Notes about managing/removing duplicates from the forumOver the last few days I have been working on managing duplicates in
English<>Greek pairs. About
2500 duplicates have been processed (including instances of 3 or 4 same source terms) out of a total of
65,000 entries (that is approximately
3.85%). The work took 20-25 hours to complete.
The procedure was as follows:
1. export the relevant database boards as csv;
2. import them into Excel
1;
3. normalize apostrophe variants to a standard symbol (
') |
');
4. replace the "→" delimiter with a tab character
2;
5. delete duplicate and leading/trailing spaces (using ASAP Utilities).
6. create an extra column with source text (optionally, separate initialisms found in parentheses in an extra column in order to locate duplicates with or without the initialism);
7. change all text of that column to lowercase (or, even, in languages like French for example, remove all accents too in order to spot errors in accent usage);
8. use a duplicate manager to colour duplicate cells of that column;
9. use Excel filter functionality to sort on coloured cells.
At this stage the duplicates had been separated and the second stage deals with actions directly on the forum.
The following courses of action were taken for forum duplicates as I saw fit:
1.
Reverse the entry (i.e. change "car → αυτοκίνητο" to "αυτοκίνητο → car") and reverse the language pair (by moving to the reverse language pair board). This implies that it has been confirmed that there is no such entry in the other board (i.e. "αυτοκίνητο").
2. If there was already an entry in the reverse language pair board, then
change one of the pairs and move to a different board altogether. I.e. if "αυτοκίνητο" already exists in Greek>English then move to Italian>Greek as "auto → αυτοκίνητο".
An easier workflow would be to first reverse the language pairs of topics, en masse, and when that is done, sort on Subject and, if the languages use different alphabets, then it would be easy to select them via check boxes in board index and move them en masse to the reverse pair board.
3. When two or more duplicate entries have a lot of replies, then it is a good idea to simply
merge them, so that people in the future get more out of the topic. When merging, the subject should be amended in order to contain all the translations available in both topics. The down side of merging is that the newest topic ID(s), when accessed via the Internet, will display an error page (since the topic no longer exists).
Merging can be done either by selecting the check boxes after a search, or by adding the topic ID of the topic to be merged (for example the topic ID of this topic is 361000).
4. Another possibility is to change from singular to plural or vice versa.
The forum search does tend to miss some topics at times. This will be remedied in the future by replacing the native forum search system with a server-based one.
Σημείωση για τα γαλλικά:
Επίσης, μπορείς να κάνει κάποιος την αντιστροφή ζεύγους, αλλά να μην τα μεταφέρεις ακόμη στο ελληνογαλλικό. Αφού πρώτα τα κάνεις όλα, κάνεις μετά ταξινόμηση κατά θέμα (
https://www.translatum.gr/forum/index.php?board=5.0;sort=subject), και πρώτα (ή τελευταία) θα είναι όσα ξεκινούν με ελληνικούς χαρακτήρες, στη συνέχεια, επιλογή όλων από το κουτάκι πάνω δεξιά και από κάτω move selected to ... Greek-French. Στη συνέχεια στην επόμενη σελίδα μέχρι να έχουν μεταφερθεί όλα τα σχετικά.
1. Since Excel fails to convert UTF-8 as exported from phpmyadmin, I had to use the following regex pairs to manually convert to tsv:
";"
\t
"\n"
\n2.
->
\t
–>
\t