Custom machine translation engines using the OPUS corpus (1,715 language pairs, including Greek and Ancient Greek)

spiros · 1 · 1218


  • Administrator
  • Hero Member
  • *****
    • Posts: 856343
    • Gender:Male
  • point d’amour
Custom machine translation engines using the OPUS corpus

A team of NLP researchers and open-source enthusiasts at the University of Helsinki under Jörg Tiedemann has developed machine translation engines for a whopping 1,715 language pairs (all built on the widely used Marian NMT framework). One member of that team, Tommi Nieminen, then developed plugins and other connectivity options for tools like Trados Studio (you will likely have noticed that the tool's name has now officially dropped "SDL" and fortunately not added its new owner "RWS"), memoQ, OmegaT, Wordfast, and web-based tools like Memsource (the latter will be ready within a month or so).

I tested the English-to-German engine and (very unscientifically) found that the quality is on par with generic web-based engines such as those from Google, Microsoft, and even DeepL. It feels kind of absurd to even say this, but all the machine translation processing actually happens on your local computer! Once you've downloaded the machine translation engine, you no longer communicate with the originating website.

Here is how you can access all of this:

First, download the MT engine, which includes a control panel that allows you to select your language combination from the link on top of the page right here. (Well, I should have started differently: First make sure that you have a Windows PC -- the only operating system this runs on.) Once it's downloaded, just extract the zip file to wherever you would like it to reside from now on (no separate installation necessary). You then double-click on OpusCatMTEngine.exe, and from the resulting interface select "Install OPUS model from Web." You are then shown a massive list of language combinations . . .

. . . from which you'll filter the one(s) you need.

Note that while you'll find many combinations with languages with lesser diffusion in there, you should not assume that they will necessarily produce acceptable quality translations. The smaller the corpora for a language, the greater the likelihood for poor results. Tommi Nieminen, whom I talked to for this report, made a special point of saying that engines such as Google or Microsoft do indeed surpass the quality of OPUS-MT in combinations with languages like Amharic because they have access to more -- often specifically generated -- data for those languages. (To see a graphical representation of available language combinations and expectable quality, there's an interactive map with all language combinations right here; to use it, hover over the icon in the upper right-hand corner and select a language.)

Once you've selected, downloaded, and installed your language combination(s) in the little desktop control panel shown above (all of which goes surprisingly fast), you can select the desired language model and then press "Translate with model" to get an idea of the quality of the machine translation suggestions you can expect.

If you decide to "fine-tune" the engine with data you have in the form of TMX files (this only makes sense if you have very specific data for a specific client/project and a file that includes at least tens of thousands of translation units), you can select "Fine-tune selected model" and upload your TMX file.

This process will take several hours, but it can be done overnight or while you're working on other projects.

Once all of that is completed (note that the more lengthy fine-tuning step is not a requirement), you can set up OPUS in the tool of your choice.

For Trados, install the plugin and see these instructions
For memoQ, follow the instructions on this page
For Wordfast Classic or Pro, follow these instructions
For OmegaT, contact tommi.nieminen AT
For other tools, contact the email address above and ask Tommi to work on plugins (personally, I would love a Star Transit plugin!)

— The 324th Tool Box Journal

From this page
Tatoeba-Challenge/ at master · Helsinki-NLP/Tatoeba-Challenge · GitHub

Greek engines, look for: lang = eng-ell
Ancient Greek engines, look for: lang = eng-grc

Συνημμένη μια πρώτη δοκιμαστική μετάφραση:

Enter translation in the Source text area, click Translate and wait for a translation to appear in the Translation area (producing the first translation may take some time, as the model needs to be initialized, subsequent translations are faster).

Εισάγετε μετάφραση στην περιοχή Source κείμενο, κάντε κλικ Μεταφράστε και περιμένετε μια μετάφραση για να εμφανιστεί στην περιοχή Μετάφραση (η παραγωγή της πρώτης μετάφρασης μπορεί να πάρει κάποιο χρόνο, καθώς το μοντέλο πρέπει να αρχικοποιηθεί, επόμενες μεταφράσεις είναι ταχύτερη).

Η ίδια μετάφραση από

Enter translation in the Source text area, click Translate and wait for a translation to appear in the Translation area (producing the first translation may take some time, as the model needs to be initialized, subsequent translations are faster).

Εισαγάγετε μετάφραση στην περιοχή κειμένου Πηγή, κάντε κλικ στη Μετάφραση και περιμένετε να εμφανιστεί μια μετάφραση στην περιοχή Μετάφρασης (για την παραγωγή της πρώτης μετάφρασης μπορεί να χρειαστεί λίγος χρόνος, καθώς το μοντέλο πρέπει να αρχικοποιηθεί, οι επόμενες μεταφράσεις είναι πιο γρήγορες).

Και Bing Microsoft Translator

Enter translation in the Source text area, click Translate and wait for a translation to appear in the Translation area (producing the first translation may take some time, as the model needs to be initialized, subsequent translations are faster).

Εισαγάγετε μετάφραση στην περιοχή κειμένου Προέλευση, κάντε κλικ στην επιλογή Μετάφραση και περιμένετε να εμφανιστεί μια μετάφραση στην περιοχή Μετάφραση (η παραγωγή της πρώτης μετάφρασης μπορεί να διαρκέσει κάποιο χρονικό διάστημα, καθώς το μοντέλο πρέπει να προετοιμαστεί, οι επόμενες μεταφράσεις είναι ταχύτερες).

Και DeepL

Enter translation in the Source text area, click Translate and wait for a translation to appear in the Translation area (producing the first translation may take some time, as the model needs to be initialized, subsequent translations are faster).

Εισάγετε τη μετάφραση στην περιοχή Source text, κάντε κλικ στο Translate και περιμένετε να εμφανιστεί η μετάφραση στην περιοχή Translation (η παραγωγή της πρώτης μετάφρασης μπορεί να πάρει κάποιο χρόνο, καθώς το μοντέλο πρέπει να αρχικοποιηθεί, οι επόμενες μεταφράσεις είναι ταχύτερες).

Η δοκιμή με Αρχαία Ελληνικά προς Αγγλικά ήταν... λιγότερο πετυχημένη:

Ἄξιόν ἐστι τὸ ἀρνίον τὸ ἐσφαγμένον λαβεῖν τὴν δύναμιν καὶ τὸν πλοῦτον καὶ σοφίαν καὶ ἰσχὺν καὶ τιμὴν καὶ δόξαν καὶ εὐλογίαν

It is worthy of the helmets of the slain take the power and the ship, and wisdom, and honor, and glory, and word.

Ανθρώπινη μετάφραση:
Worthy is the Lamb that was slain to receive power, and riches, and wisdom, and strength, and honour, and glory, and blessing
« Last Edit: 24 Apr, 2021, 14:21:41 by spiros »


Search Tools