Go to full version

Resources, Technical Assistance and Technology News > Translator resources

Online monolingual, bilingual and multilingual corpora (mostly with free access)

Pages: (1/3) > >>


Online monolingual, bilingual and multilingual corpora (mostly with free access)

1. MyMemory - multilingual (free search, no login required unless you want to upload/download)

MyMemory is the world's largest Translation Memory: 300m segments by end 2009

Just like a traditional TM, MyMemory stores segments and their translations, supporting translators with matches and concordance. It differs from traditional technologies in terms of the project's ambitious scale, and its centralized, collaborative architecture. Anyone may consult or contribute to MyMemory via the internet, although contributions are carefully vetted for quality.

2. OPUS  - multilingual (free search, no login required)

An open source parallel corpus.

OPUS is an attempt to collect translated texts from the web, to convert and align the entire collection, to add linguistic annotation, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and is also delivered as an open source package. We used several tools to compile the current corpus. (Manual corrections have not been made.)

3. TAUS - multilingual (From October 31, 2020, the Search and Upload features are no longer available. It is replaced with the brand new TAUS Data Marketplace, a platform that will meet the growing needs for language data.)

4. Sketch Engine - monolingual English (free search, login required to search)
British Academic Spoken English Corpus (BASE), English, 1,252,256 segments
British Academic Written English Corpus (BAWE), English, 8,336,262 segments

5. Hellenic National Corpus [Εθνικός Θησαυρός Ελληνικής Γλώσσας (ΕΘΕΓ)] - Monolingual Greek (free search, login required to search, restrictions in number of search results apply for free accounts)

The ILSP Corpus has been developed by the Institute for Language and Speech Processing. It currently contains more than 47.000.000 words of written texts and is constantly being updated. Users can retrieve parts of these texts in the form of whole sentences by making queries based on one to three words, lemmas or parts of speech. Furthermore, users can define the maximum distance between search items as well as the specific sub-corpus they wish to make queries in. Finally, users can also look for certain statistical data for words and lemmas.

Το σώμα κειμένων του Ινστιτούτου Επεξεργασίας του Λόγου αναπτύχθηκε επί σειρά ετών και σήμερα περιλαμβάνει περισσότερες από 47.000.000 λέξεις, οι οποίες αυξάνονται σε τακτά χρονικά διαστήματα. Ο χρήστης έχει τη δυνατότητα να εμφανίσει προτάσεις του ΕΘΕΓ χρησιμοποιώντας από ένα μέχρι τρία κριτήρια. Κάθε κριτήριο μπορεί να είναι είτε μια λέξη είτε ένα λήμμα είτε κάποιος γραμματικός προσδιορισμός. Επιπλέον ο χρήστης μπορεί να καθορίσει τη μέγιστη απόσταση μεταξύ των κριτηρίων καθώς και το υποσύνολο κειμένων στο οποίο θα περιοριστεί η ανεύρεση. Τέλος, υπάρχουν διαθέσιμα κάποια στατιστικά στοιχεία για το γλωσσικό περιεχόμενο του ΕΘΕΓ.

Ολόκληρο το σώμα κειμένων του ΙΕΛ είναι διαθέσιμο από τις ιστοσελίδες που ακολουθούν. Για οδηγίες χρήσης και περιορισμούς, καθώς και για γενικές πληροφορίες για το σώμα κειμένων, μπορείτε δείτε στις πληροφορίες.

6. Corpus of Greek Texts  [Σώμα Ελληνικών Κειμένων] - Monolingual Greek (free search, login required to search)

Το Σώμα Ελληνικών Κειμένων (ΣΕΚ) είναι το πρώτο ηλεκτρονικό σώμα κειμένων της Ελληνικής που δημιουργήθηκε με στόχο τη γλωσσολογική έρευνα σε ένα ευρύ φάσμα κειμενικών ειδών της σύγχρονης γλώσσας. Το ΣΕΚ περιλαμβάνει 30 εκατομμύρια λέξεις από προφορικά και γραπτά κείμενα από τις δεκαετίες 1990-2010. Τα κείμενα του ΣΕΚ προέρχονται από την Ελλάδα και την Κύπρο και από ένα ευρύ φάσμα μέσων (ραδιόφωνο, τηλεόραση, βιβλίο, εφημερίδα, περιοδικό, ηλεκτρονικά μέσα, ζωντανή επικοινωνία κ.λπ.).
The Corpus of Greek Texts (CGT) is the first electronic corpus of Greek that was created with the aim of providing a resource for linguistic research in a wide range of Modern Greek genres. CGT includes 30 million words from spoken and written texts produced between 1990 and 2010. CGT’s texts were created in Greece and Cyprus in a wide variety of media (radio, TV, books, newspapers, magazines, internet, face-to-face communication etc).

7. Linguee Dictionary and Translation Search Engine - English<>German, English<>French, English<>Spanish, English<>Portuguese, English<>Italian, English<>Greek, English<>Russian, English<>Japanese, English<>Danish, English<>Finnish, English<>Czech, English<>Romanian, English<>Hungarian, English<>Slovak, English<>Estonian, English<>Sloveke, English<>Maltese, English<>Bulgarian, English<>Polish, English<>Chinese, English<>Lithuanian, English<>Dutch (they plan to add more language pairs). Presents bilingual web data in tabular format. (free, no log in required) This is not exactly a formal corpus but presents the web as a corpus and if is very useful indeed.

8. webcorp Essentially, a web concordancer (does not require log in, free to use)

WebCorp LSE is a fully-tailored linguistic search engine to cache and process large sections of the web. WebCorp LSE offers: enhanced sentence boundary detection, date identification, 'boilerplate' removal, collocation and other statistical analyses, grammatical tagging, language detection, full pattern matching and wildcard search

9. WeBitext, offered by the National Research Council of Canada, is a multilingual  corpus (European Community languages—Greek included) searchable by domain (does not require log in, free to use—research prototype is provided to the public free of charge and for a limited time only).

10. TransSearch, offered by Terminotix, contains the Hansard and transcripts from the main Canadian courts (English-French); (requires log in, 5 day free membership, after that membership fees apply).

11. TOTALRecall, offered by the National Tsing Hua University, is an English-to-Chinese corpus (read a paper about it); (does not require log in, free to use).

12. Glosbe Free Multilingual TM engine populated with open source content (wiktionary / open subtitles, etc.).

13. Corpus of Spoken Greek - 1.7 millions words. Requires registration. More about it here.

14. iatrolexi Greek monolingual medical  corpus. Concordancer and other tools.

15. Γαλλοελληνικό σώμα λογοτεχνικών κειμένων (French Greek corpus of literary texts). [Ενημέρωση 12/2020: δεν υπάρχει πλέον]

16. english-corpora (Multiple, multilingual downloadable and searchable corpora.

17. United Nations Parallel Corpus (English, French, Spanish, Russian, Arabic, Chinese. Free).

See also:

— Free multilingual legal translation corpus in tmx format by the European Union
— Web sites / manuals translated in many languages (parallel multilingual texts)
— Πρόσβαση στο Σώμα Ελληνικών Κειμένων μέσω ιστοσελίδας στο Πανεπιστήμιο Αθηνών


Also, one can read the article "Pondering and Wondering" by Jost Zetzsche in the Translation Journal. The information I am quoting below overlaps in some respect with that posted by Spiros but I think that it is interesting to have Jost's take on it since he is also quoting what sources each tool/bilingual corpus contains:

... the availability of large amounts of bilingual data that can be used in translation memories.

Here is just a sampling:

— MyMemory: A colossal translation memory of presently around 300 million segments that contains data from web alignments (app. 30% of the total data), corpora such as the EU corpus (app. 50%, see "DGT TM" below), and TM contributions from translators. It offers terminology searches, download and upload of translation memories in TMX (the Translation Memory eXchange format), editing capabilities for users, and a strong tie-in to machine translation.
— BigTM: A custom translation search engine that can be used by LSPs or translators. You can submit the translatable text or a sample of it, and the system goes out on the web to search for pages similar to the source text that already have translations in the target language. Within 24 hours it then provides a searchable index of the discovered parallel pages that allows you to look up how terms or phrases were translated by others in the past. (This product is still in its beta phase.)

— OPUS: An open-source parallel corpus with a large number of bilingual files in many language directions containing such varied materials as data from the European Medicines Agency, the European constitution, the European Parliament Proceedings, the OpenOffice.org corpus, the opensubtitles.org corpus, and various open-source localization and software documentation files. The author of the site is a researcher working in natural language processing and machine translation, so the files are not especially made for translation memory-most of them are in a text format-but they're nothing that could not be converted to a TM-compatible format or even TMX (and the files for the European Medicine Agency are in fact in TMX).
— TAUS Data Association (TDA): The TAUS Data Association (or TDA) is an association of mostly large corporate translation buyers who originally came together to pool their translation memory data to better train their machine translation engines. TDA has now just announced that they have launched a relatively low-priced Professional membership category that allows you to download 10 times the amount of data that you upload. Also, as a "by-product" they have opened the data up to the public as a terminology resource. Both the terminology searches and the TMX download can be categorized according to client and a (rather coarse) subject taxonomy. Presently (December 2009) the complete corpus includes about 1 billion words.
— The DGT TM: The humongous translation memory for the Acquis Communautaire (the body of EU law) in 22 languages and a total of 231 language pairs. It's available as a free download and the data is presented in TMX format.
— Linguee: Linguee is a very large corpus of English-into-German-into-English data (other language pairings will be released in 2010) of web-based translated materials. The web-based data is matched up with the help of a large custom dictionary and other web-based dictionaries. Though the data is not categorized, every entry is accompanied by a link to the originating webpage where webpages or whole websites can be downloaded and aligned (i.e., converted into a translation memory).


Linguee - not exactly a formal corpus, just using the web as one (however, I added it on the list). For ages, the webcorp has been around. Different functionality though, just a monolingual concordancing of the web.

The DGT TM - not searchable. Tmx data that can be downloaded and incorporated in TM tools. TAUS has imported much of it.


Still with tmx format from DGT, you can use some other concordancer and create word lists, usage frequency  etc. Still good to have. The TAUS is the best so far and in combination with the Sketch engine you can create a fairly thoroughly corpus that will be a combination of Internet crawling and European terminology. With Sketch you can create a corpus on the fly based on the words you want to search for. I am going through my links because I swear I thought there was some service that you could submit a domain, language, even sample text and have them create a corpus from the web. If I find it again I will post it here.


Download widget to search TAUS directly from your desktop (registration required)

Pages: (1/3) > >>