Free multilingual legal translation corpus in tmx format by the European Union

spiros · 1 · 9545

spiros

  • Administrator
  • Hero Member
  • *****
    • Posts: 854547
    • Gender:Male
  • point d’amour
Download DCEP: Digital Corpus of the European Parliament
Language Technology Resources | EU Science Hub

The distribution consists of 12 zip files (Volume_1.zip, ... Volume_12.zip), each of approximately 100 MB. Each zip file has dozens of tmx-files identified by the EUR-Lex number of the underlying documents of the Acquis and a file list in txt specifying the languages in which the documents are available.

You can download the data files from the site http://wt.jrc.it/lt/Acquis/DGT_TU_1.0/data/. There is no need to unzip the files as the extraction program will access the data in the zip files directly. The texts for the different languages is spread over the various zip files so that you will need to download all files if you want the full parallel corpus. Downloading only a subset of the zip files is possible, but it will result in producing only a subset of the parallel corpus.

You need to also download the extraction program and copy them into the same directory as the zip files with the data. The program consists of two files: the program file and the library.

How to produce bilingual extractions

The multilingual extraction has English as the source language. Users can extract any language pair as follows, using the extraction tool TMXtract:

  • download the zip files, the extraction tool TMXtract (exe.file) and the file swt-win32-3218.dll onto your PC. The files must be in the same directory;
  • open TMXtract;
  • select Input files (Volume_1.zip, etc.; multiple selection is possible);
  • specify Output file (the result is always 1 file);
  • choose Source and Target language;
  • click on Start.
« Last Edit: 12 Sep, 2021, 10:11:44 by spiros »


 

Search Tools