Translation - Μετάφραση

Resources, Technical Assistance and Technology News => Translator resources => Topic started by: spiros on 09 Feb, 2008, 20:02:57

Title: Free multilingual legal translation corpus in tmx format by the European Union
Post by: spiros on 09 Feb, 2008, 20:02:57
Download DCEP: Digital Corpus of the European Parliament (https://wt-public.emm4u.eu/Resources/DCEP-2013/DCEP-Download-Page.html)
Language Technology Resources | EU Science Hub (https://ec.europa.eu/jrc/en/language-technologies)

The distribution consists of 12 zip files (Volume_1.zip, ... Volume_12.zip), each of approximately 100 MB. Each zip file has dozens of tmx-files identified by the EUR-Lex number of the underlying documents of the Acquis and a file list in txt specifying the languages in which the documents are available.

You can download the data files from the site http://wt.jrc.it/lt/Acquis/DGT_TU_1.0/data/. There is no need to unzip the files as the extraction program will access the data in the zip files directly. The texts for the different languages is spread over the various zip files so that you will need to download all files if you want the full parallel corpus. Downloading only a subset of the zip files is possible, but it will result in producing only a subset of the parallel corpus.

You need to also download the extraction program and copy them into the same directory as the zip files with the data. The program consists of two files: the program file (http://wt.jrc.it/lt/Acquis/DGT_TU_1.0/ExtractionTool/TMXtract.exe) and the library (http://wt.jrc.it/lt/Acquis/DGT_TU_1.0/ExtractionTool/swt-win32-3218.dll).

How to produce bilingual extractions

The multilingual extraction has English as the source language. Users can extract any language pair as follows, using the extraction tool TMXtract: