Translation - Μετάφραση

Resources, Technical Assistance and Technology News => Translator resources => CAT Tools Tips and Assistance => Topic started by: spiros on 16 May, 2017, 07:48:25

Title: Bilingual and Monolingual Terminology Extraction
Post by: spiros on 16 May, 2017, 07:48:25
Bilingual Terminology Extraction from TMX - A State-of-the-Art Overview
https://ec.europa.eu/info/sites/info/files/tef2016_chelo_vargas_en.pdf

Free Term Extractors
https://termcoord.wordpress.com/about/testing-of-term-extraction-tools/free-term-extractors/

TermoPL — a Flexible Tool for Terminology Extraction
http://www.lrec-conf.org/proceedings/lrec2016/pdf/296_Paper.pdf

Making Term Extraction Tools Usable
http://www.lrec-conf.org/proceedings/lrec2016/pdf/296_Paper.pdf

Prospector - A monolingual terminology extractor. The tool automatically extracts terms from source texts in English.
Logrus Global Localization Cloud (https://cloud.logrusglobal.com/#prospector)

Online Terminology Extraction
— sketchengine (https://terms.sketchengine.eu/)
— translated.net (https://labs.translated.net/terminology-extraction/)
— fivefilters (http://fivefilters.org/term-extraction/)
— termine (http://www.nactem.ac.uk/software/termine/#form)

Independent terminology extraction software/tools
— SynchroTerm  (https://terminotix.com/index.asp?content=item&item=7&lang=en)
— PhraseMiner (http://asap-traduction.com/PhraseMiner)
— Araya Bilingual Term Extractor (http://www.heartsome.de/en/termextraction.php)
— Similis (http://download.cnet.com/Similis/3000-2079_4-131401.html)
— Prospector (http://logrusglobal.com/prospector.html)
— KEA (http://community.nzdl.org/kea/index.html)
— Syn-Tactic (http://syn-tactic.com/english/documentation--translation/terminology-extraction.html) 
— SDL MultiTerm Extract (http://www.sdl.com/software-and-services/translation-software/terminology-management/sdl-multiterm/extract.html)
Déjà Vu X3 - Extract terminology using Regular Expressions - YouTube (https://www.youtube.com/watch?v=l2M8b_zyewk)

Terminology extraction tools as part of a translation environment tool
— memoQ (https://docs.memoq.com/current/en/Places/extract-candidates.html)
— CafeTran (https://www.cafetran.com)
— Déjà Vu (Lexicon  (https://atrilsolutions.zendesk.com/hc/en-us/articles/208457905-What-is-the-Lexicon)feature) (DVX vs MultiTerm Extract (https://www.proz.com/forum/d%C3%A9j%C3%A0_vu_support/24917-trados_term_extract_vs_dvx_lexicon.html))

There are a number of tools that allow you to extract terminology from a document or set of documents and then decide which of the terms are usable and which are not. Some translation environment tools offer terminology extraction, like the relatively noisy extraction of memoQ, CafeTran, or Déjà Vu. These processes primarily look at frequency of use of each term (which admittedly is a valid approach but not as the only criterion). Other tool vendors have their separate terminology extraction tools, such as SDL MultiTerm Extract or Terminotix's Synchroterm. While these tools perform better, they're really not widely used (if you don't believe me, try inviting a representative of those companies for a couple of beers at the next conference and then ask whether these tools sell well -- they'll likely have a good chuckle). Then there is the former EU project TaaS, now commercialized and renamed to Tilde Terminology, which not only extracts and normalizes terms from translatable documents but also automatically queries a number of resources for translation suggestions (see edition 229 of the Tool Box Journal for a review of the first incarnation of TaaS). And then there are high-powered, non-translation-specific tools like Sketch Engine, the tool I recently covered in this newsletter (see editions 276 and 277).
Logrus apparently felt there was a lot of room in this market (and I sort of agree, actually) and came up with Prospector. Here is how they describe it themselves in their press release:
"Prospector uses a combination of proprietary linguistic algorithms and semantic relevancy measures to effectively identify terms, and advanced stemming technology to convert plurals and inflections to the base form. The properly adjusted, semantically relevant terms are arranged, in descending order of importance, on separate sheets of an Excel file: new terms, acronyms, and proper nouns.
"One distinguishing feature is that Prospector uses the Corpus of Contemporary American English (COCA) as a 'reference corpus,' which improves term ranking. Maintained by Brigham Young University, COCA is the world's largest [freely available] corpus of the English language."
In other words, you can upload a document into the web-based system and it will give you a very impressive list of terms free of noise words ("the," "a," "and," etc.), a list that includes not mere words but true terms, often with modifiers. For instance, I uploaded a short essay and got extractions like "Oregon coast," "Sistine Chapel," "above-mentioned beauty," "abstract art," "cloud formation," "experimental literature," "Martha Graham Dance Company," and "semi-rotten crab leg" (don't ask!). I also got a few non-starters like "effort end" or "individuals might" (ratio of good to bad was about 4:1), and the tool wasn't particularly successful at separating all the proper terms out. But it was very, very good, very similar in quality, in fact, to the extraction tool XTS that Xerox developed many years ago and then sold to a company that is not making it available to regular peons like us anymore.
So, if you're a translator who comes from English as your source language and you feel that you need to prepare your terminology better, this is a great tool for you. The same is true if you are a project manager preparing projects coming from English. Of course, the problem with tools like this is that they're completely language-specific and -- in this case -- work only for English (and it didn't sound like Serge had plans to include other languages).
The tool is free at the moment. At some point there will be some kind of pricing, but as long as that is reasonable I (and I assume some of you) might be willing to invest into that.
I also asked Serge about the fact that the data is processed in Russia, and he seemed surprised at that being a potential problem (which in turn surprised me). While he assured me that the data is not being stored or kept on any Logrus servers after processing, he said he would consider moving it to the cloud in Europe or the US if clients ask for it. (If Prospector is to be successful, this will almost invariably have to happen, methinks.)
Oh, and the other tool Logrus has released, Goldpan TMX/TBX Editor, is also a nice (and free!) tool, even though it's not really breaking much ground. This desktop tool allows you -- as the name implies -- to load TMX or TBX (translation memory or termbase exchange) files and edit those as well as do a large number of semi-automated quality checks. The tool is very easy to use, and that is where I would see its greatest appeal. For more advanced users, tools like Xbench or the Okapi tools (including Olifant) are more versatile and powerful. 
Jost Zetzsche, The 280th Tool Box Journal
Title: Re: Bilingual Terminology Extraction
Post by: drkhateeb on 05 Apr, 2018, 21:25:00
Hello

SDL projectTermExtract

This plugin adds a very neat feature to Studio allowing you to extract term candidates from your Project, or specific files within a project.  The plugin works by adding a file containing the terms to Studio for translation, and then converts the file to a termbase and adds it to your project.

The plugin also provides for some simple refining of the extraction using a very simple and visual interface that makes the process of term extraction incredibly simple, and enjoyable to work with.


If you want to use these terms in your own termbase then this can be easily achieved by using the Glossary Converter, or alternatively you can provide the termbase as a MultiTerm termbase or in another format as a nice value add solution for your customer.


If you want to know a little more about this application then perhaps review this article..

https://appstore.sdl.com/language/app/projecttermextract/817/
Title: Re: Bilingual Terminology Extraction
Post by: spiros on 05 Apr, 2018, 21:29:08
3. It can't be configured to extract phrases. So for instance, if the phrase "breeding grounds" is repeated several times throughout the text, it extracts "breeding" and "grounds" as separate terms. This makes it a whole lot less useful.

[Paul Filkin] That would be a feature of something more complex like MultiTerm Extract which is a tool you can pay for.  This is a free plugin carrying out a simple word extraction.  We may look at doing something more complex in the future, especially if this is a tool people like.
Title: Re: Bilingual Terminology Extraction
Post by: drkhateeb on 10 Apr, 2018, 21:25:06
Hello
Dear spiro

how to align this attached text file to creat a termbase

https://up.top4top.net/downloadf-830ew9cp1-zip.html

format of text is :
1line = term
2line = term difinition , it may more than one definitin in one line , sparated by (, or ;) or only on defintion

separator empty line


regards
Title: Re: Bilingual Terminology Extraction
Post by: spiros on 11 Apr, 2018, 10:40:34
Emailed. If you want to have multiple definitions converted, find/replace , or ; with |
Title: Re: Bilingual Terminology Extraction
Post by: drkhateeb on 12 Apr, 2018, 04:00:58
Thank you very much

these are Greek Dictionaries

https://up.top4top.net/downloadf-831wpwkm1-zip.html

you may use/convert them to Termbases , SDL Dictionaries or any thing else

Best regards