Author Topic: Morphology support in Termbases for CAT tools  (Read 405 times)

spiros

  • Administrator
  • Hero Member
  • *****
  • Posts: 685808
  • Gender: Male
  • point d’amour
    • spiros.doikas
    • greektranslator
    • doikas
    • 102094522373850556729
    • lavagraph
    • Greek translator CV
Morphology support in Termbases for CAT tools
« on: 30 Mar, 2019, 19:58:11 »
The problem with morphology support (i.e., the ability of the underlying technology to recognize that different forms of one word all belong to one base version of that word as shown in terminology recognition, etc.) is that it's very tedious because it's language-specific. The developers of Across and Star Transit/TermStar have "solved" that problem by painstakingly coding specific rules for specific (very limited) languages. A tool like Lilt has used artificial intelligence to have the system learn morphology rules for various languages, and tools like OmegaT are using third-party tools to support "stemming" for a large number of languages.

In my eyes, however, it's a much larger problem to use the tediousness of that task as an excuse not to do anything about it -- especially when it comes to translation environment tools that naturally need comprehensive support for basic linguistic intelligence.

SDL Trados has now finally found a way to offer morphology support for a very large array of languages within their terminology tool (presently: Albanian, Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Korean, Latin, Latvian, Lithuanian, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sami, Sanskrit, Serbo-Croatian, Slovak, Slovenian, Spanish, Swedish, Thai, Turkish, and Ukrainian).

Ah, you say, MultiTerm? Nah, not quite. Rather than updating the good old MultiTerm with this functionality, this is available only in the cloud-based Language Cloud offering -- which also is home to SDL's machine translation product. In fact, both services are bundled. You can get free access to 200,000 characters of machine translation and a limited cloud-based terminology product (which you are not able to share, into which you can't import Excel files, and for which you can build only one terminology database that cannot have custom termbase definitions) and a $10/month access to a less restrictive model (see here for specifics).

I asked Daniel about the decision not to implement the "linguistic search" in the desktop-based MultiTerm. Here is what he said:

"MultiTerm is our product for local file-based and server-based/on-premise terminology management. That won't change in the next few years. However, MultiTerm does not have the prerequisites to be able to be transitioned to genuine cloud-based ways of working (which I always call 'the third way of working' beyond file- and server-based). Against that background, we have decided to reinvent terminology management. This is not unlike reinventing ourselves back then for Studio following up to Translator's Workbench. The transition phase will for sure take a long time, but it is important to be prepared to reinvent a piece of technology when the previous generation has matured to its full potential. Obviously, full roundtrip of data between Language Cloud and MultiTerm will be guaranteed, so users can migrate both ways as needed."

Makes sense to me. The only thing I would add is that I'm not sure it makes sense to pay for a full-fledged version of a transitional product.

SDL is making its cloud-based terminology component morphology-aware by using Elasticsearch, a third-party product, that in turn uses the open-source Hunspell token filter (which OmegaT also uses) to enable dictionary-based stemming.

While these termbases are cloud-based, they can be used within SDL Trados in exactly the same way as MultiTerm termbases, including the possibility of adding new term pairs to existing termbases right from within Studio's environment. When you select one of those termbases, you will have the choice between a "linguistic fuzzy search" (which is the one we are talking about) and a "character-based fuzzy search" (which is essentially the same kind of language-independent search based on matches of three letters or more that MultiTerm offers as well).

Term settings
You can also see that you can adjust the fuzziness level for the linguistic search. (For the tests I ran, I used the default settings.)

Years ago, I was part of a multi-university morphology project that never really happened, but for the proposal I documented in a fairly extensive manner how the old MultiTerm search feature continuously failed (back then I chose MultiTerm as the example application because it is and was the market leader -- most other tools would have delivered similar results). Fortunately, I was still able to access the same files and termbase and was therefore able to compare the output pre- and post-linguistic search. Here are some of the results (they're in German, but you don't have to understand the language to understand what's happening):

Recognition improvements
The first column shows the word in the translatable file; the second the existing terms in the (then: MultiTerm) termbase that were present in the termbase but were erroneously not detected as matches; and in the third column you see terms that now were detected with the identical source text and an identical termbase, only this time in the Language Cloud and with morphology support. It was not perfect, but clearly a lot better than before. And not only better, but useful. You can also see that it not only supports morphological changes (gekauft -- kaufen, geladen -- laden) but also compound words, which -- believe me! -- is especially helpful in a language like German.

The same feature is also available for term verification (you know, the previously frustrating quality check that typically turned out way more false positives than real ones).

Speaking of false positives, the tool is not perfect. In the regular translation interface, there were also false "matches" both in the term recognition window as well as highlighted in the source text for being available in the termbase without that being the case. No doubt I could have changed that by playing with the fuzziness settings, but it was not annoying enough to worry about it.

So, why do I give this so much space? Because I think this is a really important feature in a really important tool, and, notably, I feel like we were listened to after having requested its implementation for years. Now it's here and we're all the better for it.
— Tool Box Journal, Issue 19-3-298
« Last Edit: 30 Mar, 2019, 19:59:54 by spiros »