How to use Sketch Engine to extract terminology from a document or parallel texts in just a few clicks (Josh Goldsmith)

spiros · 1 · 1101


  • Administrator
  • Hero Member
  • *****
    • Posts: 853881
    • Gender:Male
  • point d’amour
How to use Sketch Engine to extract terminology from a document or parallel texts in just a few clicks (Josh Goldsmith)

Why extract terminology

Terminology underpins every translation or interpreting job. After all, when we use a client's terminology, we come across as competent.

But how can we develop that specialized knowledge and identify domain-specific terminology and phraseology?

Of course, you can – and should – read through materials you receive and search for additional resources, like industry publications or a speaker's previous statements.

But if you're looking for a fast, reliable way to identify key terms, look no further than terminology extraction.

Extracting terminology helps you spot the terms and abbreviations your client actually uses and add them to your glossaries. It provides a general understanding of a domain so you can pinpoint concepts for further research, create glossaries for your projects, teams, or clients and ensure you consistently use the client's terminology.

And if you're an interpreter, terminology extraction is the ideal way to quickly pull terms from a long document you receive right before an assignment.

What's a corpus? And what's a term?

A corpus (plural: corpora) is a collection of one or more texts used to study language. We can use a corpus to identify links between words and unique features in language.

A term is a single or multi-word expression that appears in a corpus more than it would appear in general language. For example, the words "the" or "a" are not terms, while a more specialized concept, like the "United Nations Convention on Biological Diversity," is a term.

How terminology extraction works (in a nutshell)

Sketch Engine uses linguistic and statistical information to detect terms in a corpus.

First, it groups together word forms, like "word"/"words" or "study"/"studies"/"studied."

Then, it compares the corpus to general language to determine when terms appear more frequently than expected. High-frequency units are labeled as term candidates.

This approach works well with longer, written documents. If you receive short texts, presentations, spreadsheets, or lists of bullet points, opt for another strategy.

Sketch Engine's OneClick Terms

Sketch Engine has been designed for linguists of all stripes, from translators and interpreters to terminologists and lexicographers.

For our purposes, we'll head to the web-based OneClick Terms interface to detect terminology in just a few clicks – in any domain.

Worried about confidentiality? SketchEngine is too. They hold the ISO 27001 Certificate of Digital Security, and never "feed" your data into the machine. If you choose not to store data on your account, it is automatically deleted after three days.

OneClick Terms is currently available in 25+ languages, with a free 30-day subscription and monthly, quarterly and yearly subscription plans.

(If you don't create an account, you can still test out the features described below, but some results will be hidden.)

Monolingual terminology extraction

Let's start by extracting terminology from a document in a single language.

Go to OneClick Terms and pick "One language." Select and confirm the document language, upload a document in a supported format (.doc, .docx, .htm, .html, .pdf, or .txt), and click "Extract terminology."

After processing the text, Sketch Engine suggests single-word and multi-word term candidates.

Terms at the top of your list are most typical in your text, while those further down tend to be more general.

To see the frequency of terms in your document, click the "show statistics" button.

Like what you see? Click the download button to extract terms in .CSV or .XLSX format.

When I extracted terms from a biodiversity-related text, top hits included "biodiversity loss," "invasive alien species," "habitats directive," "nature restoration," "protected area," and "biodiversity-friendly" – and that was just the tip of the iceberg.

Identify phraseology and collocations with the concordancer

Context matters.

By looking at the words immediately before and after a term candidate, we can fine-tune the term and see how it is used in real life. For example, the tool may have detected "terminology," while the full term is actually "terminology extraction."

Sketch Engine features a built-in concordancer that makes this process easy.

Simply click on the "boxes" icon next to any term, and Sketch Engine will display several examples of how the term is used in your document.

Pick an acronym, and the full version of the term is likely to pop up.

Click on an adjective, and you'll find the nouns that often accompany it. For example, "biodiversity-friendly" might yield "biodiversity-friendly practices," "biodiversity-friendly soil cover," "biodiversity-friendly trade," and other helpful expressions.

Select a noun, like "biodiversity loss," and you'll find verbs which typically precede it – like "address" or "reverse biodiversity loss" – or follow it – like " a threat," or "...reduces crop yields."

Click "See more examples in the reference corpus" to see how the term is used in other sources, like the media.

Bilingual terminology extraction from parallel documents

Do you ever receive a document and its translation, and want to extract terminology from both texts?

In 2022, the SketchEngine team rolled out an incredible feature to automatically align parallel texts, then extract terms in one language – and identify potential equivalents in the second!

Now, when you go to OneClick Terms, select "Two languages – Non-aligned documents." Pick the languages of your documents, upload both files, and click "Align documents and extract terminology."

After processing, SketchEngine will offer single and multi-words terms in each of the two languages, plus a new view called "Biterms," which identifies five potential target terms for each source term.

Select the correct target term, or click the "pencil" icon next to the source or target term to edit it. Not sure which term is right? Open the parallel concordancer to see the terms in use.

After you've selected equivalents, click "Download" to export a bilingual glossary in .TBX, .CSV, or .XLSX format. Clean up any extraneous columns, and you're ready to import your new bilingual glossary into your CAT tool or glossary management software and dive right into your assignment!

Pro tip: Click "Deselect all" before you start reviewing the terms, and only select the ones you actually want to include in your glossary. 😁

Check out my video demonstration of bilingual terminology extraction in SketchEngine to see the full process of uploading parallel documents, pairing up terms, and downloading my bilingual glossary.

When you should use terminology extraction

Sketch Engine's OneClick Terms is my go-to preparation tool whenever I receive a long document in one language or a document and its translation – especially when I'm pressed for time.

By popping the documents into the tool, I quickly gain an overview of a field and its key terms, and can employ the terminology the client actually uses.

One word of warning: As with all technology, SketchEngine is a tool – and you're the human with the brain. While automatically identifying potential terms can be a huge time-saver, your linguistic knowledge and expertise are essential for delivering top-notch translation and interpretation.

Josh Goldsmith is a UN and EU accredited translator and interpreter working from Spanish, French, Italian, Portuguese and Catalan into English. A passionate educator, Josh splits his time between interpreting, researching and teaching through, which empowers language professionals to make the most of technology.
« Last Edit: 22 Dec, 2022, 13:19:03 by spiros »


Search Tools