One of the major tasks of any translation job is the identification of equivalents for specialised terms.
Subject fields such as different sectors of law and industry all have significant amounts of field-specific terminology. In addition, many document initiators might use their own preferred terminology. Researching the specific terms needed to complete any given translation is a time-consuming task. However, attempting an initial term extraction using term extraction tools has proved to be very time-saving. In fact this is the most positive aspect of such tools that has been noted during the testing.
Nevertheless, despite the fact that the extraction tools facilitate extraction, the resulting list of candidate terms must be verified by a human terminologist or translator. Therefore, the process of term extraction is computer-aided rather than fully automatic.
Term extraction can be either monolingual or multilingual (usually bilingual).
Monolingual term extraction attempts to analyse a text or corpus in order to identify candidate terms, while multilingual term extraction analyses existing source texts along with their translations in an attempt to identify potential terms and their equivalents.
Therefore, term extraction tools can assist in populating termbases and setting up the terminology for specific tasks or projects. However, there are only few translation memory systems that offer term extraction as an integrated feature.
The main debate regarding terminology extraction in the EU institutions always leads to the question of how a term extraction tool can help with various aspects of the translator’s terminology-related tasks.
In order to achieve a fuller automated service that really aids the translator rather than consume time on playing with another application, such tools need to be adapted to the institutions’ needs and possibly by bridging a future CAT system with the already existing terminological database of the EU institutions – IATE. And as tests of the various off-the-shelf tools went on, this aspect became increasingly apparent.
IATE has over eight million entries and in spite of being far from perfect, it is the terminology database of the EU Institutions and is here to stay. Thus the institutions have already partly built a Terminology Management System (TMS). What’s lacking is a powerful extraction tool to automate the process that so far has been done manually in all institutions, thus complementing the IATE database. One big obstacle when testing term extraction tools was how to cross-reference the extracted terms with the IATE entries. Obviously this has to be done manually and it is extremely time-consuming.
By using such tools one could easily create temporary or satellite termbases outside IATE. These satellite termbases could then be exported from the term extraction tools in a compatible format for importation into IATE. However such satellite termbases could jeopardise the scope of having one centrally located database for all the translators of the EU institutions and the general public. There could be the dangerous tendency, as there has been in the past and up to a certain point is still now, that translators (either individually or on a unit level) could create and maintain their own termbases without ever interacting with the rest of the institutions, leading to serious inconsistency issues in translation.
In other words, there is room for a term extraction tool but like a big jigsaw puzzle the three main tools, namely the CAT tool, IATE and the term extraction tool, have to fit together and work in unison. This also means that even IATE would need a structural revamp. A disciplinary practice for entering and verifying entries among the contributing parties is also important for avoiding duplicate entries and for increasing the reliability of the verified terms.
Testing different Term Extraction Tools
The Term Mining Project as it is called at TermCoord consisted of allocating tasks to trainees in various language units. Following TermCoord’s guidelines, the tasks involved the following activities:
- manual vs automatic term extraction activities;
- testing of particular features on different tools;
- report of findings to TermCoord.
There are two main approaches to term extraction: statistical and linguistic.
Term extraction tools using a statistical approach basically look for repeated sequences of lexical items. Often the frequency threshold, which refers to the number of times that a word or a sequence of words must be repeated to be considered a candidate term, can be specified by the user. The major strength of the statistical approach is its language-independence. However, the amount of “noise” (i.e. the number of unlikely terms) and “silence” (i.e. the number of terms that are not identified) is relatively high. Therefore, often aspects of both approaches are combined into hybrid term extraction tools.
Term extraction tools using a linguistic approach typically attempt to identify word combinations that match certain part-of-speech patterns (e.g. “adjective+noun” or “noun+noun”). Obviously the linguistic approach is heavily language-dependent because term formation patterns differ from language to language. Consequently, term extraction tools that use a linguistic approach are generally designed to work in a single language (or closely related languages) and cannot easily be extended to work with other languages. Therefore, they are not well suited for integration into TM systems, which are usually language-independent.
So far TermCoord has tested the following term extraction tools:
- SDL MultiTerm Extract 2009;
- SDL PhraseFinder 2009 (currently being tested).
See also: Free Term Extractor tools.