File size observations on the IATE TBX Termbase

Is has been known for a while now that a database dump of IATE, the EU Terminology Database, has been made available as a download instead of a web search form in June 2014. The ZIP file is ~116 MB, the unpacked database 2.2 GB (!) large. Since it contains all EU languages, I split this file into 4 subfiles, and extracted four trilingual DE/FR/EN files using an XSL transformation sheet. xsltproc.exe from Apache’s Xerxes XML Parser package couldn’t cope with the complete file, but the four 550MB files passed through in about 10 minutes each and dropped to about half their original size.

About 250-275MB per file is still quite fat, so I thought about ways to reduce this further. (Un-)fortunately, IATE isn’t exactly renowned for its accuracy – colleagues in the know will always tell you to use IATE with caution. IATE has a „Reliability“ rating which is assigned to each entry, running from 1 (unchecked) via 2 (minimal reliability) and 3 (reliable) to 4 (very reliable/assessed). Thus, I was tempted to throw out all Reliability 1+2 entries and considered to also do away with Rel. 3 entires, since the IATE team itself notes:

This code was automatically assigned to many entries, regardless of their previous validation status, following the merger of existing databases to create IATE. Therefore some entries marked as ‘reliable’ are not necessarily so.

Uh-huh. So basically, all sorts of stuff was thrown in and instead of correctly classifying it as minimally reliable (Rel. 2) until the material could be reviewed, it was decided to recommend it as „reliable“ (Rel. 3). That was the point at which I wrote two more XSL sheets to filter for Reliability 3+4 (R34) and exclusively for Reliablity 4 (R4). Since that run looked promising, I wrote yet another XSLT script to clean up the results (C), deleting empty language groups („tig“ elements) or even empty Term entries („termEntry“ elements). Here’s what happened:

IATE TBX File Size Reductions for DE/FR/EN

Filename

Orig. Size

R34 Size

R34 % from Orig

R34 Cleaned Size

R34C % from R34

R34C % from Orig

R4 Size

R4 % from Orig

R4 Cleaned Size

R4C % from R4

R4C % from Orig

IATE-de-fr-en-1of4.tbx

273 MB

166 MB

-39%

125 MB

-25%

-54%

113 MB

-59%

57 MB

-50%

-79%

IATE-de-fr-en-2of4.tbx

276 MB

233 MB

-16%

212 MB

-9,0%

-23%

106 MB

-62%

50 MB

-53%

-82%

IATE-de-fr-en-3of4.tbx

253 MB

213 MB

-16%

192 MB

-10%

-24%

100 MB

-61%

46 MB

-54%

-82%

IATE-de-fr-en-4of4.tbx

271 MB

245 MB

-10%

231 MB

-6%

-15%

107 MB

-61%

56 MB

-48%

-79%

Now, what does this mean?

Apparently, German, English and French make up roughly 50% of the whole IATE database. This isn’t astonishing, as DE, FR and EN are the OFFICIAL official languages of the EU (that means, all documents must be made available in at least one of these three languages). But it also means that on average, 80% of the chosen DE/FR/EN data subset are classed as „reliable or very reliable“ and still almost 40% as „very reliable“.

Additionally, this means that by cutting out all unreliable entries and all the unnecessary bits (empty tags, superfluous whitespace, etc.), we can achieve significant file size reductions. This plays an important role during import of the TBX database into other systems, notably SDL Trados Studio’s beloved companion, SDL MultiTerm, which didn’t manage to import the original DE-FR-EN files without lots of „file lock limit“ errors. More on that in another post, perhaps, but Paul Filkon already wrote on that in What A Whopper. The message is: „Don’t use IATE as-is, adapt it to your needs!“ For example, one could further filter IATE by the „field“ column to adapt it to one’s own expert fields as a translator.

If you are interested in the XSL transformation sheets used, you can download them as a 3kB ZIP file. If you don’t know anything about XML/XSL, but would like to have a look at the resulting varieties of DE-FR-EN TBX files, send me a nice-to-read e-mail to info ~at~ defrent ~dot~ de (no „mee too!“ blog comments, please). The „unedited TBX“ ZIP file weighs in at ~55MB, the filtered Reliability 3+4 ZIP is ~37MB and the Reliability 4 ZIP is only ~7.5 MB. Since the resulting SDL MultiTerm termbases are 5 times as heavy as the corresponding TBX file, I am reluctant to send out those, but with the free MultiTerm Convert tool from the SDL OpenExchange, conversion should be a matter of minutes. Of course, the IATE usage conditions from their download site apply to the edited files, too:

You are allowed to reproduce the data provided on this page for your personal needs, to distribute it for non-commercial and commercial purposes, and to make and distribute derivative works, provided the source is acknowledged as follows: Download IATE, European Union, 2014.

Edit (1st Oct. 2014): @jeromobot recommended Paul Filkin’s recommendation, which I will repeat here in short: If you are looking for more thoroughly cleaned IATE files that are ready for import into your CAT, you might want to visit Henk Sanderson’s site SanTrans, where he also mentions addditional IATE pitfalls, like terms-that-aren’t and escaped (pseudo-)HTML codes like &lt;i&gt;some term&lt;/i&gt; inside entries.

Hints that the source of my current EN>FR manual #xl8 might have been written by a German engineer: "solve" the screw on the "belt roles", having to loose a singular clamping screw before fastening a plural of them (do the others come loose by themselves?), ... 🤨😑