Skribenta and The Debated Translation Unit

As providers of Skribenta, a CCMS with integrated translation memory, we sometimes receive inquiries from customers and translation agencies who are curious about the possibilities. There’s this one question in particular which we receive rather frequently:

Is it technically possible to output translation files segmented on the sentence level?

Our response is that, at this time, such a feature is not planned for implementation. But this is not due to technical difficulties or limitations. Instead, it comes down to a decision based on how we view units of translation, and how translations are handled in Skribenta.

Here’s a brief summary of the whys and hows of translation units in Skribenta, and what to do if your approach to working with translations seems to be at cross-purposes with our CCMS.

Why our unit of translation is a block and not a sentence

In Skribenta, a translation unit is a block. A block is a paragraph, a title, a caption, a main title, or a subtitle. One source block should have exactly one corresponding translation block in each target language. This is a prerequisite for enabling the on-the-fly translation that Skribenta is so well known for. When a publication is sent for translation via Skribenta, the output file contains the output blocks, and thus contains segments on paragraph level— as does the TMX file of the translation memory pair, which can be exported as a feature.

So why have we decided on the block as the ideal unit of translation? Our decision is based on the belief that a paragraph contains more semantic information than another common translation unit: the isolated sentence. While the sentence is the basic unit of language, it is far from being a consistent unit of meaning. Since individual sentences do not necessarily communicate content, using them as a unit of translation can often lead to what we call an ambiguity in Skribenta.

An ambiguity occurs when the same text in one language exists in more than one block in the Skribenta translation memory. Ambiguities are, in many cases, the result of variant translations and can be resolved in the translation memory.

Here’s a classic example of semantic ambiguity:

John and Mary are married.

In this case, the extended semantic information a paragraph provides would help decrease ambiguity, and hopefully make it easier for the translator to arrive at the correct translation:

John and Mary are married. John has been married to Kate for twenty years, and Mary is happily married to Paul…

In the above example, an incorrect translation would result is mild amusement at most. In technical information, however, accuracy is of critical importance.

Tips for successfully working with Skribenta translation files

First priority: Translate on paragraph level

Most commercial CAT tools support matching against reference material segmented on paragraph level. In that case, the Skribenta translation file or the TMX does not require re-segmentation. In some CAT tools, the unit for translation isn’t even of vital importance, as the reference material can be extended to include surrounding segments, paragraphs, as well as sentences.

Second priority: Translate on sentence level

For segments that remain untranslated after matching on the paragraph level, it might be beneficial to match on a sentence level in order to get more “hits” in the reference material for the untranslated blocks.

It also possible that you’re working in a CAT tool that strictly supports matching against sentences, or only work with reference material that is segmented on sentences. In such cases, you’ll have to re-segment the translation file or the TMX file. The usual route is to create an import definition with segmentation rules in the CAT. This definition can then be applied every time a Skribenta translation file or TMX is imported to the CAT tool.

Segmentation rules are no exact science, but punctuation is usually a pointer. Here is a (very) short primer:

“Break” rules. Full stops (or question marks, or exclamation marks if you have a very dramatic technical writer on your staff) followed by a capitalized word usually signifies a break between two sentences in European languages and in Japanese. It is always possible that one sentence in the source language corresponds with several sentences in the target language, or the reverse. But this is usually an indication of a poorly written source text.

Colons and semi-colons. This depends on the writing rules applied in your technical documentation. Preferably, clauses separated by a colon or semi-colon should be self-contained semantically, i.e. be understood without a proceeding or following clause.

Exception rules. There are cases where punctuation marks do not indicate a new sentence. Abbreviations may include full stops in either the source or target language, or both. For common abbreviations with full stops, you can implement exception rules to not segment. Like abbreviations, decimal numbers include full stops in many languages. But since Skribenta doesn’t output numbers for translation, you don’t need to bother.

With that, I’ll leave you to ponder the ultimate unit of translation. I’ve outlined the Excosoft argument for the block-unit here, but please do get back to us if you find a better answer. This topic is always open for debate, and our door always open for feedback.