The Translate extension translation memory supports multiple backends. The available backends are database, Solr and ElasticSearch. This page helps you install the best one for you and explains their specifications in deeper detail.

Unlike other translation aids, for instance external machine translation services, the translation memory is constantly updated by new translations in your wiki. Advanced search across translations is also available at Special:SearchTranslations if you choose Solr or ElasticSearch.

Contents

Comparison

The database backend is used by default: it has no dependencies and doesn't need configuration. The database backend can't be shared among multiple wikis and it does not scale to large amounts of translated content. Hence we also support Solr and ElasticSearch backends. It is also possible to use another wiki's translation memory via their web API if open. Unlike the others, remote backends are not updated with translations from the current wiki.

The bootstrap script will create necessary schemas. If you are using ElasticSearch backend with multiple wikis, they will share the translation memory by default, unless you set the index parameter in the configuration.

Solr backend

Here are the general quick steps for installing and configuring Solr for TTMServer. You should adapt them to your situation.

To use Solr backend you also need Solarium library. The easiest way is to install the Solarium extension. See the example configuration for Solr backend in the configuration section of this page. You can pass extra configuration to Solarium via the config key as done for example in the Wikimedia configuration.

Installation

After putting the requirements in place, installation requires you to tweak the configuration and then execute the bootstrap.

Configuration

All translation aids including translation memories are configured with the $wgTranslateTranslationServices configuration variable.

The primary translation memory backend must use the key TTMServer. The primary backend receives translation updates and is used by Special:SearchTranslations.

$wgTranslateTranslationServices['TTMServer']=array('type'=>'ttmserver','class'=>'ElasticSearchTTMServer','cutoff'=>0.75,/*
* See http://elastica.io/getting-started/installation.html
* See https://github.com/ruflin/Elastica/blob/master/lib/Elastica/Client.php
'config' => This will be passed to \Elastica\Client
*/);

Solr backend configuration

$wgTranslateTranslationServices['TTMServer']=array('type'=>'ttmserver','class'=>'SolrTTMServer','cutoff'=>0.75,/* See http://wiki.solarium-project.org/index.php/V2:Basic_usage
'config' => This will be passed to Solarium_Client
*/);

Possible keys and values are:

Key

Applies to

Description

config

Solr and ElasticSearch

Configuration passed to Solarium or Elastica.

cutoff

All

Minimum threshold for matching suggestion. Only a few best suggestions are shown even if there would be more above the threshold.

database

Local

If you want to store the translation memory in a different location, you can specify the database name here. You also have to configure MediaWiki's load balancer to know how to connect to that database.

displayname

Remote

The text shown in the tooltip when hovering the suggestion source link (the bullets).

index

ElasticSearch

The index to use in ElasticSearch. Default: ttmserver.

public

All

Whether this TTMServer can be queried through the api.php of this wiki.

replicas

ElasticSearch

If you are running a cluster, you can increase the number of replicas. Default: 0.

shards

ElasticSearch

How many shards to use. Default: 5.

symbol

All

The suggestion source link text. Defaults to ‣ for remote and to • otherwise.

timeout

Remote

How long to wait for an answer from remote service.

type

All

Type of the TTMServer in terms of results format.

url

Remote

URL to api.php of the remote TTMServer.

You must use the key TTMServer as the array index to $wgTranslateTranslationServices if you want the translation memory to be updated with new translations. Remote TTMServers cannot be used for that, because they cannot be updated.

Currently only MySQL is supported for the database backend.

Bootstrap

When you have chosen Solr or ElasticSearch and set up the requirements and configuration, run ttmserver-export.php to bootstrap the translation memory. Bootstrapping is also required when changing translation memory backend. If you are using a shared translation memory backend for multiple wikis, you'll need to bootstrap each of them separately.

Sites with lots of translations should consider using multiple threads with the --thread parameter to speed up the process. The time depends heavily on how complete the message group completion stats are (incomplete ones will be calculated during the bootstrap). New translations are automatically added by a hook. New sources (message definitions) are added when the first translation is created.

Bootstrap does the following things, which don't happen otherwise:

adding and updating the translation memory schema;

populating the translation memory with existing translations;

cleaning up unused translation entries by emptying and re-populating the translation memory.

When the translation of a message is updated, the previous translation is removed from the translation memory. However, when translations are updated against a new definition, a new entry is added but the old definition and its old translations remain in the database until purged. When a message changes definition or is removed from all message groups, nothing happens immediately. Saving a translation as fuzzy does not add a new translation nor delete an old one in the translation memory.

TTMServer API

If you would you like to implement your own TTMServer service, here are the specifications.

Query parameters:

Your service must accept the following parameters:

Key

Value

format

json

action

ttmserver

service

Optional service identifier if there are multiple shared translation memories. If not provided, the default service is assumed.

sourcelanguage

Language code as used in MediaWiki, see IETF language tags and ISO693?

targetlanguage

Language code as used in MediaWiki, see IETF language tags and ISO693?

test

Source text in source language

Your service must provide a JSON object that must have the key ttmserver with an array of objects. Those objects must contain the following data:

Database backend

The backend contains three tables: translate_tms, translate_tmt and translate_tmf. Those correspond to sources, targets and fulltext. You can find the table definitions in sql/translate_tm.sql. The sources contain all the message definitions. Even though usually they are always in the same language, say, English, the language of the text is also stored for the rare cases this is not true.

Each entry has a unique id and two extra fields, length and context. Length is used as the first pass filter, so that when querying we don't need to compare the text we're searching with every entry in the database. The context stores the title of the page where the text comes from, for example "MediaWiki:Jan/en". From this information we can link the suggestions back to "MediaWiki:Jan/de", which makes it possible for translators to quickly fix things, or just to determine where that kind of translation was used.

The second pass of filtering comes from the fulltext search. The definitions are mingled with an ad hoc algorithm. First the text is segmented into segments (words) with MediaWiki's Language::segmentByWord. If there are enough segments, we strip basically everything that is not word letters and normalize the case. Then we take the first ten unique words, which are at least 5 bytes long (5 letters in English, but even shorter words for languages with multibyte code points). Those words are then stored in the fulltext index for further filtering for longer strings.

When we have filtered the list of candidates, we fetch the matching targets from the targets table. Then we apply the levenshtein edit distance algorithm to do the final filtering and ranking. Let's define:

E

edit distance

S

the text we are searching suggestions for

Tc

the suggestion text

To

the original text which the Tc is translation of

The quality of suggestion Tc is calculated as E/min(length(Tc),length(To)). Depending on the length of the strings, we use: either PHP's native levenshtein function; or, if either of the strings is longer than 255 bytes, the PHP implementation of levenshtein algorithm.[1] It has not been tested whether the native implementation of levenshtein handles multibyte characters correctly. This might be another weak point when source language is not English (the others being the fulltext search and segmentation).

Solr backend

Solr Solr search platform backend works similar to the database backend, except that it uses a dedicated search engine for increased speed. The results are by default ranked with the levenshtein algorithm on the Solr side, but other available string matching algorithms can also be used, like ngram matching for example.

In Solr there are no tables. Instead we have documents with fields. Here is an example document:

Each translation has its own document and message documentation has one too. To actually get suggestions we first perform the search sorted by string similarity algorithm for all documents in the source language. Then we do another query to fetch translations if any for those messages.

We are using lots of hooks to keep the translation memory database updated in almost real time. If user translates similar messages one after another, the previous translation can (in the best case) be displayed as suggestion for the next message.

New translation (if not fuzzy)

Create document

Updated translation (if not fuzzy)

Delete wiki:X language:Y message:Z

Create document

Updated message definition

Create new document

All existing documents for the message stay around because globalid is different.

Translation is fuzzied

Delete wiki:X language:Y message:Z

Messages changes group membership

Delete wiki:Z message:Z

Create document (for all languages)

Message goes out of use

Delete wiki:Z message:Z

Create document (for all languages)

Any further changes to definitions or translations are not updated to TM.

Translation memory query

Collect similar messages with strdist("message definition",content)

Collect translation with globalid:[A,B,C]

Search query

Find all matches with text:"search query"

Can be narrowed further by facets on language or group field.

Identifier fields Field globalid uniquely identifies the translation or message definition by combining the following fields:

wiki identifier (MediaWiki database id)

message identifier (Title of the base page)

message version identifier (Revision id of the message definition page)

message language

The used format is wiki-message-version/language.

In addition we have separate fields for wiki id, message id and language to make the delete queries listed above possible.