Tatoeba is a project that aims to collect lots of sentences translated in several languages. In this blog you will find, among other things, news and documentation about it.

Thursday, January 22, 2009

New validation system

Context

There are currently over 330,000 sentences in Tatoeba (all languages included). Most of them come from an English-Japanese corpus named Tanaka Corpus. Part of this corpus was translated into French about a year and a half ago thanks to the initiative of Tokidoki's webmaster, who later gave me the translations to integrate into Tatoeba.

We have now about 150,000 sentences in English, about the same quantity in Japanese, and almost 24,000 in French.

The problem is, many of these sentences still have mistakes. And to understand why, you have to understand how those sentences were collected.

Tanaka Corpus

For those who didn't want to read the page about the Tanaka Corpus, here's the explanation :

Professor Tanaka's students were given the task of collecting 300 sentence pairs each. After several years, 212,000 sentence pairs had been collected

[...]

The original collection contained large numbers of errors, both in the Japanese and English. Many of the errors were in spelling and transcription, although in a significant number of cases the Japanese and English contained grammatical, syntactic, etc. errors, or the translations did not match at all.

A huge work has been done to maintain this corpus, but it was done mostly by one man (Paul Blay), and you couldn't expect him to get rid of all the mistakes.

French translations

The French translations that were given to me were the result of the work of 80 vonlonteers. The idea of this translation project was first of all to translate as much as possible, even if it's not always correct. And then only later, go through a phrase of verification. The project stopped early though, and the already translated sentences didn't get to go through verification.

Old validation system

In the old version of Tatoeba, every new contribution was not directly added into the rest of the sentences collection. Instead, it was added in a waiting list. Moderators could see this list, validate the sentences that were correct and refuse those that were not. It was aimed to prevent additional wrongly spelled sentences or even wrong translations.

But unless I had a bunch of devoted and very qualified moderators (which I didn't), this kind of system was clearly very slow and heavy.

New validation system

In the new validation system, there are no moderators anymore. Instead, each sentence will have a owner, and only the owner can modify the sentence. Contributors will be responsible of the sentences they own. If you see a mistake in a sentence that is not yours, you can post a comment about it. Of course, each user will be able to quickly access to the comments that were posted about their sentences.

If a user doesn't feel (s)he can take the responsibility, (s)he will have the possibility to renounce to the ownership of a sentence. These "orphan" sentences can be adopted by other users. Right now I can tell you that most of the sentences are orphans and the goal is to make find them a parent.

On top of that, it will be possible for every user to follow other users' contributions in Tatoeba. In case some people are not doing a good job and are blocking many many sentences that have mistakes by adopting them and not correcting them, it won't be difficult to withdraw their ownership.