Tatoeba is a project that aims to collect lots of sentences translated in several languages. In this blog you will find, among other things, news and documentation about it.

Saturday, May 10, 2014

New feature: unapproved sentences

We are soon going to release a new feature, and I would like to take some time to talk about it. First of all, here's what this feature will do:

Corpus maintainers will be able to mark a sentence as "unapproved".

Admins will be able to change the "level" of a contributor. By default contributors have a level of 0, but admins can set this level to -1 so that any new sentence/translation from these contributors are marked as "unapproved".

Unapproved sentences will still be in the database and will still be indexed whenever we run the indexation, but will be displayed in red on the website.

Unapproved sentences will however NOT be exported into the CSV file that we distribute.

The goal of this feature is to deal with 2 issues:

Bad quality sentences. We want Tatoeba to become more useful for language learners. The problem is that since everyone can contribute sentences and translations, some contributions are not reliable enough for language learning, but maybe not bad enough that it's clear they should be deleted.

Non CC-BY sentences. It often happens that new contributors copy-paste sentences from other language learning sources. This is a problem because Tatoeba redistributes the sentences under the CC-BY license and the content needs to be CC-BY compliant.

Setting those sentences as "unapproved" allows us to warn users that there is an issue about the sentence and they should use it with extra care. This feature will also allow admins to act more quickly when a contributor is somehow polluting the corpus. Admins can lower the level of a contributor so that all their next contributions will be marked in red. The contributor will notice themselves that their contributions are red as soon as they are saved.

This feature can obviously be tuned a lot more. Ideally we should treat differently the bad quality sentences from the non CC-BY sentences. Ideally we should set a different level for each user for each language instead. Ideally we should also have approved sentences, and we can also have different levels of approved and unapproved sentences. We just don't have the time and resources to implement these things right now, but they are part of the next steps.