Tatoeba is a project that aims to collect lots of sentences translated in several languages. In this blog you will find, among other things, news and documentation about it.

Friday, December 10, 2010

Tatoeba update (Dec 10th, 2010)

What's new

Sentences stats. There's now a specific page for the sentences stats, to make them a bit more readable. The total number of sentences is also now indicated (it's a quite important number, but for some reason we never displayed it anywhere).

Wall messages of a user. You can browse the messages that were posted by a specific user, from the user profil. Click on "See this user's contribution", scroll to the bottom of the page. You will see the latest messages posted by the user, and a link to view them all (if the user has posted any message).

11 comments:

I would like very much, if you could solve the problem that some translated sentences are not visible (if linked by more than one sentence in between) - so people translate again and again. And we get more and more duplicates.

I made two suggestions - one being that you show the number of sentences linked to a given sentence. See http://tatoeba.org/epo/wall/show_message/4427#message_4427

The other suggestion to identify the chain and the language, http://tatoeba.org/epo/wall/show_message/4433#message_4433

But like sysko said in his reply, these are not things we can easily do in the current system :( I mean, it's doable, but not without taking a lot of resources. This is why these features are not something you will see before we release the next version, where we will switch to a new kind of database that can handle (in real time) the kind of query you are asking for without making our server crash ^^

But what I can do for now (until the new version is out) is trying to make it easier to link sentences. If more sentences are linked, it will reduce the number of "hidden translations".

Well, even if it can be done at any moment, it wouldn't be easy to implement and we would have to make changes in the database to store this new data. Considering that we have a new system in progress, it wouldn't be worth having sysko or myself spending time on this and it wouldn't be worth changing the database for this purpose =/

However, I've thought of something else.

We provide download files that are updated weekly:http://tatoeba.org/download_tatoeba_example_sentences

Someone who has programming skills can download the links.csv file and calculate the number of 'hidden translations' for each sentence. They would only keep the sentences that have at least one hidden translation. They could publish somewhere the results with the following format:

It would give us information on how many sentences have hidden translation. Maybe it's a lot, or maybe not that much. Right now I personally have no idea how many there could be. In any case, once we have this information, we can regularly work on linking hidden translations to reduce the chances of people adding translations that already exist.

Three days ago sysko wrote that the new release will show more or less all translations - so it seems the problem will be solved some day.

I tried to find out how many sentences there are with hidden translations. So I went to http://tatoeba.org/epo/sentences/show_all_in/eng/none/none/indifferent/page:10000 and tried them. Out of the ten there are hidden sentences (can not be viewed from the English version) for http://tatoeba.org/epo/sentences/show/249696http://tatoeba.org/epo/sentences/show/249692

Hidden sentences for the- German version (and others) of http://tatoeba.org/epo/sentences/show/249690- Polish versions of http://tatoeba.org/epo/sentences/show/249689- Persian version of http://tatoeba.org/epo/sentences/show/249684

This may mean that half of the graphs with English sentences have hidden sentences. The ones without were more or less those with only two or three translations.

But, we have to consider that there are about 160.000 sentences in English and only 60.000 or less in the other languages. So I tried the same with French, with http://tatoeba.org/epo/sentences/show_all_in/fra/none/none/indifferent/page:1000

Hidden sentences are in the graphs ofhttp://tatoeba.org/epo/sentences/show/542849http://tatoeba.org/epo/sentences/show/542839http://tatoeba.org/epo/sentences/show/542836http://tatoeba.org/epo/sentences/show/542835http://tatoeba.org/epo/sentences/show/542833http://tatoeba.org/epo/sentences/show/542830http://tatoeba.org/epo/sentences/show/542828http://tatoeba.org/epo/sentences/show/542827http://tatoeba.org/epo/sentences/show/542825

No:http://tatoeba.org/epo/sentences/show/542832

Which means that nine out of then graphs with a French sentence have hidden translations which can not be seen in some languages...

Yes, the problem will definitely be solved someday :) We're coding the new version largely to solve that specific problem ^^

But I think one other reason why there is this problem (or why it's becoming "large") is because the link feature does not scale to the growth we've had. I mean, only trusted users and moderators can link, and even then, trusted users cannot easily link ANY sentence. So this is clearly not enough. We need to make this feature less restricted and more usable, and that's something we can start improving on the current version.

If you browse Esperanto sentences that are not directly translated into English or into Japanese, you will notice that many have an indirect translation. => http://tatoeba.org/sentences/show_all_in/epo/eng/eng/indifferent=> http://tatoeba.org/eng/sentences/show_all_in/epo/jpn/jpn/indifferentIf more people were working on linking them, then it can make visible translations that are hidden...

I agree about making linking easier. It will be good to have an easier linking procedure and to have encouraging for linking.

You may try to convince people to put more links. But to solve the problem of hidden sentences for Esperanto you need at least 47 000 links between Esperanto and English alone. (This is the number of Esperanto sentences not linked directly but via one translation to English, from http://tatoeba.org/epo/sentences/show_all_in/epo/eng/eng/indifferent/page:4708 .) This will take a lot of time - and this is only one language link. So probably programming will be quicker to solve the hidden translations problem.