Sunday, December 19, 2010

I just received an email asking if we had a page listing projects using Tatoeba and was reminded that we still don't. So here is a beginning of list. Feel free to send us projects that you know of (or are the author of) and are not listed here!

Friday, December 10, 2010

Sentences stats. There's now a specific page for the sentences stats, to make them a bit more readable. The total number of sentences is also now indicated (it's a quite important number, but for some reason we never displayed it anywhere).

Wall messages of a user. You can browse the messages that were posted by a specific user, from the user profil. Click on "See this user's contribution", scroll to the bottom of the page. You will see the latest messages posted by the user, and a link to view them all (if the user has posted any message).

Sunday, November 21, 2010

Alright, it's been a long time since we last updated Tatoeba :) This is just a small update.

What's new

"Members" page. This is probably the main modification. We redesigned a little bit the "Members" page to look a bit better and to be less slow. We removed the information about the last login, because some people don't like being spied :P We removed the top 20 ranking because that's what makes the page so slow. Instead we're displaying the members who are currently active (those who participated to the few last hundreds contributions).

Tags info. If you hover your mouse over a tag, you will see the id of the user who added it, and the date when it was added. This is mostly useful for sentences owner, who may wonder why someone has tagged a sentence a certain way. You can figure out who's the user behind a certain id with the following URL: http://tatoeba.org/users/show/[id].

Set language to "unknown". We get requests for new languages quite frequently and we ask people to add a few sentences in the language they request. Except that the language is sometimes misdetected and there was no way to set the language to "unknown" (to indicate that it's a language that is not in the list). Now it's possible. There is an option called "other language", and will set the language icon to "unknown".

Sentence owner's name in comments. It was requested a long time ago, and it's finally here. The name of the sentence owner is now indicated in the comments, next to the sentence itself. This way, when you look at a comment on the homepage, you will not only know what sentence it is associated to, but also the user who added that sentence.

What next

We'll be working on a page that lists all sentences that were tagged @change and @delete more than 2 weeks ago. This way moderators will have a simple way to know what sentences they can/should take care of.

Sunday, November 14, 2010

Yesterday was our first Tatoeba day, so today I'm publishing stats about what has been been achieved that day, as well as more general stats.

Stats by language

The chart below shows the number of sentences added on Nov 13th for each language.

The gold medal goes to Arabic! Silver goes to Esperanto and bronze goes to German :)

Arabic (573)

Esperanto (354)

German (247)

Egyptian Arabic (230)

Spanish (207)

Italian (183)

Chinese Mandarin (162)

Hebrew (125)

French (113)

Ukrainian (105)

Danish (100)

Hungarian (78)

Cantonese (78)

English (73)

Russian (70)

Polish (45)

Dutch (36)

Old East Slavic (33)

Lithuanian (18)

Persian (17)

Unknown language (10)

Portuguese (8)

Finnish (7)

Latvian (4)

Vietnamese (4)

Czech (3)

Swedish (3)

Norwegian Bokmål (2)

Shanghainese (2)

Breton (1)

Bulgarian (1)

Catalan (1)

Estonian (1)

Japanese (1)

Quechua(1)

Slovak (1)

Turkish (1)

Uzbek (1)

Sadly, the record set on August 18th of 3465 sentences added was not broken. We only made it to 2899. It's still not bad though, since it's the 2nd most important day, in terms of sentences added (and by "sentences added" I mean "new sentences + translations").

We were missing a few of our devoted members that day, so I guess it's normal. Let's hope more people will be available for the next Tatoeba day :)

Stats by users

The chart below shows the number of sentences added (in green) and the number of sentences modified (in yellow) on Nov 13th, for the top 20 users. You'll excuse my laziness but I only used the number of sentences added for the rank.

Saeb wins the day, by far, with 802 sentences added! Congrats :D Second place goes to nickyeow, and third place goes to Eldad.

At any rate, everyone deserves a big thank you for their contributions! THANK YOU :)

saeb (802/20)

nickyeow (214/20)

Eldad (166/17)

aandrusiak (140/7)

MUIRIEL (138/41)

Guybrush88 (135/2)

danepo (100/12)

GrizaLeono (94/21)

Shishir (94/12)

Dejo (56/11)

Archibald (54/32)

darinmex (53/5)

rado (52/2)

Leono (51/10)

esocom (51/4)

Esperantostern (48/5)

Muelisto (43/1)

kroko (42/4)

Dorenda (41/0)

qdii (40/11)

zipangu 37 2

wondersz1 33 4

Manfredo 27 1

samueldora 24 2

sysko 23 7

szaby78 22 5

Zifre 22 7

cost (21/2)

sencay (20/2)

shanghainese (19/0)

fanty (18/0)

pliiganto (16/13)

BraveSentry (15/1)

pjer (14/5)

U2FS (14/3)

debian2007 (13/1)

Gyuri (12/3)

jxan (12/0)

virgil (12/4)

TRANG (11/32)

slavneui (11/0)

sarah (11/0)

kebukebu (10/2)

Wimmer (10/1)

ae5s (10/0)

Tonari (9/0)

arashi_29 (9/5)

Aleksej (7/0)

CK (5/14)

Shoyren (4/1)

Holyspirit (3/0)

JimBreen (2/0)

luwenzhuo (2/0)

CLARET (2/1)

lajauge (1/0)

ozma29 (1/0)

sschlumberger (1/0)

mr5 (1/0)

Tenshi (1/0)

Language ranks

Tatoeba day is a good occasion to see how each language have progressed. You can see how each language with more than 1000 sentences was positioned one month ago, in this previous post. Let's how it is now...

Top 5

The top 5 hasn't changed.

English - 158,000+. It looks like English has been growing a little bit.

Japanese - 153,000+. Japanese is standing still. You can tell we don't have a very strong Japanese community.

French - 53,000+. French seems keeps moving at a steady pace.

Esperanto - 47,000+. Esperanto is catching up with French quickly...

German - 32,000+. German is progressing better than French, but still not quite as well as Esperanto.

Other languages with 10,000+ sentences

Polish - 20,000+

Spanish - almost 19,000. Spanish gained one rank! :D

Russian - almost 18,000

Chinese Mandarin - almost 15,000

Ukrainian - 14,000+

Other languages with 1,000+ sentences

Italian - 8,500+

Arabic - 6,500+. Great boost for Arabic!

Dutch - almost 6,500

Portuguese - 6,000+

Hebrew - 4,500+. Great boost for Hebrew as well!

Icelandic - 4,000+

Hindi - almost 3,500

Hungarian - 3,000+. Hungarian joined the 1,000+ sentences club! Very good progress.

Turkish - 2,500+

Shanghainese - 2,500+

Uyghur - almost 2,500

Danish - 2,000+. Danish is new to the club with very good progress as well!

Vietnamese - 2,000+

Belarusian - almost 2,000

Norwegian Bokmål - 1,500+

Cantonese - 1,500+

Other numbers

55,735 sentences added in October.

About 25,000 sentences added since the beginning of November.

We've reached 600,000 sentences in total today!

But there are probably thousands of duplicates, so it's not really 600,000 yet...

We will soon have 76 languages. 5 are waiting to be added: Galician, Irish, Interlingua, Lojban, Toki Pona. Note that the last 3 languages are constructed languages.

Next Tatoeba day

A potential date for the Tatoeba day would be December 11th. Although it could be December 18th as well. We'll see what suits best for everyone.

The main objective of the first Tatoeba day was to break the record of the highest number of sentences added in one day. We didn't break it, but it's okay because we still had fun :D

The main objective will be different for the second Tatoeba day. We haven't decided what it will be yet, but I think it would be nice to emphasize on adoption next time. Because unfortunately I didn't really have time to look at adoptions for this first Tatoeba day :(

Anyway, we'll keep you informed. Thanks again for everyone who participated and who came to our IRC channel :)

Sunday, November 7, 2010

We have introduced the "tags" feature several months ago and we've let trusted users experiment it pretty much freely. There has been a profusion of tags created but they are quite a mess and we decided to try tidying up.

From now on, if you are going to tag a sentence, please take into consideration the following things.

1. Use tags for objective and official information

We would like to keep the tags for "objective" and "official" information. If you want to categorize sentences for personal purpose, you should use lists.

For instance, you cannot tag a sentence "French exam" to mark the sentence as part of those you will use to practice before your French exam, you should create a list for that. We know lists are not as practical as tags, but we'll be improving the lists feature as soon as we have time.

2. Avoid creating new tags

Avoid creating new tags because it can make the cleaning process harder. If the tag you want to add doesn't appear in the autocompletion list, then it's a new tag, so don't add it unless you are really convinced it's a valid tag.

3. Ask before you create a new tag

We don't have clear rules yet for what is a valid tag and what is not, but one of our moderators (Swift) volunteered to take care of the tags. If you feel the need to create a new tag, it would be wise to ask Swift first. He will be officially in charge of tidying up the tags. He will be the one deciding what tag to keep or not and what tag to rename. Also, don't hesitate to contact him if you would like to help out. It's not easy to decide on these things.

4. Use English for tags, unless you really can't

We have decided to use English as the default language for tags. We will rename all non-English tags into their English equivalent, when it is possible. We can still accept non-English tags, but only if there is no English equivalent.

The point of having one common language is uniformity. It would be inefficient to have a bunch of sentences tagged "proverb" (English) and another bunch tagged "proverbe" (French). There is also no point having a sentence tagged with both "proverb" and "proverbe". They are the same notion. It can even make things confusing to have several tags to designate a same notion, that's why we have decided to have one default language. We will later implement the possibility to translate the tags and to display them in languages other than English.

5. How things are going to work

We'll try to keep the process as transparent as possible.

Swift will publish on the Wall the modifications that will be applied to the tags (i.e. renaming and deletions).

There will be a few days until these modifications are actually applied, in case people strongly disagree with a decision.

Swift will also add on his profile and his personal web page the links to every Wall post mentioning the modifications, for people to be able to trace back all the decisions about the tags.

Tatoeba has seen its community grow quite significantly in the past 6 months, and it's really encouraging. There was a suggestion about having a "Tatoeba day", a day where (passionate) members would try to contribute more passionately than ever. It's a very good idea so we'll be organizing one every month (we'll try to).

When?

The first one will happen on Saturday November 13th, from 0:00 to 23:59 (France time).

Where?

Well, this is a virtual event, so it happens on the internet... BUT if you want to live this event at its fullest, come to our IRC channel on Nov 13th: #tatoeba, on freenode. Don't be shy! And even if you are shy, you can just drop by to read what's going on.

What?

For the first Tatoeba day, we will start with something very basic. The goal of the day will be to translate, correct and adopt a lot of sentences sentences. Not that it's different from what's already happening every day, but I will publish detailed stats the following day, to give an idea of what has been achieved during those 24 hours.

How many sentences added for each language and each user

How many corrections made for each language and each user

How many sentences adopted for each language and each user

Why?

This event is of course an occasion to be more productive than we usually are, but it's mostly an occasion for members to feel more connected with each other and to have fun! You may also learn a few things about Tatoeba that you didn't know :)

Thursday, October 14, 2010

I normally tweet whenever a language is reaches an important milestone but I was a bit absent from Tatoeba the past 4 weeks and I didn't really keep track of the progress of each language. So I'm going to sum up everything in this blog post and, while I'm at it, give more general stats about Tatoeba.

New languages

We've added several new languages since September. Tatoeba is now supporting a total of 71 languages. The new languages are:

Bosnian

Croatian

Old East Slavic

Chamorro

Tagalog

Quechua

Mongolian

Lithuanian

Sentences stats

Top 5

English - 156,000+ sentences. English has taken the first place back in September and things still haven't changed.

Japanese - 153,000+ sentences.

French - 50,000+ sentences. Around 10,000 sentences were added within 2 months. There's progress :) It had taken 3 months to go from 30,000 to 40,000.

Sunday, September 26, 2010

I decided to write more specific guidelines about how to react to bad behavior because I'm so fricken tired of seeing people attacking each other in public.

The community is growing and becoming more diverse. Diversity means divergence of opinions, which means more intense debates. I can accept divergence of opinions, it's normal, it's even necessary. But I cannot accept people flaming each other in public. I don't expect members to act all lovey-dovey with each other, but do I expect members make an effort to be respectful with each other, NO MATTER WHAT.

If you think a user is being disrespectful

Send him a private message with the title "Warning: you are being disrespectful". I insist very much on PRIVATE MESSAGE. Everyone can send this warning, not just moderators.

Add in your private message the link to the comment where the user was disrespectful. I insist again: PRIVATE MESSAGE.

Quote the part of the comment that you felt was disrespectful.

Try to explain why you felt it was disrespectful.

Add a link to this blog post.

Just in case it was not clear, I will repeat the main idea: if you think a user is being disrespectful, send him a private message and ONLY a private message.

If you received warnings

It's possible that it was a misunderstanding from the sender, you can simply explain him what you really meant. But if one person misunderstood, it's possible that other people will misunderstand you as well, so you should consider clarifying your comment for everyone.

It's possible that you are really being disrespectful, in which case you should consider deleting your comment or apologizing for being disrespectful (or both).

Insulting someone is disrespectful, obviously. I don't think I need to explain that one.

Being condescending is disrespectful. You should treat everyone's opinion equally. It shouldn't matter whether you're debating with a 6 year-old kid or a non native speaker. You are NOT entitled to trash someone's opinions just because you think you know better. If you know better, then educate people, don't trash them.

Lecturing someone publicly is disrespectful. You can tell someone how they should behave in PRIVATE, but not in public, never EVER. Even something small like "Dude, calm down" => PRIVATE MESSAGE.

Generally speaking, writing negative comments about someone is disrespectful. If you don't like something about someone, you let them know in PRIVATE and ONLY IN PRIVATE.

Just to be clear, I may myself show lack of respect in moments of weakness. Everyone may. You come back tired from a long day of work, someone offends you publicly, you can't resist the temptation to reply back publicly as well. It happens to everyone. But it is NOT acceptable, there is NO EXCUSE for that.

What happens to people who misbehave?

My thoughts here about bad behavior are still true today. People who misbehave will not be banned, suspended or anything. They will simply receive a lot of warnings and hopefully those warnings can slap some sense into them. I count on EVERYONE to send warnings to users who are crossing the line. It's not only my job, it's not only moderators' job, it's not only trusted users' job, it's EVERYONE'S JOB to make sure Tatoeba remains a place that people ENJOY going back to.

If your inbox starts being filled with warning messages, you really need to work on your behavior. I must remind you that this is a collaborative project, and collaborative means we are working WITH each other, NOT AGAINST. If you care about this project, then please, show more maturity. If you can't do that, then for Tatoeba's sake, take a break and come back when you grow up. Thank you.

We add new languages regularly, but this week, we're adding a quite special language: CycL. This was request by our member witbrock. I'm very curious to see where this is going to lead...

What next?

API. More and more people have been asking us if we were providing an API. We currently don't, but we definitely want to provide an API someday. I can't say when yet, I don't want to make promises, but I'll be posting progresses as they happen.

Copyright. More copyright issues have been raised lately. So I'll be writing a post about it, to try to explain clearly the issues we are facing related to copyright and what you can do to help.

Tuesday, August 3, 2010

This article explains what kind of content we accept in Tatoeba, what kind of content we delete and what kind of content we review. Note that this article is not final. You have the right to object to something or to ask for more clarifications.

What do we accept?

Tatoeba is about collecting sentences so we only want sentences. However, what exactly do we mean by "sentences"? What is a sentence and what is not? It's actually a difficult question... No one will doubt that "I am happy" is a sentence. But what about "On the left", is that a sentence? What about "Thank you", "Yes", or "Awesome"?

As far as I'm concerned, I think Tatoeba can handle a loose definition of "sentence". We don't strictly need to have an entity with at least a verb. To me, when spoken, everything is a sentence. When written, the main difference between a sentence and a non-sentence is punctuation. That's all. For the rest, as long as people can imagine context where the "sentence" can be expressed, then it's a sentence.

So yes, I'm roughly saying that you can take all the words in the dictionary, add punctuation and perhaps a capital letter, you'd turn it into a sentence. I don't encourage it because it's not useful (dictionaries do that already), but one-word sentences are still tolerated. I'll trust people's common sense for adding only one-word sentences that are significant (for instance, "Hello" is, "House" isn't).

In case you run across sentences that are not strictly speaking sentences, then tag them as "non-sentence", so that there is a way to quickly identify them. Inform the owner about this article if he's a new member, and let him know it's better to to have sentences with more context.

At any rate, don't bother starting endless discussions if the sentence has already been translated because it will be kept as is. Feel free however to add a new sentence based on the "non-sentence".

Generally speaking, Tatoeba is open to many kinds of sentences. We tolerate casual speech, slang, insults (as long as they are not targeting anyone in particular), erotic sentences, sentences that are not "true" (after all, Tatoeba is not an encyclopedia). These sentences can be tagged accordingly to inform users. But I'll ask people to focus primarily on appropriate and politically correct sentences. We don't have (yet) a good system to filter out sentences that are not very "safe", so don't flood us with those, please.

What do we delete?

What we delete for sure are:

Entries that people add by mistake due to our failure to provide a more efficient interface.

Sentences that owners themselves requested to delete (because the delete feature is still not available to everyone).

Entries that are copyrighted or under a license that is not compatible with CC-BY.

Racist comments and personal attacks, if they are really harmful and there is a general agreement that it should be removed.

Entries that really make no sense and whose owner won't provide any explanation.

In the perspective of providing better content, I'm also allowing the deletion of "sentences" that are "not really sentences" and came from the Tanaka Corpus, but only under these conditions:

The vocabulary is already illustrated in other sentences.

There is only the Japanese-English pair, no translation into any other language. We can make an exception for French (i.e. it's still deletable if there is a French translation).

All the sentences that will be deleted do NOT belong to anyone.

It may be obvious, but you should avoid translating a sentence that is likely to be deleted... Unless you want to stand against its deletion.

What do we review?

By "reviewing" I mean correcting mistakes. So we correct spelling mistakes, grammar mistakes, bad formulations, etc. We want Tatoeba's data to be used (or at least usable) for educational purpose so we want good quality sentences.

However, the limit between a "correct" and "incorrect" sentence is not always clear and some sentences can generate a lot of debate. In such cases, the final decision belongs to the owner of the sentence.

Remember that Tatoeba allows several translations in a same language, so there is no point fighting endlessly on what is correct or not. Simply add another version of the sentence if you are not happy with the existing one, we don't mind at all having near duplicate sentences (cf. this discussion on the Wall, and more precisely my thoughts on the issue here).

We also don't want any kind of annotations in the sentences. You can find more details in the contributor's guide, rule #9. If you have a good reason to keep your annotations, then please explain it in your comments. Otherwise moderators have the right to edit your sentence two weeks after you have been requested to change your sentence.

What do we link?

Tatoeba's sentences are represented as a graph. Two sentences that are linked together have the same meaning. Linking two sentences in the same language is accepted, but you shouldn't link only based on meaning. The sentences that you link should also have an equivalent "style" and type of speech. Cf. my wall post here.

Saturday, July 17, 2010

First of all, I'd like to mention that we've had a lot of traffic lately. Allan published an article on linuxfr.org about Tatoeba, and it sure brought a lot of new people :D

Google Analytics says 1,172 unique visitors on July 17th, while we usually have around 400-450. We're glad to see the server is still doing well despite the quite significant increase of activity!

What's new

We can now import sentences. Since July 4th actually, but I didn't have much time to write about it. The feature is currently only available for moderators, because we cannot safely let everyone import huge amount of data. So the way it works is that you send us your sentences in a simple text file, by email (team@tatoeba.fr), and we import it.

We accept two formats:

Single sentences: each line has one sentence. All the sentences have to be in the same language.

Sentences + translations: each line has a sentence and its translation, separated by a tab (sentence [tab] translation). All the sentences have to be in a same language, and all the translations in a same language. For instance only French-Spanish, and not French-Spanish in one line, and Swedish-Spanish the next line.

IMPORTANT: We release our data under the Creative Commons Attribution (CC-BY) license. We will not be importing your content if it brings up copyright issues or license incompatibilities. I mean, for instance don't send us sentences stripped from textbooks, or sentences that under the CC-BY-SA license (it's not compatible with CC-BY).

So far we imported:

~700 pairs of sentences in Chinese-Shanghainese. In total we have ~900 pairs of sentences thanks to shanghaining.com. The first 200 ones were added by hand.

200+ proverbs in Dutch.

250+ proverbs in Ukrainian.

That's the major thing for the last couple of weeks.

What next?

We still have to import 2500+ pairs of English-Spanish sentences, provided by one of our registered users, Łukasz. And probably thousands and thousands of other sentences, as more and more people discover Tatoeba, and have their own private (or not so private) collections of sentences to share with everyone :)

In terms of features, there will not be much going on in the next couple of weeks. Actually it will depend on the rest of the team, but as far as I'm concerned, I will have other priorities.

There is still a lot of things that can be improved about the current features, and we will keep improving them, but in August we will also start discussing about the next new stuff. I will write more about it when we get there.

Right now I'd just like to say thank you to everyone who gave this project a little bit - or a lot - of their time, of their knowledge, of their encouragements... Because Tatoeba has become an awesome place for language lovers and learners, and for that, the credits really goes to the community :)

Sunday, June 27, 2010

Page that lists all the tags. NOTE: It's not organized at all, it's really just for sake of having a page that displays all the existing tags.

Page that lists all the sentences in a specific language, with possibility to show only those that are NOT translated yet into a certain language. For instance Japanese sentences not yet translated into English. Useful feature for contributors =)

Possibility to filter by language, on the page that lists sentences with a certain tag.

What's next

Possibility to import sentences from CSV file. This feature won't be available to normal users. For a start (and I think for a long time), only moderators will have access to it. So anyone who wants to import sentences from a file will have to make a request. Anyway, the main point is that as soon as we have this feature, we will add massively lots of new sentences =]

Friday, June 11, 2010

This will provide a way for people to add meta-data to sentences. For instance "proverb", "formal", "informal", "male", "female", etc. Such information can be very useful for language learners because they cannot necessarily guess such things just by reading the sentence.

Tags will be restricted for a short period of time. Only trusted users will be able to add tags, but everyone can see the tags associated to a sentence. When we feel the feature is ready for everyone, we will allow everyone to add tags.

People will be free to tag sentences with whatever they want. We don't really have any strict rules yet because tags are still new, and we want to see how people use them. But I can at least suggest some basic tags:

proverb, archaic, slang

formal, informal

male, female (to indicate whether the sentence is said by a man or a woman)

to delete, to correct, checked (I will talk more about these)

controversial, unsafe (to mark sentences that can cause problems, are not suitable for kids, etc).

easy, intermediate, difficult (to indicate the level of difficulty of a sentence)

So these are only my suggestions. Again, the tag feature is new, so we will necessarily go through a phase of experimentation before we can clearly set any rule. We count on everyone to try and help us figure out what works best. Feel free to discuss about issues related to tags on the Wall.

A few more things you need to know about tags:

You can see the list of sentences associated to a certain tag by clicking on the tag.

You can remove a tag from a sentence only if you were the one who added it.

Moderators can remove any tag.

It's not possible to add twice a same tag for a sentence. If someone has already added "proverb", you can't re-add "proverb".

"to delete" tag

Those tags will help moderators in their work. At the moment, in Tatoeba, only moderators can delete sentences. The traditional way of requesting a deletion was to add a comment to it, and point out that it should be deleted (and explain why). But the flow of comments has increased a lot and it's less easy for moderators to keep track.

So if you come upon a sentence that you feel should be deleted, then tag it with "to delete" so that moderators can easily find them and clean Tatoeba from entries that are not valid. Anything that is gibberish is not valid. Anything that is not a complete sentence is not valid. But then again, we haven't decided what exactly is a "sentence" so it's debatable.

"to correct" tag

In Tatoeba, it is not possible to modify a sentence that doesn't "belong" to you. These sentences are typically sentences that you have added yourself. No one (or almost) can touch them besides you. If someone sees a mistake in your sentence, all they can do is post a comment, and you have to correct it.

But certain members contribute sentences with mistakes and never come back. And for now, no one can correct their mistakes... except moderators. So if you want to help moderators, whenever you come across a sentence that needs to be corrected, that has a comment asking for correction, but even after two weeks, it was still not corrected, then you can tag the sentence with "to correct".

"checked" tag

Before I explain further, I must stress that this tag is experimental. Many times people have asked for a way to tell whether a sentence can be trusted or not. Okay, so now we can tag a sentence as "checked" to indicate that it has been proofread and validated as a correct sentence.

Of course, this raises some of course problems...

What if a user tags a sentence as "checked" just for the fun of it?

What if a user tags a sentence as "checked" but was tired and overlooked a mistake?

Well, we can't guarantee 100% accuracy. A sentence that is tagged "checked" will simply have a higher reliability rate than one that doesn't, but it won't be 100% (no one can guarantee that anyway).

What's next

We will make tags available to everyone.

We will add a page that lists all tags, to enable people to easily browse by tags.

Sunday, May 30, 2010

We simplified the registration process. If it doesn't bring too much spam, we'll leave it like that.

We started reviewing the texts in Tatoeba. There's still a lot of editorial work to do though.

We added support for right to left languages (like Arabic). They are not actually displayed right to left.

What's next

Import lists from CSV file (I have already mentioned this many times).

We will try as well to implement tags for sentences.

But you have to know that we are currently investing more time in promoting the project. That means less time on implementing new features.

The reason is because we have registered to Drumbeat last weekend. Drumbeat is a platform launched by Mozilla earlier this year. It was made for people to promote their projects that can make the Web better and keep it open. The best projects can even get seed funding, and we kind of hope we will :)

Monday, May 24, 2010

This is a little guide/FAQ to explain what is the role of a moderator in Tatoeba, and to make sure moderators use their powers wisely.

Why do we need moderators?

Every community needs their moderators, but in Tatoeba more specifically, the problem is that unless you are the admin, you (currently) cannot :

delete sentences, not even your own sentences

edit sentences that do not belong to you

So with the growing community, more and more sentences are getting in the "delete me" and "correct me" queues (due to members who never come back to correct their sentences).

Moderators are here to help take care of these sentences that no one else can take care of.

What can moderators do?

Moderator can currently delete, edit, link/unlink any sentence. Yes, this is a lot of power, but since contributions are logged and can be seen by everyone, we don't need to worry too much about a moderator going nuts and ruining others' work.

Keep in mind that the moderator's rights are not "stable" yet. We will balance out the permissions over time. For now, we don't really have time, so we'll trust moderators for doing the right things.

When should moderators edit or delete?

Only use your moderator rights as the last resort.

This is especially true when dealing with others' sentences. Some people will gladly let you edit or delete their sentences without having to be notified about it (they may even be annoyed by this). But other people may feel that you are abusing of your powers, not respecting their work, not acknowledging their presence in the project, or whatsoever.

To avoid any kind of conflict, only edit sentences where the latest correction request says "two weeks ago" (or more) and no correction has been made. Only delete a sentence after asking the owner if they're okay with their contribution being deleted.

Basically, give people the time to do their work first, and only if they don't do anything, you can step in.

How do you become a moderator?

You can either ask Trang or wait for her to notice that you are a good candidate to be a moderator. The criteria is that you are at least already a "trusted user". The rest is subjective.

Sunday, May 16, 2010

Contributors can now edit and translate sentences directly from a list (as well as adopt, favorite and add to another list). This will, I believe, enhance a lot the contribution process.

You can display translations in all the languages, and not just one language (I'm still talking about the lists).

You can download a list into a file. I feel this is going to be extremely useful. We made it so that users can import the file to use in Anki.

You can specify the target language when searching sentences. That is to say, you can not only search "from", but also "to" a specific language.

Indirect translations are taken into account in the search. You may not realize this, but it's an incredibly, incredibly powerful feature.

So we worked very hard this week, which is why the update was delayed of one day.

What's next

Have a "moderator" status. Moderators will be able to edit anyone's sentences, link and unlink any sentence, and delete sentences.

Simplification of the registration process. We will get rid of the whole "validate your registration" step.

Import lists from CSV file. This will be a way to feed Tatoeba more quickly ;)

Have two modes for the lists : "show" and "edit". Only the "edit" mode will have the various buttons to translate, adopt and so on. This will make the list less crowded for people who simply want to browse.

Saturday, May 8, 2010

You can now add Tatoeba as a search engine in your little Firefox search bar.

You can also browse sentences that belong to a specific user, and you can filter them by language.

And we did some small UI improvements.

We have been mostly working on cleaning, improving, debugging and optimizing the code.

What's next

You will be able to download lists into a CSV file as well as import sentences into Tatoeba from a CSV file. It was originally planned for this week, but we decided to delay it as we still had things to discuss about in how we were going to do this.

We will first implement these features so we can use them for our "massive validation and correction" process, but of course we will adapt them to other needs afterwards. For instance, it could give users the possibility to download sentences to be used in Anki, as well as to import an Anki deck into Tatoeba.

Saturday, May 1, 2010

This update doesn't really bring any thing hugely new. The only new things are:

Icons (instead of text), for the "Inbox" and "Log out" link in the top menu.

Update of the Downloads page. The download files are now (officially) updated every week, and there's a new format.

The rest is bug fixes and optimization, so nothing really worth writing about.

Checking sentences

However, this update marks the beginning of an important phase in Tatoeba : massive correction and validation of sentences. We actually have the necessary features to do it, and in a way that is not just "random". This is all explained in my previous post on the reliability of the sentences.

We will start with French sentences. The primary goal is to correct the spelling and grammar mistakes by the end of May.

What's next

We will mostly revamp the lists section so that it can support in a better way the massive correction and validation of sentences.

Friday, April 30, 2010

Reliability has always been a big issue in Tatoeba. Many sentences have mistakes but there is currently no indication on whether a sentence is correct, or whether it is an accurate translation. So when you look at a sentence, you can never be 100% sure if you can rely on it.

We will start introducing some measures to solve this.

The first objective is to have all the sentences in Tatoeba adopted by someone. Languages that are especially concerned are Japanese, English and French. They are the main languages in Tatoeba and I have explained in a very old post where they come from. The post also explains the idea behind "adoption" of sentences. You can also read this discussion where I explained more in details my point of view on the "adopt" feature.

However, adopting is not enough. Sooner or later we will need some sort of "vote system". But before integrating a new feature, we can use what we already have. This article describes how we are likely to proceed, but since this is still experimental, it is of course not bound to remain as described. The procedure can be improved as we experiment it, and all feedback is welcome.

Step 1 - Generating lists

We will generate lists, that could be named following the template :

[checking] $languageCode ($languageName), $whateverYouWant

For instance :

[checking] fra (French), list 1

These lists will be private, and each of them will be attributed to the person who has to check the sentences in that list. They will be filled (in priority) with "orphan sentences", which won't remain orphan very long because we will have them automatically assigned to the person in charge of the checking.

Step 2 - Checking

Users will have to check sentences in their native language only. Most importantly, they will NOT be checking the accuracy of translations, only the sentences themselves, i.e. if there is any spelling or grammar mistake. We will deal with accuracy of translations much later.

There may be some cases where you are not sure whether the sentence should be edited or not. For instance, sentences that are archaic or sentences that are grammatically correct, but do not sound like what a native speaker would say. There is no absolute rule... Usually, if the sentence is archaic, you can leave it alone. If it is grammatically correct but not "natural", then try to make it sound natural. At any rate, if you are not sure what to do, post a comment on the sentence and we will see what should be done about it.

Step 3 - Marking sentences as checked

If you are volunteering to check sentences, then in the first place, you will actually have to check and correct all the sentences you own (and that are in your native language). Because we will consider as "checked" all these sentences, except those that are in your "checking" list which are of course being checked.

All the sentences that are not adopted or that belong to someone who is not part of the "checking team" will be considered as not checked.

Once you are done checking sentences in your list, we will by default renew the list.

Or you can rename your checking list into "[checked] $lang ...", in case you want to keep track of the various batches of sentences you have checked. We will then generate a new list for you.

Anyway, as you can see, with the features we have, we can already start reviewing sentences "en masse". The problem, however, is that the list feature is not exactly optimized for this...

Step 4 - Exporting lists into CSV file

This will be a useful feature in general, but in our particular case, it will enable people to check sentences offline, as well as execute a "replace all" if they come across a recurrent mistake.

On that matter, don't hesitate to send us an email and tell us about recurrent mistakes you find and that can be corrected systematically. This will help us create a script to correct these mistakes in the non-adopted sentences so that contributors won't have to waste time working on things that can be processed automatically.

Step 5 - Re-importing CSV file

People who decide to correct from the CSV file can later import back their corrections. All the corrections made in the file will be applied to the sentences in Tatoeba. At this time, it is still not very clear how the re-import process will be handled.

Step 6 - Re-adapting the list page

Some people may prefer checking from a page in Tatoeba rather than a from a file so that they can easily post a comment when needed, unadopt the sentence and leave it to someone else to deal with if it's just too difficult for them, favorite it, or add it to another list...

But the list page is not exactly optimized for that, so we will try to provide a page where users can easily to these things while checking.

Step 7 - Reorganizing the lists

There may be a point where the lists section starts getting a bit too messy. We will certainly have to introduce some categorizations for the lists (we'll have at least "checking" and "to translate"). This is still an open idea though, nothing guaranteed.

Step 8 - Integrating a vote system

We will start working on a vote system only when we have a larger number of active contributors. Right now, the number of active contributor in a given language probably doesn't exceed 5. There has to be at least 20, in my opinion, for the vote system to be worth implementing.

This is probably NOT something we will work on before another 3 or 4 months.

Step 9 - Locking sentences

Eventually, when a sentence has been checked and/or discussed again, again and again, it would make sense to lock it so that no one can edit it anymore, not even the owner.

The fact that a sentence is locked will also be the guarantee that a sentence is completely reliable. But again, this won't be integrated before at least several months.

Phase 2 - Step 1 - Checking translations

Well, I'll write another post about this when we get there, because this is a trickier issue...

Let's do the math

Supposing it takes an average of 5 seconds to check a sentence and correcting it if needed (which is quite optimistic). We have about 370,000 sentences. That's 500+ hours of checking. So with enough people, it's not that much.

(Optimistically)

Japanese would need about 210 hours.

English, 200 hours.

French, 40 hours.

German, 16 hours.

Polish, 13 hours.

Well, we have other languages, but I won't list them all here...

French will be the first language where we will start pouring our efforts (the project is based in France, after all). We can surely get the French sentences checked by the end of May.

Sunday, April 18, 2010

As if migrating to a new server wasn't enough, we also decided to migrate to a new search engine. It was a rather on-the-fly decision, but I must admit, it was fun :D

A little bit of context

Until now we were using a search engine called Lucene. It's written in Java, and the integration of Lucene into Tatoeba is something that was coded three years ago, back when I didn't know how to code and wasn't even sure yet I would pursue a career in computer science.

I was just very lucky that one student in computer science at my university found out about my project and was interested to join me in the task of integrating a search engine, as part of a school project (thank you François, if you read me).

The problem is, running Lucene takes a lot of memory. And our new server doesn't have a lot of memory (512MB RAM). So we figured, okay, we'll just leave the search engine on the old server (2GB RAM), Masa (the admin) will not mind.

But Masa wanted to clean up his server, to reinstall it from scratch, but couldn't. He didn't want Tatoeba to be in trouble (because that meant we had to find somewhere else to go, even if it would be temporary). So when I told him we were moving to our own server, he was quite excited, he could finally reinstall peacefully. I told him our migration was scheduled on Saturday April 17th, and that we would find a temporary solution for the search engine, so he can do whatever on Sunday.

Migration day

Saturday, migration day. Lots of things to do. And I couldn't be in Paris with 3 other members of my team (Allan, Robin and Baptiste), so it only made the task harder. I won't go into details, but we reached the end of the day, everything went pretty well, except we hadn't taken care of the search engine yet...

We were in IRC, and Robin and Baptiste had left. I was telling Allan all the hackish stuff we would need to do to set up the search engine, because the initial plan was that we temporarily use his machine at work to host it. But then he felt "Okay this too hackish, I'll try to find another solution otherwise we will never update the search engine".

Except, I had received an email from Masa ealier, telling me he would really like if we could be done migrating by 1AM, so I tell Allan "But Masa really really wants to reinstall his server, we need to have something working by midnight". And it was 8PM...

How we decided to use Sphinx

Allan was not going to give up so easily. He started telling me that he had already done some searches before, and that Sphinx was often mentioned as a competitor of Lucene.

me: Sphinx or Lucene, if you can code me something within 2-3 hours, I have nothing against it.

So he kept going, telling me that Sphinx handles stemming, that it's written in C++, that someone made a behavior to integrate it in CakePHP...

me: Alright, but it will be for next week :P

Allan: So I didn't really have a choice...

me: Ah because you want to do this now?

Allan, quoting me: Sphinx or Lucene, if you can code me something within 2-3 hours, I have nothing against it.

me: Well okay, we can try it.

Allan: Yea because you know, there wasn't any big fail in our migration, so we need to add more pressure, otherwise it's not fun.

me, thinking: Like I didn't have enough pressure for the day *sigh*. (Allan was in the train while *I* was doing the migration)

me: Give me the links you have, I'll see what I can do to speed up the integration.

It was 8:30PM.

How things went

Things went very well :) Note that none of us knew much about Sphinx before. We had no idea how difficult (or how easy) it was to install it, and run it, and integrate it in CakePHP. Allan took care of the installation & configuration part while I was taking care of the integration in CakePHP.

Once I understood how Sphinx worked and how to get it to work (which took me a bit more than one hour), all I had to do was to follow the explanations on the Sphinx Behavior documentation, adapt the code to Tatoeba, figure out how to pass GET variables with CakePHP's Paginator, and add some "warning" message to let users know that we're switching to a new search engine and some features are no more available (but of course we will integrate them back as soon as possible).

In the meantime, Allan installed Sphinx on our new server, figured out how to create one index for each language so that people can still search from a specific language, figured out how to fetch in that index from CakePHP, and figured out how to make the search work for languages that had non ASCII characters.

It was then 1AM, and we had done it. Installed Sphinx, integrated it into CakePHP, have it work for all the languages we are supporting, did the tests to make sure basic searches are working, and updated Tatoeba.

Now everything is soooo fast, it's awesome. Besides, indexing with Sphinx only takes 30-60 seconds (compared to 15-20 minutes with our 3 year-old Lucene code). So we can afford to index much more often.

The whole experience was awesome as well. The challenge, the teamwork, the achievement. I loved it :D

Friday, April 16, 2010

Well, I was supposed to be in Paris with my team at the moment, but some volcano decided it would be otherwise - flight canceled. So be it, Tatoeba will still be updated today.

New server

That's the most important news: we're moving to a new server, kindly provided to us by the Free Software Foundation in France.

It may surprise some of you, but we weren't on our own server until... well, now. We were hosted by Masa (not his real name), webmaster of http://tokidoki.fr. I have to thank him for hosting Tatoeba - for free - for the last two years. I also have to thank him for giving Tatoeba the (18,000) French translations of the Tanaka Corpus that he gathered (with many volunteers), also two years and a half ago.

Cleaned up sentences

Duplicate sentences will be merged, and the { } annotations that you can find in some sentences will be removed.

Private messages

The private messages look better now :) The private messages system needs to be changed someday though, to be more practical. Some that similar the Wall, except it would be private. However, people do not use private messages that much, so it is not urgent.

What next?

I won't be talking about our update in two weeks (because we haven't really decided yet), but rather for the next two months.

As usual, we will be debugging and optimizing our code.

We will take some time to reach out to other people. We start having a quite long list of people to contact, and it's time we actually contact them.

We will work on improving the profile and the lists.

Tatoeba is currently built on a PHP framework, CakePHP, but we will start switching to Django (something we've been considering for a few months already). It's not like we're going to entirely recode Tatoeba. We still have to discuss on how we'll be doing this.

And we will move our code source to GitHub (also something we've been considering for a couple of months).

We have finished paginating the Wall so you won't have to wait forever to get to read the new messages.

Link/unlink sentences

We have implemented the link/unlink feature. The owner of a sentence can turn indirect translations into direct ones by linking it to his/her sentence. He or she can also unlink a translation, if the translation does not mean the same thing. This feature will not be available to everyone however... Only to a few chosen ones.

Trusted users

There is now a new user status, that we call "trusted user". For now, the only good thing about being a trusted user is that you can link and unlink sentences, while normal users can't. But in the future, we will start by testing new features with trusted, who can then give us feedback so that we improve the features. And only then we will release it for everyone.

Note that there are no specific criteria to become a trusted user. But one very important condition is to have read ENTIRELY the "How to be a good contributor" guide.

What next?

As usual, many many things. But the main thing is that we are going to move to a new server. We recently asked the Free Software Foundation (France) if they could host our project, which is a free project (AGPL license for code source and CC-BY for corpus files). They accepted, so we now have our own server. The migration is scheduled for April 17th. Once that is done, Tatoeba will be (much?) faster - because it is pretty slow right now.

Friday, April 2, 2010

We used to display romaji in Tatoeba... We don't anymore. Well, at least not directly. We are now going to display the reading in hiragana. You can however get the romaji version by hovering your mouse over the hiragana, and wait for the little tooltip to appear.

Before:

After:

There has been some discussion about it (like here, here or here), and I think this solution will make everyone happy.

Now, of course, the output generated is not perfect. So if anyone out there is interested to improve the hiragana generated, then please let us know! As much as I agree that the reading is a vital information for Japanese learners, I will NOT have time to make it any better. I'd really, really like if someone could take on this tasks.

Thursday, April 1, 2010

We started to add audio in Tatoeba, and it will be available on April 3rd. Great, isn't it? :D

Yes, but (there is a but) you will probably be disappointed to see that most of the sentences will be indicating "audio unavailable". So far, only a few hundred sentences have audio, which is barely 0.1% of the whole corpus. This however not a fatality! If you are interested in helping us adding more audio, keep reading.

First of all, about Shtooka

Shtooka is a small non-profit orgnization based in Paris which goal is to gather collections of audio for words, expressions, proverbs, sentences, etc. You can browse their collections here.

We have met them at an event they organized on February 13th, and thanks to them, we are now starting to integrate audio into Tatoeba.

Audio for Shanghainese

The audio we have so far in Shanghainese. Yes, we do have such an exotic language. Now, you may be wondering why on Earth did we pick Shanghainese? Well, for a few reasons.

Allan (aka. sysko), one of the most active developer in the team, is very interested in Chinese, and more particularly in Shanghainese. He was provided 900 Shanghainese sentences from shanghaining.com.

Congcong (aka. fucongcong), one of the most important contributor in Tatoeba, speaks Shanghainese.

They were both able to meet regularly Nicolas (aka. zmoo), president of Shtooka, in order to record these sentences in Paris.

Want more?

Needless to say, we will be very happy to add audio for any other language. But it's not going to be easy, and it's not going to be possible without your help! So if you are interested...

First of all, send us an email at team@tatoeba.fr, with the title "Audio for Tatoeba in [insert-language-here]".

You have to know that Shtooka insists a lot on quality, therefore recording from your laptop's microphone is not an option. We will explain things more in details when we contact you back.

Then if you are still motivated, start gathering sentences for which you would like to record audio, by creating lists. Limit each list to 100 sentences max.

Note that you can also create lists just to gather sentences for which you want audio, even if you are not going to record them. Just make sure that all the sentences in a list are in a same language.

Anyway, having audio in Tatoeba is really exciting for us, and we hope that many of you will join us in this quest!

Saturday, March 20, 2010

As mentioned in the previous update post, we are currently in a phase where we're taking care of the old and small tasks. Here's what we took care of this week.

Traditional Chinese

Chinese sentences are now displayed both in traditional AND simplified. Some contributors in Tatoeba contributed in simplified Chinese and others in traditional Chinese. Instead of asking people to contribute in both or only one form of writing, we let them do both, and we do the conversion automatically. The converted form is in grey, just like the pinyin.

Two new tools (for Chinese)

Since we have integrated traditional Chinese, we also added two new tools.

Pinyin converter. We had a romaji/furigana converter for Japanese. Now you can do that for Chinese as well.

Note that these tools are based on a software called Adso. You might as well go there if you need to convert.

Pagination on Wall

The Wall was getting quite crowded so we decided to start doing something about it. There is now a paginated version of the wall, in case you don't want to be displaying the entire wall. We still need to do a bit of restructuring though. In two weeks we will have a better wall :)

Latest messages from Wall on homepage

Since the Wall is being more and more used and visited, we eventually decided to display the latest messages on the homepage. Note that for now, clicking on the link of a message will still lead you to the non-paginated version of the Wall.

Number of sentences in lists

We also decided to improve a little bit the lists, by changing the way it is presented, and by displaying the number of sentences in the list.

What next?

On February 13th we went to an event organized by an association based in Paris called Shtooka. Their goal is to compile audio of words, expressions, proverbs, and guess what... sentences! Just like Tatoeba, what they produce is FREE (and with really good quality). They redistribute their content under a Creative Commons license as well (most of their collections are under CC-BY, the rest is CC-BY-SA).Needless to say, projects were made for each other. They need written content to create audio from, and we have it. We (the users, my team, me) have always wanted audio in Tatoeba, and they provide this.So you can expect a beginning (and I insist, it will just be a beginning) of audio integration in a couple of weeks :)

The other thing we are going to work on is to give the possibility for users to link and unlink sentences. This is a feature that is really really lacking, but this is also a feature that is very very difficult to implement. I cannot guarantee it will be ready for April 3rd, but we will do our best so that it is.

Saturday, March 13, 2010

You have to know that we are currently in a phase where we're getting rid of all the old/small tasks so that in a when summer comes we can start tackling new/bigger tasks.

We are also in the phase of setting up some working rules which involves us working on and updating Tatoeba more regularly.
For now, it was decided that we would update Tatoeba on a two-week basis, which means you can be expect for sure some changes every other week (more precisely on Saturday). Each update will be quite small, so that we avoid introducing bugs whenever we introduce new features. We are actually limiting ourselves to 6 tasks for each update.

But if we have more time or if we are very productive, we will also update Tatoeba in-between. The current update is actually an "in-between" update. It was only one week after the last update, and you will only have to wait one week before we update again. But we are updating only because we had the time to test properly what we were going to integrate.

What's in this update (besides the bug fixes)

1. Page with your comments
You can access it from your profile page. Please, take a look at your comments and delete those that are not useful. I will let you judge what is useful and what is not. But of course, be smart, don't delete a comment if someone has replied to you below or it will become confusing (unless this someone also deletes their comment).

2. Page with comments on your sentences
You can access it from your profile page as well. Please, take a look at those comments to check if you haven't missed any suggestion of correction on your sentences!

3. Clickable URL
It's not much, but now URLs in the private messages, the Wall and the What's New section are clickable.

Anyone to translate the website?

I would also like to make a call, to anyone who speaks Japanese, German or Spanish. We need people to translate Tatoeba into these languages.

We have our texts hosted on Launchpad: https://translations.launchpad.net/tatoeba
It offers an interface for collaborative translation. Whenever we update Tatoeba, we will download the new translations made in Launchpad and integrate them in Tatoeba.

The most urgent would be Japanese!

What next?

I will not write about all the things that we have in mind or in our todo list because there are so many and we don't exactly know when we will deal with them. You will know about it little by little. But I can at least tell you with a decent level of certainty what is planned for the two next updates.

March 20th
- Conversion of simplifed<=>traditional Chinese
- Display on homepage of latest messages from Wall
- And some optimization (trying to make Tatoeba a bit faster)

Saturday, March 6, 2010

Ah, finally an update that will be integrating "real" changes. Here's a short description of the new stuff.

Possibility to indicate the language

Until now, when you wanted to add a sentence or a translation, you had no way to indicate in which language you were contributing. The language was auto-detected though, but it was still a bit puzzling the first time you try to add something. Most people were probably thinking "But how will they know what language... oh okay, it's auto-detected". But more importantly, we could not really consider supporting languages that are not supported by Google's language detection tool (which we are using). Users would have to indicate manually the correct language everytime, and that would be annoying.

This is a small but important feature we wanted to have in Tatoeba for a long time and it's finally here.

Adopting in place

There's one important concept in Tatoeba: you can only modify a sentence if you are the "parent" (owner) of that sentence. You are by default the parent of the sentences you add, which implies only YOU can modify your sentences (which makes sense). But you can also become the parent of a sentence by "adopting" it. Because many, many sentences in Tatoeba do not have any parent. The reason why you'd want to adopt a sentence is because you noticed a mistake and want to correct it, and you wouldn't be able to do this without being the parent of that sentence.

The adopt feature was quite "heavy" to use. Everytime, you were redirected to the "info" page of that sentence. Now it can all be done in one place. Click on the adopt icon, and there you go, no redirection, you can modify it right away. This should make the tasks of correcting sentences less annoying.

Only main sentence displayed when translating

This should solve a problem that we've had for a long time. Users who are not familiar with the system tend to add translations without caring what they actually add their translation to. Many times, people were adding a translation to a Japanese sentence when they were in fact translating from the English sentence. And because of the way things are displayed, they think "okay, I'm just adding one sentence in that box". But it's not the way things work in Tatoeba... Hopefully this will make things clearer.

Possibility to delete comments

Yes, now you can delete your comments (comments on sentences as well as comments on the wall). You cannot edit them yet though. That will be for next time (probably). Be careful though! Deleting a comment will delete it forever.We haven't made a page that lists all your comments yet, but you can go through all the comments in Tatoeba here.

What next

Well I'm looking at our todo list, and it's hard to say... I'd rather let it be a surprise ;)