Thursday, June 30, 2005

With some 76.000 articles, the English language Wiktionary is the biggest Wiktionary. Until yesterday all the articlenames were capitalised. Some months ago there was a vote to change it so that articles would be as the word is spelled. This decision was not implemented, a lot of words were spilled on this issue and now many months later, out of the blue it was changed.

The English wiktionary now has a problem and, it has an opportunity. The problem is that many entries are wrong. The problem is that the interproject links to Wiktionary from Wikipedia are wrong. The opportunity is that there are many other things wrong as well and it is therefore an unsought opportunity to revisit the content to improve the content.

Many people will feel frustrated because of all the huha. Many people will feel angry because the timing was not great; it stopped the migration of Wikipedia to release 1.5 temporarily among other things. But as the opportunity is there. It is also the time to step to the plate and do the best that can be done.

I am speaking to Andre Engels and I hope that he will come up with a bot that will find the capitalised words and move them back to capitalisation. This bot should also be able to list the words where a word can be found in both upper- and lowercase. After this the interwiki.py bot can be run again.. There is also a need for a bot that checks the en.wikipedia content for links to wiktionary and checks if the article is there and if not fixes it to undercase..

Yes, there was a need to prepare this change but it is also understandable that given that the decision was reached so long ago it could go wrong as it did. So now we have to do without preparation and just do the work ..

Monday, June 27, 2005

I am propably the official spokesperson for the Ultimate Wiktionary. It is this default thing; I came up with the idea and I carried it forward, found the needed funding and I do a lot of the evangelising. When I started with UW I had to learn a lot and I did. I do not expect that I have finished learning but anyway ..

There is this thing with convincing people, to what length should you go. To what length do you want to go to make people buy into an idea? My time is valuable in that I can spend it only once and if I spend too much time arguing I do not speak with people who have idea's on how things can be done. Arguing does not help the project when the result is not positive.

There are people who insist that I do everything by IRC or e-mail while I prefer to skype as it gives me better feed-back. There are people who are only interested in a tiny specific part of Wiktionary.. only English seems to be relevant to some. There are people who think they quote me and say things I would never say.

Basically, to me there are three groups. People who understand what I am saying, people who want to understand what I am saying and people who for whatever reason do not want to hear what I say or cannot understand what I say. With the first two groups I can talk. We do not have to agree but there is this basis of understanding. With the last group who can be quiet vocal, I find that they waste my time.

The problem is, there is always the off-chance that it is me who does not hear what they are saying. It may be a dilemma where there is no good solution. When I can adress the problem why they do not hear or understand what they say they may become part of the people that become relevant... So, how much time to spend on this and how much to spend on new things.

Spending time on new things is a hazard in itself. It moves me even further away from the people who find it hard to hear / understand what I am on about..

Sunday, June 26, 2005

Today, Jimbo Wales told me that I can quote him; "the license will not prevent Open / Free software projects to use our data". This is indeed good news. It means that we can host data and cooperate with every and all. It means that we can host data for organisations that are less well equiped to do this.

When we can pull it off to do these kind of things we will add extra relevance to the Ultimate Wiktionary.

Saturday, June 25, 2005

When you think about an Ultimate Wiktionary, the idea of including all words of all languages is a given. That is ambitious enough. You do not need anything more, right ?

The Dutch language will change in 2006, it will change things that are artificial like paardenbloem back to paardebloem, it has always been pronounced as paardebloem.. The result will be that many words will be wrong from 2006 onwards.

In October 2005 a list of words will be published with the old and new spelling. It means that we have to cater for this list in the Ultimate Wiktionary. So the Ultimate Wiktionary has to be more ambitious alas..

This is then the time to start experimenting. So I am using the word Imbiß as an example, in modern German it is spelled as Imbiss, I have introduced two new templates. One to be used in front of everything to signal old spelling and the correct one. One to say that it used to be correct.

Having a date for the change will make the information even more valuable. When UW is used within software to be used for optical character reading, it may be used as a pass after the initial pass that did the scanning. It will allow for an appropriate spellcheck that will allow to enhance the quality of the OCR process.

One thing to consider as well is that some spellings are local to a certain region or country. Rudolf Heß is called Rudolf Hess in Switzerland.. the "scharfes S" is not used in die Schweiz.. So words that still have there "scharfes S" in German, are spelled differently in Switzerland. This is just spelling. Some words or their meaning are not known to all people who speak German like "Paradeis" which Austrians know to be a "Tomate".

I am more and more appreciating the fact that linguist find it astounding that we attempt to make the Ultimate Wiktionary a reality. What makes us try it is that it was for us a natural growth path from Wiktionary. So we have our problems serially and not in a parallel fashion. The issues are there to be solved and they can be solved. Getting the issues serially helps because it prevents you from being overwhelmed by complexities for us it is just a matter of refactoring.

Tuesday, June 21, 2005

I have written earlier about the Holland Open software conference, I was happy to be there and gave a presentation there as well. A conference like this is really valuable, you make contacts and when you have a new project like Ultimate Wiktionary these are really valuable. They may alter the way a project is run. One such contact I had with Mr Bart Knubben of OSOSS. This organisation is about Open Standards and Open Source Software in the Dutch government.

OSOSS is working hard to make the list of properly spelled words maintained by the NTU available for the public. Because of all kinds of contractual restrictions this is not possible at this time. To alleviate this issue, they are working as the focal point of the Dutch Open world to get the list of the NTG, the Nederlandstalige TeX Gebruikersgroep, validated for the spelling. This means that some 222.872 words will be validated.

This list of differently spelled words, comes with indications how the word is to be broken up at the end of a line. When the UW is to host such a list, it will mean some adaptions to the software; we will want to keep track correct spelling. As the Dutch spelling will change in August 2006, it means that we will want to retain the old spelling and mark it as such. As the change of the spelling rules will be in the future, we will have to consider how to deal with this.

When we host a resource like this for the NTG, it means that our license has to be compatible with the NTG. Currently they use the GNU Lesser General Public License. They do not care who uses it under what license as long as it stays Free.

Technically there is this issue; we want to host this data for the NTG. It would be really cool to be the resource for the Open/Free content world and host the Open/Free resource for the Dutch language. It would very much be in line with our objectives. We will find a solution for this issue; one thing is sure the LGPL is not applicaple for a wiki. :)

Monday, June 20, 2005

When you have an entry in an electronic dictionary like Wiktionary, what is enough to make it worthwhile to have it? The question is relevant as there are people who are of the opinion that any article that does not have extensive defenitions and etymology is substandard.

My opinion is a bit more inclusive, I would like to have extensive definitions and etymology but for me the sheer fact that a word is properly spelled is enough to have it in an electronic dictionary. The Dutch language knows an institution that does provide the authorised list of correctly spelled Dutch words. For me a list with these words would be a worthwhile contribution to the Ultimate Wiktionary. Obviously, it would be a bit meagre but it does serve its purpose.

When the correct way of spelling words changes, like it will do on the 15th of October, an electronic dictionary has a clear advantage over paper based dictionaries. It is however not clear to me how We should cover the old correct spellings. In a way it is relevant to have a history of correct spellings. It could/should be part of the database..

Thursday, June 16, 2005

Some people tell me that I should say that "Ultimate Wiktionary" will improve cooperation. I have been saying that with UW we will get cooperation. At this moment the Italian and the Dutch wiktionaries are cooperating as well as possible. Today there were some changes on the word [[Jiddisch]] I had to check some things on the Italian wiktionary as a result and found that they have at least 10 more translations.

With UW we will get the cooperation, the synergy that we do not have at this moment. The will to cooperate is there but it just does not happen. So I am unapologetic, only with the UW we will get the synergy that we so desperately want. It is not that we do not want to, it is that it does not happen in a practical manner.

Tuesday, June 14, 2005

On a good day I get some 100 e-mails on a bad days there are many more. Many of these e-mails are spam. The e-mail software I use has an inbuild spam filter, it must be trained and it more than halves the work that I need to do. For the other stuff, I have to look at the mail to decide its relevancy. When it is from a bank or monetary institution I do not do business with it is spam, when it is from China in Chinese it is spam. As I am Dutch many of the American names that send me stuff are suspect. Typicaly this works out fine.

When I want to connect to people who are "official" or high up in an organisation, there is little chance for me to actually reach the right level. There are often many intermediary levels before my message gets to Mr or Mrs Right. These intermediary levels have similar strategies like mine; I do not expect that they are impressed with my myrealbox.com or gmail.com e-mail adressses. It makes me just a person of the public (and I am) not someone who asks something on behalf of the Wikimedia Foundation. So it would be helpfull if people who are known to be active on behalf of the WMF to have a wikimedia.org e-mail adress. It helps to overcome the barriers thrown up by the intermediary levels and get a job done, a message delivered.

Monday, June 13, 2005

Sometimes it is a nice suprise when you find that a nice idea gets some following. The pronunciation of famous people is one such thing. When you listen how an Italian pronounces the name of the Italian prime minister or how an American pronounces the name of his president, you realise that it is different from how it is pronounced in other languages.

Влади́мир Влади́мирович Пу́тин is a suprise for me because it is the first famous person I found on wikipedia with a sound file that I did not ask for.

The funny thing with pronunciations is that the pronunciation of Mr Bush can be heard on the Dutch Wikipedia. The English Wikipedia objects to the soundfile; it has been removed already several times. Some people use Wikipedia to learn languages, it is therefore usefull to learn how a local pronounces famous names.

What I would really like is to have soundfiles of famous people. We have already asked the new pope... One can always hope :)

Saturday, June 11, 2005

I have been working on Farsi training material on Wikibooks; it is a project to teach the Farsi characters and sounds to Dutch people. I do not speek Farsi, I am not learning to speak Farsi, I am just helping this project to improve.

A lesson has two parts; the spoken Farsi words and the translation in Dutch. When you click on the Farsi words, you may hear the pronunciation in the .ogg format. When you press the Dutch words it takes you to the nl.wiktionary.org. When a word does not exist I create it when the word exists I add the Farsi translation.

It is really hard if not impossible to use my favourite browser, I have to move to the other side as it is clearly superior when editing a page like FarsiLes5. In the past I did enter bugreports for Mozilla and Mediawiki, I learned today that they are working on it. I hope they do a good job because Firefox is almost useless when editing pages where there is a mix of languages.

Thursday, June 02, 2005

When you have been away for a few days, you have a lot of reading to do. I had little time to read my e-mail so I had to wade through hundreds of e-mails. There is this big temptation NOT to read many of those e-mails and just delete them.

I decided to wade through the wikitech-l and found this interesting concept of "transcluding a sound". I think they mean that this means that a sound is played automagically. This needs some clever software that will play the sound in-line. The article says that there is no need for such a feature .. Would it not be cool if you find a word, you hear it automagically ?? Yes, it could also be something that you can enable/disable from your preferences ..