Archive for July, 2010

As a software development professional, I am never a fanatic of one platform or one tool. The choice always depends on many factors and constraints. The most important thing, I think, is not the tools you use to solve the problem of the customer. Customers usually do not know about them at all. Customers usually expect good, effective and timely solutions.

Deasciification is the process of converting text written with only ASCII letters to its correct form using corresponding letters in Turkish alphabet (or any language that contains non-ascii letters). For example, the text “Cok yogun bir calisma ve emegin urunu” conveys the meaning, that is, human intelligence is able resolve ambiguities (if any) and understand text like this. The text, however, should be written as “Çok yoğun bir çalışma ve emeğin ürünü” (in Turkish). This is what a deasciifier is supposed to do.

Well, why do we need deasciification? We may not have Turkish letters on the keyboard (or the OS we are using may be without Turkish keyboard layout) and we need to end up with a text in correct Turkish form. It is also possible that we are accustomed to typing only with Ascii letters for some reason.

In addition, we may need to analyze a large collection of Turkish documents, and this collection can be contaminated with text written in Ascii, which will degrade the performance of our analysis. Then, the only possibility is to use deasciification. This is the most important reason for me as I often perform text mining on Turkish document collections, and I always need deasciification.

In this post, I’ll shortly review a few deasciification tools developed with several languages.

The first deasciifier is the one which is part of Zemberek project. Written completely in Java, Zemberek is an open-source general purpose Natural Language Processing library and toolset designed for Turkic languages, especially Turkish. A web-based demo of Zemberek is available at http://zemberek-web.appspot.com/. I usually use the deasciifier of Zemberek in my text mining research when I work with Turkish text datasets.

The next deasciifier is developed by Gökhan Tür at Sabancı University. More information and a demo is available at http://www.hlst.sabanciuniv.edu/TL/deascii.html. This system is currently not open-source, and not available for download.

In a blog post, Daniel Lemire complains that in research papers, people express a measure or a gain in absolute value, and then go to a conclusion about optimality due to the numbers obtained. Lemire calls this fallacy of absolute numbers. With a comment to this post, John Cook gives the correct term coined for this situation: numerator-only data. I’ve ever heard that phrase for the first time. I’ve found it interesting, and wanted to share it with a blog post.

So, what exactly is numerator-only data? Cook explains the term in a blog post as data without anything to compare it to, no denominator. It is the data that leaves us asking “compared to what?”.

For example, if one tells that an athlete runs 100 m sprint in 9.75 seconds, we need to ask “is it good or bad”. The number itself does not give enough information, it is nothing more than a data value. If he tells that the world record is 9.58 seconds, then we have a denominator to compare it to; that is, 9.75 seconds is now meaningful. Of course, we may already have the knowledge of the “denominator”, but this may not always be the case, and as Lemire points out that numerator-only data becomes a problem in research papers, for instance.