Monday, November 10, 2003

Above is a picture of my latest completed project. At the top is the title, "Zhongwen Tool". Below it is a text box that takes Chinese characters as input. When the Submit button is pressed the computer thinks for a while, then prints out the text with all the words that it can find in the dictionary underlined. When the cursor is hovered over a word in the text, a box pops up and shows the pinyin pronunciation and meaning of the word.

Just today, I've used it to read an article in Chinese on what changes the latest round of opening and reform have brought to the common people of China, and decisions by Hu Jintao and Wen Jiabao regarding the direction that the socialist market economy will take in the future.

For now, the "Zhongwen Tool" (gosh, that's an awful name) is only available on my own computer. On some other weekend when I'm less busy, I hope to make it available for public use.

Some of the tool's cool features are only obvious behind the scenes. For example, I have the ability to create custom dictionaries to supplement the main dictionary (the CEDICT) for proper nouns like "The Matrix: Revolutions" (?? 3) and words that don't quite make it into the formal dictionary like "movie fan" (??). Also, every time the tool runs it reads a list of punctuation to ignore. Lastly, I actually wrote what I think is a recursive function to hunt for words in the text. I haven't written a recursive function since I took at class on C in 1997.

The program is working reasonably well. Even though the CEDICT has 24k words, it can't cover every possible term. It has trouble particularly with suffixes and prefixes, which sometimes get tacked onto the wrong word. But it shouldn't be a problem for somebody whose Chinese is at a reasonble level and is able to pick that stuff out. As the custom dictionary grows, it should eliminate some of the misses. Also, this will be a good way to build up a contribution to the CEDICT, an open Chinese dictionary project which takes submissions of new entries.

If anybody is really interested in trying this out, I've put up a copy at my Freeshell site. Please note several things: I built this on my RedHat box and Freeshell runs NetBSD, so there may be some bugs/weirdness in certain outputs; second, this tool is hosted on a free server, and has to parse a 10 MB dictionary file every time it runs, so please don't tax it too much (if you want it for heavy usage, e-mail me and I'll send you the files); also, for the moment, this only takes input in UTF-8 encoding (RedHat changes everything on the clipboard to UTF-8; convenient!); finally, the definitions are actually output in the title attributes of span tags around the characters, which Mozilla pops up in little boxes, and I know that Safari doesn't do these pop-ups. If I can find a Javascript that'll do them, it'd be nice. But like I said, all these limitations will have to be overcome some other day.