Wednesday, October 29, 2008

Yes, it took a while, but here's a new release preview build. The biggest changes from 2.1.2 are:

FFmpegSource2 is the new default audio and video provider, replacing Avisynth. This should provide frame-exact seeking (with keyframe support) on AVI, MKV and MP4 files, as well as other benefits. This is still a bit experimental, however. If you have any issues, just switch back to Avisynth in options.

The DirectSound audio player was reverted to what it was in 2.1.1, since 2.1.2 seems to have critical issues related to it.[Edit: jfs says that the issue was after 2.1.2]

Many small issues around the program were fixed.

VSFilter has been updated to the MPC-HC 2.39 version, which includes jfs's new patches (see this post)

Aegisub is now built against Visual C++ 2008 SP1 runtimes. Hopefully there will be no issues related to this (ASSDraw is still built against 2005 SP1 runtimes, due to library issues). If you can't run Aegisub, try installing this and reporting how it goes.

A listing of other fixes can be found on TheFluff's builds page. This release includes the Brazilian Portuguese (100%), Catalan (99%) and Spanish (99%) translations. All other translations are too outdated, and were left out. If you are willing to update any of them, please let us know.

Thursday, October 16, 2008

The first one adds proper ruby (a.k.a. furigana, in the case of Japanese) support to Firefox. The second one is more interesting: it adds furigana to kanji on websites that don't have it, making it much easier to read Japanese text.

An interesting side-effect of having proper ruby support in Firefox is that the times in blogger look quite odd, with the full date above them.

Tuesday, October 14, 2008

I have noticed that lots of people have no idea what exactly is the whole "Unicode", "UTF-8", "UTF-16" and "UCS-2" stuff, aside from the fact that it's somehow related to the display of foreign characters. The objective of this post is to briefly explain them and dispel some of the myths associated with them.Unicode is a coding system used to represent characters from many languages (including Japanese and Chinese) without the need to change your language locale. If you've tried writing Kanji in Medusa, you know what I'm talking about. In Unicode, characters are given an unique number. For example, the capital letter "A" is U+0041 (65 in decimal), and the Hiragana "ふ" is U+3075 (12405 in decimal). Characters are divided into planes of 65536 characters for convenience. Almost all common characters are in plane 0 (also known as the Basic Multilingual Plane, or BMP), which goes from code points U+0000 to U+FFFF. All kanji are in planes 0 and 2.

UTF-8, UTF-16 and UCS-2 are simply techniques used to encode those values into text files. Windows helped create a myth that Unicode is UTF-16 by calling UTF-16 "Unicode" in applications such as Notepad - but the fact is that UTF-16 is as much Unicode as UTF-8 is.

UCS-2 (UCS = Universal Character Set) is an old encoding system that can store characters from the BMP by simply writing them as 16-bit values. The advantage of this system is that it's simple and covers most characters, but anything outside the BMP will fail catastrophically. That's why some Kanji have issues with some programs. An interesting consequence of UCS-2 is that it allows the mapping of characters that don't exist, such as the ones reserved for surrogate pairs (see the next paragraph).

UTF-16 (UTF = Unicode Transformation Format) builds on UCS-2. Indeed, for characters on the BMP, UTF-16 is identical to UCS-2; the difference lies in planes above the BMP. UTF-16 is capable of representing characters in planes 1 through 16 (even though no planes above 3 are specified yet) with a surrogate pair, that is, it uses two 16-bit values to store a character. This means that you can't measure the length of a UTF-16 string by counting how many 16-bit values it has!

UTF-8 is similar to UTF-16, but a character can be encoded as anything ranging from 1 to 6 bytes, although no character is mapped to anything that would be over 4 bytes long in UTF-8. Similarly to how UTF-16 is identical to UCS-2 for the BMP, UTF-8 is identical to Western encoding for the ASCII range (U+0000 to U+007F), making it "backwards compatible" with software that isn't Unicode-aware. It also means that text that is mostly composed of ASCII characters (such as, say, ASS subtitles) will be much shorter as UTF-8 than as other Unicode formats. That's why Aegisub uses UTF-8 as its standard format.

Regardless of encoding differences, UTF-8 and UTF-16 can both represent ANY Unicode character. UTF-16 can sometimes be shorter than UTF-8, but that will, in practice, never be the case for ASS subtitles, even if they are entirely written in Japanese/Chinese, due to all the ASCII text involved in the format syntax.

Ever since I've posted about Kanamemo, there have been quite a few requests for a "Kanjimemo", a tool based on the same idea, but for Kanji.

Even before, I had considered writing something like that... But I never started because I couldn't quite figure out all the details on how it'd work. On this post, I'll talk about some of the ideas that I had for it. If you're interested in a "Kanjimemo", please leave your feedback and suggestions in the comments!Programming Language

First of all, I'm not sure which programming language to write it in. At first, I considered C++, since that would be the easiest for me and allow the maximum flexibility, at least as far as PCs are concerned. The problem is that I'm already fairly experienced with C++, and so it wouldn't be much of a learning experience (which is always a plus :)) unless I went for Direct3D.

Then I pondered about Java: with all the cell phones supporting J2ME, it seemed like a good idea - Kanjimemo on the go? Great! The real problem came when I realized that J2ME is *REALLY* limited - you often have less than 1 MB of heap memory available (!) for your application, which makes a program like Kanjimemo almost impossible to implement. I also lack a J2ME-enabled cell phone, so I couldn't even work on a J2ME port right away.

A few other languages crossed my mind. C# is something that I've always wanted to learn, but its cross-platform support is quite bad (I'm looking at you, Mono). It's also much slower than Java. Python is another "to learn" language, but I question the sanity of doing complicated data analysis on such a high level and slow language... Plus all the horrible dependencies. Same goes for Ruby.

So, any thoughts on the "language barrier" might be useful.

Basics

On to how the program would ACTUALLY work... Learning kanji is nowhere as easy as learning kana. The problem with kanji is that most of them have multiple (typically two) readings, depending on the word... but some (like 日, one of the most basic kanji) can have many more. So my idea is to have an algorithm that works like this:

Select a group of five or so kanji for each level (like Kanamemo)

Mine EDICT for all words marked as [Common] that use that kanji

Perhaps attempt to extract the pronunciation of your kanji on that word? If that doesn't work, just go with individual words.

Create a list of all the different unique pronunciations and associated words.

Have the user learn all the unique pronunciations, preferably by using words that contain nothing but that kanji and kana.

If there's no word with that kanji by itself, make sure that the user already "learned" all the other kanji in the word displayed.

Of course, steps 3 and 6 might be very tricky to code. All of this will require mining data from EDICT and possibly KANJIDIC. If it becomes necessary, I might use a SQLite database to store this information.

ProgressionProgression would work similarly to Kanamemo, with a new set of 5 kanji unlocked with each memorized set. Ideally, the user could choose profiles to control the new kanji: perhaps follow the JLPT progression, or the japanese school system progression, or how common a given kanji is, or a combination of them (i.e. start with all JLPT4 kanji sorted by frequency, then all JLPT3 sorted by frequency, etc). The user should also be able to customize a list of kanji that he wants to learn.

Given this system, it'd be possible to simply consider kana as being kanji, and have the program work in the same way for those, so you'd be entering actual japanese words when learning kana. This has the advantage of making your japanese reading skill progress.

Multiple fontsOne problem that I noticed with kanamemo is that it was easy to just memorize the font glyph, as opposed to the more abstract shape of the kana. This could prove to be an issue with kana that are very different depending on how they're written (such as さ and ふ). This program would fix that problem by using different types of fonts (cursive, brush, type) randomly, or perhaps by forcing you to learn all the different variation before progressing.

TranslationSince the concept of the program is word-focused, it might feel strange to be learning how to read words without learning what they mean. If you're an anime watcher, then perhaps you already have a relatively big vocabulary of words, but you won't know all of them, and not everyone is an anime watcher. EDICT provides translations, but I'm not sure if just slapping the translations there will do any good... Thoughts on this?

VoiceFinally, it might be useful to have someone read the words out loud for you whenever you get them right. I'm not sure how hard it would be to add support for some third-party voice synthesizer, but it might be worth the trouble.

Other ideasPerhaps the program should be designed to look more like a game? A little mascot cheering for you, a scrolling background, some background music? Perhaps this game could have multiple "stages" that you would do in alternating order: First learn to read the kanji, then what the word means, then perhaps a speed typing test? Maybe even a grammar test mode?

DevelopmentOf course, what this needs the most right now are IDEAS! If you have any, please share them with us. If you know of somebody who might be interested in this sort of thing, link them to this page! If you want to help with the development itself, drop by IRC and let us know. The idea is that this should be an open, free project.

About Aegisub

Aegisub is an advanced subtitle editor for Windows, and UNIX-like systems, such as Linux, Mac OS X and BSD. It is open source software and free for any use.

Aegisub natively works with the Advanced SubStation Alpha format (aptly abbreviated ASS) which allows for many advanced effects in the subtitles, apart from just basic timed text. Aegisub's goal is to support using these advanced functions with ease.