Education is what someone tells you to do and learning is what you do for yourself.— Mike Karnjanaprakorn

Search

Kanji Sieve – Analysing Kanji Usage

This is a little FileMaker solution I’ve written.
It takes a piece of pasted Japanese text and analyses the kanji contained in it.

I wrote it as a quick and probably imprecise way of looking at kanji usage in texts. Probably because of the 1998 study of kanji usage in the Asahi Shinbun (Shinbun denshi media no kanji, Senseido, 1998) usually a figure is quoted of 1000 most frequent kanji account for 95% of usage. I have also seen this as 1000 characters allow you to read 95% of articles (a subtle difference) but I think this is a bit of an overstatement, (the thread below suggests 1900 kanji in order to read 95% of compounds). While doing a bit of research on this I came across several other frequencystudies and an interesting thread where Jim Breen notes

…a discussion at a language teaching conference in Japan I attended in 1999, where there was general consensus that
the average Japanese adult could read 700-800 kanji…

Although I find this a bit hard to imagine, write by hand maybe…What interests me is the percentage of kyouiku kanji that are used in texts and which of the remainder of the jyouyou kanji are used most frequently.

My hypothesis is that the kyouiku kanji are a better medium term goal for JSL learners than the complete jyouyou set. The diminishing returns in terms of effort on the 939 kanji beyond the kyouiku kanji might suggest approaching these on a need-to-know basis. The old canard (by Heisigists I suspect) is that leaving out 10% of the alphabet isn’t a good idea. I don’t know. Firstly a more accurate analogy would be around vocabulary and it’s not so much that you completely ignore them but that it is possible to work around the unknown characters. And there’s a world of difference in effort between learning 3 characters and learning 939 characters. But I digress.
The Asahi Shinbun also probably isn’t the most read source by JSL learners either. It might be good to have some statistics on Amazon reviews, mixi blogs, or manga.

Kanji Sieve filters for the six primary school grades and for the remaining jyouyou kanji. It then counts the occurance of each character. This might allow you to see the most frequently occurring characters in the texts you are interested in.
Characters outside the jyouyou set are not considered.
For readability or difficulty other considerations would need to be addressed such as the vocabulary used, the length of compounds and the grammar.

If I continue to play with this I would like to add an export option, maybe allow you to collect a series of articles and see the aggregate statistics.
I would also like to incorporate it into my Kanji Notebook, to allow you to lookup kanji or add them to a study list or set of flash cards.
I would also like to see if I can extract vocabulary in the same way, but I suspect word boundaries would be an issue there although Rikachan manages it though….

Comments (4)

As you mentioned, Kanji Sieve doesn´t look too good in Windows when opened with FileMaker Pro (version 11 in my case). However the bound file opened with the Shiawase engine looks very good. It must be a font useage problem.

Is there any way that fonts/font size could be changed in the program?

The possibility of separation and dictionary look-up of Kanji compound words (and parsing of sentences for hiragana/katakana words) would be a fantastic plus.

The bound file might be different from the standalone… I may have cleaned up the fonts a bit after looking at it in Windows before binding it. By the way the bound file should also open from within FileMaker 11 if you want. I’m undecided as to whether to just standardise on Windows fonts and looser layouts or to spend time re-jigging layouts for Windows. Cross platform has always been a bit of an issue in FileMaker. With Roman fonts I’d just happily go Verdana, I’m not sure what Japanese fonts are available cross platform.
Actually if I had server access the whole thing would be better driven off a web page.

I hadn’t thought of variable font sizing, I must look into it.

Anyhow, I have version 0.2 ready for the Mac side. Hopefully I’ll get it online over Easter. I’ve added filtering for non-jyouyou and katakana and a button to copy a tab panel to your clipboard. I think parsing for word boundaries is beyond my skill at the moment. (Lately I’ve mostly been working on Flashcards for my Kanji Notebook solution)

I haven´t done any software development in a long time. My main interest these days is Learning Software (especially Japanese) from a teachers/users point of view and I spend a lot of time evaluating programs. I downloaded Filemaker because I saw that you are using it, although I did take a look at it a few years ago.

I don´t have access to MAC machines so, if possible, could you please let me have a look at a Windows compatible .USR file for the version 0.2?

The best Font file I have found is Meiryo which looks great in Japanese or English. Its comes with Vista so should be available on the web (try Googling), if you can´t find a copy I´ll let you have one.

Hi Tom,
I installed Meiryo. I also found that I needed Cleartype or again the font metrics looked pretty bad.
I might have to wrench myself away from my favourite fonts and go with Meiryo to keep things ok cross -platform.
(must look at the CSS for my blog as well I guess)
If I can find some spare time I’ll finish off and upload v0.2 later today.