ザ南蛮人日記

Announcing myougiden, a command-line Japanese/English dictionary

Where have I been, you ask? I’ve disappeared for the last two weeks! I didn’t write anything, talked to no one, was nowhere to be seen!

As it happens, the Muse of Programming possessed me forcefully, and after some intense days taken by a mood, I ended up with this:

myougiden is a new JMdict-based dictionary for the command-line. If you’re in a POSIX-style system (I think OSX should work, probably, perhaps), and you’re interested in trying it out, refer to the README. Here’s a copy of the current features list for hype:

I just think that degrading to “no indication of long vowel” is superior to degrading to word processor style. I don’t know what the official Kunrei-shiki standard says, but at least in the Hepburn world that’s the done thing (e.g. passports, train station names).

Losing phonemic information hurts my computolinguistic sensibilities (even non-phonemic information—I’m very bothered when I have to write e.g. yokuzuna or kanazukai and can’t distinguish underlying /du/ from /zu/. And speaking of that, it should be いなづま not いなずま。 It’s the “wife of the rice”!)

After I was 80% done, the thought popped that I should have looked into the DICT protocol… It’s true that I do a lot of firulas* like color and “intelligent” guessing, but perhaps it would be possible to write it as a custom server/client pair with protocol extensions, while remaining compatible with existing software. Oh well.

*firula: Unreasonably indulgent design; like, say, a backpack with almost too many kinds of inner divisions (“almost” because it’s never too many).

Thanks! Honesty binds me to confess that myougiden is quite slower than I hoped; in part because it attempts to “do what you mean”, by running many types of queries until one matches. And regexes unfortunately give it a significant performance hit. If the latency gets too uncomfortable, try passing lots of parameters to reduce query guessing. Also, depending on what you need, consider simply grepping edict.utf8 or edict2.utf8 (this method has been my primary “dictionary” for many years, and myougiden grew out of this workflow).

(and if anyone has suggestions of how to make this thing faster, I’m all ears! profiling shows that most of the time is spent on the SQL queries, not on the fluff.)

I would love it if you could add support for reading the EPWING dictionary format as well (http://ja.wikipedia.org/wiki/EPWING). I’ve got some dictionaries in this format, and I’m currently stuck using some Windows-based readers in a VM.

Since myougiden is in Python, I might try to add EPWING support myself, if I get the time.

Yeah multi-dictionary support has been asked–there’s a ton of little neat stuffs to add but I kinda grew tired of coding for now, & am concentrating on nethack my thesis, but I’ll try my hand at it when I’m coding again, & of course patches are welcome.

I’m sad this is project is called ‘defunct’ on Github because I certainly use it every day! (Usually because I feel bad about hitting beta.jisho.org every few minutes.) It’s plenty fast for me. Good work! The code looks high-quality so hopefully you (or someone else) will come back to it.

I’m so very sorry; I’m doing a thousand little things and find myself with little energy to code. But I myself use myōgiden everyday! So it’s not really defunct, it’s in… suspended animation? (笑)

If you’re tech-savvy, check out the latest branches/commits. I’ve added support to “full text search” (search-engine–like queries), which has sped up most queries to an order of magnitude. (There’s also new support for EDICT/JMdict languages other than English, if you’re interested in that.) Any of these days I should package a new release, and eventually add a few important features we’re missing (like de-inflection/lemmatization).

Please don’t apologize :! I saw the issue about non-English JMdict versions (which is where I saw the word ‘defunct’ :P), but I didn’t catch the FTS branch, interesting!

But what I’m most intrigued by is de-inflection/lemmatization. I’m currently using Ve (by Kimtaro, who also runs jisho.org, and which itself reprocesses MeCab: https://github.com/Kimtaro/ve) to separate Japanese sentences into “words” (Ve combines the morphemes found by MeCab into something higher-level than morphemes, nominally “words”), and then I use myougiden to look up the resultant lemmas to make glosses. In this way, I have my own linguistics-superpowered version of lingq.com :)

E.g., Ve converts “今朝、我が家で初氷を観測しました” (from a lesson on lingq.com) into approximately the following JSON:

It’s a fine piece of software that you should be very proud of. You’ve put a lot of work into solving problems that aren’t specific to JMDict. I might read through the source and see about getting it to work with other languages. A CEDICT version would be great.