February 8, 2012

A Kanji word block is a 3 x 3 block of Kanji characters in which each row and each column forms a word, and no character appears more than once. Thus the word block contains six words in all. Using Jim Breen’s WWWJDIC as my reference, I found that there are 12 possible word blocks. Here is one:

ダイ / おお大
Large

チュウ中
Middle

ショウ / こ小
Small

だいちゅうしょう
L/M/S; Clothing Sizes

ニン人
Person

ケン / カン間
Space

がた型
Model

にんげんがた
Humanoid

スウ数
Number

チ値
Value

カ化
Change

すうちか
Numeric Conversion

おおにんずう
Large Numberof People

ちゅうかんち
Median

こがたか
Miniaturization

So a person-space-model is a humanoid, a number-value-change is a numeric conversion, and so on.

February 3, 2012

We show Sam the item.
“What is this?” we ask.
“I don’t know,” says Sam.
We say, “It is the item.”
Again, we ask Sam, “what is this?”
“The item,” Sam repeats.
Sam has received his first lesson.

We show Sam the item.
“What is this?” we ask.
“I’m not sure,” says Sam, “but I think it is the item.”
“Of course it is the item! Why are you not sure?”
Sam has no reply.

The thing is very similar to the item.
We show Sam the thing.
“What is this?” we ask.
“It is the item!” Sam shouts confidently.
“It is not the item! It is the thing!”
Was Sam’s earlier hesitation justified?

We show Lisa the item.
“What is this?” we ask.
“I don’t know,” says Lisa.
We say, “It is the item.”
Again, we ask Lisa, “what is this?”
“The item,” Lisa repeats.
We show Lisa the thing.
“What is this?” we ask.
“I don’t know,” says Lisa.
We say, “It is the thing.”
Again, we ask Lisa, “what is this?”
“The thing,” Lisa repeats.
Lisa has received two lessons.

December 20, 2011

Google has some very cool n-gram data sets available for download. Alas, the files are quite large. For example, the 3-grams are split into 200 zip files, each weighing in at 440 MB. That’s about 88 GB total.

The 1-grams are much lighter, totaling 2 GB. I was able to reduce this to about 35 MB by throwing away the time information (the original files indicate the year each data point came from). That’s smaller than the original files by a factor of 57.

November 19, 2011

The Japanese writing system includes hiragana, katakana, and kanji. While hiragana and katakana represent sounds, kanji represent meanings. There are a lot of good places to start with learning Kanji. The JLPT N5 kanji list is a set of kanji that appear in the first of 5 standardized tests. Here is a list of the 100 most commonly-used Kanji on the internet.

Kanji may represent words by themselves, or they may be combined into larger units of meaning called compounds. I was interested in finding a set of kanji with a lot of combinatorial power — the ability to represent many distinct compounds without using any kanji not in the set.

For this project I used Jim Breen, jisho.org, Python, and a greedy heuristic. This solution is unlikely to be optimal. I’d like to share a truly optimal solution (with proof) later, but I don’t know how long that will take. Here’s what I’ve got now:

I’ll explain this again in a different way. Minimax search is an algorithm for perfect play in any combinatorial game. To solve Go by minimax search would, however, take a very long time, and so it is not a practical method by itself.

One technique that has been useful for Chess is to apply minimax search to a similar game with a new rule. The new rule is this: after n moves, for some particular n, the game will end and the final score will be determined using a board evaluation function. The optimal move in this modified game may then be assumed to be a decent move in the original chess position.

Since Go has a large number of moves available from most game positions, the values of n for which such calculations are practical are quite small. If we could throw out most possible moves (that is, if we could know that they are suboptimal without search), then we could search more deeply.

Enter locality – the fact that it takes a long time for effects on a go board to propagate over a long distance.

(The shading in this diagram shows the distance to the highlighted region, as defined by the metric discussed in the quoted material).

Suppose that our board evaluation function only cares about a small region R of the board, such as the highlighted region shown above. This means we’re now dealing with a local goal instead of the overall goal of winning the game. Let’s just press on, and at a later date we can figure out how local goals relate to winning the game.

If n=1, then the value within a region depends on the set of states that R can be in after one move. To determine which states are possible, we would need the following information:

If a chain has only one liberty within R, we wish to know if it has an external liberty. This requires looking at points at distance 1 from R.

If a chain has no liberties within R, then it must have an external liberty; but whether it can be captured depends on if it has two external liberties. This again requires looking at points at distance 1.

If an empty point in R is bordered only by (WLOG) white stones, then the legality of a black play there may depend on whether it captures a white chain that is itself at distance 1 from R. To answer this question involves looking at points at distance 2 from R.

This is the extent of the information we may need about the rest of the board. So the set of reachable states in R depends only on a radius-2 neighborhood of R.

Now we can proceed by induction. Suppose that the depth-(n-1) value of R depends only on a neighborhood of radius 2(n-1). Then to get the depth n value, we would simply look at all the possible states that the radius 2(n-1) neighborhood can be placed into by one move, and take the max (or min) of their depth-(n-1) values. The set of possible states depends only on a radius 2 neighborhood of the radius 2(n-1) neighborhood – that is, it depends only on a radius 2n neighborhood of R. Thus the depth n value depends only on the depth 2n neighborhood.

Of course, this bound is not very tight; it would be possible to achieve much tighter bounds.

June 14, 2010

I’m writing a book about Go strategy. I’ll be comparing moves by professionals to moves by amateurs in the 1-3k range. My hope is that amateur players may improve by trying to imitate the professionals.

I ran into this problem in the context of a computer game I’m developing. Rather than storing every detail of the game world in the map file, I store a random seed that is used to generate most map features. I found that every time I tweaked certain parameters in my code, old map files became unrecognizable.

Thus, we would like a random sampling algorithm that is unlikely to change its output when we make a small change to the underlying probability distribution (assuming we keep our original random seed). In particular, if we increase the probability of one outcome and renormalize the others, then the only way the outcome should change is if it changes to the outcome whose probability we increased.

If you want to solve the problem yourself, this would be a good time to stop reading.

I shared this problem with Michael Dobbins and he solved it for me.

Mike’s method is to associate each face of a simplex with one of the outcomes. We pick a random point inside the simplex, and measure the distance from that point to each of the faces, then divide that distance by the probability of the associated outcome. The lowest number wins.