Friday, December 23, 2011

Once the AI Class was over, I wanted to take some time to study the second optional NLP problem with Python. Finding a solution is not difficult (because even an imperfect one will yield the answer anyway), but devising a satisfying process to do so (i.e. a general way of solving similar problems), is much harder.

Problem Representation

The first task is working up the problem representation. We need to parse:

A useful data structure in this context is a list of 19 sublists, each corresponding to a column of 8 tokens. The reason for this is that we need to easily shuffle around the columns, which are the atomic parts of the problem (i.e. those which cannot change):

There are certainly more elegant ways to handle this, but this is enough for the discussion. To revert back from this representation to a readable form:

defrepr(grid):
return'\n'.join(['|%s|' % '|'.join([grid[i][j] for i inrange(19)]) for j inrange(8)])

We need to consider two abstract components for solving this problem: the state space exploration, and the evaluation function we must use to guide it, since there's clearly no way we can brute-force the 19! possible configurations.

State Space Exploration

For the exploration function, here is a simple approach: from a given grid configuration, consider all the possible grid reorderings, resulting from having an arbitrary column reinserted to an arbitrary position. Then expand them in the order of their score ranking. Here is an example showing two possible column reorderings from the initial state:

Note that this is not the same as simply swapping columns: from "a,b,c", for instance, we'd like to derive "b,a,c", "b,c,a", "c,a,b" and "a,c,b", which wouldn't possible with only one swapping step. One way to do this is:

Again it's quite certain that nicer solutions exist for this: in particular, I use sets to avoid repetitions (both for the already explored states overall and the reinsertion configs for a given grid), which thus requires those ugly "tuplify/listify" operations (a list is editable but not hashable, while a set is the opposite). We can then use this function in a greedy and heuristic way, always expanding first the best reordering from any given state (using a heap to make our life easier):

This is obviously greedy, because it might easily overlook a better, but more convoluted solution, while blindly optimizing the current step. Our only hope is that the evaluation function might be more clever in compensation. Note also that we don't stop, simply because there is no obvious criterion for doing so. We could define a tentative one, but since that's not the goal of the exercise, let's just proceed by inspection (once I knew the answer, I cheated by hardcoding a stopping criterion, just for the purpose of counting the number of steps to reach it). But then what about the scores? But just before..

Some Doubts..

At first, my solution wasn't based on a heap: it was simply considering the best configuration at any point, and then would forget about the others (in other words: it used a single-element search frontier). But I had doubts: was this strategy guaranteed to eventually exhaust the space of the possible permutations of a given set? If not, even in theory, it would seem to be a potential problem as for the generalness of the solution.. I'm not sure about the math required to describe the relationship between the n! permutations of a set and the n2-2n+1 element reorderings (there must be a technical term for this?) from any configuration, but after having performed some practical tests, I reached the conclusion that using a heap-based method (i.e. a frontier with many elements) was more sound, because although I cannot prove it, I'm pretty certain that it is guaranteed to eventually exhaust the permutation space, whereas the first method is not. In the context of this problem, it doesn't make a difference though, because the search space is so large that we would have to wait a very long time before we see these two closely related strategies behave differently.

Language Modeling

Enters language modeling.. this is the component we need for the evalution function, to tell us if we are heading in the right direction or not, in terms of unscrambling the hidden message. My first intuition was that character-based n-grams would work best. Why? Because while exploring the state space, due to the problem definition, most of the time we are dealing with parts or scrambled words, not complete (i.e. real) ones. Thus a character-based n-gram should be able to help, because it works at the sub-lexical level (but happily not exclusively at this level, as it retains its power once words begin to get fully formed, which is what we want). To do this, I used the venerable Brown Corpus, a fairly small (by today's standards) body of text containing about 1M words, which should be enough for this problem (note that I could have also used NLTK):

Another debatable aspect: I use a very small value as the probability of combinations that have never been seen, instead of using a proper discounting method (e.g. Laplace) to smooth the MLE counts: I thought it wouldn't make a big difference, but I might be wrong. The outcome however is that I cannot talk about probabilities, strictly speaking, so let's continue with scores instead (which means something very close in this context).

The final required piece is the already mentioned function to compute the score for a given grid:

A couple of things to note: I use the log-likelihood because it is more numerically stable, and I also use a simple interpolation method (with uniform weights) to combine models of different orders.

So.. does this work? Not terribly well unfortunately.. Although with some tuning (the most important aspect being the order N of the model) it's possible to reach somewhat interesting states like this:

from which it's rather easy to guess the answer (we are so good at this actually that there's even an internet meme celebrating it), for some reason my experiments seemed to always find themselves stuck in some local minima from which they could not escape.

The Solution

Considering the jittery nature of my simplistic optimization function (although you prefer it to go up, there is no guarantee that it will always do), I pondered for a while about a backtracking mechanism, to no avail. The real next step is rather obvious: characters are probably not enough, the model needs to be augmented at the level of words. The character-based model should be doing most of the work in the first steps of the exploration (when the words are not fully formed), and the word-based model should take over progressively, as we zero in on the solution. It's easy to modify the previous code to introduce this new level:

After some tinkering, I found that character-based 6-grams augmented with word unigrams was the most efficient model, as it solves the problem in 15 steps only. Of course this is highly dependent on the ~890K training parameters obtained using the Brown corpus, as with some other text it would probably look quite different. I'm not sure if this is the optimal solution (it's rather hard to verify), but it should be pretty close.

Tuesday, November 29, 2011

Earlier today I was reading about the Tuesday Birthday Problem (which curiously doesn't seem to have its own entry on Wikipedia.. maybe it is known under a different name?) and although I was convinced by the argument, I thought that a little simulation would help deepen my understanding of this strange paradox (or at least make it a little more intuitive). The problem I had is how to represent, in a clear way, some a priori knowledge (namely, the fact that one of the children is a son born on a Tuesday) in a numerical simulation?

Since directly modeling the conditional distribution wouldn't be trivial, an easier way to do it is by using rejection sampling: iterate over a set of randomly generated family configurations, and reject those that do not match the given fact, i.e. those not containing at least a son born on a Tuesday. From the configurations that passed the test, the proportion of those having the other child also a son (born on whatever day), should yield the answer (which of course is not 1/2, as intuition first strongly suggests):

Thursday, October 6, 2011

I did what I suggested in my last post, and finally read about Peter Norvig's constraint propagation method for solving Sudoku. On one hand it's quite humbling to discover a thinking process so much more elaborate than what I could achieve, but on the other, I'm glad that I didn't read it first, because I wouldn't have learned as much from it.

It turns out that my insights about search were not so far off the mark... but then the elimination procedure is the real deal (in some cases, it can entirely solve an easy problem on its own). In fact the real power is unleashed when the two are combined. The way I understand it, elimination is like a mini-search, where the consequences of a move are carried over their logical conclusion, revealing, many steps ahead, if it's good or not. It is more than a heuristic, it is a solution space simplifier, and a very efficient one at that.

My reaction when I understood how it worked was to ask myself if there's a way I could adapt it for my current Python implementation, without modifying it too much. It is not exactly trivial, because the two problem representation mechanisms are quite different: Peter Norvig's one explicitly models the choices for a given square, while mine only does it implicitly. This meant that I couldn't merely translate the elimination algorithm in terms of my implementation: I'd have to find some correspondence, a way to express one in terms of the other. After some tinkering, what I got is a drop-in replacement for my Sudoku.set method:

Ok.. admittedly, it is very far from being as elegant as any of Peter Norvig's code.. it is even possibly a bit scary.. but that is the requirement, to patch my existing method (i.e. to implement elimination without changing the basic data structures). Basically, it complements the set method to make it seek two types of things:

a square with a single possible value

a row/column/box with a value that has only one place to go

Whenever it finds one of these, it recursively calls itself, to set it right away. While doing that, it checks for certain conditions that would make this whole chain of moves (triggered by the first call to set) invalid:

a square with no possible value

a row/column/box with a set of unused values that is not equal to the set of values having a place to go (this one was a bit tricky!)

So you'll notice that this is not elimination per se, but rather.. something else. Because really there's nothing to eliminate, this is what happens to the elimination rules, when they are forced through an unadapted data structure. With Peter Norvig's implementation, it is so much more elegant and efficient than this, of course. And speaking of efficiency, another obvious disclaimer is that of course this whole thing is not as efficient as Peter Norvig's code, and for many reasons. I wasn't interested in efficiency this time, but rather in finding a correspondence between the two methods.

Finally, we need to adapt the solver (or search method). The major difference with the previous greedy solver (the non-eliminative one) is the fact that a move is no longer a single change we do to the grid (and that can be easily undone when we backtrack). This time, an elimination call can change many squares, which is a problem with this method, because we cannot do all the work with the same Sudoku instance, for backtracking purposes, and such an instance is not as efficiently copied as a dict of strings. There are probably many other ways, but to keep the program object-oriented, here is what I found:

Again it's not terribly elegant (nor as efficient) but it works, in the sense that it yields the same search tree as Peter Norvig's implementation. Just before doing an elimination (triggered by a call to set), we deepcopy the current Sudoku instance (self), and perform the elimination on the copy instead. If it succeeds, we carry the recursion over with the copy. When a solution is found, the instance is returned, so that's why this method has to be called like this:

S = S.solveGreedilyWithConstraintPropagation()

To illustrate what's been gained with this updated solver, here are its 6 first recursion layers, when ran against the "hard" problem of my previous post:

Sunday, October 2, 2011

Or.. Some Variations on a Brute-Force Search Theme

While browsing for code golfingideas, I became interested in Sudokusolving. But while by definition Sudoku code golfing is focused on source size (i.e. trying to come up with the smallest solver for language X, in terms of number of lines, or even characters), I was more interested in writing clear code, to hopefully learn a few insights on the problem.

and it can actually fit Sudoku variants of different size: 9x9, 8x8, 6x6. The only thing that changes for each is the definition of the box method for finding the "coordinates" of a box, given a square position.

Validity Checking

One interesting aspect to note about this implementation is the use of a series of set data structures (for rows, columns and boxes, wrapped in defaultdicts, to avoid the initialization step), to make the validation of a "move" (i.e. putting a value in a square) more efficient. To be a valid move (according to Sudoku rules), the value must not be found in the corresponding row, column or box, which the isValid method can tell very quickly (because looking for something in a set is efficient), by simply checking that it is not in any of the three sets. In fact many Sudoku code golf implementations, based on a single list representation (rather than a two-dimensional grid), use a clever and compact trick for validity checks (along the lines of):

which you'll find, after having scratched your head for a while, that although it does indeed work... is actually less efficient, because it relies on two imbricated loops looking at all elements (hence is in O(size2)), whereas my technique:

by exploiting the set data structure, actually runs slightly faster, in O(size1.5).

Sequential Solver

With all that, the only piece missing is indeed a solver. Although there are many techniques that try to exploit human-style deduction rules, I wanted to study the less informed methods, where the space of possible moves is explored, in a systematic way, without relying on complex analysis for guidance. My first attempt was a brute-force solver that would simply explore the available moves, from the top left of the grid to the bottom right, recursively:

This solver is not terribly efficient. To see why, we can use a simple counter that is incremented every time the solver function is called: 4209 times for the "easy" puzzle (above), and a whopping 69,175,317 times for the "harder" one! Clearly there's room for improvement.

Random Solver

Next I wondered how a random solver (i.e. instead of visiting the squares sequentially, pick them in any order) would behave in comparison:

This is really worst... sometimes by many orders of magnitude (it is of course variable). I'm not sure I fully understand why, because without thinking much about it would seem that it is not any more "random" than the sequential path choosing of the previous method. My only hypothesis is that the sequential path choosing works best because it is row-based: a given square at position (i, j) benefits from a previous move made at (i, j-1), because the additional constraint directly applies to it (as well as to all the other squares in the same row, column or box), by virtue of the Sudoku rules. Whereas with random choosing, it is very likely that this benefit will be lost, as the solver keeps randomly jumping to possibly farther parts of the grid.

Greedy Solver

While again studying the same code golf implementations, I noticed that they're doing another clever thing: visiting first the squares with the least number of possible choices (instead of completely ignoring this information, as the previous methods do). This sounded like a very reasonable heuristic to try:

Performance is way better with this one: only 52 calls for the easy problem, and 10903 for the hard one. This strategy is quite simple: collect all the possible values associated to every squares, and visit the one with the minimum number (without bothering for ties). However, even though this solver clearly performs better, it's important to note that a single call (i.e. for a particular square) is now less efficient, because it has to look at every square, to find the one with the fewest choices (whereas the sequential solver didn't have to choose, as the visiting order was fixed). This is the price to pay for introducing a little more wisdom in our strategy, but there are however two easy ways we can speed things up (not in terms of number of calls this time, but rather in terms of overall efficiency): first, whenever we find a square with no possible choice, we can safely back up right away (right in the middle of the search), because we can be sure that this is not a promising configuration. Second, whenever we find a square with only one choice, we can stop the search and proceed immediately with the result just found, because the minimum is what we are looking for anyway. Applying those two ideas, the solver then becomes:

First, here is the exploration path taken by the sequential solver (read from left to right; each node has the square's i and j coordinates, as well as its chosen value):

In contrast, here is the path taken by the greedy solver:

Whenever the solver picks the right answer for a certain square, the remaining puzzle becomes more constrained, hence simpler to solve. Picking the square with the minimal number of choices minimizes the probability of an error, and so also minimizes the time lost in wrong path branches (i.e. branches that cannot lead to a solution). The linear path shown above is an optimal way of solving the problem, in the sense that the solver never faces any ambiguity: it is always certain to make the good choice, because it follows a path where there is only one, at each stage. This can also be seen on this harder 9x9 problem:

with which the sequential solver has obviously a tougher job to do (read from top to bottom; only the 6 first recursion levels are shown):

But even though its path is not as linear as with simpler puzzles (because ambiguity is the defining feature of harder problems), the greedy solver's job is still without a doubt less complicated:

This last example shows that our optimized solver's guesses are not always the right ones: sometimes it needs to back up to recover from an error. This is because our solver employs a greedy strategy, able to find efficiently an optimal solution to a wide variety of problems, which unfortunately doesn't include Sudoku. Because it is equivalent to a graph coloring problem, which is NP-complete, there is in fact little hope of finding a truly efficient strategy. Intuitively, the fundamental difficulty of the problem can be seen if you imagine yourself at a particular branching path node, trying to figure out the best way to go. There is nothing there (or from your preceding steps) that can tell you, without a doubt, which way leads to success. Sure you can guide your steps with a reasonable strategy, as we did, but you can never be totally sure about it, before you go there and see by yourself. But sometimes by then, it is too late, and you have to go back...

Preventing Unnecessary Recursion

The last thing I wanted to try was again inspired from the code golfing ideas cited above: for the moves with only one possible value, why not try to avoid recursion altogether (note that although the poster suggests this idea, I don't see it actually implemented in any of the code examples, at least not the way I understand it). Combining it with the previous ideas, this one can be implemented like this:

The single-valued moves are now handled in a while loop (with a properly placed continue statement), instead of creating additional recursion levels. The only gotcha aspect is the additional bookkeeping needed to "unset" all the tried single-valued moves (in case there were many of them, chained in the while loop) at the end of an unsuccessful branch (just before both places where False is returned). Because of course a single-valued move is not guaranteed to be correct: it may be performed in a wrong branch of the exploration path, and thus needed to be undone, when the solver backs up. This technique is interesting, as it yields a ~85% improvement on the problem above, in terms of number of calls. Recursion could of course be totally dispensed with, but I suspect that this would require some important changes to the problem representation, so I will stop here.

Note that you are still responsible for managing any transaction externally:

>>> conn.commit()

With the 'return_id' option (which restricts the default 'returning *' clause to the primary key's value, which is assumed to be named '<table>_id'), the insert/update above could also be done this way:

because it trims any extra items in 'values' (i.e. corresponding to columns not belonging to the table). Note that since this option requires an extra SQL query, it makes a single call a little less efficient.

You can always append additional projection elements to a select query with the 'what' argument (which can be a string, a list or a dict, depending on your needs):