2017-04-17

In the last couple of days, I was ill and had to stay in bed, so I've used this time also to tidy up the work that accumulated over the past year in cl-nlp. That was especially timely, considering the interest that was expressed in using it by some people who I've met at the recent Lisp-related events.

I've even assembled a rough checklist of the things that need to be finished to get it to v.1.0 and beyond.

Besides, after finishing the basic cleaning, I've returned to one of the programming tasks that has racked my head for long: tree pretty-printing. In NLP, we constantly have to deal with various versions of parse trees, like the constituency or dependency ones, but the problem is that they are not easily visualized. And good visualization plays, at least for me, a critical role in effective debugging, ideation and programming. It's an essential part of a solid interactive experience that is one of the fundamental traits of Lisp development.

For instance, a constituency tree is usually presented as a Lisp list. Here's an infamous example from the Penn Treebank:

However, writing the visualization code was one of the most intimidating programming tasks I've ever encountered. One explanation is that trees are most naturally processed in depth-first order top-down, while the visualization requires bottom-up BFS approach. The other may be that pixel-perfect (or, in this case, character-perfect display is always tedious). As far as I'm concerned, this is not a sufficient explanation, but I couldn't find any other. The ugliest part of this machinery is deps->levels function that prints the dependency relations in a layered fashion. The problem is to properly calculate minimal space necessary to accommodate both tokens and dependency labels and to account for different cases when the token has outgoing dependency arcs or doesn't. In theory sounds pretty easy, but in practice, it turned out a nightmare.

And all of this assumes projective trees (non-intersecting arcs), as well as doesn't know how to show on one level two arcs going from one token in two directions. Finally, I still couldn't align the two trees (constituency and dependency) above and under the sentence. Here's the target:

Well, the good news is that it is usable, but it still needs more work to be feature complete. I wonder what was I doing wrong: maybe, someone can come up with a clean and simple implementation of this functionality (in any language)? I consider it a great coding challenge, although it may require a week of your free time and a bunch of dead neurons to accomplish. But if you're willing to take it, I'd be glad to see the results... :D