reviews

theology

subtleties of state

2015-08-06

I found a type error in the database system I am implementing. The type error only is thrown after a certain number of pages are added to a table. This was, as one might expect, a fun exercise in debugging.

The actual trail of the path looks as follows:

A month ago, I make the most recent set of changes and send them up to the SCM, everything working.

I open up the project earlier this week and run the smoke tests. Immediate restart thrown! (What!?)

Hours and hours later, I am too sleepy to make progress an close the window. I have traced the static call stack, along with the dynamic call stack found in the traceback. No dice.

Taking up the story tonight, I begin to dig into the Lisp REPL and exhaustively test each function. After a random tweak to a parameter, I find that the crash does not occur. Hmmm.

Culprit found: the caching layer of the system is interacting with the page creation system; when created pages reach the eviction limit, the cache of a page becomes NIL (since it got evicted), and things go sideways, because NIL is not of type cachable-item.

Initial cut at a solution: the caching logic is (1) buggy (clearly), and (2) I'm not sure it's the right place to have caching. So the caching classes got pushed over to a different file and the linkages removed. The caching layer has been buggy since shortly after inception, so this feels right.

Some context onto this problem:

The general idea is that a table has a set of pages, which may or may not be loaded into memory. However, this implies having some extremely complex requirements about visibility and lifetimes of your data. A table scan will thus ripple through your pages as you bring them in and out of memory. Further, indexes have to be overlaid on your columns in the table, which then should have - in some fashion - references to the page data. This means that typical GC approaches won't work, as you need a kind-of weak-reference - a weak reference such that when you access it, it becomes alive.

Another lesson that can be taken away here is that simple tests are not enough; complex semantic tests that exercise the boundary conditions of your system need to be put in place from the beginning; early drafts of this sytem were written on the fly and without tests.

Further, an adequately powerful type system can prevent certain classes of bugs. In this case, an adequate type system would have limited the expression of the idea that a pointer can be NIL from time to time (Of course, this means that the machinery for whether the data is present or not has to be encoded into the type system: no silver bullet).

As I noted above, this problem has a gnarly essential complexity - data has to come and go, transparent to the other layers of the architecture.. Right now, solving that essential complexity is not high on the priority list, so the caching feature will get cut, to be added at a later date (possibly at a different layer in the data archicture).