28 August 2007

I've been working for a while on a compiler for statistical models. (Those of you who knew me back in the day will know that I used to be a programming languages/compiler nut.) The basic idea is to have a language in which you can naturally express hierarchical Bayesian models and either simulate them directly or compile them into your language of choice (in this case, right now, the only language you can choose in C :P). The key differentiation between HBC (the Hierarchical Bayes Compiler) and tools like WinBugs is that you're actually supposed to be able to use HBC to do large-scale simulations. Also, although right now it only supports Gibbs sampling, message passing and stochastic EM variants are in the works. It also almost knows how to automatically collapse out variables (i.e., automatic Rao-Blackwellization), but this is still a bit buggy.

Anyway, you can see more information on the official website. I'd really appreciate people who are interested in sending me bug reports if you encounter any problems. It's a pretty complex bit of machinery and some of it is still hacked together rather than done properly (especially type checking), so I expect to find some bugs there.

So just to whet your appetite if you haven't yet clicked on the above link, here's a complete implementation of LDA including hyperparameter estimation in HBC:

dave -- in short, yes. in long, there are a few things on the TODO list that are higher right now. basically, the list as it stands right now is something like: (1) automatic variable collapsing; (2) better support for gaussians (right now only constant-diagonal covariance matrices are allowed); (3) maximizations; (4) variational. somewhere in there also targetting other languages. if someone wants to help me target some other language, i'd really appreciate it: i plan to do ocaml, just because i use that a lot, but it would also be nice to have java... unfortunately, i don't remember java so if someone wants to tell me what the important differences between java and C are, it should be an easy port.

I gather from the Web page that the ability to generate production-performance code distinguishes HBC from most otherstochastic languages, but I as an ignorant observer do wonder as aserious question, how does HBC relate to IBAL, AutoBayes, and Dyna?(Any others?) It seems that IBAL and Dyna are more focused on discretedistributions (such as over algebraic data types) and exact inference --is that right?

this is pretty much right. autobayes, umacs, bugs, etc are all interpreted and hence pretty slow. some allow integration into R, which allows some flexibility, but R (and matlab) is pretty bad for dealing with text. so the big "selling point" is supposed to be that it generates code in your language of choice (which must be C right now). in comparison to dyna and ibal, these focus on exact inference, the former essentially in automata, the latter in algebraic data types.

but the basic point is that it should be somewhat difficult to actually implement a better sampler directly in C than the one produced by HBC. indeed, if you see places in the generated C code that could be sped up significantly (and that gcc doesn't already optimize, like constant lifting and simple dead code elimination), let me know.

What're the licensing terms? I didn't see anything on the site. If it's Apache-style open source, I'd love to help out generating Java if the system's much more scalable than WinBugs.

Could you run a pLSA-like mixture model over the Netflix data? That's 20K films, 500K users, 100M integer ratings in {1,2,3,4,5}. Ratings could be generated by mixtures of around 50-100 Gaussians or by similarly sized mixtures of 5-dimensional mutlinomials for the discrete output case.

I'm thiking pLSA needs to not only be generalized in the obvious way for Bayesian priors, but also for generating films for the Netflix case where not every film is rated for every user. The generative process would first select a topic, then a film from a per-topic multinomial. In addition to completing the generative model, it has the obvious predictive advantage of letting films associate more strongly with topics, so if you're a sci-fi nut and a documentary nut, you don't rate Star Trek a 1 if you happen to draw the documentary topic. I found this helpful in an EM estimator I implemented in Java for pLSA to run on the Netflix data.

How much memory and time would this take? And once the model's estimated, how fast would predictions be (estimating ratings given film and user)?

re license: i was hoping to avoid this issue because it just irks me. anyway, it's completely free for any purpose...i just added this to the web page.

bob -- i don't think memory would be a problem (i find that even on a fair amount of data, the memory usage is quite small). the bigger issue is that the speed of convergence for your model might be slow, at least until collapsing is implemented.

but it's pretty fun, right? very easy to try new models... i hope it's fruitful.

anyway, i'd love help with java. i'm pretty much unavailable for the next two weeks, but after that let's get it done. it should be a fairly trivial modification of the C generation.

For a machine learning perspective, see Bishop's book "Pattern Recognition and Machine Learning". For a simulation-oriented statistics perspective, see Gelman et al.'s "Bayesian Data Analysis". Both of these assume you are comfortable with probabilility theory, the common distributions in calc/matrix format. If you need to bone up on that, I'd recommend Larsen and Marx's "Intro to Mathematical Statistics", but there are lots of other mathematical intros out there.

I think of myself as a decent java programmer. If you need someone to code some stuff I would be willing to invest some time in it. This would also give me the opportunity to get more familiar with Bayesian models ;)

In response to your comment about other versions: I'd be very excited about an Ocaml version, and would be even more excited about a Python or Ruby one. OpenBayes for Python looks good, so together these would be powerful tools... but again, this is just my selfish perspective, since I know these languages better than Haskell. But in any case, just wanted to let you know there's major interest in Ocaml, Python and Ruby versions, and I'd definitely hack on it and add features.

this looks very cool, i'm looking farward to playing with hbc, and understanding how it works.

here's a simple challenge (that i think is hard to do in most packages for bayesian modeling): write a pcfg, and acheive the inference performance of cyk (without extraordinary tinkering). this is hard first because representing a pcfg is often awkward, and second because generic inference algorithms don't seem to leverage the independence properties that dynamic programming does.

btw., as someone who uses haskell and ocaml for doing serious bayesian modeling work, which would you recomend as a 'replacement' for everyday matlab use?

there are essentially two things that need to be done. first, the code generator needs to be written; i can do this with a small amount of input. second, the equivalent of stats.[ch] needs to be ported. the closer the type signatures can match, the easier it will be to generate the code. note however that if your language is reasoanble and your arrays store their length, then obviously this parameter doesn't need to be passed in.

so, here's the deal i'll strike. if you want a code generator in language X, give me a stats.X and i'll make the code generator. two people have expressed interest in java; maybe you can team up.

stats.c isn't so bad, but samplib.c on which it depends, is a killer. It's 2500 lines of densely written sampling algorithms (gamma, beta, exponential, Poisson, normal, etc.).

Does anyone know of good sampling code in Java? Porting samplib.c would be tough because of all the gotos, which Java doesn't support. The algorithms would have to be significantly refactored. I don't think this'd be possible without some good test sets. There are good refs and decent inline comments in some of the algorithms, so it at least looks possible.

Jakarta Commons Math have inverse cumulative prob methods for all their distributions and these could be used with a uniform [0,1] sampler from Random to sample from a gamma, but I'm guessing that's not going to be the best way to sample. And not all the distros are even implemented (e.g. beta's missing an Impl).

As for targeting other languages -- putting aside the coolness-factor of compilation to many different languages, I think that given the dependence on tight C code, and the inherent slowness of at least Python/Ruby if not Java -- it might be better to stick with generating cross platform C code and concentrate on the FFI of the host languages (SWIG for python, JNI for java, etc). This will require the users to have gcc, but at least the code will run at a reasonable speed.

Alternatively, one can just provide the stats lib using the FFI, which will remove the compilation burden from the users while making it easier to support different languages AND making the generated code run at reasonable speed.

This looks great!So I'm making a first stab at a factorial hmm, and for some reason its complaining that pi is not quantified. I'm not sure what these means, but I suspect I need some sort of type annotation....