January 4, 2010

And Bill Brass would run through the streets of Edinburgh shouting "Eureqa!"

If you are a social scientist and haven't checked out Eureqa, you should spend a few hours playing with it in the next few months, because it's entirely different from your prior experiences with statistical software. Featured in a Wired article last month, Download Your Own Robot Scientist, Eureqa is not a scientist but a statistical engine that generates potential formulae to solve a defined problem from data, evaluates the formulae as it goes, and does so using a set of operations defined by the user. The usual (somewhat tedious) method for those of us trained in social sciences is to think very clearly about the problem, define a potential model (or, in reality, the form of a function and the variables that would go in that function), and let software estimate parameters to minimize error defined in some way or maximize the likelihood of having observed the collected data. If the "think very clearly" sounds remarkably Cartesian, so be it. In the best of worlds, that a priori modeling can lead to interesting and useful findings, even if you're also exposed to John Tukey-like practicality (such as his 53h smoothing). There's also the "churn it out" school of automated stepwise regressions that used to be an excuse for researcher laziness, though I have recently accepted a manuscript for Education Policy Analysis Archives with precisely that tool used at one step (and for very justifiable and practical reasons--the authors were not being lazy one whit).

So in this world of "try out one well-justified family of models at a time" rushes Eureqa, threatening to either upset the applecart or lead to some very interesting possibilities. Instead of comparing a set of nested models, where summary models often allow inferential judgments of the utility of additional variables, Eureqa compares some very different models, where the conclusions one can draw in comparing models is restricted to the sample (where many people would argue we're always restricted, but I'll skip the metatheoretical discussion of inferential statistics). So what the heck is the use of Eureqa?

To get a glimpse of the possibility, let me tell you about my experience. Looking at one of the images in the tutorials, I saw a sine curve whose magnitude diminished, and I thought, "Okay, let's see how quickly Eureqa recognizes that," synthesized numbers in a spreadsheet to fit a formula with a magnitude that diminished to 0 asymptotically (i.e., as the independent variable headed to infinity), and plugged it into Eureqa, telling Eureqa that it could add, subtract, multiply, divide, and use sine and cosines in any combination. In a few minutes, Eureqa spat out an optimal formula that was identical to the one I had used. Okay, so far so good, but I had been easy the first time out.

Next, I added an error term. Eureqa asked me if I wanted to smooth the data first. No, I said, and Eureqa had some problems, so I went back and checked the "smooth" box. Eureqa dutifully chugged away, and one of the candidate formulae was almost the same as the one I had used (minus the error term), but it wasn't the prime candidate after several minutes. Instead, Eureqa proposed a sum of two sine curves that had slightly different periods. I thought about it and realized, oh, yes, of course. One way to have a diminishing-amplitude sine wave was to have diminishing amplitude, but another is to have the sum of two sine waves with almost but not completely identical periods. As time goes on (or x increases), the waves shift from constructive to destructing interference, and the amplitude of the sum decreases. In a real-world environment, we would need to extend the time (or observe at higher x's) to disconfirm one of the two candidates--increasing amplitude after some time would lend evidence to the two-wave interference formula. Eureqa had neatly forced me to think of another way to see the data.

And that is the obvious first-order value of Eureqa, to generate different ways of seeing data. But it isn't the only value. And to make the argument for the second value, to generate reliable models for complex social data, I'll ask for some help from the late Bill Brass, a Scottish medical demographer I encountered in graduate school through the Brass logit relational model of life tables, a 1971 model of transforming a single model life table into life tables of real countries in real time using two parameters (alpha and beta--okay, so he wasn't exactly stellar in the naming-parameters department, but he was a brilliant practical demographer otherwise). The Brass logit model has some problems at extreme age ranges and for countries with unique mortality conditions, but given the complexity of mortality experiences through time and across continents, having any simple model that could take a model set of age-specific measures and transform it into something anywhere close to real experiences is ... well, amazing. And Brass did it without the help of microcomputers.

Since the early 1970s, a number of demographers have tinkered with the Brass logit model, and they have the benefit of microcomputers, but without Eureqa. Before microcomputers and fast computing using individual-level data, demographers had to use a combination of mathematical models and the type of statistical insight that Brass brought to life tables. So could demographers and other social scientists use Eureqa to generate this type of relational model for a range of data? Possibly, and certainly they could use Eureqa to generate candidate models. I'd be curious to see if Eureqa could come up with anything close to the Brass logit model if fed an appropriately-prepared set of data. Demography grad students, here's a great project--see if Eureqa can beat Bill Brass!

In the next month I'm finishing up my EPAA editorial duties (or coming as close to it as I can in preparing approved articles) and delving more intensively into unfinished projects. But there's a small project that's perfect for Eureqa. I have no idea if it'll come up with anything useful, but because Eureqa's proposed solutions are sample-dependent and Eureqa splits the sample into training and validation sets (and uniquely per run), Eureqa gives me a perfect routine for dull tasks: do work, take break to see how Eureqa is running, capture proposed solutions, restart the run with a new training/validation split, go back to dull task, rinse, repeat. It doesn't require the intensity of concentration I'll need for unfinished projects this spring.