Archive

The below is a public version of a post originally posted on an internal blog at the Education Advisory Board (EAB), my current employer. We don’t yet have a public tech blog, but I got permission to edit and post it here, along with the referenced code.

Data Science teams get asked to do a lot of different sorts of things. Some of what the team that I’m part of builds is enterprise-scale predictive analytics, such as the Student Risk Model that’s part of the Student Success Collaborative. That’s basically software development with a statistical twist and machine-learning core. Sometimes we get asked to do quick-and-dirty, one-off sorts of things, to answer a research question. We have a variety of tools and processes for that task. But there’s a third category that I want to focus on – frequently requested but slightly-different reports.

what is it

There’s a relatively new theme in the scientific research community called reproducible research. Briefly, the idea is that it should be possible to re-do all steps after data collection automatically, including data cleaning and reformatting, statistical analyses, and even the actual generation of a camera-ready report with charts, graphs, and tables. This means that if you realized that, say, one data point in your analysis was bogus and needed to be removed, you could remove that data point, press a button, and in a minute or two have a shiny new PDF with all of the results automatically updated.

This type of reproducible research has been around for a while, although it’s having a recent resurgence in part due to the so-called “statistical crisis“. The R (and S) statistical programming languages have supported LaTeX, the scientific document creation/typesetting tool, for many years. Using a tool called Sweave, a researcher “weaves” chunks of text and chunks of R code together. The document is then “executed”, where the R code chunks are executed and the results are converted into a single LaTeX document, which is then compiled into a PDF or similar. The code can generate charts and tables, so no manual effort is needed to rebuild a camera-ready document.

This is great, a huge step forward towards validation of often tricky and complex statistical analyses. If you’re writing a conference paper on, say, a biomedical experiment, a reproducible process can drastically improve your ability to be confident in your work. But data scientists often have to generate this sort of thing repeatedly, from different sources of data or with different parameters. And they have to do so efficiently.

Parameterizable reproducible research, then, is a variant of reproducible research tools and workflows where it is easy to specify data sources, options, and parameters to a standardized analytical report, even one that includes statistical or predictive analyses, data manipulation, and graph generation. The report can be emailed or otherwise sent to people, and doesn’t seem as public as, say, a web-based app developed in Shiny or another technology. This isn’t a huge breakthrough or anything, but it’s a useful pattern that seems worth sharing.

I just returned from the useR! 2012 conference for developers and users of R. One of the common themes to many of the presentations was integration of R-based statistical systems with other systems, be they other programming languages, web systems, or enterprise data systems. Some highlights for me were an update to Rserve that includes 1-stop web services, and a presentation on ESB integration. Although I didn’t see it discussed, the new httr package for easier access to web services is also another outstanding development in integrating R into large-scale systems.

Coincidentally, I just a week or so ago had given a short presentation to the local R Meetup entitled “Annotating Enterprise Data from an R Server.” The topic for the evening was “R in the Enterprise,” and others talked about generating large, automated reports with knitr, and using RPy2 to integrate R into a Python-based web system. I talked about my experiences building and deploying a predictive system, using the corporate database as the common link. Here are the slides:

In my previous post, I motivated a web application that would allow small-scale sustainable meat producers to sell directly to consumers using a meat share approach, using constrained optimization techniques to maximize utility for everyone involved. In this post, I’ll walk through some R code that I wrote to demonstrate the technique on a small scale.

Although the problem is set up in R, the actual mathematical optimization is done by Symphony, an open-source mixed-integer solver that’s part of the COIN-OR project. (The problem of optimizing assignments, in this case of cuts of meat to people, is an integer planning problem, because the solution involves assigning either 0 or 1 of each cut to each person. More generally, linear programming and related optimization frameworks allow solving for real-numbered variables.) The RSymphony package allows problems set up in R to be solved by the C/C++ Symphony code with little hassle.

My code is in a public github repository called groupmeat-demo, and the demo code discussed here is in the subset_test.R file. (The other stuff in the repo is an unfinished version of a larger-scale demo with slightly more realistic data.)

For this toy problem, we want to optimally assign 6 items to 3 people, each of whom have a different utility (value) for each item. In this case, I’m ignoring any fixed utility, such as cost in dollars, but that could be added into the formulation. Additionally, assume that items #1 and #2 cannot both be assigned, as with pork loin and pork chops.

This sort of problem is fairly simple to define mathematically. To set up the problem in code, I’ll need to create some matrices that are used in the computation. Briefly, the goal is to maximize an objective expression, , where the are variables that will be 0 or 1, indicating an assignment or non-assignment, and the is a coefficient vector representing the utilities of assigning each item to each person. Here, there are 6 items for 3 people, so I’ll have a 6×3 matrix, flattened to an 18-vector. The goal will be to find 0’s and 1’s for that maximize the whole expression.

Additionally, I need to define some constraints. For starters, it makes no sense to assign an item to more than one person. So, for each row of that matrix, the sum of the variables (not the utilities) must be 1, or maybe 0 (if that item is not assigned). I’ll create a constraint matrix, where each row contains 18 columns, and the pattern of 0’s and 1’s defines a row of the assignment matrix. Since there are 6 items, there are 6 rows (for now). Each row needs to be less than or equal to one (I’ll tell the solver to use integers only later), so I also define vectors of inequality symbols and right-hand-sides.

So that’s great. It found an optimal solution worth more than 50% more than the expected value of a random assignment. But there’s a problem. There’s no guarantee that everyone gets anything, and in this case, person #3 gets almost twice as much utility as person #2. Unfair! We need to enforce an additional constraint, that the difference between the maximum utility that any one person gets and the minimum utility that any one person gets is not too high. This is sometimes called a parity constraint. Adding parity constraints is a little tricky, but the basic idea here is to add two more variables to the 18 I’ve already defined. These variables are positive real numbers, and they are forced by constraints to be the maximum and minimum total utilities per person. In the objective function, then, they are weighted so that their difference is not to big. So, that expression becomes: . The first variable (the maximum utility of any person) is minimized, while the second variable is maximized. The free parameter defines how much to trade off parity with total utility, and I’ll set it to 1 for now.

For the existing rows of the constraint matrix, these new variables get 0’s. But two more rows need to be added, per person, to force their values to be no bigger/smaller (and thus the same as) the maximum/minimum of any person’s assigned utility.

Looking at just the last row, this constraint says that the sum of the utilities of any assigned items for person #3, minus the lower limit, must be at least 0. That is essentially the definition of the lower limit, that that constraint holds true for all three people in this problem. Similar logic applies for the upper limit.

The last two numbers in the solution are the values of the upper and lower bounds. Note that the objective value is only 41% higher than a random assignment, but the utilities assigned to each person are much closer. Dropping the value to something closer to 0 causes the weights of the parity bounds to be less important, and the solution tends to be closer to the initial result.

Scaling this up to include constraints in pricing, farm preferences, price vs. preference meta-preferences, etc., is not conceptually difficult, but would just entail careful programming. It is left as an exercise for the well-motivated reader!

If you’ve made it this far, I’d definitely appreciate any feedback about this idea, corrections to my formulation or code or terminology, etc!

A few months back I gave a presentation to the NYC R Meetup. (R is a statistical programming language. If this means nothing to you, feel free to stop reading now.) The presentation was on ggplot2, a popular package for generating graphs of data and statistics. In the talk (which you can see here, including both my slides and my patter!) I presented both the really great things about ggplot2 and some of its downsides. In this blog post, I wanted to expand a bit on my thinking on ggplot, the Grammar of Graphics, and how peoples’ conceptual representations of graphs, data, ggplot, and R all interact. ggplot is both incredibly elegant and unfortunately difficult to learn to use well, I think as a consequence of the variety of representations. Read more…