Last November, I wrote two posts on agile predictive analytics. It’s time to return to the subject. I’m used to KXEN talking about the ability to do predictive modeling, very quickly, perhaps without professional statisticians; that the core of what KXEN does. But I was surprised when Revolution Analytics told me a similar story, based on a different approach, because ordinarily that’s not how R is used at all.

Ultimately, there seem to be three reasons why you’d want quick turnaround on your predictive modeling:

You want to change your models quickly as your world changes (your products change, your competition does different things, you customers have different interests, etc.).

You like the general benefits of agility (faster, cheaper, more responsive). Or in particular, …

… you want to model differently for different segments of your relevant universe (sets of customers, sets of products, sets of financial securities, etc.), so any one model had better be easy to build.

A KXEN story along these lines might go:

A retailer has 100s or 1000s of stores, each of which sell a lot of items.

A single model that covered all the stores would be horrifically complex.

Better to run a separate model for each store.

The point here is KXEN automates some modeling steps that are manual with most other tools, allowing each individual model to be done more quickly.

One production Revolution use case goes:

A large stockbroker has 100s of equities traders.

At any time, what a trader wants to model might be governed by a particular customer’s interests — what exactly their objective function is, which particular stocks they want to look at, etc.

An app was built to let traders re-run the models fresh each time, with a convenient UI that allows parameterized inputs of ticker symbols and risk objectives.

R (in Revolution’s version or any other that I know of) doesn’t have KXEN’s general quick-modeling features, and perhaps not even those of SAS or SPSS. But building a specific parameterized app is obviously a workaround for that lack.

That said, there are indeed a lot of cases where you often need to re-run your models from scratch, whether through convenient technology or by throwing lots of bodies at the problem. Suppose, for example, you’re doing some kind of marketing campaign management for a telecom service provider. Potential changes to your data, or to its interpretation, include:

Your service plan changes.

Your competitors’ service plans change.

You or your competitor embarks on a major new advertising campaign.

New hardware comes out.

New hardware doesn’t come out for a little while, and the market shifts away from early adopters.

You figure out a better way of explaining things to a confused subset of your customers, happily changing their perceptions.

and also:

You do some clever analysis, and subcategorize what you’d previously regarded as one homogeneous set of consumers.

You change your website a little bit, and hence have new kinds of clickstream data.

You improve efficiency in your call center, and hence have different kinds of interactions with callers.

You run a new marketing program, and hence have new kinds of response data.

You up your text analytics or social media game somewhat, and hence have new kinds of sentiment or affinity data.

Any of these changes (and that’s hardly a complete list) could invalidate your existing models, or otherwise make it advantageous for you to run new ones.

Of course, “from scratch” is not necessarily entirely from scratch; while each model may be new, the underlying database is likely to change more slowly. It’s hard to do quick-turnaround predictive modeling unless you start out with a database that’s in good shape — even when one of the reasons for the quick turnaround need is that you keep adding new kinds data.

One last note — little of this is in the vein “BI has told us something interesting; now let’s start modeling.” The step from operational/monitoring business intelligence to drilldown/investigative BI happens all the time, but I’m not aware of many cases (yet) where there’s a follow-on step of quick-turnaround predictive modeling. Even when modeling is done quickly, it seems to be proactive much more than reactive — or if it is reactive, it’s reactive to big news (stock market crash, natural disaster, whatever) rather than to, say, a few surprising sales results.

The time may (and should) come when iterative investigate BI and iterative predictive analytics go hand-in-hand, but — presumably with a few exceptions I’m overlooking — that natural-seeming synergy doesn’t seem to be exploited much today.

Comments

10 Responses to “Quick-turnaround predictive modeling”

J. Andrew Rogers on
May 28th, 2012 1:25 pm

An issue for some types of quick-turnaround predictive modeling is that many databases capture interpretations of data rather than the “physics” of the data. The database representation tacitly assumes an interpretation that may be poor for some analytics.

The distinction between storing the “physics” of data and an interpretation is a big deal when working with sensing data, and is generally useful when fusing many different data sources. Geospatial geometries do not exist in a cartographic projection even if that is how they are presented to a user. A photograph is a collection of spectral samples from which countless features can be derived. Enumerations and number ratings are often a dimensional reduction that summarizes a more complex set of factors that could be captured in text fields. Data models often store what the user expects to see rather than what that presentation actually represents for analytical purposes.

Hence the term “physics”, since it nominally captures the objective fact of reality underlying the interpretation and presentation and makes the assumption that this is a unifying context across diverse data sources. It is not efficient for fixed, narrow use cases but it is analytically flexible.

If agile predictive analytics is preferred or not, I’m interested in your thoughts on the significance of model version management. Updating models based on changes in the real world makes sense but does maintaining a model’s version and relevant time frame provide additional value or lower risks in any way?

(4) A single model with a thousand subsegments is complex. Managing a thousand models is also complex.

(5) It’s possible to product individual-level predictions without building individual models. In the real world of predictive analytics, practicioners understand that additional complexity does not necessarily add to predictive power; segmenting a population only adds to overall predictive power if the relationship between the dependent variable and independent variables is significantly different across the segments. More often than not, observed differences are simply noise.

(6) For deployed models. the “versioning” problem is a matter of tracking how well each model performs by matching predicted and actual measures. As the number of deployed models increases, the time and cost to do this becomes prohibitive; organizations with rapid deployment processes in place simply rebuild and publish the models on a regular cycle.

We agree that many of the bottlenecks are in data prep and manipulations (I assume what you call marshaling) and deployment. That’s why for data prep and manipulation we have a dedicated product to automate many (not all) of the steps (http://www.kxen.com/Products/Explorer) and for deployment we’ve been doing in database scoring for years (as well as outputting C++, Java, yes even SAS etc.). Flexibility in both is key, because the plumbing and horsepower required when doing massive scoring on 50 million subscribers is different than when calculating the “next best thing” in a call center. We also realize that there are many folks who use SAS, SPSS, or homegrown SQL to prepare their analytical data sets, so we can directly read in many formats like SAS, SPSS etc.

That said, we still encounter plenty of companies where analysts spend weeks (sometimes months) on building a model (I’m counting post data marshaling to start of deployment). Most surveys I have seen from industry analysts estimate it at 20 to 35% of total time spent. So it is incorrect to ignore the time spent on modeling in the equation, just as it is overly simplistic to think that “modeling automation” solves all the problems for agility.

Regarding the number of models, I think there is some confusion. We do not always advocate for lots of models. We say that you should have the flexibility to build the number that matches your business needs. And sometimes, you simply have to build many models or you are not accurately describing the problem. The simplest example is where a company has different data available by line of business and product line (corporate, enterprise, consumer, pre-paid, post paid, etc.). That is not the analysts fault, it is just the reality of their business systems.

At the end of the day, predictive power can be measured by the results obtained. And if you get better business results (more churners identified, more products sold, more email campaigns opened etc.) with less models, then great. Our experience is that we often produce incremental gains over previous results by taking a more granular and agile approach. Not always, but very regularly.

I will contact you offline so we can have a longer discussion.

John

Thomas W Dinsmore on
May 30th, 2012 11:45 pm

In the contemporary world, a modeler who spends months building a model will soon be an ex-modeler.

John M. Wildenthal on
May 31st, 2012 12:53 pm

Re:

“(6) For deployed models. the “versioning” problem is a matter of tracking how well each model performs by matching predicted and actual measures. As the number of deployed models increases, the time and cost to do this becomes prohibitive; organizations with rapid deployment processes in place simply rebuild and publish the models on a regular cycle.”

This makes me uncomfortable from an “unclosed virtuous cycle” aspect.

I guess one might respond to questions about unmonitored model performance by claiming that since the model is [always] recent, it is probably performing as well as could be. But that leaves open the question of quantifying the benefit. Which leaves open the value of the model – and modeler – to the firm. If you never go back and check, how do you prove the model is doing enough better than chance that you justify the costs of creating and using the model in the first place? Maybe the firm would be better off firing the modeler and sending random mailings given a low actual lift. Wouldn’t we rather be showing hard numbers to our managers during our performance reviews?

As for the cost, if you are creating and using models in a regulated industry you will do whatever your regulator tells you to do, including whatever performance monitoring they require. Perhaps the cost of monitoring performance should be included in the cost of deployment? And isn’t performance measurement pretty straightforward BI after adding just a few more variables to your DW, the same DW that supports quick-turnaround modeling in the first place?

I don’t think the cost of monitoring need be high, particularly if the DW is set up well for creating the models in the first place. And it is a good business practice to get a reasonable measure of costs and benefits for post hoc evaluation. If the perceived performance benefits of a model are so small that they are dwarfed by the cost of measuring the performance, maybe the firm doesn’t really benefit from creating/having/using that model.

Thomas W Dinsmore on
June 1st, 2012 5:42 pm

The conventional way to monitor a model is to run a K-S test (or similar statistic) to measure model drift. That’s easy to do when you have one deployed model, but not so easy to do when you have a thousand deployed models.

As the number of deployed models increases and the time needed to re-estimate and re-deploy models declines, it becomes more cost effective to simply re-estimate models on a regular cycle than to evaluate each model separately. You are never worse off if you do this. It’s like washing the car every Saturday whether it needs it or not.