One of the highlights of the 2012 Strata California conference was the Oxford-style debate on the proposition “In data science, domain expertise is more important than machine learning skill.” If you weren’t there, Mike Driscoll’s summary is an excellent overview (full video of the debate is available here). To make the story short, the “cons” won; the audience was won over to the side that machine learning is more important. That’s not surprising, given that we’ve all experienced the unreasonable effectiveness of data. From the audience, Claudia Perlich pointed out that she won data mining competitions on breast cancer, movie reviews, and customer behavior without any prior knowledge. And Pete Warden (@petewarden) made the point that, when faced with the problem of finding “good” pictures on Facebook, he ran a data mining contest at Kaggle.

A good impromptu debate necessarily raises as many questions as it answers. Here’s the question that I was left with. The debate focused on whether domain expertise was necessary to ask the right questions, but a recent Guardian article,”The End of Theory,” asked a different but related question: Do we need theory (read: domain expertise) to understand the results, the output of our data analysis? The debate focused on a priori questions, but maybe the real value of domain expertise is a posteriori: after-the-fact reflection on the results and whether they make sense. Asking the right question is certainly important, but so is knowing whether you’ve gotten the right answer and knowing what that answer means. Neither problem is trivial, and in the real world, they’re often closely coupled. Often, the only way to know you’ve put garbage in is that you’ve gotten garbage out.

By the same token, data analysis frequently produces results that make too much sense. It yields data that merely reflects the biases of the organization doing the work. Bad sampling techniques, overfitting, cherry picking datasets, overly aggressive data cleaning, and other errors in data handling can all lead to results that are either too expected or unexpected. “Stupid Data Miner Tricks” is a hilarious send-up of the problems of data mining: It shows how to “ predict” the value of the S&P index over a 10-year period based on butter production in Bangladesh, cheese production in the U.S., and the world sheep population.

Cherry picking and overfitting have particularly bad “smells” that are often fairly obvious: The Democrats never lose a Presidential election in a year when the Yankees win the world series, for example. (Hmmm. The 2000 election was rather fishy.) Any reasonably experienced data scientist should be able to stay out of trouble, but what if you treat your data with care and it still spits out an unexpected result? Or an expected result that’s too good to be true? After the data crunching has been done, it’s the subject expert’s job to ensure that your results are good, meaningful, and well-understood.

Let’s say you’re an audio equipment seller analyzing a lot of purchase data and you find out that people buy more orange juice just before replacing their home audio system. It’s an unlikely, absurd (and completely made up) result, but stranger things have happened. I’d probably go and build an audio gear marketing campaign targeting bulk purchasers of orange juice. Sales would probably go up; data is “unreasonably effective,” even if you don’t know why. This is precisely where things get interesting, and precisely where I think subject matter expertise becomes important: after the fact. Data breeds data, and it’s naive to think that marketing audio gear to OJ addicts wouldn’t breed more datasets and more analysis. It’s naive to think the OJ data wouldn’t be used in combination with other datasets to produce second-, third-, and fourth-order results. That’s when the unreasonable effectiveness of data isn’t enough; that’s when it’s important to understand the results in ways that go beyond what data analysis alone can currently give us. We may have a useful result that we don’t understand, but is it meaningful to combine that result with other results that we may (or may not) understand?

Let’s look at a more realistic scenario. Pete Warden’s Kaggle-based algorithm for finding quality pictures works well, despite giving the surprising result that pictures with “Michigan” in the caption are significantly better than average. (As are pictures from Peru, and pictures taken of tombs.) Why Michigan? Your guess is as good as mine. For Warden’s application, building photo albums on the fly for his company Jetpac, that’s fine. But if you’re building a more complex system that plans vacations for photographers, you’d better know more than that. Why are the photographs good? Is Michigan a destination for birders? Is it a destination for people who like tombs? Is it a destination with artifacts from ancient civilizations? Or would you be better off recommending a trip to Peru?

Another realistic scenario: Target recently used purchase histories to target pregnant women with ads for baby-related products, with surprising success. I won’t rehash that story. From that starting point, you can go a lot further. Pregnancies frequently lead to new car purchases. New car purchases lead to new insurance premiums, and I expect data will show that women with babies are safer drivers. At each step, you’re compounding data with more data. It would certainly be nice to know you understood what was happening at each step of the way before offering a teenage driver a low insurance premium just because she thought a large black handbag (that happened to be appropriate for storing diapers) looked cool.

There’s a limit to the value you can derive from correct but inexplicable results. (Whatever else one may say about the Target case, it looks like they made sure they understood the results.) It takes a subject matter expert to make the leap from correct results to understood results. In an email, Pete Warden said:

Post Your Comment

Post Your Reply

Forbes writers have the ability to call out member comments they find particularly interesting. Called-out comments are highlighted across the Forbes network. You'll be notified if your comment is called out.