The unreasonable necessity of subject experts

Experts make the leap from correct results to understood results.

One of the highlights of the 2012 Strata California conference was the Oxford-style debate on the proposition “In data science, domain expertise is more important than machine learning skill.” If you weren’t there, Mike Driscoll’s summary is an excellent overview (full video of the debate is available here). To make the story short, the “cons” won; the audience was won over to the side that machine learning is more important. That’s not surprising, given that we’ve all experienced the unreasonable effectiveness of data. From the audience, Claudia Perlich pointed out that she won data mining competitions on breast cancer, movie reviews, and customer behavior without any prior knowledge. And Pete Warden (@petewarden) made the point that, when faced with the problem of finding “good” pictures on Facebook, he ran a data mining contest at Kaggle.

A good impromptu debate necessarily raises as many questions as it answers. Here’s the question that I was left with. The debate focused on whether domain expertise was necessary to ask the right questions, but a recent Guardian article,”The End of Theory,” asked a different but related question: Do we need theory (read: domain expertise) to understand the results, the output of our data analysis? The debate focused on a priori questions, but maybe the real value of domain expertise is a posteriori: after-the-fact reflection on the results and whether they make sense. Asking the right question is certainly important, but so is knowing whether you’ve gotten the right answer and knowing what that answer means. Neither problem is trivial, and in the real world, they’re often closely coupled. Often, the only way to know you’ve put garbage in is that you’ve gotten garbage out.

By the same token, data analysis frequently produces results that make too much sense. It yields data that merely reflects the biases of the organization doing the work. Bad sampling techniques, overfitting, cherry picking datasets, overly aggressive data cleaning, and other errors in data handling can all lead to results that are either too expected or unexpected. “Stupid Data Miner Tricks” is a hilarious send-up of the problems of data mining: It shows how to “predict” the value of the S&P index over a 10-year period based on butter production in Bangladesh, cheese production in the U.S., and the world sheep population.

Cherry picking and overfitting have particularly bad “smells” that are often fairly obvious: The Democrats never lose a Presidential election in a year when the Yankees win the world series, for example. (Hmmm. The 2000 election was rather fishy.) Any reasonably experienced data scientist should be able to stay out of trouble, but what if you treat your data with care and it still spits out an unexpected result? Or an expected result that’s too good to be true? After the data crunching has been done, it’s the subject expert’s job to ensure that your results are good, meaningful, and well-understood.

Let’s say you’re an audio equipment seller analyzing a lot of purchase data and you find out that people buy more orange juice just before replacing their home audio system. It’s an unlikely, absurd (and completely made up) result, but stranger things have happened. I’d probably go and build an audio gear marketing campaign targeting bulk purchasers of orange juice. Sales would probably go up; data is “unreasonably effective,” even if you don’t know why. This is precisely where things get interesting, and precisely where I think subject matter expertise becomes important: after the fact. Data breeds data, and it’s naive to think that marketing audio gear to OJ addicts wouldn’t breed more datasets and more analysis. It’s naive to think the OJ data wouldn’t be used in combination with other datasets to produce second-, third-, and fourth-order results. That’s when the unreasonable effectiveness of data isn’t enough; that’s when it’s important to understand the results in ways that go beyond what data analysis alone can currently give us. We may have a useful result that we don’t understand, but is it meaningful to combine that result with other results that we may (or may not) understand?

Let’s look at a more realistic scenario. Pete Warden’s Kaggle-based algorithm for finding quality pictures works well, despite giving the surprising result that pictures with “Michigan” in the caption are significantly better than average. (As are pictures from Peru, and pictures taken of tombs.) Why Michigan? Your guess is as good as mine. For Warden’s application, building photo albums on the fly for his company Jetpac, that’s fine. But if you’re building a more complex system that plans vacations for photographers, you’d better know more than that. Why are the photographs good? Is Michigan a destination for birders? Is it a destination for people who like tombs? Is it a destination with artifacts from ancient civilizations? Or would you be better off recommending a trip to Peru?

Another realistic scenario: Target recently used purchase histories to target pregnant women with ads for baby-related products, with surprising success. I won’t rehash that story. From that starting point, you can go a lot further. Pregnancies frequently lead to new car purchases. New car purchases lead to new insurance premiums, and I expect data will show that women with babies are safer drivers. At each step, you’re compounding data with more data. It would certainly be nice to know you understood what was happening at each step of the way before offering a teenage driver a low insurance premium just because she thought a large black handbag (that happened to be appropriate for storing diapers) looked cool.

There’s a limit to the value you can derive from correct but inexplicable results. (Whatever else one may say about the Target case, it looks like they made sure they understood the results.) It takes a subject matter expert to make the leap from correct results to understood results. In an email, Pete Warden said:

“My biggest worry is that we’re making important decisions based on black-box algorithms that may have hidden and problematic biases. If we’re deciding who to give a mortgage based on machine learning, and the system consistently turns down black people, how do we even notice it, let alone fix it, unless we understand what the rules are? A real-world case is trading systems. If you have a mass of tangled and inexplicable logic driving trades, how do you assign blame when something like the Flash Crash happens?

“For decades, we’ve had computer systems we don’t understand making decisions for us, but at least when something went wrong we could go in afterward and figure out what the causes were. More and more, we’re going to be left shrugging our shoulders when someone asks us for an explanation.”

That’s why you need subject matter experts to understand your results, rather than simply accepting them at face value. It’s easy to imagine that subject matter expertise requires hiring a PhD in some arcane discipline. For many applications, though, it’s much more effective to develop your own expertise. In an email exchange, DJ Patil (@dpatil) said that people often become subject experts just by playing with the data. As an undergrad, he had to analyze a dataset about sardine populations off the coast of California. Trying to understand some anomalies led him to ask questions about coastal currents, why biologists only count sardines at certain stages in their life cycle, and more. Patil said:

“… this is what makes an awesome data scientist. They use data to have a conversation. This way they learn and bring other data elements together, create tests, challenge hypothesis, and iterate.”

By asking questions of the data, and using those questions to ask more questions, Patil became an expert in an esoteric branch of marine biology, and in the process greatly increased the value of his results.

When subject expertise really isn’t available, it’s possible to create a workaround through clever application design. One of my takeaways from Patil’s “Data Jujitsu” talk was the clever way LinkedIn “crowdsourced” subject matter expertise to their membership. Rather than sending job recommendations directly to a member, they’d send them to a friend, and ask the friend to pass along any they thought appropriate. This trick doesn’t solve problems with hidden biases, and it doesn’t give LinkedIn insight into why any given recommendation is appropriate, but it does an effective job of filtering inappropriate recommendations.

Whether you hire subject experts, grow your own, or outsource the problem through the application, data only becomes “unreasonably effective” through the conversation that takes place after the numbers have been crunched. At his Strata keynote, Avinash Kaushik (@avinash) revisited Donald Rumsfeld’s statement about known knowns, known unknowns, and unknown unknowns, and argued that the “unknown unknowns” are where the most interesting and important results lie. That’s the territory we’re entering here: data-driven results we would never have expected. We can only take our inexplicable results at face value if we’re just going to use them and put them away. Nobody uses data that way. To push through to the next, even more interesting result, we need to understand what our results mean; our second- and third-order results will only be useful when we understand the foundations on which they’re based. And that’s the real value of a subject matter expert: not just asking the right questions, but understanding the results and finding the story that the data wants to tell. Results are good, but we can’t forget that data is ultimately about insight, and insight is inextricably tied to the stories we build from the data. And those stories are going to be ever more essential as we use data to build increasingly complex systems.

I was at the debate and it was an interesting discussion. The conclusion that I came out with was that domain experts and machine learning experts are equally valuable and necessary to the process. They each bring different, complementary skills, knowledge and insight to the problem. I don’t think that one is more valuable than the other.

I didn’t go back and look at the video, but I could have sworn that the proposition was presented as if you were a startup and were hiring your first employee, is it better to hire a domain expert or a machine learning expert.

http://blog.sparklinglogic.com Carole-Ann Matignon

Brilliant analysis.

Being in the Decision Management space (capturing decision logic for automation and improvement), I have seen exactly the same dilemma: modelers develop predictive models i isolation then hand it over to IT or Business User so that they can tune the rules that act on the prediction. There is an isolation / rivalry between those two groups.

The greater value resides in the collaboration. Being able to understand the business and interpret it is critical to the outcome. Being able to leverage vast amount of data is critical to detecting faint signals as they happen.

On the one hand, I’m not sure the cited cases really were done without domain expertise. Certainly, for example, everyone’s at least “expert” enough to agree that “breast cancer is bad,” and likewise any cancer patient is heavily disposed to accept any treatment with a good track record, understood or not.

On the other hand, in the situations where I actually munge data (and I do), I really already have domain experts on hand: they’re the ones who requested the data; my data is “decision support” for those already fairly well qualified to make the decisions: the business people of the business.

What I’ve found most lacking is yet a third kind of contributor: call them the “epistemology experts.” Data that confirm suspicions can be helpful, but the most important data and analysis is that which questions or overturns established assumptions — in other words, forestalls mistakes.

scott gray

I think the domains that are being considered here are so simple that a “domain expert” doesn’t have that much to offer.

However, if the domain is something like Particle Physics, and the data set is from the LHC then just showing up with some machine learning skills isn’t going to be much help.

http://www.hablantia.com Peter Bennett

Hi,

Interesting article. Seems to be that the old cycle of ‘hypothesis”experimental measurement’ is alive and well, but with the ability to be sloppier on the hypothesis.

The correlations found in mining of one dataset should lead to choices of new datasets to add to the analysis. By definition the initial dataset lacked the ‘new’ data, so this can decision can only come from outside the analysis – i.e. from a hypothesis-forming ‘expert’ who has access to a broader range of data (albeit at a much shallower level).

Your experts may possibly end up focusing on expanding their breadth of knowledge of different data sources, rather than their depth of understanding of individual ones.

Thanks again for the article.

Peter

http://gumption.typepad.com Joe McCarthy

Interesting summary of what sounds like an interesting debate.

I found myself thinking about the infinite monkey theorem: “a monkey hitting keys at random on a typewriter keyboard for an infinite amount of time will almost surely type a given text, such as the complete works of William Shakespeare” [Wikipedia].

I do not intend to equate monkeys with machine learning experts (especially due to my PhD being in machine learning and NLP), but without a subject matter expert (in English literature .. or at least English), one might conclude that the first generated text is the complete works of William Shakespeare … or, drawing on John Searle’s Chinese room thought experiment, the complete works of Lu Xun.

With respect to your assertion that “the ‘cons’ won”, in m.e.driscoll’s summary of the data science debate, he reports that the vote was 55 to 52, leading me to wonder about the statistical significance of the outcome. I also wonder about the influence of what I suspect was significant subject matter expertise among the classifiers employed in this particular classification task.

http://blog.gillerinvestments.com Graham Giller

I feel it’s worth commenting on the S&P/butter example because it has more depth than the simple “this is a dumb thing to do” context.

It’s an example of a problem (spurious regressions of unit root processes) that won the Nobel Prize in economics (for the invention of Cointegration by Granger and Engle).

The regression is real, but has no predictive power. The power is absent because the unit roots create stochastic trends in both series and the regression has identified the in-sample relationship between the stochastic trends — and there is no out-of-sample relationship.

A robotic fishing expedition with data mining code would fail to know that merely differencing the series (ARIMA anybody) would remove the roots, and that’s why ML is not the answer to everything.

Featured Video

Big Data and the Hypocrisy of Privacy: Alicia Asín on data, privacy, and the colossal amount of data the IoT will generate.