Why You Should Never Trust a Data Scientist

The wonderful thing about being a data scientist is that I get all of the credibility of genuine science, with none of the irritating peer review or reproducibility worries. My first taste of this was my Facebook friends connection map. The underlying data was sound, derived from 220m public profiles. The network visualization of drawing lines between the top ten links for each city had issues, but was defensible. The clustering was produced by me squinting at all the lines, coloring in some areas that seemed more connected in a paint program, and picking silly names for the areas. I thought I was publishing an entertaining view of some data I’d extracted, but it was treated like a scientific study. A New York Times columnist used it as evidence that the US was perilously divided. White supremacists dug into the tool to show that Juan was more popular than Juan[HF – John???] in Texan border towns, and so the country was on the verge of being swamped by Hispanics. …

I’ve enjoyed publishing a lot of data-driven stories since then, but I’ve never ceased to be disturbed at how the inclusion of numbers and the mention of large data sets numbs criticism. The articles live in a strange purgatory between journalism, which most readers have a healthy skepticism towards, and science, where we sub-contract verification to other scientists and so trust the public output far more. … If a sociologist tells you that people in Utah only have friends in Utah, you can follow a web of references and peer review to understand if she’s believable. If I, or somebody at a large tech company, tells you the same, there’s no way to check. The source data is proprietary, and in a lot of cases may not even exist any more in the same exact form as databases turn over, and users delete or update their information. Even other data scientists outside the team won’t be able to verify the results. The data scientists I know are honest people, but there’s no external checks in the system to keep them that way.

I’m curious: do MC readers consider “statistical, numerical, and research design illiteracy” to include a tendency to believe inferences drawn from nonrandomly selected data? I follow MC occasionally, and I’m surprised at the number of posts based on data that surely is biased by its collection process. Commentary rarely draws attention to these issues. Given the explosion in nonrandom samples, it might be nice to have a subfield developing more ways to deal with such. We’ve got raking, capture-recapture, and sensitivity analyses of various kinds. Sure we could have more and better? And insist on them when presented with analysis of, say, traffic patterns based on RFID payment schemes (e.g., EasyPass, FastTrak) which exclude drivers without bank accounts (from an article in Significance, not MC, but it’s at the top of my mind right now)?

Never forget: it was the “data scientists” on Wall Street (aka, quants) who crashed the world’s economy. Still. So, the title isn’t nearly as sensational as it could be. “Data Scientists are Evil Predators”, for example.

And, do get me going on Bayesian priors.

About

The mission of this blog is described in our inaugural post. And, technically, an orangutan is an ape, not a monkey.