The Data Science Delusion

Gleanings from observed technical misunderstandings between business leaders and data scientists (and among data scientists themselves) so dramatic that one could start wondering whether there is something wrong with data science as it is being practiced.

By Anand Ramanathan, Computer Scientist.

Prologue

Four years ago, having earned my living as a programmer/researcher for over a decade, I was co-opted into the data science movement. Since then I have witnessed technical misunderstandings between business leaders and data scientists (and among data scientists themselves) quite unlike anything I had seen earlier, and some projects have ended with so unexpected a notion of business or scientific success that I started wondering whether there is something wrong with data science as it is being practiced.

Discussions with several colleagues convinced me that I was not alone in my thoughts. And though not easy to find in the deluge of hype online, more than a few eminent voices have raised questions and concerns about data science. In this article, I have tried to tie together these strands of contrarian thinking with some perspectives of my own, in an attempt to explore the reasons for what could be termed a delusion in many parts of data science.

The Fundamental Problem

First, let’s get out of the way what is unexceptionable: being data-driven is beneficial [1]; it is necessary now to compete on analytics [2]; and there is a set of technologies (big data platforms, scientific software, machine-learning algorithms) that are converging in maturity, making it easy to travel the data-driven path. Data science, then, by this or another name, is indeed an idea whose time has come.

However, while much of the data science work in the industry is useful, and often ground-breaking, much is not. What are the circumstances in which data science succeeds, and where lie the delusionary traps?

Poor Definition

It can be understood too, but only dimly and in flashes. Not half a dozen men have ever been able to keep the whole equation of pictures in their heads. — Scott Fitzgerald, “The Last Tycoon”

Mathematical and statistical knowledge, advanced computing skills (including databases, high-performance computing and visualization) and substantive expertise (or application and domain knowledge) form the almost impossible intersection in the Data Science Venn diagram [3]. It is no doubt a great advantage if all these skills could somehow be concentrated in one human being — a new-age Spock [1], so to speak. One wonders, though, how many such unicorns could exist. It is also a laudable goal to move away from the extreme specialization of traditional or academic research [4], but has there been an overcompensation?

Figure 1: The Data Science Venn Diagram

It is interesting to note that DJ Patil’s conception of the term “data scientist” seems to suggest a generalist product manager or developer, rather than some kind of a super-scientist [5]:

Yes. It’s good and bad. I think there’s this interesting question of, Well, what is a data scientist? Isn’t that just a scientist? Don’t scientists just use data? So what does that term even mean?

You’ve had one of my co-authors, Hilary Mason, on the show, and the thing we joke about and we wrote about together, is that the number one thing about data scientists’ job description is that it’s amorphous. There’s no specific thing that you do; the work kind of embodies all these different things. You do whatever you need to do to solve a problem.

— DJ Patil, “10 questions for the nation’s first data scientist”

This kind of a broad definition seems to be the consensus in the industry: “a set of activities involved in transforming collected data into valuable insights, products, or solutions.” [6].

Now, numerous master’s programs in data science have been introduced in recent years providing basic training in research methods, statistical modeling, applied machine learning and big data, to address this need for generalists with the right mix of competencies, notwithstanding the criticism that such courses, which are “spreading like mushrooms after the rain”, address only a part of what it means to be a data scientist [7].

A word about calling it a science: Is data science the study of data? That would be a strange claim — as Patil does suggest — for every empirical science is based on studying data. Moreover, a science of data disconnected from any domain already has a name — Statistics, and if it is the science of business, it should probably be called Business Science [8]. Of course, despite the poor coinage, data science may one day evolve into a real science by settling on sound foundational principles (like computer science did over its first couple of decades). For now, it seems to function more as a buzzword designed to attract talented scientists from various disciplines to work for business [9, 10]

One consequence of this vague definition is that experts have variously claimed data science to be Statistics 2.0 [8, 11], Computer Science 2.0 [12] and Business Analytics 2.0 [8]. This is partly because of greater interdisciplinary work between statistics and computer science, and a necessary coming together of two different ways of thinking [13, 14], but it also points towards a fundamental confusion.

I should also note that the term “data science” has a long history, as has the confusion surrounding the term. In fact, as early as the 1960s, Peter Naur suggested it as a better name for computer science [15, 16], and Jeff Wu, in the late 1990s, suggested that “statistics = data science” and “statisticians = data scientists”!

Easy-to-Fake

The last decade has seen many areas of research (parallel processing, machine learning, visualization, statistical programming languages) maturing into technology, which had made it possible for one person (perhaps an expert in one discipline) to take a project through the stages of data ingestion and manipulation, statistical modeling, and visualization entirely on his or her own. This democratization of algorithms and platforms, paradoxically, has a downside: the signaling properties of such skills have more or less been lost. Where earlier you needed to read and understand a technical paper or a book to implement a model, now you can just use an off-the-shelf model as a black-box. While this phenomenon affects many disciplines, the vague and multidisciplinary definition of data science certainly exacerbates the problem.

These reasons make it easy to pose as a data scientist, and the huge demand and limited supply has made data science a fertile ground for poseurs [17, 7].

Deliberate faking aside, the difficulty in evaluating what is being done, leads both business leaders and data scientists down blind alleys.

Addressing the Problem

One response to data science’s purported need to be “all things to all people” [18] has been to carve up the space: type A (analysis) vs. type B (building) data scientists [19], statistical vs. database vs. computer systems data scientists [18], and human-targeted vs. machine-targeted data scientists [20]. These are sound suggestions, essentially acknowledging the difficulty of finding unicorn data scientists, and urging the creation of teams with the full range of skills. Such teams then face the problem of melding a disparate set of skills.

The ease of faking omniscience has brought warnings about incomplete knowledge. For example, Drew Conway pointed out (in his article detailing the data science Venn diagram) that the three data science skills on their own or combined with only one other “are at best simply not data science, or at worst downright dangerous” [3]. Most problematic, according to him, is the combination of hacking skills (computing) and domain expertise, because it is in this intersection that people “know enough to be dangerous”.

However, it is not simply an intersection of skills that is lethal, but these skills applied to the wrong task. In fact, the very set of skills that work well in one arena could be dangerous in another. Moreover, the fact that these skills rarely come together in one person, often leads to problems when data scientists work with each other.

So, what specifically goes wrong when different types of data scientists, possessing very different skill combinations, work with each other or with business stakeholders, on varied data science problems? An oversimplified model of the data science landscape can help us answer.

A Finer Scalpel: The Data Science Landscape

Figure 2: The Data Science Landscape

The data science landscape in figure 2 has Modeling Difficulty on one axis — representing how tractable the problem is to simple statistical modeling or machine-learning — and System Complexity on the other axis — representing complexity of the business processes being modeled, domain dependence of the data, scale of the data, timeliness requirements, and so on. This representation combines two of the components of data science, computing skills and domain expertise, into one dimension (system complexity), primarily for convenience of analysis. This quadrant view could be further refined into an octant view by separating out these components (more on this later). The terms simple and complex could be misunderstood here — to be clear, we are concerned not with the simplicity of the final model, but how easily a task can be modeled using state-of-the-art technologies.

Movement along the Modeling-Difficulty axis brings an increasing need for a specialized data scientist (e.g., an expert in time-series analysis or text processing), while movement along the System-Complexity axis leads to greater need for domain expertise and systems understanding. For instance, a classification problem, where the domain already provides data in the required form (say lots of historical data was manually tagged, and the task is to automate the classification), probably slots into Quadrant-1 (Q1). Another example of a Q1 task would be sentiment-tagging when you have an off-the-shelf model pre-trained on data from the same domain. The task would move to Q3 if the data from the target domain is quite different and needs to be preprocessed and modeled anew. (The promise of deep-learning is to reduce differences along this dimension, by automating the feature engineering involved [21].)

On the other hand, a move to Q2 or Q4 happens when formulating the problem is itself a challenge. Or when there are aspects of the system that cause assumptions to break — the data is updated in specific ways every week, or the value of the data stream decays perhaps due to changing user behaviour (e.g., Google Flu Trends [22]). Or if there are vast differences within a variable that are not represented in the data (e.g., state-level data aggregated at the country or continent level).