Contributing Editor Hadley Wickham is Chief Scientist at RStudio and Adjunct Professor of Statistics at Rice University. He is interested in building better tools for data science. His work includes R packages for data analysis (ggplot2, plyr, reshape2); packages that make R less frustrating (lubridate for dates, stringr for strings, httr for accessing web APIs); and that make it easier to do good software development in R (roxygen2, testthat, devtools, lineprof, staticdocs). He is also a writer, educator, and frequent contributor to conferences promoting more accessible and more effective data analysis. He writes:

Recently, there has been much hand-wringing about the role of statistics in data science. In this and future columns, I’ll discuss both the threat and opportunity of data science. I believe that statistics is a crucial part of data science, but at the same time, most statistics departments are at grave risk of becoming irrelevant. Statistics is flourishing; by-and-large academic statistics continues to focus on problems that are not relevant to most data analyses. In this first column, I’ll discuss why I think data science isn’t just statistics, and highlight important parts of data science that are typically considered to be out of bounds for statistics research.

I think there are three main steps in a data science project: you collect data (and questions), analyze it (using visualization and models), then communicate the results. It’s rare to walk this process in one direction: often your analysis will reveal that you need new or different data, or when presenting results you’ll discover a flaw in your model.

Statistics has a lot to say about collecting data: survey sampling and design of experiments are well established fields backed by decades of research. Statisticians, however, have little to say about collecting and refining questions. Good questions are crucial for good analysis, but there is little research in statistics about how to solicit and polish good questions, and it’s a skill rarely taught in core PhD curricula.

Once the data has been collected, it needs to be tidied (or normalized) into a form that’s amenable for analysis. Organizing data into the right ‘shape’ is essential for fluent data analysis: if it’s in the wrong shape you’ll spend the majority of your time fighting your tools, not questioning the data. I’ve worked on this problem for quite some time (culminating in the tidy data framework) but I’m aware of little similar work by statisticians.

Any real data analysis involves data manipulation (sometimes called wrangling or munging), visualization and modelling. Visualization and modelling are complementary. Visualizations surprise you, and can help refine vague questions. However, visualizations rely on human interpretation, so the ability to scale is fundamentally constrained. Models scale much better, and it’s usually possible to throw more computing at the problem. But models are constrained by their assumptions: fundamentally a model cannot surprise you. In any real analysis you may use both visualizations and models. But the vast majority of statistics research is on modelling, much less is on visualization, and less still on how to iterate between modelling and visualization to get to a good place.

The end product of an analysis is not a model: it is rhetoric. An analysis is meaningless unless it convinces someone to take action. In business, this typically means convincing senior management who have little statistical expertise. In science, it typically means convincing reviewers. Communication is not a mainstream thread of statistics research (if you attend the JSM, it’s easy to come to the conclusion that some academic statisticians couldn’t care less about the communication of results). Communication is a part of some PhD programs, but it tends to focus on professional communication (to other statisticians), not communicating with people who have substantive expertise in other domains.

In business, analyses are often not done just once, but need to be performed again and again as new data come in. These data products need to be robust in both the statistical sense (i.e. to changes in the underlying distributions/assumptions) and in the software engineering sense (i.e. to changes in the underlying technological infrastructure). This is a ripe field for research.

Statistics is a part of data science, not the whole thing. Statistics research focuses on data collection and modelling, and there is little work on developing good questions, thinking about the shape of data, communicating results or building data products.

There are people in statistics doing great work in all these areas, but it’s not mainstream statistics. If you’re interested in these areas, it’s harder to get tenure, harder to get grants, and most of the ‘top’ statistics journals are unavailable to you.

Attempting to claim that data science is ‘just’ statistics makes statisticians look out of touch, and belittles the many other contributions outside of statistics.

Editor’s note: The opinions expressed are exclusively of the columnist and do not necessarily reflect opinions of the IMS or editorial opinions of the IMS Bulletin.

31 Comments

I am currently working to develop a data science program in University of Nebraska at Omaha and teaching introduction to data science. While talking to the business community around Omaha, we got a nice feedback and everyone indicated the necessity of such a program. The existing programs on statistics or computing or business does not just satisfy their full need. No matter how hardly statisticians or computer scientists say they are doing data science it indeed makes them ‘look out of touch’.

Your discussion in the column nicely complemented how Dr. Cleveland felt about statistics when he wrote his article “Data science: an action plan for expanding the technical areas of the field of statistics” in 2001.

Finally I share the same feeling when you say ” it’s harder to get tenure, harder to get grants, and most of the ‘top’ statistics journals are unavailable to you”. Looking forward to seeing your future columns.

Should the question be framed differently? How is data science different from data analysis? Data analysis seems to encompasses all of “data science” and more because it involves models, visualization, simulation and significance testing or testing for independence, and attempts to answer questions about causality or what factors account for variability in a distribution. Data science is but one aspect of statistics as a discipline.

Spot on. It’s just another buzz word, like “Big Data”. Data Science, as the word is used today, is basically a very small subset of statistics and a fancy way of saying that one is practicing the data collection, categorization, and or combination of existing statistical methods. Glad I’m not the only one that sees this.

I agree. I think also a key difference is that we as statisticians are taught the ins and outs of the scientific process. Take the theory of hypothesis testing for example. I have reviewed many data science program courses, and none appear to be calculus based nor as in-depth as theoretical statistics.

A great summary of what I also see in the rapidly changing world of data.

My experience is that many people who are strong in analytics and data management could definitely benefit from a strong course on thinking like a statistician (a use case review of many of the methods). In particular, developing a keen understanding that much of the power of statistics is underpinned by a strong focus on variability and how prior observed variability can inform your findings for better decisions.

Likewise, many who are expert in statistics would benefit greatly from becoming more knowledgeable about the business or field(s) they analyze data within. This includes speaking more often with leaders and executives about goals, challenges and better understanding the language of the decision-maker. This would lead them to easily reduce excessive analysis of topics that will have low impact and better inform their assumptions about the models they build.

Statisticians can also benefit from a stronger focus on managing dirty data, cleaning and tidying it. This should be a required course for most stats majors at any degree level. Finally, courses and practicums on data presentation skills including graph and dashboard design will radically improve the value of their degrees.

Could you say that data science is the concrete and practical application of statistics to real-world problems? In this idea, domain knowledge is important. For example, the statistician might study the effects of supplementing vitamin D, because she is asked to by researchers. The data scientist might tell researchers that vitamin D interacts with vitamins A and K, so the three need to be researched together. The data scientist helps to ask better questions.

Nice blog. I heard your speak in San Francisco in July 2012 on “The Future of Data Analysis”. Glad to see you’re still working on topics you mentioned there, including the “tidy data framework”.

I’m reminded of an analogous discussion with the then president of a related professional society. In response to my predicting a future of irrelevance for said society, he replied “Not everyone wants to be relevant.”

That said, let me support your observations as follows:

I think the field of statistics, as opposed to professional statisticians, has become too narrow and constrained. Perhaps you’re arguing the same thing. For example, proposing that “everything” in healthcare should undergo a double-blind randomized trial is self-defeating. As Russ Altman observes, “We don’t have time, we’ll all be dead.” Thus, as you imply we need data-driven approaches that construct experiments as best we can. Because so many U.S. patients are treated differently even though they present “identically” we have an ongoing natural experiment. The few professional statisticians I’ve spoken to about such things are deeply pragmatic, and, in their, own way, they are data-driven, instead of being method driven. But, the many books on statistics I have are almost all method driven – implying that the method is the magic, etc.

If all we had to worry about was as you posit – visualization vs. modeling – I think we’d be in good shape. Instead what I see is endless refinement and esoterica re statistical tests.

And, incidentally apropos “surprises”, one way I know my model is good is when for a given input the model is right and my estimation from that data is wrong.

Lastly, now that we have (more or less) unbounded computing resources, the field of statistics should take on “method” with the goal of making statistics more uniform, letting data drive differences and not method.

I think the talk is more relevant to academic statistician, not those working in the industry such as health or finance whatever. I never engaged in a project is purly about statistical modelling. Each project is collabration with domain experts, disussing about the data and research questions we are going to answer, communcating via visualsation and our modelling approach. And in the end, we have a fully-developed protoco for the project.

I heard many many times, people are discussing that statisticians only care about analysis, not the business. I don’t know where people are getting this feeling.

Those working in the academic environment or those working as theoretical statisticians are developing and inventing methods for the analysis of different types of data. With the emerge of “Big Data”, not only computing and IT structure matters, the methods as well.

All the blather about how “statistics” is becoming irrelevant and “data science” is the right way to approach data is just hogwash. I don’t know whether the lies involved are deliberate or just ignorant, but it hardly matters.

The name of the identical field as “data science” is “statistics”. “Data science” is just a fad name for the field of statistics. All the claims that “data science” does things that poor old irrelevant statistics doesn’t do are 100% untrue.

Anyone familiar with statistics papers from the past 20 or 30 years will recognize what utter, complete nonsense articles like the one on this web page are.

I found your article from a search trying to understand the differences between Data Science and Statistics. What I don’t understand is why you are blaming Statistics for not asking good questions. It is not Statistics or Data Science that helps you ask good questions.

It is the person. Even before you look at data, you should already be asking questions. Otherwise, no amount of data can ever help you with that. Asking the “good questions” is very subjective. One person’s good question may not be the same as the next.

I’m sorry that your PhD program failed you. Statistics courses that I have taken focused a lot of attention on Interpretation of the results from data collection and modeling. Statistics is about drawing conclusions/interpretations about the pattern in the data.

This article did not really state the difference between Statistics and Data Science. Sadly, your ranting does not help people understand the difference. All this did is like the comments show “more ranting”. Your article is still clear as mud to me of what are the differences which I know there are many. I am not taking sides and just want the facts.

If we define data science as the science of answering scientifically relevant questions using data, then we need to treat the entire learning process in a scientific manner.

In statistics, we have extensive experiences with experimental designs, learning from the models, and communicating the results (I actually think many basic visualization tools from statistics, box plots, scatter plots, etc, are still the most useful ones. One can certainly add new dimensions to these tools.)

But as Hadley pointed out, there are gaps. For example, if you think about the entire pipeline of doing a biological experiment and draw conclusions from the data. Statisticians can help with design, analysis, and reporting of the results, but in the past, we have paid less attention to helping with documenting all details of a experiment, curating a database of experiment settings for published results (not just experiment data), and so on. (Hadley obviously has a lot of experience with cleaning dirty data.)

I also find it challenging in practice to convince all scientists to talk to a statisticians about their designs before they carry out their experiments. If we can achieve that, we will keep many more statisticians busy, and will never need to worry about our job security (and we will be dealing with more high-quality data sets).

I am also not clear how a statistician can help with developing the scientific questions. It seems more the job a scientist. I do see that, sometimes, scientists want to find questions from the data: they call this process hypothesis generating. But I am not fully convinced that is the best way to generating scientific questions.

To say that “Data Science is just Statistics” would have to imply that Statisticians use deep convolutions neural networks for classification, long short term memory recurrent networks for forecasting, particle swarm optimizers and other computational intelligence algorithms for non-convex optimization, develop bespoke interactive real-time visualizations using libraries like D3 (implying they can write a modern JavaScript web application) and engineer solutions using all these building blocks end-to-end. If someone who proclaims to be a Statistician cannot do all the above, then they can’t claim that Statistics = Data Science. Because the above is by no means beyond the level of a 4 year graduate in a modern Computer Science curriculum, who happens to have 3 years of calculus, linear algebra, numerical analysis, discrete structures, mathematical statistics, statistical learning, artificial intelligence, deep learning, and the usual “core” Computer Science curriculum under his belt. This is what a modern, well-equipped Data Scientist possesses. They are a one man Statistical Computer Scientist Swiss Army knife.

Before the emergence of name “data science”, we have spoken of statistics and data analysis for a long time. Some peoples categorized statistics as two groups: theoretical statistics that work with random variables, propose statistic methods for assumed models and investigates statistical properties of those methods, whereas applied statistics analyze real-world data taking use of the methods developed by theoretical statisticians. When a data is analyzed, the data preparation (data cleaning, data tidies and so on) are usually inevitable. Some one ever said in a textbook on linear regression (obviously a statistics book) that, when analyzing a data, one usually spends almost more than 80% of the time at data preparation and less time for running a program for computing the statistics; it even needs only a few minutes sometimes.

Compared to statistics, data science puts more emphases on question formulation. This appears the only difference.

The expression “Data Science” is misleading because it is not a science. “Data scientists” are not scientists, they are more like engineers since they apply results obtained by sciences such as Statistics, Mathematics and Computer Science, among others. Data Engineering would be perhaps a more accurate expression for such activities.

Welcome!

Welcome to the IMS Bulletin website! We are developing the way we communicate news and information more effectively with members. The print Bulletin is still with us (free with IMS membership), and still available as a PDF to download, but in addition, we are placing some of the news, columns and articles on this blog site, which will allow you the opportunity to interact more.
We are always keen to hear from IMS members, and encourage you to write articles and reports that other IMS members would find interesting. Contact the IMS Bulletin at bulletin@imstat.org

What is “Open Forum”?

In the Open Forum, any IMS member can propose a topic for discussion. Email your subject and an opening paragraph (to bulletin@imstat.org) and we'll post it to start off the discussion. Other readers can join in the debate by commenting on the post. Search other Open Forum posts by using the Open Forum category link below. Start a discussion today!

Recent posts

About IMS

The Institute of Mathematical Statistics is an international scholarly society devoted to the development and dissemination of the theory and applications of statistics and probability. We have about 4,500 members around the world. Visit IMS at http://imstat.org