Data science without statistics is possible, even desirable

The purpose of this article is to clarify a few misconceptions about data and statistical science.

I will start with a controversial statement: data science barely uses statistical science and techniques. The truth is actually more nuanced, as explained below.

1. Data science heavily uses new statistical science

But the new statistical science in question is not regarded as statistics, by many statisticians. I don't know how to call it, "new statistical science" is a misnomer, because it is not all that novel. And it is regarded by statisticians as dirty data processing, not elegant statistics.

While I consider these topics to be statistical science (I contributed to many of them myself, and my background is in computational statistics), most statisticians I talked to do not see it as statistical science. And calling this stuff statistics only creates confusion, especially for hiring managers.

Some people call it statistical learning. One of the precursors of this type of methods is Trevor Hastie who wrote one of the first data science books, called The Elements of Statistical Learning.

2. Data science uses a bit of old statistical science

Including the following topics, which curiously enough, are not found in standard statistical textbooks:

Exploratory data analysis (to be automated with tools such as data dictionary, then it won't be old statistics anymore)

Sampling

Some statistical distributions

Random variables

Some asymptotic results, although I encourage Monte-Carlo simulations to obtain limiting distributions, rather than theoretical principles which may not apply to real, modern data

These techniques can be summarized in one page, and time permitting, I will write that page and call it "statistics cheat sheet for data scientists". Interestingly, from a typical 600-pages textbook on statistics, about 20 pages are relevant to data science, and these 20 pages can be compressed in 0.25 page. For instance, I believe that you can explain the concept of random variable and distribution (at least what you need to understand to practice data science) in about 4 lines, rather than 150 pages. The idea is to explain it in plain English with a few examples, and defining distribution as the expected (based on model) or limit of a frequency distribution (histogram).

Funny fact: some of these classic stats texbooks still feature tables of statistical distributions in an appendix. Who still use such tables for computations? Not a data scientist, for sure. Most programming languages offer libraries for these computations, and you can even code it yourself in a couple of lines of code. A book such as numerical recipes in C++ can prove useful, as it provides code for many statistical functions; see also our source code section on DSC, where I plan to add more modern implementations of statistical techniques, some even available as Excel formulas.

Now don't get me wrong, there are still plenty of people doing naive Bayes, linear or logistic regression, and it works on many simple data sets, and you'll get a job if you know these techniques, more easily than if you don't know them, because progress is slow. But the future is in uniting these techniques under a single methodology, simple, robust, with easy-to-interpret results, available as black box to non-experts, and easy to automate. This project (I'm working on it, some computer science people at Cambridge University also work on this) is sometimes referred to as the automated statistician.

But just to give an example, naive Bayes (old stats, still widely used unfortunately) is terrible at detecting spam and categorizing email because it wrongly assumes that rules are independent, while a modern version called hidden decision trees (new stats) has been very successful (combined with pattern recognition) at identifying massive Botnets. Some modern techniques such as recommendation engines sometimes fail (unable to detect fake reviews) because they still rely on old, poor statistical techniques rather than modern data science. Though the fix to this issue is reworking the business model, rather than improving data science algorithms.

Finally, old statistics use a top-down approach, from model and theory to data, while new statistics or data science use a bottom-up approach, from data to model or algorithm.

Conclusions

Based on what many statisticians think statistical science is, and is not, I am tempted to say that modern data science barely uses statistical science. Instead, it mostly relies on statistical principles that are not considered statistical science by most people who call themselves statisticians, because of their rigid perception of what statistics is, and their inability to adapt to change.

To the contrary, for non statisticians (computer scientists, engineers and so on), it is clear that data science has a strong statistical component. In my heart, I also believe that new statistics is also a core component of data science. Yet when talking to hiring managers, I tell them that statistics is another animal, because in their mind, statistics is old statistics. And old statistics is barely used anymore in modern data science. Likewise, when talking to statisticians, I tell them that data science is not statistics, to not upset them or waste my time in fruitless argumentation.

I think we should stop & desist bashing statisticians as somehow we are superior than them. We should bear in mind that the term data-scientist was a recently invented term in the last 2 or 3 years which some may look at it as a fad.

I guess that the scientists at CERN who are doing data analytics don't regard or view themselves as data-scientists even though they pretty much breath data analytics from sunrise till sunset. They themselves still use some of the old statistics listed above. We can't dismiss them as non-data-scientists (even they hardly want to call themselves data scientists). One of my data science team member is a Physicist who did her intern at CERN last year (2013). She came with no knowledge of machine learning but now, she's pretty much well verse in machine learning because of her ability to absorb very fast.

Old statistics will still be around for a long time and based on what I see in the literature, they evolve into new variants which their performances improved over the old variants.

@Richard, thanks for clarifying the relevance of your post. If you reread mine, you'll see that I said that process mining "falls under the umbrella of data science." That means I agree that it's data science; I don't know why you think I was stating the opposite. The point behind my post was simply that process models cannot reliably be inferred from data alone. Judea Pearl ran into the same problem when trying to generalize Bayesian networks to causal nets. He realized he'd have to introduce a "data generating model" (i.e., theory) to make up for the underfitting. So researchers using directed acyclic graphs for causal modeling sketch out the model and then run data through it to see if the model is right. This is what process modelers do with conformance checking.

As for old versus new statistics, Granville lists Markov chains, Markov processes, and simulation as OLD statistics. I'm not aware of any techniques in process mining that use the methods of new statistics listed above.

Peter, your last reply illustrates well how the two schools (new versus old statistics) are different. You wrote "finding the best combination of predictors is generally a NP problem". Data science does not care about the best combination, but about a few ones that are good enough and easy to interpret. And you can easily guess how much more predictive power you will gain by running your algorithms for billions of years, after having obtained something decent in a few hours or minutes. Usually, very little.

And regarding a proven track record for my techniques, I won't publish in scientific journals or talk at conferences. I have used my methods with success, and it's up to you to decide whether they are worth spending your time or not. And this is part of a new school of doing scientific research: self-funded (thanks to the success of my methods!), posting in highly visited blogs like here (150,000 members) or DSC. It gets published much faster, reaches far more professionals, and I received unbiased feedback from other members. In short, peer-review in real time, and public. And all the time that I saved by not publishing in traditional outlets or attending conferences, I use it to further grow the data science community and disseminate my intellectual property. I do spend time reading what other authors publish.

My next book (data science 2.0) will be self-published (so I retain copyrights and other benefits) and most likely priced very competitively (5 times cheaper than competition), and the digital version will be free. I will market it myself via my different communities, reaching out to far more people, and at no cost, than any traditional author or publisher can do.

By the way, that's data science that works, because growing these communities involves serious data science. AMSTAT (or anyone else for that matter) with its bunch of statisticians, can't compete with me (a one-man organization) in terms of growing, and they must be paying a lot of money in marketing, despite the fact that I publish many of my growth strategies. Indeed, they even hired a PR firm. Now we are talking about data science that works. Data science that works well enough to beat competitors and generate good revenue with high margins, involving computational marketing and various campaign optimizations techniques.

Regarding cross-validation, you should also make sure that the test cases have buckets of data not found in the control cases. For instance, data from a different type of clients or different industry. That's the only way you can check how your model will perform when faced with new, unusual data. Proper data bucketization is thus an important step prior to performing cross-validation. You also need to use a feature selection algorithm that is stable in the way it detects the best predictors. Finally, data science is good at fast development and fast execution of stable algorithms.

My hidden decision trees and Jackknife regression contain stuff similar to Lasso (constrained or ridge regression), boosted trees and random forests blended together. But it is not underlined by any statistical models, nor solved with complex mathematical optimization. Instead, it is data-driven with solution arising easily and naturally using a simple, fast algorithm. The data reduction step uses a criterion called predictive power, to find best combinations of predictors. Data bucketization and bucket sizing is an important part of the process, as well as extrapolations at the bucket level. All this stuff is available for free to everyone. It's open patent or open intellectual property.

@Mark Dr. Granville has consistently challenged us to define the boundaries of Data Science, and my comment about Process Mining was in this spirit. Disagree that Process Mining is not Data Science. It appears as a specialization of event sequence analysis, a discipline initiated by Carl Petri in 1939. Petri Nets have unique formalisms of liveness and the like. And, they have evolved into elaborate business process modeling techniques. The Process Mining paradigm of play-in (event sequence to Petri-like model) or play-out (Petri-like model to event seq) seems analogous to Markov Chains and Markov Processes. Is Process Mining another branch of new statistics, as defined above? I am certainly seeing practical applications in my work, as cited previously.

@Richard, Process Mining falls under the umbrella of data science, but not data mining. In fact, the course you link to explicitly calls process mining "data science in action" and then explains that it bridges the gap between traditional process modeling and data mining. I don't understand why you doubt his assertion. And I also don't understand what your question has to do with Dr. Granville's article. Can you explain?

Ralph, new statistics include a modern version of logistic regression, highly constrained, with very few parameters, making it easy to interpret, stable, and solved without using maximum likelihood estimation, indeed without statistical model. And it's blended with other classifiers. I also take results from scientific studies and research with a grain of salt. Is it replicable, works consistently for my own data, and is there a more robust methodology 10 times easier to implement, that yields the same results? These are the questions I ask myself before implementing a technique.

Apparently, IBM trusts 'old' logistic regression enough to incorporate it as the basis of Watson for ranking Jeopardy answers. Food for though when considering the stability of a 'new' vs. 'old' technique.

"Over the course of the project, we have experimented with logistic regression, support vector machines with linear and nonlinear kernels, boosting, single and multilayer neural nets, decision trees, and locally weighted learning; however, we have consistently found better performance using regularized logistic regression, which is the technique used in the final Watson system."

One of the things I'd like to accomplish is to democratize science. Traditional stats can be arcane to the non expert, and there's a bit of secrecy or mystery around the recipes used by statisticians. Old statistics seem to be for a small elite of people initiated to the secrets. Outsiders are not welcome (and I made myself an outsider though I was once an insider). Hopefully my new statistics (whatever you call it) are not mysterious, easy to understand and safe to use by professionals such as engineers, programmers, decision makers, business analysts, geographers, economists and many more.