What Is a Data Scientist (and What Isn’t)?

The perception among organizations over the past five years is that more quantitative methods, with or without Big Data, are critical to success. The problem is that most commercial organizations have little to no depth in these disciplines. On the other hand, businesses where data and data products are their primary revenue stream have an abundance of talent in this area.

The perception among organizations over the past five years is that more quantitative methods, with or without Big Data, are critical to success. The problem is that most commercial organizations have little to no depth in these disciplines. On the other hand, businesses where data and data products are their primary revenue stream have an abundance of talent in this area. Some, like Google or Amazon, employ hundreds of applied mathematicians and statisticians, in the same way that manufacturing companies employ mechanical or electrical engineers. Medical informatics, genomics, even intelligence and defense groups work on the bleeding edge of research into methods for classification, prediction and optimization. Because this work is rather unique, involving massive data volumes, unruly data formats and sources that are beyond the typical enterprise data flows, coupled with a broader understanding of the business or organization, a name for these professionals emerged: “Data Scientist.

But the term “Data Scientist” is an over-reaching title.

Lets look at how this actually plays out. The work is clearly divided between true scientists, those who research and create algorithms and methods, publish papers and actively participate in their discipline’s communications, and those who understand and employ quantitative methods, design, test and deploy models but do not create new science. I refer to these two as Type I and Type II respectively (in a forthcoming research report from Constellation Research, I go into much more detail and describe Types III and IV also). The former are truly scientists, the latter are not, though this is the group typically referred to as data scientists. There will be very few “data scientists” in commercial organizations. Data scientists work in research, academia and organizations where the production of new methods and algorithms are the core of the enterprise. Google, Amazon, Wall Street, etc. – these are companies whose scientist produce new methods in quantitative science and publish in peer-reviewed journals.

Although, there is a prejudice for employing PhD’s as Type II’s, it isn’t necessary.

Despite the unfortunate name given to this growing class of professionals (scientists they are not, in general), it does represent a new sort of role in organizations. Finding people to fill this role is difficult for all of the following reasons:

Varied types of data available and the resultant multitude of analyses that can be employed

A skill set that includes programming capability, quantitative methods, investigative and modeling orientation

The ability to understand what is meaningful and what is not

Need to have sufficient domain knowledge, not be quant-for-hire

The ability to communicate complex subjects to others who lack the background in the tools and methods employed

I mentioned engineers above. Engineers come to work with a solid grounding in the area of their choice, but no real practical experience, and typically no experience at all in the business of their employer. They learn as they go. In fact, there is even a professional designation for engineers that demonstrates they have the skill, training and practitioner’s experience to be a senior engineer – Professional Engineer (usually abbreviated as PE).

Another model for recruiting and nurturing professionals for this role, instead of competing for a small pool of PhD’s who may be overqualified and unfulfilled with the work, is the way insurance company grow their own actuaries (full disclosure, I have an actuarial background). There are two major actuarial organizations, The Society of Actuaries and the Casualty Actuarial Society. Both organizations administer comprehensive (actually, sort of grueling) certification programs that start with most of an undergraduate math degree and proceed to all aspects of probability, statistics and the insurance business itself. The series of exams can take 5-10 years to complete, and most insurance companies offer time off for study as well as on-the-job mentoring. Two things about this are key: first, a Fellow in either society demonstrates not only thorough grounding in quantitative methods, but also, and perhaps even more importantly, a true understanding of the workings of the enterprise as well as the entire industry.

There are tons of gimmicky professional “certifications,” but actuarial fellowship, Professional Engineer certification, even CPA, are all rigorous, practitioner-oriented programs. Analytics is looming in importance and is deserving of something similar.

Companies can’t expect universities to provide this kind of education. It’s obvious that skill with data, and analytics, are central to most if not all organization’s success. It’s time to get serious about it. Call them data scientists if you will, but you have to participate in their learning. They don’t grow on trees.

Some sort of legitimate professional certification is needed. But until then, companies need to take grooming and nurturing these professionals seriously.