There is no question that the USA (in fact, most of the world) would be well-served with more quantitatively capable people to work in business and government. However, the current hysteria over the shortage of data scientists is overblown. To illustrate why, I am going to use an example from air travel.

On a recent trip from Santa Fe, NM to Phoenix, AZ, I tracked the various times:

Duration (min)

Cumulative (min)

Drive from Santa Fe to ABQ Airport

65

65

Park

15

80

Security

25

105

Wait to board

20

125

Boarding process

30

155

Taxiing

15

170

In flight

60

230

Taxiing

12

242

Deplane

9

251

Wait for valet bag

7

258

Travel to rental car

21

279

Arrive at destination in Tempe

32

311

As you can see, the actual flying time of 60 minutes represents only 19% of the travel time. Because everything but the actual flight time is more or less constant for any domestic trip (disregarding common delays, connections and cancellations which would skew this analysis even farther), this low percentage of time in the air is a reality. For example, if the flight took 2 hours and fifteen minutes, it would still work out to 135/386 = 35%. The most recent data I have, from 2005, shows the average non stop distance flown per departure was 607 miles, so we can add about 25 minutes to the first calculation and arrive at 85/336 = 25%.

Keep in mind, again, these calculations do not account for late departures/arrivals, cancelled and re-booked flights, connections, flight attendants and pilots having nervous breakdowns, etc. It’s safe to say that at most 25% of your travel time is spent in the air. Just for fun, let’s see how this would work out if we could take the (unfortunately retired) Concorde. We would reduce our travel time by flying at Mach 2.5 by 40 minutes, trimming out journey from five hours and eleven minutes to four hours and 31 minutes, about a 13% improvement.

What’s the point of all of this and what does it have to do with the so-called data scientist shortage?

Based on our research at Constellation Research, we find that analysts that work with Hadoop or other big data technologies spend a significant amount of time NOT requiring any knowledge of advanced quantitative methods – configuring and maintaining clusters, writing programs to gather, move, cleanse and otherwise organize data for analysis and many other common tasks in data analysis. In fact, even those who employ advanced quantitative techniques spend from 50-80% of their time gathering, cleansing and preparing data. This percentage has not budged in decades. Keep in mind that advanced analytics is not a new phenomenon; what is new is the volume (to some extent) and variety of the source data with new techniques to deal with it, especially, but not limited to, Hadoop.

The interest in analytics has risen dramatically in the past two or three years, that is not in dispute. But the adoption of enterprise-scale analytics with big data is not guaranteed in most organizations beyond some isolated areas of expertise. Most of the activity is in predictable (commercial) industries – net-based businesses, financial services, and telecommunications, for example, but these businesses have employed very large-scale analytics, at the bleeding edge of technology for decades. For most organizations, analytics will be provided by embedded algorithms in applications not developed in-house and third-party vendors of tools and services and consultants.

The good news is that 80% of the expertise you need for big data is readily available. The balance can be sourced and developed. “The crème-de-la-crème of data scientists will fill roles in academia, technology vendors, Wall Street, research and government.

There are related and unrelated disciplines that are all combined under the term analytics. There is advanced analytics, descriptive analytics, predictive analytics and business analytics, all defined in a pretty murky way. It cries out for some precision. Here is how I characterize the many types of analytics by the quantitative techniques used and the level of skill of the practitioners who use these techniques.

Descriptive Title

Quantitative Sophistication/Numeracy

Sample Roles

Type I

Quantitative Research (True Data Scientist)

PhD or equivalent

Creation of theory, development of algorithms. Academic/research. Often employed in business or government for very specialized roles

Type II

(Current definition of) Data Scientist or Quantitative Analyst

Advanced Math/Stat, not necessarily PhD

Internal expert in statistical and mathematical modeling and development, with solid business domain knowledge

“Data Scientist” is a relatively new title for quantitatively adept people with accompanying business skills. The ability to formulate and apply tools to classification, prediction and even optimization, coupled with fairly deep understanding of the business itself, is clearly in the realm of Type II efforts. However, it seems pretty likely that most so-called data scientists will lean more towards the quantitative and data-oriented subjects than business planning and strategy. The reason for this is that the term data scientist emerged from those businesses like Google or Facebook where the data is the business; so understanding the data is equivalent to understanding the business. This is clearly not the case for most organizations. We see very few Type II data scientists with the in-depth knowledge of the whole business as, say, actuaries in the insurance business, whose extensive training should be a model for the newly designated data scientists (see my blogs "Who Needs Analytics PhD's? Grow Your Own” and “What is a Data Scientist and What Isn’t.”)

Neil Raden is CEO and Principal Analyst at Hired Brains Research covering Analytics, Big Data and Decision Management.He welcomes you comments and can be reached at nraden@hiredbrains.com. Hired Brains provides research, analysis, advisory and consulting services in North America, Europe and Asia.

It's reminiscent of marginal costs. Perhaps the hope is that since there is a notion that the infrastructure and data acquisition and preparation are essentially "fixed" costs, there's an assumption that the marginal cost of the analysis investment will deliver a good return.

The problem is that I don't believe those infrastructure costs are really fixed, especially not when trying to constantly change the scope and volume of data for analysis. The situation could likely be alleviated by concentrating hard on the division of labor in the process, so those charged with analysis actually do analysis, not clean-up, rework and other grunt work because the raw material is not ready to use.

I like the scale - I've used this type of categorization internally to try and explain user roles/profiles/callitwhatyouwill within organisations, usually in a pyramid form, since it also gives the idea that the number of such roles (and licenses, of course) increases the further down the scale you go.

Perhaps you could extend the scale downwards a little into the more downstream data meddlers, viewers & tweakers, skimreaders and ignorers. Add to that a snappy name and the keynotes will come flooding in ;-)

Great post ... loved the data points on your travel times. And agree with your conclusions ... we are going backwards, at least in terms of data preparation and quality. Hadoop is a bit lit spreadsheets on steroids, with programmers as users. YIKES!

People still seem to be fixated on HOW to build a new infrastructure to process the increased amount of data (and spend 50-80% of thier time cleansing and preparing this data). The real challenge is understanding WHY that data is important for your business in the first place and how you can leverage it to create change. That's where they type II and III come in to turn data into information.