Friday, 5 September 2014

Data science at scale - calling out the "Big Data Scientist"

“Data science” is a popular term and one in
the ascendancy in Gartner’s Hype Cycle for Emerging Technologies 2014.
It has multiple meanings based on whom you ask. One way to deal with subjective
interpretations is to crowdsource the answer and pick the popular
interpretations, provided there is enough data. Recently, a data scientist (who
else?) at LinkedIn attempted to define the term “data scientist” using data from
profiles of people that have the phrase “data scientist” across its network.His results are available in a small post
over at LinkedIn's page.
Unusually for a data scientist, the author doesn’t provide any quantified data
at all, whereas I would have expected to see at least the numbers of profiles
analyzed, the popularity scores and the strength of the relationship between
the terms or the popularity scores for skills. Without numbers, there isn’t a
whole lot of interpretation that outsiders like me can do though. Looking at
the information qualitatively, the set of data scientists in the LinkedIn network seems to be distinctly tilted towards
“small data” analysis as opposed to “large data” analysis. I gauge this from two
indicators: (a) absence from the “Most popular skills” table of those skills
typically associated exclusively with large data analysis; (b) the small sizes
of the bubbles of these large data-focused skills and the lack of any strong
connections (look at the higher resolution image in that post) from any of
these to the popular “small data” skill bubbles.

Does this mean that the majority of data
practitioners are “small data” scientists? Where are the “Big Data Scientists”
(a portmanteau of “big data” and “data scientist”) and what sets them apart?

As that post and manyothers delineate, a good data scientist has mastery over a breadth of techniques, the
tools that encode these techniques, and the domain knowledge that helps provide
the extra oomph to the results. As aids, the tools – be it statistical or
visualization in nature – provide algorithms and implementations of techniques
out-of-the-box that are then used as deemed fit for the data problem at hand.
The tools themselves do not provide readymade solutions to the problem, whereas
it is the data scientist who knows how to use which tools and what techniques
given the nature of the data, the type of problem being addressed and the
targets to be achieved, if any. It is no wonder then that data science is
sometimes referred to as “art” with the practitioners commanding a premium.

Data science at scale is a completely different beast from data science on a
single machine. Data analysis on a single machine is itself hard but, data
analysis at scale typically challenges fundamentals that are often taken for
granted. Take the problem of sorting. It is one of the first to be introduced
in an algorithms course in a computer science curriculum, and how to sort data is
well understood. However, when the data being sorted is larger than the memory available
in a machine, a different algorithm is required. Let’s call this the single
machine algorithm while the textbook algorithms could be classified as
main-memory algorithms. When the data becomes even larger and no longer fits
within a single machine, the previous algorithms do not suffice and yet another
design for algorithms is required. These could be called the distributed sorting
algorithms. Sorting at massive scale is a problem class in itself and has a dedicated
big data benchmark too (look for "Gray").

Sorting is a very simple problem in the
world of big data. There are complicated ones as well, like machine learning at
scale. In all, I would argue that the most challenging aspect of being a “Big
Data Scientist” is to know when to use some data analysis approaches
(e.g. clustering versus classification) and the techniques for each (e.g.
k-means for clustering) and to also know the design of algorithms. Knowledge of
the internals of algorithms comes handy in designing a distributed version of
the same algorithm that works with good performance on massive data. This
crucial task of having to not only know “data science” but also be able to
design and implement the algorithms to run on massive data really sets apart a
“Big Data Scientist” from a “small data" scientist.

In the last couple of years, there has been
a steady stream of software packages offering big data-enabled algorithms
out-of-the-box. Open-source packages in the Hadoop stack include the popular Mahout and the newer Spark MLlib, to name a few.
If you do not subscribe to the Hadoop architecture, GraphLab built using MPI can be executed standalone. Amongst the really few proprietary
packages offering massive-scale algorithms out-of-the-box, Teradata Aster is a great example, and I cut my teeth in big data by contributing
massive-scale algorithms to its analytic foundation library. These software
packages make the transition from a “small data” scientist to a “Big Data
Scientist” easier, but talk to any expert statistician and you’ll know that the
coverage of the required breadth of algorithms and techniques is still poor.

O’Reilly Media conducted a survey of data scientist salaries across
Strata editions in 2012 and 2013. That report is a better example of presenting
data about data scientists (the report calls them out as data professionals
since not all individuals wear the “data scientist” tag). Parts of the survey,
especially those about proprietary tool usage, are not that useful since the majority of the audience at Strata tend to be the open-source-kool-aid consuming
types and the survey sample is therefore biased. The size and the geographic
variance of the audiences at Strata are also necessarily lesser than what
LinkedIn could potentially see in its data of the world. Nevertheless, the O’Reilly
survey also reinforces the points in this post that the portmanteau role of
“Big Data Scientist” is a rare combination and commands a premium over even the “small data" scientist.