Rise of the Data Scientist

As we’ve all read by now, Google’s chief economist Hal Varian commented in January that the next sexy job in the next 10 years would be statisticians. Obviously, I whole-heartedly agree. Heck, I’d go a step further and say they’re sexy now – mentally and physically.

However, if you went on to read the rest of Varian’s interview, you’d know that by statisticians, he actually meant it as a general title for someone who is able to extract information from large datasets and then present something of use to non-data experts.

Sexy Skills of Data Geeks

As a follow up to Varian’s now-popular quote among data fans, Michael Driscoll of Dataspora, discusses the three sexy skills of data geeks. I won’t rehash the post, but here are the three skills that Michael highlights:

Statistics – traditional analysis you’re used to thinking about

Data Munging – parsing, scraping, and formatting data

Visualization – graphs, tools, etc.

Oh, but there’s more…

These skills actually fit tightly with Ben Fry’s dissertation on Computational Information Design (2004). However, Fry takes it a step further and argues for an entirely new field that combines the skills and talents from often disjoint areas of expertise:

Computer Science – acquire and parse data

Mathematics, Statistics, & Data Mining – filter and mine

Graphic Design – represent and refine

Infovis and Human-Computer Interaction (HCI) – interaction

And after two years of highlighting visualization on FlowingData, it seems collaborations between the fields are growing more common, but more importantly, computational information design edges closer to reality. We’re seeing data scientists – people who can do it all – emerge from the rest of the pack.

Advantages of the Data Scientist

Think about all the visualization stuff you’ve been most impressed with or the groups that always seem to put out the best work. Martin Wattenberg. Stamen Design. Jonathan Harris. Golan Levin. Sep Kamvar. Why is their work always of such high quality? Because they’re not just students of computer science, math, statistics, or graphic design.

They have a combination of skills that not just makes independent work easier and quicker; it makes collaboration more exciting and opens up possibilities in what can be done. Oftentimes, visualization projects are disjoint processes and involve a lot of waiting. Maybe a statistician is waiting for data from a computer scientist; or a graphic designer is waiting for results from an analyst; or an HCI specialist is waiting for layouts from a graphic designer.

Let’s say you have several data scientists working together though. There’s going to be less waiting and the communication gaps between the fields are tightened.

How often have we seen a visualization tool that held an excellent concept and looked great on paper but lacked the touch of HCI, which made it hard to use and in turn no one gave it a chance? How many important (and interesting) analyses have we missed because certain ideas could not be communicated clearly? The data scientist can solve your troubles.

An Application

This need for data scientists is quite evident in business applications where educated decisions need to be made swiftly. A delayed decision could mean lost opportunity and profit. Terabytes of data are coming in whether it be from websites or from sales across the country, but in an area where Excel is the tool of choice (or force), there are limitations, hence all the tools, applications, and consultancies to help out. This of course applies to areas outside of business as well.

Learn and Prosper

Even if you’re not into visualization, you’re going to need at least a subset of the skills that Fry highlights if you want to seriously mess with data. Statisticians should know APIs, databases, and how to scrape data; designers should learn to do things programmatically; and computer scientists should know how to analyze and find meaning in data.

Basically, the more you learn, the more you can do, and the higher in demand you will be as the amount of data grows and the more people want to make use of it.

Related Posts

47 Comments

i’m curious about how you chose the term “data scientist” to describe this role. that’s precisely the title we used for folks on the data team at facebook, chosen somewhat arbitrarily as a contraction of “data analyst” and “research scientist”, with the same skills in mind as you mention above. i also titled my chapter for “beautiful data” “information platforms and the rise of the data scientist”. quite amazing if you formulated the phrase separately! something in the air…

Nathan – Nice synthesis and thanks for the shout-out. Ben Fry’s model captures the various fields that comprise this interdisciplinary ‘data science’ quite elegantly. I’d even venture to add some bidirectional arrows. Between the four core activities — Munging, Modeling, Visualizing, and Interacting — there’s a lot of feedback.

And as far as sexiness goes, I’m still holding my breath for People magazine to release its Sexiest Data Scientist Alive issue. It still may be a decade or more away.

@Michael – re:feedback definitely. fry gets into this as its one of his arguments for an interdisciplinary field – whereas in a collaboration, a person would have to explain to another what he wanted, have some misunderstandings along the way, and then wait.

I have an open question. My graduate training has a foundation in statistics building to psychometrics. I’ve spent some time in human factors and usability, I was even an art major for awhile. This has all given me a solid base to understand the story data is telling as well as put it into visualizations that facilitate understanding among non-specialists. However, while I appreciate the opportunity to have a sexy I job I am currently on track to senior management in a major corporation. So here’s the question how do you balance becoming a specialist and developing the breadth necessary to perform as a business manager?

What about “storyteller?” Not trying to trivialize the issue at all, but the ability to effectively communicate the relevance and import of the findings would seem to be the skill that ties it all together. Completing the analysis isn’t the end of the project, getting the HiPPO’s sign-off is. If all the effort doesn’t go toward meeting an organizational goal, it’s wasted.

@Craig: The best answer is to look at senior managers whose work you admire, and see how they did it. For example, often specialized knowledge is replaced by the ability to notice, nurture, and exploit technical talent.

And note well: if you can’t find a senior manager you’d like to imitate, that’s a sign that being an executive will make you unhappy.

I appreciate Dr. Fry’s explict recognzition and acknowledgement that “cartographers have mastered the ability to successfully organize geographic data in a manner that communicates effectively.” Fry goes on to suggest cartography could serve as a “useful model” and be extended in the “direction of Computational Information Design.” My only comment to Fry (and to Nathan’s post) is that it already has – a field called ‘Geographic Information Science and Technology’ (GIS&T). The GIS&T body of knowledge was outlined by Mike Goodchild and others in the mid 1990s and fully scoped in 2006 by the UCGIS. Perhaps new ‘data scientists’ will consider the groundwork already laid by so many geographers and cartographers as they expand into new territories and exploit emerging technologies.

I am not sure what you are getting at, but these are quite difficult questions.

1)Can we predict a theory [like special theory of relativity] using data?

Understanding you to not be asking the trivial question of whether or not data and analysis methods are used to make discoveries, but rather asking about the algorithmic discovery of new laws, or regularities– yes, but we are not very good at unaided computer discovery yet. Since the inception of AI, computer scientists, philosophers, mathematicians, and other researchers have endeavored to find methods for automated discovery of regularities from empirical data. Herbert Simon (BACON), and Paul Thagard (PI) are two historical examples of automated discovery. Today, much is being done in the area of machine learning and discovery, but that is a book (or ten) by itself.

2) Can data help us to prove a theorem like Fermatâ€™s Last Theorem?

While such theorems can be assisted with computers (see the four color theorem and Coq proof assistant), theorems are derived through mathematical induction, construction, negation, etc., not though empirical data. Quasi-empirical data, however is used, meaning the results of enumeration (such as the ever-growing list of prime numbers) or random evaluation of a complex mathematical object (such as Monte Carlo methods on probability density functions), allow us approximations or enumerations for further mathematical consideration

It’s interesting that the graphic you represented in this post has similarity to the job of librarians…
We aquire books (or information), filter these books, or the information, to the right user or client; or mine the stacks or systems in pursuit of these information; the files are represented via catalog (cards or systems); and the reference service refines and interact with our users.
Do you think that to be a Data Scientist is to have a librarian expertise too? And vice-versa?

Great summary. The data-munging/ scraping part was something I particularly identified with. Working in the financial services industry (ahead of others when it comes to using data to drive strategy), I am still surprised with the amount of data that just goes untapped and unused. The systems (and the vision) required to capture this data and put in in a form which is ‘analyzable’ just does not exist.

This post reminds me of a post you made last year, about why Data Visualization isn’t popular. One point you made was that people know, but don’t know what it’s called (so they know.. but they just don’t know they know). But my question is where does ‘data visualization’ stop?

Or, is this blog (that one of your comment-ers suggested on your 37 visualizations post) also visualization?http://thisisindexed.com/

I’m into visualization because reading lots of little words hurts my eyes, because I believe there’s a more efficient way to convey information, and it really is awesome looking sometimes. But, does everything count?

To summarize… he gave a lot of very precise definitions and requirements for ‘information that results in a picture’ to be called visualization.

The definitions (to me at least… casual graph-onlooker) seem pretty intense. And going back to your post about why data visualization wasn’t popular… it might be because data visualization is viewed as a science.. very precise/definite/intense that people are afraid to enjoy/appreciate it at all.

This is my favorite unusually good data visualization site: http://www.babynamewizard.com/voyager
Somebody showed me this one in grad school, and I immediately proceeded to spend several hours riveted to my screen, looking up the the popularity and geographic trends associated with the names of everyone I knew.