Evaluating the Data Scientist Job Description

As the demand for the Data Scientist job position continues to increase, we’re seeing significant variation in the ads appearing on places like LinkedIn. From the discontinuity I’ve noticed in the ads, it is reasonable to question whether the internal/external recruiters and HR departments actually know what a data scientist does. In some cases I’ve seen, 2 or 3 positions described for just one data scientist position. I wonder if the successful candidate will command a salary equal to 2-3 people? It is as-if the HR manager just did a Google search on “data scientist” and copied/pasted every keyword they could find.

So I thought it would interesting to evaluate a typical job ad that I found floating around on LinkedIn. This particular ad was posted by Facebook. Let’s see how well it does (see my commentary in red):

Job description

Facebook is seeking a Data Scientist to join our Data Science team. Individuals in this role are expected to be comfortable working as a software engineer and a quantitative researcher [I don’t agree with this trend, a data scientist is NOT necessarily a coder which I believe is a waste of talent, and a coder is most definitely NOT a data scientist who should have a significant theoretical foundation in mathematical statistics]. The ideal candidate will have a keen interest in the study of an online social network, and a passion for identifying and answering questions that help us build the best products. [“answering questions” alludes to the storytelling ability that all data scientists must have in order to convey in lay terms what the data are saying]

Responsibilities

Work closely with a product engineering team to identify and answer important product questions

Answer product questions by using appropriate statistical techniques on available data [again alluding to the important storytelling ability data scientists should possess; more than just numbers]

Drive the collection of new data and the refinement of existing data sources [data scientists should have deep data munging ability to prepare data for machine learning algorithms]

Analyze and interpret the results of product experiments [the “scientific method” should be a prime ingredient of a data scientist’s tool set]

Develop best practices for instrumentation and experimentation and communicate those to product engineering teams [ditto storytelling above]

Requirements

M.S. or Ph.D. in a relevant technical field, or 4+ years experience in a relevant role [it is good to see that traditional academic background “in a relevant technical field” is a top requirement although the education requirement may have to evolve a bit to include prospects trained in the new MOOC ecosystem]

Extensive experience solving analytical problems using quantitative approaches [this is where a solid background in mathematics and statistics is valuable]

Fluency with at least one scripting language such as Python or PHP [sorry, Coder <> Data Scientist]

Familiarity with relational databases and SQL [Absolutely agree]

Expert knowledge of an analysis tool such as R, Matlab, or SAS [a true data scientist must be expert level with one of these for machine learning modeling]

Experience working with large data sets, experience working with distributed computing tools a plus (Map/Reduce, Hadoop, Hive, etc.) [many job ads now include this requirement and I agree with it, but not to the level of requiring the data scientist to be responsible for production architecture, deployment as well as maintenance & support for a Hadoop system for example, just not realistic]

I think the above Facebook job ad is better than many I’ve seen recently, and I’m sure the person the company ultimately hires will be a good fit. That being said, I believe in the separation of “experimentalist” data scientist, or one who codes production systems and is more of a IT person, and the “theorist” data scientist who analyzes data, does exploratory data analysis, develops machine learning models, evaluates models and writes cogent reports for management consumption.

Comments

Quote: “I don’t agree with this trend, a data scientist is NOT necessarily a coder which believe is a waste of talent, and a coder is most definitely NOT a data scientist who should have a significant theoretical foundation in mathematical statistics.”

In my opinion, the job description never actually states “DS = Developer”. It states they are looking for a DS that also is comfortable with dev. You seem to believe that a DS that also does dev is waste of talent; I believe that a DS that is unable to implement data analysis algorithms (or, more specifically, adapt the existing to the situation at hand) in order to TEST them (i.e. it is nowhere stated that they are hiring for a product development position) is an enormous waste of money.

Sophia –> well, let’s be clear here, under the “Requirements” section of the job description it states:
Fluency with at least one scripting language such as Python or PHP

To me, that means the DS must be a coder. Let me give you an example that I’ve seen in the field. A friend of mine who I knew from my graduate program years ago applied for a job as VP of Data Science. He has a Ph.D. in statistics from Stanford. He is what I’d term a “theorist” in data science as he has a strong command of mathematical statistics, algorithm design, modeling, optimization; all at a very high level. The HR director who responded to his inquiry denied him an interview with the CTO who was making the hiring decision. The reason given? He wasn’t a developer. He uses R and Matlab for algorithm development, data modeling, testing hypotheses, performing cross-validation, etc. and was used to working with coders in the companies he worked with in the past in order to do production implementation. This company was shortsighted in “requiring” a Ph.D to be a coder.

Data scientist and Big Data are poorly understood by a majority of the people that have even heard the terms. So it is not surprising there is variation, even for the same job description/posting. I would offer another viewpoint however. I have seen three different descriptions for the same job at the same company from the same POSTERS. That indicates the posters themselves realize it’s a poorly defined role and are fishing with multiple hooks…not that they’re too stupid to know the difference.

We all know there’s good and bad storytelling when it comes to data analysis, which is why I find prevalent use of the term troubling. I would absolutely agree that good data scientists must know their audience and be able to ARTICULATE findings to that audience. Moreover, the manner alone in which the data are analyzed can lead to certain interpretations. It is imperative that the data scientist must be able to provide results in a manner that is totally transparent and clear (upfront and understandable), including methods used. Without this approach to “storytelling”, I fear there will be far too much bad information being taken as gospel that will in turn lead to bad decisions. And as we all know, those decisions can too frequently chain into horrendous final conclusions with disastrous results.

AFWIW, I’m not sure you always need to spin a story to answer a question.

I like the theorist vs experimenter distinction, two roles found in other scientific circles that must interpret data to gain insight – nuclear physics springs to mind. Perhaps not making that distinction in this example job description lead to the coder issue you have. Remember, job posters (including the hiring manager) frequently ask for the sun, moon, and stars from applicants -whether or not that seems realistic. I can see at least *asking* for the ability to use Python and Hadoop for an experimentalist role.

As with many other fields, to be most effective and accurate, it is important that all these roles within a data analysis team have some level of overlap – cross-training if you will. I would argue is it more important on these team than most, because they are *interpreting* information used in decisions.

Great overview. I would add the ability to sometimes think and look outside the box for required information/knowledge. What are the most important business questions and what ‘s the simplest way to answer them, not just HMMMM, what can I do with all these data?

Resource Links:

Industry Perspectives

In this special guest feature, Eric Frenkiel of MemSQL champions the use of Apache Spark in the enterprise coupled with in-memory database technology to achieve the promise of real-time analytics. [Read More…]