The Latest News from Burtch Works

Not Just a Title: How to Identify a Data Scientist

Posted May 11th, 2015

Update 2018: This post has been updated to reflect our most recent criteria from our 2018 data science salary study, released in May 2018.

After our much-debated blog post, 4 Ways to Spot a Fake Data Scientist, many readers were curious to know what criteria Burtch Works uses to identify data scientists, since the title itself is not always an indicator. The following is adapted from our recently released data science salary study that goes into more detail about the academic background, skills, and day-to-day job responsibilities that we look for when identifying data scientists. To download the full report with complete compensation data and demographic information, as well as how data science salaries compare to predictive analytics, click here.

Data scientists apply sophisticated quantitative and computer science skills to both structure and analyze massive stores or continuous streams of unstructured data, with the intent to derive insights and prescribe action.

The depth and breadth of their coding skills distinguishes them from other predictive analytics professionals, and allows them to exploit data regardless of its source, size, or format. Through the use of one or more general-purpose coding languages and data infrastructures, data scientists can tackle problems that are made very difficult by the size and disorganization of the data.

To identify data scientists for our recruiting efforts and Burtch Works Studies, we use the following criteria:

1. Educational Background – Data scientists typically have an advanced degree, such as a Master’s or PhD, in a quantitative discipline, such as Computer Science, Physics, Engineering, Applied Mathematics, Statistics, Economics, or Operations Research. New educational options include data science degree programs, MOOCs (massive open online courses), and bootcamps which continue to take hold in the quantitative community. Some professionals from related careers or fields of study have successfully pivoted into entry-level data science roles through premier bootcamps and mid-career Master’s programs.

2. Skills – Data scientists have expert knowledge of statistical and machine learning methods using tools such as Python and R, with predictive analytics still at the core of the discipline. Data scientists are usually proficient users of relational databases such as SQL, Big Data infrastructures like Hadoop and Spark, related tools like Pig and Hive, and, frequently, AWS.

Data scientists may use languages such as Python, Java, and Scala (among others) to write programs to wrangle and manage data, automate analysis, and, at times, build these functions into production level code for SaaS companies. Many also use other methods to derive useful information from data, including pattern recognition using TensorFlow and deep learning techniques, signal processing, and visualization.

3. Dataset Size – Data scientists typically work with datasets that are measured in gigabytes or larger increments, usually too large to be housed in local memory, and may work with continuously streaming data.

4. Job Responsibilities – Although they may specialize in a specific area, data scientists are equipped to work on every stage of the analytics life cycle which includes:

Data Acquisition – This may involve scraping data, interfacing with APIs, querying relational and non-relational databases, building ETL pipelines, or defining strategy in relation to what data to pursue.

Data Cleaning/Transformation – This may involve parsing and aggregating messy, incomplete, and unstructured data sources to produce datasets that can be used in analytics and/or predictive modeling.

Analytics – This involves statistical and machine learning-based modeling in order to understand, describe, or predict patterns in the data.

Prescribing Actions – This involves interpreting analytical results through the lens of business priorities, and using data-driven insights to inform strategy. Strong technical chops alone do not make an exceptional data scientist, so when recruiting we look for a combination of technical and non-technical skills.

Programming/Automation – In many cases, data scientists are also responsible for creating libraries and utilities to operationalize or simplify various stages of this process. Often, they will contribute production-level code for a firm’s data products.

Note: Professionals whose jobs are described as predictive analytics, analytics management, business intelligence, and operations research are not classified as data scientists under our definition. This is because they either do not work with exceptionally large datasets or do not work with unstructured data. In the specific case of operations researchers, their function is to optimize well-described processes rather than predict and prescribe insights towards more nebulous problems like customer behavior. Predictive analytics professionals (that primarily work with structured data) were the subject of their own study, The Burtch Works Study: Salaries of Predictive Analytics Professionals, released in September 2017.

There’s been a lot of conversation around this developing field, and as the tools continue to evolve, our criteria have evolved with each new study as well. Whether you’re a data scientist, an analytics professional, a programmer, or a data engineer, it’s important that you continue to learn as tools enter the market, and keep up with new technology. I’m sure there will be some bleeding edge tools that we’ve missed, so be sure to leave your thoughts in the comments below.

Interested in our salary research on data scientists and predictive analytics professionals? Download our studies using the button below.

24 Responses to “Not Just a Title: How to Identify a Data Scientist”

DataLover

After reading this post in addition to “4 Ways to Spot a Fake Data Scientist”, I reflect upon my own experience as a data scientist, or maybe “data scientist”. What I realize is that I don’t care which one I am. The term data scientist was put to me in order to identify me as, what a group of recruiters and data engineering professionals called, a “unicorn” in terms of the combination of skills, experience, talents and interests I embody.

I am not a statistician, a software engineer, a coder, data engineer, a machine learning expert, an algorithm optimizer/developer (or “quant”), a database administrator, a database analyst, a business analyst, a report generator/analyst, a BI platform administrator, a predictive analytics professional, etc… I am a bit of all those things when I need to be. I am a lot of a technology and concept, or idea, hacker with a love for data and insights coming from analyzing living data processes and systems as they exist in their natural habitats. So I use all those different aspects of a data professional to wrangle a set of data, or create one, that contains the concepts I, or my business, wish to understand. I choose the appropriate tools for analyzing the data given the type of outcome desired whether this is the insights themselves, or an automated way to generate the window onto the processes and data that drive the insights.

I then find the best way to deliver the information to my audiences in ways that they can consume the information, WHATEVER way they feel most comfortable. On the business level I may choose a more general solution. But, on the individual level, I meet them where they are so that they can most quickly grasp the information and insights and can contribute to the discussion of implications and strategy as an equal. I also help them get to where they want to be in terms of sophistication of tools, but I have no issue with writing a vba script for them to press a button and get a report in Excel if that is most comfortable for them; Or, even creating a brief powerpoint or a printout that we can sit down and discuss.

My data sets have reached only to the many billions of records, but that does not mean I don’t understand what a distributed file system is, or how to thread processes and results, or would not be able to rapidly create or implement or use such systems if that were necessary.

What interests me in new opportunities are the types of data the business knows they have, questions asked of the data, interest in learning the most out of what their data can tell them if handled correctly and explored fully, the liveliness and staleness of their data, and the universe their data lives in. While I cannot always predict what will interest me in a firm and its data (and I am surprised many times in this), I am very picky.

So talk of data scientist vs. “data scientist” means very little to me. And, if it distracts potential employers, I would rather they not call me either. Instead they should focus on explaining their current data infrastructure, assets and processes. Telling me their immediate needs and desires in terms of what they would like to get from their data; as well as their long term interests and dreams, vague as they may be, about what they might want to do with their data.

Simon

DataLover

Frogman

DataLover,
I feel exactly the same way. I detest the current trend to get caught up in these data science/analytics labels. In the end it’s all about understanding the data, providing value to the business from the data and having a general curiosity of the data itself. I would appreciate it greatly if you could email me with some guidance on elevating my data skills to the level where you feel would merit the term of data scientist or provide the best analytical all around skill set. thanks ahead of time! 🙂

DataLover

Frogman,

If I can give any advice regarding your request, and I’m not sure I can, I would say forget about trying to achieve some elevated data scientist status.

Try this: In your request, replace the term “Data Scientist” with “Artist”. First, any artist faces criticism from other “Artists” and their work may not even be considered “Art” by many. But, if you found one and asked them how to become an artist, how would they respond? I’m really asking here. Because, there are many students in art school, but there are just as many artists that never went to school for art. Maybe they had a feeling or emotion or event that they needed to express and found some medium for doing so. Then, society decides whether or not they are artists. But to them, they are expressing themselves in the way they desire, and many times need.

Assuming that I can be considered a data scientist, I will attempt to help you here.

I do believe you need to have a deep desire to work with data. Natural curiosity and interest in learning whatever it takes to get to know your data. Sometimes this is a new system for storing data, sometimes a new progamming language, other times it is a new application of a quantitative method, and sometimes you may actually have to draw diagrams or do math… with pencil and paper. You may need to implement whatever you design single handedly. Whatever it takes to understand the data and how it is generated and the universe it lives in. In my opinion, a good grasp of the basics, the foundations of inferring from quantitative data, is very important. Also, communication is extremely important. Unlike the “Fine” artist, being able to communicate and help your business or audience understand your results and various parts of your process (where necessary) are part of your art. It does not do your data justice to display the work you have done with it and not help others understand it. If you were at a party with your friends and family, would you introduce your significant other as “This is [John/Jane]”. Then leave the party and leave everyone to stare at each other?

-The data is your medium. (I do know I’m using the plural with singular grammar)
-The data storage systems, progamming languages, and descriptive, predictive, statistical, machine learning methods etc. are your tools.
-Your canvas is the landscape of information that you work in, or are interested in exploring. Required here is the motivation for generating relevant and interesting questions (ex. business need, business processes, a gap in knowledge or information, or just genuine curiosity).
-How you use those tools with your medium to compose your canvas is your technique.

That may not be a sturdy analogy, but I do think that if you change it around and say something like:
-The systems, programming languages, quantitative methods, etc. are your your medium;
-Information is your tool (think: systems design, process flowcharting, etc.); and
-The data is your canvas.
You would be describing more of a “data engineer”. One who sees systems, languages, and analytical methods as building blocks for designing, fixing, augmenting or changing the flow of information. Resulting in an efficient and scaleable data pipeline, allowing the data to be generated by systems needing to generate it and consumed by systems needing to consume it, integrating analytics into the broader business environment. It’s a stretch of the analogy, but I’m sticking with it. I will say that if you think finding a “true data scientist” is hard, then you can forget about finding a data engineer that does the term justice. (One simple reason for this is that most business needs are somewhat adequately met without needing a data engineer. Much the same way a lot businesses hire business analysts and machine learning specialists as data scientists. But, if you do need a data engineer, accept no substitutes.)

Many times, technique is where you will find the distinction between “good” vs. “average” data people. Choosing the optimal technique or approach by internally solving for resources available, timeliness, and type and quality of results desired. Sometimes, there is no getting around the exploration stage and the results of exploration may be only slightly indicative, and that is part of the optimization problem. For instance, you could learn a new language to process a certain data set in certain ways and then run a boosted tree on it to predict a binary classification outcome (like, purchase/no purchase). Or, you might purchase access to one of the push-button, “substitute our product for any background in data science/analytics”, machine learning platforms and try to let the platform do the work for you (go ahead, get a trial version and see how easy it is to not know anything about variance or what you’re doing and run a random forest). Or, it may be possible to compose a barchart that illustrates the same information in a fraction of the time, for a fraction of the cost, causing a fraction of the confusion you would need to actively manage and address. A good “data scientist” understands variance and the need and/or applicability for certain methods based on the parameters mentioned, the desired output, and the audience.

Now, again, assuming you would call me a data scientist, I will continue to try to give you a bit more of an answer to your request (and again, through analogy because that is how I am feeling). Would the veteran “Artist” tell you that all you have to do is go to school to study fine art? I think going (back) to school to study probability, statistics, (and then machine learning, after these things), is just one of many possible beginnings to a fulfilling career as a data person. If I were the veteran artist… And, I think I would be a painter, because I’m sometimes visual like that. I would say to the aspiring artist asking how to acquire the skills to be considered an Artist, “GO PAINT!!”