How to become a data scientist- Interview with Sean McClure

My name is Sean McClure and I am a data scientist working in San Francisco. As a Data Scientist, I assist organizations looking to compete analytically by applying advanced statistical and mathematical modeling, database technology, and tool-set development in order to derive insight, discover patterns, and automate systems. I am passionate about teaching others — the importance of ensuring real science is brought to every use-case, and helping other data scientists learn how to connect advanced analysis to the challenges organizations face.

What does it really mean to be a data scientist?

There is a lot of discussion surrounding what it means to be a data scientist, and unfortunately the big data hype has confused both the general public and companies as to what the profession means.

The most important aspect of this position is the science, so we should start with what it means to be a scientist. Being a scientist means using data that has been generated from natural phenomena to build models that explain and predict that phenomena. This is true for any real science including all areas of biology, chemistry and physics. All scientists look at data; whether that’s collecting butterflies and measuring their attributes, analyzing the collisions from particle accelerators, computing the folding in a complex protein, or assessing storm systems in weather patterns. All scientists analyze data and use it to build a model of how they think that phenomena works. This is what it means to do science.

Often that model is encapsulated in the language of mathematics, and with today’s technology, we use computers to assist us in chugging through the math and building that model. In fact much of science is moving towards what we callscientific computing, since our models are often too complex to write down on paper.

Today, we extend much of our humanity using data-producing devices like laptops, smartphones, sensors, etc. Which means scientists now have access to much more information about our planet and the people living on it. As such, there has been a huge demand to bring the approaches and tools used in traditional science to these data.

And this is what it means to be a data scientist. To use the tools of mathematics, scientific inquiry, and computing to solve today’s biggest challenges. Because the data is so large we can no longer rely solely on humans to detect the patterns in the data. We use techniques in machine learning to ‘learn’ patterns in the data and use those patterns to automate systems.

We use techniques from statistics to validate our models and explore the distributions that generate the data that feed our models. We use visualization to better explore trends inside the data and to communicate our findings to non-scientists. We develop software so that we can send our models into the real world where they can collect more data and take automated actions. We build data infrastructures that better handle large amounts of streaming data and help us scale our science to larger and more practical applications.

To be a data scientist means using the above approaches to do great science on interesting data. This is the responsibility of anyone looking to get into the field. It’s not about making a company money, or hyping up the position as the new sexy thing to do in tech. It’s about doing good science. Often good science will benefit an organization just as inventing the transistor or discovering a new drug has. But their benefit is a byproduct of doing good science.

Describe your path to becoming a data scientist?

I started my career in science doing an undergraduate degree in materials chemistry and then doing a Ph.D in an area called computational chemistry. This is an area of scientific computing that uses mathematical descriptions of molecular compounds, quantum theory, and advanced algorithms and supercomputers to solve challenges related to chemical phenomena. We used advanced calculations and models to help design new solar devices related to nanotechnology.

Towards the end of my Ph.D., I had to decide whether to take the traditional academic route and become a professor or look to apply my knowledge in the “real world.” The professor route often means doing more grant-writing than actual science, and as a scientist I wanted to solve real problems and see them utilized in the real world; something academic researchers rarely do.

With the fact that most science was moving towards using computers on large datasets, and with the explosion in data happening all around us, it was a perfect chance to leave the “ivory towers” and apply my expertise to interesting data being collected by government, companies and other organizations.

I started my own company, doing advanced analytics and data science for various organizations. Since I was on my own, I would interact with management consulting firms looking to place analytic talent into their opportunities; this allowed me to get my foot into larger companies and work with interesting data.

After a few years another company reached out to me and offered me an opportunity to work with them as a data scientist and senior consultant. I took the opportunity and this is what I do now.

Which programming language should a data science aspirant learn?

The 2 main languages are R and Python. These are the most respected and useful in the scientific computing and statistics community. Their package ecosystems are very mature and highly supported. Although much scientific computing is done using C and FORTRAN, the R and Python packages are typically interfaced with these low-level languages meaning R and Python are fairly fast for numerical calculations.

Although considered more platforms than languages, Pig and Hive are still being used quite frequently to layout data pipelines and set up data flow. Obviously you should know some basic SQL to wrangle in data from legacy RDBMS tools that many organizations still use. There are some interesting languages like Julia and Scala but their package ecosystems are much less mature than R and Python.

When in doubt, start solving problems using R and Python. Wrangle your data with SQL if you have to. I would also learn JavaScript for data visualization; start playing around with the D3.js library, so you can communicate your findings effectively.

How much programming and statistics is needed to work as a data scientist? Can you explain the most important statistics and programming concepts?

Programming and statistics are critical. Statistics even more so. Don’t be fooled by vendors preaching tools that allow you to simply press a button and deploy a model or make a ready-to-go recommendation engine. Doing real science takes time and effort and is never the same twice.

You must understand the assumptions your models make about the underlying data, how they scale with infrastructure, when learning algorithms break down, how to parameterize, optimize, validate, and improve your models, and how to use your understanding in programming and statistics to deliver real ROI to an organization. All the above should sound exciting to you, not daunting.

I love learning new programming languages, and love diving deep into mathematical theory and statistical analyses. I love connecting something esoteric and complex to something real world that a business person can understand. This should be what drives you. If being technical and understanding things deeply scares you, you should not look to be a data scientist.

With all of that said keep in mind that innovation always moves forward because we “stand on the shoulders of giants.” There are many packages to work with and database technologies to integrate our models. We are not looking to reinvent the wheel but you should have a strong conceptual understanding of how these packages work and how to use them to suit the data you are analyzing. If you like to “black box,” don’t go into data science.

How does a complete novice teach himself data science?

Although companies often look for Ph.D. scientists it doesn’t mean you cannot learn data science. We are living in a world where all the information we could ever hope for is out there. The only way to teach yourself anything is to solve real problems. In data science this means downloading a public dataset and trying to find something or build a data product from it.

Go build an App that shows crime predictions in real time and publish it online. When you do, you will have to work with R or Python. You will have to write some SQL. You will have to develop software. You will have to learn how to integrate a model with that software. You will have to learn what model to use and which ones not to use. You will have to go to online communities like Stack Overflow and Cross Validated and ask questions to better understand what to use or why something doesn’t work. You will have to research articles on models, and math, and stats just to get your App to do something worthwhile. You will have to validate your models and improve your models. You will fail miserably again-and-again, but this is worth its weight in gold. You will come out with so much new knowledge about how to make data science work in the real world. Forget the courses and countless books. Use the internet and go solve problems.

If you want to impress a company and get hired… go enter public competitions like Kaggle, and compete in a data mining and machine learning contests. That will prove you know what you’re doing. And don’t worry about first place…just place among the top 10% to show you can compete with the best of them.

What are the best types of entry level positions that could lead to a career in data science? Can you also tell us about the industries in which there is a huge demand for data scientists?

Almost every industry is trying to compete on data. I would focus on one that excites you. Maybe it’s security. Maybe it’s marketing. Maybe it’s crime prevention. Data Science doesn’t really have “entry level” positions as much of the expertise you are expected to have comes from advanced education in a quantitative discipline. However, if you are really keen on working your way up to understanding data science and how to apply the skills in the real world I would look to get in as an analyst, somewhere you can work with data. Show the company that you have ideas about how to use machine learning models with their existing product.. If you truly love working with data this will be enough, and if you truly love building models and doing science your real world experience will help you get a more advanced position. But make sure it’s science you really want to do; not simply having a title that happens to be popular right now. If you don’t love science, you will not make a good data scientist.

What is difference between a Data Analyst and a Data Scientist?

A data analyst is not expected to build or deploy models which is what makes data science a science. A data analyst will use languages like SQL to summarize data and communicate those findings to upper management (or to data scientists). It is closely related to business intelligence and is more about summarizing existing data.

Data science is about bringing the tools of science and advanced mathematics to that data, so that models can be built and deployed in real-world applications. This requires an understanding of machine learning, big data architecture, advanced statistics and software development…and above all the ability to test hypotheses as a scientist.

Should a Data Scientist definitely know MySQL?

Knowing SQL is important for gathering data and doing simple summaries of that data. Also in cleaning and doing any ETL (extract, transform, load) that might be required. MySQL is similar to other SQLs and used very often, so it’s a good one to know.

How is Big Data and Data Science related? What Big Data concepts should a Data Scientist know?

Big Data is about doing data science on very large datasets. Simply spinning up a Hadoop cluster is not doing Big Data. Big Data is interesting because it is not just big, it is different. There are things you can find in big data that you cannot find in smaller amounts of data. A Data Scientist should know how to build and maintain some big data architectures, although keep in mind there are data engineers who specialize in this.

The real responsibility of the data scientist is to ensure the engineers are building something that can scale the models properly. For example, Hadoop is not the best thing to use in all circumstances and it cannot parallelize all algorithms. The data scientist must know how their model scales and help the engineers build something that is suitable for this type of algorithm.

If you could give one piece of advice to someone starting out, what would you say?

Forget the hype and the sex appeal of data science. If you are entering into this field for that reason you will hate your job. It is a demanding career where you are responsible for doing good science, managing client expectations, understanding technical and theoretical aspects deeply, and knowing how to be productive and work with many other disciplines.

If you love going deep and being technical, and also being awesome at communicating those technical aspects in laymen terms to real-world businesses, then my advice is to start solving problems using data and build a product from it. Life is too short to not love what you do. If data science is no longer cool 5 years from now would you still do it? Would you solve problems with math and data if you were on an island all by yourself? If Yes…go for it!!