Five Interview Questions to Predict a Good Data Scientist

For those of us in the profession, we’re constantly being reminded of the drastic shortage of data scientists. It’s only going to get worse before it gets better since the demand for technologies like machine learning, AI, and deep learning are on such an upward trajectory. As a result of this deficiency, we’re seeing a lot of people sensing high-paying employment opportunities, and making the transition from other professions. As a result of this onslaught, the problem for employers is clear: you’re not always getting the best candidates for your open positions.

What to do? Many firms craft employment ads that are seemingly designed to scare off candidates. Not everyone can fill the role of a data science “unicorn” calling for a Ph.D. in computer science and applied statistics, along with years of domain-specific experience. Of course, there are brave souls who apply for these jobs without the requisite knowledge and experience. You just need to effectively filter out the imposters.

The short list below is something I came up with to be used by hiring managers for data science positions (read: not data engineers) to help weed out the folks who are stretching reality with respect to their abilities. It’s true that many tech firms will include grueling coding tests during interviews, but these questions are more nuanced, focusing more on foundational knowledge, down-in-the-trenches experience, and data science common sense. The idea is to see if they know the basics, can create a viable strategy, and can practically solve a problem.

What is the significance of the normal distribution to data science? This question is designed to demonstrate an understanding of one of the most basic elements of data science. It would be great if the response involved a discussion of the Central Limit Theorem, but maybe that’s too much to ask for. And maybe getting the mathematical formula for the Gaussian probability distribution function is an overreach. But aside from a mention of the “bell curve” it would be nice to hear something along the lines of: its mean, median and mode are all same, or the entire distribution can be specified using just two parameters — mean and variance, or maybe a description of its importance to linear regression (the workhorse of data science).

Tell me about your passion for data science. Do you: attend local meetups, participate in data challenges like Kaggle, work to use data for common good like public data hacking, speak at conferences, write books or articles, etc.? The point of this question is to determine whether the candidate feels that data science is their true calling. Do they think and dream about data? Do they see a problem and instantly look for a solution involving patterns in data? What books are in their library? A related question is how much does a mathematical foundation for data science play a role in how they think about the subject? A data scientist who understands the math behind the algorithms will typically perform much better.

Describe that last time you experienced frustration in a data science project you were working on, and how did you overcome it? Not all data science projects progress swimmingly along, as many potential roadblocks may occur. This question probes the depth of their true experience and how they managed to handle inevitable problems. People with scant knowledge and experience will easily be exposed here.

Think back to a past data science project you worked on. If the powers that be asked you to change one of your data sources, and thus use different predictors, how would you alter your solution? This question relates to the previous role the candidate has played, and how well they adapted to changing requirements such as introducing new data sets. Many times, lower level data scientists are simply given a data set with a list of predictors to use, without providing any input to their suitability. Heavier contributors, on the other hand, will be involved with dataset selection, feature engineering, and statistical analysis. You probably want a more well-rounded candidate for your team.

Research has stated that 2.3 billion people have been affected by floods in the last two decades. Describe how you’d approach a data science project to predict upcoming floods in the next 100–500 years. These predictions can be used to build dams at correct locations to minimize loss. This kind of question, or one more in alignment to your specific industry, calls for consideration of the “data science process” including problem formulation, data acquisition, data wrangling, exploratory data analysis, feature engineering, modeling the data (build, fit, and validate a model), and data storytelling with the results. The candidate needs to be intimately familiar with a data scientist’s workflow.

If you’re looking for a good data scientist versus someone who just claims a title, then the above questions are surprisingly effective to quickly differentiate between the two. The good thing about these questions is that you can fine-tune the acceptable answers in terms of your industry or even your company.