Better Questions to Ask Your Data Scientists

The intersection of big data and business is growing daily. Although enterprises have been studying analytics for decades, data science is a relatively new capability. And interacting in a new data-driven culture can be difficult, particularly for those who aren’t data experts.

One particular challenge that many of these individuals face is how to request new data or analytics from data scientists. They don’t know the right questions to ask, the correct terms to use, or the range of factors to consider to get the information they need. In the end, analysts are left uncertain about how to proceed, and managers are frustrated when the information they get isn’t what they intended.

At The Data Incubator, we work with hundreds of companies looking to hire data scientists and data engineers or enroll their employees in our corporate training programs. We often field questions from our hiring and training clients about how to interact with their data experts. While it’s impossible to give an exhaustive account, here are some important factors to think about when communicating with data scientists, particularly as you begin a data search.

What question should we ask? As you begin working with your data analysts, be clear about what you hope to achieve. Think about the business impact you want the data to have and the company’s ability to act on that information. By hearing what you hope to gain from their assistance, the data scientist can collaborate with you to define the right set of questions to answer and better understand exactly what information to seek.

Even the subtlest ambiguity can have major implications. For example, advertising managers may ask analysts, “What is the most efficient way to use ads to increase sales?” Though this seems reasonable, it may not be the right question since the ultimate objective of most firms isn’t to increase sales, but to maximize profit. Research from the Institute of Practitioners in Advertising shows that using ads to reduce price sensitivity is typically twice as profitable as trying to increase sales. The value of the insight obtained will depend heavily on the question asked. Be as specific and actionable as possible.

What data do we need? As you define the right question and objectives for analysis, you and your data scientist should assess the availability of the data. Ask if someone has already collected the relevant data and performed analysis. The ever-growing breadth of public data often provides easily accessible answers to common questions. Cerner, a supplier of health care IT solutions, uses data sets from the U.S. Department of Health and Human Services to supplement their own data. iMedicare uses information from the Centers for Medicare and Medicaid Services to select policies. Consider whether public data could be used toward your problem as well. You can also work with other analysts in the organization to determine if the data has previously been analyzed for similar reasons by others internally.

Then, assess whether the available data is sufficient. Data may not contain all the relevant information needed to answer your questions. It may also be influenced by latent factors that can be difficult to recognize. Consider the vintage effect in private lending data: Even seemingly identical loans typically perform very differently based on the time of issuance, despite the fact they may have had identical data at that time. The effect comes from fluctuations in the underlying underwriting standards at issuance, information that is not typically represented in loan data.

You should also inquire if the data is unbiased, since sample size alone is not sufficient to guarantee its validity. Finally, ask if the data scientist has enough data to answer the question. By identifying what information is needed, you can help data scientists plan better analyses going forward.

How do we obtain the data? If more information is needed, data scientists must decide between using data compiled by the company through the normal course of business, such as through observational studies, and collecting new data through experiments. As part of your conversation with analysts, ask about the costs and benefits of these options. Observational studies may be easier and less expensive to arrange since they do not require direct interaction with subjects, for example, but they are typically far less reliable than experiments because they are only able to establish correlation, not causation.

Excerpted from

Experiments allow substantially more control and provide more reliable information about causality, but they are often expensive and difficult to perform. Even seemingly harmless experiments may carry ethical or social implications with real financial consequences. Facebook, for example, faced public fury over its manipulation of its own newsfeed to test how emotions spread on social media. Though the experiments were completely legal, many users resented being unwitting participants in Facebook’s experiments. Managers must think beyond the data and consider the greater brand repercussions of data collection and work with data scientists to understand these consequences.

Before investing resources in new analysis, validate that the company can use the insights derived from it in a productive and meaningful way. This may entail integration with existing technology projects, providing new data to automated systems, and establishing new processes.

Is the data clean and easy to analyze? In general, data comes in two forms: structured and unstructured. Structured data is structured, as its name implies, and easy to add to a database. Most analysts find it easier and faster to manipulate. Unstructured data is often free form and cannot be as easily stored in the types of relational databases most commonly used in enterprises. While unstructured data is estimated to make up 95% of the world’s data, according to a report by professors Amir Gandomi and Murtaza Haider of Ryerson University, for many large companies, storing and manipulating unstructured data may require a significant investment of resources to extract necessary information. Working with your data scientists, evaluate the additional costs of using unstructured data when defining your initial objectives.

Even if the data is structured it still may need to be cleaned or checked for incompleteness and inaccuracies. When possible, encourage analysts to use clean data first. Otherwise, they will have to waste valuable time and resources identifying and correcting inaccurate records. A 2014 survey conducted by Ascend2, a marketing research company, found that nearly 54% of respondents complained that a “lack of data quality/completeness” was their most prominent impediment. By searching for clean data, you can avoid significant problems and loss of time.

Is the model too complicated? Statistical techniques and open-source tools to analyze data abound, but simplicity is often the best choice. More complex and flexible tools expose themselves to overfitting and can take more time to develop. Work with your data scientists to identify the simpler techniques and tools and move to more complex models only if the simpler ones prove insufficient. It is important to observe the KISS rule: “Keep It Simple, Stupid!”

It may not be possible to avoid all of the expenses and issues related to data collection and analysis. But you can take steps to mitigate these costs and risks. By asking the right questions of your analysts, you can ensure proper collaboration and get the information you need to move forward confidently.

Dr. Michael Li is the founder and executive director of The Data Incubator. A data scientist, he has worked at Google, Foursquare, and Andreessen Horowitz. He is a regular contributor to VentureBeat, The Next Web, and Harvard Business Review.

Madina Kassengaliyeva is a client services director with Think Big, a Teradata company. She helps clients realize high-impact business opportunities through effective implementation of big data and analytics solutions. Madina has managed accounts in the financial services and insurance industries and led successful strategy, solution development, and analytics engagements.

Raymond Perkins is a researcher at Princeton University working at the intersection of statistics, data, and finance and is the executive director of the Princeton Quant Trading Conference. He has also conducted research at Hong Kong University of Science and Technology, the Mathematical Sciences Research Institute (MSRI), and Michigan State University.