Data Science Project Scoping Guide

Over the past several years, the University of Chicago Center for Data Science and Public Policy and the Data Science for Social Good Fellowship have worked on over 40 data science projects and one (unsurprising) lesson we’ve learned over those years is that it is critical to scope these projects well at the beginning. There are a lot of organizations out there – government agencies, non-profits, social enterprises, corporations – working on important problems that can have a huge impact on society. There are lots (not enough but still, lots) of talented, passionate, and smart people with data (science) skills who can help them tackle those problems. Yet, when these two sets of people come together, the results are often mixed because of the challenges associated with designing a well-scoped project. We have found that it is necessary to have people who can mediate between the two groups and formulate a problem that is both solvable and impactful. It’s even better if both groups have these scoping skills themselves so they can work together more effectively. After scoping hundreds of projects over the past few years, we’ve learned several project scoping lessons that we wanted to share with the larger data science for social good community. We’d love any feedback and comments on our approach and for people to help us improve it. Although this is written specifically to benefit people scoping data science projects for social good, the lessons here generalize to socially neutral (and unfortunately, to socially evil) projects as well. In fact, a lot of this is based on lessons we’ve learned from working on data science projects in the corporate world in our previous work lives.

Initial Filters

We’ll assume here that we’ve already gone over the basic criteria for doing data science for social good projects:

The problem we’re solving is important and has social impact.

Data can play a role in solving the problem, and the organization has the right data (See our data maturity framework to help you assess whether you have the right data)

Scoping Overview

Once we’ve gone through the initial screening process, we then start the scoping process. There are many approaches to scope a problem. The one we describe below is one that has worked for us but we certainly don’t think that is the only way to scope a successful project. Let us know if you have alternative approaches you use that we can benefit from. As always, the scoping process is fairly iterative and the scope gets refined both during the scoping process as well as during the project.

Step 1: Goals – Define the goal(s) of the projectStep 2: Actions – What actions/interventions do you have that this project will inform?Step 3: Data – What data do you have access to internally? What data do you need? What can you augment from external and/or public sources?Step 4: Analysis – What analysis needs to be done? Does it involve description, detection, prediction, or behavior change? How will the analysis be validated?

Step 1: Define the Goal(s)

This is the most critical step in the scoping process. Most projects start with a very vague and abstract goal (say, improving education or healthcare), get a little more concrete (increase % of percentage of students who will graduate on time or decrease the number of children who get lead poisoning), and keep getting refined until the goal is both concrete and achieves the aims of the organization. This step is difficult because most organizations haven’t explicitly defined analytical goals for many of the problems they’re tackling. Sometimes, these goals exist but are locked implicitly in the minds of people within the organization. Other times, there are several goals that different parts of the organization are trying to optimize. The objective here is to take the outcome we’re trying to achieve and turn it into a goal that is measurable and can be optimized.

Let’s take an example that we are all extremely familiar with these days: presidential elections. What is the goal of each presidential campaign? Obviously, the answer is winning the election but what is the single, measurable goal campaigns are trying to optimize? Some may say getting more votes or winning more swing states or winning more electoral votes. Those are all goals that are pointing in the right direction but at the very core of the campaign, the goal of a presidential campaign is to maximize the probability of winning at least 270 electoral votes. This is a critical point because getting > 50% votes doesn’t mean you win the election. Winning more than 50% of the states doesn’t mean you win the elections. And getting 270 electoral votes has the same outcome as winning 500 electoral votes. So you want to make sure that you don’t just go for more votes or more electoral votes but that you maximize your probability of getting at least 270 electoral votes. By focusing on this goal, it allows the (rational and data-driven) campaigns to allocate their resources to increase this probability.

Examples

Now let’s take a few problems in public policy and social good and see how we can determine the goals.

Example 1 – Lead Poisoning: A few years ago, we started working with Chicago Department of Public Health on preventing lead poisoning. The initial goal was to increase the effectiveness of their lead hazard inspections. One way to achieve that goal would be to focus on homes that have lead hazards. Although helpful, this approach wouldn’t get to their real goal, which was to prevent children from getting lead poisoning. Finding a home with lead hazards and getting it remediated is only beneficial if there is a high chance of a child present (currently or in the future) who is likely to get exposed to lead. The next iteration of the goal was to maximize the number of inspections that find lead hazards in homes where there is an at-risk child (before the child gets exposed to lead). Eventually, we got to the final goal: identifying which children are at high risk of lead poisoning in the future and then target interventions at the homes of those children.

Example 2 – High School Graduation: One of the bigger challenges schools are facing today is helping their students graduate (on time). Graduation rates in the US are ~65%. They’re all interested in identifying students who are at risk of not graduating on time. When initially talking to most school districts, they start with a very narrow goal of predicting which kids are unlikely to graduate on time. The first step is to go back to the goal of increasing graduation rates and asking if there is a specific subset of at-risk students they want to identify? What if we could identify students who are only 5% likely to be at-risk versus students who are 95% likely to not graduate on time without extra support? If the goal is just to increase graduation rates, the first group is (probably) easier to intervene with and influence while the second group may be more challenging due to the resources they need. Is the goal to maximize the average/mean/median probability of graduating for a class/school or is the goal to focus on the kids most at risk and maximize the probability of graduation of the bottom 10% of the students? Or is the goal to create more equity and decrease the difference in the on-time graduation probability between the top quartile and the bottom quartile? All of these are reasonable goals but the schools have to understand, evaluate, and decide on which goals they care about. This conversation often makes them think harder about analytically defining what their organizational goals are as well as tradeoffs.

Example 3 – Inspections: We’ve worked on several projects that involved inspections – such as with the EPA and NY State Department of Environmental Conservation to help them prioritize which facilities to inspect to look for waste disposal violations, with the City of Cincinnati to help them target properties at risk of code violations in order to prevent blight, and with the World Bank Group to help them prioritize which fraud and collusion complaints to investigate. In most inspection/investigation problems, there are many more entities (homes, buildings, facilities, businesses, contracts) to inspect than available resources needed to conduct those inspections. The goal most of these organizations start with is to target their inspections at entities that are most likely to be in violation of existing regulations. That is a good start but most of these organizations can never inspect all facilities/homes that may be noncompliant so the goal they are really after is deterrence – reducing the total number of facilities that will be in violation. An ideal inspection process would then result in reducing the actual number of (found or not) violations which may not be the same as an inspection process that is aimed at being efficient and increasing the hit rate (% of inspection resulting in violations).

Example 4 – Scheduling Waste Pickup: We recently started working with Sanergy, a social enterprise based in Kenya. They deploy portable toilets across informal urban settlements and one of their largest costs is hiring people to empty the toilets. Today, every toilet gets emptied every day even though there is variance in how much they get used and how much they fill up. In order for them to grow and keep costs down, they need a more adaptive approach that can optimize the schedule for emptying toilets. The goal in this case is to make sure that you don’t over-empty the toilet when it’s not full but you don’t let it stay full either because then it’s not usable. This translates to a formulation that pushes for emptying the toilet as close to being 100% full as possible without getting to 100%.

Considering tradeoffs while deciding on goals

As we start determining and often prioritizing goals, the conversation leads to tradeoffs. When dealing with students who may need extra support to graduate on time, what do you care more about? Finding every single student who may need that help (at the expense of targeting students who may not need the support and possibly being inefficient) or prioritizing efficiency and only focusing on students where you’re extremely sure they’ll need the extra support (and thus missing many students). Would you rather inspect more homes without finding lead hazards in them (inefficient) or would you rather miss homes with children who will end up getting lead poisoning? When dispatching and placing emergency response vehicles, do you want to make sure you can get to every possible emergency within 10 minutes or do you want to make sure that you can get to critical emergencies within 3 minutes and the non-critical within 20 minutes? What mistakes are you willing to make? That is a critical question a good scoping process brings up and answers based on the priorities of the organization.

In data science terms, would you rather have more false positives or more false negatives? Of course, this decision depends on the impact and cost of those errors, which is often hard (and sometimes uncomfortable) to quantify. There may not be an objectively correct answer but policymakers need to decide which policy goals they want to optimize, what resources they have, and which outcomes they want to prioritize. The data science work is then used to support and implement those policy goals. Data Science can help explore the impact of those goals and understand the implications better but it’s ultimately a policy decision to decide on what goals to optimize.

Step 2: What Actions/Interventions are you Informing?

The work we do can typically only have impact if it’s actionable. What actions can the organization take to achieve these goals? These actions often need to be fairly concrete: home inspections, enrolling a student in one of three after school programs, targeted emails for fundraising or advocacy, dispatching an emergency vehicle, or scheduling a waste pickup. A well-scoped project ideally has a set of actions that the organizations is taking that can be now be better informed using data science. If the action/intervention a public health department is taking is lead hazard inspections, the data science work can help inform which homes to inspect. You don’t have to limit this to making existing actions better. Often, we end up creating a new set of actions as well. Generally, it’s a good strategy to first focus on informing existing actions instead of starting with completely new actions that the organization isn’t familiar with implementing. Enumerating the set of actions allows the project to be actionable. If the analysis that will be done later does not inform an action, then it usually (but not always) does not help the organization achieve their goals and should not be a priority.

Let’s go back to our example of elections. What actions does a campaign have in order to get votes? Typically, a campaign has three high level actions:

Register new voters

Persuade existing voters to support their candidate

Get out the Vote

All of three actions need one key input: Who should these actions target?

Who should we register to vote?

Who should we persuade?

Who should we try to get to vote?

Breaking Down Actions

Actions have a granularity, frequency, time horizon, channel, etc. but we’ll leave that discussion to a future write-up. For example, we would want to determine what channel(s) (Door knock, Phone call, Email, Twitter, Facebook, Snapchat, TV Ads) to use to target an individual? How often should they be targeted? Who should target them? You would also want to often come up with new actions and interventions. We will come back to that in a future write-up as well.

Let’s look at some additional examples of actions:

Often, an organization has one high level action (lead inspection or home inspection or after school programs). In the scoping process, we can proceed in two ways:

We can just keep the scope to informing that one action:
a. Which homes to inspect for hazards?
b. Which students should be enrolled in the after-school program?

We can also break the high level action down into smaller components: There may be multiple after school programs and each of them can be considered an action. For example, there may be 3 types of programs:
a. An online program that can be provided to 90% of the students
b. A short program that can be provided to 50% of the students
c. An intense program that can only be provided to 10% of the students.

Step 3: What Data do you have and What Data do you need?

You’ll notice that so far in the scoping process we haven’t talked about data at all. This is intentional since we want these projects to be problem-centric and not data-centric. Yes, data is important and we all love data but starting with the data often leads to analysis that may not be actionable or relevant to the goals we want to achieve. Once we’ve determined the goals and actions, the next step is to find out what data sources exist inside (and outside) the organization that will be relevant to this problem and what data sources we need to solve this problem effectively. For each data source, it’s good practice to find out how it’s stored, how often it’s collected, what’s its level of granularity, how far back does it go, is there a collection bias, how often does new data come in, and does it overwrite old fields or does it add new rows?

You first want to make a list of data sources that are available inside the organization. This is an iterative process as well since most organization don’t necessarily have a comprehensive list of data sources they have. Sometimes, (if you’re lucky) data may be in a central, integrated data warehouse but even then you may find individuals and/or departments who have additional data or different versions of the data warehouse.

Matching the Data to the Actions

This step also helps you figure out if your data matches the actions you need to inform. If the actions are individual level, then you most likely need data at an individual level. If the actions need to be decided on once a day, then you need your data to be updated every day. It’s important to match the granularity, frequency, and time horizon of the actions to the granularity, frequency, and time horizon of the data you have.

External and/or Public Data

Once you’ve determined what data you need and what data exists inside the organization, you then want to figure out what external and/or public data you can get that fills the gaps. Each domain often has commonly used data sources that you want to know about. American Community Survey is a good example of a data source you want to use in most projects you do that involve some spatial component in the US. Open data portals (at federal, state, and local levels) also have data that can be used to augment your internal data. 311 call data, 911 call data, and fire data are examples of some commonly found local data sources. You also want to take a look at commercial data sources you can buy to augment your internal data. Examples include buying data from organizations such as Acxiom and Experian around purchase behavior or media buying habits from Nielsen.

Step 4: What Analysis Needs to be Done?

The final step in the scoping process is to now determine the analysis that needs to be done to inform the actions using the data we have to achieve our goals.

The analysis can use methods and tools from different areas: computer science, machine learning, data science, statistics, and social sciences. One way to think about the analysis that can be done is to break it down into 4 types:

Description: primarily focused on understanding events and behaviors that have happened in the past. Methods used to do description are sometimes called unsupervised learning methods and include methods for clustering.

Detection: Less focused on the past and more focused on ongoing events. Detection tasks often involve detecting events and anomalies that are currently happening.

Prediction: Focused on the future and predicting future behaviors and events.

There are of course many more types of analysis but we’ll keep the focus on these four in this write-up.

The questions to answer in this step are:

What analysis needs to be done? Is this a descriptive analysis, a predictive model, or a detection or behavior change task? Often, the analysis involves several of the types of analysis we described above

How will the analysis be validated? What validation can be done using existing, historical data? What field trial needs to be designed to validate this in the field before it can be deployed?

For action 1 (Register Voters), we need to answer a key question: Who should we register?

For action 2 (Persuade Voters), the question is very similar: Who should we persuade to support our candidate?

For action 3 (Get out the Vote), it’s again the same question: Who should we target for Get out the Vote?

As mentioned earlier, there are more complicated variations of the question that we can get to. For example, not just focusing on who should be persuaded but also who should persuade them? What channel(s) should we use to persuade them (in person? Phone? Online? TV?)? When should they be persuaded? In the beginning of the campaign? Closer to the elections? How often should they be persuaded?

Let’s pick one of those actions to do the analysis for: Who should we target for “get out the vote” efforts? Intuitively, we want to target people who already support our candidate but are not likely to vote. That requires us to know two things about voters: How likely are they to support our candidate? And how likely are they to vote? Formally, we want to take every voter in the country and predict their:

P(Support): Probability that they will support our candidate

P(Turnout): Probability that they will vote

Once we’ve determined these two predictions, we can then use them to determine the voters who are likely supporters but not likely to vote and target our efforts at persuading them to vote. The second analysis that needs to be done is determining how to persuade them. This falls under the behavior change category and requires us to do some experimental work. By combining predictive analysis and behavior change analysis, we can then have a list of individuals who need to be targeted and an approach to persuading them.

In addition to doing the analysis, a good scope also should define a process of validating each of the analysis, using both historical data and a field trial.

Summary

Hopefully this gives government agencies and non-profits an overview of how to scope data science/analytics projects and what questions they need to answer before launching the project.

Step 1: Goals – Define the goal(s) of the projectStep 2: Actions – What actions/interventions do you have that this project will inform?Step 3: Data – What data do you have access to internally? What data do you need? What can you augment from external and/or public sources?Step 4: Analysis – What analysis needs to be done? Does it involve description, detection, prediction, or behavior change? How will the analysis be validated?