Innovate, test, then scale. The sequence seems obvious—but is in fact a radical departure. Too often the policy making process looks more like “have a hunch, find an anecdote, then claim success.”

Over the last decade and more, though, many individuals and organizations have invested in this new model of development. We’ve learned a lot including about how to improve learning, reduce extreme poverty, combat violence, and improve women’s empowerment.

But what have we learned about the process of getting from innovation to scale? What kinds of innovations make it to scale? What types of partnerships are needed? What can donors do to catalyze policy impact at scale?

Scaling after testing in multi-site evaluations: The “graduation” model

The “graduation” program is an example of what I consider the classic model for the “innovate, test, scale” approach. The Bangladeshi NGO BRAC realized that some people are too poor to benefit from microfinance. They designed an innovative program that provided very poor people with assets, a small stream of income, and intensive support—designed to help families “graduate” to a higher-income status. This program was tested with a randomized evaluation, which showed that it was spectacularly effective (Bandiera et al).

Researchers then coordinated to test the impacts of the program in six other countries. The findings from randomized evaluations, taken together, demonstrated highly positive cost-benefit ratios: In other words, the program was found to be not only broadly effective, but also cost-effective (Banerjee et al 2015, summarized here).

With these positive findings from rigorous research, BRAC and other organizations implementing the graduation approach were then able to use this evidence of effectiveness to raise more funding. Many were able to scale up their programs, including with support from USAID DIV and USAID’s Office of Microenterprise Development.

The “classic” model for scaling?

So, is this the recipe for moving from innovation to testing to scale: Evaluate a program, work with an organization that can scale, replicate in multiple contexts, and scale to improve millions of lives?

That model can work. It was appropriate for the graduation program, which is expensive and thus required a higher burden of evidence of effectiveness before scaling. The graduation program is also very complex implying the results may well vary by implementer, so it is worth testing with several implementers and scaling with those implementers who saw success.

But we shouldn’t conclude that this is the only model for getting to scale. In the rest of this post I will discuss examples of the “innovate, test, scale” approach in which there was no need for replication in multiple contexts; others where the testing was (appropriately) done with a different type of organization than that which scaled it; and finally, examples of evidence impacting millions of lives when it was not a program that was tested, but a theory.

Scaling without testing in multiple contexts: The Generasi case

When does it make sense to scale without testing in multiple contexts?

In Indonesia, the government partnered with researchers and the World Bank on a program called Generasi that gave grants to communities to help them improve health and education outcomes. Communities could decide how to spend the money, but in some communities, the size of future payments was directly linked to measurable improvements in in education and health outcomes.

A consortium of funders including the Government of Indonesian and the Government of Australia funded testing of program, which was randomized over 264 sub-districts in five provinces in Indonesia, reaching 1.8 million beneficiaries.

Even without additional testing, the results, which showed success, were already representative of a large part of Indonesia. While it might be interesting to look at this approach in other countries, for the Indonesian government, this constituted sufficient evidence to take it to scale. The program is now improving the lives of 6.7 million women and children, with scale-up led by the Indonesian government and supported by the U.S. Millennium Challenge Corporation.

Scaling with new implementing partners: The deworming case

Generasi shows the benefit of testing a program with a government who can take it to massive scale. But that is not the only route to scale.

In the case of school-based deworming, a randomized evaluation tested the impact of a relatively small NGO program, but the program was successfully scaled up with multiple implementing governments. In 2016 in India alone, the government dewormed 179 million children, with technical assistance from Evidence Action supported by funding from USAID.

Of course, it’s possible that the impact of a program may be very different if the implementer changes. This is a serious concern with complicated programs that are highly dependent on personal interactions between staff and beneficiaries. But here, we are talking about a pretty simple intervention. The deworming pill used in these programs is unlikely to have different impacts if it is administered by, say, a government instead of an NGO. What is important to test is whether the pill is actually reaching people, something that Evidence Action takes very seriously and tracks carefully.

Cost-effectiveness

There was another reason deworming reached such large scale: Deworming children is by far the most cost-effective way to increase schooling of any program rigorously evaluated. Even with very conservative cost assumptions, US$100 spent on deworming was estimated to lead to twelve additional years of schooling. Even if worms were half as common in the next context, and costs were much higher per child, it would still be extremely cost-effective. (Costs have actually been much lower at scale.)

The lesson from deworming is that low-cost, easy-to-implement programs, if they are effective, are easier to scale up across organizations and contexts and thus have the potential to reach massive scale. The other lesson is that support for technical assistance from knowledgeable partners like Evidence Action to help governments scale is a high-leverage investment for donors.

Scaling after general lessons emerge: The “teaching at the right level” case

The examples I’ve noted so far have been cases of scaling up a program that was rigorously evaluated. But I think the most profound way in which evidence has changed lives is not through the scale-up of individual programs, but through evidence changing how we understand and address problems.

Let me give you an example.

One of the first of a new wave of RCTs in development looked at the impact of providing textbooks in Kenyan schools. The authors found that textbooks only helped the top of the class, because most of the class was so far behind the curriculum they could not read the textbooks. So next, academics tested a program called “teaching at the right level” that sought to help those children who had fallen behind, implemented by an NGO called Pratham in India. The results were spectacular: A series of different studies using different instructors and implemented in very different states across India found that the program significantly improved learning outcomes (Banerjee et al 2017).

Back in Kenya, academics worked with an NGO to test an alternative way to “teach at the right level,” sorting grade 1 children into two classes based on their level of English skills and adjusting the curriculum to adapt to each class’s basic skills. The evaluation found that students in both classes learned more (Duflo, Dupas, and Kremer 2011).

Here, although the Kenyan program was quite different from the Indian program, the theory behind the programs’ effectiveness—adapting curriculum to teach students at their current skill level—was consistent across contexts.

Similarly, a review of education technology shows that the most consistent gains come not from simply putting computers in classrooms, but from improving access to personalized learning software on computers or tablets that seeks to tailor learning to the skills of the child—in other words, teaching at the right level (Escueta et al 2017). Some of these computer programs were inspired in part by lessons from noncomputer based teaching at the right level studies.

In meetings between J-PAL and the Ministry of Education in Zambia, ministry officials noted that children were going to school regularly but falling behind the curricula, and that there was a wide range of learning levels in the same class. Given the wide evidence base from different programs that the “teaching at the right level” approach was likely to improve student learning, the question for Zambia was which type of “teaching at the right level” program made sense in their context. To find out, the government piloted a number of different programs.

We conducted a process evaluation to test whether children turned up and whether teachers taught at the right level for children, because these were the crucial questions in this context. The program found to be most appropriate is being scaled up to 1,800 schools in Zambia over the next three years, with potential for further scale based on the results of an ongoing randomized evaluation.

When a testing a theory helps change lives: Pricing of preventative healthcare products

My final example doesn’t even involve testing a program, yet it is probably the example in which evidence has had a direct impact on the most lives.

In 2000 there was an intense argument about whether malarial insecticide-treated bednets (ITNs) should be given out for free. Some argued that charging for bednets would massively reduce take-up by the poor. Others argued that if people don’t pay for something, they don’t value it and are less likely use it. It was an evidence-free argument at the time.

Then, a series of studies in many countries testing many different preventative health products showed that even a small increase in price led to a sharp decline in product take-up. Pricing did not help target the product to those who needed it most, and people were not more likely to use a product if they paid for it. This cleared the way for a massive increase in free bednet distribution (Dupas 2011 and Kremer and Glennerster 2011).

There was a dramatic increase in malaria bednet coverage between 2000 and 2015 in sub-Saharan Africa. At the same time, there was a massive fall in the number of malarial cases. In Nature, Bhatt and colleagues estimate that the vast majority of the decline in malarial cases is due to the increase in ITNs. They estimate there were 450 million fewer cases of malaria due to ITNs and four million fewer deaths due to ITNs. The lesson here is that testing an important policy-relevant idea can have as much impact on peoples’ lives as testing a specific program.

What next?

As we think about investing in high-quality evidence for development and using the results of those investments, we need to avoid relying upon a single simplistic model for how evidence is used to improve lives.

Sometimes we do research to test a specific program in the hope that the program can be scaled up. But much of what we learn from RCTs and other rigorous methodologies is about general behavior—for example, people are highly sensitive to prices when it comes to preventative health, and the incentives in school systems often mean teachers teach to the top of the class while children find it hard to learn when the curricula is above their level of learning.

For funders, three concluding lessons from observing and participating in using evidence to help improve millions of lives:

1. Stay open, not prescriptive, because it is hard to predict innovation. The varied models I describe above are indicative of the many different ways to innovate and improve lives at scale—and we are still learning.

2. Evidence is a global public good and its learnings go well beyond the specific program evaluated. Testing the theories behind programs has the potential to generate general and thus highly policy-relevant lessons applicable across contexts, and should be encouraged.

3. Incorporating theory-based lessons into policymaking requires a deep understanding of the evidence. Supporting technical assistance to help governments incorporate evidence into their policies is a high-payoff approach to catalyzing impact.

One of the first rules of thumb you learn about developing survey questions is that they should be specific and time-bound. In other words, it’s better if a question is about a specific event or behavior rather than a vague idea so respondents are less likely to interpret it in different ways, and it should include a clear timeframe so that their responses are comparable.

Yet some of the most common survey questions for measuring women’s participation in household decision-making are not specific or time-bound. The questions, often adapted from USAID’s Demographic and Health Survey (DHS), go like this:

“Who usually makes decisions about [healthcare for yourself]/ [major household purchases]/ [visits to your family or relatives]: you, your husband/partner, you and your husband jointly, or someone else?”

These questions are an important part of the DHS women’s empowerment modules and are widely used by researchers and practitioners outside the DHS. At a recent IPA and J-PAL roundtable on measuring women’s empowerment, more than half the researchers present had used these kinds of questions in impact evaluations before.

Several, however, had concerns. In practice, these questions can be hard to answer accurately because they are vague and require people to make a quick guess about general trends in decision-making at home. As one researcher put it, “They don’t pass the ‘Can I answer my own survey question?’ test.”

A simple alternative could be to ask people about how they would make a decision in a concrete scenario that's relevant in their context.[i] Instead of asking, “Who usually makes decisions about your healthcare,” we could ask, “If your child is sick and needs immediate healthcare, but your husband is not home, what would you do?” Or, “If you ever need medicine for yourself (for a headache, for example), could you go buy it yourself?”

In an evaluation one of us (Rachel) is conducting on girls’ empowerment in Bangladesh, our team asked both the standardized question and the more specific questions above. We got very different answers.

In response to the standard question, 16 percent of women said they usually make decisions about their healthcare alone or jointly with their husbands. Given this response, we would call this group more empowered—yet nearly a quarter of this group also said they could not take a sick child to the doctor until their husbands came home.

We also found discrepancies in the other direction: over half of the women who appeared disempowered according to the standard question said that they could take a sick child to the doctor on their own, and even more telling, could buy medicine for themselves.

These data should make us concerned that the standard questions are not picking up the characteristics we think they are. However, one test is not enough to jettison the DHS-style questions, which have other benefits.

First, there is value in asking questions in multiple countries over many years. For one, it allows us to benchmark a study to the broader literature, and to do meta-analyses of studies using a common indicator. They are also easier and more convenient to add to surveys than developing new questions.

The hope is that a more general question can fit many contexts, whereas specific questions may be more context-dependent. “Who decides whether and what type of health insurance to purchase for the family?” might be relevant in the United States, but not many other countries. “If you had a headache, could you purchase medication?” might provide a useful diversity of responses in Bangladesh, but not in the US, where most women can purchase cheap over-the-counter drugs.

So when we ask a general question like “Who usually makes decisions about your healthcare”, respondents arguably will adjust it to be about whatever the relevant health decisions are in their context. The downside is that we usually don’t know exactly what kind of decision the woman is thinking about when she answers, and different women are likely thinking about different decisions. If we don’t know the decisions she’s thinking about, and whether they are important to her or not, is hard to judge whether any change we see in this general indicator is meaningful.

However, there have been cases when general questions led to more accurate responses than specific ones. For instance, de Mel, McKenzie, and Woodruff found that simply asking small-scale entrepreneurs what their profits were was more accurate than asking them to report detailed revenues and expenses. Women and men may similarly have a good-enough sense of decision-making at home so that even if there is measurement error, the standard decision-making questions may still pick up something that’s correlated with the underlying truth.

One indication that this could be the case comes from Markus Goldstein, head of the Africa Gender Innovation Lab at the World Bank, who shared an analysis comparing women and men’s responses to the DHS decision-making questions at our recent roundtable. It is now available in a working paper by him and co-authors Donald, Koolwal, Annan, and Falb. They find that women who reported having greater sole or joint-decision making power were also more likely to own land, work outside the home, earn more than their husbands, and not condone domestic violence—outcomes we typically think of as signs of empowerment.[ii]

Yet even if responses to the standard household decision-making questions can be correlated with empowerment outcomes, it may not make sense to use them in impact evaluations without carefully working through whether they’re relevant to the program being tested or the context.

Several researchers at the recent IPA and J-PAL roundtable observed that they have rarely seen significant changes in household decision-making indicators in their own or others’ impact evaluations. It could be that these changes take longer than most evaluations. Another possibility is that the program wasn’t likely or designed to change these decisions in the first place. When this is the case, it is probably better to use other questions more specific to the program.

Beyond the program, it’s also important to check that our survey questions are relevant to the context. Gender roles and dynamics can vary widely even within small geographic areas and change over time. Before starting an evaluation of an empowerment program, we typically conduct formative research in the field to collect qualitative and quantitative data about where women lack the ability to make strategic life choices that they want to make. Based on these data, we identify locally relevant indicators of empowerment and develop new survey questions to pick them up.

It can be valuable to use standardized questions in impact evaluations if they’re relevant to the program and context, but we think it is equally, if not more important to include context-specific questions about what the women in our study communities can and want to change in their lives.

More broadly, a fruitful area for future measurement research is to conduct more validation exercises comparing different methods for asking about tricky concepts like agency and decision-making (see a useful recent example from IFPRI that makes the case for calibrating questions to specific contexts). More validation exercises could help us identify whether there are improvements or additions to current standard questions that are worth making. For example, can we develop more specific questions that are relevant in many contexts—such as, “If your child is sick and needs immediate health care, but your husband is not home, what would you do: seek immediate care, ask for permission from someone, wait for your husband….”?

There will likely never be an effective one-size-fits-all set of survey questions to measure women’s decision-making power or agency, but we are optimistic about the potential to improve on current practice. We’re always looking for more research on this, so if you’re aware of useful validation exercises that have already been completed or are currently in the works, please send them to Claire Walsh and we’ll update this post with relevant links.

[ii]Most definitions of empowerment emphasize agency and gaining the ability to make strategic life choices. Many draw on Sen’s concept of an agent as “someone who acts and brings about change, and whose achievements can be judged in terms of her own values and objectives,” (1999), and/or Kabeer’s definition of empowerment as “the process by which those who have been denied the ability to make strategic life choices acquire such an ability” (1999). Sources: Sen, Amartya. 1999. Development as Freedom. New York: Alfred A. Knopf. Kabeer, Naila. "Resources, Agency, Achievements: Reflections on the Measurement of Women's Empowerment." Development and Change 30, no. 3 (1999): 435-464.

The majority of out of school children are girls, and much of the rhetoric about improving access to education focuses on girls. Yet many of the policies designed to improve primary school access (particularly those evaluated with randomized evaluations) do not specifically target girls. In J-PAL’s recent review of education RCTs, Roll Call, we therefore ask: Which gender benefits most from these school access policies?

We were surprised to find that recent systematic reviews of education evaluations reported very few results by gender, even although many of the randomized evaluations in a recent J-PAL review on student participation included results disaggregated by gender.

The results of the simple exercise we conducted were clear and compelling: In all but two cases, school attendance improved for girls as much as—if not more than—for boys (the difference was statistically significant in 10 out of 25 cases). The two exceptions where boys’ attendance improved more than girls’ were cases in which boys had lower attendance than girls to start with. In other words, policies aimed at improving school attendance in general appear to benefit the disadvantaged gender (usually but not always girls) most.

The J-PAL education team worked hard to put the results of multiple studies into as consistent a metric as possible when data was available, thus allowing comparisons across studies.

Our measure captures both the number of children enrolled in school and how much children attend school once they are enrolled. For example, if 50 percent of children are enrolled, but those who are enrolled show up 100 percent of the time, the overall attendance rate for the community is 50 percent. If 100 percent of children are enrolled but they show up only 50 percent of the time, the attendance rate is also 50 percent.

In the chart below, reproduced from pages 22-23 of Roll Call, we show the school participation rates for boys in the control group (teal) and the increase in participation resulting from the policy change (yellow). We then show the initial participation rate and the change for girls. In most cases the gender gap in enrollment or after the intervention (teal plus yellow) is smaller than it was before the intervention.

There are, naturally, caveats to our conclusion that policies aimed at both genders tend to benefit girls as much as, or more than, boys. We are not saying that there is never a case for gender-targeted interventions: Clearly such interventions are sometimes needed when there are very specific barriers faced by one gender or another. Indeed, a few of the policies summarized in the bulletin were designed with gender disparities from the start (for example, village-based schools in Afghanistan).

Our attempt to put all results in one metric has disadvantages: We weigh enrollment and attendance equally, while it may be that increasing enrollment by 5 percent—getting more children into at least some school—is more beneficial to society than increasing the percentage of days enrolled children attend school by 5 percent. In some cases, we did not have both enrollment and attendance data (both are necessary to calculate our aggregate attendance rate) and we had to make assumptions.

But we feel these disadvantages are outweighed by the benefits of being able to view results through one metric. In particular, it can be extremely difficult to draw conclusions from a review which cites results on entirely different bases: It’s hard for our brains to process a comparison between, say, a 17 percent reduction of dropouts on a base of 32 percent with a 14 percent increase in attendance on a base of 55 percent. Creating a metric consistent across multiple studies makes it much easier to compare results and draw policy insights.

Understanding the differential impacts of education policies on girls and boys is critical for policymakers seeking to design programs that reach the most children. For more lessons from our analysis of 58 evaluations, read the full bulletin: Roll Call: Getting Children Into School.

There has been a huge increase in the number of impact evaluations of different approaches to reducing poverty but despite this if you are a policy maker it is unlikely that there will be a rigorous impact evaluation that answers precisely the question you are facing in precisely the location you are facing it. How do you draw on the available evidence, both from your local context and from the global base of impact evaluations in other locations, to make the most informed decision?

In an article just published Stanford Social Innovation Review, Mary Ann Bates and I set out a practical generalizability framework that policy makers can use to decide whether a particular approach makes sense in their context. The key to the framework is that it breaks down the question “will this program work here?” into a series of question that come out of the theory behind a program. Different types of evidence can then be used to assess the different steps in the generalizability framework.

Here is a generalizability framework for providing small incentives to nudge parents to immunize their children. The first steps require a local diagnosis of the problem and need to be answered using local descriptive data as well as qualitative interviews and local institutional knowledge. The next steps are about general lessons of human behavior where studies from other contexts can be very valuable. The final steps are about local implementation where local process monitoring evidence is key.

In the article we discuss our experience working alongside policymakers around the world to apply this framework to solve practical policy problems. We also show how this approach enables policy makers to draw on a much wider range of evidence than they might otherwise use: for example, with only two published RCTs on the immunization program above there is a wealth of rigorous impact evaluation supporting the general behavioral lesson behind the program. With this article we seek to move the debate about generalizability of impact evaluations from its rather confused and unhelpful present to a more practical future.

Today MIT announced the launch of a new masters in Data, Economics, and Development Policy (DEDP) offered by MIT’s Economics Department and J-PAL. The masters will be a hybrid of online and in-person courses. After taking courses in development economics, microeconomics, data analysis, and the practicalities of running RCTs, candidates will take an in-person exam at sites around the world (where people take their SATs and GREs). Students will then receive a “MicroMasters” and the top candidates will be invited to attend MIT in person to complete the full masters (online courses will be converted to MIT credit). After a semester of classes in MIT’s Economics Department, students will complete a summer of practical research—including on J-PAL projects. Those with access to the internet and high school level calculus and willing to work hard will be able to access some of the best teaching the world has to offer and can compete for a place at MIT.

Why have we put so much energy over the last few years into developing this MicroMasters with a path to a full in-person’s MIT masters? We believe passionately that development policy can benefit from more people rigorously trained in data and economics. We know there are many very smart dedicated people around the world who could benefit enormously from an MIT training but can’t access the level of courses they need to get into a top university. Even if they could, many of them can’t leave their families for one or two years for a typical masters. With the MicroMasters, thousands will be able to take technical economics courses while they work at their day job. I was fortunate to be able to get my masters and PhD in economics, so I feel particularly pleased to open this opportunity to others. We also believe that MIT will benefit enormously from the diverse range of experience that the in person students will bring to our classrooms.

DEDP is also unusual in being much more practical than many other masters programs: it fits the proud MIT tradition of uniting mind and hand “mens et manus”. One of the five online courses is the soon to be released “Designing and Running Randomized Evaluations”. This will cover developing different RCT designs, how to do power calculations in practice, survey design, data collection management, and units on measurement (I just recorded the lecture on measuring women’s empowerment). And those who come to MIT will spend their summer experiencing the rigors of data collection in the field, often as part of a J-PAL project.

Finally, we are hoping that other universities will use the MicroMasters courses and exam as a launch pad to their own DEDP masters so that many more in-person places are offered at top institutions around the world.

The MicroMasters program is now open for enrollment for courses beginning in February 2017; the DEDP master’s degree will launch in 2019. For more information and to sign up for the MicroMaster see this link.

A remarkable new World Bank report makes the case for a radical change in the Bank’s approach to political systems. For years the Bank and other international agencies have sought to give the poor a voice in health, education, and infrastructure decisions through channels unrelated to politics. They have set up school committees, clinic committees, water and sanitation committees on which sit members of the local community. These members are then asked to “oversee” the work of teachers, health workers, and others. But a body of research suggests that this approach has produced disappointing results. More recently, researchers have tested ways to improve accountability of government services by strengthening, rather than ignoring, the political route to oversight. There are a number of quite promising results from a range of countries which have prompted this welcome proposal to change the emphasis of accountability programs at the World Bank.

There is little doubt that in many countries political systems have failed to deliver accountable government. In most poor and middle income countries many teachers and health workers fail to turn up to work, theft of essential drugs is common, as are kickbacks on construction projects. Informal fees are charged for services that by law are meant to be free. A few examples illustrate the point:

Sierra Leone: Over 50% of parents reported paying for immunizations which are meant to be free. (Note this was before a big push to make all services for under 5s free)

If politics is not delivering accountable government, a natural response is to attempt to circumvent it: give power directly to consumers who care most about service quality. The result has been a plethora of committees at the very local level. But rarely are these given any real money and power. Teachers and health workers (by far the biggest expenditure in health and education) continue for the most part to be paid by governments. When researchers have investigated how these committees function on the ground they often look very different from what they appear like on paper. In India, most Village Education Committees that were mandated by the government were not operating: 25% of people whose names were listed as being on the committee when surveyed did not know they had this role. Even the efforts of the otherwise highly effective NGO Pratham failed to achieve effective village engagement through the committee or improve educational outcomes.

There have been some successes, particularly in health. In Uganda, researchers found that most, legally prescribed, local health committees were not active. However, support from an outside NGO and information from a detailed survey of health care quality led to higher attendance of nurses and improvements in health. But this involved very expensive intensive data collection efforts and when the researchers tested a cheaper version without village-specific data there was no positive effect. In Sierra Leone, NGO support to community health committees (with an intensity between that of the two Uganda approaches) was successful in improving health outcomes, cutting health worker absenteeism (paper in progress).

Another type of project that promotes community-level participation, not through the political process but through community action, is Community Driven Development. Grants are given to communities on the condition that there is strong community participation in the decision-making about how the grants are spent. Again the results of rigorous evaluations are rather disappointing: the grants themselves have produced real outputs on the ground, but there is little evidence community participation persists once the grant is finished from RCTs in Sierra Leone, DRC, and Liberia (the last shows some gains though arguably small and confined to specific groups).

Partly in response to these underwhelming results, attention switched back to enhancing engagement and transparency of the political process. Results from a series of RCTs have been more encouraging. Voters in Brazil respond to the release of audit reports on corruption in municipalities (more corruption, the lower the vote share). Voters in Delhi respond with considerable sophistication to information about how members of parliament spend funds made available for their constituency (increased vote share for those who spend more in the voters’ area or attend committees for issues of importance to voters, and lower vote shares for those who spend money on goods that are not seen as useful such as fountains).

There are some negative effects: publicizing corruption levels when these levels are very high led to lower voter turnout. In Kenya, providing messages about the integrity and efficiency of the electoral commission backfired when there were significant problems with voting systems.

In my view the most promising results are from RCTs of programs that seek to engage voters rather than just provide them with information. In Benin and the Philippines, when political parties held town hall meetings, rather than the normal rallies with little policy content, challengers were able to combat clientelism of the dominant party and increase support. In Sierra Leone, filming and then screening debates between candidates for MP led to increased knowledge about political candidates, their policy positions, and the political process, better alignment on policies between voters and the candidate they voted for, and a shift in votes towards the candidate considered most effective in the debate. MPs elected from constituencies where debates were screened visited their constituency more often, held more meetings with constituents, and were considered to be doing a better job by neutral parties. It was also possible to track debate MPs’ spending to real projects on the ground.

The accumulated evidence in the last 15 years has changed my view on the best way to improve accountability and it is great to see this shift in this World Bank document, written by Stuti Khemani, with whom I worked to evaluate village-level accountability in India, as well as J-PAL affiliates Claudio Ferraz and Fred Finan. There is likely to be some push back to what some will portray as “the Bank getting involved in elections.” But as the report sets out, there are many ways the Bank and others can support citizen engagement in ways that do not compromise the Bank’s neutrality.

One of the favorite criticisms of randomized control trials (RCTs) in development economics is that they answer “small” questions rather than tackling big (macho?) questions like “what are the causes of growth” or “what is the impact of infrastructure investment.” The poster child for this argument has been studies examining whether deworming pills or Insecticide Treated Nets (ITNs) should be given free or sold at a subsidized price. A key target of criticism was a Cohen and Dupas study showing that take-up of ITNs drops sharply with small changes in price while those who pay for a net are no more likely to use it than those who get it for free. Lant Prittchet bemoans the wasted talent of the best and brightest of a generation working on such insignificant questions. Dani Rodrik and Angus Deaton questioned the usefulness of a study that looked just at pregnant women in one part of Kenya.1

But the work on pricing and health proves exactly the opposite point. It demonstrates how impact evaluations can simultaneously answer questions of immediate practical importance for the partner in the evaluation (how many more kids will sleep under a bednet if the price is reduced from 50 cents to free), and help us understand underlying truths about human behavior (do humans value and use something more if we pay for it, do small costs deter us from doing things that are good for us in the long run). These more general questions are sometimes derided as academic but in the long run are particularly important for policy because they tend to generalize better. A series of studies looking at pricing for different nonacute health products in different countries shows remarkably similar results: sharp declines in take-up with small changes in price (summarized here, here, and here).

Equally importantly there is no evidence that the act of paying for something increases use or that charging helps target products to those who are most likely to need them. It challenged the growing call to judge development programs on financial sustainability rather than cost-effectiveness.

This literature fed into a growing understanding of how bad human beings are at making small sacrifices today for long term health payoffs whether that be forgoing chocolate cake, exercising more, or buying prevention products. This understanding of human behavior is feeding into policy.

In the early 2000s a debate raged about whether to charge for ITNs. Advocates of free distribution said small costs could reduce access by the poor. Those arguing for charging cited anecdotes of bednets being used as wedding veils or fishing nets but neither side had much evidence. The RCTs on price and use were quickly taken up by advocates of free mass distribution and the opposition faded.

Coverage of ITNs in sub-Saharan Africa (the region with the highest burden of malaria) has improved dramatically with the vast majority of coverage accounted for by free mass distribution (43 out of 47 countries had mass free programs). As the great maps from Giving What We Can illustrate, malaria cases have fallen dramatically. A recent article in Nature estimates that 2/3 to 3/4 of the decline in malaria cases between 2000 and 2015 can be attributed to increased net coverage: 450 million cases of malaria and 4 million deaths averted from ITN distribution. That’s anything but small.

1As pregnant women and newborns are the most likely to die of malaria this is the policy relevant group to focus on.

The “#ilooklikeaneconomist” hashtag was an attempt to correct the overwhelming predominance of images of men in response to the search term “economics professor.” But “economist” and “economics professor” are not the same thing, and women economists have made important contributions outside of academia. I have written elsewhere about some of the challenges women economists face in the policy world. Here are some of the women policy economists who have inspired me along the way.

Ida Merriam

Ida joined the Social Security Administration virtually from its inception (1936) and worked her way up to head of the research department, where she was a strong and effective advocate for the program. While her knowledge was encyclopedic, her explanations were always simple, clear, and data-driven.

Judith Guron

Judith is one of the pioneers of randomized control trials (RCTs) in the social sciences. As director of research from the founding of MDRC (1974), and then president (1986-2004), she was at the forefront of figuring out how to bring RCTs out of the lab and into the world of US policymaking. As she explains in her book Fighting for Reliable Evidence (written with Howard Rolston), this fight was primarily fought by policy economists, with very little engagement of academics until the late 1990s. Her lessons on how to convince agencies to randomize are well worth reading, including for those working in developing countries.

Rachel Lomax

There were very few women economists and even fewer senior women economists when I started work at the UK Treasury in the 1980s. Rachel Lomax was a prominent exception. She went on to be vice president at the World Bank, permanent secretary of Welsh Office, Department of Work and Pensions, and at the Department of Transport, as well as being deputy governor of the Bank of England. As a boss she was both intimidating and inspiring. More than any other senior bureaucrat I have worked with, she pushed us to stay current with the latest economic literature and make sure our decisions reflected the latest research findings. Knowing that she miraculously found time to be current with the literature was a very strong incentive for us to do the same.

Luisa Diogo

I had the privilege of supporting Mozambique through the write-down of much of its bilateral and multilateral debt in 2001 as part of the Highly Indebted Poor Country Initiative. Mozambique was only the second country to reach this point, in large part due to the sound economic management of the then-Minister of Finance Luisa Diogo, who got her master’s in economics from the University of London and went on to become Mozambique’s first woman prime minister. As the only female member of the IMF negotiating team, I loved to watch then-Minister Diogo dominate meetings through superior argument as well as force of personality. She would heap withering scorn on poorly thought-through proposals (including from her own central-bank governor). The result was that everyone went into meetings with her very well prepared. As prime minister, Madame Diogo used her formidable intellect and reputation to push for free reproductive health care and gender equality across the African continent.

I would love to hear about women policy economists that have inspired others. (Note that I have not included Janet Yellen on this list, even though she is an important and inspiring role model for women in economics, because she made her reputation as an academic economist and only later moved into policy.)

In my previous blog I discussed some of the challenges of working with governments on RCTs. My aim was not (as some have suggested) to disparage this work but to give some tips to those who are taking it on. Indeed, it is because I know just how hard it is that I have enormous admiration for those who manage to pull it off. I was fortunate enough to work with the Government of Sierra Leone's Decentralization Secretariat in the aftermath of the civil war as they introduced a system of decentralization to the country (not randomized) and experimented with decentralizing all the way down to villages with the GoBifo program (randomized). Ten years later I am still there and still working with the government.

Despite all the challenges, working with governments can have substantial payoffs. Some of these benefits flow to researchers. As discussed in my previous post, working with governments opens up a whole range of questions that cannot be addressed when working with NGOs or companies. Nor is there anything like trying to run an RCT with a developing-country government to understand just how these governments work and what constraints they operate under. But many of the benefits are in the form of public goods. Researchers often come in with knowledge of what has worked elsewhere and help the government design better programs. They typically also provide technical assistance to governments about good process monitoring. Government partners also gain, through an improved capacity to distinguish different types of evidence, to determine which evidence is best at answering what question, and how best to integrate research evidence into their programs. If the relationship goes well, the government partner will be much keener to work with subsequent research teams and will be much more effective in working with them. Importantly, governments are often in a good position to scale up a program that has been tested if the results are positive. A great example of how researchers working on a question of intense interest to the government can lead to a quick and wide-reaching policy change is the Raskin rice program in Indonesia.

Because the costs to the individual researcher are high, but many of the benefits accrue to the world at large, J-PAL has put significant effort into building up long-term partnerships with governments to make it easier for our researchers to work with them. Typically these partnerships include: discussing general lessons from previous research findings that may be relevant to the local context; capacity building to help civil servants read and incorporate lessons from research; and a demand-driven agenda of research to answer the questions the government most wants answered, and that J-PAL researchers are able to design studies to answer.

Another approach is to provide subsidies and support to researchers who are investing in building relationships with governments. The Government Partnership Initiative at J-PAL does exactly this.

Other organizations, including Innovations for Poverty Action and the International Growth Center (both of which I work with), have similarly invested in building long-term relationships with governments to foster researcher–government collaborations. When doing RCTs or other impact evaluations, researchers at the World Bank and other multilateral development banks also benefit from working within a long-run relationship with governments, which can help with--though they do not solve--the challenges I discussed.

David McKenzie’s blog about the arms race in RCTs raised the concern that there will be a reduction in the number of RCTs with governments and that it will get harder to publish them. I am not going to comment on the publication issue but instead on the pros and cons of doing an RCT with governments and some practical suggestions for those who do want to work with governments.

On the benefit side, governments:

Have substantial resources to invest in large and expensive programs and evaluations;

Cover large populations. When else could you randomize at the subdistrict level and have 1.8 million beneficiaries in an RCT, as in Olken et al. (2014)?;

Collect a lot of data on individuals, such as test scores for children, earnings for adults, and encounters with the criminal justice system. Working with governments can help you get access to these administrative data which can reduce the cost and hassle of running an RCT;

Some examples of RCTs that use these advantages of partnerships with government to good effect are Angrist et al. (2006), who were able to follow up winners and losers of a lottery for vouchers to attend private school in Colombia by linking winners to a centralized college-entry exam seven years after the vouchers were issued. In ongoing work, Bettinger et al. link the same voucher winners and losers to government tax and earnings data, 17 years after the lottery. Muralidharan and Sundararaman (2011) test the impact of teacher-incentive pay in a representative sample of rural schools across the state of Andhra Pradesh, meaning their results are valid across a population of 60 million. Banerjee, Hanna, Cohen, Sumarto, and Olken worked with the Government of Indonesia to test how providing individual ID cards to recipients of government-subsidized rice (which indicated the amount and price of rice they were eligible for) could reduce corruption in the distribution system. The results showed that the cards increased the subsidy received by targeted recipients by 25 percent, so the government scaled up the ID card program, reaching 66 million people. The time from evaluation design to scale-up was about a year.

With these benefits, however, come considerable costs:

Governments can be slow-moving and less able or willing to test out-of-the-box solutions than NGOs.

It may be particularly difficult to run more theory-oriented field experiments with governments. They tend to be less interested in answering an abstract question, the answer to which could inform many policies but would not be scaled up as a specific program.

Governments can also find it harder than NGOs to provide services only to a limited group of needy citizens. Some governments have laws requiring them to treat citizens of equivalent need equally. When the Government of France wanted to test programs using randomized trials, they first had to change the constitution to make this possible.

Staff turnover in governments can be high as civil servants are transferred regularly. This makes it even more important to build support at different levels of government: if the RCT has support from the minister but not the bureaucrats, then it is likely to die with the next cabinet reshuffle.

An election can lead to a dramatic change in policy priorities and personnel at the same time. It can also lead to paralysis for a period both before and after an election, even if the program being evaluated has bipartisan support. (An RCT I was involved in collapsed when none of the planned monitoring could take place because a newly elected government froze all nonessential expenditure while they thought through new priorities. In another instance a survey had to be suspended just as it was about to go into the field because of a national exchange rate crunch, which again led to a freeze. Government budget shortfalls and last minute crunches are not confined to developing countries.)

Governments can renege on any agreement with impunity. There is not much a researcher can do when a government decides to fill a shortfall in a program budget with money set aside for, say, the endline.

Many of the strategies for working with partners discussed in my previous blog are just as relevant to working with governments. Government partners are in a powerful position vis-à-vis the researcher, so it is important to listen hard to what they want. They often work within short political timelines, so delivering intermediate products such as baseline reports can be key for keeping them engaged.

There are also specific actions a researcher can take to help the often bumpy ride of partnering with governments. A more formal approach to partnership may be needed than in the case of working with NGOs. Governments often require a memorandum of understanding that sets out clear expectations for both parties. Discussions may be going well at the practical implementation level, but any final decision—even a relatively small one—is likely to require sign-off from someone senior. It is important to build extra time into the schedule to account for this.

Government procurement rules can also cause considerable delay. For example, if we decided that an intervention needs a leaflet to explain the study to participants, the government may require a competitive bid for the printing of the leaflet, leading to several months delay. Having some independent funding that does not run through the government can be very helpful in easing some of these constraints: a researcher can come in and offer to pay for a leaflet, or for additional monitoring, etc. Independent funding can also help keep the research going if the government faces short-run liquidity constraints.

Being the first to do something might be exciting for an NGO, but can make a government nervous about being exposed to criticism. Thinking through the optics of the experiment (i.e., how it would look on the front page of a newspaper) can help alleviate concern. Another strategy is to bring in an official from another department or country who has worked on an experiment before, preferably of a similar type. It is much more reassuring for officials to talk to other officials than it is to hear from a researcher.

Policymakers often have a healthy skepticism of researchers who want to provide advice about how to measure or improve a program, especially those coming from another country, state or region. It is important for researchers to prove their relevance and their local knowledge. A mix of humility, a desire to learn from the policymaker, and a lot of homework about local conditions can help. I have seen policymakers visibly relax and start to engage when they hear from a researcher about their on-the-ground experience. A well-placed anecdote about a conversation with a farmer in Kenema or a teacher in Pittsburgh can be critical for building credibility.

The bottom line is that working with governments can be rewarding, but is very hard and takes enormous investment. It is also risky. As such it may be inadvisable for junior researchers who have to finish a PhD on time or get a paper published before a tenure decision to do an RCT with a government. Even senior researchers need to think hard about the commitment, risk, and reward before taking on an RCT with a government.

For more discussion on ways for researchers to make working with governments easier, see my next blog entry.

Economics doesn’t have a reputation for being a particularly ethical profession, but the new book by William MacAskill might help change that. Many of the concepts laid out in Doing Good Better--such as counterfactuals, diminishing marginal utility, and fat tail distributions--will be familiar to economists. What is unusual is to see these tools used to develop a practical guide on how to live an ethical life. MacAskill doesn’t tell you what choices to make; instead he sets out a simple framework for how to think through decisions like whether to be a vegetarian, what job to take, whether to buy fair trade coffee, and where to donate. He also provides a lot of useful numbers for thinking through these decisions.

As an example, MacAskill discusses the choice of career. We might naively think that we should adopt a career that directly helps people, like being a doctor. But MacAskill urges the reader to think through the counterfactual: what would happen if we did not become a doctor? In many countries the number of doctors is highly regulated. What do we think would happen to the average quality of medical care if we did not become a doctor but another person filled the slot in medical school instead? Are there other careers where our marginal impact might be higher? It’s a difficult question but making a career choice is a big decision that deserves some careful, structured thought.

I like to think I approach ethical decisions reasonably logically, but there were a number of cases where this book made me question what I am doing. For example, I am a vegetarian who eats eggs. MacAskill argues, however, that eating eggs may cause more animal suffering than eating beef. If we are making a commitment such as being a vegetarian, isn’t it worth investigating a bit more thoroughly what the consequences are? (The discussion about vegetarianism is also a good example of how ethical decisions are not zero/one, or right versus wrong, but more on a continuum.)

I appreciate how Peter Singer, this book, and Effective Altruism have directly taken on the all-too-common view that we have a stronger duty to fix the problems closest to us compared to those of people who live far away. Will MacAskill nicely brings together the concept of diminishing marginal utility and results from happiness research to suggest that $100 in a poor county could generate 100 times more utility than $100 in the US. You can quibble with the exact numbers but the basic idea has to be right.

While Effective Altruism is about a lot more than aid, there is a thought-provoking point on aid. Small pox eradication has saved between 60 and 120 million lives. If it were the only achievement of all aid ever given to developing countries, we would have spent $40,000 for every life saved from aid, making it highly cost-effective (compared, for example, with the FDA's cost-effectiveness threshold of $7.9 million per life saved). Deaton argues that this fails to take into account the cost of aid in propping up regimes who suppress human rights in developing countries. MacAskill and others would clearly want to take into account any negative general equilibrium effects of aid. I agree with those who stress the importance of human rights and democratic freedoms—the poor deserve these as much as the rich do. I have been shocked by the West’s unwillingness to protest the overthrow of democratically elected governments around the world, including in Bangladesh and Thailand. But it is a big jump to go from caring about democracy to saying that all support to the poor in all developing countries simply props up bad regimes. There are democracies among poor countries and there are better and worse ways to support the poor even in less-than-perfect democracies.

Deaton’s argument that providing succor to the poor in less-than-perfect regimes is dangerous because it provides credibility to these regimes, reminds me vividly of the arguments communists made against social democracy in Europe from Marx in the 1840s to my college days in the 1980s. Pensions, unemployment insurance, public education, and welfare would sap the poor’s willingness to rebel and create regime change and thus should be opposed. I didn’t buy it in college and I don’t buy it now. To be concrete, one of the recommendations in Doing Good Better is GiveDirectly, which gives money directly to poor families in Kenya: a democracy, although far from a perfect one. If $100 is 100 times more valuable to a Kenyan than to an America, I would want pretty strong evidence that this giving made the Kenyan regime less democratic before thinking such giving was a net negative. It seems just as plausible to me that giving a poor Kenyan $100 puts them in a better, rather than worse, position to lobby for improved government.

There is one part of MacAskill's book where I have a different perspective from him: that effective altruists might do more good "earning to give" in a high-paid job than working for a development agency or NGO. I worry that he underestimates the change in preferences someone working in banking may undergo. I also feel that development NGOs and agencies could be more effective if a higher proportion of their staff were quantitative-minded people who carefully think through counterfactuals and trade-offs in the way effective altruists do. In other words, unlike the doctor case, its not clear to me that if effective altruists stop pursuing these types of jobs that these positions will be filled with people who are just as effective.

A fundamental position of Effective Altruism is that we have a moral obligation to not just do good, but to think more carefully about how to do good more effectively. To me, being an Effective Altruist is the logical consequence of being an economist who thinks about effectiveness and wants to be ethical. (In a previous blogI discuss some ways in which Effective Altruists have raised important and interesting technical challenges about some economics methodologies).

In the interests of full disclosure, the first chapter of Doing Good Better has a description of how the recent wave of randomized evaluations in development economics started in which I feature (too prominently).

Twenty-one years ago this summer, I was fortunate enough to witness the start of a new movement of randomized evaluations that have transformed development economics. (This week, CEGA and IPA celebrated the anniversary in Nairobi.)

I had already met Michael’s family in Kansas but in 1994 he took me to see his other family, with whom he had lived on a small farm in rural Kenya for a year after college. Michael relates the story of how a chance meeting with an old friend who worked for International Child Support (ICS) led to the first RCTs on education in Busia, Kenya, a then very sleepy town on the border with Uganda which has become famous among development economists.

There is something very special about Busia and the ethos of inquiry and partnership that developed and flourished there. Here a craft was developed and honed. As the movement becomes much wider it's worth recording some of the principles that were so successful in Busia.

A deep commitment to craft. A thousand small decisions separate a good RCT from a poor one: Framing the question on the survey exactly right; designing the intervention so that it carefully reflects local needs; capturing data at the right time of day or year; meticulously monitoring enumerators; making it easy for the data to be entered correctly; capturing not just the outcome but all the steps along the way to better understand the outcome. Busia was where many researchers came to develop and learn the intensely practical craft of running a good RCT: Lorenzo Casaburi, Pascaline Dupas, Esther Duflo, David Evans, James Habyarimana, Ted Miguel, Owen Ozier, Jon Robinson, Simone Schaner, Frank Schilbach, Rebecca Thornton, Alix Zwane, and many others all worked there. And here they developed many of the protocols that have become standard in RCTs.

Long-term partnership. Many of the best RCTs have come from long-term partnerships between researchers and NGOs or other local partners. The trust that develops with collaborations over many years allows innovative ideas to be tested, often springing out of the lessons of past failures or partial successes. One RCT builds on the findings of the previous. The relationship between researchers and ICS was the first model for this long-term relationship. (Several years after going to Busia I also saw Michael and Abhijit Banerjee forge the start of a second important relationship with Seva Mandir. See Neelima Khetan, former chief executive of Seva Mandir, discuss this summer like no other here.)

Thorough understanding of context. An important ethos of the researchers working in Busia was the importance of understanding the local context. OK, not everyone could spend a year living in a local family, sleeping on cow dung in a thatched hut and teaching in the local school--but this example set the tone. Many PhD students and research assistants lived with families in town and learned Swahili during their long stays in Busia. Scott Guggenheim, an anthropologist who has worked closely with Ben Olken and others on RCTs in Indonesia criticizes economists' reluctance to talk about the detailed qualitative work they do in preparation for an RCT: its an integral part of how the best economists work, but if you don’t talk about it people will think you can fly in and fly out and do a good RCT.

Building high-quality research infrastructure. Why did so many PhD students and junior researchers come through this backwater town? Yes, for the collaboration with other researchers, but also to take advantage of the research infrastructure that was built--first as the evaluation group at ICS which then migrated to become IPA Kenya. It is hard enough for PhD students to do an RCT as their thesis, but if they had to find a way to hire Kenyan enumerators legally and find people to enter the data, etc., etc., it would be a lot harder. The lower entry costs were critical to the movement taking off. Again the model was copied in other places: first in India, then in IPA offices and J-PAL regional centers across the world. It has been wonderful to see some of the Kenyans who helped build this infrastructure be recognized. This summer, Michael and I (with other Busia alumni) were with Carol Nekesa (who was part of the team who moved from ICS to IPA), to celebrate her graduation from the midcareer master's at the Harvard Kennedy School.

In a previous blog post I discussed what a researcher should look for in an implementing partner with whom they want to do an RCT. But what does an implementer want in a research partner, and how can a researcher make him- or herself a better partner?

I) Answer questions the partner wants answered

Start by listening. A researcher will go into a partnership with ideas about what they want to test, but it is important to understand what the implementer wants to learn from the partnership. Work together to come up with a design that answers key questions from both parties. Sometimes this doesn’t require another arm to be added to the study, but rather some good monitoring data or quantitative descriptive data of conditions in the population to be collected.

II) Be flexible about the evaluation design

The research design you have in your head initially is almost never the design that ends up being implemented. It is critical to respond flexibly to the practical concerns raised by the implementer. One of the main reasons that randomized evaluations have taken off in development in the last twenty years is because of the range of tools that have been developed to introduce an element of randomization in various ways. It is important to go into a partnership with all those tools in mind and use the flexibility they provide to achieve a rigorous study that also takes into account the implementer’s concerns.

A common concern implementers have about randomization is that they will lose the ability to choose the individuals or communities they think are most likely to benefit from the intervention; for example, a training program may want to enroll students that have some education, but not too much. These concerns are relatively easy to deal with: agree to drop individuals or communities that don’t fit the criteria as long as there are enough remaining to randomize some into treatment and some into control. This may require expanding the geographic scope of the program. Randomization in the bubble can be a useful design in these cases.

Randomized phase-in designs are also useful for addressing implementer concerns, although they come with important downsides (Glennerster and Takavarasha 2013 detail the pros and cons of different randomization techniques).

There can and should be limits to this flexibility. If an implementing organization repeatedly turns down research designs carefully tailored to address concerns they've raised previously, at some point the researcher needs to assess whether the implementer wants the evaluation to succeed. This is a very hard judgment to make and is often clouded by an unwillingness to walk away from an idea that the researcher has invested a lot of time in. In this situation, the key question to focus on is whether the implementer is also trying to overcome the practical obstacles to the evaluation. If not, then it probably makes sense to walk away and let go of the sunk costs. Better to walk now than be forced to later, when even more time and money have been invested.

III) Share expertise

Many partners are interested in learning more about impact evaluation as part of the process of engaging on an evaluation. Take the time to explain the impact evaluation techniques to them and involve them in every step of the process. Offer to do training on randomized evaluations or run a workshop on Stata for the organization’s staff. Having an organization-wide understanding of RCTs also has important benefits for research. In Bangladesh, employees of the Bangladesh Development Society were so well-versed in the logic of RCTs that they intervened when they noticed girls attending program activities from surrounding communities. They explained to the communities (unprompted) that this could contaminate the control group and asked that only local girls attend.

Researchers often have considerable expertise in specific elements of program design, including monitoring systems and incentives, and knowledge of potential funding sources--all of which can be highly valued by implementers. Many researchers end up providing technical assistance on monitoring systems and program design that go well beyond the program being evaluated. The good will earned is invaluable when difficult issues arise later in the evaluation process.

IV)Provide intermediate outputs

While implementing partners benefit from the final evaluation results, the timescales of project funding and reporting are very different from academic timelines. Often an implementing organization will need to seek funding to keep the program going before the endline is in place and several years before the final evaluation report is complete. It is therefore very helpful to provide intermediate outputs. These can include: a write-up of a needs assessment in which the researcher draws on existing data and/or qualitative work that is used in project design; a description of similar programs elsewhere; a baseline report that provides detailed descriptive data of the conditions at the start of the program; or regular reports from any ongoing monitoring of project implementation the researchers are doing.Usually researchers collect these data but don’t write them up until the final paper. Being conscious of the implementer’s different timescale and getting these products out early can make them much more useful.

V) Have a local presence and keep in frequent contact

Partnerships take work and face time. A field experiment is not something you set up, walk away from, and come back to some time later to discover the results. Stuff will happen, especially in developing countries: strikes, funding cuts, price rises, Ebola outbreaks. It is important to have a member of the research team on the ground to help the implementing partner think through how to deal with minor and major shocks in a way that fits the needs of both the implementer and the researcher. Even in the middle of multiyear projects I have weekly calls with my research assistants, who either sit in the offices of the implementer or visit them frequently. We always have plenty to talk about. I also visit the research site once and often twice a year. Common issues that come up during the evaluation are lower-than-expected program take-up, higher-than-expected costs of running the program, uneven implementation quality, and new ideas on how to improve the program.

Investment pays off

The benefits of investing in long-term partnerships are high. Some of the most interesting RCTs have come out of long-term partnerships between researchers and implementers that cover multiple evaluations. Once a level of trust in the researcher and familiarity with RCTs has been established, implementers are often more willing to randomize different elements of their program and try new approaches to their programs. Indeed, they often become the drivers of new ideas to test.

The Well Column of The New York Times recently suggested that too much exercise could be bad for your health based on the results of a study, the flaws of which Justin Wolfers eloquently explained in a subsequent issue of the Times. While this was a particularly embarrassing case of journalists giving more weight to a study than it deserved, it is hardly unique. We regularly hear or read that food stuff X (coffee, butter, red wine) is bad for our health, only to read some months or years later that a new study shows the same food is in fact good for us. This cycle of excitement and dissolution, whether it is about a new way to prevent cancer, reduce poverty, or improve children’s IQ is not healthy for the public’s faith in science or journalism.

Recently the Department of Communication at Pontificia Universidad Católica de Chile (under the leadership of Gonzalo Saavedra, who opened the session) and J-PAL LAC brought together some of the leading journalists in Chile to discuss how journalists can improve their presentation of evidence and statistics. Paula Escobar-Chavarría moderated the session. My talk (slides available here) focused on the need for journalists, when reporting on a study, fact, or claim, to clearly distinguish between four different types of evidence: descriptive evidence; process evidence, correlations, and causal evidence. All could be newsworthy. Francisco Gallego, in response to a question about whether correlations should never be considered newsworthy, gave the example of how important it would be to report that wages were lower for women than men even if there was no way to identify that the relationship is caused by discrimination. The mistake would be to report a correlation in a way that suggests causation.

I suggested the following checklist of questions journalists can ask when reporting on a claim:

2.Does the type of evidence match the claim? Is the claim causal but the evidence only a correlation?

3.If the claim is descriptive, is the sample large and representative?

4.If the claim is causal (i.e., about impact), what is the counterfactual: How do we know what would have happened otherwise?

5.Are there reasons to think that those who participated in the program/ate a certain type of food/went to a type of school are different in other ways from those who did not?

How could a journalist make a story come alive and yet be true to the evidence? Telling individual stories was a powerful communication tool. Rather than interviewing a few (unrepresentative) people and drawing wider conclusions from those interviews, journalists could illustrate the findings of a representative and well-identified study by seeking out individuals whose story reflected the study’s findings.

The precise use of words is something journalists care about. Everyday words like impact, led to, cause, or the result of imply causal evidence. They should only be used when we know what would have happened otherwise. When reporting on a study that describes a correlation, it can also be helpful to draw attention to this fact and discuss why the correlation may not be causal. (I spent a frustrating thirty minutes stuck in traffic listening to an--admittedly hour-long--NPR show on the higher-than-average rate of suicides among those taking antidepressants without once hearing the moderator ask, “Could this correlation be due to the fact that those on antidepressants are depressed and thus more prone to suicide?” The side benefit of this episode was that my six-year-old son in the back seat got an impromptu lecture on the difference between correlation and causation, similar to that which I received from my mother at about the same age).

Journalists have an important role in holding politicians, NGOs, and others to account for the claims they make. Rather than take these claims at face value, journalists are in a perfect position to probe and ask: How do you know that your policy caused that change?

As experts in communication, journalists also have an important role in communicating the results of studies that may be written in complex language. But communicating a complex argument both clearly and accurately usually requires investing time in understanding the study and it may require asking for help in how to express the findings, either from the authors or from other experts--something that journalists are often reluctant to do. (I realize that journalists consider it against their standards to show a draft article to the interviewee or study author, but isn’t there a higher duty to accuracy?)

One of the participants at the event commented that journalists can be reluctant to say “we don’t know” even though sometimes that is the only accurate reading of the evidence. But despite the many challenges to improving the accuracy with which evidence and statistics are reported in the press, I came away energized and optimistic. Throughout the discussion, journalists openly discussed the profession’s (and often their own individual) failings, happily engaged on difficult questions, and said they would welcome more advice and input to improve their reporting. This group was a very select sample of the most serious journalists in Chile but I was left to wonder whether there would be the same appetite to engage on these issues in other countries. Are US journalists too proud to admit they need help?

On Friday The New York Timespublished our op-ed (written jointly with Tavneet Suri and Herbert M’cleod) on how implausible claims and misleading presentation of information has made the Ebola crisis worse than it needed to be. Much of the op-ed is based on our report joint with the the World Bank. In our survey we find little evidence of impacts on agriculture, but lots of evidence that household enterprises in urban areas have been hit.

Given our section on agriculture was cut from the op-ed, It is worth explaining in a bit more detail. On September 5th the FAO reported that more than 40 percent of farms in Kailahun were abandoned and raised alarm that planting might be disrupted because of a lack of planting materials. Some news outlets reported that 40 percent of all farms across the country had been abandoned. Ninety percent of farmers in Sierra Leone grow rice, and during September the rice crop is maturing in the fields (having been planted before the outbreak started). What, in this context, does “abandoned” mean? Farmers were not weeding their crop for fear of infection? And why was the FAO worried about lack of planting materials in September, given planting would not take place till late spring/early summer 2015? In our November survey many farmers reported that they had not yet harvested their rice crop (normally harvest starts in late September and early October, although it extends through December). The main reason given was that the rains were continuing. Virtually no one mentioned Ebola. The FAO report was influential in shaping the debate on economic impacts. It was mentioned in the Ministry of Finance and World Bank assessments of the economic impact of Ebola.

It may well be that the reported harvest is lower this year than before. Unfortunately we will probably never know exactly what the impact of Ebola was on agricultural output as it is hard, even in the best of times, to estimate agricultural production in a country where the vast majority of agricultural output is consumed by those who produce it. Our previous, survey- based estimates have been at sharp odds with official figures. But what seems clear is that household nonfarm businesses have been hit harder than agricultural production, yet the emphasis has been on the latter. Is that lack of attention because there is no ministry or agency that has the nonagricultural informal sector as their priority?

For much of the last two months Tavneet Suri and I have been working with the Sierra Leone team at Innovations for Poverty Action, the World Bank, and Statistics Sierra Leone on a major household survey to get some facts about how households are faring during the Ebola crisis. The results have just been released.

Given the transport disruptions and the threat of infection, the survey was conducted by cell phone, which is not ideal in a country as poor as Sierra Leone. However, this survey used as its base those who had been interviewed as part of the Labor Force Survey (LFS) undertaken in July/August 2014. Thus we had up-to-date phone numbers for 2,764 people (66 percent of the LFS sample). We also had data that for most people was pre-Ebola. Thanks to the hard work of IPA and SSL enumerators, 70 percent of these were reached during the middle of November (meaning 46 percent of the overall LFS sample was reached). Coverage was good in urban areas but weaker in rural areas. The rural results therefore need to be treated with some caution, although the panel nature of the data helps and the numbers reached in rural areas still far exceeds most data collection exercises in Sierra Leone through this period. The response rate is much higher than the World Bank survey in Liberia.

Of course we don’t have any good identification to distinguish the impacts of the Ebola crisis from other shocks that have hit the economy (like the unseasonably heavy rains or the fall in the price of iron ore). All we show are changes since August. However, it is important to remember that the Sierra Leone economy was growing strongly before the onset of the crisis. Overall, it looks as though the urban informal sector has suffered most since August. This is perhaps unsurprising as this is the sector that would bear the brunt of restrictions on transport, markets, restaurants, and bars. They are also the sector that is most vulnerable to cutbacks in discretionary expenditure which comes with uncertainty.

The percentage of household heads who reported working in the last week in urban areas fell from 75 percent to 67 percent. The percentage working stayed unchanged in rural areas. Hours worked, for those in work, fell everywhere except Freetown.

Revenues from nonfarm household enterprises fell by 40 percent. While 4 percent of nonfarm household enterprises were reported as not operating in August, this rose to 12 percent in November. A third of households reported (in November) that Ebola was the reason that the HH enterprise was not operating.

Roughly half of households still have rice that is yet to be harvested but the main reason given is that it is still raining. Only three households mentioned Ebola as a reason rice remained unharvested.There were reports of labor shortages but these were mainly at the household level (14 percent of respondents), while only 6 percent reported lack of labor in the community as a constraint. Over half of households reported hiring outside labor despite worries that fear of infection would reduce this practice.

Just over 70 percent of households reported taking some action to combat food insecurity in the week prior to the survey. As we do not have similar data at this point in the season it is hard to say how much of this is due to Ebola.

Our markets data continues to show that food prices are similar to those seen in previous years. The one exception is that imported rice prices have fallen much more than is typical in cordoned areas. Our market price data are consistent with prices reported by households in the cell phone survey.

Our market trader data looks somewhat more encouraging than in previous rounds. This is partly because we have changed our base year from 2012 to 2011 (the previous round of the market survey stopped in October 2012). There was also an uptick in rice traders as the delayed harvest came in. Traders of palm oil and processed cassava (gari) are still below previous years, although the gap closed somewhat in December.

Long-term biases and stereotypes are not going to be easy to change, but here are a few practical things you can do to help combat sexism in economics both in the academic and policy world:

1. When writing recommendation letters, if you draw comparisons to others, do so across genders (and across race).

It is common in recommendation letters and tenure letters to draw comparisons to other, more senior people in the same field. This PhD student reminds me of X, this junior colleague is the best in their field since Y, or has the breadth of interest of Z. In the policy world the comparisons are not as formal but it is still common to say, “This person reminds me of X” as shorthand to convey the type and the quality of the person being considered. But whether in academia or in the policy world, these comparisons are nearly always done within gender. A junior male academic is compared to a male senior academic, but not to a female one, and vice versa. I don’t think this is a conscious decision but it has insidious results, especially if there are few senior women that a junior woman can be compared to. There are now, thankfully, some women stars in academic and policy economics, but not every woman can be the next Susan Athey, Esther Duflo, or Janet Yellen.By unconsciously drawing comparisons only within gender it is as if we are forcing ourselves to paint pictures of women in black and white while using the full-color palette to paint pictures of men. To do justice to female candidates we need to be able to use a full pallet of comparisons. So next time you write a recommendation letter, or want to describe someone with a quick shorthand by saying they are similar to someone else, make a conscious decision to dismantle this restrictive norm and choose a comparator of a different gender. If you want to say a junior male colleague is rigorous and thorough, compare them to a senior woman with those characteristics and vice versa. I don’t have enough of a sample to say if there is a similar hesitancy to make comparisons across race and country of origin but I worry that there might be. It makes no sense that junior French economists are frequently compared to Tirole or Piketty when their research may be more like that of a US colleague.

2. Stop and ask yourself, “Would I say that if I were talking to a man?”

Several years ago at Davos a woman from a large foundation came up to Esther Duflo and me and told us in a very matter of fact way that we couldn’t possibly have founded J-PAL because we were too young. I wondered if she would have said that to us if we were two men in our mid-30s and early 40s. During a review of a research project, a committee discussed if the (female) researcher should be discouraged from proceeding with her study (which involved interviewing terrorists) given the danger. Fortunately these are rare examples of bizarre behavior. But how common is it for the work of a female junior faculty who has coauthored with a senior faculty to have that work discounted based on the assumption that the senior author was the real brains behind the project? If you ever find yourself in that position, ask yourself—would I discount this work equally if the junior faculty were a man?

3.Don’t draw attention to a woman’s minority status.

At my London comprehensive we had an old-fashioned physics department and the boys and girls had to line up on different sides of the classroom door: 28 boys on one side and two girls on the other. Throughout the lesson our teacher would constantly refer to the class as “gentlemen” and then add “and of course, our two ladies.” Despite loving science I dropped physics at 16. This is an extreme version of a pretty common phenomenon in which women who do science and economics are constantly reminded of the fact that they are unusual. At the UK Treasury the etiquette was that the most senior officials entered the Minister’s office first.Yet several times the senior official would usher me in first as the only woman, even though I was the most junior official there. On mission at the IMF a colleague would make a big show of interrupting the waiter as he took our order and told him to take my order first, as the only woman. Trust me, when you are working 18-hour days for two weeks in close quarters with a small team, the last thing you want anyone to do is remind everyone of the fact that you are the only woman.

4.Try not to be jealous if a woman occasionally gets the spotlight.

There has been a lot of discussion recently of biases in who the press quotes and pays attention to. Occasionally, as if struck by guilt for past mistakes, the press will suddenly do a feature on a particular woman. These splashy features can be a double-edged sword for the featured woman as they can create resentment amongst her colleagues. But it is worth remembering that the woman herself almost certainly did not seek or precipitate the feature. The erratic nature of the press is hardly her fault and neither male nor female academics could be expected to decline being featured because they thought someone else was more worthy. This resentment is not only precipitated by the erratic spotlight of the press. Someone may be searching for a woman to sit on a panel or look for a woman to fill a very visible policy job in an attempt to signal that the organization is not as male-dominated as it might appear. I know that occasionally IMF colleagues of mine thought it was unfair that I would be seated next to the Governor of the Central Bank as the only woman at the table, and Treasury colleagues resented the fact that women often got coveted jobs working closely with Ministers. (If they knew the banter we had to put up with in those positions they would probably have been less jealous.) But if you find yourself feeling annoyed or jealous on the occasions when the spotlight falls on a woman, try not to blame the woman.

Unlike most academic economic research, running randomized control trials (RCTs) often involves intense collaboration between researchers and the organization or individuals who are implementing the intervention that is being evaluated. This collaboration can be the best thing about working on a study--or the worst. What should a researcher look for in an implementer? In a later post I will discuss what a researcher can do to strengthen the relationship and provide value to the implementing organization.

i)Sufficient scale

A first, and easy, filter for a good implementing partner is whether an organization is working at a big enough scale to be able to generate a sample size that will provide enough power for the experiment. How big is sufficient depends on the level at which the randomization is going to take place, as well as the number of different variants of the program that are going to be compared and the outcome of interest. Thus a lot of detailed discussion takes place about what a potential evaluation would look like before it is possible to say if an evaluation is feasible. However, it is surprising how many potential partnerships can be ruled out quite early on because the implementer is just not working at a big enough scale to make a decent evaluation possible.

ii)Flexibility

A willingness to try different versions of the program and adapt elements in response to discussions with researchers makes an attractive implementing partner.As discussed above, we can learn a lot by testing different parts of a program together and separately or by comparing different approaches to the same problem against each other. The best partnerships are where researcher and implementer work together to find the most interesting versions of the program to test.

iii)Technical programmatic expertise, yet representative

There is a risk of testing a program run by an inexperienced implementer, finding a null result, and generating the response, “Of course there was no impact, you worked with an inexperienced implementer.”The researcher also has less to learn from an inexperienced implementer, and thus the partnership risks becoming one-sided. At the other end of the spectrum, we may not want to work with a "gold-plated" implementer unless we are doing a “proof-of-concept” evaluation of the type discussed above. There are two risks here: that the program is so expensive that it will never be cost-effective even if it is effective; and that it relies on unusual and difficult-to-reproduce, noncash resources from a few highly dynamic mentors that would be hard to replace. An implementer working at a very big scale is unlikely to run a gold-plated program and has already shown the program can be scaled. It is also possible to work with a smaller implementer that closely follows a model used by others.

iv)Local expertise and reputation

Implementers who have been working with a population for many years have in-depth knowledge of local formal and informal institutions, population characteristics, and geography that is invaluable in designing and implementing an evaluation. What messages are likely to resonate with this population? What does success look like and how can we measure it? When I started working in Sierra Leone I spent a long time traveling round the country with staff from Statistics Sierra Leone, Care, and the Institutional Reform and Capacity Building Project. I learned that it was socially acceptable to ask about the bloody civil war that had just ended but that asking about marital disputes could get us thrown out of the village. From Tajan Rogers l learned that every rural (and some urban) communities in Sierra Leone come together for “road brushing,” where they clear encroaching vegetation from the dirt road that links their community to the next and even build the common palm-log bridges over rivers. How often this activity took place and what proportion of the community took part became our preferred measure of collective action and has been used in many papers since.

Just as importantly, an implementer who has been working locally has a reputation in local communities that it would take a researcher years to build. This reputation can be vital. We learn little about the impact of a program if suspicion around the implementer means that few take up the program.

Researchers need to understand how valuable this reputational capital is to the implementer. What may seem like reluctance to try new ideas may be a fully justified caution to put their hard-won reputation on the line.

v)Low staff turnover

There are many difficulties in working with governments and donor organizations, but perhaps the hardest to overcome is high staff turnover. As we have emphasized, evaluation is a partnership of trust and understanding and this takes time to build. All too often a key government or donor counterpart will move on just as an evaluation is reaching a critical stage. Their successor may be less open to evaluation, want to test a different question, be against randomization, or just uninterested. The only way a researcher can protect the evaluation is to try and build relationships at many levels throughout the implementing organization so that the loss of one champion does not doom the entire project. But this may not be sufficient. One of the many advantages of working with local NGOs is that they tend to have greater stability in their staffing.

vi)Desire to know the truth and willingness to invest in uncovering it

The most important quality of an implementing partner is the desire to know the true impact of an intervention and a willingness to devote time and energy to helping the researcher uncover the truth. Many organizations start off enthusiastic about the idea of an evaluation but at some point realize it is possible a rigorous evaluation may conclude their program if it does not have a positive impact. At this point, two reactions are possible: a sudden realization of all the practical constraints that will make an evaluation impossible, or a renewed commitment to learn.

In Running Randomized Evaluations (Glennerster and Tavarakasha 2013), we quote Rukmini Banerji of Pratham at the launch of an evaluation of Pratham's flagship “Read India” program:

"[The researchers] may find that it doesn't work. But if it does not work, we need to know that. We owe it to ourselves and the communities we work with not to waste their and our time and resources on a program that does not help children learn. If we find that this program isn't working, we will go and develop something that will."

It is not just that an unwilling partner can throw obstacles in the path of an effective evaluation. An implementing partner needs to be an active and committed member of the evaluation team. There will inevitably be problems that come up during the evaluation process which the implementer will have to help solve, often at a financial or time cost to themselves. The baseline may run behind schedule and implementation will need to be delayed till it is complete; transport costs of the program will be higher as implementation communities will be further apart than they otherwise would be to allow for comparison groups; roll-out plans must be set further in advance than normal to allow for the evaluation; selection criteria must be written down and followed scrupulously, reducing discretion of local staff; and some promising program areas must be left for the comparison group. Partners will only put up with these problems and actively help solve them if they fully appreciate the benefits of the evaluation being high quality and they understand why these restrictions are necessary.

This commitment to the evaluation needs to be at many levels of the organization. If the headquarters in Delhi want to do an impact evaluation but the local staff don’t, it is not advisable for HQ to force the evaluation through because it is the staff at the local level who will need to be deeply involved in working through the details with the researcher. Similarly, if the local staff are committed but the HQ is not, there will be no support for the extra time and cost the implementer will need to participate in the study. Worst of all is when a funder forces an unwilling implementer to do an RCT run by a researcher. Being involved in a scenario of this kind will suck up months of a researcher's time trying to come up with evaluation designs that the implementer will in turn find some way to object to.

If this level of commitment to discovering the unvarnished truth sounds a little optimistic, there are practical ways to make an impact evaluation less threatening to a partner. An implementer who does many types of programs has less at stake from an impact evaluation of one of their programs than an organization that has a single signature program. Another option is to test different variants of a program rather than the impact of the program itself. For example, testing the pros and cons of weekly versus monthly repayment of microcredit loans (Field and Pande 2008) is less threatening than testing the impact of microcredit loans. In some cases researchers have started relationships with implementers by testing a question that is less threatening (although potentially less interesting). As the partnership has built up trust, the implementing partner has opened up more and more of their portfolio to rigorous testing.

J-PAL and our many partner organizations are in the process of a massive recruitment effort to fill 100+ positions all over the world. The most common position is as a research associate either helping to run a randomized evaluation or analyzing the data from an RCT (often in the US). Many only stay in these positions for a couple of years: they are stepping stones. Where do they lead?

In an RA position you learn a lot about data: how to write a good survey, how to supervise a team of enumerators, how to clean data and (in a few positons) how to analyze it. This highly practical training is a great entry point and complement to the theoretical training in an economics or political science PhD, and many of my RAs have gone on to top PhD programs at Harvard, MIT, UC Berkeley, etc. Dan Keniston, for example, was one of my first RAs and is now an assistant professor at Yale.

But a practical training in how to collect high quality data and run rigorous impact evaluations is useful well beyond academics. There are now enough organizations doing high-quality impact evaluations that it is possible to build a career moving up and between these organizations. For example, Mike Duthie, an RA of mine in Bangladesh, went on to become Country Director for Innovations for Poverty Action (IPA) in Sierra Leone and now works for Social Impact managing projects all over the world. Buddy Shah and Andrew Fraker worked as RAs for me in India and (with others) launched their own impact evaluation NGO, which is expanding rapidly. Tricia Gonwa worked with me in the J-PAL Cambridge office, then was Country Director for IPA in Liberia, and now works for the Gender Lab at the World Bank.

It is also possible to have a career within J-PAL and IPA: Iman Sen was an RA on my project in Bangladesh and then in Cambridge, and is now Assistant Director of Research at J-PAL South Asia. Shobhini Mukerji worked as an RA in J-PAL South Asia and is now Executive Director there.

Many jobs offer RAs from Europe and the US the opportunity to live and work in a developing country. When I first wanted to get into development work I was faced with a catch-22: I could not get a job in development without having worked in a developing country, but I could not get that experience without already having it. For many, the chance to become an integral part of another society is one of the most rewarding aspects of the job. There are few other jobs where you have the chance to work in such a diverse and close knit team. Some RAs go on to build businesses based on these experiences. I have already mentioned IDInsight. Bureh belts is another example. It was launched by Grant Bridgman (South Africa), Dan Heyman (US), and Fatoma Momoh (Sierra Leone).

If you'd like to apply for a research associate position at J-PAL, click here for more information.

While the media and many agencies have focused on the concern that the Ebola outbreak may lead to rises in food prices, our latest results suggest the dangers lie elsewhere. Our latest round of market surveys took place in October, shortly after all Sierra Leoneans were asked to stay at home for two days. Since our last report the geographic burden of the disease has shifted considerably and new cordon restrictions have been imposed in Port Loko, Moyamba, and Bombali.

Figure 1: Geographic Spread of Confirmed Cases and Cordon Restrictions, September 18 and October 23

The main results of our survey are:

1.Prices of basic food commodities at markets are not significantly higher in October than they were at this time in previous years, nor are they higher on average in cordon areas.

2.The number of traders selling basic food items has continued to fall in all districts. In Kailahun and Kenema (the first districts to be cordoned) there are 69 percent fewer domestic rice traders than in 2012, while the decline in newly cordoned areas is 29 percent. This suggests the major economic threat is not food prices but income, especially for those who produce cash crops and are dependent on selling their product to traders.

Figure 2: Number of Domestic Rice Traders per Market

3.There are outliers where prices are much higher, and there are more of these outliers than in normal years.

4.There are an increasing number of markets that are closed. In most of these cases traders report they are selling food from their homes. However, it will be important to monitor food security at the household level to ensure that food (at reasonable prices) is reaching households, especially in remote locations.

5.Very preliminary data, however, suggests a new risk to food security or at least a potential delay in the rice harvests. Rainfall in September was much higher than it usually is at this time of year, but it did begin to fall back down in October. This may negatively impact the rice harvest, or at the very least delay the rice harvest.

Figure 3: Average Rainfall by Month in Sierra Leone and Liberia

a. sierra leone

b. liberia

Other data collection efforts are attempting to capture the decline in economic activity more generally as well as utilization of health care for non-Ebola-related health issues. We will report on the results of this work as soon as they are available.