health systems, monitoring, evaluation, learning.

Month: December 2013

*this blog post was also cross-posted on people, spaces, deliberation, including as one of the top 10 posts of 2014. In a recent blog post on stories, and following some themes from an earlier talk by Tyler Cowen, David Evans ends by suggesting: “Vivid and touching tales move us more than statistics. So let’s listen to some stories… then let’s look at some hard data and rigorous analysis before we make any big decisions.” Stories, in this sense, are potentially idiosyncratic and over-simplified and, therefore, may be misleading as well as moving. I acknowledge that this is a dangerous situation. However, there are a couple things that are frustrating about the above quote, intentional or not.

Second, it suggests that the main role of stories (words) is to dress up and humanize statistics – or, at best, to generate hypotheses for future research. This seems both unfair and out-of-step with increasing calls for mixed-methods to take our understanding beyond ‘what works’ (average treatment effects) to ‘why’ (causal mechanisms) – with ‘why’ probably being fairly crucial to ‘decision-making’ (Paluck’s piece worth checking out in this regard).

.

In this post, I try to make the case that there are important potential distinctions between anecdotes and stories/narratives that are too often overlooked when discussing qualitative data, focusing on representativeness and the counterfactual. Second, I suggest that just because many researchers do not collect or analyse qualitative work rigorously does not mean it cannot (or should not) be done this way. Third, I make a few remarks about numbers.

.

As a small soapbox and aside, even calls for mixed-methods for making causal claims give unnecessary priority to quantitative data and statistical analysis for making causal claims, in my opinion. A randomized-control trial – randomizing who gets a treatment and who will remain in the comparison group – is a method of assigning treatment. It doesn’t *necessarily* imply what kind of data will be collected and analyzed within that framework.

.

Anecdotes, narratives and stories As to the danger of stories, what Evans, Cowen, and others (partly) caution against is believing, using or being seduced by anecdotes – stories from a single point of view. Here I agree – development decisions (and legislative and policy decisions more generally) have too often been taken on the basis of a compelling anecdote. But not all stories are mere anecdotes, though this is what is implied when ‘hard data’ are equated with ‘statistics’ (an equation that becomes all the more odd when, say, the ‘rigorous’ bit of the analysis is referred to as the ‘quantitative narrative’).

.

Single stories from single points in time in single places – anecdotes – are indeed potentially dangerous and misleading. Anecdotes lack both representativeness and a counterfactual – both of which are important for making credible (causal) claims and both of which are feasible to accomplish with qualitative work. As revealed with the phrase ‘quantitative narrative,’ humans respond well to narratives – they help us make sense of things – the trick is to tell them from as many perspectives as possible to not un-mess the messiness too far.

.

Representitiveness: It is clear from the growing buzz about external validity that we need to be cautious of even the most objective and verifiable data analysed in the very most rigorous and internally valid way because it simply may not apply elsewhere (e.g. here and here). Putting this concern aside for a moment, both qualitative and quantitative data can be collected to be as representative of a particular time and place and circumstance as possible. I say more about this, below.

.

Counterfactuals: Cowen notes that many stories can be summed up as ‘a stranger came to town.’ True, to understand something causal about this (which is where anecdotes and tales following particular plot-lines can lead us astray), we would like to consider what would have happened if the stranger had not come to town and/or what happened in the town next door that the stranger by-passed. But those are still stories and they can be collected in multiple places, at multiple time points. Instead of dismissing it or using it only as window-dressing, we can demand more of qualitative data so that it can tell a multi-faceted, multi-perspectival, representative story.

.

That rigor thing Perhaps it seems that we have a clearer idea of how to be rigorous with collecting and analysing quantitative data. I don’t think this is necessarily true — but it does seem that many quant-focused researchers trying out mixing their methods for the first time don’t even bother to consider how to make the qualitative data more rigorous by applying similar criteria as they might to the quant part. This strikes me as very odd. We need to start holding qualitative data collection and analysis to higher standards, not be tempted to scrap it just because some people do it poorly. An excellent piece on this (though there are plenty of manuals on qualitative data collection and analysis) is by Lincoln and Guba.

.

They suggest that ‘conventional’ rigor addresses internal validity (which they take as ‘truth value’), external validity, consistency/replicability and neutrality. (The extent to which quantitative research in the social sciences fulfils all these criteria is another debate for another time.) They highlight the concept of ‘trustworthiness’ – capturing credibility, transferrability, dependability and confirmability – as a counterpart to rigor in the quantitative social sciences. It’s a paper worth reading.

.

Regardless of what types of data are being collected, representativeness is important to being able to accommodate messiness and heterogeneity. If a research team uses stratification along several to select its sample for quantitative data collection (or intends to look at specific heterogeneities/sub-groups for the analysis), it boggles my mind why those same criteria are not used to select participants for qualitative data. Why does representativeness so often get reduced to four focus groups among men and four among women?

.

Equally puzzling, qualitative data are too often collected only in the ‘treated’ groups. Why does the counterfactual go out the window when we are discussing open-ended interview or textual data? Similarly, qualitative work has a counterpart to statistical power and sample size considerations: saturation. Generally, when the researcher starts hearing the same answers over and over, saturation is ‘reached.’ A predetermined number of interviews or focus groups does not guarantee saturation. Research budgets and timetables that take qualitative work seriously should start to accommodate that reality. In addition, Lincoln and Guba suggest that length of engagement – with observations over time also enhancing representativeness – is critical to credibility.

.

The nature of qualitative work, with more emphasis on simultaneous and iterative data collection and analysis can make use of that time to follow up on leads and insights revealed over the study period. Also bizarre to me is that quant-focused researchers tend to spend much more time discussing data analysis than data collection and coding for quantitative stuff but then put absolutely all the focus (of the limited attention-slice qualitative gets) on collecting qualitative data and none into how those data are analysed or will be used. Too often, the report tells me that a focus group discussion was done and, if convenient, it is pointed out that the findings corroborate or ‘explain’ the numeric findings. Huh? If I am given no idea of the range of answers given (let’s say the counterpart of a minimum and a maximum) or how ‘common themes’ were determined, that thing that one person said in a focus group just becomes an anecdote with no real ‘place’ in the reporting of the results except as a useful aside.

.

One more thing on credibility – the equivalent of internal validity.Lincoln and Guba say that credibility *requires* using member-checks (stay tuned for a paper on this), which means sharing the results of the analysis back with those who provided the raw data so that interpretations can, at least in part, be co-constructed. This helps prevent off-the-mark speculation and situation analyses but also helps to breakdown the need to ‘represent’ people who ‘cannot represent themselves’ – as Said quotes from Marx. I’ve said a few things about this sort of shared interpretation here, recognizing that respondents’ perceptions will reflect the stories they tell themselves. That said, as development researchers increasingly look at nudging behavior, the stories (not-always-rational) actors tell themselves are potentially all the more important. We need to collect and present them well.

.

One key hurdle I see with enhancing the perceived rigor and non-anecdotal-ness of qualitative work is that it is hard to display the equivalent of descriptive statistics for textual/interviewin data. That doesn’t mean we shouldn’t try. In addition, it is more difficult and unwieldy to share (even ‘cleaned’) qualitative data than the quantitative equivalent, as increasingly happens to allow for replication. Still, if this would enhance some of credibility of the multifaceted stories revealed by these data, it is worth pushing this frontier.

.

Numbers aren’t always clean In terms of stories we tell ourselves, one is that data are no longer messy (and, often by implication, are clean, hard, ‘true’) because they fit in a spreadsheet. Everything that happened in the field, all the surveyors’ concerns about certain questions or reports of outright lying all often seem to fade from view as soon as the data make it into a spreadsheet. If you ask a farmer how many chickens he has and he gives you a story about how he had 5, 2 got stolen yesterday but his brother will give him 4 tomorrow, regardless of what number the enumerator records, the messiness has been contained for the analyst but not in the reality of the farmer that is meant to be represented.

.

In general, if we want to talk about creating credible, causal narratives that can be believed and that can inform decision-making at one or more levels, we need to talk about (a) careful collection of all types of data and (b) getting better at rigorously analysing and incorporating qualitative data into the overall ‘narrative’ for triangulating towards the ‘hard’ truth, not equating qualitative data with anecdotes.

Assume that no referee reports are truly anonymous. It is fine to be critical but always be polite.

Skim the paper within a couple of days receiving the request- my metro rides are good for this – you can quickly tell whether this is a paper that is well below the bar for some obvious reason and can be rejected as quickly as possible.

Unless it is immediate junk, read the paper once and return to it a week later with deeper thoughts and a fresh mind.

Referee within one month.

Remember you are the referee, not a co-author. I hear a lot that young referees in particular write very long reports, which try and do way more than is needed to help make a paper clear, believable and correct. I think 2 pages or less is enough for most reports.

Your report should not assume that the editor has a working knowledge of the paper.

The first paragraph should summarize the contribution. Reviewers should provide a concise summary of the paper they review at the start of their report and then provide a critical but polite evaluation of the paper.

Explain why you recommend that the paper be accepted, rejected, or revised.

If you would like the editor to accept the paper, your recommendation must be strong. The more likely you think the paper is to merit a revision the more detailed should be the comments.

The referee report itself should not include an explicit editorial recommendation. That recommendation should be in a separate letter to the editor.

If you consistently recommend rejection, then the editor recognizes you are a stingy, overly critical person. Do not assume that the editor will not reveal your identity to the authors. In the long run, there are no secrets.

If you recommend acceptance of all papers, then the editor knows you are not a discriminating referee.

Possible considerations:

Research question and hypothesis:

Is the researcher focused on well‐defined questions?

Is the question interesting and important?

Are the propositions falsifiable?

Has the alternative hypothesis been clearly stated?

Is the approach inductive, deductive, or an exercise in data mining? Is this the right structure?

Research design:

Is the author attempting to identify a causal impact?

Is the “cause” clear? Is there a cause/treatment/program/fist stage?

Is the relevant counterfactual clearly defined? Is it compelling?

Does the research design identify a very narrow or a very general source of variation?

Could the question be addressed with another approach?

Useful trick: ask yourself, “What experiment would someone run to answer this question?”

Theory/Model:

Is the theory/model clear, insightful, and appropriate?

Could the theory benefit from being more explicit, developed, or formal?

Are there clear predictions that can be falsified? Are these predictions “risky” enough?

Does the theory generate any prohibitions that can be tested?

Would an alternative theory/model be more appropriate?

Could there be alternative models that produce similar predictions—that is, does evidence on the predictions necessarily weigh on the model or explanation?

Is the theory a theory, or a list of predictions?

Is the estimating equation clearly related to or derived from the model?

Data:

Are the data clearly described?

Is the choice of data well‐suited to the question and test?

Are there any worrying sources of measurement error or missing data?

Are there sample size or power issues?

How were data collected? Is recruitment and attrition clear?

Is it clear who collected the data?

If data are self-reported, is this clear?

Could the data sources or collection method be biased?

Are there better sources of data that you would recommend?

Are there types of data that should have been reported, or would have been useful or essential in the empirical analysis?

Is attrition correlated with treatment assignment or with baseline characteristics in any treatment arm?

Empirical analysis:

Are the statistical techniques well suited to the problem at hand?

What are the endogenous and exogenous variables?

Has the paper adequately dealt with concerns about measurement error, simultaneity, omitted variables, selection, and other forms of bias and identification problems?

Is there selection not just in who receives the “treatment”, but in who we observe, or who we measure?

Is the empirical strategy convincing?

Could differencing, or the use of fixed effects, exacerbate any measurement error?

Are there assumptions for identification (e.g. of distributions, exogeneity?)

Were these assumptions tested and, if not, how would you test them?

Are the results demonstrated to be robust to alternative assumptions?

Does the disturbance term have an interpretation, or is it just tacked on?

Are the observations i.i.d., and if not, have corrections to the standard errors been made?

What additional tests of the empirical strategy would you suggest for robustness and confidence in the research strategy?

Are there any dangers in the empirical strategy (e.g. sensitivity to identification assumptions)?

Is there potential for Hawthorne effects or John Henry-type biases?

Results:

Do the results adequately answer the question at hand?

Are the conclusions convincing? Are appropriate caveats mentioned?

What variation in the data identifies the elements of the model?

Are there alternative explanations for the results, and can we test for them?

Could the author have taken the analysis further, to look for impact heterogeneity, for causal mechanisms, for effects on other variables, etc?

Is absence of evidence confused with evidence of absence?

Are there appropriate corrections for multiple comparisons, multiple hypothesis testing?

Scope:

Can we generalize these results?

Has the author specified the scope conditions?

Have casual mechanisms been explored?

Are there further types of analysis that would illuminate the external validity, or the causal mechanism at work?

Are there other data or approaches that would complement the current one?

Like this:

without too much detail, i’ll just note that i spent more time in the hospital in undergrad than i would have preferred. often times, i, being highly unintelligent, would wait until things got really bad and then finally decide one night it was time to visit the ER – uncomfortable but not non-functional or incoherent. on at least one occasion – and because she’s wonderful, i suspect more – alannah (aka mal-bug, malice, malinnus) took me there and would do her homework, sometimes reading out loud to me to keep me entertained and distracted. in one such instance, she was studying some communications theories, one of which was called or nicknamed the onion theory of two-way communication. the basic gist is that revealing information in a conversation should be a reciprocal unpeeling. i share something, shedding a layer of social divide, then you do and we both feel reasonably comfortable.

it didn’t take too long to connect that this was the opposite of how my interaction with doctor was about to go. the doctor would, at best, reveal her name and i would be told to undress in order to be examined, poked and prodded. onion theory, massively violated.

i mention all this because i have just been reading about assorted electronic data collection techniques, namely here, via here. first, i have learned a new word: ‘paradata.’ this seems useful. these are monitoring and administrative data that go beyond how many interviews have been completed. rather, they focus on the process of collecting data. it can include the time it takes to administer the questionnaire, how long it takes a surveyor to locate a respondent, details about the survey environment and the interaction itself (i’d be particularly interested in hearing how anyone actually utilizes this last piece of data, in particular, in analyzing the survey data itself. e.g. would you give less analytic weight to an interview marked ‘distracted’ or ‘uncooperative’ or ‘blatantly lying?’).

the proposed process of monitoring and adjustment bears striking resemblance to other discussions (e.g. pritchett, samji and hammer) about the importance of collecting and using monitoring data to make mid-course corrections in research and project implementation. it does feel like there is a certain thematic convergence underway about giving monitoring data its due. in the case of surveying, it feels like there is a slight shift towards the qualitative paradigm, where concurrent data collection, entry and analysis and iterative adjustment are the norm. not a big shift but a bit.

but on the actual computer bit, i am less keen. a survey interview is a conversation. a structured conversation, yes. potentially an awkward conversation and almost certainly one that violates the onion theory of communication. but even doctors – some of the ultimate violators – complain about the distance created between themselves and a patient by having a computer between them during an examination (interview), as is now often required to track details for pay-for-performance schemes (e.g.). so, while i appreciate and support the innovations of responsive survey design and recognize the benefits of speed and aggregation over collecting the same data manually, i do wish we could also move towards a mechanism that doesn’t have the surveyor behind a screen (certainly a tablet would seem preferable to a laptop). could entering data rely on voice more than keying in answers to achieve the same result? are there other alternatives to at least maintain some semblance of a conversation? are there other possibilities to both allow the flexibility of updating a questionnaire or survey design while also re-humanizing ‘questionnaire administration’ as a conversation?

“Principles and proposals for a more credible research publication”, an early draft white paper on best practices for social science journals, by Don Green, Macartan Humphreys, and Jenny Smith.

Abstract: In recent years concerns have been raised that second-rate norms for analysis, reporting, and data access limit the gains that should follow from first-rate research strategies. At best, deficient norms slow the accumulation of knowledge; at worst, they result in a body of published work littered with results that are flawed, fragile, false, or in some cases, fraudulent. Scholars across disciplines have proposed a number of innovations for journal reform that seek to counter these problems. We review these and other ideas and offer a blueprint for a “best-practices” social science journal.

Like this:

This blog is a cross-post with Suvojit. Update 21 December: the conversation has also continued here.

Recently, Givewell has revised its recommendation on one of its previously top-ranked ‘charities,’ the Against Malaria Foundation (AMF), which focuses on well-tracked distributions of bednets. Givewell “find[s] outstanding giving opportunities and publish the full details of our analysis to help donors decide where to give.” This approach seems to have succeeded in moving donors beyond using tragic stories and heart-wrenching images to raise funds, looking rather at effectiveness and funding gaps.

In their latest list, AMF does not rank amongst the top three recommended charities. Here, based on the experience with AMF, we outline the seeming result of Givewell’s attention on AMF, consider the possible lessons and ask whether Givewell seems to have learnt from this episode, taking clear steps towards changing their ranking methods to avoid similar mishaps in future. As it stands, around US$ 10m now lie parked (transparently and hopefully temporarily) with AMF as a result of its stalled distributions, a fact for which Givewell shares some responsibility.

Givewell lays out its thinking on revising AMF’s recommendation in detail. As a quick re-cap of that blog post: when Givewell looked at AMF two years ago, AMF was successfully delivering bednets at the small- to medium-scale (up to hundreds of thousands in some cases) through partnerships with NGOs (only the delivery of health products such as bednets and cash transfers meet Givewell’s current eligibility criteria). Following Givewell’s rating, a whole bunch of money came in, bumping AMF into a new scale, with new stakeholders and constraints. The big time hasn’t been going quite so well (as yet).

This is slippery ground for a rating service seeking credibility in the eyes of its donors. Currently, Givewell ranks charities on several rating criteria, including: strong evidence of the intervention’s effectiveness and cost-effectiveness of intervention; whether a funding gap exists and resources can be absorbed; and the transparency of activities and accountability to donors.

In its younger/happier days, AMF particularly shone on transparency and accountability. Recognizing that supplies of bednets are often diverted and don’t reach the intended beneficiaries, AMF is vigilant about providing information on ‘distribution verification’ as well as household continued use and upkeep of nets.

These information requirements – shiny at the small scale – create a glare at large-scale, which is part of the problem AMF now faces. ‘Scale’ generally means ‘government’ unless you are discussing a country like Bangladesh with nationwide NGO networks. The first hurdle between information and governments is that the required data can be politically sensitive. Distribution and use of information is great for donors’ accountability but it can be threatening to government officials, who want to appear to be doing a good job (and/or may benefit from distributing nets to particular constituents or adding a positive price, etc).

As a second, equally important, hurdle: even if government agencies intend to carry out the distribution as intended (proper targeting etc), data collection has high costs (monetary, personnel, and otherwise) – especially when carried out country-wide. AMF doesn’t actually fund or support collection of the data on distribution and use that they require of the implementing agencies. AMF is probably doing this to keep its own costs low, instead passing collection costs and burdens on to the local National Malaria Control Programmes (NMCP), which is definitely not the best way make friends with the government. Many government bureaucracies in Sub-Saharan Africa are constrained not only by funds but also capacity to collect and manage data about their own activities.

What do these data needs mean for donors and what do they mean for implementers? For donors, whose resources are scarce, information on transparency and delivery can guide where to allocate money they wish to give. Givewell, by grading on transparency of funding flows and activities, encourages NGOs to compete on these grounds. Donors feel they have made a wise investment and the NGOs that have invested in transparency and accountability benefit from increased visibility.

At issue is that there seems to exist a tension between focusing on transparency and the ability to achieve impact on the ground. If the donor, and possibly Givewell, do not fully take into account institutions (formal and informal), organizational relationships and bureaucratic politics, the problem of a small organization not being able to replicate their own successful results at scale may resurface. Givewell says that it vets a given charity but it is not clear what role potential implementing partners play in this process. Givewell likely needs to account for the views of stakeholders critical to implementation, including those people and organizations that may become more important stakeholders given a scale-up. The fact that NMCPs (or the relevant counterpart) as well as bilaterals and multilaterals are hesitant to work with AMF could have been weighed into Givewell’s algorithm.

Givewell seems to be listening and recognizing these challenges, first by its publicly reasoned response to AMF’s performance, second by posting reviews (in particular, this recent review by Dr. de Savigny) and third, updating its selection criteria for 2013, including a consideration of scalability. de Savingny’s review raises AMF’s strategies in working with governments, both coordinating with donor governments and supporting ‘recipient’ governments with determining data needs and collecting data.

What else can Givewell do now? Expand the criteria beyond need, evidence-base (intervention and organization) and commitment to transparency by also including:

Feedback from previous implementing partners.

Specific project proposals from applicants, in which they lay out a plan to implement their activity in a specific country. Potential funding recipients should think through and detail their government engagement strategy and gain statements of buy-in from likely implementing partners – global and local – in that context.

Givewell should more carefully calibrate how much money goes to organizations for proposed projects. Funding based on engagement in a particular country can help avoid problems of getting too much too fast: funding can be pegged to the requirements of the specific project that has been put up, for which the organization has need and absorptive capacity.

Like this:

perhaps like many people in public health, i take the fortification of salt with iodine – the prevention of several thyroid-related disorders and the widespread return of the neck ruff – as one of public/global health’s major achievements. up there with smallpox, water treatment (for sanitation and potentially with fluoride) and really-we-are-nearly-there-but-stuff-keeps-happening polio.

the WHO declared a universal salt iodization strategy in 1993 (in quito, if you try to keep up with the location-names of these declarations). there have been recent successes in central asia, among other places, in reversing the cognitive and other negative effects of iodine deficiency. iodization of salt is an appealing strategy to promoting public health because it requires very little effort from front-line workers or potential users. fortification is a neat, technocratic solution to a serious problem. people use salt regularly, out of necessity (though often use more than is necessary), and – viola! – unconsciously ingest something extra that’s good for them. salt’s pretty important; of course, it used to be traded for gold (and human beings) and as a recent poetic-wax highlighted, salt is constitutive of human emotions and activities, in the form of sweat and tears. and, though i am not sure it has inspired poetry (perhaps among campers?), iodine has to be ingested because human bodies do not produce it on their own though they need it.

but iodization is a technocratic solution only right up till you recognize the politics behind it (as with most technical solutions to development). it had not fully registered to me until i re-read kurlansky’s salt – despite the proliferation of a rainbow of artisanal and heirloom sea salts, rock salts, probably moon salts, at whole foods and trader joe’s – precisely what mass iodization meant for local salt works around the world. kurlansky notes that country decisions to ban non-iodized salt are “popular with health authorities, doctors and scientists, but very unpopular with small independent salt producers.” India banned iodized salt in 1998, only to repeal the ban in 2000. among other arguments for repeal, the ban went against “Gandhi’s assertion that every Indian had a right to make salt.” oops. that old controlling-salt-production-is-and-always-has-been-super-political thing.

kurlansky suggests that small salt works have neither the money nor the knowledge to iodize their own salt up to government standards, so good salt comes from large national manufacturers and from outside. but deficits of knowledge and money are generally fixable problems, so this answer to combating iodine deficiency seems… deficient.

partly, at issue is the silo-ed approach to development, where very few projects link directly with national strategies for economic development, though many projects note that poverty reduction and growth promotion *are* national priorities. we might just skip the contents of their actual strategy. we talk about country ownership (hey paris, hey accra), we talk about local capacity-building, we talk about alignment with, say, national health and education priorities, but we don’t talk enough about furthering development through all these projects by buying local (meaning more than that one shirt you bought from that one women’s co-op that one time you were visiting that one project in that one country — which especially doesn’t count if that project was focused on SMEs or entrepreneurship and your shirt is not from one of them).

we don’t, i believe, talk as much as we should about the use of locally manufactured products in global health and development projects more generally. there are, to be sure, political and economic difficulties to a work-local-buy-local approach, since donor countries also have national self-interest to consider. and there are technical and logistical difficulties because many places arguably in need of development projects also don’t have manufacturing processes that are up to global standards, perhaps coddled too long by import substitution strategies that did not have an eye towards exporting and competing. it would take time and effort to build local production capacity and supply chains — and we need to work quickly!

so, health commodities come in, building materials come in, food supplies come in, machinery and equipment comes in, often human capital comes in — and development is meant to logically follow. but bringing stuff ‘in’ has big implications for local livelihoods. a comment this week about a large development project in timor-leste describes the “lost opportunity” of not using local materials that would support local employment or small businesses. earlier this year, julie walz and vijaya ramachandran at cgd wrote about promoting local procurement in haiti, noting that this would do “double duty” by “purchas[ing] immediately needed goods or services [and helping] grow the private sector, creat[ing] jobs, and encourag[ing] entrepreneurs.”

two-birds, one stone sounds pretty good. so… can we start talking about this as part of the post-2015 discussions? over probably-not-iodized but tres-good gourmet sel gris popcorn? it supports this adorable old french salt harvester.