Musings on research, international development and other stuff

Tag Archives: Monitoring and Evaluation

Evidence-informed policy – it’s a wonderful thing. But just how widespread is it? The ‘Show your workings’ report from the Institute of Government (and collaborators Sense About Science and the Alliance for Useful Evidence) has asked this question and concluded… not very. It states “there [are] few obvious political penalties for failing to base decision on the best available evidence”. I have to say that as a civil servant this rings true. It’s not that people don’t use evidence – actually most civil servants, at least where I work, do. But there are not good systems in place to distinguish between people who have systematically looked at the full body of evidence and appraised its strengths and weaknesses – and those who have referenced a few cherry-picked studies to back up their argument.

Rosie is my actual cat’s name. And she does indeed make many poor life decisions. Incidentally, I named my other cat ‘Mouse’ and now that I am trying to teach my child to identify animals I am wondering just how wise a life decision that was…

The problem for those scrutinising decision making – parliament, audit bodies and, in the case of development, the Independent Commission for Aid Impact – is that if you are not a topic expert it can be quite hard to judge whether the picture of evidence presented in a policy document does represent an impartial assessment of the state of knowledge. The IoG authors realised this was a problem quite early in their quest – and came up with a rather nifty solution. Instead of trying to decide if decisions are based on an unbiased assessment of evidence, they simply looked at how transparent decision makers had been about how they had appraised evidence.

Now, on the evidence supply side there has been some great work to drive up transparency. In the medical field, Ben Goldacre is going all guns blazing after pharmaceutical companies to get them to clean up their act. In international development, registers of evaluations are appearing and healthy debates are emerging on the nature of pre-analysis plans. This is vitally important – if evaluators don’t declare what they are investigating and how, it is far too easy for them to not bother publishing findings which are inconvenient – or to try multiple types of analysis until, by chance, one gives them a more agreeable answer.

But as the report shows, and as others have argued elsewhere, there has been relatively little focus on transparency on the ‘demand’ side. And by overlooking this, I think that we might have been missing a trick. You see, it turns out that the extent to which a policy document explicitly sets out how evidence has been gathered and appraised is a rather good proxy for systematic evidence appraisal. And the IoG’s hypothesis is that if you could hold decision makers to account for their evidence transparency, you could go some way towards improving the systematic use of evidence to inform decision makers.

The report sets out a framework which can be used to assess evidence transparency. As usual, I have a couple of tweaks I would love to see. I think it would be great if the framework included more explicitly an assessment of the search strategy used to gather the initial body of evidence – and perhaps rewarded people for making use of existing rigorous synthesis products such as systematic reviews. But in general, I think it is a great tool and I really hope the IoG et al. are successful in persuading government departments – and crucially those who scrutinise them – to make use of it.

One of the things I love about working in DFID is that people take the issue of beneficiary* feedback very seriously. Of course we don’t get it right all the time. But I like to think that the kind of externally designed, top-down, patronising solutions that are such a feature of the worst kind of development interventions (one word: BandAid**) are much less likely to be supported by the likes of DFID these days.

In fact, beneficiary feedback is so central to how we do our work that criticising it in any way can been seen as controversial; some may see it as tantamount to saying you hate poor people! So just to be clear, I think we can all agree that getting feedback from the people you are trying to help is a good thing. But we do need to be careful not to oversell what it can tell us. Here are a couple of notes of caution:

1. Beneficiary feedback may not be sufficient to identify a solution to a problem

It is of course vital to work with potential beneficiaries when designing an intervention to ensure that it actually meets their needs. However, it is worth remembering that what people tell you they need may not match what they will actually benefit from. Think about your own experience – are you always the best placed person to identify the solution to your problems? Of course not – because we don’t know what we don’t know. It is for that reason that you consult with others – friends, doctors, tax advisors etc. to help you navigate your trickiest problems.

I have come across this problem frequently in my work with policy making institutions (from the north and the south) that are trying to make better use of research evidence. Staff often come up with ‘solutions’ which I know from (bitter) experience will never work. For example, I often hear policy making organisations identify that what they need is a new interactive knowledge-sharing platform – and I have also watched on multiple occasions as such a platform has been set up and has completely flopped because nobody used it.

2. Beneficiary feedback on its own won’t tell you if an intervention has worked

Evaluation methodologies – and in particular experimental and quasi-experimental approaches – have been developed specifically because just asking someone if an intervention has worked is a particularly inaccurate way to judge its effectiveness! Human beings are prone to a whole host of biases – check out this wikipedia entry for more biases than you ever realised existed. Of course, beneficiary feedback can and should form part of an evaluation but you need to be careful about how it is gathered – asking a few people who happen to be available and willing to speak to you is probably not going to give you a particularly accurate overview of user experience. The issue of relying on poorly sampled beneficiary feedback was at the centre of some robust criticisms of the Independent Commission for Aid Impact’s recent review of anti-corruption interventions – see Charles Kenny’s excellent blog on the matter here.

If you are trying to incorporate beneficiary feedback into a rigorous evaluation, a few questions to ask are: Have you used a credible sampling framework to select those you get feedback from? If not, there is a very high chance that you have got a biased sample – like it or not, the type of person who will end up being easily accessible to you as a researcher will tend to be an ‘elite’ in some way. Have you compared responses in your test group with responses from a group which represents a counterfactual situation? If not, you are at high risk of just capturing social desirability bias (i.e. the desire of those interviewed to please the interviewer). If gathering feedback using a translator, are you confident that the translator is accurately translating both what you are asking and the answers you get back? There are plenty of examples of translators who, in a misguided effort to help researchers, put their own ‘spin’ on the questions and/or answers.

Even once you have used a rigorous methodology to collect your beneficiary feedback, it may not be enough to tell the whole story. Getting feedback from people will only ever tell you about their perception of success. In many cases, you will also need to measure some more objective outcome to find out if an intervention has really worked. For example, it is common for people to conclude their capacity building intervention has worked because people report an increase in confidence or skills. But people’s perception of their skills may have little correlation with more objective tests of skill level. Similarly, those implementing behaviour change interventions may want to check if there has been a change in perceptions – but they can only really be deemed successful if an actual change in objectively measured behaviour is observed.

.

I guess the conclusion to all this is that of course it is important to work with the people you are trying to help both to identify solutions and to evaluate their success. But we also need to make sure that we don’t fetishise beneficiary feedback and as a result ignore the other important tools we have for making evidence-informed decisions.

.

* I am aware that ‘beneficiary’ is a problematic term for some people. Actually I also don’t love it – it does conjure up a rather paternalistic view of development. However, given that it is so widely used, I am going to stick with it for this blog. Please forgive me.

** I refuse to provide linklove to Bandaid but instead suggest you check out this fabulous Ebola-awareness song featured on the equally fabulous Africaresponds website.

I have written before about the separate functions of evidence supply and demand. To recap, supply concerns the production and communication of research findings while demand concerns the uptake and usage of evidence. While this model can be a useful way to think about the process of evidence-informed policy making, it has been criticised for being too high level and not really explaining what evidence supply and demand looks like in the real world – and in particular in developing countries.

I was therefore really pleased to see this paper from the CLEAR centre at the University of Witwatersrand which examines in some detail what supply and demand for evidence, in this case specifically evaluation evidence, looks like in five African countries.

What is particularly innovative about this study is that they compare the results of their assessments of evaluation of supply and demand with a political economy analysis and come up with some thought-provoking ideas about how to promote the evidence agenda in different contexts. In particular, they divide their five case study countries into two broad categories and suggest some generalisable rules for how evidence fits in to each.

Developmental patrimonial: the ‘benevolent dictator’

Two of the countries – Ethiopia and Rwanda – they categorise as broadly developmental patrimonial. In these countries, there is strong centralised leadership with little scope for external actors to influence. Perhaps surprisingly, in these countries there is relatively high endogenous demand for evidence; the central governments have a strong incentive to achieve developmental outcomes in order to maintain the government’s legitimacy and therefore, at least in some cases, look for evaluation evidence to inform what they do. These countries also have relatively strong technocratic ministries which may be more able to deal with evidence than those in some other countries. It is important to point out that these countries are not consistently and systematically using research evidence to inform decisions and that in general they are more comfortable with impact evaluation evidence which has clear pre-determined goals rather than evidence which questions values. But there does seem to be some existing demand and perhaps the potential for more in the future. When it comes to supply of evaluations, the picture is less positive: although there are examples of good supply, in general there is a lack of expertise in evaluations, and most evaluations are led by northern experts.

Neopatrimonial: a struggle for power and influence

The other three countries – Malawi, Zambia and Ghana – are categorised as broadly neopatrimonial. These countries are characterised by patronage-based decision making. There are multiple interest groups which are competing for influence and power largely via informal processes. Government ministries are weaker and stated policy may bear little relationship to what actually happens. Furthermore, line ministries are less influenced by Treasury and thus incentives for evidence from treasury are less likely to have an effect. However, the existance of multiple influential groups does mean that there are more diverse potential entry points for evidence to feed into policy discussions. Despite these major differences in demand for evidence, evaluation supply in these countries was remarkably similar to that in developmental patrimonial countries – i.e. some examples of good supply but in general relatively low capacity and reliance on external experts.

I have attempted to summarise the differences between these two categories of countries – as well as the commonalities – are summarised in the table below.

There are a couple of key conclusions which I drew from this paper. Firstly, if we are interested in supporting the demand for evidence in a given country, it is vital to understand the political situation to identify entry points where there is potential to make some progress on use of evidence. The second point is that capacity to carry out evaluations remains very low despite a large number of evaluation capacity building initiatives. It will be important to understand whether existing initiatives are heading in the right direction and will produce stronger capacity to carry out evaluations in due course – or whether there is a need to rethink the approach.

Another week, another blog pointing out that RCTs are not the ‘gold standard’ of evidence despite the fact that NOBODY is saying they are. To be fair to the blogger, he is simply summarising a paper written by Angus Deaton – a man who is a bit of an enigma to me. I have heard him speak and been blown away by how thoughtful, insightful and challenging he is – until he comes to the topic of RCTs when he seems to become strawmantastic. Anyway, I’ve written about misconceptions about rcts so many times in the past that I am sure you are bored of hearing me – in fact I am even bored of hearing myself drone on about it. So, in lieu of another post on this matter, I present to you links to previous posts (here, here and here)… and a picture I have drawn for you of a baby panda. Enjoy.

I suspect that one reason that bad capacity building programmes have persisted for so long is that monitoring and evaluation of capacity building has been so poor. It is commonplace for capacity building programmes to be ‘assessed’ almost entirely on the basis of subjective measurements of how much people have enjoyed the experience or how much they think they have learnt. Of course it is lovely that people enjoy themselves – but surely we should be trying a bit harder to find out if people have actually learnt anything.

There are some exceptions where more rigorous approaches have been used and they illustrate just how vital it is that we get a bit more objective in our assessments.

A multi-million pound science communication capacity building programme (which I won’t name!) had an independedent evaluation which compared outputs produced by participants before and after they took part in the scheme. The assessment found NO significant difference in the quality of outputs. A bit of a depressing finding.

A train the trainers workshop I ran used a diagnostic test before and after the course to test knowledge of basic principals of pedagogy. The test did reveal a significant increase in scores – although it was notable that a full third of participants continued to get the wrong answers even after the intensive course. But more worryingly, observations of teaching practices carried out in the months following the course revealed that many participants had reverted to their old, bad teaching habits. This certainly taught me of the importance of follow-up mentoring and within-workplace support for learning.

In both the above examples, participants themselves rated the capacity building programmes as excellent – further illustrating that people’s subjective view of the experience may differ significantly from a more objective assessment of what has been learnt.

I strongly believe that if we implemented better monitoring and evaluation of capacity building programmes, it would be quite depressing to start with because it would prove that lots of the stuff we are doing is not working. But it would provide a mighty big incentive for all of us to up our game and start adapting capacity building programmes so they could make a real difference.

So that’s it, those are my four simple rules. What do others think? Would you add other rules? Or do you think I am being to harsh on capacity building programmes, and they are actually generally better than I have implied? Thoughts welcomed!

Stefan Dercon and Paul Clist recently published this excellent short paper outlining 12 principles to consider before using a Payment by Results (PbR) contract for development programmes. But, as pointed out by @hmryder, it is written in quite technical language. You can’t blame the authors – I mean, they are hardcore economists who probably speak that way when they are watching the football. So I have attempted to translate the paper for fellow simple folk – economists do let me know if I have made any mistakes.

Principle 1: PbR involves paying for something after it has been delivered. Therefore it only works if the implementer has enough money in the first place to pay for the work until they are reimbursed.

Principle 2: If you are going to pay based on results, you need to be able to measure the results. If you choose a proxy indicator (i.e. not the final result you are looking for but something that has to change along the way), you need to make sure that changes in your indicator really suggest that the end result will change too.

Principle 3: Some people will game the system by finding ways to make it seem that they have achieved the results when they actually haven’t. Perhaps more worrying is that if you choose the wrong proxy indicator, it might lead people to concentrate too much on trying to achieve that without trying to achieve the actual end result you are looking for.

Principle 4: Donors shouldn’t use PbR just as a way to reduce their risk, for two reasons. Firstly, donors are actually usually much better able to handle risk than implementing partners. This is because donors tend to be funding lots of projects, so if one or two go wrong, they still know they have others that should work. Implementers, on the other hand, may only have one project so they are likely to be really risk averse. The second reason is that the implementer is already likely to be very susceptible to risk and by transferring the additional risk of potential non-payment, you will probably just make them even more risk averse.

Principle 5: If the thing that you want to achieve is essentially the same as the thing the implementer wants to achieve, PbR may not be that useful. PbR should be used to incentivise implementers to do the thing that you want them to do, and you might be wasting effort if they are already fully incentivised to do that thing anyway.

Principle 6: PbR is useful where it is difficult to measure what the implementers are doing (inputting), and therefore you need to measure what they are achieving. If you can easily measure what they are doing, just do that.

Principle 7: PbR works well when achieving the result you are looking for is actually within the control (more or less) of the implementers. It doesn’t work well when there are loads of factors outside the implementers control which will determine whether the result is achieved.

Principle 8: The biggest extra cost of PbR contracts compared to other contracts is the cost of verifying whether results (or a suitable proxy indicator of results) have been achieved.

Principle 9: There is some evidence that trying to incentivise people who are already very motivated to do something by giving them money can actually backfire – they may feel insulted that you think they need to be paid to do something when actually they want to do it because they think it is the right thing. (I wrote about this a bit here).

Principle 10: Donors need to be honest about the practical constraints they are working under and to be aware when these might get in the way of an effective PbR contract.

Principle 11: You can only judge whether your PbR contract has been sucessful by looking to see whether the end result you were aiming for has actually been achieved. Just showing that a proxy indicator has been achieved is not enough.

Principle 12: Remember that PbR is not the only tool in the box for incentivising performance.

In a previous post I talked about the issue of many capacity building initiatives using self-reported ability as a measure of impact. To further illustrate this point, I decided to carry out a small scientific experiment. I gathered a randomly selected group of study participants and gave them the following instructions:

1. Rate their tiddly wink playing ability on a scale of 1 (dreadful) to 5 (outstanding)
2. Tiddle their wink (no winking of tiddles was allowed)
3. Measure how far they had tiddled their wink (in cms)

(I suspect that the above may be the best sentence I will EVER blog)

The results are shown in the graph below. Now as any scientist will tell you, the R^2 value written on this graph indicates that this (poorly designed, probably manipulated ;-)) experiment definitively proves that my pre-existing prejudices were correct i.e. self-reported ability to play tiddly winks is not correlated with actual ability.

So what is the conclusion of this? Well, for one thing, if you ever see a log frame for a project to build tiddly winking capacity with ‘increase in self-reported tiddly wink ability’ as the verifiable indicator, I trust you will refer the project team to this definitive rebuttal.

But in addition, I think we should be a lot more sceptical of any project which uses perceived increase in ability as the measure of success. Of course, there may be some abilities which ARE correlated with self-reported ability. But I suggest that this correlation needs to be demonstrated before self-reporting can be used as a reliable proxy indicator.