If you pay teachers based on performance do you get better teachers, or do they work harder?

If a country announces a policy that it will pay teachers based on some measure of performance two main things could happen. One they would get different (maybe better, maybe worse) people applying for teaching jobs. And two, the teachers they hire could work harder. And either or both of these could translate into different learning outcomes for students.

These questions are at the heart of a neat new paper by Claire Leaver, Owen Ozier, Pieter Serneels and Andrew Zeitlin. They set up an experiment in partnership with the Rwanda Education Board and the Ministry of Education. It’s a two-tier experiment. In part 1, different labor markets for teachers (district-subject combinations) are randomized into either a fixed wage contract (a 20,000 RWF top up) or a pay for performance contract (P4P). The P4P contract gives a bonus of 100,000 RWF (or 15 percent of the annual salary) to teachers who score in the top 20 percent based on presence in the classroom, preparation (lesson plans), pedagogy (captured in classroom observation) and performance (measured through test scores of their students). The first 3 criteria get 50 percent weight, and performance gets the other half.

In stage 2, Leaver and co. take all of the schools in which an upper-primary teacher recruited under the first stage had applied for and been placed and then rerandomize each school into either a P4P contract or fixed wage contract. This lets them separate out the composition effects of recruitment from the effort of folks on the job. And lest you be worried that some people get the short end of the stick in this rerandomization, they provide a retention bonus (promised ex ante) so that all folks (no matter their initial assignment and beliefs on probabilities) are made whole.

Before turning to the results, its worth discussing two methodological dimensions that are important here. First, because Leaver and co. are randomizing at the level of the market (in the first stage), their ex ante power will be low. Part of their insurance for this is a very careful, thoughtful and actually instructive pre-analysis plan (indeed, they blogged about it last year).

Leaver and co. also collect a ton of data. They start with data on teachers, where they have the teachers training college exam data, as well as the exams they have to sit at the district level in order to get these jobs. Once teachers show up on the job, Leaver and co. collect school surveys (where they get range of administrative data on teachers), teacher surveys (which include not only demographics and background, but also personality traits, self esteem and other attributes), and a set of lab in the field games that they play with teachers (more on these later).

That covers the teachers’ attributes. For the students, Leaver and co. developed subject and grade-based tests based on the national curricula. These are given to a random subset of students over the 3 rounds that Leaver and co. are collecting data.

For the teacher-on-the-job data, Leaver and co. (via the government and IPA) conduct spot checks to measure teacher presence, review a set of lesson plans, and have an observer sit through and measure different activities during a 45-minute lesson. In the summary statistics, it’s interesting to see that in year 1 (which comprised two rounds of data collection) the teachers are there 96-97 percent of the time, but they only have lesson plans 54 percent of the time. That shifts in year two (which only has one round) when teachers are present 90 percent of the time, but have lesson plans 79 percent of the time.

So what do they find? The best way to describe their results is by their hypotheses (in many cases they use more than one test and data source – I am going to focus on the top-level results here so that this post isn’t as long as the paper):

1. Advertised P4P induces differential application qualities. Based on the training college final exam scores the answer seems to be no. This also seems to hold for some other measures. And, moreover, there is no difference in the volume of applications (although this is less precise).

2. Advertised P4P affects the observable skills of recruits placed in schools. Using the teacher skill assessment, they find no significant difference.

3. Advertised P4P induces differentially ‘intrinsically’ motivated recruits to be placed in schools. This is where the lab-in-the-field games come into play. In particular, they play a framed version of the dictator game, where teachers have to choose between keeping money and allocating it towards providing school supply packets to students. And here they find something: the teachers recruited under the P4P offer provide 10 percentage points less of the pot to the students.

4. Advertised P4P induces the selection of higher (or lower) value-added teachers. Now Leaver and co. turn to student outcomes, using the tests that they gave them. And they can’t reject the null that there is no impact of advertised P4P on the student outcomes.

So, looking at recruitment effects as a whole, they don’t find anything in terms of ability, but do find teachers who are more likely to keep money rather than handing it over to their students.

5. Experienced P4P creates incentives which contribute to higher (or lower) teacher value-added. Recall that once teachers were on the job, they might have gotten a different contract type – this will let Leaver and co. separate out the recruitment from on-the-job effects. Here again, Leaver and co. turn to the student test results. And boom, students with teachers getting a pay for performance contract perform better. This effect turns out to be small in year one and larger in year two. In the second year, the impact is equivalent to moving a student from the 50th to 56th percentile of test scores: “a modest but certainly economically meaningful result.”

6. Selection and incentive effects are apparent in the teacher performance metrics. Here Leaver and co. see significant increases in teacher presence and classroom practices in the group that got the P4P contracts.

Leaver and co. then depart from the pre-analysis plan and look at the dynamics of teacher retention and composition over the two years for which they have data. First of all, there is no significant difference in the retention rate across P4P versus fixed wage schools, they both lose about 20 percent of their teachers. Beyond the average rate of attrition, there is no clear evidence that the P4P contract induces teachers of different skill or different intrinsic motivation to stick around.

So, to sum it all up: pay for performance doesn’t get you different skilled teachers at the start. The only big difference is that they are more likely to keep money for themselves versus giving it to their students – so potentially some kind of different internal motivation. Once on the job, though, pay for performance seems to induce somewhat higher effort (in some dimensions) from teachers, with resulting learning outcomes from students. How these will play out over a longer time horizon, as this new form of contract sticks around, is an interesting question.

Markus Goldstein is currently a Lead Economist in the Office of the Chief Economist for Africa at the World Bank, where he leads the Gender Innovation Lab.

Email your news TIPS to info@chronicles.rw or WhatsApp +250738160269.
You can also find us on Signal