Proof of progress Part 2

Back in January I described the comparative judgement trial that we were undertaking at Swindon Academy in collaboration with Chris Wheadon and his shiny, new Proof of Progress system.

Today, Chris met with our KS2 team and several brave volunteers from the secondary English faculty to judge the completed scripts our Year 5 students had written. Chris began proceedings by briefly describing the process and explaining that we should aim to make a judgements every 20 seconds or so. The process really couldn’t be simpler: the system displays two scripts at a time and you just have to judge which one you think is best.

Teachers judging as Chris looks on

When you’ve made your decision, you simply click either the Left or the Right button and the next two scripts are presented. It’s as simple as that.

We had 97 candidates’ scripts to judge, 27 judges and we took a grand total of 7 minutes to arrive at an extremely reliable rank order. When essays are marked by trained examiners using a mark scheme they typically achieve a reliability of between 0.6 – 0.7. We managed a reliability score of 0.91! What this means is that if anyone else were to place our scripts in a rank order there’s a very high probability that they’d arrive at the same order.

Here are the scores for our 10 top ranked scripts:

And our 10 lowest placed scripts:

Here are our best and worst writers:

We then wanted to decide what the cut off point for ‘age related expectation’ for writing in Year 5. The consensus was ‘about 50%’ of the students were currently meeting an acceptable standard, so we looked at a couple of the scripts at the halfway point to determine whether they did indeed represent a reasonable standard of writing:

What do you think? While we’d been judging our conversations were perfectly amicable and our decisions were consensual. As soon as we started discussing standards, no one could agree. This goes to show the impossibility of making holistic judgements. In the end, Chris said, “What do you think?” If a majority of us said either yes or were unsure, that should be out cut off point. In the end it was as straightforward, and as mysterious, as that. As Chris said, “If you think it’s around there, it’s there.” The standard becomes the actual work rather than some arbitrary verbiage in a rubric.

This might seem unsatisfying for teachers raised on the perceived objectivity of standards enshrined in rubrics, but it really is as accurate a way of determining a cut off as anything else. Chris explained that exam boards always start with the numbers, then they look at the candidates’ work for confirmation that the numbers are about right

So, now we knew, with a reliability of over 90%, how many and exactly which of our students were performing at a standard we deemed acceptable. In terms of the simplicity and expediency of the process, everyone was happy, but what next? How could we judge whether our students were making progress?

This is easier than you might think: you simply take a sample of scripts for which you’ve already agreed a score and then include these known quantities in a second assessment to use as an anchor. Then, when you judge this second set of scripts you already know the value of some and all the other scripts can then be judged either better or worse. You then compared the ‘true’ scores for students in both assessment rounds and measure the change.

In this way, you can arrive at a much more reliable measure of progress than is ever achieved by external exams and vastly better than that possible with teacher assessment. As a school, if the trend for a majority of children is upwards, that’s a fairly cast iron indication that progress is being made. If the trend is downward, that’s essential feedback on the quality of teaching! Our plan is to repeat this process in the summer term to measure how the performance of these children is (hopefully) improving.

But what about feedback? One of the criticisms of the process is that it’s all very well for making summative judgements, but how will it help students improve? Firstly, teachers taking feedback from students is as, if nor more, important then students being given feedback. Finding out how effective our instruction is pretty important and this gives us some very useful information on how well students are performing. Secondly, we now have some very useful tools to feedback to students about what good looks like and some benchmarks to measure themselves against. We’re planning to get students to go through a ranking exercise themselves so that can get a sense of the varying quality and the associated features of their peers’ work. This sets us up to be able to give precise, meaningful feedback to individuals on exactly what they need to do to improve.

We also got feedback on the accuracy of individual judges’ performance:

Chris explained that the lower the infit score, the better, but that anywhere between 0.5 – 1.5 is fine. Infit is short for inlier-sensitive or information-weighted fit. and is a measure of consistency. Infit values of 1 or below represent high consistency. Infit values of above 1.2 suggest inconsistency, possibly due to carelessness. Chris also told us that, “There’s no dishonour in being a bad judge” – like anything else, this is something we improve at with experience. (If you’re interested, I came out with an infit of 0.69, a median time of 7 seconds and made 49.06% left clicks.)

The only concern that came from our teachers was the possibility that children might be penalised for poor handwriting. Chris told us that when they first started analysing their results they thought they were detecting a gender bias as boys were being consistently ranked lower than girls. When they typed up a sample of answers and put them in for further judging, the gender bias disappeared. What they had detected was actually a handwriting bias. Boys, on the whole, have messier hand writing than girls.

Maybe this simple fact might account for gender gap in national exams? If so, it could be relatively simple to solve. I doubt whether boys’ messy handwriting is a function of their biology; it seems much likelier to be a cultural artefact. That being the case, surely to goodness we can teach boys to improve the way the hold a pen and neaten up their writing?

Our next step is to compare our rank order with that of students studying for PhDs in creative writing. My suspicion is that teachers’ judgements might be warped by the long habit of relying on rubrics to assess students’ work. All too often we end up teaching what’s on the rubric and missing out on other features of expert performance. We end up rewarding work which meets the mark scheme’s criteria even if we think it’s a bit ropey. Likewise, some students are penalised because although they write well, their work doesn’t obviously display the features a marker is primed to look for. It will be fascinating to see whether a different kind of ‘expert’ arrives at a different judgement.

This sounds like a technologically enhanced version of part of Britton, Rosen and Martin’s 1960s study of Multiple Marking of English Compositions. They found that holistic impression marking was highly accurate in discriminating between students’ achievement in writing. They also found that writing tasks often regarded as equivalent actually made different demands on the writer, and went on to investigate in detail the development of children’s writing abilities.

Hi John – I’m not familiar with this study – do you have a link?
It sounds though as if the findings might be the opposite. Laming and others have shown the holistic judgements are not possible – all judgement is comparative.

‘Multiple marking of English compositions: an account of an experiment’, by James N Britton; Nancy Martin; H Rosen. London: H.M.S.O., 1966. By ‘holistic’ I meant that they too didn’t use a mark scheme. They were testing the usual practice of the time to allocate marks to different aspects of writing, e.g. structure, punctuation, grammar etc. Their work resulted in major changes to O level English exams.

I’ve loved reading this blog, but there is very little criticism here! I hope it will show up in future posts! Just because there was a statistical consensus doesn’t mean it is correct! This is absolutely “judgement by fiat”. It is not a measurement model in the strictest sense of Coombs “data theory”. This is more like “Hot or Not,” that terrible TV show from about 10 or 15 years ago. Just because we can make a judgement doesn’t mean it is productive, helpful, useful, or will promote growth! I have a friend who was part of a study that found that, especially for young students, the number of words lines up statistically with a teacher’s holistic judgement of writing and may other more technical judgements of the piece of work. I’m guessing that is because students who were able to write more had clearer thinking about their writing and were able to get their words out on paper in the amount of time allocated. But when students have a chance to rewrite and revise, especially as they are older, this should no longer be true! I feel like we often judge because we can, not because we should!

” Just because we can make a judgement doesn’t mean it is productive, helpful, useful, or will promote growth!” No indeed. Neither can comparative judgement cure cancer! What this does – and ALL it does – is mark essays with greater reliability and validity than is possible with mark schemes. It’s also an order of magnitude more efficient. What’s not to like?

A colleague of mine frequents says (or maybe quotes someone) “In an unjust system, pragmatism is unjust action”. Just because using this software makes a job better doesn’t mean we should be doing that job! But, I have the privilege to say that because I’m no longer a teacher. I’ll be the first to admit that I had to give grades/marks even when I was uncomfortable doing so, and I did it anyway. But I have trouble seeing how this improves validity. I’ll stay tuned, I hope that you and your colleagues have some critical discussions after using this type of software. Maybe you’ll have a chance to look at the implications for validity. And discuss that the software may be able to do this same work even without a teacher . . . I look forward to future posts on this topic!

This improves validity because you’re not using a mark schemes which relies on a small number of indicative items of ‘expert performance’. This way your a judging the work, not trying to fit the work to a rubric.

Usually validity is in relation to something else though- like set criteria, previous results, or theoretical frameworks. Are the rubrics so bad that they aren’t considered criteria for writing? And are considered to make marking less valid?!

Validity is about the depth of the domain which is sampled by a test. Rubrics, by their very nature, reduce validity by picking out criteria indicative of different levels of performance. Most of the expertise we use to judge work is tacit and not quantifiable by a rubric.

This is an interesting approach but differs significantly from current TA data because it is based on a single piece of writing being judged, not a collection and so more akin to a test in writing – do you think this is the way forward?

Hi Debbie – you can judge as many piece of writing as you want. There is absolutely no reason why you couldn’t judge portfolios of children’s work – it’s just that each piece would be judged in isolation and not as a collection. And, yes – I really do think this is the way forward.

The only problem I can forsee when it come to developing this on a departmental level is ego. If teacher X’s students are ranked higher than teacher Y’s students during a moderation meeting then teacher Y will feel insecure and inadequate which could create a perverse incentive to provide too much scaffolding for the assessment.

The only way to entirely protect Teacher X’s fragile ego is not to any standardisation or moderation at all. If the assessments are undertaken in test conditions – which are were – there the problem of “too much scaffolding” should not occur.

I wonder if, generally, the appearance of a piece of writing does correspond to how good it is for writers below a certain standard, but that when a threshold is breached the worthiness of the writing is no longer related directly to what it looks like? (In 7 seconds you can’t read and digest the writing?).

For expert writers (which these are unlikely to be?) this methodology wouldn’t work?

Well, in 7 seconds are you really assessing the look of the piece of writing, rather than much of the content? (the hand writing, the grammar, some of the spelling? But not the content, much). For non-expert writers do these factors actually correlate very well with sophistication of content as well? (Where they might not for expert writers ?)

I’m still struggling to be sold that this is useful in any other area other than summative assessment. How much feedback do you get on your teaching with a 10 second scan of an essay (or multiple essays)?
If we really wanted to see if a child has made progress surely it is easier to compare an essay from the start of the year to one completed later in the year. It should be clear which is better and shows they’ve made progress during the year. Thats just one comparison per child. Will a “progress score” be meaningful to a pupil?
I can see a future for CJ in English GCSE exam q marking for instance. But a regular feature of assessment in a classroom? I just can’t see it.
Fascinating blog as ever though.
Damian

1. I’m not suggesting this should be a regular feature of assessment in a classroom. What I’m suggesting is that whenever a teacher has to mark a piece of work using a rubric, CJ will produce better reliability and validity. It will also be a more more efficient use of time and will significantly reduce workload.

2. Judging is entirely different to giving feedback. If you want to give students individual feedback on every piece of work you still need to read through carefully. But what you can do is give whole class feedback where individuals scripts are deconstructed. Instead of using rubrics you get to use the students’ actual work.

3. “If we really wanted to see if a child has made progress surely it is easier to compare an essay from the start of the year to one completed later in the year.” Yes. That is exactly what I have said. But if you do this using CJ you a) get much more reliable data and b) you get objective proof (much more reliable than that gathered through external examination data) of how well the school is doing.

Thank you for your response DD. I can certainly see that this system would be great at getting a reliable rank order but I still have reservations as to the benefits. Don’t get me wrong, I can see the benefits of decinstructing scripts but am unconvinced that pupils will then be given anything other than rubrics/criteria to improve their work. I genuinely await part 3 with interest.
Damian

This is interesting but I’m finding it complicated! How do you come up with the consensus that about 50% of the pupils are meeting the expected standard? Secondly, what are the ‘true scores’ based on? Thirdly, am I right in assuming that the rank order accuracy was achieved because the same scripts were ranked several times by different teachers? Thanks in advance and hope all’s well. K.

1. We came up with the consensus of 50% by asking the Head teachers of both primaries and the executive head of the academy to come up with a figure. We then looked at the work half way down the rank order to see if it met our expectations. If it does, that’s the standard. If it doesn’t, go up until you find a piece which does meet the standard.
2. Maths.
3. Yes. Each script was judged a minimum of 5 times – this took 7 minutes. Chris has shown that further judgements result in diminishing marginal returns in terms of reliability: https://nomoremarking.com/blog/AAGfCvS9xp2aE8SzY

Intended just to highlight the heritage of this idea, these intuitive qualitative judgements are central to the book, sure Pirsig reads too much into them- he was a crack pot- but does not mean there is not great value in this – hope it grows

Fascinated by the method I have spent the last few days attempting a reconstruction (reverse engineering) of this method, and after two (in retrospect) dumb approaches I figured out that they must be using something like the sort/merge algorithm for ordering a set of numbers. I implemented this, and have a simulation of the process up and running. If you are interested I would like to communicate further by email (howard_at_58@yahoo.co.uk)

The crucial thing affecting the correctness of a ranking is the probability of misjudgement or difference of opinion in the group of judges. The simulations really show this .

Very interested in this. If it makes things more reliable and saves time, that’s got to be good!

Just a thought… When making our comparison in such a short space of time, might we risk placing too much emphasis on the first few sentences of the work, potentially missing other aspects of the writing which appear later on in the text but are still worthy of credit?

Hi David,
I’m really intererested in the CJ trial and your blog spurred me on to test out our Y4’s work on the website and see what happened. I think I understand the concepts etc but my question was – did you think about what exactly you judged in the 7ish seconds?
I am really interested in knowing what you thought it was you were basing your judgements on.
I came up with a list of things:- (in order)
Structure
-letter formation/handwriting
-punctuation including layout on the page, capitals, accuracy in full stops, use of paragraphs

So, all the lower end of the essays fell down on those aspects. They are pretty easy to spot when glancing over work – even the full stops.

This was much harder to judge quickly or to feel that I hadn’t skipped over children’s work where the handwriting was a bit more difficult to read.

I ask about this because it’s easier to judge whether a shade of colour is darker or lighter – I’m only looking at one thing. But I was curious to understand what I was looking at when I was judging the writing.

It’s important for two reasons. The first reason is that it affects how the essays at the upper end of the judgement are ranked. I’ll explain the second reason later.

If you have the time to digest the question and answer it, I’d be grateful.

The important thing to remember is that judging is not the same as marking. In order to accurately judge which is the best written of two scripts requires very little time. If you want to know why one script is better than another then you need to read them both thoroughly. The point is, aggregated comparative judgement is very much more reliable & accurate than marking using a mark scheme.

I’m not disputing that CJ might be more reliable and accurate, I’m asking you what did you think you were judging the work on. What do you think you are looking at in the short space of time, particularly on the top end?

That might be a different question to ‘what are the detailed reasons why one piece is better than another’ and is definitely different to ‘is CJ reliable or accurate or better than using a mark scheme’

At the end of the day I imagine that you are doing to produce a grade on a 5 point scale, since it is not reasonable for a single exercise to have % marks purporting to distinguish between 75% and 76%. It has to be easier to find grade boundaries in a set of work that has already been ranked.

Find this really interesting …. have used similar methods prior to moderation with staff and genuinely see it as a positive step forward. It got me thinking, could this be used by children in terms of a very quick peer review exercise so that they can see different work, compare it to there and make changes?

I am absolutely sold on CJ – I trialled this in school with mock exams and we very quickly found a rank order of student work, established which teachers were outliers (me!) and collated general year group feedback as a team. I am not completely au fait with the mechanics of attributing a mark yet but will persevere. The single biggest barrier (as with life after levels) was getting teachers to let go of deeply held notions of how we should mark and to give it a go. I am definitely a fan. Yes it can be criticised for not allowing individual feedback (perhaps that is not the purpose of it) and indeed it doesn’t look closely at every rubric (I think that is the joy of it) and of course it may have some bias relating to visual presentation but it is seriously impressive in terms of sharing work/judging/comparing and checking teacher judgement too. More posts on this please!

“I doubt whether boys’ messy handwriting is a function of their biology; it seems much likelier to be a cultural artefact. That being the case, surely to goodness we can teach boys to improve the way the hold a pen and neaten up their writing?”

You seem quite keen to prove something about boys here. The paper you’ve linked to is unfortunately behind a paywall but from the abstract I can’t see any reason to suggest that boys will have messy handwriting. If it’s true that “girls scored significantly better than boys on both visual motor and graphomotor tasks” there’s no reason at all to suggest that this makes boys incapable of holding a pen. In fact, we know quite well that the vast majority of boys can use a pen perfectly adequately.

So, we’re left with either holding everyone to the same high expectation of legible handwriting or making excuses for 50% of the population because of their sex. The question becomes, is it reasonable to treat boys as if they can write legibly? I think the answer has to be yes.

That’s not the question if the problem is that we tend to judge how ‘good’ a piece of writing is on its neatness when we assess using a very quick reaction. The problem is caused by it being a quick reaction, due to the system, not by any prior teacher expectation about handwriting.

If boys are biologically predisposed to write less neatly, then CJ will be biased against them unless you type everything up, which kind of defeats the point. I don’t think that’s got anything to do with expectations, unless you think the most important quality of writing is how neat it is.

As I said in the last comment, I see no reason to believe that slower development in wrist bones equates to poor handwriting in 50% of the population.

Do you really not see how offering excuses to boys about poor handwriting is lowering expectations? Of course it’s not the most important thing about the quality of writing but handwriting bias is real. Poor handwriting will, on average, result in lower grades. This applies every bit as much to examiner assessment as CJ.

Just stumbled upon this whilst pondering how to take writing assessments away from dreaded checklist driven assessments enforced by DfE.

This has got me really excited! It makes perfect sense (to me). As you have stated:

– It speeds up assessments (why spend x amount of time weighing the pig when you could be planning and fattening it for next time?)

– Data is more reliable (comments about HOW judgements are made have to be addressed to the current system of assessments)

– What we do with the ranked assessed pieces in the classroom, comparing and sharing (the nitty-gritty, technical/creatives elements) will make the impact the next time around as teachers can focus in on specific areas for improvements.

We will be using this method in September – can’t wait!

Loved reading this, David. Thank you for sharing and also taking the time to reply.