Schools are held ransom by dubious analysis of test score data. And the fine detail of the data is, it appears, dangerously wrong, and wilfully misleading. It’s also incredibly hard to scrutinise independently. Test marks are, it transpires, used to create spurious ‘Fine Grades’ which are used as if they were measured, scalable numbers. If more head teachers knew how much every SATs test answer mattered, many more would question the results they are allocated. I’ve mentioned Fine Grades before, here and here. A Fine Grade is a single number, which is used in complex-but-daft analysis in RAISEonline and FFT estimates. I wondered out loud last year about the methodology for calculating ‘Fine Grades’, but I couldn’t find an explanation anywhere online. I have now found one, buried within an Appendix to a consultation document issued just before Christmas. And I simply can’t quite believe what I’ve found. But first, there are two important aspects of educational measurement which need to be revisited. The first is that grades are guesswork, and we need to be careful when we use them to summarise a child’s current ability. They may be – excuse the woeful pun – educated guesses, but they are guesses nonetheless. I’ve written about this before, but here are the edited highlights: Tests are designed to sample knowledge, skills and understanding to provide an estimate of the full range of knowledge, skills and understanding (the ‘domain’) a child might possibly have. Daniel Koretz’s Measuring Up, published in 2008, summarises this point: ‘The results of an achievement test – the behaviour of students in answering a small sample of questions – is used to estimate how students would perform across the entire domain if we were able to measure it directly.’ (p20) “There are three distinct reasons why scores on one test, taken by themselves, are not enough to tell which schools are good or bad. The first is that even a very good achievement test is necessarily incomplete and will leave many aspects of school quality unmeasured.The second reason not to assume that higher scores necessarily identify better schools is that, in the current climate, there can be very large differences among schools in the amount of score inflation.The third and perhaps most important reason scores cannot tell you whether a school is bad or good is that schools are not the only influence on test scores. Other factors, such as the educational attainment and educational goals of parents, have a great impact on students’ performance.” (P325-6) In summary, the test score in a single test is not a reliable ‘measure’ in the conventional sense. This is why we report Levels and Grades to children and parents in our education system. We rank children into a limited number of ‘big buckets’, which are generally fuzzy, unreliable and often wrong. Grades are, to a large extent, guesswork, since we can only go on what we can get children to demonstrate in written tests. The second point is that the total number of marks a child is awarded on any given externally-marked test can mean many different things. One child’s 14/20 is likely to be different to another child’s 14/20. This blog, by Tom Sherrington, gives a good insight into this. Often, externally marked exams are subject to administrative error or idiosyncratic interpretation of mark schemes. Many schools are fully aware of this, and often ask for papers to be remarked. Most marks change when this happensThis recent blog demonstrates the extent of the problem, as the school, disappointed with their GCSE English Literature results, ‘sent back a sample for remarks, when they were upgraded that triggered a full cohort, all 198, remark. Of the 198 papers no less than 97 were awarded an extra grade and a few two grades higher. Instead of 85% A*-C it is now 95% A*-C and 44% A*A rather than 28% A*A.’ Ofqual, the government’s qualification watchdog, publishes details of requests for remarks of GCSE and A Level papers (which it refers to as ‘enquires’). In 2014, there was a huge increase from 304,400 to 451,000 papers which were asked to be remarked. This resulted in 77,400 grades being changed, and whilst it is not clear how many marks changed, it is fair to assume that it was somewhat higher than this number. Whilst this represents less than one percent of all GCSE and A Level papers which were graded in 2014, it is quite remarkable how many of the papers changed grade: 17%, or one in six papers. Whilst this is a selective sample of the whole population which sat the exams – after all, only those who were close to the grade boundary will have asked for their papers to be remarked – the blog above suggests that some schools which have requested a complete remarking of all their papers have found that around half of their students increase their grade. And these are high stakes examination, which matter a great deal for children. Key Stage 2 results only matter to schools: They mean little or nothing to children. But the number of inaccurately reported marks in KS2 test is likely to be as high as it is in GCSE. All of this casts serious doubt on the reported total number of marks a child has been given, and it’s clear that we use grades because ‘total marks awarded’ is not a particularly good estimation of a child’s current ability.So, bearing in mind these two serious issues, here’s how ‘Fine Grades’ are calculatedThis is taken from a Statistical Working Paper entitled ‘Measuring disadvantaged pupils attainment gaps over time’ issued in December 2014:

Yes, it’s got equations and everything, but it’s fairly straightforward, honest. It has a few interesting points, one of which being the ‘but why add 1’ in the lower equation. There is also a glaring mistake in the lower equation which would be a howler in a student’s work. I’ll leave you to ponder that whilst I explain the rest. Apologies if this is patronising, but experience suggests that anything involving numbers in education needs careful explanation. So, this is a way of working out what ‘Fine Grade’ a child will be awarded in Year 6. Children sit papers which have marks. The marks are grouped into ‘threshold ranges’ so that parents can be told in which big bucket their child is currently working: ‘Level 3’, ‘Level 4’ or ‘Level 5’. Level 2 and Level 6 are ‘yes/no’ levels, and calculated differently. (Levels have been abolished, of course, except that they haven’t quite died yet and are still being used in 2014/15; next year the situation is likely to be even worse, but I’ll worry about that then.) Clearly, this means that children can be only put into three ‘buckets’ (as per point one above). It is obviously difficult to do much in the way of analysis with results with only three categories, so ‘Fine Grades’ invent more categories which look like – but, crucially, are not – countable numbers. Levels are categories too, not countable numbers, even though they are cunningly designed to look them. Remember that there is no Unit of Education, so the progress between level 3 and level 4 has no linear equivalence to the progress between level 4 and level 5. So, a Fine Grade takes the ‘bucket’ a child is placed in – 3, 4 or 5 – and then adds a bit. A bit of what, you might ask? Well, a bit of ‘rank’, I suppose. It’s done like this: Take a child’s (almost certainly inaccurate) total marks. Divide this by the range of possible marks in a ‘bucket’. Turn this into a number, to two decimal places. Or possibly one. It isn’t clear. I’ve never seen a list of Fine Grades, since I’m simply a teacher and this witchcraft isn’t normally explained to us. I don’t think anyone in a primary school has ever seen a fine grade either – I think that the data isn’t available before Year 11 - but if you have, do let me know. The example here (12 / 15) gives an answer which is 0.80, conventionally written as 0.8. 11/15 would give 0.73 to two decimal places, 0.7 to one, but the badly chosen example isn’t very clear. If it is to two decimal places, then the ‘Level 3 bucket’ has magically been expanded to 16 separate buckets, all of which look like scaled numbers, but which aren’t. They would look like this: 0.07, 0.13, 0.2, 0.27, 0.33, 0.4, 0.47, 0.53, 0.6, 0.67, 0.73, 0.80, 0.87, 0.93, 1.00 and 1.07. And to be clear, these are categories, not continuous numbers. There is no 0.08, for example. So, have you spotted ‘the why add one’ and the glaring howler yet? Here they are. Why add one? Adding one solves a simple problem, which is that the top mark in a level threshold (top of level threshold– bottom of level threshold)/ (top of level threshold – bottom of level threshold) = 1. To avoid this, simply add 1 to the denominator. Now a student awarded top marks in Level 4 can’t have a Fine Grade of 4+1. The glaring howler? 35-19+1 is 17, not 15, as per the example. The example should say 4+12/17, which is 0.71 (to 2 decimal places). I assume that this is a mistake in the paper I found, and isn’t repeated throughout RAISEonline and FFT ‘analysis’. I did find this, however, which makes me wonder. It’s from a paper called ‘FFT: KS2 2012: Calculating Fine Grades’, and it gets its maths wrong too.

5+(80-79)/(100-79+1) = 5.045, not 5.043. Adding 2 to the denominator gets 5.043. Which is somewhat worrying. What other errors are lurking in the analysis of school’s test results? All of this says to me that those who have developed the ludicrous methodology behind RAISEonline and FFT analysis really do need to be ashamed of themselves. Getting basic maths wrong is probably forgivable. We all make mistakes. But someone should have checked the maths – it shouldn’t be up to teachers like me to find these elementary errors. But beyond the basic inaccuracies, someone somewhere should have had an overview which sat between those who have developed tests, and those who have created statistical tools which turn fairly randomly distributed mark totals into Cargo Cult Data. I am willing to bet that virtually no school realises that its RAISEonline Value Added scores – and those little blue and green indicators in RAISE reports – are based on the actual total number of marks obtained in tests, rather than simply on grade ‘bucket’ awarded to each child. No wonder that those that do teach so carefully to the test. We know that remarked test papers are given higher mark totals. If I were a head teacher, I would be sorely tempted to have every set of externally marked papers returned and remarked. It is highly likely that the number of marks on each test is, on the evidence, likely to increase. And in a culture of high stakes testing, who could blame a school for doing just that? And finally, it has taken me a long time to find this information. It isn’t readily available. It should not be up to people like me to scrutinise basic maths and criticise dubious statistical analysis. Surely education unions or data consultancies - or even Ofsted - should be looking more closely at the way that test results are being used in the increasingly data-driven world of education? Fine Grades are dangerously wrong, and wilfully misleading. They have no place in holding schools to account.

"31-19+1 is 17, not 15"
- but the denominator is 35-19+1 [your typo?]
OR their typo or missing bracket - who knows???

I'm really pleased you're raising this & taking it all apart - I gave up with it years ago. As you point out: "experience suggests that anything involving numbers in education needs careful explanation"

I can attest to the basic test score/marking inaccuracies - I used to do question level analyses of ks1/ks2/y3/y4/y5 optional test papers for most of the schools of which I was consultant. No mistakes (in marking or arithmetic) was definitely a rarity. Of course, I often discovered them after they'd been reported/recorded internally.

As you point out: "fine grades" appear to be derived from end of KS2 tests only so I daresay primary teachers have generally just let them pass by. They're a problem for secondary schools.
I don't understand how any such grades could be derived from KS1 or through ks2 as none of that data should be reported.

Reply

Jack Marwood

12/1/2015 11:53:08 am

Thanks for spotting my typo - I've now corrected it in the main post above.

Glad to have your corroboration of the scale of inaccuracies, too. As I say, I suspect that a majority of papers maybe be allocated incorrect totals, which is worry given what happens when total scores are converted into 'Fine Grades'.

And, whilst none of the Fine Grades are reported at Key Stage 2, they are clearly used in the *calculations* of Value Added scores. And given Ofsted's - and others' - over reliance on VA, that is a serious cause for concern...

Reply

adamcreen

12/1/2015 12:31:38 pm

But I've been doing this with KS2 scores for our Year 7 arrivals for 14 years. There's no alchemy, it's a really easy formula (when used correctly) and is great if students have done different tiers (as we used to have at KS3). I can give you more info if you like. @adamcreen

Reply

Jack Marwood

12/1/2015 12:43:20 pm

Hi - I've replied via email and look forward to hearing from you.

Reply

Dylan Wiliam

15/1/2015 10:17:38 pm

You asked me why I think this post is "wrong on so many levels".

The most important point is that while the scores do indeed suffer from "spurious precision" every psychometrician knows this, and knows that it is impossible to interpret any score without some indication of the likely error in the score. Indeed, the Standards for Educational and Psychological measurement published by the AERA, APA and NCME state clearly that failure to report errors with score is unacceptable. So a fine score of 53 is unlikely to be reliably different from a score of 54. But when we group scores into categories, the score of 53 might be allocated to one category, and the score of 54 to another. And then people do tend to assume that there is a qualitative difference between the two. Also, all the scores that are allocated to the same category are then treated as equivalent, so scores towards the top of the category are regarded as equivalent to those towards the bottom. So far apart scores in the same category are regarded as the same, and close together scores in different categories are regarded as different. In other words, you are throwing away information. All measurements have error. When reporting on a fine scale, all you have is the measurement error. When you allocate scores to categories, you add rounding error.

Reply

Jack Marwood

16/1/2015 12:09:22 pm

Firstly, thank you very much for commenting on my post both here and via Twitter. I appreciate your time and thoughts.

I note that you mention that ‘every psychometrician knows’ that score suffer from spurious precision. As a teacher, my impression is that very few teachers, let alone parents, would know what a psychometrician is, or what they do. I suspect many Ofsted inspectors don’t know either, and nor do many of the statisticians who have developed the analysis of school effectiveness on which RAISEonline is built. Hopefully, this post will encourage them to find out – there are links to Daniel Koretz and Noel Wilson here on Icing on the Cake for those who are interested.

It’s also good to be steered towards the AERA, APA and NCME too, and interesting to hear that they all state that failure to report errors with score is unacceptable. That in itself should be useful for schools dealing with data-illiterate Inspectors.

The point you make regarding the measurement errors inherent within fine scores (or Fine Grades, calculated using mark totals) is also useful. Clearly, grouping Fine Grades into categories introduces further error. I think we agree on all of this.

As I said above, ‘We rank children into a limited number of ‘big buckets’, which are generally fuzzy, unreliable and often wrong.’ There is more error in Fine Grades than that simply accounted for by the measurement and rounding errors which you highlight – errors in the mathematics, the reliability of tests, the marking and administration of the tests and the dubious statistical analysis undertaken by RAISEonline and the FFT, amongst others.

And apologies if my analysis isn’t perfect – but the issues I’m raising don’t seem to have been raised by many other people and it’s good to have them discussed, as I’m sure you’ll agree.

Jack

Reply

Patricia Gooch

28/1/2015 03:54:25 am

If you then follow the process a bit further -the fine grade is then multiplied by 6 to get a score e.g 5.045x6=30.27 which equates to a 5c, but the rounding is done in an odd place, so actually isn't quite right (I can't remember where off hand) and when you look at the tables for expected attainment 8 based on the fine score these are grouped differently - in theory a 4a ranges from 28-29.99, but according to the expected attainment 8 table a 4a ranges from 27.9 (probably - the bands are 4.1,4.1....4.6,4.7,4.8,4.8,5.0) to 29.699, so we have no idea what to do if a child's fine grade multiplied by 6 is 29.8!
If schools (I'm at an upper school) are going to be measure on LOPs and Progress 8 we need a clear idea of where the data is coming from and how we process it. Not in order to change the data by altering curriculum,but so we have no surprises when Raise (or Ofsted) appears!

Reply

Jack Marwood

28/1/2015 12:35:48 pm

Thanks for this Patricia. Yes, the process for calculating Fine Point Scores also has issues, as you point out. The Average Point Scores created from the Fine Point Scores also have issues. That'll be in my next blog, with a bit of luck, as I look further into the way in which test scores and teacher assessments are manipulated ;-)

Reply

Jack Marwood

16/2/2015 03:06:21 am

A quick up date from me, to say that I've delayed my post on Average Point Scores for now, but I hope to revisit this soon.

In the meantime, the DfE have amended their document at https://www.gov.uk/government/statistics/measuring-disadvantaged-pupils-attainment-gaps-over-time so that the mathematical error I highlighted has been corrected. So, well done to them, and hello too - glad to know you are reading my blog and taking action based on my comments ;-)

Your comment will be posted after it is approved.

Leave a Reply.

Author

Me?
I work in primary education and have done for ten years. I also have children
in primary school. I love teaching, but I think that school is a thin layer of icing on top of a very big cake, and that the misunderstanding of test scores is killing the love of teaching and learning.