I wrote about RAISEonline in March this year, and this has consistently been the most read post on Icing on the Cake. This is despite it being a long, detailed argument against RAISEonline, which contained – the horror! – equations, explanations of statistical theory and ideas which most readers found challenging. So I’m going to write about RAISEonline again. I’m going to show that its entire foundation is based on number crunching which has been rejected by statisticians as too simplistic and, unfortunately, too badly understood to be used by non-experts. And when I meet Ofsted’s Mike Cladingbowl and Sean Harford later this month to discuss Ofsted’s use of data, I’m going to ask them to review the way which Inspectors and Governors use RAISEonline to hold schools to account. I'm also going to ask those in charge of official government statistics to examine RAISEonline's foundations, and to consider how to reassess the way which data is used to hold schools to account.

Now, part of reason for writing this is to show that both those responsible for and those who use RAISEonline seem to have completely misunderstood the statistics which they are using. That does mean that this post might be a bit challenging. It will require you to think hard. But unless you are prepared to do that, there is no way that you will understand why RAISEonline is so dangerously flawed. Of course, if you do understand the statistics, you'll sail through the arguments I'm about to make. Either way, I urge you to think hard about how we currently use Test Scores and to question what RAISEonline does.RAISEonline has significant problems To summarise my earlier argument, RAISEonline, or ‘Reporting and Analysis for Improvement through school Self-Evaluation’ may as well be called Reporting Utter B******* Because It’s Simply Hogwash. I identified a number of problems with it:

The ‘What’s it testing?’ problem

The Not Independent and Identically Distributed problem

The Key Stage 1 Data Manipulation problem

The Primary Age problem

The Loss of Definition Problem

The Missing Data Problem

The Misunderstanding Significance Problem

I want to turn to the last problem on this list to look closely at what statisticians mean when they say that something is ‘statistically significant’. As I said in March, “What significance tests do is suggest – and that’s all they do – that one mean which is very different to another may be from a different population and that this is not simply by chance.“ I then left readers to explore further, saying, “There is more to the idea of significance involving Null Hypotheses and Type I and Type II errors.” I want to look more closely at ‘statistical significance’, then look at the source of RAISEonline’s statistical reasoning, and finally at what working academic statisticians say about this reasoning.A significance test is just a number crunching exercise As I said in my earlier post, “Statisticians collect data from a random sample of the population under scrutiny and find its mean, by adding up all the numbers and dividing the total by the number of observations. Then they test it to see if it is different to a mean of another random sample.” In summary, a significance test result is said to be ‘statistically significant’ at the 95% level if the sample mean is calculated to have is a less than 1 in 20 chance of being drawn from the same population as another sample. Of course, a significance test is just a number crunching exercise, as those who developed them were at pains to point out. There could be many other reasons why a sample appears to be ‘significant’ according to the test. Amongst these are the Type I and Type II errors I mentioned previously. In brief, a Type I error is a Type I error is made when something is said to be significant when it isn’t.This often referred to as a 'false positive'. A Type II error is when there is a significance but a test sample is found not to be significant. This is often referred to as a 'false negative'.Ooooh… Blue… Green… The official explanation of what RAISEonline does is as follows:“Significance TestsSignificance is a statistical term that shows if a difference or relationship exists between populations or samples of data. After finding a significant relationship, it is important to evaluate its strength. Significant relationships can be strong or weak. Significant differences can be large or small depending on the sample size. Significant tests are used in RAISEonline and school Full Reports to determine if a measure for a particular school’s cohort is significantly different from the national cohort.Significance tests are performed on the data using a 95% confidence interval.” (P61) This is a terrible explanation of a test of statistical significance. The phrase ‘a significant relationship’ will make any statistician hang their head in horror. ‘Significant differences can be large or small depending on the sample size’is almost entirely meaningless. But it does confirm that RAISEonline uses a 95% significance test. Where it goes really wrong is when it then colours certain items in a RAISEonline report blue or green, like this:

The + and – symbols are powerful. Positive and negative are pretty binary concepts – one good, one bad. Clearly, stating that there something is ‘Sig+’ and ‘Sig-‘ is highly misleading if the impression which is given is that green is good and blue is bad. They are neither. They are merely statistically significant, i.e. not likely to be the result of chance. They are not ‘significantly good’ or ‘significantly bad’. Remember as well that an item could be marked Sig+ or Sig- but have a Type I error. This would happen if the item was found to be statistically significant at the 95% level, but that this was not actually justified. Maybe the item included data which was wrong, or contained an extreme outlier. Maybe half the class was ill on the day a test was taken, or their teacher had been off school for some reason, or Ofsted turned up the week before and sent everyone into a tail spin. Either way, it could be as a result of a Type I error. On the other hand, a number not coloured in could have a Type II error. The test should have found it to be significantly different at the 95% level, but didn’t for some reason. Maybe a high flying child was off school for the test, or a marker added up the numbers of marks incorrectly. Maybe huge numbers of paper were incorrectly marked. A Type II error would result if a statistically significant result was not indicated by the test. Both types of error are important. But either way, the test doesn’t tell you anything. The colours clearly tell Inspectors and Governors something, however.What Inspectors and Governors are told to think about colour This is from widely respected Ofsted Inspector Mary Myatt’s blog Making Sense of RAISEonline, and demonstrates the kind of misleading advice which Ofsted Inspectors have been given. “Figures highlighted in blue mean that students are performing significantly below other students nationally Those in green mean they are significantly above national.”

This is completely wrong. The colours don't 'mean' that at all. They simply indicate that the result is 'statistically significant' - nothing more and probably less. A large number of the students might be being tutored at home. A large number might never read for pleasure. A group of SATs papers might have been incorrectly marked. A majority of the students might have parents educated to degree level. Maybe they are, on average, four months older than most children in their national cohort. A significance test using Test Scores can't tell you 'how the children are performing', only that the numbers are 'statistically significant', which isn't the same thing at all. Governors are routinely misadvised too. The National Governors’ Association’s Knowing Your School says the following: “Each school figure is compared with that year’s national result, the difference is shown and a statistical test has been carried out to indicate if the schools result is below (blue) or above (green) the national result.” It’s hard to overstate just how wrong this is. Significance testing can be wrong, either because the number crunching suggests that there is something different about the sample when there is not, or because a sample may not be flagged as different when it should be. Even by the standards of basic significance, Ofsted inspectors and Governors have been badly advised and are clearly using tools which they simply do not understand.

And even if Inspectors and others say that RAISEonline is merely a signpost and not a destination, if the signs are colourfully highlighted and clearly wrong, then surely it's much more likely you'll end up at the wrong destination?Further issues with RAISEonline Having looked closely at the statistics underpinning RAISEonline, I notice that two different tests of significance are used. So I asked the team at RAISEonline about them.I asked, “On page 14 of the guide, a formula is given for calculating significance. This suggests that a Z-test is being used to test for significance, as a Z-score is tested against a figure of 1.96. On page 25-26, a t-test is used to test for significance. Could you explain why the two different tests have been used?” In brief, Z-tests are used for numbers greater than 30 and t-tests are used for small samples. The response from RAISEonline was very interesting. It is quite long winded, and a bit contradictory, so I’ve put it here. In essence it says:"‘Average Point Scores at Key Stage 2’ is tested against a population standard deviation, which is ‘known’, using a Z-test.‘Average Points Score at Key Stage 4’ against a population standard deviation, which is ‘known’, using a t-test.School level APS values are assumed to be based on more than 30 observations and should use Z-tests, and pupil group APS values are assumed to be based on less than 30 observations and should use t-tests.
But ‘This is not a practical solution and would add a layer of complexity that can be avoided by applying a single methodology.’" It sums up what the powers that be have decided they can do with the data: “Statisticians within Ofsted and the Department for Education decided to accept simplifying assumptions and apply a t-test to APS significance tests. This has three major benefits:1. Applying the same formula throughout APS significance tests makes the methodology more accessible to users and avoids concerns about applying different statistical tests to the same metric within the same school.2. Using a t-test is more robust than a Z-score for small cohorts (as it is less sensitive to outliers), but converges on the results of a Z-test as the sample size increases; applying a t-test to all cohorts provides robust outcomes whilst not being overly-sensitive to larger sample sizes.3. The t-test is a more conservative test than using a Z-test and as such, it reduces the probability of making a type I error. A type I error occurs when the difference is said to be significant when it is not significant. This in turn increases the probability of making a type II error, i.e. stating that the difference is not significant, when perhaps it is significant. The impact of a type I error is judged to be greater than the impact of a type II error in this situation.” I’ll let you read that again. It probably still doesn’t make much sense. As far as I can gather, this says that, even though the Methodology document clearly uses two different tests, government statisticians have decided to apply t-tests to all cohorts. They have also decided that Type I errors are more important than Type II errors. This arbitrary messing around with statistical theory is a little baffling. The people who developed Z-test and t-tests were at pains to explain the important assumptions they were making, and to make it clear the limitations of their models. They recognised the possibility of two types of errors. As I said previously, RAISEonline seems to have completely ignored the basic assumptions which have to be in place for you to be able to make inferences from significance testing. Simply messing around with their carefully developed tests seems a little, well, cavalier.But the RAISEonline team really raised my hackles when they included this in their email:

That little citation at the bottom leads to ‘Conversational Statistics for Business and Economics’ by L Van Jones. Here’s a biography of Mr Van Jones. Here’s his home page at the Texas Christian University. His degree was in ‘Commerce’ in 1961. He is a businessman teaching elementary statistics in the context of business. He’s not a statistician, and he’s not exactly an expert in the field.

I’m not sure that I think that Mr Van Jones represents the best that has been thought and said about statistical inferences. It turns out that he isn’t. In fact, he is a long way from the current thinking about tests of significance. A very long way indeed.

Statisticians know that simple tests of statistical significance are not well used, and suggest that we don’t use them

Andrew Gelman is a Professor of Statistics and political science and director of the Applied Statistics Center at Columbia University, an Ivy League institution which administers the Pulitzer Prize. Hal Stern is Professor of Statistics at University of California, Irvine, which is ranked 1st among all US universities and 5th among the top 100 global universities under 50 years old. If you want an argument from authority, they are go-to guys.

“Many of the pitfalls of relying on declarations of statistical signiﬁcance appear to be well known. For example, by now practically all introductory texts point out that statistical signiﬁcance does not equal practical importance. If the estimated effect of a drug is to decrease blood pressure by 0.10 with a standard error of 0.03, this would be statistically signiﬁcant but probably not important in practice. Conversely, an estimated effect of 10 with a standard error of 10 would not be statistically signiﬁcant, but it has the possibility of being important in practice. As well, introductory courses regularly warn students about the perils of strict adherence to a particular threshold such as the 5% signiﬁcance level.”

“I’m thinking more and more that we have to get rid of statistical significance, 95% intervals, and all the rest, and just come to a more fundamental acceptance of uncertainty.”

It’s worth noting that, in the extract from the report above, estimated effects are reported with ‘standard errors’. This is because statistics deals, at its heart, with uncertainty. Measuring anything introduces error, and it can’t simply be ignored in the way that simplistic mechanistic rubbish like RAISEonline does. And leading statisticians suggest that we should, ‘get rid of statistical significance, 95% intervals, and all the rest’ which means that they would dismiss the entire underpinning of RAISEonline and its simplistic, wrongly interpreted Sig+ and Sig- indicators.

1) Test Results for a given school should be treated as clustered results rather than a samples from a population. RAISEonline does not account for the fact that two students from a children school are more likely to be similar (in terms of outcomes) than two children sampled from different schools.
2) Test Scores are population data, not sampled data. This is, once again, a bit technical, but in essence sampled data is used to estimate unknown population values. But we know the population values for Test Results, so using sampling theory - as RAISEonline does - makes no sense whatsoever.
3) Test Scores are not reliable enough to be used for statistical analysis. I’ve written about this here.
4) There is no accounting for errors of measurement within RAISEonline. And the errors are huge.

'Serious Doubts About School Effectiveness', an excellent academic paper on the problems with Test Scores written by Professor Stephen Gorard, is worth reading in its entirety. A few highlights are included below. SE is school effectiveness, p-values are the basis for t-tests and Z-tests, NPD/PLASC is the National Pupil Database and its predecessor and DCSF is the predecessor to the Department of Education.

“Even used as intended, p-values cannot help most analysts in the SE field. The same applies to standard errors and confidence intervals and their variants. But the situation is worse than this because in the field of school effectiveness, these statistical techniques based on sampling theory are hardly ever used as intended. Most commonly, the sampling techniques are used with population figures such as NPD/PLASC. In this context, the techniques mean nothing. There is no sampling variation to estimate when working with population data (whether for a nation, region, education authority, school, year, class or social group). There are missing cases and values and there is measurement error. But these are not generated by random sampling and so sampling theory cannot estimate them, adjust for them or help us decide how substantial they are in relation to our manifest data.

Despite all this, DCSF use and attempt to defend the use of confidence intervals with their population CVA data. A confidence interval, remember, is an estimate of the range of values that would be generated by repeated random sampling, assuming for calculation purposes that our manifest score is the correct one. It has no relevance at all to population data like PLASC/NPD.”

“Teachers are spending their time looking at things like departmental VA figures and distorting their attention to focus on particular areas or types of pupils. School effectiveness results have been used to determine funding allocations and to threaten schools with closure (Bald, 2006; Mansell, 2006). The national school inspection system in England, run by OFSTED, starts with a CVA and the results of that analysis partly pre-determine the results of the inspection (Gorard, 2008c). Schools are paying public funds to external bodies for VA analyses and breakdowns of their effective- ness data. Parents and pupils are being encouraged to use school effectiveness evidence (in league tables, for example) to judge their schools and potential schools. If, as I would argue, the results are largely spurious this means a lot of time and money is wasted and, more importantly, pupils’ education is being needlessly endangered.”

“However, the dangers of school effectiveness are even greater than this. School effectiveness is associated with a narrow understanding of what education is for. It encourages, unwittingly, an emphasis on assessment and test scores—and teaching to the test—because over time we tend to get the system we measure for and so privilege. Further, rather than opening information about schools to a wider public, the complexity of CVA and similar models excludes and so disempowers most people. These are the people who pay tax for, work in or send their children to schools. Even academics are largely excluded from understanding and so criticising school effective- ness work (Normand, 2008). Relevant academic work is often peer-reviewed and ‘quality’ checked by a relatively small clique. School effectiveness then tends to monopolise political expertise on schools and public discussion of education, even though most policy-makers, official bodies like OFSTED, and the public simply have to take the results on trust.”

“The whole school effectiveness model, as currently imagined, should be abandoned. It clearly does not, and could not, work as intended, so it causes all of the damage and danger described above for no good reason. It continues partly as a kind of Voodoo science (Park, 2000), wherein adherents prefer to claim they are dealing with random events, making it easier to explain away the uncertainty and unpredictability of their results.”

As I say, I strongly recommend the whole paper. It is a clear and devastating eviseration of the whole 'School Effectiveness' nonsense underpinning RAISEonline.

So what should we do about this?

Well, someone somewhere in government should be interested in the criticisms I’ve made here. I’m meeting people from Ofsted later this month to discuss their use of data. I’ll raise this with them then. Additionally, the government has recently appointed a new Chief Executive of the UK Statistics Authority.

John Pullinger has had a long career as a government statistician, and clearly knows what he is talking about. Additionally, he’s indicated that he might be aware of issues in education, saying of school league tables, “We’re seeing the numbers but are we necessarily drawing the right conclusion?League tables are altering the way schools behave — which children are going to be entered for exams and the way parents behave, which school they’re going to send their child to.”

So I’ve written the following letter to him.

An open letter to John Pullinger

Dear Mr Pullinger,

I am a teacher and parent and I am extremely concerned about the use of children’s Test Scores within RAISEonline, the main dataset used to measure the effectiveness of schools in England. I have written extensively on the flawed foundations on which RAISEonline is built, and I urge you to read the criticisms I have made on my weblog.

My initial suggestions for questions you should ask of the RAISEonline team are as follows:

1) Why is sampling theory used to test population data?
2) How do you respond to Stephen Gorard's paper, Serious Doubts About School Effectiveness?
3) How can you improve what you do with data so that those who use do not come to simplistic and incorrect conclusions about schools based on unreliable Test Scores?

I look forward to hearing from you,

Jack Marwood

Don't bring me problems, bring me solutions

The problems:
RAISEonline's entire foundation is based on number crunching which has been rejected by statisticians as too simplistic and too badly understood to be used by non-experts.
Ofsted Inspectors are not trained sufficiently well to understand the limitations of the data they use.
Governors are badly advised about the statistics in RAISEonline.
Academics have explored the inherent problems of the School Effectiveness model RAISEonline represents.

The solutions:
As a profession, we need to debate the use of Test Score 'data'.
We have a new Chief Executive of the UK Statistics Authority who may be able to help.
Ofsted are beginning to listen to criticisms of RAISEonline and the use of Test Score 'data'.
Those in education should find out more about the ways in which numbers are crunched, and investigate the criticisms of the number crunching.

A few final notes

Thanks for reading this far - I know that this is a long post and I appreciate your time.
I'd be interested to know how many teachers have actually seen a RAISEonline report, either as a printed report on in its online form. Please let me know either on Twitter or in the comments below.
Thanks to those who read the early drafts of this - you know who you are!

Here's a short clip which has been playing in my mind since I started this post:

Jack- you have really highlighted some important facts here! I am particularly interested in how sometime common sense is ignored when interpreting data and information. I understand why there is a deficit in understanding and believe that articles like this help to demystify some of the intricate detail behind the system. There is work to be done to help improve how data is used and understood across the sector. Keep up the good work!

Reply

Jack Marwood

14/10/2014 11:35:36 am

Thanks for this, Phil. And given that you run a data consultancy, I'm even more grateful - the more analysis based on Test Scores is questioned, the better... and I promise to keep chipping away at the dubious foundations of RAISEonline!

Success is a journey! If we can get people to understand what the ingredients are then we can make better cakes! Keeping stuff simple is an art that we need to get better at if we are going to improve outcomes for the people that we work with!

Improving general health and adopting preventive measure may improve the chances of a cure. Toenail fungus diet aims to eliminate dietary sources of sugar and protein to inhibit fungal growth. Keeping the feet or toenails dry may prevent the nail from attracting fungi. Mild treatment is advisable for this stage of discolored toenails.

Reply

Dominic Salles

14/10/2014 04:28:09 pm

Hello,

I follow your premise without really following the statistical arguments. I am a secondary school teacher. What I wonder is, what sort of sample size would you need before you could be confident that the errors were likely to balance out as they would in the population as a whole? Your argument seems to be that the sample size would have to be so big, that it is impossible for a school to do.

This is not helpful really. The question is what sort of sample size is likely to give a reasonable guide to the performance of a school? In a secondary school, a year group might be 200 strong. Would one year's worth of data be enough, or would we need two or three years?

To argue that we would need more than this is to misunderstand the real world - it may be statistically insignificant as a sample, but in terms of how well children do in a school, it is a very real measure. You can feel the difference in a school that is performing well over three years compared to one that is not - you'd spot it even if you never saw an exam result.

In secondary schools I also wonder if a fairer way than Raiseonline would be to assume that each school had only three students, their mean level 3, mean level 4 and mean level 5 entrants. Then you could add their mean GCSE scores together to get a value for the school. This could easily be compared to all other schools. It seems to me that the mean student gaining a level 5 will be quite similar to the mean student nationally gaining a level 5, and so on. If you took cohorts over three years, would this measure give you a much more meaningful analysis of progress than Raisonline?

Reply

Jack Marwood

15/10/2014 02:20:14 pm

Thanks for commenting, Dominic. Given that you say you aren't able to follow the statisticl arguments, I suggest that you read this article on Statistical Significance (http://www.workingoutwhatworks.com/en-GB/Magazine/2014/10/Statistics_in_educational_research) as it might help to explain what is meant by significance as used by statisticians. After that, I suggest that you look at my post on Test Scores (http://icingonthecakeblog.weebly.com/blog/ofsteds-use-of-test-scores-to-judge-schools-is-ridiculous) and then my post on the flawed assumption that Schools and teachers are solely responsible for pupils' relative achievement (http://icingonthecakeblog.weebly.com/miniblog/fundamentally-flawed-assumptions-schools-and-teachers-are-solely-responsible-for-pupils-relative-achievement). They might give you some food for thought.

In response to a few points you’ve raised:
1) What sort of sample size is likely to give a reasonable guide to the performance of a school? Your argument seems to be that the sample size would have to be so big, that it is impossible for a school to do.

That isn’t my argument at all. My argument is that it doesn’t matter how big the ‘sample’ is, Test Scores are still all population data, so using significance testing is pointless as there is no sampling error. And ‘statistically significant’ as used by RAISE doesn’t mean that ‘students are performing significantly above or below other students nationally’.
What’s more, trying to use woefully inadequate Test Scores as a guide to the ‘Performance of a school’ is a chimera – the largest driver of achievement is prior attainment, which is driven by pupil level factors. I’m not misunderstanding the real world, I’m afraid, I’m simply bringing together thoughts of my own and research by others to argue my case – which is that it isn’t possible to isolate what proportion of any measure of children’s ‘performance’ is due to School level effects using anything we currently have, which is what RAISE assumes is possible. It isn’t.
2) You could add mean GCSE scores together to get a value for the school, and then compare it to other schools.
You could do this, but it would simply tell you about many things – most importantly intake – and almost nothing about the ‘effectiveness’ of the school, as RAISE presumes it can.

Finally, schools are big complicated places full of hopes and dreams - trying to distil them down to numbers does a disservice to everyone in education. It's time we stopped trying to do just that.

Reply

Mary Myatt

19/10/2014 04:34:09 am

Many thanks Jack and I take your point about sig plus and minus being statistically rather than actually so. I have adjusted my post and included a link to this post. The potential for there to be Type 1 errors, for example, is precisely why we go into schools. RoL is used as a starting point for discussion.

Reply

Jack Marwood

19/10/2014 01:08:19 pm

Thanks for this response, Mary. I really appreciate your efforts to clarify this important point. The revised part of your blog now says:
“Figures highlighted in blue mean that students are performing statistically significantly below other students nationally (it is important to note that this is statistically significant and there may be reasons which account for this. Such significance will need to be explored in discussion with the school)
Those in green mean they are statistically significantly above national (the same caveat in the bullet point above applies) ”

I’m afraid I’m still going to say that this is still biased, however, and would probably steer discussion in a particular direction. As discussed, statistical significance does not mean ‘significant’ as in ‘important’. Any impression that green is good and blue is bad (or, alternatively, that they are not worth investigating) is misleading.

The best links I could find for reporting statistical significance to a general audience: http://www.surveysystem.com/signif.htm or http://medical-dictionary.thefreedictionary.com/statistical+significance

These suggests using the phrase ‘probably’ or ‘probably true’ when discussing differences which are statistically significantly, indicating a lack of certainty.

So you could say “Figures highlighted in blue or green mean that students’ average measure is ‘statistically significant’ when compared to the national mean. A blue ‘Sig-‘ indicates that it is probably true that the students’ average measure is statistically lower than all similar students nationally. A green ‘Sig-‘ indicates that it is probably true that the students’ average measure is statistically higher than all similar students nationally. It is important to note that ‘statistically significant’ does not imply certainty and there may be reasons which account for the appearance of blue or green indicators, as the average measure may be incorrectly highlighted due to statistical error or the data may be skewed by some unknown factor. Equally, some scores may not be highlighted due to statistical error or the data may obscure some unknown factor. This will need to be explored in discussion with the school.”

‘Performing’, ‘significantly below’ and ‘significantly above’ are terms which are simply too loaded to be read neutrally. There is a big difference in saying that 'a group of musicians is performing significantly below students nationally’ - which implies this is true - and ‘it is probably true that a group of musicians’ average measure is statistically lower than all similar students nationally although this may be incorrectly highlighted due to statistical error or the data may be skewed by some unknown factor.’

And then, with a following wind, RoL could be used as a starting point for a discussion which not biased from the outset.

Thanks again for responding - it's good to see you trying to clarify a tricky but significant detail of data interpretation.

Your comment will be posted after it is approved.

Leave a Reply.

Author

Me?
I work in primary education and have done for ten years. I also have children
in primary school. I love teaching, but I think that school is a thin layer of icing on top of a very big cake, and that the misunderstanding of test scores is killing the love of teaching and learning.