(When I wrote this post in March 2014, I was simply a teacher frustrated by the horribly misleading use of significance testing in RaiseOnline. I was deliberately provocative in the title, hoping to catch the eye of readers. It worked, and it is the most read piece here on Icing on the Cake.

Subsequently, I've had chance to meet statisticians who currently work on RaiseOnine, who seem to be well aware of the issues which I raise here. Unless and until I get time to revise the piece thoroughly, I plan to leave it here. Raise still uses significance testing in a way which is at the very least highly misleading, and is at worse, contemptuous of the hard working staff in schools across the country which have been unfairly pilloried by those who simply don't understand how data works.

If you like to find out how RaiseOnline can be useful, as well as learning how to make sure that no-one misrepresents your school using the information it contains, I'm running some CPD in January 2015. In the meantime, your thoughts are welcome in the comments.)﻿

Having had a good look at the Ofsted Schools Data Dashboard and the DfS’ Performance Tables, it’s time to take a look at RAISEonline, the slickly acronymed behemoth of the Big Data Disaster. In full, RAISE is actually ‘Reporting and Analysis for Improvement through school Self-Evaluation’. It may as well be called Reporting Utter Bollocks Because It’s Simply Hogwash. It is based on bad statistical analysis, is hopelessly misleading and has been used unbelievably badly by those who should know better.Part of the issue with RUBBISHonline is that it requires some fairly entry level statistical knowledge to understand what you can and cannot infer from data analysis. Those who developed RUBBISHonline into the sense-eating monstrosity it has become has taken some quite clearly misunderstood statistical theory and, in effect, added two and two to make a banana.Having pored over the statistics behind RUBBISHonline, I'm still somewhat shocked at what I have discovered. It seems so incredible that the analysis is so wrong, and I have gone over the arguments I make here over and over, convinced that I must have missed something. I did in fact make few mistakes in the initial version of this article, as I managed to confuse the distributions of raw data and the means of samples of data as very helpfully pointed in the comments below. I am convinced, however, that my basic criticisms of RAISEonline stand, and I welcome any further comments you may have. This post is likely to patronise you, scare you and infuriate you in equal measure. But in order to understand why RUBBISHonline has been and continues to be so damaging, you need to look the maths in the face and not be intimidated by people wielding bad data as a weapon. You need to understand how the good name of statistics has been besmirched. And you need to take a stand against crimes against statistics, schools and children.

Bear with me whilst I patronise you a bit

Now, as I have previously said, I have studied Statistics to a high level. And whilst this doesn’t make me the Data God, it does give me a leg up when it comes to explaining a bit about the way statisticians work with data. So bear with me as I give you a bit of an insight into what those working with data have developed to try to make sense of what you can, and can’t, do with data. Apologies if you do understand all this. In my ten years’ experience in Primary education I’ve yet to meet anyone who has studied this in any depth, so I’m going to assume that you have not done so either. If you have, feel free to skip this bit and rejoin the post below when I outline some headline problems with RUBBISHonline.

So firstly, I’d expect most people to have a rough idea of what a normal distribution looks like, even if you are not used to that term. It’s the big hilly graphy thing which tells you that, most things in a given set of data are somewhere in the middle with some on either side. A ‘Bell’ curve. One of these:

So this distribution shows that most people are about 5’8”, with some smaller and some taller people. As is pointed out below, however, lots of datasets, if not all, are not distributed in this way. They are skewed one way or the other like this:

But statisticians are clever types, and they noticed that if you take a sample of a population, and find its mean, and compare that mean to the means of other samples of the population, you do get a normal distribution of the means. Of course, I’m using ‘sample’ and ‘population’ in their statistical sense, but you wouldn’t know that because they are English words and were used originally because, well it made sense not to make up new words like ‘spingle of data from a bygoat’. This will become important a bit later on…

The mathematical thinking behind this distribution of means is made explicit in the Central Limit Theorum, which states that, with various important provisos, the means ofindependent and identically distributed random variables are themselves approximately normally distributed.

So what a statistician will do is to transform the data under investigation into an imagined distribution of means, which looks like this:

Now, don’t be scared. Look the Stats Monster in the face. It’s not that scary… Well, okay, maybe it is. So here are a few simple observations:

Most of the graph is clustered around the middle of the distribution.

The percentages refer to the areas under the graph.

The further you get from the middle, the smaller numbers are.

The ‘z-scores’ or ‘standard deviations’ are a simple but beautiful bit of mathematical trickery which allow you to look at the spread of data around the middle value.

Just over 68% of the graph is between -1 and 1 standard deviations from the mean.

Just over 95% of the graph is within 2 standard deviations from the mean.

Just over 99% graph is within 3 SD from the mean.

Fairly straightforward, really. Except for the last three points, which get a bit more specific and math-y. But that’s what those studying science at university learned about whilst the arts grads discussed poetry and whether trees in the forest make any sound when they fall over for reasons which never seem to be explained or considered important. Ahem. I digress.Anyway, what all this allows statisticians to do is to crunch numbers and perform tests to see if a given statistically valid sample of a population is different to another statistically valid sample of a population.

A push and shove and this land is ours

I know, know, your head probably hurts (I did mention the patronising above, didn’t I?). A bit more stat-stuff before we can have a good took at RUBBISHonline. Bear with me once again.

So, statisticians collect data from a random sample of the population under scrutiny and find its mean, by adding up all the numbers and dividing the total by the number of observations. Then they test it to see if it is different to a mean of another random sample. It’s worth pointing out that, and Statisticians use ‘population’ to mean ‘the group being studied’ and sample ‘sample’ to mean ‘completely random sub-section of the population.’ By collecting data on, say, a completely random group of a surprisingly small number of people over the age of 18 in England, you can be fairly certain that your data will give you enough data to start to generalise about adults over 18 in England.

Here comes the very large and important ‘BUT’

There are a few major issues with this kind of generalisation, however.

Firstly, you need to be fairly sure that the means of your data are actually normally distributed, i.e. are actually evenly distributed around the most common value.

Third, your population has to be fairly accurately defined, since adults over 18 in England may be quite different to, say, adults over the age of 18 in Japan.

Finally, you need to understand that your population is a theoretical construct containing all the people who you might have had in your sample if the time and money involved weren't prohibitive. Not, the actual population.

If all these things are understood and you feel you can make these generalisations, then you can use some funky maths to find out if a new sample of adults over 18 in England is similar to your first sample or not. The great thing about the maths for this is that it is designed for people who don’t really need to understand the maths. You just measure something about people, find the mean and standard deviation of the sample, and put these into a simple equation which includes the mean and standard deviation from your first sample.

This gives you a number – the Z-score on that last scary graph - and if that number is between two given values, you can be fairly sure that your two samples are ‘statistically similar’. If they are not similar, statisticians say that the comparison between them is ‘significant’. To a Stat-head, this means that they might not be drawn from the same population. It doesn’t mean that they are definitely different. It simply means that the maths says that you might want to look closer to see what is going on.

You’ve probably even seen these Z-scores, if you’ve looked closely at any data. Does 1.96 ring any bells? If it does, it’s because 95% of values in a normal distribution occur between -1.96 and 1.96 standard deviations from the mean.

Anyway, here endeth the lesson. You should now have been thoroughly patronised, may be a little scared, but you should have a rough idea how statisticians calculate Z-scores for standardised normal distributions. Let’s get back to the abuse of data.

RUBBISH data analysis always starts with the best of intentions

Back to RUBBISHonline. So, it will all have started with someone wanting to know how children in school are progressing. I think most people would agree that it is worthwhile trying to work out whether children are developing as they move through school. That’s why we try to assess what children know in increasingly formal ways as they get older. I teach in Primary, and it is useful to know roughly how children are getting on with the core skills of reading, writing and numeracy.

Assessing these things is notoriously difficult, however. It’s even harder to put numbers onto knowledge. That hasn’t stopped people trying to do just that, and since the National Curriculum was introduced in 1988 children have been assessed as being at different ‘levels’ based on what knowledge, skills and understanding various experts have said that they should have.

The levels were fairly arbitrary, as they had to be, since you can’t actually measure knowledge, skills and understanding using an absolute scale. The experts simply drew up a set of things that children should know and be able to use. As I understand it from Warwick Mansell’s Education by Numbers, Year 6 was allocated as Level 4, Year 2 as Level 1 and then a straight line graph was drawn as follows:

So far, so good. A nice straight line, with children developing as they progress through school, as you would expect. But then Big Bad Data reared its ugly head and Bad Things Began To Happen.The government, in its wisdom, decided to produce ‘performance tables’ based on the data it began to collect. Teachers pointed out that having a small number of levels didn’t really give a good picture of what children could actually do. And schools pointed out that the differing levels of prior knowledge children had meant that a straight set of levels helped some schools and disadvantaged others. So sublevels were introduced, and measures of ‘value added’ were invented. Levels became ‘point scores’, and more numbers were used to try to picture what schools’ imperfect assessments were actually saying.All this coincided with an era when computers have made processing data very easy, and it’s fairly obvious why those overseeing the running of schools took to data in big way. Unfortunately, it would appear that those collecting and analysing the data dumps from schools have very little understanding of the wonderful world of statistics and some extraordinarily bad analysis is now common place.So, here are some problems with RAISEonline:

The ‘What’s it testing?’ problem

As noted above, you need to be fairly sure that your data is independent and identically distributed to use Z-scores and the like. If you want to test whether the test results for a given school is statistically significant when compared to a national mean and standard deviation, as RAISEonline does, you are effectively testing a ‘school effect’. Is there something about this school which makes it different to a control sample, in RAISEonline’s case all the children contributing results for a given school year? So what does make the school different? Is it the quality of the teaching and learning, as RAISEonline implicitly assumes? Is it a particular cohort’s teaching and learning? Is it the socio-economic background of the children? Is it their prior attainment? Is it their family income? Or is it a combination of these factors? All of this begs the question, what is RAISEonline actually assessing?RISE, the Research and Information on State Education think tank, found that “School performance is strongly related to the prior attainment and socio-economic background of a school’s intake.” They also noted that “schools do not operate in a vacuum and some of the influences on pupil attainment, such as maternal health and wellbeing, family income, parental job security, the socio-economic mix of peers and access to thriving labour markets.”

So is RUBBISHOnline testing all of these things against a huge national sample which jumbles up all of this and loses all meaning?The Not Independent and Identically Distributed problem If you are testing the effect of a given fertiliser on a given species of plant, you can be fairly sure that each of the plants in a sample of plants is independent and identically distributed. This is complicated, but essentially, you should be able to swap any of your plants between sample groups before conducting your experiment, since any given plant should react to the experiment in the same way. In order to be able to assume that a sample cohort which has supplied test results for a school is independent and identically distributed, you should be certain that a completely different random group of children subject to the same teaching and learning would perform in exactly the same way. This seems entirely unlikely, since a given cohort in a given school will not be randomly selected from the entire population. The children are likely to be similar to each other in a statistically significant way - which could be socio-economic background, prior attainment, family income and so on. And that means that attainment levels of the cohort are not independent and identically distributed random variables.

The Key Stage 1 Data Manipulation problemAt Key Stage 1, children can be assessed as being at Level 3, 2A, 2B, 2C or 1. These levels are given point scores – which you can find on page 55 of this document – as follows:

You will notice that there are no sub levels for Level 3, or Level 1. That because this system was developed piecemeal, and because those who developed it clearly had no idea how BigData would abuse the information.As a result of this, teachers have been forced to manipulate data. A child assessed at Level 3 is automatically assigned level 3B, likewise Level 1 is assigned 1B. This has hugely important implications later on, as progress measures are calculated using this initial, compromised, data.Incidentally, there is a level 4+ category at KS1. Almost no child is ever assigned a Level 4B assessment at KS1, because schools have to show progress over a Key Stage, and no child can be assigned Level 7 at KS2, so it would never happen. There is also a W level – working towards level 1 – which is arbitrarily assigned 3 points. Why? Who knows? This category is for children who clearly have special educational needs, and is clearly not part of the mainstream.So, a mainstream child can only be assigned to one of five categories, which are heavily subjective. At the more able end, the data is fudged because some children might be a little more advanced than 2A but not as much as 3B, and schools are forced to choose a category. At the lower ability end, a similar situation applies. The data has already been compromised before any analysis is undertaken.The Primary Age problemAge matters a great deal at Key Stage 1. Children are put into cohorts with a 365 day range. Some children are up to 13% younger than others in the same cohort when they are assessed. Given that children are likely to have been talking for just five years by the time they are assessed in May of Year 2, some have effectively 20% less experience than others in the same cohort. Remember that any analysis using Z-scores must assume that the data is independent of any underlying variable. KS1 data is heavily skewed by an underlying age factor.Key Stage 2 is similarly influenced, since some children in Key Stage 2 are 8% or so younger than their oldest classmates, a huge twelve months behind in life experience and learning time.The Loss of Definition ProblemStatisticians have always been very careful not to overstate what can and cannot be inferred when you start to manipulate data. As well as being rigorous about assumptions and independence of data, it has also been long recognised that the more you manipulate data, the more errors you should account for and the more you lose definition of your data.For example, you might measure heights, ages and shoe size of a population, and then create a weighted average of those three measures. Any analysis using this new measure, let’s call it the Average Measure Score (AMS), would be a lot more general than the three specific measures on which it is based. Someone with particular AMS might look and be quite different to another person with the same AMS, because you have lost definition by combining three measures.At primary level, RUBBISHonline combines three scores into an Average Point Score, which doesn’t actually mean anything when you think about it. You can’t tell whether someone with an APS of 13 was assessed at Level 1 for reading and writing and Level 3 for numeracy or whether they were assessed at Level 2c for all three. And because you can only score either 9, 13, 15, 17 or 21 points, lots of children will have an APS of 15 with a Level 2C, 2B, 2A or 1, 2B, 3 in any combination of subjects, for example. The APS becomes meaningless.This gets worse at GSCE when five different numbers are combined into a single Average Point Score. What exactly does an APS of 390.8 mean?The Missing Data ProblemRUBBISHonline allows children with missing data to be given an APS, so if you miss an assessment at KS2, you simply take an average of the other two assessments. The mind boggles, and APS becomes even more meaningless.The Misunderstanding Significance ProblemIf you’ve seen a RUBBISHonline report, you will have seen ‘Sig+’ and ‘Sig-‘ appearing here and there. It’s used incorrectly, or at least with no statistical justification worth its salt.Now, in statistical theory, significance is a specific term, and it isn’t quite what you might think. It doesn’t mean ‘important’ as many people would legitimately infer. ‘Significance’ is the likelihood that something is not simply due to chance. So if you take a random sample of a population and find its mean, you would expect that mean to be slightly different to the mean of another sample of the same population. It might be bigger or smaller, but it would be expected to be similar.What significance tests do is suggest – and that’s all they do – that one mean which is very different to another may be from a different population and that this is not simply by chance.There is more to the idea of significance involving Null Hypotheses and Type I and Type II errors. I could go on about this at length, but I’ll simplify by saying that the starting point is usually to check whether something is likely to be within the 95% of possible values which would indicate that two samples are from the same population and to take it from there. Given that there is very little likelihood that a school cohort and all the children in the country of the same age are from the same population, there is little point trying to test for significance, since any statistician will tell you that they can't be compared that way anyway.The RUBBISHonline Methodology Document suggests that ‘Significance is a statistical term that shows if a difference or relationship exists between populations or samples of data’. Which is not what significance means to statisticians at all. It might 'suggest', given all the caveats above, but it doesn't 'show' anything. Now, RUBBISHonline might say that they have used this explanation because their methodology document is aimed at the general reader; this won’t wash since there is enough higher level maths in the document which would be beyond most people. I understand the statistics and I am not in the least persuaded that those behind RUBBISHonline do.The RUBBISHonline problemsI could go on (and on) about RUBBISHonline, but even this initial headline analysis shows why this stuff is, in fact, rubbish because of the 'What's it testing? problem, the Not Independent and Identically Distributed problem, the Primary Age problem, the Key Stage 1 Data Manipulation problem, the Loss of Definition Problem, the Missing Data Problem and the the Misunderstanding Significance problem.And this is the data on which schools are judged by those who know no betterIt’s worth quoting the following in full. It comes from the National Governor’s Association’s document Knowing Your School.

"The purpose of RAISEonline is twofold. Firstly, it is an important (but by no means the only) source of data for school governors to use in retrospective self-evaluation and school improvement planning. It should be used alongside other sources of data such as the Ofsted Data Dashboard, FFT Governor Dashboard, FFT Self Evaluation Booklet and the schools’ own pupil tracking data.Secondly, the analyses are used by Ofsted inspectors during their inspection of schools. It is therefore critical that you are able to interpret your school’s data from an inspector’s perspective and can identify apparent areas of under-performance in order to explain why they occurred; or demonstrate that you recognise them and have set out the action you are taking to address them."

It’s hard to know where to start. The Data Dashboard is ridiculous and tells you nothing. RAISEonline is rubbish and the next Ofsted inspection team which visits your school are going to use this hopeless analysis and hit you on the head with it.You are stuffed from the outset, as RUBBISHonline is likely to show up all kinds of warnings because it’s using the wrong, and incorrectly applied, tests of significance which don't even stand up to elementary statistical scrutiny. You'll be hammered by people using information supplied by nameless data wonks who have complete misunderstood the theories behind the sampling of data.Variables and means have to be treated differently A final, further, example of the sheer idiocy of the people who judge schools on RUBBISHonline data comes from yet another complete misunderstanding of the statistics of means of samples. Actual test marks can tell you things, and you can observe (with certain caveats, once again - statistics is not about certainty after all) a trend in the results which, say, a given child gets in tests. If a child gets 12, then 15 then 20 out of 30, there is a trend. Likewise, if they get 15, 15 and 15, or 20, 15 and 10.But means don't work like that, because the central limit theorem states that they will be distributed normally. So the mean you have could be any one of any number of possible means. And you have to test if a given mean is significantly different from the mean which you might expect.Not that those looking at data know this. For example, the NGA includes the following table in its guidance to governors:

Now, to a non-mathematician, these figures might look like a trend ‘over time in Key Stage 2 average point score. This school has tended to achieve higher scores in mathematics than in English’.But does it actually show this? The central limit theorum suggests that the mean of any sample could vary considerably and still be similar to another sample. A Z-test can be used to test whether the mean of a random sample is significantly different to the mean of another sample from the same population. Let’s look at one of these samples as reported in the table above – which aren’t random or independent as discussed above, but we’ll ignore that for now.Let’s take Mathematics in 2007, because it’s got the biggest ‘difference’ – which is included in this table, illuminatingly, for reasons I’ll explain below. It has a mean APS of 28.8, for 47 children against a national mean of 27.3. Assuming that these are both samples from the population of children taking the key Stage 2 test. We are going to have to assume a standard deviation of 5.22 as in the Methodology document’s worked example for KS2 APS.Big assumption, but bear with me.So, using the formula

This is (28.8 – 27.3) / (5.22/6.78233) = 1.948945. Even if the standard deviation was slightly lower, this wouldn’t be statistically significant and would have to be seen as a reasonable result for a random sample of children taking these assessments. Since the results can be reported as level 3, 4, 5, 6 and be allocated 21, 27, 33 and 39 points (and 15 points for some categories of result), a standard deviation of at least 5 points seems reasonable.Now, using the RUBBISHonline flawed methodology, this isn’t statistically significant. The two means can reasonably be assumed to have come from the same population. You might think that they are from different populations (which they are) because the value is really close to the line drawn for 'significant at the 95% level', but the theory says that they aren't.I haven’t done the calculations for any of the other samples above, but I’m willing to bet that none of them are statistically significant. Which means that there is no ‘trend’ – they are all reasonable results if (big if) the school cohort is a representative sample of the population. By the way, the thing labelled ‘difference’ above is the kind of thing a complete non-mathematician would use. It’s entirely irrelevant if you understand statistical theory at A level standard. Which clearly very few people do.So, should you ever have a conversation with anyone about RUBBISHonline - I don't know, an Ofsted inspector, or someone making judgements based on data - I suggest you ask them a simple question. "Can you explain your reasons for thinking that the data for our school reflects our school and no other factor and that it is independent and identically distributed?" If they can't answer that, then they are in no position to make any judgements using the data in the way they are trying to use it.I doubt you will get anywhere, since the level of understanding of statistics - and trust me, this is entry level undergraduate stuff, it is not advanced weird barely-understood-by-anyone stuff - by those making judgements is utterly laughable.And good luck with any attempt to “identify apparent areas of under-performance in order to explain why they occurred; or demonstrate that you recognise them and have set out the action you are taking to address them”. I doubt any Ofsted inspector will listen to your reasoned demolition of the statistical nightmare that is RAISEonline. If the people who have created the data analysis tools have so little understanding of what they are doing with data, what hopes are there that school inspectors do either? RAISEonline is deeply, deeply flawed and should be dismissed as such. It's RUBBISH.

I have always been of a view that a sample of 30 kids is way too small to conclude anything with respect to numerical data. That was before taking on board the insights you lay out here. How do they get away with this? Have you seen these errors raised by anyone else?

Reply

Icing on the Cake

17/3/2014 03:06:39 pm

No, I haven't seen any criticism similar to mine here, although I have seen people like Dylan Wiliam alude to some concerns about the RAISEonline data. I am fairly sure my criticism is valid, having pored over it recently. Would really like to hear from any Statisticians who read this.

Reply

Matt

18/3/2014 01:38:03 pm

I agree: too much emphasis is placed on numerical data to measure the performance of schools. As I point out regularly in my own school, it is pointless comparing end of Key Stage pupil performance from year to year because they are different children! I can see the value of tracking attainment from Y2 to Y6 and Y9 for an individual, but for the benefit of the pupil. Unfortunately, this data is only being used (poorly) to judge school performance.

A great blog. There are faults in RAISEonline and I have had to sit and argue with an Ofsted inspector in the past who had not grasped the significance, or lack of significance, of small cohorts of pupils in a primary school that only had 64 children in total. However, I am also an Ofsted inspector and I find RAISE useful to identify questions. It is a small piece od a much wider evidence base and yes it is flawed but then most data in schools is.

Reply

Icing on the Cake

18/3/2014 04:27:29 pm

Thanks for your comments, Keith. I'm intrigued by your experience as an Ofsted Inspector, as I'm preparing other blogs about data use during inspections. Can you give any examples of questions that RAISE has raised? I'd love to know...

Reply

igb

19/3/2014 03:48:54 pm

"The mathematical thinking behind this is made explicit in the Central Limit Theorum, which states that, with various provisos, independent variables are approximately normally distributed."

No they aren't. As you state it, it's clearly untrue: my counter example would be rolls of a single die, which are most certainly not normally distributed. If I roll a die and plot a histogram of the results, it will converge on a rectangular distribution, and if it doesn't (after a sufficient number of trials) then I'm going to worry about how honest was that craps game I was playing last night. Household incomes aren't normally distributed, because lots of people have household incomes greater than twice the mean, but very few have incomes less than zero. Lifetimes aren't normally distributed, because most people die between 70 and 90, but no-one makes 140. In fact, very few distributions drawn from "real life" are normal, and height is quite rare in being close to it (the tails are fatter, but it's close enough): many are right-skewed in some sense, because there's a "zero" you can't go below, but no hard limit on the other side, so the mean is significantly above the median.

The central limit theorem does not address the distribution of a variable, but the distribution of the mean of a sample of the variable. Suppose I roll a die ten times and divide the sum by ten. I'll get a result somewhere between 1 and 6, but it's a lot more likely to be around 3.5 than it is 1 or 6. If I repeat taking the mean of ten rolls many times, and plot a histogram of _that_, then that will be normally distributed. The same will apply to household incomes: if I sample 100 houses, calculate their average income, and repeat that experiment, then with a lot of caveats about how I did the sampling, those means will themselves will be normally distributed. And the same will apply to life times, although though the original distribution is not only non-normal, but probably bi-modal (depending on whether the country has good paediatric care: it's certainly savagely, and tragically, bi-modal in poorer countries).

There are some pathological cases where you can sample them repeatedly, calculate a mean of those samples, and not get a normal distribution, but they're rare. So it's magic: I can take distributions that are absolutely non-normal, sample them, take the mean of those samples, and it's a normal distribution.

Given that misconception, I'm not sure I follow the rest of your argument (although I haven't studied it in detail). If children and schools were homogenous, then samples of them would be normally distributed (ie, I pick 30 children, take their average attainment, and plot those means) unless the underlying distribution of attainment were weird (Cauchy or something). So to take two samples of children and compare their means using parametric statistics is not an unreasonable move: I don't think it's making an assumption about the distribution of attainments (which is as you say completely invalid), rather it's making an assumption about the distribution of means of sub-samples, which is usually valid (central limit theorem) provided the underlying distribution is not pathological (which would be a matter for study). I might well be misunderstanding your argument, and I'm certainly not a statistician, but your mis-statement of the central limit theorem makes me wonder if your argument didn't take a wrong turning at that point.

Reply

Icing on the Cake

19/3/2014 05:11:47 pm

Thank you very much for your lengthy and exceptionally useful response. As I said, I’ve studied statistics but it’s been a while since I looked at the theory as closely as I have here, and I’ve been hoping that someone who have a good look and find the flaws in my argument. And, yes, I’ve clearly misstated the Central limit Theorum, which has it that the means of independent variables are approximately normally distributed, rather than the variables themselves…

That obviously sent me off on at a tangent with regards to the distribution of assessment results for cohorts of children, which as you say, are unlikely to be normally distributed, although their means would be expected to be if the sample was independent and identically distributed.

I share the view, however, referenced in the RISE link that children and schools are clearly not homogeneous, and therefore that using analysis based on the Central Limit Theorum is not valid, as the ‘random variables’ – i.e. the children’s assessments – from which a mean is devised are not independent and identically distributed since the children in a given school socio-economic background, prior attainment, family income and so on.

So, I need to revise a bit of about the ‘Not Normal’ problem, since this is a red herring, and clarify the Not Independent section to make it clear that it is the fact that the random variables are not ‘independent and identically distributed’ which is the problem.

I’d be grateful for any other comments on my analysis – I really appreciate the peer review in real time!

Reply

igb

20/3/2014 12:56:17 am

I don't think your chain of reasoning is right.

I think the claim of analysis of schools statistically is roughly as follows.

We have a set of indicators which characterise schools. That will be a combination of FSM, EAL, PP to measure social deprivation, and (for post-primary in particular) we will have prior attainment which, because it itself depends on other factors, is a proxy for those factors. If we divide schools into subgroups that are similar, such that they have similar values for all these indicators, then it is reasonable to assume that if a sub-sample (ie, the results in one school) differ significantly from the whole population (ie, the distribution over all such schools) then something's afoot, and it would be then worth going and doing some qualitative work to find out if that's the case, and if so why.

Just saying, "oh, the population isn't homogenous" is a counsel of despair. It's perfectly possible to divide populations up into comparable sub-populations and then analyse outcomes based on that. That division may be inexact, and the onus is on the people proposing the method to justify their method. But it has a long history and, to take a related example, your argument would have us believe that all clinical audit is meaningless, because the populations treated by hospitals are neither normal (they tend to skew towards unhealthiness) nor equivalent (for most of the same social reasons as schools' differing).

And I think you are overstating the requirement for distributions to be normal for parametric techniques to be valid: very, very few experimental populations are normal, in that you could exactly characterise the distribution by its mean and variance, plug it into Gaussian distribution formula and get a perfect match. Yes, school populations will be skewed, but you don't have to immediately reach for your non-parametric tool box. I don't know about RAISE, but you appear to be asserting that it relies on the distributions being normal; I find that quite surprising, and I suspect that in fact it relies on the populations being parametric, which isn't remotely the same thing.

Reply

Icing on the Cake

20/3/2014 04:11:42 am

Thanks once again for your exceptionlly useful feedback on my analysis, igb. I really appreciate you taking time to critique the ideas I've put forward, and it's given me a lot of food for thought. I'll need to develop my argument a little further when I get time later but for now, thanks very much!

Reply

igb

20/3/2014 05:03:56 am

A pleasure, and thanks for your comments.

By the way, at risk of becoming a bore, one more thing for you to think about.

I think you're being very conservative in your definition of "independent variables". I'm not sure that anything other than the output from a strong random number generator based on stochastic physical processes would pass the test you're erecting, which is that to be independent the variable must not depend on some other variable. I think in general it's valid to use parametric statistics to study things other than the interval between clicks registered by your Geiger counter (and even they depend, in some sense, on the half-life and mass of the sample you're pointing it at).

Suppose your study is on parental income versus children's educational outcomes. I think that's a perfectly reasonable study, and doesn't require exotic statistical tools: either you make some assumptions about continuous data and use a coefficient of correlation, which will put a straight line (or a quadratic, or whatever) through the scatter graph of income versus outcome, or you go all edgy and non-parametric use something like Spearman's rank correlation coefficient, which just considers a league table of income and a league table of outcomes without looking at the absolute differences. The former study would consider some notion how much additional income correlates with an extra GCSE grade, the latter study would just look at whether people who earn more have children who do better, without studying the magnitude of the effect. Interpreting the results would be interesting, and the left might draw different conclusions to the right, but I don't see anything statistically invalid about the basic approach. The study would assume (and it seems reasonable) that educational outcomes may be correlated with parental income, but the effect of a child's educational outcome on their parents' income is small, so the income is an independent variable.

However, in your definition, the parents' income depends on _their_ educational achievement, and on their health, and on the economy of the area, and so on, and is therefore not independent. Now you're right: those factors are important, and are the sort of confounders that make correlation (to use a cliche) not a sign of causation. Showing that although A is correlated with B they are both in fact correlated with C is vitally important (it's at the heart of multivariate factor analysis, which got a bad name from Cyril Burt's shameful abuse of it but is nonetheless entirely respectable statistics) and showing mechanisms of causation is even more important. But they don't mean that, for the purposes of a study into the correlation of parental income and educational outcomes, it's invalid to treat parental income as independent.

I suppose the summary would be "independent variables are those which do not depend on other variables in your study, such that in the data you are studying, so for each independent variable x_n, there is no y_n such that x_n = f(y_n)". Your test seems to be that to be independent, a variable must not depend on anything inside or outside the model. To return to my example of income and outcomes, I don't think anyone seriously suggests that the taking a correlation between income and outcomes, noting the slope is positive and thinking that targeting some interventions at people whose parents are on low incomes is dark voodoo science which we should have nothing to do with.

Mark Bennet

20/3/2014 03:06:54 am

There is one very simple way of illustrating what is wrong with trend information. Suppose the test results for each school are random with any reasonable distribution. In any year, as a result of random fluctuations, essentially half the schools will go down and half will go up. Over two years, which is all we can detect with three years results on display, one quarter of schools will have two successive rises and one quarter two successive falls. Now it may be that schools with higher results are likely to have a fall, which would bias things a bit. But the "clear rising/falling trend" is prone to gross over-interpretation. In approximately one eighth of schools there would be three successive rises, just by chance.

Reply

igb

20/3/2014 05:20:04 am

But there are entirely mainstream statistical approaches which resolve these issues, and if the arguments you advance were true, it would be impossible to measure outcomes over time for any service.

For example, In the case of schools, added value figures are presented with confidence intervals, which we might also read as error bars. Does a movement in the central value of the value added figures over the course of a year indicate a real change? No, for all the reasons you suggest. However, if the upper limit on the added value this year is below the lower limit last year, does _that_ indicate a real change? Yes, it does, subject to all the usual caveats about the testing methodology both now and at the starting point the added value is being measured against. If the confidence intervals largely overlap, we learn almost nothing from the comparison. As the confidence intervals move from being the same to having a progressively smaller overlap, the significance of the difference increases, until they are disjoint, at which point the statistical significance of the original confidence intervals applies to the proposition that there is a difference between the (unknown) real values.

I agree that there's a tendency in many fields to assume that the probability distribution of the indicators is symmetrical and has a sharp peak, and therefore unwarranted conclusions are drawn about small variations in the central value that are in fact meaningless. But to dismiss all such comparisons, even those where proper attention is paid to the significance of the numbers, seems unwarranted.

Reply

John Vitale

20/4/2014 05:17:27 pm

The final insult comes when carefully calculated levels of progress for each child are deemed inadequate by the SLT prompting the order to "be generous" with levels. No amount of statistical analysis can draw any accurate conclusion from data which is essentially made up.

Reply

Icing on the Cake

21/4/2014 05:32:00 am

"No amount of statistical analysis can draw any accurate conclusion from data which is essentially made up."
Quite! Yet another reason why RUBBISHonline is Not Even Wrong!

Reply

Icing on the Cake

21/4/2014 02:42:43 am

It’s been a month since I published this article, so I thought I’d return to it to add some summarising thoughts on the comments above. I stand by my analysis that the data analysed by RAISEonline is not independent and identically distributed, making the use of statistics based on the Central Limit Theorum (CLT) invalid, and the conclusions which RAISEonline purports to reach RUBBISH.

Igb has raised some interesting points above, but I think that the line of argument does not invalidate the points which I made in my article. Here’s why. Igb uses the example of an analysis of parental income versus children's educational outcomes. This would appear to be a reasonable analysis using CLT, since the data is IID. If you were analysing parental income versus children’s educational outcomes at a school level, you could take two children in two completely different schools in England and swap them round, and the correlation between the two variables would still have a positive slope.

The problem with this line of thinking is that, ‘anyone seriously suggests that the taking a correlation between income and outcomes, noting the slope is positive and thinking that targeting some interventions at people whose parents are on low incomes is dark voodoo science which we should have nothing to do with,’ is not an accurate summary of what RAISEonline does, which is to subject a cohort of children in highly specific to a comparison with all children (or sub group of children) in England.

As I said above:
“If you are testing the effect of a given fertiliser on a given species of plant, you can be fairly sure that each of the plants in a sample of plants is independent and identically distributed. This is complicated, but essentially, you should be able to swap any of your plants between sample groups before conducting your experiment, since any given plant should react to the experiment in the same way.

In order to be able to assume that a sample cohort which has supplied test results for a school is independent and identically distributed, you should be certain that a completely different random group of children subject to the same teaching and learning would perform in exactly the same way.”

If you can’t do this, the methodology is flawed and, therefore, any analysis is not even wrong.

I discussed this via email wih igb, as follows:

Me: “I'm sticking by my basic hypothesis, which is that it is not reasonable to compare a school - especially a Primary school with 30 or fewer children in Year 6 - to nationally derived means and standard deviations, and that any 'school effect' is comprised of much more than the teaching and learning in the school, which seems to be the basic of the analysis RAISEonline undertakes.”

igb: “I think that performing a t-test between a single class and the whole cohort nationwide is bogus, and even if the test is significant then it falls into the "so what?" category. If RAISE does that, then they deserve scorn.

However, performing a t-test between a class and a matched cohort of pupils in schools with similar measures of intake doesn't strike me as immediately bogus, although the definition of "similar measures" is obviously key.

The rest I agree with: good article.”

I’ve shown elsewhere that the similar schools measures are hopelessly flawed and meaningless, and RAISE doesn’t actually test against a matched cohort of any kind. A school is simply measured against the national dataset in a blunt and hopelessly meaningless way. Whatever RAISE deserves, scorn would be a good starting point. It’s RUBBISH.

Reply

Vp

28/5/2014 02:27:26 pm

Not statistically valid for 64 children!! Try arguing your case when you are in a very small school with cohorts of two!

Reply

SKJ

7/7/2014 01:21:33 pm

I just came across your very interesting article which I think I will have to read again to try and fully understand all the points within it. I have been concerned for a while that 'the data' is being used more and more by ofsted to make judgements about schools.
My experience of this is when an ofsted inspector said that the teaching grading would not be based on what he saw during his visit but on the 'trends of data'. Being in a very small school with 8-10 children per cohort, I presumed that it would be fairly tricky to identify trends in the data. I'm not a statistician but this was just based on common sense. When I asked the inspector about this he said that his judgement would indeed be based on the trends of data. I asked him at what cohort size it was possible to identify these trends, to which he replied 'over a 3 year period'. When I clarified what I was asking, and after a bit of probing, he confirmed he would 'make his judgment based on trends in the data' even if the school only had cohorts of 5 per year.
Surely this is completely statistically unreliable? Is there statistically a point where a cohort size is too small to really be able to make any judgement from the data? I would be interested to know your thoughts.

Reply

Jack Marwood

9/7/2014 04:25:21 pm

Thanks for your comment, SKJ. It's very interesting to read about your experiences with an Inspector referring to 'trends' in they way you suggest; firstly because I know that there are those who don't quite believe that Ofsted Inspectors use data to judge schools in the way you suggest they do, and secondly. because, as you suggest, the idea of a 'trend' with a small cohort is clearly not reasonable.

From a statistical point of view, as I said in the article a sample of less than 200 or so would be subject to different criteria than one which is larger. Once you get down to samples less than 200, you need tables of t-scores such as http://easycalculation.com/statistics/t-distribution-critical-value-table.php (the second column of the 2-tail test) rather than a simple z-score of 1.96 for a 95% significance test. 5 would need to be compared to 2.5706, for example. This does, of course, assume thaat the data can be analysed using these tests, which I argue is not reasonable given the non-random nature of a given cohort.

Reply

Jack Marwood

14/7/2014 03:23:22 pm

I've had to update the main article above, as I've now discovered some more information about Z-tests and t-tests. As the saying goes, when the facts change, I change my mind. It's worth noting that over 1500 people have visited this page, and nobody had noticed the fairly obscure bit of statistics of which I wasn't aware.

I've discovered that current theory has it that z-tests *can* be used to test small samples of data - so long as you know the variance for the population data. This was news to me - as I have said, I studied statistics, but I don't work as a statistician; I'm just a teacher who is examining how data is analysed. I had understood that Z-tests were not valid for small samples, which is why t-tests are generally used for them instead.

In any case, this doesn't alter my main issue with RAISEonline and other analysis of data based on the CLT, which is that pupils are not i.i.d. and therefore cannot be analysed using z-tests or t-tests.

Reply

Paul Driscoll

29/10/2014 04:42:24 am

Jack, just came across this and very much enjoyed your analysis. I am CoG of a RI Secondary Comp, constantly trying to advise that you can't take the interpretation of data too far (unlike an HMI of a RoL course I went on who said that you could read 'behind the Sig+/- indicators). Although a professional scientist, I am not a bona fide statistician, and I am currently trying to work out if there is a good test of whether the PP gap is closing or widening year on year. I suspect movements in this gap are often too small to be significant, and the cohorts (and some staff, and curriculum) are different in each year, so not even sure it is possible. I understand that current Ofsted inspection guidance is focussed on the gap trend....

Anyway, given your comments about Z-tests and t-tests, I thought that you might be interested in something better ;-) ...

Thanks very much for this, Paul. Have you read my review of The Signal and the Noise, by Nate Silver? I discuss a Baynesian thinking there. I look forward to exploring the link you've posted.

My current thinking is that cohorts are separate samples from a schools' 'population', so any cohort can be expected to differ from any other fairly randomly. That would mean that any differences between FSM6 and other children would also vary randomly from cohort to cohort and year to year. That may not help you, but it may also open up other ways of challenging any Ofsted Inspector's assumptions about the 'quality' of your school...

Reply

Paul Driscoll

30/10/2014 03:14:29 am

Thanks Jack, I am still working through your various (useful) critiques. I have not yet found your review, though I am well aware of Nate Silver and I have read his book, which I thought was very engaging but at times confusing. WRT to Bayes I should declare that my father-in-law (http://en.wikipedia.org/wiki/T._M._F._Smith) was once President of the Roy Stat Soc, and his field was/is Bayesian analysis of survey data and I have tried to engage him in various statistical issues from my professional life -- not yet so much with respect to RAISE. Usually he goes a bit over my head. I may try again about RAISE data next time I see him. (His view of the Silver book was, if I recall correctly, equivocal.)

Mark Farrar

13/11/2014 06:09:25 am

Really enjoyed reading this blog and all the comments. I recently received RAISE data from a primary school where I am a Governor and was immediately concerned with the statistics being presented. A quick Google search led me here. I think we all agree the analyses are flawed to a greater or lesser extent, but as a Governor my concern is Ofsted. If inspectors challenge a school based on these data, surely we have to 'play the game' - I doubt that criticising the methods will help! I'm not completely across the new assessment methods that are being introduced this year as part of the new curriculum, but maybe this will make data more meaningful. We can but hope.

Reply

juliet

6/4/2015 12:58:32 pm

Really interesting to read this post and the ensuing comments. I've been at pains to explain in my school how we are misusing data, but I've struggled to be convincing, since I do not have more than a very basic grounding in statistics. For example, I refused, for some of the reasons which you outline above, to have anything to do with performance related pay which was based on pupil points progress. I was particularly concerned that people working with small intervention groups would also be subjected to PRP on the basis of the progress of 5 or 6 pupils. Even though I outlined the reasons why we could not use pupil points progress to decide on teacher pay, they went ahead with it for everyone else apart from myself. At one point we were debriefed on our RAISEonline data which discovered that our SEN pupils were not making as much progress as the rest of our pupils!

Now we're in a new situation, where levels have been removed and schools are expected to 'track' pupils in their own way. This has got to be a disastrous new can of worms for anyone who understands the nature of data. I wonder if you have any thoughts about what lies ahead.

Reply

John

20/12/2015 10:41:29 pm

Great discussion. My question: When calculating the Confidence Intervals for Value Added (for all and groups) the guidance* says that the SD used is that for the national group not the school's own data (i.e CI= 1.96x SD / sqrt n). Is this correct - surely this would give a much narrow and potentially misinformed CI?? (I'm not statistician!)
Thanks
*(see page 15: http://www.education.gov.uk/schools/performance/secondary_14/Key_stage_2_to_key_stage_4_value_added_measures.pdf)

Reply

Your comment will be posted after it is approved.

Leave a Reply.

Author

Me?
I work in primary education and have done for ten years. I also have children
in primary school. I love teaching, but I think that school is a thin layer of icing on top of a very big cake, and that the misunderstanding of test scores is killing the love of teaching and learning.