SEM

I’ve had a number of enquiries recently about how to calculate the standard error of measurement (SEM) for a range of different repeatability studies. This has struck me as odd because in my mind the SEM is a simple and clearly defined measure and given this it seems quite obvious to me how to calculate it.

On looking at a range of text books though I think I can see what the problem is. As I’ve pointed out in a previous post the SEM is almost always presented as a derivative of the intra-class correlation coefficient (ICC). Portney and Watkins for example introduce it through the formula SEM = SD√(1-ICC). For those not used to maths this looks bad enough on its own. When they probe a little further, however, they will find that the ICC itself is an esoteric output from a specifically structured ANOVA. No wonder so many give up and assume that the SEM is the rather abstract product of some largely incomprehensible calculations.

But nothing could be further from the truth. The SEM is simply the standard deviation of a number of measurements made on the same person. Bland and Altman actually recommend that it should be referred to as the within-subject standard deviation to make this clear (although I think SEM is so well established now that this is a battle not worth fighting). If you understand what a standard deviation is and how it represents variability on measurements from different people (and everyone the most basic interest in clinical measurement really should) then you should also understand what the SEM is and hown it represents variability within measurements taken on the same person. In a very real sense it is the SEM that is the primary measure of repeatability and the ICC should be seen as a derivative of it rather than vice versa.

Most importantly if you know how to calculate a standard deviation (either with a pencil and paper, calculator, or spreadsheet) then you already know how to calculate the SEM. You just use the same equation to calculate the SD of a number of measurements made on the same person rather than the those made on a number of different people. If the measurements have been made by a number of different assessors working in a particular gait lab then the SEM can be taken as representative of the lab as a whole. If they have all been made by the same assessor then they are only really valid when that individual is making the measurements.

If you make measurements on more than one person (and you should in any well designed repeatability study) then you can calculate the within-subject standard deviation for each person and you will find that this varies a little from person to person. This is where the only mildly complicated step comes in the calculations in that the overall SEM is the root mean square average of these within subject standard deviations (rather than the simple arithmetic mean).

Just to show how straightforward the calculations are I’ve prepared a document outlining how to do the sums which you can download at this link. All the data, figures and calculations for the examples are also available in these two Excel spreadsheets (here and here). If you want to listen to a more general talk about repeatability studies then there is one on my YouTube channel which uses the same examples. This is a recording of an open virtual classroom giving publicity to our MSc in Clinical Gait Analysis by distance learning so you’ll have to listen to a couple of minutes sales pitch before you get to the interesting bit!

PS Apologies to some of my recent students who probably wish they had had access to these resources a long time ago!

Like this:

So how can we use the standard error of measurement? I spent a considerable part of a recent post criticising the ICC but it’s clear from correspondence with several people that the properties of the SEM are not well understood. The SEM is a type of standard deviation (Bland and Altman, 1996, actually refer to it as the within-subject standard deviation). It assumes that measurements of any variable will be distributed about the man value (in this post we’ll assume that the mean value is the true value which needn’t necessarily be true, but is the best we can do for most clinically useful measurements). Measurements a long way from the mean are far less likely than those close to the mean and the distribution of many measurements follows a specific shape called the normal distribution. It can be plotted as below and the golden rule is that the probability of finding a measurement between any two values is given by the area under the curve (hence the differently shaded blue areas in the figure).

(Click on picture to got to Wikipedia article on standard deviation)

If the distribution is normal then it is described by just two parameters. The mean (which coincides with the mode and the median) and the standard deviation which measures the spread. 68% of measurements fall within ± one SEM of the mean. This means that 32% (1 in 3) fall outside. So if you only take one measurement then on nearly a third of occasions the true value of whatever you are trying to be measure will be further than one SEM from the actual measurement. On 16% (one in six) of occasions the true value will be higher than the true value by one SEM or more and on 16% of occasions it will be lower. This isn’t particularly reassuring so in classical measurement theory scientists tend to prefer to concentrate on the ±2 SEM range within which 95% of measurements fall (this still means that on only 1 in 20 occasions the true value will lie outside this range of one measurement).

This type of analysis get’s quite scary when applied to gait analysis measures. I’ll focus on a physical exam measure as an example because then we don’t need to worry about variation across the gait cycle. Fosang et al. (2003) calculated the SEM for the popliteal angle as 8°. This means that if a single measurement of 55° is made on any particular person then there is a 1 in 3 chance that the true measurement is greater than 63° or less than 47°. If we want 95% confidence then all we can say is that the true value lies somewhere between 39° and 81°. Data from Jozwiak et al. (1996) suggest that the one standard deviation range for the normal population of boys is from 14° to 50° (you do need to make some assumptions to extract these values) and the two standard deviation range is from (-4° to 68°). Thus the 95% confidence limits on our measurement of 55° (39° to 81°) suggest the true value could lie anywhere between well within the 1SD range to a long way outside the 2SD range. As a clinical measurement this isn’t very informative.

Things look even gloomier when you want to compare two measurements! We very often want to know if there is any evidence of change. Has a patient improved, or have they deteriorated, either as the result of a disease process or as a consequence of some intervention? We take one measurement and some weeks later we make anther measurement to compare. There is measurement variability associated with both measurements so we are even less certain about the difference between the two measurements than we are about any individual measurements. Any decent clinical measurement text book will tell you that the variability in the difference between two measures will be 1.4 (√2) times the SEM for an individual measurement.

Going back to the popliteal angle measurement this means that in order to have 95% confidence that two measurements are different we need to have measured a difference of greater than 22° (22.4° actually, being 1.4*2*8°). This is huge and may make you just want to give up and get a job sweeping the roads which doesn’t require you to think about what you are doing. Don’t give up though – for all this sounds pretty grim it is better than using the old surgeon’s trick of eyeballing the measurement and recording “tight hamstrings”.

There are some other factors to consider as well. We may not be interested in detecting a difference but want confidence that what we are doing is not actually harming our patient. So take two measurements of popliteal angle and let’s assume the later measurement is lower (better) than the first. On 2.5% of occasions the true difference will be less than the 95% confidence limit (we will have over-estimated the change by more than the confidence limits) but the other 2.5% who are outside the confidence limits have had an even more positive change (we have under-estimated the change). We thus have 97.5% confidence in an improvement greater than the lower limit. There is a strong argument that we should be using a what is called a one-tailed distribution to correct for this in which case we only need 2.3 * SEM in order to have 95% confidence of an improvement. This still works out as 18°.

We can also question the need for 95% confidence. How often do doctors or allied health professionals ever have 95% confidence in what they are doing? Why should we demand so much more of our statistical measures than we do of other areas of our practice? In some cases we might want 95% confidence (if we are going to spend many thousands of pounds operating on a child with cerebral palsy and requiring them and their family to engage in an 18 month rehabilitation programme) but on others this might be overkill (if we want to assess the outcome of a routine programme of physical therapy). In many clinical situations having 90% confidence that a treatment has not been detrimental may be sufficient. If we drop to requiring 80% confidence then the measured difference need only be as low as 1.2 times the SEM. The table below allows you to work out the minimal difference you need to measure (in multiples of the SEM) to be confident of an improvement has occurred (one tailed). I wouldn’t drop much below 80% because there is limited sense in drawing formal conclusions from data if you know you’re going to be wrong on 1 in every 5 occasions.

Minimum difference between two measurements

probability

two-tail

one-tail

1:100

3.6

3.3

1:20

2.8

2.3

1:10

2.3

1.8

1:5

1.8

1.2

All in terms of SEM

Before you start thinking that the picture is too rosy remember that not harming your patients is a pretty low standard of care. If we are delivering any care package we really want confidence that it is helping. To manage this statistically we need to define a minimal clinically important difference (MCID). This is the minimum value of change that you consider is useful to the patient as a result of the treatment. If you are simply trying to prevent deterioration then the value may be zero and the analysis above is appropriate. For most interventions, however, you want improvement and to have confidence of that improvement the difference in your measurements needs to exceed the MCID by the number of SEM stated in the right hand column of the table. In some ways this analysis is depressing. The hard truth is that there is significant measurement variability in the measurements that most of us rely on (gait analysis is very little better than physical examination). Most of the time we are deceiving ourselves if we think that we have hard evidence of anything from a single clinical measurement from an individual.

In many ways, though, I think that this is one of the strengths of clinical gait analysis though, particularly in the way it brings together so many different measurements including kinematics, kinetics, physical exam, video and EMG. Although we have limited confidence in any of the individual measurements the identification of patterns within such a wide range of measurements can give considerably more confidence in our overall clinical decision making.

The other thing I’d point out is that none of the discussion above would have been possible on the basis of a measure of reliability such as the ICC.Fosang et al. (2003) quote the ICC for popliteal angle as 0.72. I defy anyone to construct a rational interpretation of the impact of measurement error on clinical interpretation on the basis of that number!