Negative predictive value - the fraction of negative values which are correct; determined by dividing the true negatives by the sum of the true negatives and false negatives

Positive predictive value - the fraction of positive values which are correct; determined by dividing the true positives by the sum of the true positives and false positives

Precision - the closeness of agreement between independent measurements; generally expressed as coefficient of variation or standard deviation

Reference range - the inner 95% of values for a laboratory test as measured in a defined population; the subject population is typically disease free with regards to the test of interest

Sensitivity - the ability of a test to detect a true positive; determined by dividing the true positives by the sum of the true positives and false negatives

Specificity - the ability of a test to detect a true negative; determined by dividing the true negatives by the sum of the true negatives and false positives

Standard deviation - a measure of precision (square root of the variance)

BACKGROUND/SIGNIFICANCE

A basic understanding of statistics is assumed for this chapter. In order to interpret laboratory tests it is essential that health professionals understand sources of variation, reference ranges, predictive values and how laboratory results are interpreted.

There are many different reasons for performing laboratory tests. We perform them when we believe the results may help us answer some question that we have about a patient. In most cases the result itself is not the answer; we interpret the result to answer the question we posed. This interpretation is based on statistical principles. The purpose of this chapter is to provide a statistical framework for interpreting laboratory test results.

One reason for performing a laboratory test might be to screen an asymptomatic individual for evidence of occult disease. In this case, we probably expect to compare the result to a reference range (formerly called “normal range”) to determine whether the individual has the disease under consideration. It is important to understand how reference ranges are established in order to use them appropriately.

In another setting, we may be dealing with a very sick patient for whom we have developed a differential diagnosis based on the history and clinical findings. We may determine one or more laboratory tests in order to make a definitive diagnosis. The same test result might have a different significance in a clinically ill patient versus a clinically healthy patient. We have statistical tools to help us address this apparent paradox.

Another common reason for performing laboratory tests is to help follow the course of disease in a particular patient, and assess the impact of treatment. In this case, we want to know whether an interval change in the results of a laboratory test represents a real change in the patient versus just analytical variation. Knowledge of the statistical data routinely collected in the clinical laboratory can facilitate interpretation of serial test results.

SOURCE OF VARIATION

Proper interpretation of results requires an understanding of the sources of variation which influence laboratory tests. These can be categorized as preanalytical, analytical, intraindividual, and interindividual variation.

Analytical variation is produced by conditions which affect the sample and the testing system from the moment the sample is removed from the patient until the final result is generated. (It is helpful to further subdivide this category into preanalytical factors, which include all the things that can happen to a sample as it is collected, transported, processed, and stored, and analytical factors which affect the testing process itself.) All test results are subject to analytical variation. Important interferences with laboratory tests include hemolysis (rupture of RBC into plasma), lipemia (excess lipids in a plasma sample) and icteric (high concentrations of bilirubin).

Intraindividual variation is produced by conditions which cause a single individual’s laboratory values to change at different times of day or under different physiologic conditions. Examples of factors which contribute to intraindividual variation include circadian rhythms, hydration, activity, stress, posture, and food intake. When we use the results of serial testing to follow the course of disease in a patient, it is important to recognize the potential contribution of normal physiologic factors and try to distinguish it from medically important variation.

Interindividual variation reflects the many different factors which cause laboratory test results to vary from one individual to another within a population. Examples of such variables include age, sex, diet, body mass, general activity level, and genetics. The results of a test performed on a group of individuals will reflect analytical variation and intraindividual as well as interindividual variation.

The remainder of this chapter addresses each of these categories in more detail, together with the applicable statistical concepts. Familiarity with the normal distribution (bell-shaped curve, or Gaussian distribution), including the concepts of the mean and standard deviation, is assumed.

ESTABLISHING REFERENCE RANGES

When we establish a reference range, we want to have a tool for comparing the test result from one individual with those from a relatively large number of other members of a similar population. What we want to determine is the expected range of interindividual variation. We already know that the results of a test performed on a group of people will reflect intraindividual and analytical variation as well as interindividual variation; it is obvious that if the contributions from the first two are relatively large, they will obscure the part of the total variation that is due to actual differences among individuals. Part of the process of establishing a reference range is simply taking steps to reduce the magnitude of this obscuring effect.

Define the reference population. Demographically, it should match the population whose laboratory results will be compared to this reference range. Based on what is already known about the analyte, consider whether separate reference ranges should be established for adults versus children, men versus women, and so forth. Profound biochemical changes take place in the period between birth and adulthood, and many of these are reflected by clinical chemistry test values in this age group that differ significantly from those considered normal in adults. The most pronounced and/or accelerated changes are seen in the newborn period and during puberty. Table 3 gives examples of laboratory tests that are affected by age. Some hospitals only give reference ranges for adults, yet report out children’s results (with incorrect reference ranges), thus it is important to know what values change with age.

Table 3. Lab Results that are affected by age

Lab Tests that are Higher in Newborns and Children

Lab Tests that are Lower in Newborns and Children

Alkaline Phosphatase

Bicarbonate

Ammonia

Albumin

AST

Amylase

Bilirubin

Cholesterol

Creatine Kinase (CK)

Creatinine

Potassium

Cooper

Gamma glutamyl transferase (GGT)

Glucose

Thyroid stimulating hormone (TSH)

Haptoglobin

Thyroxine (T4)

IgA, IgM, IgE Osmolality

Here are the steps involved in establishing a reference range:

Select the analyte for which the reference range is to be established, and the methodology and instrumentation which will be used for testing. Study what is already known about the analyte, including variables which are already known to cause variation within and between individuals. Review the methodology with particular attention to the impact of variables known to be associated with sample procurement, transport, processing, and storage.

Define the reference population. Demographically, it should match the population whose laboratory results will be compared to this reference range. Based on what is already known about the analyte, consider whether separate reference ranges should be established for adults versus children, men versus women, and so forth.

Choose a sampling method. Ideally, this method should yield a random sample of individuals representing the reference population.

Collect, process, and test the specimens. It is important to treat these specimens exactly as patient specimens are treated.

Analyze the results. This process can be broken down into the following steps:

a) Organize the results. A convenient format is a frequency histogram, as shown in the example in Table 4. In this example, the possible values for the analyte are shown in the column on the left, from the lowest at the top to the highest at the bottom. (It is true that the actual levels of most analytes form a continuum, not limited to a set of discrete numbers, but as results are rounded off, we work with a limited number of possibilities which represent regular intervals along the continuum.)

The next column shows the number of reference specimens on which that result value was recorded.

The body of the histogram is a graphic representation of these results. In this example, there is an “X” corresponding to each individual result.

The last column shows the cumulative rank for each value. In this format, the cumulative rank corresponds to the total number of “X’s” in that row and all the rows above it, representing the total number of samples having that result value or a lesser value.

Table 4. Frequency Histogram

Value

Number of Observations

Rank

1

4

xxxx

4

2

15

xxxxxxxxxxxxxxx

19

3

23

xxxxxxxxxxxxxxxxxxxxxxx

42

4

25

xxxxxxxxxxxxxxxxxxxxxxxxx

67

5

22

xxxxxxxxxxxxxxxxxxxxxx

89

6

19

xxxxxxxxxxxxxxxxxxx

108

7

12

xxxxxxxxxxxx

120

8

2

xx

122

9

0

10

0

11

0

12

0

13

1

x

123

b) Eliminate outliers. An outlier is an extreme value at one end of the data set which is so far away from the rest of the data set that the reference range might be substantially different if we include that point versus deleting it.

There is not a single universally accepted mathematical definition of the term “outlier,” so what we offer here is a “rule of thumb.” If the distance between the most extreme data point and its nearest neighbor is more than a third of the total range covered by the data set, then that data point should be regarded as an outlier and eliminated.

After eliminating an outlier, reexamine the data set for new outliers which may become apparent only after reducing the range.

c) Once the outliers have been eliminated, determine whether the data have a Gaussian distribution. Often, inspection of the histogram is sufficient to conclude that the distribution is clearly non-Gaussian. The more rigorous approach is to calculate the mean, median, mode, skewness, and kurtosis.

The mean is the average value:

(3)

\begin{align} mean = \frac{X(1) + ... + X(n)}{n} \end{align}

The median is the value of the middle observation, when the set is arranged in rank order:

(4)

\begin{align} median = \frac{n + 1}{2} \end{align}

The mode is the most frequently observed value.

Skewness is a measure of asymmetry.

Kurtosis is a measure of relative peakedness versus flatness.

(The formulas for calculating skewness and kurtosis are beyond the scope of this course but the functions are available in some of the popular software packages which include statistics.)

If the mean equals the median equals the mode, and the skewness and kurtosis are zero, then the distribution is Gaussian. For our purposes in this course, values for skewness and kurtosis of -1 to +1 are sufficiently near zero to be treated as such.

d) Select a method for calculating the reference range

There are two basic methods: parametric and non-parametric.

If the distribution is Gaussian, a parametric method may be used. A minimum of 30 result values are required to calculate a reference range by this method. The reference range is defined as the mean plus or minus two standard deviations: Reference Range = X + 2SD

If the distribution is non-Gaussian, a non-parametric method must be used. At least 120 result values (after elimination of outliers) are required to calculate a reference range by this method. The non-parametric method may also be used for a Gaussian distribution, as long as there are 120 result values in the data set. Three steps are involved:

First, arrange the results in order and assign a rank to each observation, so that

X(l) < X(2) < … < X(n)

Second, calculate the ranks which correspond to the 2.5 and 97.5 percentiles:

0.025 (n+l) and 0.975 (n+l)

Third, find the values which correspond to those ranks. Those values are the upper and lower limits of the reference range (Central 95th percentile).

Suppose we collect one more reference sample, and its result just happens to fall near one end of the result distribution. Can you see how this chance event might change the reference range a little, especially if we are using a minimum number of observations?

The point of going through this rather lengthy exercise is to demonstrate concretely that reference ranges are not exact. They represent approximations subject to a wide variety of influences. At best, they can be improved by increasing the size of the population samples on which they are based. Statistical methods exist for estimating the validity of reference ranges, but they are beyond the scope of this course.

USING REFERENCE RANGES

Figure 1 shows a graphic representation of a reference range. The area under the curve represents the entire population of healthy subjects from which the reference sample was drawn. The vertical lines near each end of the curve represent the upper and lower limits of the reference range. The area under the curve between the two vertical lines includes 95% of the reference population. The tails of the curve have been cut off, excluding the values obtained from approximately 5% of the healthy reference subjects from the reference range. Does this mean we should no longer consider those subjects healthy? No. Is there any reason for choosing to include 95% and to exclude 5%? The original selection of 95% limits was arbitrary; it is done now for the sake of convention. Unless otherwise specified, you may presume that the reference range provided with a laboratory result represents the middle (central) 95% of the reference population.

Figure 1: Graphic Representation of a Reference Range

We have a specific term for the values of healthy individuals which fall outside the limits of the reference range: we call them “false positives” regardless of whether they fall at the high or low end of the range. The terms “positive” and “negative” in this context have nothing to do with high or low numbers, but rather indicate positivity versus negativity for disease.

It follows that the more tests performed on an individual, the more percent “false positives” will result. This can be calculated by the formula:

Percent False Positives = (1-0.95n) X 100%, where “n” is the number of tests performed.

Thus, one test will yield 5% “false positives.”

Let’s use a hypothetical reference range to interpret some patient results. The first result we’ll consider happens to fall outside the range. What are the possibilities? Either the patient is sick, or the patient is well but has a test result which is an outlier (either one of which would have been excluded from calculations), or is demographically different from the reference population, or the specimen may have been collected under different conditions, or the specimen may have been handled differently from the reference specimens. Now let’s consider a result that falls inside the range. Either the patient’s result is within the range because the patient is well, or the patient does in fact have the disease or condition we are considering, but the test result just doesn’t show it for any of the reasons mentioned in the first example, or for the reason that the presence of disease doesn’t necessarily assure abnormal laboratory results.

Meaningful interpretation of laboratory data requires an understanding of test results to be expected for patients having various diseases and conditions, as well as for healthy individuals. It would be ideal if such reference ranges for disease never had any areas of overlap with the so-called “reference ranges,” but since that is rarely the case, the next section addresses the interpretation of overlapping result distributions.

Finally, the number of patterns of test results is given by the following formula:

# Patterns of Test Results = Xn, where “X” is the number of types of results and “n” is the number of tests.

Thus, for three types of results (e.g., “low,” “normal,” and “high”) and two tests, there would be nine patterns possible.

Taxonomy of Overlapping Distributions

Figure 2 shows a pair of curves, representing hypothetical distributions of test results from two distinct populations, one healthy and one diseased with a region of overlap. The horizontal line represents the continuum of possible result values for the analyte we are measuring. The curve on the left represents the distribution of results from the healthy reference population. The curve on the right represents the distribution of results from a group of people known to have the disease. The vertical line represents the upper limit of the reference range.

All of the results which fall to the left of the vertical line are called negative. All of the results which fall to the right of the line are called positive. Some diseased patients have “negative” results, since part of their distribution falls to the left of the vertical line. We call this group of results “false negatives.” Conversely, when we defined our reference range we already acknowledged that a small group of healthy individuals would have “false positive” results. To complete the taxonomy, we call the results which accurately reflect the status of the individuals from which they came “true positives” and “true negatives,” respectively.

Sensitivity and specificity are performance characteristics of a test. To determine these characteristics, it is necessary to obtain test results on populations in whom the presence or absence of disease has been established by some method independent of this test.

Sensitivity is defined as the proportion of diseased subjects correctly classified by the test; i.e., the ability to detect a true positive in a person afflicted with the disease.

Specificity is defined as the proportion of healthy subjects correctly classified; i.e., the ability to exclude a diagnosis in a healthy person.

(6)

\begin{align} Specificity = \frac{TN}{TN + FP} \end{align}

A convenient format for arranging data is shown in Tables 4 and 5. Note that sensitivity refers only to the diseased population while specificity refers only to the healthy population. The relative sizes of the two populations do not affect sensitivity or specificity.

Table 4. Sensitivity and Specificity

Number of Subjects with Positive Test

Number of Subjects with Negative Test

TOTALS

Number of Subjects with Disease

TP

FN

TP + FN

Number of Subjects without Disease

FP

TN

FP + TN

TOTALS

TP + FP

FN + TN

TP + FP + FN + TN

Table 5. Example of Sensitivity and Specificity

Number of Subjects with Positive Test

Number of Subjects with Negative Test

TOTALS

Number of Subjects with Disease

68

32

100 sensitivity = 68%

Number of Subjects without Disease

2

98

100 specificity = 98%

TOTALS

70

130

200

Sensitivity and specificity tell us how well a test performs when run on groups of people in whom we already know the diagnosis. In clinical practice, we do not use tests this way. Often, we are running a test on one patient for whom we have not yet made a diagnosis. What we want to know about the test is the odds that the result will correctly classify our patient with respect to the diagnosis we are considering.

Predictive Values

Predictive values describe the odds that the results of a test will correctly classify an individual with respect to the disease or condition under consideration. To determine predictive values, we need to know the prevalence of the disease in the population we are testing. Prevalence is the fraction of the population which has the disease.

The predictive value of a positive test result is the fraction of positive test results which are correct, or the true positives divided by all the positives, both true and false.

(7)

\begin{align} PV+ = \frac{TP}{TP + FP} \end{align}

The predictive value of a negative test is the fraction of all negative results which are correct, or the true negatives divided by all the negatives, both true and false.

(8)

\begin{align} PV- = \frac{TN}{TN + FN} \end{align}

Using the hypothetical data provided in Table 5, we can calculate the predictive value of a positive result to be 97% and that of a negative result to be 75%.

It is important to recognize the impact of disease prevalence on predictive values. Tables 6 and 7 show two more hypothetical data sets. Both have the same sensitivity and specificity as shown in Table 5, but the prevalence has been decreased, demonstrating the impact on predictive values. In general, as the prevalence of disease increases, the predictive value of a positive test improves. As the prevalence of disease decreases, the predictive value of a negative test improves, and the predictive value of a positive test is diminished by increasing numbers of false positive results.

Table 6. Effect of Low Prevalence

Positive

Negative

Total

Diseased

68

32

100

Healthy

20

980

1,000

Sensitivity = 68%

Total

88

1,012

1,100

Specificity = 98%

PV+ = 77%

PV- = 97%

Prevalence = $\frac{100}{1100}$ = 9%

Table 7. Effect of Further Decrease in Prevalence

Positive

Negative

Total

Diseased

68

32

100

Healthy

200

9,800

10,000

Sensitivity = 68%

Total

268

9,832

10,100

Specificity = 98%

PV+ = 25%

PV- = 99.7%

Prevalence = $\frac{100}{10100}$ = 1%

Intraindividual vs. Analytical Variation

Up to this point we have focused on the interpretation of individual results relative to group results, or analysis of interindividual variation. We will now focus on determining if therapeutic intervention has changed laboratory values that can be detected analytically.

How do you know if my therapy has changed the patient’s lab values? The way to determine if the patient’s lab values have actually changed is to determine if the difference between the first and subsequent measurement is greater than 3 times the standard deviation of the assay. If the difference between the two measurements is greater than 3 times the standard deviation of the assay then you can be 95% confident that the difference between the two measurements is not due to chance (Reference Kaplin LA, Pesce AJ, and Kazmierczak SC, Clinical Chemistry Theory, Analysis, Correlation, 4th edition, St. Louis: Mosby, 2003, page 385).

Figure 3: Quality Control Chart for Sodium with a Mean of 113.2 and a Standard Deviation of 0.5 mmol/L

EXAMPLE

Example: A patient had a sodium of 120 mmol/L on day 1. After treatment the patient’s sodium increased to 126 mmol/L. Has the patient become less hyponatremic?

In order to answer this question you need to know the analytical variation of the lab’s sodium analysis. This can be determined by calling the lab and asking what the standard deviation of the sodium assay is. The laboratory runs quality control specimens with each batch of samples and will know the analytical variability of the assay in the range of interest. A typical quality control chart is shown in Figure 3. The standard deviation for this sodium control is 0.5 mmol/L. Three times 0.5 mmol/L is 1.5 mmol/L. Since the observed change (6 mmol/L) is greater than 1.5 mmol/L, you can be 95% confident that the patient’s sodium has increased to a degree which can be detected analytically.

The primary reason to run controls is to assess whether the test system is functioning properly and generating reliable test results. When the technologists in the laboratory examine the results obtained on the control sample, they are expecting some variation, and they are trying to distinguish between two possible sources of variation: random analytic variation and systematic error.

Random analytic variation is inevitable and all its points will fall within a Gaussian distribution. Systematic error occurs when some new variable is introduced, such as deterioration of a reagent, clogging of a tube within the instrument, etc. The problem with systematic error is that it is likely to compromise the accuracy of test results.

Each time a technologist sets up an analytical test for patient samples he or she first calibrates the assay with standards containing a known concentration of the analyte. To ensure that the calibration is accurate, the technologist then runs a series of controls. Typically controls are run at three levels, low, normal and elevated concentrations. The technologist then checks the control values to see whether the result falls within the pattern of a Gaussian distribution before analyzing and reporting patient samples. Once the mean and standard deviation have been established for a particular control for a particular analyte, 95% of the control results should fall within ± 2 SD of the mean. About one in 20 results will fall outside these limits, but within 3 SD. If this is just a random event, the next time the control is tested, the result will return to within 2 SD 95% of the time. If it persists outside 2 SD, this is interpreted as most likely a sign of some systematic error, and the technologist proceeds to investigate and take corrective action. Likewise, if any result is more than 3 SD from the mean, it is interpreted as probable systematic error and treated as such.

Clinical laboratory technologists do not issue test results if the control results do not fall within established limits. Their objective is to generate the most accurate results possible given the methodology and instrumentation available. The principles of quality control are a major component of their education and training. As Point-of-Care testing becomes more widespread, it is important for pharmacists and other care providers to understand the need to verify the integrity of the test system before using it.

there is a ton of info on each page. i understand that each page corresponds to a lecture, but is it possible to keep this structure and additionally have a page for each term (e.g. "Predictive Values") as well? it would also be nice to have links to other wiki pages for technical terms. this way it would be more digestible.

Hi,
I have a question.
My laboratory participate in PT program.
The institution that send the samples and then analyze the results uses 1SD as pass or fail criteria.

I looked at different standards including ISO 13528 that clearly define acceptable range for the results as +/- 2SD.
Also, the CV for the tests received from different participants vary between 23 – 59. All results reported by my laboratory remain below 1.7 SD and I would qualify those results as valid, however, they received fail code.
I would appreciate your comments,

I wonder why this question has not been answered so far?!!
But my guess for this is that the program assesses accuracy rather than precision (as all external QC programs do) and this is calculated by the % Deviation from the "true" or "expected" value compared with your measured value. Persumably they used 1SD as an estimate of the limit of deviation which should not exceed 2%.

CLIA provides guidelines on instrument comparison.
One way to approach this (this is by no means comprehensive or the best way but "one" of the many ways. I am trying to be as specific as possible to be useful.)
You could use 20 patient samples with a wide range of values to correlate the two instruments (Deming is a good place) .
Set your own acceptable criteria based on clinical considerations. If the slope is acceptable and close to 1 (and r^2 close to 1) you are on track. If there is an intercept, determine if it is statistically significant (p-value). If the slopes are different determine if it is some thing to do with instrumentation, reagents or any other experimental factor (troubleshoot). If slopes are truly different determine its clinical significance and develop a correlation factor, if possible. But if it changes every six months you probably have some sort of systemic lab problem you might want to investigate further. If the slopes and intercepts are clinically comparable or if a valid correlation factor can be applied to make the instruments report the same value, you have nothing to explain to the medical staff. If the instruments are significantly different you have a more complex issue. I know this is not fully helpful but..