Classical test theory is a body of related psychometric theory that predict outcomes of psychological testing such as the difficulty of items or the ability of test-takers. Generally speaking, the aim of classical test theory is to understand and improve the reliability of psychological tests.

Classical test theory may be regarded as roughly synonymous with true score theory. The term "classical" refers not only to the chronology of these models but also contrasts with the more recent psychometric theories, generally referred to collectively as item response theory, which sometimes bear the appellation "modern" as in "modern latent trait theory".

Contents

Classical test theory is based on the decomposition of observed scores (which are ordinal, but analyzed as interval) into true and error scores. The theory views the observed score $ x $ of person $ i $, denoted as $ x_i $, as a realization of a random variable $ X $. The person is characterized by a probability distribution over the possible realizations of this random variable. This distribution is called a "propensity distribution". The true score of person $ i $, $ t_i $, is axiomatically defined as the expectation of this propensity distribution. This definition (Novick 1966) is formally stated as

$ {\varepsilon}(X_i)=t_i $

(Eq. 1)

Secondly, the so-called error score for person $ i $, $ E_i $, is defined as the difference between $ i $'s observed score and his true score:

$ E_i=X_i - t_i $

(Eq. 2)

Note that $ X_i $ and $ E_i $ are random variables, but $ t_i $ is a constant. Also note that it directly follows from these definitions that the error score has expectation zero:

The above equations represent the assumptions that classical test theory makes at the level of the individual person. However, the theory is never used to analyze individual test scores; rather, the focus of the theory is on properties of test scores relative to populations of persons. Hence, the next step is to introduce a population-sampling scheme into the structure of classical test theory. When we assume that people are randomly sampled from a population, the true score becomes a random variable too, so that we get the canonical equation

$ X = T + E $

(Eq. 4)

Classical test theory is concerned with the relations between the three variables $ X $, $ T $, and $ E $ in the population. These relations are used to say something about the quality of test scores. In this regard, the most important concept is that of reliability. The reliability of the observed test scores $ X $, which is denoted as $ {\rho^2_{XT}} $, is defined as the ratio of true score variance $ {\sigma^2_T} $ to the observed score variance $ {\sigma^2_X} $:

$ {\rho^2_{XT}} = \frac{{\sigma^2_T}}{{\sigma^2_X}} $

(Eq. 5)

Because the variance of the observed scores can be shown to equal the sum of the variance of true scores and the variance of error scores, this is equivalent to

This equation, which formulates a signal-to-noise ratio, has intuitive appeal: The reliability of test scores becomes higher as the proportion of error variance in the test scores becomes lower and vice versa. The reliability is equal to the proportion of the variance in the test scores that we could explain if we knew the true scores. The square root of the reliability is the correlation between true and observed scores.

Note that reliability is not, as is often suggested in textbooks, a fixed property of tests, but a property of test scores that is relative to a particular population, and computed for this sample. This is because test scores will not be equally reliable in every population or even every sample. For instance, as is the case for any correlation, the reliability of test scores will be lowered by restriction of range. Thus, IQ-test scores that are highly reliable in the general population will be less reliable in a population of college students and even less reliable in a sample of sophomores. Also note that test scores are perfectly unreliable for any given individual $ i $, because, as has been noted above, the true score is a constant at the level of the individual, which implies it has zero variance, so that the ratio of true score variance to observed score variance, and hence reliability, is zero. The reason for this is that, in the classical test theory model, all observed variability in $ i $'s scores is random error by definition (see Eq. 2). Classical test theory is relevant only at the level of populations and samples, not at the level of individuals.

Reliability cannot be estimated directly since that would require one to know the true scores, which according to classical test theory is impossible. However, estimates of reliability can be obtained by various means. One way of estimating reliability is by constructing a so-called parallel test. The fundamental property of a parallel test is that it yields the same true score and the same observed score variance as the original test for every individual. If we have parallel tests x and x', then this means that

$ {\varepsilon}(X_i)={\varepsilon}(X'_i) $

(Eq. 7)

and

$ {\sigma}^2_{E_i}={\sigma}^2_{E'_i} $

(Eq. 8)

Under these assumptions, it follows that the correlation between parallel test scores is equal to reliability (see Lord & Novick, 1968, Ch. 2, for a proof).

Using parallel tests to estimate reliability is cumbersome because parallel tests are very hard to come by. In practice the method is rarely used. Instead, researchers use a measure of internal consistency known as Cronbach's $ {\alpha} $. Consider a test consisting of $ k $ items $ u_{j} $, $ j=1,\ldots,j,\ldots,k $. The total test score is defined as the sum of the individual item scores, so that for individual $ i $

Cronbach's $ {\alpha} $ can be shown to provide a lower bound for reliability under rather mild assumptions. Thus, the reliability of test scores in a population is always higher than the value of Cronbach's $ {\alpha} $ in that population. Thus, this method is empirically feasible and, as a result, it is very popular among researchers.

As has been noted above, the entire exercise of classical test theory is done to arrive at a suitable definition of reliability. Reliability is supposed to say something about the general quality of the test scores in question. The general idea is that, the higher reliability is, the better. Classical test theory does not say how high reliability is supposed to be. Too high a value for $ {\alpha} $, say over .9, indicates redundancy of items. Around .8 is recommended for research.[1] It must be noted that these 'criteria' are not based on reasonable arguments but the result of convention. Whether they make any sense or not is unclear.

Classical test theory is by far the most influential theory of test scores in the social sciences. In psychometrics, the theory has been superseded by the more sophisticated models in Item Response Theory (IRT). IRT models, however, are catching on very slowly in mainstream research. One of the main problems causing this is the lack of widely available, user-friendly software; also, IRT is not included in standard statistical packages like SPSS, whereas these packages routinely provide estimates of Cronbach's $ {\alpha} $. As long as this problem is not solved, classical test theory will probably remain the theory of choice for many researchers.