HOW MANY ANSWER CATEGORIES SHOULD ATTITUDE AND PERSONALITY SCALES USE?

J.J. Ray

Sociology, University of New South Wales, Australia

General population data were sought on the question of what is the optimal number of response categories for attitude and personality scale items. When an ethnocentrism scale was administered to 100 randomly sampled residents of Johannesburg, it was found to show a reliability of 0.73 with five-point options. Scoring the extreme options as the same gave also a three point version of each item. So scored, the scale reliability dropped to 0.65. In a second study the Jackson Dominance Scale was administered to two random postal samples of the Australian State of New South Wales. When administered and scored with three response options for each item the reliability was 0.84. Collapsed to two options the reliability of the scale dropped to 0.80. When administered and scored with only two response options the reliability was 0.81. It was concluded that collapsed scoring is a useful methodology for examining the effects of varied response options and that more response options generally improve reliability.

While there has been in the literature a clear tendency to regard a larger number of response options as more desirable in attitude and personality scales, there is certainly no consensus on the issue. This is shown very clearly in the recent extensive review by Matell and Jacoby (1971). While there would be little point in attempting to add to these authors' very competent review, there may be some point in questioning the empirical work that the same authors also carried out. In their own research they found that it made no difference how many response options were used. Whether one uses two, three, five, seven or more options, both the reliability and the validity will remain unaltered. A similar conclusion was drawn in Matell and Jacoby (1972).

A question that hangs over both their 1971 and 1972 results, however, is the adequacy of their sampling. They used unmatched groups of students. Thus if a sub-sample was particularly heterogeneous, the answer format being responded to might appear artificially as having especially low reliability.

Even if the groups corresponding to the various answer formats had been matched, however, the generalizability of student-based results to the population at large must remain unknown. Nor would the recent work by Jenkins and Taber (1977) be informative here. These authors abandoned people altogether as a source of data and relied on a computer 'simulation'.

The work reported below, then, will attempt to advance the question using general population data from both South Africa and Australia.

Method

The method proposed for assessing the effect of answer format in the study below uses each person as his own control. This avoids problems with matching of groups. What is done initially is simply to compare a five-point with a three-point format. This can be done by administering a scale with five response options and then scoring it both with the original five points and also with the two categories of 'Agree' or 'Disagree' collapsed together. Thus 'Strongly Agree' is scored the same as 'Agree' on the second occasion. A midpoint is allowed in both cases. We thus have alternative 5, 4, 3, 2, 1 and 3, 3, 2, 1, 1 scoring systems.

Study I

This study used the Ray (1974) Ethnocentrism Scale as modified for South African usage by Heaven and Moerdyk (1977). Although originally used with a five-point format, Heaven (1978) prefers a three point answer format. In the circumstances an empirical comparison of the two formats seemed of particular interest for this scale.

The scale was administered as part of a larger questionnaire given to a random cluster sample of 100 residents of the Johannesburg greater metropolitan area interviewed in their own homes.

In this study, the accuracy of the 'collapsed scoring' system used above to estimate the effect of different format was the prime issue. It seemed conceivable that there might be some difference between how a person might respond when confronted by two actual 'agree' options and when confronted by only one. Collapsing the former two might for various reasons give a different result from just administering the latter one alone.

The scale used on this occasion was in fact one designed for dichotomous use - the Jackson PRF Dominance Scale Form AA (Jackson 1967). As a scale designed for dichotomous use, the effect of extra response options should be minimal. If an improvement in reliability were observed, the result would be all that more impressive.

The samples used were two postal surveys of the Australian state of New South Wales. Names were selected at random from the Australian electoral rolls and 500 questionnaires sent out in each case. The resulting samples both of 122 people showed a distribution on the four basic demographic characteristics of age, sex, occupation and income not significantly different from that observed on other doorstep samples carried out contemporaneously in the Sydney metropolitan area.

In the first sample the Jackson scale was administered in 3, 2, 1 format and scored both 3, 2, 1 and 1, 0. The latter format is of course Jackson's own system and has the effect of treating an 'undecided' (omitted) response as 'no'. Jackson simply counts 'yeses' to get his total score (after allowance for reverse scoring where indicated). The alternative system counts a 'yes' as '3', a '?' as '2' and a 'no' as '1'. The reliability observed with each system was: Jackson system 0.80; alternative system 0.84.

In the second sample, the same Jackson scale was administered (not just scored) in the '1-0' system. 'True' and `False' were the only response options offered. The reliability observed (Coefficient alpha) was 0.81. The comparable means (and SD's) on the two occasions were 7.31 (4.15) and 9.24 (4.47). Obviously, whether the scale is administered or just scored in the briefer system has very little effect on the reliability observed. It does, however, have an important effect on the mean. People are more likely to assent to dominance statements when a 'not sure' category is not provided (t = 3.48; p < 0.01).

Discussion

Clearly, whether we use matched samples or simply rescore the data from one sample, the reliability is usefully higher the more response categories we use. The nature of the samples used above should make this finding fairly generalizable.

The use of Cronbach's (1951) coefficient alpha above as the criterion of scale quality is in line with the existing literature but test-retest reliabilities and validity estimates would also of course be useful. The use of general population samples, however, does make these latter very difficult to obtain.

The work above did not of course attempt to find the effects of all possible answer formats. Seven-point formats such as were used with the California F Scale (Adorno, Frenkel-Brunswik, Levinson & Sanford 1950) could well be best of all. If such formats were to be used, the findings of Rotter (1972) on how they should be labelled could be valuable.

Another attempt not made above was to test the significance of the differences between the reliability coefficients. This is because standard tests for such differences do not exist. The reason for this in turn is that with reliability one is concerned not to test its significance but to maximize it. A reliability of (for instance) 0.20 may be significant but it would certainly not be useful. Reliabilities, then, are generally evaluated against conventional standards (see Shaw & Wright 1967) rather than in terms of significance. The present work has suggested that extra response categories may on at least some occasions help a scale to meet these standards.

Perhaps a final point of importance would be a plea for greater use of a midpoint category. We are greatly dependant on voluntary cooperation from people in our testing activities and attempts to 'force' respondents into a positive or negative response can only worsen relationships. If this results in lesser feeling of co-operativeness, the answers will surely be of less value and indeed of less validity. The finding above that provision of a midpoint improves scale function should be all the incentive we need to abandon attempts to introduce coercion into our testing.