Acquiescent Response Bias as a Recurrent Psychometric Disease: Conservatism in Japan, the U.S.A. and New Zealand

J. J. RAY

School of Sociology, University of NSW, P.O. Box 1, Kensington, 2033, Australia

Summary

Early work on acquiescence treated it as a trait -- a characterization which Rorer showed could not be sustained. Evidence is reviewed to show that non-meaningful acquiescence is elicited as a response to some aspect of the test or test situation by particular groups of subjects. Its presence can be detected with balanced scales -- producing a validity breakdown for the particular scale on that particular occasion of its use. Balanced scales are needed, then, not only to control for acquiescence but also to detect attacks of validity breakdown. This was illustrated by reanalysing Conservatism scale data from four published studies by W. Scott and G. D. Wilson. The correlation between "liberal" and "conservative" sub-scales was used as a criterion of construct validity and it was hypothesized that the two Japanese samples would show less validity for the scale than would the two "Western" samples. It was also hypothesized that the New Zealand sample would show the highest validity of all. In fact one of the Japanese samples and the American sample showed satisfactory validity whereas the other Japanese sample and the New Zealand sample showed little validity. It was concluded that the onset of acquiescence-caused invalidity is common but substantially unpredictable. Control against its effects and measurement of its effects are then both vital before a scale can be accepted as measuring what it purports to measure.

Introduction

After the publication of the one-way-worded California F scale, much attention was given to the role of acquiescent response set in its scores. It was widely held that the scale was more a measure of the tendency towards indiscriminate agreement with ambiguous statements than of what it purported to measure (pre-Fascist ideology). The low meaningfulness of F scale items was confirmed when attempts to construct a "balanced" form of the scale (i.e. a form including equal numbers of pro-Fascist and anti-Fascist items) foundered on the finding that many people agreed both with the "positive" and the "negative" forms of the same item. This was most concisely shown in the fact that the correlation between the positively and negatively worded sub-scales tended more towards orthogonality than toward high negative coefficients. CHRISTIE, HAVEL & SEIDENBERG (1956) regarded correlations of this sort (hereafter in this paper referred to as "rPN ") of around -.2 as evidence that the F scale was "irreversible".

This demonstrated invalidity of the F scale (its items evidently did not measure what they were thought to measure) was generally held to result from there being in the population a substantial number of "acquiescers" who responded to ambiguous stimuli by tending to indicate agreement. RORER (1965) however, showed that this theory also ran against the evidence. If acquiescence were a trait, then various measures of it should intercorrelate positively. On the evidence available at that time, he pointed out that this was not so. Intercorrelations between acquiescence measures were generally negligible. This left us with the conclusion that neither pre-Fascist ideology nor acquiescence could be reliably measured.

The conclusion that RORER drew at this juncture was that because it could not be measured, acquiescent response style either did not exist or existed only to a negligible degree, Similarly, one could presumably conclude that pre-Fascist ideology did not exist. While RORER's work made it doubtful that acquiescence could explain the poor correlation between oppositely-worded versions of F scale items, it did not of course cause the poor correlations to go away.

As it happens, however, both these conclusions are wrong. Balanced versions of the F scale with rPNs of up to -.7 have now been constructed (RAY 1972a, 1979a) and evidence has accumulated that acquiescent responding does generalize from one attitude scale to another (VAGT & WENDT 1978, RAY 1979c, 1983). Acquiescence can in fact now be measured with considerable reliability and can have strong correlations of its own (HEAVEN 1983, MARTIN 1964, RAY 1982 & 1983, RAY & PRATT 1979).

Why, then, does the conclusion from later studies differ so markedly from RORER's conclusion? The answer is that acquiescence has only some generality. With some balanced scales on some occasions the reliability of the acquiescence score (i.e. the score obtained by adding up responses across all items without doing any reverse-scoring) is negligible and correlates little even with the acquiescence score of other scales. On other occasions, the acquiescence scores of an entire group of scales will be found to intercorrelate highly (VAGT & WENDT 1978; RAY 1983).

It would seem then that acquiescing regardless of meaning is more properly viewed as a response than as a trait. Some scales on some occasions with some subjects will evoke a tendency to indiscriminate agreement. Whenever this happens, the measurements obtained must of course be regarded as lacking in validity. It therefore becomes a question of some importance to ask what the circumstances are under which this validity defect (indiscriminate acquiescence) may be expected to manifest itself. What are the stimuli that evoke the response of meaningless or indiscriminate agreement? Unless we can identify stimuli, any scale we use on any occasion may turn out to have failed in the measurement task for which it was intended. The entire existing body of attitude scale research might be open to doubt as being based on measures of unknown construct validity.

The obvious circumstance that might give rise to acquiescent responding is item ambiguity. Even RORER conceded that ambiguous items might attract systematic and indiscriminate agreement. But how do we measure the ambiguity of a set of items on any given occasion? The ambiguity is an attribute not only of the items themselves but also of how the respondents perceive those items. The Wilson Conservatism scale has been shown to have an rPN that varies between -.20 and -.57 (RAY 1980, 1981). The RAY balanced dogmatism scale has been found to have an rPN that varies between -.71 and -.22 (RAY 1979b). The RAY Australian environmentalism scale has been found to have an rPN that varies between -.53 and -.03 (RAY 1983). What is ambiguous to one group of respondents may be not at all ambiguous to others.

We are then faced with the possibility that the emergence of ambiguity in our measuring instruments may be substantially unpredictable. This, however, would surely be a premature admission of defeat. Although it may take a long time to zero in on the critical factors, surely we could attempt at least some preliminary hypotheses. Surely we could point to at least some circumstances when the emergence of ambiguity is likely. It is proposed in the present paper to make one attempt in that direction.

Method and Results

The most obvious circumstance wherein a scale seemed likely to suffer a breakdown in meaningfulness seemed to be when a scale developed for use in one country is applied to the population of another country. Surely we should be able to predict with some confidence that cultural, historical, national and linguistic differences would affect the meaningfulness of attitude scale items. Surely a scale that is quite meaningful to American respondents could very easily be much less meaningful to (say) Japanese respondents. Exactly this prediction is tested below.

This study took advantage of two existing bodies of data: A study by SCOTT (1979) which applied abbreviated forms of the WILSON (1973) Conservatism scale to community samples in Boulder (Colorado), Kyoto/Otsu (Japan) and Wellington (New Zealand) and a study by WILSON & IWAWAKI (1980) which applied the same scale to 219 Japanese university students. As this scale was originally constructed in New Zealand by New Zealanders using New Zealand subjects, some adaptations of some items were made on an a priori basis at the time the American and Japanese questionnaires were devised. An attempt was made, in other words, to reduce the cross-cultural incompatibility which could be expected for the scale. This being so, the data enable a conservative test of the hypothesis that cross-cultural transfer will induce a breakdown of meaningfulness. If a breakdown is detected, it will have emerged despite efforts to prevent it.

Given the provenance of the scale, it was predicted that the meaningfulness of its items would be highest in New Zealand, lowest in Japan and intermediate in the U.S.A. This should be reflected in gradations of rPN.

When the data kindly supplied by the two authors were reanalysed, the rPNS found were -.23 in New Zealand. -.55 in Boulder and -.45 in Kyoto/Otsu. The rPN among the Japanese students of WILSON & IWAWAKI was .27. The anomalous sign of the latter correlation is not unprecedented (RAY 1972b) but is nonetheless a very unfavourable commentary on the validity of the scale on the given occasion. It is therefore clear that the Wellington and the Japanese student data showed levels of rPN of a magnitude similar to that which led CHRISTIE, HAVEL & SEIDENBERG (1) to regard the F scale as "irreversible". The remaining two correlations, on the other hand, are highly significant and must be regarded as fairly satisfactory.

The hypothesis, then, was thoroughly falsified. The New Zealanders did worst rather than best and one out of the two Japanese results showed high rather than low meaningfulness.

Discussion

The present results have shown that even the most gross and obvious predictions about when the items of a scale will undergo a collapse of meaningfulness were not supported by the data. If predictions of this kind can be falsified, it augurs ill for any prospects of prediction in this field at all. A breakdown of meaningfulness may afflict any scale at any time with any sample. As such it may be compared to a disease that strikes without warning. We can neither predict nor prevent it. All we can do is detect and record its presence. Unless therefore we use balanced scales and calculate the rPN for each scale on each occasion of its use, we may at any time be relying on data that does not mean to our respondents what we assume it to mean. Any research wherein the rPN for the scales used is not reported must be regarded as of uncertain meaning.

References

{Articles below by J.J. Ray can generally be accessed simply by clicking on the name of the article. I am however also gradually putting online a lot of abstracts, extracts and summaries from older articles by other authors so if an article not highlighted below seems of particular interest, clicking here or here might just save you a trip to the library}