Designing Questions to Be Good Measures

In surveys, answers are of interest not
intrinsically but because of their relationship to something they are supposed
to measure. Good questions are reliable, providing consistent measures In
comparable situations, and valid; answers correspond to what they are Intended
to measure.

This chapter discusses theory and practical
approaches to designing questions to be reliable and valid measures.

It is always important to remember that
designing a question for a survey instrument is designing a measure, not a
conversational inquiry. In general, an answer given to a survey question is of
no intrinsic interest. Rather the answer is valuable to the extent that it can
be shown to have a predictable relationship to facts or subjective states that
are of interest. Good questionnaires maximise the relationship between the
answers recorded and what the researcher is trying to measure.

In one sense, survey answers are simply
responses evoked in an artificial situation contrived by the researcher. What
does an answer tell us about some reality in which we have an interest?

Let us look at a few specific kinds of
answers and their meanings:

A respondent tells us that he voted for Nixon rather than McGovem
for president in 1972. The reality in which we are interested is which lever,
if any, he pulled in the voting booth. The answer given in the survey may
differ from what happened in the voting booth for any number of reasons. The
respondent may have pulled the wrong lever and, therefore, not know for whom
he voted. The respondent could have forgotten for whom he voted. The
respondent could have altered his answer for some reason on purpose. The
interviewer accidentally could have checked the wrong box even after an
"accurate" answer was given.

A respondent tells us how many times he went to the doctor for
medical care during the past year. Is that the same number that the researcher
would have come up with had he followed the respondent around for 24 hours a
day for 365 days, during the past year? Problems of recall, problems of
definition of what constitutes a visit to a doctor, and problems of
willingness to report accurately may affect the correspondence between the
number the respondent gives and the count the researcher would have arrived at
independently.

When a respondent rates her public school system as "good" rather
than "fair" or "poor," the researcher will want to interpret that answer as
reflecting evaluations and perceptions of that school system. If the
respondent rated only one school rather than the whole school system, or
tilted the answer to please the interviewer, or understood the question
differently from others, her answer may not reflect the feelings the
researcher tried to measure.

Although many surveys are analysed and
interpreted as if the researcher "knows" what the answer means, that, in fact,
is very risky. Studies designed to evaluate the correspondence between
respondents' answers and "true values" show that many respondents answer many
questions very well. However, there also is a considerable amount of lack of
correspondence. To assume perfect correspondence between the answers people give
and some other reality is naive. When it is true, it is usually the result of
careful design. In the following sections, we discuss many specific ways
researchers can improve the correspondence between respondents' answers and the
"true" state of affairs.

One goal of a good measure is to increase
question reliability. When two respondents are in the same situation, they
should answer the question in the same way. To the extent that there is
inconsistency across respondents, random error is introduced and the measurement
is less precise. The first part of this chapter deals with how to increase the
reliability of questions.

There is also the issue of what a given
answer "means" in relation to what a researcher is trying to measure: How well
does the answer correspond? The later two sections of this chapter are devoted
to validity - the correspondence between answers and "true values"- and ways to
improve that correspondence (compare Cronbach & Meehl, 1955).

Designing a reliable instrument

One step toward ensuring consistent
measurement is that each respondent in a sample is asked the same set of
questions. Answers to these questions are recorded. The researcher would like to
be able to make the assumption that differences in answers stem from differences
among respondents rather than from differences in the stimuli to which
respondents are exposed.

A survey data collection is an interaction
between a researcher and a respondent. In a self-administered survey, the
researcher speaks directly to the respondent through a written questionnaire. In
other surveys, an interviewer reads the researcher's words to the respondent. In
either case, the questionnaire is the protocol for one side of the interaction.
In order to provide a consistent data collection experience for all respondents,
a good questionnaire has the following properties:

The researcher's side of the question and
answer process is fully scripted so that the questions as written fully prepare
a respondent to answer questions.

The question means the same thing to
every respondent.

The kinds of answers that constitute an
appropriate response to the question are communicated consistently to all
respondents.

Inadequate wording

The simplest example of inadequate question
wording is when, somehow, the researcher's words do not constitute a complete
question.

Incomplete wording

Bad

Better

5.1 Age?

What was your age on your last birthday?

5.2 Reason last saw doctor?

What was the medical problem or reason for
which you most recently went to a doctor?

Interviewers (or respondents) will have to
add words or change words in order to make an answerable question. If the goal
is to have respondents all answering the same questions, then it is best if the
researcher writes the questions fully.

Sometimes optional wording is required to fit
differing respondent circumstances. However, that does not mean that the
researcher has to give up writing the questions. A common convention is to put
optional wording in parentheses. These words will be used by the interviewer
when they are appropriate to the situation and omitted when they are not needed.

Examples of optional wording

5.3 Were you (or anyone living here with you)
attacked or beaten up by a stranger during the past?

5.4 Did (he/she) report the attack to the
police?

5.5 How old was (EACH PERSON) on (his/her)
last birthday'?

In 5.3, the parenthetical phrase would be
omitted if the interviewer already knew that the respondent lived alone.
However, if more than one person lived in the household, the interviewer and
would include it.

The parenthetical choice offered in 5.4 may
seem minor. However, the parentheses alerts the interviewer to the fact that a
choice must be made; the proper pronoun is used, and the principle is maintained
that the interviewer need read only the questions exactly as written in order to
present a satisfactory stimulus.

A variation that accomplishes the same thing
is illustrated in 5.5. A format such as that might be used if the same question
were to be asked for each person in a household. Rather than repeat the
identical words endlessly, a single question is written instructing the
interviewer to substitute an appropriate designation (your husband/your son/your
oldest daughter).

The above examples permit the interviewer to
ask questions that makes sense and take advantage of knowledge previously gained
in the interview to tailor the questions to the respondent's individual
circumstances. There is another kind of optional wording that is seen
occasionally in questionnaire that is not acceptable.

Example of unacceptable optional wording

5.6 What do you like best about this
neighbourhood? (We're interested in anything like houses, the people, the parks,
or whatever.)

Presumably, the parenthetical probe was
thought to be helpful to respondents who were having difficulty in answering the
question. However, from a measurement point of view, it undermines the principle
of standardized interviewing. If interviewers use the parenthetical probe when a
respondent does not readily come up with an answer, then a subset of respondents
will have answered a different question. Such optional probes usually are
introduced when the researcher does not think the initial question is a very
good one. The proper approach is to write a good question in the first place.
Interviewers should never be given any options about what questions to read or
how to read them except, as in the examples above, to make the questions fit the
circumstances of a particular respondent in a standardized way. The following is
a different example of incomplete question wording. There are three errors
embedded in the example.

Poor example of standardized wording

5.7 I would like you to rate different
features of your neighbourhood as very good, good, fair, or poor. Please think
carefully about each item as I read it.

(a) Public schools

(b) Parks and services

(c) Other

The first problem. with 5.7 is the order
of the main stem. The response alternatives are read prior to an instruction to
think carefully about the questions. The respondent probably will forget the
question. The interviewer likely will have to do some explaining or rewording.
Second, the words the interviewer needs to ask about the second item on the
list, "parks," are not provided in 5.7. A much better question would he the
following:

Better example

5.7a I am going to ask you to rate different
features of your neighbourhood. I want you to think carefully about your
answers. How would you rate (FEATURE)-would you say very good, good, fair, or
poor?

This gives the interviewer the wording needed
for asking the first and all subsequent items on the list.

The third problem with the example is the
alternative "other". What is the interviewer to say? It is not uncommon to see
"other" on a list of questions in a form similar to the example. Although
occasionally there may be a worthwhile question objective involved, most often
the questionnaire will benefit from dropping the item.

The above examples illustrate questions that
could not he presented consistently to all respondents due to incomplete
wording. Another step needed to increase consistency is to create a set of
questions that flows smoothly and easily. It can be shown that if questions have
awkward or confusing wording, if there are words that are difficult to
pronounce, or combinations of words that sound awkward together, interviewers
will change the words to make the questions sound better or to make them easier
to read. It may be possible to train and supervise interviewers to keep such
changes to a minimum. However, good design of the questionnaire will raise the
odds of a standardized interview.

Ensuring consistent meaning to all respondents

If all respondents are asked exactly the same
questions, one step has been taken to ensure that differences in answers can be
attributed to differences in respondents. However, there is a further
consideration. The questions should all mean the same thing to all respondents.
If two respondents understand the question to mean different things, their
answers may be different for that reason alone.

One potential problem is using words that are
not understood universally. In general samples, it is important to remember that
a range of educational experiences and cultural backgrounds will be represented.
Even with well-educated samples, using simple words that are short and widely
understood is a sound approach to questionnaire design.

Undoubtedly, a much more common error than
using unfamiliar words is the use of terms or concepts that can have multiple
meanings. It is impossible to give an exhaustive list of ambiguous terms used in
surveys, but the prevalence of misunderstanding of common terms has been well
documented by those who have studied the problem (e.g., Belson, 1981).

Poorly defined terms

5.8 How many times in the past year have you
seen or talked with a doctor about your health?

Problem. There are two ambiguous terms or
concepts in this question. First, there is basis for uncertainty about what
constitutes a doctor. Are only people practicing medicine with M. D. degrees
included? If so, then psychiatrists are included, but psychologists,
chiropractors, osteopaths, and podiatrists are not included. What about
physicians, assistants or nurses who work directly for doctors in doctors'
offices? If a person goes to a doctor's office for an innoculation, that is
given by a nurse, does it count?

Second, what constitutes seeing or talking
with a doctor? Do telephone consultations count? Do visits to a doctors office
when the doctor is not seen count?

Solutions.
Often the best approach is to provide respondents and interviewers with the
definitions they need.

5.8a We are going to ask about visits to
doctors and getting medical advice from doctors. In this case we are interested
in all professional personnel who have M.D. degrees or work directly for an M.D.
in the office such as a nurse or medical assistant.

When the definition of what is wanted is
extremely complicated and would take a very long time to define, as may be the
case in this question, an additional constructive approach may be to ask
supplementary questions about desired events that particularly are likely to be
omitted. For example, visits to psychiatrists, visits for inoculations, and
telephone consultations often are under reported and may warrant specific
follow-up questions.

Poorly defined terms

5.9 Did you eat breakfast yesterday?

The difficulty is that the definition of
breakfast varies widely. Some people consider coffee and a donut anytime before
noon to be "breakfast". Others do not consider that they have had break- fast
unless it includes a major entree, such as bacon and eggs, and is consumed
before 8:00 A.M.

Solutions.
There are two approaches to the solution. On the one hand, one might choose to
define breakfast:

5.9a For our purposes, let us consider
breakfast to be a meal eaten before 10:00 in the morning, which includes some
protein such as eggs, meat or milk, some grain such as toast or cereal, and some
fruit or vegetable. Using that definition, did you have breakfast yesterday?

While that often is a very good approach, in
this case it is very complicated. Instead of trying to communicate a common
definition to respondents, the researcher may simply ask people to report what
they consumed before 10:00 a.m. At the coding stage, the "quality" of what was eaten can be evaluated
consistently without requiring each respondent to share the same definition.

Poorly defined terms

5.10 Do you favour or oppose gun control
legislation?

Problem. Gun control legislation can mean banning the legal sale of certain
kinds of guns, asking people to register their guns, limiting the number or
kinds of guns that people may possess, or which people may possess them. Answers
cannot be interpreted without assumptions about what respondents think the
question means. Respondents will undoubtedly interpret the question differently.

5.10a One proposal for the control of guns is
that no person who ever had been convicted of a violent crime would be allowed
to purchase or own a pistol, rifle. or shotgun. Would you oppose or support
legislation like that?

One could argue that it is only one of a
variety of proposals for gun control. That is exactly the point. If one wants to
ask multiple questions about different possible responses to a gun control
problem, one should ask separate specific questions that can be understood
commonly by all respondents and interpreted by researchers. One does not solve
the problem of a complex issue by leaving it to the respondents to decide what
questions they want to answer.

The worst, way to handle a complex
definitional problem is to give interviewers instructions about how to define
terms if they are asked. Only respondents who ask will receive the definition;
interviewers will not give consistently worded definitions if they are not
written in the questionnaire. Thus the researcher will never know what question
any particular respondent answered. If a complex term that may require
definition must be used, interviewers should be required to read a common
definition to all respondents.

The "Don't Know" Option

When respondents are being asked questions
about their own lives, feelings, or experiences, a "don't know" response is
often a statement that they are unwilling to do the work required to give an
answer. On the other hand, sometimes we ask respondents questions about things
about which they legitimately do not know. As the object of the questions gets
further from their immediate lives, the more plausible and reasonable it is that
some respondents will not have adequate knowledge on which to base an answer or
will not have formed an opinion or feeling.

There are two approaches to dealing with such
a possibility. One simply can ask the questions of all respondents, relying on
the respondent to volunteer a "don't know." The alternative is to ask
respondents whether or not they feel familiar enough with a topic to have an
opinion or feeling about it.

When a researcher is dealing with a topic
about which familiarity is high, whether or not a screening question for
knowledge is asked is probably not important. However, when there is reason to
think that a notable number of respondents will not be familiar with whatever
the question is dealing with, it probably is best to ask a screening question
about familiarity with the topic. People differ in their willingness to
volunteer a "don't know". A screening question for familiarity helps to produce
a kind of standardisation; most people answering the question then will have at
least some minimal familiarity with what they are responding to (Schuman &
Presser, 1981).

Specialised Wording for Special Subgroups

Researchers have wrestled with the fact that
the vocabularies in different subgroups of the population are not the same. One
could argue that standardised measurement actually would require different
questions for different subgroups.

Designing different forms of questionnaires
for different sub- groups almost is never done. Rather methodologists tend to
work very hard to attempt to find wording for questions that has consistent
meaning across an entire population. Even though there are situations where a
question wording is more typical of the speech of one segment of a community
than another (most often the better-educated segment), finding exactly
comparable words for some' other group of the population and then giving
interviewers reliable rules for deciding when to ask which version is so
difficult that it is likely to produce more unreliability than it reduces.

Standardized expectations for type of
response

Thus far we have said it is important to give
interviewers a good script so that they can read the questions exactly as
worded, and it is important to design questions that mean the same thing to all
respondents. The other component of a good question that sometimes is overlooked
is that respondents should have the same perception of what constitutes an
adequate answer for the question.

The simplest way to give respondents the same
perceptions of what constitutes an adequate answer is to provide them with a
list of acceptable answers. Such questions are called closed questions. The
respondent has to choose one, or sometimes more than one, of a set of
alternatives provided by the researcher.

Closed questions are not suitable in all
instances. The range of possible answers may be more extensive than it is
reasonable to provide. The researcher may not feel that all reasonable answers
can be anticipated. For such reasons, the researcher may prefer not to provide a
list of alternatives to the respondent. However, that does not free the
researcher from structuring the focus of the question and the kind of response
wanted as carefully as possible.

5.11 Why did you vote for Candidate A?

Problems.
Almost all "why" questions have problems. The reason is that one's sense of
causality or frame of references can influence what one talks about. In the
particular instance above, the respondent may choose to talk about the strengths
of Candidate A, the weaknesses of Candidate B, or the reasons the respondent
uses certain criteria ("My mother was a lifelong Democrat"). Hence respondents
who see things exactly the same way may answer differently.

Solution.
Specify the focus of the answer:

5.1]a What characteristics of Candidate A led
you to vote for (him/her) over Candidate B?

Such a question explains to respondents that
we want them to talk about Candidate A, the person for whom they voted. If all
respondents answer with that same frame of reference, we then will be able to
compare responses from different respondents in a direct fashion.'

5.12 What are some of the things about this
neighbourhood that you like best?

Problems.
In response to a question like that, some people will only make one or two
points, while others will make many. It is possible that such differences
reflect important differences in respondent perceptions or feelings. However,
research has shown pretty clearly that education is related highly to the number
of answers people give to questions. Interviewers also affect the number of such
answers.

Solution.
Specify the number of points to be made:

5. 12a What is the feature of this
neighbourhood that you would single out as the one you like most?

5.]2b Tell me the three things about this
neighbourhood that you like most about living here.

Although that may not be a satisfactory
solution for all questions, for many such questions it is an effective way of
reducing unwanted variation in answers across respondents.

The basic point is that answers can vary
because respondents have a different understanding of the kind of responses that
are appropriate. Better specification of the properties of the answer de- sired
can remove a needless source of unreliability in the measurement process.

Types of measures / types of questions

Introduction

The above procedures are designed to maximise
reliability - the extent to which people in comparable situations will answer
questions in similar ways. However, one can measure with perfect reliability and
still not be measuring what one wants to measure. The extent to which the answer
given is a true measure and means what the researcher wants it to mean or
expects it to mean is called validity. In this section, we discuss other aspects
of the design of questionnaires, in addition to steps to maximise the
reliability of questions, that can increase the validity of survey measures.

For this discussion, it is necessary to
differentiate questions designed to measure facts or objectively measurable
events from questions designed to measure subjective states such as attitudes,
opinions, and feelings. Even though there are questions that fall in a murky
area on the borders of these two categories, the idea of validity is somewhat
different for subjective and objective measures for several reasons. If it is
possible to cheek the accuracy of an answer by some independent observation,
then the measure of validity becomes the similarity of the survey report to the
value of some "true" measure. In theory one could obtain an independent,
accurate count of the number of times that an individual obtained medical
services from a physician during a year. Although in practice it may be very
difficult to obtain such an independent measure (e.g., records also contain
errors), the understanding of validity can be consistent for objective
situations.

In contrast, when people are asked about
subjective states, feelings, attitudes, and opinions, there is no objective way
of validating the answers. Only the person has access to his or her feelings and
opinions. Thus the only way of assessing the "validity" of reports of subjective
states is the way in which they correlate either with other answers that a
person gives or with other facts about the person's life that one thinks should
be related to what is being measured. For such measures, there is no truly
independent direct measure possible; the meaning of answers must be inferred
from patterns of association. This fundamental difference in the meaning of
validity requires sepa- rate discussions regarding ways of maximising validity.

Levels of Measurement

There are four different ways in which
measurement is carried out in social sciences. This produces four different
kinds of tasks for respondents and four different kinds of data for analysis:

Nominal - people or events are
sorted into unordered categories. ("Are you male or female?")

Ordinal - people or events are
ordered or placed in ordered categories along a single dimension. ("How would
you rate your health - very good, good, fair, or poor?")

Interval data - numbers
are attached that provide meaningful information about the distance between
ordered stimuli or classes. (In fact, interval data are very rare. Fahrenheit
temperature readings are among the few common examples.)

Ratio data - numbers are
assigned that have absolute meaning such as a count or measurement by an
objective, physical scale such as distance, weight, or pressure. ("How old
were you on your last birthday?")

Most often in surveys, when one is collecting
factual data, respondents are asked to fit themselves or their experiences into
a category, creating nominal data, or they are asked for a number, most often
ratio data. "Are You employed?", "Are you married?". and 'Do You have
arthritis?" are examples of questions that provide nominal data. "How many times
have 'you seen a doctor?" "How old are you?", and 'What is your income?" are
examples of questions to which respondents are asked to provide real numbers for
ratio data.

When gathering factual data, respondents may
be asked for ordinal answers. For example, they may be asked to report their
incomes in relatively large categories or to describe their behavior in
nonnumerical terms ("usually. occasionally, seldom, or never"). When respondents
are asked to report factual events in ordinal terms, it is because great
precision is not required by the researcher or because the task of reporting an
exact number was considered too difficult; ordinal classification seemed a more
realistic task for a respondent. However, there usually is a real numerical
basis underlying an ordinal answer to a factual question."

The situation is somewhat different with
respect to reports of subjective data. Although there have been efforts over the
years, first in the work of a psychophysical psychologists (eg., Thurstone,
1929), to have people assign numbers to subjective states that met the
assumptions of interval and ratio data, for the most part respondents are asked
to provide nominal and ordinal data about subjective states. The nominal
question is, "Into which category do your feelings, opinions, or perceptions
fall?" The ordinal question is "Where along this continuum do your feelings,
opinions, or perceptions fall?"

When designing a questionnaire, a basic task
of the researcher is to decide what kind of measurement is desired. When that
decision is made, there are some clear implications for the form in which the
question will be asked.

Types of Questions

Survey questions can be classified roughly
into two groups: those for which a list of acceptable responses is provided to
the respondent (closed questions) and those for which the acceptable responses
are not provided exactly to the respondent (open questions).

When the goal is to put people in unordered
categories (nominal data), the researcher has a choice about whether to ask an
open or closed question. Virtually identical questions can be designed in either
form.

Examples of open and closed forms

5.13 What health conditions do you have?
(Open)5.13a Which of the following conditions do you currently have? (READ LIST)
(Closed)5.14 What do you consider to be the most important problem facing our
country today? (Open)5.14a Here is a list of problems that many people in the
country are concerned about. Which do you consider to be the most important
problem facing our country today? (Closed)

There are advantages to open questions. They
permit the researcher to obtain answers that were unanticipated. They also may
describe more closely the real views of the respondent. Third, and this is not a
trivial point, respondents like the opportunity to answer some questions in
their own words. To answer only by choosing a provided response and never to
have an opportunity to say what is on one's mind can be a frustrating
experience. Finally, open questions are appropriate when the list of possible
answers is longer than it is feasible to present to respondents.

Having said all this, closed questions are
usually a more satisfactory way of creating data. There are three reasons for
this:

The respondent can perform more reliably the task of answering the
question when response alternatives are given.

The researcher can perform more reliably the task of interpreting
the meaning of answers when the alternatives are given to the respondent (Schuman
& Presser, 1981).

When a completely open question is asked,
many people give relatively rare answers that are not analytically useful.
Providing respondents with a constrained number of categories increases the
likelihood that there will be enough people in any given category to be
analytically interesting.

Finally, if the researcher wants ordinal
data, the categories must be provided to the respondent. One cannot order
responses reliably along a single continuum unless a set of permissible ordered
answers is specified in the question. A bit more about the task that is given to
respondents when they are asked to perform an ordinal task is appropriate, since
it is probably the most prevalent kind of measurement in survey research.

Figure 5.1 shows a continuum. In this case we
are talking about having respondents make a rating of some sort, but the general
approach applies to all ordinal questions. There is a dimension that is assumed
by the researcher that goes from the most negative feelings possible to the most
positive feelings possible. The way survey researchers get respondents into
ordered categories is to put designations or labels on such a continuum.
Respondents then are asked to consider the labels, consider their own feelings
or opinions, and place themselves in the proper category.

There are two points worth making about the
kinds of data that result from such questions. First, respondents will differ
one from the other in their understanding of what the labels or categories mean.
However, the only assumption that is necessary in order to make meaningful
analyses is that, on the average, the people who rate their feelings as "good"
feel more positively than those who rate their feelings as "fair." To the extent
that people differ some in their understanding of and criteria for "good" and
"fair," there is unreliability in the measurement, but the measurement still may
have meaning (i.e., correlate with the underlying feeling state that the
researcher wants to measure).

Second, an ordinal scale measurement like
this is relative. The distribution of people choosing a particular label or
category depends on the particular scale that is presented. Consider the rating
scale in Figure 5.1 again and consider two approaches to creating ordinal
scales. In one case, the researcher used a three-point scale, "good, fair, or
poor". In the second case, the researcher used five descriptive words,
"excellent, very good, good, fair, and poor". When one compares the two scales,
one can see that adding "excellent" and "very good" in all probability does not
simply break up the "good" category into three pieces. Rather it changes the
whole sense of the scale. People respond to the' ordinal position of categories
as well as to the descriptors. "Fair" almost certainly is further to the
negative side of the continuum when it is the fourth point on the scale than
when it is the second. Thus one would expect considerably more people to give a
rating of "good" or better with the five-point scale than with the three-point
scale.

Such scales are meaningful if used as they
are supposed to be used: to order people. However, by itself a statement that
some percentage of the population feels something is "good or better" is not
appropriate because it implies that the population is being described in some
absolute sense. The percentage would change if the question were different. Only
comparative statements (or statements about relationships) are justifiable when
one is using ordinal measures:

(a) Comparing answers to the same question
across groups; e.g., 20 percent more of those in Group A than in Group B rated
the candidate as "good" or better; or

(b) Comparing answers from comparable samples
over time, e.g., 10 percent more rate the candidate 'good' or better in January
than did so in November.

The same general comments apply to data
obtained by having respondents order items. ("Consider the schools, police
services, and trash collection. Which is the most important city Service to
you?") The percentage giving any item top ranking, or the average ranking of an
item, is completely dependent on the particular list provided. Comparisons
between distributions when the alternatives have been changed at all are not
meaningful.

Agree-Disagree Items: A Special Case

Agree-disagree items are very prevalent in
the survey research field and therefore deserve special attention. One can see
that the task that respondents are given in such items is different from that of
placing themselves in an ordered category. The usual approach is to read a
statement to respondents and ask them if they agree or disagree with that
statement. The statement is located somewhere on a continuum such as that
portrayed in Figure 5.1. Respondentsí locations on that continuum are calculated
by figuring out whether they say they are very close to that statement (by
agreeing) or saying their feelings are very far from where that statement is
located (by disagreeing).

The use of agree-disagree questions to order
respondents has two main potential limits.

First, a statement, in order to he
interpretable, must be located at the end of a continuum. For example, if a
statement was to be rated that said "The schools are fair," presumably a point
in the middle of a continuum, a respondent could disagree either because he
rated the schools as "good" or because he rated them as "poor". The similar
limitation is that it is very common for the statements used as stimuli for
agree-disagree questions to have more than one dimension, (i.e., to be
double-barrelled), in which case the answer cannot be interpreted. The two
statements below provide examples of double- barrelled statements.

5.15 In the next five years, this country
will probably be strong and prosperous.

Problems.
It obviously is possible for someone to have the view that the country will be
strong but not prosperous or vice-versa. Since prosperity and strength do not go
together necessarily, a respondent may have trouble knowing what to do.

5.16 With economic conditions the way they
are these days, it really isn't fair to have more than one or two children.

Problems.
If a person does not happen to think that economic conditions are terrible
(which the question imposes as an assumption person does not believe that
economic conditions of whatever kind have any implications for family size, but
if that person happens to think one or two children is a good target for a
family, it is not easy to answer the question.

FEELING ABOUT
SOMETHING

EXTREMELY
POSITIVE
EXTREMELY NEGATIVE

TWO-CATEGORY SCALE

GOOD
NOT GOOD

THREE-CATEGORY SCALE

GOOD
FAIR POOR

'

FOUR-CATEGORY SCALE

VERY
GOOD GOOD
FAIR POOR

FIVE-CATEGORY SCALE

EXCELLENT
VERY GOOD GOOD FAIR POOR

Figure 5.1 Subjective Continuum Scale

The problem then is knowing what the
respondent agreed to, if he or she agreed. Asking two or three questions at once
and having imbedded assumptions in questions are very common problems with the
agree-disagree format. The agree-disagree format appears to he a rather simple
way to construct questionnaires. In fact, to use this form to provide reliable,
useful measures is not easy and requires a great deal of care and attention. In
many cases, researchers would have more reliable and interpretable measures if
they used a different question form.

Increasing the validity of factual reporting

When a researcher asks a factual question of
a respondent, the goal is to have the respondent report with perfect accuracy,
that is, give the same answer that the researcher would have given if the
researcher had access to the information needed to answer the question. There is
a rich methodological literature on the reporting of factual material. Reporting
has been compared against records in a variety of areas, in particular the
reporting of economic and health events (see Cannell et al., 1977a, for a good
summary).

Respondents answer many questions accurately.
For example, over 90 percent of overnight hospital stays within six months of an
interview are reported (Cannell & Fowler, 1965). However, how well people report
depends on both what they are being asked and how the question is asked. There
are four basic reasons why respondents report events with less than perfect
accuracy:

They do not know the information.

They cannot recall it, although they do
know it.

They do not understand the question.

They do not want to report the answer in
the interview context.

There are several steps that the researcher
can take to combat each of these potential problems. Let us review these.

Lack of Knowledge

Since the main point of doing a survey is to
get information from respondents that is not available in other ways, most
surveys deal with questions to which respondents know the answers. The main
reason that a researcher would get inaccurate reporting due to lack of knowledge
is that he or she is asking one household member for information that another
household member has. In health surveys, for example, it is common to use a
household informant to report on visits to the doctor, hospitalisations. and
illnesses for all household members. Economic and housing surveys often ask for
a household respondent to report information for the household as a whole.

If the information exists in the household,
but simply not with the person that the researcher wants to be the main
respondent, the solutions are either to eliminate proxy reporting or to provide
an opportunity for respondents to consult with other family members. For
example, the National Crime Survey conducted by the Census Bureau obtains
reports of household crimes from a single household informant, but in addition
asks each household adult directly about personal crimes such as robbery. If the
basic interview is to be carried out in person, costs for interviews with other
members of the household can be reduced - administered forms are left to be
filled out by absent household members, or if secondary interviews are done by
telephone. A variation is to ask the main respondent to report the desired
information as fully as possible for all household members. Then mail the
respondent a summary for verification, permitting consultation with other family
members (see Cannell & Fowler, 1965).

Finally, it sometimes is worth asking
household members to designate the best informed person to answer the questions.
The house-wife is not always the most knowledgeable about health, and the
husband is not always the most knowledgeable about finances. People themselves
often can do a better job of choosing the best respondent for a particular topic
than can the researcher.

Recall

Studies of the reporting of known hospital
stays clearly show the significance of memory in the reporting of events. As the
time between the interview and a hospitalisation event increases, the
probability of it being reported in an interview decreases. In a like way, short
hospitalisations are less likely to be reported than long ones. Memory decays in
predictable ways; the minor and distant events are more difficult to conjure up
in a quick question and answer interview.

There are several ways to reduce the impact
of memory decay on the reporting of factual events. Five possible methods are as
follows:

Reduce the period of time about which respondents are asked to
report. There is great value in having respondents report for as long a period
of time as possible, because there is more information obtained in that way.
However. the longer the reporting period, the less accurate the reporting (Cannell
& Fowler, 1965).

Memory is improved by asking more questions. By asking more than one
question about events, more time will elapse for the respondent to think. In
addition, questions can be designed that will stimulate associations, thereby
helping the recall process. Thus the number of health conditions reported is
increased by asking about visits to doctors, taking medications, and missing
work (Madow, 1963).

A second chance to think about the answers given also can stimulate
memory. The technique suggested above of sending the respondent a summary of
answers for verification has been shown to improve the recall process as well;
even asking for the same information twice in the same interview can help
recall.

A reinterview procedure, interviewing the same respondent twice or
even more times, is another good way to deal with problems of recall. One key
problem with recalling events over time is setting them in the proper time
period. An initial interview can serve as an anchorpoint for people's recall.
The previous interview serves as a boundary in their minds. In addition, the
researcher can check to make sure that events reported in Interview 1 were not
repeated in Interview 2. A final advantage of the panel approach is that
respondents are sensitised to the kinds of events that will be asked about,
thereby further improving their recall.

Carrying that last point a step further, one way that researchers
have dealt with the reporting of minor events that are hard to remember is by
asking people to keep a diary. Consumption pat- terns, minor deviations from
good health, and patterns of expenditure are all difficult for people to
recall in detail over time unless they are taking notes. Even respondents who
do not keep their diaries up to date conscientiously report considerably
better than they would have had they not been keeping a diary (Sudman &
Bradburn, 1974).

It should be noted that a trade-off with both
the reinterview and diary strategies is that it is more difficult to convince
people to keep a diary or be interviewed several times than it is to get them to
agree to a one-time interview. Hence the values of improved reporting have to be
weighed against the possible biases resulting from sample attrition.

Social Desirability

There are certain facts or events that
respondents rather would not report accurately in an interview. Conditions that
have some degree of social undesirability such as mental illness and venereal
disease are underreported significantly more than other conditions (Madow,
1963., Densen et al., 1963). Hospitalisation associated with conditions that are
particularly threatening, either because of the possible stigmas that may be
attached to them or due to their life threatening nature, are reported at a
lower rate than average (Cannell et al., 1977a). Aggregate estimates of alcohol
consumption strongly suggest underreporting, although the reporting problems may
be a combination of recall difficulties and respondents' concerns about social
norms regarding drinking. Arrest and bankruptcy are other events that have been
found to he underreported consistently, but which seem unlikely to have been
forgotten (Locander et al., 1976).

There are probably limits to what people will
report in a standard interview setting. If a researcher realistically expects
someone to admit something that is very embarrassing or illegal, extraordinary
efforts are needed to convince respondents that the risks are minimal and the
reasons for taking a risk are substantial. The following are some of the steps
that a researcher might particularly consider when sensitive questions are being
asked (also see Sudman & Bradburn, 1982).

Minimise a sense of judgment; maximise the importance of accuracy.
Careful attention to the introduction and vocabulary that might imply that the
researcher would value negatively certain answers is important. Researchers
always have to be aware of the fact that respondents are having a conversation
with the researcher. The questionnaire, plus the behavior of the interviewer
if there is one, constitutes all the information the respondent has about the
kind of interpretation the researcher will give to the answers. Therefore, the
researcher needs to be very careful about the kind of cues the respondent is
receiving and about the type of context in which respondents feel their
answers will be interpreted.

Use self-administered questions. Although the data are not
conclusive, there is evidence that telephone interviews are more subject to
social-desirability bias than personal interviews (e.g., Hendon et al., 1977,
Mangione et al., 1982), there is also evidence that having respondents answer
questions in a self-administered form rather than having an interviewer ask
the questions may produce less social-desirability bias for some items (e.g.,
Hochstim, 1967). Such a consideration might lead one to think of a mail survey
or group administration. A personal interview survey also can be combined
usefully with self-administration: A respondent simply is given a set of
questions to answer in a booklet as part of the personal interview experience.

Confidentiality and anonymity. Almost all
surveys promise respondents that answers will be treated confidentially and that
no one outside the research staff will ever be able to associate individual
respondents with their answers. Respondents usually are reassured of such facts
by interviewers in the introduction and in advance letters, if there are any;
these may be reinforced by signed commitments from the researchers. For surveys
on particularly sensitive or personal subjects, special steps to ensure that
respondents cannot be linked to their answers (such as the random response
techniques described by Greenberg et al., 1969) may be used. Again it is
important to emphasise that the limit of survey research is what people am
willing to tell researchers under the conditions of data collection designed by
the researcher. There are some questions that probably cannot be asked of
probability samples without extraordinary efforts (e.g., Kinsey et a]., 1948).
However, some of the procedures discussed in this section, such as trying to
create a neutral context for answers and emphasising the importance of accuracy
and the neutrality of the data collection process, are probably worthwhile
procedures for the most innocuous of questions. Any question, no matter how
innocent it may seem, may embarrass somebody in the sample. It is best to design
all phases of a survey instrument with a sensitivity to reducing the effects of
social desirability and embarrassment on any answers people may give.

Increasing validity of subjective questions

As discussed above, the validity of
subjective questions has a different meaning than the validity of objective
questions. There is no external criterion. One only can estimate the validity of
a subjective measure by the extent to which answers are associated in expected
ways with the answers to other questions or other characteristics of the
individual to which it should be related (see Turner & Martin, 1984, for an
extensive discussion of issues affecting the validity of subjective measures).

There basically are only three steps to the
improvement of validity of subjective measures:

Make the questions as reliable as
possible. Review the sections on the reliability of questions, dealing with
ambiguity of wording, standardized presentation, and vagueness in response form,
and do everything possible to get questions that will mean the same thing to all
respondents. To the extent that subjective measures are unreliable, their
validity will be reduced. A special issue is the reliability of ordinal scales,
which are dominant as measure of subjective states. The response alternatives
offered must be unidimensional (deal with only one issue) and monotonic
(presented in order, without inversion).

Problematic scales

5. 16 How would you rate your job-very
rewarding, rewarding but stressful, not very rewarding but not stressful, or not
rewarding at all?

5.17 How would you rate your job-very
rewarding, somewhat rewarding, rewarding, or not rewarding at all?

Question 5.16 has two scaled properties -
rewardingness and stress - that need not be related. All the alternatives are
not played out. Question 5.16 should be made into two questions if rewardingness
and stress of jobs are both to be measured. In 5.17, some would see "rewarding"
as more positive than "somewhat rewarding" and be confused about how the
categories were ordered. Both of these problems are common and should be
avoided.

When putting people into ordered classes along a continuum, it
probably is better to have more categories than fewer. There is a limit,
however, in the precision of discrimination that respondents can exercise in
giving ordered ratings. When the number of categories exceeds the respondents'
ability to discriminate their feelings, numerous categories simply produce
unreliable "noise." However, the validity of a measure will be increased to
the extent that real variation among respondents is measured.

Ask multiple questions, with different question forms, that measure
the same subjective state; combine the answers into a scale. The answers to
all questions potentially are influenced both by the subjective state to be
measured and by specific features of the respondent or of the questions. Some
respondents avoid extreme categories; some tend to agree more than disagree;
others do just the opposite. Multiple questions help even out response
idiosyncrasies and improve the validity of the measurement process (Cronbach,
1951).

The most important point to remember about
the meaning of subjective measures is their relativity. Distributions can be
compared only when the stimulus situation is the same. Small changes in wording,
changing the number of alternatives offered, and even changing the position of a
question in a questionnaire can make a major difference in how people answer.
(See Turner & Martin, 1984; Schuman & Presser, 1981; and Sudman & Bradburn, 1982
for numerous examples of factors that affect response distributions.) The
distribution of answers to a subjective question cannot be interpreted directly;
it only has meaning when differences between samples exposed to the same
questions are compared or when patterns of association among answers are
studied.

Error in perspective

A defining property of social surveys is that
answers to questions are used as measures. The extent to which those answers are
good measures is obviously a critical dimension of the quality of survey
estimates.

Questions can be poor measures because they
are unreliable (producing erratic results) or because they are biased, producing
estimates that consistently err in one direction from the true value (as when
drunk driving arrests are underreported).

We know quite a bit about how to make
questions reliable. The principles outlined in this chapter to increase
reliability are probably sound. Although other points might be added to the
list, creating unambiguous questions that provide consistent measures across
respondents is always a constructive step for good measurement.

The validity issue is more complex. In a
sense, each variable to be measured requires research to identify the best set
of questions to measure it and to produce estimates of how valid the resulting
measure is. Many of the suggestions to improve reporting in this chapter emerged
from a twenty-year program to evaluate and improve the measurement of
health-related variables (Cannell et al.,. 1977a, 1977b). There are many areas
in which a great deal more work on validation is needed.

A third issue is the credibility of a
question (or series) as a measure. It always is legitimate to ask researchers
for their evidence about how well a question (or series) measures what it is
supposed to. Too often, researchers make little effort to evaluate their
measures; they assume, and ask their readers to assume, that answers mean what
they 'look like" they mean and measure what the researcher thinks they are
supposed to measure. To rely on so-called "face validity" of questions is not
acceptable practice.

Researchers should build explicit efforts to
assess the validity of their key measures into their analyses. As standard
practice, patterns of association related to validity can be calculated and
presented in an appendix.

Reducing measurement error through better
question design is one of the least costly ways to improve survey estimates. For
any survey, it is reasonable to attend to careful questionnaire design and
pretesting (which are discussed in Chapter 6) and making use of the existing
research literature about how to measure what is to be measured. Also, building
a literature over time in which the validity of measures has been evaluated and
reported is much needed. Such evaluations are now the exception; they should
become routine.