The desire to revise a priority list based on
cost-effectiveness: The role of the prominence effect and
distorted utility judgments.

Abstract

Background. People sometimes object to the results of
cost-effectiveness analysis when the analysis produces a
ranking of options based on both cost and benefit. We suggest
two new reasons for these objections: the prominence effect, in
which people attend mostly to a more prominent attribute
(benefit as opposed to cost) and distortion of utility
judgments. Methods. We simulated the production of a
cost-effectiveness ranking list in three experiments using
questionnaires on the World Wide Web. Subjects rated the
utility of 16 health benefits using either rating scale or
person trade-off elicitation methods. In some experiments, we
asked subjects to rate the utility of the health benefits with
attention also to the cost of achieving the benefits. In all
experiments, at the end, we showed subjects a priority list
based on their own utility judgments and asked them whether
they wanted to move any of the health benefits up or down the
list. Results. In all experiments, subjects wanted to
give higher priority to treatments with high cost or high
benefit. They thus wanted to give less weight to cost and more
weight to benefit than the weight implied by their own prior
judgments. The desire for revision was also reduced when
subjects made their utility judgments after indicating whether
the utility was above or below the midpoint of the scale (a
manipulation previously found to reduce distortion).
Conclusion. The desire to change cost-effectiveness
rankings is in part a preference reversal phenomenon that
occurs because people pay attention to the benefit of health
interventions as opposed to their cost, at the time they
examine the ranking.

A natural strategy for the use of cost-effectiveness analysis in
health care allocation is to rank treatment for
cost-effectiveness and then cover all the treatments at the top
of the ranking, going down as far as the budget allows. The
experience of the state of Oregon suggests, however, that such
rankings are intuitively unappealing. The state created a
priority list for determining which health care services its
Medicaid enrollees would receive. The list ranked the
cost-effectiveness of 709 condition/treatment pairs, such as
antibiotic treatment for pneumonia. However, the list was an
immediate failure. It was not even forwarded to the State
legislature by the commission that created it, on the grounds
that it did not capture people's health care priorities. For
example, it ranked surgical treatment for ectopic pregnancy and
for appendicitis at about the same part of the list as dental
caps for ``pulp or near pulp exposure.'' The Commission that
created the list found these rankings counterintuitive and,
instead, abandoned cost-effectiveness as the sole basis of its
priority list.

A number of experts have debated why Oregon's cost-effectiveness
list was such a failure. Some argued that Oregon's list was
plagued by technical flaws and measurement error. Hadorn argued
that Oregon's cost-effectiveness list failed, not because of
inadequate cost-effectiveness data, but instead because
cost-effectiveness itself fails to capture people's preferences.
Cost-effectiveness ignores the rule of rescue - people want to
give priority to treatment for life threatening illnesses in a
way that is not captured by the cost-effectiveness of those
treatments.1 When people are given a list based on
cost-effectiveness, they will want to give higher priority to
expensive life-saving treatments.

Others argue that the rule of rescue, or similar values, could be
captured in cost-effectiveness analysis, if utility measures were
altered to account for the societal value people place on various
treatment benefits, that is, their desired priority in public
policy, which may differ from their utility for patients
3,4. For example, a person may say that a condition is
typically in the middle between death and normal health, for a
typical individual patient. This judgment would imply a utility
of .5 on a scale on which death is 0 and normal health is 1. But
the same person might rather save the lives of 25 people than
cure 100 people of the condition in question. Choices that
involve numbers of people reflect, it is argued, a societal
perspective. From this perspective, the condition would have a
``utility'' of .75 or higher for the purpose of
cost-effectiveness analysis. This kind of judgment is the basis
of the person trade-off method of utility elicitation, which asks
respondents (for example) how many saved lives is equivalent to
curing 100 people of the condition.

This paper addresses primarily a different possible explanation
of the conflict between intuitive rankings and cost-effectiveness
analysis: the prominence effect. When people make choices, they
pay attention to the most prominent attribute of the available
options, the attribute usually judged more ``important,''
whereas, in judgment or matching tasks, they pay attention to all
the relevant attributes.5-8 In one study, subjects were
asked to choose between two hypothetical highway safety programs.
Program X saved 100 lives cost $55 million, and Y saved 30 lives
and cost $12 million. Subjects usually chose program X, despite
the cost. When similar subjects were asked to indicate how many
lives saved by program X would make the two programs equally
attractive, however, they usually gave answers greater than 100.
Such answers imply that program X would be less attractive than Y
with the figures presented in the original choice. Presumably,
subjects considered both cost and benefit when making the
matching judgment (in which they indicated what number would make
the two equivalent), but they considered mainly the benefit when
making the choice. This is because, in this case, the benefit is
more prominent.

The prominence effect could explain the desire to revise a
ranking based on cost-effectiveness analysis because the analysis
is based on judgments in which the respondent must supply a
numerical judgment and the revision of the list is more like
choice. When people provide numerical judgments of utility, they
could attend to costs as well as benefit, but when they look at a
priority list, their desire to revise it could be based largely
on the benefit, because benefit is the more prominent attribute.
Just as cost is less prominent than benefit, it may also be true
that number of patients helped is less prominent than the amount
of benefit per patient.

For example, suppose Treatment A costs $100 and yields an
average of 0.01 QALYs, therefore, costing $10,000 per QALY. And
suppose Treatment B costs $10,000 and yields 1 QALY, therefore
having similar cost-effectiveness. Because of the different costs
of each treatment, 100 people can receive Treatment A for the
same cost as providing one person with Treatment B (in both cases
yielding 1 QALY). A cost-effectiveness ranking would show these
two treatments as being equally cost-effective. But, when
evaluating such a ranking, people might focus primarily on the
more prominent attribute, the amount of benefit brought by
providing one person with each treatment and, thus, they may want
to move Treatment B up higher on the list.

In the studies we report here, we conducted utility elicitations
for a series of hypothetical treatment/condition pairs. We used
two methods for eliciting judgments of utility: visual analog
scale (because it is easy for subjects) and person trade-off
(because it is designed to reflect preferences about allocation).
Then we show subjects a cost-effectiveness ranking based on their
own utility elicitations. In some cases, we provide the cost of
the treatment at the same time we elicit utilities. In other
cases we do not. In all experiments, we ask people to look at
the cost-effectiveness ranking and adjust any items that they
think are misplaced. We hypothesize that people pay more
attention to the prominent attribute of benefit - and therefore
less attention to cost - when evaluating a cost-effectiveness
ranking than when responding to utility elicitations. They will
thus want to increase the priority of treatments that are
expensive or highly beneficial.

As a secondary issue (addressed in Experiment 2), we also test
the possibility that the initial judgments of the utility of
treating mild conditions is exaggerated because subjects misuse
the judgment scale. Previous research suggests that such
distortion occurs and can be corrected somewhat by asking
subjects to think about the relation of what they are judging to
the midpoint of the scale.9

Both our hypotheses are about psychological factors that affect
the evaluation of priority lists derived from utility judgments.
One factor, the prominence effect, affects the response to the
list, and the other factor involves distortion of the initial
utility judgments that are used to produce the list.

Experiment 1

In Experiment 1, each subject made two kinds of utility
judgments. In one, called ``without-cost,'' subjects used a
standard visual analog scale to rate the amount of benefit
brought by treating various health conditions. The two ends of
the rating scale were labeled ``no good at all'' and ``as good as
preventing death.''

In the other kind of judgment, called ``with-cost,'' subjects had
the opportunity to consider cost as well as benefits. The two
ends of the rating scale were ``no good at all'' and ``as good as
preventing death at no cost.'' This type of judgment is not
traditionally in medical cost-utility analysis, although it is
used in holistic judgments of consumer goods, when the price of
the good is one attribute among many. As we show, our subjects
did attend to cost when making these holistic judgments.

Either method - without-cost judgments, or holistic with-cost
judgments - can lead to a cost-utility priority list. The
utility/cost ratio for the without-cost method is based on
assumed costs (which can then be listed in the rank list). The
ratio for the with-cost method is determined directly from
subjects' judgments of utility, because they include cost as an
attribute when they make these judgments.

We used three levels of cost. The middle level was intended to
be plausible. The low level was half of the middle, and the high
level was twice the middle. This manipulation allowed us to
determine whether subjects were following the instructions to
take cost into account. If they took cost into account, their
with-cost ratings would be higher for the low-cost items and
lower for the high-cost items.

After subjects rated the 16 pairs in these four versions
(without-cost, and with-cost using three levels of cost), we
presented them with a list of condition-treatment pairs ranked
according to the middle-cost with-cost judgments. The list
included the cost of each item. The subject could then indicate
which, if any, of these pairs should be moved higher in the list,
or lower. They did not have to indicate how many steps higher or
lower, but the answer to this question allows us to assess which
(if any) of the pairs seemed out of place to each subject. To
analyze these results, we can attempt to predict the subject's
desire to move an item from the item's cost. If high-cost items
are moved up, this supports the hypothesis that people tend to
ignore cost when looking at a ranking but not when making the
ratings that determined the ranking. Note that this analysis
requires computaion of a correlation coefficient across the
condition-treatment pairs within each subjects. The hypothesis
concerns the across-subjects mean of these within-subject
correlations.

In sum, the final list is based on ratings of the items in the
list, complete with cost information. If subjects want to change
the rankings for systematic reasons, this is the simplest
possible demonstration of a reversal of preference between rating
and ranking.

Method

Seventy subjects completed a questionnaire on the World Wide Web.
They found out about the study from links in a variety of web
pages, including one advertising ``free stuff on the internet.''
They ranged in age from 13 to 50, with a median of 29; 71% were
female; and 36% were students. Subjects were paid $3, and they
had to provide an address and social-security number in order to
be paid. The questionnaire included a variety of checks to make
sure that responses were serious.

The questionnaire began as follows:

Health judgments

Health insurance can be provided by the government or by private
companies. Insurance covers different things. Almost all insurance
covers emergency-room care for heart attacks. Almost no insurance
covers in-vitro fertilization (artificial insemination) for
couples who cannot conceive a child. There are hundreds of such
treatments or preventive measures that could be covered or not.

Suppose that a commission were appointed to try to draw up a list
of priorities for insurance coverage. The idea would be that each
insurer would go down the list until it ran out of money.
Heart-attack treatment would be near the top of the list, so all
insurers would cover it. Insurers with a lot of money might cover
the top 500 treatments. Insurers with less might cover only 400.
Suppose there was a law about this. The law says that, if you
cover one treatment, you must cover everything above it on the
list, unless you get special permission.

In such a situation, insurers must decide how to spend their money
most wisely. Judgments of the importance of curing or preventing
various conditions will play a role.

In the items that follow, you will rate the value of treating or
preventing various conditions. There are two kinds of items. One
has a dark grayish background and provides information about cost.
When you respond to this item, take the cost of the treatment into
account. 48 items are of this type. The other type has a dark red
background, and it involves only the treatment. This is about the
benefit of the treatment alone, irrespective of cost. 16 items are
of this type, for a total of 64.

In each item, you provide a numerical rating, but you do this by
clicking on arrows or on a scale. At the end, you will have a
chance to examine and change the priority list that resulted from
your responses.

Then the subjects saw 64 screens in a random order chosen
differently for each subject. The 64 screens required judgments
of 16 condition-treatment pairs (henceforth simply ``pairs'')
under four versions of cost: without-cost, low, medium, and high.
We assigned costs to treatments in an effort to be plausible to
the subjects rather than accurate. Here are the pairs with their
costs in the medium-cost version:

READING GLASSES TO RESTORE ABILITY TO READ

$100

CATARACT SURGERY TO RESTORE NORMAL VISION

$4000

ANTIBIOTICS FOR PNEUMONIA

$30

ANTIDEPRESSANT DRUG FOR DEPRESSION (1 YR.)

$2000

EMERGENCY TREATMENT FOR HEART ATTACK

$1000

SURGERY FOR APPENDICITIS

$10000

MEDICATION FOR HIGH BLOOD PRESSURE (10 YRS. AVG.)

$10000

INSULIN FOR DIABETES (20 YRS. AVG.)

$5000

REMOVAL OF WARTS FROM HANDS

$300

CAPPING OF BROKEN TOOTH

$500

LIVER TRANSPLANT FOR ALCOHOL INDUCED CIRRHOSIS

$100000

REMOVAL OF PRE-CANCEROUS SKIN MOLE

$100

ANTIBIOTICS FOR STREP THROAT

$30

VACCINATION AGAINST CHICKEN POX

$20

BANDAGE FOR SPRAINED ANKLE

$50

HEARING AID TO RESTORE NORMAL HEARING

$1000

The cost in the low-cost version was half of that given here, and
the cost in the high-cost version was double. On each with-cost
trial, the subject saw a screen with a heading like the following
(from the high-cost version):

How much good does this do?:

REMOVAL OF WARTS FROM HANDS at a cost of $600 per case (which
means 1667 cases for $1,000,000).

Below the heading on the left was a scale ranging from
0 to 100, in units of 5, as well as two arrows pointing up,
labeled +5 and +1, respectively, and two arrows pointing
down, labeled -1 and -5, respectively. The scale was
labeled ``As much as preventing death at no cost'' at the top and
``No good at all'' at the bottom. It was white above the utility
value in effect, and red below it. To the right of the scale was
a summary of the judgment so far:

REMOVAL OF WARTS FROM HANDS at a cost of $600 per case (which
means 1667 cases for $1,000,000).
is % as good as preventing death at no cost.

The blank was initially filled in with 0, but its value
changed as the subject manipulated the scale or the arrows. The
visual scale was marked in units of 5, so finer adjustments (from
using the +1 and -1 arrows) were reflected only in the blank
space.

In the without-cost version, the information about cost and
number of people helped for $1,000,000 was omitted, and the top
of the scale was labeled Äs much as preventing death.'' The
without-cost version was presented in a different color, as noted
in the instructions.

At the end, subjects were told, ``Here is a list of treatments,
ranked according to your responses. Please check it to see
whether you want to change anything. The treatments higher in the
list would get priority. (For example, A would get priority over
B.)'' They were instructed to type the letters of conditions
ranked too low on the list, and, separately, those ranked too
high. The list included cost information by adding ``at a cost of
...'' to the end of each description. The list was based on the
responses to the middle-cost items.

Results

We used the logarithm (base 10) of cost in all calculations
because the distribution of cost was highly skewed.

To determine whether the with-cost ratings were sensitive to cost
across the 16 condition-treatment pairs, we regressed, for each
subject, the with-cost utilities for each cost level against the
without-cost utilities and cost, across the 16
condition-treatment pairs. The mean within-subject
(unstandardized) regression weights for cost were -2.4, -2.6,
and -2.3, for the three cost levels (low, medium, high),
respectively. That is, multiplying the cost by 10 reduced the
utility by about 2.5 points on the 100 point scale. The mean of
the three coefficients was significantly different from zero
(t66 = 4.04, p = .0001). Cost level (low-medium-high) also
affected utility judgments overall. The mean utility ratings for
the low, medium, and high cost levels were, respectively, 46.4,
44.7, and 43.2 (t69 = 4.12, p = .0001, for the declining linear
trend across the three levels). Thus, subjects took cost into
account in the with-cost condition.

Subjects had a chance to revise, at the end, the ranking of
condition-treatment pairs based on the standard costs. We asked
if they wanted to move a pair up or down on the priority list.
We coded their response as 1 if they wanted to move a pair up, as
-1 if they wanted to move a pair down, and as 0 if they did not
want to move a pair. 23 subjects did not revise any ranking.
Our hypotheses all concern within-subject correlates, across the
16 pairs, of wanting to move a pair up or down in the ranking, so
these 23 subjects did not contribute to the relevant
correlations. They had no variance in one of the variables being
correlated.

We first computed, for each subject who made revisions, the
correlation between the revision (1 for up, -1 for down, 0 for
no revision) and the without-cost rating, across the 16
condition-treatment pairs. The mean of these correlation
coefficients (one for each subject) was .10, which was
significantly positive (t44 = 2.42 across the 45 subjects who
had nonzero variance in both variables, p = .0195). In other
words, subjects wanted to revise the rankings based on their
with-cost ratings in the direction of those that would be based
on their without-cost ratings, the ratings of benefit that
ignored cost.

The revision was also correlated with cost across the 16
condition-treatment pairs. The mean within-subject correlation,
across the 47 subjects, was .11 (t46 = 2.32, p = .0251). In
other words, pairs that were moved up tended to be those with
high cost. Presumably, this happened because subjects took cost
into account in their ratings (as we noted), which were then used
to make the ranking. High-cost items were thus ranked lower than
they would have been ranked on the basis of their benefit alone.
When subjects examined the final ranking, they did not attend to
the cost as much as they had done in their initial ratings -
even though the cost information was provided in the list - so
they wanted to give the high-cost items higher priority.

More generally, a mean of 27% of adjacent items in the ranking
that each subject wanted to switch (either by moving the higher
item down or the lower item up, or both) were characterized by
both higher without-cost utility and higher cost for the item moved
up, vs. 12% that were both lower in cost and lower in without-cost
utility.

In sum, subjects wanted to move high-benefit and high-cost
treatments higher than the position implied by the subjects' own
ratings that took cost into account. Their response to the final
ranking was less influenced by cost considerations, and more
influenced by benefit alone, than were the utility ratings they
made for the same items.

Experiment 2

Experiment 2 examined a second reason for the desire to revise
rankings. The utility measure itself may be invalid because
subjects may use the scale in a way that does not correspond to
their own representations of what they are judging. In
particular, the usual methods of utility assessment may overstate
the disutility of mild or moderate health conditions, or,
equivalently, they may overstate the benefit of curing or
preventing such conditions.9 For example, Ubel et al. asked
subjects, ``You have a ganglion cyst on one hand. This cyst is a
tiny bulge on top of one of the tendons in your hand. It does
not disturb the function of your hand. You are able to do
everything you could normally do, including activities that
require strength or agility of the hand. However, occasionally
you are aware of the bump on your hand, about the size of a pea.
And once every month or so the cyst causes mild pain, which can
be eliminated by taking an aspirin. On a scale from 0 to 100,
where 0 is as bad as death and 100 is perfect health, how would
you rate this condition?''4 The mean answer was 92. The cyst
was judged about 1/12 as bad as death. This seems too high, too
far from 100. If the seriousness of minor conditions is
overrated because of a distortion in the response scale, then
these conditions will get higher priority in the final ranking
than they deserve. Subjects would then want to move them down.

Baron et al. found that rating scale judgments yield smaller
disutilities for health conditions and are more consistent with
each other when subjects are first asked whether a condition is
closer to one end of the scale or the other (normal health or
death).9 For example, a subject who rates blindness as 40 on a
scale from 0 (death) to 100 (normal) may revise this number
upward after judging that blindness is closer to normal health
than to death. These results suggest that, indeed, conditions
tend to be rated as closer to death than they ought to be, and
that this bias can be corrected by instructions to consider the
midpoint of the scale. When the bias is corrected, internal
inconsistencies among ratings are reduced. Baron et al. suggest
that this effect results from a general tendency to use normal
health as a reference point and exaggerate the differences near
the reference point.

We use such ``interval instructions'' here as a way to ``debias''
utility judgments. Before the subjects rated a
condition-treatment pair relative to preventing death, we asked
them whether the pair did more or less than half as much good as
preventing death, for half of the trials. This question may have
made subjects think more carefully about the nature of the scale
as a measure of differences (even if they would have given the
same answer if they had given their ratings first and based the
answer on their ratings).

Method

Sixty-six subjects completed a questionnaire on the Web, as in
Experiment 1. Ages ranged from 13 to 65 (median 27); 73% were
female; and 44% were students.

The method was the same as Experiment 1, with the following
changes. We used only the middle cost, not high and low. We
used four versions, 16 trials each, with the order of all 64
trials randomized separately for each subject. Two of the
versions provided cost information and two did not. One of the
with-cost versions and one of the without-cost versions provided
interval instructions. We did not use the visual scale (because
we thought it would detract from the salience of the interval
instructions). The text for each trial with the interval
instructions read as follows (for example):

The treatment here is to provide INSULIN FOR DIABETES (20 YRS.
AVG.) at a cost of $5000 per case (which means 200 cases for
$1,000,000).

First, think about whether this does more or less than half as
much good as preventing death at no cost.

If it does more than half as much good, type 'm'.
If it does less than half, type 'l'.
If it does exactly half as much good, type 'h'.

On a scale where

0 means 'no good at all' and
100 means 'as much good as preventing death at no cost',
how much good does it do to provide
INSULIN FOR DIABETES (20 YRS. AVG.)
at a cost of $5000 per case
(which means 200 cases for $1,000,000)?

The version without the interval instructions began with ``On a
scale ....'' The version without cost omitted the cost
information. The introduction was changed to reflect the changes
just described.

Results

Utility ratings (that is, ratings of benefit from treating the
condition) were lower when cost information was provided
(F1,65 = 8.67 p = .0045) and they were lower when interval
instructions were provided (F1,65 = 7.13, p = .0095). The
second result shows that the interval instructions were effective
in inducing subjects to provide lower ratings of benefit, as we
hypothesized.

Of interest were the correlates of the final revision of the
ranks. Sixteen subjects did not revise their ranking. Recall
that the ranking was based on the cost-no-instructions version.
We asked about the correlation between desired revisions in
ranking and the ratings in the other three versions, as well as
cost. The mean within-subject correlations, across the 50
remaining subjects, were .13 for the correlation between revision
and ratings in without-cost-no-instructions (t49 = 4.41,
p = .0001), .11 for without-cost-instructions (t49 = 3.61,
p = .0004), .12 for cost-instructions (t49 = 3.99, p = .0002),
and .09 for cost itself (t49 = 2.24, p = .0298).

There are three results here. First, as we found in Experiment
1, subjects wanted to revise in the direction of giving priority
to treatments with greater benefit, regardless of cost (the
without-cost correlations). Second, as found in Experiment 1,
they wanted to revise in the direction of higher cost (the
correlation with cost). Third, they want to revise in the
direction of the instructed ratings (cost-instructions).

The third finding suggests that one of the reasons that people
want to change rankings based on the analog scale is that the
scale provided invalid utility ratings. That is, the ratings do
not represent peoples' most reflective judgments. This
invalidity can be reduced by asking subjects to reflect on the
relation of the condition they are rating to the midpoint of the
utility scale. When they do this, their ratings are more
consistent with their final ranking.

Experiment 3

In Experiments 1 and 2, we based the final ranking on judgments
when the subjects were provided cost information. In essence, we
asked them to evaluate each condition-treatment pair in terms of
its benefit-cost ratio, and we used this to determine the
priority list. In Experiment 3, we asked only about benefit, and
we constructed the priority list by computing the benefit-cost
ratio for the subjects. For this computation, we used the same
costs as in Experiment 2. We showed subjects these costs and
their implications for the number of patients who could be treated
for $1,000,000, as part of the priority list, as in Experiments 1
and 2.

We used two methods to elicit utilities for benefits. One was
the simple direct-rating method used in Experiment 2 (without
cost information and without interval instructions). The other
was the person-tradeoff (PTO). We asked subjects how many lives
saved was just as as attractive as 100 people getting each of the
treatments in the list. For example, if the subject says that
saving 25 lives is just as attractive as giving some treatment to
100 people, and if we interpret this answer as a measure of
relative benefit, then the treatment would be 1/4 as beneficial
as saving a life.

We focused primarily on the PTO, because some have suggested that
judgments of benefits for the purpose of allocation might differ
from judgments of benefits for other purposes (such as individual
decision making).3,4 In addition, the PTO explicitly asks
people to think about the relative number of patients who would
have to receive a treatment to bring a specific amount of
benefit. If these arguments are correct, PTO judgments should
yield an acceptable final ranking, one that people would not want
to revise in any systematic way. The assumption here is not that
the PTO automatically takes cost into account. We did not
provide cost information. Rather, the argument assumes that PTO
ought to provide the correct measure of benefit for allocation
based on cost as well as benefit.

Method

Seventy-four subjects completed a questionnaire on the Web, as in
Experiments 1 and 2. Ages ranged from 13 to 65 (median 30, or 28
after omissions described below); 75% were female (70% after
omissions); and 40% were students (47% after omissions).

The PTO item read: ``How many people saved from death is just as
attractive as providing 100 people with [the pair].'' The
introduction to the experiment explained the PTO as follows:

In another type of item, you are to imagine that you have a choice
between saving some number of lives, call it X, and giving 100
people the treatment in question. The question is, at what value
of X would you find your two options equally attractive. For
example:

How many people saved from death is just as attractive as
providing 100 people with chemotherapy for breast cancer?

Again, imagine that the people treated and the people saved from
death are similar, and consider the average person who would get
the treatment. In this case, you should not give a number greater
than 100. Providing chemotherapy does a lot of good, but not as
much good as saving a life, on the average. So your answer would
be lower than 100. When the treatment in question does less good,
then your answer should be lower.

Subjects then did a practice item in which they clicked on buttons
that adjusted the answer higher or lower until they were
satisfied.

The first 32 screens of the experiment mixed the 16 rating items
(without cost information) and the 16 PTO items in a different
random order for each subject. Then the subject saw two screens
with rankings, like those in Experiments 1 and 2. The first
screen was always based on the PTO judgments. The second was
based on the ratings. In each case the priority rankings were
determined by dividing the utility (rating or PTO) by the cost.

Results

Subjects found the PTO task difficult, some by their own
admission. To determine whether they interpreted the task
correctly, we formed an index of sensitivity to severity for each
measure, ratings and PTO. We formed the index by subtracting the
responses for the three pairs with the lowest utility on both
measures on the average (warts, sprained ankle, hearing aid) from
the three rated highest (antibiotics for pneumonia, emergency
treatment for heart attacks, insulin for diabetes). We examined
a scatterplot of these two sensitivity indices (one for ratings,
one for PTO) and found one group of subjects in which they were
highly correlated and close to one another and another in which
the index for PTO was much lower than that for ratings. These
subjects presumably were the ones who did not use the PTO in a
way that reflected their views. We eliminated subjects who
differed by .25 (on a scale of 0-1) in this direction. This
left 54 subjects out of the original 73. The subjects who were
eliminated in this way were grouped together in the scatterplot,
with a space between them and the subjects who were retained. We
did this before looking at other results. Of course, we also
examined the results with all subjects included, but we wanted to
make sure that any positive results could not arise from the
inclusion of subjects who had serious difficulty with the PTO.

Each subject wanted to move the more costly treatments higher in
the priority list. Revision upward was correlated with cost. The
mean within-subject correlation for the list based on the PTO was
.43 (t48 = 9.87, p = .0000) and the mean correlation for the
list based on ratings was .40 (t48 = 8.78, p = .0000). The two
correlations did not differ significantly. (When all subjects
were included, the two correlations were, respectively, .46 and
.45. Again, both correlations were significantly positive and
did not differ significantly.)2

Perhaps the cost information was not salient enough when the
ranking was presented to the subjects. It is possible that
subjects would pay attention to cost, and to the number of people
who could be treated for a fixed cost, if this information were
more salient in the ranking. (Of course, the tendency to ignore
cost when evaluating a priority list would be exacerbated if the
cost information is missing from the listed items.) To determine
whether subjects would still want to move high-cost items upward
under these conditions, we reran Experiment 3 with one change.
In the final presentation of the list, the information about
number of people was presented first. Thus, each line in the
list was of the form, ``1000 cases of HEARING AID TO RESTORE
NORMAL HEARING (cost $1000).'' The questionnaire was completed
by 70 subjects (ages 15-68, median 28; 61% females, 41%
students).

Once again, we eliminated subjects with more than a .25
difference between the indices for PTO and ratings, and also
subjects with less than .25 on either measure (suggesting little
attention to the difference between serious and mild conditions),
leaving 49. The mean within-subject correlations between cost
and upward revision were .47 for PTO (t41 = 10.32, p = .0000)
and .44 for ratings (t40 = 10.99, p = .0000.) These two means
did not differ significantly. (When all subjects were included
the two mean correlations were, respectively .42 and .45 -
again, these did not differ significantly and were significantly
positive.) These means were no lower than in the original
version of the experiment, which suggests that the manipulation
of salience had no effect. Only three subjects wanted to move
high-cost items down (or low-cost items up) for each measure.

In sum, when subjects see a priority list based on
cost-effectiveness of condition-treatment pairs, even when the
information about cost and the number of patients who can be
helped is salient, they want to ignore cost and move high-cost
items up. The use of the PTO does not reduce this effect.

Discussion

We found evidence for two factors that affect evaluation of
priority lists derived from cost-effectiveness analysis. First,
when people evaluate a priority list, they attend less to cost
and more to benefit than when they are asked to make explicit
tradeoffs of cost and benefit. This is, we believe, an instance
of the prominence effect. We found this consistently in several
different cases: when the priority list was based on ratings that
took cost into account, when it was based on ratings of benefit
only (with cost taken into account afterward, in devising the
list), and when person tradeoff (PTO) was used to elicit utility
judgments of benefit. The last result is of interest because PTO
judgments have been suggested as closer to the actual allocation
decision, hence capturing a type of value that is relevant to
allocation decisions in particular.3,4

Second, numerical utility judgments are often too high,
especially for small effects. This is an artifact of the way in
which people assign numbers to internal representations.10
When cost-effectiveness analysis is based on such judgments, it
favors allocation of too many resources to options with minor
benefits. We found that this bias exists and that it is reduced
if subjects are asked to consider the midpoint of the scale as
well as its two ends. Such a debiasing method may be useful in
practice. Simple rating scales are often the easiest methods to
use, but they are suspect because of just this sort of problem.
Interval instructions could help overcome the problem (and make
the scales more theoretically justifiable as well).11

Of course, our results do not address the existence of other
possible factors that cause conflict between priority lists and
the judgments that led to them. We did not examine judgment of
cost at all. It is possible that the public has a broader
conception of cost than the costs typically used in
cost-effectiveness analyses, which includes such things as
effects on the family. It is also possible that priority lists
conflict with intuitions about fairness.10

It is also possible that the rule of rescue is an additional
factor, beyond the prominence effect. Note first that the
prominence effect could explain results that seem to imply a rule
of rescue, if those results show that the rule is applied in
choices rather than matching judgments. Such an explanation
could be based on the assumption that benefit itself consists of
two dimensions: the seriousness of the condition before
treatment; and the benefit of the treatment (or the final
condition after treatment). Of these two dimensions of benefit,
seriousness might be more prominent. But that is a different
kind of prominence effect than the one we have examined.
Demonstration of such an effect would require comparison of
matching judgments (or utility ratings) and choice judgments (or
judgments about the revision of a priority list).

A major limitation of our results is that our lists were
hypothetical, not real, and the subjects were not necessarily
representative of activist citizens who complain about priority
lists. Thus, our experiments address questions about the
psychology of evaluating priority lists, but not about the
political and institutional factors that affect the evaluation of
such lists in the real world. Any conclusions about the real
world are based on the assumption that properties of human
judgment studied in the laboratory are active outside the
laboratory enough to affect policy outcomes. Although this
assumption has been defended,12,13 it cannot be taken for
granted.

If our results are relevant at all outside the laboratory, their
main implication is that people should also be wary of attempts
to tinker with it on the basis of judgments about the list
itself, even if it contains information about costs and their
implications. When questions about a list are raised, they
should be resolved by re-doing the procedure that generated it,
perhaps more carefully or with additional checks (such as the
interval instructions used in Experiment 2), rather than by using
intuition to change the ranks.

Footnotes:

1This research was supported by a
pilot-project grant from the Penn Cancer Center and by
National Science Foundation grant SES9876469. Dr. Ubel is a
Robert Wood Johnson Foundation Generalist Physician Faculty
Scholar, recipient of a career development award in health
services research from the Department of Veterans Affairs,
and recipient of a Presidential Early Career Award for
Scientists and Engineers. This research was also supported
by the National Cancer Institute (R01-CA78052-01). We thank
Andrea Gurmankin and the reviewers for comments. Send
correspondence to Jonathan Baron, Department of Psychology,
University of Pennsylvania, 3815 Walnut St., Philadelphia, PA
19104-6196, or (e-mail)
baron@psych.upenn.edu.

2These correlations were
considerably higher than those in Experiment 1, for example.
The difference may be related to the fact that subjects mad
many more revisions in the list in Experiment 3. The mean
number of revisions was 6.4 in Experiment 3 and 2.4 in
Experiment 1 (t140 = 7.82, p = .0000). The correlation is
necessarily low when the number of revisions is small. This
difference in number of revisions is itself something we cannot
explain.