Cognitive Bias in On-Call: the evidence

As later revealed, this was somewhat of a white lie: it was a survey, and it did indeed have time-limited questions which meant it plausibly take only a small number of minutes, but there was a lot more going on than any single test-taker could discern.

In fact, it was a cognitive bias detector -- a survey designed to gather evidence to confirm or reject the hypothesis that cognitive bias of various kinds might operate in on-call related situations, or in SRE thinking generally. In the construction of this survey, I was strongly inspired by Kahneman & Tversky's work about cognitive bias in general, as related in the wonderful Thinking, Fast and Slow, which I strongly recommend everyone to read.

Specifically, the survey looked at participants attitudes to risk, both specifically within the SRE service-management context, and outside of it, together with questions on anchoring and availability. Leaving aside the psychological terms for the moment, the portion of the survey I found the most engaging was the question I designed to try to provoke anchoring effects -- this involved creating a question with a time-limit and using the value of the time limit itself to see if that could provoke participants into reading a graph incorrectly. (I chose graph-reading because in many ways, that's what our profession does for a living!)

I apologise for using people's time under somewhat of a false pretence, but the time consumed was voluntary, in the most part quite short, and the results are, in my opinion, quite interesting.

Results

Around 200 people took the survey, with the following distribution of roles:

Around 150 of the participants declared what amounted to an average of 7.6 years of experience, with a standard deviation of 2.96, so we note that we seem not to be looking at a population of junior engineers.

Baseline

The next set of questions established a baseline for the majority of participants: in a context which had nothing to do with SRE, how would their choices compare to the population as a whole?

"You have a chance to win a prize by choosing to draw a marble from one of two urns. Prize-winning marbles are coloured red. Which urn do you choose: Urn A containing 10 marbles, 1 of which is coloured red, or Urn B, containing 100 marbles, of which 8 are coloured red."

This is just a baseline statistics question, to see if the "background rate", if you like, of
statistics knowledge is what you'd expect. (This bias is actually called denominator
neglect.) It is not a particularly sophisticated question, of course; 10 marbles, of which 1 is coloured red is just like 100 marbles, 10 of which are coloured red, which is a 10% chance of success. Of course 100 marbles, 8 of which are coloured red is just an 8% chance, so urn A is universally
better.From a baseline comparison point of view, 30-40% of students taking this test, are recorded as getting this wrong, whereas only 8.26% of us. so
there is some evidence that our understanding is more sophisticated.Things get more interesting from here on in.

Prospect Theory I: Gain

There are a bunch of ways that psychological kinks in our assessment of risk might reflect themselves. One theory that attempts to explain them is prospect theory.

Q3: Do you prefer a 61% chance to win $520,000 or a 63% chance to win $500,000? (This would be a one-off event, not a repeated sequence.)

From a pure econo-rational point of view -- i.e. simple maximisation of returns -- rational utility theory states that we
should calculate the projected yield by multiplying the probability of the event by the
gain of the event -- so 0.61*520 = 317,200, whereas 0.63*500=315. Therefore picking
the first, although it's less likely to result in a payout, will maximise your outcome.

However, if I understand prospect theory correctly, it turns out that humans don't think like this. Instead, how you frame the situation changes what you do, and in particular, expected gains matter, and when
you're getting a life-changing amount of money, the marginal utility on 520k versus
500k is essentially zero. Therefore we maximise according to the largest probability of
getting any large sum, and so we pick the highest chance of getting something. That is indeed what we see
here.Another way to put this is that we are risk averse; we want the largest chance of getting something, no matter what it is.I had an intuition that SRE as a profession might well be risk averse, so I wasn't surprised to see this outcome. (This his is risk averse with respect to a gain of course -- we see what happens with a loss later.) I was, however, surprised at the response to the next question:"Do you prefer a 98% chance to win $520,000 or a 100% chance to win $500,000? (This would be a one-off event, not a repeated sequence.)"

Again, the econorational thing to do is to
multiply the probability by the gain: expected gain in the 98% case is $509k, expected
gain in the 100% case is $500k ... but we overweightthe 2% possibility of gaining
nothing at all, and become risk averse.The most interesting thing here, however, is the size of the disparity in choices: in the other question, the ratio is about 93/25 -- here, it's 113/10 -- 8% versus 26%. The disparity is presumably greater here because the psychological disparity between
61-63 and 98-100 is also greater, even though the numerical difference is precisely the same.

Prospect Theory II: Loss

As discussed above, we now move onto the highly relevant question of loss rather than gain. My intuition here is that SRE's reactions might be distinct from the general population's reactions because the domain we work in cares
deeply about loss but not so much about gain. Of course, if an SRE is working with an e-commerce facility of some kind, it could be argued that more purchase of stuff (due to higher availability) would allow for more gain, but I think that this is not a first-order way of thinking in our community.
This question used approximately the same dollar numbers and exactly the same probability numbers in order to make the answers as comparable as possible:

"You run a very popular web service with a large numbers of users and paying customers. If there was an outage, would you prefer a 61% chance to lose $500,000 or a 63% chance to lose $480,000?"

This is much more evenly distributed -- about 70/30 -- but again the
econorational calculation is that 0.61*500 is a 305k loss, versus 0.63*480 which is a 302.40k loss, therefore the 63%
choice is better. Yet we don't: instead, we pick the one which has the fractionally
higher chance of losing nothing -- but it has the higher chance of losing more! Our behaviour has actually flipped around here: we have no choice but to lose something in this scenario, but we are risk seeking because we take the option
which actually loses us more in the steady-state case. Fascinating.The contrast is actually illustrated even more starkly in the next question:"You run a very popular web service with a large numbers of users and paying customers. If there was an outage, would you prefer a 98% chance of losing $20k, or a 100% chance to lose $18k?"

As before, this is effectively the same question, except we are moving to the edge of the probability
distribution where it is known that humans overweight probabilities, as indeed we do.This is another increment more evenly distributed than before -- approximately 66/33 -- but the
econo-rational thing is not necessarily what you'd expect intuitively. It seems as if the first option would give you a larger chance of getting off scot free, but when you multiply it out, 0.98*20 = a loss of 19.6k, versus a sure
loss of 18k. In this case, the sure loss of a smaller amount is better, so picking the second would be best.The fact that we don't means that we are gambling with a 2%
chance to lose nothing: this is risk seeking behaviour in the hope of avoiding loss.

Anchoring I

Anchoring is a really powerful and sometimes subtle, sometimes obvious effect. It turns out that it's possible to "anchor" the mind on a particular point for a particular context, making it mentally hard to break away from that point -- for example, just mentioning a price before haggling is enough to change the range of prices that the negotiator will tend to stay within. (It works across many contexts; not just for numbers, but it is particularly pernicious for numbers and estimation tasks.)

For this section, I was wrestling with how to potentially demonstrate this effect in on-call contexts. I settled upon the idea of showing a graph to the test-takers and asking them when it would hit some critical threshold -- in this case, 1.5 -- but I decided the anchoring would be more effective if it was a little out-of-band. Here is the graph in question:

With the luxury of time, it's pretty clear that the most plausible answer as to when the metric would hit ~1.5 again (08:30) is based on following observation: since 05:30, the metric rises for approximately an hour, and then drops shortly after the half hour -- i.e. 05:30 → 06:30, 06:30 → 07:30, 07:30 → ... 08:30. It's the kind of thing that's easy to see after the fact. But in the kind of stressful situations that on-callers operate within, it's not necessarily that easy to see what's going on.

For that reason, in order to reproduce this effect, the test takers were divided into two cohorts who were shown different initial texts. The first text was this:

"You are in charge of a very important logs processing system used by 900-1000 people in your large company. On the following page you will be shown a graph for a small number of seconds and asked to estimate, if the graph were continued, when the graph metric will hit a particular value.

EMPHASIS: THIS IS A TIME-LIMITED QUESTION. YOU WILL HAVE 45 SECONDS TO ANSWER. Only move forwards when you are prepared to work quickly!"

The other cohort were shown the same thing except without mentioning the 900-1000 figure, and the time limit was set to 30 seconds instead.

For the 30 second cohort, the choice of answers was relatively simple:

Most people -- 69.49% out of 59 takers -- picked the correct answer, 08:30, with 09:00 and 10:00 picking up 8.47% & 5.08% respectively. Note the small but noticeable peak at 09:30 (16.95%), which might or might not be due to anchoring effects from the 30 second time-limit.

The 45 second/900-1000 cohort, however, had a very different pattern.

From 86 takers, the most interesting thing is that, apart from a simple majority (53.49%) picking the correct answer, these distributions are nothing like each other. The sizeable peak at 08:45 (~15.12%), not matched by a similar peak at 09:45, and the peaks at 09:00 (11.63%) and 10:00 (10.47%) with the lower count of test takers selecting the intermediate value 09:30!

Anchoring & Availability II

I had further questions on outage durations, which attempted to anchor two cohorts of participants on outage durations of either 43 minutes or 107 minutes, but while the data demonstrates a longer average in the longer priming case, the data seem a bit too noisy to derive anything much from:

Type (43)

Min

Max

Mean

Stddev

Count

Minutes

0

120

44.71

32.34

24

Hours

0

24

3.92

6.91

12

Type (107)

Min

Max

Mean

Stddev

Count

Minutes

0

120

56.48

25.62

21

Hours

0

39

10.71

14.01

7

The Linda Problem (Conjunction Fallacy)

The Linda Problem, also known as the conjunction fallacy, is a famous effect in cognitive science where a majority of those sampled choose something which is logically incorrect (to wit, that it is more likely that something has two properties rather than just one.) (The result is controversial but is definitely reproduceable). I set out to see if SREs would suffer from the same error in thinking, by asking the following question:

"For almost half a year, your team has been troubleshooting a persistent but intermittent problem. Throughput between hosts will dramatically decrease for seconds or minutes. Eventually it fixes itself. It can happen several times a day but most often happens a few times a week. You believe you have tracked it down to a network problem. You found a new firmware version for your switches, the update notes of which referenced an obscure packet loss condition. After upgrading, the problem has not recurred since (2-3 days). Which is more likely, in your opinion, considering everything you know above -- that the issue is a network problem, or that it is a network problem related to packet loss?"

So we do way better than the general population on this (somewhere up to 85% of those sampled get it wrong, we're at ~41%) but the interesting thing is how, when the domain of the question moves to something considerably more abstract, we get it a little bit wronger, but not much. The next question is actually the Linda problem in disguise:

Consider a regular six-sided die with four green faces and two red faces. The die will be rolled 20 times and the sequence of greens (G) and reds (R) will be recorded. You are asked to select one sequence from a set of three, and you will win $25 if the sequence you choose appears on successive rolls of the die. Do you pick RGRRR, GRGRRR, or GRRRRR?

Of course, the second sequence is one longer than the first sequence so is strictly less probable, but the sequence matches the proportions of the die a little better. In the outside world, apparently 66% of respondents pick this one, but we avoid the wrong answer by about 55% to 45%.

Conclusions

The first interesting fact is that, according to our answers, we behave unlike the standard prospect theory matrix: we are risk averse with loss and risk seeking with gains. It is an open question whether this result is reproduceable and whether or not it has anything to do with our profession's attitude and exposure to risk management generally.

Another interesting fact is that we have a better-than-general-population understanding of statistics, perhaps befitting our more scientific training, although it could be argued that this is mostly in domains with which we are familiar. (We still do make mistakes, though.)

The final point is that there is evidence that we are vulnerable to anchoring, particularly in stressful situations.

Further work on this would seem appropriate. Would any statisticians or psychologists like to help? (It seems like there should at least be a paper in it.)

Comments

Post a Comment

Popular posts from this blog

Related to a discussion my colleague Betsy and I had the other day, I was led to the following observation:

The end state of an SRE team that acquires new work and automates this new work as much as possible is to have only non-automatable (or practically non-automatable) work remaining.

What do I mean here? It's a little like the operational analogue of Amdahl's law. From parallel computing, Amdahl's law is a description of the limits of how much a computation can be sped up, given some description of what proportion of it is not parallelizable. There are a number of underlying reasons for this, of course, and it turns out that it is much less of an effective limit on the benefits of parallelization than it sounds, but it came to mind when I was thinking about SRE teams workstreams the other day.

To put it another way, a similar analogue for operations is the observation that the proportion of SRE work for a service which is non-automatable comes, over time, to dominate wh…