T-tests? That's for amateurs! In this post we're going to build a Voight-Kampff Test! If you haven't watched Blade Runner or read Do Androids Dream of Electric Sheep? you may be unfamiliar with this famous test. The purpose of the Voight-Kampff test is to identify replicants. In the film replicants are androids that are nearly indistinguishable from humans. The only way to detect a replicant is by testing their reaction to a series of questions intended to invoke an emotional response. When humans are asked these question they uncontrollably respond with "capillary dilation or the so called 'blush response', fluctuation of the pupil, involuntary dilation of the iris." Using the Voight-Kampff machine an interviewer can measure these responses.

Observing Evidence is an important part of Bayesian Reasoning.

Why do we need Bayesian Analysis?

The machine pictured above is only useful for measuring response. It requires the skill of the interrogator to make the call. Each question gives the interrogator a bit of evidence, but this is not a trivial task. The purpose of the test is to identify rogue replicants, and the consequence of being identified is execution. We want to be very certain of the test's results. Deckard, the film's protagonist, mentions that he has never mistakenly terminated a human. Deckard also reveals that for most cases 20-30 questions are required to come to a conclusion, but in special cases it can take more than a 100! This means we gather evidence until we have enough to establish a firm belief.

You aren't going to use a p-value for that are you?

Introducing Bayes Factor and Jaynes' Evidence

A regular t-test isn't going to solve the problem we have. For starters we don't want to simply "reject a null hypothesis". With a t-test we can only say something like "There is a 19/20 chance that the subject is not your average human". That's a pretty lame belief if it means your next step is going to be putting a bullet in someone's head. What we want to be able to say is "The subject is a replicant beyond any doubt". Then there's the issue that p-values are rather hard to reason about as they grow close to 1 or 0. Saying there's a 0.999 chance someone is a replicant and a 0.9999 chance both seem roughly the same in our heads, but there is an order of magnitude difference! Additionally 0.9999 is still too much error if the result is going to be "elimination". We also want to ask a series of questions and stop when we have enough evidence to decide, this would violate proper experiment design for traditional null hypothesis testing.

To start solving this problem let's assume that we have some prior information at our disposal. Voight-Kampff tests have been given for ages so we know the response rate for a variety of questions. We have the question "It’s your birthday. Someone gives you a calfskin wallet. How do you react?" Historical data says that humans show an unintentional response with a probability of 0.8, and replicants only 0.01. We'll refer to this prior information as \(X\).

We ask this question to a subject and our machine notices that they do show an emotional response. This observation of a response (or lack of one) is our data, \(D\). Now we have two Hypotheses we are considering, \(H\) the subject is a replicant, and \(\bar{H}\) the subject is human. What we need now is a way to evaluate what our data tells us about our evidence. We'll do this by comparing the likelihood of getting this data given each hypothesis were true.

If we assume \(H\) than the likelihood of getting the data \(D\) we did can be expressed as:$$P(D|HX) = 0.01$$ This value is simply a fact from our previous data, that the probability of a replicant showing a response is 0.01. But we're not done yet. We need to compare how likely \(H\)'s response is to that of \(\bar{H}\), after all just because it's unlikely for a replicant doesn't by itself tell us how the replicant hypothesis fairs against the human hypothesis. We know, from our prior information, that \(\bar{H}\) explains this data much better:$$P(D|\bar{H}X) = 0.8$$ To quantify the relative strength of the replicant hypothesis in explaining the data to the human hypothesis we'll simply take the ratio of the two likelihoods:$$\frac{P(D|HX)}{P(D|\bar{H}X)} = \frac{0.01}{0.8} = 0.0125$$This ratio of likelihoods is refered to as Bayes Factor.

ET Jaynes, in his book Probability: the Logic of Science, takes this one step further by defining Evidence which converts Bayes Factor into a decibel system:$$e = 10\cdot log_{10} \Bigg[\frac{P(D|HX)}{P(D|\bar{H}X)}\Bigg]$$This ends up giving us a measurement very similar to decibels in sound. Even better, our intuitions about decibels in sound carry over: 1 means very little evidence in favor of our hypothesis, and 100 more means our evidence is extremely 'loud'.

Applying Jaynes' refinement we get a much more intuitive number to work with:$$e = 10\cdot log_{10} \Bigg[\frac{P(D|HX)}{P(D|\bar{H}X)}\Bigg] = 10\cdot log_{10} \Bigg[\frac{0.01}{0.8}\Bigg] \approx -19$$ Because we are dealing with the log of the ratio we can just sum up each piece of evidence as we acquire it. At this point we have 19 db of evidence against our hypothesis that the subject is a replicant.

Putting it all together

Below we have a table of questions that we're going to ask (simply labeld Q1,Q2, etc). In the column labeled "human" we have the proportion of humans who have an involunatry response registered by our Voight-Kampff machine, in the colum labeled "replicant" is the proportion of replicants that have a response. The next two columns, labeled "response" and "no response" are the evidence we gain depending on the response of the subject (we'll round to the nearest whole number).

Evidence for five questions.

Now that we have our data, let the test begin!

Inference

Before we ask any questions we need to establish our initial belief about whether or not the subject is a replicant. Let's also say that our prior information \(X\) tells us that in general there's a 1/10 chance that the subject is a replicant.

Our starting evidence is -10, which means that we're starting from the initial belief that our subject is not a replicant. Now suppose you believe much stronger either direction about a particular subject. You can adjust your prior simply by changing this one value, incorporating all of the rest of the evidence we present next will not change.

We ask our first question, Q1, and we observe no response. In the table above we have already precalculated all of the evidence, but to make sure our reasoning is clear lets calculate where that is coming from (and remember we're rounding everything to whole numbers).

For \(P(D|HX)\) given no response to Q1, we take the probability of a replicant not respsonding, which is just 1 minus the probability of a response.

Because of the log transformation we can simply add our initial evidence (-10) to our Q1 evidence (13) to see what we currently believe, which is 3. Even though not showing a response is more likely for a replicant than a human it is not that so unlikely that it moves our belief that subject is a replicant all that much from our prior. We just have a very slight hint that the subject might be a replicant. Let's ask the rest of our questions:

Our final belief is -52 which means we are pretty sure the subject is human, even though they answered a few questions like a replicant. Let's look at another subject that is more suspect:

Our final belief is 80, are we ready to 'retire' the subject? We certainly could. We have a pretty strong belief the subject is a replicant, but as stated previously, we ideally want a belief of 100 or more before we decide. This is neither universally true nor some magic number, it merely represents an incredibly strong belief in our hypothesis. We could easily settle on a lower number. Remember the conclusion of this experiment is going to be the termination of the subject if we believe they are a replicant. We don't want our beliefs to be summarized as "meh, I'm pretty sure" or even "I'm really positive", but preferably "you're damn right that's a replicant."

We can now see why Deckard needs 20-30 questions for most cases, our last subject gave particularly damning answers, but a single human response would have changed our results quite a bit.

Now the only question left is... have you ever taken the test?

Huge thanks to reader Nevin for pointing out some major errors in the original version of this post!