Menu

The downside of making accurate inferences about people

This is a point that I’ve been thinking about on and off for a while and never really come up with or otherwise seen a satisfying conclusion to. It’s in the general category of “sometimes my politics and my epistemology are hard to make play well together”.

Suppose you meet a woman at a conference. Based solely on the fact that she’s a woman, you are 90% sure she’s a recruiter (Note: All numbers in this post are made up for making a clear point and should not be considered accurate). This isn’t you being prejudiced – you’ve met a lot of women at previous similar events and even this one and of those 90% were recruiters. Your judgement that the a priori chance that she’s a recruiter is an entirely accurate one.

(Reminder that I have no idea what the actual numbers are and that in most circumstances 90% is going to be a ridiculous overestimate)

The problem is not this judgement, but how you act on it. Maybe you really don’t want to interact with a recruiter right now so you avoid her, judging that the 10% chance that she’s a dev isn’t worth the 90% chance that you’ll have yet another awkward recruitment conversation. Maybe you do want to interact with a recruiter and you go up to her and talk excitedly about how you’re looking for a job and she’s like “Uh, great? I’m here to talk about consistency in distributed databases. I don’t really have any hiring power”. Maybe you just act surprised when you find out she’s actually a developer.

A useful feminist concept is that of the microaggression. An interaction where each individual instance is a minor thing that serves to reinforce roles and express prejudice in the aggregate. All of the above are examples.

The fact that they’re minor in individual instances but major in the aggregate is part of why microaggressions are so insidious. Because each individual interaction is not in and of itself a big deal, if you only see a few of them you probably don’t perceive this as any real problem. e.g. in the above you might have deprived the pair of you the chance of an interesting interaction, you might have slightly annoyed someone, etc. All of these are strictly worse than not doing them, but they’re also not the end of the world.

The problem is that everyone else is making more or less the same judgement as you. In practice peoples judgement will be inaccurate (usually tending to the overconfident), but in an epistemically optimal world where everyone has perfect reasoning, most people will come to something like that 90% number.

And this is pretty rough for the 10%. They’re now on the receiving end of a constant stream of microaggressions caused by these accurate judgements: The vast majority of people are treating them as if they’re something they’re not, or assuming them to be less competent at their speciality than they actually are.

(Aside: This being a problem does not require you to think that being a recruiter is in any way a bad thing. Recruiters sadly have a bad rep, but the problem here exists regardless of that: Being constantly assumed to be something other than what you are is grating)

Which will tend to mean they stop coming to conferences, and that number is going to get more extreme.

(Additional parenthetical disclaimer: Obviously this is not the only source of problem for women developers at conferences. It is likely swamped by other more serious problems. I have however definitely heard plenty of women developers complaining about things that sound awfully like being on the receiving end of this, so I don’t think this is just empty theorising)

This is the core problem of making accurate judgements about people: Whatever judgement you make will tend to reinforce itself, because it will be based on broad statistical trends, and this will tend to add friction to interactions with people who buck those trends, which will tend to discourage them, and thus counterexamples to the trends you base your judgement on will tend to disappear faster than those who fit the pattern.

And I’m not really sure what to do about this.

Oh, in the conference setting it’s easy enough. The benefits of accurate judgement are low enough that just going “Don’t do that then” is basically enough of a solution. Don’t form preconceptions about what people do based on their gender (or their race, or any of countless other categories) and try to treat everyone the same and you’ll probably just do fine in this case.

But a lot of the time when you’re making judgements about people it’s actually much more important and you do need to make accurate judgements. Consider for example hiring people.

Obviously you should not make judgements like “You are a woman therefore you are less likely to be good at this job therefore I won’t bother to interview you”. Even if this were true (it’s not) it would still be a terrible thing to do.

But a lot of people make judgements like “You do not have a Github profile with lots of open source code on it therefore you are less likely to be good at this job and therefore I won’t bother to interview you”. And guess what: Open source contributions are significantly gendered, due to a variety of cultural problems (women tend to have less free time due to greater expectation of doing house work, child care, etc, and open source is not exactly an inclusive environment). This is somewhat related to what I’ve written about false proxies previously, but is more insidious: It’s almost impossible to come up with metrics that are completely oblivious to certain boundaries (even the “Hire them and work with them for several years and see how you find it” metric isn’t: What if your company is secretly a bit racist and you just haven’t noticed because you’re white? The black colleague you hired is having a much harder time of it than the white one and so you will tend to judge them more harshly even if you yourself are completely ignoring their race).

About the best thing you can do that I know of is screen off certain questions at the individual level when making these decisions (make as many decisions as you can without even knowing about the person’s race, gender, etc. and where you do know it do your best to ignore it), then later go back and calibrate: This question that we screened off… are we actually screening it off? Do we get significantly different results in our process for men and women? Or for different ethnicities?

This is worth doing when you can, but a lot of the time it’s impossible to do. If you’re a small company you probably don’t have the numbers to get good stats. If you’re an individual trying to form opinions about people you can’t do this sort of statistical analysis – you’re not gathering the data, you probably can’t gather the data, and a lot of the time you’re not even aware you’re asking the question.

Which leads be back to “I don’t know what to do”, which is a pretty depressing point to end this piece on. I value both accurate judgements (not just for the sake of them: they’re also necessary for making good decisions and helping people) and not reinforcing structural prejudice, and it’s completely unclear to me how to balance the two. My current solutions are basically just a bunch of patchwork and special cases and I’ve no real idea whether I’m missing important areas or not.

11 thoughts on “The downside of making accurate inferences about people”

Having been in a few “10%” situations over the years, I also came to the conclusion that it’s basically impossible to fix; people will make these judgements because they’re weak, and people will make these judgements because they’re in a hurry, and people will make these judgements because they have to pick one thing out of ten and are having trouble whittling it down to a shortlist, so rejecting things that might correlate to badness for even the most tenuous of reasons might well be a better strategy than random choice.

But, chalking the suffering of being a minority down to human nature doesn’t make it any less infuriating, so I try to predict and compensate. People will make predictions about me based on easily observed characteristics, sure; the best I can do is to predict the most costly mistakes they’ll make and try to provide easily observed hints. If I were your example of a woman developer at a tech conference, I’d consider wearing a suitably nerdy t-shirt to make my developeryness clear to passing observers, if being taken for a recruiter was a problem for me! Obviously, having to pander to prejudice comes at a cost to me, but it’s often worth it compared to the alternatives.

However, this still leaves a cache of frustration and resentment that slowly builds up inside me. I don’t want to be a person who rants and spits at well-meaning people who had no way of knowing I’m a statistical outlier, although it can certainly be tempting at times. I maintain some awareness of my frustration levels, and take steps to let off steam productively if I need to.

This is leading to a big philosophical topic about epistemology, ethics, etc., and I’ll get back with more detailed thoughts later, but thought I’d share now the first thought that came to my mind when I saw the title of this post.

I thought it was going to be about Big Data, the whole goal of which is to make the best inferences possible based on all the data we have at a given snapshot in time, and make decisions based on the inferences, often extremely wide-ranging decisions that disproportionately ignore current outliers.

On the one hand, the value of statistical analysis is quite valuable (and has caught egregious cases of cheating and may or may not have deterred some), but we also know that the unexpected can happen. Someone can improve all of a sudden. A criminal can reform, despite the odds. In some way, all of literature, all of hope, is based on accepting that we do not know for sure, that someone could be given a chance but blow it, while someone else could be denied a chance but end up proving to be right. And we tell stories only after the fact. At the moment of decision making, we really do not know what the consequences will be, and in what time frame.

The title seems a bit off here — it should be more like “The downside of making generalisations about people which are statistically justifiable”. The inferences you describe are not always accurate — that’s the problem. They are statistically justifiable, not “accurate”.

There’s a piece up on Medium entitled “Coding Like a Girl” that covers pretty much exactly the scenario you describe from the point of view of “the 10%”.

Her solution seems to be simply “Assume people [who present as feminine] are as or more qualified than you”, which seems like a nice idea, but doesn’t solve or address the problem that you highlight. Given the odds, that assumption will be often wrong, so it’s difficult to overrule one’s pattern matching instincts.

I think it boils down to “don’t judge a book by its cover” and “do unto others as you would have them do unto you”.

Yes, obviously (ok, it’s not obvious to everyone, but people to whom it’s not obvious are much bigger problems than the one I’m talking about in this post) you should assume women are as competent as you. I thought I covered that. I basically agree with all the points in the linked article. It’s generally a good solution to the conference problem. My point is that the conference problem is an instance of a general problem and that while you should certainly give people of all genders and backgrounds the benefit of the doubt in all circumstances, you can run afoul of this problem even while you’re doing that.

To address your other comment which I somehow missed before: “accurate” basically means statistically accurate. You can say that you are confident that something is true and be wrong 10% of the time or you can say that you’re 90% confident that something is true (more realistically you can say that you’re confident that something is true and be wrong 50% of the time or that you’re 90% confident that something is true and be wrong 40% of the time, but assuming ideal reasoners). If you hold off for certainty then you’ll never get there. You can’t even realistically hold purely mathematical theorems to a standard of complete certainty.

Which is the problem: Suppose we had a cheap test that has 0% false negatives on men and 50% false negatives on women. If we test the same number of men and women who are true positives this has a false negative rate of 25%. In e.g. hiring I’d kill for a cheap test that could guarantee that three quarters of the people it let through were competent. But the result of this test is that even if we got the same number of competent male and female applicants, twice as many men make it through to the next round. We have accurately inferred that there’s a 25% chance of the people we’ve rejected actually being good at the job but decided that that’s an acceptable risk, but because of the detailed structure of that 25% it turns out to have a prejudicial effect.

And assuming competence here doesn’t help! We’re not making any assumptions about competence on either side: We’re testing competence, we’re getting an accurate prediction, and we’re making good decisions off the back of that prediction. The problem is that by making those decisions we’re aligning the world in the direction of our reasoning, which turns out to have unintended consequences.

One natural way of thinking about this is to include relevant cost terms for the misclassification in your assessment and make your decisions on the basis of cost, not likelihood; in the conference example, the cost of misclassifying a dev as a recruiter is that it contributes to a culture in which the numbers of female developers at conferences goes down; assuming you care about this situation (which you do, or else the original problem wouldn’t be of concern to you), your payoff matrix is now much more in accord with your political views, since the misclassification cost for assuming a recruiter is a dev is low compared to assuming a dev is a recruiter.

Of course, there’s a separate question about making decisions whose individual cost is minimal but with a high aggregate cost.

In the recruitment scenario, the problem is more that you’re not *aware* of the numbers involved; you have a strong suspicion that your recruitment process is biased in certain directions, but not well enough to apply appropriate correction factors. My preferred solution is to attempt to calculate your biases along specific axes which we have good anecdotal evidence are likely to be relevant, and then again make your decision in terms of cost factors; how you allocate your costs for (hiring|not good) and (not hiring|good) must I guess be a factor of your personal/corporate value allocation.

I think the problem is that the costs don’t actually work out right unless you’re very conscientious about paying attention to the issue I’m trying to highlight and they’re so finely tuned that the costs only give the right answer if they add up suspiciously exactly to the same solution as ignoring the predictive value of gender altogether (which is my preferred solution).

In a naive cost framework the cost of a missed conversation is very low. The problem is that you need to consider the costs not only of you making these decisions but also of many people very similar to you making very similar decisions. Although the cost of any individual missed or flubbed conversation is low, the cost of many is high. You are not the one who is experiencing the many – the person you’re making the decision about is – so when you try to take into account the cost you will get the wrong answer.

RE recruitment: I think in general recruitment processes are designed to maximize high P(not hiring | not good) which makes them particularly susceptible to this sort of bias. Calculating biases is generally a good idea, but it seems to be quite hard to do in practice due to a mix of missing information and insufficient datapoints.

Right, as I say, the naive cost framework isn’t sufficient for this, but that doesn’t mean that cost accounting is the wrong solution, just that you need a better cost framework. This is like an n-person prisoner’s dilemma – the cost for an individual defection is low, but assuming similar people, the global outcome is worse when everyone defects. One option is to assume some kind of superrationality, and make your decisions based on the global cost rather than the local one; if you do such a thing, the cost of missing that conversation is now very high, easily enough to skew your decision in favour of talking to the person!

I wouldn’t expect the costs to work out the same way as ignoring the predictive value of gender – that’s saying that there’s no value of n<100% such that you would ever consider not talking to the possible dev, which seems intrinsically wrong to me. However, the fact that you consider this problem interesting tells me that (a) you care about the cost of the missed conversation and that (b) you are considering the costs on a global rather than individual basis. So I would be surprised if your cost matrix did not resolve in making the decision to talk to the person.

Yes, I think recruitment actually suffers from two problems in this area. The first is that we don't have any real quantification of biases; even amongst those of us who would be inclined to try to apply appropriate correction factors to our decisions don't know what those correction factors should be. The second is that, particularly in small companies, the misclassification cost in one direction is high enough that it may well still overwhelm the cost the other way, and at best one can assume a small subset of competitors is superrational. I think large companies could have a big role to play in both cases; both in attempting to calculate (and maybe publish) good priors on personal/corporate bias, and in altering their hiring metrics away from maximising P(not hiring|not good), because the cost of a mis-hire is much lower. And in the latter case, I guess that's where legislation comes in, preventing players from defecting.

I still don’t think treating this as a cost matrix is a good solution to the conference version of the problem when compared to screening off the question.

The reason is that a) The cost matrix approach is really hard to calculate in any meaningful way and b) I’m not sure that the probabilities matter that much for what the “correct” decision is here. It’s not that there is no probability that someone is a recruiter that would cause me to not talk to them, it’s that I don’t think I should be using gender as part of that computation even if it gives an accurate answer. If there are a lot of woman recruiters and only one or two woman programmers at an event I don’t want the latter to have a miserable time, no matter how large the former group is – it will create the sort of reinforcement I’m concerned about and make the problem worse and c) It’s a very delicate balance of costs between “I should definitely not talk to this person” and “I should definitely talk to this person guaranteed”. Both are bad if widely adopted. You also don’t want everyone swarming to talk to women at conferences because that’s fairly overwhelming.

I agree that large companies can have a role to play in the hiring problem, but I don’t really like that as a solution. The problem is basically that small companies are a lot easier to fix the culture of than large companies because you have an opportunity to get it right. See e.g. a bunch of the complaints that have been levelled at Google. If the only places where we can correct the bias that causes some groups not to be hired are places where they’re going to have a bad time, that seems a bit sad.

So far my approach on hiring has been to screen off certain questions that I know for a fact introduce bias (e.g. open source contributions) and to try to match the questions as closely to the abilities we actually need as possible. I don’t know how well this works – as you say it’s very hard to measure – but it’s a start.