Some commentators are comparing the upcoming general election with the one in 1992. The opinion polls predicted a hung parliament or a Labour victory; in fact the Conservatives won, with a majority of 21. The reason? People lied to opinion pollsters: they were embarrassed to admit their support for the Tories.

This is an example of a more general issue in data collection: data quality.

Another political example would be the notorious Literary Digest poll of 1936. Despite over two million responses the poll wrongly predicted that Governor Landon of Kansas would win the presidency. Unknown pollster George Gallop, using a sample of only fifty thousand, correctly forecast that Franklin Roosevelt would be re-elected; indeed he won by a landslide, taking 48 of the 50 states.

In this case, the flaw had been the people the Literary Digest had polled. Their sample was based in large part on people who owned telephones and people who owned cars – both luxury items in the 1930s. Wealthy people disproportionately support the Republican party, whose candidate was Landon.

What we learn from the Literary Digest poll is that quantity is no guarantee of quality. In this case the poll was flawed by bias.

Another issue that can impact survey sampling is non-response. What do you do if a member of your sample refuses to answer one of your questions?

In 2010 the Office for National Statistics published the results of its research into how many British adults are gay. While the Kinsey Reports of the late 1940s and early 1950s are often used to support the claim that around 10% of the population is gay, the commonly accepted figure more recently has been somewhere in the region of 5-7%.

The ONS data reduces the figure even further, putting it at around 1.5%.

So which is it?

The problem with the ONS survey is that only 96% of respondents produced a ‘valid response’ to the question about their sexuality. In other words, 4% didn’t produce a valid response. What does this mean? Why would someone not produce a valid response? Perhaps because they’re gay and don’t want to admit it to a researcher. (The ONS survey comprised personal interviews.) If all of those 4% of non-respondents were gay then the true proportion of gay people jumps from 1.5% to 5.5%, which is within the previously accepted range. If only half of them were gay, this still increases the proportion to 3.5%, more than doubling the estimate of the total number of gay adults in the UK.

Of course, we can’t actually know. All we have is an invalid response and it’s impossible to interpret that. But because the proportion of non-respondents is of the same order of magnitude as the proportion of people identifying as gay (it’s actually more than double the size) we have to treat the 1.5% statistic with considerable caution.

Does it really matter? Well, yes, because social policy is shaped by data. Accurate data helps governments and others decide what legislation to pass, what priorities to make, what funds to distribute. The 5-7% figure was quoted by the Blair government when it was proposing the introduction of civil partnerships. Other parties are interested, too. Advertisers, perhaps lured by the so-called ‘pink pound’, will be very interested to know how large the market for a product targeted at gay people is.

So how can we collect data on sensitive topics such as sexuality? How big a problem is cannabis use amongst teenagers, for example? Do you smoke cannabis? Would you tell a researcher? For the ONS survey, respondents were asked to say ‘stop’ as soon as the interviewer read out the sexual orientation that applied to them. (Someone who is not openly gay might not wish to answer the direct question ‘What is your sexuality?’ with the answer ‘gay’ for fear that someone they know might overhear them. The ONS ‘stop’ method was designed to avoid this problem. But it still required the respondents to be open about their sexuality to the interviewers. (In the same way that respondents to pollsters for the 1992 election had to ‘admit’ to supporting the Conservative party to someone standing right in front of them.)

The ideal approach is to make the respondent feel absolutely certain that their answer is given in complete confidence. There is, in fact, one way to achieve this in which the respondent can feel certain that no-one at all knows their answer, not even the researchers. It works like this.

Each respondent is given two unsealed envelopes marked heads and tails, a piece of paper with the words yes and no printed on it, a pencil, a coin and a die. The respondent is free to examine the contents of each envelope before proceeding.

The respondent then goes into a private room where he tosses the coin and rolls the die. If the coin comes down heads, he opens the heads envelope; if it comes down tails he opens the tails envelope. Inside the heads envelope is the question ‘Are you gay?’; inside the tails envelope is the question ‘Does the die show an even number?’ The respondent indicates his answer by circling the word yes or the word no on the piece of paper. He then folds it in half so that his response cannot be seen.

Finally, the respondent emerges from the private room and puts the piece of paper into a sealed box along with other similar pieces of paper from other respondents.

This elaborate procedure should ensure that the respondent answers the question honestly. He can feel safe in doing so because no-one can know which question he is answering. If his response is ‘yes’ that could simply mean that he is answering the entirely innocuous question ‘Does the die show an even number?’ Further, there is no way to link his answer to him because it is in a box of identical slips that are just the same as his; indeed, since he didn’t write the word out himself, there’s not even a link via his handwriting to his answer.

But how, then, do we interpret the results if we don’t know which of the two questions have been answered on any given slip?

Probability theory comes to the rescue. On average, about half of the respondents will get a tail when they toss the coin, so half of the respondents will be answering the question about the die. Of these, about half will have got an even number when they rolled the die and these people will answer ‘yes’. One quarter of all the responses will therefore be ‘yes’.

If the percentage of ‘yes’ answers is more than one quarter then the excess must be due to respondents answering ‘yes’ to the question ‘Are you gay?’

For example, suppose that there are 100 respondents and 30 of them say ‘yes’. Then (on average) 25 of these are people who got a tail and a even number. The remaining five got a head and are saying ‘yes’ they are gay. Since half of the respondents will have been answering the question ‘Are you gay?’, the proportion of gay people is 5 out of 50, or 10%.

Of course those figures are approximate because if you toss a coin 100 times you don’t get exactly 50 heads every time. So to feel confident in the results, you have to ask many more people than 100. The ONS survey was based on interviews with over 450,000 people. Had they used this methodology, their results would carry a far higher degree of confidence.