Thanks to Kit Baum over at Boston College, our Stata add-on surveybias.ado is now available from Statistical Software Components (SSC). The add-on takes as its argument the name of a categorical variable and said variable’s true distribution in the population. For what it’s worth, the program tries to be smart: surveybias vote, popvalues(900000 1200000 1800000), surveybias vote, popvalues(0.2307692 0.3076923 0.4615385), and surveybias vote, popvalues(23.07692 30.76923 46.15385) should all give the same result.

If you don’t have access to the raw data but want to assess survey bias evident in published figures, there is surveybiasi, an “immediate” command that lets you do stuff like this: surveybiasi , popvalues(30 40 30) samplevalues(40 40 20) n(1000). Again, you may specify absolute values, relative frequencies, or percentages.

Do you like this graph? I don’t think it is particularly attractive, and that is after spending hours and hours creating it. What I really wanted was a matrix-like representation of 18 simulations I ran. More specifically, I simulated the sampling distribution of a statistic under six different conditions for three different sample sizes. Doing the simulations was a breeze, courtesy of Stata’s simulate command, which created 18 corresponding data sets. Graphing them with kdensity also poses no problem, but combining these graphs did, because I could find no canned command that produces what I wanted: a table-like arrangement, with labels for the columns (i.e. sample sizes) and rows (experimental conditions). What I could do was set up / label a variable with 18 categories (one for each data set) and use the ,by() option to create a trellis plot. But that would waste a lot of ink/space by replicating redundant information. At the end of the day, I created a nine graphs that were completely empty save for the text that I wanted as row/column labels, which I then combined into two separate figures, that were then combined (using a distorted aspect ratio) with my 18 separate plots. That boils down to a lot of dumb code. E.g., this creates the labels for the six conditions. Note the fxsize option that makes the combine graph narrow, and the necessity to create an empty scatter plot. capture drop x capture drop y capture set obs 5 gen x= . gen y= .

Who is afraid of whom?

The liberal German weekly Zeit has commissioned a YouGov poll which demonstrates that Germans are more afraid of right-wing terrorists than of Islamist terrorists. The question read “What is, in your opinion, the biggest terrorist threat in Germany?” On offer were right-wingers (41 per cent), Islamists (36.6 per cent), left-wingers (5.6 per cent), other groups (3.8 per cent), or (my favourite) “no threat” (13 per cent). This is a pretty daft question anyway. Given the news coverage of the Neo-Nazi gang that has killed at least ten people more or less under the eyes of the authorities, and given that the authorities have so far managed to stop would-be terrorists in their tracks, the result is hardly surprising.

Nonetheless, the difference of just under five percentage points made the headlines, because there is a subtext for Zeit readers: Germans are worried about right-wing terrorism (a few weeks ago many people would have denied that there are right-wing terrorists operating in Germany), which must be a good thing, and they are less concerned about Islamist terrorists, which is possibly a progressive thing. Or something along those lines.

But is the five-point difference real?

YouGov has interviewed 1043 members of its online access panel. If we assume (and this is a heroic assumption) that these respondents can be treated like a simple random sample, what are the confidence intervals?

Binomial Confidence Intervals

First, we could treat the two categories as if they were distributed as binomial and ask Stata for exact confidence intervals.

cii 1043 round(1043*.41)
cii 1043 round(1043*.366)

The confidence intervals overlap, so we’re lead to think that the proportions in the population are not necessarily different. But the two categories are not independent, because the “not right-wingers” answers include the “Islamists” answers and vice versa, so the multinomial is a better choice.

Multinomial Model

It is easy to re-create the univariate distribution of answers in Stata:

The parameters of the model reproduce the observed distribution exactly and are therefore not very interesting, but the estimates of their standard errors are available for testing hypotheses:

test [right_wingers]_cons = [islamists]_cons

At the conventional level of 0.05, we cannot reject the null hypothesis that both proportions are equal in the population, i.e. we cannot tell if Germans are really more worried about one of the two groups.

Simulation

Just for the fun of it, we can carry out one additional test and ask a rather specific question: If both proportions are 0.388 in the population and the other three are identical to their values in the sample, what is the probability of observing a difference of at least 4.4 points in favour of right-wingers?

The idea is to sample repeatedly from a multinomial with known probabilities. This could be done more elegantly by defining a program and using Stata’s simulate command, but if your machine has enough memory, it is just as easy and possibly faster to use two loops to generate/analyse the required number of variables (one per simulation) and to fill them all in one go with three lines of mata code. Depending on the number of trials, you may have to adjust maxvars

Seems the chance of a 4.4 point difference is between 5 and 6 per cent. This probability is somewhat smaller than the one from the multinomial model because the null hypothesis is more specific, but still not statistically significant. And the Zeit does not even have a proper random sample, so there is no scientific evidence for the claim that Germans are more afraid of right-wing extremists than of Islamists, what ever that would have been worth. Bummer.

Sometimes, a man’s gotta do what a man’s gotta do. Which, in my case, might be a little simulation of a random process involving an unordered categorical variable. In R, sampling from a multinomial distribution is trivial.

rmultinom(1,1000,c(.1,.7,.2,.1))

gives me a vector of random numbers from a multinomial distribution with outcomes 1, 2, 3, and 4, where the probability of observing a ‘1’ is 10 percent, the probability of observing a ‘2’ is 70 per cent, and so on. But I could not find an equivalent function in Stata. Generating artificial data in R is not very elegant, so I kept digging and found a solution in section M-5 of the Mata handbook. Hidden in the entry on runiform is a reference to rdiscrete(r,c,p), a Mata function which generates a r*c matrix of draws from a multinomial distribution defined by a vector p of probabilities.

That leaves but one question: Is wrapping a handful of lines around a Mata call to replace a non-existent Stata function more elegant than calling an external program?