About that claim in the Monkey Cage that North Korea had “moderate” electoral integrity . . .

Yesterday I wrote about problems with the Electoral Integrity Project, a set of expert surveys that are intended to “evaluate the state of the world’s elections” but have some problems, notably rating more than half of the U.S. states in 2016 as having lower integrity than Cuba (!) and North Korea (!!!) in 2014.

I was distressed to learn that these shaky claims regarding electoral integrity have been promoted multiple times on the Monkey Cage, a blog with which I am associated. Here, for example, is that notorious map showing North Korea as having “moderate” electoral integrity in 2014.

The post featuring North Korea has the following note:

The map identifies North Korea and Cuba as having moderate quality elections. The full report online gives details on how to interpret this. It does not mean that these countries are electoral or liberal democracies. The indicators measure expert perceptions of the quality of an election based on multiple criteria derived from international standards.

It’s good to recognize the problem, but the above note isn’t nearly enough. When you have a measure that makes no sense in some cases, the appropriate response is not to just restate that you’re measuring “expert perceptions of the quality of an election” but to figure out what exactly went wrong! Recall that in this case, North Korea was rated as above 50 on every one of the “multiple criteria” given in their report. You can say “expert perceptions” and “international standards” as many times as you want and it doesn’t resolve this one.

When you find a bug in your code, you shouldn’t just exclude the case that doesn’t work, you should try to track down the problem.

More recently, the Electoral Integrity Index was featured in this Monkey Cage post entitled, “Why don’t more Americans vote? Maybe because they don’t trust U.S. elections,” by Pippa Norris, Holly Ann Garnett and Max Grömping, who concluded with the statement that “the U.S. ranks 52nd out of 153 countries worldwide in the 2016 Perceptions of Electoral Integrity index, and at the bottom of equivalent Western democracies.”

That post also featured a correlational analysis—states with higher measured electoral integrity also, on average, had higher voter turnout—and gave this an entirely unpersuasive causal interpretation (“electoral integrity has an effect [on turnout] as well”). That’s just terrible to say that, really contrary to the principles of social science.

At this point, I wouldn’t be surprised if word processors such as Microsoft Word and Google Docs could even have a Social Science Mode that would find unsupported causal claims in your text (search for “cause,” “effect,” and a few other words) and highlight them in red.

Just to be clear: I’m not saying that this sort of work could or should be excluded from the Monkey Cage. Norris et al. are studying an important topic, even if their methods are seriously flawed. But it is disturbing that we’ve been presenting their work entirely uncritically.

The Monkey Cage is one of the few public faces of political science, and when we feature work claiming that North Korea has moderate electoral integrity, or that subliminal smiley faces have huge effects on political attitudes, or that there are large numbers of votes cast by non-citizens, we’re discrediting our own field as well as polluting the public discourse. So, yes, let’s present controversial and preliminary work: if a well-respected survey gives results that don’t make sense, this is a fine topic for the Monkey Cage. It would just be best to express such claims in the spirit of scientific speculation rather than as scientific fact.

The next step is that we correct our errors and learn from them. For example, after a blogger pointed out implausible estimates in my election maps, I went back, figured out what I’d been doing wrong, and posted an update. After the Monkey Cage published that post on non-citizen voting, our editors added the following note:

The post occasioned three rebuttals (here, here, and here) as well as a response from the authors. Subsequently, another peer-reviewed article argued that the findings reported in this post (and affiliated article) were biased and that the authors’ data do not provide evidence of non-citizen voting in U.S. elections.

And after reading my criticisms of her work, Pippa Norris posted a long note with some details on her project.

31 Comments

Suppose the Electoral Integrity Project was a machine learning endeavor, seeking to find an algorithm which would properly classify entities. If the algorithm concluded that North Korea or any of the Stans rated highly, indeed above a couple of American states such as North Carolina, the researchers ought to do a rethink about the machine’s learning.

This case was dropped in 2015. NK is no longer in the dataset and hasn’t been for years. What went “wrong” was confined to this specific case. This single instance does not call into doubt the scientific quality of the project, the evidence or the research and it is grossly unfair to claim so. Judge us on our current dataset (PEI 4.5) and research publications.

I’m skeptical of your claim, “What went ‘wrong’ was confined to this specific case [of North Korea].” I’m skeptical for three reasons:

1. You put “wrong” in scare quotes as if the North Korea numbers might actually be ok. Saying North Korea has moderate electoral integrity isn’t just “wrong.” It’s actually wrong.

2. When you wrote a post on your results a couple years ago, you discussed North Korea without saying that you saw any problems with it. It was only after others pointed out how ridiculous these numbers were, that you dropped the case from your reports. I think your transparency is admirable. At the same time I don’t think it’s unfair at all to find problems with numbers that you discussed publicly for a long time.

3. The same method you used for North Korea was used for all the other countries. So when you get meaningless numbers for North Korea, this makes me question your method.

“This case was dropped in 2015. NK is no longer in the dataset and hasn’t been for years.” But why was it dropped from the dataset when there are still others with n=2 that remain? In any case, what justification is there for saying ‘too few responses’? You can only say how wide the error bars for n respondents are if you know what the underlying probability distribution is. But your median response number is n=11, and it’s obviously absurd to say that, even if you had a perfect survey for gauging electoral perceptions, all elections have the same distribution. Everyone seems to agree that elections in Denmark are pretty good, so you probably get a normal curve for responses there. But i’d expect the Cuban distribution, for instance, to be bimodal – most experts would say their elections are windowdressing to provide the appearance yet not the reality of a free choice, but the Cuban government has some hardcore defenders in polisci departments. So how is reporting n=3 for the Cuban elections a meaningful number for gauging the weight of expert opinion? Lots of other disputed elections are going to have split opinion among the experts you canvass, but with a median response of n=11, the PEI scores will be a hopelessly blunt tool.

Yes, but the whole point of this method is to determine how we should classify these entities. That is, if you are thinking of a supervised learning algorithm, how are we to provide a data set for learning? We would need to already have an established measure of electoral integrity so the algorithm can determine what variables are good predictors of this.

In general, would you not agree that we should, technically, not be evaluating this method based off the results? I know that sounds a bit ridiculous, but the point of this project is to determine electoral integrity – implying that we don’t already have a good measure of this. Obviously, NK being ranked higher than some US states seems outrageous. But, at which point do we accept a surprising finding? Suppose the method had yielded a slightly less surprising finding, maybe that NK was ranked higher than China? I suppose a crazy finding will lead us to which part of the method allowed for this finding, and maybe that will expose an important flaw in the method. But the result on its own being unexpected should not be enough to discount a method used for this purpose.

You’re taking an idealized form of ‘dispassionate scientific analysis’ and pushing it to an absurdity.

A central challenge in fields like comparative politics is that the information set required to understand a phenomena is not easily (or even possibly) turned into a nice n-dimensional numeric dataset. The most brilliant scholars of Russian history have invariably read War and Peace. What information is in War and Peace that helps you better predict and understand Russian politics? It’s really hard to say, but there certainly seems to be something there.

As it turns out in this case though, the methodology is trying to gather wisdom from a crowd. The goal is that the crowd-wisdom, when forced to answer the same set of questions on a country, will shine a brighter truth than any individual. Even though the dataset and questions places massive constraints on the ability to share information or analysis.

The problem is that this ‘wisdom from a crowd of experts’ is clearly not working as intended, or at the very least there is cause for concern, because guys like Gelman and other experts are looking at it and laughing at how ridiculous it is. Would you trust a few alleged experts who said “North Korea isn’t as bad as North Carolina?” I wouldn’t, it’s ridiculous. So having them formally submit a few answers and putting it into an index isn’t changing anything.

The thing is, the human brain isn’t a mystical object. It’s very powerful, and capable of filtering out incredible priors. For example, I’ve studied Political Science for years. This ranges from Econometrics to reading historical books, and even novels, on various countries. This information isn’t useless or somehow ‘doesn’t count’, just because you can’t plot it in R. And to me as well, this finding is ridiculous.

But maybe this is all me missing the forest for the trees: Do you know anything at all about North Korea? The prison camps? The government? The quality of life?

I agree I pushed the argument to an extreme by suggesting that the fact that NK is ranked higher than NC, while surprising, is not reason to discount the method without first finding the flaw in the method. For example, suppose that Gelman et al. were asked to evaluate this method beforehand (forget that this particular method involves surveying experts which can’t really be assessed without talking to these experts) and they found no flaws in it. However, after putting the method into practice, it turns out that NK is ranked higher than North Carolina! Is this seemingly crazy result reason enough to discount the method? No, I don’t think it is. My point is simply that you can’t discount this method based off the results alone. We don’t know what the ranking is supposed to be! If we just assume the correct rankings based off intuition, and then say that any method that produces results contrary to our intuition is wrong…well what’s the point?

Spend a few minutes reading their report and you will see there are lots and lots of very very obvious things wrong with their method. One of them they even mention: “In addition, after the announcement of the results, challenges to the legitimacy of the outcome are most common in partly free (hybrid) regimes, whereas these types of protests are suppressed under autocratic regimes.” Translation: our survey gives better scores to countries where people are too afraid of the secret police to protest electoral fraud.

This is just one reason why these may be protests. You can also get an election which is technically high quality but the outcome was tight and party competition is such that losers have greater incentive to cry fraud, triggering their voters to protest. This is actually quite common not just a hypothetical. Whereas if one party wins a decisive majority everyone stays home, whether there was fraud or not.

Of course there are many reasons why protests may or may not happen – but that’s my point. ‘Parties/candidates challenged the results’ doesn’t happen in North Korea , but it did in say, Australia, in 2013 with Clive Palmer (whom no one took seriously) but your survey just naively asks a lot of yes/no questions like this and then proceeds to add them up.

I’m empathetic to your general point as a matter of the right way to be a scientist. We’re in no disagreement there. I do technically agree with you, the question as to ‘how bad is North Korea?’ is empirical.

I would counter though by saying that some of this stuff though there are so many forking paths of models to use, that we need to ground them in some of our more stable priors (as estimated by the human brain). In this case I’m not stating NK is worse by assumption, but I think the signal is *so strong* that all reasonable persons who read the news will agree. Even taking into account extreme media bias, wikipedia bias, etc, the signal that dictatorial countries with death camps and mass starvation have less electoral freedom than the worst U.S. state is SUPER strong.

We can even take another step back and ask what do we mean by the classification of ‘electoral freedom’? This is a term, but really, it’s a set of attributes humans have collectively filtered into a linguistic term to classify similar things. It’s actually arbitrary when you think about it, there is no true thing called electoral freedom. It is just a set of correlated terms that we all generally agree on.

The authors present a closed form where they try to reduce it to simpler terms. I think an entirely fair argument though would be that NK is by definition not electorally free. It’s even kind of an endogeneity issue, since I define electoral freedom in part by observing NK and defining it as the opposite.

Anyway, as I said, as a general scientific principle I agree. But sometimes in social sciences it’s impossible to get that awesome proof, abstracted from our own priors and intuition, for general political terms.

Andrew’s links covers much of what I’m saying from a different angle, but I think there’s a bigger problem with all types of pseudo-quantitative listings like this. By “like this” I mean inherently non-quantitative things like “electoral integrity” or “influence” or “importance” or “university quality.”

The press loves this stuff, especially if it’s done by “an algorithm” instead of just averaging, but the problem either way is that it’s totally arbitrary. The categories you choose, the top and bottom of the scale for each category, and the relative importance will determine the outcome. And metrics and scales that are relevant for one set (like comparing US states, which all have two-party systems and FPP voting) will be bad for another.

Pretty much the only way to decide if you’ve come up with a useful list is to look at the results and see if they match some reasonable, pre-existing judgment. You don’t even have a proxy for what’s correct (like “wins in a baseball season” to try and validate “player quality”) to see if you’re on the right track.

Given that the validation is basically “Did this match expectations?” it’s always struck me as completely circular to highlight unexpected results as some interesting discovery. Sometimes the stuff is useful for the intended purpose but mostly it’s for forcing you to rethink what’s actually an important factor or realizing how many variables there really are.

Let’s add an additional wrinkle most of the time. Suppose the algorithm is known to work well 80% of the time on a validation set. Is it okay if the algorithm produces odd results like rating North Korea or any of the Stans highly, if the algorithm is still reliable most of the time? Can you trust the algorithm to classify future states’ elections correctly (or at least, 80% of the time)? Would you want to highlight any “odd” results?

A possible solution is to try multiple machine learning algorithms and see if they all agree with each other.

I’m one of the expert respondents in their panel of experts that they use to produce these measures. And my reaction to taking their survey was that it was garbage. Lots of subjective questions for which the only possible objective answer is “Don’t know.” Things like “Was the 200X election in [country] stolen?” Well, the losers alleged it was, providing no evidence, but all losers in that part of the world do that, and the winners claimed it was on the up and up, and there’s no hard information either way. Maybe it was stolen? Maybe it wasn’t? I have deep case knowledge of this country, but who am I to say? I didn’t personally observe rigging and I can’t prove a negative either. So I put “don’t know.” Its not clear this is a valid way to get any information to aggregate into a measure of how stolen it was, which is what they used our responses to do…

Basically, my reaction to taking their survey was that they were going to get a lot of variation across countries based on how the “experts” interpreted vague questions and how over-confident the “experts” were in their own knowledge of their cases. Many of the questions are things that are actually — in a strict, literal, factual sense — unknowable because electoral fraud is inherently very difficult to observe. If you read many of these questions literally, the answer is often “who knows?” If you read the questions more figuratively, you can give a more concrete answer. But that interpretation is left up to the respondent. This means they are bad survey questions.

(a) Whose perceptions are these? Should I trust the perception of someone who ranks North Korea highly on several measures of electoral integrity?

(b) What’s the quality of the survey questions? On this blog comment thread we have the testimony of one of the survey participants, someone who you’d judged as an expert, who characterizes the survey as “garbage.”

(c) You yourself tweeted this the other day: “Political scientist: North Carolina ‘can no longer be classified as a full democracy.'” This would seem to be a statement on the actual quality.

if you look at the PEI 3.0 csv file, the average response of their two respondents to the question “Overall how would you rate the integrity of this election on a scale from 1 (very poor) to 10 (very good)?” was 1.5 for North Korea. so it looks like one expert gave a 1 and the other a 2. I think at least part of the problem is that PEI has questions like “Information about voting procedures was widely available”. Information about voting procedures is widely available in North Korea – every knows that if you try to vote the wrong way, you will be executed.

I agree with you that the perception of a problem is a problem in this case, but I think there are at least two problems with your approach.

(1) If you’re measuring perceived problems, why are you measuring experts and not the public. Let’s say that the public thinks that voting without ID is a problem, but experts don’t. Does that mean there’s not a perceived problem?

(2) Unless you know whether a problem is real or perceptual, you don’t know what solutions to propose. For example, you personally argue that US perceptions argue that we should shift from a state-based system to a national system, but I’m not sure what the basis for that is.

In particular, if your experts believe that the shift would increase their confidence, but the public at large would lose confidence or remain static, would that affect the proposal, and why?

In terms of the claim of how perceived problems of electoral integrity are consistently associated with lower turnout, at micro and macro-levels, please do read the research upon which this was based from my book for Cambridge University Press on Why Electoral Integrity Matters, along with the other research literature from other scholars which has confirmed this observation consistently (Beaulieu, Birch, Van Ham, etc). The claim is not simply the short blog posting but are based upon published work in the sub-field in peer-reviewed journals and books for university presses.

I assume that many of the commenters here are empirical scientists (or at least are interested in empirical stuff). Prof Gelman certainly is! So I’m a little surprised that none of the comments appear to have actually looked at what the data have to say or, for that matter, specifics about the methodology. Sure, the North Korea bit appears to raise a red flag, as do the comments from lewis77. But pause for a moment: red flag or red herring? I would think one would want to give the EIP a fair hearing. What’s good about it? What’s bad about it? As in most scientific endeavors, details matter.

I will offer one specific comment on the EIP report about the US elections. Figure 3 looks like a really clean result: lower (perceived) electoral integrity in states with Republican-controlled legislatures. A very cool result if true! But then look at the data plotted in Figure 4. Yes, I know the data are naturally very noisy in this kind of research. Still, I have a hard time getting excited about that scatterplot or the R^2 = 0.097. So, going back to Figure 3, it makes me wonder: Given that the survey responses are largely from academics or other political scientists, many of whom must have liberal leanings, I wonder if one could tease out the “liberal bias” in the results?

By the way, you can look for a related podcast from “The State of Things” on WUNC (public radio), Jan 5. Host Frank Stasio interviews Professor Andrew Reynolds, the author of the original op-ed-gone-viral that appeared in the Raleigh newspaper on Dec 22 that started all this. Sorry, podcast link not yet posted as of this writing.

For reasons discussed in my comments here and here, I find it difficult to take these numbers seriously. I don’t think North Korea is a “red herring.” If you have a method that gives ridiculous results, that suggests there’s a problem with the method.