September 12, 2013

No pollster attracts more love and hate than Public Policy Polling. The Democratically aligned polling firm routinely asks questions that poke fun at Republicans, like whether then-Senator Barack Obama was responsible for Hurricane Katrina. Not coincidentally, Republicans routinely accuse them of being biased toward Democrats. Last fall, PPP was front and center in conservative complaints about allegedly skewed polls. But when the election results came in, PPP’s polls were vindicated and the conspiracy-minded critics were debunked.

Pollsters, though, tend to judge one another based more on methodology than record. And for experts and competitors, the firm’s success remains difficult to explain. PPP doesn’t follow many of the industry’s best practices, like calling voters' cell phones; the firm only calls landlines. It discards hundreds of respondents in an unusual process known as “random deletion.” And because PPP's interviewers rely on lists of registered voters—rather than random digit dialing—and simply ask non-voters to hang up the phone, the firm can’t use census numbers to weight their sample, as many other pollsters do. This forces PPP to make more, and more subjective, judgments about just who will be voting.

In PPP’s telling, the Raleigh-based firm overcomes the odds by mastering those subjective judgments, perfecting the art of projecting the composition of the electorate—the same art that eluded Republican pollsters in 2012. If this explanation was satisfying, perhaps PPP could settle into the top-ranked pollster slot without great protest. But PPP’s success, in fact, did not reflect a clairvoyant vision of the electorate. The racial composition of their polls swayed wildly. A recent Georgia poll was just wrong.

After examining PPP’s polls from 2012 and conducting a lengthy exchange with PPP’s director, I've found that PPP withheld controversial elements of its methodology, to the extent it even has one, and treated its data inconsistently. The racial composition of PPP’s surveys was informed by whether respondents voted for Obama or John McCain in 2008, even though it wasn’t stated in its methodology. PPP then deleted the question from detailed releases to avoid criticism. Throughout its seemingly successful run, PPP used amateurish weighting techniques that distorted its samples—embracing a unique, ad hoc philosophy that, time and time again, seemed to save PPP from producing outlying results. The end result is unscientific and unsettling.

A SHIFTING ELECTORATE

In an age of racial polarization, small shifts in the racial composition of a poll can mean everything: There was only a 3-point difference in the white share of the electorate between the 2008 election, when Obama won decisively, and the 2010 midterms, when Republicans won a historic landslide. Last year, Pew Research assumed that whites would represent 74 percent of voters, while Gallup assumed whites would make up 78 percent. Pew nailed it. Gallup, having been proven embarrassingly wrong, is now rethinking its entire methodology.

In the past, PPP has said that their ability to divine the composition of the electorate underlies their success. But that composition is all over the map: In PPP’s polling, the white share of the electorate routinely swings by 4 points or more. In PPP’s weekly tracking poll with DailyKos and SEIU, the white share of the electorate once swung from 69 to 76 percent over just one week—basically the difference between an electorate as white as the one that elected George W. Bush in 2004 and an electorate even more diverse than the one that chose President Obama in 2012. Other pollsters don't show similar changes.

These shifts are basically impossible to justify, and even harder to explain. PPP has the discretion to make judgments about the projected composition of the electorate, which they achieve through a highly unusual, one-of-a-kind, even "baffling" process known as “random deletion.”1 PPP “randomly” deletes respondents until the remaining respondents fall into “target ranges” for race, gender, and age based on Census data and prior exit polls.

According to PPP, the exact targets for random deletion are determined by the number of responses by race and ethnicity: When more black voters respond to its surveys, blacks will be a higher share of the electorate. “If we do one Virginia poll where 11 percent of raw responses are black and another where 14 percent are [black] you are likely to see those weighted to 18 percent and 20 percent respectively in a presidential election year,” Tom Jensen, PPP’s director, wrote in an email, explaining how PPP would use the number of black responses to determine just how much to correct for underrepresenting black voters.

This approach is odd. Most pollsters do not assume that there’s a relationship between fluctuations in the number of raw responses and the composition of the electorate, at least not to this extent. I’m not aware of any research that defends this proposition, especially on the basis of a single poll. But although this kind of deletion is unusual, it doesn’t necessarily bias a poll if it is truly random and applied consistently.

But for PPP, that doesn't appear to be the case. Consider two polls of registered voters conducted in consecutive weeks in 2012. One poll was 69 percent white; the other was 76 percent white. Under PPP’s explanation, one would expect the number of non-white responses to be far higher in the first poll than the second. A look at PPP’s raw data shows otherwise.

The non-white share of “completes” was basically equal in each poll, but the racial composition of the two surveys wound up very, very different. To the naked eye, it looks as if there’s something other than changing response levels that influences PPP’s polling.

THE CASE OF THE GEORGIA ELECTORATE

For a pollster like PPP, Georgia is as easy as it gets. Unlike most states, Georgia releases voter turnout and registration data by race and age. That frees pollsters from the need to project turnout based on imprecise and contradictory Census and exit poll data.

Yet earlier this month, a PPP poll ahead of next year’s senatorial election found that Georgia’s “voters,” defined as registered voters who had participated in one of the last three general elections, were 71 percent white and 24 percent were black—much different than the data from the Georgia secretary of state, which shows that the electorate was 61 percent white and 30 percent black in last November’s election.

That’s an unacceptable amount of variation. Why such a large difference? In a Twitter exchange, PPP defended the composition of its survey by noting that it was following the lesson of the state’s 2008 Senate run-off election, when PPP believes it missed the result because it overestimated the black share of the electorate. The argument was that, with no African-American candidate running, the state’s 2014 turnout would be closer to the 2008 run-off than to the 2010 senatorial race. PPP also noted that in 2006, the last time there was a general election without a black candidate, the electorate was 71 percent white and 24 percent back, just like its survey.

None of these responses withstand scrutiny.

For one thing, the 2008 Senate run-off was a low-turnout election for both black and white voters—there’s no reason to use it as a baseline for general elections, which almost always have higher turnout. Likewise, its far-fetched to assume that the presence of a completely uncompetitive black candidate on the 2010 ballot was the reason blacks turned out in significantly higher numbers for a general election than they had for a run-off. And there’s a huge problem with estimating voter turnout based on 2006: Demographic changes over the ensuing seven years have driven the white share of registered voters down from 68 percent in 2006 to 59 percent in 2010.

Instead of pushing back, PPP argued that its electorate yielded a sensible outcome: A properly weighted survey would have shown likely Democratic Senate nominee Michelle Nunn up by an unrealistic 6 points and would have projected Obama and Mitt Romney tied in Georgia in the last presidential election (which Romney, in fact, won by nearly 8 points). But PPP doesn’t deserve bonus points if it tweaked the sample in order to yield a less preposterous result: The simple fact is that PPP’s sample was bad.

If PPP was tweaking its samples to get the result it wanted, there would be little reason to pay attention its surveys. In a world of accurate polling averages and FiveThirtyEight, we know what the result of a poll ought to be. What’s hard is conducting the high-quality, methodologically sound surveys that allow the models to be as accurate as they are.

When I posed that exact possibility on Twitter, asking “are you saying you weighted to get the result you wanted,” Jensen’s response was terse: “Maybe we’re being over cautious but we try not to make the same mistake twice. That’s all I have to say.”

SHIFTY NUMBERS ELSEWHERE

It wasn't just Georgia. PPP's polls display the same conveniently self-correcting pattern: More weight given to the groups that push PPP’s result closer to the expected outcome throughout 2012. When PPP’s polls otherwise would have shown the president doing worse than expected, there tended to be fewer white voters, which had the effect of bringing PPP closer to expectations.

Consider, for instance, PPP’s daily tracking poll over the final three weeks of the election. Over that period, the white share of the electorate swung between 70 and 74 percent—a gap as wide as the disagreement between Pew and Gallup. On the days when whites represented 70 percent of the electorate, the president was generally performing worse among white voters. Conversely, when whites were 74 percent of the electorate, the president was generally performing better among white voters. Another way to think about it: What would PPP have shown if it held the composition of the electorate constant? When Obama would have done worse, the white share of electorate dropped and boosted his numbers; when Obama was doing well, the white share of the electorate increased and boosted Romney.

The composition of the electorate shouldn’t shift at all for a pollster like PPP. If it does, it shouldn’t shift along with fluctuations in the sample—you might expect the dots on those charts to be scattered randomly and evenly across the grid.

And the effect of this pattern was entirely to PPP’s advantage if the goal were to stay close to the polling averages. On every day of PPP’s tracking poll, PPP’s survey produced a result closer to the average of polls than they would have if they consistently used the composition of the electorate from their final poll.

And it's not just the tracking poll. It's Florida:

It's Ohio:

It's Nevada:

And PPP's home state of North Carolina:

The common denominator: Those are four of the five battleground states where the president was most dependent on non-white voters.2 In all of those states, changes in the composition of the electorate kept PPP closer to the average. For instance, PPP would have shown Romney ahead in Nevada, which no non-partisan pollster found; PPP would have found Romney ahead in Ohio in late October, which no non-partisan, live interview pollster showed; a 5-point drop in the white share of the projected electorate allowed Obama to maintain a narrow October lead in Florida polling, even while the poll had him losing among Florida’s Democratic-leaning Hispanics.3

DELETING THE QUESTIONS

Over the last few weeks, I’ve had a long email exchange with Jensen. I asked him many questions about his methodology, including why there might be a relationship between the preferences of the sample and the composition of the electorate. He first said it was a “coincidence.” But he soon conceded that PPP was using the 2008 vote to help determine the racial composition of the electorate—something they hadn’t previously disclosed:

“We might in a state like say Ohio weight on the lower end of the black scale (10 or 11%) if weighting on the higher end meant we were going to end up with an Obama +8 2008 sample whereas we would be more comfortable with 12 or 13% if that wasn’t going to push it up over Obama +4.”

As far as I’m aware, no other pollster uses the results of the last election the way PPP does.4 Pollsters have stayed away from this approach for good reasons. For one, who knows whether the electorate will end up including Obama and McCain voters in the same proportion as the last election? But even more important, the results are notoriously unreliable: People lie about their vote, or just forget. Simply targeting the past election result shouldn’t work.

In many ways, PPP’s technique has the strengths and weaknesses of weighting by party ID, the widely criticized approach employed by some of PPP’s most vocal critics—the pro-Romney poll “unskewers” of 2012. Like party ID, how you voted in the last election correlates well with how you’ll vote this time. The issue is whether the pollster knows the ratio of former Obama and McCain supporters. Unfortunately, there’s no easy way to know whether the poll should be weighted to an electorate where people say they voted for Obama by 7.2 points or 12 points or nothing at all.

Jensen said that using respondents’ 2008 preferences to project the racial composition of the electorate wasn’t as bad as it looks: “It's not us saying 'Obama's doing well enough against Romney so we don't need to weight African Americans as high,'” he said. “It's 'we don't want to put out a sample that over-represents people who voted for Obama last time and give Republicans something to attack us about.'”

But there’s an obvious problem with that explanation: The detailed releases accompanying PPP surveys rarely included any mention of a question about the 2008 presidential election, so there was nothing for Republicans to attack them about.

It turns out that PPP deleted the question from its public releases. “At some point you get tired of putting out an Ohio poll that’s Obama 48/42 2008 vote and the Republicans attack you,” explained Jensen. But it doesn’t track: It’s a little ridiculous to imagine Jensen, who solicits GOP attacks, deleting questions to avoid that same scorn of his enemies. More importantly, PPP’s decision to delete the question obviated their stated motive for weighting on it: There was obviously no need to weight to avoid Republican attacks on a deleted question.

Jensen finally said that PPP considers the last election in weighting its sample because they think it improves their polls.

It’s troubling enough that PPP was using an undisclosed, poorly executed version of a controversial method based off of a deleted question to project the racial composition of the electorate. What’s even stranger is that PPP didn’t use it consistently.

They didn’t use it on their DailyKos/SEIU poll, which did not include a question about the ’08 election, helping to explain why there wasn’t a similarly self-correcting relationship in that poll. (Most of PPP’s public polling is conducted without a sponsor, but PPP also conducted a weekly survey sponsored by the Daily Kos website and the Service Employees International Union.) At first, Jensen said the DailyKos/SEIU poll had the same methodology as PPP’s other surveys. But clearly the absence of the 2008 question made a difference, and, conspicuously these were also the surveys where PPP released the raw data:

Even in the polls where PPP did ask the 2008 question, Jensen didn’t use it consistently. When asked about how Jensen would reconcile the number of non-white responses and 2008 responses, Jensen said “we don’t have any specific rules about what factors trump what when we’re weighting our polls, it’s really just on a poll by poll basis.” Even more bizarrely, Jensen says he would only use the 2008 question if it helped Romney, but not Obama. Jensen said: “If Obama won a state by 4 [in 2008] I’m a lot more comfortable putting out a sample that’s Obama+1 than Obama+7.”

If Jensen believes the 2008 question improves his polls, why wouldn’t he be about as concerned by a poll that showed McCain and Obama tied in Virginia as Obama leading by 12? His reason, of course, is that he’s “sensitive to the constant Republican attacks.” But again, that’s a little tough to imagine that Jensen is so sensitive, especially since deleting the question seems to solve that problem just fine.

Worse still, the record shows that Jensen is, in fact, willing to publish polls showing Obama doing far too well in the past election. The most prominent example came in one of PPP’s biggest meltdowns, the South Carolina special congressional election earlier this year. According to a PPP poll that showed the ultimate victor, Mark Sanford, trailing by a significant margin, the share of voters who had supported Romney last year was five points higher than the share who had supported Obama; in fact, Romney had won the district by 18 points.

Jensen said that was a special case: “there were real world reasons to think it was possible that Republicans were sitting out because of Sanford being the candidate in general and what happened that week in particular.” That was clearly proven wrong, but fair enough. The point is that Jensen’s stated aversion to projecting a pro-Obama electorate only lasts until Jensen thinks there’s a “real world” reason for an exception, whatever that may be. His methodology certainly offers him the flexibility to make an exception whenever he so chooses.

That flexibility manifests through inconsistent random deletion. When confronted with the two spring polls with identical response rates but inconsistent random deletion, Jensen said it was because PPP uses sequential weighting, an amateurish technique where PPP first randomly deletes for race and then weights for age. The problem is that when PPP weights for age, they wind up changing the racial composition of the sample, since young voters are disproportionately non-white. Jensen explained how this influenced those two polls:

“I’m guessing on April 26th I decided I needed to delete more whites in order for the African Americans and Hispanics to be within the target range after I weighted for age whereas on May 3 weighting for age was going to bring up the African Americans and Hispanics by a larger amount so it wasn’t necessary to delete whites to get in range because the age weighting was going to do it.”

This explanation doesn’t cut it. The final, post-age weighted numbers were still extremely far apart, even though they ought to be very similar, since there was only a very minor difference in the non-white share of initial responses. So if the differences in random deletion were indeed in anticipation of the age weights, Jensen's fly-by-night effort was well off target.

PPP's philosophy is simple, but disconcerting: Get the result right. Jensen emphasized that “the only thing we’re trying to accomplish when weighting our polls is to accurately tell people what would happen if the election was today.” He reiterated that his polling was accurate, that his clients were satisfied. That’s much like how PPP rationalized its Georgia poll on Twitter. The problem is that it’s very difficult to distinguish these statements from weighting toward a desired result.

Of course, it's impossible to prove that PPP weights toward a desired result. But what’s so troubling is that it’s totally possible. No other pollster employs a truly ad hoc approach, with the flexibility to weight to whatever electorate it chooses, while allowing the composition of the electorate to fluctuate based on the inconsistent and subjective application of controversial or undisclosed metrics. In PPP's own words, they “don’t have any specific rules” about how to weight from poll to poll. And within that framework, it's a little absurd that PPP reserves the power to wield a weighting bludgeon, like considering the last election, whenever and however the firm feels like it. And while that inconsistency prevents anyone from proving that they weight toward a desired results, it also prevents PPP from proving that it does not.

Reproducibility is at the heart of the scientific method: If I conduct an experiment and produce a certain result, and someone can’t do the same experiment and get the same outcome, there should be serious doubts about my finding. Incredibly, PPP’s methodology is so inconsistent that it’s not even clear it could replicate its own results with the same raw data. That should raise red flags.

Public pollsters have a responsibility to be transparent, as The Huffington Post's Mark Blumenthal has repeatedly and persuasively argued. The reason is simple: Transparency is critical to trust. Anyone can publish a bunch of made-up numbers and call it a poll, or weight a bad sample to match the FiveThirtyEight model. Yesterday, PPP implicitly acknowledged that the results of other public polls influences their own confidence in a survey, including their decision to release a poll. This seemed to confirm a study by two political scientists that suggested automated pollsters, like PPP, are more accurate when another live interview pollster has already conducted a survey.5 So there are plenty of reasons why PPP's credibility is at stake.

This is not an abstract concern. Over the last few years, two well-known pollsters shut down after getting caught cooking their numbers. And although PPP has been transparent about releasing cross-tabs and even raw data, Blumenthal is correct in arguing that the accuracy of pollsters doesn’t just depend on “the process used to collect the data, but also the methods used to transform non-representative raw data into the final product. We need full disclosure of both.”

To be fair, even the best pollsters aren’t perfectly transparent. Perhaps that’s especially so when constructing and tinkering with likely voter models.6 And it’s also possible that PPP would still be a decent pollster if it used a more defensible approach. But PPP’s opacity and flexibility goes too far. In employing amateurish weighting techniques, withholding controversial methodological details, or deleting questions to avoid scrutiny, the firm does not inspire trust. We need pollsters taking representative samples with a rigorous and dependable methodology. Unfortunately, that’s not PPP.

If anything, the composition of the electorate in PPP’s polls should be even more stable than other public pollsters, who use random digit dialing to contact a random sample of adults and then weight it to Census targets for race, age, gender, and geography. Then, most public pollsters exclude non-registered and unlikely voters—which could result in swings of the composition of the electorate, depending on how many adults say they’re registered or unlikely to vote in any given survey. PPP, by contrast, only calls from lists of registered voters, which are often flagged with reasonably accurate data on age, race, and gender. And since it never weighs to Census targets for adults, PPP weighs—or “randomly deletes”—to whatever target they want. For a pollster like PPP, fluctuations in the composition of the electorate represent a deliberate choice.

2

The fifth state, Virginia, where the president’s margin among white voters was within 1 point of the median in all but one poll over the final four months of the race, suggesting that there wasn’t enough fluctuation in the underlying sample to generate corresponding shifts in the composition of the electorate.

3

I complained about this at the time and, in many respects, it was this particular survey that first led me to mull whether there was something going on with PPP's weighting.

4

The few pollsters who consider the past election don’t simply weight to the result, like PPP. Nor do they let the last election influence the racial composition of the sample. Instead, they might weight to the average response on the 2008 ballot question from their past surveys, while simultaneously considering other variables, like race.

5

In fairness, PPP sometimes outperforms the field. In Colorado, for instance, PPP consistently and rightly showed Obama ahead by a comfortable margin, when most pollsters didn't. On the other hand, isn't it possible that PPP "didn't want to make the same mistake twice" and elected to show Obama doing better in Colorado, a state where the polls historically and systematically underestimate Democratic candidates? And some of PPP's worst showings, like the South Carolina special congressional election, have come when PPP had the field to itself.

6

On the other hand, pollsters make undisclosed assumptions about the level of turnout, or the measures that best model turnout at any point during the election. And there’s reason to think that pollsters test these assumptions and tinker between polls. But the record suggests, in my view, that these fluctuations pale in comparison to those with PPP.