State Similarity Scores

For the most of you who haven’t followed my baseball work, I am best known for inventing a forecasting system called PECOTA, which generates predictions by comparing baseball players with a large database of historical peers and identifying the most similar ones. This same technology — which is really just a variant of nearest neighbor analysis — can be applied to virtually anything, including identifying the similarity of any two states along a number of dimensions of political salience. In fact, that’s exactly what I’ve done in the chart below, with each state listed along with its three most similar states.

What factors go into the similarity score? There are quite a few, which are weighted in rough proportion to their importance in determining the Kerry-Bush result in 2004 and the McCain-Obama polling this year according to an analysis of variance.

The highest score theoretically achievable is 100, for two states that are exactly identical along each of these 19 dimensions. The highest score in practice is 71 between North and South Carolina. A score of 0 represents states that are as dissimilar as similar, and negative scores are both possible and quite common (though I list them as zeroes in the table above).

Note that some states really aren’t like any other states at all, including big ones like Florida and Texas and small ones like Alaska, Utah, and New Mexico. Then there are other states that are sort of within the main sequence but need to pull from different regions — like Indiana, whose three most similar states span the Midwest (Ohio), South (North Carolina) and the Prairies (Kansas).

And yes, this does have implications for our model, which will become clear at some point soon.

Nate Silver is the founder and editor in chief of FiveThirtyEight. @natesilver538