Clustering U.S. Senators using roll call voting data

For our forthcoming book on machine learning for hackers, John Myles White and I will discuss clustering, and various methods for doing so. One common method for clustering observations is multidimensional scaling (MDS). For those that are not familiar, MDS takes a matrix of distances, or similarity measures, among observations and assigns them values in some N-dimensional space based on those distances. Those items that are most similar will be located closest in that space.

I do not study American politics, but I participate in it so when the opportunity presents itself I am always eager to dig into the data. As such, one way to measure the distance among legislators in the U.S. Congress is to check how similar the voting records are. Thankfully, political scientist Keith Poole makes all historical roll call data available online in machine readable format! For our chapter on clustering, John and I will be going into this data in much greater detail, but I thought it would be fun to share a small portion of the results here.

Above are two sets of MDS plots for the 110th and 111th U.S. Senates. In the left column are those plots with points, and on the right are the same plots with the Senators surnames replacing those points. I used a 2-dimensions MDS for this data, so those senators closest on the x- and y-axes have the most similar voting records. The colors of the points correspond to the Senator’s political party in the typical way. Because there is a high-degree of overlap in the names plot in the right column, the points in the left are included for reference

What is particularly convenient about using MDS for U.S. voting data is that the x-axis can be interpreted as a general ideological scale, with those senators most liberal or conservation falling at the extremes of the left and right respectfully. It is a bit more difficult to interpret the y-axis, other than it separates senators with similar ideology by another dimension. If you are familiar with the U.S. Senate, there are lots of interesting tidbits that can be gleaned from this simple analysis. .

First, if you had any doubt as to the partisan nature of the U.S. Senate than this analysis should go a long way to convincing you. There appears to be quite a large gap between Democrats and Republicans. The data also confirms that those Senators often thought to be most extreme are in fact outliers. We can see Sen. Sanders, the Independent for Vermont, at the far-left, and Sen. Coburn, he who despises political science, at the far right. Likewise, Sens. Collins and Snowe are tightly clustered at the center for the 111th Congress. As you may recall, it was these moderate Republicans senators whose votes became the linchpin for the passage for the healthcare reform legislation.

Another interesting feature of the analysis are the positions of then Sen. Obama and Sen. McCain in the 110th Senate. Obama appears singled out in the upper-left quadrant of the plot, while McCain is clustered with Sens. Wicker and Thomas closer to the center. While it might be tempting to interpret this as meaning the two senators had very complementary voting records, given the nature of the data’s coding it is more likely a result of the two missing many of the same votes due to campaigning. That is, when they did vote on the same bill they were relatively different; although not extremely different (x-axis), but when they missed it was often for the same legislation. Of course, this begs the question: what excuse do Wicker and Thomas have?

There are many more things that could be said about this data, and I would love to hear from you. Especially those of you who study this stuff for a living! Finally, if you are interested in the code for this analysis; unfortunately, you will have to wait for the book’s publication. The good news is that by then there will be much more to it.