I decided to use scikit-learn to try and build a decision tree in order to predict the political party (democrat or republican) of a voter based on that voter's 16 yes/no votes. This is considered classification which falls under supervised learning.

The data

Let's look at the data we have to work with. It comes from UCI's machine learning repository which provides a bunch of high quality data sets that are useful for testing out machine learning algorithms.

There are 435 rows of data, but some of them contain ? which indicate that person abstained from voting on a measure. Once we remove any incomplete rows, we are left with 232 rows. That should be enough to create a decision tree.

Validation

But wait, how will we know if our decision tree is any good? We should split the 232 rows of good data equally into two sets: our training set and our validation set. In other words, we'll build the decision tree with 116 data points of input and then check if our decision tree accurately predicts the political party of the remaining 116 voters.

Reading the data into scikit-learn

Let's describe our data using machine learning terms. We have 16 features (the different voting measures, ex: handicapped-infants, water-project-cost-sharing, adoption-of-the-budget-resolution) and 2 classes (political party ex: republican, democrat). We had 435 data points, but some had missing values, so we ended up with 232 instances of useful data.

By convention, scikit-learn takes a 2D array X, sparse or dense, of size [n_samples, n_features] holding the training samples, and an array Y of integer values, size [n_samples], holding the class labels for the training samples.

Predicting

While I was reading data in, I separated every other record into a set of validation data. I named the 2D array of votes X_unseen and 1D array of classes (democrat, republican) Y_unseen to indicate that the machine learning algorithm never looks at them.

That's a 96% success rate of accurately predicting the political party of a voter based on his or her 16 votes. Pretty neat.

What might be more interesting is plotting the success rate based on the number of training samples. Let's give that a go:

As it turns out we only need to train the classifier on a relatively small sample size to get >90% accuracy.

Examining the decision tree

We can go further and look at the decision tree that is created. This can be used to determine which of the 16 votes are the most polarizing. In other words, if you had classify a voter as a Republican or a Democrat based only on a few votes, what would they be?

I've annotated some of the house measures in the decision tree above. You can see that out of the 116 instances, if a voter said no to the physician fee freeze (or in other words X[3] < 0.5), the voter is a democrat. If they approved the physician fee freeze, there are some other votes we need to examine (such as the synfuels corporation cutback) before we can predict their political party.

Try it yourself

Here's the complete code I used. The file house-votes-84.data can be found here.