Classifying Data with Discriminant Analysis

Cluster analysis methods have been gaining popularity as a way of Relating pieces of data in large datasets with one another. Examples in social networking are obvious: friends on Facebook cluster into cliques and communities, which cluster into even larger groups. Demographics and other marketing research can also be aided by sorting prospective customers into groups based on preference.

When the clusters are known, and ample training data is available, discriminant analysis is particularly effective at classifying new data. Discriminant analysis methods are built into the R programming language (something we’ve discussed a bit in the past) with a standard package. However, R can be cumbersome to use by itself (and the syntax still seems a bit bizarre to me personally), so I used Rinruby, a gem which gives direct access to R methods and data, to put a nice Ruby wrapper around it. Below is some example code that analyzes a well-known clustering test dataset, the Fisher iris data.

Say our iris data is contained as an array (“training_rows”) of hashes that each look like

(where the species is given as 1, 2, or 3). We can load our data into a DiscriminantAnalysis instance and start predicting in just a few lines:

Every prediction comes as a hash with a confidence score to check the quality of the classification. This code uses a linear analysis (i.e., it separates classes by lines or planes); the pretty scatterplot up top was generated using the iris data with quadratic analysis, which can be achieved in the code above by simply replacing “init_lda_analysis” with “init_qda_analysis”.

R allows for tons of manipulation of the analysis once it’s loaded, some of which has been built in to the DiscriminantAnalysis class (scatterplots, accuracy and significance testing, etc.).

Since we’re all about contributing to the open source community at Highgroove, I’ve packaged these methods into a gem called harlequin (cheesily enough after one of the irises in Fisher’s data). If you have a working copy of R on your machine, you can get the code above running by just adding the proper requires for the gem. Feel free to check it out and help to make it more awesome!

Have you used clustering methods before? How do you deal with heavy-duty data processing in Ruby?