CS 251: Assignment #8

Machine Learning

Due Monday 17 April 2017

The goal of this week's project is to build two simple classifiers
that can be trained from data. In particular, you will implement a
Naive Bayes classifier and a K-nearest-neighbor [KNN] classifier. Once
they are working, build some tools for evaluating the outputs and use
your visualization app to look at the results.

Tasks

Write the two functions in the Classifier parent class for
creating and printing a confusion matirx. The confusion_matrix
method should build a numpy matrix showing the number of data
points in a category classified as each output category. The
confusion_matrix_str method should convert it into a string that
does a nice job of printing out the matrix.

Write a python function, probably in a new file, that does the
following.

Reads in a training set and its category labels, possibly as a
separate file.

Reads a test set and its category labels, possibly as a
separate file.

Builds a classifier using the training set.

Classifies the training set and prints out a confusion matrix.

Classifies the test set and prints out a confusion matrix.

Writes out a new CSV data file with the test set data and the
categories as an extra column. Your visualization application
should be able to read this file and plot it with the
categories as colors.

You will want to be able to use either the Naive Bayes or the
KNN classifier for this task. You can create two files, or you
can let the user select one or both classifiers from the command
line.

Run the above code on the original Activity Recognition data
set. Then run it again on the PCA-transformed version of the data
set. Include the confusion matrices in your writeup and note any
significant differences.

Plot the activity recognition data set using the first three PCA
axes and use color to show the output labels of the
classifier. Include this image in your writeup.

Repeat the above two exercises on a data set of your choice other
than the Iris and Activity Recognition.

Extensions

Try variations on the training data or the classifiers and compare
performance on the Activity Recognition data set. For example:

Use more or fewer PCA dimensions.

Compare using clustering versus the entire data set for the KNN
classifier.

Compare using different numbers of exemplars per class for the KNN
classifier.

Compare using different numbers of neighbors in the distance sum
for the KNN classifier.

Compare using different distance metrics.

Use a method other than K-means clustering to select a subset of
exemplar points for KNN classification.

Implement a different type of classifier.

Explore more data sets.

Integrate machine learning analysis into your GUI. Be very careful
and intentional if you do this extension. Think for a while about
your design before writing a single line of code to implement it.

Writeup

Make a wiki page for the project report.

Write a brief summary of your project that describes the
purpose, the task, and your solution to it. The summary should
be 200 words or less.

Write a brief description of how you implemented the two
classifiers and the results on the test data sets.