Introduction to Artificial Intelligence: Programming assignment 2

Assigned: Apr. 11
Due: Apr. 25

Overview of the assignment

This is actually an experimental assignment, rather than a programming
assignment. The assignment is to run three machine learning algorithms ---
linear regression, Naive Bayes, and a decision tree algorithm ---
over a given data set and compare the accuracy of the output for
given data set and compare the accuracy of the output for
different sizes of training sets.

The
WEKA system
is a package of machine learning resources developed by
Ian Witten and his research group at the University of Waikato (New Zealand).
The code is written in Java, and is available on the Web. (This link above
is my local copy. The original package, if you want to get it, is
here.)

There are three main types of resources available in WEKA:

Code to do classification learning, using a variety of machine learning
algorithms. These are called classifiers.

Code to preprocess data sets prior to learning. These are called
filters.

The remaining lines are data. Each line is a single instance,
with the values of the attributes, following the same order as the @ATTRIBUTE
statements, separated by commas.

Running the code

You can find all of the WEKA material in the
weka directory
in the class Web site. If you want to run it on your own machine, then
1. Download the file
weka-3-0-2.jar
2. Expand this using the command "jar xvf weka-3-0-2.jar".
3. Further expand the new file "weka.jar" using the command
"jar xvf weka.jar". (The README file says to use "jar -jar weka.jar", but
that didn't work for me.)
Or you can just download the files you need for this assignment (a small
subset of the entire directory).
4. Download the training set
auto-mpg.arff and the test set
auto-mpg-test.arff

Alternatively, you can run the code from an ACF5 account by changing
directory

cd /home/e/ed1/weka/weka-3-0-2

and then running the code as described below.

You can run the code from an account on the Sun system
by changing directory

cd /usr/httpd/htdocs_cs/courses/fall00/G22.3033-001/weka/weka-3-0-2

and there you can find directories with the various WEKA programs in Java,
which can be run as described below.

If you have any trouble
with this, please let me know as soon as possible. Don't wait until
the day before the due date to make sure that you can find this code.

The Experiments

Part I: Linear regression

In this section, you will test the accuracy of linear regression as a
predictor, using training sets of different sizes.

The argument -N 48 means that the data should be divided into 48 folds. As
there are 240 instances in the data, that gives a data set of size 5.
Create the other training files analogously. (Note: The different
training files need not be disjoint.) The argument -S 1 gives a random
number seed, so as to get a random choice. Of course, you should give a
different seed for each training set you create.

Step 3: Run the linear regression algorithm over the different partial
training files and the complete training file, and evaluate them over the
test file. The linear regression algorithm is run using the command:

Step 4: Create a plot where the x-axis is the size of the training set
(5, 10, 15, 24) and the y-axis is the average accuracy, measured by
mean absolute error, over the three sample sets of the specified size.

Step 5: Define the similarity of two hypotheses to be the sum
of the absolute values of the difference of corresponding coefficients.
For instance, given the two models

Step 3: As in part 1, step 2, create subsets of the training data
with 5, 10, 15, and 24 elements. Create 3 subsets of each size.

Step 4: Run the Naive Bayes algorithm on the various discretized
training sets, and test them on the discretized test set. The
command is

java weka.classifiers.NaiveBayes -t -T ~/auto-mpg-test-d2.arff

Step 5: Create a plot where the x-axis is the size of the training set
(5, 10, 15, 24) and the y-axis is the average accuracy, measured by
percentage correct, over the three sample sets of the specified size.

Step 6: Redo step 2, but discretize the remaining attributes into
4 bins rather than 2 (change the argument of -B to 4). Run Naive Bayes
again. Compare the results of the two discretizations.

Part III: Decision trees

Using the discretized training sets and test setting created in steps 1 through
3 of part II, run the C4.5 algorithm (a sophisticated decision tree algorithm)
using the command

The flag -M 3 requires that nodes with 3 or fewer instances are not
further split.
Report on the accuracy as in step 5 of part II. Also, discuss the
output hypotheses. How much do the trees differ,
and how do they change as the training set becomes larger?

Extra credit (optional)

I should be interested to see the results of any further experimentation
you choose to do.