CS 478: Introduction to Machine Learning

Assignment Specifications

Guidelines

Each written assignment is limited to no more than four
single-sided, single-spaced pages (the first should require
significantly less) using 12 point font, including all
figures. Figures are not to be hand-drawn and should be
large enough to be easily legible. Anything beyond the four
page limit will not be graded! You must work to include all
important information in the allotted four pages.
Communicating clearly and concisely what you have to say is
an important skill you will use throughout your career.
NOTE: All written assignments are to be done on a word
processor and be neat and professional.

Grading

Lab assignments will be graded 40% on positive
completion, 50% on technical accuracy and insight and 10%
on writing and presentation. Most parts of the assignments
are fairly objective and for these, you are expected to be
correct. A few questions are more open-ended and
subjective, and in these cases you will be graded based on
perceived effort and understanding. Good writing, grammar,
punctuation, etc. are extremely important because of the
great effect they have on the impact of your
work.

Wiki Assignment

This assignment is worth 5% of your overall Homework
and Programming Assignments grade.

Do some surfing on the web and find at least two examples of
recent, interesting machine learning news, applications, advances, etc. Make an entry on
the course wiki that briefly describes each, gives your opinions, etc. and
points interested readers to more information. Email the TAs so they can verify your work.

Linear Models Assignment

This assignment is worth 17% of your overall Homework
and Programming Assignments grade.

In this assignment, you are to implement the simple Logistic Regression algorithm for a single independent variable (IV) and the Perceptron algorithm for the general case of multiple independent variables. You may assume that data is provided to you in 2-column (n-column) format, where the first column(s) is(are) the IV(s) and the second(last) one is the dependent variable (DV). The first row contains the number of independent variables and the second row contains the name(s) of the variables. Each subsequent row corresponds to a single observation. Hence, you may need to aggregate the values of the DV (i.e., counts) for each value of the IV before doing the regression.

Implement Logistic Regression for a single IV.

Your program must:

Output the values of the intercept (w0) and the slope (w1) of the model.

Output the value of R2 from the probabilities (i.e., compute the SSE and TSS on probabilities rather than on odds).

Plot the SSE over time for 1000 epochs of training for a learning rate that is too large (no convergence).

Plot the SSE over time for 1000 epochs of training for a learning rate that is small enough (asymptotic error).

Report the predictions your model makes for the following data (vectors independent variables).

Turn in a thoughtful, well-written written report (see
the guidelines above) that details your experiments and
addresses the questions posed above (look carefully at
everything to make sure you've covered all the parts of
each).

Clustering Assignment

This assignment is worth 17% of your overall Homework
and Programming Assignments grade.

In this assignment, you are to implement both the k-means algorithm and the Hierarchical Agglomerative Clustering (HAC) algorithm. For both, you may assume that all features are real-valued. You may also assume that there is no need for normalization of the features. Use the Ln-norm for distance calculations with a default value of n=2 (Euclidean).

Note that the datasets you are to test your algorithms with contain labeled items. You will need to ignore the label (target attribute, always last here) while clustering.

Implement the k-means clustering algorithm and the HAC algorithm (using single linkage).

For k-means, your program should automatically try 2 <= k <= 7 and compute the squared error in each case. You should then return the value of k that produces the lowest squared error, together with that error.

For HAC, you should compute (and store) the squared errors of all possible clusterings (as they are built from the bottom up). Upon completion, you should return the value of t (distance threshold) that produces the lowest squared error, the corresponding number of clusters, and the corresponding error.

Compare the best number of clusters obtained by k-means and HAC. How do these also compare with the underlying structure of the dataset in which there are 3 classes of iris plants?

Experiment with your distance metric -- can you find a value of n for the Ln-norm that changes the number of clusters found?

Graph the value of the squared error for each clustering as HAC executes (with no threshold). What do you observe? Is this surprising?

Use your algorithms on the Balance problem.

Run HAC on the full dataset. Record the threshold value that produces the lowest squared error, together with the corresponding number of clusters and the corresponding error. Call these tf, kf and ef, respectively.

Run HAC on this sample. Record the value of the best threshold (call it ts).

Run HAC on the full dataset again, but using the threshold (ts) obtained above on the sample. Record the values of the the corresponding number of clusters (ks) and the corresponding error (es).

Compare tf, kf and ef with ts, ks and es. Discuss your findings.

Turn in a thoughtful, well-written written report (see
the guidelines above) that details your experiments and
addresses the questions posed above (look carefully at
everything to make sure you've covered all the parts of
each).

Apriori Assignment

This assignment is worth 17% of your overall Homework
and Programming Assignments grade.

In this assignment, you are to implement the Apriori algorithm. Your implementation should allow the user to specify a minimum support threshold (minsup), a minimum confidence threshold (minconf), and a maximum number of rules to display at a time (maxrules).

This is an artificial problem. Each attribute represents a bit position in a string of 30 bits: Lmost, Lmost1, ..., Lmost14, Rmost14, Rmost13, ..., Rmost1, Rmost and the attribute Symm is 1 if the pattern is
symmetric about its center, and 0 otherwise. Given this interpretation, do any of the rules discovered by your Apriori algorithm make sense?

Build your own association task.

Design your task so that it contains some simple associations you can check your algorithm against. List these associations.

Run Apriori for various combinations of minsup and minconf values.

Verify that the associations you designed into the task are discovered by your algorithm.

Any surprises?

Turn in a thoughtful, well-written written report (see
the guidelines above) that details your experiments and
addresses the questions posed above (look carefully at
everything to make sure you've covered all the parts of
each).

WEKA Exercise

This assignment is worth 10% of your overall Homework
and Programming Assignments grade.

Perform the following activities:

Go to http://www.cs.waikato.ac.nz/ml/weka/. Download and install the latest version of Weka on your computer.

Build a neural network (using the MLP algorithm) that predicts whether a patient has a heart condition. Record the 10-fold cross-validation accuracy of your model as A1.

Create a new attribute coarseBloodPressure, with values: Low if blood pressure is less than or equal to 120, Normal if blood pressure is greater than 120 but less than or equal to 150, and High if blood pressure is greater than 150. (You may need to be a little creative here in the use of (unsupervised) filters, possibly having to first create a copy of blood pressure with the new name and then transforming it as per the foregoing; Filters AddExpression and MathExpression may prove particularly useful).

Build a neural network (using the MLP algorithm) that predicts whether a patient has a heart condition, using the new attribute coarseBloodPressure instead of the original blood pressure. Record the 10-fold cross-validation accuracy of your model as A2.

Compare A1 and A2. Any comments?

Remove all records whose value of the attribute resting ecg is Abnormal.

Construct a decision tree (using the J48 Algorithm) that predicts whether a patient has a heart condition, given the attributes age, sex, chest pain type, coarseBloodPressure, angina, peak, and slope. Save the result buffer and insert the confusion matrix obtained with 10-fold cross-validation in your report.

In the Explorer application, open the cpu.arff data file and do the following:

Cluster the data (using the simple k-Means algorithm, with k=3) and report on the nature and composition of the extracted clusters.

Discretize the attributes MMIN, MMAX, CACH, CHMIN and CHMAX using 3 buckets in one step. Find associations among these attributes only (i.e., remove the other ones), using the Apriori algorithm, with support 0.2, confidence 0.95 and top 3 rules only being displayed. Save the result buffer and insert it in your report.

Using the original data, list the eigenvalues associated with the attributes selected by the Principal Components Analysis method, when the amount of variance you wish covered by the subset of attributes is 75%.

Turn in a thoughtful, well-written written report (see
the guidelines above) that details your experiments and
addresses the questions posed above (look carefully at
everything to make sure you've covered all the parts of
each).

WEKA Lab 1

This assignment is worth 17% of your overall Homework
and Programming Assignments grade.

Become familiar with the ionosphere data set and use it to perform the following experiments:

Use IB1 and IBk (for k up to 9) to classify the data using cross validation. Report your results. Which value of k is best?

Use PCA to try differing numbers of attributes (1-8). For each number of attributes, perform the same classification experiments that you did with the original data. Report your results (as an 8x9 matrix). Which value of k is best? How does performance change with different numbers of attributes? Which number of attributes is best? Can you explain which attributes these are?

Given the best number of (PCA) attributes, choose various subsets (of the same size) of the original attributes. Perform the IBk experiments on these subsets and report and compare results with those obtained using the PCA attributes. What do you see?

Show a visual representation of the decision surface with 2 attributes (derived from PCA) and k=1.

Become familiar with the vowel data set and use it to perform the following experiments:

Remove the first three attributes as well as the class attribute.

Cluster the data using the simple k-Means algorithm, with values of k from 1 to 12. What do you see?

Now add the class attribute back in and repeat the clusterings, comparing the clusterings with the class. How well does the clustering appear to correlate with class? What might this mean?

Choose several different classifiers and use them to classify the data. How does their performance compare with the clustering's "performance"? Is this something you might expect?

Does adding back in any of the original first three attributes have any effect on either the clustering or the classification performance?

Design your own experiment, perform it and report on it:

Find an interesting data set and briefly describe it. You might look here or here. Or, you can come up with your own some other way. A brief description of ARFF (a file format that WEKA knows) may be helpful.

Come up with an interesting experiment (experiments) to do using your data set and briefly describe it (them).

Perform your experiment(s), report your results and draw conclusions (what did you learn? what does it mean? does it make sense? what else could be done? etc).

Turn in a thoughtful, well-written written report (see
the guidelines above) that details your experiments and
addresses the questions posed above (look carefully at
everything to make sure you've covered all the parts of
each). What you say is important. Why you say it is even more important.

WEKA Lab 2

This assignment is worth 17% of your overall Homework
and Programming Assignments grade.

In this lab you will investigate the difference in model performance using statistical significance testing. We will compare three models (decision tree, multi-layer perceptron and SVM) on two different data sets (ionosphere and vowel), and perform a pairwise comparison of the models on each data set (so, a total of six experiments). For the vowel data, preprocess the data by removing the first three attributes as we did in the previous assignment.

For both data sets, compare the performance
of a decision tree vs. a multi-layer perceptron, a decision tree vs. a support vector machine, and a multi-layer perceptron vs. a support vector machine using a paired permutation test. For each model, describe parameter settings/design decisions you make in acquiring your data (so that your experiments are replicable).

For the permutation tests, obtain 10 paired
accuracy estimates (the best and easiest way to do
this is 10-fold cross-validation) and perform an
exhaustive permutation test.

You can collect accuracy estimates using the Experimenter in WEKA, dumping the results to a file and using the appropriate column of the file. You will need to implement the paired permutation test yourself.

Report the p value for each experiment and draw conclusions. Can you make any assertions about the accuracy of one model over another on the ionosphere data (that is, can you say that differences in model accuracy are statistically significant)? On the vowel data? Can you say anything about how these models might compare on other data sets? What can you say about differences in the performance of the models in general?

Turn in a thoughtful, well-written written report (see
the guidelines above) that details your experiments and
addresses the questions posed above (look carefully at
everything to make sure you've covered all the parts of
each).

Group Project

This assignment is worth 20% of your overall grade.

Prepare and submit a polished paper describing your work. Use these Latex style files if you are writing in Latex (which is *the* way to write nice looking papers) or this Word template if you insist on suffering through producing something somewhat less elegant. Your paper should look like a conference paper, including Title, Author Affiliations, Abstract, Introduction, Methods, Results and Conclusion (Bibliographical references are optional for this, but do include them if you have any).

Prepare a polished oral presentation, including slides, that communicates your work. Your presentation should take about 12-15 minutes and allow 2-5 minutes for questions (a rough guide for slide preparation is about 1 slide per minute). Practicing in advance makes for a much better presentation. Send me an email telling me which day (April 7th or 9th) you would like to present your work to the class. First come, first served.

Email me a thoughtful and honest evaluation of the contributions of your group members. For each, include a score from 1 to 5 indicating your evaluation of their work (1 meaning they were a valuable member of the team that made significant contributions to the project and were good to work with, 5 meaning they contributed little and were difficult to work with). If you would like, you may also include any clarifying comments, etc.