Computer Science 5751
Machine Learning

Introduction

One important aspect of understanding Machine Learning is the critical role
that data plays in ML.
In this assignment I want you to familiarize yourself with one of the common
formats (C4.5) for the standard data sets that can be found at the
UCI
Machine Learning repository.
The C4.5 format allows researchers to create data sets of interest and
deposit them at the UCI repository so that other researchers can use the
data in their learning programs and compare their results.

In this assignment you will build two small data sets of your own and run
a simple decision tree program I will provide for you on these data sets.
More details on the C4.5 format are given below and then I will discuss
the characteristics I want to see in YOUR data sets.

C4.5 Data Set Format

I have ported a number of the data sets from the UCI repository to my home
machine so that we can easily access these data sets without having to ftp
them.
The files for the data sets can be found in the directory:

~rmaclin/public/datasets

Each data set in the C4.5 format consists of two files, one ending in an
extension .names which gives the feature names, possible feature values, and
classification values for a data set and a second file ending in .data that
lists the actual data points in a data set. For example, one data set
is the labor data set which is stored in the files labor.names and labor.data.
labor.names looks like this:

The first line of any .names file indicates what classes into which data points
are divided. In labor.names, the first line indicates that points are labeled
good or bad (good or bad is their classification). The first line has the
format:

Name1, Name2, ..., NameN.

Each of the names is a class that a point can be labeled with (and is the
focus of our learning in an inductive learning system).
In the labor data set, data points are labeled "good" or "bad".
In the labor.names file a comment is added at the end of the first line
(the characters " | Classes") which is ignored by the parser.
Following the first line is a blank line and then a list of the feature
names and the possible values of those features.

A feature name is simply any string of characters ending in ":". Some
of the feature names in labor.names are "duration", "cost of living adjustment",
and "contribution to dental plan".
Following the ":" is one of two things, either the single word "continuous"
or a list of names separated by commas.
If the single word "continuous" appears then this feature is assumed to have
values that are real numbers (this includes features with integer values).
Examples of such features in labor.names include "duration",
"wage increase first year", "wage increase second year", etc.
On the other hand, if a list of names appears after the ":" then this list
is assumed to indicate all of the different possible "discrete" values the
feature may take on.
For example, in labor.names, "cost of living adjustment" has "none,tcf,tc"
following it.
This means that the possible values of this feature for each data point are
"none", "tcf" or "tc".
The feature "pension" has possible values of "none", "ret_allw" or "empl_contr".
Such features are generally called nominal or discrete features.

The .names file for a data set describes the features, feature values and class
values for each point in the data set.
The .data file actually lists the data points making up that data set.
The first five (out of 57) data points in the labor data set data file
(labor.data) are:

Each data point is simply a list of the values for each feature in the data
set (in the order they appear in the .names file) followed by the class value
for that data point, with the values separated by commas.
So, for example, the first line indicates a data point with the following
feature/feature value pairs:

The final value on the line is the class this point has been labeled with
(in this case "good").
Each data point must have one value for each feature.
The values must be listed in the order they appear in the .names file and
must be of the appropriate type (a number for continuous features or one
of the possible feature values for discrete features).
The only exception to this rule is that if a feature value is not known for
a particular data point, a "?" may be included to indicate that the value
is unknown for this data point.
In the first data point, several of the feature values are unknown (including
"wage increase second year", "wage increase third year",
"cost of living adjustment", etc.).
Some data sets (especially the labor data set) have lots of examples with
unknown values and other have none.

The Task

I want each of you to construct TWO data sets. Each data set should
have at least five features and at least 25 data points. You may choose
any type of data you are interested in but please try to avoid offensive
concepts.
In the first of the two data sets you should use only features with discrete
values (no "continuous" features), you should not allow any unknown values and
you should only have two possible class values.
In the second data set you should have at least one continuous feature and may
have unknown values or more than two class values if you like.
For each data set you should construct two files, a DATASETNAME.names file
and DATASETNAME.data file where DATASETNAME is the name you give the
data set.

You should also make sure that neither data set is trivial where we will
define trivial as being possible to classify based on only one feature.
To check this, I have made available a working version of a program you
will be implementing later, ID3.
Copy the archive file cds.tar.Z to somewhere in
your home directory (you should do this on one of the Computer Science
department machines in HH314 such as csdev01).
Then do the following:

uncompress cds.tar.Z
tar xvf cds.tar
cd check_data_set

In the directory check_data_set you will find a script named check_data which
runs the program train.
Run it on your first data file by typing:

check_data DATASETNAME

substituting DATASETNAME with the name (and path) or your data set.
This should print out the line "Tree for class 0" followed by a representation
of a decision tree.
Your decision tree should have more than one layer or it is trivial (it
has only one layer if their is a single feature name listed in the entire
tree).

What To Turn In

Print out a copy of each of the files making up your data sets and the
result of running check_data on each of the data sets.
Then write a short report discussing the interesting aspects of your data
sets and why you chose them (and what they mean).
Also, you should email the files making up your data sets to Hari.