SPAM assignment
Part 1) Develop a program that classifies spam vs. non-spam email. The
program will be evaluated on a test set that includes spam and
non-spam email sent to Atkeson and the TAs. Your program should take
a list of filenames, and classify each file (each file is a single
email message) and produce a text file which had each file name
followed by SPAM or NON-SPAM in the format
email1 SPAM
email2 NON-SPAM
email1 SPAM
...
Your writeup of this part should include an explanation of how the
classifier works (what features they selected, how the features are
used to make a decision, etc.).
Part 2) Given the characteristics of your SPAM classifier (error rates, etc.)
explain how the SPAM classifier cab best be used by a human. How does
the combined human/program handle errors, get better with experience, etc?
Extra Credit Part 3) Compare the performance of several classification
approaches on SPAM.
Extra Credit Part 4) Develop a classifier that returns a confidence in its
classification, and explain how to use that confidence value to more
effectively handle SPAM.
**********************************************************************
Assignment FAQ:
1) Can we work in groups? alone?
Yes. The maximum group size is 3. You can work alone.
2) Can we use stuff off the web?
Yes. As long as you clearly indicate what your contribution is, using
other resources is fine. You will be graded on the "value" you add to
whatever resources you use.
3) How do we turn this in?
I would like a URL pointing to your writeup (and code), so we can make
a class web page, and everyone can learn from what others do. Ideally,
you can make your writeup available to the world, so others can build
on what you do.
4) Where can I get some data?
Example SPAM is on the class web page. Latanya Sweeney recently solicited
SPAM. You need to generate real email (NON-SPAM).