WekaClassify.java
=================
Version 0.4
Copyright (C) 2001-2004
Ted Pedersen, tpederse@umn.edu
University of Minnesota, Duluth
Satanjeev Banerjee, satanjeev@cmu.edu
Carnegie Mellon University
http://www.d.umn.edu/~tpederse/sensetools.html
0. Important note on versions:
------------------------------
WekaClassify version 0.4 compiles with Weka 3.4, and can be used with
models generated using Weka 3.4, but not with earlier versions of Weka.
For earlier versions, use WekaClassify version 0.3.
1. Introduction:
----------------
WekaClassify is a java program that is part of the SenseTools package,
which is a suite of programs that support the application machine
learning techniques to the problem of word sense disambiguation.
While WekaClassify was developed by Satanjeev Banerjee as a part of that
package, we have found it generally useful and are distributing it
separately as well, in the hopes that it will be useful for users of the
machine learning suite Weka, regardless of their application area.
WekaClassify carries out classification based on a previously learned
model by Weka, and produces output such that each possible answer is
"scored" based on whatever criteria the learned model might use
(confidence scores, probabilities, etc.). Thus, WekaClassify requires
as input an ARFF file that represents the data to be classified, and a
machine learning model that was learned by Weka from training examples.
2. Details
----------
This java program takes a test/evaluation file in ARFF format and
classifies the instances in the file using the stored representation of a
model learned by Weka from a set of training data.
2.1. The Model:
---------------
WekaClassify uses a previously learned (and stored) model to classify
instances in the given test ARFF file. This model has to be created by
training some classifier (such as a decision tree, a neural network, a
naive Bayesian classifier, etc) on training data consisting of instances
of the same target word as that in the test arff file. Moreover, the model
has to be created using Weka. One can create such a model in Weka by using
the -d switch (see Weka's [1] help and documentation for more information).
A model thus saved contains information about the type of classifier used
as well as other facts "learned" from the training data.
WekaClassify uses this model to classify the instances in the test
file. The model file can be passed to WekaClassify using the -d option.
Classification is done by calling the Weka library routines. Hence Weka
should be available on the java CLASSPATH for this program to run
properly.
1.2 Other Command-line Options:
-------------------------------
Other command-line options include the -t switch to specify the test
file and the -p switch to specify the level of precision required. By
default, values are output up to 4 places of decimal.
2.3. The Output of WekaClassify:
--------------------------------
For each instance in the test file, WekaClassify outputs a probability
distribution over all the possible 'class' values. As expected, these
values range from 0 (implying this particular instance has zero
probability of belong to this particular class) to 1 (implying this
particular instance belongs to this particular class). For SENSEVAL test
files, "class" values are the possible senseid values that the target word
in the test data may assume.
This output is in the answer file format required by SENSEVAL (and
used by the scorer.python program). Following is the format of the
output of WekaClassify:
...
For example, suppose for our art.n.xml.arff file above we find the following:
art.n art.40001 art_gallery~1:06:00::/0.0 fine_art~1:06:00::/0.0 art~1:06:00::/1.0
art.n art.40002 art_gallery~1:06:00::/0.25 fine_art~1:06:00::/0.25 art~1:06:00::/0.50
For art.4001, this tells us that the first two senses have zero
probability of occurring, while the third sense has 100 percent
probability of occurring.
3. Copying:
-----------
This suite of programs is free software; you can redistribute it
and/or modify it under the terms of the GNU General Public License as
published by the Free Software Foundation; either version 2 of the
License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
USA.
Note: The text of the GNU General Public License is provided in the file
GPL.txt that you should have received with this distribution.
4. Acknowledgments:
---------------------
This work has been partially supported by a National Science
Foundation Faculty Early CAREER Development award (#0092784) and by a
Grant-in-Aid of Research, Artistry and Scholarship from the Office of
the Vice President for Research and the Dean of the Graduate School of
the University of Minnesota.
5. References:
--------------
1. Weka 3 - Machine Learning Software in Java.
World Wide Web site: http://www.cs.waikato.ac.nz/ml/weka/.