Task

This page describes the task of Advanced Track only. Task of Basic Track is here.

Datasets

Example datasets for 6 different problems of DNA microarray data analysis and classification can be found in Repository in RSCTC/2010/B/public folder - these are training and test datasets from Basic track, available in ARFF or CSV format, each one separately or zipped together. You can use them as benchmarks when experimenting with different kinds of algorithms and trying to find the best one, to be submitted to the challenge - they have similar characteristics as the secret datasets that will be used during evaluation of solutions.

Remember that you must be logged-in to TunedIT and registered to the challenge, otherwise you will have no access to the dataset files - they will not even show up in the folder contents.

All datasets - both public and secret ones - contain between 100 and 400 samples, characterized by values of 20,000 - 65,000 attributes. Samples are assigned to several (2-10) classes. All attributes are numeric and represent measurements from DNA microarrays. Attributes are normalized in some way - in test data you should expect similar distributions of attribute values as in the example data.

Solution

Solution has the form of a JAR file containing Java source code of a classification algorithm. The implementation should be based on architecture (API) of one of the systems:
Debellor 1.0,
Weka 3.6.1 or
Rseslib 3.0.2.
Depending on the chosen architecture, the algorithm class should inherit either from:

org.debellor.core.Cell - for Debellor, or weka.classifiers.Classifier - for Weka, or rseslib.processing.classification.Classifier - for Rseslib.

The JAR file may contain also any other classes, used by the class of the algorithm. It is not necessary to include classes of Debellor, Rseslib or Weka. You may assume that their JAR files will be available on the classpath. You are free to use in your implementation all the algorithms available in these systems.

In solution submission form, you must choose the JAR file to be submitted and type a full name (with package) of the class that implements the algorithm.
After submission, the JAR file is compiled on the server under Sun JDK 6 and the algorithm is tested. Preliminary result will appear on Leaderboard.

Evaluation

Solutions are evaluated on several secret datasets corresponding to different problems of DNA microarray data classification. Secret datasets represent different medical problems than the public ones, but possess similar characteristics: number of samples and attributes, statistical distributions of attribute values etc.

There are 5 datasets during preliminary evaluation and 6 during final. The algorithm is evaluated 5 (preliminary) or 20 (final) times on each dataset using Train+Test procedure. Each T+T trial consists of randomly splitting the data into two equal disjoint parts - training and test subset - training the algorithm on the first part and testing on the second part with calculation of the quality measure: balanced accuracy. Measurements from all T+T trials on all the datasets are averaged. Randomization of data splits is the same for every submitted solution, so every algorithm is evaluated on the same splits.

Balanced accuracy is an average of the standard classification accuracies (acck) calculated for each decision class (k = 1,2,...,K) independently:

In this way, every class has the same contribution to the final result, no matter how frequent it is.

Time and Memory

The algorithm should not only be accurate, but also time- and memory-efficient. There is a time limit set for the whole evaluation: 4 hours in preliminary tests and 20 hours in final tests. Therefore, a single Train+Test trial of the algorithm should last no longer than 10 minutes, on average.

Memory limit is set to 1,500 MB, both in preliminary and final evaluation. Note that up to 450 MB is used by evaluation procedure to load the dataset into memory, so about 1 GB is left to the algorithm.

Tests are performed on a station with 1.9 GHz dual-core CPU, 32-bit Linux and 2 GB memory, running Sun Java HotSpot Server 14.2 as a JVM.

Implementation Tips

Folder Examples in Repository
contains sample implementations of the simplest classification algorithm, majority classifier.
Implementations are realized in architectures of Debellor and Rseslib.
They include compiled code as well as Java sources, so they may be helpful in understanding
the API that should be implemented by your class. See also the Examples section in Docs.

JAR files of
Debellor,
Rseslib and
Weka
are available in Repository.
You can download them and put on the classpath of your algorithm,
to run and test the algorithm locally, outside TunedTester.
For instance, if you develop under Eclipse and want to add a library JAR
to the classpath, click on the project, choose from menu: Project -> Properties -> Java Build Path
and then, in Libraries tab, click on "Add JARs" or "Add External JARs".

It is also possible to test the algorithm on each of the six public datasets: RSCTC/2010/B/public/dataX_train.arff, where X = 1,2,...,6, under similar conditions as in the challenge evaluation on the server (i.e., using TunedTester).
For this purpose:

Add JAR file with compiled code of the algorithm to your home folder in Repository.
In upload form, choose "private" access to make the file and related test results
inaccessible for other users.

Download TunedTester
and run tunedtester.bat or tunedtester.sh with command-line parameter "-m 1500" to increase memory limit for the test from default 512 MB to 1500 MB.

In "Algorithms" field, type the full name (including folders) of your JAR in Repository,
followed by a colon ":" and a full name of the class. For example:

jsmith/MyJarFile.jar:org.jsmith.MyClassifier

In "Datasets" field, type the full name of a dataset file. For example:

RSCTC/2010/B/public/data3_train.arff

Type your username and password from TunedIT website.
Optionally, check "Send results to Knowledge Base", so that test results are sent
to the server. You can view and analyse them afterwards on
Knowledge Base page
- this is particularly convenient when comparing many different algorithms.
Note that if the algorithm has private access in Repository then the results are also private - they are visible only to you.

Click "Run...". This will start a test using
ClassificationTT70
evaluation procedure, which trains the classifier on random 70% of the data,
tests on the remaining 30% and calculates classification accuracy, which becomes the result of the test.
Keep in mind that during challenge evaluation data is split with 50+50% ratio and the quality measure calculated is the balanced accuracy and not standard classification accuracy, as in ClassificationTT70.

If you have any questions please post them on discussion forum of the challenge. We also encourage you to subscribe this forum ("Subscribe forum" link at the bottom), so that you receive notifications about new posts, which may contain further explanations to the challenge tasks.