Overview

RSCTC 2010 Discovery Challenge: Mining DNA microarray data for medical diagnosis and treatment, is a special event of Rough Sets and Current Trends in Computing conference that will take place in Warsaw, Poland, June 28-30, 2010. The task is related to feature selection in analysis of DNA microarray data and classification of patients for the purpose of medical diagnosis and treatment. Prizes worth over 3,000 USD will be awarded to the best solutions.

Introduction

In recent years, a lot of attention of researchers from many fields has been put into investigation of DNA microarray data. This growing interest is largely motivated by numerous practical applications of knowledge acquired from such data in medical diagnostics, treatment planning, drugs development and many more. When analyzing microarray data, researchers have to face the few-objects-many-attributes problem, as the usual ratio between the number of examined genes and the number of available samples exceeds 100. Many standard classification algorithms have difficulties in handling such highly dimensional data and due to low number of training samples tend to overfit. Moreover, usually only a small subset of examined genes is relevant in the context of a given task. For these reasons, feature extraction methods - in particular the ones based on rough-set theory and reducts - are an inevitable part of any successful microarray data classification algorithm. With RSCTC'2010 Discovery Challenge we would like to stimulate investigation in these important fields of research.

The challenge task is to design a machine-learning algorithm that will classify patients for the purpose of medical diagnosis and treatment. Patients are characterized by gene transcription data from DNA microarrays. The data contain between 20,000 and 65,000 features, depending on the type of microarrays used in a given experiment.

Tracks

Challenge comprises two independent tracks, Basic Track and Advanced Track, differing in the form of solutions. In Basic track, the participant submits a text file with predicted decisions for test samples, which is compared on the server with ground truth decisions - a typical setup used in other data mining challenges. In Advanced track, the participant submits Java source code of a classification algorithm. The code is compiled on server, the classifier is trained on a subset of data and evaluated on another subset.

Advanced track is more challenging for participants than Basic one, because there are restrictions on the way how the algorithm is implemented - it must be written in Java, according to API defined by one of three data mining environments: Weka, Debellor or Rseslib. On the other hand, it allows for much more precise and accurate evaluation of solutions, because every algorithm may be trained and tested a number of times on the same dataset, using different splits into train/test parts, to obtain more reliable quality measurements - for example it is possible to run cross-validation scheme, which is impossible with the traditional setup used in Basic track. This is particularly important for the problems like DNA microarray data analysis, where datasets are small and evaluation with single train/test split is not fully objective. Therefore, evaluation on Advanced track is much more reliable than on Basic track and we view Basic track as a kind of exercise for participants before entering into Advanced track.

Another advantage of Advanced track is the possibility to evaluate time and memory complexity of algorithms, not only accuracy of their decisions. Time and memory limits are set for execution of evaluation procedure, so if the algorithm is too slow or requires too much memory, evaluation is interrupted with an error.
Moreover, after Advanced track is finished, source code of winning solutions will be readily available on TunedIT server and can be effortlessly used by other researchers as a benchmark or starting point for new research, if only the authors allow to disclose their implementations.

Evaluation

For the purpose of training and thorough evaluation of the algorithms, datasets from a number of microarray experiments were collected, each one related to a different medical problem. They have different numbers of attributes and decision classes. Thus, the participant should design an algorithm which can be successfully applied to different problems of DNA microarrays analysis, not only to one.

Datasets in medical domains usually have skewed class distributions, with one dominant class represented by majority of samples, and a few minority classes represented by small number of objects. This is the case also in this challenge. Typically, minority classes are more important than the dominant one and their correct detection is very important, which should be reflected by the quality measure used to assess algorithms. For this reason, solutions are evaluated using balanced accuracy quality measure. This is a modification of standard classification accuracy that is insensitive to imbalanced frequencies of different classes. Namely, it calculates classification accuracies for every decision class independently and then takes average over all classes. In this way, every class has the same contribution to the final result, no matter how frequent it is.
In the case of 2-class problems with no adjustable decision threshold, balanced accuracy is equivalent to Area Under the ROC Curve (AUC). Thus, it may be viewed as a generalization of AUC to multi-class problems.

Leaderboard

Solutions are evaluated automatically on TunedIT servers using TunedTester application. Every solution undergoes two distinct evaluations: preliminary and final. The result of preliminary evaluation is published on Leaderboard instantly after it is calculated, while the final result is disclosed only when the challenge ends. Only final results are taken into account when deciding the winners.
Note that they are calculated on different data (sub)sets than preliminary ones, so the best preliminary result does not have to correspond to the best final solution!

Participants are allowed to submit solutions many times, for the whole duration of the challenge, to have an opportunity to compare their algorithms with others and make improvements. However, to avoid overfitting to preliminary test datasets, number of submissions by each participant is limited to 100 on both tracks. Additionally, precision of preliminary results is restricted to a small number of digits after comma.

At the end of challenge, the last submitted solution will be considered as the final one. If several solutions achieve the same result, which is likely especially on Basic track, the earlier submission date will decide.

Awards

To encourage active participation and reward authors of the best algorithms, TunedIT will award winning solutions on both tracks with money prizes: 2,000 USD on Advanced track and 1,000 USD on Basic track. Additionally, RSCTC registration fees for both winners will be covered. If the first winner on a given track does not plan to participate in RSCTC, the next best solution on this track will be awarded with coverage of RSCTC participation fee.

Moreover, the winners of 1st and 2nd place in each track will be invited to prepare short - up to 3 pages - descriptions of their solutions, to be included in a joint paper summarizing the challenge. The paper will be published in conference proceedings in Springer LNAI series. The winners who submit their descriptions will become co-authors of the paper.

Regardless of contributing short descriptions to the joint paper, all participants are welcome to prepare full descriptions of their solutions and submit as regular standalone papers to RSCTC. Note that these papers will undergo regular reviewing procedure, like all other papers submitted to RSCTC. Note also that paper submission deadline will pass shortly after the end of the challenge, so we recommend that you start preparing the paper in advance, before the challenge is finished.

Dissemination of Results

We pay high attention to broad dissemination of research findings done during the challenge, for the benefit of the whole rough-set and data-mining community. For this purpose:

A workshop will be organized during RSCTC devoted to presentation of the challenge and solutions.

Joint paper describing the challenge and - briefly - the winning solutions will be published in RSCTC proceedings.

All participants will be encouraged to prepare full descriptions of their solutions and submit as standalone papers to RSCTC conference. These papers will undergo regular reviewing procedure.

Participants will be encouraged to make source code of their algorithms publicly available.

After the challenge, test datasets and source code of evaluation procedures will be published on TunedIT, so that new algorithms can be tested against challenge data, using the same experimental setup. In this way, the challenge will contribute to creation of benchmark datasets and experiments that can be re-used later on by the whole scientific community.

After the challenge, information about the origin of datasets will be published on the challenge web page, together with R scripts that were used for data preparation. Using these scripts, other interested researchers will be able to prepare more datasets in the same way as we did for the challenge.

Schedule

Dec 1, 2009: start of the challenge

Feb 28, 2010: end of the challenge

Mar 7, 2010: deadline for submission of papers to challenge workshop

Mar 7, 2010: deadline for submission of short descriptions of winning solutions to the joint paper

Jun 28-30, 2010: RSCTC conference and challenge workshop

Organizing Committee

Marcin Wojnarski, TunedIT and University of Warsaw, Poland Andrzej Janusz, University of Warsaw, Poland Hung Son Nguyen, PhD, University of Warsaw, Poland Jan Bazan, PhD, University of Rzeszów, Poland