QSAR data provided for SIAM SDM’11 Contest were known to be highly noisy. Around 30% of labels provided could be wrong due to experimental uncertainty, as reported by the organizers after the contest was closed. Furthermore, this contest only counted the last submission, which means it was risky to overtune the models on the known data (including training data and preliminary test data).

In my approach, initially, a 7-fold cross validation strategy was adopted for modeling on the training data. Several classification algorithms were tried and the best CV results (in terms of Balanced Youden Index) were observed with R gbm and randomForest techniques. At that point the performance for gbm was 0.659/0.664/0.640 (in the order of 7-CV/preliminary/final), and for rf it was 0.636/0.718/0.628. (Of course, I only know the final performance after the contest is closed). I also tried different feature selection methods but I did not see obvious improvement so I decided to use all of the 242 features.

The next step I tried was to remove noisy data. The assumption was that an instance is likely to be noisy if it gets wrongly predicted with a high probability value. Such an idea was applied onto a balanced gbm modeling’s CV result. If the prediction value for a positive instance was less than nplimit, it was assumed to be a noise. Likewise, a negative instance was a noise if its prediction value was larger than or equal to pplimit. These noisy instances were removed and only the remaining instances were used for training the 2nd-round gbm / randomForest classifiers. After a few rounds of tuning, nplimit was set to 0.2 and pplimit to 0.8. Now I had the performance 0.688/0.644 (in the order of preliminary/final) for gbm, and 0.771/0.671 for rf.

Finally, the above process was applied to the combined training/preliminary data, but all modeling parameters were unchanged from the first phase. Step 1, a balanced gbm model was built. Step 2, noisy instances were removed on step 1 CV result with nplimit=0.2 and pplimit=0.8. Step 3, a rf model was built for final classification. Since different rf modelings have slightly different results, I actually built 9 rf models and picked the major voting as the final prediction, which was ranked at the 2nd place in this contest, with Balanced Youden Index of 0.6889.