Two training datasets were provided, one larger one containing data taken from single instruments, and one smaller one with data from combinations of exactly two instruments. These two datasets contained both similar as well as unique labels. Overall, there were 32 literally distinct classes contained in the two training sets. Since the problem was one of multi-way classification, the first approach was the multi-layer perceptron. With 35 hidden neurons, the MLP was trained using Levenberg-Marquadt updating. The MLP was then used to evaluate the test set, and the top 2 activations (of the 32 output nodes) were assigned as labels to that sample point. This model performed with 38% accuracy. These results led us to believe that further investigation should be done on the test data, as MLP’s should perform significantly better than the nearest neighbor approach. Also, many inconsistencies existed between the two training set labels i.e. ‘alto sax’ and ‘saxophone’. To investigate the distribution of the test samples, 32 ‘dummy’ scripts were submitted, each of which containing only one instrument class for both instruments and for every test sample. The resulting classification accuracy was collected for all the classes and represented the distribution of the preliminary test samples. Additionally, it was known that the preliminary and final test set was randomly chosen from the entire test set. Using this knowledge, the resulting distribution was used as priors on the 32 classes. Upon scrutinizing the returned test distribution, it was noticed that many of the classes which had similar names i.e. ‘clarinet’ vs. ‘B-flat clarinet’ only appeared as one class in the preliminary test set. With this knowledge, the classes which did not appear at all in the preliminary test set were either deleted, or their data combined with the classes which had similar names.

During initial investigation of the training data a traditional random forest (RF) classifier was used to test the baseline classifiability of the single instrument training dataset (details of the algorithm can be found in L. Breiman 2001). A forest of 1000 decision stumps, each maximally ten splits deep, was trained. Initial performance of this classifier was very good with error > 0.9%. However, the traditional RF classifier is designed to handle discrete, scalar target values. For this problem, training on the mixed interment data, with each datum belonging to two classes, would normally not have been feasible. However, our group devised a method to train this algorithm using both the single instrument and mixed instrument training data. We did so by generating new training sets, with one instance of the single instrument training data, and randomly sampling the mixed training data, with repeats and a non-uniform distribution that matched the prior information about the final test set that was gained from the dummy scripts, and labeling each repeated with one or the other of the two class labels provided by the training data. This allowed the RF algorithm to be trained in a bootstrap-like method seeing the same datum several times, and seeing them with both labels attached to that datum. Out-of-bag training error was optimal for the RF at roughly 300 trees, again each maximally ten splits deep. Probability outputs for each class were obtained by the proportion of votes for that class to the total number of trees.

Initial leaderboard submissions determined classification success of the test data for this RF was 54.66% overall. Next, a submission was made to the leaderboard by mirroring just the most probable RF class for each entry e.g. “cello,cello; violin,violin;…”. Results from this submission had a leader board determined classification success of 46.02%, informing us that this RF algorithm was correctly selecting one of the two instruments in the test data 92% of the time, and the addition of the second most probable instrument correctly selecting the second instrument for roughly 16% of the entries.

The final model used a voting scheme to decide on the two instrument labels for each test sample. The first label was chosen from the highest RF vote. To decide instrument two, the two independently best performing MLPs were used with the RF probabilities. The output activations from the MLP’s and RF’s were weighted by each other, and by the prior distribution. Discarding the selection from the RF for label one, the highest vote from this ensemble was used to create the second label.

Special thanks to Dr Max Welling, Eli Bowen. All analysis was done in MATLAB, using the Neural Network and Randomforest-MATLAB toolboxes.

Thanks for this competition – it was great fun. Software used: R, Weka, LibSVM, Matlab, Excel. This was the 2nd competition I had entered (the first being the SIAM biological one) and I only really entered because I had so much undergraduate marking to do! We developed a novel approach to the problem which involved multi-resolution clustering and Error Correcting Output Coding. Our 2nd place approach involved transforming the cluster labels into feature vectors.

Method and Journey:

1. We trained on 50% of the training data using Weka and built an ensemble of a cost-sensitive random forest (number of trees 100, number of features 25), a Bayes Net and a neural network. This resulted in 77.44% on the preliminary dataset. It was very frustrating as we couldn’t improve on this. We then looked at semi-iterative relabeling schemes such as Error Correcting Output Coding (using Matlab and LibSVM). This resulted in 81.59% prediction accuracy.

2. We then decided to look at the “statistics” of number of performers, segments, genres etc. We used R to normalize the data (training and test data) and to carry out K-means clustering, k =6 for genres, k=60 for performers, k=2000 for possible songs etc. Taking each set of clusters independently didn’t give any information. However, as we had pasted the results into the same file, we noticed a distinct pattern when the cluster results were looked at together – even though no crisp clusters were identified, we noticed that if a training instance was of a different genre from the rest of the cluster then it usually belonged to a different lower granularity cluster. We then built lots of cluster sets for the data (multi-resolution clustering). K was set to 6, 15, 20, 60, 300, 400, 600, 800, 900, 1050, 1200, 2000, 3000, 3200, 5000 and 7000 clusters. At the finest granularity cluster (k=7000) a majority cluster vote was taken using the training instance labels and the test set predictions – the whole cluster was relabelled to the “heaviest” class. If a cluster could not be converged at the finest k-level then we “fell back” to a lower granularity cluster (k=5000) and so on. These new predictions were fed back to the ECOC system and the process was repeated.

3. Figure below shows the overall approach we came up with:

4. This was the winning solution and resulted in 0.87507 score on the final test set. For the 2nd place solution, we decided to look at using the cluster assignation labels as feature vectors. This transformed the problem from the original 171-dimensional input space, into a new 16-dimensional space, where each attribute was an identifier of the cluster at one of the 16 levels. So, for example, if instance #7 have fallen into the 3rd out of 6 clusters at the first granularity level, 10th out of 15 clusters at the second granularity level and so on, in the transformed space it would be described as a 16-diemensional vector: [3, 10, …]. Note, that these attributes are now categorical, with up to 7000 distinct values at the highest granularity level. This has limited the number of classifiers we could use.

We have used majority voting to combine the decisions of these 3 ensembles. After labeling the test dataset using the method described above, we have fed both training and test dataset (this time with the labels from the previous step) to the ECOC system to obtain final predictions. This resulted in 0.87270 on the final test set.

I became interested in the ISMIS 2011 genres contest due to the challenge that some contestants noted in the online forum: standard model selection via cross-validation did not work well on the problem. Supervised learning techniques I tried, such as SVM, FDA, and Random Forest, all achieved accuracy in the 90-95% range in k-fold CV, only to result in leaderboard test set accuracy in the 70-76% range.

I interpreted this performance drop as an indication that the sample selection bias and resulting dataset shift was significant. I tried three categories of techniques in an attempt to produce a classifier that adapted to the test set distribution: standard transductive algorithms, importance weighting, and pseudo-labeling methods.

My best entry used what I call Incremental Transductive Ridge Regression. The procedure pseudo-labels test points progressively over multiple iterations in an attempt to gradually adapt the classifier to the test distribution. Labeled points can also be removed or reweighted over time to increase the significance of the unlabeled points. The objective function minimized in each iteration is the combination of a labeled loss term, a pseudo-labeled loss term, and the standard L2 ridge regularizer:

The response vector yi for each point contains K entries, one for each genre, and is encoded in binary format where yik=1 if point i has label k and 0 otherwise. Other coding schemes are possible, for example using error-correcting output codes or (K-1) orthogonal vectors. The variable yi* is a pseudo-label vector for each unlabeled point, and Lt and Ut represent the sets of labeled and unlabeled point indices utilized in iteration t. The function f is a linear predictor with weights w, and predictions are produced by argmax f(x).

I experimented with several techniques for growing an initially empty Ut across T iterations. The most successful approach was a stratified one, adding the most confident Fk / T predictions for every class in each round. Confidence is determined by the multiclass margin, and Fk is the expected frequency of class k based on the labeled class distribution. I kept all labeled points in Lt during the T iterations, but surprisingly found that performance increased by removing them all at the end and self-training for a few extra iterations (TII) using just the test points.

In the end, I was able to achieve 82.5% leaderboard accuracy using T=10, TII=5, C=1, λ=1. I added another 0.5% by combining several of these classifiers in a voting ensemble, where diversity was introduced by bootstrap sampling the labeled set. This increase may have been spurious, however, as it did not provide similar gains at larger ensemble sizes.

Along the way, I also experimented with semi-supervised manifold algorithms like LapRLS [1] and tried importance weighting using uLSIF [2], but found only modest gains. Other pseudo-labeling techniques that produced around 80% accuracy for me were Large Scale Manifold Transduction [3] and Tri-training [4].

For implementation, I programmed in Python/SciPy and utilized the ‘scikits.learn’ package when experimenting with off-the-shelf classifiers. Reported results involve two pre-processing steps: duplicate entries in the data sets were removed and features were normalized to have zero mean and unit variance.

I would like to thank TunedIT, the members of Gdansk University of Technology, and any others who helped put together this challenging and fun event.