Two training datasets were provided, one larger one containing data taken from single instruments, and one smaller one with data from combinations of exactly two instruments. These two datasets contained both similar as well as unique labels. Overall, there were 32 literally distinct classes contained in the two training sets. Since the problem was one of multi-way classification, the first approach was the multi-layer perceptron. With 35 hidden neurons, the MLP was trained using Levenberg-Marquadt updating. The MLP was then used to evaluate the test set, and the top 2 activations (of the 32 output nodes) were assigned as labels to that sample point. This model performed with 38% accuracy. These results led us to believe that further investigation should be done on the test data, as MLP’s should perform significantly better than the nearest neighbor approach. Also, many inconsistencies existed between the two training set labels i.e. ‘alto sax’ and ‘saxophone’. To investigate the distribution of the test samples, 32 ‘dummy’ scripts were submitted, each of which containing only one instrument class for both instruments and for every test sample. The resulting classification accuracy was collected for all the classes and represented the distribution of the preliminary test samples. Additionally, it was known that the preliminary and final test set was randomly chosen from the entire test set. Using this knowledge, the resulting distribution was used as priors on the 32 classes. Upon scrutinizing the returned test distribution, it was noticed that many of the classes which had similar names i.e. ‘clarinet’ vs. ‘B-flat clarinet’ only appeared as one class in the preliminary test set. With this knowledge, the classes which did not appear at all in the preliminary test set were either deleted, or their data combined with the classes which had similar names.

During initial investigation of the training data a traditional random forest (RF) classifier was used to test the baseline classifiability of the single instrument training dataset (details of the algorithm can be found in L. Breiman 2001). A forest of 1000 decision stumps, each maximally ten splits deep, was trained. Initial performance of this classifier was very good with error > 0.9%. However, the traditional RF classifier is designed to handle discrete, scalar target values. For this problem, training on the mixed interment data, with each datum belonging to two classes, would normally not have been feasible. However, our group devised a method to train this algorithm using both the single instrument and mixed instrument training data. We did so by generating new training sets, with one instance of the single instrument training data, and randomly sampling the mixed training data, with repeats and a non-uniform distribution that matched the prior information about the final test set that was gained from the dummy scripts, and labeling each repeated with one or the other of the two class labels provided by the training data. This allowed the RF algorithm to be trained in a bootstrap-like method seeing the same datum several times, and seeing them with both labels attached to that datum. Out-of-bag training error was optimal for the RF at roughly 300 trees, again each maximally ten splits deep. Probability outputs for each class were obtained by the proportion of votes for that class to the total number of trees.

Initial leaderboard submissions determined classification success of the test data for this RF was 54.66% overall. Next, a submission was made to the leaderboard by mirroring just the most probable RF class for each entry e.g. “cello,cello; violin,violin;…”. Results from this submission had a leader board determined classification success of 46.02%, informing us that this RF algorithm was correctly selecting one of the two instruments in the test data 92% of the time, and the addition of the second most probable instrument correctly selecting the second instrument for roughly 16% of the entries.

The final model used a voting scheme to decide on the two instrument labels for each test sample. The first label was chosen from the highest RF vote. To decide instrument two, the two independently best performing MLPs were used with the RF probabilities. The output activations from the MLP’s and RF’s were weighted by each other, and by the prior distribution. Discarding the selection from the RF for label one, the highest vote from this ensemble was used to create the second label.

Special thanks to Dr Max Welling, Eli Bowen. All analysis was done in MATLAB, using the Neural Network and Randomforest-MATLAB toolboxes.

Thanks for this competition – it was great fun. Software used: R, Weka, LibSVM, Matlab, Excel. This was the 2nd competition I had entered (the first being the SIAM biological one) and I only really entered because I had so much undergraduate marking to do! We developed a novel approach to the problem which involved multi-resolution clustering and Error Correcting Output Coding. Our 2nd place approach involved transforming the cluster labels into feature vectors.

Method and Journey:

1. We trained on 50% of the training data using Weka and built an ensemble of a cost-sensitive random forest (number of trees 100, number of features 25), a Bayes Net and a neural network. This resulted in 77.44% on the preliminary dataset. It was very frustrating as we couldn’t improve on this. We then looked at semi-iterative relabeling schemes such as Error Correcting Output Coding (using Matlab and LibSVM). This resulted in 81.59% prediction accuracy.

2. We then decided to look at the “statistics” of number of performers, segments, genres etc. We used R to normalize the data (training and test data) and to carry out K-means clustering, k =6 for genres, k=60 for performers, k=2000 for possible songs etc. Taking each set of clusters independently didn’t give any information. However, as we had pasted the results into the same file, we noticed a distinct pattern when the cluster results were looked at together – even though no crisp clusters were identified, we noticed that if a training instance was of a different genre from the rest of the cluster then it usually belonged to a different lower granularity cluster. We then built lots of cluster sets for the data (multi-resolution clustering). K was set to 6, 15, 20, 60, 300, 400, 600, 800, 900, 1050, 1200, 2000, 3000, 3200, 5000 and 7000 clusters. At the finest granularity cluster (k=7000) a majority cluster vote was taken using the training instance labels and the test set predictions – the whole cluster was relabelled to the “heaviest” class. If a cluster could not be converged at the finest k-level then we “fell back” to a lower granularity cluster (k=5000) and so on. These new predictions were fed back to the ECOC system and the process was repeated.

3. Figure below shows the overall approach we came up with:

4. This was the winning solution and resulted in 0.87507 score on the final test set. For the 2nd place solution, we decided to look at using the cluster assignation labels as feature vectors. This transformed the problem from the original 171-dimensional input space, into a new 16-dimensional space, where each attribute was an identifier of the cluster at one of the 16 levels. So, for example, if instance #7 have fallen into the 3rd out of 6 clusters at the first granularity level, 10th out of 15 clusters at the second granularity level and so on, in the transformed space it would be described as a 16-diemensional vector: [3, 10, …]. Note, that these attributes are now categorical, with up to 7000 distinct values at the highest granularity level. This has limited the number of classifiers we could use.

We have used majority voting to combine the decisions of these 3 ensembles. After labeling the test dataset using the method described above, we have fed both training and test dataset (this time with the labels from the previous step) to the ECOC system to obtain final predictions. This resulted in 0.87270 on the final test set.

I became interested in the ISMIS 2011 genres contest due to the challenge that some contestants noted in the online forum: standard model selection via cross-validation did not work well on the problem. Supervised learning techniques I tried, such as SVM, FDA, and Random Forest, all achieved accuracy in the 90-95% range in k-fold CV, only to result in leaderboard test set accuracy in the 70-76% range.

I interpreted this performance drop as an indication that the sample selection bias and resulting dataset shift was significant. I tried three categories of techniques in an attempt to produce a classifier that adapted to the test set distribution: standard transductive algorithms, importance weighting, and pseudo-labeling methods.

My best entry used what I call Incremental Transductive Ridge Regression. The procedure pseudo-labels test points progressively over multiple iterations in an attempt to gradually adapt the classifier to the test distribution. Labeled points can also be removed or reweighted over time to increase the significance of the unlabeled points. The objective function minimized in each iteration is the combination of a labeled loss term, a pseudo-labeled loss term, and the standard L2 ridge regularizer:

The response vector yi for each point contains K entries, one for each genre, and is encoded in binary format where yik=1 if point i has label k and 0 otherwise. Other coding schemes are possible, for example using error-correcting output codes or (K-1) orthogonal vectors. The variable yi* is a pseudo-label vector for each unlabeled point, and Lt and Ut represent the sets of labeled and unlabeled point indices utilized in iteration t. The function f is a linear predictor with weights w, and predictions are produced by argmax f(x).

I experimented with several techniques for growing an initially empty Ut across T iterations. The most successful approach was a stratified one, adding the most confident Fk / T predictions for every class in each round. Confidence is determined by the multiclass margin, and Fk is the expected frequency of class k based on the labeled class distribution. I kept all labeled points in Lt during the T iterations, but surprisingly found that performance increased by removing them all at the end and self-training for a few extra iterations (TII) using just the test points.

In the end, I was able to achieve 82.5% leaderboard accuracy using T=10, TII=5, C=1, λ=1. I added another 0.5% by combining several of these classifiers in a voting ensemble, where diversity was introduced by bootstrap sampling the labeled set. This increase may have been spurious, however, as it did not provide similar gains at larger ensemble sizes.

Along the way, I also experimented with semi-supervised manifold algorithms like LapRLS [1] and tried importance weighting using uLSIF [2], but found only modest gains. Other pseudo-labeling techniques that produced around 80% accuracy for me were Large Scale Manifold Transduction [3] and Tri-training [4].

For implementation, I programmed in Python/SciPy and utilized the ‘scikits.learn’ package when experimenting with off-the-shelf classifiers. Reported results involve two pre-processing steps: duplicate entries in the data sets were removed and features were normalized to have zero mean and unit variance.

I would like to thank TunedIT, the members of Gdansk University of Technology, and any others who helped put together this challenging and fun event.

Like this:

The basis for my model was the Probabilistic Neural Network (PNN), which was originally introduced by Donald Specht in 1990. The PNN is an example-based classifier in which the ‘X’ vector for an unknown case to be classified is compared to all known-class cases used in the training set. A distance metric (typically Euclidean) is passed through a gaussian function to estimate a ‘probability’ of a match with each training case. These individual probabilities are combined for each class in the training set, and the class with the highest composite probability is selected as the most likely class for the unknown case. The evaluation and combination function used was:

Although a PNN can be used with little or no training, this problem posed several difficulties. The first was the high dimensionality of the input data. Because they are example based, PNN classifiers require their input ‘space’ to be reasonably well filled in order to perform. As the number of input features is increased, one would expect their input space to become exponentially sparser. The solution to this was to employ feature selection. Also, another challenge for obtaining good performance is the proper selection of σ, which controls the selectivity of the classifier. If one makes σ too large, the classifier will tend lose the ability to differentiate between different input data. On the other hand, if σ is made too small, the classifier loses the ability to generalize beyond its training set. The problems of both feature and σ selection were solved by using a guided random walk, with the objective of maximizing the Modified Youdon performance on the training set. One feature of this approach is that it does not require the calculation of gradient information, only the value of the metric being maximized. To avoid severe overtraining effects, a leave-one-out scheme was used to evaluate training-set performance.

Because the PNN model developed as described above only sees a small subset of available inputs, I decided to attempt to increase the performance through constructing ensembles of the PNNs, and then taking a simple vote among their outputs to decide the final classification.

As one can see from the following plot, there is substantial variation in both the training and final test Modified Youdon measures for different models, with a degree of correlation between the training metric and the final test metric. This led to the idea of constructing the final voting pool out of a subset of models with superior training performance.

In the end, the submission model consisted of a vote of the best 25 out of 135 candidate PNN models (by training score) constructed using 35 features. This yielded a training score of 0.794, and a final test score of 0.689. Note that while some individual sub-models would have had very similar performance to the ensemble model, there was no obvious way of reliably identifying such high-performing sub models a priori, so the ensemble technique allowed for the combination of a number of good (and not so good) models into a better one.

I developed the model generation code in Visual Basic .NET, and did the final vote taking using a spreadsheet. The generation and tuning of the 135 candidate models required nearly 8 hours on a single processor core of an Intel E5300.

Safety of pharmaceutical and chemical products with respect to human health and the environment has been a major concern for the public, regulatory bodies, and the industry, for a long time and this demand is increasing. Safety aspects start in the early design phases of drugs and chemical compounds and they end formally with the official authorization by national and international regulators. Traditionally, for decades, animal tests have been using as the preferred accepted tool – kind of Gold Standard, which, in fact, it is not – for testing harmful effects of chemicals on living species or the environment. Currently, in Europe only, about 10 million animals per year are (ab)used for laboratory experiments, and a lot of time and billions of Euros are spent into these experiments. So, we as consumers who use and value chemical products every day everywhere in some form are safe? No! Not really. About 90% of the chemicals on the market today have never been tested or have not been requested, officially, to be tested. There is a simple reason, apparently: Despite the ethical issues of animal testing – it is estimated that additional 10 – 50 million vertebrate animals would be required if all 150,000 registered substances would have to be tested in this traditional way – it is simply not possible to run animal tests for this amount of substances within reasonable time and cost constraints. Animal tests cannot do that. To solve this problem, there is a strong demand for alternative testing methods like QSAR models to help minimizing and widely substituting animal tests in the future.

Like this:

We have a pleasure to publish three after-challenge reports authored by participants of SIAM DM 2011 Data Mining Contest who achieved ex aequo the 4th best score (differed only in time of submission). Hope you’ll find the reports insightful. To view full results of the competition see Leaderboard and Summary pages.

* * *

By Robert Wieczorkowski, Ph.D. (rwieczor).

SIAM SDM’11 Contest was my second challenge in which I participated on TunedIT. Previously I took part in IEEE ICDM Contest (Traffic Prediction) and ended on the 12th place. Taking part in this challenge was for me a form of new practical experience in using modern statistical tools.

Share this:

Like this:

In the previous week we had an excellent data mining conference in Warsaw – Rough Sets and Current Trends in Computing (RSCTC). Several months ago, TunedIT had organized the Discovery Challenge for RSCTC: analysis of genetic data for medical purposes. Now, there was a challenge session where the winners presented their solutions to general public. Everyone was really curious how they did it and many questions followed after their talks, so they had no choice but to lift the curtain on their secret tricks. If anyone still wants to learn more, I recommend looking into the challenge paper – to be found here or in conference proceedings (pp. 4-19). We’ll also post shortly an interview with one of the winners, so stay tuned!

Apart from the contest, the conference brought many interesting presentations. First of all, there were four invited keynote talks given by prominent researchers, professors: Roman Słowiński, Sankar Pal, Rakesh Agrawal and Katia Sycara.

Rakesh Agrawal is the head of Microsoft Search Labs, responsible for the development of Microsoft’s Bing search engine. In his talk, Search and Data: The Virtuous Cycle, he sketched what kinds of data mining problems they face when trying to make Bing more “intelligent”, so that search results contain exactly the pages that the user is looking for. It appears that one of the toughest problems is to discover real intentions of the user: what is he really looking for? Search engine knows only the query string, usually very short (1-2 words,often misspelled), say “Ireland”, and must guess what the user expects: travel guide for a tourist or geographical facts about the country? Another problem is that many words have several different meanings: if the user writes “polish” does it mean a verb, “to polish”, or an adjective, “Polish”? Yet another problem: how to deal with numbers in a smart way? The query “$200 camera” gives few sensible results if treated literally – better try “$199 camera” 🙂

Many more issues of this kind must be dealt with. Add that the algorithms must dig through petabytes of data in a matter of seconds, and you’ll have no doubts that guys in Microsoft Search Labs never complain about boring assignments. BTW, I must confirm from own experience that data size and performance requirements are critical factors to make data mining fun. With small data and no performance difficulties, data mining is just an interesting thing to do. When performance begins to play a role, you discover that 95% of your fantastic algorithms just don’t catch up and you’ve got to turn all the bright ideas (and software) upside down.

Another talk which I really enjoyed – Emergent Dynamics of Information Propagation in Large Networks – was delivered by Katia Sycara from Carnegie Mellon University. It’s interesting to observe how large networks of “agents”, for example people, share information among themselves on a peer-to-peer basis, like through gossiping, and how the information fills the whole network at some point in time or – conversely – suddenly disappears. It’s important that we can predict evolution of such processes, because in real world the “information” distributed may be an infectious disease whose spread should be stopped as soon as possible; or an operator’s request that must be distributed to all computers in a large decentralized network, in a shortest possible time.

Which outcome is observed depends on different parameters of the network: how many connections there are between agents, what’s the topology (uniform connections? separated clusters?), how keen the agents are to pass the gossip further on. But what’s the most interesting is that Read more of this post

Share this:

Like this:

An interesting post by Mike Loukides at O’Reilly blogs: What is data science? The title question is hard to answer. Most likely there’s no single answer that everyone would agree upon. But still, Mike makes a couple of good points and observations that are worth quoting:

The web is full of “data-driven apps.” Almost any e-commerce application is a data-driven application. (…) But merely using data isn’t really what we mean by “data science.” A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product.

I would add that not only the web is full of data. The amount of data grows exponentially in every domain, be it on-line or off-line apps. But the users are moving more and more from off-line to web applications, plus it’s easier and more natural to merge together data from different users when things happen on the web than in an off-line scenario. Some examples of off-line applications: analysis of medical records, bioinformatics & genetics, video surveillance, energy demand forecasting, industrial control systems.

In the last few years, there has been an explosion in the amount of data that’s available. Whether we’re talking about web server logs, tweet streams, online transaction records, “citizen science,” data from sensors, government data, or some other source, the problem isn’t finding data, it’s figuring out what to do with it.

Yep. Data is the king. I like examples with CDDB and Google. It’s good to realize that 97% of Google revenue actually comes from data mining algorithms: PageRank (smart search engine) combined with AdSense and AdWords (intelligent online advertising). To put it differently, 23 bln $ of Google revenue in 2009 came from data mining algorithms. It’s data mining and machine learning which make Google search engine so accurate in answering queries and which attract so many users. It’s data mining and machine learning which allow Google to present digital advertisements in optimal place and time, to users who are potentially most interested in a given product.

At the same time, intelligent algorithms make up as little as 1% (or less) of their whole code base. Google has lots of other software that has nothing in common with data mining – various web apps (like Google Docs), libraries, widgets, APIs – but the core, the critical code in terms of their revenue, the code that makes Google be Google, is data mining!

This relation – 97% of revenue from 1% of code base – is very typical for data mining applications. On the other hand, this 1% of code is very hard to invent, much harder than the other 99%. I wonder how much do data mining algorithms make for Google in terms of costs? Mainly for paying the specialists who devise them and thoroughly, step by step, over long period of time, tune them up? I would guess for a number that’s closer to 99% than 1%.

The question every company is facing today (…) is how to use data effectively (…). Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well-defined kinds of analysis. What differentiates data science from statistics is that data science is a holistic approach. We’re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.

Nothing to add.

What is data science?

Share this:

Like this:

Some time ago I spoke to a friend of mine, Paweł Szczęsny from the Polish Academy of Sciences – a biologist, a visionary of Open Science and a pioneer of scientific blogging. When I mentioned about our plans to start a blog for TunedIT, Paweł, after giving it a serious thought, had come up with the following advice: „Remember one thing: do not write about yourself. If someone writes about oneself, the blog becomes terribly boring. Only if you keep writing about something different, it has a chance to be interesting”.

The Nerd Ghost

At first, this tip of advice seemed illogical to me – what’s the point in opening a blog related to the web portal TunedIT, if we are to write about something totally different? All in all, this is so natural: if a new functionality comes up on TunedIT, we will mention it in the blog: „Today a new functionality has been released, which enables … it helps in … you can use it like this …” and so on. If there’s going to be a new competition: „Today we’re launching a new competition … the task is to …” Isn’t it the way you do it? Each of us could instantly list a long series of blogs where similar posts can be encountered. Don’t they sound so familiar, so natural, so conventional, so … banal? Hm, wait a moment. Banal? Actually, … Have I read many blogs like that? Sure! A lot! How many of them have I read further than to the second sentence of the paragraph? I can’t recall any… So maybe writing about yourself is not the best choice for your blog, in fact? But if not, what else then makes it tick? Read more of this post