Data mining curiosities: RSCTC 2010 write-up

In the previous week we had an excellent data mining conference in Warsaw – Rough Sets and Current Trends in Computing (RSCTC). Several months ago, TunedIT had organized the Discovery Challenge for RSCTC: analysis of genetic data for medical purposes. Now, there was a challenge session where the winners presented their solutions to general public. Everyone was really curious how they did it and many questions followed after their talks, so they had no choice but to lift the curtain on their secret tricks. If anyone still wants to learn more, I recommend looking into the challenge paper – to be found here or in conference proceedings (pp. 4-19). We’ll also post shortly an interview with one of the winners, so stay tuned!

Apart from the contest, the conference brought many interesting presentations. First of all, there were four invited keynote talks given by prominent researchers, professors: Roman Słowiński, Sankar Pal, Rakesh Agrawal and Katia Sycara.

Rakesh Agrawal is the head of Microsoft Search Labs, responsible for the development of Microsoft’s Bing search engine. In his talk, Search and Data: The Virtuous Cycle, he sketched what kinds of data mining problems they face when trying to make Bing more “intelligent”, so that search results contain exactly the pages that the user is looking for. It appears that one of the toughest problems is to discover real intentions of the user: what is he really looking for? Search engine knows only the query string, usually very short (1-2 words,often misspelled), say “Ireland”, and must guess what the user expects: travel guide for a tourist or geographical facts about the country? Another problem is that many words have several different meanings: if the user writes “polish” does it mean a verb, “to polish”, or an adjective, “Polish”? Yet another problem: how to deal with numbers in a smart way? The query “$200 camera” gives few sensible results if treated literally – better try “$199 camera” 🙂

Many more issues of this kind must be dealt with. Add that the algorithms must dig through petabytes of data in a matter of seconds, and you’ll have no doubts that guys in Microsoft Search Labs never complain about boring assignments. BTW, I must confirm from own experience that data size and performance requirements are critical factors to make data mining fun. With small data and no performance difficulties, data mining is just an interesting thing to do. When performance begins to play a role, you discover that 95% of your fantastic algorithms just don’t catch up and you’ve got to turn all the bright ideas (and software) upside down.

Another talk which I really enjoyed – Emergent Dynamics of Information Propagation in Large Networks – was delivered by Katia Sycara from Carnegie Mellon University. It’s interesting to observe how large networks of “agents”, for example people, share information among themselves on a peer-to-peer basis, like through gossiping, and how the information fills the whole network at some point in time or – conversely – suddenly disappears. It’s important that we can predict evolution of such processes, because in real world the “information” distributed may be an infectious disease whose spread should be stopped as soon as possible; or an operator’s request that must be distributed to all computers in a large decentralized network, in a shortest possible time.

Which outcome is observed depends on different parameters of the network: how many connections there are between agents, what’s the topology (uniform connections? separated clusters?), how keen the agents are to pass the gossip further on. But what’s the most interesting is that no configuration of parameters guarantees expected outcome to occur every time. Chaos, sheer chance, plays a significant role and causes that even the same parameters may lead to different outcomes when the simulation is started again. Fortunately, there’s a solution, discovered by Sycara’s team: through data mining and statistical modeling they invented smart self-adapting algorithm that adjusts parameters of the network on-the-fly, during evolution of the process, by reacting to current local behavior of agents, and in this way achieves the desired outcome every time.

As you can see, the keynote talks were very thought-provoking. Presentations of regular papers didn’t lag behind – authors invented brilliant algorithms that crack the data and extract useful knowledge for automatic prediction and recognition tasks. Below is a list of presentations that particularly drew my attention.

In larger cities you may notice video recorders installed over crossroads. They register car traffic and send recordings to the monitoring center where the signal is stored and analysed for detection of different events (accidents, traffic jams, thefts) or calculation of traffic statistics. The amount of data that comes from the whole city is so huge that analysing it manually is incredibly expensive. That’s why data mining and computer vision algorithms are necessary: to automatically analyze this video stream. One of subproblems here is how to recognize types of vehicles: trucks, buses, cars and so on – an algorithm that can do this with over 95% accuracy was designed by the authors of the paper. I must add that this application domain, traffic monitoring, is related to IEEE ICDM Data Mining Contest organized right now in TunedIT.

Computer vision algorithms again. Now applied to TV video stream. The goal is to automatically recognize sport disciplines shown on TV news. For humans it’s trivial (especially during soccer World Cup – you can say without looking), but for computers it’s very tough, unless you employ intelligent algorithms to solve this task. It’s amazing how many practical problems there are in computer vision and image recognition, and how few of them can be solved today by computers, with satisfactory accuracy. Still a work in progress and huge application area for data mining and machine learning techniques.

Alexey and his colleagues discovered an interesting thing about microarray data used in medical diagnostics: it’s much better to analyze differences between expression values of two selected genes, than every gene separately – the error rate of diagnosis may drop even by 1/4 in some cases. How to choose best pairs of genes? By incorporating external biological knowledge about gene interactions and similarities. It’s worth to note that Alexey participated in RSCTC 2010 Data Mining Challenge, which also concerned analysis of genetic data.

Learning Age and Gender Using Co-occurrence of Non-dictionary Words from Stylistic Variations
by Rajendra Prasath (p. 544)

Rajendra’s presentation was really perfect. Both in terms of scientific methodology and practical significance of results. I won’t tell you anything more at the moment, because we’ll make a longer post about this paper soon…

Both papers investigate the problem of splitting soundtrack into separate instruments and detecting what instruments are playing in a given moment – such software would be very handy for indexing and searching through vast repositories of multimedia content, like YouTube, that have appeared in recent years together with advances in computer storage and networking. This complex problem requires the use of intelligent algorithms that learn from examples of how every instrument sounds and what this sound “looks” like numerically, in a sequence of numbers that comprise the recordings. One of interesting observations done by authors was that in many cases two instruments playing together may sound like yet another instrument, which makes the task more difficult. The authors tried many different algorithms and their combinations: random forests, k-nearest neighbors, rough sets, decision trees, Fourier transform, MPEG-7 descriptors, spectral features, cepstral coefficients. The best recognition rates achieved were at the level of 83%, which is high enough for many practical applications.

All of the above papers can be found in the proceedings. Page numbers in parentheses. Have a good lecture!