Archive for August, 2008

In a survey-paper various methods for finding the number of clusters were compared (Dimitriadou et.al, An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets, Psychometrika, 2002) – and there are plenty of methods. None of them work all the time. Finding the right number of clusters has been an open problem for quite a while and also depends on the application, e.g. if more fine or course grained clusters are of interest.

A similar problem occurs in psychopathology. Imagine some measurements taken from several people – some with and some without a mental illness. The question then becomes: are there two clusters or just one? Is the data simply continuous or generated by a latent Bernoulli distribution? There is a whole bunch of literature out there dealing with the same problem from the psychology standpoint (for example: Schmidt, Kotov, Joiner, “Taxometrics – Towards a New Diagnostic Scheme for Psychopathology“, American Psychological Association) One of the more famous researchers is Paul Meehl, who developed a couple of methods to detect a taxon in data. The MAXCOV-HITMAX invented by Paul Meehl is for the detection of latent taxonic structures (i.e., structures in which the latent variable is not continuously, but rather Bernoulli, distributed).

My problem with Meehl’s methods (MAMBAC, MAXCOV, MAXEIG etc.) is that in all the articles only an intuitive explanation is given. Despite being a mathematical method there were no clear definitions of what the method will consider to be a taxon, or any necessary/sufficient conditions on when the algorithm will detect a taxon. Zoologists for example have entire conferences on how to classify species and go through a lot of painful details on how to properly classify species. They have, it seems, endless debates on what constitutes a new species in the taxonomy. However, I still wasn’t able to find a mathematical definition of what constitutes a taxon.

In addition to that, there seems to be some problems when using MAXCOV with dichotomous indicators (Maruan et.al, An Analysis of Meehl’s MAXCOV-HITMAX Procedure for the Case of Dichotomous Indicators, Multivariate Behavioral Research, Vol. 38, Issue 1 Jan. 2003); in this article they pretty much take the entire procedure apart and show that it often fails to indicate taxons when they are there or indicates taxons when there is nothing.

I think the question of finding a taxon is strongly related to clustering, because it simply tries to answer if clusters exist in the data. However, from all the clustering literature I’ve read so far, clusters are generally defined as dense areas in a space and are found in various ways by maximizing or minimizing some criterion (mutual information etc.). What constitutes a cluster is often conveniently defined so it fits the algorithm at hand. And then you still have to deal with or at least acknowledge the fact that the current notion of clustering has been proven to be impossible (An Impossibility Theorem for Clustering; Kleinberg; NIPS 15).

In a new paper in Machine Learning called Generalization from Observed to Unobserved Features by Clustering (Krupka&Tishby; JMLR 9(Mar):339–370, 2008) the authors describe an idea that might change the way we view clustering. In the paper they show that (under certain conditions) given a clustering based on some features the items will be implicitly clustered by yet unobserved features as well. As an intuitive example, imagine apples, oranges and bananas clustered by shape, color, size, weight etc. Once you have them clustered, you will be able to draw conclusions about a yet unobserved feature, e.g. the vitamin content. The work, because it is oriented on the features, might even be a way around the impossibility-theorem.

This is half-way there for a nicer definition of a taxon or what should constitute a cluster for that matter: can we draw conclusions about features not used for the clustering process? If you are clustering documents by topic (using bag-of-words), can you predict which other words will appear in the article? If you cluster genes, can you predict other genes from the cluster-membership?

I recently attended a talk where the authors claimed that the CAPTCHA technology (the squiggly letters they make you type in whenever you sign up for anything) is dead and defeated. I disagree. In the talk, they demonstrated how to break a couple of “home-brew” captcha-implementations they found on the internet. Most of them were – not surprisingly – not very good. I think this is almost comparable to people inventing there own encryption algorithms.

All the implementations they broke were either insecure implementations (accepting solutions several times, hiding the answer in an invisible form field etc.) or were simply writing numbers in images with little or no distortion. The audio captcha they broke was simply reading numbers with a little bit of clicking noise in the background. Those are all very simple. What is supposed to make “real” captchas hard is that they are hard to segment – compare the phpBB captcha with the one from Yahoo. In the later you will have problems separating the letters for your OCR. A good audio captcha overlays music, chatter or other noise that is hard to separate from the code being read.

An article on the Internet Storm Center discusses wether Anti-Virus software in the current state is a dead end. In my opinion it has been dead for quite a while now. Apart from the absolutely un-usable state that anti-virus software is in, I think it’s protecting the wrong things. Most attacks (trojans, spyware) nowadays come through web-browser exploits and maybe instant-messenger (see reports on ISC). So instead of scanning incoming emails, how about a behavior blocker for the web-browser and the instant messenger? There are a couple of freeware programs (e.g. IEController [German]) out there that successfully put Internet Explorer, etc. into a sandbox; whatever Javascript exploit – known or unknown – the browser won’t be able to execute arbitrary files or write outside its cache-directory. Why is there nothing like that in the commercial AV packages?

However, a few possibilities suggested in the article might be worth exploring. For example, they suggest Bayesian heuristics to identify threats. Using machine learning techniques might be a direction worth exploring. IBM AntiVirus (maybe not the current version anymore) has been using Neural Networks with 4Byte sequences (n-grams) for bootsector virus detection.

A couple things to keep in mind, though:

Quality of the classifier (detection rate) should be measured with Area-under-ROC-Curve (AUC), not error-rate like most people tend to do in Spam-Filter comparisons. The base-rate of the “non-virus” class is pretty high; I have over 10.000 executables/libraries on my windows machine. All (most?) of them non-malicious.

The tricky part with that is the feature extraction. While sequences of bytes or strings extracted from a binary might be a good start, advanced features like call-graphs or imported API-calls should be used as well. This is pretty tricky and time-consuming, especially when it has to be done for different types of executables (Windows scripts, x86-EXE files, .Net files etc.). De-obfuscation techniques, just like in the signature based scanners, will probably be necessary before the features can be extracted.

Behavior blocking and sandboxes are probably easier, a better short-term fix, and more pro-active. This has been my experience with email-based attacks as well back in the Mydoom days when a special mime-type auto-executed an attachment in Outlook. Interestingly there are only two programs out there that sanitize emails (check mime-types, headers, rename executable attachments etc.) at the gateway-level – a much better pro-active approach than simply detecting known threats. The first is Mimedefang, a sendmail plugin. The other is impsec, based on procmail. CU Boulder was using impsec to help keep student’s machines clean (there were scalability issues with the procmail solution, though).