Archive for the ‘Machine Learning’ Category

How can one evaluate the performance of prognostic models in a meaningful way? This is a very basic and yet an interesting problem especially in the context of prediction of very rare events (base-rates <10%). How reliable is the model’s forecast? This is a good question and of particular importance when it matters – think criminal psychology where models forecast the likelihood of recidivism for criminally insane people (Quinsey 1980). There are a variety of ways to evaluate a model’s predictive performance on a hold out sample, and some are more meaningful than others. For example, when using error-rates one should keep in mind that they are only meaningful when you consider the base-rate of your classes and the trivial classifier as well. Often this gets confusing when you are dealing with very imbalanced data sets or rare events. In this blog post, I’ll summarize a few techniques and alternative evaluation methods for predictive models that are particularly useful when dealing with rare events or low base-rates in general.

The Receiver Operator Characteristic is a graphical measure that plots the true versus false positive rates such that the user can decide where to cut for making the final classification decision. In order to summarize the performance of the graph in a single, reportable number, the area under the curve (AUC) is generally used.

The New Scientist describes a new method for spam detection by learning patterns. This new method exploits the spammers most powerful weapon – the automatic generation of many, similar messages by automated means (i.e., some grammar in a formal language) – and turns it against them. The article reports that a pattern can reliably be learned from about 1000 examples captured from a bot, allowing the method to classify new messages accurately and with zero false positives. This sounds really exciting given my full spam-folder.

However, I’m a bit cautious. The article is a bit sparse on technical details, so I might make some wrong assumptions here. First, zero false positives reported is the discrimination of spam from that particular spam-grammar versus other messages. At least that’s how I understand it. Second, it seems from the article that they only learn from positive examples. Overall the technique sounds to me like they are learning a pattern language. Pattern languages are a class of grammars that overlap with linear and context-sensitive grammars (Chomsky hierarchy). Unfortunately they don’t have a real Wikipedia page so I’ll try to give a bit of background. The closest I can give for an example right now would be regular expressions with back-references. I’m not sure if this is an accurate description for all possible patterns, but it’s close enough for an example.

I don’t know how the specific technique mentioned in the article works in detail, but I’ve learned two things about learning grammars from text: (a) we can’t learn all linear or context-sensitive languages, only all pattern language grammars; (b) learning patterns without negative examples leads to over-generalization really really fast.

While I haven’t worked with learning grammars in a long while, the only algorithm of which I’m aware is the Lange-Wiehagen algorithm (Steffen Lange and Rolf Wiehagen; Polynomial-time inference of arbitrary pattern languages. New Generation Computing, 8(4):361-370, 1991). This algorithm is not a consistent learner, but can learn all pattern languages in polynomial time. There might be better ones available by now, but learning grammars is not that popular in the machine learning community right now. I’m sure there are some other interesting applications besides spam filtering. Maybe it’s time for a revival.

Overall, it sounds like a promising new anti-spam technique, but I’d like to see some more realistic testing done. There are some obvious ways for spammers to make learning these patterns harder, but either way I’m curious – maybe the inventors of this technique discovered a better way to learn patterns? Maybe by using some problem-specific domain knowledge?

Interesting post over at hunch.net about reviewers bidding for papers in order to shoot them down. Make sure to read the comments… That state of mind of some reviewers might explain why the least-informative and most negative reviews always come with the highest confidence rating in ML conferences (specifically NIPS).

Given that a common question during training with risk assessment software is “what do I do to get outcome/prediction x” from the software it should be explored how to safeguard in the software against users gaming the system. Think detecting multiple model evaluations with slightly changed numbers in a row…

Edit: I just found an instrument implemented as an Excel Spreadsheet. Good for prototyping something, but using that in practice is just asking people to fiddle with the numbers until the desired result is obtained. You couldn’t make it more user-friendly if you tried…

That’s an interesting way to do it (from a risk modeling point of view) and I wonder how well it works in practice. There might also be some legal ramifications to this if it can be demonstrated that this practice (possibly unknowingly to them) discriminates e.g. against minorities.

I had written a post about the issues of converting models into something that is usable in production environments as most stats-packages don’t have friendly interfaces to integrate them into the flow of processing data. I worked on a similar problem involving a script written in SAS recently. To be specific, some code for computing a risk-score in SAS had to be converted into Java and I was confronted with having to figure out the semantics of SAS code. I found a software to convert SAS code into Java and I have to say I was quite impressed with how well it worked. Converting one language (especially one for which there is no published grammar or other specification) into another is quite a task – after a bit of back and forth with support we got our code converted and the Java code worked on the first try. I wish there would be similar converter for STATA, R and SPSS 🙂

Recently I had a fun discussion with Bill over lunch about intellectual property and how that might apply to statistical modeling work. Given that there are more and more companies making a living from forming predictions with a model they have built (churn-prediction, credit-scores and other risk-models) we were wondering if there were any means of protecting them as intellectual property. For example, the ZETA-model for predicting corporate bankruptcies is a closely guarded secret with having published only the variables being used (Altman E. I. (2000); Predicting financial distress for companies: revisiting the Z-Score and ZETA models). Obviously this model is useful for lending and can make serious money for the user. Making decisions guided by a formula is becoming more popular. This might be something over which legal battles will be fought in the future.

Copyrighted works and patents often count towards what a company would be worth should somebody acquire it. This means there would be motivation for start-up companies to protect their models. A mathematical formula (e.g. a regression equation) cannot be patented, and copyright probably won’t apply either; even if copyright would apply, it’s trivial to build a formula that does essentially the same thing (e.g. multiply all the weights in the formula by 10). This leaves only trade secret protection and means there is no recourse once the cat is out of the bag. Often it’s also the data-collection method that is kept secret – a company called Epagogix developed a method to judge the success of movies from a script by scoring it against some scales that they keep secret.

Currently, I don’t see any legal protections with the exception of trade-secrets for this. And given that there is infinitely many ways to express the same scoring rules in a different way, this would be a fairly hard problem for lawyers and politicians to formulate sensible rules for establishing protection for this kind of intellectual property.

In a survey-paper various methods for finding the number of clusters were compared (Dimitriadou et.al, An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets, Psychometrika, 2002) – and there are plenty of methods. None of them work all the time. Finding the right number of clusters has been an open problem for quite a while and also depends on the application, e.g. if more fine or course grained clusters are of interest.

A similar problem occurs in psychopathology. Imagine some measurements taken from several people – some with and some without a mental illness. The question then becomes: are there two clusters or just one? Is the data simply continuous or generated by a latent Bernoulli distribution? There is a whole bunch of literature out there dealing with the same problem from the psychology standpoint (for example: Schmidt, Kotov, Joiner, “Taxometrics – Towards a New Diagnostic Scheme for Psychopathology“, American Psychological Association) One of the more famous researchers is Paul Meehl, who developed a couple of methods to detect a taxon in data. The MAXCOV-HITMAX invented by Paul Meehl is for the detection of latent taxonic structures (i.e., structures in which the latent variable is not continuously, but rather Bernoulli, distributed).

My problem with Meehl’s methods (MAMBAC, MAXCOV, MAXEIG etc.) is that in all the articles only an intuitive explanation is given. Despite being a mathematical method there were no clear definitions of what the method will consider to be a taxon, or any necessary/sufficient conditions on when the algorithm will detect a taxon. Zoologists for example have entire conferences on how to classify species and go through a lot of painful details on how to properly classify species. They have, it seems, endless debates on what constitutes a new species in the taxonomy. However, I still wasn’t able to find a mathematical definition of what constitutes a taxon.

In addition to that, there seems to be some problems when using MAXCOV with dichotomous indicators (Maruan et.al, An Analysis of Meehl’s MAXCOV-HITMAX Procedure for the Case of Dichotomous Indicators, Multivariate Behavioral Research, Vol. 38, Issue 1 Jan. 2003); in this article they pretty much take the entire procedure apart and show that it often fails to indicate taxons when they are there or indicates taxons when there is nothing.

I think the question of finding a taxon is strongly related to clustering, because it simply tries to answer if clusters exist in the data. However, from all the clustering literature I’ve read so far, clusters are generally defined as dense areas in a space and are found in various ways by maximizing or minimizing some criterion (mutual information etc.). What constitutes a cluster is often conveniently defined so it fits the algorithm at hand. And then you still have to deal with or at least acknowledge the fact that the current notion of clustering has been proven to be impossible (An Impossibility Theorem for Clustering; Kleinberg; NIPS 15).

In a new paper in Machine Learning called Generalization from Observed to Unobserved Features by Clustering (Krupka&Tishby; JMLR 9(Mar):339–370, 2008) the authors describe an idea that might change the way we view clustering. In the paper they show that (under certain conditions) given a clustering based on some features the items will be implicitly clustered by yet unobserved features as well. As an intuitive example, imagine apples, oranges and bananas clustered by shape, color, size, weight etc. Once you have them clustered, you will be able to draw conclusions about a yet unobserved feature, e.g. the vitamin content. The work, because it is oriented on the features, might even be a way around the impossibility-theorem.

This is half-way there for a nicer definition of a taxon or what should constitute a cluster for that matter: can we draw conclusions about features not used for the clustering process? If you are clustering documents by topic (using bag-of-words), can you predict which other words will appear in the article? If you cluster genes, can you predict other genes from the cluster-membership?

I recently attended a talk where the authors claimed that the CAPTCHA technology (the squiggly letters they make you type in whenever you sign up for anything) is dead and defeated. I disagree. In the talk, they demonstrated how to break a couple of “home-brew” captcha-implementations they found on the internet. Most of them were – not surprisingly – not very good. I think this is almost comparable to people inventing there own encryption algorithms.

All the implementations they broke were either insecure implementations (accepting solutions several times, hiding the answer in an invisible form field etc.) or were simply writing numbers in images with little or no distortion. The audio captcha they broke was simply reading numbers with a little bit of clicking noise in the background. Those are all very simple. What is supposed to make “real” captchas hard is that they are hard to segment – compare the phpBB captcha with the one from Yahoo. In the later you will have problems separating the letters for your OCR. A good audio captcha overlays music, chatter or other noise that is hard to separate from the code being read.

An article on the Internet Storm Center discusses wether Anti-Virus software in the current state is a dead end. In my opinion it has been dead for quite a while now. Apart from the absolutely un-usable state that anti-virus software is in, I think it’s protecting the wrong things. Most attacks (trojans, spyware) nowadays come through web-browser exploits and maybe instant-messenger (see reports on ISC). So instead of scanning incoming emails, how about a behavior blocker for the web-browser and the instant messenger? There are a couple of freeware programs (e.g. IEController [German]) out there that successfully put Internet Explorer, etc. into a sandbox; whatever Javascript exploit – known or unknown – the browser won’t be able to execute arbitrary files or write outside its cache-directory. Why is there nothing like that in the commercial AV packages?

However, a few possibilities suggested in the article might be worth exploring. For example, they suggest Bayesian heuristics to identify threats. Using machine learning techniques might be a direction worth exploring. IBM AntiVirus (maybe not the current version anymore) has been using Neural Networks with 4Byte sequences (n-grams) for bootsector virus detection.

A couple things to keep in mind, though:

Quality of the classifier (detection rate) should be measured with Area-under-ROC-Curve (AUC), not error-rate like most people tend to do in Spam-Filter comparisons. The base-rate of the “non-virus” class is pretty high; I have over 10.000 executables/libraries on my windows machine. All (most?) of them non-malicious.

The tricky part with that is the feature extraction. While sequences of bytes or strings extracted from a binary might be a good start, advanced features like call-graphs or imported API-calls should be used as well. This is pretty tricky and time-consuming, especially when it has to be done for different types of executables (Windows scripts, x86-EXE files, .Net files etc.). De-obfuscation techniques, just like in the signature based scanners, will probably be necessary before the features can be extracted.

Behavior blocking and sandboxes are probably easier, a better short-term fix, and more pro-active. This has been my experience with email-based attacks as well back in the Mydoom days when a special mime-type auto-executed an attachment in Outlook. Interestingly there are only two programs out there that sanitize emails (check mime-types, headers, rename executable attachments etc.) at the gateway-level – a much better pro-active approach than simply detecting known threats. The first is Mimedefang, a sendmail plugin. The other is impsec, based on procmail. CU Boulder was using impsec to help keep student’s machines clean (there were scalability issues with the procmail solution, though).