Basically, he built a classifier predicting from some innocuous (but possibly correlated variables) the likelihood of somebody having a felony offense. The classifier isn’t meant to be used in practice (from eye-balling the Precision/Recall curve in the talk slides, I estimate an AUC of about 0.6-ish; not too great), but it was built to start a discussion. It turns out that courts have upheld the use of profiling in some cases as “reasonable suspicion,” a legal standard for the police to stop somebody and investigate. This could lead to “predictive policing” being taken even further in the future. Due to the model outputting a score Jim also discusses the trade-off of where the prediction of such a model may be actionable – he calls it the Tyranny/Anarchy Trade-Off (a catchy name 🙂

Having done statistical work in criminal justice before, I think predictive analysis can be helpful in many areas of policing and criminal justice in general (e.g., parole supervision). On the other hand, I find profiling and supporting a “reasonable suspicion” from statistical models unconvincing. I think the courts will have to figure out a minimum reliability standard for such predictors, and hopefully they’ll set the threshold far higher than what the ‘felony classifier’ is producing. There’s just too many ways using a statistical model for “reasonable suspicion” to go wrong. Even if variables of protected classes (gender, ethnicity, etc.) are not used directly, there may be correlated variables (hair-color, income, geographic area) as discussed in the talk Jim gave. Even more problematic in my mind would be variables that do not or hardly ever change, as they would lead to the same people being hassled over and over again. Also the training data from which these models are built is biased since everybody in it by definition has been arrested before. It’s beyond me how one can correct for this sample bias in a reliable way. Frankly, I don’t think policing by profiling (statistical or otherwise) can be done well, and hopefully courts will recognize that eventually.

The RAND corporation just published an interesting paper exploring the benefits of using risk prediction to reduce the screening required at airports. You might have noticed various attempts to establish some kind of fast-lane or trusted traveler program. Obvious this is a very sensitive topic and probably hard to get right. Screening certain groups of the population more than others (“profiling”) is generally frowned upon and also not a good idea in general (see “Strong profiling is not mathematically optimal for discovering rare malfeasors on rare event detection“), but what hasn’t been examined much is identifying people that can be considered more “safe” than others. The paper explores that idea and shows that even under the assumption that the bad guys will try and subvert this program that there can be benefits to implementing this solution. The paper is a bit sparse on mathematical details. Certainly an interesting idea, though.

Just in time for the latest Christmas terror scare, I came across an interesting paper: “Strong profiling is not mathematically optimal for discovering rare malfeasors” (William H. Press; PNAS 106(6), p. 1716-1719 www.pnas.org/cgi/doi/10.1073/pnas.0813202106). In the paper, the author investigates whether profiling by nationality or ethnicity can be justified mathematically and tries to answer the question of how much screening must we do, on average, to catch the bad guys in the crowd. Rare events detection is hard as it is, and it’s interesting to see a look from the sampling perspective. It’s an interesting and short read. Long story short, it shows that using an indiscriminate feature like nationality or ethnicity is not optimal (as is any screening at least in proportion to a prior probability) and wastes resources.

Given that a common question during training with risk assessment software is “what do I do to get outcome/prediction x” from the software it should be explored how to safeguard in the software against users gaming the system. Think detecting multiple model evaluations with slightly changed numbers in a row…

Edit: I just found an instrument implemented as an Excel Spreadsheet. Good for prototyping something, but using that in practice is just asking people to fiddle with the numbers until the desired result is obtained. You couldn’t make it more user-friendly if you tried…

Recently I had a fun discussion with Bill over lunch about intellectual property and how that might apply to statistical modeling work. Given that there are more and more companies making a living from forming predictions with a model they have built (churn-prediction, credit-scores and other risk-models) we were wondering if there were any means of protecting them as intellectual property. For example, the ZETA-model for predicting corporate bankruptcies is a closely guarded secret with having published only the variables being used (Altman E. I. (2000); Predicting financial distress for companies: revisiting the Z-Score and ZETA models). Obviously this model is useful for lending and can make serious money for the user. Making decisions guided by a formula is becoming more popular. This might be something over which legal battles will be fought in the future.

Copyrighted works and patents often count towards what a company would be worth should somebody acquire it. This means there would be motivation for start-up companies to protect their models. A mathematical formula (e.g. a regression equation) cannot be patented, and copyright probably won’t apply either; even if copyright would apply, it’s trivial to build a formula that does essentially the same thing (e.g. multiply all the weights in the formula by 10). This leaves only trade secret protection and means there is no recourse once the cat is out of the bag. Often it’s also the data-collection method that is kept secret – a company called Epagogix developed a method to judge the success of movies from a script by scoring it against some scales that they keep secret.

Currently, I don’t see any legal protections with the exception of trade-secrets for this. And given that there is infinitely many ways to express the same scoring rules in a different way, this would be a fairly hard problem for lawyers and politicians to formulate sensible rules for establishing protection for this kind of intellectual property.

I’ve followed the discussion and introduction of the GPL v3 for a bit. One major change in the license is supposed to close the loophole commonly referred to as the “tivoization” of GPL software, i.e. mechanisms that prevent people from tinkering with the product they bought which includes GPL software. Tivo, in particular, accomplishes this by requiring a valid cryptographic signature for the code to run – the user has access to the code, but it’s of no use. One of the main ideas of the GPL was to allow people the freedom to tinker, improve and understand how something works. This got me thinking a bit about software that uses machine learning techniques.

For the sake of the argument, let’s assume that somebody releases a GPL version of a speech recognition system, or say an improved version of a GPL speech recognition system. While the algorithms would be in the open for everyone to see, two major components of speech recognition systems, the Acoustic Model and Language Model, do not have to be. The Acoustic Model is created by taking a very large number of audio recordings of speech and their transcriptions (Speech Corpus) and ‘compiling’ them into statistical representations of the sounds that make up each word. The Language Model is a very large file containing the probabilities of certain sequences of words in order to narrow down the search for what was said.

A big part of how well the speech recognition system works relies on the training. The author who improved upon the software should publish the training set as well. Otherwise people won’t be able to tinker with the system or understand why the software works well.

The same would hold for things like a handwriting recognition system. One could publish it along with a model (a bunch of numbers) that make the recognition work. It would be pretty hard for somebody to reverse engineer what the training examples were and how training was conducted to make the system work. Getting the training data is the expensive part in building speech-recognition and handwriting-recognition systems.

Think Spam-Assassin – what if the authors suddenly decide to not make their training corpus available anymore? How would users be able to determine the weights for the different tests?

I don’t think this case is covered by the GPL (v3 or older) – (However, I’m not a lawyer). Somebody could include the model in C code (i.e. define those weights for a Neural Net as a bunch of consts) and then argue that all is included to compile the program from scratch as per the requirements of the license. However, the source by itself wouldn’t allow anybody to understand or change (in a meaningful way) what the program is doing. With the growing importance of machine learning methods just being able to recompile something won’t be enough. I think this should be taken into consideration by the open source community for GPL v3.01.

A (long) while ago there was a lot of talk about the IRS doing text-mining for people that did not report income they got from e.g. ebay internet auctions. The German version of this attempt is called XPIDER (see also here and there) was just shown to be totally ineffective. After years of trying and spending millions they did not identify conclusive evidence to go after a single tax-evader. The German GAO (Bundesrechnungshof) is trying to find out what went wrong, why and who misspend all that money. I wonder if the more sophisticated web-crawling program the Canadian IRS is using (it’s called XENON as reported by Wired) is similarly effective …

I just read an article in the Wired blog titled “AI Cited for Unlicensed Practice of Law” citing a ruling from a court upholding its decision that the owner through the expert system he developed has given unlicensed legal advise. While an expert system is a clear cut case (as the system always does exactly what it was told [minus errors in the rules]; it just follows given rules and makes logical conclusions), this becomes more interesting in cases in which the machine learns or otherwise modifies its behavior over time. For example, lets say I put an AI software online that interacts with people and learns over time. Should I be held responsible if the program does something bad? What if I was not the person that taught it that particular behavior? This will probably be a topic that the courts will have to figure out in the future. For one, people should not be able to hide behind actions their computer has done. But what if it is reasonably beyond the capability of the individual to forsee what the AI has done?

This will probably end up being the next big challenge for courts just like the internet has been. It is interesting how the internet has created legal problems just with people being able to communicate more easily with each other: think trademark issues, advertising restrictions for tobacco or copyright violations (fair use differs from country to country; what is legal in one might be illegal in another) …