I’m fascinated by a common problem in data mining: how do you pick variables that are indicative of what you are trying to predict? It is simple for many prediction tasks; if you are predicting some strength of a material, you sample it at certain points using your using your understanding of physics and material sciences. It is more fascinating with problems that we would believe to understand, but don’t. For example, what makes a restaurant successful and guarantees repeat business? The food? The pricing? A friendly wait-staff? Turns out a very big indicator for restaurant success is the lighting. I admit that lighting didn’t make it into my top ten list… If you now consider what is asked on your average “are you satisfied with our service” questionnaire you can find in a various restaurant chains, then I don’t recall seeing anything about the ambiente on it. We are asking the wrong questions.

There are many other problems in real life just like this. I read a book called Blink and the point the author is trying to make is that making subconscious decisions are easy to make – once you know what to look for. More information is not always better. This holds for difficult problems such as judging the effectiveness of teachers (IIRC seeing the first 30 seconds of a videotape of him/her entering a classroom is as indicative as watching hours of recorded lectures). Same holds true for prediction problems about relationships – how can you predict if a couple will still be together 15 years later? Turns out there are four simple indicators to look for, and you can do it in maybe 2 minutes of watching a couple… The book is full of examples like that, but does not provide a way to “extract the right features”. I have similar problems with the criminology stuff I’m working on; while we get pretty good results using features suggested by the criminology literature I’m wondering if we have the right features. I’m still thinking that we could improve our results if we had more data – or the “right” data I should say (it should be obvious that more is not better by now). How do you pick the features for problems? Tricky question…

There is only data mining system that does not have this problems: recommender systems. Using recommender systems can avoid the problem as they do not rely on particular features to predict, but exploit correlations in “liking”. A classical example was that people that like classical music often like jazz as well – something you wouldn’t easily be able to predict from features you extract from the music. I wonder if we could reframe some prediction problems in ways more similar to recommender systems, or maybe make better use of meta-information in certain problems. What I mean with “meta-information” is easily explained with an example: Pagerank. It is so successful in web-scale information retrieval because it does not bother with trying to figure out if a page is relevant by keyword ratios and what not, but simply measuring the popularity by how many important pages link to it (before link-spam became a problem that is). I wish something simple like that would be possible for every problem 🙂

Bruce Schneier gave a speech of how human psychology affects computer security. Very true as security software is often too cumbersome to use. Email encryption is still not common place while SSL as an end-to-end encryption is. It’s easy to use and people have been trained to look for that little golden padlock in the corner before entering their credit-card. Yet I feel that there are a couple of things that could be done to encourage people to pay more attention when it comes to computer security related things. In my opinion this isn’t happening because:

Most people are good and assume that other people are good too. They hold the door open for the guy that left his badge in the car, they click on the “cool link”, they open email that looks like it might be from someone important.

Most people see security problems as something that happens to someone else. Most breaches are never publicized, some publicized breaches are so huge (millions of credit card number copied – yet nothing happens to them or anybody they know) – this enhances the belief in the low likelihood of problems. We feel save in a crowd.

Most people believe they know what they are doing. Some other people are pretty learning-resistant when it comes to computers. I’ve heard some stories from companies in which the IT-staff is supposed to do user-training as well in addition to the external training the people received in the beginning (try to get accounting to explain to you over and over again how to file reimbursement claims). Maybe we really need a computer-drivers-test, but then again drunk driving can kill people while drunk computing can not.

People get bored. Cry Wolf too often, ask a person to be careful too many times in the face of a relatively low-probability event and they become trained to click “Yes, I’m sure.” (This will be interesting with Windows Vista) We are constantly bombarded with awareness-programs which makes the IT-security awareness compete with many other awareness-programs.

There is no incentive. Most people (employees) don’t face consequences when their PC is infected or the company database gets stolen. People have the neighbors kid come over to remove all the spyware from the machine and so on. Avoidable security problems like spyware turn into a “car maintenance problem”.

I think on the incentive side there is a lot that can be done. In the industry a lot experience has been gained with safety incentive programs to reduce accidents. I found a study cited on a website where it states that the reinforcing safe of acts “removes the unwanted side effects with discipline and the use of penalties; it increases the employees’ job satisfaction; it enhances the relationship between the supervisor and employees” (McAfee and Winn 1989). Properly designed incentives have the approval of the people to whom they are addressed, and are often preferred to other forms of safety motivation such as laws and policing. Probably some incentives could be created to educate the users and teach them safer computer practices. For example, to make people think more carefully about following links in email (phishing!) one could send fake phishing emails; if the user clicks on a link he gets on a page that informs him that this could have been trap and to always enter the URL directly into the browser address bar. It’s possible to track who clicked and who didn’t with specially crafted URLs in the emails. Similar things could be done with harmless executable attachments. I think this is a direction that should be pursued.