Archive for the ‘Ramblings’ Category

I just read the book Cult of the Amateur and quite frankly disagree with a lot written in it. To make a long story short, the authors world is one where there is a simple transcendental reality; a truth, purveyed by trained experts, journalists and professionals. Apparently the danger of dissolution is upon us by the radically relativistic truths of Wikipedia where the community sets the agenda. I think things are far less black and white than he makes them seem to be. One particular example I very much disagree with is his criticism of recommender systems. The author claims that nobody will need to read movie reviews anymore when there are AI systems making the recommendations. I have to say that so far I found movie ratings by experts – be it ratings in IMDB or a professional review in a newspaper – totally useless. I’ve liked films with low ratings, and hated others with high-ratings. I discovered so many things that I liked with recommender systems like Pandora. Yes, there are problems with recommender systems, but relying on an expert opinion for matters of taste can in my opinion never work out. People simply have different tastes and I don’t think that one reviewer writing movie-reviews for a particular newspaper is speaking for the entire readership. For another wonderful example using Spaghetti Sauce as an illustration, see the TED talk from Malcom Gladwell. That said, there are some points in the book that deserve consideration. Anybody who has ever read through comments (“noise”) on youtube knows that there seems to be a mass-infestation of stupidity out there – something that needs to be taken into consideration in all the Web 2.0 experiments. Stupidfilter anyone? 🙂

I was playing around with the VMWare player and an Windows XP image trying to establish a VPN connection with Microsoft’s VPN Client. It worked just fine, connected and then got stuck at “Verifying Username and Password”. After a while it aborted with a time-out error (was it error 638 or 721?). It turns out that GRE (General Routing Encapsulation) doesn’t deal well with multiple network address translations (e.g. using VMWare Networks with NAT and then my DSL-Router). It worked once I changed it to bridged network. This took me a couple of hours to figure out…

After reading this article on Slashdot about NSI immediately registering every free domain that is searched for on their site, I went ahead and tried it myself. Indeed, seconds after searching for two random domain-names they were immediately registered (or locked). They even put a domain-parking page on it. Since this is all fully automated I can’t help but wonder what would happen if somebody were to search for all sorts of trademarked names, especially from companies that are fairly aggressive in suing for trademark infringements. I wonder if they thought about that …

I just came across a very interesting book announcement for “Super Crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart” by Ian Ayres, a professor Yale Law School and econometrician. In the book (I haven’t read it yet, but I will) the author argues that intuition is losing ground to statistical methods and data mining. According to the Amazon abstract he gives examples from the airline industry, medical diagnostics and even online dating services showing that a statistical model will outperform human intuition.

That machines can outperform human judgement has been known for quite some time. For example, in the field of psychology the diagnosis of mental disorders is more or less standardized by them DSM. There was a very interesting meta-analysis that showed that a mechanical predictor always outperformed the human psychologist. To be specific: Grove, W.M., Zald, D.H., Hallberg, A.M., Lebow, B., Snitz, E., & Nelson, C. (2000). Clinical versus mechanical prediction: A meta-analysis. Psychological Assessment, 12, 1930. To quote from the Abstract: “On average, mechanical-prediction techniques were about 10% more accurate than clinical predictions. Depending on the specific analysis, mechanical prediction substantially outperformed clinical prediction in 33%-47% of studies examined. Although clinical predictions were often as accurate as mechanical predictions, in only a few studies (6%-16%) were they substantially more accurate. Superiority for mechanical-prediction techniques was consistent, regardless of the judgment task, type of judges, judges’ amounts of experience, or the types of data being combined.”

I’m a little bit skeptical about using data crunching to decide important questions (as in life and death questions). In general it seems like a good idea, but it always comes down to how you model the data and how you model the question to be answered. In many cases this might be obvious, in others not so much. The art is then to model the data, not the application of the algorithm or technique. It reminds me a bit of a class about formal program verification I took back in Darmstadt. Stefan, the TA of the class, and I had an argument about the use of practicability of program verification. He gave the unix find utility as an example for which you can show – more or less – easily that the program will terminate while enumerating all the files in all the directories in the system, and how find can be nicely modeled with a well-founded relation to show the termination of the algorithm. I objected that I could set a symbolic link to a uper-level directory (which is why find does not by default follow them) and could make find go in circles. Stefan conceded, “Oh well, I guess then the model was wrong…”. Similar things have happened in e.g. Cryptography, where a finite-state model (sorry, lost the citation somewhere; I’m not quite sure if that was the Usenix paper from the Stanford guys I read or something else) showed that the SSL protocol (Secure Socket Layer) is secure. Later the protocol was broken nonetheless (Schneier, Bruce; Wagner, David; Analysis of the SSL 3.0 protocol).

I think that with the wrong model you can show a lot of good things about anything. Once you abstract from the real world and build a model you might just have ignored that little most important feature. Maybe it is time for a best-practices in data modeling and data mining (there are already some books out there for some specific domains) …

Last night I was out with a couple of friends at our favorite club. I had the time of my life and met some interesting new people. When we left the club, a girl got hit by a car right in front of our eyes. I’ve seen quite a bit of “shit going down” in my life now (3 fires, one guy getting beat up in an alley, 2 times one or more people getting hit by a car, once getting hit by a car myself) and what I find interesting is the different reactions by people. Last night some guys freaked, some were calm, some tried to help, some girls just started crying. Most people just stared in some kind of apathy and did nothing. Some “tough guys”, who I wouldn’t have expected that from, freaked out. In all the instances I’ve been involved with I’ve stayed calm and just did what was necessary (including the one where I got hit). That said, 911 operators are kinda slow – why do they ask if the person hit by the car is OK when I already told them that the person is seriously injured? What a weird end for a good night – I’m still shaking a bit…

Last Saturday (3/31/07) I went to the Caffeine Music and Arts Festival Rave party (by Triad Dragons) in Littleton (Denver). The event started at 8PM and was supposed to go until 5am in the morning. This must have been the worst organized event I have ever been to. I bought tickets online, arrived at 9PM and stood with thousands of other people in line in the cold. Everybody had their ticket already and it took 2 hours of standing in line to be able to see the entrance. It was cold, and everybody (including me) was wearing thin stuff like t-shirts etc. This is ridiculous. At 11PM I left as it probably would have taken another hour or so (given the speed the line was moving) to get to the entrance from where I was. Given that they expected a couple of thousand people to attend, they had (from what I could see from where I was) only two (2!) people to pat down visitors for drugs and weapons. Maybe the rest was on break or whatever, but one would expect that they can afford enough staff for the event for $35/ticket. I will most certainly not attend the Caffeine Festival 2008, and I will let anybody (who wants to hear about it) know about how things went in 2007. I am upset …

When I started my blog I was already aware about the Comment Spam problem and thus enabled a WordPress plugin to prevent comment spam (“did you pass math”). The other day a friend complained that when he wanted to comment on something and forgot to fill out the captcha-field his comment got lost (and pushing the back-button had his browser loose all that he had typed up). And when I was reading through raw apache logs and saw somebody trying to post a comment and apparently not succeeding. So I turned the plugin off and within a day I had 8 spam comments on my blog (which does not have a high pagerank and uses nofollow-links; What’s the gain?)… So I’ll keep it turned on. There!

Spam is an interesting problem, because you have an “adversary” with a lot of resources who will do whatever it takes to get your attention, an email in your inbox or a comment with links on your blog. The more filters we build, even with machine learning, the more sophisticated they become. It will probably be a driving force for classification for some time to come. However, machine learning and filters are very expensive in CPU time and do not scale very well. Sander told me about the email server at their institute having a backlog in emails of 40 Gigabytes, i.e. 40 Gigabytes of emails staying in the spool waiting to be scanned for spam and virii. Given that this server was only serving about 50 users and given that 99% of the email in the spool is probably spam illustrates the problem. Currently (in my opinion) mechanisms like Grey-Listing and such are a better solution simply because they scale better as they exploit “implementation issues” of the spam-software and don’t require the CPU-intensive scan of every email. That is, until the next generation of Spam-bots will adapt to those measures. Build a better spam-filter and somebody will build a better spam.

I just read an article in the Wired blog titled “AI Cited for Unlicensed Practice of Law” citing a ruling from a court upholding its decision that the owner through the expert system he developed has given unlicensed legal advise. While an expert system is a clear cut case (as the system always does exactly what it was told [minus errors in the rules]; it just follows given rules and makes logical conclusions), this becomes more interesting in cases in which the machine learns or otherwise modifies its behavior over time. For example, lets say I put an AI software online that interacts with people and learns over time. Should I be held responsible if the program does something bad? What if I was not the person that taught it that particular behavior? This will probably be a topic that the courts will have to figure out in the future. For one, people should not be able to hide behind actions their computer has done. But what if it is reasonably beyond the capability of the individual to forsee what the AI has done?

This will probably end up being the next big challenge for courts just like the internet has been. It is interesting how the internet has created legal problems just with people being able to communicate more easily with each other: think trademark issues, advertising restrictions for tobacco or copyright violations (fair use differs from country to country; what is legal in one might be illegal in another) …