Thursday, 27 November 2008

Lies, Damned Lies, and Statistics

One of our big customer contracts is up for renegotiation next month. This involves pulling a list of all the search & site activity that originated from that customer over the last year, and then negotiating based on whether usage is up or down. Over the last few years we've seen 10%-15% increases from this particular account, year-on-year, which is good. Yesterday morning I ran the stats report and got this:

Not good. In fact, very very worrying indeed. Whilst the marketing team went into crisis mode to work out what the hell we were going to do if this was real, I started double-checking to make sure this was genuine. It certainly looked genuine. The graph is horribly organic, the way the decline is gradual, occasional peaks and troughs, but with a very, very definite downward trend. In my experience, when software fails, it tends to fail in big straight lines - everything just stops working completely and stays there.

Turns out the stats were wrong - huge sigh of relief all round - but the reason why they were wrong is, I think, quite interesting. These statistics are calculated using some custom logging routines in our (legacy ASP) web code. When a user first hits the site, we create a record in the UserSession table in our database that stores their IP address, user agent string, user ID, and so on. There's some counter fields in that table that are incremented over the course of the session as the user accesses particular resources, so we can build up a fairly accurate picture of which resources get accessed heavily, by whom, and at what times throughout the day.

Well, it turns out our CreateUserSession() routine was failing if the browser's UserAgent string was more than 127 characters. Historically, this was never a problem, but at some point last year Microsoft started putting all sorts of information about .NET framework versions and plugins into the HTTP_USER_AGENT header sent by Internet Explorer (Scott Hanselman has a great post about this if you're interested) As various updates were pushed out to our users via Windows Update and corporate rollouts, the user agent strings were getting longer and longer, until one day they'd exceed 127 characters - and that particular PC would stop showing up in our logs. Whenever they'd roll out new hardware, we'd see the stats increase temporarily, until those new boxes were upgraded and the same thing happened. Hence the gradual decline and the fact that non-IE users were unaffected.

We would have noticed this a long time ago, of course - but the CreateUserSession() call was wrapped in a try/catch block that called a notification function when it caught an exception, and somewhere along the line, the notification mechanism for this particular system had been commented out. I'd love to blame someone else for this, but Subversion has a commit with my name on it sometime last year with the relevant line mysteriously commented out.

I believe the kids are calling that an "epic fail". I believe they have a point.