IE6 fucking up BASE (again)?

Beginning of this week we noticed a rather dramatic increase of the popularity of our forum, or at least so it seemed. Pageview counts were up from an average of 800.000 a day to over a million spiking at 1.5 million last Monday. Surely this couldn't be right...

1.5M pageviews on our forum on Monday 26-05-2008

So we put on our "suspicion" hat and started investigating. At first we suspected the PicLens plugin since at least the IE-version of that Web2.0 (so nice, shiney and worthless) tool seems to have a real appetite for RSS feeds, but seperating the views on RSS from the regular forum pageviews did not account for this drastic increase, nor did we find irregularities in for instance the topicview stats.

So the cause of this must have been something else, something that slipped through most of our statistics but did count as pageviews. So it was time to inspect the apache accesslogs and it didn't take long to notice the endless number of requests on resources such as:

Some useragent was seriously fucking up here and it didn't take long to find the culprit since all of these requests (hundreds of thousands each day over the past weeks - starting approximately as of May 12th) had one of these two useragent strings:

Fact 1: this sudden increase in requests occured shortly after the arrival of SP3 for windows XP - and the UA-strings all indicate XP clients.
Fact 2: SP3 for winXP is known to 'clean up' the UA-string of Internet Explorer (removing all the Qxxxxxx and .NET CLR x.x.xxxx additions).
Fact 3: SP3 for winXP comes with a bunch of patches for IE6 when that is still the installed browserversion.

Conclusion: some patch in SP3 targetted at IE6 has some nasty side-effects dealing with the retrieval of external JS files that are relatively linked from a page that uses a specific <base href> (pointing to another domain) - and only JS files since neither our CSS files nor our relatively linked static images seemed to be requested wrongly like this.

The actual cause and the patch responsible for this weird behaviour in IE6 for now remains a mystery. Also the fact that apparently IE6 goes into some kind of recursive loop trying to locate "js/forum.js" is quite odd. We didn't bother asking Microsoft since they are known to be either unresponsive when confronted with bugs or decide that it is "by design", so instead we just linked the JS-file absolutely and immediately saw a drop in "pageviews":

Drop in pageviews/requests once we worked around the IE6 bug

Now IE6 is known to treat <base> in a non-standard way: unless you explicitly close the <base>-tag (which doesn't have a closing-tag in HTML) all following elements are descendants of <base> instead of <head>. That's why we use this in our HTML:

This may have something to do with it, but we've used it for years to accommodate scripts that expect certain elements to be a direct child of <head>.

As for the importance of <base> itself I already wrote about that some time ago - still we see odd requests that are typically coming from agents that are ignoring <base> on a regular basis...

Unfortunately, even with a tech-minded audience, IE6 still cannot be disregarded, and with Microsoft still fiddling with it's codebase the matter just got worse. Maybe there should be a law against maintaining bad browsers? Imo IE7 should have been a mandatory part of XP's SP3 - not that it is a good browser, but at least it is somewhat better than IE6...

Update 03-06-2008: as the comments show it is more plausible that some recent pre-fetching features from anti-virus vendors (a.o. AVG) are responsible for this. They probably use bad parsers to do in-depth scanning of linked javascript-files.

Not only do I think that pre-fetching content is a dubious practice since mostly the cost doesn't outweigh the actual benefit to the user (and more important: the cost isn't solely placed on the user but on the website-owner as well since content gets requested that may not be eventually seen by the user), but practice has shown that pre-fetchers in general do not seem to do any form of cacheing (either clientside or at the site of the vendor of these tools) mitigating (some) band-width issues, askew pageview statistics by not identifying themselves as automated tools (instead posing as 'normal' browsers), generally ignore robots.txt and rel directives and often go more than 1 level deep. Not to mention the fact that GET requests may not always be idempotent (as they should be)...

Comments

It would be a great idea to ban bad browsers by law. In fact, I think we should go a little further and perhaps force companies to abide by the standards set in a certain field. So for example, Microsoft would have to make browsers that accept valid HTML and Apple has to make it an option to put your music on an iPod by Drag&Drop. This way we could force companies to be compatible with the standards so that hard-working developers elsewhere don't have to worry about it anymore. I don't think it's going to happen though

You could close the tag on this way I presume: <base href="http://tweakimg.net/g/forum/templates/tweakers/" />
Adding a slash at the end of the tag. This should not give any problems in other browsers.

JW1: that would not make any difference since in HTML that slash has no meaning whatsoever; it is completely ignored by the parser.

Civil: we have seen this kind of behaviour also in the past but then it were mostly proxies that did prefetching but also ignored BASE (and then it wasn't limited to only the linked JS file). See the other blogpost I once wrote about BASE. I'm quite sure that this problem is something new.

I had the same problem with one of my websites. I diagnosed the problem to be with the AVG LinKScanner plugin for IE6, which was included in AVG 8 which was released at the same time. This seems more plausible than an update to IE6 from Microsoft.

Maybe you could help bring attention to this by posting this as a news article at tweakers.net. I informed AVG about this problem a week ago and only received an automated reply that they were processing my "techninal query".

In the mean time possibly thousands of websites are being attacked by this tool while many webmasters remain unaware until they receive the bandwidth bill at the end of the month.

The hosting provider of my website thought it was a malicious PHP-script and restricted access to the directory that was being requested over and over. They also wanted the client to upgrade to a dedicated server for 325 euro per month. Only because I was able to prove that it was not our fault that this happened did the hosting company not charge us extra or force us to upgrade.

To clarify, the UA Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813) is AVG LinkScanner (not the optional toolbar), and it checks results of any searches done on Google/Yahoo/MSN by pre-fetching the listed page and (in some cases) any external JavaScript files.

The UA Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) remains unidentified but as noted elsewhere acts in exactly the same way as AVG and is likely to be some other anti-virus package exceeding its capabilities.

So these are aggressive pre-fetchers that apparently don't do any cacheing or actual HTML-parsing (at least ignore BASE) to check for linked scripts. In my book that's amateurish or at least a big fuck-up. Where can we sent the bill for wasted bandwidth?

Wow. I cannot believe this. We have been fighting performance issues on our web site for the last month, and just commissioned a new server. Then we got our bandwidth overage bill for May, and our bandwidth was more than double (and we got billed huge overages). The bandwidth on our site was going up EXPONENTIALLY! For June, we were looking at being 4-5 times more than our allocated bandwidth, and were looking at more than $5K this month in overages!

What made us realize something was off, was that the page views according to Google Analytics were flat, yet traffic and bandwidth were EXPLODING. Most of this started in Early April, and started heading north in a really scary J curve. But when we ran Webalyzer stats, it indicated that the traffic on the site WAS going up. But since Google analytics only logs page views that the browser renders (via Javascript), none of this showed up in the Google stats. So clearly something OTHER than normal browser traffic was sucking up our bandwidth and CPU time.

Our particular problem came from implementing the Highslide JS script, and using relative tags in this script. From our logs, it appears that IE6 and Firefox 1.0 were both causing recursive calls, and sucking up HUGE amounts of bandwidth. Once we changed it over, our CPU utilization in the site dropped from close to 100% down to 20%. Holy cow. And I expect our bandwidth is going to drop drastically also.

Now our site is no longer overloaded, and we probably do not need the new machine we just commissioned....

Your FACT 2 comment about SP3 cleaning up the user-agent strings doesn't appear to be correct in my testing. I tried out SP3 followed by .NET 2.0, .NET 2.0 followed by SP3 and in both cases IE6 does send