Webby thoughts, most about around interesting applications of ecmascript in relation to other open web standards. I live in Mountain View, California, and spend some of my spare time co-maintaining Greasemonkey together with Anthony Lieuallen.

2006-08-20

I think for the past year or even years, I have been encountering strange site breakage I've always, more or less subconsciously, incorrectly attributed to site owners. Random broken pictures here and there, typically in galleries, albums and the like. Rarely but frequently enough to give the slightly "tainted" feeling of browsing around a site kept slightly but not perfectly in trim, or that there was some lacking quality assurance in the site's file upload dialog allowing partial uploads, making faulty data format conversions and / or similar. Perhaps one in every few hundred images broken, and only on IIS sites like those mentioned in my last post. Knowing a bit too much about the web, you can often make up lots of plausible explanations for the kind of breakage you encounter once in a while.

But a few days ago, I encountered a piece of breakage that just wouldn't be explained like that, on a community site where site native functionality had been turned off on one profile. You couldn't use the messaging or commenting functionality there, because it was shut down; dropped from available options. Clicking links leading to the profile would mysteriously blow away the entire frame in a way I couldn't even begin to understand; it was all most unfathomable and I couldn't help suspect my own client side hackery; was there any one of my userscripts that could have been behind all of this?

Checking the DOM of these pages, there were indeed the elements that were gone; they had their nodes but were shut out via a CSS display:none; attribute. Surely I hadn't done anything like that in any of my scripts? Well, apart from that odd hack where I put an onerror handler on a few images injected by myself that would drop the image from display if its URL had gone 404 missing. No, that just wouldn't explain it -- and further on, the problem wouldn't go away with the first Greasemonkey debugging tip to try at any time you suspect something like this: clicking the monkey to turn off all Greasemonkey functionality temporarily and reloading the page. Yep, still the same mysterious dissapearances. So the monkey went back on again.

By a stroke of luck, I finally stumbled on the culprit: AdBlock, and more precisely, a very trigger happy regular expression rule fetched by the Filterset.G updater for trashing ads that, by the sheer length of it, looks like it would be a very specific fit indeed only to trigger very specific match criteria:

A bit of a mouthful, yes. What it does? Well, summarically, it'll snag anything matching the substring "ad", optionally surrounded by one of a busload of words often denoting ads found across the web(*).

For instance, the strings "-AD" or "{AD" will do. That's incidentally a very common thing to find in GUIDs, which IIS servers like to sprinkle all over their URL space. There are five spots in those 38-byte (or sometimes four in 36, when braces are gone) identifiers that each has a one in 256 chance of triggering a hit, assuming a perfect randomness distribution. Some URLs even have two or more GUIDs in them, increasing chances. I'm a bit surprised it doesn't strike more viciously than it does, but it's probably because the randomness distribution isn't anywhere near perfect.

The issue has apparently already been reported to the Filterset.G people a year ago, though it was deemed unfixable at the time. I submitted a suggested solution, hoping for it to get fixed upstream; tucking in a leading (or trailing; just as long as it doesn't end up a character range qualifier) dash and opening curly brace fixes this particular GUID symptom.

At large, though, this looks like a regexp that evolved from two strict regexps being laxed and "optimised" into one, for instance, or one with a lot of self repetition being made less self repetitive and overly lax in the same blow.

By the looks of this huge regexp, it wants to find and match whatever matches either "something-ad" or "ad-somethingelse" -- a string matching "ad" with any in a set of known prefixes and/or suffixes known to signify an ad. In regexp land, you formulate this as "(prefix-)ad|ad(-suffix)". This will not get false positives for random words not listed among the given pre/suffixes, such as words like "add" or good the guys like "AdAware", or random GUIDs, whereas it might be set to trigger for "livead", "mainad", "adbroker" or "adcontent" for instance, as listed above.

But what this regexp does instead, is to match for the regexp "(prefix-)?ad(-suffix)?", meaning: match "(prefix-)ad(-suffix)", "(prefix-)ad", "ad(-suffix)" or simply "ad", on its own!

Ouch! Given that we're just interested in whether we hit a culprit or not, suddenly that whole slew of carefully recorded words meant to strengthen the regexp against false positives, doesn't! We might just as well throw them away. They of don't do any harm, of course, except wasting a bit of memory and computrons every time the regexp is used, but they don't do anything good either unfortunately.

And worse: by being there at all, as we're (particularly when working with people who are very skilled artisans at what they do) mostly tempted to assume code is there for a purpose, and one that meets the eye, unless explicitly describing the deeper magics hidden below the surface in a nearby comment, they imply being useful for something. Which wastes mind resources that try to improve on it, figuring that I'll just add this additional word that is a known ad to improve the matcher incrementally. Except here it won't achieve squat!

This is a very frequent cause of debugging-induced madness: trying to improve code that has secretly been broken for who knows how long. It's convention to write code that works, so except when in deep bug hunt mode, we read the surface of the code, rather than digging into that huge mindset which reads the entire code chunk structure, filtering all possible input through it to see what happens to it. Especially with long regexps like this.

This is something to take heed with, and a problem that grows more likely to bite you especially as expressions or nesting and complication levels rise in a chunk of code, whichever language you use. Increased complexity comes at the price of a matching decrease in readability, and before you know it, very intricate and hard to find and solve problems creep all over.

While I've given a rough suggestion of the real problem with the above regexp, I can't do any good job of going over how it should look instead, as the complete regexp lists seventeen consecutive conditions to be met, eleven of which are on/off/zero-to-many toggles and I don't know which combinations of those the regexp has been aiming for meeting in combination.

Most likely, though, picking any random one of them that satisfies the above recipe will immensely strengthen the regexp against false positives, just by giving the already provided word lists meaning again. With luck (and I'd be surprised or sad if this is not so), the Filterset.G maintainers have their regexps version controlled with good checkin comments noting what prompted every change, so they can track back to the commit where the two red culprit question marks in the regexp got added (assuming they were not there from the very beginning) to see which parts were meant to go where in the match. And if they were there from the very beginning, it's just to try mending the situation from here as best you can guess.

I believe this story has many lessons to teach about software engineering, and not only the magics of regexpcrafting. Plus, I finally found and slayed the random serial killer that would wreak havoc in the IIS family photo albums! I'm sure we will all sleep better at night now for it. Or maybe not.

(*) Technically, the regexp is a bit further constrained, by not being directly followed by ".org" and a few extensions and other known non-ads, and that it must be followed by another underscore or word character and preceded by a non-alphanumeric, =, +, % or @ character, but by and large, the words listed that make up the lion's share of the regexp, are no-ops, unless you're using it to match and parse out those chunks with them for doing some further string processing with the matches, rather than looking for a boolean "matches!" or "does not match!" result, as is the case here.

I think for the past year or even years, I have been encountering strange site breakage I've always, more or less subconsciously, incorrectly attributed to site owners. Random broken pictures here and there, typically in galleries, albums and the like. Rarely but frequently enough to give the slightly "tainted" feeling of browsing around a site kept slightly but not perfectly in trim, or that there was some lacking quality assurance in the site's file upload dialog allowing partial uploads, making faulty data format conversions and / or similar. Perhaps one in every few hundred images broken, and only on IIS sites like those mentioned in my last post. Knowing a bit too much about the web, you can often make up lots of plausible explanations for the kind of breakage you encounter once in a while.

But a few days ago, I encountered a piece of breakage that just wouldn't be explained like that, on a community site where site native functionality had been turned off on one profile. You couldn't use the messaging or commenting functionality there, because it was shut down; dropped from available options. Clicking links leading to the profile would mysteriously blow away the entire frame in a way I couldn't even begin to understand; it was all most unfathomable and I couldn't help suspect my own client side hackery; was there any one of my userscripts that could have been behind all of this?

Checking the DOM of these pages, there were indeed the elements that were gone; they had their nodes but were shut out via a CSS display:none; attribute. Surely I hadn't done anything like that in any of my scripts? Well, apart from that odd hack where I put an onerror handler on a few images injected by myself that would drop the image from display if its URL had gone 404 missing. No, that just wouldn't explain it -- and further on, the problem wouldn't go away with the first Greasemonkey debugging tip to try at any time you suspect something like this: clicking the monkey to turn off all Greasemonkey functionality temporarily and reloading the page. Yep, still the same mysterious dissapearances. So the monkey went back on again.

By a stroke of luck, I finally stumbled on the culprit: AdBlock, and more precisely, a very trigger happy regular expression rule fetched by the Filterset.G updater for trashing ads that, by the sheer length of it, looks like it would be a very specific fit indeed only to trigger very specific match criteria:

A bit of a mouthful, yes. What it does? Well, summarically, it'll snag anything matching the substring "ad", optionally surrounded by one of a busload of words often denoting ads found across the web(*).

For instance, the strings "-AD" or "{AD" will do. That's incidentally a very common thing to find in GUIDs, which IIS servers like to sprinkle all over their URL space. There are five spots in those 38-byte (or sometimes four in 36, when braces are gone) identifiers that each has a one in 256 chance of triggering a hit, assuming a perfect randomness distribution. Some URLs even have two or more GUIDs in them, increasing chances. I'm a bit surprised it doesn't strike more viciously than it does, but it's probably because the randomness distribution isn't anywhere near perfect.

The issue has apparently already been reported to the Filterset.G people a year ago, though it was deemed unfixable at the time. I submitted a suggested solution, hoping for it to get fixed upstream; tucking in a leading (or trailing; just as long as it doesn't end up a character range qualifier) dash and opening curly brace fixes this particular GUID symptom.

At large, though, this looks like a regexp that evolved from two strict regexps being laxed and "optimised" into one, for instance, or one with a lot of self repetition being made less self repetitive and overly lax in the same blow.

By the looks of this huge regexp, it wants to find and match whatever matches either "something-ad" or "ad-somethingelse" -- a string matching "ad" with any in a set of known prefixes and/or suffixes known to signify an ad. In regexp land, you formulate this as "(prefix-)ad|ad(-suffix)". This will not get false positives for random words not listed among the given pre/suffixes, such as words like "add" or good the guys like "AdAware", or random GUIDs, whereas it might be set to trigger for "livead", "mainad", "adbroker" or "adcontent" for instance, as listed above.

But what this regexp does instead, is to match for the regexp "(prefix-)?ad(-suffix)?", meaning: match "(prefix-)ad(-suffix)", "(prefix-)ad", "ad(-suffix)" or simply "ad", on its own!

Ouch! Given that we're just interested in whether we hit a culprit or not, suddenly that whole slew of carefully recorded words meant to strengthen the regexp against false positives, doesn't! We might just as well throw them away. They of don't do any harm, of course, except wasting a bit of memory and computrons every time the regexp is used, but they don't do anything good either unfortunately.

And worse: by being there at all, as we're (particularly when working with people who are very skilled artisans at what they do) mostly tempted to assume code is there for a purpose, and one that meets the eye, unless explicitly describing the deeper magics hidden below the surface in a nearby comment, they imply being useful for something. Which wastes mind resources that try to improve on it, figuring that I'll just add this additional word that is a known ad to improve the matcher incrementally. Except here it won't achieve squat!

This is a very frequent cause of debugging-induced madness: trying to improve code that has secretly been broken for who knows how long. It's convention to write code that works, so except when in deep bug hunt mode, we read the surface of the code, rather than digging into that huge mindset which reads the entire code chunk structure, filtering all possible input through it to see what happens to it. Especially with long regexps like this.

This is something to take heed with, and a problem that grows more likely to bite you especially as expressions or nesting and complication levels rise in a chunk of code, whichever language you use. Increased complexity comes at the price of a matching decrease in readability, and before you know it, very intricate and hard to find and solve problems creep all over.

While I've given a rough suggestion of the real problem with the above regexp, I can't do any good job of going over how it should look instead, as the complete regexp lists seventeen consecutive conditions to be met, eleven of which are on/off/zero-to-many toggles and I don't know which combinations of those the regexp has been aiming for meeting in combination.

Most likely, though, picking any random one of them that satisfies the above recipe will immensely strengthen the regexp against false positives, just by giving the already provided word lists meaning again. With luck (and I'd be surprised or sad if this is not so), the Filterset.G maintainers have their regexps version controlled with good checkin comments noting what prompted every change, so they can track back to the commit where the two red culprit question marks in the regexp got added (assuming they were not there from the very beginning) to see which parts were meant to go where in the match. And if they were there from the very beginning, it's just to try mending the situation from here as best you can guess.

I believe this story has many lessons to teach about software engineering, and not only the magics of regexpcrafting. Plus, I finally found and slayed the random serial killer that would wreak havoc in the IIS family photo albums! I'm sure we will all sleep better at night now for it. Or maybe not.

(*) Technically, the regexp is a bit further constrained, by not being directly followed by ".org" and a few extensions and other known non-ads, and that it must be followed by another underscore or word character and preceded by a non-alphanumeric, =, +, % or @ character, but by and large, the words listed that make up the lion's share of the regexp, are no-ops, unless you're using it to match and parse out those chunks with them for doing some further string processing with the matches, rather than looking for a boolean "matches!" or "does not match!" result, as is the case here.