Query catching for fun and profit

We’ve been quite successful in capturing query strings from client
calls to use as service arguments. The class we developed in Args() condensed it into a neat little package. Queries
abound in HTTP and there is more we could capture than our own
service’s client data and more we could do it all.

We want to fill a slightly different niche than the Args() object does. We’re not going to generalize
the object to, for example, accept an arbitrary URI and parse out the
query from it. We’re only going to look at the referrer or the page
we’re currently on, in that order, for a query string.

So far we’re doing almost exactly the same thing as our Args() constructor with the exception that we’re
doing it on a different target; the document.referrer if
it’s there or else the document.location.href. Of course
neither may have a query string but that’s where we’re looking.

A change that becomes necessary at this point is to account for the
possibility there will be an anchor name present at the end of the
URI. We need to strip them off when the appear. This line does it
neatly.

qString = qString.replace(/#[^#]*$/, '');

The parseQuery() method for Query() is
identical to one we developed for Args()
with one exception. We’re no longer in control of the standard for the
query strings. That’s means we’ll have to change one line to reflect
common usage. We’ve got no choice.

var Pairs = query.split(/;/);var Pairs = query.split(/[;&]/);

What we want to do differently from Args()

The query string itself is easy to get from a URI, a src
attribute, the document.referrer, or the
document.location.href. Getting the part that we care
about isn’t easy.

We don’t want all the arguments, as useful as they can be. We only
care about a specific value in the whole query. The most human
readable one we can find. This usually corresponds to the search
parameter from a referring search engine. Anything else we find
instead that is semantic is likely to be useful. The other parts of
the query string will likely be internal to the referring source and
uninteresting for our purposes.

For example, if Google sends us a visitor, the
referring query string includes their search parameters; ie, what they
typed in the search box. That’s valuable. If we know what a user came
for, we can tailor a service’s response to match. Narrowing the target of a user’s interest is what marketing companies
pay great green for.

What isn’t valuable, generally, is all the little details that will be
included in addition to what the seeker asked about. There might be a
dozen or more extra parameters concerning location, language,
encoding, page, browser, and search controls.

We have been parsing queries in a controlled situation up to now; the
API for a service we’ve written. The query string variables from other
sources are subject to the whims of every web developer on the planet.

Let’s take a look at referring query strings from four major search
engines.

or

That represents 90% or better of all English search referrals. There
are always sites like Naver.com to
consider but if the search terms come through in English it shouldn’t
matter with our technique that the referrer’s other info might be in
Korean or another language.

Considering the queries above, it wouldn’t be too hard to craft a
single RegExp() that would identify the right parameter
key out of our pre-parsed query string.

var rx = new RegExp(/^((enc)?q(uery)?|p)$/);

That for example would catch the right part out of all the examples
above. But it would fail on fringe cases often. And though we’ve
probably caught 90%, why give up 10%? Also, any given site might be in
a fringe that is regularly found only by search engines not
represented in the top tier. We could be giving up far more than 10%
with that approach.

Writing a RegExp() or two for those is possible. It
wouldn’t be easy however and it would be easy to break with the next
few cases. This is a hint that we’ve taken the wrong approach,
altogether. We’ve taken the wrong approach.

A RegExp() is right out. We could never craft one to
match the whimsy of every referrer out there because parameter keys
are totally arbitrary.

How-to without voodoo

The clue is the realization that the keys are not reliable. Therefore
the solution does not lie with them. That leaves the query values’
search terms.

While query keys are arbitrary, the values are not. They either make
human sense as something that could be reasonably searched and found,
like “World War II” and “Java for special education”, or they don’t,
like “en” and “org.mozilla:en-US:official.”

Pervasive patterns emerge immediately and they correspond to good
search strategy. You wouldn’t get good results searching for “en” on
Google. Nor for “I” “tab” or “x.” “234234SSQWFASDF” and “” are also
obviously not worth catching as search terms. If a search term is
clueless, we don’t care about it anyway because if the target is
nebulous, it can’t be hit.

Your best guess is better than what you think you know

The proper approach then is to iterate over the query values and
applying a scoring, or weighting, system. Once we give each query
value a numeric score it’s easy to choose what we’d consider the best
match. It’s the value that has the highest score.

What’s probably worth catching?

Real words; book not xxxl.

Values with more than one word; blue book over blue.

Prefer longer words to shorter ones; hippopotamus over ten.

Good scoring is like good regular expressions: sometimes knowing what
you don’t want is more valuable than knowing what you do.

What’s a giveaway that we don’t want it?

Short words; as, the, if, etc.

Non-word characters and punctuation; if it looks like cartoon swearing
*^+$@%|&, or like _val1, we don’t want it.

With just those ideas in mind we can score quite effectively. We’ll
create a chooseBest() method to wrap up both the
scoring/weighting in another method, scoredParams(), and
handle the sorting by weight to return the best guess for the most
meaningful, to a human, value out of the query string.

NB: the demo script is tweaked to only show current page’s
query string, not the referrer’s. If we were to use the
referrer, it would be confusing because your choice would show up not
when you clicked it but on the successive page load at which point it
would become the referrer.

Alternative cat skinning techniques

This sort of query catching can be done up front with vanilla CGI but
it means you have to have that program executing every page load or
wrapped around you entire page logic. It can also be done behind the
scenes at the Apache/webserver level with modules or
tools hooked directly into the webserver like .

These are computationally expensive compared with JavaScript however,
and except in the case of or Apache modules in general, they will execute more slowly
for the user. They’re also more difficulty or inaccessible in general.
Unless you are running your own server, or paying $100/month for a
dedicated one, you probably will not be allowed to add custom modules.

It’s time to turn our focus to improving the management of all the
code we’ve been developing. Perhaps there is a way we can reuse the
code about without having to put it into every script we’ve got. We’re
ready to take a stab at Using JS code libraries.