A Canary Trap for URI Escaping

When building web applications, one thorny issue is URI escaping and
unescaping. This is especially important when passing data between different
systems or through multiple redirects. It's possible to end up with double- or
triple-escaped URIs, which the application might not handle correctly.

For example, if you pass "Los Angeles" through escaping once, you get
"Los%20Angeles". Web applications expect this, so they decode their input. If
there is a redirect in the path, however, you may end up with the double-escaped string "Los%2520Angeles". Triple escaping looks like
"Los%252520Angeles". Obviously, you wouldn't want to enter one of those into a
database ... or use it for output.

From experience, it's always safe to decode query parameters once, because
browsers will encode things if necessary. However, what if the data you pass
around actually looks like escaped data, or was input for a
calculator (say 65%20)? You don't want to overdecode that,
because then the calculator will output 5 instead of 20. (That's how Mars
Rovers get lost.)

Canary Traps

How many times should you decode a URL? There's no good way to know. To solve this, turn to an old solution used by coal miners. (Or so I've been led to believe.) The story goes something like this.

Coal miners brought a bird, often a yellow canary, into the mine with them. Because the bird has a relatively high respiratory rate and small lung capacity, it is more susceptible to bad air. If the bird fell over, it was a good indication that the air in the mine wasn't safe to breathe, due to a gas leak or perhaps just a lack of oxygen. That was a sign to get out as fast as they could.

A similar technique works in software.

Software Yellow Birds

First, you need some sort of symbol to use as a canary. It should be short,
yellow, and something that gets escaped. Because you can't control the color
of the text, give up on the yellow thing. I chose =: as the
canary. It's short, distinct, and contains two escaped characters (%3D%3A).
=: is also very unlikely to be the start of a real data
string.

You can use other characters for your canary, but it is easier if you select something that is more likely (or required) to be escaped in transit. If you were to select ZZ as your canary, it wouldn't help you much, because hardly anything will escape those characters. =: will always be escaped, because they are both part of the set of reserved characters in URI strings.

The technique looks something like this. Put the canary on the front of anything you believe might be improperly escaped somewhere along the line. Then, when you receive that parameter back, decode until you see the canary on the front, and then trim it off.

It supports up to five layers of escaping. For each layer, call the _remove_canary subroutine. That routine will return a true value if it has detected the canary, and false otherwise.

The five is arbitrary but reasonable. If you're ever in a circumstance when you have more than five levels of escaping, it probably means something very, very, very, very, very, very bad has happened. What's important is that the number is finite. Otherwise, there is the potential for an infinite loop if someone passes in data without a canary.

for (my $i=0; ++$i <= 5;) {
last if _remove_canary(\$param);
}

The _remove_canary subroutine takes a reference to the original data so that it can modify it in place. This has two benefits. First, you can use the return value of the function to determine when to stop looping. Second, if the data is large, the code won't keep making copies.

sub _remove_canary {
my $string = shift;

If the code detects the canary at the beginning, remove it and return true.

return 1 if $$string =~ s/^=://;

The code didn't find the canary, so it unescapes the data and hopes the canary appears.

$$string = uri_unescape($$string);

Now it checks for the canary again. This isn't strictly necessary, because the next iteration through the loop would catch it. It does save the overhead of one function call. Overoptimization? Maybe.

return 1 if $$string =~ s/^=://;

Otherwise, it hasn't found the canary, so return false. This isn't an
error, but it is the trigger to run through the loop again.

return 0;
}

Using Your Canary

There are many different ways for reading arguments in CGI and similar scripts (and that would be good content for another article entirely), but you might do something like:

The color value will be escaped more than the rest of the string. This can cause you problems later, as the code will not call uri_escape enough times. Instead, make sure always to escape and unescape the canary and entire value together:

<INPUT TYPE="HIDDEN" NAME="color" VALUE="[% "=:" _ color | uri %]">

In summary, canaries are a very simple solution to a very annoying problem. With a few lines of code and a little change to the way you pass data around, you (like the coal miners before you) avoid a very big headache.

Postscript

The inspiration for this article came from a conversation I had with Ask
Bjørn Hansen, who is developing the Bitcard single sign-on system. BitCard
integrates tightly with Combust,
the framework we use at perl.org. Bitcard passes a lot of its data back and
forth in the URI via HTTP GET requests. We were having issues with multiple
redirections creating double- and triple-escaping conditions. I suggested a
canary solution like the one discussed above, and Bitcard has been happy ever
after.

Robert Spier
is a member of the Site Reliability Engineering group at Google.