Secure Cooking with C and C++, Part 2

Editor's note: In part two in this three-part series of sample recipes from Secure Programming Cookbook for C and C++, authors John Viega and Matt Messier discuss some of the factors to consider to properly decode a URL, and they provide example code programmers can use to securely decode URLs. (And if you missed last week's recipe, the authors provid nine techniques programmers can use for proper data validation.)

Recipe 3.8: Evaluating URL Encodings

Problem

You need to decode a Uniform Resource Locator (URL).

Solution

Iterate over the characters in the URL looking for a percent symbol followed by two hexadecimal digits. When such a
sequence is encountered, combine the hexadecimal digits to obtain the
character with which to replace the entire sequence. For example, in the ASCII
character set, the letter "A" has the value 0x41, which
could be encoded as "%41".

Discussion

RFC 1738 defines the syntax for URLs. Section 2.2 of that document also
defines the rules for encoding characters in a URL. While some characters must
always be encoded, any character may be encoded. Essentially, this means that
before you do anything with a URL--whether you need to parse the URL into
pieces (i.e., username, password, host, and so on), match portions of the URL
against a whitelist or blacklist, or something else entirely--you need to
decode it.

The problem is that you must make certain that you never decode a URL that
has already been decoded; otherwise, you will be vulnerable to
double-encoding attacks. Suppose that the URL
contains the sequence "%25%34%31". Decoded once, the result is "%41" because
"%25" is the encoding for the percent symbol, "%34" is the encoding for the
number 4, and "%31" is the encoding for the number 1. Decoded twice, the
result is "A".

At first glance, this may seem harmless, but what if you were to decode
repeatedly until there were no more escaped characters? You would end up with
certain sequences of characters that are impossible to represent. The purpose
of encoding in the first place is to allow the use of characters that have
special meaning or that cannot be represented visually.

Another potential problem with encoding that is limited primarily to C and
C++ is that a NULL-terminator can be
encoded anywhere in the URL. There are several approaches to dealing with this
problem. One is to treat the decoded string as a binary array rather than a
C-style string; another is to use the SafeStr library
described in Recipe 3.4 because it gives no special significance to any one
character.

You can use the following spc_decode_url( ) function to decode a URL. It returns a dynamically allocated copy of the URL in decoded form. The result will be NULL-terminated, so it may be treated as a C-style string, but it may contain embedded NULLs as well. You can determine whether it contains embedded NULLs by comparing the number of bytes spc_decode_url( ) indicates that it returns with the result of calling strlen( ) on the decoded URL. If the URL contains embedded NULLs, the result from strlen( ) will be less than the number of bytes indicated by spc_decode_url( ).