Let me preface this by saying that the "history" part of this ended up being way more complicated than we have space to cover in this story, I'll try to keep it brief.

Back in the day, I remember the "PC DOS Tech Ref" manual (yes, I was there in 1981 to read this one. And yes, I still have my copy) - one of the many useful things in that manual was the by-now-very-familiar ASCII table, listing characters 1-127, which had been extended to include the next 128 characters, for an even 1-255 (1-FF in hex). I think this extension might have been for PC-DOS actually. I spent a lot of time using this, as it was handy in transcoding hex and binary data streams to characters (remember, this was before we had sniffers on PC DOS platforms).

At the time, the main competition for ASCII was EBCDIC, the character encoding used by IBM System/36, System /38 and mainframe architectures. IBM AIX and all the other Unix (this was pre-Linux) vendors used 8 bit ASCII along with everyone else. But at least we had decent packet sniffers on mainframes both mainframes, S/3x and Unix platforms!

Enter "the rest of the world" which needed to read and write in characters that exceeded the limited A-Z ASCII character set. Unicode 1.0 was established back in 1987 (yes, really, it was that long ago!) and has seen regular updates since then. The current version is 6.2 (released just last month, in September 2012), which supports 110,000 different characters, 100 scripts, including rendering, collation, bidirectional order (to handle right to left scripts). All of a sudden simple text got a lot less simple !

How does this relate to security? Because many of today's defence technologies still live in the 1981, 8 bit ASCII world.

Consider a directory traversal attack. Say you have a website at http://somesite.domain.com/somepage
A directory traversal attack will "traverse" the directory structure to steal files outside of the web page. For instance, http://somesite.domain.com/../../../etc/passwdto steal the "passwd" file from a unix or Linux system that might host that site.

So, how do we protect from that? The web server should prevent you from using that pesky "../../.." string, or any variant that looks like it. But how about in straight up ASCII, where the "./" character is character can be encoded in Hexadecimal - as 2E 2F. So now we need to protect against "%2E%2F", and any other variation on that.

Simple so far, but now consider Unicode, where the "." and "/" can now be represented as (again in hexadecimal) 002E 002F. So we also need to protect against "%002E%002F". But now factor in that there are hundreds of other alphabets and character sets, each with their own Unicode table, so we now have more than a few different hexadecimal representations for the "." and "/" characters! For instance the "/" character, which we now know as %2F or %002F, can also be represented as %C0AF (this one was missed in an early version of IIS). Or we can mis-code it intentionally, and %ca%9v also works!

Oh, remember that if you're on a Windows machine, where the subdirectory delimeter is backwards, using the "\" character (hex 5C)? That means we need to take everything above and double the number of checks!

Now add every other web attack method (directory traversal is just of the most simple ones), and you can see how character encoding can complicate matters tremendously in defending web (and other) applications!

One of the character encoding attacks that we're all expecting to see more common is the use of Unicode in spear-phishing attacks. We covered this a while back in a diary: http://isc.sans.edu/diary/non-latin+TLD+to+be+issued/8755
Consider if you're google searches - it's now easy to redirect you to a site where the "o" characters were actually a different character entirely, in another code page. It's unlikely that most people would detect an attack like this, and most of our technical controls for things like this are not prepared for non-latin domain names either.

What got me started on this you ask? We received a note from one of our readers (thanks again Larry) - he had captured a cross site scripting attack against his web application (an unsuccessful attack, thankfully). The neat thing for me was the character encoding used to obfuscate the attacking script - the attack as captured is shown (partially) here:

This looks fairly straightforward, but that weird string in the middle '004498978135172075721:ll4byhgudkg' had me stumped. Bojan Zdrnja (another Handler here at the ISC) clued me in - it's a stored search string on Google. So what this attack script does, once successful, is pull the "real" attack down from a stored site, indexed and called indirectly courtesy of Google. This real attack might often be a command and control channel back to a botnet or other controller host, but it could be just about anything really.

Anyway, back to character encoding - you see that the majority of this attack was encoded / obfuscated in 8 bit ASCII - it's not unicode or anything complex at all. The IPS in front of the website had no trouble dealing with this, it was blocked and sent to our reader as an alert, and he passed it on to us.

But remember what I mentioned about many of our defences still living in the 1980's world of 8-bit ASCII? While the attack *looks* complicated to the human eye, it's 10-years-ago complicated, ie - it looks complex but if you've got any defences at all attacks of this nature are likely to be blocked handily. Throwing in unicode, especially from one of the less used tables, and doctoring it up with some mis-coded characters might have made this simple XSS attack more likely to avoid detection by a signature based IPS.

The proper method for an IPS (or "Web Application Firewall" or WAF) to deal with this is to have it decode the attack the same way the target host will (this is true of web attacks as well as network based attacks like packet fragmentation methods), rather than use a signature database. If you have multiple hosts, the IPS/WAF may need to decode the attack multiple times to "get it right" for each target. The tough part is that the IPS or WAF has to decode *everything* before it knows what traffic is good and what is an attack, which is why IPS's these days usually have lots of CPU and memory !

We covered a lot of ground in today's story, I hope that the example made things clearer by showing a real attack you might see on your network today. If you have any comments, perhaps a neat attack you may have seen lately that uses character encoding, please use our comment form!

So when you're thinking about attack and defense on the net - until you've had a chance to look at the character encoding, don't believe everything you read !