Regular expression sandboxing

Wednesday, 5 May 2010

Birth of the regex sandbox

I decided today to do a proper blog post to explain my reasons for creating regex sandboxes. I don’t often write a lot of words on this blog partly because I’m not very good a making long meaningful sentences and partly because I think the point can often be made in less words. Hopefully this will be useful for someone writing filters.

First off a quote “You can’t parse [X]HTML with regex. Because HTML can’t be parsed by regex. Regex is not a tool that can be used to correctly parse HTML” from (stackoverflow). I agree with the comment it isn’t possible to fully parse HTML with regexes but my goal wasn’t to do that, I wanted to parse a safe form of HTML. I also have a uncontrollable urge to do something that people say can’t be done.

Now we have that out of the way, how did this all begin? Well I was building a char by char JavaScript parser inside JavaScript to allow untrusted code to be executed. Every time I wrote a simple string matching function I found myself making shortcuts and using regexes instead. For example why loop through all characters when you can whitelist the desired ones? I soon found that I had a great advantage of using regexes instead of parsing every character, because I could use the native JavaScript engine to help me.

This lead me to develop JSReg [1], at first it seemed very easy to match JavaScript, the numbers were pretty easy and strings but I then encountered one of the first problems of regex sandboxes. It is very difficult to match something that is matching itself, for example an array can contain pretty much any JavaScript statement and itself but if you are defining it how can you match it? I didn’t really have an answer to this, one of my solutions to this problem was to create a recursive regex that created a second compiler to match inside the first match and so on. But this was slow and because JavaScript doesn’t have lookbehind previous matches would eat characters in the next match (I’ll talk more about this in the design). My other idea was to use backreferences but these are very difficult to track when using multiple regexes and they only return a successful match in my tests it wasn’t possible to produce a perfect array match using backreferences. I could be wrong of course I know I’m not perfect.

The design

My basis of my design was to not rely on 3rd party code were possible that means no jquery etc, in addition I should employ multiple layers of security wherever possible. These were good design decisions. Throughout initial testing the multiple layers proved difficult to break down. For JSReg the first layer was an iframe, the iframe was created each time of execution enabling fresh prototypes and a throw away box once execution had finished. Then I whitelisted the entire JavaScript objects/properties, this was done by forcing all methods to use suffix/prefix of “$”. Each variable assignment was then localized using var to force local variables. Each object was also checked to ensure it didn’t contain a window reference.

Javascript arrays proved tricky as mentioned earlier because of the amount of code that can be included within them, initially I decided to try and match them and their contents. But there were several performance problems of matching all that code and JavaScript regex limitations. For example I use one regex with a replace function to globally match each sequence using groups, the idea is to match all the valid objects first. In the instance of an array you’d first match all regex objects, strings etc because they can contain a “[” and “]” then once all valid objects have been enumerated by the regex engine it will encounter the first “[” of our array.

This works well in practice for every object apart from arrays. In JavaScript the array literal shares the same syntax as the object accessor. Therefore you have to identify the difference between an array or object. Sounds easy?

As you can see with the samples above, you’d have to match the entire js syntax before the opening “[“. Then if you don’t match the entire sequence inside the array you won’t know if the ending “]” is part of an array sequence or object. This problem was unsolved for a long time. The main reason was in order to protect against window references I rewrite object accessors like obj[‘abc’] to obj[JSREG_FUNC.gp(‘abc’)] so the function returns a safe string which uses the prefix/suffix of $ e.g. abc becomes $abc$. Because a string is returned of the expression it would break an array if it wasn’t detected.

Detecting an array or object was difficult because of the design too, you see if a regex object is matched like /abc/ and is followed by a object accessor like /abc/[‘source’] the previous expression is eaten by the parser so the next match is effectively [‘source’] which JSReg understandably thinks is an array. A simple way round this would be to lookbehind to see if a whitelist of characters make the opening “[” an array or not. But JavaScript doesn’t support lookbehinds!

The simple workaround was to use Array(1,2,3) instead for arrays and assume all “[” and “]” were not arrays. This worked but it breaks existing code. Finally after many attempts I think I’ve come up with a solution. I store a list of previous matches and rewrite all array literals and object accessors into a function or method. This means I no longer need to detect the ending of the array as they both have a “)” instead of a “]”. Easily demonstrated with a code example:-

Finally as part of the design I check the JavaScript syntax before and after conversion this provides another layer of security if the rewrite fails at any part of matching the code.

The code

JavaScript is difficult to match but I found HTML/CSS easier. At first I started the code for HTMLReg [2] and CSSReg [3] in a similar way to JSReg. Then I realized when hacking my own code how I could make it better to defend against attack. First off I employed a strict whitelist to remove any partial open HTML attacks and evil attributes that were obvious attacks. This means I didn’t stick to the HTML specification, I don’t allow any junk in attributes. For example if you want to include “” inside a title attribute then you have to encode it. I may allow them in future if it can be proven safe but I’d rather not fight something I can’t win. You may disagree with what I’ve just said but your filter is probably being pwnd right now.

Once I had my whitelist of tags and attributes I constructed RegExes for any individual parts I wanted to match. For example text nodes, invalid tags and valid attributes, these would be nicely chained together in one big regex. Then each part is grouped so that you can match each expression and validate it.

Notice how I use the replace function, I don’t do html = html.replace because I only want to match the text in my regexes. I prefer to use replace because I have a nice reference to each group like this automatically with local variables. This was a lesson I learned from developing JSReg as if the replace fails it will return your plain code rather than rewrite it.

Inside the function I include a couple of things in each block I’ll use the text node as an example:-

Here if the text node is matched it adds it to the output. Parse tree is a nice way of keeping track of what you’ve matched. It’s a useful debugging reference. The if statement is required because of browser inconsistencies when matching groups.

In the case of HTMLReg for performance reasons I have a whitelist to match a general tag, then inspect it further so I’m only matching a smaller amount of text. You can see that with the following code:-

Once my tag has been matched I then start to parse attributes, I do this by creating a hidden div and reading it’s contents. This is cool for a number of reasons, we can read what the browser reads and our code automatically gets formatted. Because we then use the DOM it means our entities will be decoded for us. While testing I found that JavaScript won’t be executed using innerHTML without certain tags or attributes, if I whitelist the tags and attributes then I can use the innerHTML safely without having to worry about execution. I have a backup plan if this fails, I could be more strict with certain attributes if it’s possible to execute code.

Onto CSSReg! It didn’t exist nor did I think it was needed as I thought I could rely on the browser to ensure multiple CSS rules didn’t cross over from single CSS dom rules. I was wrong. It was proven by many talented researchers (mentioned in the thanks section) that it wasn’t possible to get the browsers to rewrite CSS safely. I had to write another regex sandbox. This time it wasn’t as tricky as first appeared. As long as I didn’t try to follow the madness of the specification again I should be able to produce some CSS that was safe from malicious code yet is useful enough to use.

First off I gathered a list of properties and identifiers, I removed crappy browser specific extensions yeah they are bad. ALL OF THEM. Then I used the same method of HTMLReg to match each part, the trickiest part this time was urls. There are so many ways to escape a css url in every browser, you have to handle backslash escapes, entities, new lines and backslash hex escapes. The best way I came up with was to whitelist the url first, match everything in-between () and then decode and escape every character that didn’t match the whitelist.

This made it pretty safe across multiple browsers. But there was a problem, some browsers decoded the CSS even when it was sandboxed correctly e.g. one attack I found was to triple encode the character and the browser would decode the entities and escapes until it produced it’s mangled version of CSS which broke the sandbox. To get round this I created a custom attribute which didn’t match my whitelist “sandbox-style” this allowed CSSReg to store it’s correctly sandboxed style, I used a custom attribute outside of the whitelist to prevent injections of sandbox-style. Once my CSS was stored correctly I could then match it again and rename it back to style which was then returned correctly.

All this trouble was because I wanted the browser to handle invalid HTML for me, any unclosed HTML tags would be automatically closed by the browser engine for me

Finally in order to handle selectors I stuck to very simple syntax, either #someid or .someclass and allowed multiple like .someclass1, someclass2 {} this prevents CSS injection based attacks and well as making it easy to parse. Once each selector was matched I restrict which tags are allowed and prefix a application ID to prevent HTML/CSS crossing across sandboxes. I then check if a selector is matched before opening or closing one.

I hope you’ve enjoyed this post as it’s a break from what I normally do but I thought it would be worth the effort to get together as I’ve found some of the concepts the best way to code a solution and hopefully you’ll find it useful.

Thanks

I would like to thank Dave Ross as I was heavily inspired by him especially with the multiple regex references chained together. Eduardo Vela aka “sirdarckcat” for his awesome (?:HTML|JS|CSS)Reg hacks. Juriy Zaytsev aka “kangax” for his excellent input in detecting parsing flaws with JSReg. Kyo for breaking things without even trying. Theharmonyguy for breaking HTMLReg classes and spotting comical spelling mistakes by me. LeverOne for breaking HTMLReg and CSSReg with some quite simply awesome and evil vectors. Mario Heiderich aka “.mario” for making regex objects look insane and provide great input for JSReg and breaking HTMLReg. David Lindsay aka “Thornmaker” finding JSReg parsing errors with ternary operators. Stefano Di Paola for smashing the JSReg stack and proving that non-mortals exist. Achim Hoffmann for providing valuable JSReg input and everyone else who has helped me test and develop JSReg & others.

At the moment JSReg is used to execute code and return the results, I protect access to window using $ so for example
‘abc'[‘__parent__’] would not return window because JSReg rewrites it to ‘abc'[‘$__parent__$’]