Saturday, October 16, 2010

In web development nowadays, the importance of minimizing page load times is widely acknowledged. (See, for instance, http://www.stevesouders.com/blog/2010/05/07/wpo-web-performance-optimization/.) There are all manner of best practices for reducing load time. However, many of these practices make day-to-day development more complicated. In this post, I'll discuss some techniques we're using to reconcile performance with ease of development.

Consider the "primordial" approach to web serving: a site is a collection of files (static resources and/or dynamic templates) sitting in a directory. JavaScript and CSS code is written in a modular fashion, grouping code into files in whatever organization makes it easiest to locate and manage. Templating aside, files are served raw, always using the latest copy in the filesystem. This is quite nice for development:

You can edit using whatever tools you like.

Changes are immediately visible in the browser. No need to restart the server, or invoke any sort of recompilation tool.

Comments, formatting, and file organization are preserved, making it easy to inspect and debug your JavaScript, CSS, and HTML code.

However, the resulting site is likely to break many rules of web performance. Notably:

The usual approach to this convenience / efficiency tradeoff is to implement two serving modes. In development mode, the "primordial" model is used. In production mode, a preprocessing step is used to minify script code, combine scripts, push content to a CDN, and otherwise prepare the site for efficient serving.

This approach can get the job done, but it is a pain to work with. It's hard to debug problems on the production site, because the code is minified and otherwise rearranged. It's hard to test performance fixes, because the performance characteristics of development mode are very different than production. Developers may have to manually invoke the preprocessing step, and problems arise if they fail to do so. Web development is hard enough without this added complexity.

The preprocessing approach is also limited to static content. For instance, it is difficult to minify dynamic pages with this approach.

Solution: optimize on the fly

In our site, optimizations are applied on the fly, in a servlet filter that postprocesses each response. (Of course, the server is free to cache the optimized version of static files.) Current optimizations include:

Minify JavaScript and CSS files.

Minify HTML pages, including embedded JavaScript and CSS.

Rewrite script/style/image URLs to support caching (described below).

Forthcoming extensions:

Asset bundling

Rule checking

Asset repository to avoid version skew (described below)

Because optimization is done on the fly, it is never "stale"; developers don't need to remember to re-run the optimizer. At least as important, it can be enabled or disabled on the fly. When investigating a problem on the production site, a developer can disable minification and debug the code directly. Or they can turn on minification in a development build and observe the effects.

Optimization is enabled and disabled using a simple configuration system we've built. Configuration flags can be specified at the server, session, or request level. The session mechanism is particularly convenient: simply by invoking a special form, a developer can disable minification for themselves on the production server, without affecting the way content is served to other users. And toggling minification in a development build doesn't require a server restart.

URL rewriting for cacheability and CDNs

For performance, it's critical to allow browsers to cache asset files (e.g. images and scripts). However, this causes problems when those files are updated. The best solution I'm aware of is to change the filename whenever an asset is modified. This allows you to mark the files as permanently cacheable. However, it's a hassle to implement: whenever you modify a file, you have to update all links referencing it.

Our servlet filter scans HTML content for asset references (e.g. <script src=...> or <img src=...>). For any such reference, if the reference refers to a static file in our site, the filter rewrites the reference to include a fingerprint of the file contents. Thus, references automatically adjust when an asset file is changed.

When our server receives a request for an asset file, it looks for a fingerprint in the filename. If the fingerprint is present, we set the response headers to allow indefinite caching. Otherwise, we mark the response as noncacheable.

When we adopt a CDN, we'll use this same mechanism to rewrite asset references to point to the CDN.

One issue we haven't yet tackled is rewriting asset references that appear in JavaScript code. Fortunately, the mechanism is failsafe: if our filter doesn't rewrite a reference, that asset will be served as noncacheable.

File bundling

A page often references many asset files. Performance can be improved by bundling these into a smaller number of larger files. For JavaScript and CSS files, this is fairly straightforward; for images, it requires CSS sprites. In both cases, we run into another tradeoff between convenience and performance -- development is easier with unbundled files.

Once again, on-the-fly rewriting comes to the rescue. When the servlet filter sees multiple <script> or <style> references in a row, it can substitute a single reference to a bundled file. When it seems multiple CSS image references, it can substitute references to a single sprite image. The configuration system can be used to toggle this on a server, session, or request level.

Over time, we might extend our servlet filter to implement some of these optimizations by rewriting the response. But in the meantime, it is relatively straightforward for the filter to detect and report violations. It is easier to perform this detection at runtime than in a static lint tool, because we can observe the final, post-template-engine version of the page. And unlike tools like YSlow, we don't have to manually test each release of the site.

Consistency across server updates

When you push a new version of your site to the production server, a "version skew" problem can occur. Suppose a user opens a page just before the new version is activated. They might wind up receiving the old version of the page, but the new version of the scripts. This can cause script errors. If your site is distributed across multiple servers, the window of vulnerability can be many minutes (depending on your push process). This is not a performance issue, but it involves references from a page to its assets, so it ties into the mechanisms involved in asset caching.

Happily, the URL rewriting technique described in earlier offers a solution to version skew. When a server receives a request for an asset file, it simply serves the file version that matches the fingerprint in the URL. This ensures that all asset files come from the same site version as the main page.

This solution does assume that all servers have access to all versions of the site. This can be addressed by copying asset files to a central repository, such as Amazon S3, during the push process.

Conclusion

A surprisingly large number of web performance optimization rules can be implemented with little or no impact on day-to-day site development. Equally important, optimizations can be implemented with session-based disable flags, allowing problems to be debugged directly on the production site. We've been using this approach in our site development, with good results so far.

Saturday, October 2, 2010

This blog has been quiet lately. I've been working on the nuts and bolts of what we[1] hope will eventually be a major web site, and that hasn't lent itself to the sort of material I want here. I'm trying to post only when I have something at least mildly novel and meaty to talk about -- as opposed to, say, photos from my three-day trip to the land of Tomcat configuration. We're starting to get past the boilerplate into more interesting work, so I hope to start posting more often again.

One thing I set out to tackle this week is XSS defense. Traditionally, it's a tedious and error-prone task. In this post, I'll present an attempt to improve on this.

Wikipedia provides a good introduction to the subject. As a brief refresher, XSS (Cross-Site Scripting) is an attack where a malicious user enters deliberately malformed data into your system. If your site is not properly protected, it may display that data on a web page in such a way that the browser interprets it as JavaScript code, allowing the attacker to take control of the browser of any victim who views the page.

Escaping

There are a variety of defenses against XSS. The most common is escaping -- transforming the data in such a way that the browser will not interpret it as code. For instance, when including a user-supplied value in a web page, characters like "<" and "&" should be rewritten as HTML entities -- "&lt;" and "&amp;". Old hat, probably, to most of you reading this.

Escaping, if done correctly, is a solid defense against XSS. However, getting it right is notoriously difficult. You must apply the escaping in every single place where your code inserts user-supplied data into a web page. That can easily be thousands of locations. Miss a single one, and your site is vulnerable. It's sometimes necessary to perform the encoding at different levels in the code, making it hard to keep track of which strings have already been encoded. Worse yet, different sorts of encoding are needed depending on context: HTML entity encoding, URL encoding, backslash encoding, etc. I've even seen cases where two levels of encoding were needed -- for instance, when a value is included in a JavaScript string literal, and the JavaScript code later places that string in the innerHtml of some DOM node. If you get the encoding wrong, you may again be vulnerable.

Escaping also tends to uglify your code. In JSP, <%= username %> might have to become <%= Escaper.entityEncode(userName) %>. The impact on code readability and maintainability is nontrivial. Some templating systems handle this better than others; we're using plain 'ol JSP, which offers no special support. So, we weren't very enthusiastic about this approach.

Validation and filtering

Another well-known defense is to validate and/or filter your input: disallow users from entering special characters like <, or filter out those characters. If these characters never enter your system, you don't have to worry about escaping them at output time.

Most web sites have fewer inputs than outputs, so airtight input filtering is easier to achieve than airtight output escaping. Also, a single filter can render data safe for inclusion in a variety of contexts, unlike output escaping where you have to be careful to use the correct escaping mode according to context. (Though this requires a broadminded definition of "special characters" -- see this link for a discussion.)

Input filtering does have some drawbacks. Most notably, it is visible to users. Some dangerous characters, like ' and &, appear frequently in ordinary situations, and users will be annoyed if they can't use them. For this reason, filtering is most commonly used for specific data types, such phone numbers, rather than free-form text such as a message subject.

Another drawback of input filtering is that, if you find a bug in your input filter, you have to re-validate all existing data. That can be a huge burden in practice, especially if you're under time pressure to close a security hole. With output escaping, as soon as you fix the bug and push a new build, you're protected.

Design criteria

We're building an actual web site; rubber is meeting road. Theory aside, we need a concrete plan for XSS defense. Ideally, the solution would satisfy the following criteria:

Easy to use -- no complex rules to be remembered while coding.

Minimal impact on code readability and maintainability.

Works with JSP (our current templating system), and portable to other templating systems.

Little or no user-visible impact.

Auditability -- it should be possible to scan the code with an automated tool and identify any possible XSS holes.

Also, I'd like a pony. (Well, not really. One of these, maybe.) Output escaping fails on ease of use, and is difficult to audit; input filtering fails on user impact, and by itself does not provide defense in depth. Time to get creative.

Input transformation

Again, one problem with input filtering is that it can cause serious annoyance to users. As noted in one of the pages linked above, imagine poor Mr. O'Malley's frustration when he can't type the ' in his name.

What if, instead of forbidding dangerous characters, we replace them with safer substitutes? The most common offender, the single quote, has a very acceptable substitute -- the curved single quotes, ‘ and ’. When we process a form submission, we could perform this substitution automatically. Mr. O'Malley might not even notice that he's now Mr. O’Malley, and if he did notice, he probably wouldn’t mind.

This appeals to me. The main objection to input filtering is the impact on users, and this mitigates that impact. The impact is not eliminated completely, so this won't work in all situations. But in our application, it should be usable for almost all inputs.

The mapping I'm currently envisioning is as follows:

' -> ‘ or ’ (depending on context)

" -> “ or ” (depending on context)

< -> (

> -> )

& -> +

\ -> /

This isn't sufficient in every situation. For instance, URLs need a different input transformation -- the mapping above will break some URLs, and doesn't rule out "javascript:" or other dangerous links. And no reasonable input transformation will suffice if you include user-supplied values in an unquoted tag attribute -- quoting is essential. But if you're good about quoting, this transformation suffices for most common situations. And it scores pretty well on my "I want a pony" design criteria:

Ease of use: simply replace every instance of request.getParameter("name") with something like XssUtil.getSafe(request, "name"). (With exceptions only for those unusual cases where the transformation is not acceptable.)

Impact on code readability: the new form is not especially bulkier than the old.

Template compatibility: input transformation has no impact on the templating system.

That leaves only user impact, which I've discussed; and defense in depth. Defense in depth brings me to my next topic.

Script tagging

Most approaches to XSS defense involve enumerating, and protecting, every spot on a page where scripts might creep in. The idea is for the page to be "clean" as it emerges from the templating system. As we've seen above, that's difficult.

Instead, let's accept that a malicious script might sneak into the generated page. If we have a whitelist of places where scripts are supposed to appear, we could filter out any unwanted ones. This would work as follows:

1. Tag all "good" scripts -- scripts that we're deliberately including in the page -- with a special marker. In JSP, the coding idiom might be something like this:

<script language=javascript>

<%= scriptToken() %>

...regular JavaScript code here...

</script>

It's important that an attacker not be able to guess the special marker. This is easily ensured by generating a random marker for each request.

2. After the templating system executes, run the generated page through an HTML parser to identify all scripts on the page. (For this to be robust, we'll actually need an HTML "cleaner" that rewrites the page in a way that all browsers can be trusted to parse properly.) Here, "all scripts" means all constructs that could trigger undesired actions in the browser: <script> tags, <style> tags, onclick handlers, URLs with protocols other than http/https/mailto, etc. Like any HTML cleaner, the system should be based on a whitelist -- any tag or attribute not in the whitelist is considered dangerous.

3. Whenever we see a script with the special marker, remove the marker and leave the script alone. If we see any scripts without the marker, strip them out, and flag the page for manual inspection.

I don't recall seeing this approach suggested before, but at first blush it seems sound to me. Of course, browsers are complicated beasts, and I may be missing something. If you can poke a hole, please let me know!

This approach does impose processing overhead, to parse each generated page. However, cycles are cheap nowadays; security holes are expensive, as is programmer time. Also, the same processing pass can perform other useful functions, such as HTML minification. How does the approach stack up on my design criteria?

Easy of use: pretty good. It only requires adding a bit of boilerplate at the beginning of every script block; in practice, it might look something like <%= scriptToken() %>.

Code impact: the token is not a big deal for a script block. It will be more annoying in something small like an onclick handler.

Template compatibility: inserting a token should be easy in any templating system.

Defense in depth: this approach is completely independent of input transformation, so combining the two achieves a layered defense. It could also be combined with traditional output escaping.

Conclusions

The combination of input transformation and script tagging yields a layered, auditable defense against XSS attacks, with less programming burden than traditional output escaping. Output escaping will still be needed in a few places where input transformation is impractical, such as when embedding a user-supplied value in a URL.

If there is interest, I might take the time to open-source this code once when it's completed.

Notes

[1] My old partners in crime from Writely, Sam Schillace and Claudia Carpenter