Tuesday, January 30, 2007

Input validation or output filtering, which is better?

This question is asked regularly with respect to solutions for Cross-Site Scripting (XSS). The answer is input validation and output filtering are two different approaches that solve two different sets of problems, including XSS. Both methods should be used whenever possible. However, this answer deserves further explanation.

Input Validation(aka: sanity checking, input filtering, white listing, etc.)Input validation is one of those things ranted about incessantly in web application security, and for good reason. If input validation was done properly and religiously throughout all web application code we’d wipe out a huge percentage of vulnerabilities, XSS and SQL Injection included. I’m also a believer that developers shouldn’t have to be experts in all the crazy attacks potentially thrown at a websites. There’s simply too much to learn and their primary job should be writing new code, not to become web application hackers. Developer should only have to concern themselves with the solutions required to mitigate any attack no matter what it might be. This is where input validation comes in play.

Input validation should be performed on any incoming data that is not heavily controlled and trusted. This includes user-supplied data (query data, post data, cookies, referers, etc.), data in YOUR database, from a third-party (web service), or elsewhere. Here are the steps that should be performed before any incoming data is used:

NormalizeURL/UTF-7/Unicode/US-ASCII/etc decode the incoming data.

Character-set checkingEnsure the data only contains characters you expect to receive. The more restrictive the rules are the better.

Length restrictions (min/max)Ensure the data falls within a restricted minimum and maximum number of bytes. Limit the window of opportunity for an attacks as exploits tend to require lengthy input strings.

Data formatEnsure the structure of the data is consistent with what is expected. Phone should look like phone numbers, email addresses should look like email address, etc.

Regular expression examples with iteratively more restrictive security:(These are just samples, not recommended for production use)

ImplementationFor a variety of reasons input validation has proved time consuming, prone to mistakes, and easy to forget about. The best approach is defining all the expected application data-types (account ID’s, email addresses, usernames, etc.), abstract them into reusable objects, and made easily available from inside the development framework. Input validation is all handled behind the scenes, no need to parse URLs, or remember to apply all the relevant business logic rules. The benefit to this approach is security becomes consistent and predictable. Plus developers are assisted is creating software at faster rate. Security and business goals are in alignment, which is exactly the place you want to be.

For example, let’s say you’re in an objected oriented environment working with a product purchase process:

URL:http://website/purchase.cgi

Post Data:product=100&quanitiy=4&cc=4444333322221111&exp=01/08

// Check if the user is properly logged-in and their account is activeif (user.isActive) {

// make sure the product is available in the requested quantityif (req.product.isAvailable) {

// make sure the credit card is valid for the purchase totalif (req.creditcard.isValid(total)) {

// initiate the transactionprocessOrder(user, req.product, req.qty, total, req.creditcard); } else { // inform user that their credit card was not accepted with a consistent message and also log the error to central database. requestFailed(req.creditcard.error); }

} else { // inform user that items is not available with a consistent message and also log the error to central database. requestFailed(req.product.error);

}

} else {

// inform user that they are not properly logged-in with a consistent message and also log the error to central database. requestFailed(user.error);}

Notice in the example code there is no input validation, direct database calls, or implicit strings. Everything is handled behind the scenes by the objects and methods. This makes mistakes less likely to occur and extremely helpful in preventing a wide variety of attacks including XSS, SQL Injection, and more.

Output FilteringWhen you get right down to it, XSS happens on output when the unfiltered data hits the user (victim) web browser. Plus untrusted data may originate from a variety of locations, including your own database. As a developer you’re never really certain if someone else is doing their job and placing potentially malicious data in the DB. Better to play it safe when printing to screen.

Control the output encodingDon’t let the web browser guess at a web pages content encoding. They’re known for making mistakes that could lead to strange XSS variants. There are two ways to set encoding, response header and meta tags. Its best to use both methods to make certain the browser gets it right.

Removing HTML/JavaScriptMany of the languages and frameworks have their own methods to convert special characters in their equivalent HTML Entities, it’s probably best to use one of those. If not, here is Perl regex snippet that can be used or ported. I welcome anyone to comment on libraries they like, I’m not familiar and up to date with all of them. As with input validation its best to abstract this layer and make it second nature for developers.

9 comments:

kl
said...

Usually strict checking against predefined pattern is a nightmare for users - everyone writes dates and phone numbers differently.In such cases I prefer _extraction_ of data. For example instead of checking for proper arrangement of spaces, hypehs, etc. in phone number, just remove all non-digit characters and you'll have safe and bulletproof input.

Oh, and please don't forget that + is legal character in e-mail username!MTAs (gmail) can use it for tagging/filtering (username+tag@example.com)

I like this article because, however short, it provides some basics that I haven't found in too many places - which is kind of a surprise. Many write and talk about the importance but details are hard to find.

Establishing class objects for data types may not necessarily be the ideal way to go. One can be limited by their web software architecture (class objects may not be so easy to implement) and relying on built-in object verification can easily ignore data that doesn't fit into a predefined object class. A programmer not used to consciously using input validation functions could be more likely to skip validation.

I also dislike using regular expressions for quick data validation in most cases (what a waste of a computer) - I like the comment that data should be extracted (and validated) which is what I do.

About Me

Jeremiah Grossman's career spans nearly 20 years and has lived a literal lifetime in computer security to become one of the industry's biggest names. He has received a number of industry awards, been publicly thanked by Microsoft, Mozilla, Google, Facebook, and many others for his security research. Jeremiah has written hundreds of articles and white papers. As an industry veteran, he has been featured in hundreds of media outlets around the world. Jeremiah has been a guest speaker on six continents at hundreds of events including many top universities. All of this was after Jeremiah served as an information security officer at Yahoo!