Jonathan Weatherhead's wild and exciting world

PHP validation and sanitization with the Filter API

When it comes to validation and sanitization in PHP, many wheels are frequently reinvented. Stop! Did you know that PHP offers a core API for this? The Filter extension has been available since PHP 5.2 (you really should be on 5.3 or higher) and is enabled by default.

This extension filters data by either validating or sanitizing it. This is especially useful when the data source contains unknown (or foreign) data, like user supplied input. For example, this data may come from an HTML form.

To summarize the documentation, the Filter API offers to filter the data both by validation and by sanitization. Validation filters return the original data if valid and false otherwise, while sanitization filters return a sanitized version of the data. It also comes in two flavours of input argument, with a general purpose function for any type of scalar argument, and with a specialized version for filtering fields of the HTTP-oriented superglobals. These functions accept the argument to be filtered, the name of the filter to apply, and an optional configuration array.

Lets work on a practical example now. Suppose that we wanted to determine the validity of an email. Use the Filter API, and I repeat, use the Filter API!

Beautifully simple and concise. The best part is avoiding having to reinvent the wheel and worry about the RFC 822 spec. Now suppose that we’d like to check that the value is an integer within some range – a numerical month for example!

Now, an invalid month might sometimes be within the tolerance of the logic. Suppose that we would like to default to January in the case that the month is invalid or perhaps not set. Validation filters use the default option if provided in the configuration array – lets leverage this feature.

Now lets sanitize some input from a webform. When receiving a chunk of text that will not be interpreted, it’s a good idea to sanitize by stripping tags and bad byte sequences. As a rule of thumb, if the data is being persisted somewhere such as the database, it should be sanitized before that happens.

The advantages of validation/sanitization aside, using the Filter API to filter HTTP input also circumvents the nuisance of warnings generated by unset map keys that plagues PHP. Even if the input isn’t to be filtered at that point or doesn’t require filtering, the default filter, FILTER_UNSAFE_RAW, can be used to read the input as-is.

And there it is, the Filter API. I recommend reading the official documentation – we covered the essentials but there is even more to be leveraged. The API offers a rich set of filters, many of which offer optional mode flags which affect their behaviour. Example: the integer filter can be set to accept hex and octal notation. The API also supports custom validators and sanitizers.

It is actually best to do both – content sanitization and query parametrization lend to separate concerns. The objective of content sanitization is to remove unwanted/malicious content such as script tags that might load external content if they manage to be embedded on a page. The objective of query parametrization is to protect against query injections that could poison the database’s integrity and/or return sensitive data.

Please read Scott’s point again – his point is that sanitization should be done on output, not on input, so as to always have a ‘clean’, non-mangled copy available in the database. Correct sanitization is something that depends on the display context (HTML page or JSON API? etc.), and therefore belongs in display logic, not in storage logic.

*escape* on output, not filter. You should always filter input that is undesirable before you do anything with it in your application, not just before you insert it into the database. The database isn’t the only place you want input to be clean.

The situations where you need raw, untouched input, are vastly outnumbered by situations where you’re capturing say, a phone number, but don’t want to impose a structure to it (e.g. +012345678 or 1-555-555-5555 or 0.838.383.838.2). You don’t want to make it integer only, but you also don’t want people putting in script tags and bad byte sequences either….

I had a talk with Andrew Nacin in Austin last year and it seems like those are not used in core because of security related issues. Also there is no INPUT_REQUEST available so you have to know what is the type of request.

I haven’t tested this myself yet but I read in passing recently that these filter functions look at the raw input and not at whatever modifications have been made to the $_GET/$_POST superglobals. WordPress applies some filters to those variables (and I don’t agree with all of them) so if this is true that could be what Nacin meant, in that WP prefilters are bypassed.

$_REQUEST has its own troubles as it contains get/post/cookies and can be subject to key collision. I prefer to be precise in using the appropriate dataset depending on the request type, discernible with $_SERVER[ ‘REQUEST_METHOD’ ]

I’m also going to throw in there that while the filter API supports $_SERVER, there is a known issue that makes it not work on various server configurations. I’ve run into this myself (and scratched my head mighty hard) so it’s far more reliable to use $_SERVER.