Bleach, HTML sanitizer and auto-linker

Bleach is a whitelist-based HTML sanitizer and auto-linker in Python, built on html5lib, for AMO and SUMO and released under the BSD license.

Bleach has two main functions: sanitizing HTML based on a whitelist of tags and attributes, and turning URLs into links. It uses html5lib for both.

For more information on using Bleach, see the README included in the source. For more info on how Bleach works, follow below the jump.

Sanitizing HTML

Bleach’s clean() function uses a slightly custom version of html5lib’s HTMLSanitizer tokenizer that adds support for per-tag attribute whitelists. Any entity that is not part of a whitelisted tag or valid entity will be encoded. Legitimate entities and tags are allowed. The default whitelist is set up for AMO.

Linkifying Text

The linkify() function is a little more complicated. Naïve implementations usually rely on a simple regular expression to find URL-like strings, but this quickly becomes insufficient when you need to handle situations like these:

Bleach is available on Github, or can be installed via pip or easy_install. Improvements and test cases are very welcome! Actually, there’s one disabled test right now that is not supported. If you can make it work, that would be pretty great!