I've been rolling my own php framework recently and initially thought about using an existing html filtering solution such as htmLawed. But curiosity got the better of me and I decided to try writing my own instead.

I was hoping a few people might be good enough to give the demo a try to see how it holds up.

>> http://allowhtml.com/demo/

There's a link to the source code in the demo as well, which is running the default settings (and allowing the "style" attribute, which I wouldn't do normally).

Any comments / feedback appreciated. :) I was thinking about releasing it under LGPL (hence the domain name), but wanted to see if it's up to scratch first.

I'll build in a whitelist for the style attribute (always figured I'd have to if I was going to allow it). I've got mbstring dotted around the place in various components I'm building, so that's going to need re-thinking.

I'm interested in the decoding issue - I was under the (false?) impression that it was a good idea to try and decode the input as much as possible, then check for and neutralise any resulting evil characters afterwards. Bad idea in general, or just in regards to "urlencoded" characters like %22?

Yeah whitelist each property and only allow the ones you know about rather than blacklist.

The reason that decode is a mistake is because your filter is potentially creating new vectors by converting. I'd recommend you inspect but do not convert that way you'll avoid potential issues in future like this for example:-
<a href="%256aavascript%2522">test</a>

Your filter is performing a auto decode of urlencoded data, after that it is then encoding it with html entities. Some vectors include html entities and also can function with double urlencoding I'd recommend you only leave input as it is supplied or remove it if it is dangerous. IMHO Designing a filter like this involves thinking what could potentially break not what breaks currently.

If you are comparing the input continually then I guess that would be ok to run through htmlspecialchars but try use whitelist wherever possible even in attributes.

I'd also be tempted to avoid certain characters completely like for example just because the HTML spec says you can use x and unlimited characters does it mean you should? Specifications are fine for making things easy to understand and implement but should be ignored whenever their definition makes it easier to exploit your system.

I like your code, if it is improved and you take this approach I'll definitely recommend it after mario, sirdarckcat and thornmaker have broken it first though :)

I was just playing around with the decoding options. So far I'm taking the approach that if passing the attribute through the decoding function changes the value in any way, then it's presumed to be bad input and the attribute is removed (otherwise further checks are carried out).

The only situation I can think of where a legitimate user entering html might fall foul of this is if they've copy & pasted a urlencoded link into the attribute value. Other than that, there's no reason for any encoded data to ever end up in an attribute (that I can think of).

Anti-samy does look good. I've refined the attribute filtering somewhat, although I haven't gone as far as producing a rule for every different attribute / style value.

At this point, something like <div style="color:'''';">test</div> will still get through the value whitelist (allow letters, numbers, spaces and # % ' , - . _ characters), but I wouldn't have thought that would be exploitable (which is my primary concern for now)?

I suppose that one alternative would be to hook into the anti-samy policy file, using xpath to find the approriate rules for the attribute or property. The input value could then be matched against the resulting regex / literal rules.