Office HTML filter

Primary tabs

This filter cleans up HTML generated by Microsoft Office. It can remove header tags (<style>, <script>, etc...) and their contents, and can convert HTML entities to their plain-text equivalents. This filter can be used in conjunction with the core HTML filter to completely filter out the plethora of HTML generated by Microsoft Office.

In order to deal with Office-generated HTML, you must not only strip the offending tags but also the markup between them. The core HTML filter can easily deal with stripping the tags by using a whitelist such as <a> <i> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <img> <h1> <h2> <h3> <h4> <h5> <h6> <table> <tr> <td> <thead> <tbody> <tfoot><br><p><b> and choosing to strip disallowed tags. However, due to a bug/feature it does not strip the content between the tags (#447684: HTML Filter does not strip text between 'style' and 'script' elements). This is the gaping void that this module seeks to fill by stripping out that offending content. It also converts some HTML entities to their plain-text equivalents.

Development / maintenance / issue queue policy

I have no immediate plans / funding for further development. However, I will happily accept RTBC patches.