regular expression to remove javascript from html

I'm writing a servlet to parse XML and XSL together, and then do some bits an pieces with the resultant HTML.

First thing i need to do with the output is remove any and all javascript content from the HTML, as it's to be used in a different fashion to the original page.

I'm sure i'm not far off with the regex but i can't quite get it to work. (don't you just hate it when you get to that stage :roll: )

here's an example of the regexpr, and a code snippet that it fails to parse even though i think it should (note there's no CR/LF between the end of the first script tag and the start of the second one)

Use the Pattern.DOTALL flag instead of Pattern.MULTILINE. The latter causes the start and end anchors (^ and $) to match at line boundaries as well as at the beginning and end of the input. DOTALL allows the dot to match line terminator characters (\r, \n, etc.), which it doesn't normally do.

If your original source is XML, why use regular expressions at all? You have a well formed document, so you can guarentee that <script></script> is all JavaScript - so handle it when you parse your XML, or transform it with your XSLT. Regular expressions in this seem unecessary, and just an over complication. But then again, perhaps I'm not quite following what you are trying to do?

basically, i have an abstract class that instantiates an xml transformer, reads the basic data-centric xml from the database and transforms it with a html-centric xsl stylesheet (which xsl it uses is determined by various bits of data in the database).

This is being extended to a variety of things, and in the one i'm currently working on, i want to strip out all the javascripts that may be in it.

If all you want to do is remove the JS then do it at transform time - its far easier to write a template which matches <script /> elements and replaces them with nothing than to transform XML in to HTML and then strip out the script tags. Basically you are parsing the document twice and exposing your app to more possible sources of error - both in the transform and in the regular expression matching. If you get the document after it has initially been transformed and can't change part of the application, remember that you can make the result of an xslt transformation valid XHTML, in which case you could quite easily transform it again with a simple XSLT.