Wouldn't it be better to work with the DOM and then convert it into a String representation? You'll have a cleaner version of the HTML along with your changes. Is there a specific reason that you need to modify the original source string?
–
Vivin PaliathSep 4 '11 at 4:41

I can think of one. It makes it easier to figure out what substantive modifications have been made if you don't have to wade through a bunch of textual changes that are just a result of the unparser rendering the XML a bit differently.
–
Stephen CSep 4 '11 at 5:04

Vivin - I'm trying to 'clean' certain sensitive items from HTML pages so that I can run other parsing tests against the original HTML (with the sensitive data overwritten with 999 or xxx etc). Also, I want different tests to be able to run in both Java and Javascript, so the original source is the best starting point for me, as the Java HTML parsers and each browser's HTML parser may result in different DOMs (even though they probably shouldn't).
–
Paul GrimeSep 4 '11 at 11:51

7 Answers
7

It sounds like Jericho is almost exactly what you want. It is a robust HTML parser designed specifically for making unintrusive modifications to the source document.

While it doesn't come with DOM, SAX, or StAX interfaces, it has custom APIs that are similar enough to those standards that you should be able to adapt your approach to them fairly easily, or write an adapter between whatever you are using and Jericho. For instance, you can do XPath queries on Jericho documents using Jaxen -- see this blog entry for an example.

Jericho has begin and end attributes for every element, and even for parts of the element like the tag name or even an attribute name, so you can edit the document yourself with that information, but where Jericho really shines is the OutputDocument class, which lets you specify replacements directly by calling the appropriate methods with the Jericho elements that match your query instead of having to explicitly call getBegin() and getEnd() on them and pass that to some replacement method.

We had problems with jericho's behavior within a server app ( memory management, logging ) that we fixed. (the original developer didn't think our issues were important enough to put in the main code branch). Our fork is on github.
We also made fixes to htmlcleaner.

As others have suggested, you probably want to render the DOM. This basically just means constructing the node tree, it wont alter the document source unless you use an HTML cleaner like jTidy. Then you have easy access to the document and can modify it as required. I would suggest DOM4J, it has a good api and xpath support too.

Re your "indexing" requirement, during your traversal/querying of the document you can cache in a list or map any elements or nodes that you wish to modify the text of at a later point.