Thread view

Hi,
Am processing some invalid xhtml files that aren't even well-formed and am
hoping NekoHTML can help.
My main aim is to make them well-formed with the minimum possible changes.
I've written a simple test app that uses org.cyberneko.html.filters.Writer
to process one of the xhtml source files and output a cleaned version.
It currently does this:
public static void main(String[] args) throws Exception {
XMLParserConfiguration parser = new HTMLConfiguration();
parser.setFeature("http://apache.org/xml/features/scanner/notify-char-refs";, true);
parser.setFeature("http://cyberneko.org/html/features/scanner/notify-builtin-refs";, true);
parser.setFeature("http://cyberneko.org/html/features/report-errors";, true);
parser.setFeature("http://cyberneko.org/html/features/balance-tags";, true);
parser.setProperty("http://cyberneko.org/html/properties/names/elems";, "lower");
String iencoding = null;
String oencoding = "Windows-1252";
java.util.Vector filtersVector = new java.util.Vector(2);
filtersVector.addElement(new Purifier());
filtersVector.addElement(new Writer(System.out, oencoding));
XMLDocumentFilter[] filters =
new XMLDocumentFilter[filtersVector.size()];
filtersVector.copyInto(filters);
parser.setProperty("http://cyberneko.org/html/properties/filters";, filters);
XMLInputSource source = new XMLInputSource(null, args[0], null);
source.setEncoding(iencoding);
parser.parse(source);
}
A few problems with the output from this that I need to resolve:
1. The doctype from the source file doesn't appear in the target file:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"&gt;
It is the first line in the source. Get an odd error reported about this too:
[Error] source.xhtml:1:110: DOCTYPE declaration found inside document content.
2. The main problem with the source files I am trying to fix is that they contain
attribute values with bare ampersands in them. This causes normal xml parsing with Xerces
to fail. Here's an example:
href="http://somewhere.com/form?this=that&foo=baa";
Get warnings for this:
[Warning] source.xhtml:476:108: Bare ampersand found.
[Warning] source.xhtml:476:108: Unknown general entity "email".
I would have thought this should be an error, however, all that is important to me is
to find a way to have these fixed in the output, e.g:
href="http://somewhere.com/form?this=that&amp;foo=baa";
I tried setting the http://cyberneko.org/html/features/scanner/normalize-attrs feature to
true but that just caused an ArrayOutOfBoundsException so I removed it.
3. I must also be missing something obvious in my usage of NekoHTML as, the output file
contains unbalanced <br> elements.
Would appreciate any advice on fixing these things.
Thanks,
Derek
Please access the attached hyperlink for an important electronic communications disclaimer: http://www.lse.ac.uk/collections/secretariat/legal/disclaimer.htm

I think you are making an invalid assumption that the Writer filter will
output XHTML rather than plain old HTML. Based on the description [1], it
simply outputs HTML. You might presume that the Purifier filter would ensure
this, but don't confuse balancing tags in the DOM with its serialized
presentation. If you want valid XHTML output, you're going to need to use a
serializer designed to provide XHTML. Arguably, NekoHTML ought to come with
such a facility, but I don't think that's the case today.
One option is to use Xalan's serializer. It's actually built as a separate
jar file and shipped with Xerces (serializer.jar). You can use it via the
JAXP Transformer API [2]. See the Xalan documentation for use.
[1] http://nekohtml.sourceforge.net/filters.html#filters.serialize
[2]
http://xml.apache.org/xalan-j/apidocs/javax/xml/transform/package-summary.html
Jake
On Fri, 24 Jul 2009 13:27:19 +0100
Derek Alexander <d.alexander@...> wrote:
> Hi,
>
> Am processing some invalid xhtml files that aren't even well-formed and am
> hoping NekoHTML can help.
>
> My main aim is to make them well-formed with the minimum possible changes.
>
> I've written a simple test app that uses org.cyberneko.html.filters.Writer
> to process one of the xhtml source files and output a cleaned version.
>
> It currently does this:
>
> public static void main(String[] args) throws Exception {
>
> XMLParserConfiguration parser = new HTMLConfiguration();
> parser.setFeature("http://apache.org/xml/features/scanner/notify-char-refs";,
>true);
> parser.setFeature("http://cyberneko.org/html/features/scanner/notify-builtin-refs";,
>true);
> parser.setFeature("http://cyberneko.org/html/features/report-errors";,
>true);
> parser.setFeature("http://cyberneko.org/html/features/balance-tags";,
>true);
> parser.setProperty("http://cyberneko.org/html/properties/names/elems";,
>"lower");
> String iencoding = null;
> String oencoding = "Windows-1252";
> java.util.Vector filtersVector = new java.util.Vector(2);
> filtersVector.addElement(new Purifier());
> filtersVector.addElement(new Writer(System.out, oencoding));
> XMLDocumentFilter[] filters =
> new XMLDocumentFilter[filtersVector.size()];
> filtersVector.copyInto(filters);
> parser.setProperty("http://cyberneko.org/html/properties/filters";,
>filters);
> XMLInputSource source = new XMLInputSource(null, args[0], null);
> source.setEncoding(iencoding);
> parser.parse(source);
> }
>
> A few problems with the output from this that I need to resolve:
>
>
> 1. The doctype from the source file doesn't appear in the target file:
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"&gt;
>
> It is the first line in the source. Get an odd error reported about this
>too:
>
> [Error] source.xhtml:1:110: DOCTYPE declaration found inside document
>content.
>
>
> 2. The main problem with the source files I am trying to fix is that they
>contain
> attribute values with bare ampersands in them. This causes normal xml
>parsing with Xerces
> to fail. Here's an example:
>
> href="http://somewhere.com/form?this=that&foo=baa";
>
> Get warnings for this:
>
> [Warning] source.xhtml:476:108: Bare ampersand found.
> [Warning] source.xhtml:476:108: Unknown general entity "email".
>
> I would have thought this should be an error, however, all that is important
>to me is
> to find a way to have these fixed in the output, e.g:
>
> href="http://somewhere.com/form?this=that&amp;foo=baa";
>
> I tried setting the
>http://cyberneko.org/html/features/scanner/normalize-attrs feature to
> true but that just caused an ArrayOutOfBoundsException so I removed it.
>
>
> 3. I must also be missing something obvious in my usage of NekoHTML as, the
>output file
> contains unbalanced <br> elements.
>
> Would appreciate any advice on fixing these things.
>
> Thanks,
> Derek
>
>
> Please access the attached hyperlink for an important electronic
>communications disclaimer:
>http://www.lse.ac.uk/collections/secretariat/legal/disclaimer.htm
>
> ------------------------------------------------------------------------------
> _______________________________________________
> nekohtml-user mailing list
> nekohtml-user@...
> https://lists.sourceforge.net/lists/listinfo/nekohtml-user
>