Does anyone have a good way of finding if a string contains malformed XHTML using Javascript?

Since my page allows 'user' generated XHTML returns (the users can be trusted) and injects it into the DOM, I want a way to check if there are unclosed or overly closed tags, and encode them as &lt; and &gt; so that it will simply display the errors as text. This way all valid xhtml will still be displayed, and the invalid will simply become text nodes, allowing the script to at least continue with the errors.

Here's a similar method I made, which is rather crude. It has a different purpose (simply stripping all valid xhtml tags and leaving the rest. It works by recursively selecting the inner most tags and stripping them out.

One problem any such program will have is deciding which tags are the erroneous ones... Consider the fragment <b><i><u>hi</b></u>; which tags would you want to encode?
–
StoborJun 11 '09 at 3:56

@Strobor, I'm lazy, but I would try that particular bit of code in a lot of different browsers and see what they do. I suspect most of them will understand it internally as <b><u>hi</u></b>
–
Ape-inagoJun 11 '09 at 4:06

"One problem any such program will have is deciding which tags are the erroneous ones... Consider the fragment <b><i><u>hi</b></u>; which tags would you want to encode?" Anything that terminates before all nested tags are terminated, or anything that doesn't terminate before its parent is terminated. In this case, all tags should be removed.
–
Ian ElliottJun 11 '09 at 4:36

2 Answers
2

Seems like a good project, but I can't use this as part of an xhtml page.
–
Ian ElliottJun 11 '09 at 4:41

but since you've already said you're using XHR in another comment can't you just post the malformed (x)html to your own tidy.cgi? htmltidy can fix almost anything and what it can't fix your script probably wouldn't fare much better. Sure it adds maybe 2 seconds to the submit/save action but is that going to really be an issue?
–
SpliFFJun 11 '09 at 5:38

btw, there is a project called jTidy which can probably run "on site" as a Java applet but I don't think the project is actively maintained and I haven't used it.
–
SpliFFJun 11 '09 at 5:40

Well it's part of a xhtml+voice application, which already calls a cgi I script to interpret a haskell program. The entire return has to be 'verified' before I allow it to be parsed, which has to be done with a synchronous ajax request or else the vxml initiates before it has anything to ever output. Adding two seconds to this will hang the browser for an additional two seconds, not good! Also I must contain this within one file :( Constraints weren't mine to be made, sadly.
–
Ian ElliottJun 11 '09 at 5:45

2 seconds is pure speculation and was meant to describe the HTML round-trip cost (since tidy should finish in milliseconds). If you're already round-tripping the data then you can add a tidy stage to the output/verifcation process. In eality it could end up being quicker than a pure JS approach, especially if you're relying on regex. Test it and see.
–
SpliFFJun 11 '09 at 6:10

The html is generated after load, by grabbing it from a script file using xmlhttp. I have to be able to validate it on site, sending it off to w3c and waiting for a response won't due. Not to mention I just need to validate the tags, not the document, as w3c will always return an input of tags as invalid.
–
Ian ElliottJun 11 '09 at 5:14