Xponent's Mostly XML Blog

XML Parsers And Well-Formed Errors

The W3C XML 1.0 Specification requires an XML document to be "Well-Formed" which basically means that it has
a correct syntax. This article addresses how an XML parser locates well-formed errors. The list of syntax rules is rather lengthy. Some of the basic rules are that the document must have a
single "Root" node, the element tags are properly nested, and that tag names are case sensitive.
The W3C(www.w3c.org) is the standards-setting organization that developed XML and related specifications.

Why Browsers and XML Editors Stop When They Hit A Well-formed Error

Why not read the entire document and then report all the errors? The structure of
XML makes it impossible to accurately read any farther in
the document once an error has been encountered. While reading this article, keep in mind that an XML parser,
which is the software that reads and analyzes the xml, reads in a forward direction only. It does not retain much
information about what has previously been read, primarily for memory and performance considerations. The XML parser
error messages reported here are from Microsoft's XmlReader class of the .NET Framework.

Parsing XML is somewhat like walking down a staircase in the dark. One carefully takes a step at a time, checking to
see if the next step is broken or missing. Any attempt to step over a broken or missing step, could be disastrous. It
is simply too dangerous to continue until the step is repaired. When an XML parser encounters a well-formed error,
there is no way to reliably evaluate what lies ahead. Consider the following XML, bearing in mind XML tag names may not
begin with a number so 1Foo is a well-formed error.

Most XML parsers report the line and character position of the offending character when a well-formed error is
encountered. In the example above, the error would be reported as:

Name cannot begin with the '1' character, hexadecimal value 0x31. Line 2, position 2.

If a parser were to report the error and proceed, how would it know where the 1Foo element ends? The end tag may have
the same error, or it may have a different error, or it may be missing. Should the parser assume the Foo element
should be the end tag for 1Foo or assume it is the end tag of another element that is missing its start tag?
Assumptions simply cannot be made as they could result in cascading misinterpretations resulting in the reporting of
additional errors, which may or may not be errors.

Locating Well-formed Errors>

Consider the following XML file:

Most xml parsers would report an error similar to the following:

End tag 'allbooks' does not match the start tag 'book'. Line 5, Position 3.

Notice that the reported line is the root end tag. The book element was read and the parser reached the end of the file
with no end element found for it. In this small file it is easy to see that the problem may be fixed by changing the
second book element into an end element. But the error message would not be much help in a large XML file where there
may be thousands of book elements and the error occurs in the middle of the document.

Even more problematic is an XML file containing no carriage returns or line feeds such as the following:

The last element before the root end tag could either be missing markup to make it an end tag for element "e", or it
could be an element without an end tag. An XML parser will report a well-formed error. The line position will be one
since it is a single-line file, but the line position reported will be that of the root end tag, regardless of where
the actual problem tag is in the file. Thus, the problem tag cannot be located by computer logic. The XML parser
reports an error like the following:

The 'e' start tag on line 1 does not match the end tag of 'root'. Line 1, position 50.

Position 50 is the root end tag. The parser does not realize that the last element is missing its end tag until it has
reached the end tag of the root, which it does recognize as the last element in the document, so it stops there. What
do you suppose the parser reports if it is the second element that is missing its end tag, rather than the last one?
The error message is identical -the parser still does not realize an error exists until it reaches the root end tag.
An XML parser does not remember much and cannot look back. It does not assume that a start tag should be an end tag
and report the error position at that point, but rather that it is the start of a child element, so it continues
reading until it reaches the end.

A human can easily see where the problem is because this is a tiny XML document. What if the XML had a few thousand
elements? Even if viewed in a tree, how long would it take to locate the error if the problem tag is in the middle of
a huge XML file?

As far as I know, no XML parser reports the location of the problem tag in this particular scenario.
If you know of an XML parser that does, please advise. This does not preclude applications that use an
XML parser from implementing their own strategy to accomplish this.
It would require a method of tracking element tags along with their positions.