Error Handling

The previous example is incomplete, or at least optimistic. What happens if the input is missing a closing tag, or otherwise not well-formed XML? By default, the parser throws an exception and stops reading the document.

Xerces offers limited control over what to do in the event of an error. You can assign a custom error handler object to the parser, which gives you the chance to report a more useful error message (such as one that includes the error's filename and location) or, in some cases, ignore the error altogether. The sample program step3 describes the use of a simple error handler.

The first three methods are callbacks, which fire in the event of a warning, parsing error, and fatal parsing error, respectively. resetErrors() is called at the end of each parse to give the object a chance to refresh itself. For example, if you track the number of errors you encounter, you can use resetErrors() to reset that counter to zero.

Writing a custom ErrorHandler doesn't offer too much in the way of recovery, though. You can choose to swallow the exceptions instead of rethrowing them; but consider whether the document is worth parsing after you get an error.

You can still use the error handler to tell you where the error occurred. (Hint: assign the same Locator to both the parser and your custom error handler.) It's much nicer to report "Error in file X, line Y, column Z" instead of just "Your very large document has an error. Good luck."

Validation

Note the difference between well-formed and valid XML. The first is purely structural ("Does every start tag have a corresponding end tag?"), and the parser handles it automatically. The second is specific to your application ("Does the <airport> element exist?"), and it is your responsibility to handle it, at least in part.

For example, the basic error handling in step3 will force the parser to halt if the document is not well-formed (not structurally sound); but if required element is missing, step3 will blindly pass half-formed RPMInfo objects around the rest of the app. Code that handles RPMInfo objects shouldn't have to worry about that. At the same time, how does a parser know when an XML document is suitable for your purposes?

Your side of the valid-document bargain is to provide a DTD or schema/XSD (collectively, grammars) that defines how your XML document should look. Grammars are a contract between your code and the incoming XML documents. Assign a grammar to the parser, and it will enforce that contract for you.

Declare a document's grammar in its XML prolog. For example, the following code excerpt declares a DTD:

XML validation makes your code cleaner because it focuses on the task at hand, knowing for certain the invariants of the contract will hold true ("There will always be a <foo/> element")--or, if you don't already account for document structure in your code, validation lets you sleep easier in spite of the lack of explicit checks.

Entity Resolvers

It's easy enough to point to a DTD or schema in a known path, but you may have noticed a lot of DTD and schema reference off-site URLs. You certainly don't expect to hit the internet every time you parse a document, do you? That and URLs-as-schema-IDs are more for uniqueness than anything else. Some of those URLs don't resolve to anything. How does this work?

Xerces handles this with an entity resolver (Xerces class EntityResolver for SAX and XMLEntityResolver for DOM). When the parser encounters an entity in the document, it asks the resolver where to find it. The default resolver just tries to load entities from whatever location is specified; custom entity resolvers match the incoming name to some other resource--local file, in-memory document, alternate URL--and return that instead.

Its resolveEntity() callback method returns an InputSource from which the parser can read the grammar. For example, a LocalFileInputSource reads from an on-disk file. A MemBufInputSource loads data from an in-memory buffer.

(The code for DOM is only slightly different; it takes an XMLResourceIdentifier object instead of the public ID/system ID pair.)

The sample program step4 demonstrates this using the resolver SimpleEntityResolver. This class uses an internal map to match grammar URIs to local resources. Note that resolveEntity() returns a new InputSource each time it is called. The parser takes ownership of the pointer. The resolveEntity() returns NULL if it cannot find the document.

Conclusion

Xerces-C++ is a robust, feature-filled XML parser toolkit. These two articles have introduced the basics of using Xerces, but by no means do they cover everything you need to know. They should, however, serve as a starting point for further exploring the product documentation and trying your own experiments.

Resources

Xerces-C++ is similar in syntax and feel to some Java-based XML parsers. If you're familiar with those implementations, migrating to Xerces-C++ should be a breeze.

Despite the title, Elliotte Rusty Harold's Processing XML with Java is a useful reference for XML processing in all languages. The book is available for purchase, or you can read it all online.

The Xerces-C++ web site has links to documentation and downloads. Binaries are available for several platforms. While no RPMs are available, the source bundle includes a spec file for building your own.

Q Ethan McCallum
grew from curious child to curious adult, turning his passion for technology into a career.