Abstract

Over the last years XML has grown in importance when it comes to storing data on stream-oriented media. In most cases, DOM-API implementations are used to handle storing and manipulating data in memory. This article shows how to use an alternative API - the SAX-API - in a generic way with an application to automated code generation from XML schema definitions in mind. A C# implementation of a simple to use yet flexible framework for interaction with SAX parsers is presented.

1 Introduction

This article assumes the reader to be familiar with general XML processing and especially with the classes from the System.Xml namespace of .NET. Some familiarity with C# programming is also expected. The rest of this section will first give some motivation and then outline the main characteristics of SAX parsers, followed by an overview of a possible interaction framework with SAX parsers.

The second section introduces a C# implementation of such a framework and explains its overall design principles. Some hints on porting to other languages are given. The third section concludes this article by showing some examples on how to use the framework and, finally, gives some outlook on the second part article which will introduce a mapping to XML schema definitions to the framework.

The solution provided contains several VS projects. Of these, "SaxParserSupport" builds the framework. The other projects contain the examples from section 3.

1.1 Motivation

Over the last years XML has considerably grown in importance when it comes to serialize and deserialize data in a portable, platform-independent way. Many people use DOM to access XML content in favour to SAX. This is especially true for reading access. They avoid the burden of implementing the processing of data in the context of the XML parser, which is done by DOM automatically. However, on closer inspection it becomes evident that much of the source code for this processing work can be done in a generic way. More important, it can be automated as is shown in the follow-up article, making use of SAX quite benign to the programmer.

Another way read or write XML data is use of System.Xml.Serialize.XmlSerialization implementations. This interface takes an instance of a suitably designed class and serializes or deserializes it into XML. .NET SDK contains a tool that can autogenerate C# classes from schema definitions. For details, refer to MSDN documentation on xsd.exe.

The generated classes, however useful in many scenarios, suffer from two serious drawbacks: First off, they don't allow for behavioural customization by inheritance. Instead, class instances are hardwired as members into the generated classes. Dynamic behaviour, e.g. consistency checks, cannot be performed during parsing. As a consequence, XML data always is represented in a tree-like structure much like the way DOM does. Secondly, as there is no XML parser involved, they don't allow for on-the-fly validation the way XML parsers do.

1.2 The outset: DOM vs. SAX

To read XML content into memory, programmers often rely on DOM. DOM is a simple, easy to use and, most important, out of the box way to get data from XML content into a memory representation. Likewise, DOM implementations usually provide some means to transform such a memory representation into XML content.

However, this ease of use comes with a price tag. First of all, DOM parsers create an in-memory tree, whose nodes are instances of a well defined, finite set of classes. While sufficient in most situations, there might be cases where a tree representation is not adequate or one might want a more refined view of data-node relationship than the one DOM provides.

The second drawback with DOM is that the complete XML content is included in the DOM tree, regardless of whether it is of actual interest. For large amounts of XML contents this might lead to an unacceptable large memory footprint.

An alternative to DOM, SAX or "Simple API to XML", does not suffer from this limitations. Instead of creating a tree in memory, SAX provides a forward-only iterator view of the XML content. Each time certain conditions are met, SAX implementations enable the client application to take appropriate action in response to that condition. These conditions being true can be interpreted as firing an event. In fact, SAX is sometimes referred to as event-driven XML parsing.

Examples of such events are: Start of document, start of XML element, end of XML element etc. Each event is attributed with some set of data, that describe the event. An event indicating an opening tag of an XML element, for example, carries the tag name and the set of attributes. Two things are important to note: The complete stream of events along with their accompanying data is sufficient to fully describe the XML content, i.e. at any time the XML content can be reconstructed up the point the SAX parser has most recently read. Secondly, and even more important, the structure of the event stream is independent from the actual XML content. I.e. the first event always indicates start of the XML content, each event indicating the opening tag of an XML element is matched by an event indicating the element's closing tag implicitly or explicitly, and so on.

1.3 Observations on using SAX

As pointed out in the last subsection, the XML content can be reconstructed from the sequence of events generated by the SAX parser. Of course, one is not limited to simple construct XML from this sequence. Instead, any data processing can take place as required by the client application. Not surprisingly, the basic framework of client interaction with the SAX parser is of a quite simple structure. For example, consider the following chunk of XML content:

When encountering the closing bracket of the root-starttag, a client application would be notified by the SAX parser. In response, it would then instantiate a class suitable for representing the data from the root element (following on, the terms "element" and "tag" are used interchangeably) and some representation of the someAttr attribute to the value someValue , which is provided by the SAX parser.

The parser will continue to read XML content and thus encounter the child tag next. Again the client will be notified, but this time it instantiates a class suitable for representing the child element in the context of the root element. This is an important detail - in contrast to its structure, the semantics of a XML element are determined by its enclosing element, if such one is present. Consider the following XML content:

<root><child/><otherchild><child/></otherchild></root>

The second instance of the child element may well have the same structure as the first. In fact, it might even have the same interpretation of its data. But because the first one is in the context of the root element and the second in the context of the otherchild element, information collected or synthesized by their respective parent elements might differ significantly. For example, the root element (or rather its representing class) may interpret the child element as data, while otherchild may want to ignore it altogether.

A XML element has been completely parsed when its endtag is encountered. SAX parsers reflect this by generating an endelement event for the current element. On receiving this event, the client application can trigger processing data by the class instance representing the current element or its parent element.

1.4 A Simple Parsing Algorithm

With all this in mind, a simple algorithm for XML parsing with SAX can be laid out:

Set up some SAX parser implementation. Instantiate some class representing the root element of the XML content to be parsed. The root element is the first XML element in the content. Set up some stack implementation with an empty stack.

Start the parsing process on the XML content

On a startelement event, if the stack is empty push the instance representing the root element. Otherwise use the top element of the stack to look up the class representing the current element by the current element's qualified name and push this instance on the stack. Set attributes in the stack's top element according to the values provided by the parser.

On an endelement event, pop the top element from the stack and process it by the new top element, if there is one. If the stack is empty, parsing is finished.

On a text event, process text as the stack's top element sees fit.

Likewise for whitespace and significantwhitespace events.

This algorithm isn't all bells and whistles - for example, entity resolving isn't covered - but it should be sufficient to show the general direction an implementation might go. One word of caution though - there are two approaches to implementation of SAX parsers: The first one is commonly referred to as SAXReader. It uses direct callbacks into the client to indicate events. Xerces of the apache project is an example of this approach. The second flavor is known under the name XMLReader. System.Xml.XmlReader from the .NET framework is an example. XMLReader implementations use a pushmodel to parse sequences of XML content under the control of the client application. Therefore, they are slightly more efficient.

Note that the System.Xml-namespace does not provide SAXReader. There are some small differences in their respective APIs, which shouldn't pose much trouble, though. For the rest of this article, System.Xml.XmlReader is used.

2 Implementation Overview

This section introduces a sample implementation of a framework for interaction with SAXParsers. At first, the structure of the framework is explained, i.e. the implementing classes. After that the two main issues in SAX parser interaction - how to map XML elements to representing classes and how to store XML attributes - are discussed. Interspersed, there are some notes on using other languages than C# for implementation.

The information presented shows just an overview. For detailed method descriptions refer to the documentation contained inside the source code.

2.1 Structure of the Framework

The framework for SAX parser interaction is called SaxParserSupport. It defines two namespaces, SaxParserSupport and SaxParserSupport.Impl. While the former defines the API, the second contains a sample implementation. Two wrapper classes for System.Xml.XmlReader are contained as well as implementing classes to represent XML elements. These implementations should be sufficient to act as a base for day-to-day use. Of course, they can be replaced by customized implementation at any time.

Fig.1

The interface classes from namespace SaxParserSupport are shown in Fig.1 as a UML Class Diagram.

Namespace SaxParserSupport

The interface class SaxParser defines the API of a wrapper class for a SAX parser instance. It exposes two methods

void parse();
void parse( System.IO.Stream src );

that start the parsing process. The first one uses the current setup of the instance, the second one the given System.IO.Stream instance to read from. The details of set up are left to the implementation. The start or root element is determined by the property RootElement, an instance of XMLDocument. Both methods may throw exceptions related to IO issues and/or XML parsing.

XML elements are modelled by a hierarchy of three interface classes, XMLSimpleElement, XMLElement and XMLDocument respectively. The basic behaviour of XMLElements is modelled by interface XMLSimpleElement:

The characteristics of an XML element are element name and the parent XML element. Access to the getters and setters for name and owning element is public because of C# language requirements. However, they should be in a protected context, only.

Method onElementEnd is the callback to be called by a SaxParserInstance on encountering the closing tag of an element. Implementations should provide at least an empty method. Note that this method should be treated as if declared protected.

write and writeAttributes write the XML element to the given StreamWriter. The latter one is meant to be protected, again. Implementations are expected to perform any required transcoding of characters, e.g. '&' to '&amp;', on their own.

Use write to write an instance of XMLSimpleElement to a StreamWriter.

setAttribute is called to set the value of a XML attribute. It is up to the implementation to provide a means to map XML attribute names to data structures for storing.

addElement and removeElement establish or destroy a parent-child relationship with the child XML element specified by parameter elem. Typically, they will also set the owner of elem as indicated by elem.getOwner().

Lookup of possible child element representing classes is done by method lookupElement, based on the qualified name as specified by System.Xml.XmlReader. If there is no namespace available for the XML element in question, i.e. its local name equals the qualified name, the local name prepended with an ':' is used. Both SaxParser and XMLElement implementation are tightly linked together here, because the name to be matched is generated by the SaxParser implementation. Apart from that, implementations are free to use other mapping schemes.

Callback methods onSignificantWhitespace, onText and onWhitespace are to be called by the SaxParser implementation. Again, they are supposed to be used in a protected context.

In contrast to DOM parser behaviour, there is no need to create special classes for test and (significant) whitespace. If sufficient, the string instances given to the three methods can be stored in the instances representing the respective parent elements.

Interface XMLDocument, finally, models the start or root node of XML content. For the time being, this is limited to representing XML version and encoding:

When writing to a System.IO.StreamWriter, implementations should take care that the encoding of the StreamWriter instance and the one given by the getEncoding method do match.

Namespace Impl

This subsection describes a sample implementation of the API described above. In Fig.1, it is contained in the package Impl which maps to a C# namespace Impl. As before, the discussion is centered on the principles of design. For implementation details, refer to the source code documentation. The contents of namespace Impl are shown in Fig.2:

Fig.2

Classes SaxParserImpl and SaxValidatingParserImpl both implement the SaxParser interface. Both use instances of System.Xml.XmlReader to do the actual parsing, with the latter one using a System.Xml.XmlValidatingReader instance set up to perform XML schema validation.

The interface classes for XML elements are implemented by classes XMLSimpleElementImpl, XMLElementImpl and XMLDocumentImpl, respectively. Creating a class representing a given XML element is done by derivation from one of these classes. The way these classes work might impose some limitations on how to derive such classes.

The main purpose of class XMLSimpleElementImpl is to store an element's tag name and attributes. The tag name is stored in its qualified form, that is, for

<element/><NS:element2xmlns:NS="http://www.foo.com/bar"/>

XMLSimpleElement.getName() will yield element resp. NS:element in these two cases. Implementations may provide means to lookup the namespace from the alias.

Attributes are implemented as follows: If an attribute name starts with "xmlns:", it is assumed to be a namespace alias declaration. Namespaces and their aliases are stored for internal mapping.

The attributes xsi:schemaLocation and xsi:noNamespaceSchemaLocation are ignored as they are only relevant to a validating parser. However, they are stored to be included if the instance is written to a System.IO.StreamWriter.

All other attributes are tried to be mapped to a C# property with the same name as the attribute. If the attribute name contains a ":", the name of the C# property has a "_" at the corresponding position(s). If the mapping fails, implementations may choose whether to ignore the attribute or throw an exception.

Attributes, whose name would lead to an invalid property declaration because of reserved words in C#, are treated in a special way - they are mapped against a property name with a leading "@". That is, an attribute of name base will be mapped to a property of name @base.

On successfull lookup of a C# property, an instance of the property's class is created from the attribute's value string (this string is equal to the System.Xml.XmlReader's Value property at this point). Creation is done via a constructor with one string parameter. If this constructor cannot be found or the property is readonly or non public, an exception is to be thrown.

Lookup is done by reflection. Languages that don't support reflection (e.g. unmanaged C++) must use some other means to perform lookup - dictionaries may be a good choice here. Reflection is also used in locating the constructor.

When executing writeAttributes, the XMLSimpleElementImpl iterates through all properties, converts their values to string and writes these pairs to destination after performing necessary transcoding of attribute names and values.

XMLElementImpl maintains a collection of XMLSimpleElement instances. Specifically, when executing onElementEnd, a XMLSimpleElement instance may remove itself from its parent. Details are left to the implementation.

The second role of XMLElementImpl is to implement lookup for child XML elements. The sample implementation uses a simple mapping of the qualified element name to a C# class name. The mapping is done on a per-class basis, i.e. in

<root><bar><foo></foo><bar/></root>

bar is mapped to two distinct classes for representation, albeit the XML elements are the same (assuming same definition of the bar tag in both cases). If local and qualified names are the same, the (local) name must be prepended by ":". This is caused by the implementation of SaxParserImpl.

Because the mapping is done per class, static constructors can be used to set up dictionaries with appropriate tag name/class name pairs. XMLElementImpl defines an abstract method

System.Collections.IDictionary ElementClasses
{
get;
}

that returns a dictionary containing the name pairs. The class name in the dictionary must be the fully qualified class name including namespaces and assembly name.

Method writeSubElements can be overridden to change the way child elements are written. For example, this is the case if mixed content is to be emitted.

Class XMLDocumentImpl implements the XMLDocument interface. Currently, this means managing information about the encoding to use for output and the XML version, which is hardwired to "1.1".

2.2 Contracts

The sample implementation assures that at any time, several invariants summarized as follows will hold:

When reading XML content from a System.IO.Stream or System.Xml.XmlReader, error conditions are indicated by throwing an exception.

While reading XML content, exceptions thrown will be because of IO error conditions, malformed XML content including violation of schema definitions if an instance of SaxParserSupport.Impl.SaxValidatingParserImpl is used. Additionally an exception is thrown if the RootElement property of the SaxParser instance is not set. Additional exceptions may be thrown if C# properties cannot be looked up or if their representing classes lack required constructors.

It is in the reponsibility of the representing classes to ensure that properties are initialized as required.

No exception being thrown during execution of SaxParser.parse or SaxParser.parse(System.IO.Stream) methods indicates successfull reading of XML content and (possibly) schema validation.

At any time, encountered child XML elements shall be mapped against an instance of a representing class. If such a class cannot be found, an exception is thrown. The context for mapping is the enclosing element representing class. Attributes, whose names are invalid as property names, because they are keywords, will be mapped to property names with a leading "@" (see above).

At any time, encountered attributes shall be mapped against a C# property of the class representing the current XML element. If such a property cannot be found, an exception will be thrown. The property must be of a class that has a constructor taking one string instance as parameter. The attribute name might be transcoded to match a property name.

Text and whitespace are ignored by default. If they are encountered, callbacks XMLElement.onText, XMLElement.onSignificantWhitespace and XMLElement.onWhitespace, respectively, are executed, however.

When executing XMLSimpleElement.onElementEnd, the parameter given to this callback denotes the instance of the class representing the enclosing element. It is thus equal to the result of method getOwner. All attributes have been stored in their respective properties.

2.3 Limitations

As of now, the sample implementation imposes these limitations:

Currently, declarations for default namespace aliases, i.e. the xmlns-Attribute, are not supported.

The namespace alias for the www.w3.org/2001/XMLSchema-instance is assumed to be xsi. This is important for proper recognition of some predefined attributes.

As the implementation uses C# properties to represent XML attributes, classes derived from XMLSimpleElementImpl must enforce a strict one-to-one relationship between properties and XML attributes.

Properties used for XML attributes must be of public access along with their classes. If this is not the case, NotImplementedException will be thrown when the property is to be set by the framework. Non-public properties are not considered when writing to a stream.

Classes that are used for C# properties must provide a constructor taking one string instance as parameter. Moreover, an implementation of System.Object.ToString() must be provided.

For property lookup, the property's namespace alias - if present - is used, not the actual namespace itself. This is because of difficulties in mapping an URL into a valid C# identifier and needs certainly some attention in the future.

Hashtables used for child element lookup must contain the complete mapping of child elements for a class. Neither should derived classes use the hashtable of their base class, nor will such base class dictionaries be considered for lookup.

Currently there is no support for xsd:any and CDATA content. For CDATA, support will be added in a future release. Whether this is possible for xsd:any remains to be seen due to the vague semantics to of this content type.

3 Examples

This section shows some examples on how to use the framework. Only the key elements are shown. For the complete example source see the project files supplied. All examples will write XML content equivalent to the original except for ignorable white space if the write method for the root element is called.

3.1 ParserSetup - setting up the parser

To use the framework, an instance of SaxParserSupport.SaxParser must be set up with an System.Xml.XmlReader or System.IO.Stream instance:

3.2 ChildElements - Elements with Children

Here rootelement can have two child XML elements, childone and childtwo, the latter one having a child grandchild itself. Therefore MyRoot uses a static constructor to map the element names to the elements' representing classes. Note that because parent-child relationships are modelled on class level, it is sufficient to use static members for storage and setup. MyRoot now looks like this:

Because there are no namespaces used, the names of the element representing classes are prepended with ":". They are used as keys to map against the fully qualified class names of the representing classes. As these classes are instanced from inside another assembly (the SaxParserSupport assembly), the assembly name must be included. The hashtable is accessed via ElementClasses.

ChildOne inherits from XMLSimpleElement. The class is nested inside MyRoot. This has not always to be case, as the class for childtwo, ChildTwo, demonstrates. It is set up in a manner similar to MyRoot.

3.3 Namespaces

This example illustrates how to cope with namespace alias declarations. Consider the following XML content:

Note that namespace alias declarations are attributes that are treated in a special way.

3.4 Attributes - Adding attributes to elements

This example demonstrates how to make use of attributes. Let's assume the grandchild element has attributes name and xsi:id with the latter from the XMLSchema-instance namespace. Then an example XML content could look like this:

Class ChildTwo.GrandChild now has two properties name and xsi_id that correspond to the two XML attributes. Both properties are of class StringAttribute for convenience, but they could be of any class required provided that a suitable constructor for that class is available. GrandChild's implementation now looks like this:

Any C# property to be used as XML attribute representation MUST be public for two reasons: First, they are accessed from inside the framework and, second, being public is used to distinguish properties for XML attributes from others for class implementation. This is done with tools like UML editors in mind who use C# properties for UML attributes. As a consequence, public properties should not be used for other purposes than XML attribute representation. Otherwise, XMLSimpleElement.writeAttributes should be implemented in a appropriate way.

3.5 Text1 - Simple Text processing

The default implementation for XMLElement simply discards any encountered text. This example shows how use the onText method. (Significant) whitespace can be dealt with in a likewise manner. Consider the following XML content:

Note that writeSubElements is not needed in this special case, anymore.

4 Future Directions

The framework relieves the programmer from a great deal of boring infrastructure to be implemented when using SAX parsers. Still, it is somewhat clumsy to use because of the limitations and invariants described above. However, its intended purpose is to be used in conjunction with a tool for synthesizing classes from XML schema definitions (XSD). Except for the issue with xsd:any content, 100% coverage of the rich spectrum of XSD should be possible. Implementation of this tool will be the topic of the second article.

The supplied implementation uses C#, but the framework was designed with other languages in mind, too. In fact, porting it to other OOP languages with reflection support, e.g. Java, should prove quite easy. Languages like unmanaged C++, that don't offer reflection will have to use other means of achieving constructor and property lookup. Cleverly designed dictionary structures may be a good choice here.

Updates

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Share

About the Author

Still lacking an university degree in computer science, I have 20 years of experience in software development and implementation. Having expert knowledge in object-oriented programming languages like C++, Java and C# on Windows, LINUX and UNIX platforms, I participated in multiple research projects at the University of Oldenburg. During these assignments, I was trusted with implementation of a graphical editor for specification languages like CSP or Z and a prototypical tool for workflow data distribution and analysis. I gained experiences in a widespread spectrum of CS and software development topics, ranging from compiler construction across data base programming to MDA. My research interests include questions of graphical user interface design and component-based systems.

I consider myself an old-school CS geek. While I admit that simple tasks do not pose much of a problem and can be done in quick and efficient manner, it's the challenging ones that appeal to me. If you are looking for a skillful employee - be it on a permanent, contract or freelance basis - and if you can live with a lacking university degree, why not contacting me?