Introduction

The Expat XML Parser is a fine
and widely used event based XML parser. One of the nicer features of
Expat is that it has an API capable of being used by C programs. Even
though many programmers use Expat in a C++ environment, the C based API makes
it easy to export this API from a DLL.

However, Expat being a C based API doesn't mean we have to live without
our
C++ classes. Luckily, Expat was designed with the ability to be
augmented with classes.

(Definition: Event Based XML Parser - An XML parser
which invokes methods (a.k.a. events) when XML constructs are parsed.
This differs from the DOM (Document Object Model) style parsers that parse the
XML and then present the application with XML data in its logical hierarchical
format.)

Design Rational

The primary considerations when designing the Expat wrapper classes was
completeness, simplicity, and extensibility. For completeness, almost all
Expat API routines have been wrapped in the classes. This includes even
API such as XML_ExpatVersionInfo. For simplicity, the
wrapper classes only wrap the Expat API and provide no other features.
For extensibility, the wrapper classes make it easy to derive new classes the
provide enhanced functionality.

Basics

This Expat wrappers consist of 2 classes, a template based class
(CExpatImpl <class _T>) and a virtual function based class
(CExpat). Each class has features the lend themselves to specific
solutions.

The following table illustrates the relationship between the API and the two
classes.

CExpat

CExpatImpl <class _T>

Expat C API

The template class CExpatImpl <class _T> provides the base layer of
translation between C++ and the Expat C API. The benefit to the template
designed is that if the application only needs a few of the Expat event
routines, then the code for the event routines are not compiled into the final
executable. Admittedly, the amount of space wasted is minimal, but why
waste it.

The CExpat class is derived from the CExpatImpl <class _T> template
class. However, excluding the default constructor, the only methods
contained within this class are all the event methods declared as virtual
functions. CExpat is intended for situations where virtual functions are
more preferable than templates.

Within reason, the two classes are interchangeable. If you have a class that is derived from CExpat, it could
be easily modified to use CExpatImpl <class _T> or visa-versa without
having to modify any other source. See the "Implementation Notes" for
more information about some implementation pitfalls with regard to more complex
derived classes.

For the rest of this document, only the CExpatImpl <class _T> class will
be discussed. As stated previously, the two wrapper classes are almost
100 percent interchangeable. Documenting both would be redundant.

Getting Started

The first step in using CExpatImpl <class _T> is deriving a new class that
will provide the application specific implementation. Deriving a class is
required. Like Expat, if there is no derived class then Expat would only
verify that the XML is well formed.

As a starting point, let us define an XML parser that will display when an
element begins, ends, and the data contained within the element.

The CMyXML::OnPostCreate method will be invoked by CExpatImpl <class _T>
after the Expat parser has been created. This provides an easy method of
enabling event routines. The CMyXML::OnStartElement,
CMyXML::OnEndElement, and CMyXML::OnCharacterData methods will be invoked by
Expat while the XML text is being parsed. These routines will not be
invoked unless they are enabled. The code inside CMyXML::OnPostCreate
enables the three event routines.

Creating a Parser

Now that we have a derived class, we can use it to create an Expat parser.
Creating the parser is very easy. First create an instance of the parser
class, then invoke the Create method.

The Create method has two arguments, the document encoding and the character
used to separate namespaces a name. The encoding is the default
encoding that will be used while parsing the XML document unless an encoding is
specified by in the XML document itself. The namespace separator
is used
to separate the namespace from the name in calls such as OnStartElement.

For example, if in the XML document there was the name
SOAP_ENC:Envelope, the SOAP_ENC was defined as being
"http://schemas.xmlsoap.org/soap/envelope/" and "#" was specified to
Create, then OnStartElement would be invoked with the string
"http://schemas.xmlsoap.org/soap/envelope/#Envelope".

Parsing a Simple Text String

Next, we actually need to send the XML document to the parser. There are
two different methods of sending the document to the XML parser, directly or by
internal buffers. The easier of the two is sending the data
directly to the parser. However, it is also just a bit slower.

To send a simple string to the parser, the application invokes the Parse
(LPCTSTR pszBuffer, int nLength = -1, bool fIsFinal = true) method. The
first argument is a pointer to a string of data to be parsed. A routine
has been defined for both ANSI and UNICODE strings. The second parameter
is the length of the string in characters (char or wchar_t depending on ANSI or
UNICODE). If nLength is less than zero, then it is required that the
string pointed to by pszBuffer is a NUL terminated string and the length will
be determined from the string. If nLength is greater or equal to zero,
then the string need not be NUL terminated and the length shouldn't include the
NUL character if it exists. The third parameter lets the XML parser know
when there is no more data. If the whole XML document can be contained
within one simple string, then fIsFinal can be set to true the first
time. Otherwise, fIsFinal should remain false while there is more data to
be parsed. Parse can be invoked with a nLength set to zero and
fIsFinal set to true after all data has been read in.

Parsing Using Internal Buffers

To reduce the number of extra memory copies, buffers internal to the Expat
parser can be used instead of passing data into the parser just to have the
Expat parser copy the data to internal buffers. Using internal buffers
takes 3 steps, requesting a buffer, reading data into the buffer, submitting
the data to the parser.

As you can see, this method is more complicated that the other, but when you
modify the example in the previous section to read a file, the differences in
complexity are minimal.

Working With Event Routines

Event routines provide the actual information about what has been parsed to the
application. The method names inside the CExpatImpl <class _T>
class have been selected to make it easy to know which routine applies to what
Expat event.

In Expat:

Set the event handler routine

XML_Set[Event Name]Handler

Name of the event handler

Application specific

In CExpatImpl <class _T>

Enable the event handler routine

Enable[Event Name]Handler

Name of the event handler

On[Event Name]

Name of the internal event handler

[Event Name]Handler

So, if you wish to receive StartElement events, you define a method called
OnStartElement with the proper arguments and invoke EnableStartElementHandler
with a true for the only argument. The event routine can be later
disabled by invoking EnableStartElementHandler again with false as the only
argument.

The specifics about each of the event routines is beyond the scope of this
document. For more information about the events and the Expat parser
itself, see http://www.xml.com/pub/a/1999/09/expat/index.html.
The most all information contained within this document has a counterpart
of the same name in CExpatImpl <class _T>.

Implementation Notes

As stated earlier, there are some pitfalls applications will have to be aware of
when creating complex derived class hierarchies. Let us consider the example
of an XML parser consisting of two classes, CMyXMLBase and CMyXML. CMyXML
is derived from CMyXMLBase and CMyXMLBase is derived from one of the Expat
class wrappers.

Consider the case where the classes are derived from the CExpatImpl <class
_T> template class.

In this case, the programmer expects the OnStartElement to be invoked by
the Expat parser. However, due to the design of the CExpatImpl <class
_T> class, only the methods of the class specified in the template argument
list would be invoked. This is by design.

There are three different way to fix this problem. The first method would
be to declare OnStartElement as being virtual in CMyXMLBase. The second
would be to derive CMyXMLBase from CExpat instead of CExpatImpl <class
_T>. The third method requires the changing of CMyXMLBase from a
normal class to a template. This change provides CExpatImpl <class
_T> with the name of the class from which to locate the event routines.

About the Author

Tim has been a professional programmer for way too long. He currently
works at a company he co-founded that specializes in data acquisition
software for industrial automation.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

In the above example the field reg.1.displayName has the value "JReddy"(the value is stored in double quotes).

In the above case how does the value after being parsed by expat parser look like? I have observed that the field after parsing appears like
";JReddy";

Could somebody please kindly let me know why the expat parser parses in that manner? Is it correct? Why does it put semicolon after the double quotes? Please forgive me if the question is very basic but do reply to the query.

Expat impl works fine in Visual Studio environments. With some minor changes it will be portable and can be used also with gcc in Unix/Linux environments:
Line 442 : XML_Expat_Version v = XML_ExpatVersionInfo (); // upper case E and V
Line 587 : static void XMLCALL StartElementHandler (void *pUserData,
until
Line 716 : replace __cdecl by XMLCALL
Line 742 : Add a virtual destructor virtual ~CExpat (){} // avoids warning from gcc
Line 860 : Add a new line at the end of the file

Hello,
I tried to get ExpatImpl working in a library project. Since I received lots of linker errors I took the example from the CodeProject site and compiled it with MS VC++ .NET. Before that I build a static version of the expat library using the MS project files shipped with the expat files taken from sourceforge.
I called the compiler with the following command line:
cl /MT /Iexpat-1.95.8\lib expattest.cpp /link libexpatMT.lib

But I still get linker errors (see below). What is wrong? Is it a problem with the calling convention? Threading? C vs. C++? Where does this __imp__ prefix come from?
Any help would be very much appreciated.
Regards
Thomas

I know you've addressed the licence issue in the message board already, but I just want to be sure it's safe before I do anything.

I would like to use ExpatImpl in a public-domain speech recognition project, but if I do I'd like to make an installation package for it and make it available on our website alongside our primary installation package. See this URL for information about our group:

We would use Expat and the ExpatImpl wrapper to parse XML format files,
W3C Speech Recognition Grammars specifically, and miscellaneous other XML format items in the future. We could make a configurable installation package for ExpatImpl and make it available for download so that users could install our speech recognition suite and configure ExpatImpl for use by it.

We would greatly appreciate your contribution if you allow us to use ExpatImpl for this purpose.

Dear Sir,
Am very pleased with your Expat C++ wrapper. However, I was wondering how to go about extracting the value of a specific element in an XML string. For example if I have XML such as:
{zt}
{jobqueuepath}\ZT\JobQueue{/jobqueuepath}
{jobreturnpath}\ZT\JobReturn{/jobreturnpath}
{preprocessdelayms}250{/preprocessdelayms}
{/zt}
(NOTE: I've replaced the less-than and greater-than characters with curly brackets to avoid any html formatting problems in this thread)
So if, for example, I want to get the value of the /zt/preprocessdelay element, how do I go about it. Ideally, I would like a function such as:
LPSTR GetValue(LPCSTR sPath)
Any help would be greatly appreciated. Thank you.

If you make the methods pure virtual then all derived classes must implement these methods in some form or another. This would basically mean that every derived class must do all the work thus making the class mostly pointless.

As far as why templates. With templates the amount of generated code "can" be much smaller and the end result faster. For example, if you have a base class implementation of a virtual routine and the derived class implementation, both would end up in the final executable. Using the template method, only the ones that are actually called end up in the image.

I don't know if "I'm going to patent thought [sic]." makes any sense, but if you meant you were going to patent using C++ in this way consider that the Microsoft ATL/WTL libraries (discussed on this site) use this construct heavily (the later is now on sf.net and the former comes with Visual C++ or can be found on the Platform SDK). So I hope you applied for that patent about 6 or more years ago...

As I am using a old C++ compiler, I had to use Roguewave's RWBoolean instead of bool and usual cast operator (brackets) instead of static_cast. But when I run the example, it does not print the elements even though Parse() method returns true.

Can anybody help?
Below is the code copied from this website... where there are lines to print the elements.

I have derived a class from CExpatImpl,
and I need to store a pointer in a Ulong
(unsigned long) class object as a data
member of my derived class.

When I try to do this, I can compile
my program, but during linking I encounter
many "undefined reference to: (Ulong related
methods)", when the methods are indeed defined.

The problem ONLY occurs when trying to instantiate
a Ulong object within the CExpatImpl derived
class. Not when instantiating the Ulong object
somewhere else in the same file and scope as where
a CExpatImmpl derived object exists, and not when all
the header files are included.

Do you have any idea why I can use Ulong and
CExpatImpl in the same file, but can't use
Ulong inside of CExpatImpl? Or why trying to do
so makes the linker unable to find select bits of
code in the other libraries I'm linking with?

The reason I had asked about passing the parser objectto a function is this:

I need to read XML documents line by line and feed them to the parser. This means Expat needs to be informed that each parse (until the last) is not the last parse of the document, so that when XML_Parse encounters end tags whosecorresponding start tags were parsed in a previous call, the parse does not fail.

I know XML_Parse must be passed a boolean flag to indicate whether each call is the final call, and I made some changesto allow me to pass said flag, but despite that, all calls toParse after the first one fail. Basically I had modified your ParseSomeXML example by putting the call to Parse in a loop and trying to call it multiple times. Why is it that only one call to the Parse method may be made?

I need to find a way to parse something like

<superitem> <item>1</item> </superitem>

one line at a time; it thinks end tag </superitem> has no matching start tag, and the call to Parse returns false.

Or, perhaps you have another way of parsing documents in chunks? If so would you post a small example?

I'm just having trouble parsing anything less than a complete XML structure(from start tag to end tag), which means I have to load the entire structurein memory to Parse it, which eliminates some of the memory efficiency expatis meant to provide.

Passing an XMLParser object to a function and then
calling Parse inside that function causes a segmentation
fault.

Passing the XMLParser object by reference prevents this fault, but the parse
fails instead.

Why must the parser be declared in the same scope which calls Parse?
Pointers to this object work correctly within the same scope. Passing a
pointer to the XMLParser object to a function and dereferencing that pointer
to call Parse causes the parse to fail.

You can't make a copy of the object due to having to invoke XML_SetUserData to point EXPAT to the class instance. This could have been avoided by moving that call to the parse routine. As is, I should have declared the class with private copy constructors to prevent it from being passed by value.

As to why the parse is failing after you pass the instance by pointer, I have no idea. All I can say is check your code. The description of the problem is vague.

I cannot successfully compile anything
that includes ExpatImpl.h. I receive the
following errors:
ExpatImpl.h:588: variable or field `__cdecl' declared void
ExpatImpl.h:588: parse error before `(' token
ExpatImpl.h:592: syntax error before `->' token
... and so on.

I am compiling using gcc 3.2.1 on a Solaris
machine. Do you know why the first "error"
occurs?

__cdecl is a Microsoft'ism that is used to tell the compiler that the routine uses a 'C' calling convention. Depending on your compiler, "static" might be all that is needed if the C++ calling convention and C calling conventions are the same. In your case specifically (gcc 3.2.1 on solaris), I don't know the answer to that question.

I believe that there are a couple of bugs in some (probably rarely used) handlers.

1) The ExternalEntityRefHandler, unlike most other handlers, doesn't get the userData as it's first argument; it should take the expat parser itself. Unfortunately, there is not an elegant fix, as no arbitrary data is passed to this handler at all.

2) The UnknownEncodingHandler also doesn't get the userData as its first argument. However, the call to XML_SetUnknownEncodingHandler takes an additional argument, the encodingHandlerData, which is could be used to pass in the wrapper class "this" pointer in place of the userData.

I've been using this implementation of Expat and it's excellent. However, I've come to a problem. I've got new lines in some of my XML elements and for some reason the these new lines don't seem to get retrieved. It's as if the XML parser (or Me ignores them. The code I use to combine the string is this: