That's the complete code - besides having a solution to our problem we also have a general framework for building other solutions to parsing problems and a pretty decent start that can be reused for other XML related issues. Parsers written in this style are simple to extend even if they are packaged as a library, you can simply inherit from the struct and add additional productions. The code described here took very little time to write (the biggest hassle was building the larger lookup tables from the productions in the XML specification.

That's the complete code - besides having a solution to our problem we also have a general framework for building other solutions to parsing problems and a pretty decent start that can be reused for other XML related issues. Parsers written in this style are simple to extend even if they are packaged as a library, you can simply inherit from the struct and add additional productions. The code described here took very little time to write (the biggest hassle was building the larger lookup tables from the productions in the XML specification.

−

The complete example file can be found [http://stlab.adobe.com:8080/@md=d&cd=//sandbox/papers/xml_parser_example/&c=kZy@//sandbox/papers/xml_parser_example/xml_parser_example.cpp?ac=22 here.]

+

The complete example file can be found [http://stlab.adobe.com:8080/@md=d&cd=//sandbox/papers/xml_parser_example/&cdf=//sandbox/papers/xml_parser_example/xml_parser_example.cpp&c=9XT@//sandbox/papers/xml_parser_example/xml_parser_example.cpp?ac=64&rev1=2 here.]

Revision as of 23:18, 2 May 2008

Introduction

This paper is a response to an e-mail question asking if anyone had code to validate if a sequence of characters was a valid XML name which was not a reserved XML name (starting with 'x'|'X' 'm'|'M' 'l'|'L') or a qualified name (one containing ':'). See Section 2.3 of the XML Specification.

Languages are defined using a grammar. There are many notations for grammars but most use some variant of EBNF, XML is no exception to this rule. The variant of EBNF used for the XML notation is described in Section 6 of the XML Specification. Transforming an EBNF grammar into a lexical analyzer and/or parser is a relatively straight forward process. For simplicity here I'm going to define the lexical analyzer as the portion of code which transforms a stream of characters into tokens and the parser as something which reads the tokens and takes some action once it has recognized production rules. There are libraries, such as the Boost Spirit Library and well as tools such as Lex and Yacc which can be used to aid in writing lexical analyzers and parser but knowing how to write these simply and directly is an invaluable tool in any programmers tool chest - also experience with writing such systems will make you more productive when using the libraries and tools and give you some insight into when you need to use them. Some issues (including this one) can also be addressed by writing a regular expression using a library such as the Boost Regex Library - however, dealing with the numerous ranges of values in the grammar production with this problem are difficult to express in a regular expression.

When you look at a typical EBNF grammar it will often not be separated into which parts are the lexical analyzer production rules and which are the parser production rules - the structure of both a simple parser and simple lexer will be the same, and the separation becomes apparent once you start writing the grammar. Because the structure is the same I'm going to refer from here on to both parts as simply the parser.

A Simple Recursive Decent Parser

To solve our problem of validating a name we're going to write a simple recursive decent,or top-down, parser. The basic idea is we will be reading from a stream of character denoted by a pair of InputIterators - we will only need to read each character once without backtracking (a grammar which can be read without backtracking in this fashion is known as an LL(1) grammar.

To do this we are going to translate each production rule into a function which will return a bool with the value true if it recognizes the production and false if it does not. If it does recognize the production then the stream will be advanced to the end of the production and a value will be updated with the recognized production. Otherwise, the stream and value are left unchanged. Such a function will follow this pattern:

bool is_production(I& first, I last, T& value);

Rather than passing first and last through all of our functions we're going to create a simple struct which holds them and write the is_production() as a member function:

The is_match() is a helper function for terminal productions. A terminal production is a production which is not defined in terms of another production. It is in the terminal productions where the interesting work is done. Here is the body.

If we are at the end of our steam or the current character is not a match then return false. Otherwise we set our value to the matched character, increment the stream, and return true.

The remaining productions are all of a form which match one of a set of code points:

Ideographic ::= [#x4E00-#x9FA5] | #x3007 | [#x3021-#x3029]

Some of the tables, such as the one for BaseChar are quite large making it prohibitively expensive to write an expression like (0x4E00U <= x && x <= 0x9FA5U) || (x == 0x3007U) || .... Instead we will create a simple lookup table. When dealing with a parser that works with a simple char type we could just build a table with 256 elements that contained an enumeration as to the character type. Our table for these would be a bit large (but we could do it - 64K elements) - but instead we'll create an array of ranges which contain the code points we want to match and search to find if our character is in one of the ranges. To do that we'll create a semi-open ranges for each character range of the form (f0, l0], (f1, l1], ... and then use lower bound to search the table, if we end up on an odd index (starting at 0) then the item is in the range, an even index it is not. This strategy means we can't represent a range which includes 0 but 0 is not a valid character in an XML production. The code for the table lookup for <Ideographic> is like so:

To transform from the closed ranges to the semi-open ranges we subtract one from the first item in the range. Note that this code doesn't have to be a member function and is not a template - we can put these lookup tables in a .cpp file. We'll use this function in our is_ideographic() member function:

This is were we are moving past a lexical analyzer and into a parser, however, for our needs this is relatively inefficient. We're constructing and copying a string twice for which we really don't care about the value except that we need to assure that it doesn't start with 'xml' (case insensitive). If we needed a full parser we would write a more efficient way of managing tokens - but for this problem we don't need that. Instead we'll check for 'xml' as we proceed by unrolling a the first couple of executions of the loop. If we find it ''xml' return false. This will violate our rule on stream advancement only if we fully recognize a production - we'll choose a different naming convention for this function so we don't accidentally confuse it for a normal production later.

Finally - to make this a little simpler to invoke we write a wrapper function which just creates a temporary instance of the class and we add one last check to make sure that the valid name made up the entire range provided:

That's the complete code - besides having a solution to our problem we also have a general framework for building other solutions to parsing problems and a pretty decent start that can be reused for other XML related issues. Parsers written in this style are simple to extend even if they are packaged as a library, you can simply inherit from the struct and add additional productions. The code described here took very little time to write (the biggest hassle was building the larger lookup tables from the productions in the XML specification.