Introduction

The YARD parser is a generic recursive descent parser for C++. The YARD parser can be used to parse any pattern that can be expressed as a context-free grammar (CFG). This article uses version 0.2 of the YARD parser as opposed to the previous article I posted: A Regular Expression Tokenizer using the YARD Parser. There are a few minor syntactic changes, as well as more features, such as support for user-defined semantic actions.

My goal wasn't to write a complete XML parser, but rather to provide a practical demo of the YARD parser doing a real-world task which could be useful in some circumstances. If a programmer wants a more complete version of the XML parser, they are of course free to do and are encouraged (but not obliged) to share their modifications. This source code is entirely public domain.

Note: This code only works on Visual C++ 7.1 or better.

Context Free Grammars (CFG)

A context-free grammar (CFG) is a way of expressing a pattern, like a regular expression (theoretical regular expressions, not the Perl kind). In fact, for every regular expression and there is a CFG. A CFG is typically expressed in some kind of normal form, such as an EBNF, which expresses a CFG as a set of grammar productions. A CFG lends itself to the writing of a tool known as a recursive descent parser (R-D parser). An example of an annotated CFG for XML can be found here.

Grammar Productions

A grammar is described typically as a series of grammar productions. There are the basic types of productions when describing a CFG:

C ::== A - Renaming

C ::== AB - Concatenation

C ::== A | B - Union

C ::== A * - Kleene star

C ::== null - Empty set match

The notation used is a semi-formal syntax known as a BNF (Backus Naur Form). Even though these rules are sufficient for describing a CFG, more operations are often desirable for convenience sake, such as:

C ::== A k - The concatenation of A k times

C ::== A ? - Equivalent to C ::== A | null

C ::== A + - Equivalent to C ::== A A*

These extended operations (and others) when used with a BNF are known as an EBNF (Extended Backus Naur Form).

The Parser

The YARD parser works by taking a starting grammar production (called a rule in YARD) and an input data sequence passed as a pair of iterators. The Parser function returns true or false depending on whether the input data matches the grammar.

This is how most parsers work, and of course, this in itself isn't much use to anybody. Like most other parsers, the YARD parser allows the definition of semantic actions.

Semantic Actions

A semantic action is like an event or call-back that is triggered at a specific time during the parsing process. Most parsers require semantic actions to be embedded directly in the grammar itself, this is not the case with YARD. YARD is very flexible, and nothing stops the industrious programmer from writing their own rules which have embedded semantic actions.

A semantic action in YARD is defined by creating a template specialization of the following type:

re_until<Rule_T> - Matches everything up to and including Rule_T, fails if it reaches the end of input.

These are called meta-functions, but really they are simply parameterized types. The reason we call this meta-programming is because the parsing algorithm is expressed using the same technique as functional programming.

Under The Hood: The Pattern Matching Algorithms

The YARD engine uses a brute force trial and error matching algorithm. There is a trade-off of some speed for ease of use and simplicity. It was my goal to design the simplest possible generic R-D parser. The YARD parser is nonetheless sufficiently fast for most purposes.

The XML Grammar

The XML grammar that I use the YARD parser to read was lifted directly from here. Since this is more of a demo than an industrial strength XML parser, I have cut several corners (i.e., left out features, and relaxed certain constraints), at the same time, I have been more true to the grammar than many other so-called open-source "XML parsers". The naming of the productions is taken from the official XML grammar.

The grammar productions (YARD rules) are contained in the file xml_grammar.hpp. The starting production is document which is at the bottom of the file. YARD grammars have to be read starting from the bottom of the file. This has to do with C++ compilation rules. Another artifact of compilation order dependencies in C++ is that cyclical type references have to be broken using functions. You will notice four functions at the bottom of the file: AcceptElement(), AcceptComment(), AcceptCDSect(), AcceptPI() which are required because they represent recursive grammar productions.

There appears to be something mysterious going on, because the parser doesn't apparently do anything. In fact, the parser automatically calls the semantic actions which are defined in the file xml_test.hpp. Semantic actions can be defined anywhere, and are automatically associated with the parser, because they are defined as template specializations. There are two semantic actions defined, which upon a successful match of the pattern STag or ETag, output the text of the match to the standard output stream. Here is the STag matching semantic action defined:

Note: Because of the way the R-D parser works, yard::Actor::OnSuccess(...) will be triggered immediately upon a successful match, even if a parent production ultimately fails.

About the Project

The source project includes all of the work on the YARD parser up to the current moment, and runs three tests: it tests the parser along with the string tokenizer and the scanner which are part of YARD but are not discussed in this article.

Summary

I have only just touched on the potential of the YARD parser, and R-D parsers, in general. This also isn't even a complete XML parser, but hopefully, it will provide the motivated reader with enough information to go ahead and implement a more functional and useful XML parser.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Comments and Discussions

It is really good! The words i can find to describe both CFG and yard are: pure, sophisticated, clean and beautiful.
In my opinion, this is more simple and efficient method to achieve R-D grammar analysis than PCRE. It is clear, easy to understand, more essential, and higher speed!
The largest different i think is the greedy feature. The re_star and re_plus "eat" as much as it can, so the speed is improved because we don't need to think where to stop. Ensuring where to stop becomes the grammar descriptors' tasks. And a well defined grammar should not have two points to stop in a syntax unit.
I will use yard in my future designation. THANK YOU!

Hi,
I am trying to develop generalized algorithm for code instrumentation. This should be driven through rule based machanism, means based on the rules, it should instrument/inject the code. Rules can be add/modify/remove in future.

hi,
can you give me some advices/links where to look if I want to learn about ( markup) parsers, grammars etc? I don't have any experience or teoretic background (in this thing).
Rigth now I am doing some kind of HTML parser, bt I want to make it more general ( possibly all SGML-related things?)... And I am interested in parsing and lnaguage processing in general.

Thx a lot!

ah, and congrats to what seems to be good work! (anyway I don't have time to study ur article now )

For theory I use "Introduction to the Theory of Computation" by Michael Sipser. It is a pretty hard-core theoretical book though. If you want to do an HTML parser, you could study my code as it can be adapted relatively easily to do what you want. You may also want to look at the documentation for Boost::Spirit, Antlr, YACC, Flex, and Sable. Documentation for these tools might have some good starting points for learning about parsing. Good luck.

Well as the W3C can always change/update the XML standards then will the applications developed based on .Net XML related classes be affected? Will there be any problem running these application later on. If yes, how can we keep our applications up to date.

XML standards will almost definitely always be backwards compatible. The only challenge is to make sure that the grammar used for parsing is in fact correct. In my code I can not assure you of this, because I only spent a couple of days on it. If you are writing a commercial application you should run the test suites provided by W3C.

All operations have the same precedence and are evaluated left to right. Associativity does not apply. What I call "operators" are simply meta-functions, which behave like ordinary functions except they are evaluated at compile-time. I hope this helps? Perhaps if you need further assistance, you could provide me with an example. Thanks for the interest in the article