Regular expressions for XML Schema

From HaskellWiki

1 Regular expressions for XML Schema

The W3C XML Schema specification defines a language for regular expressions. This language is used
in in the XML Schema spec when defining the data type library part. The regex-xmlschema
package contains a complete implementation of this spec. It is implemented with the technique of derivations of regular expression. Main features are
full support of Unicode including all Unicode codeblocks and character properties, purely functional interface, extensions for intersection,
set difference and
exclusive or of regular sets (regular expressions), extensions for subexpression matches, interface for matching, stream like editing and tokenizing.

2 Motivation

When developing the RelaxNG schema validator in the Haskell XML Toolbox (HXT)
there was the need for a complete regular expression matcher for the W3C XML Schema regular expression syntax.
The string representation of basic data types in XML Schema as well as in the RelaxNG standard can be defined by regular expressions.
The available Haskell libraries for processing of regular expressions where not applicable for this task, e.g. they used other r.e. grammars or
did not support Unicode.

When implementing the DTD validator and the RelaxNG validator in HXT, we used the rather old idea of derivations of regular expression.
This is a technique to make the regular expressions operational in a direct way without the clumsy construction of a finite state machine. This worked
fine for validating the content model of XML elements and the implementations were very compact and sufficiently efficient.
So we could expect it to also work fine for Unicode.

The XML Schema grammar is well designed, so the transformation of this grammar in a parsec parser was straight forward,
the interesting part was the internal data structure and it's processing. When completing the work for HXT it showed up,
that this library is generally useful, not only for XML validation. And with some rather simple extensions it became
possible to not only use this for matching strings, as required in HXT, but also for sed like stream editing and for
easy construction of lightweight scanners and tokenizers. This was the motivation for doing a stand alone version of
that regular expression library.

3 Resources

4 The idea of derivations of regular expressions

The idea of derivations of regular expression was developed by Janusz A. Brzozowski, Princeton Univ. in 1964.
Goal was to perform the test whether a string w is a word of a given regular set (given by a regular expression r)
by manipulating the (internal tree like representation of the) regular expression r.

The word test is based on two functions. The first, here called nullable checks, whether the empty word ε
is contained in the regular set associated with r. This test can easily be done and it can be done efficiently.

Given an expression r and a single char x the second function, here called delta computes the so called
derivative of r with respect to x. The derivative delta r x is again a regular expression.
The following law (in Haskell like notation) must hold for delta r x:

match r (x:xs)<=>
match (delta r x) xs

Brzozowski has shown that this derivative exists, and that it is simple to construct it.

Let's see how regular Expression can be modelled with Haskell data types and how
nullable and delta work. In the example we will work with the fixed alphabet of Haskell chars.

In this first version we use the minimal set of regular expressions:
The empty set, the set containing ε and the single char sets
form the primitive sets, Star is the repetition, Seq the concatenation
and Alt the choice operator.

nullable is defined like this, so it's easy and efficient to compute this predicate:

We see for the three simple cases that only Unit is nullable,
r* contains per definition the empty word. A sequence contains the empty word
only in case where both parts are nullable, a union is nullable when at least one operand
is nullable.

delta can be viewed as a parser that can accept a single character and delivers a new parser.
delta fails, when the parser (the regular expression) is Zero or Unit. A Sym parser
checks the input against the required character and fails (Zero) or results in Unit (the EOF parser).
A r* expression is expanded into a sequence r⋅r*, and the input character
is parsed with the simple parser r.

The most complicated rule is the rule for Seq r1 r2. There are two cases. The simple one is
that r1 is not nullable. In this case the input must be consumed by r1. In the second case
(r1 is nullable)
there are to choices: The input could be consumed by r1, but it also could be consumed by r2
when r1 only accepts the empty word.
The Alt r1 r2 rule is defined such that both r1 and r2 are run in parallel. Here is the
point where the nondeterminism is implemented.

delta can easily be expanded on strings

deltaS :: Regex ->String-> Regex
deltaS =foldl delta

Combining deltaS with nullable gives us a simple matching function

match :: Regex ->String->Bool
match re = nullable . deltaS re

These are the essential facts from theory, but is this approach practically applicable?

5 Making the theory practically applicable

The above shown code runs, but it just runs for toy examples. When checking words
with this simple version of delta space and time can grow exponentially with the length of
the input. A simple example is the stepwise derivation of a* with a-s as input

The problem here is that within delta the derivations become more complex.
This happens in two places: In the rule for Star, where r* becomes r⋅r*
and in the Seq rule, when r1 is nullable. In this case a choice is introduced.

The solution for this problem is, like in symbolic algebra systems, the introduction
of a simplification step after derivation. We easily see that Seq Unit r2 is
the same as r2. This rule is applicable in the 1. derivation. Furthermore
Seq Zero r2 is equivalent to Zero (failure remains failure). A third effective rule
is: Alt Zero r2 equals r2.

These and some more simplification rules can be added by introducing smart constructors:

With these simplification rules the resulting regular expressions remain constant in size when
successively derive an expression. Furthermore the simplification rules run in constant time.
So space as well as runtime remains proportional to the length of the input.

6 Extension: Operators for intersection, complement, set difference, exclusive or and interleave

6.1 Implementation of other operators on regular sets and regular expressions

To check, whether a word w is in the union of two regular sets represented by expressions
r1 and r2, we apply delta to both r1 and r2. The real test is then done within
nullable and there it's done by a simple logical OR operation. This observation leads to the
question, whether we could implement other binary operations on regular set. And indeed this can be done
for all binary operations. A test whether a word w is in the intersection of two regular sets
can be added just by adding the operator to the Regex data type, add a rule for delta with
precisely the same structure as for Alt and add the predicate with the corresponding logical
operation to nullable. Here's the example for intersection:

So adding intersection, difference or other set operations are done by
adding about 10 lines of code.

6.2 Syntax extensions for new operators

When extending the XML Schema regular expression syntax, there was one leading
principle: All legal regular expressions should remain correct and their semantics
should not change. In the concrete syntax the curly braces are special symbols
and they are only allowed in so called quantifiers, postfix operators that specify
a kind of repetition, e..g. a{5,7} stand for 5,6 or 7 a's.
So operators in curly braces, e.g. {&} for intersection,
is an extension that does not conflict with the standard syntax.
The following new operators are added: {:} for interleave, {&} for intersection {\} for set difference and
{^} for exclusive or, here enumerated with decreasing priority.

Examples:

.*a.*{&}.*b.* all words containing at least one a and one b

[a-z]+{\}bush all names but not bush

.*a.*{^}.*b.* all words containing at least one a or one b but not both an a and a b

aaa{:}bbb all 6 char long words containing 3 a's and 3 b's

A complement operator can be formulated by using the set difference operator.
The first attempt .*{\}bush (everything but not bush) fails. The . in XML Schema syntax
is a shortcut for every character but newline (\n) and carriage return (\r). So .|\n|\r is
the whole alphabet and (.|\n|\r)* all words over the alphabet. This makes the
complement expression a bit clumsy: (.|\n|\r)*{\}bush. There are some character escape sequences
for character sets, e..g. \s for whitespace, \i for XML name start characters, \d for
digits and others. This list has been extended by \a for .|\n|\r and \A for \a*.
With this new multiCharEsc spec the above complement expression becomes \A{\}bush.

6.3 Examples using the extended syntax

6.3.1 Substitution of none greedy operators

In Perl and other libraries there are so called none greedy repetition operators.
These are not present in the W3C XML Schema syntax. But for many real world examples these
none greedy expressions can be reformulated with the use of set difference.
A classical example is a regular expression for comments, which are delimited by character
sequences, like in C with /* and */. The naive approach /[*].*[*]/ does
not work. A word like /*abc*/123*/ is not a C comment, but it matches the above given expression.

The solution with this library is an expression like the following:

/[*](\A{\}(\A[*]/\A))[*]/

in words: The contents of a C (multi line) comment is every word, that does not contain a subsequence
of */

6.3.2 Identifiers except keywords

In most scanner specs, the regular expressions for names and keywords overlap and
the sequence of the rules in the scanner spec becomes important for solving this
ambiguity. With the set difference it becomes simple to exclude keywords from the regular expression for identifier.

A simple example:

[a-z][a-z0-9]*{\}(if|then|else|while|do)

excludes the 5 keywords from the set of identifiers.

6.3.3 Permutations

With the use of the intersection operator {&} it is rather easy to formulate
a regular expression for the permutations of a character set.

.*a.*{&}.*b.*{&}.*c.*{&}.{3}

is an expressions for all permutations of a, b and c.

The above expression for permutations can be formulated even simpler
by using the interleave operator.

The interleave operator is used when validating XML with RelaxNG Schema.
In RelaxNG the content model is described by a regular expression. But
there it is allowed to specify two different content models for a single
element and then mix theses content models together.

7 Extension: Matching of subexpressions

This library supports matching of subexpressions like in Perl, but the syntax for subexpressions

is different. Labeling subexpressions is not done implicitly by counting and numbering the pairs of parentheses, but

the parentheses of interest can be labeled with a name. This is more flexible and less
error prone when extending the regular expressions.

Here's an example for searching a date pattern in YYYY-MM-DD format in a line of text and
in case of success giving back the strings for the year, month and day:

.*({y}[0-9]{4})-({m}[0-9]{2})-({d}[0-9]{2}).*

The three pairs of parentheses are labeled y, m and d. Let's see
what function we call with this pattern and how the result looks like.
We've already seen the match function giving a back a Boolean for the match result.
This is too less information in this case. We need an extended form of match called matchSubex
with the following signature:

The matchSubex function returns a list of label-value pairs for the subexpression matches.
The label-value pairs occur in the same sequence as in the regular expression. If there is no match
or there is no labeled subexpression the result is the empty list.
Nesting of labeled subexpressions is possible. Examples:

7.2 Matching with subexpressions and nondeterminism

When writing regular expressions with labeled subexpressions there are some traps, if this
is not done carefully. When an expression is nondeterministic and contains labeled subexpressions,
all matches for these subexpressions are computed.
The number of matches can grow exponentially, so the runtime also
can grow exponentially.

Here is a simple example demonstrating this situation.

matchSubex "(({l}x+))*"$"xx"=>[("l","xx"),("l","x"),("l","x")]

The expression is (x+)* and
the x+ is labeled with l. The match can be done in two ways.
The first one is: apply the *-expression once and take xx as the matching subexpression,
the second is apply the *-expression two times and take tow times x as matching subexpressions.
This gives 3 matches for label l. The situation becomes much more complicated for longer input.

There is one extension for resolving nondeterministic results with subexpressions.
As an example, let's match a text with identifiers or keywords. This could be done as follows

In the 2. test we would like to detect the else as a keyword. This could be done
by subtracting all keywords from the subexpressions for name. but this make the expression
unnecessarily complicated. There is an operator {|} that works like set union, but for
subexpressions it's not symmetric. If the left hand side matches, only these results are taken
the results from the right hand are ignored. So the above example can be rewritten to prioritize
the keywords as follows:

9 Examples for tokenizing

The tokenize function can be used for constructing simple tokenizers.
It is recommended to use regular expressions where the empty word does not match.
Else there will appear a lot of probably useless empty tokens in the output.
All none matching chars are discarded.

There are two more tokenizer functions, tokenize' does not throw
away the none matching characters, and tokenizeSubex works with labeled
subexpressions, so the resulting words can be labeled and can further be processed
with respect to that label.

10 Performance

A simple performance test shows that it's possible to process even large
data rather efficiently. The simple test does the following: A large text file is generated,
in the example run with 2^25 characters (about 33Mb). This file is copied by reading and writing the complete file
to get a figure about the time spend in IO.
The second test reads in the file splits it up into lines with the predefined line function
and writes the lines out again. This is compared with a line function defined as
tokenize ".*". The same comparison is done with the built in words and
tokenize "\\S+".

The following runtimes in second where measured:

1.33 copy file

8.48 split into lines with lines from prelude

10.93 split into lines with regex tokenize

20.44 split into words with words from prelude

39.75 split into words with regex tokenize

So the overhead introduced by repeated computation of derivation is
not too bad. Of course this throughput can not be expected by more complex
regular expressions.