Sorry to play the spoiler. I don't want to say that the improvement is
not significant, however...
Henry S. Thompson wrote:
>Jeff Rafter <lists@jeffrafter.com> writes:
>
>
>
>>Does this mean that you have modified the approach set forth in
>>http://www.ltg.ed.ac.uk/~ht/XML_Europe_2003.html?
>>
>>
>
>Yes. That approach unfolds numeric exponents, producing a number of
>states which grows linearly with maxOccurs for elements and
>non-nesting groups, and grows exponentially for nesting groups, that
>is if you have (a, b{2,5}, c){1,100} the number of states is
>Order(500).
>
>
>
That is not exponential, but linear in the product of all the "maxOccurs".
>The new approach is linear in the number of particles, because it uses
>counters, not unfolding.
>
>
>
Strictly speaking, it was also linear before, since all numeric constants
are constant, there is no change in complexity (which is blind for
constant factor improvements).
In practice, these constants are very large (and most probably much
larger than the number of particles of the regular expression).
You don't need to argue about computational complexity here to underline
the improvement.
>>If so, do you have any metrics on changes to the time complexity?
>>
>>
>
>Should not change the _runtime_ complexity at all, which was always
>the normal FSM complexity. Changes the _compile-time_ complexity
>significantly, from exponential to linear.
>
>
>
Am I missing something here?
Actually, the best compile time complexity is quadratic (in number of
particles), no matter whether you unfold or not.
That is a direct consequence of the fact that all regexps are
1-unambiguous, so the Berry-Sethi construction [1] yields a
deterministic automaton (if they were not 1-unambiguous, you would need
to apply this construction, and make the automaton deterministic
afterwards, which is worst-case exponential time because of state-space
explosion).
The worst case size of such an outcome of the Berry-Sethi construction
for a regular expression is quadratic (where size is the number of
transitions). If you define size as number of states, it is linear, but
this is not very useful to assess time-complexity, because it is the
transitions that are actually stored somewhere, not the states.
If your compile-time complexity really was exponential, it just means
you used a construction that was not optimal. In particular, using
epsilon transitions is a waste of time (though they make things much
easier) - they give you a nondeterministic automaton, even though you
could have constructed a deterministic one right away!
Hope you find this useful.
cheers,
Burak
http://lamp.epfl.ch/~buraq
[1] for Berry-Sethi, rather then looking up their original paper (Gerard
Berry,Ravi Sethi "From regular expressions to deterministic automata"
Th.Comp.Sc.), get a more recent perspective from <shameless_plug/>
Burak Emir "Compiling Regular Patterns to Sequential Machines"
http://icwww.epfl.ch/publications/ documents/IC_TECH_REPORT_200472.pdf
or search for papers by Anne Brueggemann-Klein.