Andy Heninger wrote:
> On 10/4/07, Mike <mike-list@pobox.com> wrote:
> > Should the set [^xyz\q{ch}] match the 'c' in "ch" ?
>
> I don't think so; since the \q{ch} matches "ch", the negated set does
> not match at the first position.
>
> The choices you have made seem reasonable to me.
>
> But what would implementations with a DFA (non-backtracking)
> implementation do? It would be very difficult for them to
> not take a shorter string > from a set if that led to an overall
> longer match. Would it be OK - still useful- if the UTS left
> what happens unspecified?

You seem to assume some difference between a DFA and NFA implementation, but
this classification (Deterministic or Non-deterministic Finite-state
Automatas) belongs more to some classes of languages that are defined with a
single-variable state for determining its behaviour (i.e. determining the
transition within a set of multiple candidate transitions or redictions).

Although it is possible to parse some regexps using such single-variable
state, the class of languages that you can parse with such automata is much
more restricted than what Regexp's can recognize: these languages are LR or
LR(n) languages (or its minor variant LALR that requires just one symbol of
look-ahead to determine the transition to apply in a deterministic way
without using backtracking). But for Regexps in the general case, you can't
make such simplification unless you fix a maximum bound to the value of 'n'
in the LR(n) class.

And the highest the value of 'n', the highest the number of states you need
in the transition graph. For Regexps, this rapidly explodes in a
combinatorial way, so that does not work.

To lift the restriction in the value of 'n', you have to abandon the
restriction of single state variables. Instead, it becomes rapidly necessary
to allow multiple states to be active *at the same time* within the
transition graph, and text and perform the tests of *all* candidate
transitions from the same state and at the same time. This way you get the
expressiveness of LR(infinite) languages, which is exactly what you need to
support general regexps.

And if you do that, your automata is fully deterministic, uses a single (but
compound) state represented by the individual state of activity of the
transition graph, whose size *does not* grow in a combinatorial way (in fact
with such implementation, if the source string representing the regexp has a
length of 'n' character, the transition graph can be built with *at most*
(k1 × n) nodes (k1 can easily be lower than 3) and *at most* (2×n)
transitions. And this graph can be built in linear time.

The only cost is when running the engine for matches: the time for each
transition becomes *at worst* linearily bound to the total length (n) of the
regexp. But in most applications, regexps are short, even if their
expressiveness is very large (and would take a about n^(n-1) nodes in the
transition graph of a DFA implementation).

Choosing DFA is then certainly the worst approach: it can't go beyond the
simplest regexps (such as matching substrings with very few alternatives,
and NO unbounded aggregates that include unlimited loops in the NFA
transition graph).

As soon as you must allow more than just 'x?' and 'x{m,n}' aggregates (with
low values of 'n' here), for example by allowing 'x+' or 'x*' or 'x{m,}',
you must abandon the DFA approach, or you need to use compound state
variables. And if you do that, you don't need any backtracking (backtracking
will be used only to handle syntax errors in languages where it is useful,
or to evaluate conditional transitions that can't be determined only by the
current node position in the transition graph and by the current input).

Thanks, Regexp recognizers can be built using a DFA approach *without any*
needed backtracking, provided that you choose the correct representation of
the state variables in your automata so that they can work *directly* based
on the NFA-style transition graph: each node in the NFA-style transition
graph will not be the single state variable for your DFA automata, because
multiple states may be "active" candidates. But the set of active nodes
perfectly defines the state of the DFA automata.

If your automata will only accept a limited number of state variables (say
'v'), then the regexps it will be able to recognize are in the class of
'LR(v)' languages, and will not parse all possible regexps containing
aggregates whose upper bound is larger than 'v', and then only you'll need
the support of backtracking (which just complicates things and that will
seriously slow down your automata implementation).

So forget the idea of supporting backtracking when implementing regexps:
they are used only for handling syntax errors in context-free grammars (like
those recognized by yacc), but regexps are NOT context-free grammars.

Note: the paradox here is that syntax of a regexp string can be defined and
parsed using a context free grammar like yacc. But this does not mean that
the language that the regexp describes and that you'll need to match with it
is context-free. In fact it is not!

And this becomes extremely visible if you also want to support capturing
groups in regexps, notably for supporting the use of regexps for automated
replacement of a match by a string built from parts of a match, in regexps
like "(x*)(x*)" with two capturing groups: you need to use something better
than backtracking for handling multiple possibilities, and this includes the
possibility to use compound states, or to augment the regexp with some
"sugar" to specify which capturing group has the priority (you can't *only*
use the simple rule of the longest match, as this will be applicable to the
only one of the two capturing groups in "(x*)(x*)").