> Since I prefer code that is readable, as opposed to sets of numbers> that leave me scratching my head, so it emits a simple state machine> for the token. I hope the list maintainer doesn't mind, but here's an> example of output:

Your code is ok, nothing surprisingly good or bad about it, at least
in terms of your stated goals. In fact, it looks a lot like how
certain verilog designers encode state machines in their logic.

> I don't have the experience necessary to create a regular expression> that can show the system fault; however, I can say what it is not: It> isn't a regex parser generator, because it doesn't have look-behinds> or look-aheads, and notably by the source above: it doesn't do any> capturing (yet).

Ok, here you need some theoretical gaps covered.

The term "regex parser generator" is a mixture of several concepts.

Regex as you appear to be using the term is not a concept with a
well-founded theoretical basis (currently, although there are some
researchers who are getting such a basis in place)--it is more an
ad hoc construction used to describe regular expressions exteneded
with captures and other elements. Your usage of the term seems to
correspond to Perl (or PCRE) regexes.

Now, if you were talking about regular expressions (a well-founded
theory), the term generally used is a lexer generator rather than a
parser generator. However, your subsequent usage (in this and your
following post) seems to indicate that you do mean what is
traditionally is termed parsing (the assigning of input sentences
to the corrsponding derivation (parse) trees).

To be more clear about this. Regular expressions are generally
used in lexing--the mapping of inputs to tokens. The structure of
the regular expression is not considered, i.e. we don't care how
the characters in the input are matched to the various parts of a
regular expression, only which token the string matches.

The regex extensions (captures, assertions, back references, et al)
are not used in lexing because they require keeping of the matches
within a token. This breaks the theoretical equivalences that make
a DFA and an NFA ingterchageable among other things, and potentially
require a Turing Machine to match.

In parsing, one doesn't traditionally use regex extensions either.
The language of BNF or EBNF (= BNF + regular expressions) is
generally sufficient to describe programming languages (or
significant subsets thereof, with the missing parts covered by
"hacks"). Note EBNF is not more powerful than BNF, simply more
succinct. The main forms of parsing used with BNF are the LL and
LR families of grammars. Both of these have linear time parsing
methods. Recently, these methods have begun to be supplanted by
extensions such as predicated grammars, PEGs, and GLR parsing,
which admit a wider class of languages, but have linear time for
the common cases. Again, the regex extensions do not have
guaranteed linear time implementations and thus are not used. It's
not that one couldn't use them, but the theory isn't currently
sufficiently developed to do so without significant "risk".

In your follow on question, where you are doing minimization, you
cannot merge states when the states generate (or lead to the
generation of) different outputs. If you omit that check, you will
get results similar to what you describe.

I know you are anxious to plow ahead, but your doing this without the
underlying theory is causing you to make mistakes that could easily be
avoided. Now, it may be possible that those mistakes may result in
significant innovations, but the more likely case is that you will
produce something that looks nice, but is inherently flawed. It would
only take a couple of courses to rectify that problem, as I noted
earlier. Until you do, the best I can do is point out your most
egregious mistakes and leave the subtle flaws un-discussed as you
don't have the background to understand why they are errors.