7 How the Input Is Matched

When the generated scanner is run, it analyzes its input looking for
strings which match any of its patterns. If it finds more than one
match, it takes the one matching the most text (for trailing context
rules, this includes the length of the trailing part, even though it
will then be returned to the input). If it finds two or more matches of
the same length, the rule listed first in the flex input file is
chosen.

Once the match is determined, the text corresponding to the match
(called the token) is made available in the global character
pointer yytext, and its length in the global integer
yyleng. The action corresponding to the matched pattern is
then executed (see section Actions), and then the remaining input is scanned
for another match.

If no match is found, then the default rule is executed: the next
character in the input is considered matched and copied to the standard
output. Thus, the simplest valid flex input is:

%%

which generates a scanner that simply copies its input (one character at
a time) to its output.

Note that yytext can be defined in two different ways: either as
a character pointer or as a character array. You can
control which definition flex uses by including one of the
special directives %pointer or %array in the first
(definitions) section of your flex input. The default is
%pointer, unless you use the ‘-l’ lex compatibility option,
in which case yytext will be an array. The advantage of using
%pointer is substantially faster scanning and no buffer overflow
when matching very large tokens (unless you run out of dynamic memory).
The disadvantage is that you are restricted in how your actions can
modify yytext (see section Actions), and calls to the unput()
function destroys the present contents of yytext, which can be a
considerable porting headache when moving between different lex
versions.

The advantage of %array is that you can then modify yytext
to your heart’s content, and calls to unput() do not destroy
yytext (see section Actions). Furthermore, existing lex
programs sometimes access yytext externally using declarations of
the form:

extern char yytext[];

This definition is erroneous when used with %pointer, but correct
for %array.

The %array declaration defines yytext to be an array of
YYLMAX characters, which defaults to a fairly large value. You
can change the size by simply #define’ing YYLMAX to a different
value in the first section of your flex input. As mentioned
above, with %pointer yytext grows dynamically to accommodate
large tokens. While this means your %pointer scanner can
accommodate very large tokens (such as matching entire blocks of
comments), bear in mind that each time the scanner must resize
yytext it also must rescan the entire token from the beginning,
so matching such tokens can prove slow. yytext presently does
not dynamically grow if a call to unput() results in too
much text being pushed back; instead, a run-time error results.