On Fri, 2007-02-23 at 15:32 +0000, Joel Reymont wrote:
> Folks,
>
> Is there a way to make a case-insensitive lexer with ocamllex?
>
> The only answer I was able to find is the following from John Skaller:
>
> | identifier {
> let x = toupper (lexeme lexbuf) in
> try Hashtbl.find keywords x
> With Not_found -> x
> }
>
You should note there is another reason for this technique,
which isn't really related to case insensitivity.
A lexer in which you spell out all the keywords like:
| "let" { LET }
| "in" { IN }
| "[Cc][Aa][Ss][Ee]" -> { CASE }
....
is likely to blow ocamllex out of the water if you have
a lot of keywords. This is a good thing, because the
generated code would have a transition matrix so large
it would blow your program out of the water.
(** but it would be good if it produced a warning instead
of looping forever .. can I suggest a command line argument
which limits the number of states?
eg ocamllex -maxtstates 1000 ...)
Using a hash table here retains roughly O(1) performance,
and may actually be *faster* than using a transition matrix
because the large matrix would probably cause disk paging
for every character.
As an example, I once tried to decode UTF-8, which caused
ocamllex to explode. The encoding is finite and therefore
regular .. but trying to capture the value in the lexer
state isn't the best way to do this because there are
too many states.
If you have a nasty lexicology it's a good idea to factor
it if possible, and implement 'sublexers' using different
techniques. Note you can actually tail-call ocamllex lexers
in their action codes, and, ocamllex lexers allow arguments,
so you can actually pass complex state around. You might also
note recent Ocaml's provide substring extraction using
Ville Laurikari's tagged automata technique.
If you want an EXTREME example of this -- the Felix lexer
can actually call the Felix parser and wrap a whole AST
up inside a single token. This is used to open the grammar
up to support user defined syntax extensions: the extensions
are actually parsed with an executable recursive descent
parser .. but the core constructions are still parsed
by ocamlyacc LALR(1) parsing, so the core language retains
good error diagnostics and high performance. The whole
of my code is entirely re-entrant (no global state).
So actually, although it isn't advertised very heavily,
ocamllex has quite advanced combinatorial properties,
whereas traditional 'lex' lexers are considerably more
monolithic.
--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net