I wouldn't read the generated code, one is normally only interested in the lex source code.
–
GiorgioJan 1 '12 at 11:45

4

@Giorgio: The generated code is the code you have to interface with, with disgusting non-thread-safe global variables, for example, and it's the code whose NULL-termination bugs you're introducing into your application.
–
DeadMGJan 1 '12 at 11:50

I think that if the regular expression becomes very complex, so is the corresponding code. That's why lexer generator are good: I would normally only code a lexer myself if the language is very simple.
–
GiorgioJan 1 '12 at 14:30

I have never written a complex parser and all the lexers and parsers I have written were also hand-coded. I just wonder how this scales for more complex regular languages: I have never tried it but I imagine that using a generator (like lex) would be more compact. I admit I have no experience with lex or other generators beyond some toy examples.
–
GiorgioJan 1 '12 at 14:55

There would be a string you append *pc to, right? Like while(isdigit(*pc)) { value += pc; pc++; }. Then after the } you convert the value into a number and assign that to a token.
–
rightfoldJan 1 '12 at 15:59

@WTP: For numbers, I just calculate them on the fly, similar to n = n * 10 + (*pc++ - '0');. It gets a little more complex for floating point and 'e' notation, but not bad. I'm sure I could save a little code by packing the characters into a buffer and calling atof or whatever. It wouldn't run any faster.
–
Mike DunlaveyJan 1 '12 at 17:53

Lexers are finite state machines. Therefore, they can be constructed by any general-purpose FSM library. For the purposes of my own education, however, I wrote my own, using expression templates. Here's my lexer:

It's backed by an iterator-based, back-tracking, finite state machine library which is ~400 lines in length. However, it's easy to see that all I had to do was construct simple boolean operations, like and, or, and not, and a couple of regex-style operators like * for zero-or-more, eps to mean "match anything" and opt to mean "match anything but don't consume it". The library is fully generic and based on iterators. The MakeEquality stuff is a simple test for equality between *it and the value passed in, and MakeRange is a simple <= >= test.

I've seen several lexers that just read the next token when requested by the parser to do so. Yours seems to go through a whole file and make a list of tokens. Is there any particular advantage this method?
–
user673679Oct 27 '13 at 12:20

Generally, we expect a lexer to do all 3 steps in one go, however the latter is inherently more difficult and there are some issues with automation (more on this later).

The most amazing lexer I know of is Boost.Spirit.Qi. It uses expression templates to generate your lexer expressions, and once accustomed to its syntax the code feels really neat. It compiles very slowly though (heavy templates), so it's best to isolate the various portions in dedicated files to avoid recompiling them when they haven't been touched.

There are some pitfalls in performance, and the author of the Epoch compiler explains how he got a 1000x speed-up by intensive profiling and investigation in how Qi works in an article.

Finally, there are also generated code by external tools (Yacc, Bison, ...).

But I promised a write-up on what was wrong with automating the grammar verification.

If you check out Clang, for example, you will realize that instead of using a generated parser and something like Boost.Spirit, instead they set out to validate the grammar manually using a generic Descent Parsing technic. Surely this seems backward?

Notice the error ? A missing semi-colon right after the declaration of Foo.

It is a common error, and Clang recovers neatly by realizing that it is simply missing and void is not an instance of Foo but part of the next declaration. This avoid hard to diagnose cryptic error messages.

Most automated tools have no (at least obvious) ways on specifying those likely mistakes and how to recover from them. Often recovering requires a little syntactic analysis so it's far from evident.

So, there is trade-off involved in using an automated tool: you get your parser quickly, but it is less user-friendly.

Since you want to learn how lexers work, I presume you actually want to know how lexer generators work.

A lexer generator takes a lexical specification, which is a list of rules (regular-expression-token pairs), and generates a lexer. This resulting lexer can then transform an input (character) string into a token string according to this list of rules.

The method that is most commonly used mainly consists of transforming a regular expression into a deterministic finite automata (DFA) via a nondeterministic automata (NFA), plus a few details.

A detailed guide of doing this transformation can be found here. Note that I haven't read it myself, but it looks quite good. Also, just about any book on compiler construction will feature this transformation in the first few chapters.

If you are interested in lecture slides of courses on the topic, there are no doubt an endless amount of them from courses on compiler construction. From my university, you can find such slides here and here.

There are few more things that are not commonly employed in lexers or treated in texts, but are quite useful nonetheless:

Firstly, handling Unicode is somewhat nontrivial. The problem is that ASCII input is only 8 bits wide, which means that you can easily have a transition table for every state in the DFA, because they only have 256 entries. However, Unicode, being 16 bits wide (if you use UTF-16), requires 64k tables for every entry in the DFA. If you have complex grammars, this may start taking up quite some space. Filling these tables also starts taking quite a bit of time.

Alternatively, you could generate interval trees. A range tree may contain the tuples ('a', 'z'), ('A', 'Z') for example, which is a lot more memory efficient than having the full table. If you maintain non-overlapping intervals, you can use any balanced binary tree for this purpose. The running time is linear in the number of bits you need for every character, so O(16) in the Unicode case. However, in the best case, it will usually be quite a bit less.

One more issue is that the lexers as commonly generated actually have a worst-case quadratic performance. Although this worst-case behaviour is not commonly seen, it might bite you. If you run into the problem and want to solve it, a paper describing how to achieve linear time can be found here.

You'll probably want to be able to describe regular expressions in string form, as they normally appear. However, parsing these regular expression descriptions into NFAs (or possibly a recursive intermediate structure first) is a bit of a chicken-egg problem. To parse regular expression descriptions, the Shunting Yard algorithm is very suitable. Wikipedia seems to have an extensive page on the algorithm.