How would one go about writing a lexical parser? I have several tools and containers that I've written at my disposal to contribute to the job. Ultimately, I'd like to use it in my compiler, and be able to read in a token at a time, and be able to tell its type and decide what to do based upon that, but I know it is never that simple. I plan to have it read in BNF files, and create a lexer state that processes input and returns tokens sequentially. One problem I foresee is knowing what many small tokens form when put together, and if the largest non-terminal type is returned, finding out the smaller components to operate on them, unless I return some sort of tree. I'm not sure how to go about this, and any advice would be helpful.

And please, no suggestions to abandon the project, or use one already written. Far too frequently, I receive only discouraging responses.

Note that this is most likely not the most optimal (or perhaps practical) way of doing this. I need to do this for an assignment and we are "forced" to use certain tools to create our own "scripting" language. What I do is the following:

I use the flex component from cygwin to scan the file and with the help of a small script, I am able to filter all the different tokens I want to be able to parse. (some usage examples I use from this site). This scanner creates a C file where you can overload the functions from and parse to your leisure. From this part on you are able to go through your tokens and create "paths" that lead you to variable declaration (for example, you found a path "object.position.x = 10") and with that path you can figure out what to do with whatever you want to achieve. There are multiple ways to get meaningful data and I guess you can figure out what suits you best and how you want to parse them.

You seem to be trying to write a syntactic parser, not a lexical parser. A lexical parser eats characters and spits out tokens, while a syntactic parser eats tokens and spits out a parsing tree. People often refer to these as "lexer" and simply "parser", respectively.

If you want to take one token at a time and decide what to do based on that, you probably want to write an LL(1) parser. That Wikipedia page contains a section on how to build an LL(1) parsing table which I think answers your question.

Are you making a lexer ( fixed parsing for a known grammar) or are you making a lexer generator ( reading bnf or some other grammar and then working dynamically on a stream based on that)?

For the former, it's fairly easy to do it naively. Read in characters, switch on state and generate results. If it's the later... You should really use an existing tool. Unless you're doing something quirky, this is a well covered area that is difficult to do well but has been effectively solved for decades...

[edit: ] if it's for a compiler it's likely fixed, though most compilers can't just read a line in at a time and know the type of each token... Do you happen to have the grammar well defined, or give us a better idea of what you'd expect the inputs and outputs to be?

I guess I could work from tokens. However, this would require me to write something like I already have for every type of input to create tokens from it. I'm also looking for re-usability. My preprocessor reads a stream and creates tokens from it, and I could use the tokens from that for something, however, I'd ideally like to have this create the tokens. However, it looks like I'll need both parsers at some point.

As for the suggestions to abandon writing this project and use an already written parser: no, thank you.

So, it sounds like you have the tokenizer step done already. The only thing left during lexical analysis is to classify what kind of terminal each token is. For example, your terminal types might include: {identifier, keyword-void, keyword-for, keyword-if, operator-plus, operator-multiply, operator-not, ...}

After you've created terminals, you feed those to the parser (by parser, I mean the part that builds your abstract syntax tree).

I use table-based LR(0) parsing with a GLR-based state machine. This involves:

Building the parse table from the grammar productions. This is probably the hardest to understand step. I do this at runtime, where most popular parser generators do this as a pre-compile step (generating source code for the parser). This is a reusable step you can perform on a wide variety of grammars.

Create a state machine that can follow the transitions in your table. An LR state machine is easy to write, the GLR extensions are much more complicated. This is a generic algorithm which can be reused for multiple parse tables.

Feed all of your tokenized terminals to the state machine, followed by a special 'end of input' terminal.

So, it sounds like you have the tokenizer step done already. The only thing left during lexical analysis is to classify what kind of terminal each token is. For example, your terminal types might include: {identifier, keyword-void, keyword-for, keyword-if, operator-plus, operator-multiply, operator-not, ...}

After you've created terminals, you feed those to the parser (by parser, I mean the part that builds your abstract syntax tree).

I use table-based LR(0) parsing with a GLR-based state machine. This involves:

Building the parse table from the grammar productions. This is probably the hardest to understand step. I do this at runtime, where most popular parser generators do this as a pre-compile step (generating source code for the parser). This is a reusable step you can perform on a wide variety of grammars.

Create a state machine that can follow the transitions in your table. An LR state machine is easy to write, the GLR extensions are much more complicated. This is a generic algorithm which can be reused for multiple parse tables.

Feed all of your tokenized terminals to the state machine, followed by a special 'end of input' terminal.

I would like to have it tokenize, as I would think that trying to attach the type of token after it has been processed is not an easy task. Not to mention, the preprocessor does quick and imperfect tokenization, as it ensures that tokens are whole, well-formed, and are in such a state that they can be operated upon.

I would like to have it tokenize, as I would think that trying to attach the type of token after it has been processed is not an easy task. Not to mention, the preprocessor does quick and imperfect tokenization, as it ensures that tokens are whole, well-formed, and are in such a state that they can be operated upon.

It depends on the language. Some languages are designed so that the lexical stage is very easy. Natural languages like English are the exact opposite. GLR parsers are specifically designed to handle ambiguous situations like this. For example, if you see the word "run" in English, it can be a noun, intransitive verb, transitive verb. To handle this, you can pass a GLR parser multiple options per token, and it'll analyze each possibility in parallel. Most programming languages specifically avoid this ambiguity, which makes things much easier.

Your thread has the "C Language" tag on it. Does that mean you want to parse C, or are implementing it in C?

Your thread has the "C Language" tag on it. Does that mean you want to parse C, or are implementing it in C?

Both. I have a couple specialized virtual processors already implemented, and I wrote a poor compiler for it before. I have a C preprocessor already written that can either give a container of tokens and line numbers, or write the processed output to a stream.

Past that, I have nothing. Ideally, it'd be neat to use this lexer for something other than the compiler, or another compiler. I don't want to hard-code C syntax into it, and I don't want to embed it in to the compiler; I planned it out to be a separate module that the compiler makes calls to in order to process its data.

I guess, I'll first ask questions to see if I understand. First, does BNF describe a syntax for a lexical parser, or syntax parser? If BNF is not used for a lexical parser, which type of input should I use?

Second, would I simply be returning tokens, without any type information or how they might relate to each other? If so, would the next parser then take those and interpret syntactic meaning? Also, how would I determine which non-terminal is of the highest order, without scanning all of the tokens, if one terminal could be part of a non-terminal, and that non-terminal could belong to another, how would I best ascertain the highest non-terminal?

First, does BNF describe a syntax for a lexical parser, or syntax parser?

Either. BNF for the lexical parser works with characters and the BNF for the syntactic parser works with tokens that the lexer generates. Depending on the grammar you're looking at, they might be mixed together though. It might be a grammar based on words with the implicit 'ignore all whitespace' rule.

Second, would I simply be returning tokens, without any type information or how they might relate to each other?

Usually, the lexical grammar simply converts the stream of characters into a stream of tokens, omitting comments/whitespace/preprocessing directives. It might be prudent to include association, but it's usually too early to determine that.

If so, would the next parser then take those and interpret syntactic meaning?

Yes. The lexer makes tokens, the parser makes sure the tokens are in the right order and figures out things like "Are these parenthesis a function call or a favored precedence in an expression?".

I see. If I use the tokens I already have, as I had though I would when I started, that would restrict me to using only only the container type that the preprocessor emits, and I'd have to write a whole new lexer or preprocessor for the next language. I'll likely use the C preprocessor heavily, but I wish there was an easier way that was language agnostic.

So, past that, I'd be reading in tokens, and building a tree token by token in the second stage?

Also, how would I determine which non-terminal is of the highest order, without scanning all of the tokens, if one terminal could be part of a non-terminal, and that non-terminal could belong to another, how would I best ascertain the highest non-terminal?

If I use the tokens I already have, as I had though I would when I started, that would restrict me to using only only the container type that the preprocessor emits, and I'd have to write a whole new lexer or preprocessor for the next language

Yeah. That's why I said parser generators were a bit more difficult to make. They're also tons more difficult to debug.

So, past that, I'd be reading in tokens, and building a tree token by token in the second stage?

Pretty much.

Also, how would I determine which non-terminal is of the highest order, without scanning all of the tokens, if one terminal could be part of a non-terminal, and that non-terminal could belong to another, how would I best ascertain the highest non-terminal?

If I understand you correctly (I'm horrible with formal terminology), the language specification should provide disambiguation rules for those scenarios. They might require some look-ahead to determine the correct interpretation.

Since you don't seem to have a whole lot of experience in writing these things, I think trying to write a general parser that can be used for multiple languages is too ambitious. It only makes sense to generalize a problem when you have solved individual instances several times.

It's a bit like the "write games, not engines" advice: Once you have a few games under your belt, you can consider what an engine might do that will help you with future games, but before you have that experience you are extremely unlikely to write a useful engine.

I think I understand. A lexer creates tokens from an input. If I can find a way to describe what constitutes a token as an input, I could construct a lexer. I'll look into that. And next in the parser, tokens should likely be put into a tree based on their syntactic significance of terminals and non-terminals, probably from a BNF input. Finally, I'd then iterate through the tree, and read in the data and act on it, or write it to a stream.

Lookahead seems like a viable solution to determine the highest ordinal non-terminal. That, I can do.

A bonus question, would regular expressions help me in tokenization? I can think of a way it might, but it might be too costly; I already have regular expressions working, so if I can use it to expedite the process, I'll probably favor it.

I strongly suggest you pick up a book or four on parsing theory. There's a lot of stuff that has been well-understood about parsing for decades, and you'd be doing yourself a huge favor to read up on it instead of trying to rediscover it all the hard way.

Just as an example, to your last question: regular expressions are well known to have the same expressive power as regular grammars. Since most lexing is done based on a character-level regular grammar, interchanging a lexer with a regexp engine is basically an isomorphism. And in fact, many lexer generators are already designed to use a regular expression language to generate the lexical analysis code.

At the very least, go read up on the Wikipedia articles on regular expressions, lexical analysis, parsing, and so on. You'll save yourself a tremendous amount of pain by learning from the experts instead of banging your head against decades of solved mysteries.