12 Answers
12

There are three options really, all three of them preferable in different situations.

Option 1: parser generators, or 'you need to parse some language and you just want to get it working, dammit'

Say, you're asked to build a parser for some ancient data format NOW. Or you need your parser to be fast. Or you need your parser to be easily maintainable.

In these cases, you're probably best off using a parser generator. You don't have to fiddle around with the details, you don't have to get lots of complicated code to work properly, you just write out the grammar the input will adhere to, write some handling code and presto: instant parser.

The advantages are clear:

It's (usually) quite easy to write a specification, in particular if the input format isn't too weird (option 2 would be better if it is).

You end up with a very easily maintainable piece of work that is easily understood: a grammar definition usually flows a lot more natural than code.

The parsers generated by good parser generators are usually a lot faster than hand-written code. Hand-written code can be faster, but only if you know your stuff - this is why most widely used compilers use a hand-written recursive-descent parser.

There's one thing you have to be careful of with parser-generators: the can sometimes reject your grammars. For an overview of the different types of parsers and how they can bite you, you may want to start here. Here you can find an overview of a lot of implementations and the types of grammars they accept.

Option 2: hand-written parsers, or 'you want to build your own parser, and you care about being user-friendly'

Parser generators are nice, but they aren't very user (the end-user, not you) friendly. You typically can't give good error messages, nor can you provide error recovery. Perhaps your language is very weird and parsers reject your grammar or you need more control than the generator gives you.

In these cases, using a hand-written recursive-descent parser is probably the best. While getting it right may be complicated, you have complete control over your parser so you can do all kinds of nice stuff you can't do with parser generators, like error messages and even error recovery (try removing all the semicolons from a C# file: the C# compiler will complain, but will detect most other errors anyway regardless of the presence of semicolons).

Hand-written parsers also usually perform better than generated ones, assuming the quality of the parser is high enough. On the other hand, if you don't manage to write a good parser - usually due to (a combination of) lack of experience, knowledge or design - then performance is usually slower. For lexers the opposite is true though: generally generated lexers use table lookups, making them faster than (most) hand-written ones.

Education-wise, writing your own parser will teach you more than using a generator. You have to write more and more complicated code after all, plus you have to understand exactly how you parse a language. On the other hand, if you want to learn how to create your own language (so, get experience at language design), either option 1 or option 3 is preferable: if you're developing a language, it will probably change a lot, and option 1 and 3 give you an easier time with that.

Option 3: hand written parser generators, or 'you're trying to learn a lot from this project and you wouldn't mind ending up with a nifty piece of code you can re-use a lot'

This is the path I'm currently walking down: you write your own parser generator. While highly nontrivial, doing this will probably teach you the most.

To give you an idea what doing a project like this involves I'll tell you about my own progress.

The lexer generator

I created my own lexer generator first. I usually design software starting with how the code will be used, so I thought about how I wanted to be able to use my code and wrote this piece of code (it's in C#):

The input string-token pairs are converted into a corresponding recursive structure describing the regular expressions they represent using the ideas of an arithmetic stack. This is then converted into a NFA (nondeterministic finite automaton), which is in turn converted into a DFA (deterministic finite automaton). You can then match strings against the DFA.

This way, you get a good idea how exactly lexers work. In addition, if you do it the right way the results from your lexer generator can be roughly as fast as professional implementations. You also don't lose any expressiveness compared to option 2, and not much expressiveness compared to option 1.

I implemented my lexer generator in just over 1600 lines of code. This code makes the above work, but it still generates the lexer on the fly every time you start the program: I'm going to add code to write it to disk at some point.

If you want to know how to write your own lexer, this is a good place to start.

The parser generator

You then write your parser generator. I refer to here again for an overview on the different kinds of parsers - as a rule of thumb, the more they can parse, the slower they are.

Speed not being an issue for me, I chose to implement an Earley parser. Advanced implementations of an Earley parser have been shown to be about twice as slow as other parser types.

In return for that speed hit, you get the ability to parse any kind of grammar, even ambiguous ones. This means you never need to worry about whether your parser has any left-recursion in it, or what a shift-reduce conflict is. You can also define grammars more easily using ambiguous grammars if it doesn't matter which parse tree is the result, such as that it doesn't matter whether you parse 1+2+3 as (1+2)+3 or as 1+(2+3).

(Note that IntWrapper is simply an Int32, except that C# requires it to be a class, hence I had to introduce a wrapper class)

I hope you see that the code above is very powerful: any grammar you can come up with can be parsed. You can add arbitrary bits of code in the grammar capable of performing lots of tasks. If you manage to get this all working, you can re-use the resulting code to do a lot of tasks very easily: just imagine building a command-line interpreter using this piece of code.

I think you underestimate the amount of work needed to create a high performance parser and lexer.
–
user1249Dec 23 '10 at 17:01

I've already finished building my own lexer generator and I was quite far along with building my own parser generator when I decided to implement a different algorithm instead. It didn't take me that long to get it all working, but then again I didn't aim for 'high performance', just 'good performance' and 'great asymptotic performance' - Unicode is a bitch to get good running times for and using C# already imposes a performance overhead.
–
Alex ten BrinkDec 23 '10 at 18:13

Very nice answer. I will agree with your option Nr. 3 for all the reasons you stated above. But I may add that if, as is my case, you are also very serious about designing a language perhaps you should also use parser generators at the same time as trying to create your own. So you can get a head start on the language issues and be able to see your language in action faster
–
LefterisDec 16 '12 at 4:41

That depends entirely on what you need to parse. Can you roll your own faster than you could hit the learning curve of a lexer? Is the stuff to be parsed static enough that you won't regret the decision later? Do you find existing implementations overly complex? If so, have fun rolling your own, but only if you aren't ducking a learning curve.

Lately, I've come to really like the lemon parser, which is arguably the simplest and easiest that I've ever used. For the sake of making things easy to maintain, I just use that for most needs. SQLite uses it as well as some other notable projects.

But, I'm not at all interested in lexers, beyond them not getting in my way when I need to use one (hence, lemon). You might be, and if so, why not make one? I have a feeling you'll come back to using one that exists, but scratch the itch if you must :)

The most obvious change that a
language workbench makes to the
equation is the ease of creating
external DSLs. You no longer have to
write a parser. You do have to define
abstract syntax - but that's actually
a pretty straightforward data modeling
step. In addition your DSL gets a
powerful IDE - although you do have to
spend some time defining that editor.
The generator is still something you
have to do, and my sense is that it
isn't much easier than it ever was.
But then building a generator for a
good and simple DSL is one of the
easiest parts of the exercise.

Reading that, I would say that the days of writing your own parser are over and it's better to use one of the libraries that are available. Once you've mastered the library then all DSLs that you create in the future benefit from that knowledge. Also, others don't have to learn your approach to parsing.

Edit to cover comment (and revised question)

Advantages of rolling your own

You'll own the parser and gain all that lovely experience of thinking through a intricate series of problems

You may come up with something special that no-one else has thought of (unlikely but you seem like a clever chap)

It'll keep you occupied with an interesting problem

So in short, you should roll your own when you want to really hack deep into the bowels of a seriously difficult problem that you feel strongly motivated to master.

IMO this is not a good opinion about this question. This is just a general advice not suitable to specific case. I start to suspect that the area51.stackexchange.com/proposals/7848 proposal was closed prematurely.
–
bigownNov 9 '10 at 18:31

1

If the wheel was never re-invented, we wouldn't be travelling at 100kmph+ on a daily basis - unless you're going to suggest large heavy lumps of rock spinning on wooden axles is better than the many many variants of modern tyres used in so many vehicles?
–
Peter BoughtonNov 9 '10 at 23:25

That's a valid opinion, and it's the right intuition. I'm thinking this answer might be more helpful if you could list specific advantages or disadvantages, because this sort of thing entirely depends upon the circumstances.
–
MacneilNov 15 '10 at 0:12

@Peter: It's one thing to reinvent something (implies do it totally differently) but to refine an existing solution to meet additional requirements is better. I'm all for 'improvement', but going back to the drawing board for an already-solved problem seems wrong.
–
JBRWilkinsonNov 15 '10 at 23:55

Identify why all these tools are not good enough - why don't they let you achieve your goal?

Unless you're certain that the oddities in the grammar you're dealing with are unique, you shouldn't just create a single custom parser+lexer for it. Instead, create a tool that will create what you want, but can also be used to fulfil future needs, then release it as Free Software to prevent other people having the same problem as you.

Rolling your own parser forces you to think directly about the complexity of your language. If the language is hard to parse, it is probably going to be hard to understand.

There was a lot of interest in parser generators in the early days, motivated by highly-complicated (some would say "tortured") language syntax. JOVIAL was a particularly-bad example: it required two symbol lookahead, at a time when everything else required at most one symbol. This made generating the parser for a JOVIAL compiler more difficult than expected (as General Dynamics / Fort Worth Division learned the hard way when they procured JOVIAL compilers for the F-16 program).

Today, recursive descent is universally the preferred method, because it is easier for compiler writers. Recursive descent compilers strongly reward simple, clean language design, in that it is a lot easier to write a recursive-descent parser for a simple, clean language than for a convoluted, messy one.

Finally: Have you considered embedding your language in LISP, and letting a LISP interpreter do the heavy lifting for you? AutoCAD did that, and found it made their life a lot easier. There are quite a few lightweight LISP interpreters out there, some embeddable.

Very nice. I'll just add as a point of information that Fortran required almost arbitrary (entire line) lookahead in order to parse things, before the JOVIAL. But at the time, they had no other idea how to make (or implement) a language.
–
MacneilNov 15 '10 at 0:14

Walking is the best mean of transportation as it gives you time to think whether going where you are going is really worth it. It is healthy too.
–
babouOct 13 '14 at 12:20

I've written a parser for commercial application once and I used yacc. There was a competing prototype where a developer wrote the whole thing by hand in C++ and it worked about five times slower.

As for the lexer for this parser, I wrote it entirely by hand. It took -- sorry, it was almost 10 years ago, so I don't remember it precisely -- about 1000 lines in C.

The reason why I wrote the lexer by hand was the parser's input grammar. It was a requirement, something my parser implementation had to comply with, as opposed to something I designed. (Of course I would have designed it differently. And better!) The grammar was severely context-dependent and even lexing depended on semantics in some places. For example a semicolon could be part of a token in one place, but a separator in a different place -- based on a semantical interpretation of some element that was parsed out before. So, I "buried" such semantical dependencies in the hand-written lexer and that left me with a fairly straightforward BNF that was easy to implement in yacc.

ADDED in response to Macneil: yacc provides a very powerful abstraction that lets the programmer think in terms of terminals, non-terminals, productions and stuff like that. Also, when implementing yylex() function, it helped me to focus on returning the current token and not worry about what was before or after it. The C++ programmer worked on the character level, without the benefit of such abstraction and ended up creating a more complicated and less efficient algorithm. We concluded that the slower speed had nothing to do with C++ itself or any libraries. We measured pure parsing speed with files loaded in memory; if we had a file buffering problem, yacc wouldn't be our tool of choice to solve it.

ALSO WANT TO ADD: this is not a recipe for writing parsers in general, just an example of how it worked in one particular situation.

++ Good experience. I wouldn't put too much weight on performance. It's easy for otherwise good programs to be slowed down by something silly and unnecessary. I've written enough recursive-descent parsers to know what not to do, so I doubt if there's anything much faster. After all, the characters need to be read. I suspect parsers that run off tables will be a bit slower, but probably not enough to notice.
–
Mike DunlaveyNov 9 '10 at 21:49

The advantage of writing your own recursive descent parser is that you can generate high-quality error messages on syntax errors. Using parser generators, you can make error productions and add custom error messages at certain points, but parser generators just don't match the power of having complete control over the parsing.

Another advantage of writing your own is that it is easier to parse to a simpler representation that doesn't have a one to one correspondence to your grammar.

If your grammar is fixed, and error messages are important, consider rolling your own, or at least using a parser generator that gives you the error messages you need. If your grammar is constantly changing, you should consider using parser generators instead.

Bjarne Stroustrup talks about how he used YACC for the first implementation of C++ (see The Design and Evolution of C++). In that first case, he wished he wrote his own recursive descent parser instead!

I'm barely convinced the first experiments should be with a parser generator. You gave me some advantages to swap to a custom solution. I'm not deciding nothing yet, but it's a useful answer to help me.
–
bigownNov 9 '10 at 18:40

++ This answer is exactly what I would say. I've built numerous languages and almost always used recursive descent. I would only add that there have been times when the language I needed was built most simply by layering some macros on top of C or C++ (or Lisp).
–
Mike DunlaveyNov 9 '10 at 21:40

JavaCC is claimed to have the best error messages. Also, notice the JavaScript error and warning messages on V8 and Firefox, I think they did not use any parser generators.
–
Ming-TangNov 13 '10 at 1:01

The big advantage to writing your own is that you'll know how to write your own. The big advantage to using a tool like yacc is that you'll know how to use the tool. I'm a fan of treetop for initial exploration.

If you have never, ever written a parser I would recommend you do it. It is fun, and you learn how things work, and you learn to appreciate the effort that parser and lexer generators save you from doing the next time you need a parser.

Are you trying to learn how parsers/compilers work? Then write your own from scratch. Thats the only way you'd really learn to appreciate all the ins and outs of what they are doing. I've been writing one the past couple months, and its been an interesting and valuable experience, escpecially the 'ah, so thats why language X does this...' moments.

Do you need to put something together quickly for an application on a deadline? Then perhaps use a parser tool.

Do you need something that you'll want to expand upon over the next 10, 20, maybe even 30 years? Write your own, and take your time. It'll be well worth it.

Why not fork an open-source parser generator and make it your own?
If you don't use parser generators, you code will be very hard to maintain, if you made big changes the syntax of your language.

In my parsers, I used regular expressions (I mean, Perl-style) to tokenize, and use some convenience functions to increase code readability. However, a parser-generated code can be faster by making state tables and long switch-cases, which may increase source code size unless you .gitignore them.