12 Answers
12

Let me admit frankly, building parser is a tedious job and comes close to compiler technology but building one would turn out to be a good adventure. And a parser comes with interpreter. So you got to build both.

A quick introduction to parser and interpreters

This is not too technical. So experts don't fret at me.

When you feed some input into a terminal, the terminal splits the input into multiple units. The input is called expression and the multiple units are called tokens. These tokens can be operators or symbols. So if you enter 4+5 in a calculator, this expression gets split into three tokens 4,+,5. The plus is considered an operator while 4 and 5 symbols. This is passed to a program (consider this as an interpreter) which contains the definition for the operators. Based on the definition (in our case, add), it adds the two symbols and returns the result to the terminal. All compilers are based on this technology.
The program that splits an expression into multiple tokens is called a lexer and the program that converts these tokens into tags for further processing and execution is called parser.

Lex and Yacc are the canonical forms for building lexers and parsers based on BNF grammar under C and it is the recommended option. Most parsers are a clone of Lex and Yacc.

Read this dragon book on Compilers to get a feel of it. I personally haven't finished the book

This link would give a super-fast insight into Lex and Yacc under Python

A simple approach

If you just need a simple parsing mechanism with limited functions, turn your requirement into a Regular Expression and just create a whole bunch of functions. To illustrate, assume a simple parser for the four arithmetic functions. So you would be the calling the operator first and then the list of functions (similar to lisp) in the style (+ 4 5) or (add [4,5]) then you could use a simple RegExp to get the list of operators and the symbols to be operated upon.

Most common cases could be easily solved by this approach. The downside is you can't have a lot of nested expressions with a clear syntax and you can't have easy higher order functions.

This is one of the hardest possible ways. Separating lexing and parsing passes, etc. - it is probably useful for implementing a high performance parser for a very complex but archaic language. In the modern world lexerless parsing is a simplest default option. Parsing combinators or eDSLs are easier to use than dedicated preprocessors like Yacc.
–
SK-logicDec 21 '11 at 14:43

Agreed with SK-logic but since a general detailed answer is required, I suggested Lex and Yacc and some parser basics. getopts suggested by Anton is also a simpler option.
–
UbermenschDec 21 '11 at 15:44

that's what I've said - lex and yacc is among the hardest ways of parsing, and not even generic enough. Lexerless parsing (e.g., packrat, or simple Parsec-like) is much simpler for a general case. And the Dragon book is not a very useful introduction into parsing any more - it is too out of date.
–
SK-logicDec 21 '11 at 15:54

@SK-logic Can you recommend a better updated book. It seem to cover all the basics for a person trying to understand parsing (at least in my perception). Regarding lex and yacc, though hard, it is widely used and a lot of programming languages provide its implementation.
–
UbermenschDec 21 '11 at 16:03

1

@alfa64: be sure to let us know then when you actually code a solution based on this answer
–
qesDec 29 '11 at 20:09

First, when it comes to grammar, or how to specify arguments, don't invent your own. The GNU-style standard is already very popular and well known.

Second, since you're using an accepted standard, don't reinvent the wheel. Use an existing library to do it for you. If you use GNU style arguments, there is almost certainly a mature library in your language of choice already. For example: c#, php, c.

A good option parsing library will even print formatted help on available options for you.

EDIT 12/27

It seems like you are making this out to be more complicated than it is.

When you look at a command line, it's really quite simple. It's just options and arguments to those options. There are very few complicating issues. Option can have aliases. Arguments can be lists of arguments.

One problem with your question is that you haven't really specified any rules for what type of command line you'd like to deal with. I've suggested GNU standard, and your examples come close to that (though I don't really understand your first example with the path as the first item?).

If we're talking GNU, any single option can have only a long form and short form (single character) as aliases. Any arguments containing a space have to be surrounded in quotes. Multiple short form options can be chained. Short form option(s) must be proceeded by a single dash, long form by two dashes. Only the last of chained short form options can have an argument.

All very straightforward. All very common. Also been implemented in every language you can find, probably five times over.

Don't write it. Use what's already written.

Unless you have something in mind other than standard command line arguments, just use one of the MANY already existing, tested libraries that do this.

Have you already tried something like http://qntm.org/loco? This approach is much cleaner than any handwritten ad hoc, but won't require a standalone code generation tool like Lemon.

EDIT: And a general trick for handling command lines with complex syntax is to combine the arguments back into a single whitespace-separated string and then parse it properly as if it is an expression of some domain-specific language.

You have not given many specifics about your grammar, just some examples. What I can see is that there are some strings, whitespace and a (probably, your example is indifferent in your question) double quoted string and then one ";" at the end.

It looks like that this could be similar to PHP syntax. If so, PHP comes with a parser, you can re-use and then validate more concretely. Finally you need to deal with the tokens, but it looks like that this is simply from left to right so actually just an iteration over all tokens.

Some examples to re-use the PHP token parser (token_get_all) are given in the answers to the following questions:

If your needs are simple, and you both have the time and are interested in it, I'll go against the grain here and say dont shy away from writing your own parser. Its a good learning experience, if nothing else. If you have more complex requirements - nested function calls, arrays, etc - just be aware that doing so could take a good chunk of time. One of the big positives of rolling your own is that there wont be an issue of integrating with your system. The downside is, of course, all the screw ups are your fault.

Work against tokens, though, dont use hard coded commands. Then that problem with similar sounding commands goes away.

I have written programs that work like that. One was an IRC bot which has similar command syntax. There is a huge file that is a big switch statement. It works -- it works fast -- but it's somewhat difficult to maintain.

Another option, which has a more OOP spin, is to use event handlers. You create a key-value-array with commands and their dedicated functions. When a command is given, you check if the array has the given key. If it does, call the function. This would be my recommendation for new code.

i've read your code and it's exactly the same scheme as my code, but as i stated, if you want other people to use, you need to add error checking and stuff
–
alfa64Dec 17 '11 at 1:52

1

@alfa64 Please add any clarifications to the question, instead of comments. It's not very clear what exactly you are asking for, although it's somewhat clear that you are looking for something really specific. If so, tell us exactly what that is. I don't think it's very easy to go from I think my implementation is very crude and faulty to but as i stated, if you want other people to use, you need to add error checking and stuff... Tell us exactly what's crude about it and what's faulty, it would help you get better answers.
–
Yannis Rizos♦Dec 21 '11 at 4:27

I suggest using a tool, instead of implementing a compiler or interpreter yourself. Irony uses C# to express the target language grammar (the grammar of your command line). The description on CodePlex says: "Irony is a development kit for implementing languages on .NET platform.“

I've been using NodeJS a lot lately, and Optimist is what I use for command-line processing. I encourage you to search for one you can use for you own language of choice. If not..write one and open source it :D You may even read through Optimist's source code and port it to your language of choice.

The extensible command language indicates that a DSL is required. I would suggest not rolling your own but using JSON if your extensions are simple. If they are complex, an s-expression syntax is nice.

Error checking implies that your system also knows about the possible commands. That would be part of the post-command system.

If I was implementing such a system from scratch, I would use Common Lisp with a stripped-down reader. Each command token would map to a symbol, which would be specified in a s-expression RC file. After tokenization, it would be evaluated/expanded in a limited context, trapping the errors, and any recognizable error patterns would return suggestions. After that, the actual command would be dispatched to the OS.