Now that I'm able to grab each line, I'm trying to figure out how to define my tokens. I really like MJD's approach to Lexing and have been referencing HOP::Lexer::Article (as well as Higher Order Perl) and the general wisdom is we should break down each line into the applicable tokens.

Ok, so I think I understand the code as presented (Though, I have no idea why HOP::Lexer "uses"(imports?) HOP::Stream but doesn't actually use the module... but that's really irrelevant to my use of the module) and I think get the gist of why we want our tokens in "TYPE", "TOKEN" format.

What I'm really not groking is the how of defining/identifying tokens.

For my sudoers file, my lines can be one of three types, comment, alias definition, or rule definition. Comments _should_ be easy since the line is just prefixed with a "#" (though I just thought of an edge case where rules have been commented out and might potentially end with a \\n... I may want to parse comments as rules. "should" is a funny word...), so I'm currently trying to tackle parsing alias definitions.

There are four types of aliases, "Host_Alias", "User_Alias", "Runas_Alias", and "Command_Alias". Alias definitions use the format:

(users don't really need to run as sshd, this is for example purposes, but the accounts that a user would sudo as could be any service or user account)

HOP::Lexer::make_lexer takes an iterator then a list of array refs in the form $label, $pattern, $transform_sub . The keywords are easy since we can just match against text, e.g.: (My::Sudoers::Iterator returns an iterator that grabs a line that is continued with a \\n)

The big problem here is that HOP::Lexer uses capturing parenthesis to extract the token, so the above code will break the module. Additionally, "(.*+)" is typically a bad idea, but I couldn't figure out how to define that better. also, I don't think HOP::Lexer will be able to "see" tokens in the line that have been previously consumed.

The way I'm currently dealing with aliases is splitting on the equals, then split the left half on the the spaces to get the alias type and name, and split the right half on the commas. I don't think this approach isn't necessarily appropriate as it requires further logic to make sense of the mess (as opposed to just lexing the string to obtain tokens... obviously I will need to make use of the tokens at a later point in my application, but I think trying to do too much at once is causing me headaches when debugging edge cases. Parsing tokens will more easily allow me to determine what each piece of the statement means)...

I realize this is not the Lexing solution you are going after, but I want to throw this out there. I also realize that this is probably not the most elegant solution, but it does seem to work with the input that you specified.

That looks like a really nice Lexer, but I must admit that I have become very fond of Parse::RecDescent. Yes, a full-on parser, driven by a grammar. I’ve asked that tool to do fairly-ridiculous things, like parsing a conglomeration of SAS programs, Korn shell scripts and Tivoli schedule files ... hundreds of ’em ... and it Just Did It™ with style and grace. I would basically take that approach instead of building my own program to navigate through the file’s semantic structure, even with a good Lexer by my side.

Furthermore, you can find an EBNF grammar-description for the Sudoers file here: http://www.sudo.ws/sudoers.man.html. No, P::RD does not handle such grammars directly (although other Perl parsers do ...), but it shows you outright what the proper grammar structure ought to be. I think that this might save you a lot of messy coding.