The personal ravings of a security consultant that do not fit into his official channels. Want to see all his excuses not to work on WaspVM, MOSREF and IOActive? Here you go!

Sunday, November 14, 2010

Lexical Analysis of C using Python and Ply

Code reviewers fall into two camps; those who rely on grep and their favorite text editor for review, and those who rely on a sophisticated language-specific review environment or IDE with a cross-reference generator. Consultants tend to be in the former camp, as getting a customer's random code base into an IDE can be almost as miserable as getting it out.

I use a hybrid strategy, involving a simple webapp that does syntax highlighting and grep with a few simple features that lets me combine common browsing habits (history, document tabs and linking) with a minimal expectations environment. It isn't beautiful, or featureful, but it doesn't interrupt my flow.

Of course, there's always room for improvement, like a cross-reference of identifiers, and the source files that mention them. This requires simple lexical analysis which is where a smart C programmer goes to Flex. So, where does a Python programmer go? My best guess is Ply -- a Python Lexical Analyzer that merges Lex semantics with Python metaprogramming.

So, in WEPMA fashion, here is the interesting bit, a lexical analyzer that produces identifiers, line numbers, and tokens indicating the start and end of lexical scopes. It is barely smart enough to filter out comments and strings, and tolerant of unanticipated syntactic elements because, obviously, I couldn't be bothered to implement a full C lexer.