Note: this is a rough writeout of what will be presented at Uninet Infosec Infosec
2002 on Saturday, 20 April, 2002. A babelfish translation into spanish and
french appears below. The slides are available on my website at:
http://www.monkey.org/~jose/presentations/czech-rubicon02.d/ ... this page
is available at http://www.monkey.org/~jose/presentations/czech-rubicon02.d/czech.html

INTRODUCTION

[slide 1]
This talk is designed to introduce the application of lexical analysis
to source code analysis. Specically, the design and implentation of the
static analysis tool for the C programming language, named "czech", will
be presented. Several inherent flaws in static analysis, the use of lexical
analysis in source code reviews, and specific lessons learned from the
failure in the implentation of czech will be discussed.

[slide 2]
Briefly, I will give an overview of some of the current research I am familiar
with in this arena. Several people are attempting to leverage many, many years
of source code evaluation methods and tools to automate security analysis of
their code. This is an area of high interest for corporate and academic
researchers alike. I will introduce what lexical code generation tools are and
how they operate, specically the use of the "flex" tool, a freely available
lex parser. I will then outline typical strategies for these tools, and then
delve into the specific challenges faced in the execution of the czech tool,
both for the tool and the author as well as for users of the tool.

[slide 3]
Quickly, an overview of the presentation specifically about czech. I will
discuss the design philosophy of czech, and then how I designed and then
executed the implementation. The performance of czech is something I'm quite
proud of, interestingly enough. I'll give some example of how it operates and
show you the rough output of the tool, and discuss some of the outstanding
bugs as it currently stands. Lastly, I will discuss some of the future
directions for czech and the whole of project pedantic.

[slide 4]
Despite all of the interest its getting these days, source code analysis is not
anything new. Software engineering researchers have been doing this for years.
In the security realm, this has become a hot area for researchers in recent
years, and many big groups have chimed in. These include David Wagner, David
Evans, and Microsoft.

The two big approaches are static analysis, looking at source code typically.
Dynamic analysis, in constrast, takes the whole or parts of the program
and runs it with varying input, monitoring the behavior of the program.
Static analysis focuses on things like logical analysis of the design,
flow analysis as it handles data, and the analysis of type qualifiers
in arguments. All of these have their varying strengths, including the
extendability, ease of use, or completeness of coverage.

It should definitely be noted that no method gives total coverage, and each
misses some aspects. Also, there is the Turing 'halting problem', where you
can only examine a small fraction of the possible states of the code
in any given time frame. As such, you have to focus in on some aspects
and make tradeoffs between completeness and efficiency.

[slide 5]
Project pedantic certainly owes a lot to some of the more accessible
static analysis tools. Pscan was one of the first format string analysis
tools, and could be used to find some common format string problems (see also
KF's talk later today for what those mean). ITS4, and later RATS, were also
some of the most commonly cited tools for common programming errors.
Both are also based on lexical analysis, too. Flawfinder, based on python,
works similarily to RATS, but makes some smarter decisions. Lastly, two
great tools to note are Lclint and splint, both from the research of
David Evans and coworkers in Virginia (USA).

Czech takes a lot of inspiration from these tools. In fact, I started writing
it after a piece I wrote for Linux Journal on RATS, ITS4 and Flawfinder and
getting rather upset with them. Also, during the design of czech, I spoke
to David Wheeler, the author of Flawfinder. I tried to leverage a lot from
this 'prior art'. I should also note, for humor's sake, that the test
suite for czech breaks RATS and Flawfinder.

[slide 6]
Lex is a tool for parsing input streams. A lexical analyzer looks at the
patterns as they flow by and performs actions based on them. The lex
tool takes the parser and generates C code based on that, which is then
compiled. These patterns are based on regular expressions, which makes
them powerful and easy to construct. The actions are based largely on
C, with some macros and functions from the lex and yacc libraries.

[slide 7]
Here we see an example of some of the lex code in the main scanner in
czech. Some of the actions match some obvious things, like tabs,
newlines, and single characters, or words. The function "unput()" is
a lex function which places what you tell it to back on the stream.
REJECT is a macro that unputs what you matched. Once you have started
your action, you can continue to access the stream via the input()
function.

This piece of code is pretty ugly, by the way. I really need to clean
it up to make it more readable and do some proper grammar checking.
Niels Provos, a friend of mine, read it and said, "It makes my brain
hurt."

[slide 8]
Like I mentioned above, you run this parser through the lex generator
to generate C code. Then you can compile it:

$ lex -t file.l > file.c
$ cc -c file.c
$ cc -o file file.o -ll -ly

That runs the parser "file.l" through lex to generate some C code,
which you then compile. In the final linking step you link in the lex
and yacc libraries for specific functions from them. Pretty simple.
The flex manpage is a great tutorial on lex, by the way.

[page 9]
The project strategy was pretty simple, then, in the abstract, and would
focus the efforts on a lex parser. A yacc parser is for later (more
below).

To choice between a full C parser and a keyword tag was actually a decision
between a complex design and a simple design. I chose the simpler one, but
I will eventually have to move closer towards a C parser. Czech understands
the improper usage of dangerous functions, like the printf() family for
format string problems and the use of strcat() in loops, which should never
be done.

[page 10]
The design philosophy of czech was pretty important for me. The main points
are:

dont ask much of the developer
its time consuming, error prone, and a big turnoff
an example is Lclint, a powwerful tool which requires
annotations of the code for real use. Cqual is another
example, which requires developers to recast their code
into a logic structure. Instead, I was shooting for
a drop in replacement for CC in your Makefile.

make the output easy to read
don't give them too much more to worry about

make it extendable
if you extend an API, you should be able to train the tool
pretty easily. (i failed here miserably)

work pretty fast
even though you will have to spend a lot more time looking
over the output, you should still return in a reasonable
amount of time

attempt to understand program flow
czech divides it into ingress and egress (inbound and outbound)
data flows

[page 11]
The challenges facing any of these tools are very elemental. First you have
false positives, and then you have false negatives. Each will lead to
a decreased confidence in your tool, and discourage developers. You should
be able to understand new functions, such as glib or BIND, making it useful
for really large projects like GNOME. Lastly, remember than C is a very
flexible language, and no one codes similarily. Your tools should be able to
deal with this reality.

[page 12]
Czech as it stands is an initial attempt to implement these ideas. It uses
preprocessing in a first pass to expand macros and grab function prototypes,
as well as variable declarations. Basic type qualifying is performed,
as it builds two lists of data types. Data that is input from the user is
going to be treated far more strictly than data that is stically defined.
This is in the case of a string format issue, for example. Consider the
improper handling of a string below:

Syslog has to worry about string formatting, but if you've defined a string
yourself, you know what kind of data you have there, so you shouldn't
throw a red flag on that (probably just a orange flag). Now, if you
used a function like that and *stuff had user supplied content, then
you should definitely check its formatting and its bounds. Czech attempts
to do this for you with this split data types.

[page 13]
Czech is the first available component of project pedantic, which is designed
to be a set of source code analysis tools. Its written in about 1000 lines
of code, and runs realy fast (see performance, below) and is pretty easy
to use. Currently its not fully functional. While it can build the lists,
they're not yet used. This is more a matter of time and motivation right
now, but it should be pretty easy (using the OpenBSD QUEUE(3) interface).
Lastly, there are several really show stopper bugs in it right now.

However, its actually ready to drop into place of your C compiler in many
Makefiles. Also, I've used it to actually find some bugs and have reported
them to the right people.

[page 14]
Czechs key features are sumarized here:

it knows a bit about safe usage
a strcpy() with a constant source string is safer than a
variable length string, but still not safe. And it can
safely ignore the use of va_args() in the printf() family.

it knows a bit about unsafe usage
the biggest one is the use of strcat() in a loop, which
can be very dangerous quickly.

it works by tokenizing each line of code
it then evaluates the function and then makes decisions based
on that.

[page 15]
Czech's performance is pretty amazing, given the entangled nature of
the code. In single pass mode (as czech -C *.c) it can do about 1
million lines per minute. When you do full analysis using preprocessing,
you see about a 10x loss in performance.

I should note that these numbers are on my laptop, a P3/600 OpenBSD system.
Filesystems, memory, and CPU will, of course, vary these numbers.

[page 16]
Here's an example of czech in action. In this trimmed example, czech
was used in the Makefile as a replacement for CC:

$ make CC=czech
czech -O2 -c www6to4.c
initializing czech 0130-04082002 in /home/jose/software/www6to4-1.5
scanning www6to4.c ...
line 120: possible buffer overflow in strcpy
line 487: possible buffer overflow in strcpy
line 505: possible buffer overflow in strcpy
total number of lines: 683
total number of matches: 3

And it finds a few keywords it warns you about. In the next slide
we analyze part of the output.

[page 17]
Here's a brief analysis of one of the strcpy() calls czech warned you
about:

In line 481 buf is a usersupplied vaariable, bounds checked by fgets().
In line 487 its called in strcpy() from buf to tmp. The risk here depends
on the relative sizes of the buffers, tmp and buf. One of the things
czech has to learn is how to calculate those values ...

In this case it is not an off by one error, as fgets() will accept
sizeof(buf) - 1 characters. Both buf and tmp are defined as BUFSIZE
in length, but the strcpy on line 487 will copy exactly BUFSIZE from
buf to tmp with a NULL terminator.

[page 18]
So now let's talk about the limitations of czech. Like all lexical analysis
tools for static analysis, it doesn't really get to learn program flow. It
only looks top down in any one file, and has only the most basic understandings
of input and output.

Czech cannot examine all of the possible states of the code, instead it takes
a basic case analysis. Perhaps it should take a worst case analysis. Like I
said above, czech doesn't calculate the size of a variable's allocated buffer
(but it should be easy to do so), or the return sizes of functions. And
of course it has some serious bugs in it which prevent me from rolling up
a 0.1 release (but you can check it out in CVS).

[page 19]
I would like to finish czech, but I will have to move to yacc grammar to do
so, I think. This will certainly make the lex code easier to understand.
I will also have to construct a rudamentary C parser, and of course examine
the sizes of the arguments to the functions for comparison. Lclint does this
for example, with a basic prototype of:

strcpy(dest, src);
/* insist (size(src) =< sizeof(dest)) */

The second generation tool is in the design phase. Its named 'prauge', a play
on the capital of the Czech republic, as well as the shorted "prog". I'm
shooting for a dynamic analysis tool, possibly using the ptrace(2)
functions to monitor arguments and how they are called, or perhaps the Gdb
analysis engine (suggested by a couple of friends). I may be able to
simply do static analysis from the output of the ktrace facility.

[page 20]
To conclude, while it certainly has some limitations, source code analysis
using lexical analysis techniques is worthwhile for development. However,
it can only assist the developer, not replace a manual audit. Lastly, such
tools are limited in the scope of their known grammar. Czech, an
implementation of such a tool, was quick to code for me, and is pretty easy
to use for most people, and runs very very fast on lots of code. However,
its has a way to go before it can return really useful data in a majority of
the situations.

Thank you, and I want to thank sarnold, viZard, MJesus, Fernand0, and Oroz, as well
as all of the other people involved in Uninet Infosec. It's an honor to present
here with such esteemed fellow presenters. Thank you!