Millennial C

Search

Recently it became obvious that the
Visual Net data-prep
and index-build subsystems needed refactoring, and I took on the job.
So I’ve been up to my elbows in heavy C coding for a week now—my
first such excursion this millennium.
Herewith some extremely technical low-level notes on the subject, probably
not of interest to non-professionals, except perhaps for a paragraph on the
world-view of the aging coder.
There is some discussion of XML and scaling issues.

Greybeard Coding ·
Every time I take up the cudgels to do some real development, I have to
wonder if this is the last time I’ll be doing this; many programmers
one way or another are not cutting code any more after twenty years or so.
Whereas I can see the attractiveness of getting paid to have opinions and
meetings and leadership skills, I’m still hooked on the feeling of watching
the piles of executable
abstractions grow higher and take useful form, step by very small step at
my commmand.

I’m about as good a programmer as I was a decade or two ago.
I’m no longer as strong, no longer willing to stay up till three to get
past some stupid time-dependent bug, and happily encumbered by family and so
on.
But my bag of tricks is large, and somewhere in the dusty coding
cupboards is a perhaps-not-perfect but known-to-work solution to most of what
I run across.

This doesn’t mean that I don’t make stupid mistakes; much of one
day last week was spent writing a tree-builder that produced subtly wrong
results, then I looked at it with fresh eyes and saw that the recursion was
backward, I must have been thinking about something else for four hours; I
can only assume this happens to other people too.

On C ·
I recall one of the basic tutorials for those new to Unix some twenty
years ago, it contained the immortal line “For any serious programming,
you pretty well have to use C.”
This is no longer true, except when it is; I note that a lot of important
pieces of infrastructure are still written in C, and I don’t think
it’s going away any time soon.

For most things I’d much rather use Java or equivalent, or Python or
equivalent, but sometimes you just have to wrangle shared-memory data
structures hundreds of megabytes in size, not waste any memory, and
count your microseconds.
The obvious alternative was C++, but reasons of aesthetic revulsion aside,
the case for it wasn’t strong enough to be noticeable.

On Object Orientation ·
Just because you need to write in C doesn’t mean you can’t be O-O.
At the end of the day, a Java Object is really a void
*, and after a while, it’s easy to fall into a pattern where all
your code is grouped into modules that smell like classes; each routine
either is a constructor that returns a void *, or takes one of
those void * thingies as its first argument.
You have to cook your own package-like naming scheme, but no biggie.

If there’s anyone out there who’s still writing
application-level C code but isn’t doing it this way, I recommend giving
it some serious thought.

On Processing Big Files ·
I’ve spent a whole lot of my career processing really big input files,
going back to the 570MB Oxford English Dictionary in the
Eighties (that was really big then).
I gave a lecture on the first ever
Perl Whirl entitled Perl,
XML, and Really Big Data which passed on some of the lessons.
I may reproduce that here on ongoing sometime, but here’s one important
lesson for free.

Suppose that you have to read ten million lines of text, each of which
begins with a number, and count how many begin with an odd number.
In Perl for the sake of brevity:

Looks fine, right?
So, you fire it up against your great big file and settle back to wait.
After three or four minutes, you start to get nervous; did you get that regex
right?
Did you screw up the loop somehow?
But you don’t want to interrupt it, you’ve already invested in this
run. But if it’s off the rails, you don’t want to wait too long to
find out.

Expat ·
This is still pretty well the state of the art for XML parsing in C.
We had originally used Xerces but it was too big and too complicated and we
had trouble shaking out some weird bugs.
Expat is just excellent,
on performance grounds if nothing else.
I was shaking this program down on a middle-aged 750MHz Linux box, and have
also been debugging on my 550MHz Powerbook, running Expat (with fairly simple
event handlers) over really big data files, and the CPU usage never gets up
over 60%; it’s so efficient that the performance is I/O-limited.
I like being I/O-limited.

It’s going to be interesting to see what happens when I run this on a
modern multi-GHz production server with really fast disks.

OS X Weirdness ·
One of the advantages of having a Mac is that I can work on my server-side
code here in a self-contained way here on the laptop.
Yes, but there are some distinctly surprising things in the view from down
here in the C-language trenches:

I can’t seem to do a read(2) of more than 8192
bytes against a pipe. Huh?

The C compiler is incredibly sloppy by default; modern GCC on Linux is
helpfully-pedantic about conditionals and possibly-uninitialized variables
and function templates and so on.
With this thing I can say foo(a); and then a couple lines later
foo(x,b,c);, with the types of a and x
being wildly different, and hear nothing from the compiler.
It’s probably there, I just haven’t figured out the right option
incantations.

The -pg option is making some distinctly weird stuff
happen, causing breakage that I can’t reproduce in my test suite and
can’t track down in the debugger either.
Oh well, I can profile on Linux.

An Interesting Optimization Problem ·
What this program does is read an XML file that is either a from-scratch
description of the database Visual Net is mapping, or describes some deltas
to it.
In the first case, performance is vital because the input files are
potentially huge, many gigabytes is not uncommon.
In the second case, this is an interactive transaction and humans are waiting
for the results of the deltas.
Either way, performance is critical.

At the moment, it’s the first area that’s giving me the
performance challenges.
Visual Net is very fast because it’s mostly just
compiled code traversing
in-memory data structures, and thus so is update.
But pulling the data out of the monster XML streams and building those
structures can be challenging.
Here’s one little part of the problem, which XML aficionados will find
amusing.

One of the elements in the input stream represents a customer’s data
object, which can come with an arbitrary number of named metadata fields;
we handle this with a <metadata> element, the fields show up
either as attributes or child elements, almost always as attributes.
In one of the databases we’ve done recently, there are 29 fields for each
incoming data object.

So here’s the problem.
When Expat gives you a start-tag event and an array of name/value pairs, and
for each attribute you have to look through your list of 29 field definitions
to figure out where this one goes, and you’re processing twenty million
records or so, the profiler tells you that looking through that list starts
to loom very large in your processing time figures.

What would you do?

I ended up computing a little automaton, it takes 35 states to recognize
29 distinct possible attribute values, rarely needing to look at more than
three characters.
Made a huge difference.

I wonder what all those XML deserialization packages I see advertised out
there do?
I wonder how they’d perform if asked to read 25 million records?

Understanding the Problem ·
This work has reinforced my conviction that you never really understand the
problem until you’ve written some of the code.
We’d worked out in advance how this thing I’m writing was supposed to
interface to the rest of the system, then on day three I had to go back and
say “This isn’t gonna work, here’s why” and we made some
changes.

I assume there must be people out there who are smart enough to spec out
an interface without having written any code and get it right; but I also
assume that they’re few and far between, and normal people like you and
me shouldn’t count on this sort of virtuosity.