I noticed that Joe Hellerstein, Mike Stonebraker, and James
Hamilton (DBMS luminaries all) have published a nice,
reasonably high-level paper describing the architecture and
design principles of a typical database management system: "Architecture
of a Database System".

Ian Lance Taylor's blog has an interesting
post on signed overflow behavior in C. According to the
C standard, integer overflow results in undefined behavior,
and modern versions of GCC take advantage of this to
generate more efficient code. This topic was raised
on -hackers by Tom a few years ago — at the time,
only the -fwrapv flag was implemented by GCC. Now
that GCC 4.2 provides -Wstrict-overflow, this might
be worth investigating further.

The broader point here is that while this optimization is
completely legal according to the C standard, it is
inconsistent with the traditional C semantics, and runs the
risk of breaking code that depends on integer overflow
having the expected behavior. At least GCC now provides a
flag to emit warnings for potentially broken code, which
IMHO is a prerequisite for doing aggressive optimizations of
this type. There's another interesting
post on Ian Lance Taylor's blog that discusses this
situation in general (e.g. alias optimizations are another
instance where the C standard contradicts the traditional
expectations of C programmers).

A few years ago, I did a summer internship with a group at
Microsoft that was building a multimaster filesystem
replication product. This was a very rewarding experience
for several reasons. Now that the replication product has
been shipped (in Windows 2003-R2, Vista, and Windows Live
Messenger), I was happy to see that my mentor for that
summer, Nikolaj
Bjørner, has published a paper containing "lessons
learned" from the project: "Models
and Software Model
Checking of a Distributed File Replication System". The
paper is worth reading, for a few reasons:

Why is filesystem replication such a hard problem,
particularly in the asynchronous, multi-master case?The
paper talks about the basic problem and the approach the
group took to solving it.

Perhaps more interestingly, how do you go about
constructing a high-quality implementation of such a
product?I was impressed by the group's emphasis on
correctness. Nikolaj and Dan (the technical lead for the
group) both had a CS theory background, so this is perhaps
not surprising -- but it's interesting to see some of the
practical techniques that they used to ensure they built a
correct replication system:

A detailed specification (on the order of a few hundred
pages)

A prototype of the system in OCaml, written concurrently
with the specification but before the real implementation
work began

A high-level, executable specification of the
replication protocol in AsmL.
This served as both a readable description of the
protocol, as well as a way to automatically generate
useful test cases.

Using model
checking to verify the correctness of certain
particularly complex aspects of the protocol (distributed
garbage collection, conflict resolution).

A "simulator" that walked a random tree of filesystem
operations, pausing after each node to verify that the
system had correctly replicated the resulting filesystem
state. Once a leaf node in the tree was reached, the
simulator then backtracked, exploring another branch of the
tree. The simulator was also clearly inspired by model
checking techniques. By replacing certain components of the
real system with virtualized ones (e.g. using a toy
in-memory database), this tool could be used to test large
numbers of scenarios very quickly.

Exhaustive testing. Using the simulator and a cluster of
test machines, more than 500 billion test cases were examined.

On May 31, 2008, a tribute
to honor the life and work of Jim Gray will be held at
UC Berkeley. There's a technical session,
for which registration is required, preceded by a general
session that is open to the public. As the invitation email
I received (thanks Elein!) states:

This is not a memorial, because Jim is still listed as
missing, and will be so listed until about Jan 28, 2011. It
is important that it is not referred to as a memorial,
because it can't be a memorial until then. We believe
that it is good to go ahead and recognize Jim's
contributions, to honor him in a Tribute, before such a long
time has passed.

There's a new
post by Stonebraker up at The Database Column. I don't
have much to add to the post itself, although it's
interesting to hear some information about the old Sequoia
2000 project. I notice that Google and
Yahoo were invited to the workshop — at first glance,
it seems
to me that the data management problems faced by the big web
companies are quite dissimilar to the challenges facing "big
science", but perhaps that's not the case.

This weekend's conference in
Portland was a great
experience. Much thanks to Selena Deckelmann, Josh Drake,
and all the other volunteers for organizing and running the
conference. Everything ran amazingly smoothly!

I've posted the slides to my talk on "Query
Execution Techniques in PostgreSQL". I thought the talk
went fairly well, although unfortunately I didn't have
enough time to get to everything I wanted to discuss.

In the talk, one of the algorithms I discussed was the
"hybrid hash join", which is the common hash join algorithm
used by most modern DBMSs, including PostgreSQL. The night
before, Jeff Davis tipped me off to the fact that the
inventor of the hybrid hash join algorithm, Dr. Len Shapiro
from PSU, was going to be in the audience! Thankfully I
didn't get the details of the hybrid hash join wrong :) It
was a pleasure to meet Dr. Shapiro, whose students are doing
some interesting work improving
hash index bulk build performance.

I gave a revised version of the "Introduction to Hacking
PostgreSQL" tutorial at PgCon earlier today. I've posted the
slides, handouts, and example patch here; this
version uses a completely new example patch, and much of the
introductory material has been revised. You can also find
the slides at the PgCon
page for the talk, which also includes a link to give me
feedback.

I've been busy at school recently,
finishing my
undergraduate thesis and taking my final exams. I've got one
more exam on Thursday, but once that's finished, I should be
pretty much finished my undergraduate degree (fingers
crossed).

In a complexity theory class I'm taking, I recently gave a
talk on
Kolmogorov complexity, introducing the basic ideas and
discussing some notable applications. I'm no expert at AIT,
so take the contents with a grain of salt.

Work

I've decided to take a full-time position at Amalgamated Insight,
working on their PostgreSQL-based data stream
management system (DSMS). I interned at AI last summer and I
was very impressed, so I'm excited at the chance to
work for them again. This also means I'll be moving to the
Bay Area in early June.

Speaking of stream processing, I'll be giving a talk about
data stream query processing at PgCon 2007. Gavin and
I are
also doing an introduction to modifying the PostgreSQL
source code,
similar in spirit to the "Introduction to Hacking
PostgreSQL" tutorial we gave at the Anniversary Summit. The
details are still a little hazy, but the goal is to make the
session rewarding to both newcomers and to those who attended
the previous tutorial.