Gotchas

This page contains a list of "gotchas" that I've come across during my
long tenure as PhD candidate (and hacker extraordinaire). They relate to
my work in (very large-scale) distributed systems, data mining and data collection.

Java

Not news, but recently news to me: using lots of booleans as a "compact" way
to represent a bitmap. Here's the problem: Java allocates a whole word (4 bytes!) for
each boolean value. Solution? Make a custom Bitmap class where the bitmap is stored using an
array of ints, where the size of the array is Math.ceil(numberOfBits/Integer.SIZE). Note
that for counting set bits, Java provides the Integer.bitCount() method -- as if they
knew the flaw in their boolean representation would lead developers to this workaround.

MySQL

LOAD DATA INFILE is really fast. If you're loading a copy of
a table that has GBs of indexed data, don't use inserts.

MySQL + NFS = misery. You can use MyISAM tables, which
may not have the best performance, but will certainly be
easy to work with if you have to recover from crashes.

By default, a 64-bit machine will support an extremely
large number of rows per table, but a 32-bit machine will
support only 2^32 rows. If you add rows beyond the limit,
MySQL simply rolls the number of rows back to zero and counts
from there. So if you have 2^32+1 rows, MySQL will tell you
that you have only 1. Use ALTER TABLE xxx MAX_ROWS= [something large].

The builtin bzip2 library is much less efficient (i.e., slower) than
the standlone bzip2 binary, at least on windows.

If your script is used for data mining and it grows to more than 100 lines,
you probably want to use a static-typed language instead. This not only makes
the code faster (generally) but also take up less memory.

Web service + MYSQL

Use DELAYED INSERT wherever feasible. This allows the service
to return immediately to the caller that is blocking on the response.

If the delayed insert data is coming in faster than your server
can handle it, don't use delayed (or selectively drop data).

If your web service is humming along and all of a sudden your load
on the DB server drops dramatically, make sure that the web server is not
dropping connections. One likely culprit for a large number of clients: the
default values for ip_conntrack_max is way too low. You can verify this
by looking at /var/log/messages and looking for dropped packets.

Linux servers

If you see a consistent 5 or 10 second delay for certain operations
to complete, your DNS settings are probably wrong. For example, your
DNS servers you set probably don't exist.

From Pred: You must have at least a 1 second sleep in scripts after
the last partition command before the mkfs command, or else you run the
risk of the mkfs not finding the newly-created partition. Harsh.

Kernel Hacking

printk works everywhere ... except when you're using it
in the scheduler. In this case it causes deadlock.

Calling kmalloc before the memory manager has initialized
will do nothing. It will not cause an error, however, nor will
it be caught at compile time. So make sure you don't use kmalloc
until the kernel is ready for it.