I noticed something familiar while looking at ldapsearch output from AD and
ADAM. Even though their ISAM database uses B-trees for its indices, and they
use integers for their primary keys (they call them DNTs, DN Tags, but they're
essentially the same as our entryIDs, though theirs are fixed at 32 bits and
ours are 64 bits on a 64 bit machine), they don't preserve the entry input
order. In fact, it looks like they're passing their 4-byte integer keys
directly to their B-tree routines in native little-endian byte order, which
totally screws up the sorting and locality properties of the B-tree. I noticed
this same mis-behavior in the back-bdb code within about a week of starting to
work on it. (Patched it in November 2001 back-bdb/init.c:1.42, made the
corresponding fix to back-ldbm/init.c:1.28 in December 2001...)

So essentially they have a fundamental design flaw that is costing them a huge
amount of both write and read performance. It's even more amusing that they
recognized they had a problem, but rather than fix the actual problem, they
wrote a Defragmentation process that has to run on a regular schedule (or can
be invoked manually when the server is shutdown) to try and put a bandaid over
the problem. Good old Microsoft.

Summary: B-trees are balanced trees that use ordered keys. If you store these
keys in this order, a retrieval should return them in the same order:
1
2
3
4

This characteristic offers a number of performance advantages - when you're
doing a bulk load, it is essentially the same as a sequential write - there's
very little disk seeking needed. When you're trying to find ranges of entries
within a node, you can use binary search to locate the desired targets instead
of having to do a pure linear search. And if you're retrieving a range of
entries they'll reside next to each other, which makes caching more effective.

But this only works if you insert keys in their natural, sequential order, or
your B-tree sort/comparison function knows something special about your keys'
characteristics. Most B-tree implementations just use memcmp for this purpose.
What happens when you feed little-endian integers to memcmp is incredibly bad,
although the effects won't be seen until you get to more than 256 items in the
database. The first 255 entries will all get added in sequentially, without
difficulty. But when #256 comes, instead of just tacking on after #255, it has
to seek back to the beginning and do an insertion.

for entries 1-255. Then 256 comes along as
0001
which clearly belongs at the beginning of the above list. And 257,
0101
slots in right after entry #1. If all of the nodes were optimally filled up to
this point, the majority of them are going to need to be split as you add
entries from 256-511. And the same rewind seeking/splitting will occur again
for 512-767, and so on for everything up to entry #65535. Then entry #65536
aggravates the problem even further, as does entry #16777216.

I.e., this flaw turns what should be the best possible insert order for a
B-tree into the worst possible insert order. It's something that's easily
overlooked, when you don't think about what you're doing, but its effect is so
obvious you really can't miss it. No wonder fragmentation is such a severe
problem in AD and ADAM...

I think this also explains to a large degree why, even though all of these
packages use B-trees, OpenLDAP imports data an order of magnitude faster than
ADAM, and nearly two orders of magnitude faster than AD. It also explains why
our import speeds scale pretty much linearly with the number of objects,
without extensive multithreading tricks, while theirs scales sub-linearly,
even using multiple threads in parallel. (When you compare actual CPU time
used, instead of wall clock time, the difference is another 3-4x greater at 1M
entries. More than that at 5M.)