LevelDB is an exciting new entrant into the pantheon of embedded databases, notable both for its pedigree, being authored by the makers of the now mythical Google MapReduce and BigTable products, and for its emphasis on efficient disk based random access using log-structured-merge (LSM) trees.

The plan is to keep LevelDB fairly low-level. The intention is that it will be a useful building block for higher-level storage systems. Basho is already investigating using LevelDB as one if its storage engines.

In the past many systems were built around embedded databases, though most developers now use database servers connected to via RPCs. An embedded database is a database distributed as a library and linked directly into your application. The application is responsible for providing a service level API, sharding, backups, initiating consistency checking, initiation rollback, startup, shutdown, queries, etc. Applications become the container for the database and the manager of the database.

Architectures using embedded databases typically never expose a raw database abstraction at all. They have a service API and the services use the embedded database library call transparently behind the scene. Often an embedded database will provide multiple access types, like indexed access for key-value uses and btrees for range queries and cursors.

BerkelyDB is one well known example of an embedded database, SQLite is another, the file system is perhaps the most commonly used database, and there have been many many other btree libraries in common use. I've used C-tree on several projects. In a battle of old versus new, a user named IM46 compared Leveldb to BerkelyDB and found that LevelDB solidly outperforms Berkeley DB for larger databases.

Programmers usually thought doing this stuff was easy, wrote their own failed on-disk btree library (raises hand), and then look around for a proven product. It's only relatively recently the databases have gone up market and included a network layer and higher level services.

Building a hybrid application/database architecture is still a very viable option when you want everything to be just so. If you are going to load balance requests across sharded application servers anyway, using a heavy weight external database infrastructure may not be necessary.

The LevelDB mailing list started off very active and has died down a bit, but is still nicely active and informative. Here are some excellent FAQish tips, performance suggestions, and porting issues extracted from the list:

Relationship between LevelDB and BigTable: The implementation of LevelDB is similar in spirit to the representation of a single Bigtable tablet (section 5.3). However the organization of the files that make up the representation is somewhat different and is explained [in source code comments]. They wanted to put together something like the BigTable tablet stack that had minimal dependencies and would be suitable for open sourcing, and also would be suitable for use in Chrome (for the IndexedDB implementation). LevelDB has the same general design as the BigTable tablet stack, but does not share any of the code.

Didier Spezia on log-structured-merge (LSM) trees: They are mostly useful to optimize random I/Os at insertion/delete time at the price of a slightly degradation of read access time. They are extremely efficient at indexing data in random order stored on rotational disks (i.e. better than b-trees).

Optimized for random writes. TokyoCabinet could be filled with a million 100-byte writes in less than two seconds if writing sequentially, but the time ballooned to ~2000 seconds when writing randomly. The corresponding slowdown for leveldb is from ~1.5 seconds (sequential) to ~2.5 seconds.

In the tradition of BerkelyDB it's a library you embed in your program, it's not a server. You will have to add the networker layer, sharding etc if a single process won't suffice.

Quite appropriately threading decisions are left to the application, the library is not thread safe. Threads sharing iterators, for example, will need to lock.

Data is written in sorted order.

C++ only.

Variable sized keys are used to save memory.

What leveldb does differently from B+trees is that it trades off write latency for write throughput: write latency is reduced by doing bulk writes, but the same data may be rewritten multiple times (at high throughput) in the background due to compactions.

Log-Structured Merge Trees offer better random write performance (compared to btrees). It always appends to a log file, or merges existing files together to produce new ones. So an OS crash will cause a partially written log record (or a few partially written log records). Leveldb recovery code uses checksums to detect this and will skip the incomplete records.

Search performance is still O(lg N) with a very large branching factor (so the constant factor is small and number of seeks should be <= 10 even for gigantic databases).

Every write will cause a log file to grow, regardless of whether or not you are writing to a key which already exists in the database, and regardless of whether or not you are overwriting a key with the exact same value. Only background compactions will get rid of overwritten data. So you should expect high cpu usage while you are inserting data, and also for a while afterwards as background compactions rearrange things.

Comparing LevelDB and Bitcask: LevelDB is a persistent ordered map; bitcask is a persistent hash table (no ordered iteration). Bitcask stores a fixed size record in memory for every key. So for databases with large number of keys, it may use too much memory for some applications. Bitcask can guarantee at most one disk seek per lookup I think. LevelDB may have to do a small handful of disk seeks. To clarify, leveldb stores data in a sequence of levels. Each level stores approximately ten times as much data as the level before it. A read needs one disk seek per level. So if 10% of the db fits in memory, leveldb will need to do one seek (for the last level since all of the earlier levels should end up cached in the OS buffer cache). If 1% fits in memory, leveldb will need two seeks. Bitcask is a combination of Erlang and C.

Writes can be lost, but that doesn't trash the data files: Leveldb never writes in place: it always appends to a log file, or merges existing files together to produce new ones. So an OS crash will cause a partially written log record (or a few partially written log records). Leveldb recovery code uses checksums to detect this and will skip the incomplete records.

LevelDB keeps a separate file for every couple of MB of data, and these are all in one directory. Depending on the underlying file system, this might start causing trouble at some point.

Scalability is more limited by the frequency of reads and writes that are being done, rather than the number of bytes in the system.

Transactions are not supported. Writes (including batches) are atomic. Consistency is up to you. There is limited isolation support. Durability is a configurable option. Full blown ACID transactions require a layer on top of LevelDB (see WebKit's IndexedDB).

Michi Mutsuzaki compared LevelDB to MySQL as a key-value store. LevelDB had better overall insert throughput, but it was less stable (high variation in throughput and latency) than mysql. There was no significant performance difference for 80% read / 20% update workload.

According to the docs on the leveldb google code page, leveldb is thread-safe, but it is not multi-process safe. Under "Concurrency": "Within a single process, the same leveldb::DB object may be safely shared by multiple concurrent threads."

LevelDB is not totally thread-safe.According to leveldb google code page also, "Under "Concurrency": However other objects (like Iterator and WriteBatch) may require external synchronization. If two threads share such an object, they must protect access to it using their own locking protocol.