This tutorial illustrates techniques for storing and managing large values in FoundationDB. We’ll look at using the blob (binary large object) layer, which provides a simple interface for storing unstructured data. We’ll be drawing on Data Modeling and Python API, so you should take a look at those documents if you’re not familiar with them.

For an introductory tutorial that begins with “Hello world” and explains the basic concepts used in FoundationDB, take a look at our class scheduling tutorial.

Although we’ll be using Python, the concepts in this tutorial are also applicable to the other languages supported by FoundationDB.

If you’d like to see the finished version of the code we’ll be working with, take a look at the Appendix: FileLib.py.

For key-value pairs stored in FoundationDB, values are limited to a size of 100 kB (see Known Limitations). Furthermore, you’ll usually get the best performance by keeping value sizes below 10 kb, as discussed in our performance guidelines.

These factors lead to an obvious question: what should you do if your first cut at a data model results in values that are larger than those allowed by the above guidelines?

The answer depends on the nature and size of your values. If your values have some internal structure, consider revising your data model to split the values across multiple keys. For example, suppose you’re recording a directed graph of users who “follow” other users to receive status updates. Your initial thought may be to have a single key for each userID and store the users being followed as its value, using something like:

Similarly, suppose you’d like to store a serialized JSON object. Instead of storing the object as the value of a single key, you could construct a key for each path in the object, as described for documents.

Note

In general, you should consider splitting your values if their sizes are above 10kb, or if they are above 1kb and you only use a part of each value after reading it.

Sometimes, revising your data model using the above approach is not sufficient or even feasible. Some data, even after splitting, would result in values that are still too large. Unstructured data may not have elements that can naturally serve as keys. Binary large objects (blobs) are common examples of the latter sort.

You can store large values as binary large objects using our example blob layer. This layer provides an abstraction for random reads and writes of a blob, allowing it to be partially accessed or streamed. Sequential reads and writes are also supported. The implementation automatically splits a blob into chunks and stores it using an efficient, sparse representation.

An instance of Blob is used to read and write a single blob in the database. It’s initialized with a Subspace that defines the subspace of keys the database will use to store the blob. We normally create or open a directory to obtain an initial subspace:

Suppose you have files that you’d like to store and manage in the database. The file content could be anything: text, audio, video clips, etc. You might want to group the files into named libraries and record some metadata for each file when you store it.

Let’s use the blob layer to implement a class with the basic methods we’d need to manage this kind of library. The class FileLibrary is initialized with a subspace that it extends to store metadata on the files:

The class exposes methods to import a file into a library, export a file, and remove a file from the library.

To import a file, it would be simple to read it in its entirety and write it to a blob, but let’s assume we may be reading from a CD or DVD that’s subject to damage. We can handle minor damage by reading the file in small chunks with error-checking that replaces a chunk with 0’s if it proves unreadable, but otherwise preserving the file content.

The import_file() method initializes an empty blob and begins to read the file in chunks specified by CHUNK_SIZE. (We can adjust CHUNK_SIZE according to our knowledge of the media we’re working with.) Good chunks are directly appended to the blob, while bad chunks are replaced by 0’s before being appended. Finally, we record the condition of the file with its metadata and return the blob:

The reason export_file() reads in chunks is not to do error-checking: unlike a CD or DVD, FoundationDB will not lose bits of your data. However, the blob’s read() method is transactional, and the intent of the blob layer is to allow blobs to be large. Transactions in FoundationDB cannot be long (currently defined as over five seconds) and are best kept under one second. You might be tempted to read a blob from the database in a single transaction, like so:

size=source.get_size(db)data=source.read(db,0,size)

but doing so with an arbitrary blob may well result in a long transaction. Hence, export_file() reads from the database with one transaction per chunk.

Let’s try out the import method on a text file. We can grab a plain text copy of Hamlet from Project Gutenberg and save it locally as hamlet.txt. Before we import the file, we’ll use grep-b in the shell to find the byte location of something easy to recognize:

$ grep -b 'To be, or not to be' hamlet.txt
85829:To be, or not to be,--that is the question:--

FileLibrary and Blob treat the data to be stored as uninterpreted binary data. Many standard data formats have well-defined metadata that you might want to record.

Let’s say you want to store MP3 files in a library. Most MP3’s contain metadata in one of the ID3 formats, with ID3v2 being the most popular. We’ll use pytagger, one of the many Python modules available for ID3 handling, to capture ID3v2 metadata:

fromtaggerimportID3v2,ID3Exception

We can define an MP3Library class as a subclass of FileLibrary, extending the import_file() method to grab ID3v2 tags. It will use an _id3v2() method that scans for the tags and adds them as attributes:

The flexibility of FoundationDB’s ordered key-value store allows it to model a wide variety of data. The goal of data modeling is to design a mapping of data to keys and values that efficiently supports the operations you need. In the case of large data objects, a good model will usually map a single object to multiple key-value pairs.

With structured data, you can often design a good model by encoding selected data elements into keys in a way that splits an object. With unstructured data, whether text or a binary large object, you can take advantage of the simple but powerful interface provided by the blob layer.