Your address book is a lot like a living organism.

It evolves over time as you meet new people – or as people change jobs, phone numbers or even their names. Many contact management systems don’t pay attention to this important fact. They model a contact list as a snapshot of your contacts at a single point in time: the present. Unfortunately, that snapshot starts aging immediately and must be kept up to date.

But what if an update incorrectly modifies a contact? Many systems support an “undo” function but, in a synchronized world, incorrect modifications may not be noticed until hours, days or weeks afterward. This prevents short-term command-based undo histories from being very effective.

In other words, to find and correct errors in contacts – wherever those errors may come from – contact management systems must retain a complete version history of every change committed to an address book. This simple concept yields quite a bit of power, although getting it right can be tricky.

Which is where Git comes in.

Enter Git

In 2005, Linus Torvalds switched the version control system of the Linux kernel to Git, a distributed system he himself had written. Git achieved a number of goals; it was 100% locally-stored, consistent against corruption, fast as hell, and space-efficient. By 2011 I had been using Git for several years, and I was quite impressed by the elegance of its design. I started noodling with the idea of a contact management API inspired by Git.

As a result, FullContact’s internal contact storage API embodies quite a bit of my early ideas, the most significant of which being the complete content-addressable version history of every contact.

Content-Addressable Storage

Every change you make to contacts in your FullContact account is stored as an immutable and content-addressable copy of each contact version. “Content-addressable” simply means that you can compute an identifier for a contact using only its contents.

FullContact does this by computing the SHA1 of all of the data comprising the contact at a specific point in time. This lets us store contacts in a key-value store using a handy 160-bit key. It also protects us against unnecessarily-redundant data storage. Any two contacts which are exactly the same will yield the same SHA1 value, so it’s impossible to store the same data under two different keys.

Versions = Commits

Contact versioning is so similar to the way Git versions files that, when it came time to cut the first prototype of the FullContact storage API, we actually used JGit, a Java implementation of the Git command set. Unsurprisingly, after some experimentation, we noticed some important differences between versioning files and versioning contacts.

In Git, collections of files are called “trees”. In a contact versioning system, an address book acts as a “tree”, with the exception that address books don’t nest inside of each other. In a file system, nesting trees within other trees makes a lot of sense, but contact management systems need to support much richer categorization than a tree of labels (e.g. in FullContact, you can add as many tags as you like to your contacts).

In Git, a snapshot of the entire top level tree is called a “commit”, and each commit references a “parent commit”, where the parent represents the previous state of the tree. This allows Git to recursively walk from the latest commit (called the HEAD) from parent to parent all the way back to the first commit in the repository. This works great for Git because it can use relatively-fast local disk access and because trees tend to be smaller (in terms of number of contacts files). For a cloud contact management system to do this same pattern, it would have to fetch each commit from storage, then fetch its parent (and its parent’s parent, etc…).

This doesn’t bode well for response times.

Because of these performance implications, we opted to simplify the Git model, prefixing a modification timestamp to the SHA1 hash of a contact’s contents. This ends up being especially useful when synchronizing as a remote system can use this value (which we call an ETag) to easily compare a version it has stored with the latest version in FullContact, fetching any updates if they exist.

Point-in-Time Restore

A nice consequence of keeping a complete snapshot of your address book at every point in time is that you can also recreate its state at any given point in time. This beautifully allows comparing contacts at different points in time, which can really help when 3rd party systems pollute your address book unintentionally by synchronizing bad data (like incorrect merges or out-of-date CRM exports). This requires a lot of storage, and some pretty intelligent history compaction algorithms, but it’s worth it to provide this flexibility to an address book.

The process for restoring a contact version is simple, and is exactly like reverting a commit in Git: create a new version with the contents of the old version you want to restore. This means that contact histories are immutable, and sidesteps a tons of problems that come up in a distributed system with mutable data.

Merges

One important pseudo-similarity to Git is the concept of merging contacts. Git supports merging branches. Specifically this means that a commit in Git can have more than one parent. For a contact versioning system, a key difference is that multiple contacts can be merged. This is analogous to merging files in Git, not branches.

Yet a contact versioning system would still like to be able to support undo for a contact which was incorrectly merged into 2 other contacts. This requires contact versions to be modeled more like commits, but comes at the expense of being able to create a single commit which affects multiple contacts. Another approach is to model a commit as a group of individual contact versions, and then ensure that each version can refer to the commit it belongs to. This isn’t 100% compatible with content-addressability without using a separate index to track these relationships.

Next up: Sync

As I hinted at earlier, a strong versioning system is essential to being able to support address book syncing with 3rd party systems. Contact management systems use several approaches to syncing, ranging from periodic polling to event-based push schemes.

Sync is a pretty big topic – we’ll have a whole post coming up dedicated to exploring the challenges and some of the theoretical approaches taken to keeping multiple systems synchronized with each other.