We don't want your crap databases, says Twitter: We've made OUR OWN

Exclusive Twitter is growing up and, like an adult, is beginning to desire consistent guarantees about its data rather than instant availability.

At least that's the emphasis placed by the company on its new "Manhattan" data management software, a bedrock storage system that was revealed in a blog post on Wednesday.

What's not mentioned is that Twitter is developing something it calls secondary indexes, The Register can reveal. This is a potentially powerful feature within Manhattan that will give its employees greater flexibility over how they search through the social network's vast stores of data, arming the publicly listed company with another weapon for its commercial endeavors.

First, here's an overview of what Twitter told the world this week.

"Over the last few years, we found ourselves in need of a storage system that could serve millions of queries per second, with extremely low latency in a real-time environment. Availability and speed of the system became the utmost important factor. Not only did it need to be fast; it needed to be scalable across several regions around the world," Twitter wrote on its website.

Manhattan is software built by the social network's engineers to cope with the roughly 6,000 tweets that flood into its system every second. Though 6,000 tweets is not that much data, the messages hold a lot of complexity, as Manhattan also needs to handle the mesh of replies and retweets per tweet – a tricky problem when a celeb with millions of followers makes a remark and is immediately inundated with responses.

With this technology, which may eventually become open source, Twitter has been able to move away from a prior system that used Cassandra for eventual consistency and additional tools for strong consistency to a single, hulking system that does both. Manhattan has been in production for over a year, we understand.

Developers can select the consistency of data when reading from or writing to Manhattan, allowing them to create new services with varying tradeoffs between availability (how quickly something can be accessed) and consistency (how sure you are of the results of a query).

Because of this, Twitter's programmers can access a "Strong Consistency service" which uses a consensus algorithm paired with a replicated log to make sure that in-order events reach replicates.

So far, Twitter offers LOCAL_CAS (strong consistency within a single data center) and GLOBAL_CAS (strong consistency across multiple facilities). These will have "different tradeoffs when it comes to latency and data modeling for the application," Twitter noted in a blog post on Manhattan.

We have to go deeper

Data from the system is stored on three different systems: seadb is a read-only file format, sstable is a log-structured merge tree for heavy-write workloads and btree is a heavy-read and light-write system. The Manhattan "Core" system then decides whether to place information on spinning disks, memory, or solid-state flash (SSDs).

Manhattan can match incoming data to its most appropriate format. The output of Hadoop workloads, for example, can be fed into Manhattan from an Hadoop File System, and the software will transform that information into seadb files "so they can then be imported into the cluster for fast serving from SSDs or memory," Twitter explains.

The database supports multi-tenancy and contains its own rate-limiting service to stop Twitter's developers flooding the system with requests. These systems are wrapped in a front-end that, Twitter says, gives its engineers access to a "self-service" storage system.

"Engineers can provision what their application needs (storage size, queries per second, etc) and start using storage in seconds without having to wait for hardware to be installed or for schemas to be set up," Twitter wrote, describing the system.

Twitter has plans to publish a white paper outlining the technology in the future, and may even publish the database as open source. The latter may take a while though, as one former Twitter engineer said in a post to an online message board: "It's got a lot of moving parts and internal integrations, however, so it's got to be a ton of work to make open."

The future? Secondary indexes

As for further development, the company is working to implement secondary indexes, we've discovered.

Secondary indexes let developers add an additional set of range keys to the index for a database, dramatically speeding up the way developers can navigate through and ask questions of large amounts of data.

Amazon Web Services's major DynamoDB product, for instance, implements this technology in a local and global (multi-data center) format.

By adding secondary indexes into Manhattan, Twitter will give its developers the ability to write more sophisticated queries against their large data sets, the indexes of which will be stored on memory.

This means, for instance, that Twitter's commercial arm could build a fast ad system that could present adverts in near-real-time to people according to a multitude of different factors at a lower cost and higher level of flexibility than before.

Systems like secondary indexes will be crucial to the rollout of more advanced, granular, advertising display options, and are therefore a critical tool in Twitter's commercial arsenal.

There are challenges, though: secondary indexes will initially be built to offer consistency just locally within a data center, because generating global secondary indexes is computationally impractical. ®