This Is What You Build to Juggle 6,000 Tweets a Second

Share

This Is What You Build to Juggle 6,000 Tweets a Second

Digital City Communication

Illustration: Getty Images

When you open the Twitter app on your smartphone and all those tweets, links, icons, photos, and videos materialize in front of you, they're not coming from one place. They're coming from thousands of places.

Twitter juggles tweets from more than 240 million people across the globe, with about 5,700 of these mini-messages sent every second, and this enormous stream of digital information gets stored on thousands of servers inside the company's vast network of data centers. Because it contains so many types of data – from links and video to the bits of meta-data you don't see – even a single tweet is spread across multiple machines. The trick lies in finding ways of grabbing all the right data from this sea of machines and rapidly delivering it to your phone or PC.

In the beginning, Twitter did this with help from software systems that are widely used across the web – things like the open source databases MySQL and Cassandra. But much like Google and Facebook, the microblogging outfit reached a point where its operation had grown so large and so complex, ordinary software just didn't cut it. Twitter needed a new type of software that could juggle massive amounts of information in new and more efficient ways. So it started building its own.

Twitter has openly shared other new-age tools it created to help run its massive online empire, but it has mostly kept quiet about the system it designed to store and retrieve all the data that comes tumbling into its computing centers with each passing second. Created by a team of three engineers working in the Twitter engine room – Chris Goffinet, Peter Schuller, and Boaz Avital – this system is called Manhattan, and it may provide a glimpse into the near future of the computer database.

From left: Chris Goffinet, Boaz Avital, and Peter Schuller.

Photo: Ariel Zambelich/WIRED

Today, most large online operations are underpinned by many disparate databases, each designed to handle a slightly different type of task. When a web service reaches a certain size, this is almost a necessity. But with Manhattan, Twitter has built a database that aims to do it all – a database capable of driving its entire online empire. It doesn't yet drive the whole thing, but it has run parts of Twitter's operation for more than a year, and the plan, says Goffinet, is to move most all of Twitter's database tasks to Manhattan in the near future.

Like many web outfits, Twitter once relied heavily on Cassandra. Originally developed by Facebook, Cassandra is one of many "NoSQL" databases designed to store data across hundreds or even thousands of machines. This worked well enough, but Twitter soon found it was too difficult to expand the system to new sets of machines. "A lot of systems have this problem," says Avital, who helped oversee the use of Cassandra at the company. "It's hard to manage systems when server clusters grow from tens to hundreds to thousands of nodes." And, perhaps more importantly, there are certain tasks Cassandra just wasn't suited to, tasks that requires other database tools such as MySQL and something called Gizzard.

Cassandra is what's known as an "eventually consistent" database. Basically, this means you can store and retrieve data without delay. You needn't wait for data to be available, at least not in theory. The rub is that you can't always be sure that the data you're retrieving is completely up-to-date. At Twitter, this model suits most tasks just fine. You don't want a long delay when opening up your tweet stream, but it's OK if you don't get everything that was posted to the service in the last fractions of a second. "Because we're a real-time company, we really care about availability of our data," says Goffinet, who previously worked on massive data systems at Yahoo and Digg. "If its inconsistent for milliseconds, that's fine. But we have to be up and online at all times."

That said, there are cases where Twitter needs a "strongly consistent" database – one where you know that you're getting all the latest data. It needs a strongly consistency database, for instance, when storing all available Twitter handles. Everyone across the network must have the same view of all the handles. Otherwise, the system might assign the same handle to two different people. In situations like this, Twitter used Gizzard. This isn't unusual. Typically, if they want strong consistency for some tasks and eventual consistency for others, companies will use two databases, pairing something like Cassandra with a strongly consistent database such as Gizzard, MongoDB, or Hbase.

But about two years ago, Goffinet, Schuller, and Avital created Manhattan. According to the engineers, it lets the company expand across new machines much more easily than it could with Cassandra, and it lets them run both eventually consistent and strongly consistent applications. This morning, for the first time, Twitter released a detailed description of the system, and in the future, it may open source the thing, sharing the underlying code with the world at large – though Goffinet says the company is still mulling this over.

Though they have yet to see Manhattan in action, some database engineers outside the company stress that it doesn't appear to be a vast leap forward in database design. "This isn't revolutionary," says Eric Frenkiel, an ex-Facebook engineer who now runs a new-age database company called MemSQL. "The funny thing with databases is that everything has pretty much been done before." But they do agree that this type of design is where the rest of the database world may eventually wind up – a design where all tasks can be handled by a single platform. Indeed, as of this fall, Cassandra offers a way of running strongly consistent tasks alongside eventually consistent (in addition to a new tool designed to help you more easily expand the database across more machines). And according to Eliot Horowitz, the chief technology at MongoDB, his company's database will likely embrace both models as well.

Databases have evolved by leaps and bounds over the past few years, providing new ways of accommodating increasingly enormous online operations, but because of the trade-off between availability and consistency – which is more of a continuum than a choice between black and white – they have evolved in many different directions. LinkedIn has built at least two massive databases of its own. Google has built at least three. And the open source world is littered with NoSQL databases, all trying to serve particular needs. But in the future, these various threads will come back together. The future will look more like Manhattan. "The big web companies," Goffinet says, "are already doing this."