Scaling Twitter to New Peaks

For many of us Twitter has become an essential communications utility. People and businesses use Twitter every day in broader and deeper ways, indeed we all have an interest in how well Twitter scales. Earlier this month Twitter experienced and seamlessly handled a new peak load of 143,199 tweets per second—a substantial spike above its current steady-state of 5,700 tweets per second. Raffi Krikorian, VP of Platform Engineering at Twitter reported the new record and took some time to review the engineering changes they've made to scale to this new level of traffic.

Three years ago, peaks of 2000 tweets per second from activity around the 2010 World Cup caused major stability problems for Twitter and a realization that they needed to re-architect their systems. A subsequent engineering review found that Twitter had the worlds largest Ruby on Rails installation, everything was in one codebase and both the application and the engineering team was monolithic. Their MySQL storage system had reached its limits, hardware was not fully utilised and repeated "optimizations" were ossifying the codebase. Krikorian reports that Twitter came out of their review with some big aims: to reduce the number of machines by 10x, to move to a loosely coupled service oriented architecture with cleaner boundaries and more cohesion, and to be able to launch new features faster with smaller empowered teams.

Twitter moved to the JVM and away from Ruby. They had hit the limits of Ruby's process-level concurrency model and needed a programming platform that provided higher throughput and better use of hardware resources. Rewriting their codebase on the JVM yielded better than 10x performance improvement and they now push 10-20K requests/sec/host.

Twitter's largest architectural change was moving to a service oriented architecture focussing on their "core nouns" of tweet, timeline and user services. Their development approach relies on "design by contract" where interface definitions are agreed up front and then teams work independently on the implementation. The services are autonomous and self-contained and that is reflected in the new engineering team structure. An asynchronous RPC platform, Finagle, was developed to handle concurrency, failover and load balancing in a standard manner across all engineering teams.

The new architecture is reflected in the organization of Twitter's engineering teams. The services and their teams are autonomous and self-contained. Each team owns their interfaces and their problem domains. Noone needs to be an expert across the system and not everyone has to worry about scaling Tweets. Critical capabilities are abstracted behind APIs that make them accessible to everyone who needs them.

But even with a less monolithic architecture, says Krikorian, persistence remains a huge bottleneck. Twitter's single master MySQL database has been replaced with a distributed framework of sharded, fault-tolerant databases using Gizzard.

Reinforcing a common theme for scaling large systems, observability and statistics are a key tool to manage the system and provide concrete data to support optimization efforts. Twitter's development platform incorporates tools which make it very easy for developers to provide request tracing and statistical reporting.

The final element in Twitter's scaling story is the effort put into their runtime configuration and testing environment. Testing Twitter at "Twitter scale" can really only be done in production. Deployment of new features could also require a challenging level of coordination across teams. So Twitter have developed a mechanism called Decider to switch on new features only after they have been deployed. Features can be deployed in an "off" setting and switched on either in a binary fashion (all at once), or gradually for a percentage of operations.

The overall result for Twitter today is that it is more scalable, more resilient and more agile than before. Traffic volumes are breaking new records and new features can be rolled out without significant disruption. Krikorian finishes his blog post urging us to keep an eye on @twittereng for more details about Twitter's re-architecture.

InfoQ Weekly Newsletter

Join a community of over 250 K senior developers by signing up for our newsletter. If you are based in the EEA, please contact us so we can provide you with the protections afforded to you under EEA protection laws.

Is your profile up-to-date? Please take a moment to review and update.

Email Address

Note: If updating/changing your email, a validation request will be sent

Company name:

Keep current company name

Update Company name to:

Company role:

Keep current company role

Update company role to:

Company size:

Keep current company Size

Update company size to:

Country/Zone:

Keep current country/zone

Update country/zone to:

State/Province/Region:

Keep current state/province/region

Update state/province/region to:

Subscribe to our newsletter?

Subscribe to our architect newsletter?

Subscribe to our industry email notices?

By subscribing to this email, we may send you content based on your previous topic interests. See our privacy notice for details.

You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.

We notice you're using an ad blocker

We understand why you use ad blockers. However to keep InfoQ free we need your support. InfoQ will not provide your data to third parties without individual opt-in consent. We only work with advertisers relevant to our readers. Please consider whitelisting us.