Lessons learned in scaling Twitter

Brian Degenhardt talked about some
of the challenges when scaling Twitter. Initially developed as a monolithic
application, by separating the system down into separate services meant that
the components could be scaled independently from each other, both at the
network layer but also the development layer. A MySQL database was used to
hold everything with a rails app fronting the code.

The second generation twitter split up the monolith into different back-end
data storage layers, using
gizzard
as a sharding mechanism for MySQL databases and a
redis
instance for storing timeline information. Subsequent iterations replaced the
front-end components with services as well, which allowed more heavily used
services (such as the timeline and user tweet details) to be scaled
horizontally when needed.

Fortunately when a request comes in to Twitter the request is composable
into different services, and by using non-blocking futures it is possible
to scale these by compositing the results of a series of lookups into one
chain. By executing many requests in parallel but having that hidden by the
underlying futures framework means that the code is easy to understand whilst
still taking advantage of multiple threads.

When a write to Twitter happens, it gets split into a lookup of all the
subscribers and then writes a copy of each into all the timelines. This involves
a fan-out (lookup of affected destinations) and then a sharded write across
the database. For providers that are subscribed to the firehose, the feed
goes to them as well – around 30Mb/s in total. In the case of the
Lady Gaga
tweet replying to the Mars rover,
this resulted in a set conjunction of around 1.5 million and 40 million
followers, which took some time in the fan-out processing.

In terms of measuring the performance of Twitter, once the data stream starts
moving into the tens of errors per second (99.999% uptime of 300,000 requests
per second), it’s impossible to focus on individual events in terms of
underlying analysis. By focussing on a statistical view of the events, it’s
possible to get a much better picture of overall system health than if a single
event is decoded.

To that extent, Twitter uses a distributed tracing system called
Zipkin
(demo) which performs tracing throughout the
lifetime of a request, based on Google’s
Dapper paper
which was written about but never open-sourced. As a result, Twitter
re-implemented it based on the content of the paper and open-sourced it
instead.

The underlying service calls use HTTP for external communication, and once
inside the boundary use Thrift for passing RPC messages and using a standardised
library that provides load balancing, service discovery and metrics
generation and logging. These metrics are collected centrally and can be
visualised using histograms or by using a view of the data points with
percentiles that show when the system is suffering under a higher load.

Twitter open-sources most of their code for high performance services
via their GitHub repositories under
an Apache license.

Scaling at Netflix

Ruslan Meshenberg discussed
how Netflix leverages mutli-regions to increase availability. In fact, he
said that the talk was “Really about failure” and how it’s handled. rather
than anything else.

He showed a pyramid of failures, with top incidents being those that caused
negative PR (mitigated by active/active pairs and game day practicing),
customer service incidents (mitigated by better tools and practices),
metrics impact (mitigated by data tagging and feature enable/disabling),
and automated failover/recovery processes.

The key should be about how to respond to failure; with enough systems,
eventually at least one will have a hardware-related failure with a disk
or CPU dying, or natural events such as a
lightening strike taking out
power to a data centre.
It’s possible that human error will be involved
(configuration, bad code push etc.) but the key is to be able to recover
quickly and painlessly from any events that may occur.

In the case of an Amazon ELB failure, in which the elastic load balancing
configuration was lost, Netflix lost connectivity for a
few hours on Xmas eve in 2012.
To patch this immediately Isthmus was created to front-end
all services coming in to ELB to provide a backup level of routing should
ELB fail again. This was a spike solution aimed at providing solely ELB
related traffic resilience; it grew into
Netflix Zuul which provided
a load-balancing and front-end re-routing for all of the (non-streaming)
Netflix services. In conjunction with various DNS update layers fronted
by denominator,
which provides a cross-service DNS updating tool, any service can be taken
off-line and recovered by updating either routing information or by updating
DNS entires to point to a different location (which typically has a 5-10 min
TTL).

The service information is stored in a replicated Apache Cassandra instance,
which provides active/active regions (with each region being in its own
consistency group) and eventual consistency between regions). This is also
fronted by a custom EVcache
which provides a Memcached API but with tools to permit remote eviction of
cached information, and the soon-to-be-open-sourced
ribbon client libraries
and karyon server libraries
are used, which provide a generalised RPC mechanism. This is combined with
the configuration information and routing provided by
Asgard, and configuration
updates by Archaius.

To fail fast in the event of a dependent service failure, a circuit breaker
mechanism is used to disconnect the
Hystrix library provides
a means to measure the performance of individual services and to disconnect
them if they go past a certain error threshold or delay. By failing fast,
the system is able to detect that an error condition has occurred and re-route
the requests to a more successful part.

To test whether this works, various monkeys are used to take out parts
of the system, under the
Simian Army umbrella.
The original Chaos Monkey is used to take out individual nodes in the system,
whilst the Chaos Gorilla can be used to take out sets of services or nodes.
Should a whole zone need to be taken out, Chaos Kong will wipe out an entire
zone to ensure that the system is working as expected, all the time working
in production. After all, if failure is a natural event and the system has
practice from recovering, then when an unplanned event occurs it will be more
likely to keep going on in the same way.

Overview of Vert.X

Tim Fox gave an overview of Vert.X,
describing it as a lightweight, reactive application superficially similar to
Node.JS and inspired from Erlang. However, unlike Node.JS and Erlang, Vert.X is
a polyglot language with implementations and libraries in many different
languages and a common communication bus for passing messages between them.

At its core lies an enterprise bus that communicates with JSON messages and
which subscribes many individual single-threaded processes called verticles.
These can be written in any language supported by Vert.X; messages are passed
between them as immutable data structures and so can be materialised into
any supported format. In this sense, it is similar to the Actor model,
except without any shared state.

Running a vert.x program can be achieved with vertx run followed by a class
or function that implements the verticle. For interpreted languages such as
Groovy or JavaScript the VM starts and begins executing the code directly;
for compiled languages like Java and Scala a compilation process is kicked
off to compile the code ahead of time. In development, these can be reloaded
dynamically to facilitate development and debugging.

The event bus can be distributed between different JVMs, enabled by running
with the -cluster flag. This will discover other JVMs running on the same
host (though cross-host clustering is also possible) and sync up messages
between the two to allow for distributed processing.

It is also possible to sync the event bus to a web browser, using the
JavaScript libraries to consume the event bus data over HTTP. Since it uses the
same API but does not enforce any one particular style of user interface, it is
possible to generate a UI using whatever framework or toolkit is desired.

Modules are also possible in Vert.X, using a mod.json descriptor file
and zero or more resources or verticles. These can be run with vertx runmod
and will start up any entry points discovered. It is also possible to refer
to modules stored in Maven central, using a group~artifact~version syntax;
for example, running vertx runmod io.vertx~hello-mod~1.0 will download
and run the
io.vertx/hello-mod/1.0/hello-mod-1.0.jar
file and start executing the contents. It is also possible to generate a
standalone application by running vertx fatjar which will download all the
dependencies and create a single jar containing the bootstrap library code and
dependencies.

As well as clustered mechanisms it is possible to run vertx in a high
availability mode, with -ha. This implies -cluster and allows multiple JVMs
to take the load of individual vertx modules so if a high-availability group
has been created that provides two modules and one of those module processes
dies, a new one will be instantiated on a different node in the HA cluster.

Using Docker in Cloud Networks

Chris Swan talked about
Docker as a container mechanism for software.
Outside the box lies linux containers (using LXC) and some kind of union
filesystem such as AUFS or a snapshot system such as ZFS or BTRFS.

By specifying a dockerfile, with a set of build instructions (with the RUN
command) and an execution (with the CMD command) it is possible to automate the
booting of a container inside an existing machine. Since many of the host OS
libraries and functions are used it is possible to spin up many containers for
invocations. For example, a dockerfile might look like:

123

FROM ubuntu:12.04
MAINTAINER cpswan
RUN echo'Hello World'

This would spin up a new container based on the ubuntu:12.04 image, and run
the program ‘Hello World’ and the quit.

Docker is invoked with the docker command, which needs root privileges to
run. It is possible to run a pre-configured instance and expose port 1234 using
sudo docker run -d -p 1234 cpswan/demoapp or if the local port needs to be
bound then sudo docker run -d -p 1234:1234 cpswan/demoapp could be used
instead. Various docker examples are availble at Chris Swan’s
github page, and a blog discussing
how to implement multi-tier apps in docker
is also available.

Docker is restricted to running on Linux, but in the most recent release a
version has been made available for running on OSX using Virtual Box to host
a Tiny Core Linux image.

Scaling continuous deployment at Etsy

Avleen talked about how Etsy scaled
their deployment systems, mainly by identifying bottlenecks in their existing
deployment process and being able to speed those up. Etsy is a marketplace
for custom built products.

The Etsy codebase is a collection of PHP files which are deployed and pushed to
the Etsy servers, with an Apache front end displaying them. The deploy time of
this data set is around 15 minutes, so to enable more pushes per day they
combine several unrelated changes in each push and synchronise between the
members of those changes to increase the velocity of changes going out. In
addition, by separating out configuration changes versus code changes, two
parallel streams of pushes could occur, which freed up more resources for being
able to push content out.

From a micro optimisation, they used a tmpfs to be able to serve the newly
pushed site quicker than before, and by hosting a custom apache module at the
server side were bale to atomically switch between old and new versions (named
‘yin’ and ‘yang’) by flipping a configuration bit to point between
DocumentRoot1 and DocumentRoot2.

The key deliverable was that if a process is slow, identify through measurement
where the bottlenecks are and then design a system around those bottlenecks as
a way of increasing throughput and decreasing latency for the system.

Elastic Search at The Guardian

Shay Banon and
Graham Tackley discussed how
The Guarding had used Elastic Search to provide real-time updates on
how readers were interacting with the news site.

The Guardian has 5-6m unique visitors per day to its website, and also has
applications for Android, Kindle, iPhone and iPad. Thanks to
The Scott Trust
The Guardian is a shareholder free organisation, and as such took advantage
of the early internet to make all its content available on-line for free.
As a result it is now the third largest English website in the world.

To track user interaction with the website, an initial 24h ‘hackday’ project
which grepped the Apache logs to show the top content in the prior few minutes
was launched. The success fo that project led to the creation of an
Elasticsearch backed infrastructure for performing the same thing over the
period of the last few minutes to the last few hours, days, or weeks.

The Guardian website uses a form of invisible pixel tracking; when a page is
loaded and rendered via the browser, an image is requested from the remote
server. This causes an event to be posted to Amazon SNS to post a message into
Amazon SQS which then loads it into an Elasticsearch database. By storing the
time, referrer, and other geolocation data it is possible to build up a data
set which richly describes who is using the site at any one time.

By using Elastic search to search the content for specific filters (e.g. all
sport references in the last five minutes, or all references from the front
page) it is possible to build real-time statistics of who is using the software
and at what time. By capturing the geolocation data it is possible to get a
heatmap of where readers are located in the world, and by using a set of graphs
(displayed by D3 in the dashboard)
be able to give a drill-down view of the data and to permit editorial decisions
to be made about what content to promote or to follow up on.

Elastic search provides most of these search functions out-of-the-box,
including the date histogram which can partition requests into different
time-based buckets. The graphs can be used to correlate the information seen,
including being able to find out what particular tweet caused an upsurge in
information, or when a reddit page went viral to drive content.

Conference wrap-up

The good news is that
my hat was found
so I can go home with a warm head. But it’s the end of a long week and
QCon whilst very enjoyable is also very exhausting.

The take-aways from this conference were:

Micro-services are big.
It’s not just SOA in a different guise (which was really about RPC with giant
XML SOAP messages) but rather URIs and message passing or JSON data
structures. Virtually every talk on distributed architecture or resilience
talked about how to break a monolithic application into individual services
which could be developed, deployed and monitored individually.

Agile companies are deploying components many times per day.
This isn’t limited to small organisations; this happens at internet-scale
companies that are serving hundreds of thousands of requests per second,
including financial companies. The key to this is removal of process and
pontification and replacing it with automation so easy that a new employee can
deploy to production before lunch on their first day.

Failure happens.
Failing to plan for failure happening is to plan for failure. Any system that
is based on a monolithic application or runs in a single instance is both
non-scalable and non-fault tolerant by design. The same set of benefits that
come from scaling also comes from failure planning. Testing this by destroying
instances in production is an extreme way of demonstrating that the system
will work in the case of failure, but it gives monitoring and processes
practice for when unexpected events can occur.

The GPL is effectively dead for corporate sponsored open-source projects.
All of the firms that had created code for their needs or exposed it for
others to use were licensed under permissive licenses such as Apache, MIT,
BSD or EPL. In fact, the only GPL licensed software at the talk was OpenJDK,
and even that didn’t get a mention by name but under the Java 8 moniker.
The only place the GPL is being used by companies is when they actively
want to be anti-corporate and are selling subscription support services to
large organisations.

Of course QCon London has many tracks, and I only covered a few of them in
my write-ups. The videos of the conference will be made available from
www.infoq.com (whom I write for) over the
next few months. Most of the presentations have been recorded, and the slides
should be available from the qconlondon.com
website.