May 18, 2008

Twitter as a scalability case study

So once again Twitter was down last week for a good chunk of time.

A lot has been said already about Twitter's scalability issues. Many have given Twitter as an anti-pattern of how not to deal
with scalability and have suggested different solutions for scaling it. As Twitter is famously a Ruby-on-Rails
deployment, this case has also been used as a weapon in the language/platform wars between the RoR and Java camps, and to a lesser degree, also with the
LAMP (PHP) camp.

(BTW, Todd is doing an excellent job of
collecting all this valuable information in a very professional way. Todd, keep up the good work!)

Reading through this summary, a pattern emerges. Twitter is no different than many other web apps that have become overnight successes.

They had a good idea and went to
implement it as quickly as possible. Ruby seemed to be a perfect tool for them
to get there quickly. Success was much bigger and faster
then they imagined. Not surprisingly. the architecture was not
designed for scalability and they are now forced to go through the painful process
of scaling their architecture. There are similar stories about eBay, LinkedIn, MySpace and many other notable web companies.

What's somewhat surprising is that they had to hit almost every
possible bottleneck before realizing that they still
have a scaling issue. For example, they used memcached to address the scalability of their (single!) MySQL database. They tried various messaging solutions (including a Linda-based implementation), and threw a lot of hardware at the problem. Now
it seems they are looking to completely re-write the application, as notedhere.

Recently I've encountered similar scenarios with GigaSpaces customers: one is a PHP-based app for "casual gaming" and another is a gambling application designed with a classic J2EE architecture. Both were
facing similar scaling bottlenecks. The fact that we're seeing the same scalability
issues in PHP, Java and obviously Ruby, tells us that the
scalability problem is not about the language. It's about the
architecture.

I hate to repeat myself with this cliché, but with
scalability your only as strong as your weakest link. Trying
to bypass this problem by plugging in point solutions is not going to cut it. It is
therefore not surprising that those who dealt with scalability challenges
successfully -- such as eBay, Amazon, Google and Yahoo -- went through architecture changes
and eventually reached a similar pattern to the one I described in my summary of the
Qcon conference:Architecture You Always Wondered About: Lessons Learned at Qcon

Scalability -- Doing It Right

Asynchronous event-driven design: Avoid as much as possible any synchronous interaction with the data or business logic tier. Instead, use an event-driven approach and workflow

Partitioning/Shards: Design the data model to fit the partitioning model

Parallel execution: Use parallel execution to get the most out of available resources. A good place to use parallel execution is the processing of users requests. Multiple instances of each service can take the requests from the messaging system and execute them in parallel. Another good place for
parallel processing is using MapReduce for performing aggregated requests on partitioned data

Replication (read-mostly): In read-mostly scenarios (LinkedIN seems to fall into this category well), database replication can help load-balance the read load by splitting the
read requests among the replicated database nodes

Consistency without distributed transactions: This was one of the hot topics of the conference,
which also sparked some discussion during one of the panels I participated in. An argument was made that to reach scalability you had to sacrifice consistency and handle consistency in your applications using things such as optimistic locking and asynchronous error-handling. It also assumes that you will need to handle idempotency in your code. My argument was that while this pattern addresses scalability, it creates complexity and is therefore error-prone. During another panel, Dan Pritchett argued that there are ways to avoid this level of complexity and still achieve
the same goal, as I outlined inthis blog post.

Move the database to the background: There was violent agreement that the database
bottleneck can only be solved if database interactions happen in the background. (NOTE: I recently wrote a moredetailed postexplaining how you can effectively move the database to the background.

This brings me to an interesting
story about a startup company that has an ad engine using PHP.
Like many others, when they became successful, they faced scalability issues and faced a
similar dilemma to to the one faced by Twitter. Their story, written by the implementor, a consulting company called Rocketier, is a good example of how
to do it right (in this case, using GigaSpaces):

As part of a project for a Web
2.0 company, which focuses on the Affiliate Junction market, Rocketier
developed a generic connector from PHP to GigaSpaces XAP. The company's product
serves as a hub which connects advertisers and publishers in a unique
proprietary model, tracking and billing ad views, ad clicks and resulting sales
in affiliate networks. The software was originally developed using PHP,
however, once the number of customers started to grow, the product hit severe
performance and scalability issues. The main bottleneck in the product was the
ad view counting mechanism which effectively limited the software to an upper
bound of 1M hits per day. Instead of replacing the entire product, Rocketier
focused on the performance critical business processes and developed a backend
GigaSpaces module, responsible for counting the ad views and clicks. This
module was integrated into the software using a custom designed PHP connector,
employing COM+ technology. The new solution is easily scaled up and out and can
support the company's growing customer base up to 200M hits per day...

...This solution enabled the company to
meet its business needs in a short period of time, while keeping a low risk
factor due to the fact that only the performance critical business processes
were replaced. The chosen technological solution added grid support to the
PHP-based product, including: PHP and GS mediation (the aforementioned
connector), application persistence, session based continuity, asynchronous
programming capabilities and presentation and business logic separation.

To sum up, the company was able to
leverage its existing investment and migrate evolutionarily from an initial
prototype to a heavy load, production level solution (from Beta to a Post Digg
Effect system), in a short time to market and with minimal risk"

Rocketier was kind enough to
publish part of their work onOpenSpaces.org.

What's interesting about their
approach is the fact that they were able to address PHP
scalability without throwing out the existing implementation. They used PHP as it was
meant to be used: a front-end. They grid-enabled the processing engine using
GigaSpaces. This approach is a good one for scalability of
any web application whether it's Java, PHP, RoR, .Net, or J2EE.

Are
we all doomed to go through this painful process when we are successful?

We seem to take it for
granted that dealing with scalability is complex. When we
start a new application it's hard to know whether we're ever going to be
successful to the point where the investment in scalability is worth the
effort. At this initial stage the important thing is time-to-market. We want to get
our idea out there as quickly as possible. This is a reasonable desire
as indeed most projects don't take off.

Now imagine what would happen if dealing with scalability wasn't that difficult. That would have change the entire decision making process, and would enable Twitter and many others to start with a scalable architecture from day one, avoiding this painful process.

So the question is what would is required to simplify building a scalable application to the point in which it is as simple as building it for a single machine?

From my experience, most challenges have already been faced by and dealt with others - so the first thing that I did was look at how others (not necessarily in the same industry) addressed this issue.

In this case, storage virtualization is a good example. At first, we used local disks. Local disks tend to get filled-up quite fast. It was very hard to deal with this problem as it required replacing the disk with a bigger one every time full capacity was reached. IT had to go through this process for every user and every application -- very painful and costly. The solution came in the form of NAS and SAN, or network-attached storage. Instead of using a local disk, use a virtual disk that resides somewhere on the network. The user and the application don't need to be aware of it, because they use a local disk driver that virtualizes the network devices to make them look as if they were just another local disk. The application scales but hasn't changed as a result. We can add and remove devices as we wish with no changes to the application. Later on, if there is a more cost effective solution available, we can easily replace the devices.

We can apply the same concept of virtualization to the middleware stack -- namely the data, the messaging and the processing -- with the same degree of simplicity. The application interacts with a "proxy" that hides the details of how a message or update operation is routed, how fail-over is handled, how data is partitioned and so on.

With services such as Amazon EC2,
and other cloud environments, this can be made even simpler, as we can have a
pre-configured image and hardware ready for deployment. All we need to do is just
deploy our business logic. (See an EC2 examplehere).

With today's frameworks architectures, we
don't have to go through the same painful experience. We can build scalable
architectures from the get-go. I would even argue that it takes less time to build applications with this approach than the traditional client-server approach.

In my next post I'll discuss the difference
between TCO and software license costs in scalable applications.

Comments

Twitter as a scalability case study

So once again Twitter was down last week for a good chunk of time.

A lot has been said already about Twitter's scalability issues. Many have given Twitter as an anti-pattern of how not to deal
with scalability and have suggested different solutions for scaling it. As Twitter is famously a Ruby-on-Rails
deployment, this case has also been used as a weapon in the language/platform wars between the RoR and Java camps, and to a lesser degree, also with the
LAMP (PHP) camp.

(BTW, Todd is doing an excellent job of
collecting all this valuable information in a very professional way. Todd, keep up the good work!)

Reading through this summary, a pattern emerges. Twitter is no different than many other web apps that have become overnight successes.

They had a good idea and went to
implement it as quickly as possible. Ruby seemed to be a perfect tool for them
to get there quickly. Success was much bigger and faster
then they imagined. Not surprisingly. the architecture was not
designed for scalability and they are now forced to go through the painful process
of scaling their architecture. There are similar stories about eBay, LinkedIn, MySpace and many other notable web companies.

What's somewhat surprising is that they had to hit almost every
possible bottleneck before realizing that they still
have a scaling issue. For example, they used memcached to address the scalability of their (single!) MySQL database. They tried various messaging solutions (including a Linda-based implementation), and threw a lot of hardware at the problem. Now
it seems they are looking to completely re-write the application, as notedhere.

Recently I've encountered similar scenarios with GigaSpaces customers: one is a PHP-based app for "casual gaming" and another is a gambling application designed with a classic J2EE architecture. Both were
facing similar scaling bottlenecks. The fact that we're seeing the same scalability
issues in PHP, Java and obviously Ruby, tells us that the
scalability problem is not about the language. It's about the
architecture.

I hate to repeat myself with this cliché, but with
scalability your only as strong as your weakest link. Trying
to bypass this problem by plugging in point solutions is not going to cut it. It is
therefore not surprising that those who dealt with scalability challenges
successfully -- such as eBay, Amazon, Google and Yahoo -- went through architecture changes
and eventually reached a similar pattern to the one I described in my summary of the
Qcon conference:Architecture You Always Wondered About: Lessons Learned at Qcon

Scalability -- Doing It Right

Asynchronous event-driven design: Avoid as much as possible any synchronous interaction with the data or business logic tier. Instead, use an event-driven approach and workflow

Partitioning/Shards: Design the data model to fit the partitioning model

Parallel execution: Use parallel execution to get the most out of available resources. A good place to use parallel execution is the processing of users requests. Multiple instances of each service can take the requests from the messaging system and execute them in parallel. Another good place for
parallel processing is using MapReduce for performing aggregated requests on partitioned data

Replication (read-mostly): In read-mostly scenarios (LinkedIN seems to fall into this category well), database replication can help load-balance the read load by splitting the
read requests among the replicated database nodes

Consistency without distributed transactions: This was one of the hot topics of the conference,
which also sparked some discussion during one of the panels I participated in. An argument was made that to reach scalability you had to sacrifice consistency and handle consistency in your applications using things such as optimistic locking and asynchronous error-handling. It also assumes that you will need to handle idempotency in your code. My argument was that while this pattern addresses scalability, it creates complexity and is therefore error-prone. During another panel, Dan Pritchett argued that there are ways to avoid this level of complexity and still achieve
the same goal, as I outlined inthis blog post.

Move the database to the background: There was violent agreement that the database
bottleneck can only be solved if database interactions happen in the background. (NOTE: I recently wrote a moredetailed postexplaining how you can effectively move the database to the background.

This brings me to an interesting
story about a startup company that has an ad engine using PHP.
Like many others, when they became successful, they faced scalability issues and faced a
similar dilemma to to the one faced by Twitter. Their story, written by the implementor, a consulting company called Rocketier, is a good example of how
to do it right (in this case, using GigaSpaces):

As part of a project for a Web
2.0 company, which focuses on the Affiliate Junction market, Rocketier
developed a generic connector from PHP to GigaSpaces XAP. The company's product
serves as a hub which connects advertisers and publishers in a unique
proprietary model, tracking and billing ad views, ad clicks and resulting sales
in affiliate networks. The software was originally developed using PHP,
however, once the number of customers started to grow, the product hit severe
performance and scalability issues. The main bottleneck in the product was the
ad view counting mechanism which effectively limited the software to an upper
bound of 1M hits per day. Instead of replacing the entire product, Rocketier
focused on the performance critical business processes and developed a backend
GigaSpaces module, responsible for counting the ad views and clicks. This
module was integrated into the software using a custom designed PHP connector,
employing COM+ technology. The new solution is easily scaled up and out and can
support the company's growing customer base up to 200M hits per day...

...This solution enabled the company to
meet its business needs in a short period of time, while keeping a low risk
factor due to the fact that only the performance critical business processes
were replaced. The chosen technological solution added grid support to the
PHP-based product, including: PHP and GS mediation (the aforementioned
connector), application persistence, session based continuity, asynchronous
programming capabilities and presentation and business logic separation.

To sum up, the company was able to
leverage its existing investment and migrate evolutionarily from an initial
prototype to a heavy load, production level solution (from Beta to a Post Digg
Effect system), in a short time to market and with minimal risk"

Rocketier was kind enough to
publish part of their work onOpenSpaces.org.

What's interesting about their
approach is the fact that they were able to address PHP
scalability without throwing out the existing implementation. They used PHP as it was
meant to be used: a front-end. They grid-enabled the processing engine using
GigaSpaces. This approach is a good one for scalability of
any web application whether it's Java, PHP, RoR, .Net, or J2EE.

Are
we all doomed to go through this painful process when we are successful?

We seem to take it for
granted that dealing with scalability is complex. When we
start a new application it's hard to know whether we're ever going to be
successful to the point where the investment in scalability is worth the
effort. At this initial stage the important thing is time-to-market. We want to get
our idea out there as quickly as possible. This is a reasonable desire
as indeed most projects don't take off.

Now imagine what would happen if dealing with scalability wasn't that difficult. That would have change the entire decision making process, and would enable Twitter and many others to start with a scalable architecture from day one, avoiding this painful process.

So the question is what would is required to simplify building a scalable application to the point in which it is as simple as building it for a single machine?

From my experience, most challenges have already been faced by and dealt with others - so the first thing that I did was look at how others (not necessarily in the same industry) addressed this issue.

In this case, storage virtualization is a good example. At first, we used local disks. Local disks tend to get filled-up quite fast. It was very hard to deal with this problem as it required replacing the disk with a bigger one every time full capacity was reached. IT had to go through this process for every user and every application -- very painful and costly. The solution came in the form of NAS and SAN, or network-attached storage. Instead of using a local disk, use a virtual disk that resides somewhere on the network. The user and the application don't need to be aware of it, because they use a local disk driver that virtualizes the network devices to make them look as if they were just another local disk. The application scales but hasn't changed as a result. We can add and remove devices as we wish with no changes to the application. Later on, if there is a more cost effective solution available, we can easily replace the devices.

We can apply the same concept of virtualization to the middleware stack -- namely the data, the messaging and the processing -- with the same degree of simplicity. The application interacts with a "proxy" that hides the details of how a message or update operation is routed, how fail-over is handled, how data is partitioned and so on.

With services such as Amazon EC2,
and other cloud environments, this can be made even simpler, as we can have a
pre-configured image and hardware ready for deployment. All we need to do is just
deploy our business logic. (See an EC2 examplehere).

With today's frameworks architectures, we
don't have to go through the same painful experience. We can build scalable
architectures from the get-go. I would even argue that it takes less time to build applications with this approach than the traditional client-server approach.

In my next post I'll discuss the difference
between TCO and software license costs in scalable applications.