Cisco

November 09, 2010

GigaSpaces Tera-Scale Computing over Cisco UCS

General Architecture Overview:

GigaSpaces' Tera-Scale Computing architecture follows the exact principles that I outlined in Part I of this post.

Unlike alternative approaches that are built from a fusion of different products that are pre-packaged and branded together, GigaSpaces was designed with a holistic approach in mind that consists of a single and consistent clustering architecture across the entire stack.

In addition, the entire stack was designed to run completely in-memory and support highly concurrent workloads.

The other interesting benefit that comes with this approach in comparison with the alternatives is the fact that there is no reliance on any other expensive hardware (infiniband, high-end storage…) other than the one that comes out of the box with the UCS machine. The use of a single clustering mechanism saves a lot of redundant synchronization overhead and makes the entire platform more efficient and reliable. The use of a complete in-memory stack yields extreme utilization and low latency.

GigaSpaces-UCS Integration:

The actual integration work with Cisco UCS was based on two main parts:

1. Built-in automation and provisioning to enable zero configuration

One of the unique characteristics of the Cisco UCS machine is that it exposes an API that is backed by the low-level bare metal components. This makes it possible to interact with the lower layers of the hardware and set up a complete network of blades programmatically and without reliance on a hypervisor or an operating system as a middleman.

In addition, the integration with the UCS manager enables managing and allocating a pool of blades dynamically. We can turn machines on and off on demand, and discover new machines as they are plugged into the rack mount, and use their capacity immediately without any human intervention.

A blade doesn’t need to have any GigaSpaces or Java installation to join the pool. The provisioning of the JVM and GigaSpaces is taken care of by a provisioning agent that installs GigaSpaces onto the machine remotely as soon as it is discovered and allocated.

The diagram below shows the various components that comprise this Tera-Scale Computing system.

The SLA Driven Container and Data & Messaging services are the services that come with the standard GigaSpaces XAP installation. The GigaSpaces UCS Scaling Adaptor is the service that is responsible all the communication with the physical UCS pool and turning it into a GigaSpaces pool.

You can find more details on how this integration works, including a downloadable version of the scaling and provisioning agent, here.

2. Performance tuning and optimization

The second part of the work is around performance optimization and tuning of GigaSpaces on the UCS platform.

You can find below some of the first benchmark reports we published. The benchmark was specifically geared for real-time analytics applications. In the benchmark we were able to tune the system to reach to roughly 300GB per single node with a record throughput of 7.5M reads/sec in an embedded (scale-up) scenario and 300k ops/sec in a remote (scale-out) scenario.

The table below shows the history of the performance optimization process that we carried out throughout the various stages of tuning. It also includes comparison with other similar benchmarks that were done with older versions of UCS and with other platforms.

What we could see is that we were able to continuously improve the performance through optimization of the GigaSpaces platform as well as to the hardware itself.

We are working these days on another set of optimizations that are targeted for latency-sensitive applications.

With this effort we could save the majority of the performance tuning cycles and come up with pre-defined configuration profiles that are backed into our platform and the hardware and get closer to our zero configuration goal.

How does GigaSpaces Tera-Scale Plug into Existing Applications?

There are basically two modes of operation in which existing applications can plug into this solution.

1. Virtual Appliance mode –- In the appliance mode, GigaSpaces and the UCS machine are used as a service that exposes any of the APIs that are currently supported (SQL, Memcached, JMS, MapReduce...). The application remains unchanged and sees the joint solution as a better implementation of any of those services.

The interesting aspect of this mode is that even though the services that are exposed are used as a service, a user can still use our executor API to offload pieces of the application code into the appliance and run it within the box.

This is particularly useful in cases where the application wants to leverage the processing power of the UCS machine without necessarily moving the entire application into the platform.

2. Complete pre-engineered platform –- In this mode, both GigaSpaces and the application run on the UCS platform. GigaSpaces acts as the container for the application services and the application can exploit the full capacity of the UCS machine through the GigaSpaces in-memory middleware stack. For java-based applications, GigaSpaces supports a standard deployment model that enables seamless deployment into the platform.

Final words

These are exciting days in the industry, with lots of new breakthrough technologies being introduced on a daily basis. Having been at the cutting edge of that curve for many years, I feel really excited about the work with Cisco UCS. Not just because of the elegance in the design of its hardware but also because the Cisco team that worked with us through this entire process has been a great partner to work with.

Going forward, I see a great potential here as we can finally break the hardware and software wall. On one hand, the fact that the hardware can finally interact with the software that runs on top of it with no middleman puts a lot of power in our hands. On the other hand, we see a shift where application developers can focus on the business while the platform distributes work, maintains consistency, availability, performance and security of the underlying resources including CPU, Memory, Disk and Network. In doing so they become completely abstracted from the underlying operating system and hardware.

With that in mind, we are no longer limited to the services that are provided by the operating system or even by the JVM.

As platform providers we can call directly into the network interface, we can use deep inspection into resource capacity and the traffic matrix in order to proactively respond to demand surge in an effective way and take an immediate benefit from it, while at the same time keep the application unchanged. We can also create much more robust platforms as we can get immediate alerts directly from the device when something goes wrong. I can only imagine how robust our platform can become just by getting rid of all the loops that we need to maintain today through the entire software stack, to detect failure.

We are only scratching the surface with this offering and there are probably more areas for innovation that we have not yet explored. This could be a great opportunity to send us your wish list and be part of this project.

In the past year, Intel issued a series of powerful chips under the new Nehalem microarchitecture, with large numbers of cores and extensive memory capacity. This new class of chips is is part of a bigger Intel initiative referred to as Tera-Scale Computing. Cisco has released their Unified Computing System (UCS) equipped with a unique extended memory and high speed network within the box, which is specifically geared to take advantage of this type of CPU architecture .

This new class of hardware has the potential to revolutionize the IT landscape as we know it.

In Part-I of this post, I want to focus primarily on the potential implications on application architecture, more specifically on the application platform landscape.

By scaling multi-core architectures to 10s to 100s of cores and embracing a shift to parallel programming, we aim to improve performance and increase energy-efficiency.

"Tera" means 1 trillion, or 1,000,000,000,000. Our vision is to create platforms capable of performing trillions of calculations per second (teraflops) on trillions of bytes of data (terabytes).

To put in simple words, Tera-scale is a commoditized version of the mainframe. With this new technology, we could easily create supercomputers at an affordable price.

What are the potential implications on current deployments?

One of the more trivial and probably more common use cases that comes with the introduction of these new powerful machines is the ability to condense more applications and virtual machines on less hardware. In this specific post, however, I'll refer mostly to the implications for the large-scale distributed applications that are becoming more popular these days, as the demand for scaling continues to grow, while cloud-based deployments are becoming more common and affordable.

Is it the end of distributed systems?

For anyone who deals with large scale deployments, it might feel at first impression that we're back to the mainframe days. However, a closer look can yield a completely different picture.

The increased hardware capacity will enable us to manage more data in a shorter amount of time. In addition, the demand for more reliability through redundancy, as well as the need for better utilization through the sharing of resources driven by SaaS/Cloud environments, will force us even more than before towards scale-out and distributed architecture.

So, in essence, what we can expect to see is an evolution where the predominant architecture will be scale-out, but the resources in that architecture will get bigger and bigger, thus making it simpler to manage more data without increasing the complexity of managing it. To maximize the utilization of these bigger resources, we will have to combine a scale-up approach as well.

More specifically, the implications can be broken into two categories:

Density – we can serve the same capacity and workload of our existing applications on significantly less hardware. This obviously yields an immediate benefit in terms of operational cost associated with the reduced maintenance, cooling and power, space, cabling etc.

Capacity – we could get tens or hundreds times over the capacity and processing power from the same size and cost of today’s cluster.

The second point is where I see the greater potential. Imagine running your entire application completely in-memory, i.e. -- the disk will become the new tape. The fact that we can now get terabytes of data in-memory with only a few boxes means that we can do things that we couldn’t have dreamt of before -- we can process complex analytics that used to take hours in real time, run new classes of complex fraud detection algorithms, and recognize malicious attacks before they happen. We could correlate customer trends and increase the conversion trend from our visitors in our e-commerce sites. As a Telco provider, we could serve multi-media and social networking activities, process and render videos and images quickly, and provide more targeted and personalized commercials. We can build online gaming applications that could run 3D animation at a fraction of the cost. And the list can go on and on…

The Challenge

As with any new technology, exploiting its full potential involves change. The main challenge is with existing applications. Existing (large scale) applications were not built to utilize new multi-core, network, and memory capacity, simply because they were written at a time when memory was expensive and available at lower capacity of only a few GBs at most, network was considered a bottleneck, and multi-core didn’t exist or was just beginning to emerge. In addition, most of these application were designed to run against a centralized database or some sort of centralized storage, which makes it even harder to take advantage of the power behind this new class of machines.

So while this new class of hardware holds tremendous promise, there is a small caveat that is worth noting in Intel's vision:

By scaling multi-core architectures from tens to hundreds of cores and embracing a shift to parallel programming...

In other words, Intel recognizes the fact that to take full advantage of this new class of hardware, we need to embrace a new architecture that better lends itself to parallel programming.

Tera-Scale Reference-Architecture:

The diagram below, taken from the Intel reference architecture, points to the fact that in order to take advantage of the new underlying multi-core architecture we need a specialized software platform that is thread-aware and can abstract the application from the details of the underlying infrastructure.

In other words, we rely on the application platform to become the glue that enables us to migrate our existing applications from the current centralized model into a a more parallel and decentralized model. To make the transition simple, the platform will need to continue and support the existing interfaces and APIs, but replace them with more scalable implementations. At the same time, the platform needs to expose new capabilities that will enable development of new services that are designed with parallel distribution in mind.

There are two fundamental requirements of such a platform:

Scalable Software Infrastructure

Highly parallel – The platform itself needs to be designed with extreme parallelism in its internal engine as well as expose parallel programming and event-driven semantics to the application.

In-Memory – Memory is the only physical device that can manage highly concurrent transactions. To achieve maximum utilization the platform must not rely on disk or devices that do not lend themselves to concurrent processing.

Scale-up and Out – The platform needs to support seamless transition between scale-up within a machine and scale-out between machines without changing the application code.

Simple – Pre-Integrated

The complexity often associated with tuning of the software and hardware together is going to increase exponentially in this highly parallel environment. Therefore, we can’t afford to think of the software platform and hardware infrastructure as two separate things. Instead, it must be responsibility of the platform to provide a pre-integrated and tuned environment, which includes:

Hardware and Software tightly integrated – As the application is only exposed to the platform and not to the operating system, a tightly integrated platform can have much more room to use proprietary optimization of the hardware than the application itself. It can integrate with lower-level pieces of the hardware such as tuning network routing, moving the execution environment to the right core to minimize the NUMA effect and take full advantage of the CPU cache.

Zero configuration – The platform should come pre-tuned and expose a simple setup that can be applied to the entire platform in a consistent way.

Fully automated deployment – Deploying even the most complex application should involve a single deploy command that could in turn wire all the pieces together programmatically without exposing any manual step to the end user. This includes also the ability to deal with post deployment events, such as recovery processes and auto-scaling. This implies that the platform needs to have built-in integration for Dev-Ops APIs that are backed into the core of the platform and expose full control over the infrastructure services -- Memory, Machine, CPU, Network and middleware services (data, messaging and processing, web, etc.).

In Part-II of this post, I discuss the specific implementation of this concept in our new GigaSpaces Tera-Scale solution for Cisco UCS.

October 16, 2010

In the past few months i was involved in many of the NoSQL discussions. I must admit that i really enjoyed those discussions as it felt that we finally started to break away from the “one size fit it all” dogma and look at the data management solutions in a more pragmatic manner. That in itself sparks lots of interesting and innovative ideas that can revolutionize the entire database market such as the introduction of document model, map-reduce and new query semantics that comes with it. As with any new movement we seem to be going through the classic hype cycle. Right now it seems to me that were getting close to the peak of that hype. One of the challenges that i see when a technology reaches its peak of the hype is that people stop questioning the reason for doing things and jump on new technology just because X did that. NoSQL is no different on that regard.

In this post i wanted to spend sometime on the CAP theorem and clarify some of the confusion that i often see when people associate CAP with scalability without fully understanding the implications that comes with it and the alternative approaches.

I chose to name this post NoCAP specifically to illustrate the idea that you can achieve scalability without compromising on consistency at least not at the degree that many of the disk based NoSQL implementations imposes.

Recap on CAP

Quoting the definition on wikipedia:

The CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:[1][2]

CAP and NoSQL

Many of the disk based NoSQL implementations was originated from the need to deal with write scalability. This was largely due to the changes in traffic behavior that was mainly a result of the social networking in which most of the content is generated by the users and not by the site owner.

It was clear that the demand for write scalability would conflict with the traditional approaches for achieving consistency (synchronous write to a central disk and distributed transactions).

The solution to that was: 1) Breaking the centralized disk access through partitioning of the data into distributed nodes. 2) Achieve high availability through redundancy (replication of the data into multiple nodes) 3) Use asynchronous replication to reduce the write latency.

The assumptions behind point 3 above is going to be the center in this specific post.

The Consistency Challenge

One of the common assumptions behind many of the NoSQL implementations is that to achieve write scalability we need to push as many operations on the write-path to a background process in order that we could minimize the time in which a user transaction is blocked on write.

The implication is that with asynchronous write we loose consistency between write and read operations i.e. read operation can return older version then that of write.

There are different algorithms that were developed to address this type of inconsistency challenges, often referred to as Eventual Consistency.

For those interested in more information on that regard i would recommend looking at Jeremiah Peschka post Consistency models in nonrelational dbs. Jeremiah provides a good (and short!) summary of the CAP theorem, Eventual Consistency model and other common principles that comes with it such as (BASE – Basically Available Soft-state Eventually, NRW, Vector clock,..).

Do we really need Eventual Consistency to achieve write scalability?

Before I'll dive into this topic i wanted to start with quick introduction to the term “Scalability” which is often used interchangeably with throughput. Quoting Steve Haines:

The terms “performance” and “scalability” are commonly used interchangeably, but the two are distinct: performance measures the speed with which a single request can be executed, while scalability measures the ability of a request to maintain its performance under increasing load

In our specific case that means that write scalability can be delivered primarily through point 1 and 2 above ( 1-Break the centralized disk access through partitioning of the data into distributed nodes. 2-Achieve high availability through redundancy and replication of the data into multiple nodes) where point 3 ( Use asynchronous replication to those replica’s to avoid the replication overhead on write) is mostly related with write throughput and latency and not scalability. Which bring me to the point behind this post:

Eventual consistency have little or no direct impact on write scalability .

To be more specific my argument is that it is quite often enough to break our data model into partitions (a.k.a shards) and break out from the centralized disk model to achieve write scalability. In many cases we may find that we can achieve sufficient throughput and latency just by doing that.

We should consider the use of asynchronous write algorithms to optimize the write performance and latency but due to the inherited complexity that comes with it we should consider that only after we tried simpler alternative such as using db-shards, FLASH disk or memory based devices.

The diagram below illustrates one of the examples by which we could achieve write scalability and throughput without compromising on consistency.

As with the previous examples we break our data into partitions to handle our write scaling between nodes. To achieve high throughput we use in-memory storage instead of disk. As in-memory device tend to be significantly faster and concurrent then disk and since network speed is no longer a bottleneck we can achieve high throughput and low latency even when we use synchronous write to the replica.

The only place in which we’ll use asynchronous write is the write to the long-term-storage (disk). As the user transaction doesn’t access the long-term storage directly through the read or write path, they are not exposed to the potential inconsistency between the memory storage and the long-term storage. The long-term storage can be any of the disk based alternatives starting from a standard SQL databases ending with any of the existing disk based NoSQL engines.

The other benefit behind this approach is that it is significantly simpler. Simpler not just in terms of development but simpler to maintain compared with the Eventual Consistency alternatives. In case of distributed system simplicity often correlate with reliability and deterministic behavior.

Final words

It is important to note that in this post i was referring mostly to the C in CAP and not CAP in its broad definition. My points was not to say don’t use solution that are based on CAP/EventualConsistency model but rather to say don’t jump on Eventual Consistency based solutions before you considered the implications and alternative approaches. There are potentially simpler approaches to deal with write scalability such as using database shards, or In-memory-data-grids.

As were reaching the age of Terra-Scale devices such as Cisco UCS where we can achieve huge capacity of memory, network and compute power in a single box the area’s in which we can consider to put our entire data in-memory get significantly broader as we can easily store Terra bytes of data in just few boxes. The case of Foursquare's MongoDB Outage is interesting on that regard. 10gen's CEO Dwight Merrimanargued that the entire set of data actually needs to be served completely in-memory:

For various reasons the entire DB is accessed frequently so the working set is basically its entire size Because of this, the memory requirements for this database were the total size of data in the database. If the database size exceeded the RAM on the machine, the machine would thrash, generating more I/O requests than the four disks could service.

It is a common misconception to think that putting part of the data in LRU based cache ontop of a disk based storage could yeild better performance as noted in the sanford research The Case for RAM Cloud

..even a 1% miss ratio for a DRAM cache costs a factor of 10x in performance. A caching approach makes the deceptive suggestion that “a few cache misses are OK” and lures programmers into con-figurations where system performance is poor..

In that case using pure In-Memory-Data-Grid as a front end and disk based storage as long term storage could potentially work better and with significantly lower maintenance overhead and higher determinism. The capacity of data in this specific case ( <100GB) shouldn't be hard to fit into single UCS box or few of the EC2 boxes.

September 01, 2010

In my previous post Concurrency 101 I touched on some of the key terms that often comes up when dealing with multi-core concurrency.

In this post I'll cover the difference between multi-core concurrency that is often referred to as Scale-Up and distributed computing that is often referred to as Scale-Out model.

The Difference Between Scale-Up and Scale-Out

One of the common ways to best utilize multi-core architecture in a context of a single application is through concurrent programming. Concurrent programming on multi-core machines (scale-up) is often done through multi-threading and in-process message passing also known as the Actor model.Distributed programming does something similar by distributing jobs across machines over the network. There are different patterns associated with this model such as Master/Worker, Tuple Spaces, BlackBoard, and MapReduce. This type of pattern is often referred to as scale-out (distributed).

Conceptually, the two models are almost identical as in both cases we break a sequential piece of logic into smaller pieces that can be executed in parallel. Practically, however, the two models are fairly different from an implementation and performance perspective. The root of the difference is the existence (or lack) of a shared address space. In a multi-threaded scenario you can assume the existence of a shared address space, and therefore data sharing and message passing can be done simply by passing a reference. In distributed computing, the lack of a shared address space makes this type of operation significantly more complex. Once you cross the boundaries of a single process you need to deal with partial failure and consistency. Also, the fact that you can’t simply pass an object by reference makes the process of sharing, passing or updating data significantly more costly (compared with in-process reference passing), as you have to deal with passing of copies of the data which involves additional network and serialization and de-serialization overhead.

Choosing Between Scale-Up and Scale-Out

The most obvious reason for choosing between the scale-up and scale-out approachesis scalability/performance. Scale-out allows you to combine the power of multiple machines into a virtual single machine with the combined power of all of them together. So in principle, you are not limited to the capacity of a single unit. In a scale-up scenario, however, you have a hard limit -– the scale of the hardware on which you are currently running. Clearly, then, one factor in choosing between scaling out or up is whether or not you have enough resources within a single machine to meet your scalability requirements.

Reasons for Choosing Scale-Out Even If a Single Machine Meets Your Scaling/Performance Requirements

Today, with the availability of large multi-core and large memory systems, there are more cases where you might have a single machine that can cover your scalability and performance goals. And yet, there are several other factors to consider when choosing between the two options:

1. Continuous Availability/Redundancy: You should assume that failure is inevitable, and therefore having one big system is going to lead to a single point of failure. In addition, the recovery process is going to be fairly long which could lead to a extended down-time.

2. Cost/Performance Flexibility: As hardware costs and capacity tend to vary quickly over time, you want to have the flexibility to choose the optimal configuration setup at any given time or opportunity to optimize cost/performance. If your system is designed for scale-up only, then you are pretty much locked into a certain minimum price driven by the hardware that you are using. This could be even more relevant if you are an ISV or SaaS provider, where the cost margin of your application is critical to your business. In a competitive situation, the lack of flexibility could actually kill your business.

3. ContinuousUpgrades: Building an application as one one big unit is going to make it harder or even impossible to add or change pieces of code individually, without bringing the entire system down. In these cases it is probably better to decouple your application into concrete sets of services that can be maintained independently.

4. Geographical Distribution: There are cases where an application needs to be spread across data centers or geographical location to handle disaster recovery scenarios or to reduce geographical latency. In these cases you are forced to distribute your application and the option of putting it in a single box doesn’t exist.

Can We Really Choose Between Scale-Up and Scale-Out?

Choosing between scale-out/up based on the criteria that I outlined above sound pretty straightforward, right? If our machine is not big enough we need to couple a few machines together to get what we're looking for, and we're done. The thing is, that with the speed in which network, CPU power and memory advance, the answer to the question of what we require at a given time could be very different than the answer a month later.

To make things even more complex, the gain between scale-up and scale-out is not linear. In other words, when we switch between scale-up and scale-out we're going to see a significant drop in what a single unit can do, as all of a sudden we have to deal with network overhead, transactions, and replication into operations that were previously done just by passing object references. In addition,we will probably be forced to rewrite our entire application, as the programming model is going to shift quite dramatically between the two models. All this makes it fairly difficult to answer the question of which model is best for us.

Beyond a few obvious cases, choosing between the two options is fairly hard, and maybe even almost impossible.

Which brings me to the next point: What if the process of moving between scale-up and scale-out were seamless -- not involving any changes to our code?

I often use storage as an example of this. In storage, when we switch between a single local disk to a distributed storage system, we don’t need to rewrite our application. Why can’t we make the same seamless transition for other layers of our application?

Designing for Seamless Scale-Up/Scale-Out

To get to a point of seamless transition between the two models, there are several design principles that are common to both the scale-out and scale-up approaches.

Parallelize Your Application

1. Decouple: Design your application as a decoupled set of services. “All problems in computer science can be solved by another level of indirection" is a famous quote attributed to Butler Lampson. In this specific context: if your code sets have loose ties to one another, the code is easier to move, and you can add more resources when needed without breaking those ties. In our case, designing an application from a set of services that doesn’t assume the locality of other services is used to enable us to handle a scale-up scenario by routing requests to the most available instance.

2. Partition: To parallelize an application, it is often not enough to spawn multiple threads, because at some point they are going to hit a shared contention. To parallelize a stateful application we need to find a way to partition our application and data model so that our parallel units share-nothing with one another.

Enabling Seamless Transitions Between Remote and Local Services

First, I'd like to clarify that the pattern I outlined in this section is intended to enable seamless transition between distributed and local service. It is not intended to make the performance overhead between the two models go away.

The core principle is to decouple our services from things that assume locality of either services or data. Thus, we can switch between local and remote services without breaking the ties between them. The decoupling should happen in the following areas:

1. Decouple the communication: When a service invokes an operation on another service we can determine whether that other service is local or remote. The communication layer can be smart enough to go through more efficient communication if the service happens to be local or go through the network if the service is remote. The important thing is that our application code is not going to be changed as a result.

2. Decouple the data access: Similarly, we need to abstract our data access to our data service. A simple abstraction would be a distributed hash table, where we could use the same code to point to a local in-memory hash-table or to a distributed version of that update. A more sophisticated version would be to point to an SQL data store where we could have the same SQL interface to point to an in-memory data store or to a distributed data-store.

Packaging Our Services for Best Performance and Scalability

Having an abstraction layer for our services and data brings us to the point where we could use the same code whether our data happens to be local or distributed. Through decoupling, the decision about where our services should live becomes more of a deployment question, and can be changed over time without changing our code.

In the two extreme scenarios, this means that we could use the same code to do only scale-up by having all the data and services collocated, or scale-out by distributing them over the network.

In most cases, it wouldn't make sense to go to either of the extreme scenarios, but rather to combine the two. The question then becomes at what point should we package our services to run locally and at what point should we start to distribute them to achieve the scale-out model.

To illustrate, let’s consider a simple order processing scenario where we need to go through the following steps for the transaction flow:

1. Send the transaction

2. Validate and enrich the transaction data

3. Execute it

4. Propagate the result

Each transaction process belongs to a specific user. Transactions of two separate users are assumed to share nothing between them (beyond reference data which is a different topic).

In this case, the right way to assemble the application in order to achieve the optimal scale-out and scale-up ratio would be to have all the services that are needed for steps 1-4 collocated, and therefore set up for scale-up. We would scale-out simply by adding more of these units and splitting both the data and transactions between them based on user IDs. We often refer to this unit-of-scale as a processing unit.

To sum up, choosing the optimal packaging requires:

1. Packaging our services into bundles based on their runtime dependencies to reduce network chattiness and number of moving parts.

2. Scaling-out by spreading our application bundles across the set of available machines.

3. Scaling-up by running multiple threads in each bundle.

The entire pattern outlined in this post is also referred to as Space Based Architecture. A code example illustrating this model is available here.

Final Words

Today, with the availability of large multi-core machines at significantly lower price, the question of scale-up vs. scale-out becomes more common than in earlier years.

There are more cases in which we could now package our application in a single box to meet our performance and scalability goals.

A good analogy that I have found useful to understanding where the industry is going with this trend is to compare disk drives with storage virtualization. Disk drives are a good analogy to the scale-up approach, and storage virtualization is a good analogy to the scale-out approach. Similar to the advance in multi-core technology today, disk capacity has increased significantly in recent years. Today, we have xTB data capacity on a single disk.

Interestingly enough, the increase in capacity of local disks didn’t replaced the demand for storage, quite the contrary. A possible explanation is that while single-disk capacity doubled every year, the demand for more data grew at a much higher rate as indicated in the following IDC report:

Another explanation to that is that storage provides functions such as redundancy, flexibility and sharing/collaboration. Properties that a single disk drive cannot address regardless of its capacity.

The advances with the new multi-core machines will follow similar trends, as there is often a direct correlation between the advance in the capacity of data and the demand for more compute power to manage it, as indicated here:

The current rate of increase in hard drive capacity is roughly similar to the rate of increase in transistor count.

The increased hardware capacity will enable us to manage more data in a shorter amount of time. In addition, the demand for more reliability through redundancy, as well as the need for better utilization through the sharing of resources driven by SaaS/Cloud environments, will force us even more than before towards scale-out and distributed architecture.

So, in essence, what we can expect to see is an evolution where the predominant architecture will be scale-out, but the resources in that architecture will get bigger and bigger, thus making it simpler to manage more data without increasing the complexity of managing it. To maximize the utilization of these bigger resources, we will have to combine a scale-up approach as well.

Which brings me to my final point -– we can’t think of scale-out and scale-up as two distinct approaches that contradict one another, but rather must view them as two complementing paradigms.

The challenge is to make the combination of scale-up/out native to the way we develop and build applications. The Space Based Architecture pattern that I outlined here should serve as an example on how to achieve this goal.

According to the article there are two primary ways Google will measure page speed:

How a page responds to Googlebot

Load time as measured by the Google Toolbar

A slow site can translate into a poor user experience and a lower conversion rate -- which in itself can lose you money. But this new development can actually cause you further harm by lowering your Google search ranking, as was recently reported by this users: Google Rankings Drop! Where to Check Site Response Time?

I noticed some of my sites drop significantly for many of the keywords that I rank for in Google. I rank in the top 10 for several keywords (on several different sites) and my traffic came to a screeching halt today. My rankings are down to the 7th or 8th page or lower!

After doing some digging around, I decided to login into my Google Webmaster tools. Sure enough I found a warning next the sitemaps for all my affected sites. The warning was: "Some URLs in the Sitemap have a high response time.This may indicate a problem with your server or with the content of the page."

Understanding why Web Speed Matters

The rationale behind this move by Google is fairly straightforward:

Slow web sites lead to a poor user experience, and therefore should not appear at the top of the search list even if they contain relevant content.

This is not something Google made up -- the effects of slow speed, or latency, are well documented.

Matt McGee quotes interesting statistics on the impact of latency on traffic behavior:

How to Improve Web Speed

What makes web-speed improvement difficult is that it involves multiple layers, some of which you don’t really control. It depends on the browser and how fast it processes CSS, JavaScript, etc... On the size of your pages and images, where your site is physically located, and the actual server-side architecture.

Web Speed at Large Scale

Steve's 14 rules apply to any site of any size. However, large-scale sites require additional special treatment. Large-scale web sites depend more heavily on the architecture. Non-scalable architecture can lead to devastating results under load. In this section I will discuss how to control web speed at large scale from an application architecture perspective.

Brief History

In Dec 2007 I summarized the main lessons from Google, Amazon, and LinkedIn large-scale web site architecture in this post. On December 8th, 2008, I wrote a response to Todd Hoff's post on highscalability.com, Latency is Everywhere and it Costs You Sales - How to Crush it - My Take. In those articles I tried to provide architecture guidelines on how to control latency in large-scale environments. Most of those lessons still hold true today. In this post I want to update some of those lessons based on recent experience with social networking and the emergence of NoSQL alternatives.

The Emergence of Read/Write Web

During the past two years, social networking has significantly changed the web experience. Today’s web sites deal with viral traffic behavior, as can be seen in the twitter traffic sample below. In addition, most of the content on these sites is now written by external users rather than by the site owner.

These differences in behavior led to a demand for read/write scaling as opposed to read-mostly scaling to deal with continuous scaling demands. This later led to the emergence of the NoSQL alternatives which started to pick up during the first quarter of 2010.

Last week I had the honor of a visit by Cees de Groot and members of his team from Marktplaats/eBay. Cees designed large scale web application for years, specifically in the eCommerce area. He was recently in charge of a design of an Adword Service (see reference here) and is now working on moving their entire site from a database-centric PHP architecture to scale-out architecture in Java.

Here are some of the main takeaways from our discussions on how to improve web performance in this new age of read/write scaling demand:

Improve Data Speed and Scaling

Data access is probably the most notable area of contention in many sites. In a large-scale system, data contention means that concurrent user access to the same table or data item is serialized due to locking. This makes one user request dependent on the other, and therefore, as the scaling grows it will have more and more significant impact on latency.

In many e-commerce sites, product inventory, product catalog, and user profiles tend to be typical areas where this sort of contention happens. In online gaming sites this would apply to the user profile and also to the gaming table. The important thing to note is that not all of our data is exposed to this level of contention. Understanding where the contention happens is the first step in solving the problem.

There are various option to reduce the contention points depending on the access pattern:

Read Mostly

In a read-mostly scenario, many users try to fetch the same content at the same time. Only a few users actually update the content, and even fewer share the content and try to update it concurrently. A large portion of large-scale web sites use memcache today to handle their read-mostly scaling scenarios. Memcache is extremely simple and it exposes the key/value store API. At the same time, Memcache is fairly limited as it doesn't provide consistency and high availability, and therefore cannot be used as a system of record. This means that update operations still need to go through the database, making memcache suitable for read-mostly scenarios but not suitable for write scaling scenarios.

Write-Intensive with High Latency

A write-intensive scenario means that the insert/update rate reaches fairly high levels (compared with read-mostly scenarios). It doesn't necessarily mean that the write rate is higher then the read rate, but rather that it's high enough to hit the limit of the database. Many social networking sites fall under this category as most of the site content is driven by users and not by the site owner.

NoSQL alternatives such as Cassandra can manage write scalability, but in most cases at the expense of consistency. With NoSQL alternatives, getting consistency between write/read comes at the expense of read latency (to ensure read consistency you need to read multiple copies of the same data from all the replicas). Furthermore, write latency is still bound to disk. So with NoSQL we can remove the scaling overhead of read/write but we don't come close to the latency that a memory-based solution such as memcache provides. A good reference to the type of performance that you could expect from some of the file-based NoSQL alternatives is provided here and here.

Read/Write-Intensive with Low latency

In this scenario it is not enough to manage the scaling of our write/read operation -- we need to be able to reduce the time it takes to perform the actual read/write operation. A good example is Twitter. With Twitter, read latency provided by NoSQL alternatives could be too slow to meet overall performance goals.

The solution would be very similar to the one provided by file-based NOSQL alternatives, only that it would be entirely based on memory .

The emergence of large memory devices such as Cisco UCS makes it possible to store Terra bytes of data purely in-memory (see Memory is the New Disk). Unlike memcache, an in-memory data grid such as the one provided by GigaSpaces turns memory devices into a transactional data store that can act as the system of record. This makes them suitable for handling both read and write scaling at extremely low latency.

You can read a more detailed description on how an in-memory data grid can be used for read/write scaling of existing databases in my Scaling Out MySQL post.

How Can I Manage Read/Write Scaling if I'm Already Using Memcache?

Write scaling pressure seem to be pushing sites like Twitter, Digg and others toward NoSQL as a replacement for memcache + MySQL as noted in this article: MYSQL AND MEMCACHED: END OF AN ERA?

Having said that, many of the sites are already heavily invested in memcache so the implication of that transition translates into fairly significant rewrites.

One way to avoid this rewrite exercise would be to turn memcache into a transactional data-store just as its close in-memory data grid relatives.

Because memcache is basically just a client/server protocol, we can easily add memcache support to an existing data-grid. In this way we can use memcache as a system of record that can manage both read & write scaling. As a matter of fact, we're just about to announce our first memcache supportfor GigaSpaces for this exact purpose. Other data-grid providers are expected to announce their support for memcache as well.

Improve Dynamic Page Load Time

Web page content is derived from many sources. The use of asynchronous calls to those services makes it possible to parallelize page rendering, reducing the time it takes to build the entire page content significantly. The diagram below illustrates how parallel page part fetching works:

Sequential Page Part Fetching

Parallel Page Part Fetching

While all this makes sense, writing the actual code to do this sort of parallel fetching might not be as trivial as it seems. One of the things that can make this work much simpler is the use ofFuture. The following snippet taken from GigaSpaces documentation illustrates what this API looks like:

The Sync mode is where you call an executor to fetch content from a remote service and then use the Future handle to poll for the result of that call at some other point in time. Even though the user would call the execute method sequentially it wouldn’t be blocked for its execution and therefore the actual execution will happen in parallel.

In the A-Sync mode we are not polling for the result but instead we will get a call-back with the actual results. This could be an ideal way to combine async execution with Ajax, where the callback method could be used to update our page asynchronously after the user already loaded the page given the user a fairly low latency experience.

Use On-Demand Scaling to Ensure Latency Under Load

Viral traffic behavior means that we can’t predict the load. Or to be more precise, we need to be ready to change our site sizing more frequently then we were previously accustomed. The current practice of over-provisioning based on the busiest hour of the busiest day doesn't hold up anymore: 1) You end up with a huge investment to meet the peak load traffic, 2) Provisioning for peak load leads to average-low utilization during regular hours. To be able to handle this type of traffic behavior we need to design our site for on-demand scaling. On-demand scaling involves the following steps:

Monitor the current traffic

If latency grows beyond a certain threshold, add another web container and update the load-balancer with the IP address of that container. You can read more on how you can take an existing Java web application and add dynamic scaling without changing a single line of code here.

Control your user traffic

Whether or not you managed to add dynamic scaling, you are always going to be bound by the amount of physical resources you currently have. One of the worst thing that could happen is that your site will crash as a result of an unpredictable peak load. If the Google search bot happens to visit your site while it is down, Google will “punish” your site severely and remove it completely from search results for quite some time (this is one of the SEO experts' worst nightmares, I've witnessed this on several occasions as my wife is becoming an SEO expert herself). It is therefore considered a best practice to control your user traffic and put a limit on how users can access your site. For example, Twitter limits the size of your tweets, and the number of tweets each user can post in an hour. This is also a good practice to protect your site from malicious attacks. In other words, it is better to send denial of service to some of your users then to lose them all.

Final Notes

In brick and mortar stores we've known for a long time that slow customer service will turn customers away. The retail industry has put a lot of effort into improving customer service and reducing the time customers spend waiting in queues or not being answered. Not surprisingly, we're seeing analogous developments with web site traffic. Slow sites turn customers away. Various measures such as the one presented by Google shows a direct correlation between web site latency and user traffic behavior. Lower user traffic translates immediately to fewer purchases on your site and therefore loss of potential revenue.

Google made a right move by adding web-speed to the search engine ranking. As users, this will help ensure that we get better service from sites that want to be at the top of the search list. For site owners, this places site performance as a much higher priority. For us techies, this is also good news. It will make our work of justifying why we should use better architecture easier as we can easily measure how our work translates to real business value.

March 28, 2010

Back in June 2008 i wrote a post in response to an InfoQ article: RAM is the new disk... where i quoted Tim Bray and others comparing file system performance to memory:

Memory is the new disk! With disk speeds growing very slowly and memory chip capacities growing exponentially, in-memory software architectures offer the prospect of orders-of-magnitude improvements in the performance of all kinds of data-intensive applications. Small (1U, 2U) rack-mounted servers with a terabyte or more or memory will be available soon, and will change how we think about the balance between memory and disk in server architectures....

… Disk will become the new tape, and will be used in the same way, as a sequential storage medium (streaming from disk is reasonably fast) rather than as a random-access medium (very slow). Tons of opportunities there to develop new products that can offer 10x-100x performance improvements over the existing ones.

It didn't take long for this vision to become a reality through the work that we're doing these days with Cisco. By integrating GigaSpaces XAP with the Cisco UCS machine we are demonstrating our ability to easily load hundreds of gigabytes into a single box, and to scale linearly with growing capacity without any performance degradation. This is a great example of how middleware that was built for memory from the ground up, combined with hardware that was equipped to provide terabytes of memory in a single box, can be game changing.

This exciting combination makes it possible to manage 15-20x the amount of data in-memory, per partition. This, in turn, makes it possible to store theentire application dataset in‑memory, and gain not only 10x the performance but also great simplicity, because the application no longer needs to deal with a *miss* ratio in the cache; and, at the same time, there are no consistency issues because all the data resides in-memory.

The table below was taken from a recent Stanford research paper -- The Case for RAMClouds. It provides estimates of the total storage capacity needed for one-year’s-worth of customer data for a hypothetical online retailer and a hypothetical airline. In each case the total requirements are no more than a few terabytes, which would fit in a modest-sized RAMCloud. The last line provides the estimated cost for purchasing the necessary RAMCloud servers :

As the table illustrates, storing the entire annual data of an online reservations system or an online retailer can be mapped completely into memory. The combination with UCS makes the deployment of such a system simple to manage and economic. The online retailer's annual data can fit into 4 UCS chassis, and the entire annual data set of the airline reservation system can fit into 2 UCS chassis!

The Next Big Thing -– In-Memory Elastic Database

In one of my recent talks, Elastic Data on the Cloud: Hype or Reality? I also discussed the trend in the context of the cloud database. I spoke about how the addition of elastic middleware -- which enables allocation of data service on-demand, sharing that data with other tenants of the application, and provides built-in dynamic scaling as needed -- will make this sort of deployment significantly simpler and more economical.

What’s Next

In my last CTO Note I described our current progress with our 7.1 XAP release and our plans toward the future 8.x release:

Here is what we’ve done to take XAP even closer to this vision, in the recent 7.0 release and the upcoming 7.1 release (slated for April 14, 2010):

Fast network –- XAP 7.1 will make dynamic scaling simple by enabling you to set up distributed system with a single API call. This not only provides a simpler API for developers, but also eliminates manual work for sizing, provisioning, and configuration at the deployment stage, and reduces the maintenance effort by automatically adjusting cluster configuration in response to changing loads, machine failure, and other events.

Large memory -– XAP can manage terabytes of data in-memory. It can even manage up to 384 GB in a single VM, through its certification for use with Cisco UCS, coming up in version 7.1.

Multi-core -– as of version 7.1, XAP comes with new concurrent transaction management for better utilization of multi-core processors. This comes on the heels of major improvements in multi-core utilization in version 7.0, which resulted in a 300% performance improvement on the Intel Nehalem processor. New benchmarks will shortly be released showcasing the performance of XAP 7.1 on Cisco UCS, which is based on the latest multi-core technology.

Virtualization and multi-tenancy –- XAP 7.1 is the first middleware platform with built-in multi-tenancy, making it extremely simple to share applications on the same resources with full isolation and enterprise-grade security. This dramatically cuts the cost per user for software vendors that use XAP for SaaS-enablement, because it enables them to cram more users onto the same hardware resources. Full VMware integration is also available, enabling multi-tenancy throughout the entire virtualized application stack.

Our 8.x release will include more exciting stuff in the same direction, with JPA support and continuation of our work to make our elastic middleware simpler. We will also be releasing a tighter packaging with Cisco UCS that will make large data deployments fully optimized from the application down to the hardware.

May 19, 2009

I was very pleased to read an email from Leonardo, who was the winner of the OpenSpaces Developer Challenge (a worldwide programming contest using the Gigaspaces application server which was held last year), saying that he is now a finalist in the Cisco developer contest. Here's a bit about him and the application he submitted:

About Leonardo

Leonardo worked for several ISPs in various roles as network administrator and java programmer for IT consulting firms, and finally as software architect in high-performance Java EE based projects. He is passionate about parallel programming, distributed computing and more recently semantic web and its applications on software engineering.

Leonardo was the winner of the OpenSpaces Developer Challenge. He enjoys reading about various technologies in the field of computer science. When he is not developing code, he prefers to spend time with family and friends, walk in the park, or watch a movie.

About the application

Resource Management Platform is a proposal to develop an event based platform that leverages AXP, Services Gateway Initiative (OSGI), Jini and JavaSpaces technologies to enable deployment of IP Multimedia Subsystem (IMS) applications based on Session Initiation Protocol (SIP); more specifically, the Call Section Control Function (CSCF) components. It will have admission control mechanisms to manage Call processing.

This solution improves infrastructure manageability for large scale IMS applications. Such a platform will potentially be useful to enable deployment of high-performance, network-based SaaS (Software as a Service) or Cloud Computing solutions at the network edge by leveraging AXP.

Leonardo's project is interesting, because it shows how you can use Space Based Architecture (SBA) for implementing a scalable Telco application and offer it as SaaS application on the cloud.

Interestingly enough, I got another email the week before from Amin Abbaspour, who presented another case study illustrating how you can build a scalable SMS service using SBA, as shown in this diagram:

What the two projects have in common, from an architecture perspective, is that they both represent a highly scalable Event Driven design. The unique thing about Event Driven applications is that they require a combination of messaging, data and service interaction that needs to be tightly orchestrated to meet high performance/low-latency requirements without compromising on consistency, ordering (FIFO) and reliability. This combination of requirements represent one of the hardest challenges in building scalable architectures. Trying to meet this type of challenge in the traditional way by integrating messaging system for event delivery , database or simple caching (like Memcached or TC) for data and a traditional application server for business logic is going to lead to fairly complex architecture. Trying to reach linear scalability and keeping the latency low with so many moving parts is close to impossible. This is what makes SBA such a good fit. The main difference about SBA is that it recognizes there is strong dependency between messaging, data and business logic. The key is to have one shared clustering, high availability and scalability for all three components of the architecture. This makes it possible to reduce the number of moving parts and network hops associated with each business transaction, thereby increasing reliability.

On a personal level, I was very pleased to see that the software we are developing is helping people like Leonardo and Amin to build their own carrier and put themselves in a unique spot in highly competitive market.