scalability

February 04, 2015

Background

You don’t need to be an expert to realize that a failure of an eCommerce site during Black Friday or Cyber Monday is a disastrous event, leading to huge loss in revenue and reputation for the retailer. As the share of eCommerce accounts increases to more than 8% of total US retail sales this year, the impact of failure becomes more significant - not just to the site itself, but on the the overall economy. A study on the subject, compiled by Joyent and New Relic, showed that 86% of companies experienced one or more episodes of downtime last holiday season. At the same time, 58% of customers will not use a company’s site again after experiencing site errors.

Another study by Radware measured not just the impact of downtime on eCommerce sites, but also the impact of slowness - an even more common and less measured metric. According to this study a one-second delay correlates to:

2.1% decrease in cart size

3.5-7% decrease in conversions

9-11% decrease in page views

8% decrease in bounce rate

16% decrease in customer satisfaction

a 2.2 slowdown equals a 7.7% conversion rate hit.

Meanwhile, KISSmetrics illustrated how page loads longer than three seconds lead to a 40% bounce rate.

Obviously there is enough business incentive here to invest in handling both the downtime and latency issues. Meanwhile taking a look at a typical retailer traffic (source: Akamai) during this season, we notice that the traffic spikes increase by at least 500%:

In this post I will share our specific experience and lessons-learned from the 2014 holiday season which turned out to be very successful. I believe that the results below speak for themselves.

2014 Results

How We Achieved These Results

Taking a preemptive approach - rather than reacting after a failure occurred- prevented failure in the first place.

Common Causes: Most failures are the result of misconfiguration or capacity planning guesswork.

Knowledge & Experience: eCommerce applications are complex and built from many subsystems. In many cases, an eCommerce organization does not have the expert skill-set in each of the subsystems. Having an expert in the room helps to bridge this gap and builds the capabilities of business operations.

Fast Feedback: When product-related issues are identified, we were able to provide the fastest path to protect the business and address concerns in a timely fashion.

To give you a bit more insight on this process I’ve added a section to this post called Stories from the War Room which illustrates a real-life incident and the action that was taken by our on-site engineer to resolve it.

Data is mirrored back into the database in batches. In this way, peak load transactions are buffered so that database traffic does not crash the database back-end.

The In-Memory Compute grid acts as a system of record. Failure in the underlying database can be saved without affecting the online users while the database is restored to a working state.

Using a combination of In-Memory & SSD allows very large In-Memory data sets to be stored at a reasonable cost, while still ensuring fast recovery during failure.

Self-Healing Systems recover from failure in real time

Failures are inevitable: Keeping a backup copy in-memory enables zero-downtime systems to service user traffic without interruption, even if something does go wrong.

Systems provisioned for failure handle failure by design.

Automatic failover and provisioning eliminates the need to overprovision (costly) resources in case of failure. Traditionally, it’s common for retailers to provision resources for holiday season that are 5 times the capacity of non-holiday traffic infrastructure.

Two Examples

In one case, a Top 100 online retailer used XAP to provide access to its catalog and inventory data and achieved its first zero-downtime holiday season in several years. As a result, this retailer delivered a vastly improved customer experience from previous years (achieving an 18% improvement in customer satisfaction ratings) and generated a 139% increase over 2013 holiday sales.

In another case, a Top 30 US online retailer logged a record-setting peak sales day of $44 million. This was especially notable because that same day the retailer experienced system performance issues caused by an automated hardware failover condition. Fortunately, the retailer’s XAP implementation began automatically relocating application components to standby resources, keeping apps running despite the complications. As a result, consumers continued to shop—and buy—with minimal disruption.

Stories from the War Room

I’ve picked two issues that we identified as our engineers were working on-site with one of our top eCommerce customers. I thought that these two cases provides useful insight on how a preemptive support strategy and a short feedback loop works:

Issue #1: Sudden slow client response time

The quote below was taken from the direct on-site report:

GC Spikes is one of the common issues that we encounter for managing in-memory data clusters. As GC tends to compete with the same CPU resources that serves the user transaction, this often leads to overall slowness of the system. Fairly quickly, this slowness can pile up into a huge backlog which can break the system in unexpected areas.

The resolution was to split the cluster into more data containers (GSC in XAP terminology) as this will allow better spread of the load across the entire cluster. In addition, overall capacity (memory and CPU) that was allocated for the cluster was increased to meet the increasing capacity demand.

The diagram below provides a view of one of the clusters at the time the issue occurred.

As can be seen, around 23:00 the system started to hit its high CPU mark as a result of GC spikes. The system was gradually rebalanced after couple of hours without facing any downtime. The preemptive action that was taken to handle this incident prevented it from piling up and causing a complete system failure.

Issue #2: Connection Issues

In this incident, the increase of concurrent client activities during peak load resulted in a large number of network connections that were opened at the same time. One of the nodes in the cluster was misconfigured with a low limit on the number of concurrent connections that could be opened simultaneously. The resolution was to kill that faulty node and leverage the self-healing capability of XAP to force an immediate re-route of clients into the backup node while relocating the faulty node into another machine.

Final Notes

Peak load performance often tends to stretch any system behavior in areas that are least expected and thus are often hard to handle. Quite often, peak loads lead to unexpected downtime.

There are many cases in which this sort of peak load performance is known in advance, as is the case with Black Friday and Cyber Monday. Still, many eCommerce sites continue to experience downtime or slowness during such events that lead to huge loss of revenue and reputation.

As a software vendor, we have often found ourselves involved in the early architecture discussion phases, which usually take place as a result of failure in the previous year. Despite the fact that we are brought in to solve these peak load performance problems, we are still called during a fire drill when those failures occur. Often, the result of the failure was misconfiguration or a problem on another system that manifested itself as an issue in our product. Obviously the experience of handling fire drills is never be pleasant, neither for us nor for our customer and that is something that we wanted to avoid as we approached the 2014 holiday season.

This year we decided, together with our customers, to take a more preemptive approach by putting an engineer on-site to escort the customer team during the event itself. This resulted in huge success, leading to 100% up time. Both teams learned much from the experience; the customer learned even better how to operate our product and what to look for to ensure that the system is running properly. We learned much about how the customer is using our product and were able to shorten the feedback loop between the customer and our product and engineering team.

With those lessons in hand I feel that both we and our customers are much more equipped to handle 2015. I can’t wait to write about the lessons learned from Black Friday 2015.

July 30, 2014

“You just can't have it all” is a phrase that most of us are accustomed to hearing and that many still believe to be true when discussing the speed, scale and cost of processing data. To reach high speed data processing, it is necessary to utilize more memory resources which increases cost. This occurs because price increases as memory, on average, tends to be more expensive than commodity disk drive. The idea of data systems being unable to reliably provide you with both memory and fast access—not to mention at the right cost—has long been debated, though the idea of such limitations was cemented by computer scientist, Eric Brewer, who introduced us to the CAP theorem.

The CAP Theorem and Limitations for Distributed Computer Systems

Through this theorem Brewer stated that it was impossible for any distributed computer system to be able to provide users with these three following guarantees simultaneously:

Consistency (Every node will be able to view the same data at the same time)

Availability (Every request will receive a response)

Partition Tolerance (The system will continue to operate even if the system faces any arbitrary failures)

We can evaluate different approaches for data management solution by the above three properties and the tradeoffs that each one of them faces when utilized over the other. For example, what is the tradeoff of putting more emphasis on consistency? The trade off will most likely be less availability or partition tolerance.

Traditional RDBS solutions offered consistency over partition tolerance and cost. In-Memory caching solution such as memcache offered a different set of tradeoffs, for example speed over some degree of consistency (between the in-memory data and the data held in disk).In Memory Data Grid or In Memory Computing, as it is called today, extended the use of memory as the system of record and thus enabled higher degree of consistency by reducing the dependency on the underlying disk databases. The new generation of NoSQL databases took a different approach and offered speed and scale at low cost over consistency by utilizing distributed commodity storage and relying heavily on asynchronous data flow to speed up the data processing.

Quite often there is a direct correlation between those tradeoffs and cost - for example high degree of consistency often relies on synchronous data-flow and replication operations which often come at a cost of speed. To overcome the speed limit we often need to use memory or other flavors of high speed storage which often translate to high cost. On the other hand if we can compromise on consistency, we can offer speed and scale using commodity resources ( without relying on high cost resources).

As technology advances and demands for cost effective and scalable, reliable data increase, new solutions are coming to light, with one of the most favored being a combination between two items we are already familiar with: RAM and flash.

Why SSD Is a Viable Storage Solution

Many of the underlying assumptions in the previously mentioned alternatives that was built under the assumption that disk is the bottleneck. If disk is the bottleneck, we needed layers of optimizations to minimize the access to disk. SSD provides a high speed storage device, which gets rid of the idea that disk is the bottleneck. This is a fundamental change in one of core assumptions behind the design of many databases today, let me explain..

For example if the disk is slow it makes sense to put various filters to minimize the access to disk. As disk gets faster all of those filters becomes an overhead. In many cases it would be faster writing directly to disk without those additional layers. In similar way the use of asynchronous operations need to change as well. When the disk is slow it makes sense to use asynchronous operations to deffer write operations. However, when the disk is no longer the bottleneck it would be faster and vastly simpler to access the disk directly. This also allows avoiding many of the consistency tradeoffs that i mentioned earlier that are often a result of those asynchronous operations.

Putting SSD and RAM Together using off Heap Storage

RAM storage can be as big as necessary and it is incredibly quick—but it costs about ten times more than a flash disk. By putting SSD and RAM together we can optimize the cost/performance ratio. The main challenge in doing so, remain around consistency-i.e. how do we keep the data in-memory and flash in sync so that from an end user perspective this integration would look completely seamless?

One method of such integration is referred to as off-heap storage.

Off Heap storage is often implemented as a plug-in that is used by the memory data-store to offload its data from the actual RAM into Flash and visa versa.

Traditionally Off Heap was implemented with shared memory which is basically a block storage that provides external access to the same RAM device. By doing so, we bypass the management overhead of managing data in RAM through JAVA. The limitation of that approach is that the capacity is still limited by the size of RAM. It also forces an external data management and garbage collection layer to manage this external heap.

With SSD, users can now have both the memory and the SSD device synchronized and regularly kept up-to-date, meaning storage is scalable and consistent.

Some of the SSD drivers provides a key,value interface which makes the integration vastly simpler as it takes care of its own data management. Having said that, most of the In-Memory implementation that uses SSD as off-heap storage provides fairly limited functionality in terms of query and transaction support. This is due to the fact that they( What is they?) integrate with SSD through a disk storage interface.

Using SSD as a Foundation for Flash DataBase

To really make the best use of SSD and maximize its potential we need to think of SSD as foundation for database. This requires a more tight and native integration with SSD in order to overcome some of the limitations that exist with many of the current In Memory Data Grid and SSD implementations.

The specific set of features that are needed for this sort of integration includes:

Portable and native Key/Value interface that works against any flash device.

Using Flash as Durable Storage - Flash can be used as a persistent data-store and not just as extension to RAM. As such we can leverage the durability of flash to speed up the load time of data from the underlying flash device incase of a planned or un-planned recovery process.

Support for batch operations - batch operations are a common way to speed up the access to flash by minimizing the number of cross boundaries calls between the RAM device and the underlying flash devices.

Transactional - To support transactional access we need to extend the batch operations to fail, succeed as a single atomic operation.

Query index support - to speed up the query and access time, the data needs to be indexed in a way that fetching a particular object by its IDcould point directly to its physical location on disk.

To make this a practical reality GigaSpaces partnered with SanDisk who through fusion I/O acquisition own the majority of the SSD market. As part of this partnership, we implemented a new version of our In Memory Data Grid, XAP MemoryXtend, which is now integrated with SanDisk ZetaScale interface. ZetaScale implements all of the above features as a general purpose flash disk interface. Through this integration we are now able to utilize the XAP RAM based storage for handling complex queries and analytics as well as transactional support. We can also leverage the XAP cluster support for integrating multiple flash devices that are mounted on various machines and make them work as one big transactional database ( as illustrated in the diagram below).

The XAP MemoryXtend cluster consists of a number of partitions each plugged into a local SSD drive. Each partition can have at least one backup running on a separate machine to ensure availability. The XAP client is a smart proxy that abstracts the underlying cluster and exposes all of the physical partitions through a single data-grid interface. The proxy routes the write or query requests to a particular partition in cases where we’re looking for a specific data item. In the case of agregated queries it will invoke a parallel query against all nodes and consolidate the results into a single result-set.

Flash DataBase as Service

As Flash devices are now being supported by cloud providers such as Amazon, it only make sense to leverage this capability and offer this sort of Flash based database on-demand.

This is where the XAP Deployment and Orechestration comes handy as it automates the process of deployment and management of XAP clusters across a variety of cloud infrastructures.

June 26, 2014

The big data movement was pretty much driven by the demand for scale in velocity, volume, and variety of data. Those three vectors led to the emergence of a new generation of distributed data management platforms such as Hadoop for batch processing and NoSQL databases for interactive data access. Both were inspired by their respective predecessors Google (Hadoop, BigTable) and Amazon (Dynamo DB).

As we move to fast data, there’s more emphasis on processing big data at speed. Getting speed without compromising on scale pushes the limits in the way most of the existing big data solutions work and drives new models and technologies for breaking the current speed boundaries. New advancements on the hardware infrastructure with new flash drives offer great potential for breaking the current speed limit which is bounded mostly by the performance of hard drive devices.

Why using existing RDBMS, NoSQL on top of flash drive is not enough

Many of the existing databases — including more modern solutions such NoSQL and NewSQL — were designed to utilize the standard hard drive devices (HDD). These databases were designed with the assumption that disk access is slow and therefore they use many algorithms such as Bloom filters to save access to disk in case the data doesn’t exist. Another common algorithm is to use asynchronous write to commit-log. A good insight into all the optimizations that are often involved in a single write process on NoSQL databases is provided through the Cassandra Architecture. Let’s have a look at what that entails below:

The Cassandra write path

What happens when disk speed is no longer the bottleneck?

When the speed of disk is no longer the bottleneck, as in the case of flash devices, then a lot of the optimization turns into overhead. In other words, with flash devices it will be faster and simpler to access the flash device directly for every write or read operation.

Flash is not just a fast disk

Most of the existing use cases use flash devices as faster disk. Using flash as a fast disk was a short path to bypass the disk performance overhead without the need to change much of the software and applications. Having said that, this approach inherits many unnecessary disk driver overhead. So to exploit the full speed potential of flash devices, it would be best to access the flash devices directly from the application and treat flash devices as a key/value store rather than a disk drive.

Why can’t we simply optimize the existing databases?

When we reach a point in which we need to change much of the existing assumptions and architecture of the existing databases to take advantage of new technology and devices such as flash, that’s a clear sign that local optimization isn’t going to cut it. This calls for new disruption.

The next big thing in big data

Given the background above, I believe that in the same way that demand for big data led to the birth of the current generation of data-management systems, the drive for fast data will also lead to new kinds of data management systems. Unlike the current set of databases, I believe that the next generation databases will be written natively to flash and will use direct access to flash rather than the regular disk-based access. In addition to that, those databases will include high performance event streaming capabilities as part of their core API to allow processing of the data as it comes in, and thus allowing real-time data processing.

Fast data in the cloud

Many of the existing databases weren’t designed to run in cloud as first-class citizens and quite often require fairly complex setup to run well in a cloud environment.

As cloud infrastructure matures, we now have more options to run big data workload on the cloud. The next generation databases need to be designed to run as a service from the get-go.

To avoid the latency that is often associated with such a setup, the next generation databases will need to run as close as possible to the application. Assuming that many of the applications will run in one cloud or another, it means that those databases will need to have built-in support for different cloud environments. In addition, they will need to leverage dynamic code shipping to pass code with the data, and in this way allow complex processing with minimum network hops.

RAM and flash-based devices have much more in common than flash and hard drive. In both RAM and flash, access time is fairly low and wasn’t really the bottleneck; in-memory databases use direct access to the RAM device through key value interface to store and index data.

Those factors makes in-memory based databases more likely to fit into the next generation flash databases.

The combination of in-memory databases and data grid on top of flash devices also allows the system to overcome some of the key capacity and cost per GB limitations of an in-memory based solution. The integration of the two will allow an increase in the capacity per single node to the limit of the underlying flash device rather than to the limit of the size of RAM.

The overall architecture of the integrated solution looks as follows (Source: Gartner):

As can be seen in the diagram above, the IMC (in-memory computing) layer acts as the front end to the flash device and handles the transactional data access and stream processing, while the flash device is used as an extension of the RAM device from the application perspective.

There are basically two modes of integration between the RAM and flash device.

LRU Mode – In this mode, we use the RAM as a caching layer to the flash device. The RAM device holds the “Hot,” i.e. the most recently used, and the flash device holds the entire set.

Pros: Optimized for maximum capacity.

Cons: Limited query to simple key/value access

Fast Index Mode – In this mode, all the indices in RAM and the data in flash holds the entire set.

Cons: Capacity is limited to the size of indexes that can be held in RAM.

In both cases, the access to flash is done directly using key/value interface and not through disk drive interface. As both RAM and flash drives are fairly fast, it is simpler to write data to the flash drive synchronously and in this way avoid potential complexity due to inconsistency.

It is also quite common to sync the data from the IMC and flash devices into an external data store that runs on a traditional hard drive. This process is done asynchronously and in batches to minimize the performance overhead.

Direct flash access in real numbers

To put some real numbers behind these statements, I wanted to refer to one of the recent benchmarks that was done using an in-memory data grid based on GigaSpaces XAP and direct flash access using a key/value software API that allows direct access to various flash devices. The benchmark was conducted on several devices as well as on private and public cloud-based services on AWS. The benchmark shows a ten times increase in managed data without compromising on performance.

December 15, 2013

In a stream processing model, data is processed as it arrives to the Big Data system. With a batch processing model, only once the data is stored does the system run a variety of batch analytics - typically through a map/reduce style of processing.

According to a recent survey of approximately 250 respondents, there is a trend moving towards stream processing in order to speed up analytics, with a significant increase in the popularity of this model when compared with last year. The number of organizations planning to use stream processing in 2014 has more than doubled (24%) from last year’s amount (10%).

This goes hand in hand with the fact that real-time analytics is also becoming more mainstream, as I pointed out in my 2014 Predictions.

Another interesting data point from the survey is that roughly 43% of the respondents defined real-time as sub-second and 42% defined it as sub-minute. This difference is quite interesting and could lead to different approaches for implementing stream processing.

For example, real-time is defined differently by Facebook and Twitter. In the case of Twitter, they chose Storm as their event processing engine, allowing them to process events at a sub-second resolution. Meanwhile, Facebook defines real-time at a 30 second batch window, choosing a logging-based approach to fit their need.

Based on the survey, it appears that both approaches are valid and can be applied in correlation to the degree of real-timeliness of your analysis.

Facebook Log-Centric Stream Processing

Twitter Event-Centric Stream Processing

In-Memory Data Grids become a more popular choice for real-time processing

According to the survey, 64% of the respondents indicated that they plan to combine In-Memory-based solutions for delivering their real-time analytics processing. This is also consistent with last year survey done by Ventana Research.

I believe that will see an even bigger movement in this direction, as the cost/performance ratio of In-Memory Data Grids and In-Memory Databases would go down significantly with the combination of RAM and Flash devices, which together provide a fairly compelling solution from both performance and cost ratio to that of other disk-based alternatives.

Big Data in the Cloud increased to 56%

New developments in cloud infrastructure, such as the support for bare metal in the new OpenStack releases, as well as support, high memory, flash disk, etc, are removing many of the technical barriers for running I/O intensive workloads like Big Data in the cloud.

Indeed, according to the survey, there is a significant shift towards the use of Big Data in the cloud with 55% of organizations either using or planning to run their Big Data in the cloud in 2014.

Where do we go from here? How does this effect GigaSpaces’ future roadmap?

There are multiple areas in the GigaSpaces roadmap that aim to address this demand.

Real-Time Processing through Storm Integration - In this project, we integrate Storm on top of a Memory backend for both stream processing and data-store.

Support for Flash Disk - The integration with flash disk is planned for our upcoming XAP release and will include support for SanDisk, Fusion I/O and other flash disk devices.

Big Data in the Cloud - through Cloudify - We've been continuously extending our Big Data portfolio support and recently added support for Storm and Cognos, adding to our existing support for Hadoop, MongoDB, Cassandra, ElasticSearch, etc.

January 01, 2013

Memory-based databases and caching product has been available for over a decade. However, so far they have been used in a fairly small niche in the data management solution market.

There have been multiple advances in the industry in both hardware and software architecture, which makes memory-based computing more relevant today than in the past, as outlined in the diagram below.

In a nutshell, the availability of new classes of hardware with the support of 64bit CPU can now support 2TB on a single device. In addition, the advances in software architecture and solutions toward distributed architecture and cloud make it easier to utilize these new hardware capabilities.

In-Memory Computing

In many ways, In-Memory Computing is a close relative of In-Memory Databases. As with many databases, it was designed to enable all the data management aspects that are often expected from traditional databases, such as queries and transactions, with the difference that the data is managed on RAM devices and not disks and thus comes with potentially x1000 better performance and latency according to various benchmarks.

The main differences between the traditional in-memory databases and in-memory-based-computing are that In-Memory-Computing is:

1. Designed for distributed and elastic environments

2. Designed for In-Memory data processing

Executing the code where the data is:

The fact that we can store our data in the same address space as our application is the biggest gain.

Unlike disk and even flash disk devices, we can access our data by reference and thus perform complex data manipulation without any serialisation/de-seralization overhead. With the new class of dynamic languages such as Java, JavaScript, JRuby, and Scala, it is also significantly easier to pass complex logic over the wire and execute it on a remote device.

In-Memory Computing relies heavily on that capability, and exposes a new class of complex data processing capabilities that fits well with the distributed nature of the data through real-time map/reduce and stream-based processing as a core element of its architecture.

The Big Data Context

According to our recent survey more than 70 percent of the responders said their business requires real-time processing of big data -- either in large volumes, at high velocity, or both.

Interestingly enough another survey by Ventana Research indicated that one of the biggest technical challenges in Big Data is the lack of real-time capabilities (67%). The report also indicated that many of the organizations are planning to use in-memory databases (40%) as part of their Big Data stack. This places the In-Memory Database as a second choice, before specialised DBMS (33%) and Hadoop (32%). One of the conclusions this survey leads to is that organizations see Data Warehouse Appliances and In-Memory Databases as one of their first choices to deal with the lack of real-time capabilities.

No One-Size-Fits-All Solution

While In-Memory Databases fit well in the planned Big Data stack, it is clear that theres no one-size-fits-all solution. The Big Data stack is going to be based on a blend of various technologies, each covering different aspects of the challenges of Big Data, from batch to real-time, from vertical to horizontal solutions, etc.

The question is: How do we integrate them all, without adding even more complexity to an already complex system?

In this post I will focus specifically on one of the approaches the we used for combining In-Memory Computing together with other Big Data solutions, such as Hadoop and Cassandra.

Putting In-Memory Computing Togther with a NoSQL DB

One of the main motivations to integrate in-memory-based solutions with a NoSQL DB is to reduce the cost per GB of data.

Putting our entire data purely in-memory can be too costly especially for data that were not going to access frequently.

There are various approaches to doing this -- the approach we found most useful is a two-tier approach.

With the two-tier approach the In-Memory Computing systems run separately from the NoSQL database, which acts as the long-term storage.

The Challenge

The main challenge with this approach is the complexity that is associated with synchronising two separate data systems. Specifically, how to ensure that data that is written into the front end In Memory Computing engine gets populated into the NoSQL database reliably, and vice versa.

The Solution

To deal with this challenge we used a similar approach to the one that we used before with RDBMS. Have an implicit plug-in that gets called whenever new data is written and populate it into the underlying database. The plug-in also deals with pre-loading of the data when the system starts. In the RDBMS world we used frameworks like Hibernate to deal with the implicit mapping of the data between the in-memory front end and the underlying database.

Working with Dynamic Data Structure (a.k.a Document Model)

When we tried to apply the same approach with NoSQL databases we could no longer rely on Hibernate as the default framework for mapping the data between the two data systems, as NoSQL databases like Cassandra tend to be fundamentally different from traditional RDBMSs. The main difference is the use of dynamic data structures, a.k.a the Document Model.

To deal with dynamic data structure we added the following hooks:

Introducing new documents and objects: Users can choose to write or load data in various forms -- Document for non-structured data or Objects or POJOs for structured or semi-structured data.

Introducing and loading new meta data: To map the data to and from the NoSQL database we also added the ability to introduce new meta data and load the meta data of the object before the actual data is loaded.

Introducing new indexes: In NoSQL databases you cannot effectively access data that is not indexed. For that purpose we included the ability to introduce indexes on the fly.

The main benefit of this two-tier approach is that it allows us to take the best of the memory and file-based approaches without adding too much complexity. The two tiers behave and work as one data system from an end-user perspective. The bits and pieces of how the two system get synchronized is carved out of the system.

Furthermore, the two-tier approach opens up a new degree of flexibility in how we design our Big Data system.

New Degree of Flexibility

If we look at the entire data flow from the point in which a user interacts with our system (this is where we expect low latency and high degree of consistency to our analytics systems; where we record those actions and analyze them, latency is less of an issue and we can also relax some of the consistency constraints), we can see that each stage in our data processing has different onsistency, latency and performance requirements.

With the two-tier approach there is more flexibility in dealing with those different requirements and still keep everything working as if it were one big system from a usability perspective. Here are a few examples of how this setup can work:

Consistent data flow from Real-Time to Batch: The integration enables us to handle real-time data processing at in-memory speed and deal with more long-term data processing through the underlying database.

Performance & latency: The In-Memory Computing system can handle the event processing before the data gets into the database. Or, another approach is to keep the last day (or days) in-memory and the rest of the data in the NoSQL database.

Mixed consistency model: The In-Memory Computing system is often built for extreme consistency where NoSQL databases often work best with eventual consistency. Usually, the consistency requirements are more relevant at the front end of the system, and becomes less relevant as the data gets older. The combined approach enables us to set our front end for extreme consistency and back end for eventual consistency.

Deterministic behavior: In many cases, we must ensure that a given set of data can be served under constant performance. Many databases use an LRU based cache to optimise the data access. The limitation of this approach is that the speed at which we can access our data becomes non-deterministic as we often do not control which data is served through the database cache; thus, in some cases we will get a fast response time if we hit the cache and in other cases the same operation can be 10 times longer if we miss the cache. By splitting our in-memory data from our file-based storage we get more explicit control over which data is served at in-memory speed and which data is not, thus ensuring consistent access to that data.

Faster ETL: By front-ending our Big Data storage with In Memory Computing we can also speed up the time it takes to pre-process and load data to our long-term data system. In this context, we can push the filtering, validation, compression and other aspect of our data processing into memory before it goes into our long-term databases.

Final Words

Big Data systems are complex beast and it is clear that the one-size-fits-all approach doesn't work.

On the other hand, having too many data systems increases the complexity of managing our Big Data system almost exponentially; and our ability to ensure consistent behaviour, data integrity and reliable synchronization across the varous systems becomes an almost un-manageable task if done manually.

Adding real-time capabilities to our Big Data system is a classic area where the kind of integration described in this article is needed. The integration between In-Memory and File-Based approaches as two separate tiers also introduces additioanl areas of flexibility in how we can handle often contradictory requirements such as consistency, scale, latency, and cost. Instead of trying to come up with a least common denominator, we can optimize each tier to the area it fits best.

August 21, 2012

One of the challenges in processing data is that the speed at which we can input data is quite often much faster than the speed at which we can process it. This problem becomes even more pronounced in the context of Big Data, where the volume of data keeps on growing, along with a corresponding need for more insights, and thus the need for more complex processing also increases.

Batch Processing to the Rescue

Hadoop was designed to deal with this challenge in the following ways:

1. Use a distributed file system: This enables us to spread the load and grow our system as needed.

2. Optimize for write speed: To enable fast writes the Hadoop architecture was designed so that writes are first logged, and then processed. This enables fairly fast write speeds.

3. Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed.

Batch Processing Challenges

The challenge with batch-processing is that it assumes that the feeds come in bursts. If our data feeds come in on a continuous basis, the entire assumption and architecture behind batch processing starts to break down.

If we increase the batch window, the result is higher latency between the time the data comes in until the time we actually get it into our reports and insights. Moreover, the number of hours is finite -- in many systems the batch window is done on a daily basis. Often, the assumption is that most of the processing can be done during off-peak hours. But as the volume gets bigger, the time it takes to process the data gets longer, until it reaches the limit of the hours in a day and then we face dealing with a continuously growing backlog. In addition, if we experience a failure during the processing we might not have enough time to re-process.

Speed Things Up Through Stream-Based Processing

The concept of stream-based processing is fairly simple. Instead of logging the data first and then processing it, we can process it as it comes in.

A good analogy to explain the difference is a manufacturing pipeline. Think about a car manufacturing pipeline: Compare the process of first putting all the parts together and then assembling them piece by piece, versus a process in which you package each unit at the manufacturer and only send the pre-packaged parts to the assembly line. Which method is faster?

Data processing is just like any pipeline. Putting stream-based processing at the front is analogous to pre-packaging our parts before they get to the assembly line, which is in our case is the Hadoop batch processing system.

As in manufacturing, even if we pre-package the parts at the manufacturer we still need an assembly line to put all the parts together. In the same way, stream-based processing is not meant to replace our Hadoop system, but rather to reduce the amount of work that the system needs to deal with, and to make the work that does go into the Hadoop process easier, and thus faster, to process.

In-memory stream processing can make a good stream processing system, as Curt Monash’s points out on his research traditional databases will eventually end up in RAM. An example of how this can work in the context of real-time analytics for Big Data is provided in this case study, where we demonstrate the processing of Twitter feeds using stream-based processing that then feeds a Big Data database for the serving providing the historical agregated view as described in the diagram below.

Due to a lack of alternatives at the time, in many Big Data systems today Map/Reduce is used in areas where it wasn't a very good fit in the first place. A good example is using Map/Reduce for maintaining a global search index. With Map/Reduce, we basically rebuild the index, where it would actually make more sense to update it with changes as they come in.

Google moved large part of its index processing from Map/Reduce into a more real-time processing model, as noted in this recent post:

So, how does Google manage to make its search results increasingly real-time? By displacing GMR in favor of an incremental processing engine called Percolator. By dealing only with new, modified, or deleted documents and using secondary indices to efficiently catalog and query the resulting output, Google was able to dramatically decrease the time to value. As the authors of the Percolator paper write, ”[C]onverting the indexing system to an incremental system … reduced the average document processing latency by a factor of 100.” This means that new content on the Web could be indexed 100 times faster than possible using the MapReduce system!

..Some datasets simply never stop growing ..it is why trigger-based processing is now available in HBase, and it is a primary reason that Twitter Storm is gaining momentum for real-time processing of stream data.

Final Notes

We can make our Hadoop system run faster by pre-processing some of the work before it gets into our Hadoop system. We can also move the types of workload for which batch processing isn't a good fit out of the Hadoop Map/Reduce system and use Stream Processing, as Google did.

Interestingly enough, I recently found out that Twitter Storm came up with an option to integrate an in-memory data store into Storm through the Trident-State project. The combination of the two makes lots of sense and something were currently looking at right now so stay tuned.

July 14, 2011

Lately, we've been talking to various clients about realtime analytics, and with convenient timing Todd Hoff wrote up how Facebook's realtime analytics system was designed and implemented (See previous review on that regard here).

They had some assumptions in design that centered around the reliability of in-memory systems and database neutrality that affected what they did: for memory, that transactional memory was unreliable, and for the database, that HBase was the only targeted data store.

What if those assumptions are changed? We can see reliable transactional memory in the field, as a requirement for any in-memory data grid, and certainly there are more databases than HBase; given database and platform neutrality, and reliable transactional memory, how could you build a realtime analytics system?

Joseph Ottinger and I discussed this, and this is what we came up with.

A Summary of History

To understand what a new design might look like, it’s often useful to consider a previous design. This is a very short summary of Facebook’s realtime analytics system.

First, it’s based on a system of key/value pairs, where the key might be a URL and the value is a counter. Thus, there’s a requirement for atomic, transactional updates to a very simple piece of data. The difficulties come from scale, not from the focus of the system.

The process flow is fairly simple:

A user creates an event by performing some action on the website. This generates an AJAX request, sent to a service.

Scribe is used to write the events into logs, stored on HDFS.

PTail is used to consolidate the HDFS logs.

Puma takes the consolidated logs from PTail and stores them into HBase in groupings that represent roughly 1.5 seconds’ worth of events.

HBase serves as the long-term repository for analytics data.

There are some questions around how PTail and Puma serve as scaling agents, and some of the notes around their use are still limited in scale – for example, one of the concerns is that an in-memory hash table will fill up, which sounds like fairly serious limitation to have to keep in mind.

A Potential for Improvement

There are lots of areas in which you can see potential improvements, if the assumptions are changed. As a contrast to Facebook's working system:

We can simplify the design. If memory can be seen as transactional - and it can - we can use them without transforming them as they proceed along our analytics workflow. This makes our design and implementation much simpler to implement and test, and performance improves as well.

We can strengthen the design. With a polling semantic, such systems are brittle, relying on systems that pull data in order to generate realtime analytics data. We should be able to reduce the fragility of the system, even while making it faster.

We can strengthen the implementation. With batching subsystems, there are limits shouldn’t exist. For example, one concern in Facebook's implementation is the use of an in-memory hash table that stores intermediate data; the in-memory aspect isn’t a concern until you realize that the batch sizes are chosen partially to make sure that this hash table doesn’t overflow available space.

We can allow deployments to change databases based on their requirements. There's nothing wrong with HBase, but it's got specific characteristics that aren't appropriate for all enterprises. We can design a system which you’d be able to deploy on various and flexible platforms, and we can migrate the underlying long-term data store to a different database if needed.

We can consolidate the analytics system so that management is easier and unified. While there are system management standards like SNMP that allow management events to be presented in the same way no matter the source, having so many different pieces means that managing the system requires an encompassing understanding, which makes maintenance and scaling more difficult.

What we want to do, then, is create a general model for an application that can accomplish the same goals as Facebook’s realtime analytics system, while leveraging the capabilities that in-memory data grids offer where available, potentially offering improvement in the areas of scalability, manageability, latency, platform neutrality, and simplicity, all while increasing ease of data access.

That sounds like quite a tall order, but it’s doable.

The key is to remember that at heart, realtime analytics represent an events system. Facebook’s entire architecture is designed to funnel events through various channels, such that they can safely and sequentially manage event updates.

Therefore, they receive a massive set of events that “look like” marbles, which they line up in single file; they then sort the marbles by color, you might say, and for each color they create a bundle of sticks; the sticks are lit on fire, and when the heat goes up past a certain temperature, steam is generated, which turns a turbine.

It’s a real-life Rube Goldberg machine, which is admirable in that it works, but much of it is still unnecessary if the assumptions about memory ("unreliable") and database ("HBase is the only target that counts") are changed. Looking at the analogy from the previous paragraph, there’s no need to change a marble into anything. The marble is enough.

A Plan for Implementation

Our design for implementation is built around putting data and messaging together. A data grid is a perfect mechanism for this, as long as it provides some basic features: transactional operations, push and pull semantics, and data partitioning.

A data grid does provide those basic features, or else it's not really much of a data grid; it'd be more of a cache otherwise.

With a data grid, then, the events come in as individual messages. When the user chooses an operation on the web site, an asynchronous operation would write the event, just as Facebook does today. However, instead of filtering and batching the events into various forms, the events are dispatched to waiting processes that perform many transactional updates in parallel.

There’s a danger that those updates might be slower than the generated events, if each event is processed sequentially. That said, this isn’t as much a problem as one might think; if data partitioning is used, then event handlers can receive partitioned events, which localizes updates and speeds them up dramatically.

In fact, you can still use batching to process events as a group; since the events would be partitioned coming in, the batch process would still be updating local data very quickly, which would be faster than individual event processing, even while retaining simplicity.

With this design, there is no overflow condition, because a system that’s designed to scale in and out as most data grids are will repartition to maintain even usage. If a data grid can’t provide this feature intrinsically, of course some management will be necessary, but finding data grids with this feature isn’t very difficult.

One other advantage of data grids is in write-through support. With write-through, updates to the data grid are written asynchronously to a backend data store – which could be HBase (as used by Facebook), Cassandra, a relational database such as MySQL, or any other data medium you choose for long-term storage, should you need that.

The memory system and the database - the external data store - work together. The in-memory solution is ideal for the realtime aspects, the events that affect now. The external data storage solution is designed to handle long-term data, for which speed is not as much of an issue.

A Discussion of Strengths

The key concept here is that event handling is the lever that can move the realtime analytics mountain. By providing a simple, scalable publisher/subscriber model, you simplify design; by using a platform that supports data partitioning, transactional updates, and write through capabilities, you gain scalability.

The data grid’s flexible query API means that events can literally react when data is available.

For a call center, for example, you want to immediately identify signals that show that the caller should be handled differently; imagine an ecommerce site that was able to determine immediately if a user was losing interest, and thus could respond appropriately, before the customer moves on.

With external processes and a long funnel for data, immediate-response capabilities are very difficult to implement, not just because of latency but because the data transformations tend to homogenize the data, instead of allowing rich expressions and flexible event types.

The data grid also has much richer support in terms of client applications. Instead of applications going through an API that focuses on a specific phase of the data’s life (for example, an API focused on HBase), you can focus on a generic API that can capture events at any point in their lifecycle, and from anywhere. An external monitoring process, then, can have the same immediate, partition-aware access to data that the integrated message-handling system does; adding features and analysis is just a matter of connecting a client to the data grid.

Here we have a quick demo that shows much of this in motion. We have a market analysis application, deployed into GigaSpaces XAP via our new Cloud deployment too, Cloudify; it uses an event-driven system to display realtime data, with a write-through to Cassandra on the back-end. The design is very simple, and demonstrates the principles we've discussed here - and can scale up and down depending on demand.

Final words

Todd Hoff (HighScalability) and Alex Himel (Facebook) provided a fairly detailed description on their solution and even more importantly they even shared the rationales that made them do things in certain ways.

One main difference in assumptions that lead to the different implementation strategies are in reliable memory for event processing, and in the use of passive data storage.

Another difference is that we had to to think of the solution as an easily cloneable solution and therefore a lot of attention was put on the simplicity of the runtime, packaging and management of the solution.

Yet another difference is that we couldn’t decide on a specific database as there isn’t a "one size fits all" solution – for certain customers, SQL would still be preferred choice and the fact that we can buffer the write to the database gives them more headroom while still allowing them to scale on writes.

I hope that this would lead to constructive dialogue on the various tradeoffs which will serve the entire industry...

In this first post, I’d like to summarize the case study, and consider some things that weren't mentioned in the summaries. This will lead to an architecture for building your own Realtime Time Analytics for Big-Data that might be easier to implement, using Facebook's experience as a starting point and guide as well as the experience gathered through a recent work with few of GigaSpaces customers. The second post provide a summary of that new approach as well as a pattern and a demo for building your own Real Time Analytics system..

The Business Drive for real time analytics: Time is money

The main drive for many of the innovations around realtime analytics has to do with competitiveness and cost, just as with most other advances.

For example, during the past few years financial organizations realized that inter-day risk analysis of their customers' portfolios translated to increased profit as they could react faster to profit and loss events.

The same applies to many of the online ecommerce and social sites. Knowing what your users are doing on your site in real time and matching what they do with more targeted information transforms into better conversion rate and better user satisfaction, which means more money in the end.

Todd provides similar reasoning to describe the motivation behind Facebook's new system:

Content producers can see what people like, which will enable content producers to generate more of what people like, which raises the content quality of the web, which gives users a better Facebook experience.

Why now?

Real time analytics goes mainstream

The massive transition to online and social applications makes it possible to track user patterns like never before. The correlation between the quality of data that providers track and their business success is closely related: for example, e-commerce customers want to know what their friends think about products or services, right in the middle of their shopping experience. If sites cannot keep up with their thousands of users in real-time, they can lose their customers to sites that can.

So while risk analytics in financial industry was still a fairly small niche of the analytics market the demand for real time analytics in Social, eCommrce and SaaS applications brought the demand for real time analytics to main-stream business under massive load.

No one has time for batch processing anymore.

Technology advancement

Newer infrastructures and technologies like tera-scale memory, nosql, parallel processing platforms, and cloud computing, provide new ways to process massive amount of data in shorter time and with lower cost. As most of the current analytics systems weren’t built to take advantage of these new technologies and capabilities, they haven't been able to adopt to real time requirements without massive changes.

Hadoop Map/Reduce doesn’t fit the real time business

One of the hottest trends in the analytics space is the use of Hadoop as almost the de-facto standard for many of the batch processing analytics application. While Hadoop (and Map/Reduce in general) does a pretty good job in processing massive logs and data through parallel batch processing, it wasn’t designed to serve the real time part of the business.

Strong evidence for that can be seen from the moves of those who were known as the “poster children” for Map/Reduce: Google and Yahoo both moved away from Map/Reduce. Google has moved to Google Percolator for its indexing service. Yahoo came-up with a new service S4 which was designed specifically for real time processing.

It is therefore not surprising that Facebook reached the same conclusion as it relate to Hadoop:

Facebook's Real Time Analytics system

According Todd, Facebook evaluated a fairly long list of alternatives including as MySQL, an in-memory cache, Hadoop Hive/Map Reduce. I highly recommend reading the full details from Todd's post.

I tried to outline Facebook architecture based on Todd's summary in the diagram below:

Every user activity triggers an asynchronous event, through AJAX – this event is logged in a tail log using Scribe. Ptail is used to aggregate the different individual logs into a consolidated stream. The stream is batched in 1.5 sec groupings by Puma which stores the event batch into HBase. The real time logs are kept for a certain period of time and than get cleared from the system.

Obviously this description is a fairly simplistic view – the full details are provided in the original post.

Evaluating the Facebook Architecture

Facebook reasoning behind their technology evaluation seems acceptable for the most part, although there are some obvious concerns.

There were two things that caught my attention:

Memory Counters

(Facebook) felt in-memory counters, for reasons not explained, weren't as accurate as other approaches. Even a 1% failure rate would be unacceptable. Analytics drive money so the counters have to be highly accurate.

It sounds to me that the evaluation was based on memcached. By default, it's not highly available; a failure would result in loss of data. Obviously, that doesn’t apply to some other memory based solutions such as In Memory Data Grids (for example, GigaSpaces and Coherence both were designed for high resiliency.)

Cassandra vs HBase

The choice of HBase over Cassandra was also very interesting mainly since Cassandra was developed by another Facebook team to address write scalability. The choice had to do with the write rate differences between the two alternatives at the time of evaluation:

HBase seemed a better solution based on availability and the write rate. Write rate was the huge bottleneck being solved.

Eric Hauser posted a comment on this analysis which seem to indicate that this issue has been addressed with Cassandra:

When Facebook engineers started the project 6 months ago, Cassandra did not have distributed counters which is now committed in trunk. Twitter is making a large investment on Facebook for real-time analytics (see Rainbird). Write rate should be less of of bottleneck for Cassandra now that counter writes are spread out across multiple hosts. For HBase, every counter is still bound by the performance of single region server? A performance comparison of the two would be interesting

Eric's comment is indicative of how dynamic the NoSQL space is. I’d be interested in how different the technology selection would be now.

The Architecture

There are a few common principles that drive the architecture for this type of system:

Events logging needs to be extremely fast, to minimize the latency impact on the site.

Events need to be reliable, otherwise the entire system's accuracy is questionable and the data is devalued.

The real time data in the form of logs is kept for a certain period of time (x hours or x days)

Write scalability is key.

Post processing can happens in batches, the size of the batch depends on how *real-time* we need this data to be.

Writes to a backend database need to be done asynchronously.

Facebook seem to follow those same principles in their architecture while keeping the system at scale at a fairly impressive rate.

There are a few questions that are still open as I don’t have full visibility into their system:

Are Ptail and Puma centralized components? If so, don't they pose a potential bottleneck? Based on Todd's summary and Alex' presentation, it seems that the way Facebook scaled their system is by splitting the PTail stream by categories of events so each type can be handled by a different data center.

Puma batches event logs in memory before it writes them into Hbase – what happens if Puma fails before the batch is written?

The solution seemed to be limited to handling simple counters. This seem to be a fairly severe limitation, as many systems need to produce more complex relationships even during the real time part of the system, as also indicated as part of the future enhancements, as shown:

"(We need to know) how many people acrossa time window liked a URL. Easy to do in MapReduce, hard to do with a naive counter solution.”

In one point, it is noted that Facebook chose not to rely on memory for counters. However, throughout the description it seems that there is still strong reliance on keeping data within memory boundaries:

“(We) write extremely lean log lines. The more compact the log lines the more can be stored in memory..”

“(We) batch for 1.5 seconds on average. Would like to batch longer but they have so many URLs that they run out of memory when creating a hashtable”

Looking backward – wouldn’t it be better to store the data in-memory in the first place? Why add the extra architecture components, if you're able to make memory work for you? This is a crucial question, of course, because it focuses attention on memory availability. As mentioned, though, there are in-memory data grids that are designed for just this kind of situation.

It is noted that Puma writes it batches to Hbase sequentially:

“(We) wait for last flush to complete for starting new batch to avoid lock contention issues...”

What if this rate is lower than the actual write rate?

Realtime.Analytics.next

Interestingly enough, I was asked to draw a solution for similar challenge for a voice recording system: collect lots data from various sources and process them in real time. The good news is that it's doable, as shown by Facebook's success.

The better news is that it's actually fairly easy. We can add rich query capabilities, elastic scaling, database and platform neutrality, evolution of data, and more without making things unnecessarily difficult - it's not so much that this is the next generation of realtime analytics as much as map/reduce and the hbase approach used by facebook is the previous generation.

During the event, Ronnie Bodinger, Head of IT at Avanza Bank AB, gave an excellent talk on how they turned their existing online banking application into a new site that was designed for read/write scaling.

Avanza's System Description:

Avanza Bank is a Swedish bank that makes it easy for investors to make equity transactions and fund switches. It runs the most trades on the Stockholm Stock Exchange.

It prides itself for providing advanced tools for its investors through its online banking system.

The current online system is a typical web site based on Java/JSP and Spring.

Scaling architecture of the existing site

Most of the interaction with the current site are read-mostly, the main scaling challenge was scaling on concurrent read operations. Read scaling was addressed through a side-cache architecture that is common with many of the existing LAMP + Memcached deployments where the first query hits the database and the following queries hit the cache.

The New System

The new site was designed to fit into the real-time and social era. This means that a lot of the traffic and activities are now being generated by user activities and not by just by the site owner. This activities need to be presented in real-time to the bank users.

Challenges

The changes to the new site lead to a significant change in the traffic and load behavior that drives a new class of scaling challenges:

Write scaling

In case where there are lots of updates involved the existing side-cache architecture leads to diminishing returns as the cache becomes obsolete pretty quickly and therefore synchronizing the cache only adds overhead.

Using Oracle RAC with a high end hardware platform didn’t prove itself either and yielded a fairly expensive solution that didn’t meet the scaling requirements.

Unlike “Green Field” applications, Avanza has an existing online application (a “Brown Field”) that serves its current customers. That brings the following list of additional challenges :

Existing Data Model

The entire data model of the application was designed for a relational model – changing the data model or moving it to a new NoSQL architecture as was considered would involve a huge change that could turn into a potentially years-long effort.

Legacy system

The online bank application consists of large set of legacy application and third party services. Re-writing the existing infrastructure is either impossible (due to the dependency on third party tools) or impractical.

Complex environment

As it often the case, a large portion of the legacy applications weren’t designed for scale, and weren’t built with a clear holistic architecture as they were built through layers over the years. This increases the complexity of scaling by quite a bit.

Existing Skillset

The existing development team already had fairly good knowledge of Java and Java EE. Changing the team and/or developing a completely new skillset is a huge barrier as the ramp-up time required to bring new developers up to speed with the complexity of the system can take years.

The solution: Read/write scale without complete re-write

It was clear that meeting the new scaling challenges would involve changes to the existing application – the main question was how to scope that change so that it wouldn’t require a complete re-write. The second goal was to build the change in a way that would reduce the TCO of the current system.

To achieve those two goals the following approach was used:

Minimize the change by clearly Identifying the scalability hotspots

The areas of the application that need intensive write access are often small parts of the overall system. The first step would therefore be to minimize the change to only the hot spots of the application and keep the rest of the application un-touched. In the case of Avanza, the hot spots were identified on certain tables used by the online web application. Most of the backend systems were still accessing the database for reporting, synchronization and batch processing and could therefore remain unchanged.

Keep the database as is

One of the key piece in the current design is the ability to address the read/write scalability outside of the database context (See the next bullet). This makes it possible to keep the existing database and the schema of the data unchanged. In that way, all the rest of the systems continue to work with the database as if nothing changed.

Put an In Memory Data Grid as a front end to the database

Scaling the application is done by front-ending the application with an In-Memory-Data-Grid (IMDG). The IMDG contains all the hot tables or rows of the original database. The online web application accesses the IMDG instead of the database. The IMDG is distributed in nature thus allowing scalability by distributing the load over a cluster of machines for both read and write operations.

Use write-behind to reduce the synchronization overhead

Updates from the IMDG to the underlying database is done asynchronously in batches through an internal queuing mechanism (redo-log).

Use O/R mapping to map the data back into its original format

In many cases to achieve best scalability, we need to partition the data. This often involves changes to the data schema. Changing the data schema could break the entire system including the areas that don’t suffer from the scalability bottleneck. To meet this impedance mismatch, we scope the data schema changes only to the IMDG. The data is mapped from the IMDG schema to the original schema through standard O/R mapping tools such as Hibernate or OpenJPA.

Use standard Java API and framework to leverage existing skillset

One of the challenges with many of the new NoSQL database alternatives is that they often force a complete re-architecture. This comes with the a fairly high cost of re-building the skillsets within the organization across the board for developing against new APIs as well as for maintaining capacity and sizing of those new databases.

IMDGs such as GigaSpaces expose standard APIs, such as JPA. In addition, they allow organizations to extend the use of the existing database while removing a large part of read/write load – both the use of standard APIs and existing database enables organizations to leverage their existing skillset and still meet their scalability requirements. It also enables them to take a smoother transition (through baby steps) into a completely new scale-out architecture by allowing a plug-in new scalable database at a later stage.

Use two parallel (old/new) sites to enable gradual transition

Switching all the customers into a new system at once is often a bad idea. A better approach would be to enable gradual transition of selected customers into the new sites. A common model to achieve that would be to run two parallel sites. The challenge in doing so is the synchronization between the two parallel sites. In the case of Avanza, they used the GigaSpaces Mirror service to synch all the changes from one site to the other and in that way keep the two sites up to date.

The diagram below provides a visual summary of this approach:

The TCO angle

The second goal in the project was to reduce the TCO of the current system.

This is achieved in the following way:

Use RAM for high performance access and disk for long term storage

As I noted in one recent post, a RAM based solution can be 10x – 100x cheaper than Disk based solution for high performance applications.

In addition to that, the price for RAM goes down continuously.

1GB can cost only 1.9$ a month.. , storing 1T bytes of data completely in RAM can fit into a single RAC with total cost of $1.8k per month.

The optimal solution would therefore be to use RAM to manage data that needs high speed write/read access and disk-based storage for the long-term data that is accessed less frequently.

Use commodity Database and HW – A single instance of an Oracle RAC deployment could reach to $500k. Putting a data-grid in front of the database enables us to remove the needs for many of the high-end features available in the Oracle RAC database. It also enables us to remove the need for high end hardware such as storage devices, infiniband network etc.

In this specific case, it was possible to move the data into MySQL and use commodity Dell machines to run the entire relational data system.

Final words

Many existing applications are built with layers on top of layers on top of layers of development. Relational databases often sit at the heart of those systems. Scaling these systems is therefore an extremely challenging task. This leads a lot of organizations to take the easy route, simply paying more for more high-end hardware and databases. We have reached to the point where this approach doesn’t cut it for many cases - and it's simply too expensive to maintain in the face of cheaper, more scalable alternatives.

In a world where the impact of software accretion is no longer tolerable, it's clear that a chnage is inevitable if we're to meet the new demands for scalability. The real question is how to make that change. The approach that was taken by Avanza Bank in their use of GigaSpaces is an excellent reference: it shows that they were not just able to meet the new scaling requirements fairly quickly through measurable and easy small changes, but they also reduced their cost of ownership significantly as noted recently by Ronnie Bodinger's statement: