Cassandra

April 07, 2013

Real-time Big Data is becoming more and more common in Big Data systems.

One of the most common frameworks used for running real-time Big Data system is Twitter Storm backed by a NoSQL database, as shown in the diagram below:

In this architecture, Storm is used for processing data as it comes in and the NoSQL data store is used as the output for that processing as well as for reference data storage.

Challenges

This architecture presents a couple of challenges:

Performance - Storm runs in-memory, and is therefore set to process large volumes of data at in-memory speed. However, a typical storm architecture needs to interface with a data source for its input and a data source for its output. In this context, the overall performance is now determined not by Storm itself but by the data input and output sources. Quite often these interfaces rely on file-based message queues for streaming, and NoSQL for data storage. These interfaces are at least an order of magnitude slower than Storm itself, and therefore become the limiting factor.

Complexity - Storm itself consists of several moving parts, such as a coordinator (ZooKeeper), state manager (Nimbus), and processing nodes (Supervisor). The NoSQL data store also comes with its own cluster management. In addition, a typical Big Data system comes with more components such as the application front end, a reporting database, and more. This makes the process of managing the deployment, configuration, fail-over and scaling of such a system quite complex.

Meeting the Performance Challenge

Given that Storm itself runs in memory, it only makes sense that in order for Storm to run at maximum capacity the streaming and data store interfaces should be implemented in-memory as well, as shown below:

As you can see in the above architecture, we added two interfaces that rely on a built-in Storm plug-in - one for the data inputs and the second for data output. In both cases, the underlying implementation is memory-based, and thus removing the impedance that the previous architecture included.

As most analytics applications tend to be read-mostly, we can speed up access to the NoSQL data store using an in-memory local cache. This architecture speeds up read performance between 10 to 1000 times as outlined by Shay Hassidim in his post on real-time Big Data performance.

Meeting the Complexity Challenge

To meet the complexity challenge, we will use a DevOps automation approach using Cloudify in conjunction with Chef or Puppet.

With this approach, we wrap every service with a deployment recipe that abstracts the underlying details on how to manage Storm, Cassandra, and our in-memory data store. Cloudify uses these recipes to automate the deployment of the entire stack. In this way, you only need to interact with Cloudify for the deployment, configuration, and also scaling and fail-over of your stack rather than separately manage each individual component. In addition, we use Cloudify to abstract the underlying infrastructure. This enables us to use the same deployment recipe across different environments. One for testing, another for production, etc. We can also use the same approach to deploy our apps based on the type of workload. We can use a public cloud for running sporadic workload, and thus leverage the elasticity of the cloud and also enable us to create and scale the environment as well as rip it off completely when done. At the same time, we can choose bare-metal machines for I/O intensive workloads, and so on.

You can read more on how you can set Storm using Cloudify on DeWayne's post on Storm and the cloud.

Cloudify comes with a rich set of built-in recipes for other Big Data services as well, making the integration process an out of the box experience. The main ones are listed below:

Final Notes - Optimizing without Code Change

Many real-time Big Data systems are now based on Twitter Storm and a NoSQL data store such as Cassandra.

In this post I tried to outline how we can optimize this architecture by addressing two areas: performance, and management.

The good news is that all this is possible to achieve seamlessly, without any code change. Let me explain:

In the case of Storm we used built-in plug-ins - Spout for streaming and Trident as the data store interface. In this way we can simply plug our memory-based plug-ins under these two integration points. All our existing Storm business logic would work pretty much the same.

We use the same plug-in approach to integrate our in-memory data store with our NoSQL data. This integration makes the data flow between our real time streaming to our NoSQL storage fairly transparently. In addition, it allows us to plug in different NoSQL data stores such as MongoDB, CouchBase, etc., giving us another degree of flexibility.

The same applies to our management layer. Existing Storm and NoSQL deployments are wrapped into plugged in recipes, and don't require any specific code changes.

Not Mutually Exclusive

We don't have to implement all of the optimizations to gain the benefit of this architecture. Each of the optimization points can be plugged in independently of the others. For example, if your main pain point is complexity, you can start by only by adding the DevOps automation first. At the same time if your main pain point is performance, you can use the memory-based plug-ins to speed up processing.

The other advantage of the combined architecture that I haven't discussed in this post is that it provides more flexibility in the degrees of consistency and availability in which you set your system, as I outlined in one of my recent posts on the subject In Memory Computing (Data Grid) for Big Data.

January 01, 2013

Memory-based databases and caching product has been available for over a decade. However, so far they have been used in a fairly small niche in the data management solution market.

There have been multiple advances in the industry in both hardware and software architecture, which makes memory-based computing more relevant today than in the past, as outlined in the diagram below.

In a nutshell, the availability of new classes of hardware with the support of 64bit CPU can now support 2TB on a single device. In addition, the advances in software architecture and solutions toward distributed architecture and cloud make it easier to utilize these new hardware capabilities.

In-Memory Computing

In many ways, In-Memory Computing is a close relative of In-Memory Databases. As with many databases, it was designed to enable all the data management aspects that are often expected from traditional databases, such as queries and transactions, with the difference that the data is managed on RAM devices and not disks and thus comes with potentially x1000 better performance and latency according to various benchmarks.

The main differences between the traditional in-memory databases and in-memory-based-computing are that In-Memory-Computing is:

1. Designed for distributed and elastic environments

2. Designed for In-Memory data processing

Executing the code where the data is:

The fact that we can store our data in the same address space as our application is the biggest gain.

Unlike disk and even flash disk devices, we can access our data by reference and thus perform complex data manipulation without any serialisation/de-seralization overhead. With the new class of dynamic languages such as Java, JavaScript, JRuby, and Scala, it is also significantly easier to pass complex logic over the wire and execute it on a remote device.

In-Memory Computing relies heavily on that capability, and exposes a new class of complex data processing capabilities that fits well with the distributed nature of the data through real-time map/reduce and stream-based processing as a core element of its architecture.

The Big Data Context

According to our recent survey more than 70 percent of the responders said their business requires real-time processing of big data -- either in large volumes, at high velocity, or both.

Interestingly enough another survey by Ventana Research indicated that one of the biggest technical challenges in Big Data is the lack of real-time capabilities (67%). The report also indicated that many of the organizations are planning to use in-memory databases (40%) as part of their Big Data stack. This places the In-Memory Database as a second choice, before specialised DBMS (33%) and Hadoop (32%). One of the conclusions this survey leads to is that organizations see Data Warehouse Appliances and In-Memory Databases as one of their first choices to deal with the lack of real-time capabilities.

No One-Size-Fits-All Solution

While In-Memory Databases fit well in the planned Big Data stack, it is clear that theres no one-size-fits-all solution. The Big Data stack is going to be based on a blend of various technologies, each covering different aspects of the challenges of Big Data, from batch to real-time, from vertical to horizontal solutions, etc.

The question is: How do we integrate them all, without adding even more complexity to an already complex system?

In this post I will focus specifically on one of the approaches the we used for combining In-Memory Computing together with other Big Data solutions, such as Hadoop and Cassandra.

Putting In-Memory Computing Togther with a NoSQL DB

One of the main motivations to integrate in-memory-based solutions with a NoSQL DB is to reduce the cost per GB of data.

Putting our entire data purely in-memory can be too costly especially for data that were not going to access frequently.

There are various approaches to doing this -- the approach we found most useful is a two-tier approach.

With the two-tier approach the In-Memory Computing systems run separately from the NoSQL database, which acts as the long-term storage.

The Challenge

The main challenge with this approach is the complexity that is associated with synchronising two separate data systems. Specifically, how to ensure that data that is written into the front end In Memory Computing engine gets populated into the NoSQL database reliably, and vice versa.

The Solution

To deal with this challenge we used a similar approach to the one that we used before with RDBMS. Have an implicit plug-in that gets called whenever new data is written and populate it into the underlying database. The plug-in also deals with pre-loading of the data when the system starts. In the RDBMS world we used frameworks like Hibernate to deal with the implicit mapping of the data between the in-memory front end and the underlying database.

Working with Dynamic Data Structure (a.k.a Document Model)

When we tried to apply the same approach with NoSQL databases we could no longer rely on Hibernate as the default framework for mapping the data between the two data systems, as NoSQL databases like Cassandra tend to be fundamentally different from traditional RDBMSs. The main difference is the use of dynamic data structures, a.k.a the Document Model.

To deal with dynamic data structure we added the following hooks:

Introducing new documents and objects: Users can choose to write or load data in various forms -- Document for non-structured data or Objects or POJOs for structured or semi-structured data.

Introducing and loading new meta data: To map the data to and from the NoSQL database we also added the ability to introduce new meta data and load the meta data of the object before the actual data is loaded.

Introducing new indexes: In NoSQL databases you cannot effectively access data that is not indexed. For that purpose we included the ability to introduce indexes on the fly.

The main benefit of this two-tier approach is that it allows us to take the best of the memory and file-based approaches without adding too much complexity. The two tiers behave and work as one data system from an end-user perspective. The bits and pieces of how the two system get synchronized is carved out of the system.

Furthermore, the two-tier approach opens up a new degree of flexibility in how we design our Big Data system.

New Degree of Flexibility

If we look at the entire data flow from the point in which a user interacts with our system (this is where we expect low latency and high degree of consistency to our analytics systems; where we record those actions and analyze them, latency is less of an issue and we can also relax some of the consistency constraints), we can see that each stage in our data processing has different onsistency, latency and performance requirements.

With the two-tier approach there is more flexibility in dealing with those different requirements and still keep everything working as if it were one big system from a usability perspective. Here are a few examples of how this setup can work:

Consistent data flow from Real-Time to Batch: The integration enables us to handle real-time data processing at in-memory speed and deal with more long-term data processing through the underlying database.

Performance & latency: The In-Memory Computing system can handle the event processing before the data gets into the database. Or, another approach is to keep the last day (or days) in-memory and the rest of the data in the NoSQL database.

Mixed consistency model: The In-Memory Computing system is often built for extreme consistency where NoSQL databases often work best with eventual consistency. Usually, the consistency requirements are more relevant at the front end of the system, and becomes less relevant as the data gets older. The combined approach enables us to set our front end for extreme consistency and back end for eventual consistency.

Deterministic behavior: In many cases, we must ensure that a given set of data can be served under constant performance. Many databases use an LRU based cache to optimise the data access. The limitation of this approach is that the speed at which we can access our data becomes non-deterministic as we often do not control which data is served through the database cache; thus, in some cases we will get a fast response time if we hit the cache and in other cases the same operation can be 10 times longer if we miss the cache. By splitting our in-memory data from our file-based storage we get more explicit control over which data is served at in-memory speed and which data is not, thus ensuring consistent access to that data.

Faster ETL: By front-ending our Big Data storage with In Memory Computing we can also speed up the time it takes to pre-process and load data to our long-term data system. In this context, we can push the filtering, validation, compression and other aspect of our data processing into memory before it goes into our long-term databases.

Final Words

Big Data systems are complex beast and it is clear that the one-size-fits-all approach doesn't work.

On the other hand, having too many data systems increases the complexity of managing our Big Data system almost exponentially; and our ability to ensure consistent behaviour, data integrity and reliable synchronization across the varous systems becomes an almost un-manageable task if done manually.

Adding real-time capabilities to our Big Data system is a classic area where the kind of integration described in this article is needed. The integration between In-Memory and File-Based approaches as two separate tiers also introduces additioanl areas of flexibility in how we can handle often contradictory requirements such as consistency, scale, latency, and cost. Instead of trying to come up with a least common denominator, we can optimize each tier to the area it fits best.

August 28, 2012

In one of my earlier posts I discussed in general terms why it makes a lot of sense to put Big Data on the cloud.

The first step that we took in this regard with Cloudify was to make NoSQL databases, such as Cassandra and MongoDB, run on any the cloud through Cloudify recipes. Uri Cohen gave an excellent talk on this during the Cassandra Summit where he provided a insights into this work.

In the past couple of weeks we've been working on the next phase of this project: Putting Hadoop on the cloud.

As there are various solutions that aim to do something similar, I thought that a good start would be to first outline where we fit in this ecosystem.

Cloudifying Hadoop -- What Does That Actually Mean?

Yet another Hadoop distribution?

There are multiple distributions of the Hadoop framework today. All come with strong management and sets of tools and do a fairly good job. With that in mind, it was clear to us that our goal wouldn't be to come up with yet another Hadoop distribution but rather to integrate with the ones that are already out there. We picked IBM BigInsights and the Cloudera distribution as the first targets for the project.

When we realized that this was the path we wanted to pursue, the main question that immediately came up was:

What value can we add on top of IBM BigInsights and the Cloudera distribution?

Big Data systems tend to be complex to manage and operate. BigInsights and Cloudera provide good tools to make the process of configuring and deploying Hadoop significantly simpler. That's good for the Hadoop part of the story. Big Data systems and applications often include other services such as relational databases, other NoSQL databases such as Cassandra or MongoDB, stream processing such as Twitter-Storm and GigaSpaces XAP, web front ends such as Tomcat, Play framework and Node.js. Each framework comes with its own management, installation, configuration, and scaling solutions, as shown in the diagram below:

Managing each component of your Big Data system seperately is an operational nightmare. That complexity grows exponentially as the system gets bigger, and in Big Data that's just to be expected.

We realized that one of the areas in which we can reduce this operational complexity is through Consistent Management. By Consistent Management, I'm referring specifically to consistent deployment, configuration, and management across the stack. Consistent management applies not only to the deployment phase, but also to post-deployment, including fail-over, scaling, and upgrades. In addition, Big Data systems tend to consume a lot of infrastructure resources that can easily pile up to thousands of nodes. We realized that we can optimize the infrastructure cost for running the Big Data system through Cloud Enablement and Cloud Portability. Cloud Portabiltiy enables you to choose the right cloud for the job. For example, you could choose a bare-metal cloud for I/O intensive workloads or a virtualized/public cloud for more sporadic workloads. Below is a more detailed description on how we implemeted these two properties:

1. Consistent Management

With consistent management, we wanted to make the experience of managing each of the tiers and services in the Big Data System consistent throughout the entire stack. This is where Cloudify plugs in.

Cloudify already plugs into a variety of web containers, databases, and through the integration with Chef also into hundreds of services available through the Chef Cookbook.

The process for achieving consistent management for BigInsights and Cloudera Hadoop distribution basically maps to the creation of a Cloudify recipe. The purpose of the recipe in this specific case was to map all the capabilities that come with the IBM BigInsights and Cloudera distribution in a way that would be later consistent with other services in the stack.

For this, we needed to come up with the following:

Deployment automation -- A Cloudify recipe enables us to automate the installation, configuration, and deployment of a given service. In the context of a Hadoop distribution, which comes with its own scripts for automating these phases, the process of creating a Cloudify recipe basically meant mapping the specific Hadoop distribution scripts into the Cloudify convention, as well as using the Cloudify discovery, global context, and other services to dynamically inject configuration values that are driven on-the-fly through the deployment process. In the case of BigInsights, this basically mapped into a simple recipe that would launch the machines, set up the SSH environment, and update the BigInsights configuration file with the relevant information. We then called BigInsights to install the NameNode, DataNode, Management, Hive and other services that come with the BigInsights distribution.

Automation of post-deployment operations -- In the Hadoop scenario, post-deployment operations mapped into the abilities to add a node, rebalance, and test the environment. To automate these processes, we used Cloudify custom commands. Custom commands enable us to give an alias to those maintenance scripts, and then it enables the users to call those scripts using a simple naming convention without knowing the physical location of Hadoop deployment.

SLA-based monitoring and auto-scaling -- SLA-based monitoring enables measure how the Hadoop distribution behaves after it's been deployed, and also to map specific actions when a given SLA is breached. For example, you can monitor a situation where a node fails and then spawns a new machine to take over that fail-node. You can also use the monitors to trigger any of the actions defined in the custom commands, which basically means that you can automate not only a fail-over process but also the ability to scale, by adding more capacity. To do this with Cloudify, use Custom Metrics to monitor the specific metrics that are of interest. One great power of custom metrics IMO is that you're not bound to metrics that are provided by the Hadoop distribution, and you can easily generate compound metrics that can be generated out of correlating metrics from multiple sources, such as such as network statistics or a cloud management system. You can then use the Cloudify Scaling Rules to attach rules to each of those metrics and trigger a fail-over or scaling event.

2. Cloud Enablement and Portability

The Cloud is basically a more efficient way to run IT. Today, there are growing numbers of cloud offerings and types of clouds, each providign different SLAs and pricing models. There are the public clouds such as Amazon, Rackspace, and also HP, IBM and Dell who's coming out with their own cloud offering. Microsoft and Google are also coming out with new offerings that will undoubtedly change the dynamic in this space. There are also the bare-metal clouds, the private cloud, the OpenSource cloud, etc., and obviously the VMware cloud, which are also an important part of the mix.

Enabling your Big Data systems for all these environments means you can leverage all the agility, efficiency, and flexibiltiy of the cloud, while also allowing you to choose the right cloud for the job.

Given the rapid dynamic of the cloud world, you also want to keep your options open so that you can leverage any future developments and services as they become available.

Hybrid-Cloud -- Offload some of the work into the public cloud and optimize costs through more elastic computing.

Take Zynga for example. Zynga recently launched their own private cloud offering, zCloud, and is now using a hybrid cloud strategy in which they move some of the heavy workload from the Amazon cloud into their private cloud environment. With this approach Zynga was able to increase utilization by 3x, which means that they will need 1/3 of the servers that they would need from Amazon for the same workload, as noted by CTO Allan Leinwand on Zynga’s engineering blog:

For social games specifically, zCloud offers 3x the efficiency of standard public cloud infrastructure. For example, where our games in the public cloud would require three physical servers, zCloud only uses one. We worked on provisioning and automation tools to make zCloud even faster and easier to set up. We’ve optimized storage operations and networking throughput for social gaming. Systems that took us days to set up instead took minutes. zCloud became a sports car that’s finely tuned for games.

How Does Cloudify Add Cloud Portability to Your Hadoop Distribution?

Cloudify recipes uses a feature called machine templates. Machine templates are an abstraction of the underlying compute resource. The mapping of the machine template into the particular cloud infrastructure is done through the Cloud Driver.

Cloudify comes with a large and continuously growing list of Cloud Drivers available for HP, Rackspace, Amazon, Azure, and CloudStack, as well as for completely non-virtualized environments, which is basically just a bunch of bare-metal machines with an IP and network. The Cloud Driver also plugs in with JClouds and as such can plug into any cloud that is supported through the JClouds framework.

With all this, running your Hadoop deployment on any cloud becomes just a matter of a simple configuration of the target cloud end-point.

Current Status:

We've made the entire project available on GitHub. The technical description of that work is provided here. These days, we're working closely with our IBM partners to harden and optimize our BigInsights integration, and will be coming out with more updates shortly. Obviously, this is meant to be an ongoing process, so I'd really appreciate any feedback or comments or contributions.

In the next post we will outline more specifics about the work with IBM BigInsights.

Final Notes

Adding consistent management and cloud portability to your Big Data system allows you to reduce the operational and infrastructure cost of running Big Data systems. It's clear to me that this is just the begining for what we can achieve through this work. The flexibilty that we can add through the integration of Cloudify with IBM BigInsights and the Cloudera distribution will allow us to easily plug into other services and cloud infrastructures. It will also allow us to deploy and manage even the most complex systems through a single-command... So stay tuned.

August 21, 2012

One of the challenges in processing data is that the speed at which we can input data is quite often much faster than the speed at which we can process it. This problem becomes even more pronounced in the context of Big Data, where the volume of data keeps on growing, along with a corresponding need for more insights, and thus the need for more complex processing also increases.

Batch Processing to the Rescue

Hadoop was designed to deal with this challenge in the following ways:

1. Use a distributed file system: This enables us to spread the load and grow our system as needed.

2. Optimize for write speed: To enable fast writes the Hadoop architecture was designed so that writes are first logged, and then processed. This enables fairly fast write speeds.

3. Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed.

Batch Processing Challenges

The challenge with batch-processing is that it assumes that the feeds come in bursts. If our data feeds come in on a continuous basis, the entire assumption and architecture behind batch processing starts to break down.

If we increase the batch window, the result is higher latency between the time the data comes in until the time we actually get it into our reports and insights. Moreover, the number of hours is finite -- in many systems the batch window is done on a daily basis. Often, the assumption is that most of the processing can be done during off-peak hours. But as the volume gets bigger, the time it takes to process the data gets longer, until it reaches the limit of the hours in a day and then we face dealing with a continuously growing backlog. In addition, if we experience a failure during the processing we might not have enough time to re-process.

Speed Things Up Through Stream-Based Processing

The concept of stream-based processing is fairly simple. Instead of logging the data first and then processing it, we can process it as it comes in.

A good analogy to explain the difference is a manufacturing pipeline. Think about a car manufacturing pipeline: Compare the process of first putting all the parts together and then assembling them piece by piece, versus a process in which you package each unit at the manufacturer and only send the pre-packaged parts to the assembly line. Which method is faster?

Data processing is just like any pipeline. Putting stream-based processing at the front is analogous to pre-packaging our parts before they get to the assembly line, which is in our case is the Hadoop batch processing system.

As in manufacturing, even if we pre-package the parts at the manufacturer we still need an assembly line to put all the parts together. In the same way, stream-based processing is not meant to replace our Hadoop system, but rather to reduce the amount of work that the system needs to deal with, and to make the work that does go into the Hadoop process easier, and thus faster, to process.

In-memory stream processing can make a good stream processing system, as Curt Monash’s points out on his research traditional databases will eventually end up in RAM. An example of how this can work in the context of real-time analytics for Big Data is provided in this case study, where we demonstrate the processing of Twitter feeds using stream-based processing that then feeds a Big Data database for the serving providing the historical agregated view as described in the diagram below.

Due to a lack of alternatives at the time, in many Big Data systems today Map/Reduce is used in areas where it wasn't a very good fit in the first place. A good example is using Map/Reduce for maintaining a global search index. With Map/Reduce, we basically rebuild the index, where it would actually make more sense to update it with changes as they come in.

Google moved large part of its index processing from Map/Reduce into a more real-time processing model, as noted in this recent post:

So, how does Google manage to make its search results increasingly real-time? By displacing GMR in favor of an incremental processing engine called Percolator. By dealing only with new, modified, or deleted documents and using secondary indices to efficiently catalog and query the resulting output, Google was able to dramatically decrease the time to value. As the authors of the Percolator paper write, ”[C]onverting the indexing system to an incremental system … reduced the average document processing latency by a factor of 100.” This means that new content on the Web could be indexed 100 times faster than possible using the MapReduce system!

..Some datasets simply never stop growing ..it is why trigger-based processing is now available in HBase, and it is a primary reason that Twitter Storm is gaining momentum for real-time processing of stream data.

Final Notes

We can make our Hadoop system run faster by pre-processing some of the work before it gets into our Hadoop system. We can also move the types of workload for which batch processing isn't a good fit out of the Hadoop Map/Reduce system and use Stream Processing, as Google did.

Interestingly enough, I recently found out that Twitter Storm came up with an option to integrate an in-memory data store into Storm through the Trident-State project. The combination of the two makes lots of sense and something were currently looking at right now so stay tuned.

Edd touched briefly on the role of PaaS for delivering Big Data applications in the cloud

Beyond IaaS, several cloud services provide application layer support for big data work. Sometimes referred to as managed solutions, or platform as a service (PaaS), these services remove the need to configure or scale things such as databases or MapReduce, reducing your workload and maintenance burden. Additionally, PaaS providers can realize great efficiencies by hosting at the application level, and pass those savings on to the customer.

Even though Edd’s article covers all the different forms of running Big Data on private and public clouds, the article focuses mainly on the public cloud offering from Amazon, Microsoft and Google.

In this post, I wanted to cover more specifically how I see the evolution of cloud application platforms (PaaS) to support Big Data. I’ll refer specifically to Cloudify which was designed primarily to support Big Data applications.

Big Data in the cloud using Cloudify

Background

Most of the PaaS solutions out there started by focusing on simple web application deployments on Ruby, Java and Node.js. Unlike other PaaS solutions, when we designed Cloudify we picked Big Data as the primary target for Cloudify, and started by supporting popular NoSQL clusters such as Cassandra and MongoDB, as well as providing the equivalent of Amazon RDS by providing recipes for MySQL. Our goal was to make Big Data deployments a first class citizen within Cloudify. To this end,when you download Cloudify you'll notice that ALL of our examples comes with pre-integrated Big Data deployments.

There are couple of reason that brought us to make that decision:

Managing large data clusters is a core expertise at GigaSpaces

Most people know GigaSpaces for our In-Memory Data Grid solution known as XAP (eXtreme Application Platform). Over the past 10 years, as our customer deployments grew substantially, we realized that developing strong automation and cluster management is as critical as handling data consistency, performance, and latency in of our data-grid product. In a large cluster, if something breaks it’s going to literally be impossible to handle that failure through manual procedures. For that reason, we developed lots of IP around automation of data cluster deployment which resulted in a unique self-managed data cluster.

Cloudify is a natural evolution of GigaSpaces Data Cluster

When we built Cloudify it made a lot of sense to take the IP that we developed for managing GigaSpaces cluster and simply generalize it so it would fit with any other framework. In this way, we could leverage the years of experience as well as development in this area, and gain a significant head-start.

Big Data applications are complex

Big Data applications tend to be fairly complex, which makes them an ideal candidate for the sort of automation and management that Cloudify can offer.

Big Data applications have a lot in common with XAP applications

Both need automation of data, failover and recovery, both fit into large cluster deployments, and both share similar partitioning and other clustering architecture.

What makes Big Data platform different than any other application

Most of the existing orchestration systems were designed to handle stateless processes. Moving data is a completely different ballgame as you need to think of:

Primary and Backup dependency

Availability - moving data without losing it.

Moving processes to the data rather than the other way around.

Data replication within and across sites

Automating any of these processes through general orchestration tooling like Chef or Rightscale can become a fairly involved and complex process, with lots of pitfalls with handling edge scenarios for example the handshake process that is often involved when automating a data node failure, including a split-brain scenario.

In Cloudify we were able to curve out lots of that logic from the user, for example Cloudify will automatically ensure that primaries and backups won’t run on the same node or data center in case of disaster recovery. You don't need to do anything but tag your cluster nodes with a zone-tag.

Managing Big Data applications != Managing Big Data storage

Managing data clusters is one thing. Being able to process the data is yet another challenge that we need to think about when we’re dealing with application platforms as I noted in one of my earlier post.

The main challenge is that quite often the management of the data processing logic is built on completely different scaling, availability and monitoring tools than the one used for managing our Big Data deployment. It turns out, that this silo thinking leads to a whole set of complexities starting from the inconsistency in having multiple managers, each determined in a different way when there is a failure or scaling event, and that quite often end up conflicting with one another. Having lots of moving parts is yet another challenge that makes the entire deployment pretty much a complete mess.

Over the next few years, we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.

Being also part of XAP, Cloudify already comes with built-in support for streaming Big Data processing. This means that building your own Facebook or Twitter-like real-time analytics can be as a simple as writing only small scripts that handle the analytics counters. All the rest, i.e. scalability, availability, automation, cloud portability, management and monitoring, is covered by Cloudify as noted in this and this use case.

Examples for Big Data applications running on Cloudify

In the list below, I tried to put together a couple of references and examples that will make it easier for you to get started. The first reference points to a simpler scenario that will allow you to use Cloudify to deploy your Big Data database as a service. The other three references are full-application stack deployments that include the data-processing and web-tier of applications managed together with the Big Data database.

Running a Big Data 'Database as a Service'

Cloudify comes with built-in recipes for Cassandra and MongoDB, as well as Solr (popular search engine), which makes it easy to deploy these database clusters on your local machine, data center or private/public cloud through a single command. In this way, you can use Cloudify to automate the database.

Spring Travel application with Cassandra

Demonstrating a deployment of a JAVA-based commerce application with Cassandra as the database

The example includes recipes that provision the Cassandra database, create a schema, load the data, then spawn a Tomcat container, automatically injecting the reference to Cassandra, that then shows the custom management and monitoring of the application - all through a single command. A recent videocast showing how the travel application works on HP OpenStack cloud is available here.

Pet Clinic example with MongoDB

The Pet Clinic example does pretty much the same thing only using a sharded MongoDB cluster.

Twitter Real Time analytics example for Big Data

The Twitter example shows how you can attach real time stream-based processing for handling real live Twitter feeds, and how you can manage both the stream processing cluster and the Big Data cluster using Cloudify and run it on any cloud. The entire source code for this example is provided on Github.

Give it a try

To try out any of the examples you'll need to download Cloudify (latest) or (stable) build. Cloudify comes with all first three examples as part of the distribution under the recipes or examples directory. To try out these examples simply follow the steps from the Cloudify Quick Start Guide.

September 30, 2011

Next week, I’m going to be in JavaOne. Unlike last year, where I expressed lots of skepticism about the way Oracle drives the Java Community, it seems that this year things are starting to get back on course – not that the skepticism has vanished completely, but at least there is a stronger sentiment that things are starting to settle down and take a more positive course.

It seems that now Java is taking center stage in the Cloud world, with more application platforms such as SalesForce/Heroku, VMware, Redhat/OpenShift, Cloud.com, JClouds, and obviously GigaSpaces providing a rich set of offerings that are based on Java.

One of the most interesting things in my opinion is that the combination of Cloud and Java brings new opportunities that until recently were unheard of:

The Java EE platform architecture is taking into account the operational impact of the cloud, more specifically by defining new roles, such as a PaaS administrator.

The Java EE platform may also establish a set of constraints for PaaS-specific features, such as multi-tenancy, that deployments may have to obey. Applications may also be able to identify themselves as designed for cloud environments.

All resource manager-related APIs, such as JPA, JDBC and JMS, will be updated to enable multi-tenancy. The programming model may be refined as well by introducing connectionless versions of the major APIs.

Java EE will define a descriptor for application metadata to enable developers to describe certain characteristics of their applications that are essential for the purpose of running them in a PaaS environment. These may include being multitenancy-enabled, resources sharing, quality of service information, dependencies between applications.

What This Means

Java is regaining some of its luster from the last few years. It was being perceived as vaguely pedestrian and ordinary, and languages like Ruby and Python were being seen as more dynamic environments for rapid and scalable deployments.

Now we're seeing Java as the lingua franca of enterprise development. With so many environments being heterogeneous, a homogenous platform becomes a very desirable resource, and the JVM is able to run many languages in an integrated environment: Java, Ruby, Python, Scala, and Groovy, to name only a few, and many of these can run as compiled code and as scripts.

Even beyond the multiple programming language support of the JVM, you have very broad support for almost any number of remote procedure call mechanism you can think of. The old J2EE approach, where you defined your requirement and there was a specific way to fulfill it (which meant that every architecture was more or less nudged to the middle), is no longer a real limit.

If you can do it, you can do it in Java, without Java forcing you to sacrifice in the process.

This is a very powerful concept.

It puts your organization back under your control. This is desirable because you know your product and architecture better than a community process does. You know whether you need to support REST here, and SOAP there; you know you need to use NoSQL in this area and MySQL in that area.

A Remaining Challenge

Java EE 7 may be becoming aware of the new cloud-based environment, but it's still only one aspect of application deployment and design. What it does not address is the entire application environment.

This is where applications such as Cloudify come in. Cloudify treats every artifact involved with your deployment as a managed element.

Consider: when you deploy an enterprise application, you're not just deploying some operational code. That code has external requirements, like a database (or NoSQL warehouse); it might also rely on an Apache-based load balancer, or perhaps another independently-deployed artifact.

That's quite a burden on operations, because all of these are separately deployed and managed.

Cloudify, however, centralizes the management of each element your application is composed of, over the entire application's lifecycle, and provides a bridge to whatever environment you wish to use.

For example, you could have a MySQL database, used by a Java application hosted by a single Tomcat instance, fronted by an Apache httpd server, hosted on an internal Linux server. Cloudify can deploy and monitor each of those elements: the MySQL database, the schema and initial data; Tomcat; the Java application; Apache httpd, and the server upon which it all runs.

Now let's look at a larger deployment: a set of MySQL servers, a cluster of GigaSpaces XAP nodes, an instance of GlassFish hosting some web services, six .Net-based web services, and five httpd servers being load-balanced, deployed on a platform such as Microsoft Azure. From Cloudify's perspective, the deployment and management process is the same. You have more artifacts to deploy, and you're telling Cloudify to deploy on a different platform... that's it.

Cloudify can make sure each resource is running optimally, as well. If you can measure an attribute - such as CPU load, or disk space, or free memory - you can tell Cloudify to assert that some headroom exists, and what to do if a metric is exceeded.

For example, if you have a CPU running at 100% for an extended period of time, Cloudify can automatically deploy another instance of the component consuming the CPU, and configure load balancing so your application is more scalable, without human intervention.

Likewise, if you have four servers running a given process, all at 0% CPU load, you can reduce the number of containers with that component so that you're not overallocating resources, saving energy and money.

The key thought here:

You know how to solve your problem. We can help.

We'll be on the floor at JavaOne 2011, in booth #5002. We'll be happy to show you both XAP, our industry-leading application platform, and Cloudify, our deployment solution for the cloud-enabled world.

(This post was co-written with Joseph Ottinger, who unfortunately won't be able to make it to JavaOne this year. But he says that the best people in GigaSpaces will be there, so you're not missing out!)

July 14, 2011

Lately, we've been talking to various clients about realtime analytics, and with convenient timing Todd Hoff wrote up how Facebook's realtime analytics system was designed and implemented (See previous review on that regard here).

They had some assumptions in design that centered around the reliability of in-memory systems and database neutrality that affected what they did: for memory, that transactional memory was unreliable, and for the database, that HBase was the only targeted data store.

What if those assumptions are changed? We can see reliable transactional memory in the field, as a requirement for any in-memory data grid, and certainly there are more databases than HBase; given database and platform neutrality, and reliable transactional memory, how could you build a realtime analytics system?

Joseph Ottinger and I discussed this, and this is what we came up with.

A Summary of History

To understand what a new design might look like, it’s often useful to consider a previous design. This is a very short summary of Facebook’s realtime analytics system.

First, it’s based on a system of key/value pairs, where the key might be a URL and the value is a counter. Thus, there’s a requirement for atomic, transactional updates to a very simple piece of data. The difficulties come from scale, not from the focus of the system.

The process flow is fairly simple:

A user creates an event by performing some action on the website. This generates an AJAX request, sent to a service.

Scribe is used to write the events into logs, stored on HDFS.

PTail is used to consolidate the HDFS logs.

Puma takes the consolidated logs from PTail and stores them into HBase in groupings that represent roughly 1.5 seconds’ worth of events.

HBase serves as the long-term repository for analytics data.

There are some questions around how PTail and Puma serve as scaling agents, and some of the notes around their use are still limited in scale – for example, one of the concerns is that an in-memory hash table will fill up, which sounds like fairly serious limitation to have to keep in mind.

A Potential for Improvement

There are lots of areas in which you can see potential improvements, if the assumptions are changed. As a contrast to Facebook's working system:

We can simplify the design. If memory can be seen as transactional - and it can - we can use them without transforming them as they proceed along our analytics workflow. This makes our design and implementation much simpler to implement and test, and performance improves as well.

We can strengthen the design. With a polling semantic, such systems are brittle, relying on systems that pull data in order to generate realtime analytics data. We should be able to reduce the fragility of the system, even while making it faster.

We can strengthen the implementation. With batching subsystems, there are limits shouldn’t exist. For example, one concern in Facebook's implementation is the use of an in-memory hash table that stores intermediate data; the in-memory aspect isn’t a concern until you realize that the batch sizes are chosen partially to make sure that this hash table doesn’t overflow available space.

We can allow deployments to change databases based on their requirements. There's nothing wrong with HBase, but it's got specific characteristics that aren't appropriate for all enterprises. We can design a system which you’d be able to deploy on various and flexible platforms, and we can migrate the underlying long-term data store to a different database if needed.

We can consolidate the analytics system so that management is easier and unified. While there are system management standards like SNMP that allow management events to be presented in the same way no matter the source, having so many different pieces means that managing the system requires an encompassing understanding, which makes maintenance and scaling more difficult.

What we want to do, then, is create a general model for an application that can accomplish the same goals as Facebook’s realtime analytics system, while leveraging the capabilities that in-memory data grids offer where available, potentially offering improvement in the areas of scalability, manageability, latency, platform neutrality, and simplicity, all while increasing ease of data access.

That sounds like quite a tall order, but it’s doable.

The key is to remember that at heart, realtime analytics represent an events system. Facebook’s entire architecture is designed to funnel events through various channels, such that they can safely and sequentially manage event updates.

Therefore, they receive a massive set of events that “look like” marbles, which they line up in single file; they then sort the marbles by color, you might say, and for each color they create a bundle of sticks; the sticks are lit on fire, and when the heat goes up past a certain temperature, steam is generated, which turns a turbine.

It’s a real-life Rube Goldberg machine, which is admirable in that it works, but much of it is still unnecessary if the assumptions about memory ("unreliable") and database ("HBase is the only target that counts") are changed. Looking at the analogy from the previous paragraph, there’s no need to change a marble into anything. The marble is enough.

A Plan for Implementation

Our design for implementation is built around putting data and messaging together. A data grid is a perfect mechanism for this, as long as it provides some basic features: transactional operations, push and pull semantics, and data partitioning.

A data grid does provide those basic features, or else it's not really much of a data grid; it'd be more of a cache otherwise.

With a data grid, then, the events come in as individual messages. When the user chooses an operation on the web site, an asynchronous operation would write the event, just as Facebook does today. However, instead of filtering and batching the events into various forms, the events are dispatched to waiting processes that perform many transactional updates in parallel.

There’s a danger that those updates might be slower than the generated events, if each event is processed sequentially. That said, this isn’t as much a problem as one might think; if data partitioning is used, then event handlers can receive partitioned events, which localizes updates and speeds them up dramatically.

In fact, you can still use batching to process events as a group; since the events would be partitioned coming in, the batch process would still be updating local data very quickly, which would be faster than individual event processing, even while retaining simplicity.

With this design, there is no overflow condition, because a system that’s designed to scale in and out as most data grids are will repartition to maintain even usage. If a data grid can’t provide this feature intrinsically, of course some management will be necessary, but finding data grids with this feature isn’t very difficult.

One other advantage of data grids is in write-through support. With write-through, updates to the data grid are written asynchronously to a backend data store – which could be HBase (as used by Facebook), Cassandra, a relational database such as MySQL, or any other data medium you choose for long-term storage, should you need that.

The memory system and the database - the external data store - work together. The in-memory solution is ideal for the realtime aspects, the events that affect now. The external data storage solution is designed to handle long-term data, for which speed is not as much of an issue.

A Discussion of Strengths

The key concept here is that event handling is the lever that can move the realtime analytics mountain. By providing a simple, scalable publisher/subscriber model, you simplify design; by using a platform that supports data partitioning, transactional updates, and write through capabilities, you gain scalability.

The data grid’s flexible query API means that events can literally react when data is available.

For a call center, for example, you want to immediately identify signals that show that the caller should be handled differently; imagine an ecommerce site that was able to determine immediately if a user was losing interest, and thus could respond appropriately, before the customer moves on.

With external processes and a long funnel for data, immediate-response capabilities are very difficult to implement, not just because of latency but because the data transformations tend to homogenize the data, instead of allowing rich expressions and flexible event types.

The data grid also has much richer support in terms of client applications. Instead of applications going through an API that focuses on a specific phase of the data’s life (for example, an API focused on HBase), you can focus on a generic API that can capture events at any point in their lifecycle, and from anywhere. An external monitoring process, then, can have the same immediate, partition-aware access to data that the integrated message-handling system does; adding features and analysis is just a matter of connecting a client to the data grid.

Here we have a quick demo that shows much of this in motion. We have a market analysis application, deployed into GigaSpaces XAP via our new Cloud deployment too, Cloudify; it uses an event-driven system to display realtime data, with a write-through to Cassandra on the back-end. The design is very simple, and demonstrates the principles we've discussed here - and can scale up and down depending on demand.

Final words

Todd Hoff (HighScalability) and Alex Himel (Facebook) provided a fairly detailed description on their solution and even more importantly they even shared the rationales that made them do things in certain ways.

One main difference in assumptions that lead to the different implementation strategies are in reliable memory for event processing, and in the use of passive data storage.

Another difference is that we had to to think of the solution as an easily cloneable solution and therefore a lot of attention was put on the simplicity of the runtime, packaging and management of the solution.

Yet another difference is that we couldn’t decide on a specific database as there isn’t a "one size fits all" solution – for certain customers, SQL would still be preferred choice and the fact that we can buffer the write to the database gives them more headroom while still allowing them to scale on writes.

I hope that this would lead to constructive dialogue on the various tradeoffs which will serve the entire industry...