June 16, 2010

WTF is Elastic Data Grid? (By Example)

Forrester released their new wave report: The Forrester Wave™: Elastic Caching Platforms, Q2 2010 where they listed GigaSpaces, IBM, Oracle, and Terracotta as leading vendors in the field. In this post I'd like to take some time to explain what some of these terms mean, and why they’re important to you. I’ll start with a definition of Elastic Data Grid (Elastic Caching), how it is different then other caching and NoSQL alternatives, and more importantly -- I'll illustrate how it works through some real code examples.

Let’s start…

Elastic Data Grid (Caching) Platform Defined

Software infrastructure that provides application developers with data caching and code execution services that are distributed across two or more server nodes that 1) consistently perform as volumes grow; 2) can be scaled without downtime; and 3) provide a range of fault-tolerance levels.

Source: Forrester

I personally feel more comfortable with the term Data Grid then term Cache, because Data Grid represents the fact that it's more then a passive data service, but rather one that was designed to execute code. Note that I use the term Elastic Data Grid (Cache) throughout this post instead of just Elastic Cache.

Local Cache, Distributed Cache, and Elastic Data Grid (Cache)

Mike Gualtieri Forrester Senior Analyst provides good coverage on this topic in two posts. In the first one, titled Fast Access To Data Is The Primary Purpose Of Caching, he explains the difference between local cache (a cache that lives within your application process), distributed cache (remote data that runs on multiple servers (from 2 up to 1000s)), and elastic cache (a distributed cache that can scale on demand).

It is important to note that most Data Grids support multiple cache topologies that include local caches in conjunction with distributed caches. The use of a local cache as a front end to a distributed cache enables you to reduce the number of remote calls in scenarios where you access the same data frequently. In GigaSpaces these components are referred to as LocalCache or LocalView.

The main difference between the two approaches comes down mostly to cost/GB of data and read/write performance.

…

An analysis done recently by Stanford University, called “The Case for RAMCloud” provides an interesting comparison between the disk and memory-based approaches, in terms of cost performance. In general, it shows that cost is also a function of performance. For low performance, the cost of the disk is significantly lower then RAM-based approach, and with higher performance requirements, the RAM becomes x100 to x1000 times cheaper then regular disk.

The Elastic Data Grid as a Component of Elastic Middleware

In my opinion, an Elastic Data Grid is only one component in a much broader category that applies to all middleware services, for example: Amazon SimpleDB, SQS, MapReduce (Messaging, Data, Processing). The main difference between this sort of middleware and traditional middleware is that the entire deployment and production complexity is curved out by the service provider. Next-generation Elastic Middleware should follow the same line of thought. For this reason, I think that the Forester definition of the Elastic Data Grid (Caching) category should also include attributes such as On-Demand and Multi-tenancy as I describe below:

On-demand -– With Elastic Data Grid you don’t need to go through a manual process of installing and setting up the services cluster before you use it. Instead, the service is created by an API call. The actual physical resources are allocated based on required SLA (provided by the user) such as Capacity-Range, Scaling threshold, etc. It is the responsibility of the middleware service to manage the provisioning and allocation of those resources.

Multi-tenancy -- Elastic Data Grid is built provides built in support multi-tenancy. Different users can have their own data-grid reference while at the same time the data-grid share the resources amongst themselves. Users that want to share the same data-grid instance without sharing the data can also do that through a specialized content based security authorization.

Auto-scaling -– Users of the elastic data grid need to be able to grow the capacity of the data grid in an on-demand basis, at the same time the elastic data-grid should automatically scale-down when the resources are not needed anymore.

Always-on – In an event of a failure the elastic data-grid should handle the failure implicitly and provision new alternate resources if needed while the application is still running.

Elastic Data Grid by Example (Using GigaSpaces)

To illustrate how these characteristics works in practice, I'll use the new GigaSpaces Elastic Middleware introduced in our latest product release.

On-Demand

As with the cloud –- when we want to use a service like SimpleDB or any other service we don’t set up an entire cluster ourselves. We expect that the service will be there when we call it. Let’s see how that works.

Step 1 -– Create your own cloud

This is a one-time setup for all your users. To create your own cloud with GigaSpaces all you need to do is make sure that a GigaSpaces agent runs in each machine that belongs to the cloud. The agent is a lightweight process that enables interaction with the machine and execute provisioning commands when needed. The following snippet shows how to start the Elastic Middleware manager and agent. In our particular case, we would start a single manager for the entire cluster and an agent per machine. If you happen to use Virtual Machine you can bundle the agent with each VM in exactly the same way as you would do with regular machines. The snippet below shows the agent command would look like:

>gs-agent gsa.global.esm 1 gsa.gsc 0

What this command does is start an agent with one Elastic Manager. Note that the use of the word global means that the number of Elastic Manager (ESM) instances refers to the entire cluster and not to a particular machine. The agents will use an internal election mechanism to make sure that in each point in time there would be a single manager instance available. If the manager failed a new one will be launched automatically on one of the other agent machines.

Step 2 -- Start the Elastic Data Grid On-Demand

In this step we write a simple Java client that calls the elastic middleware and requests a data-grid service with 1G at minimum and up to 2G (I used these numbers just an illustration that works in a single machine; in a real-life scenario we would expect 10-100G per data-grid).

Capacity -– "Minimum", "Maximum" memory to be provisioned for the entire cluster. In cases where highlyAvailable is set to true, the Capacity value should take into account the double capacity required for the backups. In our case that means that setting 1G capacity with highAvliabilty will translate to 512m for data and 512m for backup. Note also that the capacity number defines the capacity for the entire data-grid cluster and not just a single instance. The difference between the Min-Max will determine the level of elasticity that we expect for our deployment.

High Availability -– When set to true, the elastic manager will provision a backup instance per partition. It will ensure that primary and backup instances wouldn’t run on the same machine. (Note that if you're running a single machine you should turn this argument to false, as otherwise the provisioning will wait infinitely until a backup machine will become available).

Public Deployment/Isolation –- This is basically one of three isolation/multi-tenancy modes that you can choose for your deployment. The default is dedicatedDeploymentIsolation which means that only one deployment can run on the same machine. In this case we chose public isolation which means that different users share the same machine for deploying multiple data-grid instances.

Maximum/Initial Heap Size -– This argument defines the capacity per deployment container. The value will be translated into the –Xms, –Xmx JVM argument for each container process.

Using the data grid

Once we have a reference to the data-grid processing unit, we can start and use it.

The next snippet shows how we can write 1000 instances of a a simple Data POJO.

The first step is to get the remote reference of GigaSpace from our deployment.

Note for the use of waitFor(..) method. The first call waits until the first instance of the space has been deployed. The second call space.waitFor(space.getTotalNumberOfInstances()) waits until the entire cluster has been provisioned successfully.

Note for the use of @SpaceRouting annotation. Space routing is used to set the routing index, or in other words, it marks the key that determines which target partition this instance should belong to. @SpaceId determines the unique identifier for this instance.

Step 3 -- Give It a Go

Now that we have our cloud running (Step 1) and the client application ready, we're ready to try out our application. All we need to do is run our client application and verify that it works.

Now, if we try to run the same application again we would see that it runs much faster. The reason is simply because in the first run we also provisioned our data grid, while in the second run the elastic middleware skipped the deployment step as it already found the data-grid services instance running.

Monitoring and Troubleshooting

To understand what happened behind the scenes we need to familiarize ourselves with the GigaSpaces monitoring tool.

You can use the gs-ui utility located in the GigaSpaces bin directory to monitor the deployment at runtime.

You can look at the ESM (Elastic Service Manager) view to see the provisioning and scaling event. You can also view the Agent (GSA) view to see the actual Java command that is passed to the agent to launch new containers. There are plenty of other views and options that will enable you to view the actual data as well as the provisioning steps.

Understanding the Cluster Sizing

In a cloud environment we normally wouldn't care too much how our service gets provisioned, in this case we would need to better understand how the provisioning algorithm works as we are going to be the owner of both the application and our own little cloud.

The initial provisioning will fork number of machines/processes based on the following inputs:

- Minimum capacity requested by the user

- JVM capacity per container

- Actual amount of memory available per machine

Number of partitions: The number of partitions is calculated as follows: With High Availability set to false, Max Capacity/Max Heap Size in our case would be 2G/250m = 8 partitions. With High Availability set to true that would be 4 partitions and 4 backup (1 per partition).

Number of JVM Containers: In our example the minimum capacity is set to 1G where the capacity per deployment unit is set to 250m. This means that we need a minimum of 4 JVM containers. If we set High Availability to true we would need the same number of containers, but would be able to host only 512m of data, because the other 512m would serve as backup.

Number of machines: The initial number of machines is a factor of minimum capacity requirements and the actual memory available in each machine. In our example we deliberately chose numbers that fit into a single small machine. If the heap size per JVM container is smaller than the machine available heap size, the Elastic Manager will first try to host more containers to maximize the utilization per machine before it calls for a new machine.

Indeed, if we look at at the deployment view of the gs-ui, we see that in the given case there are currently 8 partitioned being provisioned.

If we examine the host view we see that there are indeed 4 JVM containers being provisioned as expected for the initial deployment.

Multi-Tenancy/AutoScaling, Always-On

Elastic middleware supports multi-tenancy and auto-scaling, and it will also keep our data-grid service always-on in case of failure. I've already covered in greater detail how multi-tenancy works here, as well as how high availability works, not just in a local environment but across a disaster recovery site as described here. That said, trying out these features in the context of this example shouldn’t be that hard. You are welcome to play with the isolation argument (Private, Public, Shared), deploy multiple data-grids in the same environment and see what happens. You can also try and break the machines and containers and see how the system heals itself while keeping your application running. You can also launch new containers explicitly (right-click the agent in the management console) or just start a new machine and see how the system scales-out on-demand.

Scaling Down

One thing that is worth emphasizing is how the system scales-down (not just auto-scale). The simplest way to experience how this feature works is to un-deploy our entire data-grid, which you can do by simply calling the undeploy() method:

pu.undeploy();

Another option would from the Administration console: Select the deployment tab and un-deploy the data-grid service from the console.

Now watch what happens to our data-grid instances -- but not only to the data-grid instances but to our containers (processes) as well. Let me explain:

When we un-deploy the data-grid the first thing that happens is that the partitions (the actual data content) get un-deployed freeing the memory within the hosting container. Another thing that happens in the background is that the Elastic Manager notices that we no longer need to run those container processes, and is begins freeing up our cloud environment from those processes automatically… You can watch the ESM logs as this happens.

What’s Next?

Mike includes a roadmap for Elastic Data Grid (Caching) as illustrated in the diagram below.

Since the report was compiled (the report was based on cutoff date of last year) we've already developed a fairly enhanced version of the Elastic Data Grid (Caching) platform. Towards our 8.0 release we’ll have full support for complete application deployment in that same model through the support of a processing unit. In this model you basically deploy an application partition in exactly the same way that you would have deployed a data-grid instance. This will enable us to reuse the same multi-tenancy, auto-scaling and self-healing abstraction that we use for the data-grid for our entire application -- in the exact same way.

Note that current and earlier version of GigaSpaces XAP already support full elastic application and data grid deployment through the SerivceGrid. The Elastic Middleware provides a much higher level of abstraction that makes elastic deployment much simpler to use and much closer to the next-generation data center/Cloud vision.

Give it a try

As noted earlier the example in this post is based on GigaSpaces. To try out the demo you’ll need to download the latest version of GigaSpaces 7.1.1 and follow the instructions above.

Comments

WTF is Elastic Data Grid? (By Example)

Forrester released their new wave report: The Forrester Wave™: Elastic Caching Platforms, Q2 2010 where they listed GigaSpaces, IBM, Oracle, and Terracotta as leading vendors in the field. In this post I'd like to take some time to explain what some of these terms mean, and why they’re important to you. I’ll start with a definition of Elastic Data Grid (Elastic Caching), how it is different then other caching and NoSQL alternatives, and more importantly -- I'll illustrate how it works through some real code examples.

Let’s start…

Elastic Data Grid (Caching) Platform Defined

Software infrastructure that provides application developers with data caching and code execution services that are distributed across two or more server nodes that 1) consistently perform as volumes grow; 2) can be scaled without downtime; and 3) provide a range of fault-tolerance levels.

Source: Forrester

I personally feel more comfortable with the term Data Grid then term Cache, because Data Grid represents the fact that it's more then a passive data service, but rather one that was designed to execute code. Note that I use the term Elastic Data Grid (Cache) throughout this post instead of just Elastic Cache.

Local Cache, Distributed Cache, and Elastic Data Grid (Cache)

Mike Gualtieri Forrester Senior Analyst provides good coverage on this topic in two posts. In the first one, titled Fast Access To Data Is The Primary Purpose Of Caching, he explains the difference between local cache (a cache that lives within your application process), distributed cache (remote data that runs on multiple servers (from 2 up to 1000s)), and elastic cache (a distributed cache that can scale on demand).

It is important to note that most Data Grids support multiple cache topologies that include local caches in conjunction with distributed caches. The use of a local cache as a front end to a distributed cache enables you to reduce the number of remote calls in scenarios where you access the same data frequently. In GigaSpaces these components are referred to as LocalCache or LocalView.

The main difference between the two approaches comes down mostly to cost/GB of data and read/write performance.

…

An analysis done recently by Stanford University, called “The Case for RAMCloud” provides an interesting comparison between the disk and memory-based approaches, in terms of cost performance. In general, it shows that cost is also a function of performance. For low performance, the cost of the disk is significantly lower then RAM-based approach, and with higher performance requirements, the RAM becomes x100 to x1000 times cheaper then regular disk.

The Elastic Data Grid as a Component of Elastic Middleware

In my opinion, an Elastic Data Grid is only one component in a much broader category that applies to all middleware services, for example: Amazon SimpleDB, SQS, MapReduce (Messaging, Data, Processing). The main difference between this sort of middleware and traditional middleware is that the entire deployment and production complexity is curved out by the service provider. Next-generation Elastic Middleware should follow the same line of thought. For this reason, I think that the Forester definition of the Elastic Data Grid (Caching) category should also include attributes such as On-Demand and Multi-tenancy as I describe below:

On-demand -– With Elastic Data Grid you don’t need to go through a manual process of installing and setting up the services cluster before you use it. Instead, the service is created by an API call. The actual physical resources are allocated based on required SLA (provided by the user) such as Capacity-Range, Scaling threshold, etc. It is the responsibility of the middleware service to manage the provisioning and allocation of those resources.

Multi-tenancy -- Elastic Data Grid is built provides built in support multi-tenancy. Different users can have their own data-grid reference while at the same time the data-grid share the resources amongst themselves. Users that want to share the same data-grid instance without sharing the data can also do that through a specialized content based security authorization.

Auto-scaling -– Users of the elastic data grid need to be able to grow the capacity of the data grid in an on-demand basis, at the same time the elastic data-grid should automatically scale-down when the resources are not needed anymore.

Always-on – In an event of a failure the elastic data-grid should handle the failure implicitly and provision new alternate resources if needed while the application is still running.

Elastic Data Grid by Example (Using GigaSpaces)

To illustrate how these characteristics works in practice, I'll use the new GigaSpaces Elastic Middleware introduced in our latest product release.

On-Demand

As with the cloud –- when we want to use a service like SimpleDB or any other service we don’t set up an entire cluster ourselves. We expect that the service will be there when we call it. Let’s see how that works.

Step 1 -– Create your own cloud

This is a one-time setup for all your users. To create your own cloud with GigaSpaces all you need to do is make sure that a GigaSpaces agent runs in each machine that belongs to the cloud. The agent is a lightweight process that enables interaction with the machine and execute provisioning commands when needed. The following snippet shows how to start the Elastic Middleware manager and agent. In our particular case, we would start a single manager for the entire cluster and an agent per machine. If you happen to use Virtual Machine you can bundle the agent with each VM in exactly the same way as you would do with regular machines. The snippet below shows the agent command would look like:

>gs-agent gsa.global.esm 1 gsa.gsc 0

What this command does is start an agent with one Elastic Manager. Note that the use of the word global means that the number of Elastic Manager (ESM) instances refers to the entire cluster and not to a particular machine. The agents will use an internal election mechanism to make sure that in each point in time there would be a single manager instance available. If the manager failed a new one will be launched automatically on one of the other agent machines.

Step 2 -- Start the Elastic Data Grid On-Demand

In this step we write a simple Java client that calls the elastic middleware and requests a data-grid service with 1G at minimum and up to 2G (I used these numbers just an illustration that works in a single machine; in a real-life scenario we would expect 10-100G per data-grid).

Capacity -– "Minimum", "Maximum" memory to be provisioned for the entire cluster. In cases where highlyAvailable is set to true, the Capacity value should take into account the double capacity required for the backups. In our case that means that setting 1G capacity with highAvliabilty will translate to 512m for data and 512m for backup. Note also that the capacity number defines the capacity for the entire data-grid cluster and not just a single instance. The difference between the Min-Max will determine the level of elasticity that we expect for our deployment.

High Availability -– When set to true, the elastic manager will provision a backup instance per partition. It will ensure that primary and backup instances wouldn’t run on the same machine. (Note that if you're running a single machine you should turn this argument to false, as otherwise the provisioning will wait infinitely until a backup machine will become available).

Public Deployment/Isolation –- This is basically one of three isolation/multi-tenancy modes that you can choose for your deployment. The default is dedicatedDeploymentIsolation which means that only one deployment can run on the same machine. In this case we chose public isolation which means that different users share the same machine for deploying multiple data-grid instances.

Maximum/Initial Heap Size -– This argument defines the capacity per deployment container. The value will be translated into the –Xms, –Xmx JVM argument for each container process.

Using the data grid

Once we have a reference to the data-grid processing unit, we can start and use it.

The next snippet shows how we can write 1000 instances of a a simple Data POJO.

The first step is to get the remote reference of GigaSpace from our deployment.

Note for the use of waitFor(..) method. The first call waits until the first instance of the space has been deployed. The second call space.waitFor(space.getTotalNumberOfInstances()) waits until the entire cluster has been provisioned successfully.

Note for the use of @SpaceRouting annotation. Space routing is used to set the routing index, or in other words, it marks the key that determines which target partition this instance should belong to. @SpaceId determines the unique identifier for this instance.

Step 3 -- Give It a Go

Now that we have our cloud running (Step 1) and the client application ready, we're ready to try out our application. All we need to do is run our client application and verify that it works.

Now, if we try to run the same application again we would see that it runs much faster. The reason is simply because in the first run we also provisioned our data grid, while in the second run the elastic middleware skipped the deployment step as it already found the data-grid services instance running.

Monitoring and Troubleshooting

To understand what happened behind the scenes we need to familiarize ourselves with the GigaSpaces monitoring tool.

You can use the gs-ui utility located in the GigaSpaces bin directory to monitor the deployment at runtime.

You can look at the ESM (Elastic Service Manager) view to see the provisioning and scaling event. You can also view the Agent (GSA) view to see the actual Java command that is passed to the agent to launch new containers. There are plenty of other views and options that will enable you to view the actual data as well as the provisioning steps.

Understanding the Cluster Sizing

In a cloud environment we normally wouldn't care too much how our service gets provisioned, in this case we would need to better understand how the provisioning algorithm works as we are going to be the owner of both the application and our own little cloud.

The initial provisioning will fork number of machines/processes based on the following inputs:

- Minimum capacity requested by the user

- JVM capacity per container

- Actual amount of memory available per machine

Number of partitions: The number of partitions is calculated as follows: With High Availability set to false, Max Capacity/Max Heap Size in our case would be 2G/250m = 8 partitions. With High Availability set to true that would be 4 partitions and 4 backup (1 per partition).

Number of JVM Containers: In our example the minimum capacity is set to 1G where the capacity per deployment unit is set to 250m. This means that we need a minimum of 4 JVM containers. If we set High Availability to true we would need the same number of containers, but would be able to host only 512m of data, because the other 512m would serve as backup.

Number of machines: The initial number of machines is a factor of minimum capacity requirements and the actual memory available in each machine. In our example we deliberately chose numbers that fit into a single small machine. If the heap size per JVM container is smaller than the machine available heap size, the Elastic Manager will first try to host more containers to maximize the utilization per machine before it calls for a new machine.

Indeed, if we look at at the deployment view of the gs-ui, we see that in the given case there are currently 8 partitioned being provisioned.

If we examine the host view we see that there are indeed 4 JVM containers being provisioned as expected for the initial deployment.

Multi-Tenancy/AutoScaling, Always-On

Elastic middleware supports multi-tenancy and auto-scaling, and it will also keep our data-grid service always-on in case of failure. I've already covered in greater detail how multi-tenancy works here, as well as how high availability works, not just in a local environment but across a disaster recovery site as described here. That said, trying out these features in the context of this example shouldn’t be that hard. You are welcome to play with the isolation argument (Private, Public, Shared), deploy multiple data-grids in the same environment and see what happens. You can also try and break the machines and containers and see how the system heals itself while keeping your application running. You can also launch new containers explicitly (right-click the agent in the management console) or just start a new machine and see how the system scales-out on-demand.

Scaling Down

One thing that is worth emphasizing is how the system scales-down (not just auto-scale). The simplest way to experience how this feature works is to un-deploy our entire data-grid, which you can do by simply calling the undeploy() method:

pu.undeploy();

Another option would from the Administration console: Select the deployment tab and un-deploy the data-grid service from the console.

Now watch what happens to our data-grid instances -- but not only to the data-grid instances but to our containers (processes) as well. Let me explain:

When we un-deploy the data-grid the first thing that happens is that the partitions (the actual data content) get un-deployed freeing the memory within the hosting container. Another thing that happens in the background is that the Elastic Manager notices that we no longer need to run those container processes, and is begins freeing up our cloud environment from those processes automatically… You can watch the ESM logs as this happens.

What’s Next?

Mike includes a roadmap for Elastic Data Grid (Caching) as illustrated in the diagram below.

Since the report was compiled (the report was based on cutoff date of last year) we've already developed a fairly enhanced version of the Elastic Data Grid (Caching) platform. Towards our 8.0 release we’ll have full support for complete application deployment in that same model through the support of a processing unit. In this model you basically deploy an application partition in exactly the same way that you would have deployed a data-grid instance. This will enable us to reuse the same multi-tenancy, auto-scaling and self-healing abstraction that we use for the data-grid for our entire application -- in the exact same way.

Note that current and earlier version of GigaSpaces XAP already support full elastic application and data grid deployment through the SerivceGrid. The Elastic Middleware provides a much higher level of abstraction that makes elastic deployment much simpler to use and much closer to the next-generation data center/Cloud vision.

Give it a try

As noted earlier the example in this post is based on GigaSpaces. To try out the demo you’ll need to download the latest version of GigaSpaces 7.1.1 and follow the instructions above.