September 11, 2009

In part I of of this topic, I covered the general concept and motivation behind the "Interactive Cloud". I described how the interactive cloud enables applications to be managed in a similar way to the way we run our business. In this post, I discuss further details of what it means to add operational awareness to the application. To do that, I refer to our specific experience in GigaSpaces. Towards the end of the post, I point out specific code snippets as an illustration of that model.

Background

GigaSpaces 7.0 Application Cluster Management API is an API for managing application clusters, and for giving applications operational awareness. The reason why we came up with this API in the first place, is quite interesting. When we started GigaSpaces, a typical deployment was primary-backup with multiple clients connected to this pair. Sooner then we anticipated, we found ourselves supporting cluster deployments of 10s and 100s of nodes per application. We are now at a point where customers are starting to use GigaSpaces XAP as a service shared between multiple applications, which in itself increases the deployment size quite significantly.

In this type of environment, troubleshooting a production system can be a real nightmare for everyone involved. We started to add all sorts of metrics and logging information into our system but it hardly improved things. The process of finding the right information at the level we needed was still a heavy-duty task.

When we looked back and tried to analyze how we learn about the application’s behavior, we saw that quite often we identified the time when things started to go wrong, and then, using the timestamps on the different operations, ran a complex correlation process to see what happened on each machine during that time. In many cases when we hit the first area of trouble, we were still missing information, because we were running without fine-grained logging, for performance reasons. So then we would turn up the log level and wait for the next event, and so forth.

We realized that just adding more probes and log levels and using some sort of CEP (Complex Event Processing) to parse all the garbage we were going to generate, on a continuous basis, was not going to be good enough. Even worse, if the reporting system got so complex, we would eventually end up going back to the old logs because the reporting system would inevitably lose some of our events :)

We figured that we could not afford to continue down that path – the cost, both on our end and on the customer’s end, was way too high. We needed to try a totally new approach.

The Ideal Solution – Getting Help from the Applications themselves

When we analyzed what would be the ideal process, we came to the realization that the application should be responsible for the monitoring. When the application starts to behave abnormally (for example, CPU spikes, GC spikes), or experiences network failure, it should do two things:

Take corrective action – the application should figure out if the problem is that resources have been exhausted. If that is the case, it should try to rebalance itself to reduce the load automatically.

Provide information to help analyze the root cause of the event that occurred – the application should point out the machines that are suspects and ask those server instances to increase their logging level and print a snapshot of thread dump. This provides only the information that is actually needed to perform root cause analysis, and at the right level of granularity.

It wasn’t surprising that some of the root cause analysis indicated that one of the biggest contributors to failures, is human error or misconfiguration. That led us to the realization that the way to reduce those factors would be through complete automation of manual deployment and maintenance processes.

Another thing we realized during that process, is that when things went wrong, quite often the operation guys became fairly useless at a fairly early stage of the process, and the application guys had to be called in to understand what happened. Not surprisingly, the application guys didn’t even know how to get the required information or didn’t have permission to get that information, and had to rely on someone else to do that for them – which made the entire process really long and complex.

All of this led us to the realization that we needed to create an application cluster management and monitoring tool that could be used by both the application developers and the operation guys. The application guys could use that API as part of their development cycle, and the operation guys could use it to receive the right level of information into their management and control dashboards.

In the following section, I provide a short sneak peek into this API, and how it is structured.

There are basically two ways to design the API for this challenge: define the interaction model and drive the API from there, or define the domain model and drive the interaction model from the domain model. We chose the latter.

1. Defining the Application Management Domain Model

The first step was to define the domain model of our application management. The diagram below maps out the “nervous system” of an application running on XAP – it shows the core layers of the application management domain model.

2. Defining the Interaction Model – the API

Once we mapped the nerves and sensors of our application, we could start to wire them. This enabled us not only to monitor the state of the application components, but also to interact with them when we needed to take corrective actions (see snippet below).

3. Making Operational Awareness an Integral Part of the Development Framework

If we want the application to take more responsibility for its operational behavior, we can’t just point it to an external management and monitoring tool that doesn’t speak the “developer language”. We need to make the proactive and reporting behavior an integral part of the development framework. This is illustrated in the examples below.

Application Code that uses the Cluster Management API - a Few Examples

Below are a few examples that show how applications interact with the new Administration API, using a groovy-based API. These snippets were written in Groovy but they can be be called in Java as well.

Example 1 – Getting Statistics

This example uses Groovy closures to get callbacks when a new container or machine joins the network, and when a specific application event happens.

Example 2 – Calculating Average Load on a Web Container

This code snippet shows how to get an aggregated result of the total number of requests/sec from all the web containers in the cluster.

Example 3 – Scaling Up:

The code snippet below shows the process of scaling out, when the number of requests exceeds the maximum number of requests allowed per instance.

Note that the call to incrementInstance() generates an entire match-making process behind the scene. What happens is that a new web container is spawned, but the provisioning of this new container is based on the SLA that was defined at deployment time. In our case, it is deployed only on machines that don’t already run a web container and have spare CPU and memory capacity.

Initial Success Stories

Since we came up with the API, we started to see its use spreading in areas we didn’t even think of. Here are a few examples:

Helping customers reproduce problem scenarios – now that we can script almost every part of the application, we can easily script even very complex scenarios, where the application starts to execute a few things and then crashes only when a certain operation happens, etc. The fact that we can script these scenarios makes it easier to create a reproduction scenario, and from there, the road to spotting the problem and fixing it becomes significantly shorter (see detailed reference here).

Third part integration – one of our partners created a fairly extensible Eclipse plug-in. The reason he was able to do that pretty much by himself, was because the API gave him everything he needed in order to control the application development and testing lifecycle (see more details in Jeroen Remmerswaal post here).

Writing a complex SLA – we were able to easily write complex SLAs, for example, automating deployment of applications across multiple data centers and making sure that primaries and backups never run on the same data center. All that without any human intervention.

What’s Coming Next?

One of the immediate things we are going to enhance, is to provide more built-in SLAs, and a set of utilities that perform deployment and implementation of mainstream scenarios, such as auto-scaling of web applications and auto re-balancing of data grid deployments. This is going to serve as a cornerstone of our new cloud-enabled middleware initiative, where all the classic, enterprise-grade middleware components become available as a service. In fact, some of our customers are already using our In-Memory Data Grid as a shared service between multiple applications within the organization. The idea is to make that type of deployment simpler and intuitive.

Take Part in this New Initiative

I can go on with more of my thoughts on what should be done, but I’d rather stop here and ask all of you for your feedback. If you’ve faced similar challenges, how do you see this solution?What would you want to see in this new package?

For those who want to try out this new API, it is worth noting that it is actually provided for free as part of our Community Edition, so anyone can simply download it and try out the examples from this post.

Comments

In part I of of this topic, I covered the general concept and motivation behind the "Interactive Cloud". I described how the interactive cloud enables applications to be managed in a similar way to the way we run our business. In this post, I discuss further details of what it means to add operational awareness to the application. To do that, I refer to our specific experience in GigaSpaces. Towards the end of the post, I point out specific code snippets as an illustration of that model.

Background

GigaSpaces 7.0 Application Cluster Management API is an API for managing application clusters, and for giving applications operational awareness. The reason why we came up with this API in the first place, is quite interesting. When we started GigaSpaces, a typical deployment was primary-backup with multiple clients connected to this pair. Sooner then we anticipated, we found ourselves supporting cluster deployments of 10s and 100s of nodes per application. We are now at a point where customers are starting to use GigaSpaces XAP as a service shared between multiple applications, which in itself increases the deployment size quite significantly.

In this type of environment, troubleshooting a production system can be a real nightmare for everyone involved. We started to add all sorts of metrics and logging information into our system but it hardly improved things. The process of finding the right information at the level we needed was still a heavy-duty task.

When we looked back and tried to analyze how we learn about the application’s behavior, we saw that quite often we identified the time when things started to go wrong, and then, using the timestamps on the different operations, ran a complex correlation process to see what happened on each machine during that time. In many cases when we hit the first area of trouble, we were still missing information, because we were running without fine-grained logging, for performance reasons. So then we would turn up the log level and wait for the next event, and so forth.

We realized that just adding more probes and log levels and using some sort of CEP (Complex Event Processing) to parse all the garbage we were going to generate, on a continuous basis, was not going to be good enough. Even worse, if the reporting system got so complex, we would eventually end up going back to the old logs because the reporting system would inevitably lose some of our events :)

We figured that we could not afford to continue down that path – the cost, both on our end and on the customer’s end, was way too high. We needed to try a totally new approach.

The Ideal Solution – Getting Help from the Applications themselves

When we analyzed what would be the ideal process, we came to the realization that the application should be responsible for the monitoring. When the application starts to behave abnormally (for example, CPU spikes, GC spikes), or experiences network failure, it should do two things:

Take corrective action – the application should figure out if the problem is that resources have been exhausted. If that is the case, it should try to rebalance itself to reduce the load automatically.

Provide information to help analyze the root cause of the event that occurred – the application should point out the machines that are suspects and ask those server instances to increase their logging level and print a snapshot of thread dump. This provides only the information that is actually needed to perform root cause analysis, and at the right level of granularity.

It wasn’t surprising that some of the root cause analysis indicated that one of the biggest contributors to failures, is human error or misconfiguration. That led us to the realization that the way to reduce those factors would be through complete automation of manual deployment and maintenance processes.

Another thing we realized during that process, is that when things went wrong, quite often the operation guys became fairly useless at a fairly early stage of the process, and the application guys had to be called in to understand what happened. Not surprisingly, the application guys didn’t even know how to get the required information or didn’t have permission to get that information, and had to rely on someone else to do that for them – which made the entire process really long and complex.

All of this led us to the realization that we needed to create an application cluster management and monitoring tool that could be used by both the application developers and the operation guys. The application guys could use that API as part of their development cycle, and the operation guys could use it to receive the right level of information into their management and control dashboards.

In the following section, I provide a short sneak peek into this API, and how it is structured.

There are basically two ways to design the API for this challenge: define the interaction model and drive the API from there, or define the domain model and drive the interaction model from the domain model. We chose the latter.

1. Defining the Application Management Domain Model

The first step was to define the domain model of our application management. The diagram below maps out the “nervous system” of an application running on XAP – it shows the core layers of the application management domain model.

2. Defining the Interaction Model – the API

Once we mapped the nerves and sensors of our application, we could start to wire them. This enabled us not only to monitor the state of the application components, but also to interact with them when we needed to take corrective actions (see snippet below).

3. Making Operational Awareness an Integral Part of the Development Framework

If we want the application to take more responsibility for its operational behavior, we can’t just point it to an external management and monitoring tool that doesn’t speak the “developer language”. We need to make the proactive and reporting behavior an integral part of the development framework. This is illustrated in the examples below.

Application Code that uses the Cluster Management API - a Few Examples

Below are a few examples that show how applications interact with the new Administration API, using a groovy-based API. These snippets were written in Groovy but they can be be called in Java as well.

Example 1 – Getting Statistics

This example uses Groovy closures to get callbacks when a new container or machine joins the network, and when a specific application event happens.

Example 2 – Calculating Average Load on a Web Container

This code snippet shows how to get an aggregated result of the total number of requests/sec from all the web containers in the cluster.

Example 3 – Scaling Up:

The code snippet below shows the process of scaling out, when the number of requests exceeds the maximum number of requests allowed per instance.

Note that the call to incrementInstance() generates an entire match-making process behind the scene. What happens is that a new web container is spawned, but the provisioning of this new container is based on the SLA that was defined at deployment time. In our case, it is deployed only on machines that don’t already run a web container and have spare CPU and memory capacity.

Initial Success Stories

Since we came up with the API, we started to see its use spreading in areas we didn’t even think of. Here are a few examples:

Helping customers reproduce problem scenarios – now that we can script almost every part of the application, we can easily script even very complex scenarios, where the application starts to execute a few things and then crashes only when a certain operation happens, etc. The fact that we can script these scenarios makes it easier to create a reproduction scenario, and from there, the road to spotting the problem and fixing it becomes significantly shorter (see detailed reference here).

Third part integration – one of our partners created a fairly extensible Eclipse plug-in. The reason he was able to do that pretty much by himself, was because the API gave him everything he needed in order to control the application development and testing lifecycle (see more details in Jeroen Remmerswaal post here).

Writing a complex SLA – we were able to easily write complex SLAs, for example, automating deployment of applications across multiple data centers and making sure that primaries and backups never run on the same data center. All that without any human intervention.

What’s Coming Next?

One of the immediate things we are going to enhance, is to provide more built-in SLAs, and a set of utilities that perform deployment and implementation of mainstream scenarios, such as auto-scaling of web applications and auto re-balancing of data grid deployments. This is going to serve as a cornerstone of our new cloud-enabled middleware initiative, where all the classic, enterprise-grade middleware components become available as a service. In fact, some of our customers are already using our In-Memory Data Grid as a shared service between multiple applications within the organization. The idea is to make that type of deployment simpler and intuitive.

Take Part in this New Initiative

I can go on with more of my thoughts on what should be done, but I’d rather stop here and ask all of you for your feedback. If you’ve faced similar challenges, how do you see this solution?What would you want to see in this new package?

For those who want to try out this new API, it is worth noting that it is actually provided for free as part of our Community Edition, so anyone can simply download it and try out the examples from this post.