Blogs

Flemming T Christensen on Quality Collaboration

About this blog:
This blog focuses on software quality in general, and IBM Collaboration Solutions offerings in particular. The author is an IBM employee, but expresses his observations and opinions as an individual here. The purpose of the blog is to nurture a conversation with our customers and partners about continuous improvement of our software based offerings. ~FTC.
.

Cloud Difference #19: Monitoring is central

The discipline of monitoring is not new, but it takes on a renewed importance in the cloud. Our on-premises customers monitor their environments too. Some applications have built-in monitoring capabilities, such as Domino Domain Monitoring (DDM) probes. Others may use separate Tivoli Monitoring tools like the IBM Tivoli Composite Application Manager (ITCAM), or open source tools like Nagios, or even third party monitoring services. The cost reduction focus for cloud offerings drives a need for effective monitoring as we run systems closer to their capacity limits than they typically are in on-premises environments. Monitoring has many targets in a complex environment, such as: availability, basic resources (memory, cpu and disk utilization), queue lengths, bottlenecks, application level parameters, response times, log entries, and more. Building more sophisticated application level monitoring capabilities for our hosted LotusLive (SmartCloud) environment provides an advantage for our on-premises products, who integrate the same monitoring capabilities to make them available to on-premises customers as well.

We expand monitoring systems intelligently to go beyond mere notification that a threshold has been exceeded, or a specific service is no longer available, to predictive capabilities. In basic resource monitoring of CPU, memory, and disk utilization, for example, alerts can be set for adverse trends of particular resources, which warrant a closer look before a potentially adverse incident occurs in the environment. Similarly, analysis of log entries recorded ahead of observed incidents in both test and production environments can help us develop predictive capability, so the application level monitoring too can be used to alert us before an incident occurs, not just after it has already occurred. That way, preventive action can be taken to avoid the incident. We have built, and continue to extend, a sophisticated set of monitoring mechanisms around our LotusLive (SmartCloud) offerings, which helps us take preemptive action and keep systems operational.

Another key aspect of monitoring for cloud systems is the differentiation between monitoring in the data center and monitoring from representative end user locations to "see what users see". Proxies, edge caching, network routes and latencies, etc, all contribute to a different end user experience than seen when monitoring on the systems themselves. It is essential to do both. After the earthquake in Japan in March of this year (2011), a disk array in a network acceleration router in Tokyo gradually deteriorated over a span of several days. It worked fine immediately after the quake, so it was assumed at first that it was not damaged. Monitoring solely at the data center location would not have helped us identify that the subset of users accessing the system via this router was affected a few days later.

Similar concerns can play out for global enterprises with geographically dispersed users in their on-premises environment. The monitoring challenges are in many ways similar between cloud services and large scale on-premises environments. The main differences derive from the scale and the variability of network routes caused by most of the traffic traversing the internet.

PS: To sort the blog and display just the ‘Cloud Difference’ series, click on the “cloud_difference” tag below the title of any post in the series.