Compuware puffs up Outage Analyzer to fight performance anxiety

Ad serving and web analytics services have high failure rate... who knew?

Compuware is taking another stab at making the data gathered from users of its Gomez performance monitoring network available on a freebie basis in a bid to get IT shops hooked on using the more sophisticated and definitely not free tools.

The company has also upgraded its application performance monitoring tools for mainframes, which will allow it to reach down into CICS/COBOL apps – which are often the back-end systems on many webby apps – and determine where the bottlenecks are.

This is not the first freebie service that Compuware has put out there as a bit of bait for the Gomez services. CloudSleuth, which was announced by Compuware in November 2010, monitors the performance of public clouds such as Amazon AWS, Google App Engine, Microsoft Azure, Rackspace Cloud, GoGrid, SoftLayer, Terremark, BlueLock and others as well as content delivery networks. It runs a simulated and fairly primitive ecommerce web benchmark on a number of sites around the web and then tries to smack those simulated storefronts from around the web and measures how well or poorly they are doing.

With Outage Analyzer, which is also a freebie service that rides on the Gomez network as well, rather than restricting itself to the ups and downs of clouds and CDNs, Compuware is examining outages at the web services level and then drilling into its data to give an assessment of what has caused an outage and what apps out there on the intertubes are affected by that outage.

The issue is complex, Mike Allen, strategic director of APM tools at Compuware, tells El Reg. A single page on the USA Today online newspaper, for instance, has services culled from over 30 different services, so an outage on one of those services can cause a page to fail. Using data derived from Gomez subscribers, who are using the SaaS APM tools to monitor networks and applications for customers, Compuware can drill down into each page view and see how many different elements of each page of each website or application are failing, and then take its best guess based on data coming out of apps and websites to figure out what has caused the outage.

Using data from its Gomez Synthetic Monitoring service, Compuware finds that the average web page has anywhere from around 7 to 12 services linked into it in the developed economies, with the United States and China being at the high end of the curve. In a single day, Gomez subscribers hitting the web hit around 1,500 unique services from the 160,000 nodes where Gomez is monitoring. About half of traffic is for ad serving and a quarter for web analytics. A relatively small portion (about 12 per cent) is actually for web page component serving, and a teenie tiny slice is for search. Interestingly, ad serving and web analytics services have a very high failure rate (shhhh), and of the 100,000 failures that Gomez is tracking a day, about 35,000 come from failed ads and 25,000 come from failed web analytics. (If you do the math, the percent that ad serving and analytics have of the traffic volume and the failure rates are roughly proportional.)

But other things fail, too, such as cloud compute and storage services or their networking or CDN front ends. For instance, on September 28, a chunk of Amazon's S3 storage cloud running in its Oregon data center was offline for eight hours and about 560 different web applications were affected. When Facebook goes down, about 5,000 applications that are dependent on Zuck & Co go down, says Allen. A typical outage tends to affect between 50 and 100 applications, based on the Gomez data.

Compuware Outage Analyzer during an Amazon S3 outage on September 28 (click to enlarge)

Outage Analyzer, which is in beta today, does 1 billion web page and object measurements per day, which are extracted out of the Gomez service and sucked into a Hadoop cluster via the Flume add-on, which does bulk moves of data from applications into the Hadoop Distributed File System. This system is collecting about 700,000 data points per minute, and at the moment Outage Analyzer is just seeing what is up and what is down, but Allen says in the long term the tool will be extended to do performance analysis across the pages that Gomez subscribers hit. The system is designed to suck in raw data coming in from the Gomez SaaS service and provide a visualization of the outage as well as a probable cause within five minutes.

Peeking into that big iron

In addition to rolling out Outage Analyzer, Compuware has also tweaked its application performance monitoring tools for mainframes. IBM mainframe hardware and operating systems may be legendary in terms of uptime, but mainframe application software is subject to the vagaries of the human condition as any software is.

Specifically, Compuware is updating the PurePath APM tools it received when it acquired rival DynaTrace for $296m back in July 2011. The PurePath tools, which have been woven into the Compuware APM for Mainframe stack, were already able to reach down into Java virtual machines and the WebSphere application server to tell customers what was up with their Java apps running on the z/OS operating system. But with the latest update to the PurePath tools, now agents can babysit applications written in COBOL using the CICS transaction monitor and either VSAM flat files or DB2 databases. (IMS database support is in the works.) The updated tool can monitor local CICS instances or those running across a cluster of servers, known as a CICSPlex and using IBM's Parallel Sysplex clustering as a transport.

Compuware does not reveal pricing on its mainframe tools, but does say that they are metered prices based on IBM's metered service unit (MSU) relative rating scheme for mainframe engines. The PurePath tool can also link into the Strobe analyzer to peer into WebSphere MQ and message brokering and enterprise service bus software to see what is going on in there. ®