Tracks cloud and third-party web service outages with instant notification of cause and impact

Compuware, the technology performance company today announced a new generation performance analytics solution that raises the intelligence of software-as-a-service (SaaS) application performance management (APM).

Outage Analyzer provides real-time visualizations and alerts of outages in third-party web services that are mission critical to web, mobile and cloud applications around the globe. Compuware is providing this new service free of charge. Check out Outage Analyzer here.

Utilizing cutting-edge big data technologies and a proprietary anomaly detection engine, Outage Analyzer correlates more than eight billion data points per day. This data is collected from the Compuware Gomez Performance Monitoring Network of more than 150,000 test locations and delivers information on specific outages including the scope, duration and probable cause of the event — all visualized in real-time.

“Compuware’s new Outage Analyzer service is a primary example of the emerging industry trend toward applying big data analytics technologies to help understand and resolve application performance and availability issues in near real-time,” said Tim Grieser, Program VP, Enterprise System Management Software at IDC. “Outage Analyzer’s ability to analyze and visualize large masses of data, with automated anomaly detection, can help IT and business users better understand the sources and causes of outages in third-party web services.”

Cloud and third-party web services allow organizations to rapidly deliver a rich user experience, but also expose web and mobile sites to degraded performance—or even a total outage—should any of those components fail. Research shows that the typical website has more than ten separate hosts contributing to a single transaction, many of which come from third-party cloud services such as social media, ecommerce platforms, web analytics, ad servers and content delivery networks.

Outage Analyzer addresses this complexity with the following capabilities:

Incident Visualization: Issues with third-party services are automatically visualized on Outage Analyzer’s global map view. This view displays information on the current status, impact—based on severity and geography—and duration, along with the certainty and probable cause of the outage. Outage Analyzer also provides a timeline view that shows the spread and escalation of the outage. The timeline has a playback feature to replay the outage and review its impact over time.

Incident Filtering and Searching: With Outage Analyzer, users can automatically view the most recent outages, filtered by severity of impact, or search for outages in specific IPs, IP ranges or service domains. This allows users to find the outages in services that are potentially impacting their own applications.

Alerting: Users can sign-up to automatically receive alerts—RSS and Twitter feeds—and can specify the exact types of incidents to be alerted on such as popularity of third-party web service provider, certainty of an outage and by the geographical region impacted. Alerts contain links to the global map view and details of the outage. This provides an early-warning system to potential problems.

Performance Analytics Big Data Platform: Utilizing cutting-edge big data technologies in the cloud, including Flume and Hadoop, Outage Analyzer collects live data from the entire Gomez customer base and Gomez Benchmark tests, processing more than eight billion data points per day. The processing from raw data to visualization and alerting on an outage all happens within minutes, making the outage data timely and actionable.

Anomaly Detection Algorithms: At the heart of Outage Analyzer’s big data platform is a proprietary anomaly detection engine that automatically identifies availability issues with third-party web services that are impacting performance of the web across the globe. Outage Analyzer then correlates the outage data, identifies the source of the problem, calculates the impact and lists the probable causes — all in real-time.

“Since Outage Analyzer has been up and running, we’ve seen an average of about 200 third-party web service outages a day,” said Steve Tack, Vice President of Product Management for Compuware’s APM business unit. “Outage Analyzer is just the beginning. Our big data platform, propriety correlation and anomaly detection algorithms, and intuitive visualizations of issues with cloud and third-party web services are key building-blocks to delivering a new generation of answer-centric APM.”

Outage Analyzer harnesses the collective intelligence of the Compuware Gomez Network, the largest and most active APM SaaS platform in the world. Now eight billion measurements a day across the global Internet can be harnessed by any organization serious about delivering exceptional web application performance. Determining whether an application performance issue is the fault of an organization’s code, or the fault of a third-party service has never been easier.

Compuware APM® is the industry’s leading solution for optimizing the performance of web, non-web, mobile, streaming and cloud applications. Driven by end-user experience, Compuware APM provides the market’s only unified APM coverage across the entire application delivery chain—from the edge of the internet through the cloud to the datacenter. Compuware APM helps customers deliver proactive problem resolution for greater customer satisfaction, accelerate time-to-market for new application functionality and reduce application management costs through smarter analytics and advanced APM automation.

See what the experts are saying about the APM market. Read Gartner’s comprehensive 2011 “Magic Quadrant for Application Performance Monitoring (APM)” report. It evaluates 29 vendors on completeness of vision and ability to execute.

When I saw this report, I was expecting it to be downloadable.

So here is my downloadable pdf version, which might be a little more manageable.

“You have to define availability,” said Imad Mouline, CTO of Compuware’s APM Solution. “What are the tasks and services that customers need? A service does not only have to be accessible, but working. Can an entire task be completed? That’s how you should define service level agreements.”Mickey Zandi, managing principal, Consulting Services, at SunGard Availability Services, agreed. “Uptime is always driven by the business and supported by IT,” he said. “To determine availability metrics, we first interview the business and then the IT team. We identify the core business measures for success, which typically revolve around revenue, cost and profit. Next, we identify what are the infrastructure components that drive those mission-critical applications and measure the business impact of downtime.”

This focus on what the end user wants hasn’t always been the defining factor in calculating system availability. Mouline explained that, in the past, service availability concentrated on specific areas of infrastructure. If a server was pingable, it was working. “Whether a server is up is interesting,” he said, “but not necessarily relevant from a business perspective.”

Steve Shalita, VP Marketing at NetScout, has also seen a shift in the perception of availability.

“Out-right failures are fairly rare these days,” he explains. “Things are being built to avoid outages, equipment is built to maintain standards of availability. Many view degradation in the same way as an outage, and it can be much more impactful.”

Previously, Shalita added, downtime was almost always caused by network problems. Today, he sees many more issues with application or server configuration although “the network” still gets blamed for performance problems. Availability measures have to take into account every element that contributes towards the user experience.

Measures of availability

Once availability is defined in a way that is meaningful to you, it’s then possible to measure it. The standard approach to measuring is as a percentage. “The holy grail in the enterprise is five nines,” said Shalita. That’s 99.999% available. The small window when services are not available equates to five minutes per year.

“Response time is another measure,” said Mouline. “How long did it take for the transaction to go through?” This is important because consistency is what counts for users. If it takes you half a second to process a transaction today and three seconds tomorrow, you’ll soon start to feel that the system is unreliable.

Measurement should be based on the business need,” said Chris O’Connell, director of Marketing at Nimsoft, Inc. “For example, for office workers at their desks, the work peak is typically between 9 a.m. and 5 p.m. That being the case, they need the best possible response time and availability during that time, depending on the application. Another example that clarifies where specific application prioritization plays a role would be at the end of a quarter at any given company, the financial tools may be given priority over other types of applications. The customer should be able to easily set and adjust priorities as necessary, based on business requirements.”

These days, however, it is rare that a business only operates between nine and five. Now flexible working and a global customer base are commonplace, IT teams have less and less time to plan in scheduled maintenance work.

“The traditional IT team mentality was there is a maintenance window to change or update systems,” said Zandi. “Maintenance windows are luxuries that no longer apply in the 7x24x365 global business environment. Maintenance needs to be done transparently to the business.”

Fortunately, manufacturers have given us options for working in this environment. “Network-level vendors have kit that allows for upgrades without disrupting operations,” Shalita explained.

Mouline believes that there is still the option of having a maintenance window for consumer-facing applications. Where the expectation of availability is 100%, scheduling a time for maintenance may be the only option. Any planned downtime should be well-communicated and as infrequent as possible. Other applications don’t have these restrictions: a trading application, for example, only needs to be used during trading hours, so users’ expectations of availability will be different. Understanding expectations of availability help define when maintenance can be planned in.

Calculating the impact

Unplanned downtime has a massive business impact. Figures from Alinean show that outages in a messaging system can cost around $1,000 a minute. Downtime for trading applications can cost up to $40,000 a minute, so tolerance of downtime could differ between mission critical and other systems.

It may take some time to define appropriate measures for system availability but there is one thing everyone’s clear on: the opinion of the IT department is not important.

“There should be only one way that matters: availability in the mind of the customer and end-users’ perception and experience,” said O’Connell. “We measure by the users’ experience, because that’s reality of the situation.”

Just running through the options for quick access to core synthetic monitoring ‘outside-in’ data for a client today. Essentially, looking for a quick way to alert core staff to current up-to-the-minute performance for key website transactions running in Gomez.

The operational dashboard pulling data from the Active Network XF backbone tests is nicely displayed from a quick tab in the portal. Looking at ways to display this; as in the obligatory lcd or plasma in the ops centre.

Just thinking about overall end-user web experience after chatting to a client this week.

The client is developing a new front-end for a leading e commerce retailer in the UK. I was struck by the lack of visibility the client had to real user experience of their current website.

The views expressed were gathered from a very subjective point of view. As in, ‘looks great at my desktop in the office, and was pretty quick at home too.’

Not to labour the point, but running a schedule of synthetic monitoring tests from a series of UK backbone networks gave us a ‘better picture’. The big Flash splash on the home page seemed to haemorrhage and led to some shockers in overall average response time.

As expats were also a significant market segment, a series of test data collected from outside the UK led to timeouts all over the shop. No pun intended, if you wanted to shop, I think getting past the home page for many would of been a challenge.

Ok, so we don’t all just click away when we are impatient, like me. But I think it was a useful little exercise and proved in this case getting an ‘outside-in’ view can help from the outset on a project.