Why Alerts Suck and Monitoring Solutions need to become Smarter

I have yet to meet anyone in Dev or Ops who likes alerts. I’ve also yet to meet anyone who was fast enough to acknowledge an alert, so they could prevent an application from slowing down or crashing. In the real world alerts just don’t work, nobody has the time or patience anymore, alerts are truly evil and no-one trusts them. The most efficient alert today is an angry end user phone call, because Dev and Ops physically hear and feel the pain of someone suffering 🙂

Why? There is little or no intelligence in how a monitoring solution determines what is normal or abnormal for application performance. Today, monitoring solutions are only as good as the users that configure them, which is bad news because humans make mistakes, configuration takes time, and time is something many of us have little of.

Its therefore no surprise to learn that behavioral learning and analytics are becoming key requirements for modern application performance monitoring (APM) solutions. In fact, Will Capelli from Gartner recently published a report on IT Operational Analytics and pattern based strategies in the data center. The report covered the role of Complex Event Processing (CEP), behavior learning engines (BLEs) and analytics as a means for monitoring solutions to deliver better intelligence and quality information to Dev and Ops. Rather than just collect, store and report data, monitoring solutions must now learn and make sense of the data they collect, thus enabling them to become smarter and deliver better intelligence back to their users.

Change is constant for applications and infrastructure thanks to agile cycles, therefore monitoring solutions must also change so they can adapt and stay relevant. For example, if the performance of a business transaction in an application is 2.5 secs one week, and that drops to 200ms the week after because of a development fix. 200ms should become the new performance baseline for that same transaction, otherwise the monitoring solution won’t learn or alert of any performance regression. If the end user experience of a business transaction goes from 2.5 secs to 200ms, then end user expectations change instantly, and users become used to an instant response. Monitoring solutions have to keep up with user expectations, otherwise IT will become blind to the one thing that impacts customer loyalty and experience the most.

So what does behavioral learning and analytics actually do, and how does it help someone in IT? Let’s look at some key Dev and Ops use cases that benefit from such technology.

#1 Problem Identification – Do I have a problem?

Alerts are only as good as the thresholds which trigger them. A key benefit of behavioral learning technology is the ability to automate the process of discovering and applying relevant performance thresholds to an application, its business transactions and infrastructure, all without human intervention. It does this by automatically learning the normal response time of an application, its business transactions and infrastructure, at different hours of the day, week and month, ensuring these references create an accurate and dynamic baseline of what normal application performance is over-time.

A performance baseline which is dynamic over-time is significantly more accurate than a baseline which is static. For example, having a static baseline threshold which assumes application performance is OK if all response times are less than 2 seconds is naive and simplistic. All user requests and business transactions are unique, they have distinct flows across the application infrastructure, which vary, depending on what data is requested, processed and packaged up as a response.

Take for example, a credit card payment business transaction – would these requests normally take less than 2 seconds for a typical web store application? not really, they can vary between 2 and 10 seconds. Why? There is often a delay whilst an application calls a remote 3rd party service to validate credit card details before it can be authorized and confirmed. In comparison, a product search business transaction is relatively simple and localized to an application, meaning it often returns sub-second response times 24/7 (e.g. like Google). Applying a 2 second static threshold to multiple business transactions like “credit card payment” and “search” will trigger alert storming (false and redundant alerts). To avoid this without behavioral learning, users must manually define individual performance thresholds for every business transaction in an application. This is bad, because as I said earlier, nobody in IT has the time to do this, so most users resort to applying thresholds which are static and global across an application. Don’t believe me? ask your Ops people whether they get enough alerts today, chances are they’ll smile or snarl.

The screenshot below shows the average response time of a production application over-time, with spikes representing peak load during weekend evening hours. You can see on weekdays normal performance is around 100ms, yet under peak load its normal to experience application performance of up to several seconds. Applying a static threshold in this scenario, of 1 or 2 seconds would basically cause alert storming at the weekend even though its normal to see such performance spikes. This application could therefore benefit from behavioral learning technology so the correct performance baseline is applied for the correct hour and day.

Another key limitation with alerts and traditional monitoring solutions is that they lack business context. They’re typically tied to infrastructure health rather the health of the business, making it impossible for anyone to understand the business impact of an alert or problem. It can be the difference between “Server CPU Utilization is above 90%” and “22% of Credit Card Payments are stalling”. You can probably guess the latter alert is more important to troubleshoot than pulling up a terminal console, logging onto a server and typing prstat to view processes and CPU usage. Behavioral learning combined with business context allows a monitoring solution to alert on the performance and activity of the business, rather than say, the performance and activity of its infrastructure. This ensures Dev and Ops have the correct context to understand and be aligned with the business services.

Analytics can also play a critical role in how monitoring data is presented to the user to help them troubleshoot. If a business transaction is slow or has breached its threshold, the user needs to understand the severity of the problem. For example, were a few or lot of user transactions impacted? how many returned errors or actually stalled and timed out? Everything is relative, Dev or Ops doesn’t have the time to investigate every user transaction breach, its therefore important to prioritize with business impact before jumping in to troubleshoot.

If we look at the below screenshot of AppDynamics Pro, you can see how behavioral learning and analytics can help a user identify a problem in production. We can see the checkout business transaction has breached its performance baseline (which was learnt automatically), we can also see the severity of the breach which shows no errors, 10 slow requests, 13 very slow and no stalls. 23 out of the 74 user requests (calls) were impacted meaning this is a critical problem for Dev and Ops to troubleshoot.

#2 Problem Isolation – Where is my problem?

Once a user has identified abnormal application performance, the next step for them is to isolate where that latency is spent in the application infrastructure. A key problem today is that most monitoring solutions collect and report data, but they don’t process or visualize it in a way that automates problem isolation for a user. Data exists, but its down to the individual users to drill down and piece together data, so they can find what they’re looking for. This is made difficult by the fact that performance data can be fragmented across multiple silo’s and monitoring toolsets, making it impossible for Dev or Ops to get a consistent end to end view of application performance and business activity. To solve this data fragmentation problem, many monitoring solutions use time-based correlation or Complex Event Processing (CEP) engines to piece together data/events from the multiple sources, so they can look for patterns or key trends which may help a user isolate where a problem or latency exists in an application.

For example, if a user credit card payment business transaction took 9 seconds to execute, where was that 9 seconds spent in the application infrastructure exactly? If you look at performance data from an OS, app server, database or network perspective you’ll end up with four different views of performance, none of which relate to that individual credit card payment business transaction which took 9 seconds. Using time-based correlation won’t’ help either, knowing the database was running at 90% cpu whilst the credit card payment transaction executed is about as helpful as a poke in the eye. Time-based correlation is effectively a guess, given the complexity and distribution of applications today, the last thing you want to be doing is guessing where a problem might be in your application infrastructure. Infrastructure metrics tell you how an application is consuming system resource, they don’t have the granularity to tell you where an individual user business transaction is slow in the infrastructure.

Behavioral learning can be used together to learn and track how business transactions flow across distributed application infrastructure. If a monitoring solution is able to learn the journey of a business transaction, then they can monitor the real flow execution of them across and inside distributed application infrastructure. By visualizing the entire journey and latency of a business transaction, at each hop in the infrastructure, monitoring solutions can make it simple for Dev and Ops to isolate problems in seconds. If you want to travel from San Francisco to LA by car, the easiest way to understand that journey, is to visualize it on Google Maps in seconds. In comparison, the easiest way for Dev or Ops to isolate a slow user business transaction, is to do the same thing and visualize its journey across the application infrastructure. For example, take the below screenshot which shows the distributed transaction flow of a “Checkout” business transaction which took 10 seconds across its application infrastructure. You can see that 99.8% of its response time is spent making a JDBC call to the Oracle database. Isolating problems this way is much faster and efficient than tailing log files or asking sys, network or DBA administrators whether their silos are performing correctly.

You can also apply dynamic base-lining and analytics to the performance and flow execution of a business transaction. This means a monitoring solution can effectively highlight to the user which application infrastructure tier is responsible for a performance breach and baseline deviation. Take for example, the below screenshot which visualizes the flow of a business transaction in a production environment, and highlights the breach for the application tier “Security Server” which has deviated from its normal performance baseline of 959ms.

Behavioral learning and analytics can therefore be a key enabler to automating problem isolation in large, complex, distributed applications.

#3 Problem Resolution – How do I fix my problem?

Once Dev or Ops has isolated where the problem is in the application infrastructure, the next step is to then identify the root cause. Many monitoring solutions today can collect diagnostic data which relate to the activity of components within an application tier such as a JVM, CLR or database. For example, a java profiler might show you thread activity, a database tool might show you top N SQL Statements. What these tools lack is the ability to tie diagnostic data to the execution of real user business transactions which are slow or breaching associated performance thresholds. When Ops picks up the phone to an angry user, users don’t complain about CPU utilization, thread synchronization or garbage collection. Users complain about specific business transactions they are trying to complete like login, search or purchase.

As I outlined above in the Problem Isolation section, monitoring solutions can leverage behavioral learning technology to monitor the flow execution of business transactions across distributed application infrastructure. This capability can also be extended inside an application tier, so monitoring solutions can learn, and monitor, the relevant code execution of a slow or breaching business transaction.

For example, here is a screenshot which shows the complete code execution (diagnostic data) of a distributed Checkout business transaction which took 10 seconds. We can see in the top dialogue the code execution from the initial struts action all the way through to the remote Web Service call which took 10 seconds. From this point we can drill inside the offending web service to its related application tier and see its code execution, before finally pinpointing the root cause of the problem which is a slow SQL statement as shown.

Without behavioral learning and analytics, monitoring solutions lack intelligence on what diagnostic data to collect. Some solutions try to collect everything, whilst others limit what data they collect so that their agent overhead doesn’t become intrusive in production environments. The one thing you need when trying to identify root cause is complete visibility, otherwise you begin to make assumptions or guess what might be causing things to run slow. If you only have 10% visibility into the application code in production, then you’ve only got a 10% probability of finding the actual root cause of an issue. This is why users of most legacy application monitoring solutions struggle to find root cause – because they have to balance application code visibility with monitoring agent overhead.

Monitoring today isn’t about collecting everything, its about collecting what is relevant to business impact, so any business impact can be resolved as quickly as possible. You can have all the diagnostic data in the world, but if that data isn’t provided in the right context for the right problem to the right user, it becomes as about as useful as a chocolate teapot.

With applications becoming every increasingly complex, agile, virtual and distributed. Dev and Ops no longer have the time to monitor and analyze everything. Behavioral learning and analytics must help Dev and Ops monitor whats relevant in an application, so they can focus on managing real business impact instead of infrastructure noise. Monitoring solutions must become smarter so Dev and Ops can automate problem identification, isolation and resolution. The more monitoring solutions rely on human intervention to configure and analyze, the more monitoring solutions will continue fail.

If you want to experience how behavioral learning and analytics can automate the way you manage application performance, take a trial of AppDynamics Pro and see for yourself.