If all you have is logs, you are doing it wrong

If all you have is logs, you are doing it wrong

Ten years ago, the standard way to troubleshoot an application issue was to look at the logs. Users would complain about a problem, you’d go to operations and ask for a thread dump, and then you’d spend some time poring over log files looking for errors, exceptions, or anything that might indica...

Ten years ago, the standard way to troubleshoot an application issue was to look at the logs. Users would complain about a problem, you’d go to operations and ask for a thread dump, and then you’d spend some time poring over log files looking for errors, exceptions, or anything that might indicate a problem. There are some people who still use this approach today with some success, but for most modern applications logging is simply not enough. If you’re depending on log files to find and troubleshoot performance problems, then chances are your users are suffering – and you’re losing money for your business. In this blog we’ll look at how and why logging is no longer enough for managing application performance.

The Legacy Approach

The typical legacy web application was monolithic and fairly static, with a single application tier talking to a single database that was updated every six months. The legacy approach to monitoring production web applications was essentially a customer support loop. A customer would contact the support team to report an outage or bug, the customer support team reports the incident to the operations team, and then the operations team would investigate by looking at the logs with whatever useful information they had from the customer (username, timestamps, etc.). If the operations team was lucky and the application had ample logging, the operations team would spot the error and bring in developers to find the root cause and provide a resolution. This is the ideal scenario, but more often than not the logs were of very little use and the operations team would have to wait for another user to complain about a similar problem and kick off the process again. Ten years ago, this was what production monitoring looked like. Apart from some rudimentary server monitoring tools that could alert the operations team if a server was unavailable, it was the end users who were counted on to report problems.

Logging is inherently reactive

The most important reason that logging was never truly an application performance management strategy is that logging is an inherently reactive approach to performance. Typically this means an end user is the one alerting you to a problem, which means that they were affected by the issue – and (therefore) so was your business. A reactive approach to application performance loses you money and damages your reputation. So logging isn’t going to cut it in production.

You’re looking for a needle in a haystack

Another reason why logging was never a perfect strategy is that system logs have a particularly low signal to noise ratio. This means that most of the data you’re looking at (which can amount to terabytes for some organizations) isn’t helpful. Sifting through log files can be a very time-consuming process, especially as your application scales, and every minute you spend looking for a problem is time that your customers are being affected by a performance issue. Of course, newer tools like Splunk, Loggly, SumoLogic and others have made sorting through log files easier, but you’re still looking for a needle in a haystack.

Logging requires an application expert

Which brings us to another reason logging never worked: Even with tools like Loggly and Splunk, you need to know exactly what to search for before you start, whether it’s a specific string, a time range, or a particular file. This means the person searching needs to be someone who knows the application well, usually a developer or an architect. Even then, their hunches could be wrong, especially if it’s a performance issue that you’ve never encountered before.

Not everyone has access to logs

Logging is a great tool for developers to debug their code on their laptops, but things get more complicated in production, especially if the application is dealing with sensitive data like credit card numbers. There are usually restrictions on the production system that prevent people like developers from accessing the production logs. In some organizations, these can be requested from the operations team, but this step can take a while. In a crisis, every second counts, and these costly processes (while important) can cost organization money if your application is down.

It doesn’t work in production

Even in a perfect world where you have complete access to your application’s log files, you still won’t have complete visibility into what’s going on in your application. The developer who wrote the code is ultimately the one who decides what gets logged, and the verbosity of those logs is often limited by performance constraints in production. So even if you do everything right there’s still a chance you’ll never find what you’re looking for.

The Modern Approach

Today, enterprise web applications are much more complex than they were ten years ago. The new normal for these applications includes multiple application tiers communicating via a service-oriented architecture (SOA) that interacts with several databases and third-party web services while processing items out of caches and queues. The modern application has multiple clients from browser-based desktops to native applications on mobile. As a result, it can be difficult just to know where to start if you’re depending on log files for troubleshooting performance issues.

Large, complex and distributed applications require a new approach to monitoring. Even if you have centralized log collection, finding which tier a problem is on is often a large challenge on its own. Usually increasing logging verbosity is not an option, as production does not run with debugging enabled due to performance constraints. First the operations team must figure out which log file to look through and once finding something that resembles an error, reach out to the developers to find out what it means and how to resolve it. Furthermore with the evolution of web 2.0 applications which are heavily dependent on JavaScript and the capabilities of the browser in order to properly monitor the application is operating as expected you must have end user monitoring that is capable of capturing errors on the client-side and in native mobile applications.

Logging is simply not enough

Logging is not enough – modern applications require application performance management to enable application owners to stay informed to minimize the business impact of performance degradation and downtime

Logging is simply not enough information to get to the root cause of problems in modern distributed applications. The problems of production monitoring have changed and so has the solution. Your end users are demanding and fickle, and you can’t afford to let them down. This means you need the fastest and most effective way to troubleshoot and solve performance problems, and you can’t rely on the chance that you might find the message in the log. Business owners, developers, and operations need in-depth visibility into the app, and the only way to get that is by using application performance monitoring.