When an error reaches your production environment, things get tough. Scientific research from the Institute of Made Up Stats™ shows that developer teams get gazillion times more stressed than usual, coffee cups get dirty 3x faster and god kills 2 kittens because one is just not enough. In the following post we’ll analyze the main problems you face when debugging code that runs in production. Let’s get started with a fair warning that our made up legal department forced us to put here.

WARNING: The following post contains explicit descriptions of bugs in production systems that some developers may find disturbing.

Problem #0: You Have Bugs in Production

The first step is admitting that you have a problem, bugs always find their way to production no matter how much testing your code goes through. No matter how beautiful your staging environment is and no matter how accurate your load tests are. In production it’s a whole different game with real live data flowing through your system. You’re not in IntelliJ or Eclipse anymore, you don’t have XRebel by your side, and unless you do something about it: You’re out in the dark with them bugs. But don’t worry, you’re not alone in this – Let’s see how we can turn the light on, examine the possible solutions and solve this once and for all.

In order to solve bugs you usually need one thing, the variable values that led to the error and how they came to be across the call stack. Let’s break it down to separate problems and see how we can get the most data out of each situation.

Solution: Keep on reading and hold on tight, here it comes.

Problem #1: An Error Happened But You Don’t Know Where it Came From

Something bad happened, let’s keep it at the logged error / uncaught exception level for now. So you’re looking at your log and only see a blank statement of the errors. It usually looks similar to this:

But whoa, whoa, whoa, wait, what do I actually know about this error other than that, well, something bad happened there with http-nio? Things get trickier when I look closer at the code and see it acts all fine and the error probably originated in some other machine, microservice, process, or even just another thread. It’s all that it takes to get lost in the logs. If this scenario sounds familiar then you’re probably missing a transaction ID from your logs. A unique ID to help you trace where did it originate from and the path it passed until it ended its life as a log error.

Other than the transaction ID, you need to keep in mind other data that gets lost unless you print it out with the error. The key is to draw as much context as you can, the thread you’re in, the class, and maybe even the specific method if you’re on a critical path. Ideally we’d also have variable values here, but this may be a bit too much for this type of problem. In this post on the Takipi blog we measured the performance hit for different types of logging styles with Logback, so check it out if you’re looking for more insights.

Solution: Generate a UUID at every thread’s entry point to your application and append it to every log entry – Keep it consistent over machines to save its original context. Also, consider using some of the log management tools if you want a way out of the console.

Problem #2: Uncaught Exceptions Don’t Have Information About the Error

When we step away from logged errors and caught exceptions, we’re getting into an even darker territory. Uncaught Exceptions land, where threads go to die. But it’s not all so grim, we have a last line of defense: The Default Uncaught Exception Handler. Once we have an Uncaught Exception Handler in place, we can actually do something about them and draw out some valuable data. The problem is that by the time they reach the Uncaught Exception Handler up the call stack, most of the context we can draw about the error is lost. Once again, we have another small window of opportunity here and not all is lost: Thread Local Storage (TLS) and thread names.

Thread Local Storage lets us store variables on the thread itself, not bounded to any stack frame. Actually, Threads can do pretty cool stuff, for more hands-on examples using the least known yet useful tricks with threads, you can check out the post right here. So with TLS we can store more data about the state we were in before the exception was thrown. Another cool trick here is using meaningful thread names, so instead of a name like:

pool-1-thread-1

We can do:Thread.currentThread().setName(Context + TID + Params + current Time, ...);

Solution: Set an uncaught exception handler, keep thread names in mind and put your Thread Local Storage to work.

Problem #3: Your Process is Hanging and You Don’t Know Why

A process is stuck or you know something weird is happening with a mysterious flow in your application? It’s probably a mission for good ol’ jstack from the tools you have built in your JDK. It can hook up into any Java process, just point it to a PID, and get sorted with output including all the threads that are currently running in it with their stack trace, frames, locks they’re holding and all sorts of other meta data. With jstack you can also analyze heap dumps or core dumps of processes that already don’t exist.

For a more hands-on overview and a walkthrough for turning it on automatically under certain conditions, check out this post right here, and the code sample on Github that shows you can get it done. To extract the most value out of it, you’ll also have to do some manual work to make sense of your results.

Solution: Know your way around jstack and use it to untie hairy situations.

Problem #4: All These Solutions Require Code Changes

Introducing changes to your application and refactoring the logging across your servers is a pretty hard task. What if there was a way to extract the information you need to debug your application without messing around with its internals? What if you don’t even have access to some parts of it but an error that originates someplace else explodes in your own code? Enter Java agents.

Java agents have the power to hook up to a live JVM and extract all kinds of information from it. One Java agent you can check out is BTrace. After attaching it to through a JVM argument, you can hook up to it using the BTrace scripting language. You can instrument your own code and grab data that runs through it – without changing the actual application. Let’s see how we can merge this trick with activating jstack programmatically, using the BTrace scripting language:

Here we’re grabbing all the ClassLoaders and their subclasses. Whenever “defineClass” returns, the script will print out that the class was loaded and run jstack for us. The downside here is that it’s not recommended to use this continuously in production and only good for pinpointing specific issues. It also has some limitations like it cannot create new objects, catch exceptions, or include loops.

If you’re feeling more adventurous, you can write your own custom Java agent, just like BTrace. For example, when we had a problem with a certain class that was generating millions of new objects out of nowhere, a simple Java agent helped us solve it. We would hook to the constructor of that object, and whenever it was allocated an instance, we got its stack trace and understood where the load was coming from. The agent’s code is available on Github and you’re welcome to check it out.

Solution: Learn more about Java agents and get comfortable with using them to solve the really thorny issues.

Problem #5: No Data About the Real Root Cause of the Error

Stepping up a notch, unlike custom Java agents, native agents have a much more powerful arsenal at their disposal. You can read more about the exact difference right here. Generally speaking, Java agents have code instrumentation capabilities, while native agents have access to the JVMs low level set of APIs (JVMTI). This means they can go beyond basic stack trace data and get to the real root cause, including the variable values that caused each error.

Takipi’s error analysis screen

Takipi’s native agent was built especially for monitoring production servers in high scale systems. It works at the JVM level to capture complete code and variable data for your errors, without relying on log files. On the configuration side, it only requires you attach it to your server as a JVM argument and immediately starts reporting and analyzing all your exceptions and log errors.

Solution: Try to use a native agent based tool that takes advantage of low level capabilities, and get down to their real root cause.

Conclusion

Hope this walkthrough helps you solve the errors in your production environment in a painless way, without wasting away hours and days of your time. Do you have more methods you use to debug your servers in production? Or you’d like to share stories of your battles against bugs? Let us know in the comments section below.

Alex is an engineer working at Takipi on a mission to help Java and Scala developers solve bugs in production and rid the world of buggy software. Passionate about all things tech, he is also the co-founder & lead of GDG Haifa, a local developer group. Alex holds a B.Sc from the Technion, Israel's Institute of Technology.