God-Mode in Production Code

Debugging is like taxes - everybody likes to write code, few like to pay for it. If you catch errors during development - things aren't bad. You’ve got your IDE with breakpoints, watches, tooltips and plenty of time to reproduce. And even more importantly - you can fix things before they can do any real harm.

When code fails in production, all that goes out of the window.

With debuggers no longer an option, it’s up to you to use log files and stack traces to try to determine the source code and variable state combination that caused the error.

Takipi is trying to level the playing field by making it just as easy to fix Java and Scala code in production as it is on your desktop.

It detects errors and exceptions in server code, provides analytics to help prioritize them, and captures the source code and values of variables that caused them.

Takipi was founded in 2012, and has been in beta for the last year with over 200 companies.

It’s releasing its first GA product this month. (Takipi is named after the company’s dogs, if you were wondering.)

(Click on the image to enlarge it)

How Would You Use Takipi?

There are many debugging tools developers use today ranging from command-line debuggers, to dynamic tracers and log analyzers. Takipi focuses heavily on production debugging, breaking that process into three steps -

Detection - know when a new error has been introduced into your environment at either staging or production, or when an existing one has increased in frequency.

Prioritization - get the metrics needed to decide if and when to fix it.

Analysis - get the actual source code and combination of variable values that caused it. Think of it like a debugger that automatically turns on once an error happens, collects the variables and source code for later review, and then lets the code continue executing.

1. Detecting Errors

Takipi operates at the native JVM level, which allows it to detect and show you any form of exception or error in your code, regardless of whether it was thrown by the application code, the JVM, a 3rd party library, or how it was caught. The same is true for logged and Http errors.

You can see and sort through all the errors through Takipi’s dashboard which operates as a sort of spreadsheet for all the errors in your application. You can sort and filter them by the most recent ones, ones that have recently increased in volume, or by a specific type (e.g uncaught NullPointer exceptions).

When a new location in your code begins firing an error, Takipi will notify you by email. It also sends daily digests that summarize which new errors have been introduced into your code, and top errors across your cluster.

2. Prioritizing Errors

Once you've seen an error through the dashboard or email, the next step is to decide whether you want to do something about it right now, tomorrow or next quarter. For this you’ll need to understand its actual impact. This requires correlating a number of metrics, including how often it’s happening, when did it start, and whether it’s related to a recent change in the code.

To help with this process Takipi provides a set of metrics for each error -

When it started. The first and last time that location in your code fired that type of exception.

Code changes. For every method in the error’s call stack, Takipi shows where its code was modified on that machine in the day or week prior to the error. Takipi detects deployments by assigning a unique binary signature to each .jar, .war, or .class loaded into the JVM. So when code breaks, it can tell when it was deployed onto that machine and into the application in general.

Frequency. One of the most important aspects of prioritizing an error is frequency - both absolute and relative to the calls into the code. If an exception was fired 1000 times today, that may be significant if the code is called 5000 times. But if it’s being called a million times, this may actually be okay. For this Takipi shows both the number of times an error occurred and the percentage of the total calls to that code which that represents.

Trends. Some exceptions represent normal application logic such as cache misses, login failures, or conditional update failures. That may be normal, but what if an error has increased by 40% since the last deployment? You’ll probably want to know about that. So Takipi tracks trends for each error, to show whether it has increased in the past hour, day or week.

(Click on the image to enlarge it)

* The dashboard screen serves as a “spreadsheet” to sort and filter through all the errors in your application.

3. Analyzing Errors

Once you've decided you want to fix an error, as a developer you’ll most likely need two things - the source code which was executing on that machine, and the variable state at the moment of error. Takipi captures that information to show a reconstruction of the source code running on that machine, and the variable and object values across the stack. This enables you to quickly discover any mismatches between the two, allowing you to determine the root cause of the error.

To display source code Takipi will either decompile bytecode in the cloud as necessary, or use .jar source files and scala source directories if present on the machine.

(Click on the image to enlarge it)

* The error analysis view showing the source code and variables which caused a server exception

Distributed debugging. Takipi will also show the source code and variable values across machines. So if machine A makes an HTTP call into machine B which fires an error, Takipi will show not just the code on machine B (where it may already be too late to do anything), but also the code and variable’s values across any number of machines calling into that.

This is done through a process of “reverse signalling”, where the machine that fires the error, signals back to the machines calling into it, that an error has occurred, and that they need to collect error data for the call. Takipi will then correlate these snapshots into one “story” which is presented to you.

This is especially efficient from a developer’s perspective, when compared to stack trace from a log file, where it can be challenging to identify and access the machine which originated the call, and then try and find the relevant variable data for that call (assuming it was logged) within the logs.

Communicating Errors Between Teams

The universal language of Java errors are stack traces. They’re the currency in which errors are described and passed along between dev, QA and Ops teams. One of the things Takipi does is make stack traces smarter, to contain not only a description of what happened, but also of when and why it happened.

Whenever an error is logged, Takipi makes a small addition to the stack trace called a power link. This hyperlink lets a developer jump directly from the stack trace into the error’s source code, variable state and analytics. This data is persistent as part of the stack trace, even when it’s shared by email or pasted into a bug defect system such as Jira or BugZilla. This enables developers to get much better data from Ops or QA using the same methodology used today without having to instrument, redeploy and reproduce an error to get to the variable state which caused it.

(Click on the image to enlarge it)

* You can jump from a log stack trace to the source code and variables that caused it using embedded hyperlinks.

Performance

Debugging during development is materially different than in production. When you run a JVM in debug mode using either JWDP or JVMTI, you’re enabling hooks within the JVM that enable a debugger to receive notifications when low-level events such as exceptions happen, or the ability to pause execution at specific bytecode locations for things like step-over and breakpoints.

The downside is that enabling these hooks prevents the JIT compiler from performing some of the optimizations it would normally do, which impacts the speed in which your code will execute. An example of this would be exception callbacks. When enabled by a debugger, they prevent the JIT compiler to fully optimize try / catch clauses, and can cause code to revert back to interpreted mode when an exception is thrown, in order the make the call back into the debugger. With this comes a significant drop in speed - especially at scale.

Takipi approaches this challenge by combining static bytecode analysis in the cloud (similar to tools like Coverity) with dynamic data collection at the native JVM level.

Takipi offloads compute intensive operations (such as bytecode analysis and data reduction) from your machine to its servers. At the machine level it instruments bytecode that’s loaded into the JVM for compilation, but also the resulting X86 machine code that’s produced by it. This enables it to collect low-level data and intercept signals without incurring a continuous performance overhead that a normal debugger would have. Through this it can operate with an average performance of less than 3% once anlysis of your code has been completed.

Installing On Your Machine

Takipi runs on Windows 7 / 8 / Server, OS X and major Linux flavors. You can install using standard Linux wGet / cURL commands, or through installation packages such as DEB, RPM and Chef. Once you After you install Takipi, the next step is to add anagent argumentto your JVM. Once your application launches, Takipi will analyze your code for the first time. Any exception or error that your code encounters will be detected and tracked. Since Takipi operates at the JVM level, it’s agnostic to which frameworks (e.g. Guava) or web containers (e.g Tomcat, Play) you use. No code or additional configuration changes are needed to run.

How Is Data Stored

A big challenge when it comes to collecting data directly from the application is storing and securing it. This is especially true if the data collected contains personally identifiable information (PII), such as user names or credit card numbers.

Takipi provides two modes for storing data: hosted and on-premises. In hosted mode source and variable data is encrypted on your machine using your private 256-bit AES encryption key before it's sent to Takipi. Your data can only be decrypted by you using your private encryption key. This is similar to the way you would secure access to an AWS instance using a private key.

In on-premises mode data is not sent to Takipi’s servers, but to a designated server located on-premises, where it gets stored. When you open an error inside of Takipi, instead of pulling that data into the browser from Takipi’s servers, it is pulled from the on-premises server using a RESTful API call. You can customize the way in which data is stored (e.g file system, relational DBs, key / value store), and the method by which users authenticate against it. This enables you to manage data access and retention in accordance with your own organization’s internal policies.

* In on-premises mode, data is not kept on Takipi’s servers, but is stored and retrieved from within the user’s domain.

Summary

Takipi is trying to improve on of the most basic operations developers do every day - fixing their code when it breaks. You can open up a trial account and test Takipi on your application. Share your notes, thoughts and questions with us in the comments section below - we’d love to hear them!

About the Author

Tal Weiss is the CEO of Takipi. Tal has been designing scalable, real-time Java and C++ applications for the past 15 years. He still enjoys analyzing a good bug though, and instrumenting Java code. In his free time Tal plays Jazz drums.