All Your Base Are Belong To Us

February 14, 2016

Warning: This post requires a bit of background. I strongly recommend Brendan Gregg’s introduction to eBPF and bcc. With that said, the post below describes two new bcc-based tools, which you can use directly without perusing the implementation details.

A few weeks ago, I started experimenting with eBPF. In a nutshell, eBPF (introduced in Linux kernel 3.19 and further improved in 4.x kernels) allows you to attach verifiably-safe programs to arbitrary functions in the kernel or a user process. These little programs, which execute in kernel mode, can collect performance information, trace diagnostic data, and aggregate statistics that are then exposed to user mode. Although BPF’s lingua franca is a custom instruction set, the bcc project provides a C-to-BPF compiler and a Python module that can be used from user mode to load BPF programs, attach them, and print their results. The bcc repository contains numerous examples of using BPF programs, and a growing collection of tracing tools that perform in-kernel aggregations, offering much lower overhead than perf or similar alternatives.

The result of my work is currently two new scripts: memleak and argdist. memleak is a script that helps detect memory leaks in kernel components or user processes by keeping track of allocations that haven’t been freed including the call stack that performed the allocation. argdist is a generic tool that traces function arguments into a histogram or frequency counting table to explore a function’s behavior over time. To experiment with the tools in this post, you will need to install bcc on a modern kernel (4.1+ is recommended). Instructions and prerequisites are available on the bcc installation page.

memleak

In basic mode, memleak attaches to either malloc and free, or kmalloc and kfree, and collects outstanding allocations. Outstanding allocations older than a certain age are printed along with the allocating call stack. For example, here’s some output from a leaking user program:

It looks like main keeps allocating more and more memory that isn’t being freed.

Additional options include printing only the top N allocating stacks, only stacks that allocated more than N bytes, capturing only specific allocation sizes, reducing overhead by capturing only every N-th allocation, and more. The really cool part is how easy it was to build this tool, even though I had to roll my own user symbol decoding support (currently based on a rather hackish invocation of `objdump`).

argdist

argdist is a Swiss Army knife designed to analyze the distribution of a function’s arguments. It attaches to functions in the kernel or a user process, collects specific argument values, stores them in a histogram or frequency counting collection, and displays them for further analysis. Let’s start with a couple of simple examples. Suppose you want to find what allocation sizes are common in your application. The probe syntax for malloc is the following: p:c:malloc(size_t size):size_t:size. This obscure-looking string is rather simple, really: p stands for probe (could also be r, which is a probe on the return from the function), c is the library that contains the function, malloc(size_t size) is the function’s signature, and the rest is the type and value of the expression that you want to collect.

This application seems to be allocating only blocks of size 16. We can do a similar thing with kernel functions. For example, here’s a probe in kmalloc — the grouping here is by both the allocation type (gfp_t) and the allocation size:

Another thing you can do is wait for the function to return and then refer to its execution time (latency) and the values of the arguments it had on entry. For example, ever wondered how many nanoseconds it takes to allocate a typical byte using malloc?

argdist is a fairly sophisticated tool, so it has a lot of switches and features that I haven’t described here. You can control tracing frequency, monitor functions that have struct parameter types, capture complex expressions and filters, capture multiple variables in a single probe, and even capture data from multiple functions in a single run.

Summary

This post introduced memleak and argdist, two new tools based on bcc/eBPF that demonstrate the power of dynamic tracing. memleak helps diagnose memory leaks, and argdist helps analyze function arguments using histograms and frequency collections. I intend to continue working on these and other bcc-based tools in the future. If this looks cool or useful, please head over to GitHub and contribute!