Digging into the system

This post is an introduction to sysdig — an “open source, system-level exploration” tool that ease
the task of performance troubleshooting in Linux operating system.

The Box

When it comes to understanding performance characteristics of applications I use a method called “The Box”
introduced to me by Kirk Pepperdine during Java Performance Workshop. Box itself is an abstraction of the complete system
and a systematic method which can be used to find performance bottlenecks:

Kirk states:

The final layer in the box is Hardware. This is a static layer that has a finite capacity. The CPU can only process
so many instructions per second, memory can only hold so much data, I/O channels are limited data transfer rates, disks
have a fixed capacity. It hardly needs to be said that if you don’t have enough capacity in your hardware, your
application’s performance will suffer. Given the direct impact that hardware has on performance, all investigations
must start here.

Our main task, when investigating this layer is to understand what and how shared hardware resources are used by
our application. Given this knowledge we can move up “the box” understanding how each next layer utilises resources and
what impact it has on overall performance as seen by the end user.

Most of them require extensive knowledge on system internals, how to use them and how to read the results. These
are exactly the reasons why I think sysdig is a great tool for everyone involved in performance troubleshooting: it’s
flexible, extensible and extremely easy to learn.

In the next couple of sections I will try to convince you to the above statement showing you how sysdig works on the
kernel level, what capabilities it has and how you can leverage its power in everyday work.

System calls

Before going further, it is crucial to explain what a system call is and how the usage of system calls affects performance.
According to Wikipedia a system call can be defined as follows:

In computing, a system call is how a program requests a service from an operating system’s kernel. This may
include hardware-related services (for example, accessing a hard disk drive), creation and execution of new processes,
and communication with integral kernel services such as process scheduling. System calls provide an essential
interface between a process and the operating system.

System calls were implemented to provide a proper security model where user and kernel
spaces are separated. Basically spaces have different access levels due to
the mechanism of protection rings implemented in CPUs. Thus programs run
in their own address space and direct hardware access is prohibited.

Basically every operation involving I/O (hard disk access, networking or working with any other device), managing
processes or threads, scheduling and memory allocation, goes through the system call facility. Understanding application
behaviour on that level gives us intrinsic knowledge on what hardware resources are consumed and where bottlenecks
occur.

You can read more about system calls and how they are implemented in Linux in following sources:

Tracepoints

Sysdig makes use of a kernel facility called tracepoints introduced in kernel version 2.6.28 (released December 2008).
This mechanism allows developers to attach probes to specific functions inside the kernel. List of all tracepoints that
are traceable can be found using command perf list 'syscalls:*'. Output should be similiar to following one:

As you can see, tracepoints allow capturing system call entry and exit points so the processing time on the kernel side can
be determined (let’s call it latency). There are many more tracepoints beside syscalls but they are not captured by
sysdig as of version 0.1.93 so we will not cover them here (you can always play with perf
tool and get every possible piece of information directly from the kernel).

Sysdig architecture

.

Sysdig consists of three main parts:

memory mapped ring buffer shared between user and kernel spaces,

kernel module called sysdig_probe that is responsible for publishing captured events into the ring,

sysdig client tool that reads, filters and processes published events.

This straightforward architecture enables sysdig’s low overhead way of tracing system calls and scheduling events on
the kernel side as kernel module itself is only responsible on copying events details (please note that probe will halt
kernel execution so having less work to do will yield greater throughput). Most of the work is then done in the user
space where events are read from ring buffer, decoded, filtered, processed in anyway and displayed to the user.

Using sysdig

As we get through this boring introduction it is time to play with sysdig and unleash its power.

As you can clearly see from the example #2 above sshd have accepted new incoming
connection on a socket from address 89.70.xx.xxx.

You probably have not noticed in the example #1 that event numbering is not contiguous: this comes from the fact that
sysdig filters out events coming from itself. If you want to capture all events just use -D debug flag.

Then you should definitely notice huge number of Driver drops in the example #1. As it turns out sysdig kernel
module is clever enough to drop events when the ring buffer is full and client is not able to keep up. Thanks to that
there is no danger of a sudden system slow down which makes sysdig suitable for production usage.

Capturing and reading events

Instead of displaying events to the console (which causes high number of Driver drops as formatting and writing
output takes small but significant amount of time) we will capture them, write to the disk and analyze later (this is
perfect for offline analysis in case of an emergency):

We have captured, wrote and read back exactly the same number of events - this is actually pretty good sign.

Filtering the data

Most of the programs rely on system resources heavily (thus doing a lot of system calls) and the number of events
sysdig is able to capture is overwhelming - reading it line by line would be cumbersome task.

If you wonder what fields are associated with each generic event you should definitely run sysdig -l and for the
list of event types and their arguments (evt.args) sysdig -L.

Having this knowledge let’s try something simple and find out if someone has tried to connect to our sshd (we will
filter event using process name, event type and event direction over previously collected trace file):

Notice that I’m using -z flag so that trace file will be compressed and I’m also prefiltering data so only events
related to nginx process will be captured. -s flag determines how many bytes of buffer will be captured on I/O
events (like reading from or writing to file).

In this example I’m particularly interested in open syscall to see what files have
been “touched” by web server and how many times. I would like also to export directory of the file, its name and
event timestamp in JSON format. Here is how it can be achieved:

That’s it - only single command with -j flag responsible for returning events in JSON format and -p
allowing you to select (or format) which fields will be a part of the output.

Given the powerful filtering and formatting syntax it’s extremely easy to analyze and understand behaviour of your
applications.

But there is one more thing

Unfortunately sometimes you will need to analyze not a single event but sequences of ordered events in a stateful
manner. If you are familiar with dtrace you probably know that it supports writing event-based
scripts. Analogue concept in sysdig is called chisel.

Chisels are small, reusable scripts written in Lua programming language and I can say that
they are quite easy to write :)

Sysdig comes with a couple of useful chisels out of the box:

~# sysdig -cl
Category: CPU Usage
-------------------
topprocs_cpu Top processes by CPU usage
Category: Errors
----------------
topfiles_errors top files by number of errors
topprocs_errors top processes by number of errors
...

Then to display more information about a chisel (and its arguments) -i flag can be used:

~# sysdig -i iobytes_net
Category: Net
-------------
iobytes_net Show total network I/O bytes
counts the total bytes read from and written to the network, and prints the result every second
Args:
(None)

Lua syntax is powerful and easy to understand and chisel
API consists of only handful of functions so
writing chisels is quite pleasant task.

As a best recommendation I can share with you that it took me about an hour to write simple socket inactivity
detector without prior knowledge and experience in writing
chisels or Lua programming language. Cool, isn’t it?

Summary

For me sysdig is great, easy to understand and use tool for online/offline, production profiling and analysis
tasks. It comes with powerful filtering and formatting capabilities, very good
documentation and growing community. Possibility
of writing chisels allows both automation of commons tasks and performing complex, stateful analyses.

I would recommend using sysdig as a first choice while battling with application performance diagnosis.

Mateusz is a solutions architect responsible for financial and payments systems. His main areas of research interest are: scalable, distributed computing in cloud environments, reactive programming and failure resilience.