How to trace system call is not clear at first sight – there are various kernel profilers, all of them works in its own way, requires unique configuration, methods, analysis and so on. Yes, it’s really hard to figure it out. Being the biggest open-source project developed by the massive community, Linux absorbed several different and sometimes conflicting profiling facilities. And it’s in some sense getting even worse – while some profiles tend to merge (ftrace and perf) other tools emerge – the last example is ktap.

To understand that bazaar let’s start from the bottom – what does kernel have that makes it able profile it? Basically, there are only 3 kernel facilities that enable profiling:

Kernel tracepoints

Kernel probes

Perf events

These are the features that give us access to the kernel internals. By using them we can measure kernel functions execution, trace access to devices, analyze CPU states and so on.

These very features are really awkward for direct use and accessible only from the kernel. Well, if you really want you can write your own Linux kernel module that will utilize these facilities for your custom use, but it’s pretty much pointless. That’s why people have created a few really good general purpose profilers:

ftrace

perf

SystemTap

ktap

All of them are based on that features and will be discussed later more thoroughly, but now let’s review features itself.

Kernel tracepoints

Tracepoint is a place in the code where you can bind your callback.Tracepoints can be disabled (no callback) and enabled (has callback). There might be several callbacks though it’s still lightweight – when callback disabled it actually looks like if (unlikely(tracepoint.enabled)).

Tracepoint output is written in ring buffer that is export through debugfs at /sys/kernel/debug/tracing/trace. There is also the whole tree of traceable events at /sys/kernel/debug/tracing/events that exports control files to enable/disable particular event.

Despite its name tracepoints are the base for event-based profiling because besides tracing you can do anything in the callback, e.g. timestamping and measuring resource usage. Linux kernel is already (since 2.6.28) instrumented with that tracepoints in many places. For example,__do_kmalloc:

/**
* __do_kmalloc - allocate memory
* @size: how many bytes of memory are required.
* @flags: the type of memory to allocate (see kmalloc).
* @caller: function caller for debug tracking of the caller
*/static__always_inlinevoid*__do_kmalloc(size_tsize,gfp_tflags,unsignedlongcaller){structkmem_cache*cachep;void*ret;/* If you want to save a few bytes .text space: replace
* __ with kmem_.
* Then kmalloc uses the uninlined functions instead of the inline
* functions.
*/cachep=kmalloc_slab(size,flags);if(unlikely(ZERO_OR_NULL_PTR(cachep)))returncachep;ret=slab_alloc(cachep,flags,caller);trace_kmalloc(caller,ret,size,cachep->size,flags);returnret;}

trace_kmalloc is tracepoint. There are many others in other critical parts of kernel such as schedulers, block I/O, networking and even interrupt handlers. All of them are used by most profilers because they have minimal overhead, fires by the event and saves you from modifying the kernel.

Ok, so by now you may be eager to insert it in all of your kernel modules and profile it to hell, but BEWARE. If you want to add tracepoints you must have a lot of patience and skills because writing your own tracepoints is really ugly and awkward. You can see examples at samples/trace_events/. Under the hood tracepoint is a C macro black magic that only bold and fearless persons could understand.

And even if you do all that crazy macro declarations and struct definitions it might just simply not work at all if you have CONFIG_MODULE_SIG=y and don’t sign module. It might seem kinda strange configuration but in reality, it’s a default for all major distributions including Fedora and Ubuntu. That said, after 9 circles of hell, you will end up with nothing.

So, just remember:

USE ONLY EXISTING TRACEPOINTS IN KERNEL, DO NOT CREATE YOUR OWN.

Now I’m gonna explain why it’s happening. So if you tired of tracepoints just skip reading about kprobes.

staticinttracepoint_module_coming(structmodule*mod){structtp_module*tp_mod,*iter;intret=0;/*
* We skip modules that tain the kernel, especially those with different
* module header (for forced load), to make sure we don't cause a crash.
*/if(mod->taints)return0;

If the module is tainted we will NOT write ANY tracepoints. Later it became more adequate

/*
* We skip modules that taint the kernel, especially those with different
* module headers (for forced load), to make sure we don't cause a crash.
* Staging and out-of-tree GPL modules are fine.
*/if(mod->taints&~((1<<TAINT_OOT_MODULE)|(1<<TAINT_CRAP)))return0;

Like, ok it may be out-of-tree (TAINT_OOT_MODULE) or staging (TAINT_CRAP) but any others are the no-no.

Seems legit, right? Now, what would you think will be if your kernel is compiled with CONFIG_MODULE_SIG enabled and your pretty module is not signed? Well, module loader will set the TAINT_FORCES_MODULE flag for it. And now your pretty module will never pass the condition in tracepoint_module_coming and never show you any tracepoints output. And as I said earlier this stupid option is set for all major distributions including Fedora and Ubuntu since kernel version 3.1.

If you think – “Well, let’s sign goddamn module!” – you’re wrong. Modules must be signed with kernel private key that is held by your Linux distro vendor and, of course, not available for you.

As for me I just cite my favorite thing from Steven Rostedt (ftrace maintainer and one of the tracepoints developer):

> OK, this IS a major bug and needs to be fixed. This explains a couple
> of reports I received about tracepoints not working, and I never
> figured out why. Basically, they even did this:
>
>
> trace_printk("before tracepoint\n");
> trace_some_trace_point();
> trace_printk("after tracepoint\n");
>
> Enabled the tracepoint (it shows up as enabled and working in the
> tools, but not the trace), but in the trace they would get:
>
> before tracepoint
> after tracepoint
>
> and never get the actual tracepoint. But as they were debugging
> something else, it was just thought that this was their bug. But it
> baffled me to why that tracepoint wasn't working even though nothing in
> the dmesg had any errors about tracepoints.
>
> Well, this now explains it. If you compile a kernel with the following
> options:
>
> CONFIG_MODULE_SIG=y
> # CONFIG_MODULE_SIG_FORCE is not set
> # CONFIG_MODULE_SIG_ALL is not set
>
> You now just disabled (silently) all tracepoints in modules. WITH NO
> FREAKING ERROR MESSAGE!!!
>
> The tracepoints will show up in /sys/kernel/debug/tracing/events, they
> will show up in perf list, you can enable them in either perf or the
> debugfs, but they will never actually be executed. You will just get
> silence even though everything appeared to be working just fine.

Recap:

Kernel tracepoints is a lightweight tracing and profiling facility.

Linux kernel is heavy instrumented with tracepoints that are used by the most profilers and especially by perf and ftrace.

Tracepoints are C marco black magic and almost impossible for usage in kernel modules.

It will NOT work in your LKM if:

Kernel version >=3.1 (might be fixed in 3.15)

CONFIG_MODULE_SIG=y

Your module is not signed with kernel private key.

Kernel probes

Kernel probes is a dynamic debugging and profiling mechanism that allows you to break into kernel code, invoke your custom function called probe and return everything back.

Basically, it’s done by writing kernel module where you register a handler for some address or symbol in kernel code. Also according to the definition of struct kprobe, you can pass offset from address but I’m not sure about that. In your registered handler you can do really anything – write to the log, to some buffer exported via sysfs, measure time and an infinite amount of possibilities to do. And that’s really nifty contrary to tracepoints where you can only read logs from debugfs.

There are 3 types of probes:

kprobes – basic probe that allows you to break into any kernel address.

jprobes – jump probes that inserted in the start of the function and gives you handy access to function arguments; it’s something like proxy-function.

kretprobes – return probes that inserted at the return point of the function.

Last 2 types are based on basic kprobes.

All of this generally works like this:

We register probe on some address A.

kprobe subsystem finds A.

kprobe copies instruction at address A.

kprobe replaces instruction at A for breakpoint (int 3 in the case of x86).

Now when execution hits probed address A, CPU trap occurs.

Registers are saved.

CPU transfers control to kprobes via notifier_call_chain mechanism.

And finally, kprobes invokes our handler.

After all, we restore registers, copies back instruction at A and continues execution.

Our handler usually gets as an argument address where breakpoint happened and registers values in pt_args structures. kprobes handler prototype:

typedefint(*kprobe_break_handler_t)(structkprobe*,structpt_regs*);

In most cases except debugging this info is useless because we have jprobes. jprobes handler has exactly the same prototype as and intercepting function. For example, this is handler for do_fork:

/* Proxy routine having the same arguments as actual do_fork() routine */staticlongjdo_fork(unsignedlongclone_flags,unsignedlongstack_start,structpt_regs*regs,unsignedlongstack_size,int__user*parent_tidptr,int__user*child_tidptr)

Also, jprobes doesn’t cause interrupts because it works with help ofsetjmp/longjmp that are much more lightweight.

And finally, the most convenient tool for profiling are kretprobes. It allows you to register 2 handlers – one to invoke on function start and the other to invoke in the end. But the really cool feature is that it allows you to save state between those 2 calls, like timestamp or counters.

kprobes is a beautiful hack for dynamic debugging, tracing and profiling.

It’s a fundamental kernel feature for non-invasive profiling.

Perf events

perf_events is an interface for hardware metrics implemented in PMU (Performance Monitoring Unit) which is part of CPU.

Thanks to perf_events you can easily ask the kernel to show you L1 cache misses count regardless of what architecture you are on – x86 or ARM. What CPUs are supported by perf are listed here.

In addition to that perf included various kernel metrics like software context switches count (PERF_COUNT_SW_CONTEXT_SWITCHES).

And in addition to that perf included tracepoint support via ftrace.

To access perf_events there is a special syscallperf_event_open. You are passing the type of event (hardware, kernel, tracepoint) and so-called config, where you specify what exactly you want depending on type. It’s gonna be a function name in case of tracepoint, some CPU metric in the case of hardware and so on.

And on top of that, there are a whole lot of stuff like event groups, filters, sampling, various output formats and others. And all of that is constantly breaking3, that’s why the only thing you can ask for perf_events is special perf utility – the only userspace utility that is a part of the kernel tree.

perf_events and all things related to it spread as a plague in the kernel and now ftrace is going to be part of perf (1,2). Some people overreacting on perf related things though it’s useless because perf is developed by kernel big fishes – Ingo Molnar4 and Peter Zijlstra.

I really can’t tell anything more about perf_events in isolation of perf, so here I finish.

Summary

There are a few Linux kernel features that enable profiling:

tracepoints

kprobes

perf_events

All Linux kernel profilers use some combinations of that features, read details in an article for the particular profiler.