The following patch series brings to vanilla Linux a bit of the RT kerneltrace facility. This incorporates the "-pg" profiling option of gccthat will call the "mcount" function for all functions called inthe kernel.

Note: I did investigate using -finstrument-functions but that adds a callto both start and end of a function. Using mcount only does thebeginning of the function. mcount alone adds ~13% overhead. The-finstrument-functions added ~19%. Also it caused me to do tricks withinline, because it adds the function calls to inline functions as well.

This patch series implements the code for x86 (32 and 64 bit), butother archs can easily be implemented as well (note: ARM and PPC arealready implemented in -rt)

Some Background:----------------

A while back, Ingo Molnar and William Lee Irwin III created a latency tracerto find problem latency areas in the kernel for the RT patch. This tracerbecame a very integral part of the RT kernel in solving where latency hotspots were. One of the features that the latency tracer added was afunction trace. This function tracer would record all functions thatwere called (implemented by the gcc "-pg" option) and would show what wascalled when interrupts or preemption was turned off.

This feature is also very helpful in normal debugging. So it's been talkedabout taking bits and pieces from the RT latency tracer and bring themto LKML. But no one had the time to do it.

Arnaldo Carvalho de Melo took a crack at it. He pulled out the mcountas well as part of the tracing code and made it generic from the pointof the tracing code. I'm not sure why this stopped. Probably becauseArnaldo is a very busy man, and his efforts had to be utilized elsewhere.

I came across a need to do the mcount with logdev too. I was successfulbut found that it became very dependent on a lot of code. One thing thatI liked about my logdev utility was that it was very non-intrusive, and hasbeen easy to port from the Linux 2.0 days. I did not want to burden thelogdev patch with the intrusiveness of mcount (not really that intrusive,it just needs to add a "notrace" annotation to functions in the kernelthat will cause more conflicts in applying patches for me).

Being close to the holidays, I grabbed Arnaldos old patches and startedmassaging them into something that could be useful for logdev, and whatI found out (and talking this over with Arnaldo too) that this canbe much more useful for others as well.

The main thing I changed, was that I made the mcount function itselfgeneric, and not the dependency on the tracing code. That is I added

register_mcount_function() andclear_mcount_function()

So when ever mcount is enabled and a function is registered that functionis called for all functions in the kernel that is not labeled with the"notrace" annotation.

The Simple Tracer:------------------

To show the power of this I also massaged the tracer code that Arnaldo pulledfrom the RT patch and made it be a nice example of what can be donewith this.

The function that is registered to mcount has the prototype:

void func(unsigned long ip, unsigned long parent_ip);

The ip is the address of the function and parent_ip is the address ofthe parent function that called it.

The x86_64 version has the assembly call the registered function directlyto save having to do a double function call.

To enable mcount, a sysctl is added:

/proc/sys/kernel/mcount_enabled

Once mcount is enabled, when a function is registed, it will be called byall functions. The tracer in this patch series shows how this is done.It adds a directory in the debugfs, called mctracer. With a ctrl file thatwill allow the user have the tracer register its function. Note, the orderof enabling mcount and registering a function is not important, but bothmust be done to initiate the tracing. That is, you can disable tracingby either disabling mcount or by clearing the registered function.

Only one function may be registered at a time. If another function isregistered, it will simply override what ever was there previously.

preempt_thresh echo a number (in usecs) into this to record all traces that are greater than threshold.

iter_ctrl echo "symonly" to not show the instruction pointers in the trace echo "nosymonly" to disable symonly. echo "verbose" for verbose output from latency format. echo "noverbose" to disable verbose ouput. cat iter_ctrl to see the current settings.

Future:-------The way the mcount hook is done here, other utilities can easily add theirown functions. Just care needs to be made not to call anything that is notmarked with notrace, or you will crash the box with recursion. Buteven the simple tracer adds a "disabled" feature so in case it happensto call something that is not marked with notrace, it is a safety netnot to kill the box.

I was originally going to use the relay system to record the data, butthat had a chance of calling functions not marked with notrace. But, iffor example LTTng wanted to use this, it could disable tracing on a CPUwhen doing the calls, and this will protect from recusion.

Redesign:---------This will be the last series that has each of the traces as a separatebuffer. The next series will be more like the RT patch implementationof holding a single buffer for all latency traces. The only disadvantageof this is that you can only perform one type of latency at a time.But this has never been brought up as an issue with the RT patch.

The user will still be able to switch at runtime which type of latencythey would like to record.

SystemTap:----------One thing that Arnaldo and I discussed last year was using systemtap toadd hooks into the kernel to start and stop tracing. kprobes is tooheavy to do on all funtion calls, but it would be perfect to add tonon hot paths to start the tracer and stop the tracer.

So when debugging the kernel, instead of recompiling with printksor other markers, you could simply use systemtap to place a trace startand stop locations and trace the problem areas to see what is happening.

Latency Tracing:----------------

We can also add trace points to record the time the highest priority taskneeds to wait before running. This too is currently done in the RT patch.

These are just some of the ideas we have with this. And we are sure otherscould come up with more.

These patches are for the underlining work. We'll see what happens next.