Bernie Pope

Linux Performance Counters

Linux Performance Counters

Modern CPUs can provide information about the runtime behaviour of software through so-called hardware performance counters. Recent versions of the Linux kernel (since 2.6.31) provide a generic interface to low-level events for running processes. This includes access to hardware counters but also a wide array of software events such as page faults, scheduling activity and system calls. A userspace tool called perf is built on top of the kernel interface, which provides a convenient way to record and view events for running processes.

Unfortunately it is rather hard to find authorative information about performance counters on Linux; documentation is scarce. Here’s a list of links that provide some information:

-a System-wide collection from all CPUs
-c Event period to sample (in the above case record every single event, which is useful for getting schedule events).
-R Collect raw sample records from all opened counters (default for tracepoint counters).
-e Select the PMU event.

Perf file format

The perf record command records information about performance events in a file called (by default) perf.data. It is a binary file format which is basically a memory dump of the data structures used to record event information. The file has two main parts:

A header which describes the layout of information in the file (section sizes etcetera) and common information about events in the second part of the file (an encoding of event types and their names).

The payload of the file which is a sequence of event records.

Each event field has a header which says what general type of event it is plus information about the size of its body.

There are nine types of event:

PERF_RECORD_MMAP:

PERF_RECORD_LOST: an unknown event?

PERF_RECORD_COMM: maps a command name string to a process and thread ID. Perhaps this corresponds to an exec?

PERF_RECORD_EXIT: process exit.

PERF_RECORD_THROTTLE

PERF_RECORD_UNTHROTTLE

PERF_RECORD_FORK: process creation.

PERF_RECORD_READ:

PERF_RECORD_SAMPLE: a sample of an actual hardware counter or a software event.

The PERF_RECORD_SAMPLE events (samples) are the most interesting ones in terms of program profiling. The other events seem to be mostly useful for keeping track of process technicalities. Samples are timestamped with an unsigned 64 bit word, which records elapsed nanoseconds since some point in time (I’m not exactly sure what, perhaps system running time, since it seems to be based on the kernel scheduler clock). In fact the other events can be optionally timestamped if a certain bit flag is set in the file header (sample_id_all). Samples have themselves a “type” which is defined in the file header and linked to the sample by an integer identifier.

Processing the perf file format from Haskell

There are three ways one might go about writing a tool to process the perf.data file from Haskell:

Write a program to read the data directly from file.

Call the perf code as a library.

Parse the output of the perf script command.

Option 1 is what we have already done with the haskell-linux-perf library. The upside of this approach is that it is independent of the existing perf source code. The downside is that the perf.data format is complicated and largely undocumented. It does not seem to be designed for external tools to read. It might be hard to keep a custom parser compatible with the format.

For option 2, the most likely interface to use would be the C function (from <perf source>/util/session.c):

The struct perf session is a representation of the perf.data file and the struct perf_event_ops contains a collection of functions for processing each of the event types. One could envisage a Haskell FFI call that would get perf_session__process_events to build a Haskell representation of the event list, but it seems like it would have to build it all in memory at once, rather than stream the data lazily.

Option 3 is appealing because it avoids the need to deal with the perf.data file directly: we just need to parse the text output of perf script. One small downside is that it entails a double handling of the data, which might be a little bit slower than reading perf.data directly. Another downside is that it means we are at the whim of the format of the output of perf script, which could change at any point.

Event timestamps

Certain events carry a timestamp in the form of a unsigned 64 bit integer. Time measurements come from the kernel function perf_clock in linux/kernel/events/core.c, which simply calls local_clock. On x86 systems with a stable Time Stamp Counter this ends up calling native_sched_clock in linux/arch/x86/kernel/tsc.c. You might be able to read the value of this clock in userspace with a call to clock_gettime(CLOCK_MONOTONIC, ...), although I’m not sure it is guaranteed to line up with the counter that perf uses.

Extending ThreadScope to support perf events

How to incorporate perf event information into the GHC event format used by ThreadScope?

These are the design issues I can think of:

How to synchronise the perf timestamps with the time format used by GHC events?

How to encode the perf events in the GHC event log format?

Synchronising timestamps

The perf timestamps are based on some kind of internal clock, perhaps the TSC hardware. A simple approach would be to just synchronise the timestamps at the start of the profiled program, but this runs the risk of them drifting apart over the length of the computation. It is not clear how CPU frequency scaling affects the perf clock, and the clock probably doesn’t run if the CPU core hibernates. For these reasons it is probably necessary to synchronise the two timestamps on a regular basis throughout the execution of the traced program.

There are a number of ways we might try to synchronise the timestamps:

Look for a known sys-call in the perf event log which happens at a known time-of-day. For example, we could look for calls to gettimeofday. Alternatively, it might be possible to get the GHC RTS to emit some kind of innocuous sys-call at various intervals throughout the execution of the traced program.

We could get the GHC RTS to sample the same clock that perf is using for its timestamps. Unfortunately there does not appear to be a portable way to do this, as the clock used is system specific. However, for many systems it is possible that the POSIX clock_gettime(CLOCK_MONOTONIC, ...) will work.

It is worth noting that any kind of sys-call sampling is likely to lead to “observer effects” whereby the re-scheduling due to the sys-call will affect the runtime behaviour of the traced Haskell program.

Encoding perf events in the GHC event format

The GHC event format specifies in its header a fixed number of event types. Each event instance in the payload of the file should have one of those types. As seen below there is a very large number of event types that the perf tool can record. Obviously in the context of ThreadScope we are only interested in a limited subset of all possible events. There appears to be a couple of choices regarding the encoding of perf events in the GHC event stream:

Pick a fixed subset of the perf events and add them to the fixed set of GHC event types.

Allow for any number of different perf events to be encoded in the GHC event stream.

A combination of the above two approaches.

The advantages of option 1 are that is easy to implement and probably nicer for the visualisation part of ThreadScope since it knows all the possible event types in advance. It may be possible and advantageous to pick a subset of events that are likely to be available on other platforms (CPU, operating system, and tracing framework).

The advantage of option 2 is that it allows the user more flexibility with the set of events to include and will be forwards and backwards compatible with different version of the perf tool.

Option 2 could be implemented using a two-level encoding. The first level of the encoding adds two new event types to the GHC header:

A perf event meta-type.

A perf event record.

Instances of the meta-type are pseudo events which encode true perf event types. They contain a string name of the actual perf event, and a unique integer token identifying them. Instances of the perf event record are actual event values. They contain timestamp and other information such as thread-id, plus an integer token which links the event to its corresponding perf type. The program which reads the perf event stream will have to be extended to be aware of the two-level encoding.

Types of events

You can get a listing of all the available events with the command:

sudo perf list

The sudo is necessary (at least on my system) because some of the events (tracepoints?) require root privileges.