[PATCH v3] perfcounters: record time running and time enabled for each counter

Impact: new functionality

Currently, if there are more counters enabled than can fit on the CPU,the kernel will multiplex the counters on to the hardware usinground-robin scheduling. That isn't too bad for sampling counters, butfor counting counters it means that the value read from a counterrepresents some unknown fraction of the true count of events thatoccurred while the counter was enabled.

This remedies the situation by keeping track of how long each counteris enabled for, and how long it is actually on the cpu and countingevents. These times are recorded in nanoseconds using the task clockfor per-task counters and the cpu clock for per-cpu counters.

These values can be supplied to userspace on a read from the counter.Userspace requests that they be supplied after the counter value bysetting the PERF_FORMAT_TOTAL_TIME_ENABLED and/orPERF_FORMAT_TOTAL_TIME_RUNNING bits in the hw_event.read_format fieldwhen creating the counter. (There is no way to change the read formatafter the counter is created, though it would be possible to add someway to do that.)

Using this information it is possible for userspace to scale the countit reads from the counter to get an estimate of the true count:

true_count_estimate = count * total_time_enabled / total_time_running

This also lets userspace detect the situation where the counter nevergot to go on the cpu: total_time_running == 0.

This functionality has been requested by the PAPI developers, and willbe generally needed for interpreting the count values from countingcounters correctly.

In the implementation, this keeps 5 time values (in nanoseconds) foreach counter: total_time_enabled and total_time_running are used whenthe counter is in state OFF or ERROR and for reporting back touserspace. When the counter is in state INACTIVE or ACTIVE, it is thetstamp_enabled, tstamp_running and tstamp_stopped values that arerelevant, and total_time_enabled and total_time_running are determinedfrom them. (tstamp_stopped is only used in INACTIVE state.) Thereason for doing it like this is that it means that only countersbeing enabled or disabled at sched-in and sched-out time need to beupdated. There are no new loops that iterate over all counters toupdate total_time_enabled or total_time_running.

This also keeps separate child_total_time_running andchild_total_time_enabled fields that get added in when reporting thetotals to userspace. They are separate fields so that they can beatomic. We don't want to use atomics for total_time_running,total_time_enabled etc., because then we would have to use atomicsequences to update them, which are slower than regular arithmetic andmemory accesses.

It is possible to measure total_time_running by adding a task_clockcounter to each group of counters, and total_time_enabled can bemeasured approximately with a top-level task_clock counter (thoughinaccuracies will creep in if you need to disable and enable groupssince it is not possible in general to disable/enable the top-leveltask_clock counter simultaneously with another group). However, thatadds extra overhead - I measured around 15% increase in the contextswitch latency reported by lat_ctx (from lmbench) when a task_clockcounter was added to each of 2 groups, and around 25% increase when atask_clock counter was added to each of 4 groups. (In both cases atop-level task-clock counter was also added.)

In contrast, the code added in this commit gives better informationwith no overhead that I could measure (in fact in some cases Imeasured lower times with this code, but the differences were all lessthan one standard deviation).

Signed-off-by: Paul Mackerras <paulus@samba.org>---Hopefully it's now clear from the comments and the commit message thatall the times are in units of approximately 1/pi attocenturies. 8-)

+ /*+ * These are the total time in nanoseconds that the counter+ * has been enabled (i.e. eligible to run, and the task has+ * been scheduled in, if this is a per-task counter)+ * and running (scheduled onto the CPU), respectively.+ *+ * They are computed from tstamp_enabled, tstamp_running and+ * tstamp_stopped when the counter is in INACTIVE or ACTIVE state.+ */+ u64 total_time_enabled;+ u64 total_time_running;++ /*+ * These are timestamps used for computing total_time_enabled+ * and total_time_running when the counter is in INACTIVE or+ * ACTIVE state, measured in nanoseconds from an arbitrary point+ * in time.+ * tstamp_enabled: the notional time when the counter was enabled+ * tstamp_running: the notional time when the counter was scheduled on+ * tstamp_stopped: in INACTIVE state, the notional time when the+ * counter was scheduled off.+ */+ u64 tstamp_enabled;+ u64 tstamp_running;+ u64 tstamp_stopped;+ struct perf_counter_hw_event hw_event; struct hw_perf_counter hw;

/*+ * These accumulate total time (in nanoseconds) that children+ * counters have been enabled and running, respectively.+ */+ atomic64_t child_total_time_enabled;+ atomic64_t child_total_time_running;++ /* * Protect attach/detach and child_list: */ struct mutex mutex;@@ -325,6 +368,16 @@ struct perf_counter_context { int nr_active; int is_active; struct task_struct *task;++ /*+ * time_now is the current time in nanoseconds since an arbitrary+ * point in the past. For per-task counters, this is based on the+ * task clock, and for per-cpu counters it is based on the cpu clock.+ * time_lost is an offset from the task/cpu clock, used to make it+ * appear that time only passes while the context is scheduled in.+ */+ u64 time_now;+ u64 time_lost; #endif };

+ /*+ * Add any time since the last sched_out to the lost time+ * so it doesn't get included in the total_time_enabled and+ * total_time_running measures for counters in the context.+ */+ ctx->time_lost = get_context_time(ctx, 0) - ctx->time_now;+ flags = hw_perf_save_disable();