thread to a specific core. The former
adds ticks to the work; the latter is
wholly inconvenient and can really
defeat any advanced NUMA (
nonuniform memory access)-aware scheduling that the kernel might provide.
Basically, binding the CPU provides a
super-fast but overly restrictive solution. We just want the gethrtime()
call to work and be fast.

We are not the only ones in need. Out
of the generally recognized need, the
rdtscp instruction was introduced. It
supplies the value in the TSC and a programmable 32-bit value. The operating
system can program this value to be the
ID of the CPU, and a sufficient amount
of information is emitted in a single
instruction. Don’t be deceived; this instruction isn’t cheap and measures in
at 34 ticks on this machine. If you code
that instruction call as uint64 _ t
mtev _ rdtscp(int *cpuid), that
returns the TSC and optionally sets a
cpuid to the programmed value.

The first challenge here is to understand the frequency. This is a straightforward timing exercise illustrated in
the accompanying figure.

This usually takes around 10ns, assuming no major page fault during
the assignment—10ns to set a piece
of memory! Remember, that includes
the average time of a call to mtev_
rdtscp(), which is just over 9ns. That’s
not really the problem. The problem is
that sometimes we get HUGE answers.
Why? We switch CPUs and the outputs
of the two TSC calls are reporting two
completely unrelated counters. So, to
rephrase the problem: we must relate
the counters.

The code for skew assessment is
a bit much to include here. The basic
idea is that we should run a calibration loop on each CPU that measures
TSC*nanos _ per _ tick and assess the skew from gethrtime(),
accommodating the running time of
gethrtime(). As with most calibration loops, the most skewed is discarded and the remaining is averaged.

As the TSC is per CPU, you need
to track m and b (nanos_per_tick and
skew) on a per-CPU basis.

test by bracketing the test with calls to
rdtsc in assembly. Note that you must
bind yourself to a specific CPU on the
box to make this effective because the
TSC clocks on different cores do not
have the same concept of “beginning.”
Table 1 shows the results if this is run
on our two primary platforms (Linux
and Illumos/OmniOS on a 24-core

2.6GHz Intel box).

The first observation is that Linux
optimizes both of these calls significantly more than OmniOS does. This
has actually been addressed as part of
the LX brand work in SmartOS by Joyent
and will soon be upstreamed into Illumos for general consumption by OmniOS. Alas, that isn’t the worst thing:
objectively determining what time it
is, is simply too slow for microsecond-level timing, even at the lower 119.8ns/
op (nanoseconds per operation) number above. Note that gettimeofday()

supports only microsecond-level accuracy and thus is not suitable for timing
faster operations.

So 19.33% of the execution time is
spent on calculating the timing, and
that doesn’t even include the time
spent recording the result. A good goal
to target here is 10% or less. So, how do
we get there?

Looking At Our Tools

These same modern CPUs with invariant TSCs have the rdtsc instruction, which reads the TSC, yet doesn’t
provide insight into which CPU you
are executing on. That would require
either prefixing the call with a cpuid
instruction or binding the executing