As described in the introduction, a sampling profiler captures the stack of
each thread every few milliseconds. To do this, it is preferable to suspend the
thread 1. We don’t want the stack changing under us as we are profiling.
Even worse, a thread could just terminate while we were attempting to profile
it and invalidate all the memory we want to read. It is safer to just suspend
it and resume once the capture is done.

Pitfalls of suspending threads

This is probably the most important lesson of building an in-process profiler. You can do most other things wrong and at
least get some output, but if you ignore the advice here, you may not get output at all and will likely make the program
misbehave. There are three things your profiler CANNOT do while it has a thread suspended:

You MUST NOT forget to resume the suspended thread - failing to do this would prevent programs from making progress.

You MUST NOT suspend the sampling thread itself - doing this will prevent the profiler from making progress.

You MUST NOT allocate memory or acquire locks on the sampler thread that
other threads have access to. When a thread is suspended, the thread may be
holding locks, making assumptions about certain memory locations and so on.
There are a lot of locks that are created by various platform APIs
per-process and operated on behind the scenes, including things like
printf and malloc! Let’s say the thread acquired the allocator lock and
then was suspended. If your sampling thread attempts to allocate memory, it
will block on the allocator lock and your program will deadlock! This means
any memory we need to storing data about the thread stack (which will be
covered in Part 3) must be pre-allocated before we suspend the thread. You
cannot dynamically resize this while any thread is suspended.

Finally, suspending a thread does have a performance penalty. First, it slows
down the program. We try to minimize this by keeping our stack collection as
fast as possible. Second, pausing the thread may force the OS to context switch
to get another thread going. There isn’t much we can do about this.

Suspending and resuming threads is one of the easier parts of profiling. All 3
OSes allow easily suspending threads of the current process

Sampling all threads or just registered threads.

A sampling profiler can choose to sample every thread in the application, or only certain threads. The latter solution
is good if you want to offer selective profiling, like browsers often do for web pages. It can also improve the
performance of the profiler.

Selective profiling is achieved by having interested threads call some function to register themselves with the
profiler.

Here we will stick to sampling every thread to keep the profiler logic simpler.

Windows

Windows is probably the simplest to suspend and resume, but also the most annoying to iterate over, because there
isn’t an API to only iterate the threads of a given process. This means you end up iterating over every
thread in the system and discard the ones you don’t care about, which is inefficient. A profiler that only cares about
registered threads and stores their HANDLEs in a list will do better.

First, one uses the CreateToolhelp32Snapshot function to obtain a snapshot of
all running threads. Then, the Thread32First and Thread32Next functions can
iterate over this snapshot and obtain thread information. MSDN has a code
sample
about using the thread iteration APIs, so it should be clear. We can compare the thread’s th32OwnerProcessID with GetCurrentProcessId() to restrict to threads from our process.

Once we have a thread ID, we obtain a handle to it using the OpenThread()
function. Then use the SuspendThread() function, walk the stack and
ResumeThread().

HANDLE thread_handle = OpenThread(THREAD_SUSPEND_RESUME | THREAD_GET_CONTEXT, False, te32.th32ThreadID);
if (SuspendThread(thread_handle) == -1) {
// handle error
return;
}
// walk the stack
if (ResumeThread(thread_handle) == -1) {
// at this point we probably want to crash the program as this is a bad state to be in!
abort();
}

macOS

(function names link to documentation)

On Mac, thread suspension requires using the Mach subsystem. Mach calls within the
same process are generally unrestricted, so no security measures need to be disabled.

Start by getting a handle to the process itself using mach_self() and then we can obtain a list of threads using the
task_threads function.

The Mach structures and APIs are not always well documented, and often fiddly, so it is best to see working code from
other projects. psutil has an
example.

task_threads will return a list of thread_port_act_t structures, which can be passed to the suspend and resume functions directly.

The
thread_suspend()
and
thread_resume()
functions do what they say.
Gecko
and Chromium both uses these functions. Incidentally, Mach also offers a
thread_sample()
function which will sample and write out PC values to a port (a queue). This is
pretty cool, but I’ve not seen it used in practice.

Both Windows and macOS have reference counted suspension counts. It is important to call resume as many times as suspend is called!

Linux

To obtain a list of threads for a process, it is easiest to use the proc filesystem. The /proc/self/task directory has subdirectories for every thread, identified by the kernel task ID. We can just iterate over these.

Suspending and resuming threads is really involved on Linux. One has to use a
complicated set of synchronization primitives combined with signals. I’ve linked to the vignette implementation throughout. I’ll use the term “sampler thread” for the thread the profiler is running on, and “sampled thread” for the thread we are interested in profiling.

Set up a process-wide signal handler for the SIGPROF signal. We have to pass the SA_SIGINFO flag to use the 3-argument handler. This let’s us access the ucontext_t param required for unwinding later.

Great! When the sampled thread re-enters userspace, it will receive the signal, and the signal handler will be invoked. The sampler thread uses the first semaphore (msg2 in vignette) to block until the signal handler acknowledges it.

When the signal handler runs in the context of the sampled thread, the original operation of the thread is now suspended. When the signal handler exits, it will be resumed. We use a combination of semaphores to communicate with the sampler and only resume after we have the information we need.

Within the handler, we first copy the context that we will need later. We use msg2 to notify the sampling thread that we have a context. The sampled thread waits on msg3 and is effectively suspended.

On the sampler thread, we use the context to walk the stack. Then we notify msg3 so the sampled thread can resume itself. We wait on msg4 to be absolutely sure the sampled thread is resumed. This is required because we have shared state and shared semaphores, so we cannot move on to the next thread until both ends have finished! If we were to currently send a signal to another thread, that could run in its entirety and all our semaphores are now in states we cannot predict.

The sampled thread simply notifies msg4 since it doesn’t have anything to do.

On Linux, since the app being profiled is not aware of the profiler, setting a signal handler is difficult. The correct way is to use ptrace(2), which operates on a per-thread (task) level. After attaching to a thread, the registers can be read for unwinding. Clearly this seems fraught with several edge cases.