The complexity of computing systems has tremendously increased over the last decades. Hierarchical cache subsystems, non-uniform memory, simultaneous multithreading and out-of-order execution have a huge impact on the performance and compute capacity of modern processors.

Figure 1: "CPU Utilization" measures only the time a thread is scheduled on a core

Software that understands and dynamically adjusts to resource utilization of modern processors has performance and power advantages. The Intel® Performance Counter Monitor provides sample C++ routines and utilities to estimate the internal resource utilization of the latest Intel® Xeon® and Core™ processors and gain a significant performance boost

When the CPU utilization does not tell you the utilization of the CPU

CPU utilization number obtained from operating system (OS) is a metric that has been used for many purposes like product sizing, compute capacity planning, job scheduling, and so on. The current implementation of this metric (the number that the UNIX* "top" utility and the Windows* task manager report) shows the portion of time slots that the CPU scheduler in the OS could assign to execution of running programs or the OS itself; the rest of the time is idle. For compute-bound workloads, the CPU utilization metric calculated this way predicted the remaining CPU capacity very well for architectures of 80ies that had much more uniform and predictable performance compared to modern systems. The advances in computer architecture made this algorithm an unreliable metric because of introduction of multi core and multi CPU systems, multi-level caches, non-uniform memory, simultaneous multithreading (SMT), pipelining, out-of-order execution, etc.

Figure 2: The complexity of a modern multi-processor, multi-core system

A prominent example is the non-linear CPU utilization on processors with Intel® Hyper-Threading Technology (Intel® HT Technology). Intel® HT technology is a great performance feature that can boost performance by up to 30%. However, HT-unaware end users get easily confused by the reported CPU utilization: Consider an application that runs a single thread on each physical core. Then, the reported CPU utilization is 50% even though the application can use up to 70%-100% of the execution units. Details are explained in [1].

A different example is the CPU utilization for "memory throughput"-intensive workloads on multi-core systems. The bandwidth test "stream" already saturates the capacity of memory controller with fewer threads than there are cores available.

Abstraction Level for Performance Monitoring Units

The good news is that Intel processors already provide the capability to monitor performance events inside processors. In order to obtain a more precise picture of CPU resource utilization we rely on the dynamic data obtained from the so-called performance monitoring units (PMU) implemented in Intel's processors. We concentrate on the advanced feature set available in the current Intel® Xeon® 5500, 5600, 7500, E5, E7 and Core i7 processor series [2-4].

We have implemented a basic set of routines with a high level interface that are callable from user C++ application and provide various CPU performance metrics in real-time. In contrast to other existing frameworks like PAPI* and Linux* "perf" we support not only core but also uncore PMUs of Intel processors (including the recent Intel® Xeon® E7 processor series). The uncore is the part of the processor that contains the integrated memory controller and the Intel® QuickPath Interconnect to the other processors and the I/O hub. In total, the following metrics are supported:

Intel® PCM version 1.5 (and later) also supports Intel® Atom™ processors but counters like memory and Intel® QPI bandwidth and L3 Cache Misses will always show 0 because there is no L3 Cache in the Intel® Atom™ processor and no on-die memory controller or Intel® QPI links.

Intel® PCM version 1.6 supports on-core performance metrics (like instructions per clock cycle, L3 cache misses) of 2nd generation Intel® Core™ processor family (Intel® microarchitecture code name Sandy Bridge) and an experimental support of some earlier Intel® microarchitectures (e.g. Penryn): it can be enabled by defining PCM_TEST_FALLBACK_TO_ATOM in the cpucounter.cpp.

I want to see these counters!

As an additional goody, the package includes easy-to-use command line and graphical utilities that are based on these routines. They can be used out-of-the box by users which cannot or do not want to integrate the routines in their code but are willing to monitor and understand the CPU capacity limits in real-time.

Figure 3 shows the screen shot of the command line utility on the Windows* platform. Whereas the Linux* version can rely on the MSR kernel module that is provided with the Linux kernel, no such facility is available on Windows. For Windows, a sample implementation of a Windows driver provides a similar interface.

Figure 3: Intel® Performance Counter Monitor command line version

But there is more to come. For the Linux operating system, the package includes an adaptor that plugs into the KDE* utility ksysguard. Using this daemon, it is possible to graph the various metrics in real-time. Figure 4 shows a screen shot where some of the metrics are displayed during a workload run.

See figures 9 and 10 below for PCM version 2.0 versions of these screenshots.

Since these utilities provide a direct insight into the system, they can even be used to quickly find and understand fundamental performance bottlenecks in real-time. (In contrast to the Intel® VTune™ Performance Analyzer, they won't however tell you what parts of the application are causing the performance issue.)

Since version 1.5 the Intel® Performance Counter Monitor package contains a Windows* service, based on Microsoft .Net* 2.0 or better, that will create performance counters that can be shown in the Perfmon program that is delivered with the Microsoft Windows* OS. Microsoft's perfmon is capable of showing many useful performance counters on the Windows* OS like disk activity, memory usage, cpu load. More information about perfmon for Windows* 7 and Windows* 2008/R2 can be found at here (but perfmon has been available for many releases of Windows now). Please read the Windows_howto.rtf file on how to install and remove the service for Intel® PCM.

For all of the above mentioned hardware counters on the Nehalem and Westmere based platforms, a corresponding perfmon counter is created and therefore all features supported by perfmon are also available for these counters like logging over time in a file or database. For Intel Atom® processors the perfmon counters for memory and Intel® QPI bandwidth and L3 Cache Misses will always show 0 for reasons mentioned above. In a future update of Intel® Performance Counter Monitor the service will only show the available counters.

Thanks to the abstraction layer that the library provides, it has become very easy to monitor the processor metrics inside your application. Before their usage, the performance counters need to be initialized. Afterwards, the counter state can be captured before and after the code section of interest. Different routines capture the counters for cores, sockets, or the complete system, and store their state in corresponding data structures. Additional routines provide the possibility to compute the metric based on these states. The following code snippet shows an example for their usage:

To assess the potential impact of having precise resource utilization, we have implemented a simple scheduler that executed 1000 compute intensive and 1000 memory-bandwidth intensive jobs in a single thread. The challenge was the existence of non-predictable background load on the system, a rather typical situation in modern multi component systems with many third party components. Figure 6 depicts a possible schedule for a scheduler that is unaware of the background activity.

Figure 6: Scheduler without Intel® Performance Counter Monitor

If the scheduler can detect (using the provided routines) that a lot of the memory bandwidth is currently used by a different process, it can adjust its schedule accordingly. Our simulations show that such a scheduler executes the 2000 jobs 16% faster than a generic unaware scheduler on the test system.

Figure 7: Scheduler using Intel® Performance Counter Monitor

Intel PCM version 2.0 Features

Intel PCM version 2.0 adds support for the Intel® Xeon E5 series processor based on Intel microarchitecture codenamed Sandy Bridge EP/EN/E. This processor has a new uncore with lots of monitoring options.

The Xeon E5 series processor's uncore has multiple 'boxes' similar to the Xeon E7 processor (Intel microarchitecture codename Westmere-EX). Intel PCM v2.0 supports Intel®QPI and memory metrics for the new processor.

Comparing the output of 'pcm.exe 1' version 1.7 versus version 2.0 on a Xeon E7 (Westmere-EX) based system, the primary differences are:

Version 2.0 prints a 'TEMP' column for each core (and socket for Xeon E5 processor series) where 'TEMP' values are temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature

Version 2.0 also displays the C-state core and package residency. This is the percentage of time that the core (or the whole package) spends in a particular level of C-state. The higher the level, the greater the power savings.

Intel® Xeon® E5 series specific features

The PCM version 2.0 information below applies to the Intel® Xeon® E5 series processor.

PCM version 2.0 adds more Intel® QPI info:

the QPI link(s) speed

the percentage of in-coming (received) QPI bandwidth used for data

the bytes of out-going (transmitted) data and non-data traffic for each link along with percentage utilization for the out-going link.

Please, note that availability of Intel® QPI information may depend on support of Xeon E5 uncore performance monitoring units in your BIOS and the BIOS settings.

PCM version 2.0 also adds energy usage info:

Energy usage by socket

DRAM energy usage. If the BIOS doesn't support this feature then the DRAM energy will be reported as zero.

PCM-power utility

For the Intel® Xeon® E5 series processor, PCM version 2.0 also provides the pcm-power utility. The MSVS Windows project file for this utility is in the PCM-Power_Win directory.

The pcm-power utility displays, for all cases:

For each socket and Intel® QPI port, the percentage of QPI clocks spent in the L0p and L1 lower power states. The L0p power saving state has half the QPI lanes are disabled. In L1 state all the lanes are in standby mode. The above mentioned uncore performance monitoring guide has more information on these metrics (see table 2-102). Please, note that availability of Intel® QPI information may depend on support of Xeon E5 uncore performance monitoring units in your BIOS and the BIOS settings.

For each socket, display the energy used, the watts, and the thermal headroom.

For the DRAM, display the energy and watts used, if the platform supports this feature. The value displayed will be zero if the DRAM energy display is not supported.

This option uses the 'frequency banding' feature of the PCU PMU to display the percentage of time the cores spend in 3 'bands' of frequency.

The default bands are 10, 20 and 40. You can override each band with '-a band0', '-b band1', and '-c band2'. Each band is multiplied by 100 MHz. The default bands then represent the %time the cores are in frequency:

The unit is the number of cores on the socket who were in C0, C3 or C6 during the measurement interval.

On a busy system one can get:
S0; PCUClocks: 26512878934; core C0/C3/C6-state residency: 7.28; 0.00; 0.72
Which means that, for socket 0, during the interval, on average, 7.28 cores were in C0 (the full-power mode), 0.0 cores were in C3 (a low power state) and 0.72 cores were in C6 state (an even lower power state).

the freq was limited by the OS 6.09% of the time. This is based on PCU event 0x6 FREQ_MAX_OS_CYCLES.

the power usage limited the freq 2.39% of the time. This is based on the same event as option '-p 3' second event.

the current usage limited the freq 91.51% of the time. This is based on the same event as option '-p 3' third event.

option '-p -1' omits PCU PMU output

Updates to plugins for Linux Ksysguard and Windows* Perfmon GUI

In addition to the command line tools the graphical plugins for Linux Ksysguard and Windows* Perfmon have been extended with essential energy related metrics (C-states, thermal headroom, processor and DRAM energy).

Intel, Xeon, Core, and VTune are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number

Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license. The software license text is included into the code sample.

Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.

This software is subject to the U.S. Export Administration Regulations and other U.S. law, and may not be exported or re-exported to certain countries (Burma, Cuba, Iran, North Korea, Sudan, and Syria) or to persons or entities prohibited from receiving U.S. exports (including Denied Parties, Specially Designated Nationals, and entities on the Bureau of Export Administration Entity List or involved with missile technology or nuclear, chemical or biological weapons).

License and Download

Intel Performance Counter Monitor is discontinued. Instead, we will contribute updates and new features to the fork Processor Counter Monitor on github.

135 comments

Hi, the pcm-pcie tool seems to require Numa to be enabled. We are using a Z820 with dual E5-2697 and the tool fails if Numa is disabled. Does not calculate Num Sockets correctly is the "first" issue I see. Will the tool be able to work with Numa Off?

As you have pointed out correctly, PCM currently does not support measuring memory bandwidth is not available on client parts on Windows. The reason is that you need to access to physical memory, which is not supported by the Windows drivers that we are using.

Is there a way to implement it? As far as i see you must only map PCI config space and btw. VTune is working on this machine and sdram bandwidth is shown correctly. But i had to monitor memory bandwidth from my own application, so i was so glad, when i had found this article on performance counters. Is there a way to make it work?

I grabbed the WinRing0 binaries from RealTemp370 and followed the PCM-Service steps. I then used Perfmon and graphed some of the counters. However I find that IPC, Instructions Retired, Memory Read Bandwidth, and Memory Write Bandwidth are all always zero. IPC, in particular, is one of the most interesting counters so it is quite disappointing that it does not work. I have an Intel(R) Core(TM) i7-3930K CPU @ 3.20 GHz and I was running a workload that kept all twelve hardware threads fully occupied.

Thermal Headroom below TjMax works, but that is less interesting.

What do I have to do in order to get IPC information? After struggling through all of the build issues listed above I still don't have any useful information. I tried adding the counters for varying processors with no success.

This project really needs a solution file to contain all of the projects. This solution would then automatically build all of the results to one output directory instead of requiring hand copying of components. As it is there are twice as many steps as there should be. Setting this up properly is trivial and would immediately save time for both Intel employees working on this and end users.

No VS 2010/2012/2013 support? VS 2005 is nine years old and should not be the only supported compiler anymore. After upgrading to VS 2013 I hit the following errors and warnings:

cpucounters.cpp is missing an “#include <algorithm>” for min and max – the code will not compile in VS 2013 without this.

This warning is annoying and should be fixed:

1>c:\devtools\intelperformancecountermonitorv2.6\cpucounters.h(226): warning C4251: 'PCM::errorMessage' : class 'std::basic_string<char,std::char_traits<char>,std::allocator<char>>' needs to have dll-interface to be used by clients of class 'PCM'

No msr.sys shipped with the package? Sending people off to get RealTemp binaries? That really needs to be fixed. Having to install the Windows DDK Kit and learn how to use it just to discover whether PCM is even worth using is an unreasonably high bar. And running random binaries from the Internet as kernel drivers is a crazy thing to recommend. Intel really needs to distribute signed binaries.

PCM.exe project should be set to require administrator privileges instead of needing custom modifications. Is there a compelling reason why the executables that require

Why does PCM-Service have Release and Release64 configurations? The target platform should be separate from the configuration Debug/Release versus Win32/x64, not combining the two of them. I’m not sure what Release64, Win32 platform actually means.

PCMService depends on Intelpcm.lib so they should be in the same solution with this dependency made explicit to avoid this error:

This is really just a variation on my initial complaint about the lack of a solution file. If these projects are set up correctly then the instructions should just be "Load the solution file. Select Release and Win32 or x64. Build." Anything more complicated than that means that the projects are not correctly set up.

I am trying to run the code given in the library on HSW machine running Ubuntu 13.04. I run into problem:

"Access to Intel(r) Performance Counter Monitor has denied (Performance Monitoring Unit is occupied by other application). Try to stop the application that uses PMU.Alternatively you can try to reset PMU configuration at your own risk. Try to reset? (y/n)"

What should i do? Please help

(Pls note: I use this machine for Vtune/SEP and EMOM analysis but they are not running when i use PCM)

I want to monitor pipeline based execution. particularly, I am interested in monitoring instructions when processor stalls. is it possible with PCM? if so, can you please provide some pointers/suggestions to do so?