Intel's XTU analyzed

Intel's XTU benchmark is one of the most famous CPU benchmarks out there. HWBOT has registered well over 800,000 results, no other benchmark comes even close to that number. But what is being tested with XTU, is it well implemented, safe against cheating or even reliable? To answer these questions I am going to dive deep into the reverse engineered source code of XTU and reveal inside information, tweaks and various attack vectors.

Before we get to it I'd like to make clear that I will not provide any helper applications to cheat XTU. This article is for educational purposes only, although I will uncover a few legit tweaks and methods to experiment with that could result in an increase of XTU scores.

When analyzing the security and reliability of an application you have to see things with the eyes of the attacker. For my process it always helps to get a big picture first before reverse engineering anything. Sysinternal's ProcessMonitor is a good way to start, a filter needs to be set for "PerfTune.exe" to be able to tune out the other applications:

ProcessMonitor monitoring PerfTune.exe, the main executable of XTU

The selected row shows the creation of a file called p95-bench(32-bit).exe or p95-bench(64-bit).exe depending on the OS and detected hardware. This file is the actual benchmark that is being called 20 times for the whole run, once for each movement of the blue process bar. It can be found in a temporary folder called: C:\ProgramData\Intel\Intel Extreme Tuning Utility\TempThe ProgramData folder and the temporary benchmark executables are marked as "hidden", so be sure to enable the setting to show hidden files in Windows Explorer to view them. After the benchmark run these files will be deleted again.

So "p95-bench", huh? Could this be Prime95? Let's grab the file during a benchmark run and have a look at it. The file's properties show that it was digitally signed to avoid modification. But that doesn't hinder us to run a reverse engineering tool (Ollydbg or IDA will do). We don't even need to reverse anything at this point, we will just have a look at the strings inside the benchmark executable:

Opening the temporary benchmark executable with IDA reveals the following strings when searching for mersenne

A search for "mersenne" reveals that this is indeed the well known Prime95 at work here, at least a very basic command line version of it. That immediately raises the question why this benchmark is only compatible with Intel CPUs? Prime95 in its latest versions even brings a rudimentary Zen support to the table, so the benchmark itself can't be the real reason. I will dive into that later on, but let's analyze the executable first. By searching through the strings of the application we can also find that the Prime95 version used by Intel is "27.7", one of the first to make use of AVX instructions. The application's strings also reveal that most of the configuration options defined by prime.ini and local.ini are still supported. Good to know.

Now it's time to see some code. By knowing that Prime95 27.7 is used and that it's actually open source, this is just too easy. By guessing the filename on mersenne.org's FTP we can download the source code to assist us with the reverse engineering process: p95v277.source.zip

Don't worry, I won't bother you with learning reverse engineering here (we will save that up for another article). But I do want to give you a quick look at what you would be able to see. As an example, these are among the last assembler instructions the benchmark executes:

Reverse Engineering in all its glory: The last few lines of code show the usage of the WIN32 API function MapViewOfFile

Looking at the executable's instructions without debugging is called static analysis. Although it gives good insight of what is in the file, it's very hard to determine which code is actually executed. That's why it is necessary to isolate the functionality we want to reverse engineer as good as possible to be able to debug it whenever we want. In our case we need to run the benchmark executable without pressing the benchmark button in XTU's main application. This is easy in our case as we have already grabbed the executable from the temporary directory and can therefore execute it via a command line prompt (cmd.exe or PowerShell). The only thing missing is that we need to add a single command line parameter to really run the benchmark. By the way, this is a relict from old XTU versions; this parameter defined the location of the result file on the hard drive. Good to know that this has been improved, although not by much (more on that later). This parameter is ignored anyway so just pass 0 or whatever. Important to point out here is that XTU has to run in the background because the benchmark executable relies on a dependency provided by XTU and its system service. Here is the output:

Finally we are able to step through the code line by line and with the help of the original source code of Prime95 it is a delight to label most of the general functions correctly to complete the puzzle piece by piece. So without further ado this is a simplified list of what the benchmark does each time it is called:

Creating an event to signal the main application that the result is up for grabs later on

Reading a TSC "bending" value from a memory mapped file

Reading configuration files and if they are not available use hardcoded or detected defaults instead (like CPU architecture, L2/L3 cache sizes, hyperthreading, ...)

Generating random input data for the calculation based on the current time

Running 1875 Lucas Lehmer iterations (gwnum's gwsquare function) while checking on each iteration if it was the fastest one

Writing a memory mapped result file that includes the fastest iteration time as a double precision value in milliseconds

Signal the main application that the memory mapped result file was written

You might ask yourself what the TSC "bending" value might be. Well, XTU uses the CPU's internal timestamp counter called TSC (also known as the RDTSC/RDTSCP instruction) to measure time. To convert the elapsed number of ticks into seconds you need to know how many ticks are happening each second. Makes sense, doesn't it? Modern CPUs have an invariant (or constant) TSC that ticks with the base clock the CPU was booted with. Any bclock or CPU ratio changes won't have an effect on this number so a benchmark can rely on the number of ticks provided by CPUID information (side note: that's not true for AMD's Ryzen but that's another story). This does not apply when your CPU is too old and has no invariant TSC or if you are using Windows 8+ in combination with an AMD CPU or an Intel CPU older than Skylake. These systems will be prone to time skewing when altering the bclock in the OS because the number of ticks per second changes but the benchmark still divides by the assumed bootup value. So Intel introduced their own dynamic measurement for ticks per second to avoid timing bugs that is passed to the benchmark executable via memory mapped file to calculate the final result value in milliseconds.

That's already quite a list but we are not finished yet. Let's take a better look at the XTU main application (PerfTune.exe) again. By opening the executable with IDA we will encounter the following dialog indicating that this a .NET assembly executable written in C#:

IDA doesn't do well with .NET but there are other disassemblers out there to make our lives easier. For a quick static analysis of unencrypted executables I prefer JetBrain's dotPeek, for debugging or other heavy lifting I recommend the much more advanced dnSpy.

As we can see in this screenshots from dotPeek the executable PerfTune.exe is decompiled without any troubles, so we can read every line of code Intel's engineers have written. The problem starts when certain libraries are invoked like "BenchmarkLibrary" (IntelBenchmarkSDK.dll). Most of these DLLs are also written in C# but have been encrypted so dotPeek could not decompile the method bodies instead we get comments stating "ISSUE: unable to decompile the method.". That's where dnSpy comes into play:

Opening IntelBenchmarkSDK.dll in dnSpy shows obfuscated symbols but still reveals a fair amount of insight

Now we know that we are struggling with a .NET obfuscator, namely a custom variant of ConfuserEx. This is not a bad choice by the Intel engineers per se as it involves at least a serious attempt to be able to read the DLL's code. Sadly, it's still pretty easy: We just need to debug the application with dnSpy and step into the DLL we want to deobfuscate (set a breakpoint at MultipleRun() and follow the DLL via the call to BenchmarkLibrary::StartBenchmarkRun()). That way the module (the DLL) will have to be loaded as readable but still obfuscated code to be executed:

Stepping into the original IntelBenchmarkSDK.dll with dnSpy we can now see the code but still a lot of obfuscated symbols

Now we can save the DLL loaded in memory to a file and clean it up with a deobfuscator. I prefer a self compiled version of de4dot for that task. Now that we have a clean version of IntelBenchmarkSDK.dll we need to exchange it with the original version. That's where another anti-tampering protection of Intel comes into play:

That's good news as well, it seems that Intel is checking for modifications of the application's files and their dependencies. Let's dive deeper to find out if it's worth it. A quick debug session later with a breakpoint at the message box and a look at the call stack I can follow a path of functions that lead to the WIN32 function WinVerifyTrust() to check the digital signature of each DLL in the application's working directory. That is pretty common but the verification is incomplete. It only checks if the digital signature is valid, but does not care who actually signed it. Perfect for our endeavour because I can just sign the IntelBenchmarkSDK.dll with my own signature and switch out the original one with the deobfuscated version. Mission accomplished.

A deobfuscated, nicely readable version of IntelBenchmarkSDK.dll is now debuggable because I could digitally sign it myself

Sadly, from this moment on it's very easy to debug into the HWBOT upload process and catch the AES encryption key and IV. It's also possible to change the data XML file to our likings before it gets encrypted. I won't share any details of course but if you ever wondered what data gets uploaded to HWBOT by XTU you can have a look at this example score with an i7-8700K:

In the previous chapter of this article I uncovered the attack vector to modify the main application's DLLs by simply digitally signing it again with my own certificate. This is also true for the temporarily created p95-bench(..-bit).exe files. The only thing that needs to be added to the recipe to run our own inner benchmark executable is to add the "read only" flag to these files by using the Windows Explorer's File Property dialog. By doing that XTU can't overwrite the modified versions and starts whatever executable it finds under those names (as long as they are digitally signed of course). With that in mind let's write our own benchmark file that integrates nicely into XTU to return any result value we like. To do that we need to implement the same interprocess communication that the original benchmark executable provides for the XTU main application by creating the memory mapped result file and an additional event to signal that the result file was successfully written. Now the fun part begins were we need to reverse engineer the result data that is written to the file. To show how vulnerable the communication with memory mapped files is, I chose to do this with a small command line tool that continuously reads data in these files. This is the result data for each loop in binary (formatted as bytes):

Next we need to align the data as meaningful variables. We already know that somewhere in there might be a floating point value that returns the time of the fastest iteration. Bingo, the first 192 bit are double precision values and already look very promising. This would be the C++ code to align our data correctly:

Looking at the output of our first standalone run we can easily guess that the second double precision value is the time for fastest iteration while the third seems to be the average time for all 1875 iterations. I couldn't figure out what the rest of the data (pData->szOther) represents, although another look at the benchmark executable with IDA could reveal that as well. But they are not a part of the XTU score so we don't need to know them. What we know for a fact now is that the result file is accessible through our own command line tool and stores unencrypted data. The interprocess communication with memory mapped files is therefore without doubt the weakest spot of XTU. If you think it can't get worse then have a look at the following screenshot showing a DLL injection into XTUService.exe to redirect calls to the WIN32 functions that are used to access the TSC bending value's memory mapped file. I intercept the creation and opening calls and change the file path to an alternate location where the service will continuously write the tick count into a file that will never be used. Finally I create the original TSC bend file myself and write my much bigger custom value to it, 15,000,000 instead of 3,696,000. The benchmark executable will now perceive a second four times longer than it really is.

How to inject your own DLL into the XTU service to intercept access to the memory mapped file responsible for the TSC bending value

We have collected so many pieces of the XTU puzzle by now, that we can tackle the score formula next. Of course there is always the possiblity to have a peek at the C# source code, but we won't do that. Reverse engineering a formula is too much fun, no shortcuts necessary. Instead I went ahead and implemented an empty inner benchmark executable that contains all the necessary interprocess communication to work exactly as the original one, although it does nothing but return a custom result value. I started with 0.2 milliseconds (world record) and went back to 10 ms to get a range of meaningful results:

The XTU score scaling helps us to reverse engineer the formula for scoring

As you can see it is not a linear curve. So to solve this we need to figure out how it scales. Focus on the 1307 points for the result of 1 ms. 2 ms get exactly half the points, 0.5 ms will be awarded with two times the points. That sounds very much like 1307 / 2 and 1307 / 0.5. The division by comma is inversed so it's 1307 divided by 1/2 = 1307 * 2. So the formula is nothing more than a simple division with a chosen dividend:

It's interesting to see that the difference between a world record with 18 cores (6344 points with 7980XE@5.7G) and the fastest 8 core score (4778 points with 9900K@6.8G) is merely 0.0675 ms, essentially nothing. That indicates really bad core scaling that is covered up by the non-linear formula. We will have a closer look at scaling in the second part of this article (spoiler alert: it sucks!).

To summarize my findings I have compiled the following sorted list of attack vectors that XTU suffers from:

Reading, writing and intercepting XTU's unencrypted interprocess communication between the inner benchmark executable, the XTU service and the main application to skew time (TSC bending value) or modify the result data.

Intercepting RDTSC/RDTSCP instructions via hypervisor to change how the benchmark interprets time in seconds. I notified Intel about this exploit in 2017.

Modify or exchange DLLs and the inner benchmark executable by digitally signing it with your own valid certificate.

Create a local.txt in this directory and paste this line into it: CpuArchitecture=5This line sets the CPU architecture to i3/i5/i7 (the fastest implementation) and has to remain in config file for standalone testing. Otherwise the architecture won't be correctly detected.

Open XTU and start a benchmark run

Go back to the Explorer and refresh it, you will find a hidden file called p95-bench(64-bit).exe (or p95-bench(32-bit).exe for 32 bit systems). Show its file properties and check the box for "read only" while XTU is still benching. With this the standalone benchmark executable will be permanently available for testing.

Open a command line window (cmd.exe or PowerShell) and navigate to the above location

Execute the benchmark with an additional command line parameter: p95-bench(64-bit).exe 0

Abusing Prime95's configuration files

Go through all steps above for quick performance testing

You can now edit your local.txt file to your own likings. Have a look at the official documentation of Prime95 27.7 (undoc.txt).

Important: To enable the configuration file inside XTU you need to copy local.txt to: C:\Windows\SysWOW64Depending on the window version and bitness the default working directory could also be C:\Windows\system32

Check with some cores or AVX disabled if your config file impacts the score

Example for disabling hyperthreading and two cores on the 7980XE (CPU doesn't scale beyond 16 cores):

Code: INI

NumCPUs=16
CpuNumHyperthreads=1

Disabling AVX (never did anything good to me but great for testing config file's impact)

A fact that seems to be publicy unknown is that although 20 loops with 1875 Lucas Lehmer iterations are executed, only the best iteration counts for the final score. So you don't need to run all loops with full speed. Or even better: Use a configuration file that disables certain features and enable it (via batch file) after the first few loops have successfully gone through. This will downgrade your settings to 2 cores, no hyperthreading and AVX disabled:

My thorough investigation shows that XTU is vulnerable to several serious attacks. I found two distinct ways to change the benchmark's perception of time and therefor the final score. Additionally the benchmark's DLLs and executables can be easily modified by abusing the fact that the main application only relies on valid digital signatures without checking the owner of the certificate itself. Last but not least the interprocess communication can be intercepted by anyone with basic WIN32 programming skills and the courage to dump the strings of the inner benchmark executable to gather the names of the memory mapped files.

Security issues aside the real problem lies in Intel's choice of the workload for benching with XTU. Although it uses the well known Prime95, that is a great stress test and surely can be a great benchmark as well, Intel's engineers chose the (now pretty much obsolete) FFT timings benchmark, that scales horribly with more than 8 cores. That's why Prime95 normally uses several worker threads to run as many FFTs in parallel as possible. Running only a single Lucas Lehmer iteration especially with the small size of 1024K on all available cores was never going to be future proof. If somebody should have known that, it would be Intel, right?

Furthermore XTU is a benchmark that gives a score to something that only takes a fraction of a millisecond into account (see "Only the best iterations wins!" above). There is also no error checking implemented, so whatever it calculates it could be garbage as far as we know. So it actually just tests if the OS is stable enough with the current system settings, for about a second, then sleeps for about a second and repeats that process 20 times. Yeah.