Tech —

Updating our benchmark suite for 2017 and beyond

If you've looked at some of our reviews in the last couple of days, you may have noticed a few different benchmarks and some new charts that we weren't using before. We don't usually do this, but behind the scenes we've just given our benchmark suite a comprehensive update for the first time since 2013. In the interest of keeping you all informed and letting you know what we're thinking here on the Ars Orbiting HQ, this is a good opportunity to run through the tests we do, why we do them, and why we care about benchmarks in the first place.

Charts

Enlarge/ These colors may look a little wonky compared to our old ones, but they're more legible to people with different types of color blindness.

First, you'll notice that we're using some new colors in our charts. As much as we liked the bright colors in our previous charts—colors chosen to match the Ars color palette, incidentally—we got semi-regular feedback from colorblind folks that they were hard to read. Ars Creative Director Aurich Lawson chose our new chart colors to be easily legible by people with all common forms of color blindness.

What we're using: CPU and GPU compute benchmarks

We may use benchmarks other than these when we're doing in-depth component reviews of the latest flagship processors or graphics cards, but generally speaking in phone and laptop reviews these are the standard benchmarks we'll be running on everything.

Our primary CPU test is Geekbench 4 from the fine people at Primate Labs. It replaces the previous version, Geekbench 3, and scores from Geekbench 3 are not comparable to those from Geekbench 4.

If you're interested in knowing more about what has changed in Geekbench 4, Primate Labs' John Poole gave a great interview to XDA last year talking about how the test had changed. At the high level: huge performance increases in mobile chips enabled better parity between the desktop versions of Geekbench and the mobile versions, and the tests have been tweaked to make it easier to make performance comparisons across platforms and across architectures. The way the test runs was also changed to minimize throttling in heat-constrained systems.

Geekbench 4 also adds a new GPU compute benchmark that works with OpenCL, Nvidia's CUDA, and Android's RenderScript APIs. Lots of apps these days will use GPUs to accelerate specific tasks and take some of the load off of the CPU, and this number will be of particular interest to people who use GPUs to run apps like Photoshop or Premiere or to crunch numbers rather than run games.

On our desktop platforms, we'll also continue to use Maxon's Cinebench R15, a combined CPU and GPU benchmark. It spits out one big number for both single-core and multi-core CPU performance, and since it takes a while to run, it can be a decent indicator of heat-related throttling issues.

Browser benchmarks

Historically, browser benchmarks have been used mostly to compare different browsers running on the same system and how the performance of different versions of the same browser evolved over time. In the early days before decent smartphone benchmarking apps became a thing, they were also useful as a rudimentary way to compare performance between different devices. Today's benchmarking tools have gotten good enough that we don't need to rely on them as much, so we don't.

We're still using three browser benchmarks that we consider to be reasonably modern: Google Octane, Mozilla's Kraken, and Browserbench.org's JetStream. Just be aware of their limitations; they're not super helpful when comparing different platforms, and they're also primarily an indicator of single-core CPU performance and not multi-core CPU performance.

GPU benchmarks

Our primary cross-platform GPU benchmark is Kishonti's GFXBench. The test offers a variety of high- and low-level benchmarks, and we stick to the high-level ones: we're still using the T-Rex and Manhattan tests, and now we've added the more punishing Manhattan 3.1 and Car Chase tests. On Windows and Android, these are all OpenGL graphics benchmarks. In iOS and macOS, we run the Metal versions, which don't include the Car Chase test yet (the OpenGL versions aren't being updated on Apple's platforms, presumably because Apple itself appears to have given up on keeping its OpenGL implementation even remotely up-to-date).

GFXBench offers two different types of tests: "onscreen" tests and "offscreen" tests. There has been some confusion about the two in the past, but the difference should be fairly easy to understand. Onscreen tests run at the native resolution of the device's display panel, which tells us how good a given GPU is at driving graphics on a particular display. If you have one laptop with a 1080p screen and one with a 4K screen and both are using the same model of GPU, the 4K system is going to score significantly lower in the onscreen tests because that GPU is pushing more pixels. Offscreen tests render the scenes at 1080p on every device regardless of the screen's resolution, which puts all the GPUs on even footing so that we can definitively say "all else being equal, GPU X is better than GPU Y."

On Windows PCs, we also run four tests from the most recent version of 3DMark: Cloud Gate, Sky Diver, Fire Strike, and Time Spy. Where GFXBench helps us measure OpenGL performance, 3DMark covers DirectX performance, and those four tests cover several versions of DirectX up to and including version 12. And the Cinebench R15 GPU test rounds things out with another OpenGL-based test on both Windows and Macs.

Storage benchmarks

This part of our suite is unchanged. In Windows and Android, we measure sequential read and write speeds (behavior you'd see when downloading or copying a single large file to disk) and random read and write speeds (what you'd see if multiple programs were making a bunch of small writes to the disk, as happens often if you're multitasking). AndroBench is the utility we use to make these measurements in Android, and in Windows we use CrystalDiskMark.

Things are more complicated on Apple's side of the fence because the storage benchmarks are either not as robust or nonexistent. The few storage benchmarks that exist in iOS are badly out of date and offer no customization options, so rather than provide potentially bad numbers, at this point in time we err on not providing iOS storage speed numbers. For macOS we use QuickBench, which isn't as good as CrystalDiskMark but does at least let us measure peak sequential read and write numbers.

Why benchmark?

Let's begin by stating the obvious: benchmarks don't always tell you much about what it will be like to run actual apps. You can't point to a specific Geekbench score and declare that it meets the minimum requirements for running Photoshop well. Same for 3DMark and any given game at a given resolution or detail level; they can give us some idea, but it's also not really the point.

Benchmarking is primarily about comparing the relative performance of two or more systems using a consistent, repeatable set of tasks. We can use these numbers both to track generational improvements as new hardware replaces old hardware and to track how the same chips perform in different devices. The first part doesn't always feel important today, since the performance of mainstream laptop and desktop chips has largely plateaued, and smartphone chips are beginning to show signs of doing the same thing, but it's still useful for people replacing something that's more than a year or two old.

That second part has become especially important in the age of smartphones, tablets, and fanless laptops, where the design of any given device's heatsink can have a big impact on performance. Under ideal circumstances, a Qualcomm Snapdragon 821 or an Intel Core m3-7Y30 is going to run about the same in just about any system. In the real world, this is heavily dependent on how good each individual device is at dissipating heat and letting the chips run at their maximum speeds.

Because benchmarks usually put a respectable amount of strain on the components inside these gadgets, then, they're also a good way to suss out certain kinds of design flaws. Look at the benchmarking charts for something like Huawei's MateBook tablet from last year or a Snapdragon 810 phone, for instance, and it becomes clear that heat is a problem for these systems. This means that not only will heat impact the day-to-day speed of your device, but that it may not last as long since excessive heat shortens the life of batteries and other components.

Andrew Cunningham
Andrew has a B.A. in Classics from Kenyon College and has over five years of experience in IT. His work has appeared on Charge Shot!!! and AnandTech, and he records a weekly book podcast called Overdue. Twitter@AndrewWrites

37 Reader Comments

So if Geekbench 4 was "...changed to minimize throttling in heat-constrained systems.", does that reflect real world usage? I feel like on laptops and mobile devices that throttling ends up being the limiting factor in most cases.

One other question - when your comparing scores across devices (ie, reviewing Device1, and show scores from similar Devices2-4), are you just pulling the scores from previous reviews of those particular devices, or are you rerunning the tests on devices you have on hand? The first seems more likely, but I've had trouble matching the numbers on some reviews to previous reviews (case in point - todays Chromebook Pro review).

(Also included is a file integrity utility that creates checksums for everything that can be verified at a later date to make sure there was no bit rot. It's not a substitute for a proper file system integrity check (Thanks, Apple... not.) but it's better than nothing for critical files.)

I'm a little bit fed up of the constant dick-waving with modern SSD speeds - 2+GB/sec looks great on the box, but I'm highly skeptical how often that speed actually occurs in daily use.

I'm finding myself far more interested random 4K / small file speeds - even modern SSDs struggle to pull more than a few dozen MB/sec with small files. Clearly there's huge room for improvement in this area, and it's my impression that modern OSes juggle far more small files than before.

It'd be good to know if i'm right in thinking this, and to get a better idea of the relative importance of small file speeds to modern SSD-based OSes.

This is the only Benchmark update I´am interested, especially on fridays

Well, that certainly has some erudite comments in there.

Edit: What the hell was the original image from? TrollWars?

It was suposed to be a picture of a scotch bottle under the benchmark brandlabel. But I do suck at attachments and filters at work do not help either. My bad sorry. Again great weekend for all the readers and staff here.

These colors may look a little wonky compared to our old ones, but they're more legible to people with different types of color blindness.

Fascinating!

I'm not, but thanks anyway for your awareness (wokeness? ) of others and their needs, even if they are a minority of all readers.

Speaking as somebody chromatically challenged the usual colors are actually pretty decent as well. There have been a few where they went with other colors that were challenging, but it's not too common. I don't remember examples off the top of my head, but it has happened. Bar graphs are pretty easy anyway - its graphs/plots that get confusing.

These colors may look a little wonky compared to our old ones, but they're more legible to people with different types of color blindness.

Fascinating!

I'm not, but thanks anyway for your awareness (wokeness? ) of others and their needs, even if they are a minority of all readers.

Speaking as somebody chromatically challenged the usual colors are actually pretty decent as well. There have been a few where they went with other colors that were challenging, but it's not too common. I don't remember examples off the top of my head, but it has happened. Bar graphs are pretty easy anyway - its graphs/plots that get confusing.

Anyways, I do appreciate them at least giving it consideration.

So Ars, did you get requests from others to make this change, or did you come to recognize it over time?

I really wish the state of cross-platform graphics benchmarks were better. 3DMark only has Ice Storm Unlimited to make accurate cross-platform comparisons and it turns out such ludicrous scores on modern hardware that it's not very useful. GFXBench has better tests on the surface, but I have two major problems with it. For one, they were very quick to embrace Metal for iOS/OS X but can't seem to be bothered with Vulkan/DirectX 12 for Android or Windows. Using OpenGL over DirectX on Windows especially penalizes Intel GPUs, which are unsurprisingly the ones you're most often comparing across platforms. It's also kind of silly for a gaming focused benchmark as essentially no games on Windows actually use OpenGL exclusively. I'm not terribly picky; I'd be fine with settling on Vulkan for both platforms to at least be using a low-level API across all of them, but pitting Metal on iOS/macOS against OpenGL on Windows is a pretty Apples-to-oranges comparison. Secondly, it appears to strongly favor Imagination Technologies' architecture to a degree that makes the cross-platform results between iOS (or the rare Android device that uses a PowerVR GPU) and everything else kind of useless.

I really wish the state of cross-platform graphics benchmarks were better. 3DMark only has Ice Storm Unlimited to make accurate cross-platform comparisons and it turns out such ludicrous scores on modern hardware that it's not very useful. GFXBench has better tests on the surface, but I have two major problems with it. For one, they were very quick to embrace Metal for iOS/OS X but can't seem to be bothered with Vulkan/DirectX 12 for Android or Windows. Using OpenGL over DirectX on Windows especially penalizes Intel GPUs, which are unsurprisingly the ones you're most often comparing across platforms. It's also kind of silly for a gaming focused benchmark as essentially no games on Windows actually use OpenGL exclusively. I'm not terribly picky; I'd be fine with settling on Vulkan for both platforms to at least be using a low-level API across all of them, but pitting Metal on iOS/macOS against OpenGL on Windows is a pretty Apples-to-oranges comparison. Secondly, it appears to strongly favor Imagination Technologies' architecture to a degree that makes the cross-platform results between iOS (or the rare Android device that uses a PowerVR GPU) and everything else kind of useless.

For GFXBench, we typically won't compare things if a different API is being used. So we wouldn't compare OGL results from Windows to Metal results from macOS/iOS, for instance. As tests get updated for Vulkan we'll probably add on another benchmark or two.

We used to run the OpenGL tests in macOS and iOS too, but like I mentioned the state of the API is pretty miserable these days. macOS is stuck on 4.1 I think, the same place it's been since Mavericks, and I think iOS is stuck on OpenGL ES 3.1. Even if the GFXBench folks wanted to add more tests, I don't think the software/driver support for their newer tests are there.

So if Geekbench 4 was "...changed to minimize throttling in heat-constrained systems.", does that reflect real world usage? I feel like on laptops and mobile devices that throttling ends up being the limiting factor in most cases.

One other question - when your comparing scores across devices (ie, reviewing Device1, and show scores from similar Devices2-4), are you just pulling the scores from previous reviews of those particular devices, or are you rerunning the tests on devices you have on hand? The first seems more likely, but I've had trouble matching the numbers on some reviews to previous reviews (case in point - todays Chromebook Pro review).

Re: Geekbench 4, that full interview I linked really is good. The short version is that they wanted Geekbench to be more representative of what chips were capable of under typical/bursty circumstances. They were also trying to make it so that tests run later in the benchmark wouldn't be penalized JUST because they were being run on a hotter processor that had already been going for a while.

They are (were?) working on a different test specifically for throttling that we've actually helped them test, though it doesn't appear to be ready for general usage just yet. Hopefully they keep at it so we can keep using Geekbench to measure that too.

Re: benchmark age, it's tricky, because we're usually required to send hardware back when we're done with it, but old benchmarks don't necessarily reflect improvements in OSes and drivers and browsers. Typically (I won't say universally, but typically) if numbers are more than a year-or-so old we hesitate to re-use them without re-benching. The rest of the time, yeah, we'll pull numbers from past reviews.

The exception is for Macs and iDevices and NUCs, which I have a pretty comprehensive collection of and re-bench with some regularity when new models and OSes come out.

Good stuff. If I may suggest, a decibel reading from devices with a fan would be nice too.

I can look into this! Some of the tests we run are limited by our reviewers being scattered all over the place - Ron is in NJ, Peter and Valentina are in the NYC area, Sam is in Seattle, and I'm in Philly. This is part of the reason why we don't really do Wi-Fi or noise tests - it's just too hard to recreate the same setup in half a dozen different places and control for ambient noise and interference and everything else. I'd rather publish *no* numbers than inaccurate numbers.

Good stuff. If I may suggest, a decibel reading from devices with a fan would be nice too.

I can look into this! Some of the tests we run are limited by our reviewers being scattered all over the place - Ron is in NJ, Peter and Valentina are in the NYC area, Sam is in Seattle, and I'm in Philly. This is part of the reason why we don't really do Wi-Fi or noise tests - it's just too hard to recreate the same setup in half a dozen different places and control for ambient noise and interference and everything else. I'd rather publish *no* numbers than inaccurate numbers.

But yeah, let me look into this and see what we can add.

I'll second the request for this. Even if the results are inconsistent, a simple note with noise measurements under load vs idle vs ambient would be informative. It would obviously be better if you all have the same measuring device and/or microphone.

So if Geekbench 4 was "...changed to minimize throttling in heat-constrained systems.", does that reflect real world usage? I feel like on laptops and mobile devices that throttling ends up being the limiting factor in most cases.

One other question - when your comparing scores across devices (ie, reviewing Device1, and show scores from similar Devices2-4), are you just pulling the scores from previous reviews of those particular devices, or are you rerunning the tests on devices you have on hand? The first seems more likely, but I've had trouble matching the numbers on some reviews to previous reviews (case in point - todays Chromebook Pro review).

Re: Geekbench 4, that full interview I linked really is good. The short version is that they wanted Geekbench to be more representative of what chips were capable of under typical/bursty circumstances. They were also trying to make it so that tests run later in the benchmark wouldn't be penalized JUST because they were being run on a hotter processor that had already been going for a while.

They are (were?) working on a different test specifically for throttling that we've actually helped them test, though it doesn't appear to be ready for general usage just yet. Hopefully they keep at it so we can keep using Geekbench to measure that too.

Re: benchmark age, it's tricky, because we're usually required to send hardware back when we're done with it, but old benchmarks don't necessarily reflect improvements in OSes and drivers and browsers. Typically (I won't say universally, but typically) if numbers are more than a year-or-so old we hesitate to re-use them without re-benching. The rest of the time, yeah, we'll pull numbers from past reviews.

The exception is for Macs and iDevices and NUCs, which I have a pretty comprehensive collection of and re-bench with some regularity when new models and OSes come out.

Thanks! I might have to check out that full interview when I have more time to waste. I do like the idea of a bursty/best case benchmark, as long as you also have a throttled one too. And (most importantly), how long it takes to switch to the throttled case.

I did wonder about still having access to a lot of devices, so that makes sense. I would add that make sure you get your data right - I start losing confidence when I go back to glance at the old reviews and they vary wildly from numbers on a newer one. Or they just straight up use the wrong data.

Thanks for some of this. I just built my first PC (I've been converted from Mac to Windows) and want to do some testing/benchmarks...to look back on. And for a reference if I want to OC the new PC.

Asus RealBench is a good all-around benchmarking tool that runs a set of tests using bundled open-source tools -- High-resolution image transforms using GIMP, encoding video with H.264 compression using Handbrake, Luxmark OpenCL rendering and a multitasking script which throws a laundry list of all the tests at the system.

It's a little different from synthetic benchmark tests, which are valid in their own right, but the RealBench tests do a good job of showing up instability in an overclocked system that will hit during real-world work or gaming. The handbrake test in particular can really show up any potential overheating problems -- Handbrake is murder on multithreaded CPU loading. OpenCL tests will tease out any RAM or memory controller issues.

Take your time and do your research. Overclocking can be a lot of fun, and very satisfying when you have your PC running stable and solid.

Welcome (or welcome back) to the custom PC world -- you can build a machine that's every bit as good as a Jobs-era Power Mac (and those were some sweet machines...) and have the satisfaction of knowing it inside-out.

So if Geekbench 4 was "...changed to minimize throttling in heat-constrained systems.", does that reflect real world usage? I feel like on laptops and mobile devices that throttling ends up being the limiting factor in most cases.

One other question - when your comparing scores across devices (ie, reviewing Device1, and show scores from similar Devices2-4), are you just pulling the scores from previous reviews of those particular devices, or are you rerunning the tests on devices you have on hand? The first seems more likely, but I've had trouble matching the numbers on some reviews to previous reviews (case in point - todays Chromebook Pro review).

Re: Geekbench 4, that full interview I linked really is good. The short version is that they wanted Geekbench to be more representative of what chips were capable of under typical/bursty circumstances. They were also trying to make it so that tests run later in the benchmark wouldn't be penalized JUST because they were being run on a hotter processor that had already been going for a while.

They are (were?) working on a different test specifically for throttling that we've actually helped them test, though it doesn't appear to be ready for general usage just yet. Hopefully they keep at it so we can keep using Geekbench to measure that too.

Re: benchmark age, it's tricky, because we're usually required to send hardware back when we're done with it, but old benchmarks don't necessarily reflect improvements in OSes and drivers and browsers. Typically (I won't say universally, but typically) if numbers are more than a year-or-so old we hesitate to re-use them without re-benching. The rest of the time, yeah, we'll pull numbers from past reviews.

The exception is for Macs and iDevices and NUCs, which I have a pretty comprehensive collection of and re-bench with some regularity when new models and OSes come out.

Thanks! I might have to check out that full interview when I have more time to waste. I do like the idea of a bursty/best case benchmark, as long as you also have a throttled one too. And (most importantly), how long it takes to switch to the throttled case.

I did wonder about still having access to a lot of devices, so that makes sense. I would add that make sure you get your data right - I start losing confidence when I go back to glance at the old reviews and they vary wildly from numbers on a newer one. Or they just straight up use the wrong data.

Some/most of that may come from inaccurate data entry, which is something that should happen less often since we're recording/storing numbers a little differently behind-the-scenes.

So if Geekbench 4 was "...changed to minimize throttling in heat-constrained systems.", does that reflect real world usage? I feel like on laptops and mobile devices that throttling ends up being the limiting factor in most cases.

One other question - when your comparing scores across devices (ie, reviewing Device1, and show scores from similar Devices2-4), are you just pulling the scores from previous reviews of those particular devices, or are you rerunning the tests on devices you have on hand? The first seems more likely, but I've had trouble matching the numbers on some reviews to previous reviews (case in point - todays Chromebook Pro review).

Re: Geekbench 4, that full interview I linked really is good. The short version is that they wanted Geekbench to be more representative of what chips were capable of under typical/bursty circumstances. They were also trying to make it so that tests run later in the benchmark wouldn't be penalized JUST because they were being run on a hotter processor that had already been going for a while.

They are (were?) working on a different test specifically for throttling that we've actually helped them test, though it doesn't appear to be ready for general usage just yet. Hopefully they keep at it so we can keep using Geekbench to measure that too.

Re: benchmark age, it's tricky, because we're usually required to send hardware back when we're done with it, but old benchmarks don't necessarily reflect improvements in OSes and drivers and browsers. Typically (I won't say universally, but typically) if numbers are more than a year-or-so old we hesitate to re-use them without re-benching. The rest of the time, yeah, we'll pull numbers from past reviews.

The exception is for Macs and iDevices and NUCs, which I have a pretty comprehensive collection of and re-bench with some regularity when new models and OSes come out.

I would argue that Geekbench has an unstated mission to make mobile SoCs look competitive with x86 CPUs when there is still a huge computational capacity difference. There numbers have always failed to fit with other cross platform benchmarks, much less actually running identical OS + software on both physical platforms (ARM vs x86). As an example, does it seem likely that an iPhone 7 is 25% faster than an i7 NUC (albeit mobile U SKU, ~5k vs ~4k scores)? Sure the A10 is a great SoC, but it doesn't pass the sniff test. And when you compare against older, established CPU benchmarks such as Linpack and SPEC, the difference shows substantially in favor of x86, even mobile U SKUs.

So if Geekbench 4 was "...changed to minimize throttling in heat-constrained systems.", does that reflect real world usage? I feel like on laptops and mobile devices that throttling ends up being the limiting factor in most cases.

One other question - when your comparing scores across devices (ie, reviewing Device1, and show scores from similar Devices2-4), are you just pulling the scores from previous reviews of those particular devices, or are you rerunning the tests on devices you have on hand? The first seems more likely, but I've had trouble matching the numbers on some reviews to previous reviews (case in point - todays Chromebook Pro review).

Re: Geekbench 4, that full interview I linked really is good. The short version is that they wanted Geekbench to be more representative of what chips were capable of under typical/bursty circumstances. They were also trying to make it so that tests run later in the benchmark wouldn't be penalized JUST because they were being run on a hotter processor that had already been going for a while.

They are (were?) working on a different test specifically for throttling that we've actually helped them test, though it doesn't appear to be ready for general usage just yet. Hopefully they keep at it so we can keep using Geekbench to measure that too.

Re: benchmark age, it's tricky, because we're usually required to send hardware back when we're done with it, but old benchmarks don't necessarily reflect improvements in OSes and drivers and browsers. Typically (I won't say universally, but typically) if numbers are more than a year-or-so old we hesitate to re-use them without re-benching. The rest of the time, yeah, we'll pull numbers from past reviews.

The exception is for Macs and iDevices and NUCs, which I have a pretty comprehensive collection of and re-bench with some regularity when new models and OSes come out.

I would argue that Geekbench has an unstated mission to make mobile SoCs look competitive with x86 CPUs when there is still a huge computational capacity difference. There numbers have always failed to fit with other cross platform benchmarks, much less actually running identical OS + software on both physical platforms (ARM vs x86). As an example, does it seem likely that an iPhone 7 is 25% faster than an i7 NUC (albeit mobile SK, ~5k vs ~4k scores)? Sure the iPhone 7 has a great SoC, but it doesn't pass the sniff test. And when you compare against older, established CPU benchmarks such as Linpack and SPEC, the difference substantially in favor of x86, even mobile U SKUs.

Yeah, it would be nice if they allowed a method with no gaps to prevent throttling. This would be more representative of say exporting a video render or other more demanding real world tasks, than a series of benchmarks with stall spaces in between specifically to save mobile devices from throttling.

So if Geekbench 4 was "...changed to minimize throttling in heat-constrained systems.", does that reflect real world usage? I feel like on laptops and mobile devices that throttling ends up being the limiting factor in most cases.

One other question - when your comparing scores across devices (ie, reviewing Device1, and show scores from similar Devices2-4), are you just pulling the scores from previous reviews of those particular devices, or are you rerunning the tests on devices you have on hand? The first seems more likely, but I've had trouble matching the numbers on some reviews to previous reviews (case in point - todays Chromebook Pro review).

One benchmark can't do EVERYTHING. GB4 tries to capture the peak performance a system can operate at. This includes, for example, how well an Intel system performs at is maximum turbo speed. This info is not useless --- it tells you something about how snappy a system will feel under normal, short interactions.

But yes, you will have to look elsewhere if your interest is in how a system performs under 30 minutes of continuous exertion, whether that system is a phone or a PC.

So if Geekbench 4 was "...changed to minimize throttling in heat-constrained systems.", does that reflect real world usage? I feel like on laptops and mobile devices that throttling ends up being the limiting factor in most cases.

One other question - when your comparing scores across devices (ie, reviewing Device1, and show scores from similar Devices2-4), are you just pulling the scores from previous reviews of those particular devices, or are you rerunning the tests on devices you have on hand? The first seems more likely, but I've had trouble matching the numbers on some reviews to previous reviews (case in point - todays Chromebook Pro review).

Re: Geekbench 4, that full interview I linked really is good. The short version is that they wanted Geekbench to be more representative of what chips were capable of under typical/bursty circumstances. They were also trying to make it so that tests run later in the benchmark wouldn't be penalized JUST because they were being run on a hotter processor that had already been going for a while.

They are (were?) working on a different test specifically for throttling that we've actually helped them test, though it doesn't appear to be ready for general usage just yet. Hopefully they keep at it so we can keep using Geekbench to measure that too.

Re: benchmark age, it's tricky, because we're usually required to send hardware back when we're done with it, but old benchmarks don't necessarily reflect improvements in OSes and drivers and browsers. Typically (I won't say universally, but typically) if numbers are more than a year-or-so old we hesitate to re-use them without re-benching. The rest of the time, yeah, we'll pull numbers from past reviews.

The exception is for Macs and iDevices and NUCs, which I have a pretty comprehensive collection of and re-bench with some regularity when new models and OSes come out.

I would argue that Geekbench has an unstated mission to make mobile SoCs look competitive with x86 CPUs when there is still a huge computational capacity difference. There numbers have always failed to fit with other cross platform benchmarks, much less actually running identical OS + software on both physical platforms (ARM vs x86). As an example, does it seem likely that an iPhone 7 is 25% faster than an i7 NUC (albeit mobile U SKU, ~5k vs ~4k scores)? Sure the A10 is a great SoC, but it doesn't pass the sniff test. And when you compare against older, established CPU benchmarks such as Linpack and SPEC, the difference shows substantially in favor of x86, even mobile U SKUs.

So let me see if I understand. Your argument is that "low power x86 devices OF COURSE must perform better than ARM devices" therefore the benchmark was designed to be biased?

Ummmm. Exactly where did that "OF COURSE" come from? What's this independent evidence you have for the eternal superiority of x86?

Might I suggest that you're simply confused about the various issues?

(a) What is generally suggested as comparable devices are LOW POWER intel vs A9, A10. No-one is claiming that your overclocked nitrogen cooled K-series runs slower than an A10. But how about a core m3? Even the Skylake i5 in a Surface Pro4 only turbos up to 3GHz (and takes 15W to do so).

(b) The claim is about SINGLE-threaded performance. We all know Intel ships with hyper-threading and with high core numbers on the more expensive parts.

(c) The non-GB4 numbers (eg SPEC and Linpack) do NOT show substantially in favor of x86. Here are Anandtech's figure for the A9X. (No-one has released SPEC numbers yet for the A10)http://www.anandtech.com/show/9766/the- ... o-review/4Those show essentially parity AND they're sub-optimal for comparison in that they don't use the same compiler. (The libquantum and hmmer numbers in particular are problematic because icc transforms the code in such a way that the two CPUs are performing rather different sequences of instructions. This is fine if you want to compare systems as SPEC engines; NOT useful if you want to compare them as genericCPUs.)

The browser numbers for iOS vs MacOS show the same sort of thing.

As for Linpack(a) are you running it multithreaded? --- back to what was said earlier than no-one is claiming Apple has more cores than Intel.(b) when run single-threaded, the primary speed determinant is that the newest Intel chips have 2x256 SIMD engines, giving say 2x4 FMAC's =8 per cycle. Apple has 3x128 SIMD engines, so 6 FMACs per cycle. Performance basically tracks numbers of FMACs*frequency, and that's what you see on both sides.

Essentially Apple wins on - wider than Intel- apparently slightly better branch prediction (and/or recovery from branch misprediction)- lower power- more capable instruction set (so thing like fusing work better, can handle two loads/stores in a single instruction, more registers)

They appear to be about equal (as of A10) on- memory controller performance- degree of out of order performance they can sustain- memory ordering speculation

Intel wins on - wider SIMD - lower latency to L2, L3 and DRAM- higher turbo frequencies even for mobile (but can't sustain those very long...)- support for two-reads & one write to L1 (Apple can do two reads or one read & one write --- but better ARMv8 ISA makes this much less of a problem than it appears)

Unknown are things like - who has better prefetchers (for all purposes I1, D1, L2, TLB, ...)- who has better cache insertion and replacement algorithms.

What is certainly clear is that (so far...) Apple is upgrading the quality of its algorithms each iteration substantially faster than is intel. Apple IPC increase by 15 to 20% each year, in addition to the frequency jumps each year. Intel's annual IPC jumps range from zero to a few percent. This MAY reflect that Apple had all the low-hanging-fruit available when it was behind Intel OR it may reflect that Apple has substantially faster turn-around cycles than Intel, so can learn faster from its previous mistakes, and can more rapidly incorporate new ideas and research.

So if Geekbench 4 was "...changed to minimize throttling in heat-constrained systems.", does that reflect real world usage? I feel like on laptops and mobile devices that throttling ends up being the limiting factor in most cases.

One other question - when your comparing scores across devices (ie, reviewing Device1, and show scores from similar Devices2-4), are you just pulling the scores from previous reviews of those particular devices, or are you rerunning the tests on devices you have on hand? The first seems more likely, but I've had trouble matching the numbers on some reviews to previous reviews (case in point - todays Chromebook Pro review).

Re: Geekbench 4, that full interview I linked really is good. The short version is that they wanted Geekbench to be more representative of what chips were capable of under typical/bursty circumstances. They were also trying to make it so that tests run later in the benchmark wouldn't be penalized JUST because they were being run on a hotter processor that had already been going for a while.

They are (were?) working on a different test specifically for throttling that we've actually helped them test, though it doesn't appear to be ready for general usage just yet. Hopefully they keep at it so we can keep using Geekbench to measure that too.

Re: benchmark age, it's tricky, because we're usually required to send hardware back when we're done with it, but old benchmarks don't necessarily reflect improvements in OSes and drivers and browsers. Typically (I won't say universally, but typically) if numbers are more than a year-or-so old we hesitate to re-use them without re-benching. The rest of the time, yeah, we'll pull numbers from past reviews.

The exception is for Macs and iDevices and NUCs, which I have a pretty comprehensive collection of and re-bench with some regularity when new models and OSes come out.

I would argue that Geekbench has an unstated mission to make mobile SoCs look competitive with x86 CPUs when there is still a huge computational capacity difference. There numbers have always failed to fit with other cross platform benchmarks, much less actually running identical OS + software on both physical platforms (ARM vs x86). As an example, does it seem likely that an iPhone 7 is 25% faster than an i7 NUC (albeit mobile SK, ~5k vs ~4k scores)? Sure the iPhone 7 has a great SoC, but it doesn't pass the sniff test. And when you compare against older, established CPU benchmarks such as Linpack and SPEC, the difference substantially in favor of x86, even mobile U SKUs.

Yeah, it would be nice if they allowed a method with no gaps to prevent throttling. This would be more representative of say exporting a video render or other more demanding real world tasks, than a series of benchmarks with stall spaces in between specifically to save mobile devices from throttling.

Anyways, that's for the Geekbench devs.

If that's what you want to know, why are the Cinebench numbers not good enough? Surely a range of benchmarks testing different things is more useful then five benchmarks that all test the exact same thing?

Yay Andrew! There is actually a shocking amount of work that goes into these charts and all these benchmarks.

We should also mention that we're well aware of the "Benchmark boosting" techniques some Android OEMs use to try and eke out higher scores. Usually this involves clicking the CPU over to an "All cores at maximum power" mode that isn't representative of normal usage.

On Android we have a special version of Geekbench that defeats the package-name-based benchmark detection we've seen OEMs use, so the CPU should treat our Geekbench like any normal app. Still though, I usually fire up a CPU monitor and make sure nothing crazy is happening when the benchmark apps start up.