FutureMark Releases Statement Concerning Time Spy Benchmark

Depending on the parts of the Internet you frequent, you may have seen some heated discussion about the recently released Time Spy benchmark for 3DMark, especially its implementation of asynchronous compute. Because of this, FutureMark has released a lengthy explanation of how Time Spy implements this feature of DirectX 12. It does get a bit technical but is still an interesting read. I have attempted to cover some of the main points below.

Before getting too deep into this, asynchronous compute needs to be explained. A modern GPU has to be able to do a number of things, so they contain specialized engines to perform specific tasks. Typically there will be at least one copy engine and a compute engine, and these will be separate from the 3D engine. These engines will also have their own queues of command lists (Direct of graphics, Compute, and Copy) and what asynchronous compute enables is for these queues to be executed in parallel. DirectX 11 did not allow for this and relied on synchronous compute, as there is no simple way to guarantee the commands are executed in the proper order. In Time Spy, the DirectX 12 engine does produce two queues, a Direct queue and Compute queue, that are passed to the driver. In the end, the driver and the hardware decides how the queues and command lists they contain are processed, which is important to note

If you have been following some of the news about how asynchronous compute can boost the performance of various games, you may have also seen that AMD GPUs typically sees a bigger boost the NVIDIA GPUs. This is because AMD's GCN architecture features Asynchronous Compute Engines (ACEs) in the hardware, while NVIDIA GPUs rely on a pre-emption method that requires a greater level of software control by the driver. At least that is the case for the newer Pascal architecture from NVIDIA (GTX 10xx series) as Maxwell GPUs (GTX 9xx series) have had asynchronous compute disabled within the drivers, so while Time Spy passes the drivers two queues (Direct and Compute), it serializes the command lists and makes everything run in the Direct queue.

Now we can get to what some on the Internet have deemed a controversial or bias move by FutureMark. All GPUs are sent the commands from Time Spy in the exact same way, meaning (in part) whatever GPU and drivers you have, a Direct queue and a Compute queue are sent to them to execute. The supposed bias comes from the observations that AMD's GCN-based GPU are often better equipped to utilize asynchronous compute than NVIDIA GPUs, and DirectX 12 would allow the benchmark’s engine to recognize the capabilities of a GPU and use a code path optimized specifically for the hardware. Time Spy does not take advantage of this feature of DirectX 12, and as someone with a background in science and having done lab work, it really should not. (Incoming editorializing.)

At the heart of this issue seems to be a question of what a benchmark is for. If it is meant to demonstrate the ultimate potential of one's hardware using all of the latest, supported technologies, then yes, Time Spy has failed. If instead a benchmark is meant to be a constant and consistent test of one's hardware, than Time Spy has succeeded. Considering the definition of a benchmark involves being a standard or point of reference for other things to be compared with or against, this second option would be the better. Having special and distinct optimizations for AMD, NVIDIA, and Intel GPUs means any measurements made of them cannot be directly compared without also considering the effectiveness of these optimizations. There are enough variables to consider when benchmarking without adding these.

While this approach does maintain Time Spy as a valid benchmark, there would still be the potential a bias in the design of the benchmark itself, but this is only in theory and not in reality. FutureMark has a Benchmark Development Program (BDP), which exists specifically to allow industry leaders to work directly with FutureMark on the creation of its benchmarks, and AMD, NVIDIA, Intel, and Microsoft are all members. In the two years of development that went into Time Spy, BDP members receive regular builds and have access to the source code for providing suggestions. This source code access is all in a single tree though, so each member sees every suggestion and comment made by every other member. In the end, each member approves the final benchmark for release, and of every benchmark FutureMark has made, Time Spy received the most scrutiny prior to release. If any systemic advantage exists for one manufacturer or another, the others would have known about it and still approved of it, or else Time Spy would not have released in this state.

In the end, Time Spy is a DirectX 12 benchmark designed to accurately and impartially compare hardware, which means it lacks any special bias or special optimizations. To see the impact those optimizations can make, look at the results of additional benchmarks and weight them all as you deem appropriate. That is why reviewers do more than one test after all.

If two people bought cars that were really good, one bought a truck that can offraod well, the other bought a sport car, the best way to test the vehicles would be to run them on their respective strength. This benchmark puts both vendors in a tar pit and says "race"! This is illy to me.

But the results from the tar pit can be compared (assuming the truck does not have undue advantage from its offroading capability) whereas testing the respective car's strength cannot be compared. Compare max speed, sports car has a bias. Compare max payload, truck has a bias. If all you're interested in is max speed, then looking at just that is what you want, so look at the tests that report on that. If all you're interested in is payload, then look at those tests.

If you want to compare the two cars in terms of just quality, then you need something without bias and likely considers far more than those two measurements. This is what FutureMark tries to create with its synthetic benchmarks, so while it may test for max speed and for max payload, it tests for other things and weights them to create the abstract score value which can then be compared between the sports cars and the trucks. Other tests that do focus in on a specific feature, like asynchronous compute, are still of value, but those are asynchronous compute tests, not graphics tests, just as Time Spy is a graphics test, not an asynchronous compute test.

But the results from the tar pit can be compared (assuming the truck does not have undue advantage from its offroading capability) whereas testing the respective car's strength cannot be compared. Compare max speed, sports car has a bias. Compare max payload, truck has a bias. If all you're interested in is max speed, then looking at just that is what you want, so look at the tests that report on that. If all you're interested in is payload, then look at those tests.

If you want to compare the two cars in terms of just quality, then you need something without bias and likely considers far more than those two measurements. This is what FutureMark tries to create with its synthetic benchmarks, so while it may test for max speed and for max payload, it tests for other things and weights them to create the abstract score value which can then be compared between the sports cars and the trucks. Other tests that do focus in on a specific feature, like asynchronous compute, are still of value, but those are asynchronous compute tests, not graphics tests, just as Time Spy is a graphics test, not an asynchronous compute test.

Bit of a problem with that, when you compare graphics scores in the form of avg FPS to the scores of Time Spy the numbers don't always add up. For example I did a rather quick and crude look at Tweaktown's Time Spy and recent GTX 1060 review. Notably the GTX 1070 Time Spy graphics score shows the GTX 1070 @ 5956 and the Fury X @ 5956 so obviously as this is a DX12 Benchmark then in DX 12 games the GTX 1070 is the better card right? Well oops, turns out looking at the GTX 1060 review's DX12 section the GTX 1070 and Fury X score in Ashes of the Singularity 44(1070) 46(Fury) and Hitman 65(1070) 75(Fury) maxing the Fury X the better real world card in current DX 12 games.

According to Furturemark:

3DMark Time Spy is a new DirectX 12 benchmark test for Windows 10 gaming PCs. Time Spy is one of the first DirectX 12 apps to be built "the right way" from the ground up to fully realize the performance gains that the new API offers. With its pure DirectX 12 engine, which supports new API features like asynchronous compute, explicit multi-adapter, and multi-threading, Time Spy is the ideal test for benchmarking the latest graphics cards.

The complaints I've read seem to be centered around the benchmark not using hardware similarly to ACTUAL DX12 games and following an approach that at least on the surface looks as if it could have been designed to benefit nvidia. Factoring in nvidia's logo being plastered on the benchmark's reveal video and it's not hard to see why someone might want to question how valid this benchmark is in comparing GPUs. It's similar to designing a benchmark for a CPU that ignores an instruction set because one vendor doesn't fully support it.

With all that said I'm not a graphics expert so do your own research but if DX12 games is your thing I sure as hell wouldn't buy a card based on this alleged DX12 Benchmark after looking at real games.

If this was a CPU benchmark that moved from single threaded to multi-threaded benching this would be like: "we love multi-threading so much we are only going to use 2 threads". We know the 480 has 4 Async Engines so it would benefit more from having each engine with it's own thread and command queues. Of course not all cards can benefit from this so we're not going to try to improve the experience of users by encouraging better implementations but instead we're just going with the lowest common async denominator (NVIDIA's barely passable async). Am I wrong here?

But what improves a benchmark user's experience? For a game it is obviously improved performance, but for a benchmark, wouldn't it be the most accurate and broadly applicable results? For async that might mean using just the two queues, similar to how only feature level 11_0 is used, instead of something higher, so that the most configurations can run the benchmark.

Maybe after we see Volta, which will hopefully use a different async solution, we'll also get a new benchmark to push the feature. Or maybe we will see a different company produce an async-heavy benchmark. Of course then there will be a lot of people complaining about it being bias towards AMD (unlike now when it seems like people are upset there is not a bias for AMD, which is being treated as an implicit bias for NVIDIA). Seems like there is no solution that's going to win the argument, so FutureMark went with trying to not have an active bias for anything and creating a common, if not tremendously high standard measurement for everything.

I don't think you're getting it Jim, Futuremark claims that this new benchmark is one showcasing DX12 and "fully realize the performance gains that the new API offers" yet they held back use of DX 12 features (ones I might add that game titles that ALREADY exists use not just future ones) that one company's cards can't do. Now if it was and DX 11 benchmark so no big deal, but to claim you're fully utilizing the power of DX 12 and asynchronous compute you should be doing just that, otherwise we may as well keep looking at older benchmarks. And before defending the "so more cards can run it" bs, I'm sorry, it's a DX12 benchmark by a so called technology leader in the industry, of course some older hardware wont be able to run it. That older hardware wont be running such features in the newer games utilizing it anyway. A similar comparison would be a look at asynchronous compute on vs off for nvidia's own maxwell in this benchmark:

They clearly had no problems implementing the basics of a feature not all cards can use

Now, let’s talk about the bad news: Maxwell. Performance on 3DMark Time Spy with the GTX 980 and GTX 970 are basically unchanged with asynchronous compute enabled or disabled, telling us that the technology isn’t being integrated. In my discussion with NVIDIA about this topic, I was told that async compute support isn’t enabled at the driver level for Maxwell hardware, and that it would require both the driver and the game engine to be coded for that capability specifically.

It doesn't take much to mount a theory that the benchmark is designed to make Pascal hardware look good. Is that the case, no clue. There's a big pile of evidence that points to it as a possibility though and in a world where Warner Brothers settles a case of paying people to say nice things about a game that was generally considered good paid cross promotions are everywhere I don't think it's unreasonable to be a little suspicious.

Perhaps I am not getting your point, but you are also apparently missing mine that Time Spy is meant to be a DX12 benchmark, not an asynchronous compute benchmark. The claim you are quoting is for the API, not this lone feature of the API, and it immediately follows the table in the technical guide showing order of magnitude increases in the amount of work that a GPU has to complete in each frame. Going from 3,900,000 vertices and 5,100,000 triangles to 30,000,000 and 13,500,000 is not something async is enabling alone. That is coming from far more and speaks to the "fully realized performance gains" the API offers more than any lone feature.

Let me try making my point another way: Time Spy is a benchmark, not a tech demo. It isn't meant to show off specific features or eyecandy; it is meant to measure capabilities broadly and consistently.