Logic and Reason in Benchmarking

The Need for Objectivity

I had intended my next article in this column to be a look at the Winstone benchmarks (which is in the works), however the news and reaction to the SysMark 2001 ‘bug’ caught my attention. Essentially, SysMark 2001 has an inherent problem that seemingly puts the Athlon XP, Athlon MP and Duron (w/Morgan core) processors at a disadvantage in the benchmark. It turns out that Windows Media Encoder 7, which is used in the benchmark, checks specifically for the CPUID string of ‘GenuineIntel’ in order to determine if it should use the SSE instruction path or not (just as we thought developers had learned their lesson about such things).

This, of course, has reignited accusations that Intel is influencing BAPCo, and that the benchmark is biased. Some have suggested that BAPCo should have put out a patch, and that by not doing so they have ‘proven’ their bias. AMD did put out a patch that replaces the check for ‘GenuineIntel’ with a check for ‘AuthenticAMD’, enabling the SSE code path for the Athlon XP. Many reviewers and users have claimed that this now makes the benchmark a realistic and fair benchmark. However, I can make the argument that it makes the benchmark biased towards AMD.

What? Most people seem to agree that the patch makes the benchmark fairer – so why would I say it makes the benchmark biased? Well, partly to play devil’s advocate, but there really is a serious issue of concern here. From what I have seen, the detractors of benchmarks such as SPEC CPU2000 have said that these benchmarks are not valid because they are not ‘real world’ applications. Other claims have been made against Winstone and SysMark, specifically that certain parts of the tests give a greater reliance on SSE enabled code than is found in the ‘real world’. On the other hand, so the popular argument seems to go, replacing a piece of code in a benchmark that allows the application to use SSE for the Athlon, but will not (at least as of the time of this article) be available in the commercial application, is more realistic. The question is, if the patch is not publicly available from the application manufacturer, is it ‘fair’ to use it in the benchmark, since the results given will not reflect real world performance? Isn’t this making the benchmark biased towards AMD with regards to real world applications?

At the risk of angering many people, what this points out to me is the lack of what a colleague of mine calls ‘intellectual honesty’, or what I would simply call logical consistency. Logic is a process. Logic does not provide the facts used to come to the conclusion, that is what the benchmarks provide. Proper use of logic allows us to be consisent in our thinking and come to reasonable conclusions.

I believe that there are at least two goals in benchmarking. One would be to show the performance that users can expect to see if they run out and buy a component or system today. The other would be to look at the various technologies and features, and try to determine their potential. The first is the ‘real world’ performance and the other is the potential (or perhaps, academic) performance. It appears that people will vacillate back and forth between these two goals, depending upon what they are trying to prove. Logic should tell us that the patch really is showing the potential performance, not the actual performance that users will see, and therefore is not a ‘real world’ benchmark.

While I think it is not unreasonable to present the results of the benchmark both with and without the patch, it is unreasonable to present the results from the patched version as the ‘real world’ results. If strict guidelines are not followed for what are supposed to be ‘industry standard’ benchmarks, what is to prevent any other company from issuing patches to make their own products run faster, even if the patch is not available to the public for that application?

On the other hand, the results of the test with and without the patch seem to indicate that the WME application contributes to the overall SysMark 2001 score to a greater degree than seems reasonable. Another question is why that particular application was chosen. Without information from BAPCo on the methodology used to select applications, any number of speculations can be made. It is these issues that I believe should be focused upon, not the bug in WME. The operation of any specific application is not under control of BAPCo, but the selection and relative weighting of the applications are, and there should be some information about why and how this is calculated.