Features

What Are Measurements Good For?

If you read audio websites often, you’ve surely seen discussion about whether or not measurements are important in audio reviews. Unfortunately, few of the people writing about this topic have experience in audio measurement, and their comments rarely amount to anything more than excuses for why they don’t do measurements. Because measurement is such a big part of SoundStage!’s group of websites, and SoundStage! Solo in particular, I thought it important to explain why we do measurements, and what conclusions you should draw -- and not draw -- from them.

The reason we do measurements is that a subjective audio review cannot present a comprehensive, unbiased evaluation of an audio product. It’s one writer’s opinion, almost always formed after casual, sighted listening sessions. A subjective review reflects not only the sounds that reached the writer’s eardrums, but also the writer’s presuppositions about the product category, the brand the product wears, and the technology the product uses; the writer’s relationship with the manufacturer and/or the public relations person; and the writer’s concerns about what readers, other writers, and other manufacturers will think of the review. It can also be affected by the writer’s mood, the music chosen for listening, even the time of day during which the writer performed the evaluations and wrote the review. How much do these factors affect the review? We don’t know -- and neither does the reviewer, unless he possesses a depth of self-knowledge that the Buddha would envy.

These problems could be eliminated by blind testing, but for reasons I’ve discussed elsewhere, almost no audio writers do blind testing. So what we have are mostly subjective reviews that include only the writer’s reactions to a product. These reviews can be entertaining to read as a sort of audio travelogue, but because there’s no attempt to correlate the writer’s judgment with anyone else’s judgment, or with any objective standards, these reviews provide, as famed audio researcher Floyd Toole says in the new 3rd edition of Sound Reproduction: The Acoustics and Psychoacoustics of Loudspeakers and Rooms, “just stylish prose and opinion.”

A better method is to include performance measurements of the product being tested. Measurements provide a practical way to get beyond a writer’s opinion and provide a more comprehensive and less biased evaluation of a product.

Many audio reviewers say they reject measurements because music is about emotion, and that measurements can’t gauge emotion. The audio writer, they suggest, can gauge the emotion of a certain piece of music played through a certain piece of audio equipment, and the presumption is that the reader will share his emotional reaction to this experience. But our emotional reactions to music incorporate all sorts of influences, many of which I cite above, and it’s hubristic for any audio writer to assume your emotional reaction to a certain piece of music played over a certain system at a certain moment will correlate with his. I find it insulting when audio writers -- few of whom have demonstrated deep knowledge of audio engineering, scientific research, physics, or music -- presume that their emotional reaction to a piece of music played through a certain piece of gear will be the same as mine.

Contrary to the beliefs of many audio reviewers, measurements tell us much more about how well a component conveys the emotion of a piece of music than their opinions can. What the critics of measurement fail to realize is that the key measurements of speakers and headphones are interpreted by how they relate to the preferences of real listeners established through extensive blind testing. Measurements allow us to gauge a product against the opinions of dozens or hundreds of listeners, formed in conditions where bias is minimized or eliminated. This is vastly more useful than gauging a product against one reviewer’s opinion, formed in uncontrolled, casual testing with no attempt to eliminate bias.

Research in correlating measured performance with listener responses dates back at least to the 1980s. Here’s how the process generally works. The researcher brings in numerous listeners -- with a preference for trained listeners experienced at evaluating audio products -- to listen to samples of a wide variety of audio products in a particular category, and pick their favorites. The researcher then performs measurements of the products to see which measurements predict the listener impressions and which ones don’t. A target response is created based on the listeners’ comments and the responses of the listeners’ favorite products, and then the target response is tested against listener perceptions to confirm its validity.

In these studies, researchers are typically able to develop measurements that predict listener preferences with impressive accuracy. For example, in their 2017 paper “A Statistical Model That Predicts Listeners’ Preference Ratings of In-Ear Headphones: Part 2 -- Development and Validation of the Model,” researchers Sean Olive, Todd Welti, and Omid Khonsaripour report a correlation of 0.91, with 1.0 being perfect correlation. What’s the correlation between subjective reviews and listener preferences? To my knowledge, no magazine or website has tested this, or published the resulting data.

Note that I’m mostly talking about frequency response measurements of loudspeakers and headphones. I’ve also found excellent correlation between my headphone isolation measurements and listener perception of the leakage of outside noise into the headphones and earphones, using listeners from the staff of Wirecutter (a website that tests headphones and many other products) as my test subjects, and playing a recording of airplane cabin noise through my surround-sound system.

The correlation between other measurements and listener perception is not as well established. Distortion measurements predict listener perception only in fairly extreme cases. Spectral decay, or waterfall, measurements have yet to be well correlated with listener perceptions, but they are interesting to look at and they often correspond with frequency response measurements, so I include them. Impedance and sensitivity measurements tell you little or nothing about the sound quality of headphones or speakers, but they are important for assuring that a set of headphones can deliver optimum performance with the amplifier or source device you use.

You may be wondering why I haven’t mentioned measurements of audio electronics, such as amplifiers, preamps, and DACs. That’s because the numerous papers on the subject from the Audio Engineering Society’s E-Library show at best a tenuous and slight correlation between measurements of electronics and the results of blind listening tests. Listeners are only rarely able to consistently distinguish between these products in blind tests, and even when they can, the preferences among multiple listeners are usually too varied and mild to be meaningful. Without reasonable consistency in listener preferences, there’s nothing with which the measurements -- or the impressions of a subjective reviewer -- can be correlated.

However, listeners can distinguish among these devices when they exhibit significant flaws, such as high levels of distortion or large deviations in frequency response, and measurements can easily and reliably detect these flaws. Some of these products also have idiosyncrasies, such as high output impedance or low maximum output, that affect how well they’ll work with the other products in your system. Thus, it’s important to measure these products to see if they have any flaws, characteristics, or limitations that might affect your experience with them.

I certainly understand why most audio publications avoid measurements. I think most audio engineers would agree that it takes at least a couple of years’ experience to become proficient in any one measurement, plus incalculable hours to actually run the measurements and analyze the results. It’s also costly: while there are a few good, affordable audio measurement systems, most cost somewhere between $3000 and $30,000. And of course, audio measurement demands more commitment, passion, and effort than most people would prefer to devote to such a dense and challenging subject. It’s much easier to pour yourself another glass of scotch and deride the measurement guys as “enemies of poetry, love, and humanistic culture.” But if that’s all the writer is willing to do, they won’t be able to provide information that can predict how the reader -- as opposed to just the reviewer -- will like the product in question.

It is not about controversy. The whole objective/subjective debate has been blown out of proportion to the extent that is devolved into cults of personality. In essence it is fear ... and that's just it, there are going to be these two camps out of fear... fear that objectivity robs hifi of its humanity, fear that what was held to be true/popular belief is not the case, fear that the industry truly is subjective and that so many "trusted" brands are not worth the resources upon which they are built.

I am GUILTY...guilty of being an insufferable romantic. I grew up from my early teens with HiFi - 40+ years now - and it will always be an intrinsic aspect of my life. Intellectually, academically, in my head I 'grok" the need for measurements and statistics. I use them to guide me as a filter when in the market for gear. I want the targets of my desire to be able to deliver on tangible needs, That being said and filtered, its all about the intangibles, what I feel in my heart, which gets it pounding, gets me motivated to search out products/services which I can write about and share with the HiFi community at large.

To the fears I have enumerated:

1 I am scared that when for me, HiFi becomes solely a mental exercise, It will have lost its lustre
2 I am scared that the wondrous "hot stove league" (which takes place between baseball seasons, where compelling rumination takes place) aspect of HiFi discussions will be stymied if and when objectivity metaphorically slams the door. For baseball fans - SABR, while intriguing is not definitive. I am scared that if indeed HIFi is ruled by graphs and measurements then the industry (sic) which rests upon subjective "shoulders" could very well implode as manufacturers shut down as their competition with product sporting comparable if not equal measurements is that much more affordable.

It is not a zero-sum game. It should never be. I posit that without the objective/subjective debate it is game-over for HiFi as we know it.

There is no "guilt" in being a romantic when it comes to music - the arts. When emotions get transferred to hardware it can be problematic. I conducted my first blind listening test on loudspeakers in 1966 - 52 years ago. As a research scientist I was intrigued when even simple anechoic measurements showed a clear relationship to the sound quality ratings. This was the first step in a process that has occupied my professional life. I sum it up as "science in the service of art". After all, the objective of HiFi - high fidelity - was and is to deliver the art as nearly as possible as it was created. When uncertainties about the performance of the hardware are reduced, then what is heard is more directly attributable to the performing artists and the recording and mastering engineers in the creative process. My attitude is that this is where debates should be. It is elaborated on in detail in my book, but is nicely summarized in a tutorial lecture on YouTube, "Sound Reproduction; Art and Science/Opinions and Facts: https://www.youtube.com/watch?v=zrpUDuUtxPM&list=FL8EhjwAiBJi_scYd0WgC12g&t=0s&index=7&frags=pl%2Cwn

Having just completed an significant upgrade to my own entertainment system - making it as "neutral" , as transparent as is possible - I am rewarded by many stunningly good recordings, as well as chagrinned by those that suffer from apparently flawed recording apparatus, techniques or judgment. With such a system one can lay credit and blame where it belongs, without having to wonder what, if any, influence one's hardware has. It is the art that puts the smile on my face, not the hardware, even though I respect the engineering excellence embedded in it and the science that guided those efforts.

One benefit of understanding the science is that high quality, accurate, sound reproduction becomes available at lower price levels. I don't understand why this is a problem. That is why publications and manufacturers that show accurate measurements are part of the solution for consumers.

Measurements are important only as far as we can correlate them to perception. As for loudspeakers, we cannot measure the sound of a speaker, just some separate characteristics, which few have correlated to perception. There is no single measure of a speaker that can tell us anything about how it sounds. There is no such thing as "accuracy" in loudspeakers for this reason.

Hi Gary, You gave my book a five star Amazon review, but did you read it? It provides massive evidence of strong correlations between listener evaluations of sound quality and measurements. Even I say that there is no SINGLE measurement curve that describes loudspeaker sound quality. In the book I show that verifiable subjective/objective correlations are possible if one makes many anechoic measurements and post-processes them so that they convey useful information. My original NRCC measurements, as shown in SoundStage are excellent guidance. The evolved "spinorama" version which uses 70 measurements is even better. It is now an ANSI/CTA standard. Read the book.

Thanks for jumping in, Floyd! I wish more people would read your book. Their understanding of audio -- REAL audio, not made-up quasi-religious audio -- would go up tenfold. The 3rd edition is very enjoyable and informative, it's more like a collection of a lot of really great magazine articles than a textbook.

Thank you for joining, Dr. Toole! I want to ask you a question... A couple of years ago, numerous people were saying that headphone design was where loudspeaker design was in the 1970s insofar as the research on measurements and listening. Do you agree with that? And how far along do you think we are now?

Pages 373-375 of my book outlines research that I did back in the 70s. It shows that in experiments, lacking the significant control that recent research has implemented, there was a useful relationship between competent measurements and listener preferences. The data in Figure 13.4 are clearly not far away from current preferred curves (using comparable data). AudioScene Canada magazine published reviews using this data in the mid 1970s. I think it was an historical "first". Many headphones back then were truly dreadful so crude visual correlations were easy to see. Recent work by the Harman research group led by Dr. Sean Olive, and extensively covered in AES papers, is much more thorough and precise. It would seem that we are very close to, if not at, the point of diminishing returns. Bass leakage, a problem in the 70s, remains a problem, meaning that individual opinions of sound through individual headphones will vary. Because bass is about 30% of the basis for overall subjective opinions of sound quality (another Olive contribution) it is clear that satisfying all listeners may be difficult. Although bass is certainly a problem with loudspeakers in small listening rooms, it does not usually vary as much as it can in headphones with variable amounts of leakage. There were a few, very few, good sounding headphones way back then. It was interesting that Sony engineers visited my NRCC lab after I presented AES papers on the topic, and they went back and created some superb headphones, of which they gave me a set. They measured and sounded superb, but they were monster powered electrostatics, too expensive for what they perceived to be the North American market. Some of the technology trickled down to affordable products, but eventually I lost track of what they, and others, were doing as I focussed on loudspeakers and rooms - a much larger market at that time. How things have changed!

Bass leakage remains a problem for in-ear and closed-back circumaural and supra-aural headphones. For open-back headphones, they already have a leak so consistency in bass across subjects can be quite good.

It's important to note that we carefully controlled leakage in all our headphone studies so that listeners generally heard the same bass and we could better correlate subjective impressions with measurements.

In real applications, bass leakage remains a problem that can be minimized by good mechanical design and proper fit. If the leakage can be measured (like we did in our studies by installing a tiny microphone inside the ear-cup or in-ear) you can either try to compensate for it or warn the listener to adjust the fit to achieve a better seal.

In all of the later studies we virtualized all the headphones over a single pair by simulating the measured magnitude and minimum phase response. The correlation between actual and virtualized headphone based on subjective ratings was greater than r = .90. For controlling leakage in IE headphones we put a MEM mic inside the replicator IE and measured any leakage. For AE/OE virtualizing we used an open-back headphone that had very repeatable response within and between different listeners.

One of the things we have done to more accurately measure leakage effects using our GRAS 45 CA is to develop pinnae that represent typical leakage on humans.. Todd Welti did a cool study where he measured a number of headphones on 10 different listeners to establish how they leaked. He then developed a new pinnae that matches the average leakage of the headphones measured on subjects... The paper is here: http://www.aes.org/e-lib/browse.cfm?elib=17699

The next challenge is finding 1 or more heads that represent a range of leakage found on a distribution of head sizes. The size/shape of the head is usually more of a factor on headphones with larger size cups.

While I agree there is not "single measure of a speaker" that correlates to good sound, there is wide consensus on the numerous measurements that do correlate very well. These measurements were the basis for Dr. Floyd Toole's research at NRC, back in the 1980s, that's considered the most important and influential research ever done correlating measurements with listening impressions. Few great speaker designers dispute his work.

I am still making my way through the book, notepad in hand, so I will shut up until I see the chapter about what I call "The Big Three." Mark Davis has said that what we hear in a loudspeaker are the frequency response and the radiation pattern. I have added speaker positioning and the reflective qualities of the walls around them. There are so many variations of these factors it is near impossible to characterize the "sound" of a speaker with simple frequency response measurements of the direct sound output and waterfall plots that indicate the width of the "spray" of the direct field. I know that Floyd touches upon my work on p 403 but there is some misunderstanding in there that I have written to him about. What I am waiting to see in the book are measurements that can correlate The Big Three (radiation pattern, speaker positioning, and acoustics of the room) to perception. Those are what is audible about speakers.

It has been so refreshing to read this article simply because it needs to be said. We use an entire suite of measurements taken in our anechoic chamber to create a Listening Window and Sound Power curve using our algorithm to average them. Ultimately the final test a speaker must pass is the double-blind listening test and the results of this testing will generally mean adjustments to the response because of the high audibility of very low Q artifacts that are difficult to visually identify. But I can say with certainty that the correlation between the results derived from the “Spinorama” and the results from the double-blind listen testing are very real, and large variations from these results will result in a guaranteed loser in the double-blind listen testing. It is worth noting that other factors, like power capabilities and overall bandwidth, play a large roll in overall product performance and can affect the results of a double-blind listen test. But those measurements are not the discussion here, and they tend not to be haunted by a disbelief that they need to be measured.