Has anyone come across other listening test methodologies they thought were highly revealing, useful, and rigorous? In a few papers I've seen (mostly AES) there are methodological details that vary here and there, but mostly researchers are doing double-blind tests of direct comparisons between two files.

There could be a role, for example, in tests of the form usually seen in medicine. Instead of taking one person and exposing them to both stimuli as a single data point, take a lot of persons, put them in different groups, and then collect Likert-scale data on their response to one class of stimuli. You'd probably want 100-300 persons per group, and of course expose them still double-blind, but only to one format.

This would present a calibration problem, of course, since who's to say what each number on the Likert scale represents. Still, with careful random sampling and a large sample, it's possible that the variations in judgment here would wash out in a large sample n (or they might add so much noise that no significant result is obtained). Analysis would look for systematic/significant differences between groups.

As a variation, a long-term experiment might ask a subject to listen to a single exposure of at least 10 minutes or so each day, and rate it on a Likert scale, then come back on successive days until at least 30 and preferably over 100 data points are obtained. Still lots of statistical noise here, but that's one of the points of large sample sets.

In the past such tests would have had huge practical barriers, but I'll bet it could be done now on the Internet.

In the preferred and most sensitive form of this method, one subject at a time is involved and the selection of one ofthree stimuli (A, B, C) is at the discretion of this subject. The known reference is always available as stimulus A. The hidden reference and the object are simultaneously available but are randomly assigned to B and C, depending on the trial.

The subject is asked to assess the impairments on B compared to A, and C compared to A, according to the continuous five-grade impairment scale. One of the stimuli, B or C, should be indiscernible from stimulus A; the other one may reveal impairments. Any perceived differences between the reference and the other stimuli must be interpreted as an impairment.

Note those first words "In the preferred and most sensitive form of this method..." - people have tried a lot of ways of doing this, and we end up with ABX or very similar because it's the most sensitive method.

Another sensitive method is Three Alternate Forced Choice Comparison Test (3-AFC). Present A, B and C, where two of those are the original, and one is the coded version. Pick the odd one out. If the user picks correctly, move to a higher quality version. If the user pick incorrectly, move to a lower quality version. Great for training and finding thresholds of audibility, but fraught with problems in terms of moving up and down the "quality" scale in a useful way. Easy for simple masking experiments. Harder and less reliable for finding the transparency threshold of a codec, though it's one possible tool.

Great info, thank you. Depending on how confident I am, I have conducted a de facto ABC round from time to time in the foobar ABX interface. Nothing forces the user to check both A and B before deciding, so one way I've proceeded is to check A alone, and then decide which of X or Y is actually A.

I'm also thinking of a case were a control group of large sample n gets *only* a referent file, like a CD-mastered copy, and then rates that on a variety of quality-oriented Likert scales. There *is* an implicit comparison between that and "everything else I listen to beside this test file", but subjects would not A/B. Then other treatment groups get *only* a manipulated file derived from the referent, and respond on the same survey measures. The hope would be to reject the null hypothesis on a t-test of no difference between the aggregated measures for each group.

I'm also thinking of a case were a control group of large sample n gets *only* a referent file, like a CD-mastered copy, and then rates that on a variety of quality-oriented Likert scales. There *is* an implicit comparison between that and "everything else I listen to beside this test file", but subjects would not A/B. Then other treatment groups get *only* a manipulated file derived from the referent, and respond on the same survey measures.

But you are comparing a qualitative judgment between two different groups using two different samples? Would you think it likely to obtain any meaningful result?

With ABX test: you can reject null hypothesis by correctly identifying x with significance, or fail to reject and imply transparency. This gives you a strong result.

Possibly just two different samples, but I was thinking of classes of samples. One group does only, say MP3 128, the other does only AAC 128, but given those settings, what I'd do then is encode each subject's only personal listening library. Collect the Likert scales plus demographics and personal listening information (dollars spent, type of equipment, genres, hours/day, etc). Then with many hundreds of subjects over a period of weeks, collect a lot of data. Then potentially you'd get results like "Punk rockers, the elderly and those who listened primarily in the car showed no variance between the treatment groups, while young listeners playing pop on headphones preferred AAC." I actually filled out an online survey solicited by SONY with questions about my listening habits and preferences, and it did go to different formats as well, just without any actual listening tests. This information would be of great interest to SONY, but getting the methodology together is difficult.

I'm trying to get away from the unrealistic spot listening encouraged by foobar-type ABXing. I never listen to 30 seconds of music, and I certainly don't play pieces over and over. I listen to one piece, and then move on. My appreciation of the audio quality is cumulative over time, not locked to a variety of spots.

[Caveat: Perhaps this goes to successful ABXing: I ***Do*** listen to the same spot over and over, when I'm practicing a new piece on a live instrument. Getting the septuplet flourish in the Chopin fingered right and phrased beautifully, is precisely the exercise of ABXing: short excerpt, repeated listening, listening as a perfectionist for every detail.]

I listened to YouTube music videos with my children for a good two hours the other evening, and my ears acclimated to it. Then I switched to one of my own Redbook-ripped-to-HD tracks--the soundstage leaped out of the speakers, a dramatic contrast. This could conceivably be expanded as a methodology: "Listen all your usual ways all day. Then stop for ten minutes and listen to the prescribed track in the prescribed way. Then rate the quality on these dimensions. No re-listening."

The ABX result IS a strong result, the best there is, but for a highly artificial listening situation. To put it another way, ABX is a lab experiment, well-controlled and rigorous, but with artificial conditions and poor generalizability. Something closer to a field experiment would also be useful. The data would be noisier for sure, but the potential for results directly applicable to development and product offerings would make it worth it to find out if significant results are possible.

I'm also thinking of a case were a control group of large sample n gets *only* a referent file, like a CD-mastered copy, and then rates that on a variety of quality-oriented Likert scales. There *is* an implicit comparison between that and "everything else I listen to beside this test file", but subjects would not A/B. Then other treatment groups get *only* a manipulated file derived from the referent, and respond on the same survey measures.

Excellent, that's what I was looking for. The unit of treatment exposure in the telephone tests is the "conversation", which is a highly realistic representation of the normal mode of use for that population. One long stimulus, followed by subject responses on a number of quality dimensions. Good procedure.

Has anyone come across other listening test methodologies they thought were highly revealing, useful, and rigorous? In a few papers I've seen (mostly AES) there are methodological details that vary here and there, but mostly researchers are doing double-blind tests of direct comparisons between two files.

I'm trying to figure out where you are going here. Your statement "mostly researchers are doing double-blind tests of direct comparisons between two files" is extremely general and seems describe a very large family of procedures that seem hard to criticize. If not double blind, then what, sighted? If not direct comparisons, then what, indirect comparisons?

If you want to see a general treatment of comparison methodolgies, look at the classic:

There could be a role, for example, in tests of the form usually seen in medicine. Instead of taking one person and exposing them to both stimuli as a single data point, take a lot of persons, put them in different groups, and then collect Likert-scale data on their response to one class of stimuli.

Been there, done that.

It has its moments. One key issue to bear in mind is that the test needs to be tailored and optimized for the question at had. For example: "Do these two files sound different at all?" is thought by many to be a prerequisite for, and a very different question than: "Which of these files do I prefer?"

QUOTE

You'd probably want 100-300 persons per group, and of course expose them still double-blind, but only to one format.

Lotsa luck with finding 300 qualified listeners.

Seems to me you need to do more study of subjective testing technology to date. The study of hearing developed blind testing for a decade or more before modern testng methodologies were popularized for audio in the middle 1970s. The Journal of the Acoustical Society is a good resource, and just to keep you on your toes they have an ABX test which is substantially different from the one we use in audio, and for good reasons.

No, not sighted. Here I was just listing the normal methodological procedures.

QUOTE (Arnold B. Krueger @ Sep 23 2013, 07:29)

If not direct comparisons, then what, indirect comparisons?

No, as described earlier I'm thinking about tests in which the subjects themselves do not compare side-by-side, they only rate, as is commonly the case in drug tests. The point of comparison occurs during analysis of results, not data collection.

QUOTE (Arnold B. Krueger @ Sep 23 2013, 07:29)

QUOTE (UltimateMusicSnob @ Sep 20 2013, 09:13)

There could be a role, for example, in tests of the form usually seen in medicine. Instead of taking one person and exposing them to both stimuli as a single data point, take a lot of persons, put them in different groups, and then collect Likert-scale data on their response to one class of stimuli.

Been there, done that.

Excellent, do you have any citations?

QUOTE (Arnold B. Krueger @ Sep 23 2013, 07:29)

QUOTE (UltimateMusicSnob @ Sep 20 2013, 09:13)

You'd probably want 100-300 persons per group, and of course expose them still double-blind, but only to one format.

Lotsa luck with finding 300 qualified listeners.

Obviously a tough problem, but that's why I mention the possibility of leveraging the Internet. It *would* be expensive, but conceivably a service like Zoomerang could provide me a sample population.

QUOTE (Arnold B. Krueger @ Sep 23 2013, 07:29)

Seems to me you need to do more study of subjective testing technology to date.

Yes.....that's why I posted the thread....

This post has been edited by db1989: Sep 23 2013, 19:30

Reason for edit: replacing full quote with inlined replies with a properly formatted version.

Actually, in clinical trials, direct comparisons are used if at all feasible. The reason is that if you don't do direct comparisons, your sample sizes will need to be enormous. So you could get together huge numbers of listeners and spend months and years doing a test, but its probably easier to just design a better test that doesn't cost millions of dollars to run

Actually, in clinical trials, direct comparisons are used if at all feasible. The reason is that if you don't do direct comparisons, your sample sizes will need to be enormous. So you could get together huge numbers of listeners and spend months and years doing a test, but its probably easier to just design a better test that doesn't cost millions of dollars to run

Yes, the cost/benefit ratio is probably never going to work out. It's the artificiality of the ABX that works against generalizability for the research questions in audio, though. The procedure is rigorous, the data is useful, absolutely. It's just a big departure from how people actually listen.

I'm not sure what you mean by this. True, to get the most sensitivity to small differences one listens to short segments with rapidly switching between them, but there is absolutely no reason that one could not do ABX testing by listening to the entire piece from beginning to end each time. It's all a matter of what you want to accomplish and how much time you are willing to put into it.

I'm not sure what you mean by this. True, to get the most sensitivity to small differences one listens to short segments with rapidly switching between them, but there is absolutely no reason that one could not do ABX testing by listening to the entire piece from beginning to end each time. It's all a matter of what you want to accomplish and how much time you are willing to put into it.

The main thing is repetition. If the segments are short, that's also unnatural in terms of the length of a listening segment. True, one could listen to entire pieces ("Let's A/B Tosca! --see you in a month!") But ears acclimate to the current sound environment. ABX depends on listening memory--unlikely to be effective for listening sessions which resembled "normal" listening. I tend to listen to an album all the way through, so my typical realistic session would be in the neighborhood of 30-60 minutes. I could provide Likert responses with some confidence, but compared to an album I heard an hour ago? It doesn't seem feasible. Not because it takes too long, but because aural memory will not function effectively across such long spans. I could be wrong, I'd be interested in published data if anyone has done it.

I'm not sure what you mean by this. True, to get the most sensitivity to small differences one listens to short segments with rapidly switching between them, but there is absolutely no reason that one could not do ABX testing by listening to the entire piece from beginning to end each time. It's all a matter of what you want to accomplish and how much time you are willing to put into it.

The main thing is repetition.

You don't have to actually repeat the test. You could do all sorts of methodologies where one does an ABX comparison a single time and then uses multiple samples. I suspect you'll find that its just a more complex way to arrive at the same answer though.

QUOTE (UltimateMusicSnob @ Sep 23 2013, 15:04)

I tend to listen to an album all the way through, so my typical realistic session would be in the neighborhood of 30-60 minutes. I could provide Likert responses with some confidence, but compared to an album I heard an hour ago? It doesn't seem feasible. Not because it takes too long, but because aural memory will not function effectively across such long spans. I could be wrong, I'd be interested in published data if anyone has done it.

If all you care about is how things sound compared to your long term memory, than accuracy is probably not too important. Even relatively large differences will not be apparent over such time periods. Or to put this another way, differences that matter over such long time periods are generally so obvious when A/B'ed that ABX is unnecessary.

One of the excuses that is given when someone is unable to back up a claim of audibility using ABX testing, is that the usual protocol of switching back and forth rapidly between two versions makes it more difficult rather than less difficult to tell the difference, because it is so different than how one usually listens to music. They talk of things like "fatigue factor".

The counter argument is that if they are able to hear the difference only when listening to much longer segments separated by much longer times, there is no reason ABX cannot be performed in that way. Of course, those people never take up this challenge, or if they do then they are unwilling to report the results.

One of the excuses that is given when someone is unable to back up a claim of audibility using ABX testing, is that the usual protocol of switching back and forth rapidly between two versions makes it more difficult rather than less difficult to tell the difference, because it is so different than how one usually listens to music. They talk of things like "fatigue factor".

The counter argument is that if they are able to hear the difference only when listening to much longer segments separated by much longer times, there is no reason ABX cannot be performed in that way. Of course, those people never take up this challenge, or if they do then they are unwilling to report the results.

One of the benefits of Likert scale data is that the researcher (if they obtained significant results) would have more than just "could they tell the difference" to use. "How *much* better is A than B?", for example requires at least ordinal data. Detection is just the first step. Some of the protocols cited above get into this area, which strikes me as useful.

One of the benefits of Likert scale data is that the researcher (if they obtained significant results) would have more than just "could they tell the difference" to use. "How *much* better is A than B?", for example requires at least ordinal data. Detection is just the first step. Some of the protocols cited above get into this area, which strikes me as useful.

Perhaps one of the reasons ABX and MUSHRA rule is that more elaborate data would be useful in areas where preferences for different kinds of imperfections matter. With digital encoding, it is pretty trivial to get (audibly) perfect representation of the source, using lossless if necessary at only a minor cost in file size. So it is easy to get to a point where preferences could not be relevant. Same with electronics, as I understand.

So the large scale clinical trials would only be of interest, I think, to makers of loudspeakers and devisers of multichannel systems.

No, not sighted. Here I was just listing the normal methodological procedures.

QUOTE (Arnold B. Krueger @ Sep 23 2013, 07:29)

If not direct comparisons, then what, indirect comparisons?

No, as described earlier I'm thinking about tests in which the subjects themselves do not compare side-by-side, they only rate, as is commonly the case in drug tests. The point of comparison occurs during analysis of results, not data collection.

QUOTE (Arnold B. Krueger @ Sep 23 2013, 07:29)

QUOTE (UltimateMusicSnob @ Sep 20 2013, 09:13)

There could be a role, for example, in tests of the form usually seen in medicine. Instead of taking one person and exposing them to both stimuli as a single data point, take a lot of persons, put them in different groups, and then collect Likert-scale data on their response to one class of stimuli.

There could be a role, for example, in tests of the form usually seen in medicine. Instead of taking one person and exposing them to both stimuli as a single data point, take a lot of persons, put them in different groups, and then collect Likert-scale data on their response to one class of stimuli.

One that makes sense to me is a variation used in the food industry. Two alternative forced choice. Also generally you have a reliably perceived difference if the testee scores 75% correct choices regardless of the number of trials.

You present two choices. A and B. The testee must choose one. The parameter in food industry is something like choose the sweeter or the pair. In audio you could ask a person to choose the sample with most bass, or simply the version you prefer, or that sounds most real.

I think it nicer than ABX as it is more how people listen for differences when not doing blind tests. They listen to a couple things and pick the one they prefer. Also you are not straining to hear if something is different or if it matches some references. You know for certain each of the two tracks presented are in fact different. You just pick the one you prefer or the one with whatever quality is being tested for. So you hear two versions, know they are different, pick a preference. Of course which version is presented first varies randomly. If you prefer the same version 75% or more of the time, then it is audible.

You present two choices. A and B. The testee must choose one. The parameter in food industry is something like choose the sweeter or the pair. In audio you could ask a person to choose the sample with most bass, or simply the version you prefer, or that sounds most real.

I think it nicer than ABX

The two are not really comparable because they test two different things. ABX is a test of transparency, hence you must have the reference. What you're describing above is a test of preference, hence no reference is necessary.

Basically, you're asking two different questions, which will not in general necessarily give you the same answer. Neither is nicer, it just depends what you want to know.

So let people acclimate to the current sound environment before you start the test!

QUOTE

ABX depends on listening memory

Good case of demonizing ABX for a property of all comparative listening evaluations.

Here's your challenge - how do you do comparative listening without depending on listening memory?

QUOTE

-unlikely to be effective for listening sessions which resembled "normal" listening.

Here's a news flash - normal listening is horrifically unreliable once you figure out how to determine how reliable it actually is.

We live in a world where self-deceit is very common. People do sighted listening evaluations and they think they hear all sorts of things. But there is no way to know how reliable sighted evaluations all by themselves really were since sighted listening involves other senses than listening which easily substitute their influence for just listening.

QUOTE

I tend to listen to an album all the way through, so my typical realistic session would be in the neighborhood of 30-60 minutes. I could provide Likert responses with some confidence,

You may have confidence but silly boy that I am, I notice that you know which alternative you are listening to by other means than listening, and I rightfully cry foul!

The history of ABX is that first we started doing blind tests, and we quickly encountered the problems with memory for small differences. We then devised ABX to maximize the sensitivity of our blind testing. The reason why we never encountered these problems before is that our listening tests weren't just listening tests, they were also tests that involved knowing what we listened to by other means than listening. Guess what, tests are harder if the right answers aren't posted on the blackboard during the test!

QUOTE

but compared to an album I heard an hour ago? It doesn't seem feasible. Not because it takes too long, but because aural memory will not function effectively across such long spans.

There you go! The most sensitive form of aural memory is all over with in about 2 seconds. There is actually a cascade of different kinds of aural memory, but they last for different amounts of time. Generally they become less sensitive to small differences the longer the amount of time involved.

QUOTE

I could be wrong, I'd be interested in published data if anyone has done it.

The best book I've found that describes how we remember sounds is "This is your Brain on Music by Letivin". It is full of citations of proper scientific research. It is readily available for about $15.

One that makes sense to me is a variation used in the food industry. Two alternative forced choice. Also generally you have a reliably perceived difference if the testee scores 75% correct choices regardless of the number of trials.

You present two choices. A and B. The testee must choose one. The parameter in food industry is something like choose the sweeter or the pair. In audio you could ask a person to choose the sample with most bass, or simply the version you prefer, or that sounds most real.

I think it nicer than ABX as it is more how people listen for differences when not doing blind tests. They listen to a couple things and pick the one they prefer. Also you are not straining to hear if something is different or if it matches some references. You know for certain each of the two tracks presented are in fact different. You just pick the one you prefer or the one with whatever quality is being tested for. So you hear two versions, know they are different, pick a preference. Of course which version is presented first varies randomly. If you prefer the same version 75% or more of the time, then it is audible.

Proving once again that tests are a lot more fun if they aren't real tests. One of the characteristics of a real test is that they must provide a means for people to fail the test. Sorry about that!

You seem to be sort of dancing around tests that are more like Mushra or ABC/hr. They aren't preference tests but they are more like preference tests than ABX.

It's really about the right tool for the job. ABX seems to still be king if you want to know if there is an audible difference, but it is horrible for preference testing.