I'm not arguing against blind testing as I believe the only way to reliably discern differences is to listen double blind. However we are having a debate over a number of audio myths, and blind testing eventually cropped up

In my local forum the following was referenced :

Quote:

Many respected academic researchers also question the validity of blind A/B testing. Michael Gerzon stated in his paper “Limitations of Double- Blind A/B Listening Tests,” presented at the 91st AES convention, “It would be a disaster if we had protocols that didn’t reveal subjective differences that the average consumer would notice in five years’ time. I want to indicate possible areas in which normal double-blind A/B protocols may not be adequate to reveal faults that may be audible to even unsophisticated end listeners. I’m going to do this with possible models of how we hear.” Gerzon encouraged other researchers to look beyond double-blind testing and “to develop experimental methodology matched to the phenomenon which is being tested,” and to “not believe that one simple protocol—double-blind or A/B or ABX—is the answer to all kinds of measurement problems.”

Similarly, AES Fellow J. Robert Stuart stated in his landmark papers “Estimating the Significance of Errors in Audio Systems” and “Predicting the Audibility, Detectability and Loudness of Errors in Audio Systems” that “A/B testing using program material and particularly musical program is fraught with difficulties. . .the author sets out some reasons why the ‘objective’ approaches of A/B listening and null-testing may be flawed.”

Peter McGrath (one of the world's greatest recording engineers of classical music) about the sound quality differences between high-res and 16-bit/44.1kHz. He did the same experiment as you, listening to high-res directly from the computer and then listening to the same file downconverted to 16-bit/44.1kHz. He described the sound-quality difference as "like throwing a light switch."

It's not as though the superiority of high-res digital (compared with 16-bit/44.1kHz) flies in the face of theory. Bob Stuart has shown in a series of AES papers why 16-bit/44.1kHz is insufficient to encode all the information humans can hear. The positions argued in the papers are not idle speculation; Stuart has an extensive scientific background in psychoacoustics, and cites previously published literature regarding human auditory capability.

I'm not arguing against blind testing as I believe the only way to reliably discern differences is to listen double blind. However we are having a debate over a number of audio myths, and blind testing eventually cropped up

In my local forum the following was referenced :

Quote:

Many respected academic researchers also question the validity of blind A/B testing. Michael Gerzon stated in his paper “Limitations of Double- Blind A/B Listening Tests,” presented at the 91st AES convention, “It would be a disaster if we had protocols that didn’t reveal subjective differences that the average consumer would notice in five years’ time.

Harley has a long track record of making hypercritical and questionable statements about DBTs. I could be a cynic and point out that Harley's quoting of Michael Gerzon might be considered inappropriate since Gerzon regrettably passed away in 1996 is unavailable to comment on whether or not Harley is making appropriate use of his work.

BTW there is no evidence that our current DBT protocols fail to reveal subjective differences that the average consumer would notice in five years’ time. Remember, it is possible to perform an ABX or other DBT test that is based on 5 years of listening, so asking for verification of this claim is not mission impossible.

Quote:

I want to indicate possible areas in which normal double-blind A/B protocols may not be adequate to reveal faults that may be audible to even unsophisticated end listeners. I’m going to do this with possible models of how we hear.” Gerzon encouraged other researchers to look beyond double-blind testing and “to develop experimental methodology matched to the phenomenon which is being tested,” and to “not believe that one simple protocol—double-blind or A/B or ABX—is the answer to all kinds of measurement problems.”

Similarly, AES Fellow J. Robert Stuart stated in his landmark papers “Estimating the Significance of Errors in Audio Systems” and “Predicting the Audibility, Detectability and Loudness of Errors in Audio Systems” that “A/B testing using program material and particularly musical program is fraught with difficulties. . .the author sets out some reasons why the ‘objective’ approaches of A/B listening and null-testing may be flawed.”

Peter McGrath (one of the world's greatest recording engineers of classical music) about the sound quality differences between high-res and 16-bit/44.1kHz. He did the same experiment as you, listening to high-res directly from the computer and then listening to the same file downconverted to 16-bit/44.1kHz. He described the sound-quality difference as "like throwing a light switch."

I exercised due diligence trying to find this document allegedly due to Peter McGrath and failed. Quoting presumed authroties and then leaving people mystified and frustrated because of incomplete sholarship and lack of good footnoting does not enhance a works credibiltiy. A quote without a proper reference is just name dropping, no?

Quote:

It's not as though the superiority of high-res digital (compared with 16-bit/44.1kHz) flies in the face of theory. Bob Stuart has shown in a series of AES papers why 16-bit/44.1kHz is insufficient to encode all the information humans can hear.

That appears to be based on a lot of incompletely documented and otherwise questionable assertions.

Quote:

The positions argued in the papers are not idle speculation; Stuart has an extensive scientific background in psychoacoustics, and cites previously published literature regarding human auditory capability.

What do you guys make of these references?

The biggest and openest DBT of high sample rates that could be imagined was inadvertently contrived by the music industry. They apparently falsely and misleadingly sold millions of discs involving thousands of different musical works as high resolution recordings when in fact they were fatally flawed by being derived from low resolution masters. It is a scientific fact that you can't accurately put resolution and bandwidth back into a recording once it is stripped off. Years went by and no golden eared reviewer ever pointed this out. Just shows that they couldn't hear any difference, right?

Ever get a pair of glasses the process is about all A-B, if it works for that it works for sound. It's what sounds best to you and what looks the best. There are just to many respected academic researchers, too dam many experts, with an opinion, and like rear end's everyone has one.

Ever get a pair of glasses the process is about all A-B, if it works for that it works for sound. It's what sounds best to you and what looks the best. There are just to many respected academic researchers, too dam many experts, with an opinion, and like rear end's everyone has one.

Not the tests we discuss here. We don't get involved in preference at all. We get involved in only in audible differences. The testers in our tests only need to identify which is A and which is B. They don't need to state a preference. While everyone has a rear end, not everyone has experience with bias controlled audio tests. How do you explain that those who criticize blind tests don't engage in them and those who do always come up with the same or similar results? I'm still waiting for the first tester in a bias controlled test to come up with different results than the rest of us. The different results always come from those with no bias controlled testing experience. Figure that one out.

Not the tests we discuss here. We don't get involved in preference at all. We get involved in only in audible differences. The testers in our tests only need to identify which is A and which is B. They don't need to state a preference. While everyone has a rear end, not everyone has experience with bias controlled audio tests. How do you explain that those who criticize blind tests don't engage in them and those who do always come up with the same or similar results? I'm still waiting for the first tester in a bias controlled test to come up with different results than the rest of us. The different results always come from those with no bias controlled testing experience. Figure that one out.

Opponents of DBT testing will probably claim that the methodology is designed to result in negative outcomes. I think Robert Harley believes that DBTs are rigged and are therefore destined to result in false results.

Ever get a pair of glasses the process is about all A-B, if it works for that it works for sound. It's what sounds best to you and what looks the best.

Not exactly. Eye tests for fitting glasses involve very specific tests that guide the patient to the glasses that do the best job of correcting his vision.

They don't put up a page from a newspaper and ask you which lens you like the best. That would be analogous to the typical audio listening evaluation where we play a musical passage and ask you what you like the best.

Instead they vary the size and content of the test targets, and also perform optical manipulations to the images, to get the best possible correction.

Opponents of DBT testing will probably claim that the methodology is designed to result in negative outcomes. I think Robert Harley believes that DBTs are rigged and are therefore destined to result in false results.

Unfortunately, he has no credibility with me. Nevertheless he is wrong. He does know the truth, however, because he has been involved in bias controlled tests. He eschews that experience because it conflicts with the way he makes his living. It is one thing to never experience it and another thing to have learned the truth and then fight it because of his job.

I went looking for that Gerzon article on the AES site and it's not listed there. But there's a very entertaining description of that AES convention in this issue of The Audio Critic. Here is a copy of Aczel's comments in that article about Gerzon's paper.

Quote:

Originally Posted by The Audio Critic
Following this donnybrook there was another episode of poignant human comedy, but of a different, more subtle kind. Michael A. Gerzon, the Oxford mathematician responsible for Ambisonics, delivered a paper titled "Limitations of Double-Blind AB Listening Tests." It was a highly technical and thought-provoking argument in favor of certain procedural changes in conventional double-blind testing, but John Atkinson, the militantly subjectivist editor of Stereophile (and Bob Harley's puppeteer), obviously thought it was going to be a delicious put-down of the ABX faction by—at last!—a highly accredited academician. There was no preprint to be looked at, so John started to scribble notes furiously as soon as Michael Gerzon opened his mouth. About five minutes into the presentation, the latter remarked that of course double-blind conditions are a must for any kind of validity in a listening test. John abruptly stopped taking notes as if someone had pulled his plug. It was a moment to be savored.

So according to Aczel, there was no preprint, so one can't obtain it from the AES site.

BTW, the entire "Invasion of the Credibility Snatchers" article is a really fun read.

There you go. In the mean time, link to this thread here. You'll notice I asked in your local forum for the source of those quotes. Now that we know it's Robert Hartley, I can see why AlleyCat didn't list it originally.

I'll repeat here what I said there: Blind test deniers criticize blind testing because deep down they know they'll fail. So it's easier to refute the method than to admit they've been full of it all these years.

There you go. In the mean time, link to this thread here. You'll notice I asked in your local forum for the source of those quotes. Now that we know it's Robert Hartley, I can see why AlleyCat didn't list it originally.

I'll repeat here what I said there: Blind test deniers criticize blind testing because deep down they know they'll fail. So it's easier to refute the method than to admit they've been full of it all these years.

--Ethan

The sad thing is that the blind testing is known to be essential for all sorts of human testing, so they are really taking a ridiculous position that is known to be wrong. I like the wine example:

It is only in blind testing that real differences are found in taste for wine, and only in blind testing that real differences are found for sound in audio. Otherwise, the other aspects of human perception will affect the experience. This is a known fact, not something that is speculative.

They might as well say that the world is flat. It would be no more silly than what they claim.

God willing, we will prevail in peace and freedom from fear and in true health through the purity and essence of our natural fluids. God bless you all.

There you go. In the mean time, link to this thread here. You'll notice I asked in your local forum for the source of those quotes. Now that we know it's Robert Hartley, I can see why AlleyCat didn't list it originally.

I'll repeat here what I said there: Blind test deniers criticize blind testing because deep down they know they'll fail. So it's easier to refute the method than to admit they've been full of it all these years.

--Ethan

But blind tests reach absurd conclusions .... therefore, the methodology is absurd! I think that is the lesson that Robert tried to convey in his prejudiced report against blind testing. A faulty premise leading to a false conclusion = popcorn.

The sad thing is that the blind testing is known to be essential for all sorts of human testing, so they are really taking a ridiculous position that is known to be wrong. I like the wine example:

I would kill for the same tools that the medical industry has to validate output of double blind tests. They are able for the most part, to run diagnostic tests to judge the efficacy of the medication. They don't just rely on the patient own opinion as to whether there has or has not been improvement and how much. In audio, all we have is what the person says they heard, not what they actually heard!

Here is a personal example. I go to audiologist to have my hearing checked. Sit in a booth with a headphone. The operator is in another room and plays tones at different frequencies. First loud, then less loud, then even less and so on. Detecting the loud ones was no big deal but all of a sudden there is a long gap/silence. You say to yourself, "I bet they played a tone there" and go on to say you heard the thing even though you did not! More than two decade of taking tests makes you want to be "right" even though in this case, it is totally counterproductive. I expected at the end of the test for them to say I was hearing things when they weren't even playing anything. But I guess they were too polite to say anything. If there was a diagnostic test of whether I heard something or not, then it could not be distorted with the way I voted.

I have been in numerous double blind tests and created many formal and informal ones. The formal ones were run by outside agencies following strict ITU BT1116 guidelines. Yet, they arrived at completely wrong conclusions! The reason? They picked the wrong music material with respect to the technical differences that existed. In doing so, they arrived at 90% of the people not being able to tell the difference. This is yet another problem. We just ignore the other 10% but likely those were the people who actually detected the difference correctly. But because they were in minority, folks just throw way their votes.

With respect to building technology for everyone, errors in double blind tests of this sort are not material. We care about the 90%, not the 10%. With respect to enthusiasts however, that kind of logic is not as relevant since the want the answer to the 10%.

Of course blind testing is an essential component of audio research. It is just that you need to know a ton to understand the limitations of the test and what it really tested, as opposed to what they thought they tested. Earlier in this thread Arny referred to the Meyer and Moran where they said they were testing high resolution audio against CD resolution. LIttle did they know that some of the high-res SACDs were actually based on CD masters so there was no difference in the two versions. As an interesting aside, it was the detractors of blind testing which did the research to find this out. Folks who blindly follow the results of any double blind test and celebrate the negative outcome, are not critical at all. They only investigate the ones that disagrees with their views. That is a shame because people on both sides of this argument make mistakes and at times, very serious ones in how they evaluate audio technology.

I would kill for the same tools that the medical industry has to validate output of double blind tests. They are able for the most part, to run diagnostic tests to judge the efficacy of the medication. They don't just rely on the patient own opinion as to whether there has or has not been improvement and how much. In audio, all we have is what the person says they heard, not what they actually heard!

Good point.

In the medical and pharmaceutical world, before anyone can publish clinical trial results, they have to clear a much higher bar than in the audio industry. They have to positively document efficacy, not just rely on hearsay. For example, in cancer drugs, they have to shrink tumors, and provide statistically significant longer patient survival times. The Food and Drug Administration acts as another level of referee; they often decide just how high or low the efficacy bar should be set. In addition, they have legal regulatory authority over all clinical trials of experimental drugs. Its very expensive, and more often than not, a trial has negative results.

Nothing like that exists in audio. I can easily imagine people in the clinical trial business wishing they had it as easy and people in audio .

In the medical and pharmaceutical world, before anyone can publish clinical trial results, they have to clear a much higher bar than in the audio industry. They have to positively document efficacy, not just rely on hearsay. For example, in cancer drugs, they have to shrink tumors, and provide statistically significant longer patient survival times. The Food and Drug Administration acts as another level of referee; they often decide just how high or low the efficacy bar should be set. In addition, they have legal regulatory authority over all clinical trials of experimental drugs. Its very expensive, and more often than not, a trial has negative results.

Nothing like that exists in audio. I can easily imagine people in the clinical trial business wishing they had it as easy and people in audio .

Yeh but...but...BUT... they don't have to come on some forum to defend their drug testing!!!

I have been in numerous double blind tests and created many formal and informal ones. The formal ones were run by outside agencies following strict ITU BT1116 guidelines. Yet, they arrived at completely wrong conclusions! The reason? They picked the wrong music material with respect to the technical differences that existed. In doing so, they arrived at 90% of the people not being able to tell the difference. This is yet another problem. We just ignore the other 10% but likely those were the people who actually detected the difference correctly. But because they were in minority, folks just throw way their votes.

The comment about program material is an far-reaching issue. Not unexpectedly in this post we see same-old, same-old message anti-DBT habit of shooting the messenger. The messenger is DBT, and yet again it is being demonized.

The first thing that DBT brought to the audio evaluation table is a high possibility of a negative result if you don't do things right. With old-school sighted evaluations there is very high possibility of a positive result whether the listener is reliably hearing a difference or not. Sighted evaluations let listeners wish themselves into the result that they are comfortable with. DBTs do not afford the listener that luxury. The listener must perform or fail. This is of course uncomfortable for the listener.

One of the common ways to create a failing DBT is to use the wrong musical or dramatic recording. Until DBT there was very little guidance about this since failure was all but impossible. With DBT the guidance is very clear as Amir admits above, even the post does not properly identify the cause of the problem and instead demonizes DBT.

The cause of the problem is common to science, not just DBT. When an audio system has a failure to perform well that failure may not be triggered by every stimulus. In testing in general, it is well known that the stimulus needs to be varied in order to get various technical problems to rear their ugly head. For example if you want to see how a car handles you have to do more than just drive it down the middle of straight, smooth road. The eye chart has to have small letters that are hard to see, and the examiner has to know whether the patient is reading them accurately Yet this is what audiophiles did before DBT - they used the best sounding recordings which may have well been the ones that masked equipment faults the most.

Scientifically speaking one the hallmarks of a true test is that it can fail. If you don't do things right you get a null result. Those of us who have general training in the sciences know this. For example in chemistry if you don't do the chemical test right, you get mud in your test tube instead of the desired color of precipitate. In thermodynamics if you don't go though the right steps with the right equipment you get unreliable or even nonsense results.

Why should audio testing be any different? Yet the anti-DBT clique keeps blaming DBT for behaving like a good scientific test and offering the ready alternative of failure.

Quote:

In doing so, they arrived at 90% of the people not being able to tell the difference. This is yet another problem. We just ignore the other 10% but likely those were the people who actually detected the difference correctly. But because they were in minority, folks just throw way their votes.

When a test blows up in one's face like this, there are choices. In most areas of science you correct your test procedure and run the test again. Of course you can do the same thing in audio except this is not being presented as an viable option. This looks like more anti-DBT talk to me. The fact is that in audio tests the same procedure should have been followed as would be followed in any other kind of test. Instead of engaging in wishful thinking about the test that got away, and trying to extract a positive result out of a bad test by committing statistical fraud, the stimulus needs to be adjusted, possibly using wisdom that evolved from doing the failing test. I've personally done this many times, and the annals of MPEG tests as published by JJ and other experimenters, shows examples of doing the right thing. No pain, no gain!

I would kill for the same tools that the medical industry has to validate output of double blind tests. They are able for the most part, to run diagnostic tests to judge the efficacy of the medication. They don't just rely on the patient own opinion as to whether there has or has not been improvement and how much. In audio, all we have is what the person says they heard, not what they actually heard!

Good point.

In the medical and pharmaceutical world, before anyone can publish clinical trial results, they have to clear a much higher bar than in the audio industry. They have to positively document efficacy, not just rely on hearsay. For example, in cancer drugs, they have to shrink tumors, and provide statistically significant longer patient survival times. The Food and Drug Administration acts as another level of referee; they often decide just how high or low the efficacy bar should be set. In addition, they have legal regulatory authority over all clinical trials of experimental drugs. Its very expensive, and more often than not, a trial has negative results.

Nothing like that exists in audio. I can easily imagine people in the clinical trial business wishing they had it as easy and people in audio .

Actually things like this do happen all the time in audio. It doesn't seem to happen in the fairyland of high end audio publications but it does happen in other areas of audio such as speaker testing and lossy encoders. It doesn't happen with cables, amplifiers and DACs because the results are predictable - a null result is to be expected.

I have been in numerous double blind tests and created many formal and informal ones. The formal ones were run by outside agencies following strict ITU BT1116 guidelines. Yet, they arrived at completely wrong conclusions! The reason? They picked the wrong music material with respect to the technical differences that existed. In doing so, they arrived at 90% of the people not being able to tell the difference. This is yet another problem. We just ignore the other 10% but likely those were the people who actually detected the difference correctly. But because they were in minority, folks just throw way their votes.

The comment about program material is an far-reaching issue. Not unexpectedly in this post we see same-old, same-old message anti-DBT habit of shooting the messenger. The messenger is DBT, and yet again it is being demonized.

The first thing that DBT brought to the audio evaluation table is a high possibility of a negative result if you don't do things right. With old-school sighted evaluations there is very high possibility of a positive result whether the listener is reliably hearing a difference or not. Sighted evaluations let listeners wish themselves into the result that they are comfortable with. DBTs do not afford the listener that luxury. The listener must perform or fail. This is of course uncomfortable for the listener.

One of the common ways to create a failing DBT is to use the wrong musical or dramatic recording. Until DBT there was very little guidance about this since failure was all but impossible. With DBT the guidance is very clear as Amir admits above, even the post does not properly identify the cause of the problem and instead demonizes DBT.

The cause of the problem is common to science, not just DBT. When an audio system has a failure to perform well that failure may not be triggered by every stimulus. In testing in general, it is well known that the stimulus needs to be varied in order to get various technical problems to rear their ugly head. For example if you want to see how a car handles you have to do more than just drive it down the middle of straight, smooth road. The eye chart has to have small letters that are hard to see, and the examiner has to know whether the patient is reading them accurately Yet this is what audiophiles did before DBT - they used the best sounding recordings which may have well been the ones that masked equipment faults the most.

Scientifically speaking one the hallmarks of a true test is that it can fail. If you don't do things right you get a null result. Those of us who have general training in the sciences know this. For example in chemistry if you don't do the chemical test right, you get mud in your test tube instead of the desired color of precipitate. In thermodynamics if you don't go though the right steps with the right equipment you get unreliable or even nonsense results.

Why should audio testing be any different? Yet the anti-DBT clique keeps blaming DBT for behaving like a good scientific test and offering the ready alternative of failure.

Quote:

In doing so, they arrived at 90% of the people not being able to tell the difference. This is yet another problem. We just ignore the other 10% but likely those were the people who actually detected the difference correctly. But because they were in minority, folks just throw way their votes.

When a test blows up in one's face like this, there are choices. In most areas of science you correct your test procedure and run the test again. Of course you can do the same thing in audio except this is not being presented as an viable option. This looks like more anti-DBT talk to me. The fact is that in audio tests the same procedure should have been followed as would be followed in any other kind of test. Instead of engaging in wishful thinking about the test that got away, and trying to extract a positive result out of a bad test by committing statistical fraud, the stimulus needs to be adjusted, possibly using wisdom that evolved from doing the failing test. I've personally done this many times, and the annals of MPEG tests as published by JJ and other experimenters, shows examples of doing the right thing. No pain, no gain!

Great post. As you say, when a test fails, the proper response is to devise a better test, not give up on real testing.

God willing, we will prevail in peace and freedom from fear and in true health through the purity and essence of our natural fluids. God bless you all.

In audio, all we have is what the person says they heard, not what they actually heard!

Yes, exactly. I also agree that some type of fidelity loss need training to notice. But that's different from blind testing. If you haven't learned what to listen for, you won't spot it sighted either.

Great post. As you say, when a test fails, the proper response is to devise a better test, not give up on real testing.

Give up? Why give up? The independent test was commissioned by our marketing department over my concern that the technology was not that strong, i.e. show 64 Kbps to be "CD quality." To my surprise and marketing department's delight, 90% or so of the listeners thought we did achieve that target! Why? Simple: the people creating the test did not understand audio compression technology. They picked classical music because that is what "audiophiles" listen to and is always used in magazine reviews of audio gear. Turns out classical music is some of the easiest content to compress. The most difficult is content with sharp transients. The reason is that the act of audio compression causes the quantization noise (distortion) to spread over a block of audio samples. In the case of a transient somewhere in the middle of that frame, the distortion spread before the transient. Since the noise is correlated with the music itself, it will actually smear that part of music backwards. Hence the term "pre-echo" used to describe this kind of artifact. You get an echo of the transient before it occurs.

Since a transient by definition is a sudden jump in level, then what comes before it is comparatively very quiet which allows one to hear the pre-echo. The result is a transient that gets distorted in a pretty harsh way. And is usually quite audible although not to everyone.

In contrast, classical music is continuous (for the most part anyway). Pre-echo that does occur there competes with existing sound from other instruments and hence, gets masked. If you look at the standard set of "codec killer" tracks used by MPEG to develop audio compression such as AAC and MP3, you don't find Motzart. The people who selected those tracks know what is difficult based on detail knowledge of audio compression. Testing codecs with easy content would be putting one's head in the sand as you will not hit on difficult content but your customer will. Simple guitar strings, vocals, even people clapping fall in the category of difficult material. If you are in US, listen to XM radio talk show. No doubt you hear severe compression artifacts. It is for reason that cell phones use specialized voice coders that work on a completely different principal (although the wideband versions of them these days makes this a much less of a problem). I am actually co-inventor of a patented technology that combines these two approaches. This issued patent filing explains it in more detail: http://www.google.nl/patents/US6658383

So no, nothing "blew up" in our face. Wrong conclusion was reached from this double blind test run by an independent agency which happened to be what our marketing folks desired. I don't know why Arny keeps forgetting this story as I have told him and reminded him multiple times. Here is an example: http://www.avsforum.com/t/1388821/question-about-dacs/420#post_21625657. If my memory serves me right, this study cost nearly $25,000 to conduct! Of course, our marketing department thought it was worth every penny.

Yes, exactly. I also agree that some type of fidelity loss need training to notice. But that's different from blind testing. If you haven't learned what to listen for, you won't spot it sighted either.

Yes, exactly. I also agree that some type of fidelity loss need training to notice. But that's different from blind testing. If you haven't learned what to listen for, you won't spot it sighted either.

--Ethan

For me, this muddies the results of blind A/B testing because we would have to know how trained the listeners are. In one test, you could have a room full of untrained listeners that would skew the results in the direction of there being no differences (or smaller ones) between A & B.

And for all practical purposes, I know of no way to evaluate or test listeners to determine how acute their listening discernment really is. And even if such a test exists, the blind A/B tests dont share this kind of info about the listeners in these tests.

So its possible that a majority of blind A/B results show that people cant tell a difference not because there isn't one, but because they are not trained to notice the differences.

3.1 Expert listeners
It is important that data from listening tests assessing small impairments in audio systems should come exclusively from subjects who have expertise in detecting these small impairments. The higher the quality reached by the systems to be tested, the more important it is to have expert listeners. etc.

but also stipulates pre and post screening, as well as guidelines for panel size. And, to circle back to an earlier point, it also stipulates the use of suitable "critical material" for the test(s). AFAICT and for better or worse, few tests I've seen mentioned rise to the level of rigor spelled out in the ITU guidance.

Here is a personal example. I go to audiologist to have my hearing checked. Sit in a booth with a headphone. The operator is in another room and plays tones at different frequencies. First loud, then less loud, then even less and so on. Detecting the loud ones was no big deal but all of a sudden there is a long gap/silence. You say to yourself, "I bet they played a tone there" and go on to say you heard the thing even though you did not! More than two decade of taking tests makes you want to be "right" even though in this case, it is totally counterproductive. I expected at the end of the test for them to say I was hearing things when they weren't even playing anything. But I guess they were too polite to say anything. If there was a diagnostic test of whether I heard something or not, then it could not be distorted with the way I voted.

It is possible that not being aware the details of the test, the listener might not be aware of how the test avoids being distorted by the way the listener behaves. Furthermore, not being present for the test or having the opportunity to examine the test procedure, I don't know exactly how the test was performed. However, I can descibe how the vast majority of how tests like this are done.

Finding a document that is online that contains these details was not easy, but I found what looks like a good summary of common audiology testing procedures and from what is said above I may have successfully deduced what was actually being done:

A relatively efficient method of estimating the 50%
level is the simple up-down or staircase method. It is
similar to the method of limits in that the stinmlus level
is decreased after a positive response (or increased after
a negative response), but unlike the method of limits
the test is not terminated after the first reversal. A
recommended procedure is to continue testing until at
least six or eight reversals are obtained (Wetherill and
Levitt, 1965). A typical data record is shown in Fig. 4:

The description is sketchy but I have been familiar with this kind of test for decades so I can fill in some of the missing details.

The bottom line is that pressing the switch does not validate the listener's response.here has to be a test in progress for a switch press to be valid, and tests are run at various times during the testing period so there are plenty of times when no sound is heard and the listener presses the switch and it is instantly invalidated. Secondly pressing the switches is recorded and the overall sequence has to fit into a logical pattern involving "...at least six or eight reversals" to be validated.

This method of testing is considered to be adaptive in that the listener is generally the only active party during each test session. The test thus runs under the listener's control, and it is the rules and mechanics of the test that make it blind. The test can be fully automated. Notice that at no point does the listener know what the actual dB level of the tones he is hearing are. Therefore he is blinded to the outcome of the test.

During DBT, do the test subjects focus their hearing mainly on those specific characteristics of the sound that will give them the best chance of real differences being revealed to them, or do they focus their hearing mainly on those specific characteristics of the sound that they expect to give them the best chance of real differences being revealed to them?

Great post. As you say, when a test fails, the proper response is to devise a better test, not give up on real testing.

... Wrong conclusion was reached from this double blind test run by an independent agency which happened to be what our marketing folks desired. ...

That a test is blind does not guarantee that it is a good one. The reason being, of course, that there are many things necessary for a test to be good. But if it is not blind, then it isn't a good one. The reason being, of course, that otherwise bias can contaminate the results.

In other words, being blind is one of the necessary features of a good test, but it is not the only necessary feature of a good test. That one can come up with bad tests that are done blind does nothing to show that there is no need for testing to be blind.

God willing, we will prevail in peace and freedom from fear and in true health through the purity and essence of our natural fluids. God bless you all.