itunes CVBR 64 is resampled to 32 kHz and low-passed at about 12 kHz, otherwise it sounds pretty "clean". It doesn't really help to understand what kind of artifacts (distortion, noise, pre-echo, etc) the sample may produce.

If -q 35 is too bad a higher value can be used.

In addition it would be better to include only 44.1 kHz samples. Sample rate switching may produce additional problems with the ABC-HR program, some operating systems, and/or some sound devices.

If we're looking for an anchor emphasizing artifacts to be expected, why not use MP3, e.g. LAME CBR at the lowest setting which doesn't downsample to 32 kHz? I think we could actually use the old "version 1.0" Fraunhofer encoder from 1994(?) with an additional 16-kHz lowpass filter applied before encoding (that should avoid the bug).

Edit: The more I think of it, the more I believe we should use two anchors to stabilize the results: one to define the lower end of the grading scale, the other to define a mid-point of the scale. For the lower end, I just imitated the world's first audio encoder: our test set downsampled to 8 kHz using Audition and saved as 8-bit µ-Law stereo Wave file. That's a 128-kb encoding. Nice demonstration of how far we've come in the last 40 years or so

Thanks a lot, lvqcl! I tried a 112-kb encode with the - apparently bug-free - l3enc version 2.60 (Linux version). The quality is actually too good for a mid-anchor. 96 kbps unfortunately don't work in the unlicensed version. We are currently investigating LAME at 96 kb and 44 kHz sampling rate as anchor.

For the record, the lower anchor will be created and decoded with the following commands. This yields a delay-free anchor.

What do you think about getting GXLame in as a low anchor (or even a competitor in a non-AAC test)? It's a low-bitrate MP3 encoder, so it just might fit the bill somewhere between V0-V30 (V20 averages 96kbps and defaults to 44kHz).

I don't understand why two low anchors would be needed. Wouldn't it better to let the "mid" anchor define where the the lower end of the scale is? Then there would possibly be a bit wider scale for the contenders. Ideally the low anchor would then get 0-3 and the contenders 2-5. IMHO, it would be enough that there is one low anchor that can be detected easier than the actual contenders.

Also, I don't understand why some old/mediocre MP3 encoder/setting would make a better low anchor than FAAC. FAAC would nicely represent the basis of the more developed AAC encoders. FAAC can be adjusted freely to provide the desired quality level. "-q 35 -c 18000" worked for me, but perhaps -q 38, -q 40 or so would work as well.

In general, it would be desirable that all encoders, including the low anchor, are easily available so that anyone can reproduce the test scenario (for verifying the authenticity of the results) or test different samples/encoders using/including the tested encoders and settings in order to get comparable personal results. Also the procedure to decode and split the test sample should be reproducible by anyone.

I don't understand why two low anchors would be needed. Wouldn't it better to let the "mid" anchor define where the the lower end of the scale is? Then there would possibly be a bit wider scale for the contenders. Ideally the low anchor would then get 0-3 and the contenders 2-5. IMHO, it would be enough that there is one low anchor that can be detected easier than the actual contenders.

Use of two anchors follows the MUSHRA methodology and is an attempt at making the grading scale of this test more absolute. After all, all encoders in this test sound quite good compared to old/simple encoding techniques or lower bit rates. As the name implies, the lower anchor shall define the lower end of the scale and should give the listeners an idea of what we mean by "bad quality" (range 0-1). The hope then is that this reduces the confidence intervals (grade variance) for the other coders in the test, including the mid anchor (which should end up somewhere in the middle of the grading scale).

QUOTE

Also, I don't understand why some old/mediocre MP3 encoder/setting would make a better low anchor than FAAC. FAAC would nicely represent the basis of the more developed AAC encoders. [...]

Actually, it seems it doesn't. In my first informal evaluation, I noticed that FAAC is tuned very differently than the other AAC encoders in the test (less pre-echo, more warbling), and it seems LAME@96kb emphasizes the artifacts of the codecs under test (pre-echo, warbling on tonal sounds, etc.) better than FAAC@64. Btw, the bandwidth of LAME@96 is close enough to the codecs under test (around 15 kHz).

QUOTE

In general, it would be desirable that all encoders, including the low anchor, are easily available so that anyone can reproduce the test scenario (for verifying the authenticity of the results) or test different samples/encoders using/including the tested encoders and settings in order to get comparable personal results. Also the procedure to decode and split the test sample should be reproducible by anyone.

Agreed. Igor and I are working on scripts, run by the listeners, which do all the decoding and splitting of the bit streams and creation of the (decoded) anchors. My commands for the lower anchor above are a first attempt at this.

Use of two anchors follows the MUSHRA methodology and is an attempt at making the grading scale of this test more absolute. After all, all encoders in this test sound quite good compared to old/simple encoding techniques or lower bit rates. As the name implies, the lower anchor shall define the lower end of the scale and should give the listeners an idea of what we mean by "bad quality" (range 0-1). The hope then is that this reduces the confidence intervals (grade variance) for the other coders in the test, including the mid anchor (which should end up somewhere in the middle of the grading scale).

In the past 48 and 64 kbps tests most samples were difficult to me because the low anchor was too bad and the remaining scale wasn't wide enough for correctly stating the differences between the easier and more difficult samples. I.e the low anchor was always like a "telephone" and got "1". The actual contenders were considerably better, but never close to transparency. So the usable scale for the contenders was mostly from 2.0 to 3.5. Actually, even then the grade "2" was a bit too low for correctly describing the difference between the low anchor and the worst contender. At the other end of the quality scale the difference between the reference and the best contender was always significant and anything above 4 would have been too much for the best contenders.

Of course the situation is different in a 128 kbps AAC test, but there is a danger that the two anchors will occupy the grades 1-4 and the actual contenders will get 4-5 and once again be more or less tied even though the testers actually could hear clear differences between the contenders.

QUOTE

Actually, it seems it doesn't. In my first informal evaluation, I noticed that FAAC is tuned very differently than the other AAC encoders in the test (less pre-echo, more warbling), and it seems LAME@96kb emphasizes the artifacts of the codecs under test (pre-echo, warbling on tonal sounds, etc.) better than FAAC@64. Btw, the bandwidth of LAME@96 is close enough to the codecs under test (around 15 kHz).

I see. I didn't actually try to do that kind of complex cross-comparison so you know more about this than I. You could have posted the explanation earlier...

Of course the situation is different in a 128 kbps AAC test, but there is a danger that the two anchors will occupy the grades 1-4 and the actual contenders will get 4-5 and once again be more or less tied even though the testers actually could hear clear differences between the contenders.

The method of statistical analysis which we will be using this time will take care of this: http://www.aes.org/e-lib/browse.cfm?elib=15021 Getting two MUSHRA-style anchors (one for worst quality, one for intermediate quality, and hidden reference for best quality) into our test allows us to use MUSHRA-style evaluation for our test, as stated in the referenced paper.

QUOTE

I see. I didn't actually try to do that kind of complex cross-comparison so you know more about this than I. You could have posted the explanation earlier...

What do you think about getting GXLame in as a low anchor (or even a competitor in a non-AAC test)? It's a low-bitrate MP3 encoder, so it just might fit the bill somewhere between V0-V30 (V20 averages 96kbps and defaults to 44kHz).

When I have time, I'll certainly blind-test GXLame against LAME (because I'm interested in your work). However, assuming GXLame sounds better than LAME at low bit rates, I still tend towards LAME as anchor for this test. Here's why: unlike the codecs under test, anchors are supposed to produce certain artifacts, not avoid them.

OK, I changed my mind and go along with Alex. The mid anchor will be a "compromised" AAC encoding at 96 kbps VBR. More precisely, one without TNS and short blocks and a bandwidth of 15.8 kHz. It will be created with FAAC v1.28 and the following commands:

CODE

faac.exe --shortctl 1 -c 15848 -q 50 -w ha_aac_test_sample_2010.wav

Decoder-wise, I'm not sure yet. Either NeroAacDec 1.5.1.0 or FAAD2 v2.7. Can someone point me to an Intel MacOS X (fat) binary of the latter?

What do you think about getting GXLame in as a low anchor (or even a competitor in a non-AAC test)? It's a low-bitrate MP3 encoder, so it just might fit the bill somewhere between V0-V30 (V20 averages 96kbps and defaults to 44kHz).

When I have time, I'll certainly blind-test GXLame against LAME (because I'm interested in your work). However, assuming GXLame sounds better than LAME at low bit rates, I still tend towards LAME as anchor for this test. Here's why: unlike the codecs under test, anchors are supposed to produce certain artifacts, not avoid them.

Chris

That's perfectly understandable. With its t4 release, I think it's actually quite competitive--I rushed to finish it in time for this test.

Also it would be beneficial to create tutorial with each,single,small step that proper test must consist of.

Do you mean a tutorial for the listeners on "what the rules are" and how to proceed before and during the test? That sounds good. Will be done.

I finally found some time for this test again. I've managed to write a nearly test-methodology (ABC/HR or MUSHRA) and user-interface independent instruction sheet to guide the test participants through a test session. It's based on my own experience and adapted to this particular test with regard to anchor and hidden-reference selection and grading. I'v put a draft under

A description of said "general test terminology", i.e. an explanation of terms such as anchor, item, overall quality, reference, session, stimulus, and transparency, will follow.

Everything related to listener training, i.e. how to use the test software, what kinds of artifacts to expect, and how to spot artifacts, will also be discussed separately. As mentioned, this instruction sheet is the "final one in the chain" and assumes a methodology- and terminology-informed, trained listener.

If you're an experienced listener and feel that your approach to a high-bit-rate blind test is radically different from my recommendation, please let me know about the difference.

If you're an experienced listener and feel that your approach to a high-bit-rate blind test is radically different from my recommendation, please let me know about the difference.

Chris, I'm not experienced listener at all, and also my headphones are really poor. But I would love to know what people think about my way. I actually don't care about ABX probabilities but simply mux encoded and raw audio into L and R channels so that I can hear both signals simultaneously.

Also since there is some activity related to the test, I'm wondering whether someone could reach Opticom, or just have access to OperaDigitalEar to get advanced PEAQ scores for the test samples.

I actually don't care about ABX probabilities but simply mux encoded and raw audio into L and R channels so that I can hear both signals simultaneously.

My initial guess is that this is dangerous! You will probably hear artifacts which are inaudible if you just listen to the original and coded version, one after the other, and you might not hear certain artifacts which are clearly audible if you listen to both channels of the codec signal. Example: if original and coded version are slightly delayed to each other, you'll hear this with your approach because human hearing is very sensitive to inter-aural delay. However, if both coded channels are delayed by the same amount compared to the original two channels, this might be inaudible if you listen to both coded channels (which you should). I've never ABXed this way.

Objective quality measures will be done, but might not be published with the results (don't know if I'm allowed to publish Advanced PEAQ scores, the license is owned by my employer, not by me), especially not before the test.

and you might not hear certain artifacts which are clearly audible if you listen to both channels of the codec signal.

What kind of artifacts could be missed? Excluding stereo issues I can only imagine a very far fetched example. Anyway, this method can be thought as unit test. Here is what I usually do%%[a, fs] = wavread('sampleA.wav'); [b, fs] = wavread('sampleB.wav');