I figured ratings would vary between testers depending on which of pre-echo, lowpass, ringing, warble and grittiness is more objectionable. Further more on the Bohemian Rhapsody sample source warbling had me very confused for a while

# 27 are my results. I do not know, if something went wrong, but I am definitely not a cheater.Over a week ago, I sent Igor some wave-files he asked for, but he did not answered my email jet.

I think it's really unfortunate that Igor released a file with the word cheater in it. There are so many ways for a result to go weird which have nothing to do with "cheating".

Your results can be excluded purely based on the previously published confused reference criteria (2,4,9,22,30 invalid), so that should close the question on correctness of excluding those results and it should have been left at that. Even with good and careful listeners this can happen, and it's nothing anyone should take too personally.

Though, your results are pretty weird— You ranked the reference fairly low (e.g. 3) on a couple comparisons where many people found the reference and codec indistinguishable. I think you also failed to reverse your preference on some samples where the other listeners changed their preference (behavior characteristic of a non-blind test?).

I don't mean to cause offense, but were you listening via speakers or could you have far less HF sensitivity than most of the other listeners (if you are male and older than most participants then the answer to that might be yes)? Any other ideas why your results might be very different overall and also on specific samples?

This was the first test of this kind I made. I fast realized that I do hear much difference with my speakers, so I tested the samples with a pair of good ear plugs. I "think" I can hear differeneces in HF quite good. btw. I am male and 26 years old.Yes OK, there might is a special case. I can hear high frequencies better with my left hear, where I got a small tinnitus (resulting from loud fireworks).

I've done some more testing with headphones after this was finished and also realized that my speakers were limiting my initial impressions. I can pick up differences significantly more easily through headphones than speakers. I guess next time I'll have a more valuable contribution!

Some of the listeners prefer Nero over Vorbis or vice versa. Some of them have rated Vorbis higher against HE-AAC codecs. Other preferred Apple HE-AAC over CELT on second half of samples. These variations are all fine. Finally on average Opus/CELT was better for all listeners with enough results.It was very strange that you have ranked the Opus as low as low anchor! (like sample 10 and much others) where ALL other listeners scored it very well.You average scores (including 5 invalid samples):Vorbis - 3.53Nero - 3.15Apple -3.51CELT - 2.34

Maybe your hardware has some issues.

Earlier I also wrote you to re run again the whole test because there were 5 invalid results and all test was discarded.

Hi Igor,on sample 10 I votes this way, because I found this part of the sample SUPER annoying:

Thanks for organizing the tests, guys! Sorry for being picky, but I'm not convinced about the analysis. To ease my mind, it would be great if you could comment on the following.

Please provide the number of valid results (i.e. listeners) per sample (excluding "27", see below).

How did you compute the overall average score of a codec and its confidence intervals? Taking the mean of all listeners' results? That would mean a sample with more listeners (i.e. probably sample01) has a greater influence than the last few samples (which still needed listeners shortly before the end of the test). This is probably not a good approach; weighting each sample equally in the overall score seems to be the way to go for me (but it probably doesn't make a difference here, but still...).

Nothing personal, but if a listener like "27" consistently scores in opposite direction as the average (as shown by Igor), a thorough post-screening analysis (like Spearman rank correlation < some value) would - and has to - exclude such results.

Edit: Christoph, why are the samples you uploaded at 96 kHz? Did you do the test that way?

Here is the raw data for a bitrate table. The bitrates are calculated from the physical file sizes and exact durations of the lossless reference files. The container overhead is not taken into account, but the situation is the same for every contender. I can create the finished table if no one else volunteers, but perhaps not today. I have already spent too much time with this.

Thanks for organizing the tests, guys! Sorry for being picky, but I'm not convinced about the analysis. To ease my mind, it would be great if you could comment on the following.

Please provide the number of valid results (i.e. listeners) per sample (excluding "27", see below).

How did you compute the overall average score of a codec and its confidence intervals? Taking the mean of all listeners' results? That would mean a sample with more listeners (i.e. probably sample01) has a greater influence than the last few samples (which still needed listeners shortly before the end of the test). This is probably not a good approach; weighting each sample equally in the overall score seems to be the way to go for me (but it probably doesn't make a difference here, but still...).

Nothing personal, but if a listener like "27" consistently scores in opposite direction as the average (as shown by Igor), a thorough post-screening analysis (like Spearman rank correlation < some value) would - and has to - exclude such results.

Edit: Christoph, why are the samples you uploaded at 96 kHz? Did you do the test that way?

Chris

@Edit: Oh sorry, I just fast cut it with audacity, did not noticed it was still configured that way. But anyway I hope you can hear what mean.Maybe the most people only concentrated on the beginning of the samples!? the part with the glitch is way in the sample.

Some presentation suggestions:1. Codec versions and settings should be in the results or one clearly marked click away. I don't consider what is now to be clearly marked.2. Links to results of older tests would be welcome.3. I can't wait for the bitrate table.

Some presentation suggestions:1. Codec versions and settings should be in the results or one clearly marked click away. I don't consider what is now to be clearly marked.2. Links to results of older tests would be welcome.3. I can't wait for the bitrate table.

Christoph, do you mean the slightly washed out bass drum? To me (and probably most other listeners) the artifacts of the other codecs in the first 15 seconds appeared much more severe. I don't have the decoded items here. Can someone check if Christoph's CELT decodes match his/her own?

And, since you said this is your first listening test of this kind: did you do training sessions? Did you read e.g. this guideline? The way you choose your loops (especially length) has a great impact on your ability to identify artifacts.

I figured ratings would vary between testers depending on which of pre-echo, lowpass, ringing, warble and grittiness is more objectionable. Further more on the Bohemian Rhapsody sample source warbling had me very confused for a while

The bigger difference just comes from which samples were tested. A great many listeners only listened to the first few samples, so of course their preferences will be skewed by the correlation with the samples they tested.

If you look at the 10 listeners which had all 30 valid results (so no sample-unbalance), you'll see that the overall preferences agree pretty strongly:

These are just the ranks of the averages (no comment on the significance):

The sample to sample variance in rank is a lot greater than the listener to listener variance in rank (scores might be another matter— but listeners don't score things the same, and because the score scale is non-linear I don't know of any intuitively correct way to deal with that other than using ranks).

Christoph, do you mean the slightly washed out bass drum? To me (and probably most other listeners) the artifacts of the other codecs in the first 15 seconds appeared much more severe. I don't have the decoded items here. Can someone check if Christoph's CELT decodes match his/her own?

And, since you said this is your first listening test of this kind: did you do training sessions? Did you read e.g. this guideline? The way you choose your loops (especially length) has a great impact on your ability to identify artifacts.

Chris

Hi C.R.,

yes I read the guideline before the test, but usually compared only 2-3 loops per sample.It is interesting that you are not that much annoyed by this part. I can clearly hear it and i just did a spectrum analysis and it is also visible.http://dl.dropbox.com/u/745331/spectrum.png

yes I read the guideline before the test, but usually compared only 2-3 loops per sample.

I wonder how many listeners did it like that. It seems there are a lot of things we should put in a checklist for all to read before a test. Such as "listen to the entire sample" and "use headphones"... Maybe by coincidence you only listened to sections where CELT does a bit worse than the other codecs?

Please provide the number of valid results (i.e. listeners) per sample (excluding "27", see below).

Will be addressed when per sample graphs are made. You can obtain this data yourself easily if you can't wait - the results are public.

QUOTE (C.R.Helmrich @ Apr 12 2011, 21:11)

[*]How did you compute the overall average score of a codec and its confidence intervals? Taking the mean of all listeners' results? That would mean a sample with more listeners (i.e. probably sample01) has a greater influence than the last few samples (which still needed listeners shortly before the end of the test). This is probably not a good approach; weighting each sample equally in the overall score seems to be the way to go for me (but it probably doesn't make a difference here, but still...).

This is already addressed and explained on the results page. Note that equal sample weighting, by only including complete results, does not change the results in the slightest.

That being said, the only solution to this is to put some infrastructure to force equal listeners per sample in the next tests. Any kind of post-processing to equalize the sample weights is probably as controversial as not having them equal in the first place. The samples that weren't included in the test also had unequal weights compared to those that were, if you know what I mean.

QUOTE

[*]Nothing personal, but if a listener like "27" consistently scores in opposite direction as the average (as shown by Igor), a thorough post-screening analysis (like Spearman rank correlation < some value) would - and has to - exclude such results.

The paired statistical tests are pretty incontrovertible. I've since run the same analysis with a number of different balancing and post filtering rules and every time it's come out to be the same way.

If it's any consolation, Opus considerably bombs the couple of cases that it does poorly (though its sample by sample variance is still not as large as the other codecs, it has stronger outliers). This is undoubtedly due to a mixture of encoder immaturity, lack of taking advantage of VBR, and just one of the annoying tradeoffs that come from creating a low latency codec. (The mode opus was used in here has a total of 22.5ms of latency, including the overlap but ignoring any serialization delay related to VBR).

I've noticed that there seems to be some misunderstanding promoted around here related to confidence intervals. Even ignoring the issues with non-pairwise comparisons, assumptions of normality, etc. there seems to be a mis-aprehension that the confidence intervals must not overlap at all for the result to be deemed significant to whatever P-value was used to draw the bars. This is clearly incorrect.

For example, consider 5% error bars on the mean of codec A and 5% bars on the mean codec B and the lower bar of A is the same as the upper bar of B. Is there a 1/20 (p=0.05) chance that the difference in means arose from noise? _NO_ If we assume that the errors are independent the chance of that is more like 1/400 (0.05^2). Of course, the errors are not completely independently distributed— but this fact also invalidates the assumptions used to set the errors bars in the first place. Another approach would be to compare the mean of one value with the error-bars on the mean of the other and vice versa, this isn't ideal either but it does avoid squaring the P-value used.

Blocked pair-wise parametric tests are much better for this reason, and others but they don't result in pretty graphs.

yes I read the guideline before the test, but usually compared only 2-3 loops per sample.

I wonder how many listeners did it like that. It seems there are a lot of things we should put in a checklist for all to read before a test. Such as "listen to the entire sample" and "use headphones"... Maybe by coincidence you only listened to sections where CELT does a bit worse than the other codecs?

Sorry, Christoph, can't reproduce it. What you describe must sound like a notch filter, i.e. frequency band missing. Haven't noticed anything of that sort during and after the test. What OS are you using? 64-bit?

Thanks, Garf and NullC, for the explanations.

QUOTE (Garf @ Apr 12 2011, 22:44)

Note that equal sample weighting, by only including complete results, does not change the results in the slightest.

That's good to hear. Still, if you find some time, would you mind creating a closeup average-codec-score plot using only the complete results, just like the plot on the results page?

Some presentation suggestions:1. Codec versions and settings should be in the results or one clearly marked click away. I don't consider what is now to be clearly marked.2. Links to results of older tests would be welcome.3. I can't wait for the bitrate table.

I added the bitrate table (thanks AlexB!), but that's as far as I'll go. If people want nicer webpages they need to find someone who is actually skilled at making nice HTML/CSS.

Apple_HE-AAC is better than Vorbis (p=0.000)Apple_HE-AAC is better than Nero_HE-AAC (p=0.000)Opus is better than Vorbis (p=0.000)Opus is better than Nero_HE-AAC (p=0.000)Opus is better than Apple_HE-AAC (p=0.000)AAC-LC@48k is worse than Vorbis (p=0.000)AAC-LC@48k is worse than Nero_HE-AAC (p=0.000)AAC-LC@48k is worse than Apple_HE-AAC (p=0.000)AAC-LC@48k is worse than Opus (p=0.000)

Apple_HE-AAC is better than Vorbis (p=0.000)Apple_HE-AAC is better than Nero_HE-AAC (p=0.000)Opus is better than Vorbis (p=0.000)Opus is better than Nero_HE-AAC (p=0.000)Opus is better than Apple_HE-AAC (p=0.000)AAC-LC@48k is worse than Vorbis (p=0.000)AAC-LC@48k is worse than Nero_HE-AAC (p=0.000)AAC-LC@48k is worse than Apple_HE-AAC (p=0.000)AAC-LC@48k is worse than Opus (p=0.000)