Thanks for organizing the tests, guys! Sorry for being picky, but I'm not convinced about the analysis. To ease my mind, it would be great if you could comment on the following.

Please provide the number of valid results (i.e. listeners) per sample (excluding "27", see below).

How did you compute the overall average score of a codec and its confidence intervals? Taking the mean of all listeners' results? That would mean a sample with more listeners (i.e. probably sample01) has a greater influence than the last few samples (which still needed listeners shortly before the end of the test). This is probably not a good approach; weighting each sample equally in the overall score seems to be the way to go for me (but it probably doesn't make a difference here, but still...).

Nothing personal, but if a listener like "27" consistently scores in opposite direction as the average (as shown by Igor), a thorough post-screening analysis (like Spearman rank correlation < some value) would - and has to - exclude such results.

Edit: Christoph, why are the samples you uploaded at 96 kHz? Did you do the test that way?

Please provide the number of valid results (i.e. listeners) per sample (excluding "27", see below).

Will be addressed when per sample graphs are made. You can obtain this data yourself easily if you can't wait - the results are public.

QUOTE (C.R.Helmrich @ Apr 12 2011, 21:11)

[*]How did you compute the overall average score of a codec and its confidence intervals? Taking the mean of all listeners' results? That would mean a sample with more listeners (i.e. probably sample01) has a greater influence than the last few samples (which still needed listeners shortly before the end of the test). This is probably not a good approach; weighting each sample equally in the overall score seems to be the way to go for me (but it probably doesn't make a difference here, but still...).

This is already addressed and explained on the results page. Note that equal sample weighting, by only including complete results, does not change the results in the slightest.

That being said, the only solution to this is to put some infrastructure to force equal listeners per sample in the next tests. Any kind of post-processing to equalize the sample weights is probably as controversial as not having them equal in the first place. The samples that weren't included in the test also had unequal weights compared to those that were, if you know what I mean.

QUOTE

[*]Nothing personal, but if a listener like "27" consistently scores in opposite direction as the average (as shown by Igor), a thorough post-screening analysis (like Spearman rank correlation < some value) would - and has to - exclude such results.

Sorry, Christoph, can't reproduce it. What you describe must sound like a notch filter, i.e. frequency band missing. Haven't noticed anything of that sort during and after the test. What OS are you using? 64-bit?

Thanks, Garf and NullC, for the explanations.

QUOTE (Garf @ Apr 12 2011, 22:44)

Note that equal sample weighting, by only including complete results, does not change the results in the slightest.

That's good to hear. Still, if you find some time, would you mind creating a closeup average-codec-score plot using only the complete results, just like the plot on the results page?