It would be interesting to do a rank-sum analysis comparing each pair of encoders. Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.

It would be interesting to do a rank-sum analysis comparing each pair of encoders. Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.

Completely and utterly false. We're asking to grade on a reference scale, compare to a low anchor, and judge the severity of distortions, not whether codecs are better than others.

If you're going to claim this only "seems like legitimate", you better back up that statement. Specifically, why the interval scale here (used in each and every previous test) suddenly has to be abandoned for an ordinal scale, or why we're dropping the tracking of ITU-R BS.1116-1 methodology that's generally done in these tests. Are you saying the ITU methodology only "seems like legitimate"?

... whether or not a listener ranked one encoder higher or lower than another.

I thought that (Garf, correct me if necessary) this information is reflected in the p-value tables and whether or not the confidence intervals of two coders overlap.

Yes (aggregate over all listeners). Note that the graphics are simplified plots, and don't have the correct confidence intervals for the bootstrap (because the tool doesn't support generating them) nor for ANOVA (IIRC, the plots don't consider the blocking).

This is why you'll see overlap in the graphics but not in the bootstrap nor blocked ANOVA results.

It would be interesting to do a rank-sum analysis comparing each pair of encoders. Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.

Completely and utterly false. We're asking to grade on a reference scale, compare to a low anchor, and judge the severity of distortions, not whether codecs are better than others.

If you're going to claim this only "seems like legitimate", you better back up that statement. Specifically, why the interval scale here (used in each and every previous test) suddenly has to be abandoned for an ordinal scale, or why we're dropping the tracking of ITU-R BS.1116-1 methodology that's generally done in these tests. Are you saying the ITU methodology only "seems like legitimate"?

Sorry, I only now read the caveat in the results page - "The graphs are a simple ANOVA analysis over all submitted and valid results. This is compatible with the graphs of previous listening tests, but should only be considered as a visual support for the real analysis.". My initial reaction was to the box-plot graphs, not to the analysis at the bottom of the page.

The Friedman ANOVA analysis (bootstrap or not) are using rank-based testing.

It would be interesting to do a rank-sum analysis comparing each pair of encoders. Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.

It would be interesting to do a rank-sum analysis comparing each pair of encoders. Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.

The Friedman ANOVA analysis (bootstrap or not) are using rank-based testing.

(Blocked) ANOVA is a parametric, means-based test. FRIEDMAN is the name of the utility (which unsurprisingly, also supports Friedman analysis). The result posted is means-based, not rank-based. It's there mostly to allow referencing with older tests and with other statistical packages, which are more likely to support normal blocked ANOVA than the nonparametric variants. Friedman wasn't developed further because it doesn't allow p-value step-down without losing a significant amount of power for many comparisons, and because for high-bitrate tests it is no longer clear the results are normally distributed. That's exactly what lead to bootstrap.

I should also mention that I've participated in this test too. Steve Forte Rio has made the ABC/HR sessions and new key for me. He has checked my results. (You can find this key in results.zip).A Big thank You to him for that.

Maybe there could be a legend for X-axis, the abbreviations used, at least under the first graph?

FhG, low_anchor* and Nero are almost fine enough (*though "wait, what was it again"? ;p ), but making CT, CVBR, and TVBR clear might require going back to the test page; which I think should be superfluous.

Thanks to all who participated in this test and to those who made this test possible, especially to IgorC.

My findings are in line with the general results and I am actually surprised by Nero ending up rather low. Curious to see CVBR mean is a bit higher than TVBR but I suppose this fact really has not much meaning as both fall into each other's confidence interval.

Some personal testing about a year ago with Apple CVBR at around 128kbps, I found it stunning good but never really compared it to Nero (I use Nero for 2 years now). Is it safe to conclude that if a codec is better at about 100kbps, it also is at 128kbps? Or might the quality of tuning be different for different quality settings (therefore different bitrates)?

Is it safe to conclude that if a codec is better at about 100kbps, it also is at 128kbps? Or might the quality of tuning be different for different quality settings (therefore different bitrates)?

This is a tough question. The quality of the tuning can make a difference. But barring any more information, I'd bet the codec that is better tuned/performing at 100kbps will perform better at 128kbps, too.

You could say that the codecs performance at 100kbps is a hint, but not proof, of how it will do at 128kbps.

To be perfectly honest, I am surprised that FhG did so well against Coding Technologies. Since Winamp introducted it, some of my songs seemed to retain more quality when encoded with the Coding Technologies' encoder rather than FhG.Too bad I found out about the test two days after it was already closed. I've been anxious to see the results; interesting how Nero did the worst. Great information for future references

It appears to me that the low anchor was way too bad. Shouldn't the low anchor be at around the same quality as the contenders, but "slightly" worse than all of them?

Not sure about this one, I thought it should "calibrate the scale". (Because the overall quality is so high, it's less needed at the upper end)

If you don't use an anchor, what happens is that for a minor distortion users will tend to slam down the slider. The anchor serves as a reminder "what really bad really is".

It would be more useful if the anchor stayed the same throughout the tests, I guess. Probably the opportunity to test ffmpeg in one swoop was interesting. No idea if it was understood it is *this* bad.