As I said in the test results thread I would like to discuss about each sample separately. The overall results were tied and all encoders seem to be equal. In closer inspection it seems that each encoder had problems with at least some samples.

It would be useful to analyze each sample separately in order to find out what kind of problems the testers noticed and how severe they are. The discussion would help to understand the test results and probably also help the codec developers in their work. This has not been done before, but I think the outcome would be valuable.

Some testers added comments to the result files. Those comments are useful if the tester intends to revisit a saved session later. Unfortunately the comments in the result files are quite hidden and they cannot be easily evaluated and compared. That's why I didn't add comments to my results (expect the some unfinished, partially wrong comments in one of my first result files - I meant to delete them, but I forgot to do that.)

This thread is for the Sample #2. Please try to keep the discussion on topic. If you want to discuss about any other sample feel free to start a new thread for it. I am hoping that eventually we'll have 14 separate threads - one for each sample. I'll add them by myself if others have not done that before me.

Sample #2 - Vangelis_Chariots_of_Fire

The overall results:

The results from the individual testers:

I sorted the testers so that the most critical tester is the first on the left.

6 people gave to l3enc 5.0 score. 3 of them gave to lame 3.97 score < 5.0 (Peter Sperl, anonymous19, anonymous07), one - to iTunes (anonymous16). Interesting...

Indeed... This graphic shows things that are really suspicious.

Suspicious listeners:

7 (This is definitely not random, but it is indeed strange), 19, 21, 23, 24.

How could the lower anchor be rated of higher quality than any of the contenders, in this sample? (for the record, i am the 17th in this graph).

Sebastian, you said you discarded several results because you found rated references. Is it reasonable to think (and for this, it depends the number that you had to discard), that some users posted random results?

Edit: I've been reading the results listener 21 (anonymous19) did try to ABX the samples, but did not rate the low anchor. In this case, I would think the user didn't bother to rank the low anchor, which is not a good thing to do, but i guess his/her results do not disqualify.

7 (anon25) has comments in all the samples. If this anonymous would want to stand up and explain this strange rating (versus everyone else), it could help understand it.The first impression is as if he, somehow, swapped the results of the low anchor and fraunhofer.

I am listener 6. It seems that I personnaly don't like the sound of iTunes, and not only on this sample.

What I find strange in this sample is that iTunes and Fraunhofer degrade the right channel quality, while there is nothing in it to mask the degradation. It must be because there is a given sound in the left channel, but these encoders seem to apply the matching degradation to both channels by mistake.

Here were my comments :

General comments :Interesting stereo problems. The bit depth seems to be drasticall reduced on the right channel where it is perfectly audible.

File 1Evaluation : 2.8Comment : Artifacts on the right side especiallyABX : 8/8

The problem is that as long as the reference isn't ranked, I cannot simply refuse to accept results even if the low anchor is rated higher than a contender. Otherwise people might blame me for selecting only the results I like. This would be fatal for me in an AAC test for example because there were already persons who told me that I am biased since working for Nero.The only thing I did was to discard all results if a user had a very high number of results with ranked references (like 9 out of 14). In that case, I contacted the submitter and asked why this happened. Some of the people replied telling me that they simply guessed and after asking them to redo the test with ABX if possible I included the new results only, others were affected by the ABC/HR problem and wrote down the results on paper first without knowing that reloading the configuration files re-randomizes the contenders and others didn't reply at all. However, only a very low number of people were affected (I think a total of maybe 3 submitters).

Lame 3.97 suffers from the sandpaper problem.LAME 3.98 has pronounced high frequency hiss. (I don't know why that bothered me so much when I tested this sample. The other encoders seem to have similar hiss problems. I would probably now give LAME 3.98 something like 3.3 - 3.5)FhG is noisy and has some sandpaper problem too.Helix is constantly a bit distorded.iTunes is the cleanest. I didn't notice the problem Pio2001 mentioned. (I still don't, even though I tried)

I didn't notice the problem Pio2001 mentioned. (I still don't, even though I tried)

The first two notes, on the left, from 3" to 6", are affected by artifacts, but oddly enough, the artifacts are audible in the opposite channel !It is more audible with Lame 3.97, during the first note only. I see what you call "sandpaper". The artifacts in iTunes are not sandpaper, but more common mp3 artifacts, I don't know how to qualify them... they're... mpc-ish !

The problem is that as long as the reference isn't ranked, I cannot simply refuse to accept results even if the low anchor is rated higher than a contender...

But I can.

I removed the results which I decided to be unsuitable for the purpose of this thread. (They were absolutely valid for the public listening test, but I am trying to get more info about the detected differences between the actual contenders.)

Lame 3.97 suffers from the sandpaper problem.LAME 3.98 has pronounced high frequency hiss. (I don't know why that bothered me so much when I tested this sample. The other encoders seem to have similar hiss problems. I would probably now give LAME 3.98 something like 3.3 - 3.5)FhG is noisy and has some sandpaper problem too.Helix is constantly a bit distorded.iTunes is the cleanest. I didn't notice the problem Pio2001 mentioned. (I still don't, even though I tried)

By just looking at the bitrates from all encoders and LAME's output, it seems LAME uses alot more short blocks on this sample than other encoders. Maybe too much pre-echo tuning triggered by /mnt? Without using short blocks, LAME would encode it at about 123 kbps, which would be more in line with all others.

...Maybe too much pre-echo tuning triggered by /mnt? Without using short blocks, LAME would encode it at about 123 kbps, which would be more in line with all others.

Just a thought:I know it would make Lame usage more complicated (not mentioning development), but would be a specific option useful taking into account the specific needs of individual users?For instance I am pretty insensitive towards pre-echo but I dislike tonal distortions. And I don't care about problems which are more more or less related to metal or hard-rock music or hard-core electronic music. People like /mnt may feel the other way around. Encoder development has always to make compromises. While the last sentence will remain true for any useful encoder it might be advantegous for advanced users to have an option which takes some care of specific needs.Is it possible?

The problem is that as long as the reference isn't ranked, I cannot simply refuse to accept results even if the low anchor is rated higher than a contender. Otherwise people might blame me for selecting only the results I like. This would be fatal for me in an AAC test for example because there were already persons who told me that I am biased since working for Nero.

Have you looked at any statistical methods for gleaning information by considering subsets of your data?http://en.wikipedia.org/wiki/Resampling_(s...tics)#Jackknife I'm familiar with bootstrapping in phylogenetics, where you regenerate an evolutionary tree with various random subsets of the species you're interested in. That seems fundamentally different from considering random subsets of the samples, or of the submitters, though, since nothing non-linear or unpredictable happens after discarding some samples. I think it might be interesting to see, though, what fraction of the subsets of submitters still have all codecs tied within 95% confidence interval.

BTW, what's so magic about 95%. What is the p value for Helix >= LAME 3.98.2 on the whole test? Even if you can't say Helix >= LAME, maybe you can say there's a >70% chance that Helix did better on this test, and a 30% chance that that's not the case.

QUOTE

The only thing I did was to discard all results if a user had a very high number of results with ranked references (like 9 out of 14). In that case, I contacted the submitter and asked why this happened. Some of the people replied telling me that they simply guessed and after asking them to redo the test with ABX if possible I included the new results only, others were affected by the ABC/HR problem and wrote down the results on paper first without knowing that reloading the configuration files re-randomizes the contenders and others didn't reply at all. However, only a very low number of people were affected (I think a total of maybe 3 submitters).

My results didn't get counted for sample 2. I didn't know that ABC/HR would gray-out the reference after a successful ABX in trial, not training, mode, so I just used training mode to find an artifact, but then picked the wrong slider on one sample (based on the effectively single-trial-ABX of listening to both sliders and the reference). I sent you an updated results file, but I guess you didn't use it because it was after the test deadline?

iTunes: 5.0LAME 3.98.2: 5.0l3enc: 2.1 (higher than on the other 3 samples I did. It sucks, but wasn't as bad here, so got a higher score)FhG: 4.0LAME 3.97: 2.7Helix: 4.4 (note the comment, though)

Hardware: Koss TD/60 headphones, and Logitech Z-5500 speakers. sound card: Intel HDA (STAC9271D codec with 105dB DAC SnR), driven by Ubuntu GNU/Linux. I think I tried all encodes on both the phones and the speakers, to see if there was something I could hear with one. My phones aren't bad, but they have no bass compared to my lovely speakers.

CODE

Testname: Sample02Tester: pcordes

General Comments: re-rated sample #5 (the one I previously rated the reference for). I could ABX it, but I was being too harsh.---1L File: Sample02/Sample02_3.wav1L Rating: 2.11L Comment: warbly---4L File: Sample02/Sample02_5.wav4L Rating: 2.74L Comment: horn isn't smooth, sounds clickly---5R File: Sample02/Sample02_4.wav5R Rating: 4.05R Comment: second horn sounds a little off---6R File: Sample02/Sample02_6.wav6R Rating: 4.46R Comment: Can only ABX based on the difference in noise floor in the opening half second. The sample has less background hiss.