-q135 should sound OK on non-killer rock/pop samples to most people, but to Guruboolez, IMO it is probably not up to par.

<$0.02>My ears are not especially sensitive or trained at all, but on a personal sample taken from the soundtrack CD of E.T., I had to go up as high as -q210 until the quality reached un-ABXable transparency. </$0.02>

edit: I used the first stable release of version 1.24 from Rarewares (file date = April 25, 2004 - 11:24am).

Thanks for report. I've very quickly tried to see what bitrate correspond to -q135 with faac 1.24. It's apparently close to 150 kbps on average with classical (~150 kbps on common orchestral/lyrical; ~140 with less complex music as piano; ~160 with four solo harpsichord tracks I've encoded so far).

It's the same problem with vorbis megamix II. To avoid some contestation, I've choose three different vorbis settings. Adding two more encodings (faac/-q classical & faac/-q rock) would make my test more difficult, especially for building a correct hierarchisation for each sample (it's a long task with 5 contenders, it will be much longer with 7).I'll check the overall quality of faac in preliminary tests (the next week). If the encoder is competitive, I'll see what I could do. If the codec isn't competitive enough, I'll probably wait for another AAC encoder, and why not, test faac later in a similar test opposing different AAC encoders at the same bitrate.

Hello. Although this test had been done last week, I couldn't publish this until today for some reasons. This time I did ABX only since doing ABC/HR on non-killer samples at high bitrate is too hard to me.

This test features three encoders with following settings right now I'm interested in.

Thanks a lot for results and samples. I can't download all of them yet; I've just finished marteau.flac (Boulez I suppose;)). Apparently, the file is corrupted: "error while decoding metadata".Could someone confirm?

Thank you for your work. However, considering the number of trials varying, it seems that you performed, as Guruboolez, sequencial ABX testing. It turns all the p-values useless. What was the maximum number of trials that you fixed before giving up ?

I recall that to avoid any diificulty in interpreting the results, the cleanest way to perform ABX tests is to fix a number of trials before the test begin, then, to perform the test once.

I recall that to avoid any diificulty in interpreting the results, the cleanest way to perform ABX tests is to fix a number of trials before the test begin, then, to perform the test once.

It's easy to say, but harder to do.By fixing a precise number of trials (16 could be exhausting, according to the difficulty of such tests - 8...12 are more realist in my opinion), there's a BIG risk: finish the test with unsignificative results. With eight trials, situation is simple:you can't miss two of them; with 16 trials, errors are less crucial... but with 6 contenders, there are 96 trials to perform for one sample, 960 for ten, and the listening fatigue is very hard or maybe impossible to avoid. Fatigue implies more errors, and again the risk of finishing the test with unsignificative results.

By "pushing" the test beyond the fixed value, the tester tries to prove that the difference is not placebo, and that he could hear it, and even ABX it. As a tester, I prefer finish a test with 24/30 than with a poor 10/16 due to bad concentration or something else. And in order to avoid fatigue (and therefore errors), I sometime stop the test very quickly when ABX score are perfect after 5...8 trials.

This way of doing things completely screws the results. I recall that the corrected p-values given there, though obtained with simulations, are exact.When you are ready to go for 100 trials, and get p=0.05 in the ABX program, your test has failed, because your real p value is not 0.05, it is 0.2 !! p values displayed in ABX programs are only valid for tests run either without looking at the results before the test is over, either fixing the number of trials, and for tests run for the first time ! If you undergo the test for the second time, or if you look at the results before the end and the number of trials is not fixed, then the p values given are plain wrong !It was discussed in the thread linked above, and in the other thread linked again from there.

QUOTE (guruboolez @ Jul 24 2004, 12:39 AM)

It's easy to say, but harder to do.

Exactly ! Getting a real p=0.05 is much harder than getting p=0.2, because there is less room for errors. That's why only a real p of 0.05 is considered as valid. Otherwise, it is too easy.

QUOTE (guruboolez @ Jul 24 2004, 12:39 AM)

By fixing a precise number of trials (16 could be exhausting, according to the difficulty of such tests - 8...12 are more realist in my opinion),

I use 16 for easy tests, and 8 for hard ones.

QUOTE (guruboolez @ Jul 24 2004, 12:39 AM)

there's a BIG risk: finish the test with unsignificative results.

The signifiance is given by the real p value, and nothing else. If you finish the test with p above 0.05, it just means that there are more than 1 out of 20 changes that you were guessing. The risk increases because the test is more significant, that's all.

All these sentences are mathematically equivalent. If you want to make the test easier, then you want to make it less meaningful.

QUOTE (guruboolez @ Jul 24 2004, 12:39 AM)

Fatigue implies more errors, and again the risk of finishing the test with unsignificative results.

You can't avoid this, the test must last longer, in order to allow the hearing system to rest.

QUOTE (guruboolez @ Jul 24 2004, 12:39 AM)

By "pushing" the test beyond the fixed value, the tester tries to prove that the difference is not placebo, and that he could hear it, and even ABX it. [...]I sometime stop the test very quickly when ABX score are perfect after 5...8 trials.

This is cheating. The p value suffers from random variations. What you are doing is waiting for the p value to get below 0.05 by chance, and decide to stop here.Remeber when Gabriel got p = 0.003 without listening to anything ? http://www.hydrogenaudio.org/forums/index....ndpost&p=151932If you want to allow more room for errors, increase the number of fixed trials, but in any case, don't stop before the end if you see the p value coming down accidentally, unless you use corrected p value table linked above.

Your results in this test are entierly based of the ABC/HR ratings you gave, granted that you didn't know which codec was which.The analysis of your ABX values led to no conclusion. I could only show that you got more successes than expected randomly, but since I don't know the standard deviation of the probability of getting p < 0.05 in a 50 trials sequencial test (I just know it value), I can't tell if you got significantly more positive results than expected, or randomly more positive results than expected !

By "pushing" the test beyond the fixed value, the tester tries to prove that the difference is not placebo, and that he could hear it, and even ABX it. [...]I sometime stop the test very quickly when ABX score are perfect after 5...8 trials.

This is cheating.

No, it's not cheating. - Listening tests in my appartment are far from ideal listening conditions. There's a lot of noise: computer, fridge, phone, street, road, neighbour. It's easy to miss one trial due to the lack of concentration following a sudden and disturbing noise; then it's easy to miss the ten following trials due to anger.- Sometimes, first trials are bad for other reasons: focusing on a wrong problem for exemple [that's why the training module of abc/hr 1.1 is precious].- Don't also forget that all encoded files are not tested (ABX module) in the same conditions: I'm more familiar with the original when I'm testing the sixth and last encoding. For that reason, it's easier to fail on ABXing the first file than on the third. My ears are also more saturated on the sixth file than on the third. I could feel necessary to perform again the ABX session for the first file, failed not because it was especially difficult but because I was not very familiar with the reference. The score will report the failure, but not the reason of the failure. There are many reason that could explain a failure, and rather than redoing again the whole test, it's preferable for practical reason to "push" the number of trials. It's easy to understand I suppose. Doesn't sound as "cheating" for me... Stopping at 10/16 when I know that I could obtain a much better score is not far from cheating too...

An ABX score is not only a score. There's an history behind. It's like the final score of a soccer match: it doesn't tell anything of the quality of the winner/looser. A team could dominate a full match, and finish as looser. The score would conclude on the winner's superiority, but the full match would show the contrary. Same applies to a listening test: bad results could have another reason than difficulties.

If I decide to stop a test after 16 trials, whatever the history of this test, and to publish the result, people would say: "look, guruboolez rated 1L = 3.2, but he is probably guessing according to the ABX score". At least, it's the conclusion of the statistician...If I decide to follow the test, it's not to prove to myself that the difference is really audible, but to publish a decent score. The ABX log files doesn't tell anything on the testing conditions. It doesn't help to understand why score is low ; it doesn't show that the tester ABXed sucessfully the 32 last trials but missed the 10 first one, but just reveal an enigmatic, unusual, "randomly" stopped 32/42. In my opinion, 32/42 is a much better score than 14/16, if 32 last trials were correct, and if the 10 first one were bad because I've focused my attention on a wrong problem.

QUOTE

What you are doing is waiting for the p value to get below 0.05 by chance, and decide to stop here.

It's a wrong interpretation.I sometimes miss a test for one file, and stop it at 9/16 (it's an exemple). After I have finish with success other files, it happens that I resume this bad test. I can't reset the score, and my new attempt begins with 9/16 and not 0/0. If I decide to add 20 more trials, the final trials number will be x/36, which is unusual. You could conclude on a random stop while it's an intended one.When this kind of situation happens (and it happens very frequently), I generally add few words on comment. But sometime I forgot to do it, or I could be too bored to do it.I often stop a test when I succeed in 16 consecutive trials, whatever the final score looks like. The ABX log won't show that.Again, reader could be fooled by pure numbers.

Yes, I remember. But he performed the test in one session and stopped randomly IIRC. As I said it before, multiple scenario are possible for a same score. Gabriel had maybe stopped randomly after 26 trials, but for someone else, 26 trials could mean (10+16, with a new test inside the global one and 16 fixed trials).

QUOTE

If you want to allow more room for errors, increase the number of fixed trials, but in any case, don't stop before the end if you see the p value coming down accidentally

Again, it's easy to fix principles... But there's a human tester behind score or pval. I could go with 32 trials in order to minimize the impact of ABX errors, but for one or two contenders, and certainly not with 6, at least not at this bitrate.

QUOTE

The analysis of your ABX values led to no conclusion

I never rated the reference. I've found all 60 encoded files, and rated them very carefully (rating was the most important task of the test, and hierarchy its purpose). I think that's meaningful enough. I don't see what kind of additionnal conclusion you're trying to build with undescribed ABX scores.

I understand what you are saying. But imagine that you are the one reading the test results. When you see a 20/36 result, how do you know if it was a one shot test (failed, sequencial or not), or a 4/20 failure followed by a 16/16 success ?According to rule 8, the results must prove to the reader that the difference was audible. Maybe, thanks to the internal way the test was done, it is the case, but there is no published result that proves it.

Now, what if the one that did the listening says it all : 4/20, due to lack of concentration, then 16/16 ?First, we want to get rid of placebo, and only analyse the results of the blind test. Therefore the comment about concentration can't be taken into account. It is an unproven opinion about the test result.So we are left trying to compute the p value of the result, that is the probability of getting p <= 1/65536 in a serial of ABX tests beginning with 20 and 16 trials, and with an unknown sequel if the second test had failed (maybe the guy would have tried 12, then 8 and claimed a success, we don't know). The real result of this test takes at least one hour to compute for the people in this board who have enough math knowledge to sort it out, and it is inaccessible for most members, who didn't study probabilities and statistics.For example, I can't tell if this 4/20-then-16/16 result has any significance. I don't know if its p value is above or below 0.01. I think that it is probably below 0.05, but I can't prove it in 5 minutes.

A binomial table giving the p value for fixed ABX tests have been made, it is linked in the FAQ about ABX, and the results are the ones given in all ABX software. It allows any people to perform tests and publish the results. By not following the standard methodology, one makes his results unreadable for most of the community, and give much analysis work to the math people of the forum.We have a tool allowing anyone to analyse ABX test results, use it !

In your ABX logs, we can see that you performed a total of 1088 sessions. If you had fixed the number of trials to 8 for each sample and codec, you would only have performed 10 x 6 x 8 = 480 sessions, all codecs would have been tested, all results would have been understandable, and, most important, the victory of MPC and Vorbis would not have been changed ! They won with a confidence of 95 % even if all ABX tests have failed. The rankings show it.

In conclusion, we can't deduce anything from Harashin's result right now. I just hope that the information that he will give us about his methodology will help to find some significance in the results, and that he has not done all this in vain.About your tests, Guruboolez, you see that if you don't follow the standard methodology of fixed, single ABX sessions, it is not necessary to spend much time ABXing in a way from which clear results can't be deduced. The ABC/HR results are enough, thanks to ff123's analyzer, to provide some useful information.

Thanks to this discussion, I think we should be able to write some instructions for ABX testing and include them in the forum rules, as well, maybe, as a tutorial for analyzing ABC/HR results. By the way, shouldn't the Anova analyzer be included in ABCHR software ? ABX software gives the significance of the result, why shouldn't ABC/HR do the same ?

About your tests, Guruboolez, you see that if you don't follow the standard methodology of fixed, single ABX sessions, it is not necessary to spend much time ABXing in a way from which clear results can't be deduced. The ABC/HR results are enough, thanks to ff123's analyzer, to provide some useful information.

I understand. I've tried to do my best to publish "valid" results, in order to avoid possible criticism like "mhhhh, he rated some files, but it doesn't proves us that he could really hear a difference". But in my opinion, even if ABX tests could be interpreted as sequential due to the disparity of the total number of trial, even if the pval drop from 0.05 to 0.2 because I didn't respect the number of trials I've preliminary fixed, the ABX scores I have obtained are certainly better than nothing. If a statistician can't be happy, another reader with common sense could say: "well, 39 out of 59, pval = 0.009 for Dover, Giustizia.mpc sample, it's probably not luck".

QUOTE

Now, what if the one that did the listening says it all : 4/20, due to lack of concentration, then 16/16 ?First, we want to get rid of placebo, and only analyse the results of the blind test. Therefore the comment about concentration can't be taken into account. It is an unproven opinion about the test result.

If someone would adopt a suspicious attitude against results, there's no need for him to look on the validity of the ABX scores and the real pvalue they imply: he could simply question the authenticity of the log file.All these results are based on trust: trust about the methodology, trust about the listener, trust that he tried to prove that a difference really exists and that he could hear it. Missing (partially or completely) an ABX test may lead to the conclusion than the listener can't probably hear a difference. This conclusion is wrong: multiple ABX sessions are not always a good thing. Difference are sometimes very subtle, and couldn't resist to an intensive test like ABX. That's why some people tried to perform long-term ABX tests to prove that a difference could be audible in other listening conditions. I've tried to obtain "valid" (or if not, "good") scores in listening conditions which were not ideal. I'll probably continue this way... probably not the "best" or the "ideal" way, but probably the most practical according to the difficulty of such tests.

If a statistician can't be happy, another reader with common sense could say: "well, 39 out of 59, pval = 0.009 for Dover, Giustizia.mpc sample, it's probably not luck".

Keep in mind that if someone says this, we will fight this interpretation, since it is wrong and spreads misinformation.

QUOTE (guruboolez @ Jul 24 2004, 02:37 PM)

If someone would adopt a suspicious attitude against results, there's no need for him to look on the validity of the ABX scores and the real pvalue they imply: he could simply question the authenticity of the log file.All these results are based on trust: trust about the methodology, trust about the listener, trust that he tried to prove that a difference really exists and that he could hear it.

I don't think so. Audipophiles are not evil. When they say they can hear a difference, they don't lie in order to fool us, they really believe they do. The widespread existence of strong placebo effect has lead us not to listen to opinions, but facts. Opinions about sound quality are honest. Often wrong, but 99.9 % of the times honest. So are log files. We can trust them 99.9 % of the time. But unlike opinions, they are facts that can be intepreted.

QUOTE (guruboolez @ Jul 24 2004, 02:37 PM)

Missing (partially or completely) an ABX test may lead to the conclusion than the listener can't probably hear a difference. This conclusion is wrong

The right conclusion is that he didn't hear the difference at least when he made the mistakes. For the rest of the test, we can't know. No proof. Still waiting for a positive result.

QUOTE (guruboolez @ Jul 24 2004, 02:37 PM)

That's why some people tried to perform long-term ABX tests to prove that a difference could be audible in other listening conditions.

I remember the 24 to 16 bits test, passed after several days, but as far as I remember, it was not a sequencial test, was it ?

QUOTE (guruboolez @ Jul 24 2004, 02:37 PM)

I've tried to obtain "valid" (or if not, "good") scores in listening conditions which were not ideal. I'll probably continue this way... probably not the "best" or the "ideal" way, but probably the most practical according to the difficulty of such tests.

Do as you whish, but stay tuned with the forum rules and tutorials, they might soon be updated in order to point out the non significance of such results, if other specialists agree. Once done, interpreting the results as a success will be a violation of rule 8.

Whatever way you ABX, keep on the good work with ABC/HR, it is deeply appreciated !

If a statistician can't be happy, another reader with common sense could say: "well, 39 out of 59, pval = 0.009 for Dover, Giustizia.mpc sample, it's probably not luck".

Keep in mind that if someone says this, we will fight this interpretation, since it is wrong and spreads misinformation.

Misinformation? Could you be more precise? How many chance do I have to obtain this result by guessing? I wonder... I can't obtain this kind of results by ABXing MPC Q10 in ideal conditions, but I can obtain it in bad condition with mpc Q5. It's definitevely not luck.

QUOTE

We can trust them 99.9 % of the time. But unlike opinions, they are facts that can be intepreted.

Missing (partially or completely) an ABX test may lead to the conclusion than the listener can't probably hear a difference. This conclusion is wrong

The right conclusion is that he didn't hear the difference at least when he made the mistakes. For the rest of the test, we can't know. No proof. Still waiting for a positive result.

No proof of what? If you take a look on log files I've posted, I sometimes add comment about the score's evolution. Apparently, you'e not taking this in account, because you don't know how to compute this situation.

QUOTE

QUOTE (guruboolez @ Jul 24 2004, 02:37 PM)

That's why some people tried to perform long-term ABX tests to prove that a difference could be audible in other listening conditions.

I remember the 24 to 16 bits test, passed after several days, but as far as I remember, it was not a sequencial test, was it ?

I'm not talking about 16 vs 24 bit, but about people trying to ABX high bitrate encoding after listening the same disc many, many times.

QUOTE

QUOTE (guruboolez @ Jul 24 2004, 02:37 PM)

I've tried to obtain "valid" (or if not, "good") scores in listening conditions which were not ideal. I'll probably continue this way... probably not the "best" or the "ideal" way, but probably the most practical according to the difficulty of such tests.

Do as you whish, but stay tuned with the forum rules and tutorials, they might soon be updated in order to point out the non significance of such results, if other specialists agree. Once done, interpreting the results as a success will be a violation of rule 8.

I'd like to see it. Consequence would be funny. Most listening tests already done are simply invalid. Roberto's test should be removed from news, because they are not respecting some scientific conditions for practical reason (pval of 0.01, too few samples, not enough listeners, disparity between critical and easy listeners, etc... ff123 already pointed out those limits). All HA tacit knowledge should be eradicate, because no proof about MPC superiority agaisnt other contender was NEVER published (but it's a common and shared idea). The "recommanded encoder and setting" threads could simply be erased, except maybe for the old-tested 3.90.3. GT3b2/aoTuV/Megamix... recommandations are all based on invalid tests. Enforce rule#8 conditions, and the only "valid" tests you'll see will be for 32 or 64 kbps. HA will be a place for low quality audio encoding reliable knowledge, and a vast desert of uncertainty because no tester on this board will have courage enough for risking the publication of a listening test following the "rules".Tests and knowledge evolution were possible in this board because absolutely strict conditions were never requested. Limits on rigour were always accepted, for practical reasons, and even with this cool attitude, few testers are posting results. I don't know what you or someone else on this board will expect for more exactness. Chaos? Assumptions only?

QUOTE

Whatever way you ABX, keep on the good work with ABC/HR, it is deeply appreciated

Don't be sarcastic or insincere. I doubt that something considered as invalid and "soonly" illegal could be really appreciated.

Guruboolez, I've got no time to answer your last post right now, but since you ended with a negative feeling, I'd like all the same to clarify one point now : I think that you don't understand the meaning of the Anova analysis performed by Roberto in his tests, and that I performed on yours.

For a confidence level chosen, it gives the result of the test. For example, in this test, it shows that you found MPC superior to Vorbis, and Vorbis superior to the rest with p<0.05. It means that there is less than one chance out of 20 that you rated them higher accidentally. Which make your results (as well as Roberto's tests ones) perfectly valid. That's why I was thanking you. I'm not used to thank poeple in a sarcastic way, not to post ambiguous messages.

I just wanted to point out that we are making a fuss on ABX methodology, and that it has nothing to do with your test results, because people who can't bother to read all that we post would otherwise think that I discuss your conclusions while I'm just discussing the possible analysis of your ABX results, that nearly nobody read anyway, since they are hidden as an addition to your log files.

Ratings are meaningfull without the need of ABX tests, because there is no way (or to be precise, a way inferior to the p value) that one codec comes first every time if you don't know which one it which, since the ABCHR software hides them. Anova is a way of computing the p value for this event.

So to make it short, we have MPC>Vorbis>other codecs for p < 0.05 in your test (I didn't compute the results for other p values).The ABX results reported in your logs don't provide much more information, or it is hidden to me, nor do Harashin's ones.

ABX results are one thing, results are not meaningful, so claiming they are positive is a rule 8 violation.ABC/HR ratings are another thing. Results are meaningful, work is appreciated.

ABC/HR rating without ABX confirmations are few things... It's a blind test OK, but not a double blind one. Such tests won't be really and genuinely accepted. Look at LAME (3.90.3 vs new realese) testing phase for exemple:

QUOTE

4. Your test results have to include the following:

* ABX results for 3.90.3 vs. Original 3.96 vs. Original 3.96 vs. 3.90.3 * ABC/HR results are appreciated especially at lower bitrates, but shouldn't be considered a requirement. * (Short) descriptions of the artifacts/differences

Those conditions are requested. Rating without ABX tests are often considered as useless. ABX tests are requested, especially those opposing different encoders each others. So please don't try to say that single ABC ranking are appreciated when other threads or people reaction are showing that without ABX confirmation, these notations are considered as wind...

When comparing different codecs in abchr.exe, the purpose of the abx module is really just to help clarify in the listener's mind how he thinks things should be rated in the abc/hr module. Pio's point is that abx results by themselves (without the ratings) don't say anything about the relative standings of the codecs. I agree with that.

ABX: purpose is to determine if an individual can reliably detect a difference between 2 files using multiple trials.

ABC/HR: purpose is to determine preference between 2 or more codecs, but not necessarily reliably! Multiple listeners or multiple samples increase reliability for ABC/HR in the same way the multiple trials increase reliability for ABX. Generally, it is more important that multiple samples be tested than multiple people.

The helper role of the abx module in abchr.exe version 1.1 (I need to spend a little time to clean up the last few minor bugs) is further emphasized since it unhides the hidden reference after a successful abx run.

(...) Pio's point is that abx results by themselves (without the ratings) don't say anything about the relative standings of the codecs. I agree with that.

I also agree. That's why I spent much more times and attention in the rating phase. By testing many encoders, I'm only interested about the hierarchy (the best, the second best, etc...). ABX scores can't reveal anything about quality, even about difficulty. I also agree that harashin's results don't give me any information about relative quality of three different format; I just know that there are serious chance that he heard difference between encoding files and the reference. In my opinion, ABX phase is useful for three things:

• helping me to refine the notation (I often lower or higher some notation after ABX tests).

• giving to myself insurance that I wasn't dreaming about possible artifacts when I've rate different encoders (i.e. avoid placebo). Useful when during the ABC/HR phase, I've ranked two or more files with a slight difference: if I can't ABX this difference, I often change the note and give the same to both files [or sometimes, even if I failed t oABX the difference, I just maintain a slight difference of 0.1 point for the file I still suspect to sound better]

• giving to others the feeling (or in best case the proof) that the difference were really audible. I'm sorry to repeat it again, but I consider something like 45/60 better than nothing. At least when I ended the test by a nice consecutive series of correct trials.

[/quote]Keep in mind that if someone says this, we will fight this interpretation, since it is wrong and spreads misinformation.[/quote]Misinformation? Could you be more precise? How many chance do I have to obtain this result by guessing?[/quote]

In the case of a sequencial ABX test, pval can't be 0.009 for 39 out of 59, since it is the pval for a fixed ABX test. Saying pval = 0.009 is misinformation. The max number of trials must be known and the corrected p-val table must be extended to this number to get the right value.

My conclusions are that codec B must be a bit underrated, since an "annoying difference" couldn't be distinguished from the original 4 times (unless the tester states that he hit the wrong button). I don't know how to interpret the ABX scores, since I don't know if they were run in a sequencial of a fixed way. From a fixed point of view, however, the confidence level is high.

[quote=guruboolez,Jul 24 2004, 03:38 PM]No proof of what? If you take a look on log files I've posted, I sometimes add comment about the score's evolution. Apparently, you'e not taking this in account, because you don't know how to compute this situation.[/quote]

Exactly. I'm not going to spend a whole week-end trying to analyse partly sequencial ABX results with additional conditions, with pages of calculus, while we have a binomial table that gives us the result at once if the number of trials is fixed in advance, especially after most people on this board have hammered (but I'm not sure if I repeated it in the ABX tutorial) the necessity of fixing the number of trials before the test begins OR not looking at them during the test for the results to be valid.

[quote=guruboolez,Jul 24 2004, 03:38 PM][quote][quote=guruboolez,Jul 24 2004, 02:37 PM]That's why some people tried to perform long-term ABX tests to prove that a difference could be audible in other listening conditions.

[/quote]I remember the 24 to 16 bits test, passed after several days, but as far as I remember, it was not a sequencial test, was it ?[/quote]I'm not talking about 16 vs 24 bit, but about people trying to ABX high bitrate encoding after listening the same disc many, many times.[/quote]

So what ? Long term or short term doesn't change the methodology... Either the number of trials is fixed, either you don't look at the results until the test is finished, either you fix a maximum number of trials and use the corrected p-val table. The three methods are valid for short or long term tests.

[quote=guruboolez,Jul 24 2004, 03:38 PM]I'd like to see it. Consequence would be funny. Most listening tests already done are simply invalid. Roberto's test should be removed from news, because they are not respecting some scientific conditions for practical reason (pval of 0.01, too few samples, not enough listeners, disparity between critical and easy listeners, etc... [/quote]

Roberto's results are perfectly valid :-Tests were double blind-Pval is strictly inferior to 0.05 (<0.01 is a good thing, <0.05 is requested)

The limits pointed out by FF123 have nothing to do with the results of the test, but about the scope of the test. In the same way, your test is valid in itself, because you get a success with p < 0.05, but the scope is very narrow, because you were the only one listening, and it is not sure that someone else would get the same (valid) results. It's like saying "this man is taller than this woman". The test consists in measuring them. The results are :Man 181 cmWoman 176 cm.The right conclusion is "this man is taller than this woman". The results is valid, proven by a repeatable experiment on the same couple of persons. But the scope is very narrow, we can't conclude that every man is taller than any woman.

[quote=guruboolez,Jul 24 2004, 03:38 PM]All HA tacit knowledge should be eradicate, because no proof about MPC superiority agaisnt other contender was NEVER published (but it's a common and shared idea). [/quote]

You just published it (implicitly) at the top of this thread. Your results is valid, V.A.L.I.D. Can't you read the Anova log I posted and its conclusion ? Here's the first link I found about Anova searching the web : http://www.psychstat.smsu.edu/introbook/sbk27.htmThe column it talks about refers to another software, and the value discussed is the P-Value.

[quote]If the number (or numbers) found in this column is (are) less than the critical value ( ) set by the experimenter, then the effect is said to be significant. Since this value is usually set at .05, any value less than this will result in significant effects, while any value greater than this value will result in nonsignificant effects. If the effects are found to be significant using the above procedure, it implies that the means differ more than would be expected by chance alone. In terms of the above experiment, it would mean that the treatments were not equally effective. This table does not tell the researcher anything about what the effects were, just that there most likely were real effects.If the effects are found to be nonsignificant, then the differences between the means are not great enough to allow the researcher to say that they are different. In that case, no further interpretation is attempted.[/quote]

First, yes it is. Your computer is hiding the names of the samples, and you have no other way of finding the reference than your ears. Therefore the test IS double-blind.A simple blind test would be a listening test between a pressed CD and an original one, for example, with someone putting the CD in the drive to make you listen to it. Listening to what he does with the Cd that he takes from the drive, you might tell if he is replacing the same one in it, or putting it aside and inserting the other. This is a simple blind test. For it to become double blind, you'd have to use 10 identical CD Players, with a CD hidden into it. You're left alone in the room, and must tell which drives have an original and which ones have a copy in it. This is a double blind test. Because you can't be influenced by the operator. Fortunately, computers allow us to hide and play samples without any mean for us to guess which one is played.

Those conditions are requested. Rating without ABX tests are often considered as useless. ABX tests are requested, especially those opposing different encoders each others. So please don't try to say that single ABC ranking are appreciated when other threads or people reaction are showing that without ABX confirmation, these notations are considered as wind...

[/quote]

This is because until now, Roberto was the only one to use Anova anlysis on ABC/HR tests. Remember your last test. You posted some rankings, and they were discussed. I was on the verge of brandishing rule 8, but I rather asked if someone could compute the result and post the graph with bar errors. No one did.This time you tested Lame vs Vorbis vs MPC at high bitrate. Since I found this test very important, and I saw that no one was capable of computing the results the last time, I read Roberto's pages more carefully, and found FF123's Anova analyzer.

When, rating MPC superior to Mp3 9 times out of 10, you get p <0.05 in ABC/HR Anova analysis, it is mathematically equivalent to succeed in a fixed ABX test with p < 0.05.

The ABC/HR results tell this, not the ABX ones. They show, among other things, that you can consistently hear the difference between MPC and MP3 with the settings you chose, on the samples you chose. It has not been much pointed out outside Roberto's tests, but ABC/HR can be a substitute for ABX. I think that it's time to explain this in a tutorial. Your test proves the great usefulness of this method of testing, even for one people with several samples, instead of several people and several samples.It should even work with one people and one sample, but with multiple ABC/HR sessions. I think it should be considered in future ABC/HR software.

It should even work with one people and one sample, but with multiple ABC/HR sessions. I think it should be considered in future ABC/HR software.

You're talking about mutliple trials of rating a codec in the abc/hr module. For example, rate a certain number of codecs for trial 1, then reshuffle them and rate them again for trial 2. At the end of N trials, one could average the ratings. On the face of it, it would seem the more codecs there are, and the less difference between them, the more benefit one could get from a procedure like this. Imagine testing just two, but very different quality codecs. Then it doesn't make much sense to repeat the ratings: they will be rated exactly the same every time.

So I tend to think that rating more music clips is probably better than trying to get the variability out of the ratings for a single music clip.

My conclusions are that codec B must be a bit underrated, since an "annoying difference" couldn't be distinguished from the original 4 times (unless the tester states that he hit the wrong button).

The problem is, that it could be. Some reasons are easy to explain.

Imagine that you're testing many formats in the same test. The first step is to rate each file. The first one (1L) is excellent, very hard to distinguish (4.5/5). You're not even sure that the difference really exist. The second file suffers by comparison: coarseness is clearly audible (2/5). Second step now: ABX. The first file is hard to ABX, a lot of concentration is needed. I could distinguish a slight amout of pre-echo on a precise range, that's all. 14/16 [16 as fixed value]: not bad. Second file, should be much easier to ABX. But the six first trials are bad (2/6). Why? Because all my attention is focused on pre-echo I can't hear, simply because the file doesn't suffer from this problem. By changing a bit the selected range, focusing my attention on another problem, I'll find again the annoyance I've immediately detect the first time and perform a very nice 16/16 in 2 minutes. Final score is 18/22.You're conclusion is still the same : "codec B must be a bit underrated"?

There's a serious problem with test including more than one encoded file: conditions are not eaqual for all. By changing the order, you could change the results of ABX score. Beginning by an easy test could help you to warm up your ears, give you trust, but an easy 'victory' could also handicapped you by giving excessive confidence, etc... You could be tired after two files if you're beginning by the two most difficult, etc...Of course, the solution would be to rest your audition as often as you want, to take care about your concentration... being like a sportsman during a competition. Problem is that some people (including me) can't always spent three or four hours just to achieve one single test including 6 contenders.

QUOTE

Exactly. I'm not going to spend a whole week-end trying to analyse partly sequencial ABX results with additional conditions (...) especially after most people on this board have hammered (but I'm not sure if I repeated it in the ABX tutorial) the necessity of fixing the number of trials before the test begins OR not looking at them during the test for the results to be valid.

Nobody forces you to analyse these ABX results.What kind of conclusions could you build by computing ABX scores (I'm serious, I still don't understand)? What could you conclude when you see that one file was ABXed at 10/16 and the other one at 15/16? That the second one have stronger flaws? That's a wrong conclusion. The tester is not a robot, is not living in a studio and is not a champion. He can't necessary maintain the same level of concentration during a whole test; he can't necessary maintain his ears at the same level of freshness ; he logically don't have the same familiarity with the reference during the first ABX session than during the sixth and last one... By fixing a strict number of trials, you're solving problems if and only if the tester had maintained the same listening abilities (generic term including freshness, concentration, motivation, patience, silence in the room) during the whole test.If the tester admits that his listening conditions have changed during one test, there's no need to spend one week-end or simply one minute to compute some additional datas based on ABX scores, which represent nothing (at least, they're not only reflecting the level of difficulties of the samples, but could also reflect the variations of the listening conditions themselves).

QUOTE

Roberto's results are perfectly valid :-Tests were double blind-Pval is strictly inferior to 0.05 (<0.01 is a good thing, <0.05 is requested)

And what about number of listeners? What about samples? Many people, including JohnV, ff123 and others have precised that different samples might seriously change the results. Roberto's test are probably valid (he can't use 100 samples and force 200 members of HA to participate to this test), but conclusions builded upon the final results are often... questionnable. Faac tied with Nero AAC, or WMA@128 close to have "perceptible but not annoying" difference.

QUOTE

Your results is valid, V.A.L.I.D. Can't you read the Anova log I posted and its conclusion ?

OK, I was a bit angry. Sorry

QUOTE

(...) Therefore the test IS double-blind. (...) A simple blind test would be (...)

Thank you for the explanation. I thought that double blind test was a single blind test repeated twice.

QUOTE

When, rating MPC superior to Mp3 9 times out of 10, you get p <0.05 in ABC/HR Anova analysis, it is mathematically equivalent to succeed in a fixed ABX test with p < 0.05.

But it's only true at some conditions, isn't it? The level of degradation (artifact) could also play a role I suppose.

QUOTE

It has not been much pointed out outside Roberto's tests, but ABC/HR can be a substitute for ABX. I think that it's time to explain this in a tutorial.

I'm learning different things (though it's sometimes confusing). A tutorial should be necessary.

If I have further questions, I'll probably ask them in french (private message): comprehension should be easier for me.

Anyway, thanks for the long explanations And sorry again for the irritating tone of my previous posts.