The minimal number of trials depends on how successfull you are.How quickly you are sucessfull shows how confident in yourself you are.

The time & number of sucessfull trials are tied, you should never separate them when judging an ABX log.

With F2K ABX component, 8 sucessfull trials in a row (& if all successfull in a row, it usually means quick) trials is the minimum for me.

As soon as you begin to fail you can easyly increase to 10 or 12 to try to "erease" your failures.In this case if you fail once or twice you can usually still get a signifiant result although it usually means the ABXing was hard, & by consequence longer as you begin to hesitate.

Usually if you begin to fail more than 3 times on 12 trials, it begins to be so hard & you have so much hesitation that it begins to take forever to ABX. At this stage I usually give up by myself & declare that I cannot ABX as in general it means I am not sure that the audio part I am focusing on actually contains any real artefact.

It is expected that you choose the number of trials in advance rather than continue testing in hopes that you get the desired result. This is spelled out pretty clearly in the link for those who bother to read it.

The minimal number of trials depends on how successfull you are.How quickly you are sucessfull shows how confident in yourself you are.

The time & number of trials are tied, you should never separate them when judging an ABX log.

With F2K ABX component, 8 sucessfull trials in a row (& if all successfull in a row, it usually means quick) trials is the minimum for me.

Additionally, if you're going to do more samples, you should do more trials per sample, since the odds of being wrong by chance increase the more times you test. So 8 trials might be good for you for one sample, but if you do 2 samples, then you have just halved your probability of not being wrong.

The problem with a test like this is that theres a lot of samples (10 total!) and only a few trials per sample. So the odds of getting these results by just tossing a coin are actually fairly high. If I'm doing my math right, you have greater then a 1/4 chance that at least one of these 10 trials will return 5/5 correct choices even with just random guessing!

Well there is the theory & there is real life practice. I agree that chosing the number of trials before the test & chosing a higher number of trials to avoid randomness is better ... but just try to organize a real test with several samples, several encoders (& maybe even several bitrates) ... you will quickly realize that the time it takes goes exponential very quickly specially as the part when you actually decide what to test also takes time & usually goes unoticed for beginner who read the test ...

To be short theorical perfection is indeed better, but time is the ennemy. If you cannot take the necessary time to ABX correctly, you'd better prevent yourself from making any quality claims.That's why serious ABXing on a large scale is rare.

It''s been years that I wanted to re-ABX Apple AAC & CELT & so far I have always been unable to run the test as it would need 3 days in a row of work, which is discouraging as I use 100% lossless anyway.

To be short theorical perfection is indeed better, but time is the ennemy. If you cannot take the necessary time to ABX correctly, you'd better prevent yourself from making any quality claims.

You can mitigate this to some extent by carefully picking what you will ABX. If A/B comparisons of a sample and control reveal no apparent differences, its probably not worthwhile to ABX it. Just announce that you can't tell the difference and move onto something else. Failing an ABX test proves nothing, so no sense doing it needlessly.

Looking at this test, I think the OP did more trials then he needed to, but he did them of the wrong samples. It seems like he only thinks he can ABX one or two of the 10 samples, so the obvious thing to do would have been to do about 10-12 trials of each of those few samples.

He should be conducting sets of 16 trials at first to get a feeling about how these things work, rather than being discouraged into taking shortcuts.

At this point in time it is only taking him about two minutes per test set and he's telling us the results are legitimate. He complains that doing something other than a half-baked job will take too long and you're giving the impression (intentional or not) that what he's doing is OK.

I never said I judged this test valid, I only gave my opinion about what is a valid methodology for me. I usually only trust ABXer which I have personnaly double tested, so I don't know if it's valid for me & honnestly I don't care as I ain't gonna re-test this.

I never said I judged this test valid, I only gave my opinion about what is a valid methodology for me. I usually only trust ABXer which I have personnaly double tested, so I don't know if it's valid for me & honnestly I don't care as I ain't gonna re-test this.

Was this directed at me? If so, sorry, I don't think I understand what you mean. If not, please ignore

saratoga:Yes, I had the feeling that you were thinking that I was defending the topic starter by stating that "in some case (complete sucess on identified artefact) a fast ABXing with a low number trials (but not too low indeed) can be perfectly valid". It's not the case, I am not defending the topic starter. I am just trying to help him as HA users can be relentless while he seems to be making some effort to conform the TOS.

Edit:Obviously his number of trials is too low & his "multiply my results" method is wrong, it's specially wrong as soon as you begin to fail.

Yes, I had the feeling that you were thinking that I was defending the topic starter by stating that "in some case (complete sucess on identified artefact) a fast ABXing with low (but not too low indeed) can be perfectly valid".

Are you confusing me with another, maybe deleted post? I never said that . . .

I hate to say it but your testing was flawed. 5-6 trials for a single song isn't nearly enough. You should conduct the test and make at least 10 determinations as to which sone is which. 5 guesses is way too short as the results can be skewed. Increasing the sample number (i.e. how many times you pick which song is which) is necessary. People gave you links in your original post where you did absolutely no testing but it looks like you didn't fully read them.

To flip the coin 5 times requires a really short time. But it doesn't mean that the results will be flawed. Because it will require much more time to get 5 times the same side of the coin. [On average] more exactly 32 times * 5 tries (160 tries) to get a false positive with just random clicking. And listener should be really insane to listen anything 160 times, hence not real scenario.

Here's another example where numbers can fool us. If we test 20 cables, one by one, in order to know if they have an effect on the sound, and if we consider that p < 0.05 is a success, then in the case where no cable have any actual effect on the sound, since we run 20 tests, we should all the same expect in average one accidental success among the 20 tests ! In this case we can absolutely not tell that the cable affects the sound with a probability of 95%, even while p is inferior to 5 %, since anyway, this success was expected. The test failed, that's all.

Five in a row can happen right off the bat or can happen somewhere later on. If we were to assume that all of his results were guesses then all of a sudden five in a row at some point in time doesn't even remotely touch upon the unreasonable.

...then there is the glaring zero out of five which you seemed to have overlooked!

Now of course there will be situations where artifacts are so obvious that five out of five would be considered acceptable, but that's where reproducibility comes in. Unfortunately we don't have clips that are 30 seconds or less for others to verify.

Are five trials enough in this situation?No, they are not!

Until an administrator or another moderator says otherwise, this is the way things stand with regards to this discussion in its current state.

Well I know this topic isn't about me but my opinion on the topic is that ABXing is not pure statistics ... as soon as you select your sample & bitrate it is flawed statistics because you usually select the bitrate & sample in order to get failure from the start. Unlike a heads or tails coin flipping, nobody knows the average probability of failure\success of a target ABXing test because it depends both on how hard the ABX test is (samples\birate\encoders ...) & how good the listener is (ears\experience\patience ...). The coin is flawed, so applying pure math is good, but it has its real life limits.

So is there a minimal number of trials to identify an artefact ? for yourself the answer is NO, not really ... well usually 2 or 3 to be honest.For yourself & yourself only, what matters is not the number of trials but the fact that you can identify the artefact. Identifying the artefact means that you know WHEN it happens in the sample & that you can DESCRIBE it. Knowing WHAT happens & WHEN it happens is what matters the most to me, because it means that you can ABX it for yourself 100% of time no matter the number of trials.

The number of trials is only usefull to convince others that you're not telling complete bullshits. This is why a minimum number of trials to get meaningfull statistical value is usefull.

Science means that you can repeat the experience. Once you know for yourself what you hear, the number of trials & how fast you can repeat your success is only usefull to convince others.Obviously you need a higher number of trials to convince others than to convince yourself because they are lazy & won't double check your results.

So between 5 trials that is only good for yourself & 16 trials that is overkill for you, there is a real life "in between" which is statisticaly valid & usually it is between 8 & 12 depending on how sucessfull you are.

PS: Sorry if I was a little to extensive about myself...

Edit: Typo:expensive>extensive, as you can see saratoga my english is not perfect, so sometimes there are communication breakdowns. Sorry.

That's not that easy. It directs the different issues. Credibility (if it was irony or double sense post ), bugs etc. We don't know that for sure. Do we?Otherwise You're shooting yourself right in the foot (if not in the other place as well). Because after years of applying TOS8 You mention the post where 12/13 isn't valid? Sorry, with all respect, refer yourself (even if You are Admin) to TOS8 and show me where 5/5 wasnīt good enough and one HA member (with a few years of registration) has right to claim from new participant something that isnīt really necesary by rules (>5 tries.)

If one day I will post my ABX logs with just 5/5, please, do not ask me for more than that. Take it or leave it. Why? Because itīs the only thing we can do. Trust.If one person cheats and provides You 20/20 it wonīt makes it more valid than other guy with only but a true 5/5, is it?

Second, please, if you quote somebodyīs post, quote the complete part. Because only one part changes completely the sense of the original posterLike this:

QUOTE

To flip the coin 5 times requires a really short time. But it doesn't mean that the results will be flawed. Because it will require much more time to get 5 times the same side of the coin.

Do You still think that it's not truth?

CODE

Five in a row can happen right off the bat or can happen somewhere later on.

CODE

If we were to assume that all of his results were guesses then all of a sudden five in a row at some point in time isn't unreasonable.

Again. Credibility issues, more than statistics. The listener usually tries several samples. Not just one and case closed.

CODE

...then there is the glaring zero out of five which you seemed to have overlooked!

I didn't overlooked it. My statements have a general character.

P.S. In the end, TOS8 works for everybody in the same way. You have provide the result with p<0.05. We can beleive You... or not. But it will already the question of credibility and not statistics.

Until an administrator or another moderator says otherwise, this is the way things stand with regards to this discussion in its current state.

So let's remove all previous public tests as well as all personal ABX logs with only 5/5 results because they are not anymore valid. Please show me which TOS stands that 5/5 isn't enough?Greynol, donīt do that. Thatīs not the way.

The difference between you and the OP is that you have earned a reputation.

If you came to our forum telling us, "5/5 is all you'll ever get from me take it or leave it," without providing samples and never demonstrating that you actually understand how to interpret double-blind test results, you might end up finding yourself in a position where you would no longer be allowed to post here.

Someone can flip a coin five times and only five times and get the same result. Someone can flip a coin five times and only five times and guess wrong about the result every time as well. You are wrong to say otherwise, and you did in fact say otherwise. I suggest you take the time to help groom the OP to be more like you rather than try and then fail to play "gotcha" with me.

Until an administrator or another moderator says otherwise, this is the way things stand with regards to this discussion in its current state.

So let's remove all previous public tests as well as all personal ABX logs with only 5/5 results because they are not anymore valid. Please show me which TOS stands that 5/5 isn't enough?Greynol, donīt do that. Thatīs not the way.

As I have pointed out twice before, the results in this thread collectively do not reach p<0.05.

Please, take the time to understand what people are saying before you accuse them of such nonsense.

Until an administrator or another moderator says otherwise, this is the way things stand with regards to this discussion in its current state.

So let's remove all previous public tests as well as all personal ABX logs with only 5/5 results because they are not anymore valid. Please show me which TOS stands that 5/5 isn't enough?Greynol, donīt do that. Thatīs not the way.

As I have pointed out twice before, the results in this thread collectively do not reach p<0.05.

Please, take the time to understand what people are saying before you accuse them of such nonsense.

Before to post such way please re-read my posts and You will see that and I refer this issue as crediblity, Mister Wisdom itself.

I'm pointing out to you that this is not in general true for the reasons the link you provided explained. And in this particular case, its absolutely wrong. Since you did not retract that above, I assume you still believe it to be true. Could you explain why?

I'm pointing out to you that this is not in general true for the reasons the link you provided explained. And in this particular case, its absolutely wrong. Since you did not retract that above, I assume you still believe it to be true. Could you explain why?

We can play this game a long time. Why?I can do exatly the same question to You: Why not?

Weīre mixing two different things: credibility and statistics. Statistics say: p<0.05 implies 95% validity, hence acceptable. But with one condition: credibility, in other words that the OP didn't run it a lot of time. We're mixing these two things as it was the same.

I don't even need a log anymore to trust /mnt or martel, not because of reputation ... but because I have double tested some of their results (when I needed samples) & most of time I agree with them & when I don't maybe they are more skilled than me, maybe I don't listen where I should.

So 3 trials or 300 trials, other than to build a crediblity & share results it doesn't matter. What matters is that you can re-do the test yourself & find the same results.