I'm thinking on doing at least 21 ABX trials on each file and if the subject managed to tell a difference and the guess probability is low I'd also take into account a 1-5 point scale to measure the perceived quality.

So far I have analized one subject, he did pretty well on most tests, except one, where the results where these:15 out of 29, pval = 0.500

AFAIK, p<0.05 is good to be certain that he didn't know by chance, what about the rest?How do I interpret that?, is the codec transparent for him at that test?What are the ranges of the guess probabilities and how should I interpret them?

Group: Members
Posts: 1047
Joined: 28-June 03
From: on the dock of the bay
Member No.: 7423

QUOTE (Luis G @ Jun 18 2005, 07:51 PM)

So far I have analized one subject, he did pretty well on most tests, except one, where the results where these:15 out of 29, pval = 0.500

AFAIK, p<0.05 is good to be certain that he didn't know by chance, what about the rest?How do I interpret that?, is the codec transparent for him at that test?What are the ranges of the guess probabilities and how should I interpret them?

if you say he did pretty well except this one, the means that he could tell the difference right? for the codec it would be more complimenting (i.e. better) if he couldn't tell the difference which means it's transparent for him and he's guessing if x is a or b.

anyway, either alpha=0,01 or alpha=0,05 are generally chosen.btw, you are not certain, but this is just (given, very low) probability that one is guessing, i.e. considered not guessing.a pval of 0.5 would be a reference for guessing.

if you say he did pretty well except this one, the means that he could tell the difference right? for the codec it would be more complimenting (i.e. better) if he couldn't tell the difference which means it's transparent for him and he's guessing if x is a or b.

anyway, either alpha=0,01 or alpha=0,05 are generally chosen.btw, you are not certain, but this is just (given, very low) probability that one is guessing, i.e. considered not guessing.a pval of 0.5 would be a reference for guessing.

Yes, I have read that thread, but it doesn't mention how to deal with the results when p>0.05, which is this case. An interval like 0.05<p<0.25 how is to be dealt with? It certainly doesn't say much of the codec, and I would hardly classify it into transparency. IMO p>0.5 means transparency, but I wanted to check with you gurus about this.

I thought that anyone could rate the quality, but what if the subject merely rated the correct file by chance, I need more certainty, and I'm using ABX as an indicator of the trustworthy of the subject ratings. Is this correct?

Btw, I'm more interested in rating the subjective quality of the codec.

If it were me, I'd just choose a whole bunch of different samples (say 30), and then rate each one against the reference using abc-hr. Then I'd plug the results into a statistical calculator (http://ff123.net/friedman/stats.html) to determine first if you found a significant difference from the reference and second how much that difference is. No ABX'ing is involved, plus you get a better indicator of codec quality by sampling a lot of different music.

BTW, I would also keep the samples where you rate the reference, rather than throw these cases out. So if you make some mistakes, the reference will average something less than 5.0.

If it were me, I'd just choose a whole bunch of different samples (say 30), and then rate each one against the reference using abc-hr.

I'm going off-topic, but I've an important question.I'm trying to build a complete set of classical music sample, in order to replace the usual suit of 15 samples I'm using now for 18 months. My purpose is to obtain 100 samples, including many instruments, solo, chamber, orchestral, lyrical, noisy or not noisy, quiet and loud, etc... But I'm realizing that making ABX comparisons with so many samples would be a Herculean task.What would be the best thing in your opinion:- 100 samples rated in ABC/HR without ABX- 15...20 samples rated in ABC/HR + ABX confirmation?

Good point I have to be more specific. Are such tests valid (I mean: statistically) or, more precisely, have both kind of tests the same level of validity?

I'm asking because I'm used to publish the results of my test, and always try to avoid criticism. I just fear that a big listening test including 50 or 100 samples without ABX confrontation will be contested. Should I keep this kind of test for private and favour ABX for public one, or would you consider more interesting the publication of ABC/HR only listening test involving much more samples?

That's encouraging. A listening test involving a lot of samples (and therefore introducing much greater diversity) without the necessity of listening each one ~50 times is greatly less boring for the listener. Less stressing too (I'm often frighten when I have to click on the "view results" button).

Thank you for your precious assistance about statistic and listening tests methodology

Luis G> Could I possibly ask you what kind of codec are you developing?

ff123, currently I'm using 8 samples: castanets, finger snaps, french horns, timpani, triangle, trumpets 1, trumpets 2, violins 1 and violins 2. I'm also planning to put male and female voice samples, perhaps a rock/pop/jazz test would also be included. How many subjects should I use to get a significant result?

guruboolez, is a transform codec (MDCT) with 25 bands corresponding to each Bark. The data rate is variable and is always adjusted according to signal demands to achieve good quality. Expected data rates range from 60 to 340kbps.