Not surprising that people couldn't distinguish between 192 and 176.4, as the ultrasound frequencies would be super-high. 48, 88.2, and 96 would be more relevant I think. I would like to see DSD compared to those PCM rates.

If hearing ultra or near-ultrasound frequencies has anything to do with this, I think that the test subjects need to have a full range hearing exam first.

Interesting read. Viewed solely from a methodological perspective, this paper is appallingly bad (I am sure Dr. Reiss is a fine engineer, but he would have benefited greatly from taking on a statistician as a co-author). The fundamental assumption underlying meta-analysis, that both the independent and dependent variables are fairly homogeneous, seems not to have been met at all and it just gets worse from there (I am happy to elaborate if anyone is interested, although it will probably put you rapidly to sleep)...

It is also interesting to note Reiss' inherent bias, which comes through in his writing in many places. It is surprising that the reviewers (apparently) did not comment on that or the sometimes dubious decisions that arose from it. Actually, I am curious what the peer-review process at AES entails; they mention a "review board", but I wonder who participates in that, as they don't explain it any further. I can state with great certainty that they did not seek out an expert in the use of meta-analytical tools.

As an historical aside, the appendix is hilarious, describing in excruciating detail methods that have been available since the late fifties (Mantel-Haenszel) or the mid-eighties (DerSimonian and Laird, which is what's in the Cochrane software he used) and are very well-known.

Was it a Dr.? I thought it was a student paper! No, I didn't read it that carefully

AES has always been a joke to me. There are no real qualifications to join as far as I can see.

Full disclosure: I posted this study after only reading the abstract and skimming through, hoping the more knowledgeable here would be able to cut through the meat of it. So I really didn't read it carefully, nor was I endorsing the findings.

I looked into AES membership once. The amateur choir I sing with hires a so-called professional recording company to cover our concerts. One of their engineers said he'd invite me or whatever after I took an interest in the mics he was using. But after seeing how completely incompetent he and the rest of his company are at recording a choir and orchestra, I lost interest.

It reminds me of this:

Quote

Please accept my resignation. I don't want to belong to any club that will accept people like me as a member.- Groucho Marx

Most students would probably do better. Although many do like to demonstrate how much they learned along the way, even if the details aren't relevant to the manuscript at hand (see: detailed appendix to this paper)...

I did read this thing pretty carefully, as it covers a topic I am interested in and because it uses methods that I am very familiar with (scores of publications). It is a mess, to the extent that I don't think the conclusions can be taken seriously. Maybe they are correct, maybe not; the same study, properly implemented, would probably shine some light on the subject. In fairness, though, science has become so specialized that it is basically impossible to do these things on your own (which is why I said he should have asked a statistician to help him).

As for AES, I will take you guys at your word (I have no idea). I wish more of the papers were open access, though...

The only reason I once considered joining was access to the papers, but then at some point they raised the fee rates even for members to access them and I lost interest.

I was not aware of any open access AES papers (except access to copies hosted by the writer themselves, or by some other organization for them outside of the AES) until seeing this, so I look forward to seeing if there is anything interesting in the other open access papers available through their site.

^ Interesting pattern discernible in those open access publications: once you get to the more modern ones (first couple of pages are from the early fifties), they are mostly from outside of the US. A lot of the overseas public funding sources stipulate open access publication, which is funded by fees paid by the authors. The US does not generally require this (to my knowledge), so this is not uncommon in a lot of journals.

Thanks for the list, by the way. There are some interesting papers available.

^ Interesting pattern discernible in those open access publications: once you get to the more modern ones (first couple of pages are from the early fifties), they are mostly from outside of the US. A lot of the overseas public funding sources stipulate open access publication, which is funded by fees paid by the authors. The US does not generally require this (to my knowledge), so this is not uncommon in a lot of journals.

Thanks for the list, by the way. There are some interesting papers available.

I may regret asking but I'm curious as to why the methodology used by this researcher was so poor. It sounds like you're someone who works in a field where you do this type of thing a lot, so I'd like to know what I as a layman am missing. My research experience is limited to some papers and studies done as part of my Master's in music, and nothing nearly this involved.

Well, there are quite a number of little things, but here are a few of the big ticket items (so to speak). The first is that the independent variable (exposure, stimulus) and the dependent variable (outcome, response) should be relatively homogeneous. For example, you could do a meta-analysis looking at the effects of statin use on LDL cholesterol levels. Ideally, you would want to have all included studies using the same statin (say, simvastatin) and dosage in all participants and LDL measured in exactly the same way. Outside of clinical trials, it is generally impossible to obtain that kind of data, though, so I think most reviewers would accept statin use versus LDL, even if there were several different medications and a couple of ways of measuring LDL (although they might very well ask for sub-group analysis stratifying on those). Still quite homogeneous, since statins all work via the same pathway and LDL measurement variability is pretty consistent across measurement methods. Reiss, in contrast, takes disparate outcomes and forces them into the same box and does the same with the exposures. In some cases, he allows his own bias to influence how he does this ("for each trial, it was treated as a correct discrimination if the highest sample rate, 192 kHz, was ranked closer to “live” than the lowest sample rate, 44.1 kHz, and an incorrect discrimination if 44.1 kHz was ranked closer to “live” than 192 kHz").

Another major problem is that he considers proportions as means and analyzes them as such (this maybe gets into the statistical weeds a bit, but I will keep it brief). In doing so, the methods he implements make a lot of distributional assumptions, particularly assumptions of normality. The actual trial data, though, is not at all normal, it is binomially distributed (like coin flip data). Any analysis needs to explicitly account for that and there are a number of methods for doing that (such as Stuart-Ord). The R package "meta", which is freely available (as is the base R package), implements several approaches. Incidentally, this criticism applies not only to the meta-analysis, but also to the binomial test panel in Table 2; the test is appropriate at the individual study level, but not for the aggregate. Really, this issue is even more complicated, because the trials themselves are not independent. There is correlation between a given subject's choices, so that 1 trial in 1000 people or 10 trials in 100 people or 1000 trials in 1 person are not the same thing, statistically.

Then there is the obvious publication bias problem seen in the funnel plot. He explains this away, but omits mentioning a very plausible explanation, which is that supporters of higher sampling rates conducting these studies may shelf a study that doesn't fall in line with their expectations (i.e. fails to reject the null hypothesis). This might not even be completely conscious on the researcher's part ("this is ongoing work, for which my sample size is currently insufficient"). Sensitivity analysis could help here (assuming he used an appropriate set of studies and statistical approach in the first place).

There are many other things, too, such as his curious, and curiously inconsistent, approach to multiple testing corrections or his unfortunate tendency to cite P-values as percentages. Honestly, if I had to do a formal review of this, it would take hours. The paper is poorly structured and Reiss' bias is all too obvious in many places. Anyway, that should be enough (or more than!), but let me know if you are interested in more detail...

I'm the author of this paper. I’ve avoided engaging in forum discussions, but Aaronji’s comments caught my eye, and I couldn’t help but respond.

- Full data, analysis methods, source code etc are available at https://code.soundsoftware.ac.uk/projects/hi-res-meta-analysis . I encourage anyone who is interested to perform their own analysis, and I will happily answer any questions. I expect that others may be more rigorous, or may uncover other interesting information that I overlooked. I’ll also try to answer any comments posted on the paper’s forum at https://secure.aes.org/forum/pubs/journal/?ID=591- I consulted with statisticians and meta-analysis experts at various stages throughout the preparation of the paper. I would have liked a co-author with expertise in those areas, but the people I asked were unavailable.- The Appendix was not included in the original submission, but was requested by one of the reviewers. I believe this request was correct, since the readers of the AES journal, including those who frequently apply statistical techniques to their data, are generally not familiar with meta-analysis and the techniques applied in that field.- I’m aware of the importance of homogeneity, and the heterogeneity issues here are more serious than those that would typically be found in medical research, and a world apart from formal clinical trials. However, meta-analysis has been successfully applied to social and behavioural science research with far more heterogeneity problems than those seen here. Anyway, this is a judgement call. So the approach I took was to use all possible studies (for which I could do inverse variance analysis), and then do sensitivity or subgroup analysis on more homogeneous subsets of the data.- bias. This made me laugh at first since in relation to this paper I’ve been accused of bias from all sides. Before beginning the study, I did not have a strong opinion either way as to whether differences could be perceived. But I could easily be fooling myself. So I committed to publishing all results, regardless of outcome. And again, I included all possible studies, even if I thought they were problematic, then did further analysis looking at alternative choices. I also decided that any choices regarding analysis or transformation of data would be made a priori, regardless of the result of that choice. However, I wrote the paper once all the analysis had been done, and so my writing style may reflect my knowledge of the conclusions.- I agree that the work would have been improved by using an approach specific to binomial distributions. However, for much of the analysis, the normal approximation is justified. As for independence in the binomial test, under the null hypothesis every randomised trial would be uncorrelated, regardless of whether they involved the same participant or same study (think guessing a truly random coin toss). I also agree that the aggregate binomial test is not appropriate for meta-analysis. It was included only for completeness along with the binomial values for the individual studies in Section 2, and not used as part of the meta-analysis in Section 3. - For King 2012 (the ‘closer to live’ study), it could have been either excluded it completely, treated higher preference rating as discrimination (which is fraught with issues) or treated closer to live as successful discrimination. Since the live feed was provided as a reference stimulus, similar to many other multistimulus evaluation studies, and the intention of the 192 kHz feed was to be ‘closer to live’ even if not perceived, this seemed a logical approach. Again, this decision was made a priori, in an attempt to minimize any of my own biases influencing the outcome.- The studies were mainly from the audio engineering discipline and had a strong tendency to expressing and considering results (effect sizes) as means rather than proportions, and expressing probabilities as percentages. This is reflected in the paper, though better editing on my part would have resulted in more consistency with the notation of p values. I could have also performed sensitivity analysis where results were considered as odds ratios. But at some point, one has to stop looking at every variation and just submit the paper.- The structure of the paper is in-line with the structure of most engineering papers (including IEEE). As such, it looks very different from the structure of papers in medical journals and other places where a lot of meta-analysis is published.- The standard explanation for the publication bias problem was mentioned several times. The beginning of Section 3.6 first presents it. Figure 3 shows that the apparent evidence of publication bias from the funnel plot mostly goes away when subgrouping is applied. However, it then goes on to state "publication bias may still be a factor" and in Conclusion, "still a potential for reporting bias. That is, smaller studies that did not show an ability to discriminate high resolution content may not have been published."

^ First of all, welcome to taperssection and thanks for coming in and discussing this with us. To be honest, I was initially very surprised to see you post in this little backwater of the web, catering to the practitioners of a pretty uncommon hobby, but I am fairly certain I know how you arrived here on further reflection. At any rate, I would like to respond to a couple of your comments on my more major criticisms (the presence or absence of the appendix, for example, is immaterial in the end).

- I’m aware of the importance of homogeneity, and the heterogeneity issues here are more serious than those that would typically be found in medical research, and a world apart from formal clinical trials. However, meta-analysis has been successfully applied to social and behavioural science research with far more heterogeneity problems than those seen here. Anyway, this is a judgement call. So the approach I took was to use all possible studies (for which I could do inverse variance analysis), and then do sensitivity or subgroup analysis on more homogeneous subsets of the data.

With respect to the bolded part, what does "successfully" mean? Obtained a P-value? Published a paper? Generated a useful result that led to downstream hypotheses that were also tested successfully? Settled an open debate? Whatever that defintion, though, do you think your work should fall into the category of "squishy" science (like a lot of social and behavioural science)? I always thought of engineering as "hard" science, with experiments conducted rigorously and in the most methodologically proper way possible. I am sorry, but "others did it worse!" is not a valid rebuttal of this criticism, which, in my mind, completely undermines the entire paper. You are right that, in the end, it is a series of judgement calls, but others can freely interpret the merit of the work based on their assessment of the quality of those judgements.

- I agree that the work would have been improved by using an approach specific to binomial distributions. However, for much of the analysis, the normal approximation is justified. As for independence in the binomial test, under the null hypothesis every randomised trial would be uncorrelated, regardless of whether they involved the same participant or same study (think guessing a truly random coin toss). I also agree that the aggregate binomial test is not appropriate for meta-analysis. It was included only for completeness along with the binomial values for the individual studies in Section 2, and not used as part of the meta-analysis in Section 3.

The normal approximation may be justified, particularly for large numbers, but I think you need to show that. It is kind of beside the point, though. Why make those additional, potentially spurious, assumptions when it is easy to implement the correct analysis, modelled on the correct distribution, in freely available software? With respect to the aggregate binomial analysis being included for "completeness", wouldn't it have been more complete to actually put the correct estimate in there? The intra-individual trials are not like coin flips, in my opinion; there is a discrete set of perceptual apparatus that is unique to each individual that causes correlation between that individual's observations. If such correlations did not exist, nobody would ever score higher (or lower) than 50% in a sufficiently large number of trials.

With respect to publication bias, I never said you didn't consider it, only that you never mention, specifically, the implication about the type of study that is not reported based on that funnel plot. In any event, that is a lesser concern for me than the above. I certainly appreciate your comments here, and I hope you understand I am not trying to be a dick in anyway (this, after all, is the nature of scientific discourse), but your rebuttal doesn't much impact my previous assessment...

While you are here, on a somewhat related topic, can you comment on the Journal's review policy? The website says there is a "review board". Who comprises that board? How large is it? Do all reviewers come from this board or are outside experts brought in?

Apologies in advance if I don't continue the discussion much. I've just got a long 'to do' list to catch up on.

“what does ‘successfully’ mean? ” – I meant something loosely along the lines of ‘Generated a useful result that led to downstream hypotheses that were also tested successfully.’How about https://www2.ed.gov/rschstat/eval/tech/evidence-based-practices/finalreport.pdf . This was a massive, well-cited study that has led to a better understanding of potential benefits & drawbacks of online learning. And it tested hypotheses that were generated from previous meta-studies in the field. But the data had a huge amount of heterogeneity issues. Note that I didn't follow the approach from that paper though. I kept mainly to guidelines in the Cochrane Handbook. I'm just using it as an example.I fully agree about best effort and rigour in research, and did not mean to imply an ‘others did it worse’ justification. But nor do I think the heterogeneity issues are insurmountable here. The studies were all looking at discrimination between high resolution and standard resolution audio. Almost all looked at it directly, and a couple of others (King 2012 and Repp 2006) had data that could be transformed into that form. All had multiple participants, each performing multiple dichotomous trials. And all yielded (single outcome measure) results where, if differences could always be perceived then one expects 100% discrimination, and if differences could never be perceived then one expects 50% correct discrimination. And almost all tests were forced choice, either same/different or an ABX variant (these two approaches were also treated to subgroup analysis). I’ll also note that a random effects model was used, and it can be easily seen from the main forest plot and associated statistics that heterogeneity is not readily apparent from the results with the training subgroup.Anyway, this is going back to the ‘apples and oranges’ analogy. Meta-analysis is comparing apples and oranges (two studies using different dependent and independent variables), but that is ok if you are trying to learn about the nature of fruit (both studies looking at the same research question).Regarding normal approximation, binomial analysis, etc. First, I wasn’t aware of the full functionality of the ‘meta’ package in R, and so didn’t use it. But I don’t think that use of the normal approximation invalidates any results. Also, the null hypothesis in this case results in exactly what you said, ‘nobody would ever score higher (or lower) than 50% in a sufficiently large number of trials.’ To clarify, suppose randomly you called the correct answer A half the time and randomly you called it B the other half, but that there is no way anyone can distinguish between them. Then it doesn’t matter how someone answers, it still converges on 50% correct. And given that, then we can give a probability for at least 6736 ‘correct’ results out of 12645 trials.But these are minor details. I agree that binomial distribution is preferred, that the aggregate binomial analysis is not the right approach, and that if there is any perceptual difference at all then individual’s scores are highly correlated (I make note of that in the paper when discussing Meyer 2007). The disagreement is only over the severity and importance of these things. I don’t think the analysis or conclusions are in any sense invalidated, and I still strongly encourage others to revisit the data.

can you comment on the Journal's review policy? The website says there is a "review board". Who comprises that board? How large is it? Do all reviewers come from this board or are outside experts brought in?

The editorial staff of the journal are listed at http://www.aes.org/journal/masthead.cfm . They have a much larger pool of reviewers that they pick from, and also use outside experts. I think they aim for a minimum of three reviews per paper. That said, its always a struggle (as is the case for many journals) to maintain a talented and diverse pool of reviewers, and its hard to find just the right outside experts. I'm sure that they would welcome more potential reviewers.

^^ I think we'll just have to agree to disagree, as we have a fundamental philosophical divide with respect to the tautness of the hypothesis and it's relationship to the meaningfulness and interpretability of the results...

Also, the null hypothesis in this case results in exactly what you said, ‘nobody would ever score higher (or lower) than 50% in a sufficiently large number of trials.’ To clarify, suppose randomly you called the correct answer A half the time and randomly you called it B the other half, but that there is no way anyone can distinguish between them. Then it doesn’t matter how someone answers, it still converges on 50% correct.

I didn't state it very well for the case of the null being true, but, even when the null holds, the intra-individual results are not truly independent (i.e. not coin flips). Even in simple tests like this, there are a wide range of subtle individual biases and others introduced by the experimental design. So that non-independence is a factor in analyses like these and should be accounted for; this is generally difficult to do at the meta-analysis level (although if you had the individual results for all of the studies, you could do it no problem with a linear mixed effects model), but it does complicate the interpretation of the results.

The editorial staff of the journal are listed at http://www.aes.org/journal/masthead.cfm . They have a much larger pool of reviewers that they pick from, and also use outside experts. I think they aim for a minimum of three reviews per paper. That said, its always a struggle (as is the case for many journals) to maintain a talented and diverse pool of reviewers, and its hard to find just the right outside experts. I'm sure that they would welcome more potential reviewers.

Interesting. Thanks. At least in theory, it works a little differently in my field, in that anyone is a potential reviewer based on relevant experience. In reality, editors usually have some "go to" people, though.