My primary motivation for performing this listening test was to find the lowest QT AAC TVBR setting that is fully transparent for me, because I want to use that for my music collection. Secondary motivation was to find out how the other encoders compare to QT AAC at high quality settings.

This is my first rigorous listening test, and a rather extensive one, so I wanted to share the results with the audio community. I hope others may learn as much from this experiment as I did!

Results in a nutshell (for the impatient)QT AAC was judged fully transparent at q91 and close to transparent at q82. The sample in which I heard a faint difference between these presets had a bitrate of only 128kbps at q82 and 159kbps at q91, so taking that in consideration together with expected bitrates at q82, in CBR mode I would assume files at 190kbps and up to be reasonably safe for my ears.AoTuV (Vorbis) was judged very close to transparent at q5 and q6 and fully transparent at q7. If I were to use Vorbis for my music collection I would pick q6 because I think the tradeoff between file size and perceived sound quality is better at that preset than at q7. I would trust CBR files of 200kbps or greater.Opus was judged fully transparent at VBR with target bitrate 224kbps, which is consirably higher than I expected based on previous reports. At preset 192 I judged it untransparent so there's no grey area like in AAC or Vorbis. Opus VBR seems to be a lot less variable than the other codecs so in CBR mode I would trust Opus files of 230kbps and up.LAME (MP3) was judged very close to transparent at V1 and V0 and fully transparent at c320. I would pick V0 if I were to use LAME for my music collection. In CBR mode I would trust files of 260kbps or greater.

Encoder detailsQT AAC: my installation of Mac OS X included CoreAudio 3.2.6, QuickTime 7.6.6 and QuickTimeX 10.0. I used TVBR mode and overall encoder quality "max".AoTuV: XLD included release 1. Apart from the target quality setting no options were shown.Opus: XLD included libopus 1.0.2. I used VBR mode and framesize 20ms. opus-tools 0.1.6 also uses libopus 1.0.2.LAME: XLD included version 3.99.5. I used VBR mode with -q2 and the new VBR method.

Ambient conditionsTest setup was in an appartment with reasonably good sound isolation, in a moderately quiet environment with singing birds and low traffic. During ABX trials I kept the room door and the ventilation window closed. Computer fans were turned down. Under those conditions while wearing the headphones, most of the time the only sound I heard was the low humming of the external hard drive that carried the samples. Usually I became unaware of that sound when actively listening to a sample.

SamplesI selected 15 samples from the LAME Quality and Listening Test Information page. In 8 out of those samples I didn't hear a difference at any of the encodings I've tested. The remaining 7 samples are numbered below. In addition I included a 10-second fragment from Central Industrial by The Future Sound of London which I had previously found to contain obvious artifacts when encoded with QT AAC q63:

applaud.wv

fatboy.wv

goldc.wv

pipes.wv

testsignal2.wv

vbrtest.wv

velvet.wv

central industrial.m4a (ALAC)

Henceforth I'll refer to these samples by their numbers. See the appendix for detailed discussion of each sample.

General test procedureAs a general preparation I transcoded the WavPack samples to ALAC in order to make them playable in ABXTester. I always used the lossless original as sample A and the lossy compressed file as sample B. I took regular breaks in order to prevent fatigue. The measurements were spread over multiple sessions with almost a week between the first and the last session.

For each codec, I would first encode all samples at the middle preset, i.e. q63 for QT AAC, V5 for LAME, q4 for aoTuV and 96kbps for Opus. Then for each sample I would conduct ABX testing and conclude one of the following levels of quality:

clear difference if I was very sure I heard obvious artifacts and I scored 5 out of 5 after the first batch, or if I scored near 100% after multiple batches with overall p <= 0.002;

marginal difference if I wasn't absolutely sure in each trial but testing showed that I was able to hear the difference, i.e. at least three batches with overall p <= 0.05;

no difference if testing didn't disprove that I might be just guessing (p > 0.05) or if I gave up in advance.

By default I set the audio volume to 5 notches out of 16. I tended to turn it up to 6 notches if I didn't immediately hear a difference in all samples except for #1, which I experienced as very loud already. Occasionally I would try the sample with the channels reversed (by reversing my headphones) in order to test if something new might come to my attention.After testing all samples at the middle preset I would proceed to higher presets with the samples in which I heard any difference, until I found the minimal preset at which I heard no difference or until I couldn't go higher. A preset was judged "fully transparent" if I heard no difference in any sample, "very close to transparent" if I heard a marginal difference in at most one sample, and "untransparent" otherwise. I decided to assign QT AAC q82 an intermediate category "close to transparent" because I heard a clear but very faint difference in one sample. More on that below. The overall search path from preset to preset generally went like a binary search or similarly "jumpy".

I executed the above procedure first for QT AAC, then for LAME, then aoTuV and finally Opus. During the course of the experiment I noticed I had become better at detecting artifacts, so in the end I returned to QT AAC to verify my end results for that encoder.

QT AACObserved bitrate range: varies wildly around the official expected value. For example, at q63 (135kbps expected) some samples had an average bitrate of 80kbps while others went over 190kbps.Observed artifacts: even at medium bitrates (q27) most artifacts were slight changes in timbre or texture rather than very obtrusive stand-alone sounds. The exception is sample 8 which obtained some obvious, very sharp "ticks" after encoding which were audible up to q82 at 128kbps average file bitrate.

Stage 1: all samples at q63.I heard no differences except for a clear difference in sample 8. I decided to ignore that for the moment and to proceed my search downwards first.

Stage 7: sample 8 at q73.Clear difference, I chose q83 as my search result for the time being.

Stage 8: samples 1, 2, 6, 7, 8 at q82 (verification after finishing the other encoders).I did hear a clear difference in sample 8 afterall, though I had to listen to A and B a few times before I noticed it. I heard no difference in the other samples.Note: I have not reviewed stages 1-4. With my trained ears I might actually hear some additional differences at q54 or even q63 but I haven't tested.

Stage 9: sample 8 at q91.No difference. I decided q91 to be my final search result for QT AAC.

LAMEObserved bitrate range: the spread is somewhat less than in QT AAC, generally the highest and lowest average bitrates where within 30kbps of the expected bitrate for the given quality preset.Observed artifacts: no standalone "objects", but changes in timbre or texture could be very un-subtle.

Stage 3: samples 1, 6 at V1.Marginal difference in sample 1. I decided V1 to be my search result for the time being.

Stage 4: sample 1 at V2 (checking for consistency with aoTuV after finishing Opus).Clear difference. I chose V0 as my final search result instead.

Stage 5: sample 1 at V0 (for completeness, shortly before starting this report).Marginal difference (yes really, I believe I heard a difference and I identified 18 out of 25 Xs correctly: 72%, p=0.014).

Stage 6: sample 1 at c320.No difference (at first I thought I heard a difference but ABX testing showed I didn't).

AoTuVObserved bitrate range: average file bitrate is usually greater than the official target bitrate for the given quality preset. For example, the average bitrates at q4 were all greater than 128kbps. Upwards spread from the target bitrate seemed to be similar to QT AAC.Observed artifacts: few and subtle. The marginal difference in sample 3 that I consistently heard up to q6 was an attenuation effect, the high frequency components were slightly softened.

Stage 1: all samples at q4.Clear difference in sample 1, marginal difference in sample 3 and no difference in the other samples.

OpusObserved bitrate range: average bitrates were always very close to the target bitrate, with a spread of less than 10kbps in each direction. I would compare Opus VBR to QT AAC ABR.Observed artifacts: texture changes, some of them very severe, including "rattling" and "grinding" sounds. Usually the timbre became more "metallic".

Conclusions and recommendationsQT AAC and aoTuV are the clear winners in this comparison, with QT AAC achieving full transparency at the best compression ratio. I was a bit surprised to find that the highest quality preset is no overkill (for my ears) in LAME. Opus doesn't seem to perform exceptionally well (though better than LAME) at high bitrates although it's known to beat QT HE-AAC (more or less) at 64kbps. This is probably in part explained by the fact that Opus is still very young. Another explanation is that Opus might be more intended for low bitrates, which is somewhat suggested by the way it's described on the Opus home page.

According to the Hydrogenaudio wiki, most people find AAC to be transparent at about 150kbps, Vorbis at about 150-170kbps and LAME at about 160-224kbps. Given the results of this experiment, my ears might be slightly better than average.

If you wish to repeat this experiment, you might be able to save a lot of time by using my results as a hint where to find the most significant differences. The sample details in the appendix may help you to "look" in the right direction. In addition, you can probably start your searches for Opus and LAME at higher presets than I did.

If you just want to use this report as a hint for choosing your ideal encoder setting, I suggest that you perform a miniature version of my experiment using just a single sample in the encoder that you're interested in. If you hear a difference go up one preset until you don't, otherwise do the opposite by going down. Specifically:For QT AAC, I would recommend listening to sample 8 and starting at q73. If you descend below q54 I recommend listening to samples 2, 6 instead.For aoTuV, I would recommend listening to sample 3 and starting at q5. If you don't hear any difference switch to sample 1 at q4.For Opus, you could take sample 4 at target 160kbps.For LAME, I recommend listening to sample 1 starting at V3.

Appendix: sample detailsSample 1Loud applause, with a "thank you" yelled through a microphone shortly after the start. The "thank you" is loud but sounds a bit muffled because of the microphone and there's a faint echo to it.In the lossless original the applause sounds "wet"; you could compare it to rain or perhaps to oil spattering in a hot pan. In audibly different encodings it may sound dryer, noisier and coarser, perhaps like sandblasting, or very coarse and metallic (in Opus at 96kbps target bitrate).The "thank you" should be a separate sound layered on top of the applause, and should sound fairly smooth. In audibly different encoding you may expect it to interact with the applause in several ways:

The applause may seem less clear, noisier or softer during the "thank you".

Directly after the "thank you" the applause may seem to be slightly louder and much coarser.

The echo to the "thank you" may seem to be amplified compared to the original and include some noise.

The "thank" syllable may sound slightly less smooth, a bit raspy, as if affected by the sandblasting (this is the primary way in which I made out the difference at maximum quality settings in LAME).

Sample 2Some sawtooth-like signal with an additional trill effect that seems to contain vowels. I'm not sure whether this is a heavily filtered human voice or just something creative from a synthesizer, but either way it sounds quite interesting.At medium bitrates in QT AAC and Opus it sounded distorted and metallic.

Sample 3Symphonic fragment with drums, trumpets, violins, vocals and some high-pitched snare instrument which I think might be a steel guitar. There's also some high tingling in the right channel which I suspect is an artifact in the original file coming from the snare instrument. Sounds like a soundtrack to an epic 1960s movie.In aoTuV you may find that the snare instrument (the proper sound slightly to the left, not the tingling in the right channel) is arpeggiated less sharply and sounds softer overall; I would call it a bit "timid" compared to the original.

Sample 4Bagpipe playing a slow high-pitched melody over a constant bass. The sound is smooth overall although you'll find some irregularity especially in the second long-lasting high note. In the background there's the occasional hollow, raspy, low-pitched sound which might be either the bag being inflated by the artist or (a suggestion of) wind.Focus on the long-lasting high-pitched notes, especially the very last one. In case of audible difference you'll find that they sound metallic and/or less smooth or even straightout distorted (Opus at 96kbps target bitrate).

Sample 5Drums (something that sounds similar to a conga or a djembe) playing a samba-like rhythm. At the start an alto voice sings "aaaa", which is a bit of a shame because the voice will not help you to distinguish the encoded sample from the original and it partially masks the drums.In Opus at 96kbps target bitrate the high-pitched slap beats sound more metallic than in the original.

Sample 6Western guitar playing a country tune.At lesser bitrates you might recognise the encoded sample directly because it sounds metallic and perhaps even a bit distorted. A high bitrates you might be able to make out the difference if you focus on the initial arpeggio and the final note. The last note of the initial arpeggio (which lasts longer than the previous notes) might sound a bit more rough than in the original. The final note might sound metallic. The latter difference is probably easier to hear than the former. You probably won't find a difference in the chords.

Sample 7Monotone (synthetic) drum rythm with bass, big tom beating every second base beat, open-closing hi-hat in the right channel alternating with the bass beat and another closed hi-hat in the left channel beating four times for every bass beat.You'll only hear a difference at the lesser quality settings, and you are most likely to find it in the closed hi-hat in the left channel.

Sample 8Synthesizer music of fairly low complexity.Frankly, the sounds aren't really important, because the main reason to listen to this fragment is the sharp ticks that are introduced by QT AAC. I don't think I need to tell you where they are because you're pretty much guaranteed to hear them at q63 and below.Since this sample isn't available from the LAME Quality and Listening Test Information page, I made it available for download over here: https://dl.dropbox.com/u/3512486/central%20industrial.m4a

@eahm:I'm sorry for quote-sniping you, but I see several things in your post which I think need to be addressed before I can do any additional testing:

QUOTE (eahm @ Feb 8 2013, 22:05)

Jplus, yes I meant these logs. Until I see a proper ABX test that tells me you really hear quality difference

Excuse me, but what is improper about my ABX tests? The only difference with the foobar2000 logs is that you can't see trial-by-trial whether I identified the X correctly or not. I explained my definitions of "clear difference" and "marginal difference". Everywhere I said I heard a "clear difference" I got 100% or near-100% score with a probability less than 0.002 that I identified them correctly by luck. For example, I might have correctly identified 18 out of 20 trials. The probability of getting that score by guessing is 0.5^20*choose(20,2)= 0.00018 (judging from an example log, foobar2000 would round that down to 0.0%). A similar but less-extremely-certain story applies to the "marginal differences".

If you think that I might have made this up then logs shouldn't change anything, because I can make those up as well.

If knowing my exact score for each individual trial is important for you, I can keep track of that during my next experiments and write it down in my own way in my next post. Would that solve your issue? Because frankly, I won't be able to run foobar2000 on my mac.

QUOTE

between lossless and 192 AAC

Note that from my VBR results I concluded that most files at or above 190kbps are probably transparent to me (I called it "probably safe for my ears" but that amounts to the same thing). That means that I actually don't expect to hear a difference between lossless and 192kbps AAC. If you want, I can check whether any of the QT AAC samples that I found audibly different from the lossless original had a bitrate near 192kbps.Edit: I did this, and none of them did. The highest average bitrate was 161kbps for sample 2 at q45. If I'd hear the difference in sample 2 at q54 (which I didn't verify after my ears became more trained) then that one would come close at 186kbps.

QUOTE

I have to remain skeptical, AAC is soo good at low bitrates.

I completely agree with that! Note that at the start of my experiment, I heard no difference at q54 (expected bitrate 95kbps) in any of my samples except for #8, which had obvious ticks which were probably caused by QT choosing the bitrate too low. I didn't hear those ticks anymore at q91, where the average bitrate in sample 8 was still only 159kbps.

I'm not denying that QT AAC is really good even at medium bitrates (where I follow the apparent convention that 80-120kbps is medium). I'm just saying that I found a case where q82 isn't strictly transparent, so I'll have to choose q91 for my music in order to be on the safe side.

QUOTE

Even = low. It was more for the ~96.

So you'd like me to test at about 96kbps, 128kbps and 160kbps. I'm fine with that, but how would you want me to approach that? Use the VBR preset which has an expected bitrate near the proposed bitrate?Why exactly would you like me to do that? Do you expect results that are somehow in conflict with my first post?

@DonP:Alright, so libopus 1.1a is probably better and more variable than 1.0.2. That seems to confirm my suspicion that 1.0.2 didn't score very well in my experiment because it's still a very young codec. I acknowledge that I wasn't using the bleeding-edge version in my measurements and that Opus would probably have scored better if I had.I prefer testing release versions only because you never know what rare errors an alpha encoder might have that happen not to show up in my limited set of test samples. It seems that you are concerned that Opus might look worse from my results than it deserves. Will I make you happier if I repeat my Opus measurements when 1.1 is ready for release?

As for aoTuV, XLD is probably just displaying the version number incorrectly (indeed if you search "aotuv" at the XLD homepage the last hit you'll find indicates that the default included version should be at least 4.51). My results don't seem any worse than you'd expect from aoTuV (as compared to QT AAC), so I think there's no reason for concern.

That said, XLD offers plugins for Opus 1.1a and for aoTuV 6.03b, so if more people think I really have to test those, I can without needing to jump through hoops. Please do keep in mind that what I've done here is very time-consuming though.