After some listening tests performed during two years, I’ve experimented something new, based on this discussion. This time, I've tried to perform a multiformat blind comparison based on a much larger group of samples, but without ABX confirmation. Tests are still performed within a double-blind methodology: only difference is that I haven’t spent time to confirm the audible differences with an ABX session. The spared time was invested in something more interesting (to my eyes but also for statistical analysis tools): 150 samples instead of the 15 usual ones.

1.1/ classical samples

Few words about this extravagant number. I was used to perform comparisons on a limited number of classical samples (15…20). It was probably enough to draw reliable leads about relative quality of various codecs, but such limited collection couldn’t represent the fullness of classical music, which consists of numerous instruments played in countless combinations, offering for most of them a wide dynamics. There are also voice, electronic, and to finish all variants linked to technical factors (acoustic, recording noise, etc…). That’s why I’ve tried to build a structured collection of “classical music” situations, which of course doesn’t aspire to completeness, but which should represent most situations. The collection is made up of very hard to encode samples as well as of very easy ones, loud (+10 dB) and ultra-quiet (+30 dB); noisy and crystal clear recordings; ultra-tonal and micro-detailed sounds. I’ve split it in four series:

(note#1: all samples are deliberately short. First, it’s easier to upload them. Second, there’s only one acoustic phenomenon to test per sample, and it makes comparison between different tests a bit more interesting. The exact length for the collection is 25 minutes; it corresponds to 10.00 seconds per sample on average).

(note#2: all samples were named following a simple convention. The first letter (A, E, S, V) corresponds to the category (artificial, ensemble, solo, voice). The number to the catalogue number. Then, additional information is tied: nature of instrument, type of instrument or voice, etc…

ex: S11_KEYBOARD_Harpsichord_Aex: E35_PERIOD_CHAMBER_E_flutes_harpsichord.mpcTo make short, samples will be called S11, E35, etc…)

With such a collection, I should obtain very precise idea of different lossy encoders performance on classical. For me, it’s interesting, especially if I plan to buy in the near future a portable player supporting one new audio format, as Vorbis, AAC or WMAPro. I’d like to know how good these new formats are compared to MP3. These 150 samples may also help developers/testers for evaluating the performance of codec on a wide panel of situations.

1.2/ various music samples

Last and not least, I’ve decided to give more audience to this test by adding samples representing some other genres than classical. For an elementary reason –99.9% of my CDs are classical- I can’t build the same kind of structured collection with what I will call now to make short “various music”. I used all samples selected by Roberto during his listening tests, removed all classical ones, and kept the 35 samples representing “various music”. It’s much less than the 150 above, but more than the double of what was used during all previous collective listening tests.

=> total = 150 classical + 35 various = 185 samples.

1.3/ choice of bitrate

For my first test based on these samples, I’ve selected a friendly bitrate (at least as tester): 80 kbps. It may appear as uninteresting, that’s why I must explain my choice.First, I plan to perform similar tests at higher bitrate. My dream is to build a coherent set of tests including all bitrate from 80 to 160 or 192. But this project is very ambitious –too ambitious certainly- and I’ll possibly stop my tests (in this current form) at ~130 kbps.But why 80, and not 64 kbps? To my ears, there is currently no encoder that sound satisfying at 64 kbps. They’re all disappointing or unsuitable to listening on headphone, even crap ones, even on urban environment (I repeat: to my ears). But I’ve noticed that the perceptible and highly annoying distortions I’ve heard at 64 kbps are seriously lowered once the bitrate reaches the next step. Vorbis has less problems, AAC-LC (at least advanced encoders) also seems to improve quickly beyond 64 kbps. It’s a bit like mp3, which was considered as acceptable at 128 kbps, but which quickly sunk below this value. I would consider as reasonable the *idea* of an acceptable quality at 80 kbps with modern encoders. Let’s see the facts...

II. PROBLEMS

2.1/ competitors

One big problem with this kind of test is the choice of competitors. Choosing the formats is easy: tester has just to select want he considers as interesting. Here, I’ll exclude outdated formats (vqf, MP3Pro) and unsuitable ones (MPC, MP3 – this last one would also be interesting to test, just for reference...). Remains: WMA, WMAPro (if available at this bitrate), AAC-LC, AAC-HE, Vorbis. But what implementation should I use? Nero AAC or iTunes AAC? Nero AAC features a VBR mode, but is VBR reliable at this bitrate, especially for samples which represents a wide dynamic? And for Nero, which encoder would be the best: the “high” one (default, which has verified issues with classical) or the “fast” one (which performs better with classical, but maybe not as well with various music, and which is still considered as not completely mature by Nero’s developers)? Vorbis CVS or Vorbis aoTuV? I’d say aoTuV, but if vorbis fails people will (legitimately) suspect the other one could have performed better. WMA CBR or WMA VBR? VBR is theoretically better than CBR, but tests have already shown that VBR could be unsafe at low bitrate.My first idea was to test them all. Schnofler ABC/HR allows the use of countless encoders in a same round (ff123 software is limited to 8 contenders). But after a quick enumeration of all possible competitors (iTunes AAC, Nero AAC CBR fast, Nero AAC CBR high, Nero AAC VBR fast, Nero AAC VBR high, faac, Vorbis aoTuV, Vorbis CVS, Vorbis ABR, WMA CBR, WMA VBR, HE-AAC fast, high, CBR & VBR...) and a mental calculation of the number of comparisons I have to perform with 185 samples and so many contenders, I’ve immediately canceled this project. Last but not least, multiplying the competitors in a single test will lower the significance (statistically speaking) of the results.Then, I came to a second idea: testing all competitors for one single format in a single pool, and put the winner of each pool in the final arena. It’s like sports: qualification first, final for the best. Remaining problem is the additional work. I’ve planned to test 4…5 codecs per bitrate with 185 samples, not 13 or 14. That’s why I’ve reduced the number of tested samples for the preliminary pools. I’ve limited the number at 40 samples, using 25 samples coming from different categories of the complete classical collection and 15 from the 35 samples representing “various music”. The imbalance in favor of classical is intended: the whole test is clearly focused on classical – “various music” is just an extension or bonus.

2.2/ Encoding mode and output bitrate

Other problem: VBR and CBR. Testing VBR and CBR has always been a source of controversy. In my opinion, testing a VBR encoder which outputs the targeted bitrate on average (i.e. a full set of CDs) is absolutely not a problem, even if bitrate reach amazing value on short tested samples. It’s not a problem, but the test should meet in my opinion the following condition: the test must include samples for which VBR encoders produce high bitrate as well as low one. VBR encoders have the chance to automatically increase the bitrate when a difficulty is detected – possibility that CBR encoders don’t have, and they sometime suffer from that handicap, especially on critical samples. But VBR encoders also decrease the bitrate of musical parts they don’t consider as difficult – and this diminution is sometimes very important; theoretically it shouldn’t affect the quality, but we know the gap between theory and reality, between principle and implementations of the principle. Testing the output quality of ‘non-difficult’ part is therefore very important, because these samples are the possible handicap of VBR encoders; otherwise there’s a big risk of favoring VBR encoders over CBR by testing only samples apparently favorable to VBR (whatever the format).My classical music gallery is not exclusively based on critical or difficult samples; most of them don’t exhibit any specific issue. The sample pool should therefore be fairly distributed between samples with lower bitrate than the targeted one and samples with a higher bitrate. I’ll post as appendix a distribution curve which confirms this.

2.3/ degree of tolerance

By testing VBR profiles, it’s not always possible to match the exact target. Some encoders don’t have a precise scale of VBR settings. With luck, one available profile will approximately correspond to the fixed bitrate; sometimes, the output bitrate will deviate too much from the target. CBR is not free of problem either, although they’re less important. With AAC for example, CBR is a form of ABR: output bitrate could vary a little (but fortunately not very much).That’s why trying to obtain identical bitrate between various contender could be considered as an utopia, even when the test is limited to CBR encoders only. The tester has therefore to allow some freedom: not too much of course in order to keep significant comparisons and not too less in order to make the test possible. I consider a deviation of 10% as acceptable, but again, at one condition: 10% between the lowest averaged bitrate and the highest averaged one, and not 10% between all encoders and the target. As example, if one encoder reaches 72 kbps (80 kbps - 10%) and another 88 kbps (80 kbps + 10%), the total difference would be ~20%: too much.However, I will possibly allow rare exceptions: when a VBR profile is outside but close to the limit or if it would be more interesting to test a more common profile (example : musepack –quality 4 instead of –quality 3.9). Of course, the deviation mustn’t be exaggerated; and I’ll try to limit the possible exceptions to the pool, in order to keep the fairest conditions during the final test.

2.4/ Bitrate evaluation for VBR encoders

Now that rules are fixed, we have to estimate the corresponding bitrate for each VBR encoder and profile. It’s not as easy as we can suppose. Ideally, I had to encode a lot of albums at each profile. But with my slow computer, it’s not really possible. And doing it would only help to obtain the corresponding bitrate for classical; according to my experience, this average bitrate could seriously differ from the output value that other people listening to other music (like metal) have already reported. Think about LAME sf21 issues, which could inflate the bitrate up to 230…250 kbps with –preset-standard, and compare it to the average bitrate I obtain with classical: <190 kbps! Other but different example: lossless.For practical reasons, I followed a methodology I don’t really consider as acceptable, and took the average bitrate of the 185 kbps as reference for my test. I don’t like it, because short samples could dramatically exaggerate the behavior of VBR encoders, and therefore distort the final estimation. Nevertheless, with 185 samples, this kind of over- and underrating occurring with some samples would normally be softened. And indeed, it seems that the average bitrate of encodings I’ve done of the full suite with formats I’ve used in the past (lame –preset standard, MPC) are very close to the average bitrate of my ancient music library. I can’t absolutely be certain that my gallery works like a microcosm and that bitrate matches the real usage of a full library, but I’m pretty sure that the deviation isn’t significative (+/- 5%, something like that).

2.5/ Bitrate report

There’s, before starting to reveal the results one last problem I’d like to put in the spotlight. It concerns the different way to calculate the bitrate. I’ve tried to obtain the most reliable value, and that’s why I’ve logically thought to calculate it myself with the filesize as basis. As long as no tags are integrated within the files, the calculated bitrate should correspond to the real one (audio stream). But the problem is somewhere else. Some formats are apparently embedded in complex containers, which weigh the size down. It’s not a problem in real life: adding something like 30 Kb per 5 Mb file is totally insignificant. But when these 30 Kb are appended to very short encodings, the calculation of the average bitrate is as consequence completely distorted. Concrete example: iTunes AAC. Just experiment the following thing: encode a sample (length: one second exactly) in CBR. At 80 kbps, we should obtain an 80 Kbits or 10 Kb file (80 x 1 / 8). But the final size is 60 Kb, and it corresponds to a 480 kbps (60x8) encoding! What’s the problem? Simply because iTunes add for each encoding something like 50 Kb of extra-chunks. The problem could be solved with foobar2000 0.8 and the “optimize mp4 layout” command: filesize drops to 14 Kb. But even here, the 14 Kb correspond to ~128 kbps bitrate, and the audio stream is only 80 kbps.iTunes is not apparently alone in this situation. I haven’t looked closely, but it seems that WMA (Pro) have the same behavior, and we have no “optimize WMA layout” tool to partially correct this. If we keep in mind that the average length of my samples is 10 second with some of them at only 5 seconds, we have to admit that calculating the bitrate with filesize/length formula is for this test anything but reliable.

That’s why I followed the value calculated by specialized software. MrQuestionMan 0.7 was released during my test, but the software have some issue to calculate a correct average size on short-sized encodings (iTunes AAC encodings as example). Foobar2000 appeared as the most reliable tool, and I’ve decided to trust the calculated value. For practical reasons, foobar2000 is also preferable: the “copy name” command could be modified to easily export bitrate in spreadsheet.

2.6/ notation and scale

The -really- last problem Each time I have to evaluate quality at low bitrates I regret the inappropriateness of the scale in use in ABC/HR. At 80 kbps, encodings would rarely reach the 4.0 state (“slight but not annoying difference”). 3.0 (“slightly annoying”) would rather be the best quality degree that modern encoders could obtain at this bitrate. It implies that the notation will fluctuate within a compressed scale, from 1.0 to 3.0. It’s not very much, especially when big differences in quality between contenders are noticed by the tester.To solve this issue, I’ve simply mentally lowered the visible scale by one point. Example: when I considered an encoding to be “annoying” (state corresponding to “2.0”) I put the slider to 3.0. The scale I used for the test was:5.0 : “perceptible but not annoying”4.0 : “slightly annoying”3.0 : “annoying”2.0 : “very annoying”1.0 : “totally crap”

If exceptionally one encoding appeared as corresponding to “perceptible but not annoying” I’ve put the slider on 4.9, which means “5.0”; if the quality was superior to this state, I wrote the exact notation in comments. A transparent encoding obtained 6.0.When the tests were finished, I’ve removed one point to all notation. 6.0 became 5.0, 3.4 -> 2.4 and 1.0 were transformed in a shameful 0.0! By doing it, I maintain the usual scale; only change is therefore a lower floor, corresponding to an exceptionally bad quality.The redefinition of the quality scale could directly be redefined with Schnofler’s ABC/HR software, but apparently the tester have to type the description for each new test (did I miss an option?); it was faster for me to do this small mental exercise rather than typing more than 200 time the same content

Nero currently offers the wider support for AAC: two different encoders (veiled behind the name “high” and “fast”), and for each, support for CBR and VBR. The purpose of the first pool is to establish which one could be considered as the most trustable AAC encoding solution from Nero. I didn’t include the ‘fast’ encoder in VBR mode: “radio” profile targets a bitrate inferior to 70 kbps and the next profile (“internet”) reaches the 140 kbps ceiling.Contenders for this first pool:

• As tested previously, the “fast” encoder perform better on classical music. And difference is really terrible in my opinion. Quality is also on average better than VBR “high” despite its lower bitrate (76 kbps vs 86 kbps).

• the “high” encoder is also worse on average with samples coming from group 2, but this inferiority can’t be claimed with a confidence of 95%. On average (group1 and group2 mixed together), CBR “high” appear as the lowest quality AAC-LC encoding tool.

• the “high” VBR mode produces better quality on average with group 2 (but confidence is < 95%), and is clearly worse with samples from group1.

=> Nero AAC-LC ‘fast’ with CBR 80 will join the second pool.

POOL#2: AAC-LC – faac, iTunes & Nero

- I’ve planed to perform a dedicated pool to compare faac ABR and faac VBR: some encoders perform better with ABR, especially when the VBR model is not tuned enough. I’ve started this pool, but quickly canceled it. Faac ABR (80 kbps stereo) suffers too much from the lowpass (8 KHz instead of 13 KHz for VBR), and can’t therefore perform any good result. People should also keep in mind that the corresponding bitrate for the tested setting (-q70) was a bit excessive (see the bitrate table).- iTunes AAC is ambrosia for the tester: one encoder, no setting: CBR 80 My iTunes is based on QuickTime 7.02 (it appears in the MP4 metadata).

• faac offers a very poor quality in its current state: the increased lowpass (13 KHz; 14 KHz for Nero & 15 KHz for iTunes) is often audible. More annoying: severe distortions which affects most tested files. Warbling is also often audible. I recall that faac suffers from warbling with some tonal samples up to –q500! However, this severe comparison shows how much could AAC be improved at low bitrate.

• iTunes AAC offers very similar quality from one group to another. Classical music and Various music are encoded with approximately the same quality. Obviously, iTunes AAC is well balanced. Quality is very similar to Nero on classical (warning: ‘fast’ encoder only – ‘high’ is crappy here), and slightly better (but without a 95% confidence level) with various music.

• Nero AAC ‘fast’ is less balanced than his contender. No surprise: it was revealed during the first pool. However, quality is very close to iTunes; a small difference remains, at least for me and for the 40 tested samples.

=> iTunes AAC is qualified for the final comparison

POOL#3: AAC-HE – Nero

There are some AAC-HE implementations available. The Apple’s one is still not available on Windows. I didn’t test Real (it needs Producer). Therefore, I only have Nero to test. But it means two different encoders, with 2 different settings (CBR, VBR): four combinations. However, VBR can’t be tested here. The highest VBR settings both output low bitrate (60 kbps, too far from the target - see the bitrate table).

This time, the ‘fast’ encoder isn’t superior anymore with classical music (group1), but still reveals slight regression on various music (group2). People might also note the very low average notation for both encoders. Nero’s AAC-HE suffers from several artifacts, typical of SBR I would say, which are constantly annoying me. The only difference I’ve noticed between both encoders was a tiny reduction of the level of audible artifact (sandy sound).

POOL#4: Ogg Vorbis – 1.1.1 & aoTuV beta 4

This big listening test was a good occasion to compare the modifications introduced by Aoyumi in 1.1.0 core (aoTuV beta 4 based on 1.1.1 wasn’t released when I performed the pool). It also gives me the opportunity to evaluate the performance of ABR (not recommended) over VBR. I don’t expect anything from ABR, but surprises are always possible with lossy encodings.Unfortunately, aoTuV and 1.1.1 don’t output the same bitrate (see the bitrate table). Difference isn’t that big, but it might favor aoTuV results. I’d prefer compare both on near identical basis, in order to see which one is the best. That’s why I kept aoTuV at –q1, and increased 1.1.1 to match the same bitrate. –q1,5 was very close (and it’s a semi-round number: I prefer that over eccentric settings like –q1,38 or something like that).

• on group1, all encoders are tied (although aoTuV is better than 1.1.1 with 90% confidence). It’s a disappointment for me, because I’ve seriously expected from aoTuV to reduce the level of coarseness/fatness on this specific musical genre. However, slight improvements were often perceptible – it’s better than nothing. With some samples, a slight regression was also perceptible: additional distortion or apparently restrictive lowpass (noticed with harpsichord). Interesting to note that ABR doesn’t perform badly, except on critical samples (bitrate stayed at ~85 kbps when VBR encodings reached 160!); ABR also sounded a bit better with some samples (tonal one). Good point to ABR (just note that encoding speed is dramatically slow compared to VBR).

• on group 2, differences are much more defined. ABR appeared as clearly worse than VBR and aoTuV beta 4 outdid 1.1.1 on VBR mode. Obviously, the changes Aoyumi made on vorbis are much more effective on various music.

=> on average, aoTuV beta 4 was better than 1.1.1 (not a surprise I would say), and therefore will rejoin the final.

POOL#5: WMA 9.1 Standard

WMA9Pro offers a minimal CBR setting at 128 kbps; on the other side VBR Q10 outputs to 68 kbps and the next step (Q25) to ~110 kbps. WMA9Pro can’t for that reason compete in this test.I’ve therefore limited the test of Microsoft products to WMA9 standard. It’s the only one that could be played on DAP, and the number of manufacturers supporting WMA STD is countless. WMA is supposed to offer better quality than MP3 at this bitrate, and it’s therefore interesting to see how will really perform this format (at least, I will see it). I have compared CBR to VBR. VBR Q25 offered a nice, round 80 kbps bitrate on 185 samples. However, people should keep in mind that bitrate was lower with the 150 classical samples (76 kbps) and higher with the 35 various music ones (88 kbps).

• CBR80 was slightly inferior to VBR Q25 on classical music. It’s a good point for VBR, because wide dynamics samples are often harder to handle at low bitrate with VBR than CBR. It might interesting to remind that this better performance was obtained with a (slightly) lower bitrate.

• with group2, difference is much more contrasted. CBR 80 performed very poorly on various music, whereas VBR revealed significant progress. Microsoft clearly improved his product with VBR. VBR offers the most balanced results between both groups (1.79 & 1.67), whereas CBR is obviously unbalanced (1.57 vs 1.07) in favor of classical music.

At this stage, there’s one slight problem: vorbis bitrate is a bit higher than other contenders (~83 kbps). It’s not a problem: 3 kbps can’t lead to significant difference. However, I know that some people are used to whine, especially when their favorite encoder doesn’t appear to win by far each listening test. That’s why I have decided to not use vorbis aoTuV at –q1, but to set it at –q0,9. Bitrate is now very close to other contenders – a bit too low for classical and a bit higher than 80 for various music. Interested people could refer to the bitrate table.

To complete this test, I’ve also add two anchors:

• as high anchor, I’ve considered MP3 at 128 kbps as the most profitable one: good enough to play the role of high anchor, and also interesting reference. Most editors or sellers are used to claim that modern encoders could perform as well as MP3@128 at half bitrate. Here we will see if the best implementation of each audio format could do it at only 60% of MP3. I’ve decided to maximize the quality of this anchor: ABR and LAME 3.97a10. The setting was --preset 131 in order to match 128 kbps (126 in facts).

• as low anchor, I hesitate. Finally, I’ve decided to use MP3 again, at 80 kbps. Quality should theoretically be low enough (but I had serious doubts before starting the test); I also believe as very important to obtain a direct comparison between old MP3 and new competitors, theoretically much better at such low bitrate. Again, I've used LAME 3.97a10 and --preset 82

(N.B. the indicated bitrate values correspond all to the average mean calculated by foobar2000, which weight the calculation on length of each sample. These values slightly differ from the average ones calculated with Excel and reported at the bottom of my bitrate table).

All modern encoders are not equals. Not at this bitrate obviously. Vorbis ends the test with a clearly superiority, whereas WMA doesn’t really show big difference compared to an old format as MP3. AAC (whatever the profile) is disappointing; the High Efficiency profile doesn’t help the AAC core to perform better at 80 kbps, and seems rather to handicap the format. Also interesting to note: MP3 at 128 (when encoded with LAME) is currently untouchable, except maybe by vorbis on various music. People shouldn’t seriously expect to put increase the musical content of their portable player by 100% and keep the same quality as MP3 128. Most people on HA.org probably knew that

ANALYTICAL COMMENTS

• MP3 LAME, 80 kbps: low anchor, there’s nothing to comment. Quality is poor, but not as much as expected. Indeed, I was sometimes surprised by the quality obtained with MP3 at this setting: this format could handle decently some ‘easy samples’ (more easily I would say than some competitors). In rare cases, the low anchor obtained a better note than the high one. I thought it was a mistake, but when I checked it later, I have confirmed this. The reason is simple: lame 3.97 have warbling problems occurring with some samples; this warbling was a bit much more annoying to my taste than the lowpass/resampling of the 80 kbps encoding, and that’s how a 80 kbps obtained a better notation than a 128 kbps one.

• AAC-HE (Nero, CBR 80 “high”): very disappointing score, for this format claimed to be a killer at low bitrate. 80 kbps is probably excessive for AAC-HE, now that AAC-LC implementation are getting better and better (take a look again on POOL#1, and see how AAC-LC have progressed). AAC-HE doesn’t suffer from any lowpass, but the SBR layer is highly impure, and seems to interfere with the lowest part of the spectrum. As result I get constant artefacts, noticed with more than 90% of the tested samples. AAC-HE has a maybe CD spectrum, but it’s like if a cricket was directly screeching in my headphones. Personally, I would consider something poorer (with audible lowpass and some ringing) as better that this (un)constant parasitical noise. Just a personal appreciation; other people might prefer the opposite – I don’t know. AAC-HE also have *big* troubles with attacks (pre-echo) and fine details (smearing), even more audible than simple MP3. AAC-HE would probably more pertinent at lower bitrate, for which other contenders would probably be in pain.

• AAC-LC (iTunes, CBR 80): poor results. I’ve expected something better, a bit more suitable for listening with portable player. Quality is not *that* bad (just compare to MP3 or WMA for reference), but there are too often irritating distortions. Lowpass is also annoying, at least on ABCHR conditions (with direct comparison with a full quality reference file), probably less perceptible on common earbuds (I’ve tried, and quality suddenly became much less irritating).

• Vorbis (aoTuV beta 4, VBR –q 0,9): this is by far the most enjoying thing I’ve heard at this bitrate. I was highly surprised by results I’ve got with the 150 classical samples; I was literally astonished by the final score obtained with the 35 remaining samples! Vorbis is obviously an amazing tool at this bitrate. Or differently: Vorbis apparently embed some encoding tools (point stereo?) which are remarkably suited for this bitrate (but which are maybe interfering too much at higher bitrate: see this test and this test).Quality is not perfect of course; usual vorbis problems are here: noise boost, coarseness, fatness. Distortion (vibrating effect) on long tonal notes also occurs. But these issues are limited (at least compared to other mutilations produced by other encoding tools at this bitrate) and I would say that Vorbis at this bitrate could be pertinently used for portable playback by people which are not excessively hard to please and more interested to maximize the capacity of their flash memory digital player. It’s too bad for me that vorbis performances are not as good with classical as with “various music”. But even on this “Achilles’ heal”, vorbis outperforms current other encoding tools.

• Windows Media Audio (9.1 Standard, VBR Q25): performance is well-balanced... in weakness. WMA shares the same problems than MP3 at similar bitrate: audible lowpass (13 KHz), and a lot of distortions going with many artefacts. WMA is sometimes better than MP3, sometimes worse (especially with classical, a bit less true with various music – but I recall that WMA VBR outputs something like ~88 kbps with various music, and that MP3 was tested at a bitrate lowered by 10%). I suppose that WMA would gain some quality by using an automatic resampling like LAME does: from my experience, it helps the encoder to limit the amplitude of some artefacts. I’ve often read that WMA should be preferred to MP3 for encodings at less than 128 kbps; these results could question this. Can’t we expect from MP3 to perform as well if not better than WMA at 96 kbps? Answer in some weeks, for my next listening test.

• MP3 LAME, ABR 128 kbps: high anchor, perfect in this role. Quality seems to evolve in another universe than all modern audio formats, of course not at comparable bitrate. But it should indicate how optimistic (should I say "biased") the claims of most software editors, which don't hesitate to proclaim a 50% efficiency over MP3 I also recall that "MP3 at 128 kbps" doesn't necessary mean "LAME at 128 kbps". Compared to less efficient implementation of the format, modern AAC and Vorbis encoders could perform as well (and probably better for Vorbis).

To finish: well, this test was long to perform, but also enjoying. Testing blindly 150 samples is less boring than testing 15 ones but with fastidious and sometimes pointless ABX sessions (pointless when difference is really obvious). People might be surprised –and even more uninterested– by the low bitrate tested here. I recall that my purpose wasn’t to evaluate encoders at near-transparency settings, but to see if I could get something decent at 80 kbps, and to evaluate with precision which encoding tool could be safely considered to my ears as the best.

I also recall that testing 185 samples, even various ones, even in double-blinded conditions, doesn’t remove one important limitation to such test: results correspond to my own subjectivity. It’s important to remind it, especially at low bitrate. Why at low bitrate? Simply because tester have to evaluate two things: the amount of degradation and the kind of degradation. The distortion introduced at low bitrate could take different shape: lowpass, ringing, coarseness, noise boost, metallic color, etc… A single tester could be more tolerant with one kind of distortion, whereas another one could hate it (people who’ve followed the old debate about RV9/RV10-blurring vs MPEG-4-blocking probably know what I mean).

• The complete gallery of classical samples is available :here, here et here. I hope they would help developers to not forget classical music for their amazing works.

• ABC/HR logs are here: http://audiotests.free.fr/tests/2005.07/80/ABC/Note that there are very few comments (test was long, and I had to be fast in order to not spent my summer in it). Keep in mind that I've lowered the notation for final analysis.

• Big thanks to: Roberto (which has corrected some of my corrected terrible grammatical mistakes, at least for the first part of my narration and helped me to draw plots), Schnofler for his great software, Peter Pawlowski for foobar2000 (without it, no accurate bitrate calculation, no easy decoding and renaming), ff123 for suggesting the idea of listening tests without ABX imposition, John33 for all binaries and of course all developers: the amazing Aoyumi first, and also Ivan, Gabriel, Robert and people working on audio coding for Apple and Microsoft.

What impressed me the most was Vorbis' performance, even compared to "state of the art" HE AAC (even though Guruboolez' tastes probably played a big role on those results). If any, that's yet another proof of aoyumi's enormous talent.

This post has been edited by rjamorim: Jul 10 2005, 23:13

--------------------

Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:http://www.rarewares.org

Great job. I agree with the most part of the statements. And HE-AAC doesn't sound great comparing to OGG, AAC-LC. Indeed at lower bitrate situation is different. It would be interesting to see 64 kbit/s in same way. Ande see how OGG is good comparing to enchased HE-AAC.

So can I be the noob and stupid one here and ask..what does this tell us? Why do all the tests always run towards the lower bitrates and not the higher bitrates? Okay you can flame me now lol.

It's more difficult to actually hear any substantial differences between codecs. Guruboolez is really the only one around here who has golden ears . Personally I couldn't really tell the difference beyond -q 5 and up with Vorbis, but that's just me I am sure some folks have found some problem case samples.

I suspect that the amazing difference between vorbis and other contenders is very specific to this bitrate. At 96 kbps, AAC-LC is probably stronger (much stronger?) than what I've heard on this test; at 64 kbps, AAC-HE screeching artefacts are probably more acceptable when compared to other form of distortions audible with non-SBR products.

It's just a suspicion. To confirm or infirm it, I'll probably start the second test very soon, and evaluate the relative quality of these contenders at 96 kbps. It should be less fastidious (less pools). Then, 128 kbps should follow (august, if I'm motivated enough). This one will be much harder