06/14/2010 (11:59 am)

I’ll do a more detailed post later on how to properly compare encoders, but lately I’ve seen a lot of people doing something in particular that demonstrates they have no idea what they’re doing.

PSNR is not a very good metric. But it’s useful for one thing: if every encoder optimizes for it, you can effectively measure how good those encoders are at optimizing for PSNR. Certainly this doesn’t tell you everything you want to know, but it can give you a good approximation of “how good the encoder is at optimizing for SOMETHING“. The hope is that this is decently close to the visual results. This of course can fail to be the case if one encoder has psy optimizations and the other does not.

But it only works to begin with if both encoders are optimized for PSNR. If one optimizes for, say, SSIM, and one optimizes for PSNR, comparing PSNR numbers is completely meaningless. If anything, it’s worse than meaningless — it will bias enormously towards the encoder that is tuned towards PSNR, for obvious reasons.

And yet people keep doing this.

They keep comparing x264 against other encoders which are tuned against PSNR. But they don’t tell x264 to also tune for PSNR (–tune psnr, it’s not hard!), and surprise surprise, x264 loses. Of course, these people never bother to actually look at the output; if they did, they’d notice that x264 usually looks quite a bit better despite having lower PSNR.

This happens so often that I suspect this is largely being done intentionally in order to cheat in encoder comparisons. Or perhaps it’s because tons of people who know absolutely nothing about video coding insist on doing comparisons without checking their methodology. Whatever it is, it clearly demonstrates that the person doing the test doesn’t understand what PSNR is or why it is used.

Another victim of this is Theora Ptalarbvorm, which optimizes for SSIM at the expense of PSNR — an absolutely great decision for visual quality. And of course if you just blindly compare Ptalarbvorm (1.2) and Thusnelda (1.1), you’ll notice Ptalarbvorm has much lower PSNR! Clearly, it must be a worse encoder, right?

24 Responses to “Stop doing this in your encoder comparisons”

This can be generalized as “all measurements are flawed,” which also helps explain why all benchmarks are flawed. More optimistically, as you point out, it’s important to understand what is being measured.

It doesn’t matter. The comparison has served it’s purpose. Even when a bad comparison is debunked fanboys will continue to cite it. Remember Chris Blizard’s bad comparison, and how it just wouldn’t go away?

Tim, The flip side happens too— testing at an unreasonably low bitrate (or right at the R/D knee) will exaggerate the differences if your normal usage is at higher rates. Both exaggerations tell you something true, but the usefulness depends on your application and most applications are some place in the middle.

This is very similar to the browser benchmark nightmares. “X isn’t exactly like IE/Safari/Firefox, therefore I’m taking points off”, “Y is slower in a three-year-old artificial test”, “Z’s default look matches my favorite tie”. At some point a fellow wants to start breaking fingers until the ‘stupid’ goes away.

But this is the internet. Anything worth discussing (and lots that isn’t), will be the center of its own holy war until it stops being interesting to anyone at all.

Definitely, I agree to your point. Comparison has meaning only if two encoders are optimized with similar approach. I still believe that PSNR has some meaning, but mostly, it’s only true (partially) for comparing two systems sharing most things happened in the encoder-side. To compare two different codec, almost impossible to satisfy all persons.

Maybe it’s worth to add a simple switch, e.g. ‘-extreme’ switch in x264 for easily showing the performance tuned for research purpose. (PSNR-tuned, full RD optimization, pyramid coding, trial of all possible modes, etc.) I’m sure that the noise will be reduced a lot.

True, true. I like a lot that Xvid makes it so easy, too, by providing you with SSIM numbers out of the box. And you guys are obviously doing a great job, being better than all commercial h264 encoders it seems.

Addition of CABAC, B-slice, hierarchical-B structure and 8×8 transform typically guarantees more than 20%-30% bit-savings in PSNR sense, and clearly distinguishable in subjective quality. So, if coding efficiency of VP8 is similar to H.264 baseline, it means that actually H.264 high profile outperforms VP8 more than 20%.

Francois, Jan Ozer has no idea what he is doing. His Theora vs. H.264 comparison was on of the worst I have seen and I guess the only reason why this one didn’t end up with all the same problems is because he didn’t encode the videos himself.
Of course he doesn’t give settings, so it’s not repeatable. The images he posted are jpegs and he apparently doesn’t get why using a lossy compressor for the comparison images isn’t a good idea. According to the comments he used gif for the screenshots before that which, if true, is retarded since gif only supports 256 colours.

…where they are basically optimizing for global PSNR, which is probably even worse than using average PSNR.

Multimedia Mike, I don’t think the takeaway from this is “all measurements are flawed.” It’s true that any method for objectively evaluating video quality is probably not going to correlate perfectly with human perception. But if we do use them, let’s at least use one that has been statistically proven to correlate well; PSNR isn’t that metric when it comes to image/video data.

I personally like the idea of SSIM or some other objective metric being used to establish some minimum semblance of quality is met, although I also believe the results should be compared with a box-plot and not “mean SSIM,” but that’s another can of worms…

hurumi, both Dark Shikari and I have toyed with the idea of adding an option to x264 that enables brute force RDO for all coding modes but our tests have shown that such an approach only gives minimal PSNR gain on most content, which suggests that x264 is already very close to the theoretical limit of locally-optimal RDO.
Non-locally optimal RDO on the other hand might give significant PSNR gains but is also likely to be complex to implement and will probably come at a high speed cost.

Have you seen this new online gaming site?http://onlive.com
Essentially you are playing a game on a remote server, with the video streamed to you.
For 5 Mbit of bandwidth you supposedly get a 720p video stream of the game (and recommend you have <25ms ping to their servers).

Do you know if they using x264 for this (are you allowed to say if they are?), and with the very low latency requirement, what kind of quality might we expect to see from 720p computer game video input? Would it be equal to, for example, a 3 Mbit 2-pass encode of the same input?

If I may say, shouldn’t encoder comparison done with real video encodes instead of screenshot on a mos scale, like two yuv2mpg decoded video from the encode process.

I was watching the world cup on cable the other day (hdcbc) I must say if you take each frame one by one (like pausing the playback on the pvr) the individual quality is horrible, however as the video flow it’s far more ok and errors average themselves, the quality was horrible but it wasn’t -too- horrible to withstand at least compared to a frame by frame analysis.

Of course the problem is that people then don’t have a flashy picture to show and maybe the difference between encoder will be far blurrier when comparing mos.

I think though the best comparison would be to take YouTube test case : make a test case where the video must have X size and be compressed in X minutes… Of course then VP8 will lose due to the slowness of the encoder but that’s to be expected.

The only thing that matters is visual comparison. Lossy codecs like x264, Theora, or Vp8 are designed for end user consumption and as such the only thing that matters is performance, purpose/compatibility, file size/bit rate, and visual quality.

All these things are a trade off. If your optimizing for streaming media out to a wide number of devices your probably going to sacrifice quality for highest compatibility and bit rates.

If your doing your ‘DVD Backups’ does it really matter so much that with one codec has to be 10 or 20% larger to match the quality of another? If then then performance may be a higher priority… especially if your doing playback on any sort of battery-driven device.

This is one of the reasons comparisons are so difficult… There are a hell of a lot of variables and profiles change based on purpose and goals.

And even if you are to even out all the variables and properly optimize each codec for different use cases the only metric that matters is subjective quality… Opinion on what looks better.

So the best way to do comparisons are with double blind tests.

I’ve done this personally using shell scripts and mplayer.

I’d copy a bunch different media formats of the same video to a directory. Then I write a script that used ‘mktemp -u’ to create a unique and random file name that I use as a symbolic link back to the each of the files (note that this is a insecure use of mktemp).

I then sort of symbolic links back alphabetically and feed them to mplayer with mplayer’s output surpressed.

Then I look and examine each video codec and make a note of it’s quality in relations with one another.

I also throw in the original source video into the mix so that I can help to tell how accurate my subjective judgement is… (original source should always be highest quality version)

Then once I am finished deciding what versions I like the most I can use ‘ls -l’ on the symbolic links and reveal their relationship with the the files and I can map out the relative quality to codec type.

That is about as close to a perfect test you can get without spending a shitload of money on a professional double-blind survey.

The reason it’s not completely perfect is that a person with a lot of exposure to codecs is going to be able to identify codecs types by visual artifacts; which obviously ruins the “blindnessness” of the test.

Also the other flaw, that is only partially relevant, is that people’s expectations and perceptions of quality are conditioned by exposure to lots of one type of media.

A example of this is that many people who heavily listen to relatively low-bitrate versions of songs on their media players (say 128Kb/s mp3) get used to the audio distortions and artifacts that mp3 tends to introduce into the video. Often people actually end up preferring these distortions over what should be higher quality samples since they are familiar.

You can also see this among ‘audiophiles’ that grew up before solid state amplifiers became the norm. They’ll talk about the ‘warm’ sound that tube-based amplifiers can produce. And this is just the audio distortion caused by the tube amplifiers being driven too hard and out of their range of ‘clean’ amplification.

One example of this on visual codecs may be a preference of ‘blockiness’ over ‘bluriness’ when it comes to visual artifacts.

but like I said this is only partially a problem. Since we are aiming for best subjective quality, generally, then prejudice among your audio can be a factor you need to optimize for, or at least take into account.

Alex: If you’re talking about the comparison I think you are the point was to show that switching to Theora wouldn’t cause youtube’s infrastructure to explode, and h264 as encoded by youtube really isn’t very good (IIRC they ARE using x264, but they’re trading encode speed for quality) basically it was to show that Theora could in fact handle youtube quality video just as well as h264 (which it CAN since youtube doesn’t really get the most out of even baseline profile at the bitrate used). It wasn’t meant to be a generally h264 vs Theora comparison, just ‘h264 as encoded by youtube’ vs h264… (or am I thinking of something Monty at Xiph did… hmmm he knows what he’s doing at least )

hurumi: h264 vs VP8 comparisons are done with baseline h264 because that’s what’s actually used on the web, so the comparison is valid because they are comparing a VP8 encoding for web use to an h264 encoding for web use. baseline is all lots of devices people care about can handle (eg iPhone) and IIRC that’s all flash can play as well, so since web video is what pretty much everyone doing the Theora or VP8 vs h264 comparisons care about they use baseline. As for h264 high profile being 20% better than VP8… well VP8 doesn’t have a mature encoder (look at how much better the Theora 1.1 encoder is than 1.0 and most of that was just fixing brain dead behavior from On2′s original code and adding proper rate control, seriously go look at Mony’s post about when the fixed the DCT code to be less stupid, it surprised him how much difference it made and he kinda knows Theora inside out).

nate: if they’re over-driving their amps they’re not much of an audiophile… though that doesn’t mean the ones who DON’T overdrive their amps don’t think tubes sound better… they frequently do, but audiophiles also mostly have no idea what they are talking about and tend use anecdotes as ‘data’, don’t do anything resembling a valid test (some of the magazines they read do… but that’s sort of tangential).

However there are also lots audiophiles that DO know what they’re talking about, there’s just fewer of them.

(I’ve never understood why Vorbis didn’t get more uptake, it’s not that much more expensive computationally than mp3 and it’s significantly better at the same bitrates (based on good double blind tests, vorbis blows mp3 away… consistently) plus there’s no patent licenses or even trademark licenses to put the name on your product… This parenthetical note has nothing to do with anything, just a random commentary)

Its a common problem, when it comes to any sort of benchmarking, a lot of manufacturers and developers optimize the said product for the benchmarking too and ace the benchmarking test but fail miserably on real world testing.

In my experience, the best way to test a encoder, is to stick a dvd in the drive, and encode it and check the logs! That’s the best way if u ask me!

Cheers,

I just discovered the blog and i am loving it. I’m not a programmer by trade but i can appreciate your work and i understand the outline of it. You seem like an extremely level headed person who knows what u are doing and i am learning a lot from your blog.

Please keep it up.

PS Do you have any experience with the x264 behind a program known as xvid4psp?

@Spudd86 “h264 vs VP8 comparisons are done with baseline h264 because that’s what’s actually used on the web”

You’re clueless. Most websites (including Youtube!) use Main or High profile, as that’s what Flash supports.

“As for h264 high profile being 20% better than VP8… well VP8 doesn’t have a mature encoder”

It’s 6 years old, how in the world couldn’t it be mature by now unless On2 were a bunch of incompetent tools? I don’t think they are. Also, it’s more than 20%, and that’s ignoring VP8′s lack of psy optimizations. I’d put the real number around 40-70%.

AlexW: Thanks for the information. I agree that x264 is very nicely optimized and has good trade-offs in all aspects. Actually, I’ll start learning of x264 code seriously

Spudd86: Then, the article cannot use the word “h.264 vs VP8″. Instead of it, “h.264 baseline profile vs VP8″ is more appropriate and clear naming to minimize the confusion.

For the comparison with h.264 baseline profile, I’m not sure which one is better in terms of coding efficiency, but I’m pretty sure that it’s meaningless since winning method over h.264 baseline profile is already known and relatively easy to be achieved by simply using a few main or high-profile tools of h.264 itself, e.g. 8×8 transform.

Technically (not software, but coding tools), I believe that h.264 high-profile (hey, it’s h.264) is much better than VP8 after investigating VP8 specification