Making Video Intuitive: An Explainer

On the Stream team at Cloudflare, we work to provide a great viewing experience while keeping our service affordable. That involves a lot of small tweaks to our video pipeline that can be difficult to discern by most people. And that makes the results of those tweaks less intuitive.

In this post, let's have some fun. Instead of fine-grained optimization work, we’ll do the opposite. Today we’ll make it easy to see changes between different versions of a video: we’ll start with a high-quality video and ruin it. Instead of aiming for perfection, let’s see the impact of various video coding settings. We’ll go on a deep dive on how to make some victim video look gloriously bad and learn on the way.

Everyone agrees that video on the Internet should look good, start playing fast, and never rebuffer regardless of the device they’re on. People can prefer one version of a video over another and say it looks better. Most people, though, would have difficulty elaborating on what ‘better’ means. That’s not an issue when you’re just consuming video. However, when you’re storing, encoding, and distributing it, how that video looks determines how happy your viewers are.

To determine what looks better, video engineers can use a variety of techniques. The most accessible is the most obvious: compare two versions of a video by having people look at them—a subjective comparison. We’ll apply eyeballs here.

So, who’s our sacrificial video? We’re going to use a classic video for the demonstration here—perhaps too classic for people that work with video—Big Buck Bunny. This is an open-source film by Sacha Goedegebure available under the permissive Creative Commons Attribution 3.0 license. We’re only going to work with 17 seconds of it to save some time. This is what the video looks like when downloaded from https://peach.blender.org/download/. Take a moment to savor the quality since we’re only getting worse from here.

For brevity, we'll evaluate our results by two properties: smooth motion and looking ‘crisp’. The video shouldn’t stutter and its important features should be distinguishable.

It’s worth mentioning that video is a hack of your brain. Every video is just an optimized series of pictures— a very sophisticated flipbook. Display those pictures quickly enough and you can fool the brain into interpreting motion. If you show enough points of light close together, they meld into a continuous image. Then, change the color of those lights frequently enough and you end up with smooth motion.

Frame rate

Not stuttering is covered by framerate, measured in frames-per-second (fps). fps is the number of individual pictures displayed in a single second; many videos are encoded at somewhere between 24 and 30fps. One way to describe fps is in terms of how long a frame is shown for—commonly called the frame time. At 24fps, each frame is shown for about 41 milliseconds. At 2fps, that jumps to 500ms. Lowering fps causes frames to trend rapidly towards persisting for the full second. Smooth motion mostly comes down to the single knob of fps. Mucking about with framerate isn’t a sporting way to achieve our goal. It’s extremely easy to tank the framerate and ruin the experience. Humans have a low tolerance for janky motion. To get the idea, here’s what our original clip reduced to 2fps looks like; 500ms per-frame is a long time.

Resolution

Making tiny features distinguishable has many more knobs. Choices you can make include what codec, level, profile, bitrate, resolution, color space, or keyframe frequency, to name a few. Each of these also influences factors apart from perceived quality, such as how large the resulting file is plus what devices it is compatible with. There’s no universal right answer for what parameters to encode a video with. For the best experience while not wasting resources, the same video intended for a modern 4k display should be tailored differently for a 2007 iPod Nano. We’ll spend our time here focusing on what impacts a video’s crispness since that’s what largely determines the experience.

We’re going to use FFmpeg to make this happen. This is the sonic screwdriver of the video world; a near-universal command-line tool for converting and manipulating media. FFmpeg is almost two decades old, has hundreds of contributors, and can do essentially any digital video-related task. Its flexibility also makes it rather complex to work with. For each version of the video, we’ll show the command used to generate it as we go.

Let’s figure out exactly what we want to change about the video to make it a bad experience.

You may have heard about resolution and bitrate. To explain them, let’s use an analogy. Resolution provides pixels. Pixels are buckets for information. Bitrate is the information that fills those buckets. How full a given bucket is determines how well a pixel can represent content. With too few bits of information for a bucket, the pixel will get less and less accurate to the original source. In practice, their numerical relationship is complicated. These are what we’ll be varying.

The decision of which bucket should get how many bits of information is determined by software called a video encoder. The job of the encoder is to use the bits budgeted for it as efficiently as possible to display the best quality video. We’ll be changing the bitrate budget to influence the resulting bitrate. Like people with money, budgeting is a good idea for our encoder. Uncompressed video can use a byte, or more, per-pixel for each of the red, green, and blue(RGB) channels. For a 1080p video, that means 1920x1080 pixels multiplied by 3 bytes to get 6.2MB per frame. We’ll talk about frames later but 6.2 MB is a lot— at this rate, a DVD disc would only fit about 50 seconds of video.

With our variables chosen, we’re good to go. For every variation we encode, we’ll show a comparison to this table. Our source video is encoded in H.264 at 24fps with a variety of other settings, those features will not change. Expect these numbers to get significantly smaller as we poke around to see what changes.

Resolution

Bitrate

File Size

Source

1280x720

7.5Mbps

16MB

To start, let’s change just resolution and see what impact that has. The lowest resolution most people are exposed to is usually 140p, so let’s reencode our source video targeting that. Since many video platforms have this as an option, we’re not expecting an unwatchable experience quite yet.

By the numbers, we find some curious results. We didn’t ask for a different bitrate from the source but our encoder gave us one that is roughly a third. Given that the number of pixels was dramatically reduced, the encoder had fewer buckets to put the information in our bitrate. Despite its best attempt at using the entire bitrate budget provided to it, our encoder filled all the buckets we provided. What did it do with the leftover information? Since it isn’t in the video, it tossed it.

This would probably be an acceptable experience on a 4in phone screen. You wouldn’t notice the sort-of grainy result on a small display. On a 40in TV, it’d be blocky and unpleasant. At 40in, 140 rows of pixels become individually distinguishable which doesn’t fool the brain and ruins the magic.

Bitrate

Bitrate is the density of information for a given period of time, typically a second. This interacts with framerate to give us a per frame bitrate budget. Our source having a bitrate of 7.5Mbps (millions of bits-per-second) and framerate of 24fps means we have an average of 7500Kbps / 24fps = 312.5Kb of information per frame.

Different kinds of frames

There are different ways a frame can be encoded. It doesn’t make sense to use the same technique for a sequence of frames of a single color and most of the sequences in Big Buck Bunny. There’s differing information density and distribution between those sequences. Different ways of representing frames take advantage of those differing patterns. As a result, the 312Kb average for each frame is both lower than the size of the larger frames and greater than the size of the smallest frames. Some frames contain just changes relative to other frames – these are P or B frames – those could be far smaller than 312Kb. However, some frames contain full images – these are I frames – and tend to be far larger than 312Kb. Since we’re viewing the video holistically as multiple seconds, we don’t need to worry about them since we’re concerned with the overall experience. Knowing about frames is useful for their impact on bitrate for different types of content, which we’ll discuss later.

Our starting bitrate is extremely large and has more information than we actually need. Let’s be aggressive and cut it down to 1/75th while maintaining the source’s resolution.

When you take a look at the video, fur and grass become blobs. There’s just not enough information to accurately represent the fine details.

Source Video100 Kbps budget

We provided a bitrate budget of 100Kbps but the encoder doesn’t seem to have quite hit it. When we changed the resolution, we had a lower bitrate than we asked for, here we have a higher bitrate. Why would that be the case?

We have so many buckets that there’s some minimum amount the encoder wants in each. Since it can play with the bitrate, it ends up favoring slightly more full buckets since that’s easier. This is somewhat the reverse of why our previous experiment had a lower bitrate than expected.

We can influence how the encoder budgets bitrate using rate control modes. We’re going to stick with the default ‘Average-Bitrate’ mode to keep things easy. This mode is sub-optimal since it lets the encoder spend a bunch of budget up front to its detriment later. However, it's easy to reason about.

Resolution + Bitrate

Targeting a bitrate of 100Kbps got us an unpleasant video but not something completely unwatchable. We haven’t quite ruined our video yet. We might as well take bitrate down to an even further extreme of 20Kbps while keeping the resolution constant.

Now, this is truly unwatchable! There’s sometimes color but the video mostly devolves into grayscale rectangles roughly approximating the silhouettes of what we’re expecting. At slightly less than a third the bitrate of the previous trial, this definitely looks like it has less than a third of the information.

As before, we didn’t hit our bitrate target and for the same reason that our pixel buckets were insufficiently filled with information. The encoder needed to start making hard decisions at some point between 102 and 35Kbps. Most of the color and the comprehensibility of the scene were sacrificed.

We’ll discuss why there’s moving grayscale rectangles and patches of color in a bit. They’re giving us a hint about how the encoder works under the hood.

What if we go just one step further and combine our tiny resolution with the absurdly low bitrate? That should be an even worse experience, right?

Wait a minute, that’s actually not too bad at all. It’s almost like a tinier version of 1280 by 720 at 100Kbps. Why doesn’t this look terrible? Having a lower bitrate means there’s less information, which implies that the video should look worse. A lower resolution means the image should be less detailed. The numbers got smaller, so the video shouldn’t look better!

Thinking back to buckets and information, we now have less information but fewer discrete places for that information to live. This specific combination of low bitrate and low resolution means the buckets are nicely filled. The encoder exactly hit our target bitrate which is a reasonable indicator that it was at least somewhat satisfied with the final result.

This isn’t going to be a fun experience on a 4k display but it is fine enough for an iPod Nano from 2007. A 3rd generation iPod Nano has a 320x240 display spread across a 2in screen. Our 140p video will be nearly indistinguishable from a much higher quality video. Even more, 48KB for 17 seconds of video makes fantastic use of the limited storage – 4GB on some models. In a resource-constrained environment, this low video quality can be a large quality of experience improvement.

We should have a decent intuition for the relationship between bitrate and resolution plus what the tradeoffs are. There’s a lingering question, though, do we need to make tradeoffs? There has to be some ratio of bitrate to pixel-count in order to get the best quality for a given resolution at a minimal file size.

In fact, there are such perfect ratios. In ruining the video, we ended up testing a few candidates of this ratio for our source video.

Resolution

Bitrate

File Size

Bits/Pixel

Source

1280x720

7.5Mbps

16MB

8.10

Scaled to 140p

248x140

2.9Mbps

6.1MB

83.5

Targeted to 100Kbps

1280x720

102Kbps

217KB

0.11

Targeted to 20Kbps

1280x720

35Kbps

81KB

0.03

Scaled to 140p and Targeted to 20Kbps

248x140

19Kbps

48KB

0.55

However, there are some complications.

The biggest caveat is that the optimal ratio depends on your source video. Each video has a different amount of information required to be displayed. There are a couple of reasons for that.

If a frame has many details then it takes more information to represent. Frames in chronological order that visually differ significantly (think of an action movie) take more information than a set of visually similar frames (like a security camera outside a quiet warehouse). The former can’t use as many B or P frames which occupy less space. Animated content with flat colors require encoders to make fewer trade offs that cause visual degradation than live-action.

Thinking back to the settings that resulted in grayscale rectangles and patches of color, we can learn a bit more. We saw that the rectangles and color seem to move, as though the encoder was playing a shell game with tiny boxes of pictures.

What is happening is that the encoder is recognizing repeated patterns within and between frames. Then, it can reference those patterns to move them around without needing to actually duplicate them. The P and B frames mentioned earlier are mainly composed of these shifted patterns. This is similar, at least in spirit, to other compression algorithms that use dictionaries to refer to previous content. In most video codecs, the bits of picture that can be shifted are called ‘macroblocks’, which subdivide each frame with NxN squares of pixels. The less stingy the bitrate, the less obvious the macroblock shell game.

Even worse is that flat color and noise might only be seen in two different scenes in the same video. That forces you to either waste your bitrate budget in one scene or look terrible in the other. We give the encoder a bitrate budget it can use. How it uses it is the result of a feedback loop during encoding.

Yet another caveat is that your resulting bitrate is influenced by all those knobs that were listed earlier, the most impactful being codec choice followed by bitrate budget. We explored the relationship between bitrate and resolution but every knob has an impact on the quality and a single knob frequently interacts with other knobs.

So far we’ve taken a look at some of the knobs and settings that affect visual quality in a video. Every day, video engineers and encoders make tough decisions to optimize for the human eye, while keeping file sizes at a minimum. Modern encoding schemes use techniques such as per title encoding to narrow down the best resolution-bitrate combinations. Those schemes look somewhat similar to what we’ve done here: test various settings and see what gives the desired result.

With every example, we’ve included an FFmpeg command you can use to replicate the output above and experiment with your own videos. We encourage you to try improving the video quality while reducing file sizes on your own and to find other levers that will help you on this journey!

At Cloudflare, we produce all types of video content, ranging from recordings of our Weekly All-hands to product demos. Being able to stream video on demand has two major advantages when compared to live video....