sRGB vs Linear

Every image that is meant to be displayed on your screen is an sRGB image. That’s what it means to be an sRGB image.

When doing things like applying lighting, generating mip maps, or blurring an image, we need to be in linear space (not sRGB space) so that the operations give results that appear correct on the monitor.

This means an sRGB image needs to be converted to linear space, the operations can then be done in linear space, and then the result needs to be converted back to sRGB space to be displayed on a monitor.

A small example of why this matters is really driven home when you try to interpolate between colors. The image below interpolates from green (0, 1, 0) to red (1, 0, 0)

.

The top row interpolates in sRGB space, meaning it interpolates between those colors and writes out the result without doing any other steps. As you can see, there is a dip in brightness in the middle. That comes from not doing the operation in linear space.

The second row uses gamma 1.8. What is meant by that is that the color components are raised to the power of 1.8 to convert from sRGB to linear, the interpolation happens in linear space, and then they are raised to the power of 1.0/1.8 to convert from linear to sRGB. As you can hopefully see, the result is much better and there is no obvious drop in brightness in the middle.

Getting into and out of linear space isn’t so simple though, as it depends on your display. Most displays use a gamma of 2.2, but some use 1.8. Furthermore, some people do a cheaper approximation of gamma operations using a value of 2.0 which translates into squaring the value to make it linear, and square rooting the value to take it back to sRGB. You can see the difference between those options on the image.

The last row is “sRGB”, which means it uses a standard formula to convert from sRGB to linear, do the interpolation, and then use another standard formula to convert back to sRGB.

The Mistake!

The mistake I was making seemed innocent enough to me…

Whenever loading an image that was color information (I’m not talking about normals or roughness maps here, just things that are colors), as part of the loading process I’d take the u8 image (aka 8 bits per channel), and convert it from sRGB to linear, giving a result still in u8.

From there, I’d do my rendering as normal, come up with the results, convert back to linear and go on my way.

Doing this you can look at your results and think “Wow, doing lighting in linear space sure does make it look better!” and you’d be right. But, confirmation bias bites us a bit here. We are missing the fact that by converting to linear, and storing the result in 8 bits means that we lost quite a bit of precision in the dark colors.

Here are some graphs to show the problem. Blue is the input color, red is the color after converting to linear u8, and then back to sRGB u8. Yellow is the difference between the two. (If wondering why I’m showing round trip instead of just 1 way, think about what you are going to do with the linear u8 image. You are going to use it for something, then convert it back to sRGB u8 for the display!)

Gamma 1.8

Gamma 2.0

Gamma 2.2

sRGB

As you can see, there is quite a bit of error in the lower numbers! This translates to error in the darker colors, or just any color which has a lower numbered color component.

The largest amount of error comes up in gamma 2.2. sRGB has lower initial error, but has more error after that. I would bet that was a motivation for the sRGB formulas, to spread the error out a bit better at the front.

Even though gamma 2.2 and sRGB looked really similar in the green to red color interpolation, this shows a reason you may prefer to use the sRGB formulas instead.

Another way of thinking about these graphs is that there are a quite a few input numbers that get clamped to zero. At gamma 1.8, an input u8 value of 12 (aka 0.047) has to be reached before the output is non zero. At gamma 2.0, that value is 16. At gamma 2.2 it’s 21. At sRGB it’s 13.

Showing graphs and talking about numbers is one thing, but looking at images is another, so let’s check it out!

Below are images put through the round trip process, along with the error shown. I multiplied the error by 8 to make it easier to see.

Gamma 1.8

Gamma 1.8 isn’t the most dramatic of the tests but you should be able to see a difference.

Error*8:

Gamma 2.0

Gamma 2.0 is a bit more noticeable.

Error*8:

Gamma 2.2

Gamma 2.2 is a lot more noticeable, and even has some noticeable sections of the images turning from dark colors into complete blackness.

Error*8:

sRGB

sRGB seems basically as bad as Gamma 2.2 to me, despite what the graphs showed earlier.

Error*8:

Since this dark image was basically a “worst case scenario”, you might wonder how the round trip operation treats a more typical image.

It actually has very little effect, except in the areas of shadows. (these animated gifs do show more color banding than they should, and some other compression artifacts. Check out the source images in github to get a clean view of the differences!)

Gamma 1.8

Error*8:

Gamma 2.0

Error*8:

Gamma 2.0

Error*8:

sRGB

Error*8:

So What Do We Do?

So, while we must work in linear space, converting our sRGB u8 source images into linear u8 source images causes problems with dark colors. What do we do?

Well there are two solutions, depending on what you are trying to do…

If you are going to be using the image in a realtime rendering context, your API will have texture format types that let you specify that a texture is sRGB and needs to be converted to linear before being used. In directx, you would use DXGI_FORMAT_R8G8B8A8_UNORM_SRGB instead of DXGI_FORMAT_R8G8B8A8_UNORM for instance.

If you are going to be doing a blur or generating mip maps, one solution is that you convert from sRGB u8 to linear f32, do your operation, and then convert from linear f32 back to sRGB u8 and write out the results. In other words, you do your linear operations with floating point numbers so that you never have the precision loss from converting linear values to u8.

You can also do your operations in u16 instead of u8 apparently, and also f16 which is a half float.

The takeaway is that you should “never” (there are always exceptions) store linear color data as uint8 – whether in memory, on disk, or anywhere else.

I’ve heard that u12 is enough for storage though, for what that’s worth.

If you want to make a sound shorter, you can play it faster. Doing this also makes it higher pitch unfortunately.

If you want to make a sound longer, you can play it more slowly. This also makes it lower pitch though.

Sound length and pitch are tied together and there’s no way to change one without changing the other.

… Actually that’s a lie. Granular synthesis can be used to change playback speed and pitch independently!

This post talks about how granular synthesis works, gives some examples you can listen to, and also supplies simple standalone C++ code that does it. (680 lines of code, one source file, only standard C++ includes, no libraries used.)

By the end of this post, you should even be able to program your own “autotune” effect.

Granular Synthesis Basics

Granular synthesis is conceptually pretty simple. The first step is to break a sound file up into small sections of sounds called “grains” that are typically between 10 and 100 milliseconds long.

You don’t have to do anything special to make grains, you just literally cut the sound up into a bunch of pieces.

To make a sound twice as long, you then make a new sound where each grain is repeated twice. When you play it back, it will sound mostly the same and be the same pitch, but will be twice as long.

To make a sound half as long, you would just throw away every other grain. The result is a sound that mostly sounds the same, and is the same pitch, but is half as long.

You aren’t restricted to integers though. You could easily throw out every 3rd grain to make it 2/3 as long or repeat every 5th grain to make it 20% longer.

You can also adjust pitch instead of length.

To make a sound that is the same length, but has twice as high pitch, you would double each grain, but play them back twice as fast. That would result in a sound that was the same length but had frequencies that were twice as high.

To make a sound that is the same length, but had half as high of a pitch, you would throw out every other grain, but play the ones you kept twice as slowly. That would result in a sound that was the same length but had frequencies that were half as high.

Congratulations, you now understand granular synthesis!

There are other usage cases of granular synthesis, but they are a lot more exotic – like making crazy cool sounds.

There are also variations of granular synthesis where grains overlap instead of being omitted, when dealing with sounds getting shorter, or grains being played a non integer number of times.

Some Granular Synthesis Gotchas

If you just do the above, you are going to have some issues with clicking and popping. When you put grains next to each other that weren’t next to each other before, there is going to be a discontinuity in the audio wave form, which translates into very short, very high frequencies that make a popping noise.

What’s more is if your grain size is 20 milliseconds, that means you will get a pop 50 times a second, which means you’ll get a 50hz tone from the popping.

So, how do you fix this?

One way is to use envelopes to do cross fading.

If you wanted to put down grain A and then grain C that would make a pop because A and C expect B to be between them.

Using an envelop to fix this problem, you’d put down grain A, and then next to it you’d put down grain B but have it’s volume go from 1 to 0 over some length of time (like 2 milliseconds). You then ADD grain C on top of grain B, but have it’s volume from from 0 to 1 over the same length of time.

The result is that there is no immediate “pop” from discontinuous wave forms. Instead, it gently fades from one grain to the next over the length of the cross fade.

Note that the above is equivalent to linearly interpolating from grain B to grain C over the length of the envelope, it’s just done in two passes.

I’ve heard that another way to handle this problem is that instead of cutting grains perfectly at the time / length they should be at, you make the cuts at zero crossings that are closest to the desired cut position.

What this does is make it so you can put any grain next to any other grain, and they should fit together pretty decently. This gives you C0 continuity by the way, but higher order discontinuities still affect the quality of the result. So, while this method is fast, it isn’t the highest quality. I didn’t try it personally, so am unsure how it affects the quality in practice.

Another issue you will encounter when doing granular synthesis is wanting to play back a grain at a non integer playback speed. This means you may want to sample index 0, then index 1.1, then index 2.2 and so on. How do you sample a fractional index?

A quick and easy way is to use the fractional part of the index to linearly interpolate between the two samples it’s in between. That means that sampling index 2.2 would be a linear interpolation between index 2 and index 3, and would be 20% of the way from index 2 to index 3. AKA it would be value[2] * 0.8 + value[3] * 0.2;

Another way to sample fractionally is to use cubic hermite interpolation. It’s more expensive to compute but interpolates more smoothly, and perserves first order derivatives.

Lastly, there are a couple parameters to these techniques that have to be hand tuned for the situation they are used in:

Grain Size – Some usage cases want larger grain sizes, while others want smaller grain sizes. You’ll have to play with it and see what’s best for your usage case. Again, I’ve heard that typical grain sizes can range between 10 and 100 milliseconds.

Envelope Size – The size of the envelope used for cross fading can affect the result as well. Too short an envelope will start to make popping happen, but too long an envelop will muddy your sound.

Drums and other percussion instruments have a particularly hard time with granular synthesis because they are usually made up of very short but noticeable sounds. If these sounds get (partially?) repeated, it will sound weird and wrong.

Check the links at the end of the post for deeper reads into these topics and more!

Experiments & Results

For all experiments, I used a sound clip from one of my favorite movies “Legend” where Tim Curry plays the devil, and Tom Cruise and Mia Sara are the main characters fighting him.

The experiments use cubic hermite interpolation to sample fractionally, and they use crossfading to fight popping between grains. Everything uses a grain size of 20 milliseconds, and a cross fade of 2 milliseconds.

Naive Pitch / Length Adjustment

To start out, we can play the sound faster and slower to naively adjust pitch and length.

Granular Synth Pitch Adjustment

If we want to adjust the pitch but leave the length alone, there are two ways you could do that.

The first way is to use granular synthesis to change the length of the sound (longer or shorter), keeping the pitch the same, then use the regular “naive” method to make that resulting sound be the original sound length again.

If you made the sound shorter to start with, this process would decrease the pitch. If you made the sound longer to start with, this process would increase the pitch.

Another way to get a very similar result is to just change the playback rate of the grains themselves – INDEPENDENTLY of how many times you repeat the grains (0 to N times) which changes sound length.

Here is a sound made doing that. It plays each grain back ~1.43 times faster, but makes a sound that is the same length in the end. My ears can’t tell the difference, even though I do know there is one. This is the technique we’ll be using for the rest of the experiments.

Granular Synth Pitch and Length Adjustment

To really drive home how pitch and length adjustments can be made independent with granular synthesis, here is a sound where it plays back more slowly (130% as much time), but it plays back at a higher pitch (~1.43 times as high).