Efficient Gaussian blur with linear sampling

Gaussian blur

Gaussian blur is an image space effect that is used to create a softly blurred version of the original image. This image then can be used by more sophisticated algorithms to produce effects like bloom, depth-of-field, heat haze or fuzzy glass. In this article I will present how to take advantage of the various properties of the Gaussian filter to create an efficient implementation as well as a technique that can greatly improve the performance of a naive Gaussian blur filter implementation by taking advantage of bilinear texture filtering to reduce the number of necessary texture lookups. While the article focuses on the Gaussian blur filter, most of the principles presented are valid for most convolution filters used in real-time graphics.

Gaussian blur is a widely used technique in the domain of computer graphics and many rendering techniques rely on it in order to produce convincing photorealistic effects, no matter if we talk about an offline renderer or a game engine. Since the advent of configurable fragment processing through texture combiners and then using fragment shaders the use of Gaussian blur or some other blur filter is almost a must for every rendering engine. While the basic convolution filter algorithm is a rather expensive one, there are a lot of neat techniques that can drastically reduce the computational cost of it, making it available for real-time rendering even on pretty outdated hardware. This article will be most like a tutorial article that tries to present most of the available optimization techniques. Some of them may be familiar to all of you but maybe the linear sampling will bring you some surprise, but let’s not go that far but start with the basics.

Terminology

In order to precede any possibility of confusion, I’ll start the article with the introduction of some terms and concepts that I will use in the post.

Convolution filter – An algorithm that combines the color value of a group of pixels.

NxN-tap filter – A filter that uses a square shaped footprint of pixels with the square’s side length being N pixels.

N-tap filter – A filter that uses an N-pixel footprint. Note that an N-tap filter does *not* necessarily mean that the filter has to sample N texels as we will see that an N-tap filter can be implemented using less than N texel fetches.

Filter kernel – A collection of relative coordinates and weights that are used to combine the pixel footprint of the filter.

Linear sampling – Texture sampling method when we fetch a footprint of 2×2 texels and we apply a bilinear filter to aquire the final color information (aka GL_LINEAR filtering).

Gaussian filter

The image space Gaussian filter is an NxN-tap convolution filter that weights the pixels inside of its footprint based on the Gaussian function:

The pixels of the filter footprint are weighted using the values got from the Gaussian function thus providing a blur effect. The spacial representation of the Gaussian filter, sometimes referred to as the “bell surface”, demonstrates how much the individual pixels of the footprint contribute to the final pixel color.

The graphical representation of the 2-dimensional Gaussian function

Based on this some of you may already say “aha, so we simply need to do NxN texture fetches and weight them together and voilà”. While this is true, it is not that efficient as it looks like. In case of a 1024×1024 image, using a fragment shader that implements a 33×33-tap Gaussian filter based on this approach would need an enormous number of 1024*1024*33*33 ≈ 1.14 billion texture fetches in order to apply the blur filter for the whole image.

In order to get to a more efficient algorithm we have to analyze a bit some of the nice properties of the Gaussian function:

The 2-dimensional Gaussian function can be calculated by multiplying two 1-dimensional Gaussian function:

A Gaussian function with a distribution of 2σ is equivalent with the product of two Gaussian functions with a distribution of σ.

Both of these properties of the Gaussian function give us room for heavy optimization.

Based on the first property, we can separate our 2-dimensional Gaussian function into two 1-dimensional one. In case of the fragment shader implementation this means that we can separate our Gaussian filter into a horizontal blur filter and the vertical blur filter, still getting the accurate results after the rendering. This results in two N-tap filters and an additional rendering pass needed for the second filter. Getting back to our example, applying the two filters to a 1024×1024 image using two 33-tap Gaussian filters will get us to 1024*1024*33*2 ≈ 69 million texture fetches. That is more than an order of magnitude less than the original approach made possible.

The second property can be used to bypass hardware limitations on platforms that support only a limited number of texture fetches in a single pass.

Gaussian kernel weights

We’ve seen how to implement an efficient Gaussian blur filter for our application, at least in theory, but we haven’t talked about how we should calculate the weights for each pixel we combine using the filter in order to get the proper results. The most straightforward way to determine the kernel weights is by simply calculating the value of the Gaussian function for various distribution and coordinate values. While this is the most generic solution, there is a simpler way to get some weights by using the binomial coefficients. Why we can do that? Because the Gaussian function is actually the distribution function of the normal distribution and the normal distribution’s discrete equivalent is the binomial distribution which uses the binomial coefficients for weighting its samples.

The Pascal triangle showcasing the binomial coefficients that can be used to calculate the kernel weights (each element in the succeeding rows is the sum of its "parents").

For implementing our 9-tap horizontal and vertical Gaussian filter we will use the last row of the Pascal triangle illustrated above in order to calculate our weights. One may ask why we don’t use the row with index 8 as it has 9 coefficients. This is a justifiable question, but it is rather easy to answer it. This is because with a typical 32 bit color buffer the outermost coefficients don’t have any effect on the final image while the second outermost ones have little to no effect. We would like to minimize the number of texture fetches but provide the highest quality blur as possible with our 9-tap filter. Obviously, in case very high precision results are a must and a higher precision color buffer is available, preferably a floating point one, using the row with index 8 is better. But let’s stick to our original idea and use the last row…

By having the necessary coefficients, it is very easy to calculate the weights that will be used to linearly interpolate our pixels. We just have to divide the coefficient by the sum of the coefficients that is 4096 in this case. Of course, for correcting the elimination of the four outermost coefficients, we shall reduce the sum to 4070, otherwise if we apply the filter several times the image may get darker.

Now, as we have our weights it is very straightforward to implement our fragment shaders. Let’s see how the vertical file shader will look like in GLSL:

Obviously the horizontal filter is no different just the offset value is applied to the X component rather than to the Y component of the fragment coordinate. Note that we hardcoded here the size of the image as we divide the resulting window space coordinate by 1024. In a real life scenario one may replace that with a uniform or simply use texture rectangles that don’t use normalized texture coordinates.

If you have to apply the filter several times in order to get a more strong blur effect, the only thing you have to do is ping-pong between two framebuffers and apply the shaders to the result of the previous step.

9-tap Gaussian blur filter applied to an image of size 1024x1024: no filter applied (left), applied once (middle), applied nine times (right). Click to view the full-sized image in order to better see the difference.

Linear sampling

So far, we were able to see how to implement a separable Gaussian filter using two rendering pass in order to get a 9-tap Gaussian blur. We’ve also seen that we can run this filter three times over a 1024×1024 sized image in order to get a 33-tap Gaussian blur by using only 56 million texture fetches. While this is already quite efficient it does not really expose any possibilities of the GPUs as this form of the algorithm would work perfectly almost unmodified on a CPU as well.

Now, we will see that we can take advantage of the fixed function hardware available on the GPU that can even further reduce the number of required texture fetches. In order to get to this optimization let’s discuss one of the assumptions that we made from the beginning of the article:

So far, we assumed that in order to get information about a single pixel we have to make a texture fetch, that means for 9 pixels we need 9 texture fetches. While this is true in case of a CPU implementation, it is not necessarily true in case of a GPU implementation. This is because in the GPU case we have bilinear texture filtering at our disposal that comes with practically no cost. That means if we don’t fetch at texel center positions our texture then we can get information about multiple pixels. As we already use the separability property of the Gaussian function we actually working in 1D so for us bilinear filter will provide information about two pixels. The amount of how much each texel contribute to the final color value is based on the coordinate that we use.

By properly adjusting the texture coordinate offsets we can get the accurate information of two texels or pixels using a single texture fetch. That means for implementing a 9-tap horizontal/vertical Gaussian filter we need only 5 texture fetches. In general, for an N-tap filter we need [N/2] texture fetches.

What this will mean for our weight values previously used for the discrete sampled Gaussian filter? It means that each case we use a single texture fetch to get information about two texels we have to weight the color value retrieved by the sum of the weights corresponding to the two texels. Now that we know what are our weights, we just have to calculate the texture coordinate offsets properly.

For texture coordinates, we can simply use the middle coordinate between the two texel centers. While this is a good approximation, we won’t accept it as we can calculate much better coordinates that will result us exactly the same values as when we used discrete sampling.

In case of such a merge of two texels we have to adjust the coordinates that the distance of the determined coordinate from the texel #1 center should be equal to the weight of texel #2 divided by the sum of the two weights. In the same style, the distance of the determined coordinate from the texel #2 center should be equal to the weight of texel #1 divided by the sum of the two weights.

As a result, we get the following formulas to determine the weights and offsets for our linear sampled Gaussian blur filter:

By using this information we just have to replace our uniform constants and decrease the number of iterations in our vertical filter shader and we get the following:

This simplification of the algorithm is mathematically correct and if we don’t consider possible rounding errors resulting from the hardware implementation of the bilinear filter we should get the exact same result with our linear sampling shader like in case of the discrete sampling one.

9-tap Gaussian blur applied nine times with discrete sampling (left) and linear sampling (right). Click for the full resolution of the image. Note that there is no visible difference between the two techniques even after several passes.

While the implementation of the linear sampling is pretty straightforward, it has a quite visible effect on the performance of the Gaussian blur filter. Taking into consideration that we managed to implement a 9-tap filter using just five texture fetches instead of nine, back to our example, blurring a 1024×1024 image with a 33-tap filter takes only 1024*1024*5*3*2 ≈ 31 million texture fetches instead of the 56 million required by discrete sampling. This is a quite reasonable difference and in order to better present how much that matters I’ve done some experiment to measure the difference between the two techniques. The result speaks for itself:

Performance comparison of the 9-tap Gaussian blur filter with discrete and linear sampling on a Radeon HD5770. The vertical axis is the frames per second (higher is better) and the horizontal axis represents results with various number of blur steps (higher is blurrier).

As we can see, the performance of the Gaussian filter implemented with linear sampling is about 60% faster than the one implemented with discrete sampling indifferent from the number of blur steps applied to the image. This roughly proportional to the number of texture fetches spared by using linear filtering.

Conclusion

We’ve seen that implementing an efficient Gaussian blur filter is quite straightforward and the result is a very fast real-time algorithm, especially using the linear sampling, that can be used as the basis of more advanced rendering techniques.

Even though we concentrated on Gaussian blur in this article, many of the discussed principles apply to most convolution filter types. Also, most of the theory applies in case we need a blurred image of reduced size like it is usually needed by the bloom effect, even the linear sampling. The only thing that is really different in case of a reduced size blurred image is that our center pixel is also a “double-pixel”. This means that we have to use a row from our Pascal triangle that has even number of coefficients as we would like to linear sample the middle texels as well.

We’ve also had a brief insight into the computational complexity of the various techniques and how the filter can be efficiently implemented on the GPU.

The demo application used for the measurements performed to compare the discrete and linear sampling method can be downloaded here:

It is not necessary, however with an even radius you’ll have two additional texels at the left and right end that cannot be linearly sampled, or you can include the middle texel into linear sampling on both sides to get less number of texture fetches. However, in general an odd radius is preferred to get optimum number of texture fetches.

On the other hand, in case you need a reduced size blurred image it doesn’t really matter if you use an odd or even sized filter radius.

Yes, it should, and actually if you sum the weights counting the side texel weights double it is roughly equal to 1.0. The only reason it is a little bit less is because we discarded the two least significant coefficients as I mentioned in the article.

I also mentioned that due to the discard of the two least significant coefficients we can decrease the sum of the coefficients accordingly but in the sample application I didn’t do so but just for simplicity. I thought it will be less confusing this way.

Great article!
In my shaders I’m getting a performance increase by moving the texture coordinates calculation to the vertex shader. This saves ‘millions’ of vec2 constructors, additions and divisions in the fragment shader.
The vertical filter shader would look like this:

Actually you’re right as these things should be better calculated in the vertex shader.

However, I’m not 100% sure that you save that amount of time in rendering due to texture fetches usually take much more time than for the GPU to calculate a few additions and subtractions and due to latency hiding mechanisms in the latest GPUs will simply eliminate the benefits of moving the calculations to the vertex shader.

If you are not convinced, then please give me some benchmark results as maybe I’m wrong.

You made me interested. Unfortunately my only card wouldn’t be a good basis for a generic benchmark but maybe I will create then a demo which can do both methods and let’s see how it performs on various GPU generations as I’m interested as well what will be the outcome of such a test.

> The only reason it is a little bit less is because we discarded the two least significant coefficients as I mentioned in the article.

Actually, the weights should always sum to 1.0, otherwise the filter is essentially dimming down the incoming image (a tiny bit). For example, if you used “1 2 1″ and trimmed the “1″s off your filter, but still weighted by 4 (sum of 1+2+1), you would effectively cut the image intensity by half. Extreme case, I know, but not summing to 1.0 in your code is the same sort of problem, at a different scale.

There is a discrepancy between text and code. The weights in your first code snippet, when multiplied by 1024, give the values “16.5 55 123.75 198 231 198 123.75 55 16.5″. They’re supposed to give “10 45 120 210 252 210 120 45 10″ as shown in the diagram. So, why is there a mismatch?

Otherwise, I like the article, and the comment about precomputing values in the vertex shader is a great tip (thanks, heliosdev!). Since at least some of these values are used *before* the first texture sample is retrieved, it makes sense to me that computing these once in the VS would be a win. I agree that anything computed after the first texture tap tends to cost nothing due to latency hiding, but computations before any taps cannot be hidden (if I understand this area correctly…).

Yes, you’re right that the numbers do not sum exactly to 1.0 in the demo. I know about the issue. It is due to the discarded coefficients and due to rounding errors. However, I don’t think that there such a big difference if you re-multiply them but I will double check them sometime.
About the VS solution, you are right that the calculations between the first texture fetch cannot be hidden, however for the first texture fetch we don’t need any calculation as we simply fetch the texel corresponding to the actual fragment. The only thing done in the demo is the division by the texture size, however that is not needed either if you use a texture rectangle as for them the texture coordinates are not normalized.

Divide by 1022 instead of 1024 and you’re golden – the weights must sum to 1.0. The code dims by a mere 0.2%, but there’s no reason to dim at all.

> for the first texture fetch we don’t need any calculation

Good point! However, I do think heliosdev’s point is a great one: computing it 4-6 times in the vertex shader for the quad is going to be faster than computing these a million times or more in the pixel shader. Sometimes latency hiding will luck out and hide these computations, but it seems better to not risk it and just compute these in the VS. That said, maybe there’s some hidden cost (interpolating extra values on the VS output?) that makes this A Bad Idea. I’ll be interested to how your benchmarking goes, if you do it.

Any comment on the large discrepancy I found between theory and code? The theory says use “10 45 120 210 252″, the code actually uses “16.5 55 123.75 198 231″. Bug, or was a different filter kernel actually used, or…? I’m not trying to nit-pick, but I’d love it if you fixed your code to match the theory (or vice versa), so that we all could benefit.

If the Pascal’s triangle numbers are the right ones to use, then for the first code snippet it should be:

> I’m not trying to nit-pick, but I’d love it if you fixed your code to match the theory (or vice versa), so that we all could benefit.
No problem, just I haven’t had the time so far as I had to travel this weekend and I haven’t got home yet.
> maybe there’s some hidden cost (interpolating extra values on the VS output?) that makes this A Bad Idea
Yes, that’s what I’m concerned about, but maybe I’m completely wrong so let the results talk instead.

No rush, I just wasn’t sure it was on the radar, since you didn’t mention it in your reply. I was thinking a bit more on the 252 vs. 231 mismatch. One idea: the Pascal Triangle is an approximation, after all, of the Gaussian. Also, I think we want to integrate the area under the curve of the Gaussian for each pixel’s extent. This would tend to make the central area’s value (middle of the curve) smaller that the central point’s value. I might start playing with this and asking some experts about it…

> the Pascal Triangle is an approximation, after all, of the Gaussian
Yes, I know and it is most probably not the best idea to use it for production quality but I intended this article for novices as it is much more like a tutorial than a quality scientific article.
I know now why the coefficients are different. It is because I’ve used the 12th row of the Pascal triangle that is not even shown on the figure. 924 is the middle coefficient and the sum is 4096. Sorry for that, just I’ve made the demo earlier than the article. I will correct it ASAP.
Btw, are you the Eric Haines from Autodesk?

Nice, thanks for the quick fixes, looks good. Yeah, I was figuring there was something going on at 4096, since the old factors had halfs and quarters for fractions. I thought you might have been going much further down the triangle and adding up factors and averaging. I’m still not sure about “area under the Gaussian curve” vs. Pascal’s triangle values, nor (for that matter) how standard deviation for the Gaussian curve relates. Anyway, your article now matches text to code, so good deal.

Really great article. I have been looking for a clear explanation of how to do linear sampled blur before without any luck. So I will definitely add this to my screensaver/music visualizer.

Can you always blur down to a downscaled version for extra performance even if you need a normal blur? You mention doing this for a bloom effect.
It might be more of a memory optimization to not be forced to allocated backbuffers of the same size as the screen.

However using the ping-pong technique with scaled down back buffers would make calculating the total blur size difficult.

> Can you always blur down to a downscaled version for extra performance even if you need a normal blur?
Depends on the use case. If you would like to create a simple blurred image of your visualization effect, you should not use downscaling. However, if the blurred image is just an input to more sophisticated effects, maybe it worths. Can you give more details?

Since no one seems to have mentioned it so far: in signal processing the filter you use is known as a “binomial filter” (surprise!). Another way to compute them is to convolve repeatedly with a moving average filter [1 1]. For example convolving once with [1 2 1] is equivalent to convolving twice with [1 1]. This can sometimes be used to speed up binomial filters in CPU implementations, but I’m not sure it’s beneficial on the GPU.

You are right, that in fact the sample implementation is really a simplified binomial filter, but actually I mentioned that we use the binomial coefficients to give a rough estimation of the area under the Gaussian curve. Actually that is not the only approximation we use, just see the eliminated coefficients. Usually in real-time graphics things go like this…

“A Gaussian function with a distribution of 2σ is equivalent with the product of two Gaussian functions with a distribution of σ.”
“Using the second property of the Gaussian function, we can separate our 33×33-tap filter into three 9×9-tap filter (9+8=17, 17+16=33).”

Daniel, could you please explain more detailed what is filter separation?Multiplication of two one-dimensional filters looks clear – they are applied consistently, and result is two dimensional filter of same size, but how can one filter be represented by less size filters?

I am newbie in image filtering, so I’ll be satisfacted enough by link to any manual instead of answer

All of these properties depend on the nature of the filter. While a simple box filter or a Gaussian filter is separable, most of the convolution filters are not. The same is true for the composition of two fewer-tap filters.

Unfortunately I don’t have any book recommendations, at least right now. If I’ll find something useful then I’ll post it.

Okay, in order to ease the explanation let us consider the 1D case:
In the first round you have a 9 tap filter: the center texel plus 4 more in both directions.
In the second round you take another 9 tap but there every tap already contains data from a 9 tap footprint so you get actually a 17 tap filter as besides the 9 taps you take, the texel contains already 4-4 taps from the left and right side so it is actually 4+9+4=17 taps.
The same for the next case. Here you actually have a cumulative 8-8 texels from both sides beside your 9 tap filter so that is 8+17+8=33.

I added this to my engine and it does seem to work fine so thanks for that. But I do wonder what the best way to do large blurs is. When is it better to do say a 21 tap blur then a few 9 tap blurs. One such test I did I needed quite a few passes to get the blur radius I wanted. So I changed it do to a few 21 tap blurs then 9 tap for the smaller ones and the frame rate increased by 60%.

Yes, in fact usually using larger kernels performs better. This is because raster operations also take some times as well as the implicit synchronization incurred by requiring the results of the previous pass to do the current one. This is why the theoretical performance gain by using multiple steps is not always there in real world scenarios.

I agree that a 21 tap filter is better than using several smaller tap filters, however, in case you would need a >100 tap filter then most probably you could really take advantage of the composition property of the Gaussian function. Also, optimal filter kernel varies based on the target hardware, so it is always best to do some benchmark to select a particular kernel size.

Hello everybody. Very interesting article but I´ve got one followup question. You mentioned that convolving twice with sigma is the same as convolving one with 2*sigma. My current work relies on rather accurate sigma calculation so for example i´ve got sigma of sqrt(2) so i divide it by 2 giving sigma=sqrt(2)/2 convolve twice and the result is miles away from beeing the same in both cases. For coefficients i use actual equation not binomial, i see that if I take first row from binomial and convolve twice i d achieve 2. row but that is not the case of sigma beeing double , is it ?

in fact this is my code for weights that is probably the most important of the whole deal. I use 1d separation too but that of course have no effect on my current problem … int kernel_size=(int)sigma*6+1;
if(kernel_size<5)
kernel_size=5;
float *kernel;
kernel=new float[kernel_size];

Ok my bad, it seems the image is too large so I use too much GPU instruction and it stops halfway. Can I divide a n-tap blur into two (n/2)-tap? In wikipedia it says applying 6-tap then 8-tap results in 10-tap blur, so the relation is a^2+b^2=c^2. Which is correct?

Thanks for the writeup, I came across it after reading David Wolff’s “OpenGL 4.0 Shading Language Cookbook.”

In response to Evan, I was able to adapt the above into a shader program that works well on iOS devices. On an iPhone 4, it can filter a live 640×480 video frame in under 2 ms, and I was able to filter a 2000×1494 image and save it to disk in about 500 ms. The code for this can be found within my open source GPUImage framework under the GPUImageFastBlurFilter class:

The naming needs a little work, because I use this in two passes, one each for the vertical and horizontal halves, and I just set the corresponding width or height input offset to 0 for the appropriate stage.

You do need to move the texture offset calculations to the vertex shader, because dependent texture reads are very expensive on these GPUs. Also, they don’t handle for loops well, so I had to inline the calculations.

Thanks for sharing this, Brad, however, you made some incorrect assumptions:

There is no dependent texture reads in my shader either. Everything is uniform. Dependent texture reads mean that the texture coordinates of the fetch are affected by a previous texture fetch. While moving the texture coordinate calculation to the vertex shader might help in some situations, on desktop GPUs the additional varying count would hurt the performance more than performing the math in the fragment shader.

Also, the for loop should not hurt performance either as a proper GLSL compiler implementation should be able to unroll for loops with compile time constant conditions. At least, I can assure you that this happens in case of desktop drivers.

“Dynamic texture lookups, also known as dependent texture reads, occur when a fragment shader computes texture coordinates rather than using the unmodified texture coordinates passed into the shader. Although the shader language supports this, dependent texture reads can delay loading of texel data, reducing performance. When a shader has no dependent texture reads, the graphics hardware may prefetch texel data before the shader executes, hiding some of the latency of accessing memory.

It may not seem obvious, but any calculation on the texture coordinates counts as a dependent texture read. For example, packing multiple sets of texture coordinates into a single varying parameter and using a swizzle command to extract the coordinates still causes a dependent texture read.”

Perhaps there is a difference in terminology here, but I’ve seen that anything that performs a calculation involving a texture coordinate causes a huge slowdown in the texture fetch in a fragment shader on these mobile devices. This is even true for calculations involving constant offsets, like yours here. Moving the calculations from the fragment to the vertex shader, then providing the results as varyings, took my version of your blur from 50 ms / frame to 2 ms / frame. I think this is one of those areas where the tile-based deferred renderers diverge from standard desktop hardware.

My comments about the for loops are also based on my profiling. The current shader compilers employed for mobile GPUs are not great at these kinds of optimizations yet, so we still need to hand-tune things that should be taken care of for us, like unrolling loops. It’s unfortunate, but hopefully they’ll catch up to desktop compilers soon.

Mobile GPUs are interesting animals, and I’ve had to perform quite a few tweaks like this to get desktop shaders working well on them.

No trackbacks yet.

I’ve chosen the title based on the popular article that tries to prove that OpenGL lost the war against Direct3D. To be honest, I didn’t really like the article at all. First, because it compared OpenGL 3 which targeted Shader Model 4.0 hardware and DirectX 11 which targeted Shader Model 5.0 hardware. Besides that, as we…

After the release of the OpenGL 4.1 specification the Khronos Group slowed down the pace a little bit but they didn’t left OpenGL developers without a new specification version for too long as a few weeks ago they’ve released OpenGL 4.2. The new version of the specification brings several API improvements as well as exposes…

You might remember that I wrote an article about my suggestions for OpenGL 4.2 and beyond. One of the features that I recommended to be added to OpenGL was a yet non-existent extension called GL_ARB_draw_indirect2 which suggested the addition of new draw commands that are similar in fashion to the ancient MultiDraw* commands but they…

In this article, I would like to present you an edge detection algorithm that shares similar performance characteristics like the well-known Sobel operator but provides slightly better edge detection and can be seamlessly extended with little to no performance overhead to also detect corners alongside with edges. The algorithm works on a 3×3 texel footprint…

The Khronos Group did a great job in the last few years to once again prove that OpenGL is still in game and that it can become the ultimate graphics API of choice, if it is not that already. However, we must note that it is not quite yet true that OpenGL 4.1 is a…

Currently there are several ways to feed data to the GPU no matter of what API we use and what type of application we develop. In case of OpenGL we have uniform buffers, texture buffers, texture images, etc. The same is true for OpenCL and other compute APIs that even provide more fine-grained memory management…

Dynamic geometry level-of-detail (LOD) algorithms are very popular and powerful algorithms that provide a great level of rendering performance optimization while preserving detail by using less detailed geometry for objects that are far away, too small or otherwise less significant in the quality of the final rendering. Many of these are used since the very…

Hierarchical-Z is a well known and standard feature of modern GPUs that allows them to speed up depth testing by rejecting large group of incoming fragments using a reduced and compressed version of the depth buffer that resides in on-chip memory. The technique presented in this article uses the same basic idea to allow batched…

OpenGL 3.0 capable GPUs introduced a level of processing power and programming flexibility that isn’t comparable with any earlier generations. After that, OpenGL 4.0 and the hardware supporting it even further pushed the limits of what previously seemed to be impossible. Thanks to these features nowadays more and more possibilities are available for the graphics…

With the introduction of Shader Model 5.0 hardware and the API support provided by OpenGL 4.0 made GPU based geometry tessellation a first class citizen in the latest graphics applications. While the official support from all the commodity graphics card vendors and the relevant APIs are quite recent news, little to no people know that…