[The following commentaries and notes are intended to explain the talk in more detail.]

2 / 55

Working on projects with strict deadlines and budgets makes quite some impact on the way things get advanced, especially the research phase which gets mainly driven by project's constraints. You may have very strong technical constraints, limited by memory and by execution time available, that can differ from platform to platform. As well as constrained by production time and cost. And it is especially interesting to see how new solutions to common problems could be discovered in such conditions.

I would like to share with you one of my recent experiences of searching for an alternative anti-aliasing solution which we ended up using in The Force Unleashed 2 (TFU2) by going through a series of prototypes, failures and successes, getting to the main idea of Directionally Localized Anti-Aliasing (DLAA) and discussing its PlayStation3 and XBox360 specific implementation details.

3 / 55

The final idea of the technique is very simple: "blur edges along their directions".

Of course, this type of idea doesn't pop out of nowhere at once or over night. It's rather a long process of trial and error, a process of building some type of understanding of what does and what does not work. And like in our case, it's a result of acquiring intuitive feeling of what to do next when something doesn't work the way it supposed to.

4 / 55

The problem with simple ideas like this is that it's really hard to believe that it would ever work and it's very easy to misinterpret.

The idea of blurring to achieve anti-aliasing is not new by itself, as there are quite a few techniques based on depth weighted kernels of different kind that exist. But it took more than a month of thinking and playing with Photoshop, while working on other things at the same time, to realize that you can blur the edges directionally, horizontally and vertically, to achieve the result of hiding jaggies.

As simple as it sounds, it is not trivial at all. And as usual, complexity is in the details. So let's see what it really is about.

5 / 55

But before we get to the anti-aliasing technique itself, let's see what the aliasing is. Why do we care about it and why it's important to minimize it as much as possible.

Generally speaking, aliasing is an effect which makes different signals become indistinguishable when sampled, or when the signal reconstructed from samples differs from the original one. This is quite a problem in signal processing, both audio and video, and it has well established theoretical foundation.

In practice though, we don't think about it in terms of sampling and reconstruction. In graphics, it is associated with "pixel noise", when distant or inclined parts of the scene suffer from temporal noise and Moire patterns, which is solved relatively easily. But very often it is related to geometrical edge jaggies, long edges that look more like "stair steps" instead of continuous lines.

The problem with "stair steps" is that it's very unnatural element. It's definitely not something that we would see or experience in everyday life. Thus whether we like it or not, it will give us unnatural feeling when looking at the scene with edge jaggies all over the screen. And this sets some foundation for perception based filtering, the idea of trying to turn unnatural elements into more natural.

6 / 55

The general way to fight aliasing artifacts is to reduce higher frequencies of the signal before sampling it. Filtering signal by cutting off anything that could not be reconstructed back, anything that will interfere with the sampling frequency or sampling grid.

Often, it's some sort of a low-pass filter or "blur" applied to the original input, and it's always a compromise between softness and aliasing. According to the sampling theory there is no perfect filter that exists.

The filtering is done temporally in audio and spatially in optics. Digital cameras, for instance, blur the input optically before it reaches the sensor to reduce aliasing. Whereas photographic film contains "grains" in a random pattern, and doesn't suffer from this effect.

7 / 55

In real-time graphics, different aspects of aliasing are addressed differently.

Texture sampling aliasing, that "pixel noise" we talked about, is eliminated by simply pre-filtering the image for different sampling frequencies, creating so called mip-maps as well as using anisotropic filtering to improve overall sharpness of the texture.

Shading aliasing, when dealing with specular and rim lighting, is usually solved manually in the shader by tweaking. Because most commonly used multisample anti-aliasing solution doesn't re-evaluate the shader for multiple samples, especially its 2x and 4x implementations on PS3 and X360.

8 / 55

To reduce geometry edge jaggies, multisample anti-aliasing (MSAA) has been the holy grail solution in games for many years as it gives pretty good visual results.

MSAA is an optimized form of supersampling, which only requires to sample some components of the scene at higher frequencies (resolutions). Usually, it's the depth information that gets supersampled and then combined with single shader samples. But unfortunately, this solution is not always applicable.

The more multi-pass and deferred techniques we put in place, in order to keep increasing visual complexity of the scenes, the more costly it becomes. Especially on consoles, easily taking over 5ms directly, when doing the color or lighting passes. But also indirectly, when we have to adjust for all other post processing effects that can introduce aliasing as well.

9 / 55

And this is where alternative anti-aliasing solutions come into play.

Screen-space filtering tries to hide jaggies based on perception, using as little additional data as possible. Some only use color information, others incorporate depth and normals, and in most cases the anti-aliasing effect is achieved at the expense of losing sub-pixel accuracy.

Temporal filters take advantage of information presented in multiple frames (Crysis 2, Halo Reach) and can get pretty good results. The most basic implementation of temporal anti-aliasing would accumulate the image while jittering the camera position within a pixel distance. But as soon as things get more dynamic, you have to do temporal re-projection and deal with different sort of artifacts, introducing halos and decreasing effectiveness.

There are also geometry assisted and edge-based techniques that can use extra information about the edges in order to do the filtering.

10 / 55

Morphological Anti-Aliasing, presented by Alexander Reshetov from Intel Labs, spawned a lot of interest and new research related to screen-space filtering.

The core idea of MLAA is based on reconstruction of original geometry from its picture. The result of binary edge detection is used for pattern and shape recognition, in order to figure out what the edge should have looked like and then recombining the neighboring pixels to achieve anti-aliasing effect.

PlayStation3 version of The Saboteur made the first attempt to implement this technique on the SPUs, as it's very CPU friendly. But it was not until God Of War 3 when it got more practical in terms of quality, while saving a lot of precious RSX (GPU) time. And while SPU implementations seemed to work, the XBox360 GPU implementations were and still are challenging, taking over 3.7 ms at 720p.

Additionally, while getting good results on static images, MLAA suffers from very noticeable temporal artifacts in motion, due to its pattern recognition based binary nature. Flickering in the image is most visible on long edges passing across noisy backgrounds.

11 / 55

Geometry assisted methods can get even better results while still doing pixel recombination in screen-space with sub-pixel accuracy.

Very effective edge-based anti-aliasing technique was demonstrated in XBox360 SDK sample by Cameron Egbert. One-pixel-wide polygons are rendered on top of the original scene while texture coordinate is used as a blending coefficient to re-blend neighboring pixels.

In this case, texture coordinate is linearly interpolated during rasterization, and represents the distance from the sampling point or coverage of the original polygon by the pixel. The closer we get to the top edge of one-pixel-wide polygon, the more of the pixel above it we blend in. Vertical edges are processed in the same way.

It works beautifully, and the only problem of this method is cost of storage of those edges and/or their real-time generation.

12 / 55

But neither of those we could really use.

MLAA because of its stability and implementation difficulties on the XBox360. Edge-based would introduce extra cost on the GPU and additional geometry processing. We could probably use different solutions on different platforms, but that would create difficulties during production, since artists might need to tweak their stuff for different platforms differently. That is a very bad thing do to when you only have 9 months of production.

We could not use temporal anti-aliasing either. In order to control performance to stay at a solid 30 frames per second, we introduced dynamic resolution adjustment.

This dynamic resolution adjustment was a necessary feature, mostly based on perception. When a lot of things are going on screen at the same time, it is very likely that there is a lot of motion, we simply can't focus on details and see high-resolution features. Moreover, it's most likely to be motion-blurred. So in this case we decrease the resolution gradually, until it's capable of running at 30 fps. The worst case would be to go from 720p down to 576p. But when you are done with the "enemies", so less things happen and you can enjoy the vista, the resolution would go back to 720p.

13 / 55

In TFU1 we had a fairly good depth-based anti-aliasing filter, which would blur the edges based on the depth difference. But when the resolution started to change gradually all the time, it made the whole thing look even worse.

If the resolution doesn't change, then all the "stair steps" would move consistently in the same way, and when blurred uniformly they look fine. But when it does change, they start moving quite far away from their previous positions on the edge, almost randomly, from frame to frame. Which along with the extended blur makes them a lot more noticeable.

And while old solution remained our back-up plan, we really needed something that would work on both platforms the XBox360 GPU and PlayStation3 SPUs, if not perfect, but reliable in production by looking the same on both platforms, temporally stable, effective and efficient.

14 / 55

It was around August 2009, art pre-production, when most of the prototyping was done, looking for different ways to get those gradient-like features on the edges, and use them for pixel recombination.

And despite the fact that neither of those prototypes we were able to use in the end, mainly for the same reasons as all other existing techniques, it was essential for building up knowledge and understanding of what works and what doesn't and why. It was important to build intuitive understanding in order to be able to see similarities and come up with the final idea of using directional blurs.

15 / 55

The first thing that comes to mind, when you see those gradients on the edges, is probably Fresnel term. For very high exponents of dot( N, V ) it looks very similar. And when used for pixel recombination in the same way as MLAA or edge-based solutions use it, the result is surprisingly good. But it's really hard to control and it mostly works on curved surfaces only.

The idea was to compute it in the shader, when doing the main color pass, and output it in alpha component. But since we used 32-bit buffers (8-bit per channel), that would not be enough to do proper normalization, as the value would saturate in a lot of cases.

Moreover, the alpha channel was already used for lower precision HDR luminosity and along with the extra GPU cost this technique has been discarded, as well as other geometry assisted ones, that would rely on extra shader output.

16 / 55

Then I tried to play with different depth based techniques. One of which was based on the idea of blurring the depth in both directions uniformly (box blur), detecting the edges and normalizing the blurred value between minimal and maximal depth values in the pixel's neighborhood.

We got pretty good looking edge gradients, but without any additional adjustments that would only work on flat surfaces and as any other depth based solution would not be very sensitive to jaggies on floors and ceilings, flat surfaces where the depth difference is minimal.

17 / 55

All this prototyping went for a while, without getting anywhere. And the best thing that seemed to work was to use an alternative depth pre-pass to compute pixel coverage, which required extra GPU time on the PS3.

The idea behind this technique is to have many different sampling points along the edge in a certain radius. Virtually having more than two samples by doing only two samples :) By rendering an alternative rotated depth buffer instead of the blurred one as in the previous technique.

It is interesting, because it gives very nice dithering-like effect and preserves some sub-pixel information. Alternatively, we could probably reuse results of a single MSAA depth pre-pass, resolve it into different buffers for anti-aliasing filter while using one of them as the main depth. This would reduce some execution time on the XBox360, but would introduce Z-fighting, causing a lot of headaches.

It turned out, that it was not the end prototyping, but only the beginning.

18 / 55

I also got MLAA working on SPUs, when it revealed its serious temporal instabilities that were not acceptable. And that was a huge mistake to not test the filter for stability before putting it on SPUs. Since there was no GPU implementation, I should have tested it on a video sequence by running it through Intel's filter, for which the source code was available.

Any attempts to use extra information such as depth didn't work. For instance, rim lighting does hide a lot of aliasing in many cases, but when looking at the depth information only, it says that there is aliasing and it has to be fixed. So in cases when there was no aliasing, the filter would introduce it.

Probably, I could have found some better way to make it temporally stable, but that was about time to start freaking out. Because it started to stall all other SPU work that I was supposed to be doing, and it was time to back-up and proceed with plan B, stay with the old solution.

But whether by chance or not, the following happened in the next few days. And this is really striking. Blurring the edge vertically makes it look identical to the supersampled one. Of course, you can say that if the angle changes then it will look different, but you can always adjust the blur, and it will look the same again.

19 / 55

It was the end of January 2010 and only 5 months to finish all the post-processing on SPUs, but now it was up to a few sleepless nights to make our new anti-aliasing filter work.

And this is where Photoshop comes in handy, which really forces you to think differently. It's hard to do anything complex and if something happens to work, it will be easy to implement. In fact, I find myself using Photoshop more than any other prototyping tool.

The key difference between shader oriented prototyping is that you are forced to think in terms of layers and masks, as the opposite to individual pixels and textures. Most of the basic Photoshop filters such as blurs, sharpeners and edge detectors are easily implemented with custom 5x5 convolution matrix. You can't do wider filters, but that's good enough to test an idea.

Now, let's play with some filters and see what we can do with it.

20 / 55

Well, blurring the whole thing vertically makes the vertical edge look pretty good, but obviously not the rest. It only makes sense to do it on the edge only.

21 / 55

And in order to find all the vertical edges we can apply the following high-pass filter [-1 2 -1], to find rapid intensity changes in horizontal direction. Since the output of such filter can be negative, we have to adjust additional filter parameters a little bit, by setting scale to 2 and offset to 128. Thus setting the zero level to the middle gray.

In this case, gray levels represent both negative and positive values. The higher the contrast in the original image, the more pronounced the edge is and the more brighter or darker the result of this filter is.

22 / 55

And since both bright and dark levels intersect with the original edge, we have to take both of them into account, ignoring anything that is closer to the middle gray level.

That could be done easily by applying the curve, inverting anything that is darker than middle gray and adjusting the contrast at the same time. In shader terms this would be equivalent to

saturate( abs( x ) * a – b )

23 / 55

Now we can use this as a mask for the blurred layer, converting it to gray scale by desaturation. This will apply the vertical blur only where the high-pass response for vertical edges is high. And then do the exact same thing for all the horizontal edges, rotating kernels by 90 degrees.

24 / 55

Surprisingly, it doesn't look that bad. For what it actually is, it looks pretty good. Some pixel-wide features are reduced, but that could be adjusted. The left vertical edge is fine, but the longer horizontal one at the top still looks aliasy.

25 / 55

Having a 5-pixel wide kernel for everything is definitely not enough. And if we just extend the kernel to 16 pixels, which in general case is enough to hide jaggies at 720p, that would break the image a lot more. So we have to consider two cases, short and long edges, and try to apply wider blur to long edges only.

Since we are limited to 5-pixel wide kernels, when doing custom filters, we could use Photoshop's Motion Blur filter instead.

26 / 55

But how do we find those long edges?

Well, if we take the result of horizontal edge filter and blur it again horizontally, after adjusting the contrast we get a pretty good estimate of where the long edges are.

27 / 55

It sounds like witchcraft, but it makes sense once you think about it.

The reason I like to use the word "blur" is because it hides a lot of mathematical complexity and allows to focus on the idea itself. If we would do continuous analysis then we would talk about integrals. In discreet analysis this would be summation. So blur is just a nice, simplified but very visual way to think of integrals and summations.

By looking at the result of a high-pass filter, we can clearly see all the long edges. We basically get very long horizontal or vertical stripes of bright dots. Which basically means that the more high frequencies in a particular direction we have, the higher the chance of having a long edge in that direction.

28 / 55

The contrast adjustment is important. By looking at the blurred version of the high-pass result, it does seem obvious that there are some less intense areas and lines, that don't correspond to any long edges at all, and are those that we want to ignore.

Since we are using 16 pixel wide kernel for long edges, it's capable of scattering a single "stair step" across 16 pixels as well, 8 pixels to the left and 8 pixels to the right. The absolute value of the high-pass filter gives high response on both sides of the edge. Which means that a line with 16 pixel wide steps will turn into segments of 32 pixels around the step, 16 to the left and 16 to the right. And when those get blurred with 16 pixel wide kernel, we would only get maximal sums in the 16 pixel wide window around the "stair step".

Technically, we could use a threshold which would be the maximal value possible, to detect edges that are 16 pixels wide and longer. But that would introduce temporal artifacts and won't handle the edges that are shorter than 16 pixels. Especially when the result of the high-pass filter is not binary.

By adjusting the contrast we can control when the filter kicks-in.

29 / 55

The closer we get to the long edge case, the higher the response of the detector is. And it could be used as a blending mask for wider blur, in the exact same way we did it for short edges.

30 / 55

But then we get a new problem. Color bleeding.

If it happens that there is some distant element with a very different color, then its color will be scattered around for the size of the long blur kernel. For shorter edges it's not an issue, since our visual system has lower color resolution and we just don't notice it. But in this case it's really bad.

31 / 55

Unless we go gray scale. In which case the edge looks fine and all we have to do is to use it as a target luminance. In Photoshop we could use luminosity blending mode, but in practice, to prevent luminosity bleeding, it's better to find a pixel from the local neighborhood, such that its luminance is close enough to this blurred one.

32 / 55

On top of that, there are a few other improvements that could be done.

Since long edge detection is based on the result of the high-pass filter, it can give wrong results when a certain part of the picture has a lot of noise. If there are a lot of high frequencies in both horizontal and vertical directions, then it's most likely to be a noisy region, so we say that we can only have either a horizontal or vertical edge at a time, and don't do any extra processing, preserving sharpness and increasing efficiency at the same time.

33 / 55

In terms of edge gradients, when compared to MLAA, it's very similar. It gives gradual transitions between longer and shorter axial edges, but as MLAA it does not cover diagonal, 45 degree cases. That's less of an issue, since we are not that sensitive to those types of jaggies.

This particular example shows reference implementation of DLAA, that uses 3 pixel wide high-pass filters and 5 pixel wide short edge blurs, and makes those diagonal cases slightly more pronounced. We could use 3 pixel wide blur instead, or extend high-pass filter to 5 pixels that we did in TFU2. That high-pass filter extension makes diagonal edges look a little softer and less pronounced, but there is a reason for that.

Anti-aliasing is not just an abstract filter to process a single buffer. It's a part of post-processing pipeline that is aimed to improve the final visual experience. When working with consoles, we must take into account the way it gets into your TV and what leaves your regular LCD or Plasma panel at the end, after upscaling and sharpness settings that are usually set a little higher. Not just by looking at the raw data or screen shots.

34 / 55

This final prototyping phase took only 3 days (and nights). And it turned out to be very effective when the resolution changes, making all the jaggies on the long edges barely noticeable. Removing some of the specular aliasing and even small holes between the geometry.

Of course, we had to do a lot of testing and tweaking. But everybody was pushing it into production, as artist's response was that it looks more cinematic, and we agreed that it was a positive thing.

35 / 55

Reflections had their own story, as they were actually anti-aliased at the very end.

I was doing something else when Jerome, who was sitting right next to me, came by and said that a few engineers were refining the reflections and were trying to reduce the jaggies all day long. And he was wondering if we could run it through the same filter, to what my response was "no way".

On the PS3 we had two types of SPU processing, in-frame and one frame behind. Things like SSAO were running in-frame, and had to be done quickly without stalling the GPU, so they were running on fewer SPUs in order to minimize interference with other jobs, and had slightly different synchronization mechanism. Whereas the rest of post-processing was running one frame behind and that's where the main anti-aliasing filter was executed. I didn't want any more complications to deal with at that stage of the project, so the natural response was NO.

Jerome said that they said that the reflection buffer is 14 times smaller (256x256) which I knew. But that didn't come to mind so quickly. It took me a fraction of a second though to divide 0.8 ms by 14 and realize that we could do it on RSX (GPU) and it would be dirt cheap, since at that resolution we don't need to care about the long edges.

30 minutes later we had anti-aliased reflections on both platforms.

36 / 55

Research and prototyping took about 8 week from the project time, scattered somewhere between August 2009 and January 2010. But then it had to be implemented, and implemented efficiently.

In terms of efficiency, it got well within our budgets on both platforms. Taking up to 2.4 ms on 360 and 1.9 ms on the PS3, with possibility of saving another half a millisecond extra on 5 SPUs.

Old solution on the XBox360 was taking around 1.2 ms and since the new one worked really well with the resolution adjustment, we were willing to go up to 3 or even 4 ms with it. Thus the only strategy for XBox360 was to try to make it as fast as possible.

Budgets on the PlayStation3 were a little different. Estimated SPU time, which was directly derived from GPU implementations, for all the graphics work was 12 ms on 5 SPUs. And that didn't allow any sloppiness. It turned out that by doing all the low-level as well as high-level optimizations, all the post processing got squeezed down to 7 ms plus 2 ms of in-frame SSAO.

So far, the whole technique looked very simple. A few blurs, hi-passes, saturates and lerps. And this is where one gets into this difficulty of explaining, how come it take 10 minutes to make it run in 8 ms, one day to bring it down to 4 ms and then a few weeks to get rid of another one and a half.

37 / 55

Without going too deep into technical details, I would like to give you some taste of what it really takes to make it run efficiently. And not just this example of anti-aliasing, but anything.

In general, high performance is achieved by reusing results of computations and memory, doing high-level work rejection and pipeline balancing. Whether it's texture versus ALU on the Xbox360 or Even versus Odd pipeline instructions on the PlayStation3 SPUs. As well as optimize the entire rendering pipeline globally; making optimization of one part rely on optimizations of the other.

38 / 55

It is said that the fastest way to do something is not to do it all, in a good sense of course :) And this is what "high-level work rejection" is based on.

The most important step of work rejection is to estimate which parts of the entire workload require more and less expensive processing. In the case of post processing, for example, estimation could be either some clever way to guess what needs to be processed differently, or just run some cheaper versions of the effect at a lower resolution.

In our particular case, it seems obvious to try to separate long and short edge processing somehow, because long edges require a lot more samples and additional high-pass results.

After some experimentation it turns out that on natural scenes only about 10 to 20 percent of the screen contains long edges. So at pre-process step we want to find those regions and only do the more expensive processing where it's required.

Thus the outline of the processing would look like this: find long edge regions, do high-pass filter around long edges, resolve the buffer into itself on the XBox360 only. Do short edges in the regions that are other than long ones, and then do both short and long edges in the long edge regions.

39 / 55

Long edge estimation could be done by running the following kernel, detecting both long horizontal and vertical (rotated by 90 degrees) edges at a lower resolution, reusing, for instance, the one left from the HDR reduction. It's already resistant to noise but too costly at higher resolutions.

HDR reduction is a common step in modern renderers, and is used to estimate global luminance of the scene for tone mapping as well as be used as a source for bloom effect.

Reduction could be done in many ways, but the fastest one on the XBox360 is to go from 720p down to 360p, and only then go to lower resolutions, at least for 32-bit buffers.

But that, unfortunately, reduces our chances of detecting long edges that are present at the original resolution. So we need to do something about it.

40 / 55

Since 64-bit buffers are much slower on the PS3, TFU2 rendering pipeline was redesigned to work with 32-bit buffers directly, doing the tone mapping in the shader, outputting lower precision HDR luminance in the alpha channel. Shader cost was minimal on X360 and PS3 had no extra shader cost at all, whereas the fact of using 32-bit buffers significantly speeded things up.

To increase chances of long edge detection, HDR reduction was modified. The first step would interleave rgb and luminance data, taking every other pixel with offset (0, 0) for rgb and (1, 1) for luminance, and use it for long edge estimation. Then continue reduction, merging the fields together. This potentially can affect the quality of bloom, but on our scenes that was not an issue.

41 / 55

There are a few tricks you can do on X360.

The first trick is to render a full-screen quad with 4x MSAA to quarter-resolution buffer, in order to initialize the Hi-Z while sampling the estimation mask. This is 4 times faster than regular Hi-Z initialization, 0.07 ms instead of 0.27 ms at 720p. And while this trick is an official one, suggested by Microsoft, the second one is not, as far as I am concerned.

Usually, what we are trying to do is done by using Hi-Stencil rejection. But the way Hi-Stencil is implemented on the XBox360 is not very convenient. It only uses one bit which tells whether a block of 4x4 pixels passed the test or not, and there is no way to change that result without re-initializing it. Thus we can not render one part of the screen, flip the test and render the rest.

But we could do it with Hi-Z, which has other difficulties. Hi-Z test is associated with a particular depth buffer at creation time. So in order to be able to flip the test, we create two depth buffers aliased in memory, and pointing to the same Hi-Z location. Then clear it with 0 at previous Resolve, which is free, render estimation with 0.5, render short edges with first depth buffer set, and then render the rest with another depth buffer, which will indirectly flip the Hi-Z test. This way we don't need to re-initialize anything :)

42 / 55

Additionally, it's good to do some dilation of the estimate, in the case if it missed something when running at a lower resolution or precision.

By this point we have our Hi-Z initialized and ready to use. The first thing we need to do is to run bi-directional high-pass filter on the original buffer, put the result back into the alpha channel and resolve it into itself, so we can sample it later.

When doing Photoshop prototyping, we used high-pass results from previous steps in order to detect long edges. But that would be too expensive to have two of those for both directions and sample them in addition to the source buffer. Thus we can just merge all those operations into one. Since high-pass result is only used around the long edges, we can only run it there, making sure that the rest of the alpha is set to zero by some previous step, assuming that the original data still resides in EDRAM. Now, when we sample the original buffer, we also get high-pass data at that point as well.

The filter itself could be done with just 5 bi-linear samples instead of 9, taking advantage of hardware interpolation by sampling in-between the pixels, 4 corners and one central.

43 / 55

The short edge filter is quite simple. Again, 5 bi-linear samples, which get reused between directional blurs and edge detectors. The only difference here is the normalization of the edge blending coefficient, which makes it intensity independent. And then combine horizontal and vertical blurs together, exposing effectiveness controls and using green channel as our luminosity function when needed.

What's important is that this step could be done in a lot of different ways. Bi-directional anti-aliasing could be achieved by using only horizontal blur and horizontal edge detection. Since horizontal edge detection is a high-pass with vertical samples, its coefficients could be adjusted so that you get nice gradients in vertical direction as well.

Edge detection kernel could be shorter, which would give sharper but more aliasy diagonal edges, and we would need to do extra samples. The actual kernel coefficients are not that important, as long as they are close enough to each other.

In TFU2, for short edge case, we used 5x1 kernels in both directions (5 samples total), green channel as luminosity function, eps = 0.1 and lambda = 3 as effectiveness parameters, and achieved nearly perfect balance between ALU and texture unit utilization of the GPU.

44 / 55

Long edges are a little tricky though. On one hand the kernel is much wider, on the other, we can't use the result of the blur directly due to color bleeding.

Since we already have to sample the central part of the kernel when processing short edges, it makes our life easier, because that data could be reused. And this is the main reason why we want to run short and long edge processing jointly.

Moreover, we can sample the rest of the kernel sparsely, by doing only 4 extra bi-linear samples in each direction, as it doesn't seem to affect the quality so much.

Additional [branch] for long edges could be used inside of the shader. It's not as efficient as Hi-Z rejection, but it makes things a little faster in cases of overestimation due to Hi-Z block dilation, when there are no actual long edges present.

45 / 55

Now, we need find such pixel that would match the luminosity of that wide blur. And to do that suppose we could construct a pixel of such color by interpolating between X and Y. We don't know what the pixel is, but we know its luminosity:

blurred_lum = lerp( X_lum, Y_lum, t )

Thus we know at which point between the two our pixel is located:

blurred_lum = X_lum + t * ( Y_lum – X_lum )

t = ( blurred_lum – X_lum ) / ( Y_lum – X_lum )

If t is between zero and one, then such pixel does exit and could be constructed by interpolation:

if ( 0 < t < 1 ) color = lerp( X, Y, t )

46 / 55

But what does happen when a single "stair step" gets blurred, let's say horizontally ?

Well, for the edge running from the top-left corner to the bottom-right, one side of the edge, on the left, gets brighter. Whereas on the other side it gets darker.

Blending between two dark levels will not give you anything bright. So, in one case we have to "lerp" the current pixel with the top neighbor.

47 / 55

In another case, at the right, where it gets darker, we "lerp" with the bottom neighbor. But which one do we use?

Well, we use both, because one of those cases will always be invalid.

Thus to filter long edges: we blur horizontally, sample the neighbors, compute interpolation coefficients and interpolate if coefficients are valid. Then do the exact same thing vertically, using the left and the right neighbors.

This particular operation could be implemented very efficiently on the GPU and well as on SPUs by vectorizing all blending coefficient computations. 4 neighbors fit into one vector register perfectly.

To finish the filtering, we need to blend these newly constructed pixels with the result of the short edge filter, based on the long edge mask, which is derived from the blurred alpha channel that contains high-pass results. Wide blur kernel is not just blurring, but also counting the "number" of high-frequencies in that particular direction, and probability of it being a long edge.

48 / 55

So far, we mainly talked about the GPU implementation, but the SPU implementation is not that different. The algorithm was designed with simplicity and both platforms in mind. SPU implementation doesn't add anything new to the idea. You do the exact same thing but slightly different. And now it's time to see what the difference is.

It happened that the PlayStation3 got slightly slower GPU, 1.2–1.5 times slower than the one in the XBox360. And when working on multi-platform titles it becomes an issue. One solution is to target for PS3 GPU (RSX) and just run faster on the XBox360, which doesn't sound quite right. So instead we try to move as much workload off the RSX as possible.

And we move that work to the SPUs, special cores of the CELL processor. There are only 5 of those guys that we can use (5 and a half), but those 5 can easily outperform XBox360 GPU in terms of quality and speed when used properly. Each SPU has two pipelines executing specific types of instruction in "parallel".

Typical SPU code, usually, is not very efficient, having a lot of pipeline stalls (those red bars) and instruction issue idles (all that empty space between the instructions). In a lot of cases it's not an issue, unless you have to process millions of pixels per second. And this is when it's important to make sure that each and every cycle is utilized.

49 / 55

The number one candidate out of all graphics tasks to go on SPUs is post-processing. It requires very little system memory and the nature of processing is usually very uniform.

Reading video memory from SPUs is very very very slow, so system memory becomes precious and thus we try to use as little of it as possible. In TFU2 we used two 1280x720 buffers plus 512k of scratch for all the post-filters and two 640x360 buffers for SSAO. The total damage was 9.3 MB of system memory.

But those buffers have to be copied from the video memory, and the fastest way to do it is to use tiled surfaces. To copy a single 1280x720 32-bit surface into the tiled system buffer takes about 0.3 ms of RSX time. Tiled surfaces are tricky as you can't access them in linear fashion. Those are made up of linear 8x4 ROP tiles, which build up higher-level regions of 64x64 pixels that then go in a more or less linear way (32-bit surfaces, 1280 or 640 pixels pitch).

There is one handy way to work with those surfaces. It's very useful to do partial surface untiling with DMA transfers. Construct DMA lists such that you receive a sequence of 8x4 ROP tiles next to each other. And since this type of format is defined at compile time, it unifies the way you work with those surfaces and makes the whole thing a lot simpler.

50 / 55

Code efficiency is achieved by manual vectorization, loop unrolling and software pipelining in order to hide the latency, but also reusing the results of computation between the neighboring pixels. Finally, it's required to do pipeline balancing, by trying to reduce instruction count of the longest pipeline, using instructions from the other one. The simplest example is using element (Even) vs register (Odd) wise shifts to perform multiplications and divisions. For most of the post-effect kernels, we were able to achieve 97 to 100% pipeline utilization.

When it comes to implementing DLAA, it also happens that the algorithm is not additive, so we don't need to worry about the task overlaps, which in general case might require multiple buffers or scratch memory. Re-filtering is not an issue.

Most of the short kernels are done as byte operations such as byte averaging (AVGB) and absolute byte difference (ABSDB), that allow us to process 4 rgba pixels per cycle. It's not the fastest but very convenient way to do it. Kernels like [1 2 2 2 1] are easily decomposed into sequences of AVGBs.

For long edge blurs we use incremental kernel updates, which only require subtracting and adding the tails, thus turning the most expensive GPU operation into the cheapest one, while getting perfectly smooth edge gradients.

51 / 55

When doing this type of SPU coding, I find that it's easier to think of even and odd pipeline instructions as if they actually run in parallel at the same time and take one cycle to execute when software pipelined.

Blurs, luminosity computation, saturation and interpolation are the very basic components of DLAA. And there are a few useful trick to do those quickly.

By shuffling green components of 4 rgba pixels into the corresponding alpha channels, it's possible to compute luminance in just one cycle per 4 pixels. SUMB will compute G+R+G+B, and give 10-bit integer luminosity with the following luma coefficients 0.25R+0.5G+0.25B.

Quick saturates could be done by converting floating point values to integers and back to floats. Technically speaking, it's not always the fastest way to do it, in terms of raw cycles, but most of the time it's better in software pipelining.

And finally, interpolation, the most important operation of DLAA. Since most of the things are done as bytes, it's important to not do any extra format conversion. We can use the fake floating-point trick for interpolation, by injecting the source bytes directly into the mantissa and do the interpolation in floating-points. It works assuming that the interpolation parameter is always in the range between 0 and 1. After the interpolation we can grab those bytes back with shuffles. Voilà!

52 / 55

Work rejection could be done it two ways. The most simple way is to use dynamic branching with static prediction, while working on multiple pixels. In general case, dynamic branching on SPUs is very expensive, so it's not something that you would use for every pixel. On the other hand, you have to work on multiple pixels at a time anyway, thus as soon as all the wide blur kernels for the long edges are updated, we can tell whether to do further processing or not.

The better way, is to run the process in multiple software pipelined passes. The first pass does some processing and builds a list of records, containing all the intermediate data required for the second pass. Then the second pipelined pass runs over the list and finishes the rest of the processing. The reason is that you can not use branching in software pipelined loops.

In the end, we get very optimal and efficient code, utilizing each and every cycle of the processor in the inner loops, where it makes the biggest impact on performance.

It is said that necessity, is the mother of invention. And it's very true.

We are getting closer to the end of the current console life cycle. The point, at which every millisecond counts. When in order to keep increasing visual quality and complexity tricks are inevitable. When different solutions and different ways of thinking make the difference.

25 years ago, in the era of commodores and spectrums, I would say the exact same thing, and it looks like it's not going to change any time soon. The hardware will become more powerful, but so the techniques will become more demanding. And this is what makes it even more interesting.