As part of my theory for occlusion, I'm thinking about incorporating a software rasteriser which will just render simple objects (mostly cubes I guess) in plain colours into a simulated bitmap (basically just a 2d array of pixels/numbers).

I kind of get the theory (or can find implementations of it) but I was wondering if it can be a quick process or is it processor intensive? I'm thinking it'll just go into it's own thread.

The bit I'm concerned about is filling the shapes with colour, like in between the bresenham lines.

On a 2GHz Core i7, I take less than 1 millisecond to rasterize 2000 triangles of terrain geometry into a 256 pixel wide occlusion buffer. Increasing the resolution to 1024 pixels also increases the time taken to 2.5 ms. My routine is depth only rasterization, and is not optimized to the max, C++ code only and no SIMD instructions.

I've found it hard to thread the occlusion rendering though, as the way I use it is a sequential operation: find out visible occluders, rasterize them, then use the result to check object AABB's for visibility.

Every time you add a boolean member variable, God kills a kitten. Every time you create a Manager class, God kills a kitten. Every time you create a Singleton...

In the occlusion rasterizer that I'm using, my frame-buffer is only 1 bit per pixel (drawn to yet, or not drawn to). I render all the triangles for occludersand occludees from front-to-back, with occluders filling pixels, and occludees testing if all their pixels are filled or not (if any of their pixels aren't filled, the occludee is visible).

Using SSE, you can fill 128 pixels at a time with this algorithm, so the actual rasterization is not at all a bottleneck. I can run at pretty much any resolution without much difference in speed. The real bottleneck for me is actually transforming all of the vertices from model space into screen space (i.e. the 'vertex shader').

I break the frame-buffer into 'tiles', each of them 128 pixels wide and <height> pixels tall. Each of these tiles can then be independently rasterized by a different thread.

In the occlusion rasterizer that I'm using, my frame-buffer is only 1 bit per pixel (drawn to yet, or not drawn to). I render all the triangles for occludersand occludees from front-to-back, with occluders filling pixels, and occludees testing if all their pixels are filled or not (if any of their pixels aren't filled, the occludee is visible).Using SSE, you can fill 128 pixels at a time with this algorithm, so the actual rasterization is not at all a bottleneck. I can run at pretty much any resolution without much difference in speed. The real bottleneck for me is actually transforming all of the vertices from model space into screen space (i.e. the 'vertex shader').

I break the frame-buffer into 'tiles', each of them 128 pixels wide and <height> pixels tall. Each of these tiles can then be independently rasterized by a different thread.

How you are sorting triangles? How about intersections? Or do sort by using max z and raster whole triangle with that? Sound great technique but I want to hear more details.

How you are sorting triangles? How about intersections? Or do sort by using max z and raster whole triangle with that? Sound great technique but I want to hear more details.

Yeah you use a single z value for each triangle, which means that large triangles on glancing angles don't act as effective occluders.To ensure conservative results (no false occlusion), you use the maximum z value for occluder triangles and the minimum z value for occludee triangles, which allows intersecting triangles to work without errors.First you project all the triangles into screen space, determine their z values as above, bucket them into the "tiles", then sort the triangle lists in each tile according to their z value. And then rasterize the tiles When rasterizing a triangle, you iterate through the scanlines generating a bitmask of the pixels covered by the triangle on that line. Occluders then simply OR this mask with the framebuffer. Occludees AND this mask with "NOT framebuffer" and if the result is true, they write a non-zero value into some address indicating this object is visible (when submitting a group of occludee triangles, you also pass an int*, etc, where this value will be written to if the object is visible. The int at this address is initialized to zero beforehand).

P.S. I didn't come up with this, I've shamelessly taken the idea from Vadim Shcherbakov, who got it from another guy, IronPeter. There's a full explanation and a demo with source code on his blog. His code uses SSE intrinsics so it's a bit unreadable in places, and it contains a few bugs, but it's very fast Eventually, I'd like to release my own open source version of this algorithm, but I've got other things to be working on at the moment.

[edit]P.P.S. I contacted Vadim about the copyright on his demo, because there is no explicit licensed contained in the ZIP, and got this response:

On 05/10/13 8:11 AM, Vadim Shcherbakov wrote:--------------------Hey, you can use the code as you like, there is no license or any limitations.Regards, Vadim

I am gonna try to implement this myself since Vadim S. demo is crashing for me at certain camera position/angles, also his code is hard to follow and unreadable so i cannot fix it

I think the problem is in the clipping algorithm, sometimes, when triangles intersect the near plane (an assert in there is the cause of the crash)... but yeah, his code is very hard to follow.

i could test just 4 pixels/corners of occludees axis aligned 2D rectangle?

I'm not sure that would work -- e.g if the occludee is just beyond a window, but is larger than the window. It's four corners will be occluded by the wall, but it's centre will be visible through the window.

If you're rasterizing a traditional depth buffer instead of one of these 1-bit occlusion buffers, then you can create a hierarchical Z-buffer from it, which does allow you to test any occluder with just 4 samples. This is a very old technique (1993?), but was used very recently in a splinter cell game.

I always found that the memory bandwidth was the biggest issue with software rasterization, which would link the performance to the frame buffer size. However, using one bit per pixel like Hodgman would probably alleviate that issue nicely...

it depends what your rasterizer do, for occlusion culling with one core, you just read/write 2 or 4 bytes per pixel, that's not an issue, you have a lot of math to do to come to this point, and you know in the beginning of the loop which pixel you gonna touch, you actually can even predict the next line. inserting some prefetch instructions can hide most of the memory access.

occlusion culling is a special case of rasterization, it has special demands.

1. accuracy: needs to be high quality, having one leaking pixel from the background will invalid all the rasterization you've done to create occlusion.

2. x/y resolution: you cannot really assume some lower resolution buffer will be enough. assume you have 128pixel in x, while actually playing 2560x1600, this means 20 real pixel match one occlusionbuffer pixel. if you stand in an unlucky angle to a window or door opening, the whole world you see behind it can flicker.

3. depth resolution: unless you want to waste human resource to place and adjust custom geometry, you have to render with the same accuracy in depth as hardware does, usually you try not only to avoid polys, but also drawcalls, so you want to cull decals, tiny props (e.g. painted image on a wall) and therefor you need to rasterize accurately to not have flickering due to z-fighting.

4. needs to be solid (software wise). so avoid special cases, create automatic regression tests, profile every change. if it works 99% of the time and 1% not, it won't be used, it can have a massive impact on gameplay and visuals (e.g. choppy framerates, slowdowns, wrongly culled objects...).

(5. most important part of occlusion culling is the amount you cull, that's what you save through the whole pipeline. don't fool yourself with some wrong impressions that you need the fastest occlusion culler. if you cull 90% of the drawcalls, ending up with only 500, artist will fill that up again and you are at 5k dc again. you will again end up with 10ms time. doing it in 2ms, but just culling 70% instead of 90% will lower your overall framerate!).

One last word regarding the 1bit/pixel solution, I've implemented something like this ages ago (was on a pentium, using 32bit ints), the culling results are very inconsistent. depending on your view angle, you might be rendering half the room behind a wall, just because the sorting re-ordered the polys as you look 45degree on the wall now. it's faster the bigger your polys are, but at the same time less accurate, while you can become quite accurate if you use tons of tiny polys, but then you won't see such a big speed up compared to the usual way of rasterization.