Consider the Alternatives:? Quake’s Hidden-Surface Removal

by Michael Abrash

Okay, I admit it:? I’m sick and tired of classic rock.? Admittedly, it’s been a while--about 20 years--since last I was excited to hear anything by the Cars or Boston, and I was never particularly excited in the first place about Bob Seger or Queen--to say nothing of Elvis--so some things haven’t changed.? But I knew something was up when I found myself changing the station on the Allman Brothers and Steely Dan and Pink Floyd and, God help me, the Beatles (just stuff like “Hello Goodbye” and “I’ll Cry Instead,” though, not “Ticket to Ride” or “A Day in the Life”; I’m not that far gone).? It didn’t take long to figure out what the problem was; I’d been hearing the same songs for a quarter-century, and I was bored.

I tell you this by way of explaining why it was that when my daughter and I drove back from dinner the other night, the radio in my car was tuned, for the first time ever, to a station whose slogan is “There is no alternative.”

Now, we’re talking here about a ten-year-old who worships the Beatles and has been raised on a steady diet of oldies.? She loves melodies, catchy songs, and good singers, none of which you’re likely to find on an alternative rock station.? So it’s no surprise that when I turned on the radio, the first word out of her mouth was “Yuck!”

What did surprise me was that after listening for a while, she said, “You know, Dad, it’s actually kind of interesting.”

Apart from giving me a clue as to what sort of music I can expect to hear blasting through our house when she’s a teenager, her quick uptake on alternative rock (versus my decades-long devotion to the music of my youth) reminded me of something that it’s easy to forget as we become older and more set in our ways.? It reminded me that it’s essential to keep an open mind, and to be willing--better yet, eager--to try new things.? Programmers tend to become attached to familiar approaches, and are inclined to stick with whatever is currently doing the job adequately well, but in programming there are always alternatives, and I’ve found that they’re often worth considering.

Not that I should have needed any reminding, considering the ever-evolving nature of Quake.

Creative flux

Back in January, I described the creative flux that led to John Carmack’s decision to use a precalculated potentially visible set (PVS) of polygons for each possible viewpoint in Quake, the game we’re developing at id Software.? The precalculated PVS meant that instead of having to spend a lot of time searching through the world database to find out which polygons were visible from the current viewpoint, we could simply draw all the polygons in the PVS from back to front (getting the ordering courtesy of the world BSP tree; check out the May, July, and November 1995 columns for a discussion of BSP trees), and get the correct scene drawn with no searching at all, letting the back-to-front drawing perform the final stage of hidden-surface removal (HSR).? This was a terrific idea, but it was far from the end of the road for Quake’s design.

Drawing moving objects

For one thing, there was still the question of how to sort and draw moving objects properly; in fact, this is the question I’ve been asked most often since the January column came out, so I’ll take a moment to address it.? The primary problem is that a moving model can span multiple BSP leaves, with the leaves that are touched varying as the model moves; that, together with the possibility of multiple models in one leaf, means there’s no easy way to use BSP order to draw the models in correctly sorted order.? When I wrote the January column, we were drawing sprites (such as explosions), moveable BSP models (such as doors), and polygon models (such as monsters) by clipping each into all the leaves it touched, then drawing the appropriate parts as each BSP leaf was reached in back-to-front traversal.? However, this didn’t solve the issue of sorting multiple moving models in a single leaf against each other, and also left some ugly sorting problems with complex polygon models.

John solved the sorting issue for sprites and polygon models in a startlingly low-tech way:? We now z-buffer them.? (That is, before we draw each pixel, we compare its distance, or z, value with the z value of the pixel currently on the screen, drawing only if the new pixel is nearer than the current one.)? First, we draw the basic world--walls, ceilings, and the like.? No z-buffer testing is involved at this point (the world visible surface determination is done in a different way, as we’ll see soon); however, we do fill the z-buffer with the z values (actually, 1/z values, as discussed below) for all the world pixels.? Z-filling is a much faster process than z-buffering the entire world would be, because no reads or compares are involved, just writes of z values.? Once drawing and z-filling the world is done, we can simply draw the sprites and polygon models with z-buffering and get perfect sorting all around.

Whenever a z-buffer is involved, the questions inevitably are:? What’s the memory footprint, and what’s the performance impact?? Well, the memory footprint at 320x200 is 128K, not trivial but not a big deal for a game that

equires 8 Mb.? The performance impact is about 10% for z-filling the world, and roughly 20% (with lots of variation) for drawing sprites and polygon models.? In return, we get a perfectly sorted world, and also the ability to do additional effects, such as particle explosions and smoke, because the z-buffer lets us flawlessly sort such effects into the world.? All in all, the use of the z-buffer vastly improved the visual quality and flexibility of the Quake engine, and also simplified the code quite a bit, at an acceptable memory and performance cost.

Leveling and improving performance

As I said above, in the Quake architecture, the world itself is drawn first--without z-buffer reads or compares, but filling the z-buffer with the world polygons’ z values--and then the moving objects are drawn atop the world, using full z-buffering.? Thus far, I’ve discussed how to draw moving objects.? For the rest of this column, I’m going to talk about the other part of the drawing equation, how to draw the world itself, where the entire world is stored as a single BSP tree and never moves.

As you may recall from the January column, we’re concerned with both raw performance and level performance.? That is, we want the drawing code to run as fast as possible, but we also want the difference in drawing speed between the average scene and the slowest-drawing scene to be as small as possible.? It does little good to average 30 frames per second if 10% of the scenes draw at 5 fps, because the jerkiness in those scenes will be extremely obvious by comparison with the average scene, and highly objectionable.? It would be better to average 15 fps 100% of the time, even though the average drawing speed is only half as much.

The precalculated PVS was an important step toward both faster and more level performance, because it eliminated the need to identify visible polygons, a relatively slow step that tended to be at its worst in the most complex scenes.? Nonetheless, in some spots in real game levels the precalculated PVS contains five times more polygons than are actually visible; together with the back-to-front HSR approach, this created hot spots in which the frame rate bogged down visibly as hundreds of polygons are drawn back to front, most of those immediately getting overdrawn by nearer polygons.? Raw performance in general was also reduced by the typical 50% overdraw resulting from drawing everything in the PVS.? So, although drawing the PVS back to front as the final HSR stage worked and was an improvement over previous designs, it was not ideal.? Surely, John thought, there’s a better way to leverage the PVS than back-to-front drawing.

And indeed there is.

Sorted spans

The ideal final HSR stage for Quake would reject all the polygons in the PVS that are actually invisible, and draw only the visible pixels of the remaining polygons, with no overdraw--that is, with every pixel drawn exactly once--all at no performance cost, of course.? One way to do that (although certainly not at zero cost) would be to draw the polygons from front to back, maintaining a region describing the currently occluded portions of the screen and clipping each polygon to that region before drawing it.? That sounds promising, but it is in fact nothing more or less than the beam tree approach I described in the January column, an approach that we found to have considerable overhead and serious leveling problems.

We can do much better if we move the final HSR stage from the polygon level to the span level and use a sorted-spans approach.? In essence, this approach consists of turning each polygon into a set of spans, as shown in Figure 1, and then sorting and clipping the spans against each other until only the visible portions of visible spans are left to be drawn, as shown in Figure 2.? This may sound a lot like z-buffering (which is simply too slow for use in drawing the world, although it’s fine for smaller moving objects, as described earlier), but there are crucial differences.? By contrast with z-buffering, only visible portions of visible spans are scanned out pixel by pixel (although all polygon edges must still be rasterized).? Better yet, the sorting that z-buffering does at each pixel becomes a per-span operation with sorted spans, and because of the coherence implicit in a span list, each edge is sorted against only against some of the spans on the same line, and clipped only to the few spans that it overlaps horizontally.? Although complex scenes still take longer to process than simple scenes, the worst case isn’t as bad as with the beam tree or back-to-front approaches, because there’s no overdraw or scanning of hidden pixels, because complexity is limited to pixel resolution, and because span coherence tends to limit the worst-case sorting in any one area of the screen.? As a bonus, the output of sorted spans is in precisely the form that a low-level rasterizer needs, a set of span descriptors, each consisting of a start coordinate and a length.

Figure 1: Span generation.

In short, the sorted spans approach meets our original criteria pretty well; although it isn’t zero-cost, it’s not horribly expensive, it completely eliminates both overdraw and pixel scanning of obscured portions of polygons, and it tends to level worst-case performance.? We wouldn’t want to rely on sorted spans alone as our hidden-surface mechanism, but the precalculated PVS reduces the number of polygons to a level that sorted spans can handle quite nicely.

So we’ve found the approach we need; now it’s just a matter of writing some code and we’re on our way, right?? Well, yes and no.? Conceptually, the sorted-spans approach is simple, but it’s surprisingly difficult to implement, with a couple of major design choices to be made, a subtle mathematical element, and some tricky gotchas that we’ll see in the next column.? Let’s look at the design choices first.

Edges versus spans

The first design choice is whether to sort spans or edges (both of which fall into the general category of “sorted spans”).? Although the results are the same both ways--a list of spans to be drawn, with no overdraw--the implementations and performance implications are quite different, because the sorting and clipping are performed using very different data structures.

With span-sorting, spans are stored in x-sorted linked list buckets, typically with one bucket per scan line.? Each polygon in turn is rasterized into spans, as shown in Figure 1, and each span is sorted and clipped into the bucket for the scan line the span is on, as shown in Figure 2, so that at any time each bucket contains the nearest spans encountered thus far, always with no overlap.? This approach involves generating all spans for each polygon in turn, with each span immediately being sorted, clipped, and added to the appropriate bucket.

Figure 2: The spans from polygon A from Figure 1 sorted and clipped with the spans from polygon B, where polygon A is at a constant z distance of 100 and polygon B is at a constant z distance of 50 (polygon B is closer).

With edge-sorting, edges are stored in x-sorted linked list buckets according to their start scan line.? Each polygon in turn is decomposed into edges, cumulatively building a list of all the edges in the scene.? Once all edges for all polygons in the view frustum have been added to the edge list, the whole list is scanned out in a single top-to-bottom, left-to-right pass.? An active edge list (AEL) is maintained.? With each step to a new scan line, edges that end on that scan line are removed from the AEL, active edges are stepped to their new x coordinates, edges starting on the new scan line are added to the AEL, and the edges are sorted by current x coordinate.

For each scan line, a z-sorted active polygon list (APL) is maintained.? The x-sorted AEL is stepped through in order.? As each new edge is encountered (that is, as each polygon starts or ends as we move left to right), the associated polygon is activated and sorted into the APL, as shown in Figure 3, or deactivated and removed from the APL, as shown in Figure 4, for a leading or trailing edge, respectively.? If the nearest polygon has changed (that is, if the new polygon is nearest, or if the nearest polygon just ended), a span is emitted for the polygon that just stopped being the nearest, starting at the point where the polygon first because nearest and ending at the x coordinate of the current edge, and the current x coordinate is recorded in the polygon that is now the nearest.? This saved coordinate later serves as the start of the span emitted when the new nearest polygon ceases to be in front.

Figure 3: Activating a polygon when a leading edge is encountered in the AEL.

Figure 4: Deactivating a polygon when a trailing edge is encountered in the AEL.

Don’t worry if you didn’t follow all of that; the above is just a quick overview of edge-sorting to help make the rest of this column clearer.? There will a thorough discussion in the next column.

The spans that are generated with edge-sorting are exactly the same spans that ultimately emerge from span-sorting; the difference lies in the intermediate data structures that are used to sort the spans in the scene.? With edge-sorting, the spans are kept implicit in the edges until the final set of visible spans is generated, so the sorting, clipping, and span emission is done as each edge adds or removes a polygon, based on the span state implied by the edge and the set of active polygons.? With span-sorting, spans are immediately made explicit when each polygon is rasterized, and those intermediate spans are then sorted and clipped against other the spans on the scan line to generate the final spans, so the states of the spans are explicit at all times, and all work is done directly with spans.

Both span-sorting and edge-sorting work well, and both have been employed successfully in commercial projects.? We’ve chosen to use edge-sorting in Quake partly because it seems inherently more efficient, with excellent horizontal coherence that makes for minimal time spent sorting, in contrast with the potentially costly sorting into linked lists that span-sorting can involve.? A more important reason, though, is that with edge-sorting we’re able to share edges between adjacent polygons, and that cuts the work involved in sorting, clipping, and rasterizing edges nearly in half, while also shrinking the world database quite a bit due to the sharing.

One final advantage of edge-sorting is that it makes no distinction between convex and concave polygons.? That’s not an important consideration for most graphics engines, but in Quake, edge clipping, transformation, projection, and sorting has become a major bottleneck, so we’re doing everything we can to get the polygon and edge counts down, and concave polygons help a lot in that regard.? While it’s possible to handle concave polygons with span-sorting, that can involve significant performance penalties.

Nonetheless, there’s no cut-and-dried answer as to which approach is better.? In the end, span-sorting and edge-sorting amount to the same functionality, and the choice between them is a matter of whatever you feel most comfortable with.? In the next column, I’ll go into considerable detail about edge-sorting, complete with a full implementation.? I’m going the spend the rest of this column laying the foundation for next time by discussing sorting keys and 1/z calculation.? In the process, I’m going to have to make a few forward references to aspects of edge-sorting that I haven’t covered in detail; my apologies, but it’s unavoidable, and all should become clear by the end of the next column.

Edge-sorting keys

Now that we know we’re going to sort edges, using them to emit spans for the polygons nearest the viewer, the question becomes how we can tell which polygons are nearest.? Ideally, we’d just store a sorting key in each polygon, and whenever a new edge came along, we’d compare its surface’s key to the keys of other currently active polygons, and could easily tell which polygon was nearest.

That sounds too good to be true, but it is possible.? If, for example, your world database is stored as a BSP tree, with all polygons clipped into the BSP leaves, then BSP walk order is a valid drawing order.? So, for example, if you walk the BSP back to front, assigning each polygon an incrementally higher key as you reach it, polygons with higher keys are guaranteed to be in front of polygons with lower keys.? This is the approach Quake used for a while, although a different approach is now being used, for reasons I’ll explain shortly.

If you don’t happen to have a BSP or similar data structure handy, or if you have lots of moving polygons (BSPs don’t handle moving polygons very efficiently), another way to accomplish our objectives would be to sort all the polygons against one another before drawing the scene, assigning appropriate keys based on their spatial relationships in viewspace.? Unfortunately, this is generally an extremely slow task, because every polygon must be compared to every other polygon.? There are techniques to improve the performance of polygon sorts, but I don’t know of anyone who’s doing general polygon sorts of complex scenes in realtime on a PC.

An alternative is to sort by z distance from the viewer in screenspace, an approach that dovetails nicely with the excellent spatial coherence of edge-sorting.? As each new edge is encountered on a scan line, the corresponding polygon’s z distance can be calculated and compared to the other polygons’ distances, and the polygon can be sorted into the APL accordingly.

Getting z distances can be tricky, however.? Remember that we need to be able to calculate z at any arbitrary point on a polygon, because an edge may occur and cause its polygon to be sorted into the APL at any point on the screen.? We could calculate z directly from the screen x and y coordinates and the polygon’s plane equation, but unfortunately this can’t be done very quickly, because the z for a plane doesn’t vary linearly in screenspace; however, 1/z does vary linearly, so we’ll use that instead.? (See Chris Hecker’s series of columns on texture mapping over the past year in Game Developer magazine for a discussion of screenspace linearity and gradients for 1/z.)? Another advantage of using 1/z is that its resolution increases with decreasing distance, meaning that by using 1/z, we’ll have better depth resolution for nearby features, where it matters most.

The obvious way to get a 1/z value at any arbitrary point on a polygon is to calculate 1/z at the vertices, interpolate it down both edges of the polygon, and interpolate between the edges to get the value at the point of interest.? Unfortunately, that requires doing a lot of work along each edge, and worse, requires division to calculate the 1/z step per pixel across each span.

A better solution is to calculate 1/z directly from the plane equation and the screen x and y of the pixel of interest.? The equation is:

1/z = (a/d)x’ - (b/d)y’ + c/d

where z is the viewspace z coordinate of the point on the plane that projects to screen coordinate (x’,y’) (the origin for this calculation is the center of projection, the point on the screen straight ahead of the viewpoint), [a b c] is the plane normal in viewspace, and d is the distance from the viewspace origin to the plane along the normal.? Division is done only once per plane, because a, b, c, and d are per-plane constants.

The full 1/z calculation requires two multiplies and two adds, all of which should be floating-point to avoid range errors.? That much floating-point math sounds expensive but really isn’t, especially on a Pentium, where a plane’s 1/z value at any point can be calculated in as little as six cycles in assembly language.

For those who are interested, here’s a quick derivation of the 1/z equation.? The plane equation for a plane is

ax + by + cz - d = 0,

where x and y are viewspace coordinates, and a, b, c, d, and z are defined above.? If we substitute x=x’z and y=-y’z (from the definition of the perspective projection, with y inverted because y increases upward in viewspace but downward in screenspace), and do some rearrangement, we get

z = d / (ax’ - by’ + c).

Inverting and distributing yields

1/z = ax’/d - by’/d + c/d.

We’ll see 1/z sorting in action next time.

Quake and z-sorting

I mentioned above that Quake no longer uses BSP order as the sorting key; in fact, it uses 1/z as the key now.? Elegant as the gradients are, calculating 1/z from them is clearly slower than just doing a compare on a BSP-ordered key, so why have we switched Quake to 1/z?

The primary reason is to reduce the number of polygons.? Drawing in BSP order means following certain rules, including the rule that polygons must be split if they cross BSP planes.? This splitting increases the numbers of polygons and edges considerably.? By sorting on 1/z, we’re able to leave polygons unsplit but still get correct drawing order, so we have far fewer edges to process and faster drawing overall, despite the added cost of 1/z sorting.

Another advantage of 1/z sorting is that it solves the sorting issues I mentioned at the start involving moving models that are themselves small BSP trees.? Sorting in world BSP order wouldn’t work here, because these models are separate BSPs, and there’s no easy way to work them into the world BSP’s sequence order.? We don’t want to use z-buffering for these models because they’re often large objects such as doors, and we don’t want to lose the overdraw-reduction benefits that closed doors provide when drawn through the edge list.? With sorted spans, the edges of moving BSP models are simply placed in the edge list (first clipping polygons so they don’t cross any solid world surfaces, to avoid complications associated with interpenetration), along with all the world edges, and 1/z sorting takes care of the rest.

Onward to next time

There is, without a doubt, an awful lot of information in the preceding pages, and it may not all connect together yet in your mind.? The code and accompanying explanation next time should help; if you want to peek ahead, the code should be available from ftp.idsoftware.com/mikeab/ddjsort.zip by the time you read this column.? You may also want to take a look at Foley & van Dam’s Computer Graphics or Rogers’ Procedural Elements for Computer Graphics.

As I write this, it’s unclear whether Quake will end up sorting edges by BSP order or 1/z.? Actually, there’s no guarantee that sorted spans in any form will be the final design.? Sometimes it seems like we change graphics engines as often as they play Elvis on the ‘50s oldies stations (but, one would hope, with more aesthetically pleasing results!), and no doubt we’ll be considering the alternatives right up until the day we ship.

Inside Quake: Visible-Surface Determination

by Michael Abrash

Years ago, I was working at Video Seven, a now-vanished video adapter manufacturer, helping to develop a VGA clone. The fellow who was designing Video Seven’s VGA chip, Tom Wilson, had worked around the clock for months to make his VGA run as fast as possible, and was confident he had pretty much maxed out its performance. As Tom was putting the finishing touches on his chip design, however, news came fourth-hand that a competitor, Paradise, had juiced up the performance of the clone they were developing, by putting in a FIFO.

That was it; there was no information about what sort of FIFO, or how much it helped, or anything else. Nonetheless, Tom, normally an affable, laid-back sort, took on the wide-awake, haunted look of a man with too much caffeine in him and no answers to show for it, as he tried to figure out, from hopelessly thin information, what Paradise had done. Finally, he concluded that Paradise must have put a write FIFO between the system bus and the VGA, so that when the CPU wrote to video memory, the write immediately went into the FIFO, allowing the CPU to keep on processing instead of stalling each time it wrote to display memory.

Tom couldn’t spare the gates or the time to do a full FIFO, but he could implement a one-deep FIFO, allowing the CPU to get one write ahead of the VGA. He wasn’t sure how well it would work, but it was all he could do, so he put it in and taped out the chip.

The one-deep FIFO turned out to work astonishingly well; for a time, Video Seven’s VGAs were the fastest around, a testament to Tom’s ingenuity and creativity under pressure. However, the truly remarkable part of this story is that Paradise’s FIFO design turned out to bear not the slightest resemblance to Tom’s, and didn’t work as well. Paradise had stuck a read FIFO between display memory and the video output stage of the VGA, allowing the video output to read ahead, so that when the CPU wanted to access display memory, pixels could come from the FIFO while the CPU was serviced immediately. That did indeed help performance--but not as much as Tom’s write FIFO.

What we have here is as neat a parable about the nature of creative design as one could hope to find. The scrap of news about Paradise’s chip contained almost no actual information, but it forced Tom to push past the limits he had unconsciously set in coming up with his original design. And, in the end, I think that the single most important element of great design, whether it be hardware or software or any creative endeavor, is precisely what the Paradise news triggered in Tom: The ability to detect the limits you have built into the way you think about your design, and transcend those limits.

The problem, of course, is how to go about transcending limits you don’t even know you’ve imposed. There’s no formula for success, but two principles can stand you in good stead: simplify, and keep on trying new things.

Generally, if you find your code getting more complex, you’re fine-tuning a frozen design, and it’s likely you can get more of a speed-up, with less code, by rethinking the design. A really good design should bring with it a moment of immense satisfaction in which everything falls into place, and you’re amazed at how little code is needed and how all the boundary cases just work properly.

As for how to rethink the design, do it by pursuing whatever ideas occur to you, no matter how off-the-wall they seem. Many of the truly brilliant design ideas I’ve heard over the years sounded like nonsense at first, because they didn’t fit my preconceived view of the world. Often, such ideas are in fact off-the-wall, but just as the news about Paradise’s chip sparked Tom’s imagination, aggressively pursuing seemingly-outlandish ideas can open up new design possibilities for you.

Case in point: The evolution of Quake’s 3-D graphics engine.

The toughest 3-D challenge of all

I’ve spent most of my waking hours for the last seven months working on Quake, id Software’s successor to DOOM, and after spending the next three months in much the same way, I expect Quake will be out as shareware around the time you read this.

In terms of graphics, Quake is to DOOM as DOOM was to its predecessor, Wolfenstein 3D. Quake adds true, arbitrary 3-D (you can look up and down, lean, and even fall on your side), detailed lighting and shadows, and 3-D monsters and players in place of DOOM’s sprites. Sometime soon, I’ll talk about how all that works, but this month I want to talk about what is, in my book, the toughest 3-D problem of all, visible surface determination (drawing the proper surface at each pixel), and its close relative, culling (discarding non-visible polygons as quickly as possible, a way of accelerating visible surface determination). In the interests of brevity, I’ll use the abbreviation VSD to mean both visible surface determination and culling from now on.

Why do I think VSD is the toughest 3-D challenge? Although rasterization issues such as texture mapping are fascinating and important, they are tasks of relatively finite scope, and are being moved into hardware as 3-D accelerators appear; also, they only scale with increases in screen resolution, which are relatively modest.

In contrast, VSD is an open-ended problem, and there are dozens of approaches currently in use. Even more significantly, the performance of VSD, done in an unsophisticated fashion, scales directly with scene complexity, which tends to increase as a square or cube function, so this very rapidly becomes the limiting factor in doing realistic worlds. I expect VSD increasingly to be the dominant issue in realtime PC 3-D over the next few years, as 3-D worlds become increasingly detailed. Already, a good-sized Quake level contains on the order of 10,000 polygons, about three times as many polygons as a comparable DOOM level.

The structure of Quake levels

Before diving into VSD, let me note that each Quake level is stored as a single huge 3-D BSP tree. This BSP tree, like any BSP, subdivides space, in this case along the planes of the polygons. However, unlike the BSP tree I presented last time, Quake’s BSP tree does not store polygons in the tree nodes, as part of the splitting planes, but rather in the empty (non-solid) leaves, as shown in overhead view in Figure 1.

Figure 1: In Quake, polygons are stored in the empty leaves. Shaded areas are solid leaves (solid volumes, such as the insides of walls).

Correct drawing order can be obtained by drawing the leaves in front-to-back or back-to-front BSP order, again as discussed last time. Also, because BSP leaves are always convex and the polygons are on the boundaries of the BSP leaves, facing inward, the polygons in a given leaf can never obscure one another and can be drawn in any order. (This is a general property of convex polyhedra.)

Culling and visible surface determination

The process of VSD would ideally work as follows. First, you would cull all polygons that are completely outside the view frustum (view pyramid), and would clip away the irrelevant portions of any polygons that are partially outside. Then you would draw only those pixels of each polygon that are actually visible from the current viewpoint, as shown in overhead view in Figure 2, wasting no time overdrawing pixels multiple times; note how little of the polygon set in Figure 2 actually need to be drawn. Finally, in a perfect world, the tests to figure out what parts of which polygons are visible would be free, and the processing time would be the same for all possible viewpoints, giving the game a smooth visual flow.

As it happens, it is easy to determine which polygons are outside the frustum or partially clipped, and it’s quite possible to figure out precisely which pixels need to be drawn. Alas, the world is far from perfect, and those tests are far from free, so the real trick is how to accelerate or skip various tests and still produce the desired result.

As I discussed at length last time, given a BSP, it’s easy and inexpensive to walk the world in front-to-back or back-to-front order. The simplest VSD solution, which I in fact demonstrated last time, is to simply walk the tree back-to-front, clip each polygon to the frustum, and draw it if it’s facing forward and not entirely clipped (the painter’s algorithm). Is that an adequate solution?

For relatively simple worlds, it is perfectly acceptable. It doesn’t scale very well, though. One problem is that as you add more polygons in the world, more transformations and tests have to be performed to cull polygons that aren’t visible; at some point, that will bog performance down considerably.

Happily, there’s a good workaround for this particular problem. As discussed earlier, each leaf of a BSP tree represents a convex subspace, with the nodes that bound the leaf delimiting the space. Perhaps less obvious is that each node in a BSP tree also describes a subspace--the subspace composed of all the node’s children, as shown in Figure 3. Another way of thinking of this is that each node splits into two pieces the subspace created by the nodes above it in the tree, and the node’s children then further carve that subspace into all the leaves that descend from the node.

Since a node’s subspace is bounded and convex, it is possible to test whether it is entirely outside the frustum. If it is, all of the node’s children are certain to be fully clipped, and can be rejected without any additional processing. Since most of the world is typically outside the frustum, many of the polygons in the world can be culled almost for free, in huge, node-subspace chunks. It’s relatively expensive to perform a perfect test for subspace clipping, so instead bounding spheres or boxes are often maintained for each node, specifically for culling tests.

So culling to the frustum isn’t a problem, and the BSP can be used to draw back to front. What’s the problem?

Overdraw

The problem John Carmack, the driving technical force behind DOOM and Quake, faced when he designed Quake was that in a complex world, many scenes have an awful lot of polygons in the frustum. Most of those polygons are partially or entirely obscured by other polygons, but the painter’s algorithm described above requires that every pixel of every polygon in the frustum be drawn, often only to be overdrawn. In a 10,000-polygon Quake level, it would be easy to get a worst-case overdraw level of 10 times or more; that is, in some frames each pixel could be drawn 10 times or more, on average. No rasterizer is fast enough to compensate for an order of magnitude more work than is actually necessary to show a scene; worse still, the painter’s algorithm will cause a vast difference between best-case and worst-case performance, so the frame rate can vary wildly as the viewer moves around.

So the problem John faced was how to keep overdraw down to a manageable level, preferably drawing each pixel exactly once, but certainly no more than two or three times in the worst case. As with frustum culling, it would be ideal if he could eliminate all invisible polygons in the frustum with virtually no work. It would also be a plus if he could manage to draw only the visible parts of partially-visible polygons, but that was a balancing act, in that it had to be a lower-cost operation than the overdraw that would otherwise result.

When I arrived at id at the beginning of March, John already had an engine prototyped and a plan in mind, and I assumed that our work was a simple matter of finishing and optimizing that engine. If I had been aware of id’s history, however, I would have known better. John had done not only DOOM, but also the engines for Wolf 3D and several earlier games, and had actually done several different versions of each engine in the course of development (once doing four engines in four weeks), for a total of perhaps 20 distinct engines over a four-year period. John’s tireless pursuit of new and better designs for Quake’s engine, from every angle he could think of, would end only when we shipped.

By three months after I arrived, only one element of the original VSD design was anywhere in sight, and John had taken “try new things” farther than I’d ever seen it taken.

The beam tree

John’s original Quake design was to draw front to back, using a second BSP tree to keep track of what parts of the screen were already drawn and which were still empty and therefore drawable by the remaining polygons. Logically, you can think of this BSP tree as being a 2-D region describing solid and empty areas of the screen, as shown in Figure 4, but in fact it is a 3-D tree, of the sort known as a beam tree. A beam tree is a collection of 3-D wedges (beams), bounded by planes, projecting out from some center point, in this case the viewpoint, as shown in Figure 5.

Figure 5: Quake's beam tree was composed of 3-D wedges, or beams, projecting out from the viewpoint to polygon edges.

In John’s design, the beam tree started out consisting of a single beam describing the frustum; everything outside that beam was marked solid (so nothing would draw there), and the inside of the beam was marked empty. As each new polygon was reached while walking the world BSP tree front to back, that polygon was converted to a beam by running planes from its edges through the viewpoint, and any part of the beam that intersected empty beams in the beam tree was considered drawable and added to the beam tree as a solid beam. This continued until either there were no more polygons or the beam tree became entirely solid. Once the beam tree was completed, the visible portions of the polygons that had contributed to the beam tree were drawn.

The advantage to working with a 3-D beam tree, rather than a 2-D region, is that determining which side of a beam plane a polygon vertex is on involves only checking the sign of the dot product of the ray to the vertex and the plane normal, because all beam planes run through the origin (the viewpoint). Also, because a beam plane is completely described by a single normal, generating a beam from a polygon edge requires only a cross-product of the edge and a ray from the edge to the viewpoint. Finally, bounding spheres of BSP nodes can be used to do the aforementioned bulk culling to the frustum.

The early-out feature of the beam tree--stopping when the beam tree becomes solid--seems appealing, because it appears to cap worst-case performance. Unfortunately, there are still scenes where it’s possible to see all the way to the sky or the back wall of the world, so in the worst case, all polygons in the frustum will still have to be tested against the beam tree. Similar problems can arise from tiny cracks due to numeric precision limitations. Beam tree clipping is fairly time-consuming, and in scenes with long view distances, such as views across the top of a level, the total cost of beam processing slowed Quake’s frame rate to a crawl. So, in the end, the beam-tree approach proved to suffer from much the same malady as the painter’s algorithm: The worst case was much worse than the average case, and it didn’t scale well with increasing level complexity.

3-D engine du jour

Once the beam tree was working, John relentlessly worked at speeding up the 3-D engine, always trying to improve the design, rather than tweaking the implementation. At least once a week, and often every day, he would walk into my office and say “Last night I couldn’t get to sleep, so I was thinking...” and I’d know that I was about to get my mind stretched yet again. John tried many ways to improve the beam tree, with some success, but more interesting was the profusion of wildly different approaches that he generated, some of which were merely discussed, others of which were implemented in overnight or weekend-long bursts of coding, in both cases ultimately discarded or further evolved when they turned out not to meet the design criteria well enough. Here are some of those approaches, presented in minimal detail in the hopes that, like Tom Wilson with the Paradise FIFO, your imagination will be sparked.

Subdividing raycast: Rays are cast in an 8x8 screen-pixel grid; this is a highly efficient operation because the first intersection with a surface can be found by simply clipping the ray into the BSP tree, starting at the viewpoint, until a solid leaf is reached. If adjacent rays don’t hit the same surface, then a ray is cast halfway between, and so on until all adjacent rays either hit the same surface or are on adjacent pixels; then the block around each ray is drawn from the polygon that was hit. This scales very well, being limited by the number of pixels, with no overdraw. The problem is dropouts; it’s quite possible for small polygons to fall between rays and vanish.

Vertex-free surfaces: The world is represented by a set of surface planes. The polygons are implicit in the plane intersections, and are extracted from the planes as a final step before drawing. This makes for fast clipping and a very small data set (planes are far more compact than polygons), but it’s time-consuming to extract polygons from planes.

Draw-buffer: Like a z-buffer, but with 1 bit per pixel, indicating whether the pixel has been drawn yet. This eliminates overdraw, but at the cost of an inner-loop buffer test, extra writes and cache misses, and, worst of all, considerable complexity. Variations are testing the draw-buffer a byte at a time and completely skipping fully-occluded bytes, or branching off each draw-buffer byte to one of 256 unrolled inner loops for drawing 0-8 pixels, in the process possibly taking advantage of the ability of the x86 to do the perspective floating-point divide in parallel while 8 pixels are processed.

Span-based drawing: Polygons are rasterized into spans, which are added to a global span list and clipped against that list so that only the nearest span at each pixel remains. Little sorting is needed with front-to-back walking, because if there’s any overlap, the span already in the list is nearer. This eliminates overdraw, but at the cost of a lot of span arithmetic; also, every polygon still has to be turned into spans.

Portals: the holes where polygons are missing on surfaces are tracked, because it’s only through such portals that line-of-sight can extend. Drawing goes front-to-back, and when a portal is encountered, polygons and portals behind it are clipped to its limits, until no polygons or portals remain visible. Applied recursively, this allows drawing only the visible portions of visible polygons, but at the cost of a considerable amount of portal clipping.

Breakthrough

In the end, John decided that the beam tree was a sort of second-order structure, reflecting information already implicitly contained in the world BSP tree, so he tackled the problem of extracting visibility information directly from the world BSP tree. He spent a week on this, as a byproduct devising a perfect DOOM (2-D) visibility architecture, whereby a single, linear walk of a DOOM BSP tree produces zero-overdraw 2-D visibility. Doing the same in 3-D turned out to be a much more complex problem, though, and by the end of the week John was frustrated by the increasing complexity and persistent glitches in the visibility code. Although the direct-BSP approach was getting closer to working, it was taking more and more tweaking, and a simple, clean design didn’t seem to be falling out. When I left work one Friday, John was preparing to try to get the direct-BSP approach working properly over the weekend.

When I came in on Monday, John had the look of a man who had broken through to the other side--and also the look of a man who hadn’t had much sleep. He had worked all weekend on the direct-BSP approach, and had gotten it working reasonably well, with insights into how to finish it off. At 3:30 AM Monday morning, as he lay in bed, thinking about portals, he thought of precalculating and storing in each leaf a list of all leaves visible from that leaf, and then at runtime just drawing the visible leaves back-to-front for whatever leaf the viewpoint happens to be in, ignoring all other leaves entirely.

Size was a concern; initially, a raw, uncompressed potentially visible set (PVS) was several megabytes in size. However, the PVS could be stored as a bit vector, with 1 bit per leaf, a structure that shrunk a great deal with simple zero-byte compression. Those steps, along with changing the BSP heuristic to generate fewer leaves (contrary to what I said a few months back, choosing as the next splitter the polygon that splits the fewest other polygons is clearly the best heuristic, based on the latest data) and sealing the outside of the levels so the BSPer can remove the outside surfaces, which can never be seen, eventually brought the PVS down to about 20 Kb for a good-size level.

In exchange for that 20 Kb, culling leaves outside the frustum is speeded up (because only leaves in the PVS are considered), and culling inside the frustum costs nothing more than a little overdraw (the PVS for a leaf includes all leaves visible from anywhere in the leaf, so some overdraw, typically on the order of 50% but ranging up to 150%, generally occurs). Better yet, precalculating the PVS results in a leveling of performance; worst case is no longer much worse than best case, because there’s no longer extra VSD processing--just more polygons and perhaps some extra overdraw--associated with complex scenes. The first time John showed me his working prototype, I went to the most complex scene I knew of, a place where the frame rate used to grind down into the single digits, and spun around smoothly, with no perceptible slowdown.

John says precalculating the PVS was a logical evolution of the approaches he had been considering, that there was no moment when he said “Eureka!”. Nonetheless, it was clearly a breakthrough to a brand-new, superior design, a design that, together with a still-in-development sorted-edge rasterizer that completely eliminates overdraw, comes remarkably close to meeting the “perfect-world” specifications we laid out at the start.

Simplify, and keep on trying new things

What does it all mean? Exactly what I said up front: Simplify, and keep trying new things. The precalculated PVS is simpler than any of the other schemes that had been considered (although precalculating the PVS is an interesting task that I’ll discuss another time). In fact, at runtime the precalculated PVS is just a constrained version of the painter’s algorithm. Does that mean it’s not particularly profound?

Not at all. All really great designs seem simple and even obvious--once they’ve been designed. But the process of getting there requires incredible persistence, and a willingness to try lots of different ideas until the right one falls into place, as happened here.

My friend Chris Hecker has a theory that all approaches work out to the same thing in the end, since they all reflect the same underlying state and functionality. In terms of underlying theory, I’ve found that to be true; whether you do perspective texture mapping with a divide or with incremental hyperbolic calculations, the numbers do exactly the same thing. When it comes to implementation, however, my experience is that simply time-shifting an approach, or matching hardware capabilities better, or caching can make an astonishing difference. My friend Terje Mathisen likes to say that “almost all programming can be viewed as an exercise in caching,” and that’s exactly what John did. No matter how fast he made his VSD calculations, they could never be as fast as precalculating and looking up the visibility, and his most inspired move was to yank himself out of the “faster code” mindset and realize that it was in fact possible to precalculate (in effect, cache) and look up the PVS.

The hardest thing in the world is to step outside a familiar, pretty good solution to a difficult problem and look for a different, better solution. The best ways I know to do that are to keep trying new, wacky things, and always, always, always try to simplify. One of John’s goals is to have fewer lines of code in each 3-D game than in the previous game, on the assumption that as he learns more, he should be able to do things better with less code.

So far, it seems to have worked out pretty well for him.

Learn now, pay forward

There’s one other thing I’d like to mention before I close up shop for this month. As far back as I can remember, DDJ has epitomized the attitude that sharing programming information is A Good Thing. I know a lot of programmers who were able to leap ahead in their development because of Hendrix’s Tiny C, or Stevens’ D-Flat, or simply by browsing through DDJ’s annual collections. (Me, for one.) Most companies understandably view sharing information in a very different way, as potential profit lost--but that’s what makes DDJ so valuable to the programming community.

It is in that spirit that id Software is allowing me to describe in these pages how Quake works, even before Quake has shipped. That’s also why id has placed the full source code for Wolfenstein 3D on ftp.idsoftware.com/idstuff/source; you can’t just recompile the code and sell it, but you can learn how a full-blown, successful game works; check wolfsrc.txt in the above-mentioned directory for details on how the code may be used.

So remember, when it’s legally possible, sharing information benefits us all in the long run. You can pay forward the debt for the information you gain here and elsewhere by sharing what you know whenever you can, by writing an article or book or posting on the Net. None of us learns in a vacuum; we all stand on the shoulders of giants such as Wirth and Knuth and thousands of others. Lend your shoulders to building the future!