For now Bad Mood it's still a development project so the 680x0 forum seems to be the place for it, at least until it turns into something playable.

The current aim is to speed up the rendering (3D scene view) as much as possible, before trying to bolt it onto the Doom engine itself (which is a bit of an unknown performance-wise, so for now I'm going to assume it will run fast enough @16MHz with the graphics layer stripped away and some minor mods!).

I have been doing experiments in the background as time permits and have reached a few conclusions on rendering performance. I expect to have some (good) news soon.

Here are some notes on experiments I did today with textured rendering of floors and ceilings (a.k.a. 'visplanes') in the Bad Mood engine.

Visplane rendering incurs a high cost in the Bad Mood engine - the highest single cost for any typical view (except transparent walls - which do need redone completely and best ignored for now, and turbulent lava stuff, which I'll deal with another time).

Visplane rendering is expensive for a few reasons:-

- the visplane textures rotate, 'roto-zoomer' style (while the wall textures do not), so the texture addressing unit is much more complicated than for walls.

- more complicated texture addressing means more cycles per pixel, more ops per pixel, and poor loop unrolling in the CPU cache

- there are often more 'spans' making up visplane surfaces than there are wall 'columns', each one needing some CPU setup before drawing (so lots of short spans == bad)

- the setup for each span involves several exchanges between CPU/DSP to acquire du/dv affine gradients and get them formatted for drawing, again adding bus accesses and pressuring the CPU cache on each new span

The best demoscene tip I had received so far (Mikro) was to implement some kind of DSP-based texture unit, using the DSP host port as a sort of 'texel server'. I had used this in the past for gouraud shading tricks but hadn't tried actually putting a whole texture on the DSP (!). I was also concerned that the host port is 3 separate byte-wide ports - so reading words from it probably cost twice as much as for bytes and might even the score a bit with normal CPU texturing. However, that may not be the case in the Falcon (I will measure all this stuff separately in 'Nimbench' at some point). It could well be mapped as words on the host side, and bytes on the DSP side at 2x the clock rate? will come back to it!

Anyway I ran 2 new experiments using the DSP as a texture unit to compare with the last release, using RGB as the reference display mode.

1) original 'CPU-only' based visplane texturing2) DSP+CPU based texel server, feeding 8bit texels to CPU, which then 'lights' and draws the pixel3) DSP based texel server, feeding 16bit pixels to CPU, which just get drawn directly (no lighting)

I used the built-in sampling profiler to measure the relative cost of each method, and results are below. Note that the profiler sucks CPU time so the indicated FPS can be as much as 1fps lower than it is in a normal build.

So any kind of DSP based texture unit is faster than the current CPU-only one. That's useful

The hybrid DSP+CPU version has the advantage that it can be implemented relatively easily and lighting still works as before. However the DSP-only solution is the fastest of all, and that's a bit of a dilemma...

Making the DSP-only method work with dynamic lighting is *hard* - the lighting is currently done with 64-level fog tables and there isn't space on the DSP for that. It's also too expensive to do the arithmetic explicitly per texel so that's not an option. One way that might work is to reindex the textures to use no more than 32 or 64 colours each (from the 256c palette) - most of the textures aren't too variant so this could work. It would allow the lighting/fog table to fit in the DSP assuming it is re-uploaded for each new texture.

Other possibilities involve mipmaps and encoding fog tables directly into the mipmaps, but it doesn't help for other dynamic lighting effects and the whole thing becomes quite complex to manage and make future changes.

So this is current status - there's a considerable speedup on the table - pretty much guaranteed at this point, but the biggest speedup would involve quite a lot of work. I'm going to spend a bit longer thinking about it all before writing any new code in case there is some way to simplify it further.

The table in my last post has been updated with an optimized version, which removes the redundant CPU/DSP exchanges needed for CPU-based texturing on each new span of pixels, and a version with the sampling profiler disabled (with timings missing, obviously).

calimero wrote:is there any sense to render same way walls (using DSP)?

There is to a lesser degree yes, but I would quickly have problems with it - the DSP is already busy at that time generating visplanes. So a change like that would need a lot of rework - could be painful. But the walls can be speeded up yes - even without DSP. There are some details about the wall rendering which make it suitable for other kinds of speedup.

I did a quick test already with 1-pix textures and the walls were much faster - so they are very sensitive to texel reuse (and probably texel skipping) through the CPU data cache. This is because wall columns are not rotated, and the textures are oriented such that texels are always scanned in memory order (unlike floors).

I think the minimum that can be done is to use MipMaps and always select a Mip which is => 1:1 pixel/texel ratio so there is always some pixel stretching. MipMaps can be adjusted easily with a config value so the degree of stretching will adjust performance vs detail.

Even without any stretching, MipMaps would eliminate texel skipping - which the data cache hates, and this is probably a major cost in the wall drawing just now since a lot of the drawn surface area of walls is scaling the textures down, not up (notice the framerate can rise quite a lot if you walk right up to a blank wall).

I also have some 68030 PMMU trickery to try on the display memory to prevent accidental cache pollution through writing, and some other stuff I haven't mentioned yet - such as selective cache control for du:dv texel rates. Doom has a nice feature that every texel is exactly the same 'size' in the world at a given distance - there is no texture scaling anywhere. So you know exactly how big each texel is at every given scene depth. This means you can tell when texel stretching or skipping is going to occur for each new line of pixels and you can plan ahead a little.

So, I'm not done with the performance thing yet - not for a while

I'm currently working out the encoding a texture format for the DSP which will look at least as good, or better, than the current version does. I figure the texture and it's light/fog tables can be interleaved into the same 4k of memory for a single 64x64 floor tile, which will make a 'DSP only' visplane texture route viable for Bad Mood (in it's basic form it won't allow any lighting or depth-cue). This will take time because I need to reformat the textures into a 2nd form, possibly cached on disk, and write some new DSP code etc. etc. So there may not be any interesting updates until I get that part working.

BTW on the subject of texture formats - another thing which can be done with Bad Mood is to adopt truecolour textures. It does rely on indexed textures but they don't all have to share the same 256 colour palette. Each texture can have it's own palette (or palette group), effectively. So at some point we could rework (or lift, from the Jag version?) truecolour versions of the Doom texture set and use those instead of the old PC ones...

How do you do the floor and ceiling mapping? About the time when DOOM and Duke 3D were hot I started to make a DOOM engine myself on PC. I didn't finish it, but the technic I used was different from what they used in DOOM. I didn't use a BSP tree but sectors and portals instead. To draw the walls I used a perspective mapper but with one perspective division per line instead of per pixel since you can't look up and down. This is probibly what you already do to.For the floor and ceiling I used a floor mapper routine that works in about the same way as mode 7 on SNES. It's very simple and you can put all divisions and multiplications in look up tables so that you don't need any during runtime. I found one explanation here: http://gamedev.stackexchange.com/questions/24957/doing-an-snes-mode-7-affine-transform-effect-in-pygame

As you can see, the routine is very simple and can easily map the entire screen. Instead of having the y and x FOR loops you just do a normal polygon render and use this as the horizontal line routine. One test I did once was to build the entire 3d engine from just the floor mapper. I had different tiles for floors and walls, so when the floor mapper found a wall tile, I didn't draw the floor, but instead a wall strip like in Wolfenstein. The end result is a wolfenstein engine with floors and ceilings that are quite fast. Had been interessting to try this on Atari to see how fast it would be. Maybe even an ST could handle it. You don't need a raycast engine and aslong you can draw 320x100 pixels (or maybe 160x50 with c2p) with a decent framerate, then this could work. The ceiling could be a copy of the floor with a different palette to speed things up.

Zamuel_a wrote:How do you do the floor and ceiling mapping? For the floor and ceiling I used a floor mapper routine that works in about the same way as mode 7 on SNES. It's very simple and you can put all divisions and multiplications in look up tables so that you don't need any during runtime. I found one explanation here:

Well it very similar except I don't need a lookup table for the DSP version - there isn't much space for it and the perspective calc is absorbed in parallel with drawing time. For a CPU-only solution I would use the LUT of course.

Zamuel_a wrote:About the time when DOOM and Duke 3D were hot I started to make a DOOM engine myself on PC. I didn't finish it, but the technic I used was different from what they used in DOOM. I didn't use a BSP tree but sectors and portals instead. To draw the walls I used a perspective mapper but with one perspective division per line instead of per pixel since you can't look up and down. This is probibly what you already do to.

Yes I was quite into portals at the time and made an engine with that too for Atari - but didn't finish it. I did use them in a commercial game project, and wrote a Maya plugin to build the 'portal maps' automatically out of CSG primitives.

Zamuel_a wrote:I had different tiles for floors and walls, so when the floor mapper found a wall tile, I didn't draw the floor, but instead a wall strip like in Wolfenstein. The end result is a wolfenstein engine with floors and ceilings that are quite fast. Had been interessting to try this on Atari to see how fast it would be. Maybe even an ST could handle it. You don't need a raycast engine and aslong you can draw 320x100 pixels (or maybe 160x50 with c2p) with a decent framerate, then this could work. The ceiling could be a copy of the floor with a different palette to speed things up.

Yes there's no need for 'raycasting' as such. Bad Mood doesn't really raycast either - it's all affine mapping with a perspective calculation per row or column. The perspective calc cost is just hidden in a different way.

BTW You should try your project on Atari and see how well your technique works (hint, hint).

dml wrote:I also have some 68030 PMMU trickery to try on the display memory to prevent accidental cache pollution through writing,

Note that although Hatari WinUAE CPU core has preliminary MMU emulation and some simpler things work with it, is still missing some bits from full MMU (exception) emulation and CPU cycles information in Hatari profiler will be (even more) bogus for that variant of the WinUAE CPU core.

Eero Tamminen wrote:Note that although Hatari WinUAE CPU core has preliminary MMU emulation and some simpler things work with it, is still missing some bits from full MMU (exception) emulation and CPU cycles information in Hatari profiler will be (even more) bogus for that variant of the WinUAE CPU core.

I won't be using Hatari for the MMU tests - it's a very specific optimization and definitely best done on real kit. It's also not clear if it will have any effect with write-allocation disabled (the default case on Falcon IIRC - although I will be trying to combine the two in some parts of the BM code).

I remember when Doom came out and I tried to find information about how it was done, everyone said that it was a raycast engine, similair to wolfenstein, but more advanced. I found one "good" text that tried to explain it and they said that in doom every line was raycasted instead of every block in wolfenstein. I haven't looked into the Doom source code, but I can't think that it is a raycasting engine, but instead a "real" 3d engine, except that they don't calculate everything in all axis. Doing a pixel precise raycaster would be very slow I think.

Zamuel_a wrote:I found one "good" text that tried to explain it and they said that in doom every line was raycasted instead of every block in wolfenstein. I haven't looked into the Doom source code, but I can't think that it is a raycasting engine, but instead a "real" 3d engine, except that they don't calculate everything in all axis. Doing a pixel precise raycaster would be very slow I think.

It's a 2D raycasting engine, as I understand it.

For each vertical pixel row, it fires a single ray out from the eye, colliding it very quickly with the world using the 2D BSP tree. Each point where it crosses a plane (which is basically a height change) it can find the location in 3D space, calculate how much new visible floor / wall / ceiling it defines and rasterise the visible space. Very clever technique - it very much decouples the cost of rendering the world from it's size, the span generation is pretty much proportional to the number of rays and not much else. (Hence why Doom scales nearly linearly with adjusting the vertical resolution). Every pixel is written exactly once and no more, so no messing about with complex Z algorithms.

The only drawback is you're restricted to a 2D world. Can't rotate in the roll or yaw axes, looking up or down causes perspective distortion (hence why Doom doesn't even do it).

Fires a ray for each row would be very slow. Wolfenstein 3d is fast since you know the world is made of 64x64 pixel blocks so you can more or less jump 64 pixels and don't have to shoot the ray for each potential pixel. If you do a "real" raycaster for each pixel, when even a wolf 3d engine would be terribly slow on a Pentium 2 computer (I tried this once). In Doom you don't know were a wall is since the world is not made up from blocks like in wolf 3d so you don't know were to "shoot", but would have to test each pixel and that would be very slow, so I can't beleve they do that.

The way Doom maps are processed allows them to be drawn implicitly without casting rays into the BSP, or sorting polygons or z-buffering etc. It is pretty clever. It certainly could be raycasted because it provides everything that would be needed to do that - but it's cheaper not to. Raycasting is used in the engine but for other stuff - for indexing textures, collision detection and AI interactions.

The engine walks down the BSP tree in view-dependent order (near node first), dealing with each convex 'ssector' (sub-sector or physical node) one at a time, front to back, with only front-facing 'ssegs' (sub-segments) from each ssector drawn as walls, effectively as complete polygons made up of individual columns. However each column is first clipped against an occlusion buffer for the entire image, and itself updates the occlusion buffer. In this way no pixel is drawn twice, and the engine knows when to stop drawing (all pixels occluded) and even when to stop walking the BSP. It tracks scene coverage to the pixel, very cheaply.

This does impose a limit though of one 'window' per sector join - each column has simple miny/maxy occlusion tracking, and can only 'close off' the scene from the top/bottom of the image inwards. You won't ever see one window above another (unless it's faked using a transparent wall texture with holes in the texture)...

I expect Doom and BadMood don't process upper/lower wall segments the same way - BadMood treats them as separate 'sub walls', added in sequence to the occlusion buffer. Doom might be processing them both at once as a kind of 'inverse wall' where the window between ssectors fills the occlusion buffer for upper/lower walls together. I'll probably have to check that sometime.

The floors (visplanes) are just area fills (actually gaps) between wall top/bottoms, but they have to be scan converted into the opposite axis before rasterization. There are no explicit floor/ceiling primitives involved - just a clever way of tracking what's left after the walls are generated. The visplane conversion is messy compared with everything else for a few reasons but the method does work and cheaper than using primitives.

So the BSP divides and sorts the scene into front-back order, allowing the occlusion buffer works properly - that's it's main duty. It's also used for casting rays by the game engine to find obstacles between entities - line of sight etc. But these tests are relatively rare compared with rendering occlusion tests. The BSP walk does depend on good visibility testing for each ssector, to avoid wasted work processing ssegs which are out of view. I don't remember what the Doom engine does for this, but I use 3 methods at once: 1) fast octant cull, 2) viewcone cull of node bounding box, 3) occlusion buffer test using 2D poly representing node bounding box

Transparent objects, sprites etc. are tracked and occluded in the same way but not drawn until the walls and floors are done, after which they are drawn in reverse order, back to front.

With respect to Dio's post on raycasting - I have to admit some bias on techniques because I began the current engine some time before I had a copy of the Doom source code, so I'm partly influenced by the way I had to do things to suit the Falcon, DSP etc. and deriving methods logically from scraps of information on the net (e.g. I wouldn't try to stick the scene BSP on the DSP because it's size is pretty arbitrary and would prevent limit reuse for any other task). It's likely BM diverges from Doom in a number of ways, and there is more than one way to render those scenes from the BSP correctly and quickly, with pretty much the same termination conditions. I have also seen multiple, different accounts of how it works...

The main objection I had to literal raycasting is the 5-20 intersections per display column, and 'work' at potentially many of those interfaces (sector windows - upper/lower walls) which can be cheaply dealt with in scan conversion order per-surface.

However, if you had a really fast implementation of that ray casting test, simply to find the first contact (and which sseg, ssector owns the contact), it would be a viable way to generate a visible surface & column list from which to do the rest. The other bits still need done (y occlusion tables) but many of the stages are seperable.

I suppose it's something I should try at some point just to see what the cost trade looks like. I fear it's something that would need a nice big CPU cache because of the duty switching & tracking involved at each BSP split. It would also have to be entirely CPU-side I think and might not involve *literally* flinging entire rays into the BSP root every time - possibly some kind of incremental scan across hyperplanes, spawning sub-rays on each split (marching cubes kind of thing), if it happened to reduce duty switching.

Anyway I should be careful re: what I say about what Doom does/doesn't do without proper study of the source again after all these years, and many of these comments keep me thinking of possibilities so..

How fast can it be if the original Doom source is used but with the graphics stuff optimized for the Falcon? First time I played it was on a 20Mhz 386SX computer and it runned at ok speed in low detail mode. Is the Falcon so much slower than a 386 with more or less the same clock speed? The MCGA mode on PC are ofcourse much better than the bitplane modes on Falcon, but in highcolor that shouldn't matter so much, unless it's very slow to draw stuff in that mode.

Zamuel_a wrote:How fast can it be if the original Doom source is used but with the graphics stuff optimized for the Falcon? First time I played it was on a 20Mhz 386SX computer and it runned at ok speed in low detail mode.

That's sort of what I'm trying to do by bolting BM onto the Doom code - if that works out (BM would be a sort of Falcon-optimized graphics layer).

However it's difficult to speculate on the performance of a 100% direct port of the original sources without DSP etc. since nobody has tried it (If anyone did try it, it would have happened while porting PMDOOM, in which case we'd already have a Falcon030 'Doom'), and I expect quite a lot of the C code would need redone for Falcon, as it was for x86...

In any case it's a different project. I'll stick with the BM implementation for now and incorporate any improvements and methods that seem to make sense for the Falcon (including anything I scrape from Doom sources too).

I started a long ramble on some progress on DSP texturing last couple of days which ended up being a bit of a ramble on DSP generally, so here's the short version.

My tests showed there are between 7 and 10 'instruction cycles' (each one of those being 2x 32MHz clock cycles) free & available for use on the DSP, for every single texel/pixel the CPU can copy onto the screen from the DSP port.

So any texture mapping implementation has to fit into that, before the CPU needs to be slowed down.

I know this has been done already in at least one Falcon demo (albeit without texel lighting) in just 7 operations, I had to figure a way to get lighting in there as well to make it useful for BadMood.

I say between 7 and 10 because it varies depending on the bus load, display size blah blah... this is a bit of a worry. I don't want the game crashing/locking up mid way due to an optimization. Sometimes I was able to get 10 ops, other times only 9 - at one point there was only time for 8 when I reduced the display to a small window.

Anyway I managed a full implementation of texture addressing + uv wrapping + texture + lighting lookups in 8 just ops (still to be properly verified with a real texture), and I don't think I can do any better than that - it's already a densely packed, unreadable mess of parallel moves and addressing tricks.

Now it will either works reliably *or* the CPU will need artificial padding to slow it down and allow the DSP to keep up.

Will find out before next week and report.

There is another problem, which may prevent me using pure DSP texturing anyway - at least for now. Texture state changes - number of textures needing uploaded per frame. It's another solvable thing but more work, for later. Another time....

Busy for the rest of the week but will post any interesting progress if any is made...