Let's consider 5/32 = .15625.
On the first next line, I'll add nothing. So, x value will be the floor part, and that's all.
On the second line, I'll add the floor part, AND 1 * 0.15625 = 0.15625, which is rounded to 0.
On the third line, I'll add the floor part, AND 2 * 0.15625 = 0.3125, which is rounded to 0.
On the fourth line, I'll add the floor part, AND 3 * 0.15625 = 0.46875, which is rounded to 0.
On the fifth line, I'll add the floor part, AND 4 * 0.15625 = 0.625, which is rounded to 1.
...
On the 32nd line, I'll add the floor part, AND 31 * 0.15625 = 4.84375, which is rounded to 5.

What we will focus on is the rounded part, and especially WHEN did it change.
On the first line, the rounded part was 0.
On the second line, the rounded part was 0.
On the third line, the rounded part was 0.
On the fourth line, the rounded part was 0.
On the fifth line, the rounded part was 1, we increased by 1.
...
And so on.

So,
On the first line, I didn't add anything.
On the second line, I didn't add anything.
On the third line, I didn't add anything.
On the fourth line, I didn't add anything.
On the fifth line, I did add 1.
...
And so on.

Finally, for 5/32, I get
Line 0, I add 0
Line 1, I add 0
Line 2, I add 0
Line 3, I add 0
Line 4, I add 1
Line 5, I add 0
Line 6, I add 0
Line 7, I add 0
Line 8, I add 0
Line 9, I add 0
Line 10, I add 1
Line 11, I add 0
...
Line 28, I add 0
Line 29, I add 1
Line 30, I add 0
Line 31, I add 0

Let's note that for 5/32, and considering 32 lines, I added 1 exactly 5 times.
If I had just 10 lines to raster, I would have add 1 once (line 0 to line 9).
Which may be slightly different from 10 * 5/ 32, but it's the rounding error.

What if I got more than 32 lines to draw, you ask ?
Let's just start from line 0 again!

Implementation
Of course, I could have gone to 6-bits precision (1/64), or even more.
And I could have computed for, eg 21/64, and the 64 times I would have incremented, or not.
But I didn't need the extra-precision.

Let's now consider 5-bit precision, which tells me 32 times whether or not I have to increment.
32 times true or false.
32 times a binary value.
32 times one bit.
So, a 32 bit value.
Sticking to 5/32, it's easy to write its increments horizontally:
0 0 0 0 1 0 0 0 0 0 1 0 0 ... 0 0 1 0 0
, then wrapping it all together, binary style:
0000100000100 ... 00100b
, then put it in a 32-bit register, hexa-style:
08208104h
, then rotate it multiple times to the left

Thus, on each next line, I rotate my value to the left,
the MSb is moved into T,
I then move T into a register with the MOVT instruction,
finally, I add this register to my xA register.

No test, no lookup-table, plus, 2 main advantages:
- reduced memory footprint
- I don't have to bother for the modulus, ROTL does this for me!

#Wrapping it all together
Here are the values for 5-bit precision.
I add to adjust the value above 15/32 to alias some roundig errors.
If your decimal value is between 0 and 1/32 ( 0.03125), use 0.
If your decimal value is between 1/32 ( 0.03125) and 2/32 (0.0625), use $00008000.
If your decimal value is between 2/32 (0.0625) and 3/32 (0.09375), use $00800080.
If your decimal value is between 3/32 (0.09375) and 4/32 (0.125), use $08080808.
If your decimal value is between 4/32 (0.125) and 5/32 (0.15625), use $08208104.
...
And so on.

## How to do a real DIV without, floating-point nor DIV
In your ROM, just put a lookup table, eg 512 x 512. Let's say that the dividend will be on "lines", and divisors on "columns".
Let's say that each value will be, eg 9 bits for the integer part, and 5 bits for the "floating" part.
The "floating" part will be expressed in 32th of 1, eg 5 for 0.15625.
Each value will use 16 bits (2 bits for padding, 9 for the integer part, and 5 for the floating part), or 2-bytes.
I'll then use the "floating" part as an offset for the rotate table above.
Thus, 216 / 60 (= 3.6) is located at line 216, column 60, or, (216 * 512 + 60) * 2 = 221 304.
I will find there 0x0073

Drawing a triangle
So, for each 2D-triangle (we'll see the 3D -> 2D projection another time), I sorted the vertices (well ... pixels), A being the uppermost, then B being the leftmost.
I compute the slopes (with the fixed-decimal part, remember?), and I figure two pixels starting from A : x1 and x2.
I increment the line, and add the left slope integer part to x1, and the right slope integer part to x2.
I rotate the left slope decimal part and if T is true, I increment (or decrement) x1 by 1 pixel.
I rotate the right slope decimal part and if T is true, I increment (or decrement) x2 by 1 pixel.
... till I'm at yB (or yC).
This is the upper part of my triangle.
Then, I compute the last slope, and to the same for the remaining of the triangle. That would be the lower part of my triangle.

Problem ?
It's sooooo damn slow.
While I'm not even projecting, rotating, or lighting, and even while I'm only drawing half of a triangle, I can barely reach 8 triangles @60fps. I'm pretty sure Virtua Racing was better :/

The most common way to do software rendering of triangles is to use something akin to bresenham lines for calculating the sides (really just the interpolation part). Then you don't need to do multiplication or division at all.

This is the gist of the interpolation if you wonder (basically error accumulation):

Start with accum at either 0 or (xdist >> 1) or whatever. Sort that one out.

Note that it's possible the slow down comes from elsewhere, e.g. screwing up how you're doing the pixel writes to the framebuffer. So you need to check out that stuff first if you ever want to try again.

This is rather close to how quads are rendered in Yeti3D, but the affine loops are in C. I've always intended to go back into my Yeti3D demo and replace the rendering, but never got around to it. As is, the Yeti3D demo does very roughly about 1200 polys per second on the 32X, with finding which polys to render being a significant amount of the slow down.

If you're doing more than 16 bit division, use the hardware divider built into the SH2 instead. You can do other things while it's dividing. The Saturn example code from Sega for their 3D uses the hw divider by default for all division in handling vertexes and such.

Note that it's possible the slow down comes from elsewhere, e.g. screwing up how you're doing the pixel writes to the framebuffer. So you need to check out that stuff first if you ever want to try again.

I remember writing a sprite blitting engine for the 32X once that I think was severely bottlenecked by framebuffer writes.
What's the correct way to do it?

Note that it's possible the slow down comes from elsewhere, e.g. screwing up how you're doing the pixel writes to the framebuffer. So you need to check out that stuff first if you ever want to try again.

I remember writing a sprite blitting engine for the 32X once that I think was severely bottlenecked by framebuffer writes.
What's the correct way to do it?

Frame buffer writes take 3 cycles per word until the FIFO fills (four words), then empty at 5 cycles per word. If you're filling pixels with the cpu, you need to arrange the code so that you take at least that many cycles between writes or you wind up wasting time waiting on a FIFO slot. You could also draw to an off-screen buffer in SDRAM at 2 cycles per word, then transfer the off-screen buffer to the frame buffer using DMA. If you use the 16 byte transfer size, the DMA will use burst reads from SDRAM (1.5 cycles per word). The writes will still be no faster than 5 cycles per word after the first four words, but you could schedule your DMA for times when you're otherwise not doing anything.