Subject: Fastest line algorithm -- FINALLY...
From: lukeh@kcbbs.gen.nz (Luke Hutchison)
Hi,
(FINALLY... This paper should be *complete*... I've sorted out the
problem with the newsreader, so it should post as normal. Sorry for the
half-finished and blank messages!)
Here's the line-routine description, as promised... I wrote this a year ago,
and it's written as a "paper" (I saw my first scientific paper not long
before writing this, and copied the style!).
Just a couple of things--IMPORTANT:
(1) Please, please can I have your comments about this paper and the
algorithm it describes. I will update the paper, and need to know
your opinions and need ideas for improvements.
(2) If you ever use this algorithm in any application whatsoever, could
you please let me know. I'm just really interested to see how and
where it is used. You can reach me at the address/number/Email
address given at the end of the paper.
Also remember that a claim to being "the fastest" line-drawing routine
is very subjective... but this is *very* fast...
Have fun!
Luke Hutchison (lukeh@kcbbs.gen.nz).
--
-----------------------------------------------------------------------
A New Algorithm for the Fast Scan-Conversion and Rasterization of Lines
-----------------------------------------------------------------------
Copyright (C) 19 Jan 1994 Luke A. D. Hutchison
ABSTRACT
A new, fast algorithm for line-drawing is presented which makes use of
a concept similar to that employed by Bresenham's "Run-length slice"
algorithm, combined with fixed-point calculations which make redundant the
error-variable decisions inherent in all Bresenham implementations.
A working knowledge of fixed-point math and some previous knowledge of
scan-conversion is assumed.
Keywords and phrases: Line drawing, scan-conversion, rasterization,
Bresenham's Algorithm, application of fixed-point math
INTRODUCTION
Since the start of the computer age people have been developing and improving
algorithms for the scan-conversion of graphical primitives. Probably the most
well-known of these algorithms are the ones developed by J.E.Bresenham for the
plotting of a line, circle and other objects on a 2-dimensional raster device.
Bresenham's line-drawing algorithm works by stepping along the major axis, at
each step calculating how far from the ideal line the current minor distance is
(the "error"). When the error gets too large the point is moved in the
appropriate direction along the minor axis (ie. towards the line) and the error
variable adjusted down accordingly.
Efficient versions of Bresenham's original algorithm can be written, yet it is
not the most efficient way to draw a line. Bresenham later developed his
"Run-length slice algorithm" which looks at the line as a series of horizontal
or vertical "runs" or rows/columns of pixels, as shown below:
+---------------------------------------+
| **** (x2,y2) |
| *** |
| **** |
| *** |
| **** |
| (x1,y1) *** |
+---------------------------------------+
Fig.1
The line in Fig.1 has a gradient of 6/21, or 1/3.5. Notice that the average
length of the rows of adjacent pixels is 3.5.
The Run-length Slice Algorithm is different from a normal Bresenham's in that
it steps along the minor axis, taking the integer reciprocal of the gradient
(3 for the above line) and using this as the minimum "horizontal run-length"
for the line. This means that for every step along the minor axis, at least
three pixels are drawn, and the fourth pixel is only drawn if the above
error-decision system so indicates (this in fact works, in theory, by adding
the fractional part of the reciprocal of the gradient to a "total"
variable, and when this reaches 1.0 then the fourth pixel is drawn and 1.0 is
subtracted from the total).
THE NEW ALGORITHM
A lot of needless processing is done by Bresenham's run-length slice algorithm.
Why bother keeping track of the error, or total the fractional parts, when
the gradient and the position along the major axis can be kept as fixed-point
values? To illustrate the point, refer again to the example given in Fig.1.
Note that we will arbitrarily draw lines from bottom to top here, however either
direction will work. Also note that the below description applies only to
X-major lines (the converse of the description applies Y-major lines).
Starting at the bottom of the line, (x1,y1), we set a variable, 'x' to the
fixed-point equivalent of 'x1':
x = x1 << 16;
Note that we are using 32-bit integers and that we are using the lower 16 bits
of the int as the fractional part of the fixed-point number. A different
number of bits could work for a given implementation as long as precision is
not lost.
Secondly the screen address of the pixel at (x1,y1) is found:
char *scr = scr_base + x1 + y1 * screen_width;
This is where on the screen the linedrawing will commence. Thirdly the inverse
gradient is found, as with Bresenham's Run-length Slice Algorithm, only in
fixed-point:
m = (dx << 16) / dy; (Eq.1)
Where:
dx = x2 - x1;
dy = y2 - y1;
A variable is used to remember the previous position of 'x', which we shall
call 'ox'. It is initialized as:
ox = x;
Now, in the algorithm's simplest form, another integer variable, 'y', steps
between 'y1' and 'y2', ie. steps through all run-slices in turn, and for every
run-slice a new value for 'x' is calculated by adding the inverse gradient to
it:
x += m;
The length of a given run is then:
rlen = ((x >> 16) - ((ox >> 16));
(Or in other words the integer distance between the current and previous
horizontal fixed-point positions).
Finally a row of pixels is drawn starting at 'scr' and of length 'rlen'
(note that 'scr' is updated on exit from the horizontal run-drawing code to
point to the pixel at the end of the run), then 'scr' is updated to point
to the scanline above the current one and 'ox' is set to 'x' as follows:
draw_H_run(&scr, rlen);
scr -= screen_width; /* Move up one scanline */
ox = x;
(draw_H_run() will draw the specified number of pixels in a horizontal line).
The Y-loop then continues with the next run.
As previously stated, the converse of all above remarks applies to Y-major
lines.
IMPLEMENTATION DETAILS
There are a few special cases to consider, as well as some details that need
to be taken into account for the line to be centred around the ideal (ie. for
the rasterized version of the line to match as closely as possible the true
line between the endpoints).
Firstly, if (dy == 0) then the above statement for 'm' (Eq.1) is undefined.
The same applies for (dx == 0). If (dy == 0) then a horizontal line between
(x1,y1) and (x2,y2) needs to be drawn (in other words a single horizontal run
of length (dx + 1) needs to be drawn) instead of going through the normal
process:
if (dy == 0) {
draw_H_run(&scr, dx + 1);
return; /* Line is now drawn so simply return */
}
The same needs to be done for (dx == 0) with 'draw_V_run()' to avoid division
by zero in the Y-major part of the routine.
The second detail that needs to be considered is that the algorithm, as thus
far stated, is not correct in that the line to be drawn is not necessarily
centred about the ideal line, as illustrated below, and also the line will be
too long:
+-----------------------------------+-----------------------------------+
| *** (x2,y2) | (x2,y2) *### |
| **** | **** |
| *** | **** |
| **** | **** |
| (x1,y1) *** | (x1,y1) **** |
+-----------------------------------+-----------------------------------+
Fig.2 (Ideal rasterized line) Fig.3 (Line produced by algorithm
thus far)
The line in Fig.2 is what would be generated by a standard Bresenham's
algorithm whereas Fig.3 shows what the above (simplified) description of
the algorithm would produce. The hashes ('#') in Fig.3 show the "extra" pixels
which would be plotted but are in fact further to the right than 'x2'. This
is because dx = 16 and dy = 4, giving a run-length of 4.0. However the first
run (4 pixels in above example), plus the single pixel that is valid in the
last run, must be split evenly between the first and last runs in order to
balance the line about its centre. This is done in the following way:
Because we previously set 'ox' to 'x' just before the 'y'-loop, and to
calculate the run-length 'rlen' we subtract the integer part of 'ox' from the
integer part of 'x', to force the initial run to be shorter all we need to do
is add a number to 'ox' so that 'rlen' is the length we require.
We know that 'm' is the fixed-point version of the run-length, and we know
that (m + 1.0) pixels need to be evenly divided between initial and final runs.
We will in fact use (m + 1.5) so that rounding instead of truncating occurs.
Hence instead of doing
ox = x;
just before the 'y'-loop, we do:
ox = x + ((m + (3 << 15)) >> 1);
Which boils down to "ox = x + (m + 1.5) / 2" because (1.5 << 16) is best
expressed as (3 << 15). But what happens at the other end, (x2, y2)? We will
modify the 'y'-loop from
for (y = y1; y <= y2; y++) {
...
}
to
for (y = y1; y < y2; y++) {
...
}
draw_H_run(&scr, x2 - (x >> 16) + 1);
The above change means that all but the last run will be plotted, then the
last run will be plotted separately at the end. As we already plotted a run of
length (half of initial- & final-run total) or (m + 1.0) when we processed the
initial run, the length of the final run should be the remaining half-run.
The produced line now matches the ideal.
EFFICIENCY CONSIDERATIONS AND OPTIMIZATION
The presented algorithm is very fast. But it could be improved on.
CONSIDERATION #1: Division
Divisions are very slow. But why divide when you could multiply?
You know that the maximum height of a plottable line is the height of the
screen. As you are dividing 'dx' by 'dy' and getting a fixed-point result,
the divisor is only ever within the range 1 to (screen_height - 1) inclusive
(note that if the line is Y-major then this will be dy/dx hence the range
will be 1 to (screen_width - 1) inclusive). So we can create a lookup-table
of size (MAX(screen_width, screen_height)) of fixed-point reciprocals where:
div_table[index] = (1 << 16) / index;
This step is performed at initialization-time. Then to perform, say, dx/dy in
fixed-point we would later use:
m = dx * div_table[dy];
This can typically be performed up to 50 times as fast as the previous division
method on some processors.
One other important point to note is that 16 bits may no longer be accurate
enough for large screensizes as 1/table_size can lose accuracy due to underflow
very quickly. More bits can then be used to represent the reciprocals, and
these shifted back to the 16.16-format fixed-point after the multiply as
follows:
div_table[index] = (1 << 23) / index;
(at initialization-time), then
m = (dx * div_table[dy]) >> (22 - 16)
(when we require a division). This forms the table in .22-format fixed-point,
leaving the top 10 bits clear for the sign-bit plus the multiplication by dx
(which may now be represented in 9 bits max, or has a maximum magnitude of 511).
The above statistics are for 32-bit ints only. No such problem exists on
64-bit processors where 32.32 format fixed-point numbers may be used. Note
that the fixed-point format will need to be taylored for the application.
CONSIDERATION #2: Long Horizontal Runs
The second speedup that could be employed on some processors is that after
calculating 'm' for X-major lines, it could be tested to see if it were larger
than a predefined threshold value, and if so then instead of calling the normal
'draw_H_run()' procedure you could call a special 'draw_long_H_run()' procedure
that did multiple-wordwise stores instead of just bytewise stores. This will
work on systems that have word-accessible video RAM and a way of blitting large
amounts of consecutive data. This type of line-draw is very fast but can often
take copious amounts of time to start up, hence the threshold (some short lines
will be drawn faster by bytewise stores). This alternative code could either
be invoked through a second copy of the linedrawing-routine, or through the
use of a function-pointer to the routine to use to plot the lines.
The case mentioned above where (dy == 0) would of course use the same
alternative horizontal-line mechanism.
Vertical runs will need to be drawn bytewise on most systems.
CONSIDERATION #3: Diagonal runs, Symmetry, Lookahead, Short lines,
"Runs of Runs"
Following are four concepts which could be used to make the line-drawing
orders of magnitude faster, but require much extra coding and extrapolation
of special cases.
1) This line-drawing algorithm is very fast for steep and shallow lines
(because of the long runs in such primitives), and especially for the latter
because of the runs' being horizontal. But once the inverse-gradient gets
less than 2.0, the following happens:
+-------------------+--------------------+-------------------+-----------------+
| m = 7.33 | m = 2.0 ** | m = 1.2 * | m = 1.0 * |
| ***** | ** | * | * |
| ****** --> ** --> ** --> * |
| ***** | ** | * | * |
| | ** | * | * |
+-------------------+--------------------+-------------------+-----------------+
Fig.4 (Progression of gradient)
Notice how horizontal runs of (length > 2) become fewer when (m < 2.0). Of
course when (m < 1.0) then vertical runs are used instead, but in the range
(1.0 <= m < 2.0) the algorithm becomes inefficient. Notice though that there
are now far more *diagonal* runs, with (num. diag. runs == num. horiz. runs)
when (m == 2.0).
In order to separate this special-case we need initially to check that
(m < 2.0). This is normally done using a division in fixed-point such as
(dx / dy) to calculate 'm'. However if the above condition is true, and
we are handling a line with mostly diagonal runs, then a _second_ division
must be performed as outlined below in order to determine the line's gradient
with respect to the diagonal. Performing two divisions per line is very
wasteful, so an alternative method is required:
if (dy < (dx << 1)) {
/* Line is at least twice as long as it is high, use horizontal runs */
} else {
/* Line has more diagonal than horizontal runs, use diagonal runs */
}
The above check is made to see whether or not (dx / dy < 2.0) without actually
performing the divisions. The respective calculations for 'm' would then be
performed within the "then" and "else" blocks of the "if" statement.
Diagonal run-lengths are calculated in exactly the same way as horizontal
run-lengths except that they are calculated around the line (dx == dy):
For "shallow" lines ( <= 45degrees), instead of
m = dy / dx;
we write the following to calculate the gradient as offset from (dx == dy):
m_diag = dy / (dx - dy);
(or more precisely
m_diag = (dy << 16) / (dx - dy);
in order to produce a fixed-point result). Of course a special case will now
need to be taken out for (dx == dy) (diagonal lines) because the above equation
would otherwise be undefined. This check need only be done in the code which
is *inclusive* of the diagonal: when we check near the beginning of the
algorithm whether we are dealing with X- or Y-major lines we might use the
following:
if (ABS(dx) > ABS(dy)) {
/* X-major */
} else {
/* Y-major */
}
Now we know that the condition (dx == dy), or the condition which is satisfied
by a 45-degree line, can only occur in the Y-major part of the code as the
X-major code is exclusive of this condition. Hence the special-case
diagonal-only code need only be placed somewhere in the Y-major part of the
routine.
The following equation must then be used again, as demonstrated above, to
initiate the first run:
ox = x + ((m + (1 << 16)) >> 1);
All the remaining runs excepting the final one are then drawn as diagonal-runs
of 'rlen' pixels (as calculated for horizontal runs), after which the current
pixel-address 'scr' must be moved _down_ one scanline so that the next
diagonal-run starts to the right of the last pixel of the current run.
The final run is then plotted as outlined above, with respect to the diagonal
case of course.
If diagonal runs are also processed then the algorithm starts to look very
efficient.
2) The second speed-optimization that can be made is that of drawing the line
from both ends at once, reducing the amount of actual calculation that needs
to be done by half. It is fairly well-known that this will nearly double
the speed of a standard Bresenham's algorithm, however the speedup in the case
of this algorithm will not be quite as large as the calculations will (usually)
take a small amount of time compared to the time taken in plotting the runs.
As the gradient approaches 2.0 (or the equivalent in any octant) however this
optimization becomes more important as the runs can be quite short.
Two screen-pointers now need to be employed, one for each endpoint. A second
version of the run-drawing procedures should be written, for speed, which work
backwards, ie. they start at the passed address and work backwards through
memory storing the run. This is so that no extra pointer-manipulation needs to
be done to ensure that the screen pointer for the duplicated half of the line
always remains at the beginning of the duplicated half of the line.
The bottom (or top) half of the line is then scan-converted, with each
calculated run being plotted twice (once from each screen-pointer, with the
duplicated half being plotted backwards towards the other half). This is done
inside the following loop:
for (y = y1; y < y1 + ((dy + 1) >> 1); y++) {
...
}
if (!(dy & 1))
draw_H_run(&scr, (int) scr2 - (int) scr + 1);
(This assumes the second screen-pointer is called 'scr2'). The equation
((dy + 1) >> 1) is equal to the truncated integer number of runs which need
drawing in order to complete half of the line. The condition (!(dy & 1))
checks to see whether there is an odd number of runs in the line and if so
then the final, middle run is drawn -- the length of this run is the
difference in the two screen-pointers plus one because the two screen-pointers
are now on the same scanline.
3) The third optimization that could be employed, in conjunction with any or
all of the others, is the use of a "lookahead". I will not go into much
detail here save to say that we know that for example if the fractional part of
'm' is less than or equal to 0.25 then we can safely plot sets of 4 runs all the
same length before adding (m << 2) to 'ox' and 'x'. This can vastly speed up
the drawing of many types of line. More than one case can also be used:
checks can be made on the fractional part of 'm' to see whether it is greater
than 0.5, 0.25 and 0.125 for example. Again the most optimal choice will have
to be taken for the particular application and implementation.
4) One other factor which could be considered for applications where very
"long" lines are to be drawn (eg. lines at a high resolution) is that we
could consider "runs of runs": Say the fractional part of the gradient
was 0.25, three runs in a row would be the same length, then the forth
would be longer by one pixel. (The number of runs in such a series can
trivially be found.) Of course, this could be extended to "runs of runs
of runs" etc. *but* the initialization time would outweigh the gains in
almost all applications.
5) The final optimization mentioned here is that it could be found in a given
implementation that for very short lines a standard Bresenham's algorithm is
faster. This is up to the discression of the programmer to separate out as a
special case. If, for example, (ABS(dx) + ABS(dy) < threshold) then a normal
Bresenham line could be drawn. For some applications however the usage of
very short lines is minimal and this new algorithm will provide a more
efficient alternative.
COMPLETE IMPLEMENTATIONS
Nearly all above examples have been for the case of octant 0, ie. X-major
lines with (m >= 0). Y-major lines can generally be obtained by taking the
converse of the above algorithm (swapping the X- and Y-variables).
When (m < 0) some additional points need to be considered: The absolute-value
of the minor-axis delta may need to be taken if a division lookup-table is to
be used, or alternatively a lookup-table which is twice the size can be used
with negative values stored before and positive values stored after the
reference-point from which the delta is used as an index, ie.
int index, div_tab[TABLE_SIZE * 2 - 1];
div_table = div_tab + TABLE_SIZE - 1; /* div_table points to (dy == 0) */
for (index = -TABLE_SIZE + 1; index < TABLE_SIZE - 1; index++)
div_table[index] = (1 << 16) / index;
Then to access the division table later, a signed quantity can be used as an
index into 'div_table'.
Some other points to note are that when (m < 0), the computation of 'rlen'
performed above will be negative. 'rlen' could either be multiplied by the
sign of 'm' (which could be slow on some processors) or the code could be
extrapolated into a special-case for negative gradients where calculations
such as
rlen = ((ox >> 16) - ((x >> 16));
and
x -= m;
would be performed instead of the previous positive equivalents. Run-plotting
should also happen in reverse for speed (avoiding pointer corrections such as
having to subtract the run-length from the screen-address before and after
plotting), or at least the screen-pointer needs do be updated in the reverse
direction to normal. This is a "problem" that is implementation-dependent
as most systems will be able to work backwards but some may not. Other
solutions to the movement of the pointers could probably be readily found,
again looking at the problem from an implementation-dependent point-of-view.
One such solution would be to have 'scr' merely point to the start of the
scanline which will hold the current run. Then to draw a run a modified
call to 'draw_H_run()' would be used as follows:
draw_H_run(scr, (ox >> 16), ((x >> 16) - (ox >> 16)));
Where 'draw_H_run()' is declared as:
void draw_H_run(char *pscanline, int xstart, int rlen);
To draw a run in the opposite or 'backwards' direction, one would merely swap
the 'ox' and the 'x'.
I will state again that the 16 bits of mantissa for the fixed-point variables
'x', 'ox' etc. is arbitrary, as is the number of bits used in the division
lookup-table. This must be taylored to suit a given implementation.
SUITABILITY OF ALGORITHM TO IMPLEMENTATION
The presented algorithm is particularly suited to modern RISC architectures
however may be developed for any system. Integers of at least 32 bits in size
are quite important if reasonably long lines are to be drawn without loss of
accuracy however if this is not a problem then there is no reason why other
sizes could not be accommodated.
Because of the algorithm's particular suitability to the ARM series of
processors originally designed by Acorn Computers and now developed by
Advanced Risc Machines Ltd., I shall discuss in a bit more detail an
implementation for such a chip and why it is suited. The following comments
may, and probably also apply to other modern processors:
* The ARM processors have a built-in barrel-shifter in their ALU hence
they can do the 16-bit shifts in no processor time when combined with
another instruction, such as an 'add'
* In an ARM implementation a block of code can be replicated at
assemble-time and jumped into by adding a multiple of the blocksize to
the 'pc' (Program Counter). If there are 10 runs to draw out of a maximum
of 256 and each block (which processes one run) takes
(8 instructions * 4 bytes per instruction) then we jump ahead from the
current position in the program by ((256 - 10) * 8*4) bytes into the
following chunk of replicated blocks. This is effectively completely
removing all looping constructs, which is a good thing to do on a
pipelined processor.
There are also many more, less significant tricks that can be applied in the
implementation of this algorithm on these processors and no doubt every
implementation will have its own optimizations. The above was given merely
as an example of this put into practice for a possible implementation.
ORIGIN OF ALGORITHM
I was recently writing a polygon scan-converter, using the fact that I always
drew from top to bottom, hence the move in the Y-direction would always be 1.
I realised that adding the value of (dx / dy) to the current x-position with
each move down by one scanline would give me the desired polygon edge. It
made sense to do this in fixed-point for speed. I then worked out the method
of using a lookup-table of reciprocals and multiplying instead of dividing
for speed.
One thing I noticed about any shallow edges of the polygon was that even a long
shallow edge could be scan converted at an amazing speed. I realised this
could be put to use by remembering the previous x-position, 'ox', as well as
the current, 'x', and drawing a horizontal line between them.
I leter found that Bresenham had already come up with the concept of drawing
lines in runs, however this fixed-point version of the algorithm should be
faster than an implementation of such an algorithm because of the lack of
true decision and error variables in this algorithm.
+-------------------------------------+
| |
| ## | Fig.5: Polygon scan-conversion,
| #******# | showing the positions of 'x'
| #************# | at the beginning and end of each
| #******************# | scanline as a '#'.
| #************************# |
| #*****************# |
| #*********# |
| |
+-------------------------------------+
+-------------------------------------+
| |
| ** | Fig.6: Polygon scan-conversion
| ******** | modified for line-drawing, showing
| ************** | the positions of 'x' ('#') and
| ******************** | 'ox' ('%') along the bottom edge.
| %**********#************** |
| %***********#****** |
| %*********# |
| |
+-------------------------------------+
+-------------------------------------+
| |
| | Fig.7: Horizontal-runs drawn
| | between 'ox' and 'x'.
| |
| |
| *********** |
| ************ |
| *********** |
| |
+-------------------------------------+
REFERENCES
[I would place Bresenham's references here but I don't have them -- I have
only heard of his algorithms from other sources but have yet to see any of
his papers. He is however acknowledged here for his work]
CONTACT ADDRESS
You can reach me via Email at:
lukeh@kcbbs.gen.nz .
I am also reachable via SnailMail at:
Luke Hutchison
30 Long Bay Drive
Torbay
Auckland 1310
NEW ZEALAND
Tel. 649-473-9026