Speeding up the graphics on Pentium Pro / Pentium II computers

Technical description

The Pentium Pro and Pentium II processors contain
a cache memory for speeding up the access to frequently
needed data. First level cache resides in the processor chip,
second level cache is on the chip (Pentium Pro), in the
processor package (Pentium II) or there is none (older
Celerons).

The very principle of the cache is to duplicate the
data from external memory in the caches. This redundant
data must be carefully synchronised - otherwise there
would be e.g. a possibility of a DMA periphery getting
old copy of data, because the new data is in the
cache only.

Several strategies are possible for this
synchronisation. Either all the writes access the
cache and main memory at the same time (write-through),
or the write to the memory is delayed to a more convenient
time (write-back).

There are cases where the memory is located on a
device and is accessed via some kind of device bus. The
graphics card on a PCI or AGP bus is a good example.
These buses have higher throughput, when the data
comes in larger chunks that are transferred in
one transaction - this is called write combining.
The operations that need to transfer larger continuous blocks
(and are not performed locally by the accelerator
on the device itself) can benefit from such setting.

Pentium Pro and Pentium II processors contain
registers that can be used to specify a strategy
for communication with the external memory for
a number of physical address ranges (MTRR -
Memory Type Range Register). The Linux
operating system provides an access to these
registers from the user space.

Configuration

For this feature to be exploited you need the following:

Linux kernel version 2.2.0 or later

PCI or AGP graphics card with a known address of the
linear buffer and a known memory size. The address and the size
can be found in the X server log message, which
can normally be read from the screen when starting via
startx, or in e.g. /var/log/xdm-error.log
X server log file when starting automatically. An example:

(--) S3: videoram: 2048k
(--) S3: Local bus LAW is 0xE0000000

The option "MTRR control and configuration" (CONFIG_MTRR)
must be enabled in the "Processor type and features" section of the
Linux kernel configuration. After the kernel is recompiled
and rebooted there will be a new pseudo-file in the
/proc filesystem: /proc/mtrr,
that provides an access to the MTRR registers. The present
settings can be read with the cat /proc/mtrr command.
For example, the output for a machine with 96 MB of memory
the output is

In the case all does function normally you can write
this command into some script called at the boot time.

Results

I use the described setting more than a year on a
Pentium Pro / 166 MHz computer and a PCI S3 Trio64V+
graphics card. The X server is XF86_S3 version 3.3.3,
graphics mode 1440x1080 @ 256 colors. The x11perf
performance testing tool reveals more than 50%
acceleration of the following operations:

Ratio

Operation

3.09

500x500 tiled rectangle (161x145 tile)

3.00

500x500 tiled rectangle (216x208 tile)

2.86

100x100 tiled rectangle (216x208 tile)

2.78

100x100 tiled rectangle (161x145 tile)

2.74

Copy 500x500 from pixmap to window

2.68

ShmPutImage 500x500 square

2.56

Copy 100x100 from pixmap to window

2.44

ShmPutImage 100x100 square

2.17

Fill 300x300 tiled trapezoid (4x4 tile)

2.09

Fill 300x300 tiled trapezoid (161x145 tile)

2.06

Fill 300x300 tiled trapezoid (216x208 tile)

1.82

Fill 100x100 tiled trapezoid (4x4 tile)

1.76

Fill 300x300 tiled trapezoid (17x15 tile)

1.66

Destroy window via parent (200 kids)

1.51

Unmap window via parent (16 kids)

1.51

PutImage 100x100 square

The following operations were at least 10% slower:

Ratio

Operation

0.89

Create and map subwindows (4 kids)

0.88

Fill 10x10 tiled trapezoid (161x145 tile)

0.87

Fill 10x10 tiled trapezoid (216x208 tile)

0.84

Create unmapped window (200 kids)

0.82

Destroy window via parent (16 kids)

0.80

1x1 tiled rectangle (216x208 tile)

0.80

1x1 tiled rectangle (161x145 tile)

0.78

Unmap window via parent (4 kids)

The other operations (x11perf tests more than 300 of
them) were neither faster nor slower. The statistical
error is hard to guess, but the most accelerated
operations were indeed the ones that transfer large
continuous blocks between the memory and graphics card.