The Perfect Memory

The Perfect Memory

By Margaret Piemonte [PC Magazine]

Memory: We know we have it, and we know we need it. We've heard that
the more a PC has, the better its performance will be. This section
will explore the different types of main memory (called DRAM, or
dynamic random access memory) and processor cache memory (called Level
2 or L2 cache) available for a PC today. By varying the amounts and
types of DRAM and L2 cache in your system, you can get different
performance returns.

As CPU speeds increase, so does the need for faster system
components. Traditionally, the memory bus runs much slower than the
CPU. But with Intel's newest PCI chip sets, the 430HX and 430VX, faster
memory technologies can be implemented, closing the gap between the
speed of the memory bus and the speed of the CPU.

Fast Page Mode: A Dying Breed

Fast-page-mode (or FPM) DRAM used to be standard issue on mainstream
PCs. But the market has recently seen a surge in the availability of
newer and faster memory types, which have all but succeeded in
displacing FPM DRAM as the memory of choice. In fact, in our last
roundup of 101 Pentium-class PCs ("Pentium Classic: Still the
One," June 25, 1996), only three PCs used FPM DRAM.

FPM-memory read accesses begin with the activation of a row in the
DRAM array, then move on to the activation of the first column of a
memory address location that contains the data that you want. Each
piece of information needs to be validated, then the data needs to be
latched back to the system. Once the correct piece of information is
found, the column deactivates and gets ready for the next cycle. This
introduces a wait state, because nothing is happening while the column
is deactivating. (The CPU must wait for the memory to complete the
cycle). The data output buffer is turned off until either the next
cycle begins or the next piece of information is requested. In fast
page mode, the next column in the row activates in anticipation of the
fact that the next piece of data you need is in the memory location
adjacent to the previous piece. This activation of the next column
works well only with sequential reads from memory in a given row.

Ideally, a read from 50-nanosecond FPM memory can achieve a burst
cycle timing as fast as 6-3-3-3 (6 clocks for the first data element,
followed by 3 clocks each for the next three data elements). The first
phase includes the overhead created by activating the row and column.
Once they have been activated, the memory can transfer the data in as
few as three clock cycles per piece of data.

EDO DRAM

Extended data out (EDO, sometimes called hyper-page-mode) DRAM and
burst EDO (BEDO) DRAM are two memory technologies based on the
fundamentals of page-mode memory. EDO was introduced into mainstream
PCs about a year ago and has since become the main memory of choice for
many system vendors. BEDO is relatively new and has not yet caught the
market's attention to the extent that EDO has.

EDO works much like FPM DRAM: A row of memory is activated, and then
the column is activated. But when the piece of information is found,
instead of deactivating the column and turning off the output buffer
(which is what FPM DRAM does), EDO memory keeps the output data buffer
on until the next column access or next read cycle begins. By keeping
the buffer on, EDO eliminates wait states, and burst transfers happen
more quickly.

EDO also enjoys a faster ideal burst read cycle timing than FPM
DRAM: 6-2-2-2 versus FPM's 6-3-3-3. This ultimately saves three clock
cycles in a burst of four data elements from DRAM on a 66-MHz bus. EDO
is easy to implement, and with virtually no price difference between
fast page mode and EDO, there is no reason not to choose EDO.

Burst EDO DRAM

BEDO DRAM improves cycle times over FPM much more than EDO does.
Since most PC applications access memory in four-cycle bursts to fill
cache memory (system memory will burst its data into L2 cache, or to
the CPU in the absence of L2 cache), once the first address is known,
the next three can quickly be provided by the DRAM. The essential
enhancement that BEDO offers is the addition of an address counter on
the chip to keep track of the next addresses.

BEDO also adds a pipelined stage that allows the page-access cycle
to be divided into two components. For a memory read operation, the
first component accesses the data from the memory array to the output
stage (second latch), while the second component drives the data bus
from this latch at the appropriate logic level. Since the data is
already in the output buffer, faster access time is achieved. BEDO can
achieve a maximum burst timing of 5-1-1-1 (with 52-ns BEDO and a 66-MHz
bus), saving an additional three clocks over optimally designed EDO
memory.

Synchronous DRAM

Intel's 430VX chip set supports a new type of memory technology
called synchronous DRAM (SDRAM). A key feature of SDRAM is its ability
to synchronize all operations with the processor clock signal. This
makes the implementation of control interfaces easier, and it makes
column (but not row) access time quicker. SDRAM includes an on-chip
burst counter that can be used to increment column addresses for very
fast burst accesses, similar to BEDO's. This means that SDRAM allows
new memory accesses to be initiated before the preceding access has
been completed.

SDRAM can achieve a burst timing of 5-1-1-1 with a 66-MHz bus in a
well-designed and well-tuned PC. The SDRAM's burst length and latency
are fully programmable via an on-chip mode register.

Processor Cache

When we read about cache, we are usually reading about Level 2
cache, or external cache. L2 cache has been the domain of a very fast
and expensive memory type called SRAM (static RAM) that holds data
frequently used by the CPU so that the CPU doesn't have to rely solely
on slower DRAM. Since fast types of DRAM are available, some vendors
offer cacheless PCs to hit a lower price range. Through our testing,
however, we've found that the performance levels achieved by cacheless
PCs can't match the performance of a PC with L2 cache.

The simplest form of SRAM uses an asynchronous design, in which the
CPU sends an address to the cache and the cache looks up the address,
then returns the data. An extra cycle is required at the beginning of
each access for the tag lookup. Thus, asynchronous cache's response
time can be as fast as 3-2-2-2 on a 66-MHz bus, although 4-2-2-2 is
much more common.

Synchronous cache buffers incoming addresses to spread the
address-lookup routine over two or more clock cycles. SRAM stores the
requested address in a register during the first clock cycle. During
the second, it retrieves the data and delivers it. Since the address is
stored in the register, synchronous SRAM can then receive the next data
address internally while the CPU is reading the data from the previous
request. Synchronous SRAM can then "burst" subsequent data
elements without receiving and decoding additional addresses from the
chip set. Response time can be reduced--optimally--to a 2-1-1-1 timing
on a 66-MHz bus.

Another type of synchronous SRAM is called pipelined burst.
Pipelining essentially adds an output stage that buffers data reads
from the memory locations so that subsequent memory reads are accessed
quickly, without the latency incurred by traveling all the way into the
memory array to get the next data element. Pipelining works most
effectively with sequential access patterns, such as cache
linefills.

Performance Results

Standardizing our test-bed was the key to consistent results. We
used a Pentium/166 PC, from Dell Computer Corp., with an Intel 430FX
chip set, a Seagate 2GB Fast ATA-2 hard disk, and a Number Nine Imagine
128 graphics card. To ensure a consistent set of test criteria, all of
the memory used for our tests was supplied by a single company,
Kingston Technology Corp. (800-337-8410; http://www.kingston.com).

The most significant change in performance in all the testing
scenarios was the increase in Winstone 32 scores when a machine's EDO
DRAM was raised from 8MB to 16MB. Conversely, there was very little or
no change in performance in CPUmark32 scores in the same scenario,
because CPUmark stresses the CPU, L2 cache, and memory speeds.

The numbers also show little performance gain in Winstone 32 scores
when comparing fast-page-mode DRAM to EDO with any amount or type of
cache. When there is no L2 cache, however, EDO shows its biggest
performance gain over FPM: 15 percent with 8MB and 18 percent with
16MB.

For a machine with 16MB of EDO DRAM and no L2 cache, a big
performance boost was achieved by adding 256K of pipelined-burst L2
cache: 47 percent on CPUmark32, and 36 percent on Winstone 96. In
general, adding cache or changing from asynchronous to synchronous
pipelined-burst cache will result in a bigger increase in CPU/memory
performance (as measured by CPUmark32) than in overall system
performance (as measured in Winstone 32).