Before this test packet was created there was no proper software for
measuring system's vital parameters such as CPU/Chipset/RAM providing steady
and reliable (reproducible) test results and allowing for changing test
parameters in a wide range. Among vital low-level system characteristics
are latency and real RAM bandwidth, average/minimal latency of different
cache levels and its associative level, a real L1-L2 cache bandwidth and
TLB levels specs. Besides, these aspects are not paid sufficient attention
to in a product's technical documentation (CPU or chipset). Such test packet
which combines a good deal of subsets aimed at measuring objective system
characteristics is much needed for estimating crucial objective platform's
parameters. The test packet is developed within RightMark,
named RightMark Memory Analyzer
and available as an open
source code.

System requirements

Minimal system requirements:

Pentium MMX CPU or higher;

32MB RAM available;

Windows 2000 OS and higher.

General settings

This sector contains general settings of all subtests realized in RMMA.

CPU Clock, CPU Count

Data on CPU clock and the number of logic processors in the system.

Cache Line Size

The effective cache line size detected automatically when a given application
starts (detection takes only several seconds). This parameter is very important
in achieving correct results in most realized subtests. That is why its
automatic detection is an integral part of RMMA.

Memory Allocation

Choice of a method of allocation of memory needed for execution of the
tests.

Standard - standard method of memory block allocation with malloc()
with the further usage of VirtualLock() on the memory region selected
which
guarantees that later accesses to this region won't cause page fault.

AWE - this method uses Address Windowing Extensions available
in Windows 2000/XP/2003 Server. This memory allocation method is more reliable
for some tests such as cache associativity. Usage of AWE requires Lock
Pages in Memory privilege which is not available by default. To get
this privilege make the following steps:

Enter the system at the administrative access level;

Launch Local Security Policy from Administrative Tools;

Select Security Settings -> Local Policies -> User Rights
Assignment;

Select Lock pages in memory policy and add a user or group name
(e.g.,
Administrators).

Reenter the system to apply this policy .

Data Set Size

The total size of data to be read/recorded when measuring every next pixel.
Every pixel is measured four times, then the minimal result (in CPU clocks)
is chosen. It provides higher repeatability. So, if you measure the memory
bandwidth by reading 1MB units the Data Set Size equal to 128 MB means
that it takes 4 measurements with 32 reading iterations. A higher Data
Set size provides a more reliable result (smoother lines), but respectively
increases the test time.

Thread Lock

In the general case every test runs in the main stream which is given the
highest priority (realtime) to prevent effect from other running processes.
It concerns only uniprocessor systems, though most user systems are such.
In case of SMP or Hyper-Threading systems additional processors can have
a noticeable effect on the test scores. This option locks other processes
in SMP systems to increase precision and reliability of the measurement
data. At the same time, it's not recommended that you use this option in
Hyper-Threading systems as it makes its own great effect. The ideal test
condition for Hyper-Threading systems is the minimal possible system load.
All applications including those with the lowest priority should be closed.

Logarithmic Y Scale

Linear scale by default.

White Background

It allows using a white background in graphic representation of test results
during the test execution and in the report BMP file, instead of the default
black one. This option is needed for more convenient test results printout.

Create Test Report

This option determines whether a report will be created on completion of
the test. The report includes two files with textual (MS Excel CSV) and
graphic (BMP) representation of test results.

Sequential test execution (Batch)

The tests are executed sequentially as it's more convenient, in particular,
for streaming testing of a great number of systems with the same test suite.
RMMA supports the following operations with a batch:

In each test you can set user settings or select one of the presets.
Presets are needed for more convenient usage of test options and for comparison
of systems of various classes in the same conditions. Once you select a
preset the test parameters can't be changed.

Benchmark #1: Memory BW

The first benchmark estimates an actual memory bandwidth of L1/L2/L3
data caches and RAM. This test measures time (in CPU clocks) of full reading/recording/copying
of a data block of a certain size (which can vary or stay fixed) using
some or other CPU registers (MMX, SSE, or SSE2). In case of reading and
writing the test also allows for various optimizations - Software Prefetch
or Block Prefetch - in order to reach the Maximal Real Read Bandwidth.
The scores are calculated in bytes transferred to the CPU (from CPU) at
one clock, as well as in MB/s. Below you can see settings of the first
benchmark.

Variable Parameter

PF Distance - dependence of an actual memory bandwidth on prefetch
length in Software Prefetch method. This mode is developed for reaching
the maximal real read bandwidth and it's recommended only for large data
blocks (larger than the overall data cache size);

Block PF Size - dependence of an actual memory bandwidth on block
prefetch size in one of two Block Prefetch methods. Similarly to PF Distance,
it's recommended only for large data block sizes.

Maximal Block PF

Stride Size

Stride Size in operations of reading data into cache in Block Prefetch
methods (1, 2), in bytes. For more reliable results this parameter must
correspond to the cache line size. That is why in this and other subtests
this parameter is set to auto-detect which means that the cache
line size will be automatically detected by the program at launch.

CPU Register Usage

Read Prefetch Type

Read Prefetch Type defines a type of instructions used for Software Prefetch
(PREFETCHNTA/T0/T1/T2); also, it enables one of Block Prefetch modes needed
for taking measurements at Variable Parameter = Block PF Size. Block Prefetch
1 uses line readsets from memory to block prefetch of a certain size using
MOV instructions and is recommended for AMD K7 family (Athlon/Athlon XP/MP).
At the same time, in the Block Prefetch 2 method data are read with one
of the Software Prefetch instructions (PREFETCHNTA). This method is recommended
by AMD for K8 family (Opteron/Athlon 64/FX).

Non-Temporal Store

Non-Temporal Store - direct memory access (write combining protocol) at
writing. This access method writes data into memory without prereading
of old data into the CPU cache levels system (without using the write allocate
mode). It saves the CPU cache from unneeded data, in particular, in case
of copy operations.

Copy-to-Self Mode

Data block is copied to the same memory region where the copy block is
located, i.e. the memory content doesn't actually change. This option is
not enabled by default, and data copied are shifted by the offset equal
to the transferred data block size. Since in this mode write operations
completely get into the cache, this benchmark tests memory's ability to
read data after writing (read around write). In this case the cache memory
is utilized to a greater degree and the benchmark turns out to be much
lighter for the memory subsystem. Note that the Non-Temporal Store
and Copy-to-Self modes are incompatible.

Selected Tests

Selected Tests define the memory access ways.

Read Bandwidth - real memory bandwidth at reading;

Write Bandwidth - real memory bandwidth at writing;

Copy Bandwidth - real memory bandwidth at copying.

Benchmark #2: Latency/Associativity of L1/L2 Data Cache (D-Cache Lat)

The second benchmark estimates the average/minimal latency of L1/L2
data cache and memory, L2 cache line size and L1/L2 data cache associativity.
Below are its parameters and modes of its operation.

Variable Parameter

There are 4 types of this test:

Block Size defines dependence of cache/memory latency on a block
size. This test mode demonstrates latency of various memory regions - L1,
L2, L3 (if it exists) caches or RAM. A dependent access chain is created
in the allocated memory, with each element containing an address of the
following one. At every full read iteration stage every chain element will
be addressed only once. The number of the chain elements is equal to the
block size divided by the Stride Size (see below). If the stride
size corresponds to the cache line length, the block size is a real characteristic
of the number of data read (because data are read from RAM to L2 or from
L2 to L1 line by line). The block sizes less or equal to the L1 cache allow
estimating the load-use latency when accessing the L1 cache; the block
size within the range (L1..L1+L2) or (L1..L2) estimates the L2 cache latency
depending on the cache architecture (exclusive or inclusive), and finally
(since an L3 cache is rarely used), the block size greater than L1+L2 estimates
latency when accessing RAM. The chain elements execution order depends
on the test method (see below). Forward Read Latency method starts
from the first element and goes through all to the last one which contains
the first element's addresses which allows repeating the operation multiple
times. In case of Backward Read Latency the first element contains
the last one's address, and reading goes from the last element to the first
one. Finally, Random Read Latency test selects elements on a random
basis, but the condition of selecting one element once does not change.
Below you can see the principle of forward reading of the chain comprised
by 8 elements.

Stride Size - dependence of cache latency on stride size. This
test mode makes sense only for block sizes that can get into the L2 cache
and allows estimating its line length. This method of estimating of a cache
line size is not the only one, the RMMA contains three such methods, the
others will be studied below.

Chains Count - dependence of cache latency on the number of sequential
dependent access chains. It estimates L1/L2 data caches associativity.
The number of chains is actually a conditional concept because in reality
there is only one dependent access chain which is executed several times.
The only difference between the multi-chain version and the single-chain
one is that in the first case data are read from difference memory regions
(their number is equal to the number of "chains"), and the offset between
them is a multiple of the cache segment size. Have a look at the forward
reading of an array that contains 4 chains.

To estimate such important processor cache parameter as an associative
level you should gradually increase the number of dependent access chains
while maintaining the block size minimal. This fact proves that it's simple
to "do harm" to the CPU cache, and it's not needed to pack it up with data.
Actually, to make a "breach" in the n-way set associative cache, you just
have to read n cache lines at the addresses having the offset being a multiple
of the cache segment size. This is what makes this test. For example, to
show inconsistency of the Pentium 4 L2 cache of 512K, with the associative
level of n = 8 and a 64 bytes line one has to read only 8 x 64 =
512 bytes, i.e. it's needed to take less than 0.1% of its size(!). The
minimal cache segment size in the current test version is equal to 1MB.
Such a large value guarantees that the test will correct define the L2/L3
cache associative level even in systems having a large cache (note that
the cache segment equal to 1MB corresponds to a 8MB L3 data cache with
the associative level of 8).

NOP Count - dependence of latency of the memory region selected
(L2 cache or memory) on the number of voids between two successive accesses
to the region selected (L2 cache or memory). These operations called NOP
are not related to the cache access but they bring a fixed time gap between
two successive accesses to different cache/memory lines. It unloads the
data bus between L1-L2 or L2-RAM to make the latency in accessing a selected
memory area as low as possible. In the current RMMA version such NOPs are
based on x86 ALU or eax, edx (eax stores the chain element
address, and
edx is initialized with 0); this command suits well
for testing a good deal of modern processors.

Maximal Block Size, KB

Minimal NOP Count

Maximal NOP Count

Maximal NOP Count in case of Variable Parameter = NOP Count.

Minimal Chains Count

Minimal Chains Count - a minimal number of successive dependent access
chains in case of Variable Parameter = Chains Count; the number of successive
dependent access chains in other cases. The offset of every such dependent
access chain from its neighbors is equal to the value which is a multiple
of the maximum possible cache segment size.

Maximal Stride Size

Latency Measurement

Latency Measurement technique (the parameters can be configured only if
Variable Parameter = NOP Count). In Method 1 an ordinary dependent
access chain with a varying number of NOPs (see above (edx = 0))
is used to determine the minimal latency :

Nevertheless, in some cases (if the speculative loading works effectively)
the minimal cache latency may not be achieved. For such cases there's an
alternative RMMA method (Method 2) which uses a different chain
read code (ebx = edx = 0):

Benchmark #3: Real L1/L2 Data Cache Bus Bandwidth (D-Cache BW)

This benchmark estimates a real L1-L2 cache bus bandwidth (or L2-RAM
bus bandwidth). It's the simplest test in RMMA regarding its configuring.
It's based on the method used in the real L1/L2/RAM bus bandwidth test
(Benchmark #1). But in this case memory read/write operations are carried
out line by line, i.e. with the stride equal to the cache line length and
with CPU' ALU registers. Both forward and backward access modes are supported.
Test parameters:

Variable Parameter

Stride Size - dependence of a real L1-L2 or L2-RAM bus bandwidth
on a stride size. This mode is the second way to calculate a cache line
length.

Minimal Block Size, KB

Minimal Block Size, KB, in case of Variable Parameter = Block Size; Block
Size in other cases. A value less than 1.5 times L1 cache will yield senseless
results. This test doesn't estimate a bandwidth of the L1-LSU-registers
tandem because loading of data from L1 into LSU (Load-Store Unit) and then
to CPU registers is not fulfilled line by line. To estimate the L1-LSU
bandwidth it's better to run the first test (Memory BW) within the range
of block sizes which can get into the L1 cache.

Maximal Block Size, KB

A value lower than the L2 cache size (inclusive cache architecture) or
L1+L2 (exclusive cache architecture) allows estimating a real L1-L2 bus
bandwidth. In case of the Block Size values ranging from L1+L2 to some
greater value this benchmark estimates the maximal real memory bandwidth
at reading/writing of full cache lines, which in some cases turns out to
be greater than the maximal real bandwidth in case of total data reading/writing.

Benchmark #4: L1/L2 (D-Cache) Arrival

The fourth benchmark estimates L1-L2 bus realization features (bit capacity,
multiplexing) for some processors with an exclusive cache architecture,
in particular, for AMD K7/K8 processors. This test actually measures the
total latency of two accesses to the same cache line which are separated
by a certain value. The measurement method is identical to the one in Method
#2 except the fact that two consecutive chain elements are located in the
same cache line.

Besides, the fourth test can be used to calculate the L2 cache line
size (this is the third way in RMMA to estimate it, and it's used for its
estimation at the program startup). The fourth test parameters are as follows:

Variable Parameter

Variable Parameter define one of five test types:

Block Size - dependence of the total latency on the
block size.

NOP Count - dependence of the total latency on the number of
NOPs between successive accesses to different cache lines.

SyncNOP Count - dependence of the total latency on the number
of NOPs between successive accesses to the same cache line.

1st DW Offset - dependence of the total latency on the first
word offset within the cache line.

2nd DW Offset - dependence of the total latency on the second
word offset within the cache line..

Minimal Block Size, KB

Maximal Block Size, KB

Maximal Block Size, KB, in case of Variable Parameter = Block Size.

Minimal NOP Count

Minimal NOP Count defines the minimal number of NOPs between two successive
accesses to adjacent cache lines in case of Variable Parameter = NOP Count;
the number of NOPs between two successive accesses to adjacent cache lines
in other cases.

Maximal NOP Count

Maximal NOP Count defines the maximal number of NOPs between two successive
accesses to adjacent cache lines in case of Variable Parameter = NOP Count.

Minimal SyncNOP Count

Minimal SyncNOP Count defines the minimal number of NOPs between two successive
accesses to the same cache line in case of Variable Parameter = SyncNOP
Count; the number of NOPs between two successive accesses to the same cache
line in other cases.

Maximal SyncNOP Count

Maximal SyncNOP Count defines the maximal number of NOPs between two successive
accesses to the same cache line in case of Variable Parameter = SyncNOP
Count.

Stride Size

Minimal Stride Size, in bytes, in the dependent access chain between two
successive accesses to consecutive cache lines.

Minimal 1st Dword Offset

Minimal 1st Dword Offset within the cache line, in bytes, in case of Variable
Parameter = 1st DW Offset; 1st Dword Offset within the cache line in other
cases.

Maximal 2nd Dword Offset

Selected Tests

Selected Tests define a way of testing the latency of the double access.

Forward Two-Dword Read Latency;

Backward Two-Dword Read Latency;

Random Two-Dword Read Latency.

Benchmark #5: Data Translation Lookaside Buffer Test (D-TLB)

The fifth test defines the size and associative level of the Translation
Lookaside Buffer (L1/L2 D-TLB). Actually, it measures latency when accessing
the L1 cache provided that every next cache line is loaded from the next
memory page (not the same).

(The memory page size in real operating systems is much
greater (e.g. 4096 bytes), than in our scheme which houses only 4 cache
lines).

So, if the number of pages used is less than the TLB size the test calculates
L1 cache's own latency (TLB hit). Otherwise, it measures the L1 cache latency
in case of TLB miss. Note that the Maximal TLB Entries mustn't be greater
than the number of L1 cache lines, otherwise the graph will have a jump
related with the transition from L1 to L2, but not with the D-TLB structure
size. But the overall size of TLB levels is always less than the number
of cache lines which can be put into the L1 cache. Test settings:

Variable Parameter

Variable Parameter defines one of two test modes.

TLB Entries - dependence of latency when accessing the
L1 cache on the number of memory pages used.

Chains Count - dependence of latency when accessing the L1 cache
on the number of sequential access chains at a given number of pages used
for estimation of the associative level of each D-TLB level. The principle
of chain formation is identical to the one used in the latency test (Benchmark
#2), but in this case the value equal to the stride size when accessing
every next element (cache line size) is added to the offset between the
chains. Here you can see reading of four TLB elements in case of two access
"chains".

Maximal Chains Count

Selected Tests

Selected Tests define the ways of testing.

Forward Access;

Backward Access;

Random Access.

Benchmark #6: Instruction Cache Test (I-Cache)

The sixth test estimates effectiveness of decoding/execution of certain
simple CPU instructions (ALU/FPU/MMX), and efficiency of operation of the
L1 instructions cache and its associative level. This test is of special
interest for estimating the effective Trace Cache size of Pentium 4 processors
in case of decoding/execution of various instructions. Test parameters.

Variable Parameter

Defines one of three types of this benchmark:

Block Size - dependence of decode bandwidth on the code
block size (Decode Bandwidth is the speed of sequence of operations of
reading, decoding and execution of instructions by the CPU). The test method
includes on-the-fly creation of a code block of a certain size on the fly
(in runtime) and measurement of CPU clocks taken for its execution. The
last instruction in the code block in all cases is the return instruction
(RET).

Chains Count - dependence of decode bandwidth on the number of
sequential access chains. Like in the Benchmark #2, we can estimate the
associative level of the L1 instructions cache. From the methodological
standpoint the transitions between the neighboring access chains which
correspond to different cache segments are carried out with unconditional
jmp instruction. Below you can see the code execution graph (red
arrows) in case of two chains (transition operations are marked with green
arrows).

Prefixes Count - dependence of decode bandwidth for
[pref]nNOP instructions on the number of prefixes used (pref
= 0x66, operand-size override prefix).

Minimal Prefixes Count, Maximal Prefixes Count

Stride Size

Stride Size is the minimal size of the code executed in this chain which
includes transition to the neighboring chain. It's recommended that the
stride size is equal to the instructions cache line size.

The first comparison tests of various platforms (AMD
K7/K8, Intel Pentium
4, Intel Pentium
III / Pentium M) let us reveal the bottlenecks of the RightMark
Memory Analyzer 2.4. They are accounted for in the new version of
the test suite and highlighted in this appendix.

General Test Settings

The key change in this section is the automatic detection of the
L1 and L2 Cache Line Sizes. Earlier only the L1 cache
line size was displayed, though both were measured. Theoretically
and practically the effective L1 and L2 cache lines can have different
sizes of some processors. Let's look at the Intel Pentium 4. The effective
L2 cache line size is 128 bytes (such line is called dual-sector and
consists of two 64-byte lines), which means that 128 bytes are transferred
at a time through the L2-RAM bus (or L2-L3, L3-RAM of Pentium 4 XE).

That is why the values in the D-Cache Bandwidth test of estimation
of the real bandwidth of the L2-RAM bus (L2-L3, L3-RAM) were underestimated
if the stride size was set equal to the automatically detected line
size (L1 cache). We had to set the stride size manually to get objective
scores. Now this problem is solved - now you can select the Minimal
Stride Size equal to the line size of the L1 or L2 cache (see
the screenshot above).

The general test settings got one new parameter, Active CPU Index,
which can indicate the number of the active CPU (physical or logical)
where the mainstream runs. This option is useless for usual SMP (and
HT) systems, but it can successfully be used for studying performance
differences when the CPU accesses the "native" or "alien" memory,
in the systems with the separate memory architecture (for example,
in dual-processor AMD K8 platforms each CPU has its own memory).

The last change is three Memory Allocation methods: Standard,
VirtualLock and AWE. The first one uses malloc() and is not recommended
for ordinary platform tests. It's used mostly for testing operation
under memory managers different from the standard Windows memory manager
which have certain advantages (for example, support of large 4MB memory
pages). AWE is recommended to obtain the most reliable results, that
is why this method is default.

Test 1: Memory Bandwidth

The changes are made mostly in the memory read/write/copy procedures
in all access modes (MMX, SSE, SSE2), including the methods using
the Software Prefetch. A big unlooping factor optimizes operation
of these procedures in case of small block sizes (in L1 d-cache)
and increases the real bandwidth of this cache level.

Test 2: D-Cache Latency

The new parameter here is Pseudo-Random Read Latency which
reduces the memory random-access latency. In this mode the dependent
access chain is accessed randomly within every memory page, but the
memory pages are accessed in the forward manner.

The first fact minimizes the Hardware Prefetch interference, and
the second nearly completely prevents D-TLB misses. That is why the
pseudo-random latency is much lower than the random latency (which
goes along with a great number of D-TLB misses) and can be considered
the objective memory latency parameter.

Another change in this test is the altered procedure of building
the dependence of latency of the memory subsystem level selected (L1/L2
cache/RAM) on the walk size (Variable Parameter = Walk Size).
In the previous RMMA version the real number of walks over the dependent
access chain was calculated as the block size divided by the walk
size (which was variable), that is why walks were getting fewer as
their size grew up. In the new test version the number of walks is
fixed (irrespective of the walk size the borders of which are defined
by Minimal Walk Size and Maximal Walk Size). It's calculated
as the block size divided by the Stride Size which by default
equals L1 line size.