The benchmark tends to perform slightly better when compiled with GCC.

2 Benchmark Information

The 557.xz_r benchmark is based on the Tuukani Project’s xz, an application that compresses and
decompresses files using the LZMA2 compression algorithm.

2.1 Inputs

See the 557.xz_r documentation for a description of the input format.

The inputs are stored as compressed archives. Therefore decompression is actually performed
twice: once when loading the file, and a second time after the data has been repeated in
memory. As a result, the fewer times a file is repeated, the more the benchmark results will
be skewed toward decompression performance. However, the LZMA decompression and
compression algorithms are fairly similar. Therefore this skew may not change results too
significantly.

2.2 Generating Workloads

There are a few steps to generate new workloads for the 557.xz_r benchmark:

Create a directory with your workload’s name. In this directory, create two directories
named input and output.

Compress each of the desired input files as separate xz archives and place them in the
input directory.

In the input directory, create a text file named control.

For each input archive, add a line to the control file listing the parameters as described in
the 557.xz_r documentation’s Input Description.

Add a compiled version of the 557.xz_r benchmark to your PATH, then cd to the input
directory. For each input, run the following command:

The benchmark will not run without dummy reftime and refpower files in your workload’s
root directory. Create these files, and in them, simply write your workload’s name followed by a
newline, a 2 and then another newline.

3 LZMA Compression

To understand how workloads were generated for the 557.xz_r benchmark, a brief introduction to
the LZMA encoding algorithm is necessary.

LZMA uses a method called “sliding-window compression” that maintains a buffer split into two
parts: the dictionary and the look-ahead buffer (shown in Figure 1). The encoding algorithm searches
the dictionary for the longest possible match for the data currently in the look-ahead buffer. When a
match is found, its length, position relative to the start of the dictionary, and following
byte are written to the output stream. The dictionary and look-ahead buffers are then
moved ahead by N characters where N is the length of the match. 557.xz_r uses a more
complicated encoding scheme, but the general idea of how the two buffers interact is the
same.

Figure 1: The data layout of a stream undergoing compression.

Consider an encoder using a dictionary size of 14 bytes and a lookahead buffer size of 8 bytes. We
provide the benchmark with a 12-byte file containing the word “SCRUTINIZING”, repeated to 1 MiB.
Then a pattern of maximum length can easily be matched as in Figure 3a. The buffers are then
moved ahead by 8 bytes, and a match is found again at the same offset in Figure 3b. As a result, the
program’s behaviour is very predictable, as matches always occurs at the same position, and very little
time is spent searching the dictionary.

Figure 2: With a dictionary larger than the repeated file, maximum-length matches are
consistently found at the same position.

However, if the dictionary is smaller than the size of the repeated file, matches are much harder to
find. For example, in Figure 4, the same settings are used, but we instead use the 16-byte string
“INCOMPREHENSIBLE.” As the full string no longer fits into the dictionary, we are
no longer able to find a match. Thus, rather than skipping large portions of the file due
to matches, we have to encode each character as a literal and move ahead one byte at a
time, searching the entire dictionary for each byte but never finding a match longer than 1
character.

Figure 4: With a file larger than the dictionary, no matches are found.

As well, if there are small patterns within the larger file, then matches will sometimes occur and
sometimes not, leading to unpredictable behaviour. Consider Figure 5, which uses the 17-byte string
“AUTOMATEABDICATE.” At first, there is a short 3-byte match, which is encoded. However, after
moving ahead by 3 bytes, a match can no longer be found. In practice, the dictionary, look-ahead
buffer and file will be much larger, making this kind of short pattern matching much more likely as
there is much more data in which to find matches. As such, the program is constantly switching back
and forth between encoding literals and matching short patterns which could occur at any point in the
dictionary.

In summary, if data is to be repeated, then the size of the input in comparison to the size of the
dictionary can affect performance greatly. It is therefore important to keep in mind the size of the
dictionary for each of the compression levels available to the 557.xz_r benchmark (shown in
Figure 7).

Figure 5: Sub-string matches in large files lead to unpredictable behaviour.

(a) Ashortmatchisfound.(b)Aftermovingahead3bytes,nomatchisfound.

Figure 7: Dictionary sizes for each compression level.

Level

1

2

3

4

5

6

7

8

9

Size (MiB)

1

2

4

4

8

8

16

32

64

4 New Workloads

We sorted files by three characteristics:

Compressibility

Inputs are ranked as either highly compressible or not very compressible. Files with more
repeating patterns tend to be more compressible.

Size

Inputs are ranked as either larger than the dictionary or smaller than (or equal to the size
of) the dictionary.

Repetition

Inputs are either unrepeated, or are repeated 8 times in memory.

We then created 8 workloads based on these categories. Each workload contains several files of
different types, sizes and compression levels. The file types were chosen to represent commonly-used
data formats.

In order to ensure that the files in the small workloads are always smaller than or
equal to the size of the dictionary while also maintaining a good mix of compression levels
(which varies the dictionary size), the sizes of the small workloads are inherently much
smaller than those of the big workloads. However, the workloads are constructed such
that the compressible and incompressible workloads of the same size (i.e. small or big)
are directly comparable, as their input files correspond both in file size and compression
level.

The new workloads are as follows. Note the naming convention: comp and xcomp are respectively
the compressible and incompressible workloads; big and small are respectively the workloads with
inputs larger than and smaller than the dictionary; and workloads with a -rep suffix are the ones
repeated 8 times in memory.

comp-big

A collection of files which are highly compressible and larger than the dictionary.

enronsent-a.txt

A 10 MiB subset of data from the EnronSent Corpus1, a collection of emails which were
put in the public domain while the Enron corporation was under investigation by the US
government. It was chosen to represent a large amount of natural text.Compression level: 6 (8 MiB dictionary)

enronsent-b.txt

Another 10 MiB subset of data from the EnronSent Corpus but with a different
compression level.Compression level: 3 (4 MiB dictionary)

data.xml

An XML file generated by the XMLgen data generation tool2, truncated to 40 MiB. It
was chosen as an example of a large XML file.Compression level: 8e (32 MiB dictionary)

nebula.bmp

A large bitmap image from NASA’s public domain collection3, truncated to 20 MiB. It
was chosen as an example of a common uncompressed image format.Compression level: 5 (8 MiB dictionary)

comp-small

A collection of files which are highly compressible and smaller than the dictionary.

code.c

All of the .c and .h files found in snapshots of two public domain projects (High Level
Assembler4
and SQLite5),
concatenated into one file and truncated to 6 MiB. Chosen as an example of highly
compressible code.Compression level: 6 (8 MiB dictionary)

allbooksa.xml

All of Shakespeare’s works in XML format6, truncated to 4 MiB. As with data.xml in the
comp-big workload, this was chosen as an example of a large XML file.Compression level: 3 (4 MiB dictionary)

A large MiB bitmap image from NASA’s public domain collection7, truncated to 15 MiB.
As with nebula.bmp in the comp-big workload, this was chosen as an example of a common
uncompressed image format.Compression level: 9e (64 MiB dictionary)

xcomp-big

A collection of files which are not very compressible and larger than the dictionary.

music-a.mp3

Most of the music from Kevin MacLeod’s Creative Commons-licensed “Light” collection8,
concatenated into one file. This is the first 20 MiB of the resulting file. It was chosen to
represent a large amount of music in a common format.Compression level: 6 (8 MiB dictionary)

music-b.mp3

A different 20 MiB of data taken from the same concatenated file as music-a.mp3, but
with a different compression level.Compression level: 3 (4 MiB dictionary)

sintel.mkv

A truncated 40 MiB section of the Blender Foundation’s open movie project, Sintel9,
encoded in an MKV container using the H.264 standard. It was chosen an example of a
video file which is already at the limits of compression.Compression level: 8e (32 MiB dictionary)

space.jpg

Many JPEG images from NASA’s public domain collection concatenated into one file,
then truncated to 10 MiB. This was chosen to represent a large amount of image data
which is already at the limits of compression.Compression level: 5 (8 MiB dictionary)

xcomp-small

A collection of files which are not very compressible and smaller than the dictionary.

tales.mp3

A song from Kevin MacLeod’s collection10, truncated to 6 MiB. Chosen to represent a
small amount of not very compressible music data.Compression level: 6 (8 MiB dictionary)

A different truncated 12 MiB section of the Blender Foundation’s open movie project,
Sintel, but with a different compression level.Compression level: 7 (16 MiB dictionary)

nasa.jpg

Many JPEG images from NASA’s public domain collection concatenated into one file, then
truncated to 15 MiB. It was chosen to represent a large amount of not very compressible
image data.Compression level: 9e (64 MiB dictionary)

comp-big-rep

Identical to the comp-big workload, but repeated 8 times in memory.

comp-small-rep

Identical to the comp-small workload, but repeated 8 times in memory.

xcomp-big-rep

Identical to the xcomp-big workload, but repeated 8 times in memory.

xcomp-small-rep

Identical to the xcomp-small workload, but repeated 8 times in memory.

5 Analysis

This section presents an analysis of the workloads for the 557.xz_r benchmark and their
effects on its performance. All of the data in this section was measured using the Linux perf
utility and represents the mean value from three runs. This section addresses the following
questions:

How much work does the benchmark do for each workload?

Which parts of the benchmark code are covered by each workload?

How is the benchmark’s behaviour affected by the compressibility of its input data?

How is the benchmark’s behaviour affected by data repetition and dictionary size?

How is the benchmark’s behaviour affected by different compilers and optimization levels?

5.1 Work

The simplest metric that can be used to compare the effect of the various workloads on the behaviour
of a benchmark are variations to execution time. To this end, the benchmark was compiled with GCC
4.8.4 at optimization level -O3, and the execution time was measured for each workload on machines
with Intel Core i7-2600 processors at 3.4 GHz and 8 GiB of memory running Ubuntu 14.04 LTS on
Linux kernel 3.16.0-76-generic. Figure 8 shows the mean execution time for each of the
workloads.

A clear pattern emerges for the new workloads: for each workload, the incompressible (xcomp)
variant always takes less time. Recall that the compressible and incompressible variants of a workload
have inputs with the same file sizes and compression levels. Thus, the change in execution time can be
entirely attributed to the compressibility of the data, so we can claim that compression tends to take
longer on highly compressible data.

Figure 9 shows the mean instruction count for each of the workloads, while Figure 10 shows the
mean cycles per instruction. While the mean instruction count and time elapsed appear to be closely
related, Figure 10 reveals that there are considerable differences in the number of cycles per
instruction across each of the workloads, which suggests that the behaviour of the benchmark varies at
least somewhat under each of these workloads.

Figure 8: The mean execution time from three
runs for each workload (in logarithmic scale).

Figure 9: The mean instruction count from
three runs for each workload (in logarithmic
scale).

Figure 10: The mean clock cycles per
instruction from three runs for each workload.

5.2 Coverage

This section will analyze which parts of the benchmark are exercised by each of the workloads. To this
end, we calculated the percentage of execution time the benchmark spends on several of the most
time-consuming functions. The full data can be found in the Appendix. It was recorded on a machine
with an Intel Core i7-2600 processor at 3.4 GHz and 8 GiB of memory running Ubuntu 14.04 LTS on
Linux kernel 3.16.0-76-generic

The six functions covered in this section are all the functions which made up more than 5% of
the execution time for at least one of the workloads. A brief explanation of each function
follows:

bt_find_func

Finds pattern matches in the dictionary buffer, including detecting when patterns contain
sub-patterns.

bt_skip_func

Skips the remaining dictionary data when a match of maximum length is found.

lzma_decode

Sets up and executes the main decoding loop.

lzma_lzma_encode

Sets up and executes the main encoding loop, making calls to lzma_lzma_optimum_normal as necessary.

lzma_lzma_optimum_normal

Handles the LZMA encoding logic, such as choosing whether to use find or skip and
passing matched data to the encoding function.

lzma_mf_bt4_find

A wrapper for the bt_find_func function which is specifically designed for handling the
type of binary trees used at high compression levels.

lzma_mf_bt4_skip

Like lzma_mf_bt4_find, but a wrapper for bt_skip_func.

To help make this large amount of data easier to compare, Figure 11 condenses it into a stacked
bar chart. Figure 12 is similar, but instead of scaling all of the bars as percentages, it shows the total
time spent on each symbol. Finally, Figure 13 shows the same information as Figure 12,
but excludes all workloads with an execution time greater than 100 seconds so that the
smaller workloads can be more easily compared. The rest of this section will analyze these
results.

Generally speaking, the comp workloads spend a large amount of time on the bt_find_func
function, which is reasonable given that compressible inputs, which contain complex sets of patterns,
are likely to spend more time finding good matches. Similarly, the lzma_lzma_optimum_normal
function also takes up a large amount of execution time for compressible workloads. Another popular
function for compressible workloads is bt_skip_func – due to the high match rate of compressible
data, it is unsurprising that this function is called frequently.

On the other hand, the xcomp workloads tend to spend less time finding patterns and more time
encoding data with the lzma_lzma_encode function. This is due to the fact that most of the bytes in
an incompressible file will have to be separately encoded as they do not match any data in the
dictionary, and so many calls are made to the encoding function. The lzma_mf_bt4_find
function also becomes more prevalent, which suggests that the use of binary trees may be less
efficient for data with few patterns. Also noteworthy is that the lzma_lzma_optimum_normal
function causes far less overhead for the xcomp workloads, possibly due to the fact that
optimal compression is more easily achieved with data that can barely be compressed.
Finally, the lzma_decode function takes a larger amount of time in comparison to the other
functions. This can be attributed to two factors: firstly, the xcomp workloads take less time to
compress, and so decoding takes comparatively longer; and secondly, it may take longer to
decompress a large number of short patterns than it does to decompress a small number of large
patterns.

Comparing the repeated benchmarks to their unrepeated equivalents, there is very little change for
the big workloads. However, for the small workloads, the behaviour changes completely,
with the repeated benchmarks spending a huge portion of their time in the bt_skip_func
function. As explained in Section 3, this is because maximum-length patterns can be found
very easily in the small-rep workloads and so most of the data in the dictionary can be
skipped. In contrast, the unrepeated small workloads behave very similarly to their big
equivalents.

The train workload’s top functions are somewhat similar to those of the comp-small workload,
though it spends more time in the lzma_mf_bt4_find and lzma_lzma_encode functions and less time
in the bt_skip_func function. The refrate workload instead appears to be fairly similar in
behaviour to the comp-big-rep workload aside from a larger amount of time spent in
bt_skip_func.

Figure 11: The mean (of 3 runs) percentage of execution time spent on each of the functions
listed in Figure 23.

Figure 12: The mean (of 3 runs) execution time in seconds spent on each of the functions listed
in Figure 23.

Figure 13: The mean (of 3 runs) percentage of execution time spent on each of the functions
listed in Figure 23, excluding workloads with execution times above 100 seconds.

5.3 Compressibility

As mentioned in section 5.1, compression tends to take longer for highly compressible data than it
does for highly incompressible data of the same size. In other words, more processor cycles are spent
per instruction when compressing highly compressible data. To determine what causes this change, we
will use the Intel “Top-Down” methodology11to see where the processor is spending its time while
running the benchmark. This methodology breaks the benchmark’s total execution slots into four
high-level classifications:

Front-end bound

Cycles spent because there are not enough micro-operations being supplied by the front
end.

Back-end bound

Cycles spent because there are not enough resources available to process pending
micro-operations, including slots spent waiting for memory access.

Bad speculation

Cycles spent because speculative work was performed due to an incorrect prediction
(usually from branch misses).

Retiring

Cycles spent actually carrying out micro-operations.

The event counters used in the formulas to calculate each of these values can be inaccurate.
Therefore the information should be considered with some caution. Nonetheless, they provide a broad,
generally reasonable overview of the benchmark’s performance.

Figure 14 shows the distribution of execution slots spent on each of the aforementioned Top-Down
categories. For each of the new workloads, the front-end and retiring μops bound tends to remain
fairly similar between the xcomp and comp variant. However, the xcomp workloads tend to have a
slightly higher back-end bound than their comp equivalents.

Figure 14: The distribution of execution slots spent in each category according to Intel’s
Top-Down methodology.

The xcomp-small-rep has a particularly unusual distribution of execution slots: its retiring and
back-end bounds are high, while its front-end and bad speculation bounds are smaller than those of
any other workload. Also of note is that the comp-small workload has a fairly high bad speculation
bound.

From this top-down analysis, we can see that the benchmark is strongly memory-bound, and that
branching behaviour also appears to change for certain workloads, so we analyze both of these aspects
of the program’s performance more closely. Figures 15 and 16 show, respectively, the rate of
branching instructions and last-level cache accesses from the same runs as Figure 14. As the
benchmark executes a vastly different number of instructions for each workload, which
makes comparison between workloads difficult, the values shown are rates per instruction
executed.

Figure 15 demonstrates that the incompressible inputs usually involve slightly fewer
branches overall than their compressible counterparts, though the opposite is true with
the *-small-rep workloads. The number of branch misses does not appear to follow a
clear pattern. As before, the xcomp-small-rep workload still stands out as having an
especially low branch miss rate despite having the highest rate of branching instructions,
while the comp-small workload has the highest branch miss rate both per branch and per
instruction.

Figure 15: The mean (of 3 runs) rate of branching instructions and branch misses for each
workload. The number in each bar represents the mean branch miss rate (computed as number
of branch misses divided by number of branch instructions).

Figure 16: The mean (of 3 runs) rate of last-level cache (LLC) references and misses per
instruction for each workload. The number in each bar represents the mean cache miss rate
(computed as number of cache misses divided by number of cache references).

While branching behaviour seems to be hard to predict based on compressibility alone, Figure 16
clearly reveals that incompressible inputs usually cause access to the last-level cache (LLC) fewer
times per instruction, albeit with a higher miss rate than for a compressible file. Thus, incompressible
inputs appear to produce a higher cache miss rate per instruction than compressible inputs, with the
exception once again being xcomp-small-rep.

Lastly, in terms of the cache behaviour, the two SPEC workloads behave almost identically to each
other. Interestingly, their rate of access to the LLC is nearly as low as the xcomp-small-rep workload.
As a result, despite having moderate miss rates per cache reference, they have the lowest miss rates
per instruction out of all the workloads.

5.4 Data Repetition

As mentioned in section 3, combining repeated data with a large dictionary can cause a
significant change in program behaviour. Specifically, this section will show that even the most
incompressible data will become extremely compressible if the repeated data is smaller than the
dictionary. Figure 17 illustrates this point by comparing the benchmark’s performance
between the repeated (-rep) and unrepeated (no suffix) workloads. Note that the train and
refrate workloads are left out, as there are no unrepeated workloads to which they can be
compared.

With 8x data repetition, one would expect most processor events to increase by approximately 8
times, which is denoted by the horizontal blue line in each of the diagrams. However, as shown in
Figure 18a, the *-small workloads are slowed down by less than 8x, while the *-big workloads are
slowed down by slightly more. Similarly, as shown in Figure 18b, the number of branch misses for the
small workloads is multiplied by only 3-5 when the data is repeated 8 times, whereas the number of
branch misses for big workloads increases by around 7. Thus, repetition of the small workloads causes
them to behave quite differently.

Another interesting effect of repetition can be seen in Figure 18c, which reveals that the number of
instructions executed is usually multiplied by around 8 for the repeated workloads except for with the
xcomp-small-rep workload, which executes over 10 times as many instructions as the xcomp-small
workload. This can be explained by that fact that, as shown in Section 5.1, processing
compressible data tends to take more instructions. By repeating the xcomp-small workload, it
becomes much easier to compress. In effect, we are comparing an incompressible file to a
compressible file, and so the number of instructions executed rises more than it would
otherwise.

Figure 17: Effects of repeating data in memory 8 times (calculated as the value for the repeated
workload divided by the value for the unrepeated workload).

Note that this same effect does not occur for the xcomp-big workload, as repeating data larger
than the dictionary does not make it significantly more compressible. However, as seen in Figure 18d,
the number of dTLB misses does appear to rise more than expected when repeating incompressible
workloads, regardless of their sizes. Also dTLB misses increased only by a factor of two for the
workload comp-small-rep when comparing with comp-small

5.5 Compilers

This section compares the performance of the benchmark when compiled with either GCC 4.8.4; the
Clang frontend to LLVM 3.6; or the Intel ICC 16.0.3. To this end, the benchmark was compiled with
all three compilers at six optimization levels (-O{0-3}, -Os and -Ofast), resulting in 18 different
executables. Each workload was then run 3 times with each executable on the Intel Core i7-2600
machines described earlier.

5.5.1 LLVM

First, we will compare GCC to LLVM. Generally, the two compilers produce code which behaves
somewhat similarly. However, there are a few noticeable differences in their behaviour, the most
prominent of which are highlighted in Figure 19.

Figure 20a shows that the LLVM-compiled benchmarks tend to run slightly slower than the
GCC-compiled benchmarks. In particular, the GCC benchmarks consistently outperform the LLVM
benchmarks for optimization levels -O1, -O3 and -Ofast.

Figure 20b demonstrates that aside from optimization levels -O0 and -O1, the GCC benchmarks
tend to be more heavily front-end bound than the LLVM benchmarks. There is considerable variation
between workloads, but on average LLVM appears to reduce the front-end bound for most
optimization levels.

In Figure 20c, we see that the LLVM-compiled code has a consistently higher retiring bound.
Coupled with the findings in Figure 20b, this suggests that the LLVM-compiled executables generally
spend more time doing useful work than the GCC executables.

Lastly, Figure 20d shows that the LLVM-compiled versions of the benchmark consistently
spend fewer cycles per instruction than the GCC versions, aside from a few exceptions at
optimization level -O0. However, despite this improvement, they almost always execute more
instructions than their GCC-compiled equivalents (not shown in this figure), which outweighs the
benefit of the lowered CPI and results in the comparatively poor performance shown in
Figure 20a.

5.5.2 ICC

The executables produced by ICC and GCC are also very similar in terms of behaviour aside from a
few minor differences highlighted in Figure 21.

Figure 22a demonstrates that the ICC-compiled benchmarks have similar performance to the
GCC-compiled benchmarks, though at optimization level -O0, the ICC executables are visibly slower.
As well, the comp-small workload runs slower than the other workloads with the ICC versions at
optimization levels -O2, -O3, -Os and -Ofast.

Figure 22b compares the front-end bound of the ICC- and GCC-compiled benchmarks. Neither
compiler’s code clearly outperforms the other for most optimization levels, though through
optimization levels -O1, -O2, -O3, -Ofast and -Os ICC does not match GCC’s performance on
workloads xcomp-small, xcomp-big and xcomp-big-rep

Finally, Figure 22c demonstrates that the number of branch instructions executed in the ICC code
is higher for optimization level -O0.

6 Conclusion

The 557.xz_r benchmark behaves very differently from workload to workload. Generally,
it is a memory-bound benchmark, but this behaviour may be more or less pronounced
depending on many factors, including the compressibility of the data and the amount of data
repetition used. These same factors also significantly change the benchmark’s execution
profile in terms of functions executed. As such, it is worthwhile to test this benchmark
with a wide variety of inputs so as to assess its performance when handling different types
of data. However, for the same reasons, it is also important to thoroughly test any new
workloads to verify that they are indicative of the type of program behaviour that we intend to
measure.

Appendix

Figure 23: Percentage of execution time spent on all symbols which made up more than 5% of
the execution time for at least one of the workloads. This figure continues onto the next page.