Block Resampling

Figure 7.14:
Connection arrival time series for UNC 1 PM (dashed line) and
a Poisson arrival process with the same mean (solid line).

Figure 7.15:
Connection arrival time series for UNC 1 AM and
a Poisson arrivals process with the same mean.

Figure 7.16:
Wavelet spectra of the connection arrival time series for UNC 1 PM and
a Poisson arrival process with the same mean.

Figure 7.17:
Wavelet spectra of the connection arrival time series for UNC 1 AM and
a Poisson arrival process with the same mean.

Table 7.1:
Estimated Hurst parameters and their confidence intervals for the
connection arrival time series of UNC 1 PM and UNC 1 AM, and their
Poisson arrival fits.

Trace

Estimated Parameters

UNC 1 PM Conn. Arrivals

=0.685041 C.I.=[0.646250, 0.723831]

Poisson secs.

=0.506069 C.I.=[0.467279, 0.544860]

UNC 1 AM Conn. Arrival

=0.756533 C.I.=[0.717743, 0.795324]

Poisson secs.

=0.502217 C.I.=[0.463427, 0.541008]

The basic assumption of Poisson Resampling is that connection inter-arrivals
are independent and identically distributed according to an exponential
distribution, which results in a Poisson arrival process. While the choice of
exponential inter-arrivals is reasonable given the measurement data presented in Figures
7.1 to 7.4, the arrival
process may not necessarily be independent. On the one hand, we can argue that
common application protocols make use of more than one connection, creating correlations
among some start times. For example, web browsers often open several connections
for downloading a web page. On the other hand, we focus on traces of highly aggregated
traffic, where a large number of hosts start hundreds or even thousands of connections
every second. The high
aggregation could diminish or even eliminate completely any correlation structure in the
connection arrival processes.

The analysis of our traces reveals non-negligible correlations in the connection
arrival process. Figures 7.14 and
7.16 examine the arrival process for the UNC 1 PM trace.
Using a time series of 10-millisecond bin counts,
Figure 7.14 compares the burstiness of
the original arrival process (dashed line) and that of a Poisson arrival process with
the same mean inter-arrival time (solid line). The original arrival process was far more variable.
Its standard deviation was equal to 346.07, while the one for the
Poisson process was equal to 79.21.
In order to further study the connection arrival process across a range of time-scales,
we rely on wavelet analysis. Figure 7.16 shows the wavelet
spectra of the original connection arrivals and a Poisson process with the same mean
inter-arrival time. The Poisson process exhibits the expected flat spectrum of
short-range dependent processes [HVA02]. On the contrary, the spectrum for
the original connection arrivals follows a line with a substantial positive slope, which is characteristic of
long-range dependent processes. The results of the wavelet-based estimation
[AV98] of the Hurst parameters of these processes are given in table
7.1. The Poisson process has a Hurst parameter very close to the
expected 0.5, while the original arrival process has a Hurst parameter of 0.685. This
is consistent with moderate long-range dependence. For comparison, typical estimates
of the Hurst parameter for packet and byte arrival processes are between 0.75 and 0.95, i.e.,
typical packet and byte arrival processes exhibit significantly stronger long-range dependence
than this connection arrival process.

We performed a similar analysis for the UNC 1 AM trace, and the results are shown
in Figures 7.15 and 7.17. As in
the previous case, the time series plot shows a connection arrival process that is
significantly more bursty than that of a Poisson process with the same mean. Note
however than in this case there is some degree of non-stationarity. We observe a significantly
larger number of connections started in the first 5 minutes, and a significantly
lower number started in the last 10 minutes. In this case we
compute the mean inter-arrival rate required to construct the Poisson arrivals using
the middle 40 minutes of the trace. We therefore handle the effect of trace boundaries
by ignoring the first and the last few minutes of the arrival process.
The wavelet spectra for these middle 40 minutes and a Poisson process with the
same mean arrival rate are shown in Figure 7.17. As in
the UNC 1 PM case, the original connection arrival process exhibits clear long-range
dependence. The estimated Hurst parameter in Table 7.1 reveals
a somewhat stronger long-range dependence for the UNC 1 AM trace (0.757 vs. 0.685).

In summary, the connection arrival processes we have examined are consistent with
significant long-range dependence.
Therefore, it is desirable to develop the resampling and load scaling methods that can reproduce this
structure, to cover experiments where the manner in which connections arrive is relevant for the
network phenomenon studied using synthetic traffic.
One example of this type of scenario is the evaluation of a router mechanism where
the arrival of new connections creates new state in the router. For such a mechanism, a more bursty
arrival process creates a more stringent workload, just like burstier traffic was shown by
[BC98] to be more demanding on web server performance.

Poisson Resampling cannot reproduce this observed long-range dependence in the connection
arrival process since its inter-arrivals times come from independently sampling an exponential
distribution. For this reason, we propose a second resampling technique that
can reproduce the long range dependence in the connection arrival process.
The starting point is the intuition that dependencies between connection start times
are far more likely to occur within relatively small periods. For example, web browsing
results in new connections started according to the sequence of web page downloads
and the way the browser opens new connections to the servers in which these pages are
found.
This results in brief bursts of connections whose start times are correlated.
We use this intuition to develop a resampling method wherein
the resampled objects are not individual connections, but groups of connections started
during the same period, which we call blocks.
The key idea of our Block Resampling
method is that sampling blocks of connections
rather than individual connections preserves the relative offsets
of connection start times within blocks, and therefore the dependency structure7.6Our method is derived from the Moving Block Bootstrap method [ET93].

Figure 7.18:
Block resamplings of UNC 1 PM: impact of different block lengths on the wavelet
spectrum of the connection arrival time series.

Figure 7.19:
Block resamplings of UNC 1 AM: impact of different block lengths on the wavelet
spectrum of the connection arrival time series.

Block Resampling proceeds in the following manner:
Given a trace , we divide it in blocks of duration ,
so that the first block groups together connections started in
the interval , the second block groups together connections
started in the interval
, and so on.
The block resampled trace
is
obtained by concatenating randomly sampled blocks, and adjusting the start time
of connections in each block by the time offset of the new location of this block. For example, if the random
resampling puts block as the first block of
, the start times of the i-th
connection vector in this block is set to .
Similarly, if is placed in the fourth location of
, the start times of
the i-th connection in this block are set to . More formally,
when the -th block in the original trace becomes the -th block
in the block resampling, the start time of the i-th connection vector in
is set to

Block Resampling chooses blocks for
with replacement, making it possible to
create new traces that are longer than the original from which the blocks are
obtained.

As pointed out by Efron and Tibshirani [ET93],
choosing the block duration can be a difficult problem. In our case, we found
a clear trade-off between block duration and how well long-range dependence
was preserved in the resampled trace. The shorter the block duration, the larger the number
of distinct trace resamplings that can be performed from the same trace .
This number is equal to for resampled traces with the same duration
of the original trace.
However, if the duration of the blocks is too small, the process of connection
arrivals in the resampled trace exhibits a dependency structure that does not resemble the one in the original trace.

Figure 7.18 explores the impact of block duration on the
long-range dependence of the connection arrival process in the resampled trace.
The top left plot shows the wavelet spectra of the
connection arrivals for UNC 1 PM and for 5 block resamplings where the block duration was
1 second. There is a clear and consistent flat region after octave 8, which shows
that blocks of 1 second are too short to preserve the long-range dependence of
the original connection arrival process. As the block duration is increased in subsequent plots,
we observe an increasingly better match between the arrivals in the block resamplings and
the arrivals in the original trace. Blocks with a duration of 30 seconds or 1 minute
provide the best trade off between blocks that are large enough to ensure realistic
long-range dependence in the connection arrival process, and blocks that are short
enough to provide a large number of distinct resamplings.
The same sensitivity analysis was performed for the UNC 1 AM trace and the results
are shown in Figure 7.19. Block durations of 30 seconds or
1 minute are also shown to perform well.

As discussed earlier in this chapter, an important goal of trace resampling
is the ability to preserve the target load of the original trace and to scale it up and
down according to the needs of the experimenter. The analysis of a large set
of Poisson resamplings revealed that offered load and number of connections are only
loosely correlated. This motivated the use of a byte-driven version of Poisson Resampling
which could achieve a very precise scaling of the load offered by the resampled trace.
In the case of Block Resampling, the question is whether the averaging effect of grouping
connections into blocks significantly diminishes the variability observed for
the basic version of Poisson Resampling. We study this question by examining the offered
load found in a large collection of block resampled traces.
If the blocks had roughly uniform offered load, we would expect to generate similar
offered load with each resampled trace.
The results in Figure 7.20 do not confirm this expectation.
The top row presents the analysis of 1,000 trace resamplings
constructed by resampling UNC 1 PM using 30-second blocks. The average offered load
was derived from the total payload computed using Equation 7.1.
As shown in the scatterplot, the number of connections stayed within a narrow range,
but the offered loads were far more variable.
The histogram on the right further characterizes the distribution
of offered loads in these trace resamplings. The use of blocks does not appear to have
any benefit in terms of a more predictable load. This is not surprising given the known
burstiness of the packet and byte arrival processes at many time-scales. If blocks were
effective at smoothing out these processes, we would find little long-range dependence.
This situation does not change for longer block durations, as shown in the middle and lower rows of
Figure 7.20 for blocks of 1 and 5 minutes respectively.
It is interesting to note the wider y-axis and range of the histogram for the
5-minutes blocks, which suggest even higher variability for this longer block duration.

Table 7.2:
Estimated Hurst parameters and their confidence intervals for five subsamplings
obtained from the connection arrival time series of UNC 1 PM and UNC 1 AM

Trace

Estimated Parameters

UNC 1 PM Conn. Arrivals

=0.727540 C.I.=[0.701687, 0.753393]

Subsample 90% Conn.

=0.724175 C.I.=[0.698322, 0.750028]

Subsample 80% Conn.

=0.724046 C.I.=[0.698193, 0.749899]

Subsample 70% Conn.

=0.718502 C.I.=[0.692649, 0.744354]

Subsample 60% Conn.

=0.701378 C.I.=[0.675525, 0.727230]

Subsample 50% Conn.

=0.701020 C.I.=[0.675167, 0.726872]

UNC 1 AM Conn. Arrivals

=0.746591 C.I.=[0.720738, 0.772444]

Subsample 90% Conn.

=0.738659 C.I.=[0.712806, 0.764512]

Subsample 80% Conn.

=0.725030 C.I.=[0.699177, 0.750882]

Subsample 70% Conn.

=0.715679 C.I.=[0.689827, 0.741532]

Subsample 60% Conn.

=0.696723 C.I.=[0.670870, 0.722576]

Subsample 50% Conn.

=0.691139 C.I.=[0.665287, 0.716992]

The Block Resampling method described so far makes it possible to construct a resampled
of arbitrary
duration but it cannot be used to adjust its offered load.
In order to perform this task, we can rely on thinning,
when the offered load of is above our intended offered load, and on
thickening,
when the offered load of is below our intended offered load. Block thinning involves
randomly removing connections from
.
Theoretical work by Hohn and Veitch [HV03]has shown that the thinning
of a long-range dependent process does not change its long-range dependence structure.
Our own experimentation confirms this result. Figure
7.21 shows the wavelet spectra of thinned versions
of the connection arrivals in the UNC 1 PM trace (left) and the UNC 1 AM trace
(right). The overall energy level decreases as the fraction of connections removed from each block increases.
However, the spectra maintain their shapes, which demonstrates that the degree of the
long-range dependence remains unchanged. The estimated Hurst parameters for these two
traces is presented in Table 7.2.
The values reveal only a moderate decrease in the Hurst parameter even when
half of the connections are dropped.

Figure 7.22:
Analysis of the accuracy of byte-driven Block Resampling using source-level
trace replay: replays of two separate resamplings of UNC 1 PM for each target offered load,
illustrating the scaling down of load from the original 177.36 Mbps.

Figure 7.23:
Analysis of the accuracy of byte-driven Block Resampling using source-level
trace replay: replay of one resampling of UNC 1 AM for each target offered load,
illustrating the scaling up of load from the original 91.65 Mbps.

Block thickening consists of combining more than one block in each of the disjoint
intervals of
, i.e., to ``fusion'' one or more blocks from to
form a single block in
. This makes the offered load a multiple of the original load.
For example, to double the load, the connection vectors of
two randomly chosen blocks will be placed in the first interval, those from another
pair of randomly chosen blocks will be placed in the second interval, and so on.
The new start times of the connection vectors in the resampled trace are computed using Equation
7.2, but being careful to use the right for each connection
vector.

To achieve offered loads that are not a multiple of the original
load, we can combine basic thickening and thinning using a two-step process.
The first step is to create a preliminary version of
by combining as many blocks
as possible without exceeding the target load.
The second step is to ``complete'' this trace by combining it with another
block-resampled trace that has been thinned in such a manner that the combined
load of the two resampled traces matches the intended load. For example, in order to create a
with 2.5 times the load of , a first thickened trace
is
created by combining two blocks in each position.
This trace is then combined with second trace
that has been thinned to half
of the offered load of .
From our analysis in Figure 7.20, we can see that
is not necessarily equal to twice the offered load of .
For this reason
is actually thinned to exactly the offered load
needed to complement
, and not just to half of the original offered
load This careful thinning makes the scaling match the intended load in a highly
precise manner. We can therefore achieve any intended load with the Block Resampling method, so it is as
flexible as Poisson Resampling. In accordance with our earlier analysis,
accurate thinning cannot rely on any correlation between the number of connections
and the offered load, so it must be driven by Equation 7.1, just like
byte-driven Poisson Resampling.
Therefore, our final resampling technique is Byte-driven Block Resampling.

Figures 7.22 and 7.23
show the result of several testbed experiments where Byte-driven Block Resampling
is used to create new traces. The results demonstrate that traces resampled using
this method achieve a very good approximation of the target offered loads. As in
the case of Byte-driven Poisson Resampling, the achieved loads are slightly higher
than target ones due to the packetization overhead, which is not taken into account in the resampling.

Figure 7.24:
Wavelet spectra of the packet arrival time series for UNC 1 PM and
the source-level trace replays of two block resamplings of this trace.

Figure 7.25:
Wavelet spectra of the packet arrival time series for UNC 1 PM and
the source-level trace replays of three Poisson resamplings of this trace.

One interesting question is whether the effort to preserve the
scaling of the connection arrival process has any effect on the generated traffic
aggregate. To understand this question, we can compare the process of packet (or byte)
arrivals from block resamplings and from Poisson Resampling, since the former fully
preserves connection arrival long-range dependence and the latter fully eliminates it.
Figure 7.24 shows the wavelet spectra of
the packet arrivals in UNC 1 PM and those in two testbed experiments where byte-driven
block resamplings of UNC 1 PM were replayed.
Figure 7.25 shows the same wavelet spectrum of the
packet arrivals in UNC 1 PM, and also the spectra from three testbed experiments
where byte-driven Poisson resamplings of UNC 1 PM were replayed.
Both resampling methods achieve equally good approximations of the packet
scaling found in the original trace. In other words, according to this type of analysis,
the simpler Poisson Resampling method performs as well as the
more elaborate Block Resampling method. This is a confirmation, using a closed-loop
traffic generation approach, of the results by Hohn et al. in [HVA02],
which were obtained using (open-loop) semi-experiments.
This is not to say that long-range dependence in the arrival of connections
(e.g., arrival of flow state or cache misses to a router) can be safely ignored,
since other metrics and experimental results may be more sensitive to this characteristic
of the synthetic traffic.