Optimizing WAN throughput with compression

Wide-area network optimization has become an
important technique to improve the performance and throughput of
all sorts of networked applications, including some applications
that depend on optimization to function over WAN links at all. WAN
optimization focuses primarily on reducing the volume of data on
the link. However, pure data reduction is not sufficient to achieve
optimization. Data reduction, and all other optimization
techniques, must be applied in a way that minimizes message latency
through the device and guarantees that all users receive a fair
share of the total available WAN bandwidth; both of which are
critical QoS issues. Unfortunately, the goals of bandwidth
optimization, latency minimization, and fair access are often
mutually antagonistic when processing the bursty traffic typical of
network transmissions.

WAN optimization techniques

Although this article focuses on data
compression, we should mention some of the other important WAN
optimization techniques. Message caching saves entire messages, or
even whole conversations, so that a replay does not require the
entire message to be copied over the WAN. Many protocols are quite
chatty and send many redundant messages. “Keep
alives” and service advertisement messages are typical of
this sort of behavior. Caching these messages allows the WAN
optimization appliance to send only a short reference message, or
the differences between two messages, rather than retransmitting
everything.

Another effective technique to avoid redundant
transmission is to use message byte string caching, where only
certain segments of the message are removed or compressed. Byte
caching has the ability to find redundancies across messages even
if there are important differences as well. It tokenizes or
compresses the redundant segments while sending the full text of
portions that have different content. Byte caching is a very
powerful technique, but it is also very compute intensive and is
usually limited to the high-end WAN appliances.

Data compression approaches

There are two ways to apply data compression in
WAN optimization appliances. The first takes advantage of the fact
that all web browsers can support the negotiation of compression on
HTTP streams. This allows compression whenever it is likely to
reduce network bandwidth.

The second approach compresses data between two
gateway appliances. HTTP compression is the easiest to apply but it
is limited to HTTP traffic. In a gateway-to-gateway compression
scenario, the gateway can compress any traffic it chooses, but only
between gateways. Data compression delivers a benefit only when the
original data has redundancy.

Some data, such as text or HTML, is very
compressible and other data, such as any encrypted data or
compressed multimedia files, do not compress at all. When
multimedia files are transferred over HTTP, the source identifies
the Internet Media Type (MIME type) of the data so the server sees
it is uncompressible. Where no MIME type is given, the only way to
determine whether data will compress is to attempt to compress it
and measure the results.

The most popular data compression techniques are
based on the “deflate” algorithm. Deflate,
gzip, and zlib are all variants, differing in the way they treat
files and data integrity checksums. The basic deflate algorithm is
a combination of the LZ77 algorithm and Huffman coding. LZ77 is a
high-speed compression system that scans a data stream for repeated
patterns, and replaces the detected repeated patterns with
references to the previously occurring pattern. The deflate
algorithm then replaces the symbols generated in the LZ77 pass with
the Huffman encoding of those symbols, and delivers very good data
compression and high speed.

The deflate algorithm achieves good throughput
for file transfers by breaking the input stream into blocks and
encoding the blocks individually. Consequently, the file may not be
optimally compressed but output transmission can begin before the
entire file has been analyzed. By comparison, a pure Huffman
encoding requires that the entire file be read once to determine
the probabilities of the input characters, and then again to encode
using variable length code words. The result is optimal but
requires amounts of time and memory proportional to the entire
file, whereas deflate encoding can be streamed so that the
compressed data emerge s once 32 Kbytes of data are read.

Both gzip and zlib formats are based on the
deflate algorithm, but zlib adds an Adler 32 data integrity
checksum to the end of the file, while gzip adds member information
to allow the creation of multi-file compressed archives, as well as
a CRC 32 data integrity checksum. The CRC ensures that the system
can detect any errors in the compression process. These additions
significantly increase complexity when compressing data for WAN
optimizations because the CRC code in the trailer must be computed
over the entire message, requiring that some context must be
maintained from beginning to end for each stream being
compressed.

While this is not usually a concern when data
files are compressed for local storage, the situation in network
processing is different. When compressing files for local storage,
the entire data set is usually available without significant delay.
On the network however, there are often bursts and pauses in the
transmission of packets. Packets from different streams are often
inter-mixed, and can be delivered out of sequence. Furthermore, on
the network there is often an expectation that some level of QoS
must be maintained, so the available link bandwidth must be fairly
shared according to some policy. Enforcement of a prioritized QoS
policy is often a primary objective of WAN optimization
systems.

Part of the difficulty in maintaining context is
that the zlib and gzip formats are not really designed for network
compression. For example, CRC computation is often redundant, as in
the case of HTTP messages carried by TCP, because TCP already
guarantees error-free delivery. Nonetheless, zlib and gzip remain
the preferred compression formats for all browsers, so we must be
able to handle the context without losing the QoS guarantee.

If the compressor is in software, context can be
maintained by running a separate thread for each data stream. In
this way the thread context includes the state of the compression
algorithm. However, the thread context often contains significant
amounts of other information. When large numbers of streams are
being processed, this can result in excessive memory usage and
context switching of this amount of data it can be quite time
consuming. It can also be very frequent, because the context switch
is driven by packet delivery. High-performance networking stacks
usually try to avoid this sort of stream-driven context
switching.

Hardware-based context switching

However, these disadvantages can also be avoided
by deploying hardware-based context switching compression and
decompression subsystems — such as those from Altior.
These subsystems use multiple (de)compression cores —
dedicated hardware implemented in an FPGA.

The subsystem saves the compressor context in
local memory, where it can be accessed quickly. The hardware is
pipelined so that the loading of the next stream context can
overlap with the previous stream’s execution.

Pipelining efficiency increases with packet
size. This results in very low latency context switching and also
allows packets to be processed without waiting to receive the
complete compressed message. Also, packets can carry part of the
compressed record without the CRC and the compression context keeps
track so that CRC can be checked as soon as it is received.

Another advantage is that, although packets must
be processed in sequence, they do not require complete reassembly.
Consequently, they can be compressed or decompressed as soon as the
sequence can be assured.

A context-switching compression or decompression
subsystem allows the WAN optimization server to enforce a bandwidth
management policy over all threads simply by limiting the size of
outgoing packets and managing priority queues on output. The
context switching core ensures that each packet contains complete
deflate blocks and that the proper zlib or gzip headers and
trailers are added when needed.

Context switched decompression cores allow
ingress packets to be processed as soon as they are
re-sequenced.Since the subsystem remembers the stream context it
can easily intermix packets from different streams and keep the
decompression hardware busy, even if some flows are subject to long
network delays (see Fig. 1).

Fig. 1: Context switching enables channel
sharing.

The illustration shows the effect of context
switched compression on the delivery of QoS guarantees. Two data
streams, Flow A and Flow B, must be compressed. Without context
switching, the hardware must be dedicated to a flow from the entire
HTTP record (see block C). With context switching, it is possible
to break long records into multiple packets and to process packets
interchangeably from different flows (see block D). This allows all
flows to have fair access to the output bandwidth, even if some
compressed records are very long.

The context memory is implemented as a local
DDR2 device. The cores’ communication is via multichannel
DMA controllers. Each core has a dedicated DMA channel. Because the
cores share the same context memory, any context may be processed
by any core with available capacity. This not only increases
throughput, but also makes it unnecessary to schedule the cores.The
FPGA-based approach enables easy field upgrades and patches and
achieves significant power savings versus software-based
compression. ■