RFC 7932

Brotli Compressed Data Format

6. Encoding of Block-Switch Commands
As described in Section 2, a block-switch command is a pair <block
type, block count>. These are encoded in the compressed data part of
the meta-block, right before the start of each new block of a
particular block category.
Each block type in the compressed data is represented with a block
type code, encoded using a prefix code over the block type code
alphabet. A block type symbol 0 means that the new block type is the
same as the type of the previous block from the same block category,
i.e., the block type that preceded the current type, while a block
type symbol 1 means that the new block type equals the current block
type plus one. If the current block type is the maximal possible,
then a block type symbol of 1 results in wrapping to a new block type
of 0. Block type symbols 2..257 represent block types 0..255,
respectively. The previous and current block types are initialized
to 1 and 0, respectively, at the end of the meta-block header.
Since the first block type of each block category is 0, the block
type of the first block-switch command is not encoded in the
compressed data. If a block category has only one block type, the
block count of the first block-switch command is also omitted from
the compressed data; otherwise, it is encoded in the meta-block
header.
Since the end of the meta-block is detected by the number of
uncompressed bytes produced, the block counts for any of the three
categories need not count down to exactly zero at the end of the
meta-block.
The number of different block types in each block category, denoted
by NBLTYPESL, NBLTYPESI, and NBLTYPESD for literals, insert-and-copy
lengths, and distances, respectively, is encoded in the meta-block
header, and it must equal to the largest block type plus one in that
block category. In other words, the set of literal, insert-and-copy
length, and distance block types must be [0..NBLTYPESL-1],
[0..NBLTYPESI-1], and [0..NBLTYPESD-1], respectively. From this it
follows that the alphabet size of literal, insert-and-copy length,
and distance block type codes is NBLTYPESL + 2, NBLTYPESI + 2, and
NBLTYPESD + 2, respectively.
Each block count in the compressed data is represented with a pair
<block count code, extra bits>. The block count code and the extra
bits are encoded back-to-back, the block count code is encoded using
a prefix code over the block count code alphabet, while the extra
bits value is encoded as a fixed-width integer value. The number of
extra bits can be 0..24, and it is dependent on the block count code.

The symbols of the block count code alphabet along with the number of
extra bits and the range of block counts are as follows:
Extra Extra Extra
Code Bits Lengths Code Bits Lengths Code Bits Lengths
---- ---- ------- ---- ---- ------- ---- ---- -------
0 2 1..4 9 4 65..80 18 7 369..496
1 2 5..8 10 4 81..96 19 8 497..752
2 2 9..12 11 4 97..112 20 9 753..1264
3 2 13..16 12 5 113..144 21 10 1265..2288
4 3 17..24 13 5 145..176 22 11 2289..4336
5 3 25..32 14 5 177..208 23 12 4337..8432
6 3 33..40 15 5 209..240 24 13 8433..16624
7 3 41..48 16 6 241..304 25 24 16625..16793840
8 4 49..64 17 6 305..368
The first block-switch command of each block category is special in
the sense that it is encoded in the meta-block header, and as
described earlier, the block type code is omitted since it is an
implicit zero.
7. Context Modeling
As described in Section 2, the prefix tree used to encode a literal
byte or a distance code depends on the block type and the context ID.
This section specifies how to compute the context ID for a particular
literal and distance code and how to encode the context map that maps
a <block type, context ID> pair to the index of a prefix code in the
array of literal and distance prefix codes.
7.1. Context Modes and Context ID Lookup for Literals
The context for encoding the next literal is defined by the last two
bytes in the stream (p1, p2, where p1 is the most recent byte),
regardless of whether these bytes are produced by uncompressed meta-
blocks, backward references, static dictionary references, or by
literal insertions. At the start of the stream, p1 and p2 are
initialized to zero.
There are four methods, called context modes, to compute the Context
ID:
* LSB6, where the Context ID is the value of six least
significant bits of p1,
* MSB6, where the Context ID is the value of six most significant
bits of p1,

7.2. Context ID for Distances
The context for encoding a distance code is defined by the copy
length corresponding to the distance. The context IDs are 0, 1, 2,
and 3 for copy lengths 2, 3, 4, and more than 4, respectively.
7.3. Encoding of the Context Map
There are two context maps, one for literals and one for distances.
The size of the context map is 64 * NBLTYPESL for literals, and 4 *
NBLTYPESD for distances. Each value in the context map is an integer
between 0 and 255, indicating the index of the prefix code to be used
when encoding the next literal or distance.
The context maps are two-dimensional matrices, encoded as one-
dimensional arrays:
CMAPL[0..(64 * NBLTYPESL - 1)]
CMAPD[0..(4 * NBLTYPESD - 1)]
The index of the prefix code for encoding a literal or distance code
with block type, BTYPE_x, and context ID, CIDx, is:
index of literal prefix code = CMAPL[64 * BTYPE_L + CIDL]
index of distance prefix code = CMAPD[4 * BTYPE_D + CIDD]
The values of the context map are encoded with the combination of run
length encoding for zero values and prefix coding. Let RLEMAX denote
the number of run length codes and NTREES denote the maximum value in
the context map plus one. NTREES must equal the number of different
values in the context map; in other words, the different values in
the context map must be the [0..NTREES-1] interval. The alphabet of
the prefix code has the following RLEMAX + NTREES symbols:
0: value zero
1: repeat a zero 2 to 3 times, read 1 bit for repeat length
2: repeat a zero 4 to 7 times, read 2 bits for repeat length
...
RLEMAX: repeat a zero (1 << RLEMAX) to (1 << (RLEMAX+1))-1
times, read RLEMAX bits for repeat length
RLEMAX + 1: value 1
...
RLEMAX + NTREES - 1: value NTREES - 1

If RLEMAX = 0, the run length coding is not used and the symbols of
the alphabet are directly the values in the context map. We can now
define the format of the context map (the same format is used for
literal and distance context maps):
1..5 bits: RLEMAX, 0 is encoded with one 0 bit, and values 1..16
are encoded with bit pattern xxxx1 (so 01001 is 5)
Prefix code with alphabet size NTREES + RLEMAX
Context map size values encoded with the above prefix code and run
length coding for zero values. If a run length would result in
more lengths in total than the size of the context map, then
the stream should be rejected as invalid.
1 bit: IMTF bit, if set, we do an inverse move-to-front transform
on the values in the context map to get the prefix code
indexes.
Note that RLEMAX may be larger than the value necessary to represent
the longest sequence of zero values. Also, the NTREES value is
encoded right before the context map as described in Section 9.2.
We define the inverse move-to-front transform used in this
specification by the following C language function:
void InverseMoveToFrontTransform(uint8_t* v, int v_len) {
uint8_t mtf[256];
int i;
for (i = 0; i < 256; ++i) {
mtf[i] = (uint8_t)i;
}
for (i = 0; i < v_len; ++i) {
uint8_t index = v[i];
uint8_t value = mtf[index];
v[i] = value;
for (; index; --index) {
mtf[index] = mtf[index - 1];
}
mtf[0] = value;
}
}
Note that the inverse move-to-front transform will not produce values
outside the [0..NTREES-1] interval.

8. Static Dictionary
At any given point during decoding the compressed data, a reference
to a duplicated string in the uncompressed data produced so far has a
maximum backward distance value, which is the minimum of the window
size and the number of uncompressed bytes produced. However,
decoding a distance from the compressed stream, as described in
Section 4, can produce distances that are greater than this maximum
allowed value. In this case, the distance is treated as a reference
to a word in the static dictionary given in Appendix A. The copy
length for a static dictionary reference must be between 4 and 24.
The static dictionary has three parts:
* DICT[0..DICTSIZE], an array of bytes
* DOFFSET[0..24], an array of byte-offset values for each length
* NDBITS[0..24], an array of bit-depth values for each length
The number of static dictionary words for a given length is:
NWORDS[length] = 0 (if length < 4)
NWORDS[length] = (1 << NDBITS[length]) (if length >= 4)
DOFFSET and DICTSIZE are defined by the following recursion:
DOFFSET[0] = 0
DOFFSET[length + 1] = DOFFSET[length] + length * NWORDS[length]
DICTSIZE = DOFFSET[24] + 24 * NWORDS[24]
The offset of a word within the DICT array for a given length and
index is:
offset(length, index) = DOFFSET[length] + index * length
Each static dictionary word has 121 different forms, given by
applying a word transformation to a base word in the DICT array. The
list of word transformations is given in Appendix B. The static
dictionary word for a <length, distance> pair can be reconstructed as
follows:
word_id = distance - (max allowed distance + 1)
index = word_id % NWORDS[length]
base_word = DICT[offset(length, index)..offset(length, index+1)-1]
transform_id = word_id >> NDBITS[length]
The string copied to the uncompressed stream is computed by applying
the transformation to the base dictionary word. If transform_id is
greater than 120, or the length is smaller than 4 or greater than 24,
then the compressed stream should be rejected as invalid.

Each word transformation has the following form:
transform_i(word) = prefix_i + T_i(word) + suffix_i
where the _i subscript denotes the transform_id above. Each T_i is
one of the following 21 elementary transforms:
Identity, FermentFirst, FermentAll,
OmitFirst1, ..., OmitFirst9, OmitLast1, ..., OmitLast9
The form of these elementary transforms is as follows:
Identity(word) = word
FermentFirst(word) = see below
FermentAll(word) = see below
OmitFirstk(word) = the last (length(word) - k) bytes of word, or
empty string if length(word) < k
OmitLastk(word) = the first (length(word) - k) bytes of word, or
empty string if length(word) < k

9. Compressed Data Format
In this section, we describe the format of the compressed data set in
terms of the format of the individual data items described in the
previous sections.
9.1. Format of the Stream Header
The stream header has only the following one field:
1..7 bits: WBITS, a value in the range 10..24, encoded with the
following variable-length code (as it appears in the
compressed data, where the bits are parsed from right
to left):
Value Bit Pattern
----- -----------
10 0100001
11 0110001
12 1000001
13 1010001
14 1100001
15 1110001
16 0
17 0000001
18 0011
19 0101
20 0111
21 1001
22 1011
23 1101
24 1111
Note that bit pattern 0010001 is invalid and must not
be used.
The size of the sliding window, which is the maximum value of any
non-dictionary reference backward distance, is given by the following
formula:
window size = (1 << WBITS) - 16

9.2. Format of the Meta-Block Header
A compliant compressed data set has at least one meta-block. Each
meta-block contains a header with information about the uncompressed
length of the meta-block, and a bit signaling if the meta-block is
the last one. The format of the meta-block header is the following:
1 bit: ISLAST, set to 1 if this is the last meta-block
1 bit: ISLASTEMPTY, if set to 1, the meta-block is empty; this
field is only present if ISLAST bit is set -- if it is 1,
then the meta-block and the brotli stream ends at that
bit, with any remaining bits in the last byte of the
compressed stream filled with zeros (if the fill bits are
not zero, then the stream should be rejected as invalid)
2 bits: MNIBBLES, number of nibbles to represent the uncompressed
length, encoded with the following fixed-length code:
Value Bit Pattern
----- -----------
0 11
4 00
5 01
6 10
If MNIBBLES is 0, the meta-block is empty, i.e., it does
not generate any uncompressed data. In this case, the
rest of the meta-block has the following format:
1 bit: reserved, must be zero
2 bits: MSKIPBYTES, number of bytes to represent
metadata length
MSKIPBYTES * 8 bits: MSKIPLEN - 1, where MSKIPLEN is
the number of metadata bytes; this field is
only present if MSKIPBYTES is positive;
otherwise, MSKIPLEN is 0 (if MSKIPBYTES is
greater than 1, and the last byte is all
zeros, then the stream should be rejected as
invalid)
0..7 bits: fill bits until the next byte boundary,
must be all zeros
MSKIPLEN bytes of metadata, not part of the
uncompressed data or the sliding window

MNIBBLES * 4 bits: MLEN - 1, where MLEN is the length of the meta-
block uncompressed data in bytes (if MNIBBLES is greater
than 4, and the last nibble is all zeros, then the stream
should be rejected as invalid)
1 bit: ISUNCOMPRESSED, if set to 1, any bits of compressed data
up to the next byte boundary are ignored, and the rest of
the meta-block contains MLEN bytes of literal data; this
field is only present if the ISLAST bit is not set (if the
ignored bits are not all zeros, the stream should be
rejected as invalid)
1..11 bits: NBLTYPESL, number of literal block types, encoded with
the following variable-length code (as it appears in the
compressed data, where the bits are parsed from right to
left, so 0110111 has the value 12):
Value Bit Pattern
----- -----------
1 0
2 0001
3..4 x0011
5..8 xx0101
9..16 xxx0111
17..32 xxxx1001
33..64 xxxxx1011
65..128 xxxxxx1101
129..256 xxxxxxx1111
Prefix code over the block type code alphabet for literal block
types, appears only if NBLTYPESL >= 2
Prefix code over the block count code alphabet for literal
block counts, appears only if NBLTYPESL >= 2
Block count code + extra bits for first literal block count,
appears only if NBLTYPESL >= 2
1..11 bits: NBLTYPESI, number of insert-and-copy block types,
encoded with the same variable-length code as above
Prefix code over the block type code alphabet for insert-and-
copy block types, appears only if NBLTYPESI >= 2
Prefix code over the block count code alphabet for insert-and-
copy block counts, appears only if NBLTYPESI >= 2

9.3. Format of the Meta-Block Data
The compressed data part of a meta-block consists of a series of
commands. Each command has the following format:
Block type code for next insert-and-copy block type, appears only
if NBLTYPESI >= 2 and the previous insert-and-copy block count
is zero
Block count code + extra bits for next insert-and-copy block
count, appears only if NBLTYPESI >= 2 and the previous insert-
and-copy block count is zero
Insert-and-copy length, encoded as in Section 5, using the insert-
and-copy length prefix code with the current insert-and-copy
block type index
Insert length number of literals, with the following format:
Block type code for next literal block type, appears only if
NBLTYPESL >= 2 and the previous literal block count is zero
Block count code + extra bits for next literal block count,
appears only if NBLTYPESL >= 2 and the previous literal
block count is zero
Next byte of the uncompressed data, encoded with the literal
prefix code with the index determined by the previous two
bytes of the uncompressed data, the current literal block
type, and the context map, as described in Section 7.3
Block type code for next distance block type, appears only if
NBLTYPESD >= 2 and the previous distance block count is zero
Block count code + extra bits for next distance block count,
appears only if NBLTYPESD >= 2 and the previous distance block
count is zero
Distance code, encoded as in Section 4, using the distance prefix
code with the index determined by the copy length, the current
distance block type, and the distance context map, as described
in Section 7.3, appears only if the distance code is not an
implicit 0, as indicated by the insert-and-copy length code

The number of commands in the meta-block is such that the sum of the
uncompressed bytes produced (i.e., the number of literals inserted
plus the number of bytes copied from past data or generated from the
static dictionary) over all the commands gives the uncompressed
length, MLEN encoded in the meta-block header.
If the total number of uncompressed bytes produced after the insert
part of the last command equals MLEN, then the copy length of the
last command is ignored and will not produce any uncompressed output.
In this case, the copy length of the last command can have any value.
In any other case, if the number of literals to insert, the copy
length, or the resulting dictionary word length would cause MLEN to
be exceeded, then the stream should be rejected as invalid.
If the last command of the last non-empty meta-block does not end on
a byte boundary, the unused bits in the last byte must be zeros.
10. Decoding Algorithm
The decoding algorithm that produces the uncompressed data is as
follows:
read window size
do
read ISLAST bit
if ISLAST
read ISLASTEMPTY bit
if ISLASTEMPTY
break from loop
read MNIBBLES
if MNIBBLES is zero
verify reserved bit is zero
read MSKIPLEN
skip any bits up to the next byte boundary
skip MSKIPLEN bytes
continue to the next meta-block
else
read MLEN
if not ISLAST
read ISUNCOMPRESSED bit
if ISUNCOMPRESSED
skip any bits up to the next byte boundary
copy MLEN bytes of compressed data as literals
continue to the next meta-block

if distance code is implicit zero from insert-and-copy code
set backward distance to the last distance
else
if BLEN_D is zero
read block type using HTREE_BTYPE_D and set BTYPE_D
save previous block type
read block count using HTREE_BLEN_D and set BLEN_D
decrement BLEN_D
compute context ID, CIDD from CLEN
read distance code using HTREED[CMAPD[4*BTYPE_D + CIDD]]
compute distance by distance short code substitution
if distance code is not zero,
and distance is not a static dictionary reference,
push distance to the ring buffer of last distances
if distance is less than the max allowed distance plus one
move backwards distance bytes in the uncompressed data,
and copy CLEN bytes from this position to
the uncompressed stream
else
look up the static dictionary word, transform the word as
directed, and copy the result to the uncompressed stream
while number of uncompressed bytes for this meta-block < MLEN
while not ISLAST
If the stream ends before the completion of the last meta-block, then
the stream should be rejected as invalid.
Note that a duplicated string reference may refer to a string in a
previous meta-block, i.e., the backward distance may cross one or
more meta-block boundaries. However, a backward copy distance will
not refer past the beginning of the uncompressed stream or the window
size; any such distance is interpreted as a reference to a static
dictionary word. Also, note that the referenced string may overlap
the current position, for example, if the last 2 bytes decoded have
values X and Y, a string reference with <length = 5, distance = 2>
adds X,Y,X,Y,X to the uncompressed stream.
11. Considerations for Compressor Implementations
Since the intent of this document is to define the brotli compressed
data format without reference to any particular compression
algorithm, the material in this section is not part of the definition
of the format, and a compressor need not follow it in order to be
compliant.

11.1. Trivial Compressor
In this section, we present a very simple algorithm that produces a
valid brotli stream representing an arbitrary sequence of
uncompressed bytes in the form of the following C++ language
function.
string BrotliCompressTrivial(const string& u) {
if (u.empty()) {
return string(1, 6);
}
int i;
string c;
c.append(1, 12);
for (i = 0; i + 65535 < u.size(); i += 65536) {
c.append(1, 248);
c.append(1, 255);
c.append(1, 15);
c.append(&u[i], 65536);
}
if (i < u.size()) {
int r = u.size() - i - 1;
c.append(1, (r & 31) << 3);
c.append(1, r >> 5);
c.append(1, 8 + (r >> 13));
c.append(&u[i], r + 1);
}
c.append(1, 3);
return c;
}
Note that this simple algorithm does not actually compress data, that
is, the brotli representation will always be bigger than the
original, but it shows that every sequence of N uncompressed bytes
can be represented with a valid brotli stream that is not longer than
N + (3 * (N >> 16) + 5) bytes.
11.2. Aligning Compressed Meta-Blocks to Byte Boundaries
As described in Section 9, only those meta-blocks that immediately
follow an uncompressed meta-block or a metadata meta-block are
guaranteed to start on a byte boundary. In some applications, it
might be required that every non-metadata meta-block starts on a byte
boundary. This can be achieved by appending an empty metadata meta-
block after every non-metadata meta-block that does not end on a byte
boundary.

11.3. Creating Self-Contained Parts within the Compressed Data
In some encoder implementations, it might be required to make a
sequence of bytes within a brotli stream self-contained, that is,
such that they can be decompressed independently from previous parts
of the compressed data. This is a useful feature for three reasons.
First, if a large compressed file is damaged, it is possible to
recover some of the file after the damage. Second, it is useful when
doing differential transfer of compressed data. If a sequence of
uncompressed bytes is unchanged and compressed independently from
previous data, then the compressed representation may also be
unchanged and can therefore be transferred very cheaply. Third, if
sequences of uncompressed bytes are compressed independently, it
allows for parallel compression of these byte sequences within the
same file, in addition to parallel compression of multiple files.
Given two sequences of uncompressed bytes, U0 and U1, we will now
describe how to create two sequences of compressed bytes, C0 and C1,
such that the concatenation of C0 and C1 is a valid brotli stream,
and that C0 and C1 (together with the first byte of C0 that contains
the window size) can be decompressed independently from each other to
U0 and U1.
When compressing the byte sequence U0 to produce C0, we can use any
compressor that works on the complete set of uncompressed bytes U0,
with the following two changes. First, the ISLAST bit of the last
meta-block of C0 must not be set. Second, C0 must end at a byte-
boundary, which can be ensured by appending an empty metadata meta-
block to it, as in Section 11.2.
When compressing the byte sequence U1 to produce C1, we can use any
compressor that starts a new meta-block at the beginning of U1 within
the U0+U1 input stream, with the following two changes. First,
backward distances in C1 must not refer to static dictionary words or
uncompressed bytes in U0. Even if a sequence of bytes in U1 would
match a static dictionary word, or a sequence of bytes that overlaps
U0, the compressor must represent this sequence of bytes with a
combination of literal insertions and backward references to bytes in
U1 instead. Second, the ring buffer of last four distances must be
replenished first with distances in C1 before using it to encode
other distances in C1. Note that both compressors producing C0 and
C1 have to use the same window size, but the stream header is emitted
only by the compressor that produces C0.
Note that this method can be easily generalized to more than two
sequences of uncompressed bytes.

12. Security Considerations
As with any compressed file formats, decompressor implementations
should handle all compressed data byte sequences, not only those that
conform to this specification, where non-conformant compressed data
sequences should be rejected as invalid.
A possible attack against a system containing a decompressor
implementation (e.g., a web browser) is to exploit a buffer overflow
triggered by invalid compressed data. Therefore, decompressor
implementations should perform bounds-checking for each memory access
that result from values decoded from the compressed stream and
derivatives thereof.
Another possible attack against a system containing a decompressor
implementation is to provide it (either valid or invalid) compressed
data that can make the decompressor system's resource consumption
(CPU, memory, or storage) to be disproportionately large compared to
the size of the compressed data. In addition to the size of the
compressed data, the amount of CPU, memory, and storage required to
decompress a single compressed meta-block within a brotli stream is
controlled by the following two parameters: the size of the
uncompressed meta-block, which is encoded at the start of the
compressed meta-block, and the size of the sliding window, which is
encoded at the start of the brotli stream. Decompressor
implementations in systems where memory or storage is constrained
should perform a sanity-check on these two parameters. The
uncompressed meta-block size that was decoded from the compressed
stream should be compared against either a hard limit, given by the
system's constraints or some expectation about the uncompressed data,
or against a certain multiple of the size of the compressed data. If
the uncompressed meta-block size is determined to be too high, the
compressed data should be rejected. Likewise, when the complete
uncompressed stream is kept in the system containing the decompressor
implementation, the total uncompressed size of the stream should be
checked before decompressing each additional meta-block. If the size
of the sliding window that was decoded from the start of the
compressed stream is greater than a certain soft limit, then the
decompressor implementation should, at first, allocate a smaller
sliding window that fits the first uncompressed meta-block, and
afterwards, before decompressing each additional meta-block, it
should increase the size of the sliding window until the sliding
window size specified in the compressed data is reached.

Correspondingly, possible attacks against a system containing a
compressor implementation (e.g., a web server) are to exploit a
buffer overflow or cause disproportionately large resource
consumption by providing, e.g., uncompressible data. As described in
Section 11.1, an output buffer of
S(N) = N + (3 * (N >> 16) + 5)
bytes is sufficient to hold a valid compressed brotli stream
representing an arbitrary sequence of N uncompressed bytes.
Therefore, compressor implementations should allocate at least S(N)
bytes of output buffer before compressing N bytes of data with
unknown compressibility and should perform bounds-checking for each
write into this output buffer. If their output buffer is full,
compressor implementations should revert to the trivial compression
algorithm described in Section 11.1. The resource consumption of a
compressor implementation for a particular input data depends mostly
on the algorithm used to find backward matches and on the algorithm
used to construct context maps and prefix codes and only to a lesser
extent on the input data itself. If the system containing a
compressor implementation is overloaded, a possible way to reduce
resource usage is to switch to more simple algorithms for backward
reference search and prefix code construction, or to fall back to the
trivial compression algorithm described in Section 11.1.
A possible attack against a system that sends compressed data over an
encrypted channel is the following. An attacker who can repeatedly
mix arbitrary (attacker-supplied) data with secret data (passwords,
cookies) and observe the length of the ciphertext can potentially
reconstruct the secret data. To protect against this kind of attack,
applications should not mix sensitive data with non-sensitive,
potentially attacker-supplied data in the same compressed stream.
13. IANA Considerations
The "HTTP Content Coding Registry" has been updated with the
registration below:
+-------+-------------------------------------+------------+
| Name | Description | Reference |
+-------+-------------------------------------+------------+
| br | Brotli Compressed Data Format | RFC 7932 |
+-------+-------------------------------------+------------+