1. Compression algorithm (deflate)The deflation algorithm used by gzip (also zip and zlib) is a variation ofLZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings inthe input data. The second occurrence of a string is replaced by apointer to the previous string, in the form of a pair (distance,length). Distances are limited to 32K bytes, and lengths are limitedto 258 bytes. When a string does not occur anywhere in the previous32K bytes, it is emitted as a sequence of literal bytes. (In thisdescription, `string' must be taken as an arbitrary sequence of bytes,and is not restricted to printable characters.)Literals or match lengths are compressed with one Huffman tree, andmatch distances are compressed with another tree. The trees are storedin a compact form at the start of each block. The blocks can have anysize (except that the compressed data for one block must fit inavailable memory). A block is terminated when deflate() determines thatit would be useful to start another block with fresh trees. (This issomewhat similar to the behavior of LZW-based _compress_.)Duplicated strings are found using a hash table. All input strings oflength 3 are inserted in the hash table. A hash index is computed forthe next 3 bytes. If the hash chain for this index is not empty, allstrings in the chain are compared with the current input string, andthe longest match is selected.The hash chains are searched starting with the most recent strings, tofavor small distances and thus take advantage of the Huffman encoding.The hash chains are singly linked. There are no deletions from thehash chains, the algorithm simply discards matches that are too old.To avoid a worst-case situation, very long hash chains are arbitrarilytruncated at a certain length, determined by a runtime option (levelparameter of deflateInit). So deflate() does not always find the longestpossible match but generally finds a match which is long enough.deflate() also defers the selection of matches with a lazy evaluationmechanism. After a match of length N has been found, deflate() searches fora longer match at the next input byte. If a longer match is found, theprevious match is truncated to a length of one (thus producing a singleliteral byte) and the process of lazy evaluation begins again. Otherwise,the original match is kept, and the next match search is attempted only Nsteps later.The lazy match evaluation is also subject to a runtime parameter. Ifthe current match is long enough, deflate() reduces the search for a longermatch, thus speeding up the whole process. If compression ratio is moreimportant than speed, deflate() attempts a complete second search even ifthe first match is already long enough.The lazy match evaluation is not performed for the fastest compressionmodes (level parameter 1 to 3). For these fast modes, new stringsare inserted in the hash table only when no match was found, orwhen the match is not too long. This degrades the compression ratiobut saves time since there are both fewer insertions and fewer searches.2. Decompression algorithm (inflate)2.1 IntroductionThe key question is how to represent a Huffman code (or any prefix code) sothat you can decode fast. The most important characteristic is that shortercodes are much more common than longer codes, so pay attention to decoding theshort codes fast, and let the long codes take longer to decode.inflate() sets up a first level table that covers some number of bits ofinput less than the length of longest code. It gets that many bits from thestream, and looks it up in the table. The table will tell if the nextcode is that many bits or less and how many, and if it is, it will tellthe value, else it will point to the next level table for which inflate()grabs more bits and tries to decode a longer code.How many bits to make the first lookup is a tradeoff between the time ittakes to decode and the time it takes to build the table. If building thetable took no time (and if you had infinite memory), then there would onlybe a first level table to cover all the way to the longest code. However,building the table ends up taking a lot longer for more bits since shortcodes are replicated many times in such a table. What inflate() does issimply to make the number of bits in the first table a variable, and thento set that variable for the maximum speed.For inflate, which has 286 possible codes for the literal/length tree, the sizeof the first table is nine bits. Also the distance trees have 30 possiblevalues, and the size of the first table is six bits. Note that for each ofthose cases, the table ended up one bit longer than the ``average'' codelength, i.e. the code length of an approximately flat code which would be alittle more than eight bits for 286 symbols and a little less than five bitsfor 30 symbols.2.2 More details on the inflate table lookupOk, you want to know what this cleverly obfuscated inflate tree actuallylooks like. You are correct that it's not a Huffman tree. It is simply alookup table for the first, let's say, nine bits of a Huffman symbol. Thesymbol could be as short as one bit or as long as 15 bits. If a particularsymbol is shorter than nine bits, then that symbol's translation is duplicatedin all those entries that start with that symbol's bits. For example, if thesymbol is four bits, then it's duplicated 32 times in a nine-bit table. If asymbol is nine bits long, it appears in the table once.If the symbol is longer than nine bits, then that entry in the table pointsto another similar table for the remaining bits. Again, there are duplicatedentries as needed. The idea is that most of the time the symbol will be shortand there will only be one table look up. (That's whole idea behind datacompression in the first place.) For the less frequent long symbols, therewill be two lookups. If you had a compression method with really longsymbols, you could have as many levels of lookups as is efficient. Forinflate, two is enough.So a table entry either points to another table (in which case nine bits inthe above example are gobbled), or it contains the translation for the symboland the number of bits to gobble. Then you start again with the nextungobbled bit.You may wonder: why not just have one lookup table for how ever many bits thelongest symbol is? The reason is that if you do that, you end up spendingmore time filling in duplicate symbol entries than you do actually decoding.At least for deflate's output that generates new trees every several 10's ofkbytes. You can imagine that filling in a 2^15 entry table for a 15-bit codewould take too long if you're only decoding several thousand symbols. At theother extreme, you could make a new table for every bit in the code. In fact,that's essentially a Huffman tree. But then you spend too much timetraversing the tree while decoding, even for short symbols.So the number of bits for the first lookup table is a trade of the time tofill out the table vs. the time spent looking at the second level and above ofthe table.Here is an example, scaled down:The code being decoded, with 10 symbols, from 1 to 6 bits long:A: 0B: 10C: 1100D: 11010E: 11011F: 11100G: 11101H: 11110I: 111110J: 111111Let's make the first table three bits long (eight entries):000: A,1001: A,1010: A,1011: A,1100: B,2101: B,2110: -> table X (gobble 3 bits)111: -> table Y (gobble 3 bits)Each entry is what the bits decode as and how many bits that is, i.e. howmany bits to gobble. Or the entry points to another table, with the number ofbits to gobble implicit in the size of the table.Table X is two bits long since the longest code starting with 110 is five bitslong:00: C,101: C,110: D,211: E,2Table Y is three bits long since the longest code starting with 111 is sixbits long:000: F,2001: F,2010: G,2011: G,2100: H,2101: H,2110: I,3111: J,3So what we have here are three tables with a total of 20 entries that had tobe constructed. That's compared to 64 entries for a single table. Orcompared to 16 entries for a Huffman tree (six two entry tables and one fourentry table). Assuming that the code ideally represents the probability ofthe symbols, it takes on the average 1.25 lookups per symbol. That's comparedto one lookup for the single table, or 1.66 lookups per symbol for theHuffman tree.There, I think that gives you a picture of what's going on. For inflate, themeaning of a particular symbol is often more than just a letter. It can be abyte (a "literal"), or it can be either a length or a distance whichindicates a base value and a number of bits to fetch after the code that isadded to the base value. Or it might be the special end-of-block code. Thedata structures created in inftrees.c try to encode all that informationcompactly in the tables.Jean-loup Gailly Mark Adlerjloup@gzip.org madler@alumni.caltech.eduReferences:[LZ77] Ziv J., Lempel A., ``A Universal Algorithm for Sequential DataCompression,'' IEEE Transactions on Information Theory, Vol. 23, No. 3,pp. 337-343.``DEFLATE Compressed Data Format Specification'' available inhttp://tools.ietf.org/html/rfc1951

Network Working Group P. DeutschRequest for Comments: 1950 Aladdin EnterprisesCategory: Informational J-L. Gailly Info-ZIP May 1996 ZLIB Compressed Data Format Specification version 3.3Status of This Memo This memo provides information for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited.IESG Note: The IESG takes no position on the validity of any Intellectual Property Rights statements contained in this document.Notices Copyright (c) 1996 L. Peter Deutsch and Jean-Loup Gailly Permission is granted to copy and distribute this document for any purpose and without charge, including translations into other languages and incorporation into compilations, provided that the copyright notice and this notice are preserved, and that any substantive changes or deletions from the original are clearly marked. A pointer to the latest version of this and related documentation in HTML format can be found at the URL <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>.Abstract This specification defines a lossless compressed data format. The data can be produced or consumed, even for an arbitrarily long sequentially presented input data stream, using only an a priori bounded amount of intermediate storage. The format presently uses the DEFLATE compression method but can be easily extended to use other compression methods. It can be implemented readily in a manner not covered by patents. This specification also defines the ADLER-32 checksum (an extension and improvement of the Fletcher checksum), used for detection of data corruption, and provides an algorithm for computing it.Deutsch & Gailly Informational [Page 1]RFC 1950 ZLIB Compressed Data Format Specification May 1996Table of Contents 1. Introduction ................................................... 2 1.1. Purpose ................................................... 2 1.2. Intended audience ......................................... 3 1.3. Scope ..................................................... 3 1.4. Compliance ................................................ 3 1.5. Definitions of terms and conventions used ................ 3 1.6. Changes from previous versions ............................ 3 2. Detailed specification ......................................... 3 2.1. Overall conventions ....................................... 3 2.2. Data format ............................................... 4 2.3. Compliance ................................................ 7 3. References ..................................................... 7 4. Source code .................................................... 8 5. Security Considerations ........................................ 8 6. Acknowledgements ............................................... 8 7. Authors' Addresses ............................................. 8 8. Appendix: Rationale ............................................ 9 9. Appendix: Sample code ..........................................101. Introduction 1.1. Purpose The purpose of this specification is to define a lossless compressed data format that: * Is independent of CPU type, operating system, file system, and character set, and hence can be used for interchange; * Can be produced or consumed, even for an arbitrarily long sequentially presented input data stream, using only an a priori bounded amount of intermediate storage, and hence can be used in data communications or similar structures such as Unix filters; * Can use a number of different compression methods; * Can be implemented readily in a manner not covered by patents, and hence can be practiced freely. The data format defined by this specification does not attempt to allow random access to compressed data.Deutsch & Gailly Informational [Page 2]RFC 1950 ZLIB Compressed Data Format Specification May 1996 1.2. Intended audience This specification is intended for use by implementors of software to compress data into zlib format and/or decompress data from zlib format. The text of the specification assumes a basic background in programming at the level of bits and other primitive data representations. 1.3. Scope The specification specifies a compressed data format that can be used for in-memory compression of a sequence of arbitrary bytes. 1.4. Compliance Unless otherwise indicated below, a compliant decompressor must be able to accept and decompress any data set that conforms to all the specifications presented here; a compliant compressor must produce data sets that conform to all the specifications presented here. 1.5. Definitions of terms and conventions used byte: 8 bits stored or transmitted as a unit (same as an octet). (For this specification, a byte is exactly 8 bits, even on machines which store a character on a number of bits different from 8.) See below, for the numbering of bits within a byte. 1.6. Changes from previous versions Version 3.1 was the first public release of this specification. In version 3.2, some terminology was changed and the Adler-32 sample code was rewritten for clarity. In version 3.3, the support for a preset dictionary was introduced, and the specification was converted to RFC style.2. Detailed specification 2.1. Overall conventions In the diagrams below, a box like this: +---+ | | <-- the vertical bars might be missing +---+Deutsch & Gailly Informational [Page 3]RFC 1950 ZLIB Compressed Data Format Specification May 1996 represents one byte; a box like this: +==============+ | | +==============+ represents a variable number of bytes. Bytes stored within a computer do not have a "bit order", since they are always treated as a unit. However, a byte considered as an integer between 0 and 255 does have a most- and least- significant bit, and since we write numbers with the most- significant digit on the left, we also write bytes with the most- significant bit on the left. In the diagrams below, we number the bits of a byte so that bit 0 is the least-significant bit, i.e., the bits are numbered: +--------+ |76543210| +--------+ Within a computer, a number may occupy multiple bytes. All multi-byte numbers in the format described here are stored with the MOST-significant byte first (at the lower memory address). For example, the decimal number 520 is stored as: 0 1 +--------+--------+ |00000010|00001000| +--------+--------+ ^ ^ | | | + less significant byte = 8 + more significant byte = 2 x 256 2.2. Data format A zlib stream has the following structure: 0 1 +---+---+ |CMF|FLG| (more-->) +---+---+Deutsch & Gailly Informational [Page 4]RFC 1950 ZLIB Compressed Data Format Specification May 1996 (if FLG.FDICT set) 0 1 2 3 +---+---+---+---+ | DICTID | (more-->) +---+---+---+---+ +=====================+---+---+---+---+ |...compressed data...| ADLER32 | +=====================+---+---+---+---+ Any data which may appear after ADLER32 are not part of the zlib stream. CMF (Compression Method and flags) This byte is divided into a 4-bit compression method and a 4- bit information field depending on the compression method. bits 0 to 3 CM Compression method bits 4 to 7 CINFO Compression info CM (Compression method) This identifies the compression method used in the file. CM = 8 denotes the "deflate" compression method with a window size up to 32K. This is the method used by gzip and PNG (see references [1] and [2] in Chapter 3, below, for the reference documents). CM = 15 is reserved. It might be used in a future version of this specification to indicate the presence of an extra field before the compressed data. CINFO (Compression info) For CM = 8, CINFO is the base-2 logarithm of the LZ77 window size, minus eight (CINFO=7 indicates a 32K window size). Values of CINFO above 7 are not allowed in this version of the specification. CINFO is not defined in this specification for CM not equal to 8. FLG (FLaGs) This flag byte is divided as follows: bits 0 to 4 FCHECK (check bits for CMF and FLG) bit 5 FDICT (preset dictionary) bits 6 to 7 FLEVEL (compression level) The FCHECK value must be such that CMF and FLG, when viewed as a 16-bit unsigned integer stored in MSB order (CMF*256 + FLG), is a multiple of 31.Deutsch & Gailly Informational [Page 5]RFC 1950 ZLIB Compressed Data Format Specification May 1996 FDICT (Preset dictionary) If FDICT is set, a DICT dictionary identifier is present immediately after the FLG byte. The dictionary is a sequence of bytes which are initially fed to the compressor without producing any compressed output. DICT is the Adler-32 checksum of this sequence of bytes (see the definition of ADLER32 below). The decompressor can use this identifier to determine which dictionary has been used by the compressor. FLEVEL (Compression level) These flags are available for use by specific compression methods. The "deflate" method (CM = 8) sets these flags as follows: 0 - compressor used fastest algorithm 1 - compressor used fast algorithm 2 - compressor used default algorithm 3 - compressor used maximum compression, slowest algorithm The information in FLEVEL is not needed for decompression; it is there to indicate if recompression might be worthwhile. compressed data For compression method 8, the compressed data is stored in the deflate compressed data format as described in the document "DEFLATE Compressed Data Format Specification" by L. Peter Deutsch. (See reference [3] in Chapter 3, below) Other compressed data formats are not specified in this version of the zlib specification. ADLER32 (Adler-32 checksum) This contains a checksum value of the uncompressed data (excluding any dictionary data) computed according to Adler-32 algorithm. This algorithm is a 32-bit extension and improvement of the Fletcher algorithm, used in the ITU-T X.224 / ISO 8073 standard. See references [4] and [5] in Chapter 3, below) Adler-32 is composed of two sums accumulated per byte: s1 is the sum of all bytes, s2 is the sum of all s1 values. Both sums are done modulo 65521. s1 is initialized to 1, s2 to zero. The Adler-32 checksum is stored as s2*65536 + s1 in most- significant-byte first (network) order.Deutsch & Gailly Informational [Page 6]RFC 1950 ZLIB Compressed Data Format Specification May 1996 2.3. Compliance A compliant compressor must produce streams with correct CMF, FLG and ADLER32, but need not support preset dictionaries. When the zlib data format is used as part of another standard data format, the compressor may use only preset dictionaries that are specified by this other data format. If this other format does not use the preset dictionary feature, the compressor must not set the FDICT flag. A compliant decompressor must check CMF, FLG, and ADLER32, and provide an error indication if any of these have incorrect values. A compliant decompressor must give an error indication if CM is not one of the values defined in this specification (only the value 8 is permitted in this version), since another value could indicate the presence of new features that would cause subsequent data to be interpreted incorrectly. A compliant decompressor must give an error indication if FDICT is set and DICTID is not the identifier of a known preset dictionary. A decompressor may ignore FLEVEL and still be compliant. When the zlib data format is being used as a part of another standard format, a compliant decompressor must support all the preset dictionaries specified by the other format. When the other format does not use the preset dictionary feature, a compliant decompressor must reject any stream in which the FDICT flag is set.3. References [1] Deutsch, L.P.,"GZIP Compressed Data Format Specification", available in ftp://ftp.uu.net/pub/archiving/zip/doc/ [2] Thomas Boutell, "PNG (Portable Network Graphics) specification", available in ftp://ftp.uu.net/graphics/png/documents/ [3] Deutsch, L.P.,"DEFLATE Compressed Data Format Specification", available in ftp://ftp.uu.net/pub/archiving/zip/doc/ [4] Fletcher, J. G., "An Arithmetic Checksum for Serial Transmissions," IEEE Transactions on Communications, Vol. COM-30, No. 1, January 1982, pp. 247-252. [5] ITU-T Recommendation X.224, Annex D, "Checksum Algorithms," November, 1993, pp. 144, 145. (Available from gopher://info.itu.ch). ITU-T X.244 is also the same as ISO 8073.Deutsch & Gailly Informational [Page 7]RFC 1950 ZLIB Compressed Data Format Specification May 19964. Source code Source code for a C language implementation of a "zlib" compliant library is available at ftp://ftp.uu.net/pub/archiving/zip/zlib/.5. Security Considerations A decoder that fails to check the ADLER32 checksum value may be subject to undetected data corruption.6. Acknowledgements Trademarks cited in this document are the property of their respective owners. Jean-Loup Gailly and Mark Adler designed the zlib format and wrote the related software described in this specification. Glenn Randers-Pehrson converted this document to RFC and HTML format.7. Authors' Addresses L. Peter Deutsch Aladdin Enterprises 203 Santa Margarita Ave. Menlo Park, CA 94025 Phone: (415) 322-0103 (AM only) FAX: (415) 322-1734 EMail: <ghost@aladdin.com> Jean-Loup Gailly EMail: <gzip@prep.ai.mit.edu> Questions about the technical content of this specification can be sent by email to Jean-Loup Gailly <gzip@prep.ai.mit.edu> and Mark Adler <madler@alumni.caltech.edu> Editorial comments on this specification can be sent by email to L. Peter Deutsch <ghost@aladdin.com> and Glenn Randers-Pehrson <randeg@alumni.rpi.edu>Deutsch & Gailly Informational [Page 8]RFC 1950 ZLIB Compressed Data Format Specification May 19968. Appendix: Rationale 8.1. Preset dictionaries A preset dictionary is specially useful to compress short input sequences. The compressor can take advantage of the dictionary context to encode the input in a more compact manner. The decompressor can be initialized with the appropriate context by virtually decompressing a compressed version of the dictionary without producing any output. However for certain compression algorithms such as the deflate algorithm this operation can be achieved without actually performing any decompression. The compressor and the decompressor must use exactly the same dictionary. The dictionary may be fixed or may be chosen among a certain number of predefined dictionaries, according to the kind of input data. The decompressor can determine which dictionary has been chosen by the compressor by checking the dictionary identifier. This document does not specify the contents of predefined dictionaries, since the optimal dictionaries are application specific. Standard data formats using this feature of the zlib specification must precisely define the allowed dictionaries. 8.2. The Adler-32 algorithm The Adler-32 algorithm is much faster than the CRC32 algorithm yet still provides an extremely low probability of undetected errors. The modulo on unsigned long accumulators can be delayed for 5552 bytes, so the modulo operation time is negligible. If the bytes are a, b, c, the second sum is 3a + 2b + c + 3, and so is position and order sensitive, unlike the first sum, which is just a checksum. That 65521 is prime is important to avoid a possible large class of two-byte errors that leave the check unchanged. (The Fletcher checksum uses 255, which is not prime and which also makes the Fletcher check insensitive to single byte changes 0 <-> 255.) The sum s1 is initialized to 1 instead of zero to make the length of the sequence part of s2, so that the length does not have to be checked separately. (Any sequence of zeroes has a Fletcher checksum of zero.)Deutsch & Gailly Informational [Page 9]RFC 1950 ZLIB Compressed Data Format Specification May 19969. Appendix: Sample code The following C code computes the Adler-32 checksum of a data buffer. It is written for clarity, not for speed. The sample code is in the ANSI C programming language. Non C users may find it easier to read with these hints: & Bitwise AND operator. >> Bitwise right shift operator. When applied to an unsigned quantity, as here, right shift inserts zero bit(s) at the left. << Bitwise left shift operator. Left shift inserts zero bit(s) at the right. ++ "n++" increments the variable n. % modulo operator: a % b is the remainder of a divided by b. #define BASE 65521 /* largest prime smaller than 65536 */ /* Update a running Adler-32 checksum with the bytes buf[0..len-1] and return the updated checksum. The Adler-32 checksum should be initialized to 1. Usage example: unsigned long adler = 1L; while (read_buffer(buffer, length) != EOF) { adler = update_adler32(adler, buffer, length); } if (adler != original_adler) error(); */ unsigned long update_adler32(unsigned long adler, unsigned char *buf, int len) { unsigned long s1 = adler & 0xffff; unsigned long s2 = (adler >> 16) & 0xffff; int n; for (n = 0; n < len; n++) { s1 = (s1 + buf[n]) % BASE; s2 = (s2 + s1) % BASE; } return (s2 << 16) + s1; } /* Return the adler32 of the bytes buf[0..len-1] */Deutsch & Gailly Informational [Page 10]RFC 1950 ZLIB Compressed Data Format Specification May 1996 unsigned long adler32(unsigned char *buf, int len) { return update_adler32(1L, buf, len); }Deutsch & Gailly Informational [Page 11]

Network Working Group P. DeutschRequest for Comments: 1951 Aladdin EnterprisesCategory: Informational May 1996 DEFLATE Compressed Data Format Specification version 1.3Status of This Memo This memo provides information for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited.IESG Note: The IESG takes no position on the validity of any Intellectual Property Rights statements contained in this document.Notices Copyright (c) 1996 L. Peter Deutsch Permission is granted to copy and distribute this document for any purpose and without charge, including translations into other languages and incorporation into compilations, provided that the copyright notice and this notice are preserved, and that any substantive changes or deletions from the original are clearly marked. A pointer to the latest version of this and related documentation in HTML format can be found at the URL <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>.Abstract This specification defines a lossless compressed data format that compresses data using a combination of the LZ77 algorithm and Huffman coding, with efficiency comparable to the best currently available general-purpose compression methods. The data can be produced or consumed, even for an arbitrarily long sequentially presented input data stream, using only an a priori bounded amount of intermediate storage. The format can be implemented readily in a manner not covered by patents.Deutsch Informational [Page 1]RFC 1951 DEFLATE Compressed Data Format Specification May 1996Table of Contents 1. Introduction ................................................... 2 1.1. Purpose ................................................... 2 1.2. Intended audience ......................................... 3 1.3. Scope ..................................................... 3 1.4. Compliance ................................................ 3 1.5. Definitions of terms and conventions used ................ 3 1.6. Changes from previous versions ............................ 4 2. Compressed representation overview ............................. 4 3. Detailed specification ......................................... 5 3.1. Overall conventions ....................................... 5 3.1.1. Packing into bytes .................................. 5 3.2. Compressed block format ................................... 6 3.2.1. Synopsis of prefix and Huffman coding ............... 6 3.2.2. Use of Huffman coding in the "deflate" format ....... 7 3.2.3. Details of block format ............................. 9 3.2.4. Non-compressed blocks (BTYPE=00) ................... 11 3.2.5. Compressed blocks (length and distance codes) ...... 11 3.2.6. Compression with fixed Huffman codes (BTYPE=01) .... 12 3.2.7. Compression with dynamic Huffman codes (BTYPE=10) .. 13 3.3. Compliance ............................................... 14 4. Compression algorithm details ................................. 14 5. References .................................................... 16 6. Security Considerations ....................................... 16 7. Source code ................................................... 16 8. Acknowledgements .............................................. 16 9. Author's Address .............................................. 171. Introduction 1.1. Purpose The purpose of this specification is to define a lossless compressed data format that: * Is independent of CPU type, operating system, file system, and character set, and hence can be used for interchange; * Can be produced or consumed, even for an arbitrarily long sequentially presented input data stream, using only an a priori bounded amount of intermediate storage, and hence can be used in data communications or similar structures such as Unix filters; * Compresses data with efficiency comparable to the best currently available general-purpose compression methods, and in particular considerably better than the "compress" program; * Can be implemented readily in a manner not covered by patents, and hence can be practiced freely;Deutsch Informational [Page 2]RFC 1951 DEFLATE Compressed Data Format Specification May 1996 * Is compatible with the file format produced by the current widely used gzip utility, in that conforming decompressors will be able to read data produced by the existing gzip compressor. The data format defined by this specification does not attempt to: * Allow random access to compressed data; * Compress specialized data (e.g., raster graphics) as well as the best currently available specialized algorithms. A simple counting argument shows that no lossless compression algorithm can compress every possible input data set. For the format defined here, the worst case expansion is 5 bytes per 32K- byte block, i.e., a size increase of 0.015% for large data sets. English text usually compresses by a factor of 2.5 to 3; executable files usually compress somewhat less; graphical data such as raster images may compress much more. 1.2. Intended audience This specification is intended for use by implementors of software to compress data into "deflate" format and/or decompress data from "deflate" format. The text of the specification assumes a basic background in programming at the level of bits and other primitive data representations. Familiarity with the technique of Huffman coding is helpful but not required. 1.3. Scope The specification specifies a method for representing a sequence of bytes as a (usually shorter) sequence of bits, and a method for packing the latter bit sequence into bytes. 1.4. Compliance Unless otherwise indicated below, a compliant decompressor must be able to accept and decompress any data set that conforms to all the specifications presented here; a compliant compressor must produce data sets that conform to all the specifications presented here. 1.5. Definitions of terms and conventions used Byte: 8 bits stored or transmitted as a unit (same as an octet). For this specification, a byte is exactly 8 bits, even on machinesDeutsch Informational [Page 3]RFC 1951 DEFLATE Compressed Data Format Specification May 1996 which store a character on a number of bits different from eight. See below, for the numbering of bits within a byte. String: a sequence of arbitrary bytes. 1.6. Changes from previous versions There have been no technical changes to the deflate format since version 1.1 of this specification. In version 1.2, some terminology was changed. Version 1.3 is a conversion of the specification to RFC style.2. Compressed representation overview A compressed data set consists of a series of blocks, corresponding to successive blocks of input data. The block sizes are arbitrary, except that non-compressible blocks are limited to 65,535 bytes. Each block is compressed using a combination of the LZ77 algorithm and Huffman coding. The Huffman trees for each block are independent of those for previous or subsequent blocks; the LZ77 algorithm may use a reference to a duplicated string occurring in a previous block, up to 32K input bytes before. Each block consists of two parts: a pair of Huffman code trees that describe the representation of the compressed data part, and a compressed data part. (The Huffman trees themselves are compressed using Huffman encoding.) The compressed data consists of a series of elements of two types: literal bytes (of strings that have not been detected as duplicated within the previous 32K input bytes), and pointers to duplicated strings, where a pointer is represented as a pair <length, backward distance>. The representation used in the "deflate" format limits distances to 32K bytes and lengths to 258 bytes, but does not limit the size of a block, except for uncompressible blocks, which are limited as noted above. Each type of value (literals, distances, and lengths) in the compressed data is represented using a Huffman code, using one code tree for literals and lengths and a separate code tree for distances. The code trees for each block appear in a compact form just before the compressed data for that block.Deutsch Informational [Page 4]RFC 1951 DEFLATE Compressed Data Format Specification May 19963. Detailed specification 3.1. Overall conventions In the diagrams below, a box like this: +---+ | | <-- the vertical bars might be missing +---+ represents one byte; a box like this: +==============+ | | +==============+ represents a variable number of bytes. Bytes stored within a computer do not have a "bit order", since they are always treated as a unit. However, a byte considered as an integer between 0 and 255 does have a most- and least- significant bit, and since we write numbers with the most- significant digit on the left, we also write bytes with the most- significant bit on the left. In the diagrams below, we number the bits of a byte so that bit 0 is the least-significant bit, i.e., the bits are numbered: +--------+ |76543210| +--------+ Within a computer, a number may occupy multiple bytes. All multi-byte numbers in the format described here are stored with the least-significant byte first (at the lower memory address). For example, the decimal number 520 is stored as: 0 1 +--------+--------+ |00001000|00000010| +--------+--------+ ^ ^ | | | + more significant byte = 2 x 256 + less significant byte = 8 3.1.1. Packing into bytes This document does not address the issue of the order in which bits of a byte are transmitted on a bit-sequential medium, since the final data format described here is byte- rather thanDeutsch Informational [Page 5]RFC 1951 DEFLATE Compressed Data Format Specification May 1996 bit-oriented. However, we describe the compressed block format in below, as a sequence of data elements of various bit lengths, not a sequence of bytes. We must therefore specify how to pack these data elements into bytes to form the final compressed byte sequence: * Data elements are packed into bytes in order of increasing bit number within the byte, i.e., starting with the least-significant bit of the byte. * Data elements other than Huffman codes are packed starting with the least-significant bit of the data element. * Huffman codes are packed starting with the most- significant bit of the code. In other words, if one were to print out the compressed data as a sequence of bytes, starting with the first byte at the *right* margin and proceeding to the *left*, with the most- significant bit of each byte on the left as usual, one would be able to parse the result from right to left, with fixed-width elements in the correct MSB-to-LSB order and Huffman codes in bit-reversed order (i.e., with the first bit of the code in the relative LSB position). 3.2. Compressed block format 3.2.1. Synopsis of prefix and Huffman coding Prefix coding represents symbols from an a priori known alphabet by bit sequences (codes), one code for each symbol, in a manner such that different symbols may be represented by bit sequences of different lengths, but a parser can always parse an encoded string unambiguously symbol-by-symbol. We define a prefix code in terms of a binary tree in which the two edges descending from each non-leaf node are labeled 0 and 1 and in which the leaf nodes correspond one-for-one with (are labeled with) the symbols of the alphabet; then the code for a symbol is the sequence of 0's and 1's on the edges leading from the root to the leaf labeled with that symbol. For example:Deutsch Informational [Page 6]RFC 1951 DEFLATE Compressed Data Format Specification May 1996 /\ Symbol Code 0 1 ------ ---- / \ A 00 /\ B B 1 0 1 C 011 / \ D 010 A /\ 0 1 / \ D C A parser can decode the next symbol from an encoded input stream by walking down the tree from the root, at each step choosing the edge corresponding to the next input bit. Given an alphabet with known symbol frequencies, the Huffman algorithm allows the construction of an optimal prefix code (one which represents strings with those symbol frequencies using the fewest bits of any possible prefix codes for that alphabet). Such a code is called a Huffman code. (See reference [1] in Chapter 5, references for additional information on Huffman codes.) Note that in the "deflate" format, the Huffman codes for the various alphabets must not exceed certain maximum code lengths. This constraint complicates the algorithm for computing code lengths from symbol frequencies. Again, see Chapter 5, references for details. 3.2.2. Use of Huffman coding in the "deflate" format The Huffman codes used for each alphabet in the "deflate" format have two additional rules: * All codes of a given bit length have lexicographically consecutive values, in the same order as the symbols they represent; * Shorter codes lexicographically precede longer codes.Deutsch Informational [Page 7]RFC 1951 DEFLATE Compressed Data Format Specification May 1996 We could recode the example above to follow this rule as follows, assuming that the order of the alphabet is ABCD: Symbol Code ------ ---- A 10 B 0 C 110 D 111 I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are lexicographically consecutive. Given this rule, we can define the Huffman code for an alphabet just by giving the bit lengths of the codes for each symbol of the alphabet in order; this is sufficient to determine the actual codes. In our example, the code is completely defined by the sequence of bit lengths (2, 1, 3, 3). The following algorithm generates the codes as integers, intended to be read from most- to least-significant bit. The code lengths are initially in tree[I].Len; the codes are produced in tree[I].Code. 1) Count the number of codes for each code length. Let bl_count[N] be the number of codes of length N, N >= 1. 2) Find the numerical value of the smallest code for each code length: code = 0; bl_count[0] = 0; for (bits = 1; bits <= MAX_BITS; bits++) { code = (code + bl_count[bits-1]) << 1; next_code[bits] = code; } 3) Assign numerical values to all codes, using consecutive values for all codes of the same length with the base values determined at step 2. Codes that are never used (which have a bit length of zero) must not be assigned a value. for (n = 0; n <= max_code; n++) { len = tree[n].Len; if (len != 0) { tree[n].Code = next_code[len]; next_code[len]++; }Deutsch Informational [Page 8]RFC 1951 DEFLATE Compressed Data Format Specification May 1996 } Example: Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3, 3, 2, 4, 4). After step 1, we have: N bl_count[N] - ----------- 2 1 3 5 4 2 Step 2 computes the following next_code values: N next_code[N] - ------------ 1 0 2 0 3 2 4 14 Step 3 produces the following code values: Symbol Length Code ------ ------ ---- A 3 010 B 3 011 C 3 100 D 3 101 E 3 110 F 2 00 G 4 1110 H 4 1111 3.2.3. Details of block format Each block of compressed data begins with 3 header bits containing the following data: first bit BFINAL next 2 bits BTYPE Note that the header bits do not necessarily begin on a byte boundary, since a block does not necessarily occupy an integral number of bytes.Deutsch Informational [Page 9]RFC 1951 DEFLATE Compressed Data Format Specification May 1996 BFINAL is set if and only if this is the last block of the data set. BTYPE specifies how the data are compressed, as follows: 00 - no compression 01 - compressed with fixed Huffman codes 10 - compressed with dynamic Huffman codes 11 - reserved (error) The only difference between the two compressed cases is how the Huffman codes for the literal/length and distance alphabets are defined. In all cases, the decoding algorithm for the actual data is as follows: do read block header from input stream. if stored with no compression skip any remaining bits in current partially processed byte read LEN and NLEN (see next section) copy LEN bytes of data to output otherwise if compressed with dynamic Huffman codes read representation of code trees (see subsection below) loop (until end of block code recognized) decode literal/length value from input stream if value < 256 copy value (literal byte) to output stream otherwise if value = end of block (256) break from loop otherwise (value = 257..285) decode distance from input stream move backwards distance bytes in the output stream, and copy length bytes from this position to the output stream. end loop while not last block Note that a duplicated string reference may refer to a string in a previous block; i.e., the backward distance may cross one or more block boundaries. However a distance cannot refer past the beginning of the output stream. (An application using aDeutsch Informational [Page 10]RFC 1951 DEFLATE Compressed Data Format Specification May 1996 preset dictionary might discard part of the output stream; a distance can refer to that part of the output stream anyway) Note also that the referenced string may overlap the current position; for example, if the last 2 bytes decoded have values X and Y, a string reference with <length = 5, distance = 2> adds X,Y,X,Y,X to the output stream. We now specify each compression method in turn. 3.2.4. Non-compressed blocks (BTYPE=00) Any bits of input up to the next byte boundary are ignored. The rest of the block consists of the following information: 0 1 2 3 4... +---+---+---+---+================================+ | LEN | NLEN |... LEN bytes of literal data...| +---+---+---+---+================================+ LEN is the number of data bytes in the block. NLEN is the one's complement of LEN. 3.2.5. Compressed blocks (length and distance codes) As noted above, encoded data blocks in the "deflate" format consist of sequences of symbols drawn from three conceptually distinct alphabets: either literal bytes, from the alphabet of byte values (0..255), or <length, backward distance> pairs, where the length is drawn from (3..258) and the distance is drawn from (1..32,768). In fact, the literal and length alphabets are merged into a single alphabet (0..285), where values 0..255 represent literal bytes, the value 256 indicates end-of-block, and values 257..285 represent length codes (possibly in conjunction with extra bits following the symbol code) as follows:Deutsch Informational [Page 11]RFC 1951 DEFLATE Compressed Data Format Specification May 1996 Extra Extra Extra Code Bits Length(s) Code Bits Lengths Code Bits Length(s) ---- ---- ------ ---- ---- ------- ---- ---- ------- 257 0 3 267 1 15,16 277 4 67-82 258 0 4 268 1 17,18 278 4 83-98 259 0 5 269 2 19-22 279 4 99-114 260 0 6 270 2 23-26 280 4 115-130 261 0 7 271 2 27-30 281 5 131-162 262 0 8 272 2 31-34 282 5 163-194 263 0 9 273 3 35-42 283 5 195-226 264 0 10 274 3 43-50 284 5 227-257 265 1 11,12 275 3 51-58 285 0 258 266 1 13,14 276 3 59-66 The extra bits should be interpreted as a machine integer stored with the most-significant bit first, e.g., bits 1110 represent the value 14. Extra Extra Extra Code Bits Dist Code Bits Dist Code Bits Distance ---- ---- ---- ---- ---- ------ ---- ---- -------- 0 0 1 10 4 33-48 20 9 1025-1536 1 0 2 11 4 49-64 21 9 1537-2048 2 0 3 12 5 65-96 22 10 2049-3072 3 0 4 13 5 97-128 23 10 3073-4096 4 1 5,6 14 6 129-192 24 11 4097-6144 5 1 7,8 15 6 193-256 25 11 6145-8192 6 2 9-12 16 7 257-384 26 12 8193-12288 7 2 13-16 17 7 385-512 27 12 12289-16384 8 3 17-24 18 8 513-768 28 13 16385-24576 9 3 25-32 19 8 769-1024 29 13 24577-32768 3.2.6. Compression with fixed Huffman codes (BTYPE=01) The Huffman codes for the two alphabets are fixed, and are not represented explicitly in the data. The Huffman code lengths for the literal/length alphabet are: Lit Value Bits Codes --------- ---- ----- 0 - 143 8 00110000 through 10111111 144 - 255 9 110010000 through 111111111 256 - 279 7 0000000 through 0010111 280 - 287 8 11000000 through 11000111Deutsch Informational [Page 12]RFC 1951 DEFLATE Compressed Data Format Specification May 1996 The code lengths are sufficient to generate the actual codes, as described above; we show the codes in the table for added clarity. Literal/length values 286-287 will never actually occur in the compressed data, but participate in the code construction. Distance codes 0-31 are represented by (fixed-length) 5-bit codes, with possible additional bits as shown in the table shown in Paragraph 3.2.5, above. Note that distance codes 30- 31 will never actually occur in the compressed data. 3.2.7. Compression with dynamic Huffman codes (BTYPE=10) The Huffman codes for the two alphabets appear in the block immediately after the header bits and before the actual compressed data, first the literal/length code and then the distance code. Each code is defined by a sequence of code lengths, as discussed in Paragraph 3.2.2, above. For even greater compactness, the code length sequences themselves are compressed using a Huffman code. The alphabet for code lengths is as follows: 0 - 15: Represent code lengths of 0 - 15 16: Copy the previous code length 3 - 6 times. The next 2 bits indicate repeat length (0 = 3, ... , 3 = 6) Example: Codes 8, 16 (+2 bits 11), 16 (+2 bits 10) will expand to 12 code lengths of 8 (1 + 6 + 5) 17: Repeat a code length of 0 for 3 - 10 times. (3 bits of length) 18: Repeat a code length of 0 for 11 - 138 times (7 bits of length) A code length of 0 indicates that the corresponding symbol in the literal/length or distance alphabet will not occur in the block, and should not participate in the Huffman code construction algorithm given earlier. If only one distance code is used, it is encoded using one bit, not zero bits; in this case there is a single code length of one, with one unused code. One distance code of zero bits means that there are no distance codes used at all (the data is all literals). We can now define the format of the block: 5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286) 5 Bits: HDIST, # of Distance codes - 1 (1 - 32) 4 Bits: HCLEN, # of Code Length codes - 4 (4 - 19)Deutsch Informational [Page 13]RFC 1951 DEFLATE Compressed Data Format Specification May 1996 (HCLEN + 4) x 3 bits: code lengths for the code length alphabet given just above, in the order: 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 These code lengths are interpreted as 3-bit integers (0-7); as above, a code length of 0 means the corresponding symbol (literal/length or distance code length) is not used. HLIT + 257 code lengths for the literal/length alphabet, encoded using the code length Huffman code HDIST + 1 code lengths for the distance alphabet, encoded using the code length Huffman code The actual compressed data of the block, encoded using the literal/length and distance Huffman codes The literal/length symbol 256 (end of data), encoded using the literal/length Huffman code The code length repeat codes can cross from HLIT + 257 to the HDIST + 1 code lengths. In other words, all code lengths form a single sequence of HLIT + HDIST + 258 values. 3.3. Compliance A compressor may limit further the ranges of values specified in the previous section and still be compliant; for example, it may limit the range of backward pointers to some value smaller than 32K. Similarly, a compressor may limit the size of blocks so that a compressible block fits in memory. A compliant decompressor must accept the full range of possible values defined in the previous section, and must accept blocks of arbitrary size.4. Compression algorithm details While it is the intent of this document to define the "deflate" compressed data format without reference to any particular compression algorithm, the format is related to the compressed formats produced by LZ77 (Lempel-Ziv 1977, see reference [2] below); since many variations of LZ77 are patented, it is strongly recommended that the implementor of a compressor follow the general algorithm presented here, which is known not to be patented per se. The material in this section is not part of the definition of theDeutsch Informational [Page 14]RFC 1951 DEFLATE Compressed Data Format Specification May 1996 specification per se, and a compressor need not follow it in order to be compliant. The compressor terminates a block when it determines that starting a new block with fresh trees would be useful, or when the block size fills up the compressor's block buffer. The compressor uses a chained hash table to find duplicated strings, using a hash function that operates on 3-byte sequences. At any given point during compression, let XYZ be the next 3 input bytes to be examined (not necessarily all different, of course). First, the compressor examines the hash chain for XYZ. If the chain is empty, the compressor simply writes out X as a literal byte and advances one byte in the input. If the hash chain is not empty, indicating that the sequence XYZ (or, if we are unlucky, some other 3 bytes with the same hash function value) has occurred recently, the compressor compares all strings on the XYZ hash chain with the actual input data sequence starting at the current point, and selects the longest match. The compressor searches the hash chains starting with the most recent strings, to favor small distances and thus take advantage of the Huffman encoding. The hash chains are singly linked. There are no deletions from the hash chains; the algorithm simply discards matches that are too old. To avoid a worst-case situation, very long hash chains are arbitrarily truncated at a certain length, determined by a run-time parameter. To improve overall compression, the compressor optionally defers the selection of matches ("lazy matching"): after a match of length N has been found, the compressor searches for a longer match starting at the next input byte. If it finds a longer match, it truncates the previous match to a length of one (thus producing a single literal byte) and then emits the longer match. Otherwise, it emits the original match, and, as described above, advances N bytes before continuing. Run-time parameters also control this "lazy match" procedure. If compression ratio is most important, the compressor attempts a complete second search regardless of the length of the first match. In the normal case, if the current match is "long enough", the compressor reduces the search for a longer match, thus speeding up the process. If speed is most important, the compressor inserts new strings in the hash table only when no match was found, or when the match is not "too long". This degrades the compression ratio but saves time since there are both fewer insertions and fewer searches.Deutsch Informational [Page 15]RFC 1951 DEFLATE Compressed Data Format Specification May 19965. References [1] Huffman, D. A., "A Method for the Construction of Minimum Redundancy Codes", Proceedings of the Institute of Radio Engineers, September 1952, Volume 40, Number 9, pp. 1098-1101. [2] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data Compression", IEEE Transactions on Information Theory, Vol. 23, No. 3, pp. 337-343. [3] Gailly, J.-L., and Adler, M., ZLIB documentation and sources, available in ftp://ftp.uu.net/pub/archiving/zip/doc/ [4] Gailly, J.-L., and Adler, M., GZIP documentation and sources, available as gzip-*.tar in ftp://prep.ai.mit.edu/pub/gnu/ [5] Schwartz, E. S., and Kallick, B. "Generating a canonical prefix encoding." Comm. ACM, 7,3 (Mar. 1964), pp. 166-169. [6] Hirschberg and Lelewer, "Efficient decoding of prefix codes," Comm. ACM, 33,4, April 1990, pp. 449-459.6. Security Considerations Any data compression method involves the reduction of redundancy in the data. Consequently, any corruption of the data is likely to have severe effects and be difficult to correct. Uncompressed text, on the other hand, will probably still be readable despite the presence of some corrupted bytes. It is recommended that systems using this data format provide some means of validating the integrity of the compressed data. See reference [3], for example.7. Source code Source code for a C language implementation of a "deflate" compliant compressor and decompressor is available within the zlib package at ftp://ftp.uu.net/pub/archiving/zip/zlib/.8. Acknowledgements Trademarks cited in this document are the property of their respective owners. Phil Katz designed the deflate format. Jean-Loup Gailly and Mark Adler wrote the related software described in this specification. Glenn Randers-Pehrson converted this document to RFC and HTML format.Deutsch Informational [Page 16]RFC 1951 DEFLATE Compressed Data Format Specification May 19969. Author's Address L. Peter Deutsch Aladdin Enterprises 203 Santa Margarita Ave. Menlo Park, CA 94025 Phone: (415) 322-0103 (AM only) FAX: (415) 322-1734 EMail: <ghost@aladdin.com> Questions about the technical content of this specification can be sent by email to: Jean-Loup Gailly <gzip@prep.ai.mit.edu> and Mark Adler <madler@alumni.caltech.edu> Editorial comments on this specification can be sent by email to: L. Peter Deutsch <ghost@aladdin.com> and Glenn Randers-Pehrson <randeg@alumni.rpi.edu>Deutsch Informational [Page 17]

Network Working Group P. DeutschRequest for Comments: 1952 Aladdin EnterprisesCategory: Informational May 1996 GZIP file format specification version 4.3Status of This Memo This memo provides information for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited.IESG Note: The IESG takes no position on the validity of any Intellectual Property Rights statements contained in this document.Notices Copyright (c) 1996 L. Peter Deutsch Permission is granted to copy and distribute this document for any purpose and without charge, including translations into other languages and incorporation into compilations, provided that the copyright notice and this notice are preserved, and that any substantive changes or deletions from the original are clearly marked. A pointer to the latest version of this and related documentation in HTML format can be found at the URL <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>.Abstract This specification defines a lossless compressed data format that is compatible with the widely used GZIP utility. The format includes a cyclic redundancy check value for detecting data corruption. The format presently uses the DEFLATE method of compression but can be easily extended to use other compression methods. The format can be implemented readily in a manner not covered by patents.Deutsch Informational [Page 1]RFC 1952 GZIP File Format Specification May 1996Table of Contents 1. Introduction ................................................... 2 1.1. Purpose ................................................... 2 1.2. Intended audience ......................................... 3 1.3. Scope ..................................................... 3 1.4. Compliance ................................................ 3 1.5. Definitions of terms and conventions used ................. 3 1.6. Changes from previous versions ............................ 3 2. Detailed specification ......................................... 4 2.1. Overall conventions ....................................... 4 2.2. File format ............................................... 5 2.3. Member format ............................................. 5 2.3.1. Member header and trailer ........................... 6 2.3.1.1. Extra field ................................... 8 2.3.1.2. Compliance .................................... 9 3. References .................................................. 9 4. Security Considerations .................................... 10 5. Acknowledgements ........................................... 10 6. Author's Address ........................................... 10 7. Appendix: Jean-Loup Gailly's gzip utility .................. 11 8. Appendix: Sample CRC Code .................................. 111. Introduction 1.1. Purpose The purpose of this specification is to define a lossless compressed data format that: * Is independent of CPU type, operating system, file system, and character set, and hence can be used for interchange; * Can compress or decompress a data stream (as opposed to a randomly accessible file) to produce another data stream, using only an a priori bounded amount of intermediate storage, and hence can be used in data communications or similar structures such as Unix filters; * Compresses data with efficiency comparable to the best currently available general-purpose compression methods, and in particular considerably better than the "compress" program; * Can be implemented readily in a manner not covered by patents, and hence can be practiced freely; * Is compatible with the file format produced by the current widely used gzip utility, in that conforming decompressors will be able to read data produced by the existing gzip compressor.Deutsch Informational [Page 2]RFC 1952 GZIP File Format Specification May 1996 The data format defined by this specification does not attempt to: * Provide random access to compressed data; * Compress specialized data (e.g., raster graphics) as well as the best currently available specialized algorithms. 1.2. Intended audience This specification is intended for use by implementors of software to compress data into gzip format and/or decompress data from gzip format. The text of the specification assumes a basic background in programming at the level of bits and other primitive data representations. 1.3. Scope The specification specifies a compression method and a file format (the latter assuming only that a file can store a sequence of arbitrary bytes). It does not specify any particular interface to a file system or anything about character sets or encodings (except for file names and comments, which are optional). 1.4. Compliance Unless otherwise indicated below, a compliant decompressor must be able to accept and decompress any file that conforms to all the specifications presented here; a compliant compressor must produce files that conform to all the specifications presented here. The material in the appendices is not part of the specification per se and is not relevant to compliance. 1.5. Definitions of terms and conventions used byte: 8 bits stored or transmitted as a unit (same as an octet). (For this specification, a byte is exactly 8 bits, even on machines which store a character on a number of bits different from 8.) See below for the numbering of bits within a byte. 1.6. Changes from previous versions There have been no technical changes to the gzip format since version 4.1 of this specification. In version 4.2, some terminology was changed, and the sample CRC code was rewritten for clarity and to eliminate the requirement for the caller to do pre- and post-conditioning. Version 4.3 is a conversion of the specification to RFC style.Deutsch Informational [Page 3]RFC 1952 GZIP File Format Specification May 19962. Detailed specification 2.1. Overall conventions In the diagrams below, a box like this: +---+ | | <-- the vertical bars might be missing +---+ represents one byte; a box like this: +==============+ | | +==============+ represents a variable number of bytes. Bytes stored within a computer do not have a "bit order", since they are always treated as a unit. However, a byte considered as an integer between 0 and 255 does have a most- and least- significant bit, and since we write numbers with the most- significant digit on the left, we also write bytes with the most- significant bit on the left. In the diagrams below, we number the bits of a byte so that bit 0 is the least-significant bit, i.e., the bits are numbered: +--------+ |76543210| +--------+ This document does not address the issue of the order in which bits of a byte are transmitted on a bit-sequential medium, since the data format described here is byte- rather than bit-oriented. Within a computer, a number may occupy multiple bytes. All multi-byte numbers in the format described here are stored with the least-significant byte first (at the lower memory address). For example, the decimal number 520 is stored as: 0 1 +--------+--------+ |00001000|00000010| +--------+--------+ ^ ^ | | | + more significant byte = 2 x 256 + less significant byte = 8Deutsch Informational [Page 4]RFC 1952 GZIP File Format Specification May 1996 2.2. File format A gzip file consists of a series of "members" (compressed data sets). The format of each member is specified in the following section. The members simply appear one after another in the file, with no additional information before, between, or after them. 2.3. Member format Each member has the following structure: +---+---+---+---+---+---+---+---+---+---+ |ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->) +---+---+---+---+---+---+---+---+---+---+ (if FLG.FEXTRA set) +---+---+=================================+ | XLEN |...XLEN bytes of "extra field"...| (more-->) +---+---+=================================+ (if FLG.FNAME set) +=========================================+ |...original file name, zero-terminated...| (more-->) +=========================================+ (if FLG.FCOMMENT set) +===================================+ |...file comment, zero-terminated...| (more-->) +===================================+ (if FLG.FHCRC set) +---+---+ | CRC16 | +---+---+ +=======================+ |...compressed blocks...| (more-->) +=======================+ 0 1 2 3 4 5 6 7 +---+---+---+---+---+---+---+---+ | CRC32 | ISIZE | +---+---+---+---+---+---+---+---+Deutsch Informational [Page 5]RFC 1952 GZIP File Format Specification May 1996 2.3.1. Member header and trailer ID1 (IDentification 1) ID2 (IDentification 2) These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139 (0x8b, \213), to identify the file as being in gzip format. CM (Compression Method) This identifies the compression method used in the file. CM = 0-7 are reserved. CM = 8 denotes the "deflate" compression method, which is the one customarily used by gzip and which is documented elsewhere. FLG (FLaGs) This flag byte is divided into individual bits as follows: bit 0 FTEXT bit 1 FHCRC bit 2 FEXTRA bit 3 FNAME bit 4 FCOMMENT bit 5 reserved bit 6 reserved bit 7 reserved If FTEXT is set, the file is probably ASCII text. This is an optional indication, which the compressor may set by checking a small amount of the input data to see whether any non-ASCII characters are present. In case of doubt, FTEXT is cleared, indicating binary data. For systems which have different file formats for ascii text and binary data, the decompressor can use FTEXT to choose the appropriate format. We deliberately do not specify the algorithm used to set this bit, since a compressor always has the option of leaving it cleared and a decompressor always has the option of ignoring it and letting some other program handle issues of data conversion. If FHCRC is set, a CRC16 for the gzip header is present, immediately before the compressed data. The CRC16 consists of the two least significant bytes of the CRC32 for all bytes of the gzip header up to and not including the CRC16. [The FHCRC bit was never set by versions of gzip up to 1.2.4, even though it was documented with a different meaning in gzip 1.2.4.] If FEXTRA is set, optional extra fields are present, as described in a following section.Deutsch Informational [Page 6]RFC 1952 GZIP File Format Specification May 1996 If FNAME is set, an original file name is present, terminated by a zero byte. The name must consist of ISO 8859-1 (LATIN-1) characters; on operating systems using EBCDIC or any other character set for file names, the name must be translated to the ISO LATIN-1 character set. This is the original name of the file being compressed, with any directory components removed, and, if the file being compressed is on a file system with case insensitive names, forced to lower case. There is no original file name if the data was compressed from a source other than a named file; for example, if the source was stdin on a Unix system, there is no file name. If FCOMMENT is set, a zero-terminated file comment is present. This comment is not interpreted; it is only intended for human consumption. The comment must consist of ISO 8859-1 (LATIN-1) characters. Line breaks should be denoted by a single line feed character (10 decimal). Reserved FLG bits must be zero. MTIME (Modification TIME) This gives the most recent modification time of the original file being compressed. The time is in Unix format, i.e., seconds since 00:00:00 GMT, Jan. 1, 1970. (Note that this may cause problems for MS-DOS and other systems that use local rather than Universal time.) If the compressed data did not come from a file, MTIME is set to the time at which compression started. MTIME = 0 means no time stamp is available. XFL (eXtra FLags) These flags are available for use by specific compression methods. The "deflate" method (CM = 8) sets these flags as follows: XFL = 2 - compressor used maximum compression, slowest algorithm XFL = 4 - compressor used fastest algorithm OS (Operating System) This identifies the type of file system on which compression took place. This may be useful in determining end-of-line convention for text files. The currently defined values are as follows:Deutsch Informational [Page 7]RFC 1952 GZIP File Format Specification May 1996 0 - FAT filesystem (MS-DOS, OS/2, NT/Win32) 1 - Amiga 2 - VMS (or OpenVMS) 3 - Unix 4 - VM/CMS 5 - Atari TOS 6 - HPFS filesystem (OS/2, NT) 7 - Macintosh 8 - Z-System 9 - CP/M 10 - TOPS-20 11 - NTFS filesystem (NT) 12 - QDOS 13 - Acorn RISCOS 255 - unknown XLEN (eXtra LENgth) If FLG.FEXTRA is set, this gives the length of the optional extra field. See below for details. CRC32 (CRC-32) This contains a Cyclic Redundancy Check value of the uncompressed data computed according to CRC-32 algorithm used in the ISO 3309 standard and in section 8.1.1.6.2 of ITU-T recommendation V.42. (See http://www.iso.ch for ordering ISO documents. See gopher://info.itu.ch for an online version of ITU-T V.42.) ISIZE (Input SIZE) This contains the size of the original (uncompressed) input data modulo 2^32. 2.3.1.1. Extra field If the FLG.FEXTRA bit is set, an "extra field" is present in the header, with total length XLEN bytes. It consists of a series of subfields, each of the form: +---+---+---+---+==================================+ |SI1|SI2| LEN |... LEN bytes of subfield data ...| +---+---+---+---+==================================+ SI1 and SI2 provide a subfield ID, typically two ASCII letters with some mnemonic value. Jean-Loup Gailly <gzip@prep.ai.mit.edu> is maintaining a registry of subfield IDs; please send him any subfield ID you wish to use. Subfield IDs with SI2 = 0 are reserved for future use. The following IDs are currently defined:Deutsch Informational [Page 8]RFC 1952 GZIP File Format Specification May 1996 SI1 SI2 Data ---------- ---------- ---- 0x41 ('A') 0x70 ('P') Apollo file type information LEN gives the length of the subfield data, excluding the 4 initial bytes. 2.3.1.2. Compliance A compliant compressor must produce files with correct ID1, ID2, CM, CRC32, and ISIZE, but may set all the other fields in the fixed-length part of the header to default values (255 for OS, 0 for all others). The compressor must set all reserved bits to zero. A compliant decompressor must check ID1, ID2, and CM, and provide an error indication if any of these have incorrect values. It must examine FEXTRA/XLEN, FNAME, FCOMMENT and FHCRC at least so it can skip over the optional fields if they are present. It need not examine any other part of the header or trailer; in particular, a decompressor may ignore FTEXT and OS and always produce binary output, and still be compliant. A compliant decompressor must give an error indication if any reserved bit is non-zero, since such a bit could indicate the presence of a new field that would cause subsequent data to be interpreted incorrectly.3. References [1] "Information Processing - 8-bit single-byte coded graphic character sets - Part 1: Latin alphabet No.1" (ISO 8859-1:1987). The ISO 8859-1 (Latin-1) character set is a superset of 7-bit ASCII. Files defining this character set are available as iso_8859-1.* in ftp://ftp.uu.net/graphics/png/documents/ [2] ISO 3309 [3] ITU-T recommendation V.42 [4] Deutsch, L.P.,"DEFLATE Compressed Data Format Specification", available in ftp://ftp.uu.net/pub/archiving/zip/doc/ [5] Gailly, J.-L., GZIP documentation, available as gzip-*.tar in ftp://prep.ai.mit.edu/pub/gnu/ [6] Sarwate, D.V., "Computation of Cyclic Redundancy Checks via Table Look-Up", Communications of the ACM, 31(8), pp.1008-1013.Deutsch Informational [Page 9]RFC 1952 GZIP File Format Specification May 1996 [7] Schwaderer, W.D., "CRC Calculation", April 85 PC Tech Journal, pp.118-133. [8] ftp://ftp.adelaide.edu.au/pub/rocksoft/papers/crc_v3.txt, describing the CRC concept.4. Security Considerations Any data compression method involves the reduction of redundancy in the data. Consequently, any corruption of the data is likely to have severe effects and be difficult to correct. Uncompressed text, on the other hand, will probably still be readable despite the presence of some corrupted bytes. It is recommended that systems using this data format provide some means of validating the integrity of the compressed data, such as by setting and checking the CRC-32 check value.5. Acknowledgements Trademarks cited in this document are the property of their respective owners. Jean-Loup Gailly designed the gzip format and wrote, with Mark Adler, the related software described in this specification. Glenn Randers-Pehrson converted this document to RFC and HTML format.6. Author's Address L. Peter Deutsch Aladdin Enterprises 203 Santa Margarita Ave. Menlo Park, CA 94025 Phone: (415) 322-0103 (AM only) FAX: (415) 322-1734 EMail: <ghost@aladdin.com> Questions about the technical content of this specification can be sent by email to: Jean-Loup Gailly <gzip@prep.ai.mit.edu> and Mark Adler <madler@alumni.caltech.edu> Editorial comments on this specification can be sent by email to: L. Peter Deutsch <ghost@aladdin.com> and Glenn Randers-Pehrson <randeg@alumni.rpi.edu>Deutsch Informational [Page 10]RFC 1952 GZIP File Format Specification May 19967. Appendix: Jean-Loup Gailly's gzip utility The most widely used implementation of gzip compression, and the original documentation on which this specification is based, were created by Jean-Loup Gailly <gzip@prep.ai.mit.edu>. Since this implementation is a de facto standard, we mention some more of its features here. Again, the material in this section is not part of the specification per se, and implementations need not follow it to be compliant. When compressing or decompressing a file, gzip preserves the protection, ownership, and modification time attributes on the local file system, since there is no provision for representing protection attributes in the gzip file format itself. Since the file format includes a modification time, the gzip decompressor provides a command line switch that assigns the modification time from the file, rather than the local modification time of the compressed input, to the decompressed output.8. Appendix: Sample CRC Code The following sample code represents a practical implementation of the CRC (Cyclic Redundancy Check). (See also ISO 3309 and ITU-T V.42 for a formal specification.) The sample code is in the ANSI C programming language. Non C users may find it easier to read with these hints: & Bitwise AND operator. ^ Bitwise exclusive-OR operator. >> Bitwise right shift operator. When applied to an unsigned quantity, as here, right shift inserts zero bit(s) at the left. ! Logical NOT operator. ++ "n++" increments the variable n. 0xNNN 0x introduces a hexadecimal (base 16) constant. Suffix L indicates a long value (at least 32 bits). /* Table of CRCs of all 8-bit messages. */ unsigned long crc_table[256]; /* Flag: has the table been computed? Initially false. */ int crc_table_computed = 0; /* Make the table for a fast CRC. */ void make_crc_table(void) { unsigned long c;Deutsch Informational [Page 11]RFC 1952 GZIP File Format Specification May 1996 int n, k; for (n = 0; n < 256; n++) { c = (unsigned long) n; for (k = 0; k < 8; k++) { if (c & 1) { c = 0xedb88320L ^ (c >> 1); } else { c = c >> 1; } } crc_table[n] = c; } crc_table_computed = 1; } /* Update a running crc with the bytes buf[0..len-1] and return the updated crc. The crc should be initialized to zero. Pre- and post-conditioning (one's complement) is performed within this function so it shouldn't be done by the caller. Usage example: unsigned long crc = 0L; while (read_buffer(buffer, length) != EOF) { crc = update_crc(crc, buffer, length); } if (crc != original_crc) error(); */ unsigned long update_crc(unsigned long crc, unsigned char *buf, int len) { unsigned long c = crc ^ 0xffffffffL; int n; if (!crc_table_computed) make_crc_table(); for (n = 0; n < len; n++) { c = crc_table[(c ^ buf[n]) & 0xff] ^ (c >> 8); } return c ^ 0xffffffffL; } /* Return the CRC of the bytes buf[0..len-1]. */ unsigned long crc(unsigned char *buf, int len) { return update_crc(0L, buf, len); }Deutsch Informational [Page 12]

A Fast Method for Identifying Plain Text Files==============================================Introduction------------Given a file coming from an unknown source, it is sometimes desirableto find out whether the format of that file is plain text. Althoughthis may appear like a simple task, a fully accurate detection of thefile type requires heavy-duty semantic analysis on the file contents.It is, however, possible to obtain satisfactory results by employingvarious heuristics.Previous versions of PKZip and other zip-compatible compression toolswere using a crude detection scheme: if more than 80% (4/5) of the bytesfound in a certain buffer are within the range [7..127], the file islabeled as plain text, otherwise it is labeled as binary. A prominentlimitation of this scheme is the restriction to Latin-based alphabets.Other alphabets, like Greek, Cyrillic or Asian, make extensive use ofthe bytes within the range [128..255], and texts using these alphabetsare most often misidentified by this scheme; in other words, the rateof false negatives is sometimes too high, which means that the recallis low. Another weakness of this scheme is a reduced precision, due tothe false positives that may occur when binary files containing largeamounts of textual characters are misidentified as plain text.In this article we propose a new, simple detection scheme that featuresa much increased precision and a near-100% recall. This scheme isdesigned to work on ASCII, Unicode and other ASCII-derived alphabets,and it handles single-byte encodings (ISO-8859, MacRoman, KOI8, etc.)and variable-sized encodings (ISO-2022, UTF-8, etc.). Wider encodings(UCS-2/UTF-16 and UCS-4/UTF-32) are not handled, however.The Algorithm-------------The algorithm works by dividing the set of bytecodes [0..255] into threecategories:- The white list of textual bytecodes: 9 (TAB), 10 (LF), 13 (CR), 32 (SPACE) to 255.- The gray list of tolerated bytecodes: 7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC).- The black list of undesired, non-textual bytecodes: 0 (NUL) to 6, 14 to 31.If a file contains at least one byte that belongs to the white list andno byte that belongs to the black list, then the file is categorized asplain text; otherwise, it is categorized as binary. (The boundary case,when the file is empty, automatically falls into the latter category.)Rationale---------The idea behind this algorithm relies on two observations.The first observation is that, although the full range of 7-bit codes[0..127] is properly specified by the ASCII standard, most controlcharacters in the range [0..31] are not used in practice. The onlywidely-used, almost universally-portable control codes are 9 (TAB),10 (LF) and 13 (CR). There are a few more control codes that arerecognized on a reduced range of platforms and text viewers/editors:7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB) and 27 (ESC); but thesecodes are rarely (if ever) used alone, without being accompanied bysome printable text. Even the newer, portable text formats such asXML avoid using control characters outside the list mentioned here.The second observation is that most of the binary files tend to containcontrol characters, especially 0 (NUL). Even though the older textdetection schemes observe the presence of non-ASCII codes from the range[128..255], the precision rarely has to suffer if this upper range islabeled as textual, because the files that are genuinely binary tend tocontain both control characters and codes from the upper range. On theother hand, the upper range needs to be labeled as textual, because itis used by virtually all ASCII extensions. In particular, this range isused for encoding non-Latin scripts.Since there is no counting involved, other than simply observing thepresence or the absence of some byte values, the algorithm producesconsistent results, regardless what alphabet encoding is being used.(If counting were involved, it could be possible to obtain differentresults on a text encoded, say, using ISO-8859-16 versus UTF-8.)There is an extra category of plain text files that are "polluted" withone or more black-listed codes, either by mistake or by peculiar designconsiderations. In such cases, a scheme that tolerates a small fractionof black-listed codes would provide an increased recall (i.e. more truepositives). This, however, incurs a reduced precision overall, sincefalse positives are more likely to appear in binary files that containlarge chunks of textual data. Furthermore, "polluted" plain text shouldbe regarded as binary by general-purpose text detection schemes, becausegeneral-purpose text processing algorithms might not be applicable.Under this premise, it is safe to say that our detection method providesa near-100% recall.Experiments have been run on many files coming from various platformsand applications. We tried plain text files, system logs, source code,formatted office documents, compiled object code, etc. The resultsconfirm the optimistic assumptions about the capabilities of thisalgorithm.--Cosmin TrutaLast updated: 2006-May-28

/*
** Copyright (c) 2010 D. Richard Hipp
**
** This program is free software; you can redistribute it and/or
** modify it under the terms of the Simplified BSD License (also
** known as the "2-Clause License" or "FreeBSD License".)
** This program is distributed in the hope that it will be useful,
** but without any warranty; without even the implied warranty of
** merchantability or fitness for a particular purpose.
**
** Author contact information:
** drh@hwaci.com
** http://www.hwaci.com/drh/

|

1
2
3
4
5
6
7
8
9
10
11
12
13
14

/*
** Copyright (c) 2010 D. Richard Hipp
**
** This program is free software; you can redistribute it and/or
** modify it under the terms of the Simplified BSD License (also
** known as the "2-Clause License" or "FreeBSD License".)
**
** This program is distributed in the hope that it will be useful,
** but without any warranty; without even the implied warranty of
** merchantability or fitness for a particular purpose.
**
** Author contact information:
** drh@hwaci.com
** http://www.hwaci.com/drh/

/*
** Copyright (c) 2006 D. Richard Hipp
**
** This program is free software; you can redistribute it and/or
** modify it under the terms of the Simplified BSD License (also
** known as the "2-Clause License" or "FreeBSD License".)
**
** This program is distributed in the hope that it will be useful,
** but without any warranty; without even the implied warranty of
** merchantability or fitness for a particular purpose.
**
** Author contact information:
** drh@hwaci.com
** http://www.hwaci.com/drh/
................................................................................
if( pTo ){
blob_append(pTo, &pFrom->aData[pFrom->iCursor], i - pFrom->iCursor);
}
pFrom->iCursor = i;
}
/*
** Return true if the blob contains a valid base16 identifier artifact hash.
**** The value returned is actually one of HNAME_SHA1 OR HNAME_K256 if the** hash is valid. Both of these are non-zero and therefore "true".** If the hash is not valid, then HNAME_ERROR is returned, which is zero or** false.
*/
int blob_is_hname(Blob *pBlob){
return hname_validate(blob_buffer(pBlob), blob_size(pBlob));
}
/*
** Return true if the blob contains a valid filename
*/
int blob_is_filename(Blob *pBlob){
return file_is_simple_pathname(blob_str(pBlob), 1);

** WEBPAGE: praise
**
** URL: /annotate?checkin=ID&filename=FILENAME
** URL: /blame?checkin=ID&filename=FILENAME
** URL: /praise?checkin=ID&filename=FILENAME
**
** Show the most recent change to each line of a text file. /annotate shows
** the date of the changes and the check-in SHA1 hash (with a link to the
** check-in). /blame and /praise also show the user who made the check-in.
**
** Query parameters:
**
** checkin=ID The manifest ID at which to start the annotation
** filename=FILENAME The filename.
** filevers Show file versions rather than check-in versions

** WEBPAGE: praise
**
** URL: /annotate?checkin=ID&filename=FILENAME
** URL: /blame?checkin=ID&filename=FILENAME
** URL: /praise?checkin=ID&filename=FILENAME
**
** Show the most recent change to each line of a text file. /annotate shows
** the date of the changes and the check-in hash (with a link to the
** check-in). /blame and /praise also show the user who made the check-in.
**
** Query parameters:
**
** checkin=ID The manifest ID at which to start the annotation
** filename=FILENAME The filename.
** filevers Show file versions rather than check-in versions

}
cur_tok = strtok(NULL, " \t");
if( !cur_tok ){
/* This mark was generated by an older version of Fossil and doesn't
** include the mark name and uuid. create_mark() will name the new mark
** exactly as it was when exported to git, so that we should have a
** valid mapping from git sha1<->mark name<->fossil sha1. */
unsigned int mid;
if( type_=='c' ){
mid = COMMITMARK(mark->rid);
}
else{
mid = BLOBMARK(mark->rid);
}

|

259
260
261
262
263
264
265
266
267
268
269
270
271
272
273

}
cur_tok = strtok(NULL, " \t");
if( !cur_tok ){
/* This mark was generated by an older version of Fossil and doesn't
** include the mark name and uuid. create_mark() will name the new mark
** exactly as it was when exported to git, so that we should have a
** valid mapping from git hash<->mark name<->fossil hash. */
unsigned int mid;
if( type_=='c' ){
mid = COMMITMARK(mark->rid);
}
else{
mid = BLOBMARK(mark->rid);
}

** Usage: %fossil fusefs [--debug] DIRECTORY
**
** This command uses the Fuse Filesystem (FuseFS) to mount a directory
** at DIRECTORY that contains the content of all check-ins in the
** repository. The names of files are DIRECTORY/checkins/VERSION/PATH
** where DIRECTORY is the root of the mount, VERSION is any valid
** check-in name (examples: "trunk" or "tip" or a tag or any unique
** prefix of a SHA1 hash, etc) and PATH is the pathname of the file in
** the check-in. If DIRECTORY does not exist, then an attempt is made
** to create it.
**
** The DIRECTORY/checkins directory is not searchable so one cannot
** do "ls DIRECTORY/checkins" to get a listing of all possible check-in
** names. There are countless variations on check-in names and it is
** impractical to list them all. But all other directories are searchable

|

292
293
294
295
296
297
298
299
300
301
302
303
304
305
306

** Usage: %fossil fusefs [--debug] DIRECTORY
**
** This command uses the Fuse Filesystem (FuseFS) to mount a directory
** at DIRECTORY that contains the content of all check-ins in the
** repository. The names of files are DIRECTORY/checkins/VERSION/PATH
** where DIRECTORY is the root of the mount, VERSION is any valid
** check-in name (examples: "trunk" or "tip" or a tag or any unique
** prefix of an artifact hash, etc) and PATH is the pathname of the file in
** the check-in. If DIRECTORY does not exist, then an attempt is made
** to create it.
**
** The DIRECTORY/checkins directory is not searchable so one cannot
** do "ls DIRECTORY/checkins" to get a listing of all possible check-in
** names. There are countless variations on check-in names and it is
** impractical to list them all. But all other directories are searchable

#include "printf.h"
#if defined(_WIN32)
# include <io.h>
# include <fcntl.h>
#endif
#include <time.h>
/* Two custom conversions are used to show a prefix of SHA1 hashes:
**
** %!S Prefix of a length appropriate for URLs
** %S Prefix of a length appropriate for human display
**
** The following macros help determine those lengths. FOSSIL_HASH_DIGITS
** is the default number of digits to display to humans. This value can
** be overridden using the hash-digits setting. FOSSIL_HASH_DIGITS_URL
................................................................................
# define FOSSIL_HASH_DIGITS 10 /* For %S (human display) */
#endif
#ifndef FOSSIL_HASH_DIGITS_URL
# define FOSSIL_HASH_DIGITS_URL 16 /* For %!S (embedded in URLs) */
#endif
/*
** Return the number of SHA1 hash digits to display. The number is for
** human output if the bForUrl is false and is destined for a URL if
** bForUrl is false.
*/
static int hashDigits(int bForUrl){
static int nDigitHuman = 0;
static int nDigitUrl = 0;
if( nDigitHuman==0 ){

#include "printf.h"
#if defined(_WIN32)
# include <io.h>
# include <fcntl.h>
#endif
#include <time.h>
/* Two custom conversions are used to show a prefix of artifact hashes:
**
** %!S Prefix of a length appropriate for URLs
** %S Prefix of a length appropriate for human display
**
** The following macros help determine those lengths. FOSSIL_HASH_DIGITS
** is the default number of digits to display to humans. This value can
** be overridden using the hash-digits setting. FOSSIL_HASH_DIGITS_URL
................................................................................
# define FOSSIL_HASH_DIGITS 10 /* For %S (human display) */
#endif
#ifndef FOSSIL_HASH_DIGITS_URL
# define FOSSIL_HASH_DIGITS_URL 16 /* For %!S (embedded in URLs) */
#endif
/*
** Return the number of artifact hash digits to display. The number is for
** human output if the bForUrl is false and is destined for a URL if
** bForUrl is false.
*/
static int hashDigits(int bForUrl){
static int nDigitHuman = 0;
static int nDigitUrl = 0;
if( nDigitHuman==0 ){

@ <hr />
onoff_attribute("Allow REMOTE_USER authentication",
"remote_user_ok", "remote_user_ok", 0, 0);
@ <p>When enabled, if the REMOTE_USER environment variable is set to the
@ login name of a valid user and no other login credentials are available,
@ then the REMOTE_USER is accepted as an authenticated user.
@ </p>
@
@ <hr />
entry_attribute("IP address terms used in login cookie", 3,
"ip-prefix-terms", "ipt", "2", 0);
@ <p>The number of octets of of the IP address used in the login cookie.
@ Set to zero to omit the IP address from the login cookie. A value of
@ 2 is recommended.

@ <hr />
onoff_attribute("Allow REMOTE_USER authentication",
"remote_user_ok", "remote_user_ok", 0, 0);
@ <p>When enabled, if the REMOTE_USER environment variable is set to the
@ login name of a valid user and no other login credentials are available,
@ then the REMOTE_USER is accepted as an authenticated user.
@ </p>
@ @ <hr /> onoff_attribute("Allow HTTP_AUTHENTICATION authentication", "http_authentication_ok", "http_authentication_ok", 0, 0); @ <p>When enabled, allow the use of the HTTP_AUTHENTICATION environment @ variable or the "Authentication:" HTTP header to find the username and @ password. This is another way of supporting Basic Authenitication. @ </p>
@
@ <hr />
entry_attribute("IP address terms used in login cookie", 3,
"ip-prefix-terms", "ipt", "2", 0);
@ <p>The number of octets of of the IP address used in the login cookie.
@ Set to zero to omit the IP address from the login cookie. A value of
@ 2 is recommended.

/******************************************************************************
** This file is an amalgamation of many separate C source files from SQLite
** version 3.17.0. By combining all the individual C code files into this
** single large file, the entire code can be compiled as a single translation
** unit. This allows many compilers to do optimizations that would not be
** possible if the files were compiled separately. Performance improvements
** of 5% or more are commonly seen when SQLite is compiled as a single
** translation unit.
**
** This file is all you need to compile SQLite. To use SQLite in other
................................................................................
# define _LARGE_FILE 1
# ifndef _FILE_OFFSET_BITS
# define _FILE_OFFSET_BITS 64
# endif
# define _LARGEFILE_SOURCE 1
#endif
/* What version of GCC is being used. 0 means GCC is not being used */#ifdef __GNUC__
# define GCC_VERSION (__GNUC__*1000000+__GNUC_MINOR__*1000+__GNUC_PATCHLEVEL__)
#else
# define GCC_VERSION 0
#endif
/* What version of CLANG is being used. 0 means CLANG is not being used */#if defined(__clang__) && !defined(_WIN32)# define CLANG_VERSION \ (__clang_major__*1000000+__clang_minor__*1000+__clang_patchlevel__)
#else
# define CLANG_VERSION 0
#endif
/* Needed for various definitions... */
#if defined(__GNUC__) && !defined(_GNU_SOURCE)
# define _GNU_SOURCE
#endif
................................................................................
** string contains the date and time of the check-in (UTC) and an SHA1
** hash of the entire source tree.
**
** See also: [sqlite3_libversion()],
** [sqlite3_libversion_number()], [sqlite3_sourceid()],
** [sqlite_version()] and [sqlite_source_id()].
*/
#define SQLITE_VERSION "3.17.0"
#define SQLITE_VERSION_NUMBER 3017000
#define SQLITE_SOURCE_ID "2017-02-07 13:51:48 a136609c98ed3cc673c5a3c2578d49db3f2518d1"
/*
** CAPI3REF: Run-Time Library Version Numbers
** KEYWORDS: sqlite3_version sqlite3_sourceid
**
** These interfaces provide the same information as the [SQLITE_VERSION],
** [SQLITE_VERSION_NUMBER], and [SQLITE_SOURCE_ID] C preprocessor macros
................................................................................
** has a unique 64-bit signed
** integer key called the [ROWID | "rowid"]. ^The rowid is always available
** as an undeclared column named ROWID, OID, or _ROWID_ as long as those
** names are not also used by explicitly declared columns. ^If
** the table has a column of type [INTEGER PRIMARY KEY] then that column
** is another alias for the rowid.
**
** ^The sqlite3_last_insert_rowid(D) interface returns the [rowid] of the
** most recent successful [INSERT] into a rowid table or [virtual table]
** on database connection D.
** ^Inserts into [WITHOUT ROWID] tables are not recorded.
** ^If no successful [INSERT]s into rowid tables
** have ever occurred on the database connection D,
** then sqlite3_last_insert_rowid(D) returns zero.
**
** ^(If an [INSERT] occurs within a trigger or within a [virtual table]
** method, then this routine will return the [rowid] of the inserted
** row as long as the trigger or virtual table method is running.** But once the trigger or virtual table method ends, the value returned
** by this routine reverts to what it was before the trigger or virtual** table method began.)^
**
** ^An [INSERT] that fails due to a constraint violation is not a
** successful [INSERT] and does not change the value returned by this
** routine. ^Thus INSERT OR FAIL, INSERT OR IGNORE, INSERT OR ROLLBACK,
** and INSERT OR ABORT make no changes to the return value of this
** routine when their insertion fails. ^(When INSERT OR REPLACE
** encounters a constraint violation, it does not fail. The
................................................................................
** function is running and thus changes the last insert [rowid],
** then the value returned by [sqlite3_last_insert_rowid()] is
** unpredictable and might not equal either the old or the new
** last insert [rowid].
*/
SQLITE_API sqlite3_int64 sqlite3_last_insert_rowid(sqlite3*);
/*
** CAPI3REF: Count The Number Of Rows Modified
** METHOD: sqlite3
**
** ^This function returns the number of rows modified, inserted or
** deleted by the most recently completed INSERT, UPDATE or DELETE
** statement on the database connection specified by the only parameter.
................................................................................
** either of these things are undefined.
**
** The session object will be used to create changesets for tables in
** database zDb, where zDb is either "main", or "temp", or the name of an
** attached database. It is not an error if database zDb is not attached
** to the database when the session object is created.
*/
int sqlite3session_create(
sqlite3 *db, /* Database handle */
const char *zDb, /* Name of db (e.g. "main") */
sqlite3_session **ppSession /* OUT: New session object */
);
/*
** CAPI3REF: Delete A Session Object
................................................................................
** results of attempting to use pSession with any other session module
** function are undefined.
**
** Session objects must be deleted before the database handle to which they
** are attached is closed. Refer to the documentation for
** [sqlite3session_create()] for details.
*/
void sqlite3session_delete(sqlite3_session *pSession);
/*
** CAPI3REF: Enable Or Disable A Session Object
**
** Enable or disable the recording of changes by a session object. When
** enabled, a session object records changes made to the database. When
................................................................................
** Passing zero to this function disables the session. Passing a value
** greater than zero enables it. Passing a value less than zero is a
** no-op, and may be used to query the current state of the session.
**
** The return value indicates the final state of the session object: 0 if
** the session is disabled, or 1 if it is enabled.
*/
int sqlite3session_enable(sqlite3_session *pSession, int bEnable);
/*
** CAPI3REF: Set Or Clear the Indirect Change Flag
**
** Each change recorded by a session object is marked as either direct or
** indirect. A change is marked as indirect if either:
**
................................................................................
** is set. Passing a value less than zero does not modify the current value
** of the indirect flag, and may be used to query the current state of the
** indirect flag for the specified session object.
**
** The return value indicates the final state of the indirect flag: 0 if
** it is clear, or 1 if it is set.
*/
int sqlite3session_indirect(sqlite3_session *pSession, int bIndirect);
/*
** CAPI3REF: Attach A Table To A Session Object
**
** If argument zTab is not NULL, then it is the name of a table to attach
** to the session object passed as the first argument. All subsequent changes
** made to the table while the session object is enabled will be recorded. See
................................................................................
**
** Changes are not recorded for individual rows that have NULL values stored
** in one or more of their PRIMARY KEY columns.
**
** SQLITE_OK is returned if the call completes without error. Or, if an error
** occurs, an SQLite error code (e.g. SQLITE_NOMEM) is returned.
*/
int sqlite3session_attach(
sqlite3_session *pSession, /* Session object */
const char *zTab /* Table name */
);
/*
** CAPI3REF: Set a table filter on a Session Object.
**
** The second argument (xFilter) is the "filter callback". For changes to rows
** in tables that are not attached to the Session object, the filter is called
** to determine whether changes to the table's rows should be tracked or not.
** If xFilter returns 0, changes is not tracked. Note that once a table is
** attached, xFilter will not be called again.
*/
void sqlite3session_table_filter(
sqlite3_session *pSession, /* Session object */
int(*xFilter)(
void *pCtx, /* Copy of third arg to _filter_table() */
const char *zTab /* Table name */
),
void *pCtx /* First argument passed to xFilter */
);
................................................................................
** is inserted while a session object is enabled, then later deleted while
** the same session object is disabled, no INSERT record will appear in the
** changeset, even though the delete took place while the session was disabled.