PyMOTW: gzip

gzip – Read and write gzip files

Purpose:

Read and write gzip files.

Python Version:

1.5.2 and later

The gzip module provides a file-like interface to GNU zip files, using zlib to compress and uncompress the data.

Writing Compressed Files

The module-level function open() creates an instance of the file-like class GzipFile. The usual methods for writing and reading data are provided. To write data into a compressed file, open the file with mode 'w'.

Different compression levels can be used by passing a compresslevel argument. Valid values range from 1 to 9, inclusive. Lower values are faster and result in less compression. Higher values are slower and compress more, up to a point.

The center column of numbers in the output of the script is the size in bytes of the files produced. As you see, for this input data, the higher compression values do not necessarily pay off in decreased storage space. Results will vary, depending on the input data.

A GzipFile instance also includes a writelines() method that can be used to write a sequence of strings.

importgzipimportitertoolsimportos

output=gzip.open('example_lines.txt.gz','wb')try:output.writelines(itertools.repeat('The same line, over and over.\n',10))finally:output.close()

os.system('gzcat example_lines.txt.gz')

$ python gzip_writelines.py
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.

Reading Compressed Data

To read data back from previously compressed files, simply open the file with mode 'r'.

Working with Streams

It is possible to use the GzipFile class directly to compress or uncompress a data stream, instead of an entire file. This is useful for working with data being transmitted over a socket or from an existing (open) file handle. A StringIO buffer can also be used.

importgzipfromcStringIOimportStringIOimportbinascii

uncompressed_data='The same line, over and over.\n'*10print'UNCOMPRESSED:',len(uncompressed_data)printuncompressed_data

When re-reading the previously compressed data, I pass an explicit length toread(). Leaving the length off resulted in a CRC error, possibly because
StringIO returned an empty string before reporting EOF. If you are
working with streams of compressed data, you may want to prefix the data with
an integer representing the actual amount of data to be read.

$ python gzip_StringIO.py
UNCOMPRESSED: 300
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.

RE-READ: 300
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.