Pig should be able to split Gzip files like it can split Bzip files

Details

Description

It would be nice to be able to split gzip files like we can split bzip files. Unfortunately, we don't have a sync point for the split in the gzip format.

Gzip file format supports the notion of concatenate gzipped files. When gzipped files are concatenated together they are treated as a single file. So to make a gzipped file splittable we can used an empty compressed file with some salt in the headers as a sync signature. Then we can make the gzip file splittable by using this sync signature between compressed segments of the file.

Benjamin Reed
added a comment - 01/Dec/07 22:18 The attached patch implements the method of splitting GZipped files as outlined in the issue description. It uses the same hooks as BZip. We need to review to make sure it terminates properly.
If the gzipped file is not setup for splits, we fall back to not splitting the file.
An unsplittable gzipped dataset can be converted to a splittable one with the following Pig Latin:
a = load 'orig.gz';
store a into 'splittable.gz';

Is there any reason you decided not to use the gzip ID instead of empty files? It seems like it would be better if people could generate these files themselves easily without using PIG at all. Each gzip file will start with "1F 8B 08 08" [1] if you use this mechanism to create them:

gzip -c test1 test2 > test.gz [2]

In the few times that it is wrong you will get an exception from your gzip stream and you can try again at the next boundary.

Sam Pullara
added a comment - 01/Dec/07 23:00 Is there any reason you decided not to use the gzip ID instead of empty files? It seems like it would be better if people could generate these files themselves easily without using PIG at all. Each gzip file will start with "1F 8B 08 08" [1] if you use this mechanism to create them:
gzip -c test1 test2 > test.gz [2]
In the few times that it is wrong you will get an exception from your gzip stream and you can try again at the next boundary.
[1] http://www.gzip.org/zlib/rfc-gzip.html
[2] man gzip

1) It allows me to test that a gzip file is infact splittable. We need to know up front that we can split the gzip file. If the gzip isn't split at regular intervals, it's going to waste a lot of time! The signature is more than a marker, it is meta-data that indicates that it can be split. You will also notice that if you do 'head' on the file you can see that it is splittable.

2) It gives you a much more reliable signature. (20 bytes instead of 4)

You use standard gunzip to decompress. You can also easily find the split boundaries outside of pig by looking for the signature.gz sequence.

This also allows you to better control the grouping. If your gzip file is bigger than 4G, it will be a concatenation, so there may be time that you want to process concatenated gzip files together without splitting. Using the empty signature file allows you to do that.

Now that I think about it more, it might also be good to reserve some bytes in the signature.gz to put a block size. That way when can do intelligent splits when the fs blocksize doesn't correspond to the gzip blocksize or the number of requested splits are very high.

Benjamin Reed
added a comment - 03/Dec/07 16:01 There are two reasons I use an empty file with a comment:
1) It allows me to test that a gzip file is infact splittable. We need to know up front that we can split the gzip file. If the gzip isn't split at regular intervals, it's going to waste a lot of time! The signature is more than a marker, it is meta-data that indicates that it can be split. You will also notice that if you do 'head' on the file you can see that it is splittable.
2) It gives you a much more reliable signature. (20 bytes instead of 4)
You can still use standard tools without using Pig:
cat signature.gz > test.gz; gzip -c test1 >> test.gz; cat signature.gz >> test.gz; gzip -c test2 >> test.gz
You use standard gunzip to decompress. You can also easily find the split boundaries outside of pig by looking for the signature.gz sequence.
This also allows you to better control the grouping. If your gzip file is bigger than 4G, it will be a concatenation, so there may be time that you want to process concatenated gzip files together without splitting. Using the empty signature file allows you to do that.
Now that I think about it more, it might also be good to reserve some bytes in the signature.gz to put a block size. That way when can do intelligent splits when the fs blocksize doesn't correspond to the gzip blocksize or the number of requested splits are very high.

Owen O'Malley
added a comment - 07/Dec/07 08:38 It seems a lot more friendly to define the format like:
% touch empty
% gzip -nc part0 empty part1 empty part2 empty part3 > big.sgz
That would let the user do:
% gzcat big.sgz
to get their file back. I'd also use filenames rather than a header to reflect whether a file is in this format, but that is mostly just a personal preference.

1) The signature is just too small to reliably detect the split. Misdetecting the split isn't as easy as retrying because it usually means you get an OutOfMemoryError are you may have already returned bad data.

2) You have to revert to relying on a extension to detect splitability. This ends up being pretty hokey because most gzip utilities are looking for a .gz extension. The splittable gzip format is completely compatible with existing gzip utilities. Also, if a user puts the wrong extension splits may not happen when they could or we may try to split files that we cannot.

Plus its really nice to be able to do a head file.gz and see right away whether the file is splittable or not.

Benjamin Reed
added a comment - 07/Dec/07 16:31 There are two problems with just using an empty file.
1) The signature is just too small to reliably detect the split. Misdetecting the split isn't as easy as retrying because it usually means you get an OutOfMemoryError are you may have already returned bad data.
2) You have to revert to relying on a extension to detect splitability. This ends up being pretty hokey because most gzip utilities are looking for a .gz extension. The splittable gzip format is completely compatible with existing gzip utilities. Also, if a user puts the wrong extension splits may not happen when they could or we may try to split files that we cannot.
Plus its really nice to be able to do a head file.gz and see right away whether the file is splittable or not.

The patch is not ready to commit yet. It's a work in progress patch. I talked to Utkarash about this and it's missing a termination of the split. Currently each split will not terminate correctly.There is a termination hook that bzip uses that I need to latch into.

Basically here are the things I need to add to finish:

1) Terminate split processing correctly
2) Add test cases
3) Encode block size as part of the header so that we can get almost "perfect" splits. (For example a file that is compressed as 128M blocks should not be split on 64M boundaries even if the block size of the filesystem is 128M.)

Benjamin Reed
added a comment - 07/Dec/07 18:00 The patch is not ready to commit yet. It's a work in progress patch. I talked to Utkarash about this and it's missing a termination of the split. Currently each split will not terminate correctly.There is a termination hook that bzip uses that I need to latch into.
Basically here are the things I need to add to finish:
1) Terminate split processing correctly
2) Add test cases
3) Encode block size as part of the header so that we can get almost "perfect" splits. (For example a file that is compressed as 128M blocks should not be split on 64M boundaries even if the block size of the filesystem is 128M.)
I'll try to get a committable patch this weekend.

It would be nice if the format could be generated using standard tools. By modifying the gzip flag header so that it refers to the file name (which the gzip tool can set), rather than a comment (which it cannot) we can generate compatible files using the following:

Tom White
added a comment - 08/Sep/08 15:03 It would be nice if the format could be generated using standard tools. By modifying the gzip flag header so that it refers to the file name (which the gzip tool can set), rather than a comment (which it cannot) we can generate compatible files using the following:
touch -mt 197007130719.25 Split
gzip -c Split file1 Split file2 > file.gz
Then the first split file has the following hexdump:
hexdump -n 26 -C file.gz
00000000 1f 8b 08 08 6d ca fe 00 00 03 53 70 6c 69 74 00 |....m.....Split.|
00000010 03 00 00 00 00 00 00 00 00 00 |..........|
0000001a
Note that the OS flag is 03 (Unix) rather than FF (unknown), but that should be OK as the code doesn't use it when searching for the signature.

Hadoop Archives are not really the solution here. I want my code to with exactly the same file name references whether I have 100 gzip compressed (or bzip2 compressed) part files or a single concatenation of the individually compressed part files.

I have to change all my filename references to use a har.

What we really want are simple concatenations of gzip files and bzip2 files that work with map reduce.

David Ciemiewicz
added a comment - 06/Apr/10 22:48 Hadoop Archives are not really the solution here. I want my code to with exactly the same file name references whether I have 100 gzip compressed (or bzip2 compressed) part files or a single concatenation of the individually compressed part files.
I have to change all my filename references to use a har.
What we really want are simple concatenations of gzip files and bzip2 files that work with map reduce.