Details

Processing of concatenated gzip files formerly stopped (quietly) at the end of the first substream/"member"; now processing will continue to the end of the concatenated stream, like gzip(1) does. (bzip2 support is unaffected by this patch.)

Does anyone know a possible alternate gzip codec that could just be plugged into Hadoop? I suddenly have terabytes of gzipped data that I don't know if I can trust hadoop with, and it would be a big pain to re-archive everything.

Oscar Gothberg
added a comment - 12/Jan/09 20:18 Thanks Tom for filing the Jira on this issue.
Does anyone know a possible alternate gzip codec that could just be plugged into Hadoop? I suddenly have terabytes of gzipped data that I don't know if I can trust hadoop with, and it would be a big pain to re-archive everything.

David Ciemiewicz
added a comment - 06/Apr/10 21:43 bzip2 compression format also supports concatenation of individual bzip2 compressed files into a single file.
bzcat has absolutely no problem reading all of the data in one of these concatenated files.
Unfortunately, both Hadoop Streaming and Pig only see about 2% of the data from the original file in my case. That's a 98% effective data loss.

David Ciemiewicz
added a comment - 07/Apr/10 00:15 Unfortunately I discovered that concatenated bzip2 files did not work in Map-Reduce until AFTER I went and concatenated 3TB and over 250K compressed files.
A colleague suggested that I "fix" my data using the following approach:
hadoop dfs -cat X | bunzip2 | bzip2 | hadoop dfs -put - X.new
I tried this with a 3GB single file concatenation of multiple bzip2 compressed files.
This process took just over an hour with compression taking 5-6X longer than decompression (as measured in CPU utilization).
It only took several minutes to concatenate the multiple part files into a single file.
I think that this points out that decompressing and recompressing data is not really a viable solution for creating large concatenations of smaller files.
The best performing solution is to create the smaller part files in parallel with a bunch of reducers, then concatenate them later into one (or several) larger files.
And so fixing Hadoop Map Reduce to be able to read concatenations of files is actually probably the highest return on investment by the community.

The bzip2 part reportedly is fixed on the trunk (HADOOP-4012); I haven't yet verified this for myself, but I have no reason to believe it doesn't work.

I'm working on half of the gzip half, i.e., the native-libraries portion. I appear to have a working proof of concept, but my testing so far has been extremely minimal. The java.util.zip portion could be addressed with something similar to Duncan Loveday's MultiMemberGZIPInputStream workaround (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425), but the license on his actual code is unclear. (On the other hand, he has an Apache account and apparently still works at BT, so it might be possible to get that clarified.)

Greg Roelofs
added a comment - 27/May/10 03:32 The bzip2 part reportedly is fixed on the trunk ( HADOOP-4012 ); I haven't yet verified this for myself, but I have no reason to believe it doesn't work.
I'm working on half of the gzip half, i.e., the native-libraries portion. I appear to have a working proof of concept, but my testing so far has been extremely minimal. The java.util.zip portion could be addressed with something similar to Duncan Loveday's MultiMemberGZIPInputStream workaround ( http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425 ), but the license on his actual code is unclear. (On the other hand, he has an Apache account and apparently still works at BT, so it might be possible to get that clarified.)
Ravi, do you mind if I assign this issue to myself?

Almost-final gzip concatenation code (several style-related issues to deal with, but working code, both native and non-native, with no debug statements) and a halfway test case (need to get bzip2 half working).

Summary: I implemented an Inflater-based Decompressor with "manual" gzip header/trailer parsing and CRC checks, and added new getRemaining() and resetPartially() methods to the interface. I also modified DecompressorStream to support concatenated streams (decompress() and getCompressedData() methods). For backward compatibility, the default behavior is unchanged; one needs to set the new io.compression.gzip.concat config option to true to turn it on. Since bzip2 apparently changed its behavior without such a setting, perhaps this is overkill...

Anyway, this is against trunk (as of a week or two ago). I still need to check it against Yahoo's tree, deal with the FIXMEs, update my source tree(s), run test-patch, etc. Also, I haven't included the (binary) test files here; I'll do so in one of the next versions of the patch.

Greg Roelofs
added a comment - 15/Jun/10 03:39 Almost-final gzip concatenation code (several style-related issues to deal with, but working code, both native and non-native, with no debug statements) and a halfway test case (need to get bzip2 half working).
Summary: I implemented an Inflater-based Decompressor with "manual" gzip header/trailer parsing and CRC checks, and added new getRemaining() and resetPartially() methods to the interface. I also modified DecompressorStream to support concatenated streams (decompress() and getCompressedData() methods). For backward compatibility, the default behavior is unchanged; one needs to set the new io.compression.gzip.concat config option to true to turn it on. Since bzip2 apparently changed its behavior without such a setting, perhaps this is overkill...
Anyway, this is against trunk (as of a week or two ago). I still need to check it against Yahoo's tree, deal with the FIXMEs, update my source tree(s), run test-patch, etc. Also, I haven't included the (binary) test files here; I'll do so in one of the next versions of the patch.

Greg, I cannot think of any good reasons to keep the current behavior of failing to read concatenated gzip files. So, requiring end users to actively set a flag io.compression.gzip.concat to permit meeting expectations of every user seems ass backwards. Reading concatenated files should be the default behavior.

David Ciemiewicz
added a comment - 15/Jun/10 16:43 Greg, I cannot think of any good reasons to keep the current behavior of failing to read concatenated gzip files. So, requiring end users to actively set a flag io.compression.gzip.concat to permit meeting expectations of every user seems ass backwards. Reading concatenated files should be the default behavior.

On Yahoo grids it will be (via site config). For the default code as used
by everyone else, I'm open to input. I tend to be conservative when it
comes to things that could be interpreted as user-visible API changes, but
if everyone agrees that the old behavior is simply a bug, I'm happy to drop
the config option and just hardcode concatenation support. Or I can leave
it configurable but make the concat support the default.

Greg Roelofs
added a comment - 15/Jun/10 19:34 Ciemo,
On Yahoo grids it will be (via site config). For the default code as used
by everyone else, I'm open to input. I tend to be conservative when it
comes to things that could be interpreted as user-visible API changes, but
if everyone agrees that the old behavior is simply a bug, I'm happy to drop
the config option and just hardcode concatenation support. Or I can leave
it configurable but make the concat support the default.
I'll send an e-mail to mapreduce-user and see what others think.

Greg, I have yet to encounter ANYONE who doesn't consider this a bug because all cited reference EXPECT concatenated files to work because the work in ALL OTHER cited instances including gnu tools, web browsers, etc. Can you think of a single instance where it would be the right thing to stop reading a concatenated file after the first part is read, ignoring all other concatenated parts. Forgive me but suggesting that we keep the existing behavior seems absurd because I cannot think of a single case where this would be the right thing to do.

David Ciemiewicz
added a comment - 15/Jun/10 22:53 Greg, I have yet to encounter ANYONE who doesn't consider this a bug because all cited reference EXPECT concatenated files to work because the work in ALL OTHER cited instances including gnu tools, web browsers, etc. Can you think of a single instance where it would be the right thing to stop reading a concatenated file after the first part is read, ignoring all other concatenated parts. Forgive me but suggesting that we keep the existing behavior seems absurd because I cannot think of a single case where this would be the right thing to do.

I can't think of a good use case for it, but a few years of development
experience have taught me that, planetwide, someone else very well may have.
Hence my question to mapreduce-user.

The one thing that did occur to me was financially oriented, e.g., if an
existing data flow just fits within budget (whether actual dollars or grid
capacity or time limits or whatever) because it's been reading half of the
available data and getting "good enough" results. Suddenly doubling its
usage (or 50x, as in your case) without adequate warning could be quite
painful. Admittedly, this is a weak example and probably very unlikely,
but there may be a real case that's somewhat similar.

Note that I'm perfectly willing to hardcode it always-on just like the
trunk's bzip2 code; that would simplify the code, eliminate a per-bufferload
conditional, and just generally be cleaner. I'm also happy to leave it
configurable but on by default. However, I'd like to give the user community
a chance to pipe up in case there actually is a problematic use case out
there that you and I have overlooked.

Greg Roelofs
added a comment - 16/Jun/10 00:32
I can't think of a good use case for it, but a few years of development
experience have taught me that, planetwide, someone else very well may have.
Hence my question to mapreduce-user.
The one thing that did occur to me was financially oriented, e.g., if an
existing data flow just fits within budget (whether actual dollars or grid
capacity or time limits or whatever) because it's been reading half of the
available data and getting "good enough" results. Suddenly doubling its
usage (or 50x, as in your case) without adequate warning could be quite
painful. Admittedly, this is a weak example and probably very unlikely,
but there may be a real case that's somewhat similar.
Note that I'm perfectly willing to hardcode it always-on just like the
trunk's bzip2 code; that would simplify the code, eliminate a per-bufferload
conditional, and just generally be cleaner. I'm also happy to leave it
configurable but on by default. However, I'd like to give the user community
a chance to pipe up in case there actually is a problematic use case out
there that you and I have overlooked.

Okey dokey, default it is. mapreduce-user feedback was limited but in favor of hardcoding it (i.e., not even configurable), so that's currently my plan for the final version. Later today I'll post a second interim version that merely flips the current config option. (Main purpose will be to incorporate informal review feedback from Chris Douglas and, hopefully, make the unit test much more complete; at that point I'll consider it ready for formal review and checkin.)

Greg Roelofs
added a comment - 17/Jun/10 21:04 Okey dokey, default it is. mapreduce-user feedback was limited but in favor of hardcoding it (i.e., not even configurable), so that's currently my plan for the final version. Later today I'll post a second interim version that merely flips the current config option. (Main purpose will be to incorporate informal review feedback from Chris Douglas and, hopefully, make the unit test much more complete; at that point I'll consider it ready for formal review and checkin.)

Expanded test coverage uncovered a bug on Friday, and trunk update today has breakage, so this version is against Yahoo's 0.20S+ branch.

Still not quite final; I haven't finished updating the unit test to exercise both native and built-in gzip and built-in bzip2 at multiple buffer sizes, and I've left some (mostly) commented-out debug statements in place in case that turns up anything further.

Reviewer questions:

Currently the new BuiltInGzipDecompressor class inherits directly from JDK Inflater, but I suspect I should extend BuiltInZlibInflater instead.

Is it worthwhile to encapsulate the state label and associated variables into a private inner class (BuiltInGzipDecompressor.java, first FIXME comment)?

The other FIXMEs are either related to the two items above or else are largely unrelated to this issue.

Greg Roelofs
added a comment - 22/Jun/10 04:01 Expanded test coverage uncovered a bug on Friday, and trunk update today has breakage, so this version is against Yahoo's 0.20S+ branch.
Still not quite final; I haven't finished updating the unit test to exercise both native and built-in gzip and built-in bzip2 at multiple buffer sizes, and I've left some (mostly) commented-out debug statements in place in case that turns up anything further.
Reviewer questions:
Currently the new BuiltInGzipDecompressor class inherits directly from JDK Inflater, but I suspect I should extend BuiltInZlibInflater instead.
Is it worthwhile to encapsulate the state label and associated variables into a private inner class (BuiltInGzipDecompressor.java, first FIXME comment)?
The other FIXMEs are either related to the two items above or else are largely unrelated to this issue.

DecompressorStream currently supports two concatenation modes via a pseudo-ifdef ("final boolean useResetPartially"): resetPartially(), which avoids any additional buffer copies at a cost of uglifying the Decompressor interface with this new method; or regular reset() + setInput() to recopy any "excess" bytes (that is, from stream N+1) at the end of stream N. The amount of recopying in the latter case is dependent on the buffer sizes (typically 64KB around here) and sizes of the concatenated gzip streams/members, but in general it won't be much. Barring strong disagreement, I'll go with the latter approach and clean up all the resetPartially() stuff in the next (hopefully final) version of the patch.

Any last-minute qualms about hardcoding the concatenation behavior? It would simplify the patch slightly and seems to be the preferred approach, so that's my plan for the next version.

Greg Roelofs
added a comment - 22/Jun/10 04:28 Oh, two more review questions:
DecompressorStream currently supports two concatenation modes via a pseudo-ifdef ("final boolean useResetPartially"): resetPartially(), which avoids any additional buffer copies at a cost of uglifying the Decompressor interface with this new method; or regular reset() + setInput() to recopy any "excess" bytes (that is, from stream N+1) at the end of stream N. The amount of recopying in the latter case is dependent on the buffer sizes (typically 64KB around here) and sizes of the concatenated gzip streams/members, but in general it won't be much. Barring strong disagreement, I'll go with the latter approach and clean up all the resetPartially() stuff in the next (hopefully final) version of the patch.
Any last-minute qualms about hardcoding the concatenation behavior? It would simplify the patch slightly and seems to be the preferred approach, so that's my plan for the next version.

Currently the new BuiltInGzipDecompressor class inherits directly from JDK Inflater, but I suspect I should extend BuiltInZlibInflater instead.

I'd lean the other way. It's not really a supertype of BuiltInZlibInflater and neither are public types. It's not really an Inflater, either; it may be worth either supporting that interface or using an Inflater member rather than inheritance, since it only calls public methods. Either way, it's an existing confusion in the compression type hierarchy, and if it calls for any additional testing it can be ignored in this issue.

Is it worthwhile to encapsulate the state label and associated variables into a private inner class (BuiltInGzipDecompressor.java, first FIXME comment)?

Since the code is already implemented and tested, refactoring it for a slightly cleaner implementation of a user-opaque, RFC-compliant library doesn't seem like a reasonable condition for committing it.

DecompressorStream currently supports two concatenation modes via a pseudo-ifdef ("final boolean useResetPartially"): resetPartially(), which avoids any additional buffer copies at a cost of uglifying the Decompressor interface with this new method; or regular reset() + setInput() to recopy any "excess" bytes (that is, from stream N+1) at the end of stream N. The amount of recopying in the latter case is dependent on the buffer sizes (typically 64KB around here) and sizes of the concatenated gzip streams/members, but in general it won't be much. Barring strong disagreement, I'll go with the latter approach and clean up all the resetPartially() stuff in the next (hopefully final) version of the patch.

Agreed; the penalty for re-copying once per stream is light enough that it can be endured for other codecs' API considerations.

Any last-minute qualms about hardcoding the concatenation behavior? It would simplify the patch slightly and seems to be the preferred approach, so that's my plan for the next version.

Sounds fine to me. It may cause faults in some containers, but those are probably bugs covered over by this one.

Chris Douglas
added a comment - 25/Jun/10 18:39 Currently the new BuiltInGzipDecompressor class inherits directly from JDK Inflater, but I suspect I should extend BuiltInZlibInflater instead.
I'd lean the other way. It's not really a supertype of BuiltInZlibInflater and neither are public types. It's not really an Inflater , either; it may be worth either supporting that interface or using an Inflater member rather than inheritance, since it only calls public methods. Either way, it's an existing confusion in the compression type hierarchy, and if it calls for any additional testing it can be ignored in this issue.
Is it worthwhile to encapsulate the state label and associated variables into a private inner class (BuiltInGzipDecompressor.java, first FIXME comment)?
Since the code is already implemented and tested, refactoring it for a slightly cleaner implementation of a user-opaque, RFC-compliant library doesn't seem like a reasonable condition for committing it.
DecompressorStream currently supports two concatenation modes via a pseudo-ifdef ("final boolean useResetPartially"): resetPartially(), which avoids any additional buffer copies at a cost of uglifying the Decompressor interface with this new method; or regular reset() + setInput() to recopy any "excess" bytes (that is, from stream N+1) at the end of stream N. The amount of recopying in the latter case is dependent on the buffer sizes (typically 64KB around here) and sizes of the concatenated gzip streams/members, but in general it won't be much. Barring strong disagreement, I'll go with the latter approach and clean up all the resetPartially() stuff in the next (hopefully final) version of the patch.
Agreed; the penalty for re-copying once per stream is light enough that it can be endured for other codecs' API considerations.
Any last-minute qualms about hardcoding the concatenation behavior? It would simplify the patch slightly and seems to be the preferred approach, so that's my plan for the next version.
Sounds fine to me. It may cause faults in some containers, but those are probably bugs covered over by this one.

Updated patch with the promised changes from my third and fourth questions (i.e., reset()/setInput() rather than resetPartially(), and hardcoded concat fix); cleaned up debug/FIXME stuff; and full test files (well, the shorter versions).

This is still against Yahoo's 0.20.2xx branch; I'll port it back to trunk next.

Greg Roelofs
added a comment - 25/Jun/10 18:40 Updated patch with the promised changes from my third and fourth questions (i.e., reset()/setInput() rather than resetPartially(), and hardcoded concat fix); cleaned up debug/FIXME stuff; and full test files (well, the shorter versions).
This is still against Yahoo's 0.20.2xx branch; I'll port it back to trunk next.

Greg Roelofs
added a comment - 25/Jun/10 18:45 Whoops, ships passing...
It's not really an Inflater, either; it may be worth either supporting that interface or using an Inflater member rather than inheritance, since it only calls public methods.
Good point. OK, I'll look into that first, then port to trunk. Seems like a pretty lightweight change.

Final patch against trunk. All the good stuff is in the -common half; the mapreduce half has a new unit test and six supporting test files.

Because the unit test purports to be a generic concatenated-compressed-input test, I had ported both of the main gzip subtests to work with the bzip2 decoder. That turned up a possible issue in the bzip2 decoder (or, equally likely, in the way I wrote the test case); for now, I commented out that part of the test (near line 586). The weird thing is that my own, more rambunctious test case works fine. This shouldn't be a blocker for a gzip-oriented patch, but if a reviewer could take a quick look at the commented-out block and see if I did something obviously wrong, I'd appreciate it. (The problem didn't turn up in the Yahoo branch because 0.20.x doesn't support bzip2 concatenation, as the issue title notes. The breakage appears to occur precisely at the start of the second bzip2 "member.")

Greg Roelofs
added a comment - 29/Jun/10 03:26 Final patch against trunk. All the good stuff is in the -common half; the mapreduce half has a new unit test and six supporting test files.
Because the unit test purports to be a generic concatenated-compressed-input test, I had ported both of the main gzip subtests to work with the bzip2 decoder. That turned up a possible issue in the bzip2 decoder (or, equally likely, in the way I wrote the test case); for now, I commented out that part of the test (near line 586). The weird thing is that my own, more rambunctious test case works fine. This shouldn't be a blocker for a gzip-oriented patch, but if a reviewer could take a quick look at the commented-out block and see if I did something obviously wrong, I'd appreciate it. (The problem didn't turn up in the Yahoo branch because 0.20.x doesn't support bzip2 concatenation, as the issue title notes. The breakage appears to occur precisely at the start of the second bzip2 "member.")

Hadoop QA
added a comment - 29/Jun/10 03:35 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch
against trunk revision 957074.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 21 new or modified tests.
-1 patch. The patch command could not apply the patch.
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/596/console
This message is automatically generated.

OK, so auto-patch doesn't know about the split projects, and hadoop-mapreduce patches don't work in hadoop-common.

I tried moving the test case to hadoop-common/src/test/core/org/apache/hadoop/io/compress, but it depends on too many MR classes to be workable (JobConf, Reporter, RecordReader, FileSplit, InputSplit, FileInputFormat, TextInputFormat). So I guess the procedure is (1) reupload the hadoop-common patch and mark as patch-available [actually, I missed a fix in TestCodec, so I need to upload a new one anyway]; (2) get that past QA and checked in; (3) open a separate MR JIRA, attach the test case, and mark it patch-available; and (4) get that past QA and checked in, too.

If there's an easier way, feel free to enlighten me. (On a related note, I was told by Reliable Sources to blame Owen, but he just left town. Coincidence?)

Greg Roelofs
added a comment - 01/Jul/10 01:27 OK, so auto-patch doesn't know about the split projects, and hadoop-mapreduce patches don't work in hadoop-common.
I tried moving the test case to hadoop-common/src/test/core/org/apache/hadoop/io/compress, but it depends on too many MR classes to be workable (JobConf, Reporter, RecordReader, FileSplit, InputSplit, FileInputFormat, TextInputFormat). So I guess the procedure is (1) reupload the hadoop-common patch and mark as patch-available [actually, I missed a fix in TestCodec, so I need to upload a new one anyway] ; (2) get that past QA and checked in; (3) open a separate MR JIRA, attach the test case, and mark it patch-available; and (4) get that past QA and checked in, too.
If there's an easier way, feel free to enlighten me. (On a related note, I was told by Reliable Sources to blame Owen, but he just left town. Coincidence?)

However, BuiltInGzipDecompressor lives in org.apache.hadoop.io.compress.zlib, and GzipCodec.java includes "import org.apache.hadoop.io.compress.zlib.*;", so this failure makes no sense to either of us.

Greg Roelofs
added a comment - 07/Jul/10 01:59 Fixed extra pair of javac warnings (unnecessary int typecasts in readUShortLE() ).
The javadoc output file ( http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/602/artifact/trunk/patchprocess/patchJavadocWarnings.txt ) shows 15 + 6 warnings, all of which are existing warnings in the security code. I have no idea which one it thinks I'm responsible for (particularly since there's no corresponding "trunkJavadocWarnings.txt" link), but I plead not guilty...

Greg Roelofs
added a comment - 07/Jul/10 03:49 Same as v7 but with a pair of new TestCodec unit (sub)tests provided by Chris Douglas. (These represent a low-level, hadoop-common complement to the existing ones in the hadoop-mapreduce patch.)
All uses of the deprecated "hadoop.native.lib" option in TestCodec are replaced by "io.native.lib.available", too.