hadoop-mapreduce-issues mailing list archives

[jira] Commented: (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

Date

Wed, 29 Sep 2010 08:33:34 GMT

[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916076#action_12916076
]
Niels Basjes commented on MAPREDUCE-2094:
-----------------------------------------
bq. Changing the default implementation to return false would be an incompatible change, potentially
breaking existing subclasses.
If you mean with "breaking" : "Some subclasses will see an unexpected performance degradation"
. then Yes, that will most likely occur ( first one I think of is SequenceFileInputFormat
).
I however do not expect any functional "breaking" of the output of these subclasses.
bq. Making the method abstract would also be incompatible and break subclasses, but in a way
that they'd easily detect.
Yes.
The downside of this option is that if subclasses want to have detection depending on the
compression I expect a lot of code duplication to occur.
This code duplication is already occurring within the main code base in KeyValueTextInputFormat
, TextInputFormat, CombineFileInputFormat and their old API counterparts (I found a total
of 5 duplicates of the same isSplittable implementation).
bq. Perhaps the javadoc should just be clarified to better document this default?
Definitely an option. However this would not fix the effect in the existing subclasses.
I just did a quick manual code check of the current trunk and I found that the following classes
are derived from FileInputFormat yet do not implement the isSplittable method (and thus use
"return true").
* ./src/java/org/apache/hadoop/mapreduce/lib/input/NLineInputFormat.java
* ./src/contrib/streaming/src/java/org/apache/hadoop/streaming/AutoInputFormat.java
* ./src/java/org/apache/hadoop/mapred/SequenceFileInputFormat.java
I expect that the NLineInputFormat and the AutoInputFormat will affected by this "large gzip"
bug.
So expect that simply fixing the isSplittable documentation would lead to the need to fix
*at least* these two classes.
As far as I understand the SequenceFileInputFormat can only be compressed using a "splittable"
compression, so the "return true;" from FileInputFormat will work fine there.
Overall I still prefer the clean option of returning the correct value depending on the compression.
That would effectively leave the behavior in most use cases unchanged. Yet in those cases
where splitting is known to cause problems it would avoid those problems. Thus avoiding major
issues like the ones we had and described in HADOOP-6901.
For the SequenceFileInputFormat it may be needed to implement the isSplittable as "return
true;"
Effectively the set of changes I propose (in both the old and new API versions of these classes):
1) FileInputFormat . isSplittable gets the implementation as seen in TextInputFormat
2) The isSplittable implementation is removed from KeyValueTextInputFormat , TextInputFormat,
CombineFileInputFormat (useless code duplication)
3) The isSplittable implementation "return true" is added to SequenceFileInputFormat. Given
the fact that you cannot gzip a sequencefile I expect this to be an optional fix.
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe
default behaviour that is different from the documented behaviour.
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-2094
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: task
> Affects Versions: 0.20.1, 0.20.2, 0.21.0
> Reporter: Niels Basjes
>
> When implementing a custom derivative of FileInputFormat we ran into the effect that
a large Gzipped input file would be processed several times.
> A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage
results and taking up a lot more CPU time than needed.
> It took a while to figure out and what we found is that the default implementation of
the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
] is simply "return true;".
> This is a very unsafe default and is in contradiction with the JavaDoc of the method
which states: "Is the given filename splitable? Usually, true, but if the file is stream compressed,
it will not be. " . The actual implementation effectively does "Is the given filename splitable?
Always true, even if the file is stream compressed using an unsplittable compression codec.
"
> For our situation (where we always have Gzipped input) we took the easy way out and simply
implemented an isSplittable in our class that does "return false; "
> Now there are essentially 3 ways I can think of for fixing this (in order of what I would
find preferable):
> # Implement something that looks at the used compression of the file (i.e. do migrate
the implementation from TextInputFormat to FileInputFormat). This would make the method do
what the JavaDoc describes.
> # "Force" developers to think about it and make this method abstract.
> # Use a "safe" default (i.e. return false)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.