Details

Description

The property "mapred.output.compression.codec" is used when setting and getting the map output compression codec in JobConf, thus making it impossible to use a different codec for map outputs and overall job outputs.

Activity

The methods that affect this behavior are in line 341-371 of JobConf in Hadoop 0.14.1:

/**
* Set the given class as the compression codec for the map outputs.
* @param codecClass the CompressionCodec class that will compress the
* map outputs
*/
public void setMapOutputCompressorClass(Class<? extends CompressionCodec> codecClass)

/**
* Get the codec for compressing the map outputs
* @param defaultValue the value to return if it is not set
* @return the CompressionCodec class that should be used to compress the
* map outputs
* @throws IllegalArgumentException if the class was specified, but not found
*/
public Class<? extends CompressionCodec> getMapOutputCompressorClass(Class<? extends CompressionCodec> defaultValue) {
String name = get("mapred.output.compression.codec");
if (name == null)

Riccardo Boscolo
added a comment - 06/Sep/07 22:29 - edited The methods that affect this behavior are in line 341-371 of JobConf in Hadoop 0.14.1:
/**
* Set the given class as the compression codec for the map outputs.
* @param codecClass the CompressionCodec class that will compress the
* map outputs
*/
public void setMapOutputCompressorClass(Class<? extends CompressionCodec> codecClass)
{
setCompressMapOutput(true);
setClass("mapred.output.compression.codec", codecClass,
CompressionCodec.class);
}
/**
* Get the codec for compressing the map outputs
* @param defaultValue the value to return if it is not set
* @return the CompressionCodec class that should be used to compress the
* map outputs
* @throws IllegalArgumentException if the class was specified, but not found
*/
public Class<? extends CompressionCodec> getMapOutputCompressorClass(Class<? extends CompressionCodec> defaultValue) {
String name = get("mapred.output.compression.codec");
if (name == null)
{
return defaultValue;
}
else {
try
{
return getClassByName(name).asSubclass(CompressionCodec.class);
}
catch (ClassNotFoundException e)
{
throw new IllegalArgumentException("Compression codec " + name +
" was not found.", e);
}
}
}
This could be easily fixed by using a different property, for example, "map.output.compression.codec".

As Riccardo says we clearly need another configuration property... how about:

Map Outputs:
mapred.map.output.compression.

{type|codec}
JobConf.{get|set}MapOutputCompression{Type|Codec}

Job Outputs:
mapred.output.compression.{type|codec}

JobConf.

{get|set}

OutputCompression

{Type|Codec}

That's the easy bit, however I think that this will snow-ball into a large-ish patch since this will entail hunting down all OutputFormat implementations and ensuring that they use mapred.output.compression.

{type|codec}

properties... clearly something we have to fix. Of course, I'll ensure this is clearly documented for folks who write their own {{OutputFormat}}s.

Arun C Murthy
added a comment - 27/Sep/07 07:37 As Riccardo says we clearly need another configuration property... how about:
Map Outputs:
mapred.map.output.compression.
{type|codec}
JobConf.{get|set}MapOutputCompression{Type|Codec}
Job Outputs:
mapred.output.compression.{type|codec}
JobConf.
{get|set}
OutputCompression
{Type|Codec}
That's the easy bit, however I think that this will snow-ball into a large-ish patch since this will entail hunting down all OutputFormat implementations and ensuring that they use mapred.output.compression.
{type|codec}
properties... clearly something we have to fix. Of course, I'll ensure this is clearly documented for folks who write their own {{OutputFormat}}s.

Arun C Murthy
added a comment - 27/Sep/07 10:18 - edited Umm... clearly (as-per the new way)
{set|get}CompressorClass belongs to the OutputFormatBase , does it mean I should put {set|get}
CompressionType in SequenceFileOutputFormat ? Yep, it makes sense only for
{SequenceFile}
s... equivalently I can move both set of apis,
{set|get}CompressorClass & {set|get}
CompressionType to JobConf.
Thoughts? Doug?

Arun C Murthy
added a comment - 27/Sep/07 10:26 I also propose we deprecate OutputFormatBase.
{get|set}CompressOutput and JobConf.{get|set}
CompressMapOutput apis, with reasonable defaults for compression-type (NONE) and compression-codec (null) things should work just fine.

Yes, OutputFormat-specific parameters should not be accessed through JobConf, but rather through static methods on the appropriate OutputFormat class. So, yes, we should deprecate existing OutputFormat-specific methods of JobConf.

Doug Cutting
added a comment - 27/Sep/07 17:36 Yes, OutputFormat-specific parameters should not be accessed through JobConf, but rather through static methods on the appropriate OutputFormat class. So, yes, we should deprecate existing OutputFormat-specific methods of JobConf.

Arun C Murthy
added a comment - 28/Sep/07 23:31 Here is an early patch for review while I continue testing...
I'd really appreciate f/b since this introduces a fairly large no. of changes i.e. new config knobs, deprecates methods and adds new ones etc.

org.apache.hadoop.mapred.TestMiniMRWithDFS.testWithDFS
Failing for the past 1 build (since Failed#853)
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.net.SocketInputStream.read(SocketInputStream.java:182)
at java.io.DataInputStream.readShort(DataInputStream.java:284)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:1641)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1714)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:774)
at org.apache.hadoop.mapred.PiEstimator.launch(PiEstimator.java:170)
at org.apache.hadoop.mapred.TestMiniMRWithDFS.testWithDFS(TestMiniMRWithDFS.java:170)

Arun C Murthy
added a comment - 01/Oct/07 06:45
org.apache.hadoop.mapred.TestMiniMRWithDFS.testWithDFS
Failing for the past 1 build (since Failed#853)
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.net.SocketInputStream.read(SocketInputStream.java:182)
at java.io.DataInputStream.readShort(DataInputStream.java:284)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:1641)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1714)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:774)
at org.apache.hadoop.mapred.PiEstimator.launch(PiEstimator.java:170)
at org.apache.hadoop.mapred.TestMiniMRWithDFS.testWithDFS(TestMiniMRWithDFS.java:170)
Seems unrelated to this patch, re-submitting.

Renaming mapred.compress.map.output is not back-compatible. Perhaps we should continue to use this if it is set for back-compatibility? Or else don't rename.

It also appears that all of the previous public constructors for MapFile are no longer present. If there are public constructors that are no longer used, these should be deprecated, not simply removed. Later we can remove the deprecated methods.

Doug Cutting
added a comment - 01/Oct/07 20:24 Renaming mapred.compress.map.output is not back-compatible. Perhaps we should continue to use this if it is set for back-compatibility? Or else don't rename.
It also appears that all of the previous public constructors for MapFile are no longer present. If there are public constructors that are no longer used, these should be deprecated, not simply removed. Later we can remove the deprecated methods.