Details

This provides an option to store fsimage compressed. The layout version is bumped to -25. The user could configure if s/he wants the fsimage to be compressed or not and which codec to use. By default the fsimage is not compressed.

Description

Our HDFS has fsimage as big as 20G bytes. It consumes a lot of network bandwidth when secondary NN uploads a new fsimage to primary NN.

If we could store fsimage compressed, the problem could be greatly alleviated.

I plan to provide a new configuration hdfs.image.compressed with a default value of false. If it is set to be true, fsimage is stored as compressed.

The fsimage will have a new layout with a new field "compressed" in its header, indicating if the namespace is stored compressed or not.

Hairong Kuang
added a comment - 01/Oct/10 01:40 This depends on the compression algorithm to be used. We need to choose an algorithm to provide a balance between compression quality and speed. I will do some experiments to provide more data.
Load image overhead is a concern since it adds overhead to NN restart time.

Instead of changing the binary format, could you update code to either read fsimage or fsimage.gz, whichever is available? Obviously, one could start compressing at any point, but there's significant value to being able to use existing tools to decompress if anything goes awry.

Philip Zeyliger
added a comment - 01/Oct/10 07:01 Instead of changing the binary format, could you update code to either read fsimage or fsimage.gz, whichever is available? Obviously, one could start compressing at any point, but there's significant value to being able to use existing tools to decompress if anything goes awry.

Jeff Hammerbacher
added a comment - 01/Oct/10 07:04 Could we use the Avro file format to store the fsimage? We've designed configurable compression into the format, and tools will automatically be available for inspection of the file.

Phillip, your suggestion definitely has value that gives the flexibility of compressing fsimage at any time. The focus of this jira is to store it compressed by HDFS. This allows secondary NN to transfer the compressed image to primary NN, thus reducing network & disk I/O overhead.

Jeff, I'd like to take a look at the Avro file format. Do you know if Avro file format has any overhead than the current fsimage format?

Hairong Kuang
added a comment - 01/Oct/10 17:29 Phillip, your suggestion definitely has value that gives the flexibility of compressing fsimage at any time. The focus of this jira is to store it compressed by HDFS. This allows secondary NN to transfer the compressed image to primary NN, thus reducing network & disk I/O overhead.
Jeff, I'd like to take a look at the Avro file format. Do you know if Avro file format has any overhead than the current fsimage format?

dhruba borthakur
added a comment - 02/Oct/10 07:44 > time to load the image and the time to do saveNamespace to go up by a lot with this change?
It might go up a little, and we can measure it and provide details here.

I thought more about Philip's suggestion. So instead of changing the fsimage format, I have an option simply compress the whole image file and then when loading the fsimage, it decompress the image if the image file ends with a "compression" suffix.

This has a couple of advantages over my original idea:
1. No need to change layout versions;
2. Give admin more flexibility to use existing tools to compress fsimage even if HDFS is not configured to compress fsimage.

I also did a few experiments with different compression algorithms. I tried both gzip and LZO with a 13G fsimage, both using the default level of compression.
Gzip used 13 minutes to compress the 13G fsimage to be 2.3G bytes and decompression used 2 minutes 47 seconds.
LZO used only 3 minutes to compress the 13G fsimage to be 3G bytes and decompression used 2 minutes 51 seconds.

This is very promising results. I think fsimage has a lot of duplicate bytes so it could compress really well. And also it is very obvious that LZO provides good compression speed and good enough compression quality.

Hairong Kuang
added a comment - 04/Oct/10 22:22 I thought more about Philip's suggestion. So instead of changing the fsimage format, I have an option simply compress the whole image file and then when loading the fsimage, it decompress the image if the image file ends with a "compression" suffix.
This has a couple of advantages over my original idea:
1. No need to change layout versions;
2. Give admin more flexibility to use existing tools to compress fsimage even if HDFS is not configured to compress fsimage.
I also did a few experiments with different compression algorithms. I tried both gzip and LZO with a 13G fsimage, both using the default level of compression.
Gzip used 13 minutes to compress the 13G fsimage to be 2.3G bytes and decompression used 2 minutes 47 seconds.
LZO used only 3 minutes to compress the 13G fsimage to be 3G bytes and decompression used 2 minutes 51 seconds.
This is very promising results. I think fsimage has a lot of duplicate bytes so it could compress really well. And also it is very obvious that LZO provides good compression speed and good enough compression quality.

The VERSIONS file should have a entry of the form:
codec=org.apache.hadoop.io.compress.GzipCodec
if the fsimage has been compressed using gzip.

1. At namenode startup time, it reads the VERSIONS file to determine how the fsimage is compressed. If the VERSIONS file does not have a codec=xxx entry, then the NN assumes that the image is not compressed.

2. while saving the fsimage, the NN looks at its own configuration to see if a config parameter named io.compression.codec is defined in the config. If it is defined, then it uses that codec to compress the fsimage and also updates the VERSIONS file.

This approach would be fully backward compatible and supports different compression algorithms for fsimage.

dhruba borthakur
added a comment - 05/Oct/10 05:48 +1 on compressing the entire file.
The VERSIONS file should have a entry of the form:
codec=org.apache.hadoop.io.compress.GzipCodec
if the fsimage has been compressed using gzip.
1. At namenode startup time, it reads the VERSIONS file to determine how the fsimage is compressed. If the VERSIONS file does not have a codec=xxx entry, then the NN assumes that the image is not compressed.
2. while saving the fsimage, the NN looks at its own configuration to see if a config parameter named io.compression.codec is defined in the config. If it is defined, then it uses that codec to compress the fsimage and also updates the VERSIONS file.
This approach would be fully backward compatible and supports different compression algorithms for fsimage.

LZO compression codec is not supported in Hadoop standard package. So the compression algorithm has to be configurable.

If we compress the entire image file, the challenge is to decide where to put the compression algorithm information.
Dhruba suggested to store this information in file VERSION. This idea is neat. The only problem is that now saving the fsimage needs to touch two files and its hard to guarantee atomicity.

Another solution is to use a suffix to the image file name to indicate the compression algorithm. The problem with this is that now the image file no longer has a unique name so it is possible one storage directory has multiple fsimages. How do we handle this?

After discussions back and forth, I am kind of thinking to use the approach that I originally proposed, changing the binary format. Therefore we could store the compression algorithm information in the fsimage header. In this way, we don't need to deal with any of the complexity that compressing the entire image file presents.

Hairong Kuang
added a comment - 05/Oct/10 22:41 LZO compression codec is not supported in Hadoop standard package. So the compression algorithm has to be configurable.
If we compress the entire image file, the challenge is to decide where to put the compression algorithm information.
Dhruba suggested to store this information in file VERSION. This idea is neat. The only problem is that now saving the fsimage needs to touch two files and its hard to guarantee atomicity.
Another solution is to use a suffix to the image file name to indicate the compression algorithm. The problem with this is that now the image file no longer has a unique name so it is possible one storage directory has multiple fsimages. How do we handle this?
After discussions back and forth, I am kind of thinking to use the approach that I originally proposed, changing the binary format. Therefore we could store the compression algorithm information in the fsimage header. In this way, we don't need to deal with any of the complexity that compressing the entire image file presents.
What do the community think?

Jeff Hammerbacher
added a comment - 07/Oct/10 08:05 Jeff, I'd like to take a look at the Avro file format. Do you know if Avro file format has any overhead than the current fsimage format?
I don't know about the current fsimage format. The Avro format, however, is detailed in the Avro spec: http://avro.apache.org/docs/current/spec.html#Object+Container+Files

Hairong, Avro's file format has little overhead. It supports compression. However it assumes that a file is composed of a sequence of entries with a the same schema. The fsimage has various sections. The header information could be added as Avro file metadata. The files and directories, datanodes and files under construction are currently written as separate blocks. Instead, the schema for every item might be something like a union of [File, Directory, Symlink, DataNode, FileUnderConstruction].

Doug Cutting
added a comment - 07/Oct/10 23:00 Hairong, Avro's file format has little overhead. It supports compression. However it assumes that a file is composed of a sequence of entries with a the same schema. The fsimage has various sections. The header information could be added as Avro file metadata. The files and directories, datanodes and files under construction are currently written as separate blocks. Instead, the schema for every item might be something like a union of [File, Directory, Symlink, DataNode, FileUnderConstruction] .

This patch changed the fsimage's format to support compression. The third field in the header (isCompressed) indicates if the image is stored as compressed. If yes, the fourth field stores the compression codec.

The HDFS admin could configure if s/he wants to store fsimage compressed and which codec is used to compress its fsimage. The codec to be used for storing or reading an fsimage has to be one of the codecs specified in io.compression.codecs or be either GzipCodec or DefaultCodec if io.compression.codecs is not configured.

Hairong Kuang
added a comment - 11/Oct/10 22:58 This patch changed the fsimage's format to support compression. The third field in the header (isCompressed) indicates if the image is stored as compressed. If yes, the fourth field stores the compression codec.
The HDFS admin could configure if s/he wants to store fsimage compressed and which codec is used to compress its fsimage. The codec to be used for storing or reading an fsimage has to be one of the codecs specified in io.compression.codecs or be either GzipCodec or DefaultCodec if io.compression.codecs is not configured.

We also encountered the same problem, our fsimage has 12G. We will limit transmission speed and compress transmission compression to resolve the problem, but not to change format of the fsimage. We plan to check NameNode fsimage and SecondaryNameNode when download CheckpointFiles, if they are the same, SecondaryNameNode will not download the fsimage from the NameNode.

Yilei Lu
added a comment - 12/Oct/10 16:37 We also encountered the same problem, our fsimage has 12G. We will limit transmission speed and compress transmission compression to resolve the problem, but not to change format of the fsimage. We plan to check NameNode fsimage and SecondaryNameNode when download CheckpointFiles, if they are the same, SecondaryNameNode will not download the fsimage from the NameNode.

Hairong, I don' think that using Avro is critical here. Avro's primarily intended for user data. Using Avro here could simplify long-term maintenance but short-term might add a significant amount of work. So I would not file another Jira unless you intend to implement it soon. Thanks!

Doug Cutting
added a comment - 13/Oct/10 13:34 Hairong, I don' think that using Avro is critical here. Avro's primarily intended for user data. Using Avro here could simplify long-term maintenance but short-term might add a significant amount of work. So I would not file another Jira unless you intend to implement it soon. Thanks!

@Lu, compressing the fsimage has additional advantage of reducing disk I/O and as well as networkbandwidth when writing to a remote copy. I like your proposed optimizations like limiting transmission speed and not to download an fsimage if the one at primary NameNode is the same as the one at secondary NameNode. Could you please contribute those back to the community?

@Doug, thanks for your feedback. Hope that we will get some time to work on the avro format soon.

Hairong Kuang
added a comment - 13/Oct/10 22:23 @Lu, compressing the fsimage has additional advantage of reducing disk I/O and as well as networkbandwidth when writing to a remote copy. I like your proposed optimizations like limiting transmission speed and not to download an fsimage if the one at primary NameNode is the same as the one at secondary NameNode. Could you please contribute those back to the community?
@Doug, thanks for your feedback. Hope that we will get some time to work on the avro format soon.

Hairong Kuang
added a comment - 14/Oct/10 00:44 When I was doing performance testing, I found that TrunkImageCompress.patch has a bug that it does not use a buffered input stream to read an old image. This patch fixes this performance degradation.

to Kuang, If the fsimage is very big. The network is full in a short time when SeconaryNamenode do checkpoint, leading to Jobtracker access Namenode to get relevant file data to fail in job initialization phase. So we limit transmission speed and compress transmission to resolve the problem.
We have complete the development and testing. But we not add the code that check NameNode fsimage and SecondaryNameNode when download CheckpointFiles. Because it may lead to other risks.
Please see the patch that limit transmission speed and compress transmission.
Next I will contribute other patch that check NameNode fsimage and SecondaryNameNode when download CheckpointFiles. Thanks.

Yilei Lu
added a comment - 14/Oct/10 12:42 to Kuang, If the fsimage is very big. The network is full in a short time when SeconaryNamenode do checkpoint, leading to Jobtracker access Namenode to get relevant file data to fail in job initialization phase. So we limit transmission speed and compress transmission to resolve the problem.
We have complete the development and testing. But we not add the code that check NameNode fsimage and SecondaryNameNode when download CheckpointFiles. Because it may lead to other risks.
Please see the patch that limit transmission speed and compress transmission.
Next I will contribute other patch that check NameNode fsimage and SecondaryNameNode when download CheckpointFiles. Thanks.

If the fsimage is very big. The network is full in a short time when SeconaryNamenode do checkpoint, leading to Jobtracker access Namenode to get relevant file data to fail in job initialization phase. So we limit transmission speed and compress transmission to resolve the problem

LZO compression codec is not supported in Hadoop standard package. So copression is GzipCodec.

Yilei Lu
added a comment - 14/Oct/10 14:18 If the fsimage is very big. The network is full in a short time when SeconaryNamenode do checkpoint, leading to Jobtracker access Namenode to get relevant file data to fail in job initialization phase. So we limit transmission speed and compress transmission to resolve the problem
LZO compression codec is not supported in Hadoop standard package. So copression is GzipCodec.

I did experiments with a secondary namenode using our internal 0.20 branch. I used LzoCodec to compress the image. Here are the results:

uncompressed

LZO compressed

image size

13G

2.9G

loading image from disk

5 mins

8 mins

save image to disk

2 mins

4.5 mins

download image from primary NN

16.5 mins

6.5 mins

upload image to primary NN

16.5 mins

6.5 mins

whole checkpoint

40 mins

25 mins

The result shows that a compressed image greatly improves image downloading and uploading overhead although it adds 5.5 minutes overhead to loading/saving the image. Overall this gives us 15 minutes reduction for checkpointing a 13G image.

As Lu pointed out, another very obvious optimization we could easily do is not to download the image from the primary NameNode if the secondary has the same one. This will in addition give us 6.5 minute reduction.

Hairong Kuang
added a comment - 14/Oct/10 19:17 I did experiments with a secondary namenode using our internal 0.20 branch. I used LzoCodec to compress the image. Here are the results:
uncompressed
LZO compressed
image size
13G
2.9G
loading image from disk
5 mins
8 mins
save image to disk
2 mins
4.5 mins
download image from primary NN
16.5 mins
6.5 mins
upload image to primary NN
16.5 mins
6.5 mins
whole checkpoint
40 mins
25 mins
The result shows that a compressed image greatly improves image downloading and uploading overhead although it adds 5.5 minutes overhead to loading/saving the image. Overall this gives us 15 minutes reduction for checkpointing a 13G image.
As Lu pointed out, another very obvious optimization we could easily do is not to download the image from the primary NameNode if the secondary has the same one. This will in addition give us 6.5 minute reduction.

Hairong Kuang
added a comment - 14/Oct/10 19:35 Oops! The header of the table above should shift right one column.
@Lu, I really like your idea and thanks a lot for your patch. I created HDFS-1457 to track this.

Hey Hairong. Another idea which you may want to experiment with at some point is to write a BufferedInputStream equivalent that does "readahead" or buffer filling in a second thread. That way the extra CPU caused by compression goes onto another core. Given that the actual application of the image data to the namespace is single-threaded due to the FSN lock, I bet compressed reading could actually get faster than uncompressed.

Todd Lipcon
added a comment - 15/Oct/10 00:28 Hey Hairong. Another idea which you may want to experiment with at some point is to write a BufferedInputStream equivalent that does "readahead" or buffer filling in a second thread. That way the extra CPU caused by compression goes onto another core. Given that the actual application of the image data to the namespace is single-threaded due to the FSN lock, I bet compressed reading could actually get faster than uncompressed.

Hairong Kuang
added a comment - 15/Oct/10 00:57 Todd, this is a wonderful idea! When I discussed with Dmytro, he also talked about this optimization. Could you please file a jira on this? Thanks a lot.

1. I would prefer to have the compression start after the header record. For example, the imgVersion, numFiles, genStamp, defaultReplication, etc not be compressed. This allows easier debugging, ability to dump the header of a file to find out its contents (via od -x), etc.

dhruba borthakur
added a comment - 15/Oct/10 01:50 I looked at the code, looks good. Have two comments:
1. I would prefer to have the compression start after the header record. For example, the imgVersion, numFiles, genStamp, defaultReplication, etc not be compressed. This allows easier debugging, ability to dump the header of a file to find out its contents (via od -x), etc.
2. The new image version is 25. but the code refers to it as -19
if (imgVersion <= -19) { // -19: 1st version providing compression option
isCompressed = in.readBoolean();
if (isCompressed) {
String codecClassName = Text.readString(in);

[exec] +1 overall. [exec][exec] +1 @author. The patch does not contain any @author tags.[exec][exec] +1 tests included. The patch appears to i[exec] nclude 7 new or modified tests.[exec][exec] +1 javadoc. The javadoc tool did not generate any warning messages.[exec][exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.[exec][exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.[exec][exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.[exec][exec] +1 system tests framework. The patch passed system tests framework compile.

Hairong Kuang
added a comment - 19/Oct/10 21:18 Sorry for the confusion. The bug is introduced by my patch trunkImageCompress2.patch, where I make numFiles, genStamp in the header not compressed, but I forgot to make the same change in OIV.

Konstantin commented in the original OIV JIRA (HADOOP-5467) that it would be nice if we eliminated the code duplication stemming from effectively having two distinct FS image loaders. Had we done that, you wouldn't have needed to remember to make this change in another place. This work probably shouldn't be done as part of this JIRA, this problem that you hit just reminded me of that.

Aaron T. Myers
added a comment - 20/Oct/10 15:49 Thanks for the clarification, Hairong.
Konstantin commented in the original OIV JIRA ( HADOOP-5467 ) that it would be nice if we eliminated the code duplication stemming from effectively having two distinct FS image loaders. Had we done that, you wouldn't have needed to remember to make this change in another place. This work probably shouldn't be done as part of this JIRA, this problem that you hit just reminded me of that.
I've filed HDFS-1465 to address this problem.

Hairong Kuang
added a comment - 22/Oct/10 20:38 This patch cleaned up some unnecessary indention or blank line changes and fixed a failed test caused by the previous patch.
I ran ant test-patch and succeeded.
I ran ant test and saw the following tests fail:
TestFileStatus, TestHdfsTrash (timeout), TestHDFSFileContextMainOperations, TestPipelines, TestBlockTokenWithDFS, and TestLargeBlock (timeout).
They seem not related to my patch. If nobody is against it, I will commit this patch later today.