Channel 9 Forums - Tech Off - Impact of changing block size in Hadoop HDFS:http://mschnlnine.vo.llnwd.net/d1/Dev/App_Themes/C9/images/feedimage.pngChannel 9 Forums - Tech Off - Impact of changing block size in Hadoop HDFS:http://channel9.msdn.com/Forums
Channel 9 keeps you up to date with the latest news and behind the scenes info from Microsoft that developers love to keep up with. From LINQ to SilverLight – Watch videos and hear about all the cool technologies coming and the people behind them.http://channel9.msdn.com/Forums
enTue, 03 Mar 2015 20:23:37 GMTTue, 03 Mar 2015 20:23:37 GMTRev92-2-1Tech Off - Impact of changing block size in Hadoop HDFS:Hi,

Can anybody let me know what is the impact of changing the block size in Hadoop HDFS? Say if I change block size from 64 MB to 128 MB, how does it impact?

Also, please do let me know what is the default size of block size.

Thanks,

Santosh

]]>http://channel9.msdn.com/Forums/TechOff/Impact-of-changing-block-size-in-Hadoop-HDFS/2f778449e64342e59e18a15d005c3d83#2f778449e64342e59e18a15d005c3d83
Wed, 06 Feb 2013 05:35:50 GMThttp://channel9.msdn.com/Forums/TechOff/Impact-of-changing-block-size-in-Hadoop-HDFS/2f778449e64342e59e18a15d005c3d83#2f778449e64342e59e18a15d005c3d83esantoshu2http://channel9.msdn.com/Niners/esantoshu/Discussions/RSSTech Off - Impact of changing block size in Hadoop HDFS:There are a number of things that this impacts. Most obviously, a file will have fewer blocks if the block size is larger. This can potentially make it possible for client to read/write more data without interacting with the Namenode, and it also reduces the metadata size of the Namenode, reducing Namenode load (this can be an important consideration for extremely large file systems).

With fewer blocks, the file may potentially be stored on fewer nodes in total; this can reduce total throughput for parallel access, and make it more difficult for the MapReduce scheduler to schedule data-local tasks.

When using such a file as input for MapReduce (and not constraining the maximum split size to be smaller than the block size), it will reduce the number of tasks which can decrease overhead. But having fewer, longer tasks also means you may not gain maximum parallelism (if there are fewer tasks than your cluster can run simultaneously), increase the chance of stragglers, and if a task fails, more work needs to be redone. Increasing the amount of data processed per task can also cause additional read/write operations (for example, if a map task changes from having only one spill to having multiple and thus needing a merge at the end).

Usually, it depends on the input data. If you want to maximize throughput for a very large input file, using very large blocks (128MB or even 256MB) is best. For smaller files, using a smaller block size is better. Note that you can have files with different block sizes on the same file system by changing the dfs.block.size parameter when the file is written, e.g. when uploading using the command line tools: "bin/hadoop fs -put localpath dfspath -Ddfs.block.size=33554432"