Re: file size effects on jobs?

HDFS is designed to handle large files, so use less number of large files can be always better than more number of small files. That basically because of two

From HDFS perspective, default HDFS block size is 64MB, that is far more larger than most other block-structured file systems, normally a few KBs. If files are small, HDFS will need to produce more metadata , and that is a lot of overhead in namenode memory

From MapReduce perspective, more number of small files cause more number of blocks, that requires more number of map tasks to consume, each task will process with less data, but the cost of creating and destroying is high, which is considered as a way much less efficiency.

Re: file size effects on jobs?

HDFS is designed to handle large files, so use less number of large files can be always better than more number of small files. That basically because of two

From HDFS perspective, default HDFS block size is 64MB, that is far more larger than most other block-structured file systems, normally a few KBs. If files are small, HDFS will need to produce more metadata , and that is a lot of overhead in namenode memory

From MapReduce perspective, more number of small files cause more number of blocks, that requires more number of map tasks to consume, each task will process with less data, but the cost of creating and destroying is high, which is considered as a way much less efficiency.