Description

I have a map-reduce job that is using CombineFileInputFormat. It has configured 10000 pools and 30000 files. The time to create the splits takes more than an hour. The reaosn being that CombineFileInputFormat.getSplits() converts the same path from String to Path object multiple times, one for each instance of a pool. Similarly, it calls Path.toUri(0 multiple times. This code can be optimized.

Activity

The conversion of strings to Path() occurs only once. In the presence of multiple pools, this improves performance by an order of magnitude. A job that needed 6 hours to create splits now takes about 1.5 hours.

dhruba borthakur
added a comment - 03/Feb/10 23:29 The conversion of strings to Path() occurs only once. In the presence of multiple pools, this improves performance by an order of magnitude. A job that needed 6 hours to create splits now takes about 1.5 hours.