2 Answers
2

You are reading from local disk and writing to HDFS. When you write to HDFS your data is probably being replicated so it is physically written two or three times depending on what you have set for the replication factor.

So you are not only writing but writing two or three times the amount of data you are reading. And your writes are going over the network. Your reads are not.

One reason you could be seeing this is that each iteration of your for loop creates a new byte array. This requires the JVM to allocate you some heap space. If the array is sufficiently large, this is going to be expensive, and eventually you're going to run into the GC. I'm not too sure on what HotSpot might do to optimize this out however.

My suggestion would be to create a single BytesWritable:

// use DataInputStream so you can call readFully()
DataInputStream in = new DataInputStream(new FileInputStream(localSource));
FileSystem fs = FileSystem.get(URI.create(hDFSDestinationDirectory),conf);
Path sequenceFilePath = new Path(hDFSDestinationDirectory + "/"+ "data.seq");
IntWritable key = new IntWritable();
// create a BytesWritable, which can hold the maximum possible number of bytes
BytesWritable value = new BytesWritable(new byte[maxPossibleSize]);
// grab a reference to the value's underlying byte array
byte byteBuf[] = value.getBytes();
SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf,
sequenceFilePath, key.getClass(), value.getClass());
for (int i = 1; i <= nz; i++) {
// work out how many bytes to read - if this is a constant, move outside the for loop
int imageDataSize nx * ny * 2;
// read in bytes to the byte array
in.readFully(byteBuf, 0, imageDataSize);
key.set(i);
// set the actual number of bytes used in the BytesWritable object
value.setSize(imageDataSize);
writer.append(key, value);
}
IOUtils.closeStream(writer);
in.close();