Details

The key and value objects that are given to the Combiner and Reducer are now reused between calls. This is much more efficient, but the user can not assume the objects are constant.

Description

Currently, the input key and value are recreated on every iteration for input to the combiner and reducer. It would speed up the system substantially if we reused the keys and values. The down side of doing it, is that it may break applications that count on holding references to previous keys and values, but I think it is worth doing.

Activity

As a general rule, I think applications should not expect to be able to hold on to pointers to objects passed to them, but should expect to be able to hold on to pointers returned to them. Lots of exceptions of course, but, in this case, I don't think applications should be expecting to be able to hold on to these objects, and so any that break if we reuse them were not well written.

These were originally reused. Reuse was removed when the combiner was added, since the original combiner kept pointers to the objects.

Doug Cutting
added a comment - 11/Dec/07 19:55 +1
As a general rule, I think applications should not expect to be able to hold on to pointers to objects passed to them, but should expect to be able to hold on to pointers returned to them. Lots of exceptions of course, but, in this case, I don't think applications should be expecting to be able to hold on to these objects, and so any that break if we reuse them were not well written.
These were originally reused. Reuse was removed when the combiner was added, since the original combiner kept pointers to the objects.

This patch changes the ValueIterator to have 2 instances of the key, and 1 instance of the object and reuse the objects during the iteration. I also fixed some of the compiler warnings for unbound generic types in the ValueIterator.

Owen O'Malley
added a comment - 14/Feb/08 21:03 This patch changes the ValueIterator to have 2 instances of the key, and 1 instance of the object and reuse the objects during the iteration. I also fixed some of the compiler warnings for unbound generic types in the ValueIterator.

This patch fixes the value iterator to reuse the key and value between iterations. Aggregation was assuming that the reduce inputs where not reused, so I stringified the value. Is that ok, Runping? I got a minor speed up of 2:33 instead of 2:37 on a simple 1 node word count.

Owen O'Malley
added a comment - 19/Feb/08 21:34 This patch fixes the value iterator to reuse the key and value between iterations. Aggregation was assuming that the reduce inputs where not reused, so I stringified the value. Is that ok, Runping? I got a minor speed up of 2:33 instead of 2:37 on a simple 1 node word count.

In Hadoop 16, the values printed after First and Second were the same.
In Hadoop 17, the values printed after First are identical to Hadoop 16. However, in Hadoop 17, all the records printed after Second are identical.
Adding a clone (ret.add(val.cone())) will fix this, if the clone is implemented correctly.

Ralf Gutsche
added a comment - 15/Jun/08 17:56 This piece of code will print different output with Hadoop 17 (compared to Hadoop 16).
public void reduce(... Iterator<Writable> aValues...) throws IOException {
ArrayList<Writable> ret = new ArrayList<Writable>();
System.out.println("First");
while (aValues.hasNext())
{
Writable val = aValues.next();
System.out.println(val.toString());
ret.add(val);
}
System.out.println("Second");
for(Writable w: ret)
{
System.out.println(w.toString());
}
}
In Hadoop 16, the values printed after First and Second were the same.
In Hadoop 17, the values printed after First are identical to Hadoop 16. However, in Hadoop 17, all the records printed after Second are identical.
Adding a clone (ret.add(val.cone())) will fix this, if the clone is implemented correctly.
I guess this is the consequence of this JIRA.

This Jira was marked as an incompatible change because it did change the semantics. However, without this change there was an allocation (and later garbage collection) for every key and value passed to the reduce, which had measurable performance costs.

Owen O'Malley
added a comment - 15/Jun/08 23:00 This Jira was marked as an incompatible change because it did change the semantics. However, without this change there was an allocation (and later garbage collection) for every key and value passed to the reduce, which had measurable performance costs.