[ http://issues.apache.org/jira/browse/HADOOP-195?page=comments#action_12377862 ]
paul sutter commented on HADOOP-195:
------------------------------------
I've had problems with timeouts during the copy phase, but its not clear to me that the problem
is with the RPC mechanism. It may be TCP, in which case going to HTTP wouldnt help.
Since the copy phase is the moment when the switch is busiest, my recommendation is to saturate
your switch with traffic, and experiment with various transfer methods, like HTTP, etc.
I did tests to saturate our (cheap) switch with traffic from the nodes and measure packet
loss. Though all the nodes were sending data at near a gigabit, the packet loss rate ranged
from 0.5% to 2%, sometimes peaking at 5%.
TCP behaves badly in such an environment, even with low latency. This is probably an opportunity
to retune TCP, but ever since our workaround this problem has gone to the back burner.
We worked around the problem by dramatically increasing the number of mappers, which reduces
the size of the map output file. For probablistic problems like this, the proobability of
failure increases with the duration of the transfer, and when files reach a certain size the
failure is likely to happen on every transfer, and no forward progress can be made.
The problem is an open question. I spent a lot of time trying to debug the RPC mechanisms,
and I wasnt able to find a clear culprit.
> transfer map output transfer with http instead of rpc
> -----------------------------------------------------
>
> Key: HADOOP-195
> URL: http://issues.apache.org/jira/browse/HADOOP-195
> Project: Hadoop
> Type: Improvement
> Components: mapred
> Versions: 0.2
> Reporter: Owen O'Malley
> Assignee: Owen O'Malley
> Fix For: 0.3
>
> The data transfer of the map output should be transfered via http instead rpc, because
rpc is very slow for this application and the timeout behavior is suboptimal. (server sends
data and client ignores it because it took more than 10 seconds to be received.)
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira