[ https://issues.apache.org/jira/browse/HADOOP-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471803
]
Hairong Kuang commented on HADOOP-923:
--------------------------------------
Two comments:
1. I feel that it is not neccessary to balance # of transfers when the heartbeat thread picks
up the replication work. First the background thread that computes pendingTransfers has already
balanced the load. Second block replication work needs to be done asap to avoid data loss.
Since the datanode has been assinged the block replication work, no other datanode is able
to pick up the work. If the work does not get to send to the datanode in the current heartbeat,
it has to wait for at least another heartbeat interval.
2. The background thread that computes pendindingTransfer scans only 100 datanodes per interation
and then sleep for 3 seconds. I feel that the approach does not scale well. For example, when
a cluster size becomes 2000, a datanode's work gets computed every 2000/100*3=1min if we ignore
the computation overhead, which is far less frequently than what we do now (every 3 seonds).
Another minor flaw is that the thread uses the index to record the next node to be checked.
But if the heartbeat queue gets updated between two consecutive interations, the index may
not point to the right node.
> DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes
> ----------------------------------------------------------------------------------------
>
> Key: HADOOP-923
> URL: https://issues.apache.org/jira/browse/HADOOP-923
> Project: Hadoop
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.10.1
> Reporter: dhruba borthakur
> Assigned To: dhruba borthakur
> Attachments: pendingTransferThread.patch
>
>
> The datanode sends a heartbeat to the namenode every 3 seconds. The namenode processes
the heartbeat and sends a list of block-to-be-replicated and blocks-to-be-deleted as part
of the heartbeat response.
> At times when a couple of datanodes fail, the heartbeat processing on the namenode becomes
pretty heavyweight. It acquires the global FSNamesystem lock, traverses the neededReplication
structure, generates a list of blocks to be replicated and responds to the heartbeat message.
Determining the list of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU
and blocks processing of other heartbeats because of the global FSNamesystem lock.
> It would improve scalability a lot if heartbeat processing does not require the FSNamesystem
lock. In fact, the pre-existing "heartbeat" lock already exists for this purpose.
> I propose that the Heartbeat message be separate from the "retrieve blocks-to-replicate
and blocks-to-delete" messages. The datanode can continue to heartbeat once every 3 seconds
while it can afford to "retrieve blocks-to-replicate" at a much coarser interval. Heartbeat
processing on the namenode will be fast because it does not require the global FSNamesystem
lock. Moreover, a datanode failure will not aggrevate the heartbeat processing time on the
namenode.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.