Details

Description

In the new pipeline design, pipeline close is implemented by sending an additional empty packet. If one of the datanode does not response to this empty packet, the pipeline hangs. It seems that there is no timeout.

If a datanode really becomes non-responsive, the dfs client is able to detect the problem.

The issue here is that the test simulates a non-responsive block receiver. While the block receiver is blocked, the packet responder is still alive and sends heartbeats back periodically. So the client still thinks the pipeline is working good.

The solution:
1. packet responder does not send heartbeats. instead, turn on the tcp/ip level heartbeats by setting the socket "keepalive" to be true.
2. dfs client does not receive acks until there is one outstanding packet.

Hairong Kuang
added a comment - 06/Nov/09 19:24 If a datanode really becomes non-responsive, the dfs client is able to detect the problem.
The issue here is that the test simulates a non-responsive block receiver. While the block receiver is blocked, the packet responder is still alive and sends heartbeats back periodically. So the client still thinks the pipeline is working good.
The solution:
1. packet responder does not send heartbeats. instead, turn on the tcp/ip level heartbeats by setting the socket "keepalive" to be true.
2. dfs client does not receive acks until there is one outstanding packet.

3. another option would be to make the dfsclient send a heartbeat-packet via the pipeline. Each BlockReceiver in the datanode(s) recognizes it as a heartbeat packet, and forward it to the next one in the pipeline. The last datanode , on receipt of this heartbeat-packet triggers it's PacketResponder to the send back the heartbeat to the previous datanode in the pipeline. The heartbeat message finally arrives back on the client. This way, we can test that the entire pipeline (both forward and backward channels) are alive and active.

dhruba borthakur
added a comment - 06/Nov/09 20:29 3. another option would be to make the dfsclient send a heartbeat-packet via the pipeline. Each BlockReceiver in the datanode(s) recognizes it as a heartbeat packet, and forward it to the next one in the pipeline. The last datanode , on receipt of this heartbeat-packet triggers it's PacketResponder to the send back the heartbeat to the previous datanode in the pipeline. The heartbeat message finally arrives back on the client. This way, we can test that the entire pipeline (both forward and backward channels) are alive and active.

Hairong Kuang
added a comment - 06/Nov/09 23:07 Dhruba, this is a good idea. Dfsclient sends out a heartbeat after the pipeline becomes idle for half of read timeout. A heartbeat is a special packet with seq# of -1.

I am quite torn at whether a heartbeat should be
1. a regular empty packet and be handled exactly the same as a regular data packet; or
2. a special empty packet with a seq# of -1 and be treated differently from a regular packet. For example, it does not get added to the packet queue at both client or datanode side.

Solution 1 is much simpler than solution 2. But is there any side effect?

Hairong Kuang
added a comment - 12/Nov/09 01:05 I am quite torn at whether a heartbeat should be
1. a regular empty packet and be handled exactly the same as a regular data packet; or
2. a special empty packet with a seq# of -1 and be treated differently from a regular packet. For example, it does not get added to the packet queue at both client or datanode side.
Solution 1 is much simpler than solution 2. But is there any side effect?

Agree that solution 1 will be simpler than solution 2. Another thing that might play a part here is that, over time, we might want to add metadata to the heartbeat packet. (For example, maybe the timestamp everytime a datanode forwards it; this will enable the client to eliminate slow datanodes from the pipeline if needed.)

dhruba borthakur
added a comment - 12/Nov/09 14:57 Agree that solution 1 will be simpler than solution 2. Another thing that might play a part here is that, over time, we might want to add metadata to the heartbeat packet. (For example, maybe the timestamp everytime a datanode forwards it; this will enable the client to eliminate slow datanodes from the pipeline if needed.)

The timestamp meta information could be added to regular data packets as well. My plan is not to send heartbeats periodically. Instead a heartbeat is sent only if the connection is idle for too long time.

Hairong Kuang
added a comment - 16/Nov/09 21:09 The timestamp meta information could be added to regular data packets as well. My plan is not to send heartbeats periodically. Instead a heartbeat is sent only if the connection is idle for too long time.

OK, here comes a patch that does the following:
1. when a pipeline is idle for half of the read timeout, dfsclient sends a heartbeat packet that has a sequence number of -1 and a data length of 0;
2. At the client side, a heartbeat packet does not get put the data queue and hence not in the ack queue;
3. At the datanode side, the block receiver treats a heartbeat packet mostly like a regular data packet; A heatbeat packet is put in the ack queue and its ack is also the same as that of a regular data packet;
4. The ack to a heartbeat packet may also indicate failures in the pipeline and therefore triggers pipeline recovery.

Hairong Kuang
added a comment - 25/Nov/09 01:26 OK, here comes a patch that does the following:
1. when a pipeline is idle for half of the read timeout, dfsclient sends a heartbeat packet that has a sequence number of -1 and a data length of 0;
2. At the client side, a heartbeat packet does not get put the data queue and hence not in the ack queue;
3. At the datanode side, the block receiver treats a heartbeat packet mostly like a regular data packet; A heatbeat packet is put in the ack queue and its ack is also the same as that of a regular data packet;
4. The ack to a heartbeat packet may also indicate failures in the pipeline and therefore triggers pipeline recovery.

This patch ends up implementing the heartbeat packet as a special packet. This is to avoid the complexity of both the data stream thread and the application thread creating packets simultaneously. Otherwise, the dfs client has to handle the issues like synchronization and out of sequence packets etc.

Hairong Kuang
added a comment - 25/Nov/09 18:30 This patch ends up implementing the heartbeat packet as a special packet. This is to avoid the complexity of both the data stream thread and the application thread creating packets simultaneously. Otherwise, the dfs client has to handle the issues like synchronization and out of sequence packets etc.

dhruba borthakur
added a comment - 09/Dec/09 08:52 I am still looking at the patch, two comments:
1. Is it possible to get rid of lastDataNodeRun() completely? This method existed because the last datanode in the pipeline needed to send heartbeats. Now, thatis not needed anymore.
2. In DFSClient.java, one change is:
long timeout = (stage == BlockConstructionStage.DATA_STREAMING)?
socketTimeout/2 - (now-lastPacket) : 1000;
timeout could sometime be negative?

Thanks Dhruba for your review comments. I guess you are still having jet lag. This patch
1. merges with the trunk (not an easy job ,
2. incorporates Dhruba's comments ( feels scary to remove a large chunk of code), and
3. adds Nicholas' close related fault injection test which is attached to this jira.

Hairong Kuang
added a comment - 10/Dec/09 00:31 Thanks Dhruba for your review comments. I guess you are still having jet lag. This patch
1. merges with the trunk (not an easy job ,
2. incorporates Dhruba's comments ( feels scary to remove a large chunk of code), and
3. adds Nicholas' close related fault injection test which is attached to this jira.

Let me provide more information about this patch:
1. heart beat is an empty packet with a sequence number of -1;
2. At the client side, heart beat packets are not queued in any of the queues because there is no need to resend a heartbeat if there is an error sending heartbeats.
2. At the datanode side, heartbeats are queued in the ack queue. A datanode treats a heartbeat the same as a regular data packet. The ack of a heartbeat packet is the same as a regular data packet as well. To distinguish a heartbeat from an end-of-block packet, receivePacket returns -1 when receiving an end-of-block packet.

Dhruba, please feel free to reach me if you need more explanation. I need this patch to be committed before I committing HDFS-101.

Hairong Kuang
added a comment - 13/Dec/09 19:08 Let me provide more information about this patch:
1. heart beat is an empty packet with a sequence number of -1;
2. At the client side, heart beat packets are not queued in any of the queues because there is no need to resend a heartbeat if there is an error sending heartbeats.
2. At the datanode side, heartbeats are queued in the ack queue. A datanode treats a heartbeat the same as a regular data packet. The ack of a heartbeat packet is the same as a regular data packet as well. To distinguish a heartbeat from an end-of-block packet, receivePacket returns -1 when receiving an end-of-block packet.
Dhruba, please feel free to reach me if you need more explanation. I need this patch to be committed before I committing HDFS-101 .

stack
added a comment - 16/Nov/10 00:02 I tried it by running a few of the hbase tests that were failing before hbAckReply.patch. They pass now (with hdfs-895 in place on 0.20-append). Let me run the full suite. I'll report back.
One issue that came up over in hbase-3234 was the fact that hdfs-724 does this:
- public static final int DATA_TRANSFER_VERSION = 14;
+ public static final int DATA_TRANSFER_VERSION = 15;
Could we not do the above on commit to 0.20-append? No biggie. Just asking. Would be nice not making comm incompatible. Good stuff Hairong.

Hairong Kuang
added a comment - 16/Nov/10 17:57 Thanks Stack for testing this. I just committed hbAckReply patch to append 0.20 branch.
For the incompatible problem, as I commented on the hbase jira, HDFS-724 unfortunately changed the syntax and semantics of heartbeat packets.