hadoop-hdfs-issues mailing list archives

[jira] [Commented] (HDFS-8960) DFS client says "no more good datanodes being available to try" on a single drive failure

Date

Fri, 04 Sep 2015 05:57:47 GMT

[ https://issues.apache.org/jira/browse/HDFS-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14730359#comment-14730359
]
Hudson commented on HDFS-8960:
------------------------------
FAILURE: Integrated in HBase-TRUNK #6778 (See [https://builds.apache.org/job/HBase-TRUNK/6778/])
HBASE-14317 Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL (stack: rev 661faf6fe0833726d7ce7ad44a829eba3f8e3e45)
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogKey.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/SyncFuture.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/ProtobufLogWriter.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestWALLockup.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java
* hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
* hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MultiVersionConcurrencyControl.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSWALEntry.java
* hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestMultiVersionConcurrencyControlBasic.java
* hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestMultiVersionConcurrencyControl.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALKey.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/DamagedWALException.java
* hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestFSErrorsExposed.java
* hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestFailedAppendAndSync.java
HBASE-14317 Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL; addendum (stack: rev 54717a6314ef6673f7607091e5f77321c202d49f)
* hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestFSErrorsExposed.java
> DFS client says "no more good datanodes being available to try" on a single drive failure
> -----------------------------------------------------------------------------------------
>
> Key: HDFS-8960
> URL: https://issues.apache.org/jira/browse/HDFS-8960
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs-client
> Affects Versions: 2.7.1
> Environment: openjdk version "1.8.0_45-internal"
> OpenJDK Runtime Environment (build 1.8.0_45-internal-b14)
> OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode)
> Reporter: Benoit Sigoure
> Attachments: blk_1073817519_77099.log, r12s13-datanode.log, r12s16-datanode.log
>
>
> Since we upgraded to 2.7.1 we regularly see single-drive failures cause widespread problems
at the HBase level (with the default 3x replication target).
> Here's an example. This HBase RegionServer is r12s16 (172.24.32.16) and is writing its
WAL to [172.24.32.16:10110, 172.24.32.8:10110, 172.24.32.13:10110] as can be seen by the following
occasional messages:
> {code}
> 2015-08-23 06:28:40,272 INFO [sync.3] wal.FSHLog: Slow sync cost: 123 ms, current pipeline:
[172.24.32.16:10110, 172.24.32.8:10110, 172.24.32.13:10110]
> {code}
> A bit later, the second node in the pipeline above is going to experience an HDD failure.
> {code}
> 2015-08-23 07:21:58,720 WARN [DataStreamer for file /hbase/WALs/r12s16.sjc.aristanetworks.com,9104,1439917659071/r12s16.sjc.aristanetworks.com%2C9104%2C1439917659071.default.1440314434998
block BP-1466258523-172.24.32.1-1437768622582:blk_1073817519_77099] hdfs.DFSClient: Error
Recovery for block BP-1466258523-172.24.32.1-1437768622582:blk_1073817519_77099 in pipeline
172.24.32.16:10110, 172.24.32.13:10110, 172.24.32.8:10110: bad datanode 172.24.32.8:10110
> {code}
> And then HBase will go like "omg I can't write to my WAL, let me commit suicide".
> {code}
> 2015-08-23 07:22:26,060 FATAL [regionserver/r12s16.sjc.aristanetworks.com/172.24.32.16:9104.append-pool1-t1]
wal.FSHLog: Could not append. Requesting close of wal
> java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to
no more good datanodes being available to try. (Nodes: current=[172.24.32.16:10110, 172.24.32.13:10110],
original=[172.24.32.16:10110, 172.24.32.13:10110]). The current failed datanode replacement
policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy'
in its configuration.
> at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:969)
> at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1035)
> at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1184)
> at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:933)
> at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:487)
> {code}
> Whereas this should be mostly a non-event as the DFS client should just drop the bad
replica from the write pipeline.
> This is a small cluster but has 16 DNs so the failed DN in the pipeline should be easily
replaced. I didn't set {{dfs.client.block.write.replace-datanode-on-failure.policy}} (so
it's still {{DEFAULT}}) and didn't set {{dfs.client.block.write.replace-datanode-on-failure.enable}}
(so it's still {{true}}).
> I don't see anything noteworthy in the NN log around the time of the failure, it just
seems like the DFS client gave up or threw an exception back to HBase that it wasn't throwing
before or something else, and that made this single drive failure lethal.
> We've occasionally be "unlucky" enough to have a single-drive failure cause multiple
RegionServers to commit suicide because they had their WALs on that drive.
> We upgraded from 2.7.0 about a month ago, and I'm not sure whether we were seeing this
with 2.7 or not – prior to that we were running in a quite different environment, but this
is a fairly new deployment.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)