Description

The new 2-level directory layout can make directory scans expensive in terms of disk seeks (see HDFS-8791) for details.

It would be good if the directoryScanner() had a configurable duty cycle that would reduce its impact on disk performance (much like the approach in HDFS-8617).

Without such a throttle, disks can go 100% busy for many minutes at a time (assuming the common case of all inodes in cache but no directory blocks cached, 64K seeks are required for full directory listing which translates to 655 seconds)

Daniel Templeton
added a comment - 02/Sep/15 16:53 Suggestion from HDFS-8989 :
We should allow rate-limiting the DirectoryScanner via time-slicing, so that it only runs every 100 ms out of 1 s (or whatever other percentage we configure.)

Here's a patch to add a time-based throttle. The changes are a little large because I had to first make the scanner throttlable. I also took the liberty of fleshing out the javadocs. The throttling approach is explained in the comments and javadocs.

Daniel Templeton
added a comment - 11/Sep/15 19:08 Here's a patch to add a time-based throttle. The changes are a little large because I had to first make the scanner throttlable. I also took the liberty of fleshing out the javadocs. The throttling approach is explained in the comments and javadocs.

Daniel Templeton
added a comment - 11/Sep/15 22:05 Resolved some of the checkstyle issues.
On the line too long comments, it's in the DFSConfig file where every line is too long. I would assume I should follow the established pattern rather than introduce a new style just for checkstyle.
On the comments about making the fields private and adding accessors, since the fields are all final, I don't see need. Anyone disagree.
The whitespace issue wasn't something I introduced, so I don't know why it's complaining.

I would prefer to see per-type configuration keys that are more descriptive. For example, dfs.datanode.directoryscan.timeslice.throttle.ms.per.sec. If we invent more throttle types later, we can always add more configuration key names.

testTimesliceThrottle: please copy the Configuration object and change the copy rather than mutating TestDirectoryScanner#CONF.

I'm a bit concerned that other tests running on the same machine, or GCs could cause delays here that would make the test fail. Perhaps we could do this in a loop and keep retrying until we found that the ratio was correct?

We should log an ERROR message if we can't understand the throttle type. Silently defaulting to doing nothing is not behavior most users will appreciate.

DirectoryScanner.java: there are some unnecessary whitespace changes.

TimesliceThrottler: maybe TimeSliceThrottlerTask is a more appropriate name here? I like to think of executors as scheduling tasks. Arguably the throttler state is all contained outside the class so it's not really "the throttler."

Can we just say that Throttler is always non-null, but sometimes we have a NoOpThrottler? I don't like all this checking for null business.

You could get rid of the type enum and all those explicit type checks, and just have a factory function inside an interface that creates the appropriate kind of Throttler object from the Configuration (no-op, timeslice, etc).

Colin P. McCabe
added a comment - 11/Sep/15 22:14 Good work, Daniel Templeton .
public static final String DFS_DATANODE_DIRECTORYSCAN_THROTTLE_KEY = "dfs.datanode.directoryscan.throttle" ;
Should be "throttle.type"?
649 <property>
650 <name>dfs.datanode.directoryscan.throttle.limit</name>
651 <value>0</value>
652 <description>The limit setting for the report compiler threads throttle. The
653 meaning of this setting is determined by the
654 dfs.datanode.directoryscan.throttle setting. In all cases, setting this limit
655 to 0 disables compiler thread throttling. 0 is the default setting.
656 </description>
657 </property>
I would prefer to see per-type configuration keys that are more descriptive. For example, dfs.datanode.directoryscan.timeslice.throttle.ms.per.sec . If we invent more throttle types later, we can always add more configuration key names.
testTimesliceThrottle: please copy the Configuration object and change the copy rather than mutating TestDirectoryScanner#CONF .
604 // Waiting should be about 4x running.
605 ratio = ( float )scanner.timeWaiting.get() / scanner.timeRunning.get();
606
607 LOG.info( "RATIO: " + ratio);
608 assertTrue( "Throttle is too restrictive" , ratio <= 4.5);
609 assertTrue( "Throttle is too permissive" , ratio >= 3.0);
I'm a bit concerned that other tests running on the same machine, or GCs could cause delays here that would make the test fail. Perhaps we could do this in a loop and keep retrying until we found that the ratio was correct?
84 @VisibleForTesting
85 final AtomicLong timeRunning = new AtomicLong(0L);
Should be "timeRunningMs" to reflect the fact that this interval is in milliseconds. Similar with "timeWaiting"
115 public static ThrottleType parse( String type) {
116 if (type.trim().equalsIgnoreCase(TIMESLICE.toString())) {
117 return TIMESLICE;
118 } else {
119 return NONE;
120 }
121 }
We should log an ERROR message if we can't understand the throttle type. Silently defaulting to doing nothing is not behavior most users will appreciate.
DirectoryScanner.java: there are some unnecessary whitespace changes.
TimesliceThrottler: maybe TimeSliceThrottlerTask is a more appropriate name here? I like to think of executors as scheduling tasks. Arguably the throttler state is all contained outside the class so it's not really "the throttler."
1121 } catch (Throwable t) {
1122 LOG.error( "Throttle thread died unexpectedly" , t);
1123
1124 if (t instanceof Error) {
1125 throw t;
1126 }
1127 }
What's the purpose of rethrowing exceptions here?
private static class Throttle extends Semaphore {
While this works, it feels more natural to use a boolean + condition variable here.
try {
lock.lock();
while (blockThreads) {
cond.wait();
}
} finally {
lock.unlock();
}
74 private final ThrottleType throttleType;
75 private final int throttleLimit;
76 private ScheduledExecutorService throttleThread;
77 private Semaphore runningThreads = new Semaphore(0);
78 private volatile Throttle throttle;
It feels like this state should be encapsulated inside the Throttler itself.
860 // Give our parent a chance to block us for throttling
861 if (throttle != null ) {
862 throttle.start();
863 }
Can we just say that Throttler is always non-null, but sometimes we have a NoOpThrottler ? I don't like all this checking for null business.
You could get rid of the type enum and all those explicit type checks, and just have a factory function inside an interface that creates the appropriate kind of Throttler object from the Configuration (no-op, timeslice, etc).

I would prefer to see per-type configuration keys that are more descriptive. For example, dfs.datanode.directoryscan.timeslice.throttle.ms.per.sec.

Encapsulating the throttle stuff requires a reasonable abstraction of what a throttle is. The various kinds of throttles (time, file, iops, ...) are all pretty different and aren't easy to overlay with a single abstraction. I decided to give up on the idea of making the throttle type selectable. The limit therefore always means the same thing, and so I think it's fair to leave it's name as is.

What's the purpose of rethrowing exceptions here?

I'm only rethrowing Errors, which is the right thing to do. It would be bad if a thread caught an OOME and discarded it.

Daniel Templeton
added a comment - 14/Sep/15 16:13 Colin P. McCabe , I implemented all of your suggestions except the throttle limit property name.
I would prefer to see per-type configuration keys that are more descriptive. For example, dfs.datanode.directoryscan.timeslice.throttle.ms.per.sec.
Encapsulating the throttle stuff requires a reasonable abstraction of what a throttle is. The various kinds of throttles (time, file, iops, ...) are all pretty different and aren't easy to overlay with a single abstraction. I decided to give up on the idea of making the throttle type selectable. The limit therefore always means the same thing, and so I think it's fair to leave it's name as is.
What's the purpose of rethrowing exceptions here?
I'm only rethrowing Errors, which is the right thing to do. It would be bad if a thread caught an OOME and discarded it.

Encapsulating the throttle stuff requires a reasonable abstraction of what a throttle is. The various kinds of throttles (time, file, iops, ...) are all pretty different and aren't easy to overlay with a single abstraction. I decided to give up on the idea of making the throttle type selectable. The limit therefore always means the same thing, and so I think it's fair to leave it's name as is.

I agree that it is probably premature to have a single Throttle base class that can do all the various possible things you might want a Throttle to do. But it doesn't follow from that that we need to give up on the idea of making the throttle type selectable in the future.

Also, configuration keys which are specified in terms of milliseconds should always end in ms. It causes a huge amount of confusion when times come without units. It is obvious to you-- the author of the code-- that it is in milliseconds, but it is not obvious to users or other developers.

Why not just have a single class called TimeBasedThrottle which does whatever you want your time-based throttle to do? You can make it a standalone class that doesn't extend or implement anything, and even create an instance of it that does nothing when throttling is turned off. But keep the flexible configuration mechanism that we discussed earlier. That way, if someone wants to do something more elaborate later, they can.

hadoop-hdfs-project/hadoop-hdfs/now: did you intend to put this in the patch?

Should end with "ms" to indicate the limit is in milliseconds. Or ideally "ms.per.sec"

- if (!retainDiffs) clear();
+
+ if (!retainDiffs) {
+ clear();
+ }

Can we move small whitespace cleanups like this to a follow-on change? It just makes backports a pain because it creates unnecessary conflicts, and tends to obscure what this JIRA is about when people are reading the change log.

+ } //end for+ } //end synchronized+ } // end if

If we're changing this, then let's get rid of the COBOLisms. A close brace is enough to know that the block is closed.

Colin P. McCabe
added a comment - 15/Sep/15 01:10 Encapsulating the throttle stuff requires a reasonable abstraction of what a throttle is. The various kinds of throttles (time, file, iops, ...) are all pretty different and aren't easy to overlay with a single abstraction. I decided to give up on the idea of making the throttle type selectable. The limit therefore always means the same thing, and so I think it's fair to leave it's name as is.
I agree that it is probably premature to have a single Throttle base class that can do all the various possible things you might want a Throttle to do. But it doesn't follow from that that we need to give up on the idea of making the throttle type selectable in the future.
Also, configuration keys which are specified in terms of milliseconds should always end in ms. It causes a huge amount of confusion when times come without units. It is obvious to you-- the author of the code-- that it is in milliseconds, but it is not obvious to users or other developers.
Why not just have a single class called TimeBasedThrottle which does whatever you want your time-based throttle to do? You can make it a standalone class that doesn't extend or implement anything, and even create an instance of it that does nothing when throttling is turned off. But keep the flexible configuration mechanism that we discussed earlier. That way, if someone wants to do something more elaborate later, they can.
hadoop-hdfs-project/hadoop-hdfs/now : did you intend to put this in the patch?
397 public static final String DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_KEY =
398 "dfs.datanode.directoryscan.throttle.limit" ;
Should end with "ms" to indicate the limit is in milliseconds. Or ideally "ms.per.sec"
- if (!retainDiffs) clear();
+
+ if (!retainDiffs) {
+ clear();
+ }
Can we move small whitespace cleanups like this to a follow-on change? It just makes backports a pain because it creates unnecessary conflicts, and tends to obscure what this JIRA is about when people are reading the change log.
+ } //end for
+ } //end synchronized
+ } // end if
If we're changing this, then let's get rid of the COBOLisms. A close brace is enough to know that the block is closed.

New patch. Colin P. McCabe, I addressed your concerns about key name, whitespace, and comments. I haven't done anything in the patch that will stand in the way of someone extending the throttle concept to be more flexible later, but I don't think it's worth the effort to proactively invest in doing more to enable that extension since it may well never happen.

Daniel Templeton
added a comment - 16/Sep/15 23:46 New patch. Colin P. McCabe , I addressed your concerns about key name, whitespace, and comments. I haven't done anything in the patch that will stand in the way of someone extending the throttle concept to be more flexible later, but I don't think it's worth the effort to proactively invest in doing more to enable that extension since it may well never happen.

The whitespace error is interesting. I changed line n in the patch. Jenkins complained about the whitespace on line n+1. I fixed the whitespace on line n+1 in the next patch. Jenkins is now complaining about the whitespace on line n+2. There is no issue on line n+3, so I could correct n+2 and be done, but at that point I've made whitespace changes on two lines that I didn't otherwise touch. What's the accepted why to do it? Fix the whitespace or ignore the error?

Daniel Templeton
added a comment - 17/Sep/15 18:56 [sigh] Those tests pass for me locally, so can't say why they failed.
The whitespace error is interesting. I changed line n in the patch. Jenkins complained about the whitespace on line n+1. I fixed the whitespace on line n+1 in the next patch. Jenkins is now complaining about the whitespace on line n+2. There is no issue on line n+3, so I could correct n+2 and be done, but at that point I've made whitespace changes on two lines that I didn't otherwise touch. What's the accepted why to do it? Fix the whitespace or ignore the error?

We've seen this before and never managed to track it down. It seems to be a bug in our Jenkins integration, possibly related to having multiple maven invocations going on at once sharing the same .m2 directory. I will re-trigger the build.

The whitespace error is interesting. I changed line n in the patch. Jenkins complained about the whitespace on line n+1. I fixed the whitespace on line n+1 in the next patch. Jenkins is now complaining about the whitespace on line n+2

I would say just leave it alone. If you didn't introduce the whitespace issue then don't worry about it. We really should turn off most of those checkstyle things since it provides no value.

Colin P. McCabe
added a comment - 17/Sep/15 21:01 The jenkins errors look like:
java.lang.NoSuchMethodError: org.apache.hadoop.hdfs.protocol.DatanodeInfo.<init>(Lorg/apache/hadoop/hdfs/protocol/DatanodeID;Ljava/lang/ String ;JJJJJJJJILorg/apache/hadoop/hdfs/protocol/DatanodeInfo$AdminStates;)V
at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:591)
We've seen this before and never managed to track it down. It seems to be a bug in our Jenkins integration, possibly related to having multiple maven invocations going on at once sharing the same .m2 directory. I will re-trigger the build.
The whitespace error is interesting. I changed line n in the patch. Jenkins complained about the whitespace on line n+1. I fixed the whitespace on line n+1 in the next patch. Jenkins is now complaining about the whitespace on line n+2
I would say just leave it alone. If you didn't introduce the whitespace issue then don't worry about it. We really should turn off most of those checkstyle things since it provides no value.

The configuration looks good. We had a quick offline discussion where Daniel Templeton pointed out that we could add a selectable throttle type in the future in a compatible way. So we don't need to have the throttle type selector in this patch.

Why are we declaring that the function throws InterruptedException if we just catch it and do nothing?

In ReportCompiler#run, we should probably at least call Thread.currentThread.interrupt rather than swallowing the InterruptedException completely. As described in http://www.ibm.com/developerworks/library/j-jtp05236/ , "even noncancelable tasks should attempt to preserve the interrupted status in case code higher up on the call stack wants to act on the interruption after the noncancelable task completes."

Rename d to blockReportIdx, m to memReportIdx? I don't normally complain about short variable names, but this loop is somewhat complex.

683 // No need to set an initial capacity because the number of entries will
684 // never be that large

All very true, but you could have just initialized it with reports.size() and saved some typing

ThrottleTask / UnthrottleTask: is there any way we could reuse these objects and avoid generating all that garbage? It seems like they are stateless (although they alter state inside the DirectoryScannerThrottle). We could probably just keep one of each inside the parent class and avoid allocating.

I realize you didn't change this in your patch, but this is a potentially very expensive operation depending on the directory size. Perhaps we should sort the array of children at the lowest level, so that we know that the meta file is the next entry, if it exists. We might want to do that in a follow-on jira.

Colin P. McCabe
added a comment - 18/Sep/15 02:00 Thanks, Daniel Templeton . Looking better.
The configuration looks good. We had a quick offline discussion where Daniel Templeton pointed out that we could add a selectable throttle type in the future in a compatible way. So we don't need to have the throttle type selector in this patch.
46 public DirectoryScannerThrottle( int limit) {
47 if ((limit <= 0) || (limit >= 1000)) {
We use "1000" a lot in this code. Can we have a constant like MS_PER_SEC to make its role clearer?
109 public synchronized void cycle() throws InterruptedException {
110 while (!open) {
111 try {
112 wait();
113 } catch (InterruptedException ex) {
114 // Ignore
115 }
116 }
Why are we declaring that the function throws InterruptedException if we just catch it and do nothing?
In ReportCompiler#run , we should probably at least call Thread.currentThread.interrupt rather than swallowing the InterruptedException completely. As described in http://www.ibm.com/developerworks/library/j-jtp05236/ , "even noncancelable tasks should attempt to preserve the interrupted status in case code higher up on the call stack wants to act on the interruption after the noncancelable task completes."
84 @VisibleForTesting
85 final Map< String , Stats> stats = new HashMap<>();
It's not clear from looking at this what the String key is all about. Either document that it is a block pool id in the comment, or rename the variable to statsPerBpId or similar.
508 int d = 0; // index for blockpoolReport
509 int m = 0; // index for memReprot
Rename d to blockReportIdx, m to memReportIdx? I don't normally complain about short variable names, but this loop is somewhat complex.
683 // No need to set an initial capacity because the number of entries will
684 // never be that large
All very true, but you could have just initialized it with reports.size() and saved some typing
ThrottleTask / UnthrottleTask: is there any way we could reuse these objects and avoid generating all that garbage? It seems like they are stateless (although they alter state inside the DirectoryScannerThrottle). We could probably just keep one of each inside the parent class and avoid allocating.
long mark = Time.monotonicNow();
Please make this markMs. Same for newMark.
913 // Skip all the files that cycle with block name until
914 // getting to the metafile for the block
915 while ((i + 1 < files.length) && files[i + 1].isFile()
916 && files[i + 1].getName().startsWith(blockFile.getName())) {
917 i++;
918
919 if (isBlockMetaFile(blockFile.getName(), files[i].getName())) {
920 metaFile = files[i];
921 break ;
922 }
923 }
I realize you didn't change this in your patch, but this is a potentially very expensive operation depending on the directory size. Perhaps we should sort the array of children at the lowest level, so that we know that the meta file is the next entry, if it exists. We might want to do that in a follow-on jira.

We've seen this before and never managed to track it down. It seems to be a bug in our Jenkins integration, possibly related to having multiple maven invocations going on at once sharing the same .m2 directory. I will re-trigger the build.

Yes, I think so, I see similar issue several times. If I have a HDFS patch, and also add new class in Hadoop common, then it easily fails because other builds may overwrite the hadoop common jar of my build installed. Unless we use different virtual machine, maybe through docker container to solve this problem? Maybe need support from Infra?

Yi Liu
added a comment - 18/Sep/15 05:45
We've seen this before and never managed to track it down. It seems to be a bug in our Jenkins integration, possibly related to having multiple maven invocations going on at once sharing the same .m2 directory. I will re-trigger the build.
Yes, I think so, I see similar issue several times. If I have a HDFS patch, and also add new class in Hadoop common, then it easily fails because other builds may overwrite the hadoop common jar of my build installed. Unless we use different virtual machine, maybe through docker container to solve this problem? Maybe need support from Infra?

Why are we declaring that the function throws InterruptedException if we just catch it and do nothing?

Good catch. (No pun intended.)

In ReportCompiler#run, we should probably at least call Thread.currentThread.interrupt rather than swallowing the InterruptedException completely.

OK.

Question about the rest of your comments: Except for the last one, the rest of your comments apply to code that I touched only by indenting it. I didn't modify any of that code that I didn't have to, in the name of simplifying review by diff. Sounds like the going rule is, "you touch it, you own it," even if it's just a whitespace change. Am I reading that correctly?

Daniel Templeton
added a comment - 18/Sep/15 13:23 Can we have a constant like MS_PER_SEC to make its role clearer?
For you, Colin P. McCabe , anything.
Why are we declaring that the function throws InterruptedException if we just catch it and do nothing?
Good catch. (No pun intended.)
In ReportCompiler#run, we should probably at least call Thread.currentThread.interrupt rather than swallowing the InterruptedException completely.
OK.
Question about the rest of your comments: Except for the last one, the rest of your comments apply to code that I touched only by indenting it. I didn't modify any of that code that I didn't have to, in the name of simplifying review by diff. Sounds like the going rule is, "you touch it, you own it," even if it's just a whitespace change. Am I reading that correctly?

Daniel Templeton
added a comment - 18/Sep/15 13:36 Oops, forgot to address the ThrottleTask comment:
ThrottleTask / UnthrottleTask: is there any way we could reuse these objects and avoid generating all that garbage?
Sure.

Question about the rest of your comments: Except for the last one, the rest of your comments apply to code that I touched only by indenting it. I didn't modify any of that code that I didn't have to, in the name of simplifying review by diff. Sounds like the going rule is, "you touch it, you own it," even if it's just a whitespace change. Am I reading that correctly?

That's a fair point. If you want to leave the d and m variables as-is to simplify the diff and backporting, that's fine. We should fix them at some point, but it doesn't have to be right now.

Colin P. McCabe
added a comment - 18/Sep/15 18:26 Question about the rest of your comments: Except for the last one, the rest of your comments apply to code that I touched only by indenting it. I didn't modify any of that code that I didn't have to, in the name of simplifying review by diff. Sounds like the going rule is, "you touch it, you own it," even if it's just a whitespace change. Am I reading that correctly?
That's a fair point. If you want to leave the d and m variables as-is to simplify the diff and backporting, that's fine. We should fix them at some point, but it doesn't have to be right now.

I personally would prefer the default to be 1000. In my mind 0 is a special out-of-range condition that we're allowing to mean "full rate". Just reading the default of 0 and then the first sentence of the description could easily lead one to believe the report threads are effectively off by default.
Test

Any way to avoid the sleep(5000) in the test? Our tests already take a really long time, so anytime we can avoid sleeping it's better. Maybe wait at most 5 seconds for timeWaitingMs.get() to become > 0
directoryScannerThrottle

Shouldn't stop call resume() instead of just notifyAll(). Will cycle() get out if we try to shutdown while in that wait?

Did we hit this problem with too big of hammer? Couldn't cycle() be implemented with a simple sleep? For example, with a 75% duty cycle,

n = Time.monotonicNow() % 1000;
if (n > 1000 * 0.75) sleep(1000- n)

Seems like it could be as simple as a config and a couple of lines of code. Maybe I'm missing something or there are grander plans for the throttle.

Nathan Roberts
added a comment - 18/Sep/15 21:38 Thanks Daniel Templeton for the patch. A few comments.
hdfs-default.xml
I personally would prefer the default to be 1000. In my mind 0 is a special out-of-range condition that we're allowing to mean "full rate". Just reading the default of 0 and then the first sentence of the description could easily lead one to believe the report threads are effectively off by default.
Test
Any way to avoid the sleep(5000) in the test? Our tests already take a really long time, so anytime we can avoid sleeping it's better. Maybe wait at most 5 seconds for timeWaitingMs.get() to become > 0
directoryScannerThrottle
Shouldn't stop call resume() instead of just notifyAll(). Will cycle() get out if we try to shutdown while in that wait?
Did we hit this problem with too big of hammer? Couldn't cycle() be implemented with a simple sleep? For example, with a 75% duty cycle,
n = Time.monotonicNow() % 1000;
if (n > 1000 * 0.75) sleep(1000- n)
Seems like it could be as simple as a config and a couple of lines of code. Maybe I'm missing something or there are grander plans for the throttle.

I went with 0 because that's the general convention for "off", but if 1000 makes more sense to people it's fine with me.

Any way to avoid the sleep(5000) in the test?

I hear ya. I'll see if I can reformulate the test a bit to get the sleep time down to the bare minimum.

Shouldn't stop call resume() instead of just notifyAll().

Sure. It won't matter since stop() is only called by shutdown(), which first sets shouldRunCompile to false. But for correctness, you're right.

Did we hit this problem with too big of hammer?

The majority of the patch is refactoring the report compilers so that they can be throttled at all. The additional code to do the throttling isn't much. It's more formal than just a sleep, but it's also more testable and extensible.

Daniel Templeton
added a comment - 18/Sep/15 22:03 I personally would prefer the default to be 1000.
I went with 0 because that's the general convention for "off", but if 1000 makes more sense to people it's fine with me.
Any way to avoid the sleep(5000) in the test?
I hear ya. I'll see if I can reformulate the test a bit to get the sleep time down to the bare minimum.
Shouldn't stop call resume() instead of just notifyAll().
Sure. It won't matter since stop() is only called by shutdown(), which first sets shouldRunCompile to false. But for correctness, you're right.
Did we hit this problem with too big of hammer?
The majority of the patch is refactoring the report compilers so that they can be throttled at all. The additional code to do the throttling isn't much. It's more formal than just a sleep, but it's also more testable and extensible.

Sure. It won't matter since stop() is only called by shutdown(), which first sets shouldRunCompile to false. But for correctness, you're right.

I was looking at v3 of the patch, I see you already fixed this in v4 of the patch. sorry for the noise

The majority of the patch is refactoring the report compilers so that they can be throttled at all. The additional code to do the throttling isn't much. It's more formal than just a sleep, but it's also more testable and extensible.

ok. I'll have to look at that a little deeper. I thought we were basically hitting FileUtil.listFiles(dir) really quickly in the original code so it felt like a very simple thing to do is just do a variable sleep right there based on the configured duty cycle.

I need to look more into how the scanjob queue is working. It seems like all the worker threads could be working in the same volume which doesn't seem like what we want. (Seems like we want all volumes spending the duty cycle scanning, but I didn't catch how that was the case).

Nathan Roberts
added a comment - 19/Sep/15 00:26 Sure. It won't matter since stop() is only called by shutdown(), which first sets shouldRunCompile to false. But for correctness, you're right.
I was looking at v3 of the patch, I see you already fixed this in v4 of the patch. sorry for the noise
The majority of the patch is refactoring the report compilers so that they can be throttled at all. The additional code to do the throttling isn't much. It's more formal than just a sleep, but it's also more testable and extensible.
ok. I'll have to look at that a little deeper. I thought we were basically hitting FileUtil.listFiles(dir) really quickly in the original code so it felt like a very simple thing to do is just do a variable sleep right there based on the configured duty cycle.
I need to look more into how the scanjob queue is working. It seems like all the worker threads could be working in the same volume which doesn't seem like what we want. (Seems like we want all volumes spending the duty cycle scanning, but I didn't catch how that was the case).

The scanjob queue is indeed ignoring volume when selecting the next job. I was considering the case where there are volumes of greatly differing sizes, in which case not binding a thread to a volume will result in a better distribution of the load. That's also true when the number of threads exceeds the number of volumes.

That said, the point of the JIRA was not to change the load profile of the directory scanner; it was just to insert a throttle. I'll post a changeset with a reduced scope shortly.

Daniel Templeton
added a comment - 21/Sep/15 18:24 The scanjob queue is indeed ignoring volume when selecting the next job. I was considering the case where there are volumes of greatly differing sizes, in which case not binding a thread to a volume will result in a better distribution of the load. That's also true when the number of threads exceeds the number of volumes.
That said, the point of the JIRA was not to change the load profile of the directory scanner; it was just to insert a throttle. I'll post a changeset with a reduced scope shortly.

Nathan Roberts, I agree that it might be better to keep the old behavior of finishing one volume in a thread before moving on to the next. It might increase our cache hit rate. I can think of reasons to do the opposite (i.e. spread the load across disks), that might motivate us to add that mode as an option, but it seems better to focus on just throttling in this change.

Colin P. McCabe
added a comment - 21/Sep/15 19:53 Nathan Roberts , I agree that it might be better to keep the old behavior of finishing one volume in a thread before moving on to the next. It might increase our cache hit rate. I can think of reasons to do the opposite (i.e. spread the load across disks), that might motivate us to add that mode as an option, but it seems better to focus on just throttling in this change.

This logic seems flawed. If we sleep for a whole second, we'll be back in the case where nowMs % 1000 is what it was before, and make no progress. We should gracefully handle the case where the sleep is longer than a second by not re-throttling.

Colin P. McCabe
added a comment - 22/Sep/15 18:57 Thanks, Daniel Templeton .
416 if ((throttle > 1000) || (throttle <= 0)) {
417 if (throttle > 1000) {
418 LOG.error(
419 DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_KEY
420 + " set to value above 1000 ms/sec. Assuming default value of " +
421 DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT);
Can we have a constant here for MS_PER_SEC ? I think I commented on this earlier
455 if (throttleLimitMsPerSec < 1000) {
456 logMsg = String .format( "Periodic Directory Tree Verification scan"
457 + " starting at %dms with interval %dms and run limit %dms/s" ,
458 firstScanTime, scanPeriodMsecs, throttleLimitMsPerSec);
Maybe say "throttle" instead of "run limit"?
766 // Variable for tracking time spent running and waiting for testing
767 // purposes
768 private Long markMs;
Does this need to be an object, or can it be a primitive? I don't see any case where we need it to be null.
895 while (nowMs % 1000L > throttleLimitMsPerSec) {
896 try {
897 Thread .sleep(1000L - (nowMs % 1000L));
898 } catch (InterruptedException ex) {
899 // Try sleeping again and mark the thread as interrupted
900 Thread .currentThread().interrupt();
901 }
This logic seems flawed. If we sleep for a whole second, we'll be back in the case where nowMs % 1000 is what it was before, and make no progress. We should gracefully handle the case where the sleep is longer than a second by not re-throttling.

Can we have a constant here for MS_PER_SEC? I think I commented on this earlier

I didn't do that in this patch because I didn't think the 1000 was as prominent as before, but it appears that was before I was done adding stuff. I'll put it back. I'd love to put that constant somewhere like util.Time. Would that be kosher? Or better keep it low profile and leave it local to DirectoryScanner? I notice there's already HdfsClientConfigKeys.SECOND, but that would introduce an pointless dependency. May the best answer is to keep it local and file a JIRA to consolidate them under util.Time?

Maybe say "throttle" instead of "run limit"?

I was shooting for something that would be meaningful to someone who doesn't know the code. What about "throttle limit," since that echoes the config param?

Does this need to be an object, or can it be a primitive?

Ha. Evolutionary mistake. I'll fix it.

This logic seems flawed.

I don't follow. (nowMs % 1000) has to be between 0 and 999. If it's less than the throttle limit, we won't enter the loop. The throttle limit must be between 1 and 1000. (Anything else gets set to 1000 when the scanner is created.) The sleep must therefore be for between 1 and 999 ms, pretty much guaranteeing a different result the next time around.

Daniel Templeton
added a comment - 22/Sep/15 20:58 Can we have a constant here for MS_PER_SEC? I think I commented on this earlier
I didn't do that in this patch because I didn't think the 1000 was as prominent as before, but it appears that was before I was done adding stuff. I'll put it back. I'd love to put that constant somewhere like util.Time. Would that be kosher? Or better keep it low profile and leave it local to DirectoryScanner? I notice there's already HdfsClientConfigKeys.SECOND, but that would introduce an pointless dependency. May the best answer is to keep it local and file a JIRA to consolidate them under util.Time?
Maybe say "throttle" instead of "run limit"?
I was shooting for something that would be meaningful to someone who doesn't know the code. What about "throttle limit," since that echoes the config param?
Does this need to be an object, or can it be a primitive?
Ha. Evolutionary mistake. I'll fix it.
This logic seems flawed.
I don't follow. (nowMs % 1000) has to be between 0 and 999. If it's less than the throttle limit, we won't enter the loop. The throttle limit must be between 1 and 1000. (Anything else gets set to 1000 when the scanner is created.) The sleep must therefore be for between 1 and 999 ms, pretty much guaranteeing a different result the next time around.

I didn't do that in this patch because I didn't think the 1000 was as prominent as before, but it appears that was before I was done adding stuff. I'll put it back. I'd love to put that constant somewhere like util.Time. Would that be kosher?

I think it's fine to put it in the DirectoryScanner itself if you want. I don't object to putting it in time either. Up to you.

I was shooting for something that would be meaningful to someone who doesn't know the code. What about "throttle limit," since that echoes the config param?

Sure.

I don't follow. (nowMs % 1000) has to be between 0 and 999. If it's less than the throttle limit, we won't enter the loop. The throttle limit must be between 1 and 1000. (Anything else gets set to 1000 when the scanner is created.) The sleep must therefore be for between 1 and 999 ms, pretty much guaranteeing a different result the next time around.

Colin P. McCabe
added a comment - 22/Sep/15 21:47 I didn't do that in this patch because I didn't think the 1000 was as prominent as before, but it appears that was before I was done adding stuff. I'll put it back. I'd love to put that constant somewhere like util.Time. Would that be kosher?
I think it's fine to put it in the DirectoryScanner itself if you want. I don't object to putting it in time either. Up to you.
I was shooting for something that would be meaningful to someone who doesn't know the code. What about "throttle limit," since that echoes the config param?
Sure.
I don't follow. (nowMs % 1000) has to be between 0 and 999. If it's less than the throttle limit, we won't enter the loop. The throttle limit must be between 1 and 1000. (Anything else gets set to 1000 when the scanner is created.) The sleep must therefore be for between 1 and 999 ms, pretty much guaranteeing a different result the next time around.
Let's say we start the loop at time 5200. Then while (nowMs % 1000L > throttleLimitMsPerSec) returns true (let's say throttleLimitMsPerSec = 100 ).
We call sleep with an argument of 800, but sleep actually sleeps for 1000 ms instead. (Remember, Thread#sleep may always sleep for longer than requested.) nowMs becomes 6200. Now {{while (nowMs % 1000L > throttleLimitMsPerSec) }} returns true again, since 6200 % 1000 = 200 > 100. Now we sleep again for 800 ms yet again. We completely missed our timeslice, and there's no guarantee that we'll pick up the next one either. That's the bug.

Based on the results of previous patch, this one will get flagged with 3 whitespace violations in lines I didn't touch and a fist full of checkstyle issues because of the length of the config key names.

Daniel Templeton
added a comment - 23/Sep/15 15:56 New patch to address Colin P. McCabe 's comments.
Based on the results of previous patch, this one will get flagged with 3 whitespace violations in lines I didn't touch and a fist full of checkstyle issues because of the length of the config key names.

Forgot to mention that this patch tries be fair about oversleeps when throttling, but it doesn't do anything to credit for full seconds lost entirely to oversleeps. For example, if the throttle is set to 100ms, the following could happen:

In the second that the thread wakes up from the oversleep, we try to give it its full run time, but we don't credit it for full seconds lost. Notice also that threads only offer to block once per cycle, so setting a very low throttle limit will just make the threads run one cycle between sleeps.

When the thread wakes up from the oversleep, we give it more than a second's worth of run time because it's so close to the end of the second. (Any value over 999 just means not to throttle the thread that second.) The next second starts the count over, and at .2089, even though it's been running for over 100 consecutive ms, we don't block it, because it hasn't run for 100ms in this second yet. When it checks in the next time, at .2101, we see it's over the 100ms limit and block it for the rest of the second.

Daniel Templeton
added a comment - 23/Sep/15 16:10 Forgot to mention that this patch tries be fair about oversleeps when throttling, but it doesn't do anything to credit for full seconds lost entirely to oversleeps. For example, if the throttle is set to 100ms, the following could happen:
.0000 - thread calls throttle(), no block
.0422 - thread calls throttle(), sleep for 588ms
.5016 - thread wakes up from oversleep, run limit this second set to 116ms
.5219 - thread calls throttle(), sleep for 781ms
In the second that the thread wakes up from the oversleep, we try to give it its full run time, but we don't credit it for full seconds lost. Notice also that threads only offer to block once per cycle, so setting a very low throttle limit will just make the threads run one cycle between sleeps.
The following could also happen:
.0000 - thread calls throttle(), no block
.0422 - thread calls throttle(), sleep for 588ms
.1970 - thread wakes up from oversleep, run limit this second set to 1070ms
.2089 - thread calls throttle(), no block
.2101 - thread calls throttle(), sleep for 899ms
When the thread wakes up from the oversleep, we give it more than a second's worth of run time because it's so close to the end of the second. (Any value over 999 just means not to throttle the thread that second.) The next second starts the count over, and at .2089, even though it's been running for over 100 consecutive ms, we don't block it, because it hasn't run for 100ms in this second yet. When it checks in the next time, at .2101, we see it's over the 100ms limit and block it for the rest of the second.

I guess the short summary is that I'm not happy with the "modulo" solution to the timekeeping problem since I think its inaccuracy (in the direction of not running often enough, or at all) will be unacceptable at lower throttle rates.

We have a StopWatch class which should provide a pretty easy solution to this problem that doesn't suffer from these issues. Just start a StopWatch when you enable scanning. Check the elapsed duration on the StopWatch before scanning each block-- if it is above dfs.datanode.directoryscan.throttle.limit.ms.per.sec, then disable scanning and go into the other loop, which calls Thread.sleep until 1000 - dfs.datanode.directoryscan.throttle.limit.ms.per.sec ms have elapsed.

This solution is robust against slightly shorter or longer thread pauses. Like the modulo solution, it also may run at a slightly lower rate than expected in the case where we oversleep. Unlike the modulo solution, it does not have a pathological case where we never get to run at all despite a non-zero throttle rate.

I'm confused about this code... why are we comparing the directory scan interval (in seconds) with the throttle limit in milliseconds? It seems like this was intended to test for the case where the throttle is disabled, but some wires got crossed?

Colin P. McCabe
added a comment - 23/Sep/15 19:47 - edited I guess the short summary is that I'm not happy with the "modulo" solution to the timekeeping problem since I think its inaccuracy (in the direction of not running often enough, or at all) will be unacceptable at lower throttle rates.
We have a StopWatch class which should provide a pretty easy solution to this problem that doesn't suffer from these issues. Just start a StopWatch when you enable scanning. Check the elapsed duration on the StopWatch before scanning each block-- if it is above dfs.datanode.directoryscan.throttle.limit.ms.per.sec, then disable scanning and go into the other loop, which calls Thread.sleep until 1000 - dfs.datanode.directoryscan.throttle.limit.ms.per.sec ms have elapsed.
This solution is robust against slightly shorter or longer thread pauses. Like the modulo solution, it also may run at a slightly lower rate than expected in the case where we oversleep. Unlike the modulo solution, it does not have a pathological case where we never get to run at all despite a non-zero throttle rate.
399 public static final int DFS_DATANODE_DIRECTORYSCAN_INTERVAL_DEFAULT = 21600;
...
460 int defaultLimit =
461 DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_INTERVAL_DEFAULT;
462 String logMsg;
463
464 if (throttleLimitMsPerSec < defaultLimit) {
465 logMsg = String .format(START_MESSAGE_WITH_THROTTLE, firstScanTime,
466 scanPeriodMsecs, throttleLimitMsPerSec);
467 } else {
468 logMsg = String .format(START_MESSAGE, firstScanTime, scanPeriodMsecs);
469 }
I'm confused about this code... why are we comparing the directory scan interval (in seconds) with the throttle limit in milliseconds? It seems like this was intended to test for the case where the throttle is disabled, but some wires got crossed?

The throttle() method is guaranteed to exit when at least (1000 - limit) ms have passed and the calling thread is scheduled again. What am I missing?

As fas as I can tell, the difference between the StopWatch approach and the modulo approach is the case when a thread wakes up within (1000 - limit) of the end of the second. In that case, the modulo approach will allow the thread to run longer than the StopWatch approach would. In other words, the modulo approach is focused on trying to hold to the per-second duty cycle, and the StopWatch approach is trying to hold to the duty cycle regardless of second boundaries. Given that neither the sleep time nor run time is reliable, I don't see where it matters much one way or the other.

That said, I did just notice that this patch is broken in the case of spurious wake-ups. I'll have a new patch for that (including tests) shortly.

Daniel Templeton
added a comment - 23/Sep/15 20:47 Unlike the modulo solution, it does not have a pathological case where we never get to run at all despite a non-zero throttle rate.
I don't see the pathological case. Assume the throttle is 1ms and we always oversleep:
.0000 - thread calls throttle(), no block
– thread runs –
.0422 - thread calls throttle(), sleep for 588ms
.1999 - thread wakes up from oversleep, run limit this second set to 1000ms
– thread runs (999 < 1000) –
.2190 - thread calls throttle(), sleep for 810ms
.3106 - thread wakes up from oversleep, run limit this second set to 107ms
– thread runs (106 < 107) –
etc.
The throttle() method is guaranteed to exit when at least (1000 - limit) ms have passed and the calling thread is scheduled again. What am I missing?
As fas as I can tell, the difference between the StopWatch approach and the modulo approach is the case when a thread wakes up within (1000 - limit) of the end of the second. In that case, the modulo approach will allow the thread to run longer than the StopWatch approach would. In other words, the modulo approach is focused on trying to hold to the per-second duty cycle, and the StopWatch approach is trying to hold to the duty cycle regardless of second boundaries. Given that neither the sleep time nor run time is reliable, I don't see where it matters much one way or the other.
That said, I did just notice that this patch is broken in the case of spurious wake-ups. I'll have a new patch for that (including tests) shortly.
I'm confused about this code...
Good catch. Copy-paste error, grabbed the wrong key.

OK, here's a version reworked to use the StopWatch class. After a little research, I think we're safe to trust that Thread.sleep() won't have spurious wake-ups, which makes the StopWatch code simpler than the modulo version. If we want to care about spurious wake-ups, then the modulo code is simpler.

This patch is still failing one of the new tests I just added. I will have to fix that, but I wanted to get this patch posted for review by Nathan Roberts this morning. I'll post a fixed patch when I get a chance.

Daniel Templeton
added a comment - 24/Sep/15 15:36 OK, here's a version reworked to use the StopWatch class. After a little research, I think we're safe to trust that Thread.sleep() won't have spurious wake-ups, which makes the StopWatch code simpler than the modulo version. If we want to care about spurious wake-ups, then the modulo code is simpler.
This patch is still failing one of the new tests I just added. I will have to fix that, but I wanted to get this patch posted for review by Nathan Roberts this morning. I'll post a fixed patch when I get a chance.

Thanks Daniel Templeton. I like that the stopwatch class makes this much cleaner. Just a couple of comments:

Shouldn't the isInterrupted() check throw an InterruptedException? Otherwise won't we just break out of one level? It would probably be good to test shutdown on an actual cluster if possible because you're exactly right that we could be in here a long time and it would be good to make sure we don't affect shutdown of the datanode. This has been a problem in the past and can have a serious impact on rolling upgrades.

nit but I find markRunning() and markWaiting() confusing (seem backwards to me because we call markRunning() just before going to sleep).

I'm kind of wondering if we should disallow extremely low duty cycles. Seems like it could take close to 24 hours with a minimum setting. A minimum of 20% should keep us within an hour.

Nathan Roberts
added a comment - 24/Sep/15 16:32 Thanks Daniel Templeton . I like that the stopwatch class makes this much cleaner. Just a couple of comments:
Shouldn't the isInterrupted() check throw an InterruptedException? Otherwise won't we just break out of one level? It would probably be good to test shutdown on an actual cluster if possible because you're exactly right that we could be in here a long time and it would be good to make sure we don't affect shutdown of the datanode. This has been a problem in the past and can have a serious impact on rolling upgrades.
nit but I find markRunning() and markWaiting() confusing (seem backwards to me because we call markRunning() just before going to sleep).
I'm kind of wondering if we should disallow extremely low duty cycles. Seems like it could take close to 24 hours with a minimum setting. A minimum of 20% should keep us within an hour.

I think it will be confusing either way. Maybe rename them logTimeRunning() and logTimeWaiting()?

I'm kind of wondering if we should disallow extremely low duty cycles.

I'm kinda caveat emptor on that one. In no case can they shutdown the scanning completely, but if they want to make it take forever, that's their business. I also say that because at this point we don't know what reasonable lower bounds are. Maybe after the patch goes in, you can play with it on your system and tell us what impact various throttle values levels are? We could then follow up with a JIRA to update the docs and maybe bounds checking accordingly.

Daniel Templeton
added a comment - 24/Sep/15 16:42 Shouldn't the isInterrupted() check throw an InterruptedException?
In patch 8 it does.
nit but I find markRunning() and markWaiting() confusing
I think it will be confusing either way. Maybe rename them logTimeRunning() and logTimeWaiting()?
I'm kind of wondering if we should disallow extremely low duty cycles.
I'm kinda caveat emptor on that one. In no case can they shutdown the scanning completely, but if they want to make it take forever, that's their business. I also say that because at this point we don't know what reasonable lower bounds are. Maybe after the patch goes in, you can play with it on your system and tell us what impact various throttle values levels are? We could then follow up with a JIRA to update the docs and maybe bounds checking accordingly.

Haohui Mai
added a comment - 29/Sep/15 21:42 It looks like that this patch introduces a new findbugs warning in trunk. Please see
https://builds.apache.org/job/PreCommit-HDFS-Build/12742/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html

I've noticed the new TestDirectoryScanner#testThrottling fails consistently on OS X... I filed HDFS-11707 to track.

Also attaching a patch for a 2.7 backport; it does not contain the findbugs issue fixed in HDFS-9174. The backport was mostly clean with some minor non-structural conflicts around DirectoryScanner#getDiskReport caused by HDFS-7758 "Retire FsDatasetSpi#getVolumes() and use FsDatasetSpi#getVolumeRefs()" which moved the order of a few statements within getDiskReport. That change doesn't have an actual effect on the behavior of this patch.

Erik Krogen
added a comment - 26/Apr/17 16:57 - edited I've noticed the new TestDirectoryScanner#testThrottling fails consistently on OS X... I filed HDFS-11707 to track.
Also attaching a patch for a 2.7 backport; it does not contain the findbugs issue fixed in HDFS-9174 . The backport was mostly clean with some minor non-structural conflicts around DirectoryScanner#getDiskReport caused by HDFS-7758 "Retire FsDatasetSpi#getVolumes() and use FsDatasetSpi#getVolumeRefs()" which moved the order of a few statements within getDiskReport . That change doesn't have an actual effect on the behavior of this patch.