The RM webapp should allow users to authenticate using delegation tokens to maintain parity with RPC.

YARN-2241.
Minor bug reported by Robert Kanter and fixed by Robert Kanter (resourcemanager)ZKRMStateStore: On startup, show nicer messages if znodes already exist

When using the RMZKStateStore, if you restart the RM, you get a bunch of stack traces with messages like {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /rmstore}}. This is expected as these nodes already exist from before. We should catch these and print nicer messages.

If the default setting DEFAULT_NM_VMEM_CHECK_ENABLED is set to false the test will fail. Make the test pass not rely on the default settings but just let it verify that once the setting is turned on it actually does the memory check. See YARN-2225 which suggests we turn the default off.

From https://builds.apache.org/job/Hadoop-Yarn-trunk/595/ :
{code}
testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter) Time elapsed: 33.469 sec <<< FAILURE!
java.lang.AssertionError: expected:<10000> but was:<7156>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430)
at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391)
{code}

YARN-2204.
Trivial bug reported by Robert Kanter and fixed by Robert Kanter (resourcemanager)TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler

YARN-2201.
Major bug reported by Ray Chiang and fixed by Varun Vasudev TestRMWebServicesAppsModification dependent on yarn-default.xml

TestRMWebServicesAppsModification.java has some errors that are yarn-default.xml dependent. By changing yarn-default.xml properties, I'm seeing the following errors:
1) Changing yarn.resourcemanager.scheduler.class from capacity.CapacityScheduler to fair.FairScheduler gives the error:
Running org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 79.047 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
testSingleAppKillUnauthorized[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 3.22 sec <<< FAILURE!
java.lang.AssertionError: expected:<Forbidden> but was:<Accepted>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:144)
at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKillUnauthorized(TestRMWebServicesAppsModification.java:458)
2) Changing yarn.acl.enable from false to true results in the following errors:
Running org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
Tests run: 10, Failures: 4, Errors: 0, Skipped: 0, Time elapsed: 49.044 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
testSingleAppKill[0](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 2.986 sec <<< FAILURE!
java.lang.AssertionError: expected:<Accepted> but was:<Unauthorized>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:144)
at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:287)
testSingleAppKillInvalidState[0](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 2.258 sec <<< FAILURE!
java.lang.AssertionError: expected:<Bad Request> but was:<Unauthorized>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:144)
at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKillInvalidState(TestRMWebServicesAppsModification.java:369)
testSingleAppKillUnauthorized[0](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 2.263 sec <<< FAILURE!
java.lang.AssertionError: expected:<Forbidden> but was:<Unauthorized>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:144)
at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKillUnauthorized(TestRMWebServicesAppsModification.java:458)
testSingleAppKillInvalidId[0](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 0.214 sec <<< FAILURE!
java.lang.AssertionError: expected:<Not Found> but was:<Unauthorized>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:144)
at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKillInvalidId(TestRMWebServicesAppsModification.java:482)
I'm opening this JIRA as a discussion for the best way to fix this. I've got a few ideas, but I would like to get some feedback about potentially more robust ways to fix this test.

YARN-2195.
Trivial improvement reported by Wei Yan and fixed by Wei Yan Clean a piece of code in ResourceRequest

YARN-2192.
Major bug reported by Anubhav Dhoot and fixed by Anubhav Dhoot TestRMHA fails when run with a mix of Schedulers

If the test is run with FairSchedulers, some of the tests fail because the metricsssytem objects are shared across tests and not destroyed completely.
{code}
Error Message
Metrics source QueueMetrics,q0=root already exists!
Stacktrace
org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists!
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126)
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107)
at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:96)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1281)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:427)
{code}

YARN-2191.
Major bug reported by Wangda Tan and fixed by Wangda Tan (resourcemanager)Add a test to make sure NM will do application cleanup even if RM restarting happens before application completed

YARN-2187.
Major bug reported by Robert Kanter and fixed by Robert Kanter (fairscheduler)FairScheduler: Disable max-AM-share check by default

Say you have a small cluster with 8gb memory and 5 queues. This means that equal queue can have 8gb / 5 = 1.6gb but an AM requires 2gb to start so no AMs can be started. By default, max-am-share check should be disabled so users don't see a regression. On medium-sized clusters, it still makes sense to set the max-am-share to a value between 0 and 1.

YARN-2171.
Critical bug reported by Jason Lowe and fixed by Jason Lowe (capacityscheduler)AMs block on the CapacityScheduler lock during allocate()

When AMs heartbeat into the RM via the allocate() call they are blocking on the CapacityScheduler lock when trying to get the number of nodes in the cluster via getNumClusterNodes.

YARN-2167.
Major bug reported by Junping Du and fixed by Junping Du (nodemanager)LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block

In NMLeveldbStateStoreService#loadLocalizationState(), we have LeveldbIterator to read NM's localization state but it is not get closed in finally block. We should close this connection to DB as a common practice.

YARN-2163.
Minor bug reported by Wangda Tan and fixed by Wangda Tan (resourcemanager , webapp)WebUI: Order of AppId in apps table should be consistent with ApplicationId.compareTo().

Currently, AppId is treated as numeric, so the sort result in applications table is sorted by int typed id only (not included cluster timestamp), see attached screenshot. Order of AppId in web page should be consistent with ApplicationId.compareTo().

YARN-2152.
Major sub-task reported by Jian He and fixed by Jian He (resourcemanager)Recover missing container information

Container information such as container priority and container start time cannot be recovered because NM container today lacks such container information to send across on NM registration when RM recovery happens

YARN-2148.
Major bug reported by Wangda Tan and fixed by Wangda Tan (client)TestNMClient failed due more exit code values added and passed to AM

Currently, TestNMClient will be failed in trunk, see https://builds.apache.org/job/PreCommit-YARN-Build/3959/testReport/junit/org.apache.hadoop.yarn.client.api.impl/TestNMClient/testNMClient/
{code}
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:385)
at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:347)
at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226)
{code}
Test cases in TestNMClient uses following code to verify exit code of COMPLETED containers
{code}
testGetContainerStatus(container, i, ContainerState.COMPLETE,
"Container killed by the ApplicationMaster.", Arrays.asList(
new Integer[] {137, 143, 0}));
{code}
But YARN-2091 added logic to make exit code reflecting the actual status, so exit code of the "killed by ApplicationMaster" will be -105,
{code}
if (container.hasDefaultExitCode()) {
container.exitCode = exitEvent.getExitCode();
}
{code}
We should update test case as well.

YARN-2132.
Major bug reported by Karthik Kambatla and fixed by Vamsee Yarlagadda (resourcemanager)ZKRMStateStore.ZKAction#runWithRetries doesn't log the exception it encounters

If we encounter any ZooKeeper issues, we don't know what is going on unless we exhaust all the retries. It would really help to log the exception sooner, so we know what is going on with the cluster.

When I play with scheduler with preemption, I found ProportionalCapacityPreemptionPolicy cannot work. NPE will be raised when RM start
{code}
2014-06-05 11:01:33,201 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[SchedulingMonitor (ProportionalCapacityPreemptionPolicy),5,main] threw an Exception.
java.lang.NullPointerException
at org.apache.hadoop.yarn.util.resource.Resources.greaterThan(Resources.java:225)
at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.computeIdealResourceDistribution(ProportionalCapacityPreemptionPolicy.java:302)
at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.recursivelyComputeIdealAssignment(ProportionalCapacityPreemptionPolicy.java:261)
at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:198)
at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:174)
at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:72)
at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PreemptionChecker.run(SchedulingMonitor.java:82)
at java.lang.Thread.run(Thread.java:744)
{code}
This is caused by ProportionalCapacityPreemptionPolicy needs ResourceCalculator from CapacityScheduler. But ProportionalCapacityPreemptionPolicy get initialized before CapacityScheduler initialized. So ResourceCalculator will set to null in ProportionalCapacityPreemptionPolicy.

YARN-2122.
Major bug reported by Karthik Kambatla and fixed by Robert Kanter (scheduler)In AllocationFileLoaderService, the reloadThread should be created in init() and started in start()

AllcoationFileLoaderService has this reloadThread that is currently created and started in start(). Instead, it should be created in init() and started in start().

YARN-2121.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen TimelineAuthenticator#hasDelegationToken may throw NPE

YARN-2119.
Major bug reported by Anubhav Dhoot and fixed by Anubhav Dhoot DEFAULT_PROXY_ADDRESS should use DEFAULT_PROXY_PORT

The fix for [YARN-1590|https://issues.apache.org/jira/browse/YARN-1590] introduced an method to get web proxy bind address with the incorrect default port. Because all the users of the method (only 1 user) ignores the port, its not breaking anything yet. Fixing it in case someone else uses this in the future.

YARN-2118.
Major sub-task reported by Ted Yu and fixed by Ted Yu Type mismatch in contains() check of TimelineWebServices#injectOwnerInfo()

YARN-2115.
Major sub-task reported by Jian He and fixed by Jian He Replace RegisterNodeManagerRequest#ContainerStatus with a new NMContainerStatus

This jira is protocol changes only to replace the ContainerStatus sent across via NM register call with a new NMContainerStatus to include all the necessary information for container recovery.

YARN-2112.
Major bug reported by Zhijie Shen and fixed by Zhijie Shen Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml

Now YarnClient is using TimelineClient, which has dependency on jackson libs. However, the current dependency configurations make the hadoop-client artifect miss 2 jackson libs, such that the applications which have hadoop-client dependency will see the following exception
{code}
java.lang.NoClassDefFoundError: org/codehaus/jackson/jaxrs/JacksonJaxbJsonProvider
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.<init>(TimelineClientImpl.java:92)
at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:44)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:149)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.mapred.ResourceMgrDelegate.serviceInit(ResourceMgrDelegate.java:94)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.mapred.ResourceMgrDelegate.<init>(ResourceMgrDelegate.java:88)
at org.apache.hadoop.mapred.YARNRunner.<init>(YARNRunner.java:111)
at org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34)
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:394)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306)
at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.jaxrs.JacksonJaxbJsonProvider
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 48 more
{code}
when using YarnClient to submit an application.

YARN-2111.
Major bug reported by Sandy Ryza and fixed by Sandy Ryza (scheduler)In FairScheduler.attemptScheduling, we don't count containers as assigned if they have 0 memory but non-zero cores

{code}
if (Resources.greaterThan(RESOURCE_CALCULATOR, clusterResource,
queueMgr.getRootQueue().assignContainer(node),
Resources.none())) {
{code}
As RESOURCE_CALCULATOR is a DefaultResourceCalculator, we won't take cores here into account.

YARN-2109.
Major bug reported by Anubhav Dhoot and fixed by Karthik Kambatla (scheduler)Fix TestRM to work with both schedulers

testNMTokenSentForNormalContainer requires CapacityScheduler and was fixed in [YARN-1846|https://issues.apache.org/jira/browse/YARN-1846] to explicitly set it to be CapacityScheduler. But if the default scheduler is set to FairScheduler then the rest of the tests that execute after this will fail with invalid cast exceptions when getting queuemetrics. This is based on test execution order as only the tests that execute after this test will fail. This is because the queuemetrics will be initialized by this test to QueueMetrics and shared by the subsequent tests.
We can explicitly clear the metrics at the end of this test to fix this.
For example
java.lang.ClassCastException: org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics cannot be cast to org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:103)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1275)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:418)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:808)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:230)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.resourcemanager.MockRM.<init>(MockRM.java:90)
at org.apache.hadoop.yarn.server.resourcemanager.MockRM.<init>(MockRM.java:85)
at org.apache.hadoop.yarn.server.resourcemanager.MockRM.<init>(MockRM.java:81)
at org.apache.hadoop.yarn.server.resourcemanager.TestRM.testNMToken(TestRM.java:232)

YARN-2103.
Major bug reported by Binglin Chang and fixed by Binglin Chang Inconsistency between viaProto flag and initial value of SerializedExceptionProto.Builder

Bug 1:
{code}
SerializedExceptionProto proto = SerializedExceptionProto
.getDefaultInstance();
SerializedExceptionProto.Builder builder = null;
boolean viaProto = false;
{code}
Since viaProto is false, we should initiate build rather than proto
Bug 2:
the class does not provide hashcode() and equals() like other PBImpl records, this class is used in other records, it may affect other records' behavior.

YARN-2096.
Major bug reported by Anubhav Dhoot and fixed by Anubhav Dhoot Race in TestRMRestart#testQueueMetricsOnRMRestart

org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testQueueMetricsOnRMRestart fails randomly because of a race condition.
The test validates that metrics are incremented, but does not wait for all transitions to finish before checking for the values.
It also resets metrics after kicking off recovery of second RM. The metrics that need to be incremented race with this reset causing test to fail randomly.
We need to wait for the right transitions.

YARN-2091.
Major task reported by Bikas Saha and fixed by Tsuyoshi OZAWA Add more values to ContainerExitStatus and pass it from NM to RM and then to app masters

Currently, the AM cannot programmatically determine if the task was killed due to using excessive memory. The NM kills it without passing this information in the container status back to the RM. So the AM cannot take any action here. The jira tracks adding this exit status and passing it from the NM to the RM and then the AM. In general, there may be other such actions taken by YARN that are currently opaque to the AM.

YARN-2089.
Major improvement reported by Anubhav Dhoot and fixed by zhihai xu (scheduler)FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing audience annotations

We should mark QueuePlacementPolicy and QueuePlacementRule with audience annotations @Private @Unstable

YARN-2075.
Major bug reported by Zhijie Shen and fixed by Kenji Kikushima TestRMAdminCLI consistently fail on trunk and branch-2

One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications.
We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures.

YARN-2071.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen Enforce more restricted permissions for the directory of Leveldb store

We need to enforce more restricted permissions for the directory of Leveldb store, as w did for filesystem generic history store.

YARN-2065.
Major bug reported by Steve Loughran and fixed by Jian He AM cannot create new containers after restart-NM token from previous attempt used

Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers.
The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure.

ZKRMStateStore has a few places where it is logging at the INFO level. We should change these to DEBUG or TRACE level messages.

YARN-2059.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen Extend access control for admin acls

YARN-2054.
Major bug reported by Karthik Kambatla and fixed by Karthik Kambatla (resourcemanager)Better defaults for YARN ZK configs for retries and retry-inteval when HA is enabled

Currenly, we have the following default values:
# yarn.resourcemanager.zk-num-retries - 500
# yarn.resourcemanager.zk-retry-interval-ms - 2000
This leads to a cumulate 1000 seconds before the RM gives up trying to connect to the ZK.

YARN-2052.
Major sub-task reported by Tsuyoshi OZAWA and fixed by Tsuyoshi OZAWA (resourcemanager)ContainerId creation after work preserving restart is broken

Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations.

YARN-2050.
Major bug reported by Ming Ma and fixed by Ming Ma Fix LogCLIHelpers to create the correct FileContext

LogCLIHelpers calls FileContext.getFileContext() without any parameters. Thus the FileContext created isn't necessarily the FileContext for remote log.

YARN-2049.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen Delegation token stuff for the timeline sever

YARN-2022.
Major sub-task reported by Sunil G and fixed by Sunil G (resourcemanager)Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy

Cluster Size = 16GB [2NM's]
Queue A Capacity = 50%
Queue B Capacity = 50%
Consider there are 3 applications running in Queue A which has taken the full cluster capacity.
J1 = 2GB AM + 1GB * 4 Maps
J2 = 2GB AM + 1GB * 4 Maps
J3 = 2GB AM + 1GB * 2 Maps
Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ].
Currently in this scenario, Jobs J3 will get killed including its AM.
It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted.
Later when cluster is free, maps can be allocated to these Jobs.

YARN-2017.
Major sub-task reported by Jian He and fixed by Jian He (resourcemanager)Merge some of the common lib code in schedulers

A bunch of same code is repeated among schedulers, e.g: between FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a common base.

Currently 'default' rule in queue placement policy,if applied,puts the app in root.default queue. It would be great if we can make 'default' rule optionally point to a different queue as default queue .
This default queue can be a leaf queue or it can also be an parent queue if the 'default' rule is nested inside nestedUserQueue rule(YARN-1864).

YARN-2011.
Trivial test reported by Chen He and fixed by Chen He Fix typo and warning in TestLeafQueue

YARN-1987.
Major improvement reported by Jason Lowe and fixed by Jason Lowe Wrapper for leveldb DBIterator to aid in handling database exceptions

Per discussions in YARN-1984 and MAPREDUCE-5652, it would be nice to have a utility wrapper around leveldb's DBIterator to translate the raw RuntimeExceptions it can throw into DBExceptions to make it easier to handle database errors while iterating.

YARN-1982.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen Rename the daemon name to timelineserver

Nowadays, it's confusing that we call the new component timeline server, but we use
{code}
yarn historyserver
yarn-daemon.sh start historyserver
{code}
to start the daemon.
Before the confusion keeps being propagated, we'd better to modify command line asap.

YARN-1981.
Major bug reported by Jason Lowe and fixed by Jason Lowe (resourcemanager)Nodemanager version is not updated when a node reconnects

When a nodemanager is quickly restarted and happens to change versions during the restart (e.g.: rolling upgrade scenario) the NM version as reported by the RM is not updated.

The current version of ProportionalCapacityPreemptionPolicy should be improved to deal with the following two scenarios:
1) when rebalancing over-capacity allocations, it potentially preempts without considering the maxCapacity constraints of a queue (i.e., preempting possibly more than strictly necessary)
2) a zero capacity queue is preempted even if there is no demand (coherent with old use of zero-capacity to disabled queues)
The proposed patch fixes both issues, and introduce few new test cases.

YARN-1940.
Major bug reported by Kihwal Lee and fixed by Rushabh S Shah deleteAsUser() terminates early without deleting more files on error

In container-executor.c, delete_path() returns early when unlink() against a file or a symlink fails. We have seen many cases of the error being ENOENT, which can safely be ignored during delete.
This is what we saw recently: An app mistakenly created a large number of files in the local directory and the deletion service failed to delete a significant portion of them due to this bug. Repeatedly hitting this on the same node led to exhaustion of inodes in one of the partitions.
Beside ignoring ENOENT, delete_path() can simply skip the failed one and continue in some cases, rather than aborting and leaving files behind.

YARN-1938.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen Kerberos authentication for the timeline server

YARN-1937.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen Add entity-level access control of the timeline data for owners only

YARN-1824 broke compatibility with previous 2.x releases by changes the API's in org.apache.hadoop.yarn.util.Apps.{setEnvFromInputString,addToEnvironment} The old api should be added back in.
This affects any ApplicationMasters who were using this api. It also breaks previously built MapReduce libraries from working with the new Yarn release as MR uses this api.

In fair scheduler computing shares continues till iterations are complete even when we have a perfect match between the resource shares and total resources. This is because the binary search checks only less or greater and not equals. Add an early termination condition when its equal

YARN-1913.
Major bug reported by bc Wong and fixed by Wei Yan (scheduler)With Fair Scheduler, cluster can logjam when all resources are consumed by AMs

It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs.
One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a "maxApplicationMasterShare" config.

YARN-1885.
Major bug reported by Arpit Gupta and fixed by Wangda Tan RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts

During our HA testing we have seen cases where yarn application logs are not available through the cli but i can look at AM logs through the UI. RM was also being restarted in the background as the application was running.

YARN-1877.
Critical sub-task reported by Karthik Kambatla and fixed by Robert Kanter (resourcemanager)Document yarn.resourcemanager.zk-auth and its scope

YARN-1870.
Minor improvement reported by Ted Yu and fixed by Fengdong Yu (resourcemanager)FileInputStream is not closed in ProcfsBasedProcessTree#constructProcessSMAPInfo()

In Fair Scheduler, we want to be able to create user queues under any parent queue in the hierarchy. For eg. Say user1 submits a job to a parent queue called root.allUserQueues, we want be able to create a new queue called root.allUserQueues.user1 and run user1's job in it.Any further jobs submitted by this user to root.allUserQueues will be run in this newly created root.allUserQueues.user1.
This is very similar to the 'user-as-default' feature in Fair Scheduler which creates user queues under root queue. But we want the ability to create user queues under ANY parent queue.
Why do we want this ?
1. Preemption : these dynamically created user queues can preempt each other if its fair share is not met. So there is fairness among users.
User queues can also preempt other non-user leaf queue as well if below fair share.
2. Allocation to user queues : we want all the user queries(adhoc) to consume only a fraction of resources in the shared cluster. By creating this feature,we could do that by giving a fair share to the parent user queue which is then redistributed to all the dynamically created user queues.

YARN-1845.
Major improvement reported by Rushabh S Shah and fixed by Rushabh S Shah Elapsed time for failed tasks that never started is wrong

The elapsed time for tasks in a failed job that were never
started can be way off. It looks like we're marking the start time as the
beginning of the epoch (i.e.: start time = -1) but the finish time is when the
task was marked as failed when the whole job failed. That causes the
calculated elapsed time of the task to be a ridiculous number of hours.
Tasks that fail without any attempts shouldn't have start/finish/elapsed times.

YARN-1833.
Major bug reported by Mit Desai and fixed by Mit Desai TestRMAdminService Fails in trunk and branch-2 : Assert Fails due to different count of UserGroups for currentUser()

In the test testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the following assert is not needed.
{code}
Assert.assertTrue(groupWithInit.size() != groupBefore.size());
{code}
As the assert takes the default groups for groupWithInit (which in my case are users, sshusers and wheel), it fails as the size of both groupWithInit and groupBefore are same.
I do not think we need to have this assert here. Moreover we are also checking that the groupInit does not have the userGroups that are in the groupBefore so removing the assert may not be harmful.

YARN-1670.
Critical bug reported by Thomas Graves and fixed by Mit Desai aggregated log writer can write more log data then it says is the log length

We have seen exceptions when using 'yarn logs' to read log files.
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:441)
at java.lang.Long.parseLong(Long.java:483)
at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.readAContainerLogsForALogType(AggregatedLogFormat.java:518)
at org.apache.hadoop.yarn.logaggregation.LogDumper.dumpAContainerLogs(LogDumper.java:178)
at org.apache.hadoop.yarn.logaggregation.LogDumper.run(LogDumper.java:130)
at org.apache.hadoop.yarn.logaggregation.LogDumper.main(LogDumper.java:246)
We traced it down to the reader trying to read the file type of the next file but where it reads is still log data from the previous file. What happened was the Log Length was written as a certain size but the log data was actually longer then that.
Inside of the write() routine in LogValue it first writes what the logfile length is, but then when it goes to write the log itself it just goes to the end of the file. There is a race condition here where if someone is still writing to the file when it goes to be aggregated the length written could be to small.
We should have the write() routine stop when it writes whatever it said was the length. It would be nice if we could somehow tell the user it might be truncated but I'm not sure of a good way to do this.
We also noticed that a bug in readAContainerLogsForALogType where it is using an int for curRead whereas it should be using a long.
while (len != -1 && curRead < fileLength) {
This isn't actually a problem right now as it looks like the underlying decoder is doing the right thing and the len condition exits.

YARN-1561.
Minor improvement reported by Junping Du and fixed by Chen He (scheduler)Fix a generic type warning in FairScheduler

The Comparator below should be specified with type:
private Comparator nodeAvailableResourceComparator =
new NodeAvailableResourceComparator();

I've been occasionally coming across instances where Hadoop's Cluster Applications REST API (http://hadoop.apache.org/docs/r0.23.6/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_API) has returned JSON that PHP's json_decode function failed to parse. I've tracked the syntax error down to the presence of the unquoted word NaN appearing as a value in the JSON. For example:
"progress":NaN,
NaN is not part of the JSON spec, so its presence renders the whole JSON string invalid. Hadoop needs to return something other than NaN in this case -- perhaps an empty string or the quoted string "NaN".

Schedulers currently have a reinitialize but no start and stop. Fitting them into the YARN service model would make things more coherent.

YARN-1429.
Trivial bug reported by Sandy Ryza and fixed by Jarek Jarcec Cecho (client)*nix: Allow a way for users to augment classpath of YARN daemons

YARN_CLASSPATH is referenced in the comments in ./hadoop-yarn-project/hadoop-yarn/bin/yarn and ./hadoop-yarn-project/hadoop-yarn/bin/yarn.cmd, but doesn't do anything.

YARN-1424.
Minor improvement reported by Sandy Ryza and fixed by Ray Chiang (resourcemanager)RMAppAttemptImpl should return the DummyApplicationResourceUsageReport for all invalid accesses

RMAppImpl has a DUMMY_APPLICATION_RESOURCE_USAGE_REPORT to return when the caller of createAndGetApplicationReport doesn't have access.
RMAppAttemptImpl should have something similar for getApplicationResourceUsageReport.
It also might make sense to put the dummy report into ApplicationResourceUsageReport and allow both to use it.
A test would also be useful to verify that RMAppAttemptImpl#getApplicationResourceUsageReport doesn't return null if the scheduler doesn't have a report to return.

YARN-1408.
Major sub-task reported by Sunil G and fixed by Sunil G (resourcemanager)Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins

Capacity preemption is enabled as follows.
* yarn.resourcemanager.scheduler.monitor.enable= true ,
* yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
Queue = a,b
Capacity of Queue A = 80%
Capacity of Queue B = 20%
Step 1: Assign a big jobA on queue a which uses full cluster capacity
Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity
JobA task which uses queue b capcity is been preempted and killed.
This caused below problem:
1. New Container has got allocated for jobA in Queue A as per node update from an NM.
2. This container has been preempted immediately as per preemption.
Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM.
ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED
This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption.
attempt_1380289782418_0003_m_000000_0 Timed out after 1800 secs

YARN-1368.
Major sub-task reported by Bikas Saha and fixed by Jian He Common work to re-populate containers’ state into scheduler

YARN-1367 adds support for the NM to tell the RM about all currently running containers upon registration. The RM needs to send this information to the schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover the current allocation state of the cluster.

YARN-1366.
Major sub-task reported by Bikas Saha and fixed by Rohith (resourcemanager)AM should implement Resync with the ApplicationMasterService instead of shutting down

The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once.

YARN-1365.
Major sub-task reported by Bikas Saha and fixed by Anubhav Dhoot (resourcemanager)ApplicationMasterService to allow Register of an app that was running before restart

For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit.

YARN-1362.
Major sub-task reported by Jason Lowe and fixed by Jason Lowe (nodemanager)Distinguish between nodemanager shutdown for decommission vs shutdown for restart

When a nodemanager shuts down it needs to determine if it is likely to be restarted. If a restart is likely then it needs to preserve container directories, logs, distributed cache entries, etc. If it is being shutdown more permanently (e.g.: like a decommission) then the nodemanager should cleanup directories and logs.

YARN-1339.
Major sub-task reported by Jason Lowe and fixed by Jason Lowe (nodemanager)Recover DeletionService state upon nodemanager restart

YARN-1338.
Major sub-task reported by Jason Lowe and fixed by Jason Lowe (nodemanager)Recover localized resource cache state upon nodemanager restart

Today when node manager restarts we clean up all the distributed cache files from disk. This is definitely not ideal from 2 aspects.
* For work preserving restart we definitely want them as running containers are using them
* For even non work preserving restart this will be useful in the sense that we don't have to download them again if needed by future tasks.

YARN-1136.
Major bug reported by Karthik Kambatla and fixed by Chen He Replace junit.framework.Assert with org.junit.Assert

There are several places where we are using junit.framework.Assert instead of org.junit.Assert.
{code}grep -rn "junit.framework.Assert" hadoop-yarn-project/ --include=*.java{code}

YARN-738.
Major bug reported by Omkar Vinit Joshi and fixed by Ming Ma TestClientRMTokens is failing irregularly while running all yarn tests

Running org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens
Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 16.787 sec <<< FAILURE!
testShortCircuitRenewCancel(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens) Time elapsed: 186 sec <<< ERROR!
java.lang.RuntimeException: getProxy
at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens$YarnBadRPC.getProxy(TestClientRMTokens.java:334)
at org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.getRmClient(RMDelegationTokenIdentifier.java:157)
at org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.renew(RMDelegationTokenIdentifier.java:102)
at org.apache.hadoop.security.token.Token.renew(Token.java:372)
at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:306)
at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancel(TestClientRMTokens.java:240)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)

YARN-614.
Major improvement reported by Bikas Saha and fixed by Xuan Gong Separate AM failures from hardware failure or YARN error and do not count them to AM retry count

Attempts can fail due to a large number of user errors and they should not be retried unnecessarily. The only reason YARN should retry an attempt is when the hardware fails or YARN has an error. NM failing, lost NM and NM disk errors are the hardware errors that come to mind.

YARN-596.
Major bug reported by Sandy Ryza and fixed by Wei Yan (scheduler)Use scheduling policies throughout the queue hierarchy to decide which containers to preempt

In the fair scheduler, containers are chosen for preemption in the following way:
All containers for all apps that are in queues that are over their fair share are put in a list.
The list is sorted in order of the priority that the container was requested in.
This means that an application can shield itself from preemption by requesting it's containers at higher priorities, which doesn't really make sense.
Also, an application that is not over its fair share, but that is in a queue that is over it's fair share is just as likely to have containers preempted as an application that is over its fair share.

YARN-483.
Major improvement reported by Sandy Ryza and fixed by Akira AJISAKA (documentation)Improve documentation on log aggregation in yarn-default.xml

The current documentation for log aggregation is
{code}
<property>
<description>Whether to enable log aggregation</description>
<name>yarn.log-aggregation-enable</name>
<value>false</value>
</property>
{code}
This could be improved to explain what enabling log aggregation does.

MAPREDUCE-6002.
Major bug reported by Wangda Tan and fixed by Wangda Tan (task)MR task should prevent report error to AM when process is shutting down

MAPREDUCE-5896.
Major improvement reported by Sandy Ryza and fixed by Sandy Ryza InputSplits should indicate which locations have the block cached in memory

MAPREDUCE-5895.
Major bug reported by Kousuke Saruta and fixed by Kousuke Saruta (client)FileAlreadyExistsException was thrown : Temporary Index File can not be cleaned up because OutputStream doesn't close properly

MAPREDUCE-5888.
Major bug reported by Jason Lowe and fixed by Jason Lowe (mr-am)Failed job leaves hung AM after it unregisters

Set "dfs.namenode.legacy-oiv-image.dir" to an appropriate directory to make standby name node or secondary name node save its file system state in the old fsimage format during checkpointing. This image can be used for offline analysis using the OfflineImageViewer. Use the "hdfs oiv_legacy" command to process the old fsimage format.

HDFS-6289.
Critical bug reported by Aaron T. Myers and fixed by Aaron T. Myers (ha)HA failover can fail if there are pending DN messages for DNs which no longer exist

HDFS-6273 introduces two new HDFS configuration keys:
- dfs.namenode.http-bind-host
- dfs.namenode.https-bind-host
The most common use case for these keys is to have the NameNode HTTP (or HTTPS) endpoints listen on all interfaces on multi-homed systems by setting the keys to 0.0.0.0 i.e. INADDR_ANY.
For the systems background on this usage of INADDR_ANY please refer to ip(7) in the Linux Programmer's Manual (web link: http://man7.org/linux/man-pages/man7/ip.7.html).
These keys complement the existing NameNode options:
- dfs.namenode.rpc-bind-host
- dfs.namenode.servicerpc-bind-host

HADOOP-10454.
Major improvement reported by Kihwal Lee and fixed by Kihwal Lee Provide FileContext version of har file system

HADOOP-10451.
Trivial improvement reported by Benoy Antony and fixed by Benoy Antony (security)Remove unused field and imports from SaslRpcServer

SaslRpcServer.SASL_PROPS is removed.
Any use of this variable should be replaced with the following code:
SaslPropertiesResolver saslPropsResolver = SaslPropertiesResolver.getInstance(conf);
Map<String, String> sasl_props = saslPropsResolver.getDefaultProperties();