Details

Description

moveApplicationAcrossQueues operation currently does not check user permission on the target queue. This incorrectly allows one user to move his/her own applications to a queue that the user has no access to

Rohith Sharma K S
added a comment - 24/Aug/16 19:50 +1 for the issue. And also I think there are uncovered bugs with respect to queueACL for moveToQueue operation.
Along with modify app acl, user should also have access to ADMINISTER_QUEUE & SUBMIT_APPLICATIONS acl's. Basically all 3 operations should be AND operation rather than OR operation. cc :-/ Jian He
Comments on the patch,
patch appears to be do not fix the reported bug since it still check for old-queue for user permission. Basically user authorization should happen for targeted queue.
The method appSubmissionToQueueAllowed can be renamed to checkUserAccessToQueue ?

Wilfred Spiegelenburg
added a comment - 25/Aug/16 07:26 Patch that does the checks for the ACL on the target queue.
Following checks are made:
user has modify application permission or is admin on the queue (existing check)
user has submit access or admin access on the target queue (new check)
Tests have been added for the newly added check. The existing check has not got a test and is not covered by the newly added tests

Thanks Wilfred Spiegelenburg for working on this. The patch looks good to me generally. My one thought is that moveApplicationAcrossQueues doesn't check if the queue exists before ACL checking, neither do its callers. It's OK since its callee like queueACLsManager.checkAccess did the NULL check of the target queue, but the LOG information seems vague. What about to do the NULL check in moveApplicationAcrossQueues and provide the explicitly information of "target queue does't exist" and remove the NULL check in its callee like function queueACLsManager.checkAccess added in this patch.

Yufei Gu
added a comment - 31/Aug/16 18:48 - edited Thanks Wilfred Spiegelenburg for working on this. The patch looks good to me generally. My one thought is that moveApplicationAcrossQueues doesn't check if the queue exists before ACL checking, neither do its callers. It's OK since its callee like queueACLsManager.checkAccess did the NULL check of the target queue, but the LOG information seems vague. What about to do the NULL check in moveApplicationAcrossQueues and provide the explicitly information of "target queue does't exist" and remove the NULL check in its callee like function queueACLsManager.checkAccess added in this patch.

For the permission part: should we check (submit_acl_on_target_queue || target_queue_adminAcl) && application_acl
The first half means permission on the target queue; The second half means permission on application itself.
I think the first half is also what being used currently in SubmitApplication

Jian He
added a comment - 01/Sep/16 05:39 For the permission part: should we check (submit_acl_on_target_queue || target_queue_adminAcl) && application_acl
The first half means permission on the target queue; The second half means permission on application itself.
I think the first half is also what being used currently in SubmitApplication

Sorry for the delayed response, I tried to add a new test which test just the getAccess call in the client but did not get it to work nicely.

I have updated the patch with the check for an non existing queue including an extra test.

I did not move the check for a non existent queue into the ClientRMService because each scheduler checks the queue existence in its own way and we would have had to introduce a number of new dependencies into the client. I left it in QueueACLsManager which already has the CS as a dependency. It now also logs that the target queue does not exists.

For the check that Jian He mentioned: we have an existing check for MODIFY_APP in the code. That check also takes into account the administrator access for the origin queue, covering the application_acl part. The new check added handles the first part submit_acl_on_target_queue || target_queue_adminAcl) Both need to pass to move the application.

Wilfred Spiegelenburg
added a comment - 26/Sep/16 11:18 Sorry for the delayed response, I tried to add a new test which test just the getAccess call in the client but did not get it to work nicely.
I have updated the patch with the check for an non existing queue including an extra test.
I did not move the check for a non existent queue into the ClientRMService because each scheduler checks the queue existence in its own way and we would have had to introduce a number of new dependencies into the client. I left it in QueueACLsManager which already has the CS as a dependency. It now also logs that the target queue does not exists.
For the check that Jian He mentioned: we have an existing check for MODIFY_APP in the code. That check also takes into account the administrator access for the origin queue, covering the application_acl part. The new check added handles the first part submit_acl_on_target_queue || target_queue_adminAcl) Both need to pass to move the application.

Yufei Gu
added a comment - 27/Sep/16 07:52 Thanks Wilfred Spiegelenburg 's new patch. It makes sense to not introduce more dependencies. Some nits:
1.
/*
* Check access to a targetQueue in the case of a move of an application
The first line should be /** .
2. I am wondering
if (scheduler instanceof CapacityScheduler) {
...
} else if (scheduler instanceof FairScheduler) {
...
} else {
return ...
}
will be more readable.
3. In func moveApplicationAcrossQueues , can we use ioe.toString() instead of "Target queue does not exist " + targetQueue ?

Karthik Kambatla
added a comment - 05/Oct/16 00:11 Thanks for reporting and working on this, Wilfred Spiegelenburg .
Main comment on the patch: QueueACLsManager#checkAccess: Instead of throwing an IOException, can we log the fact that queue does not exist and return false?

Wilfred Spiegelenburg
added a comment - 05/Oct/16 05:43 updated the text in the messages, it does make sense to include it not just in the message of the queue manager. Does the message look OK Bibin A Chundatt ?

Daniel Templeton
added a comment - 29/Nov/16 22:06 Thanks for all the patch updates, Wilfred Spiegelenburg . My turn.
" doesn't have permissions submit to target queue: " is missing a "to" before the "submit."
In QueueACLsManager.checkAccess() , I don't see why you need to do the scheduler-dependent if . Can't you just call checkAccess() in all cases?
In your tests, I would feel better if you tested that the app is in the right queue after the successful moves.
Note that your use of a lambda in createClientRMServiceForMoveApplicationRequest() means this patch can only go into trunk.

" doesn't have permissions submit to target queue: " is missing a "to" before the "submit."

fixed the typo

In QueueACLsManager.checkAccess(), I don't see why you need to do the scheduler-dependent if. Can't you just call checkAccess() in all cases?

The capacity scheduler part is a copy of the checkAccess() that is already there. The change to not use the checkAccess() of the scheduler for the capacity scheduler was made as part of YARN-4571. Bringing the FairScheduler and the CapacityScheduler in sync is more work than we can just push into this jira. I think it is better to open a follow up jira to refactor this and bring the two schedulers in sync again. Let me know if you agree with that approach.

In your tests, I would feel better if you tested that the app is in the right queue after the successful moves.

Because of the way the tests are mocked up the current tests can not do that. We create a ClientRMService which does not have a scheduler or an application manager. The test are focussed on the ACL managers and making sure that they stop the move in the service. We can extend the tests to do the app checks but that would introduce scheduler specific testing into the client service.

Note that your use of a lambda in createClientRMServiceForMoveApplicationRequest() means this patch can only go into trunk.

oops did not think about that. I'll have rewritten the tests to remove the lambda. I now really appreciate the simplicity of using a lambda

Wilfred Spiegelenburg
added a comment - 02/Dec/16 16:28 " doesn't have permissions submit to target queue: " is missing a "to" before the "submit."
fixed the typo
In QueueACLsManager.checkAccess(), I don't see why you need to do the scheduler-dependent if. Can't you just call checkAccess() in all cases?
The capacity scheduler part is a copy of the checkAccess() that is already there. The change to not use the checkAccess() of the scheduler for the capacity scheduler was made as part of YARN-4571 . Bringing the FairScheduler and the CapacityScheduler in sync is more work than we can just push into this jira. I think it is better to open a follow up jira to refactor this and bring the two schedulers in sync again. Let me know if you agree with that approach.
In your tests, I would feel better if you tested that the app is in the right queue after the successful moves.
Because of the way the tests are mocked up the current tests can not do that. We create a ClientRMService which does not have a scheduler or an application manager. The test are focussed on the ACL managers and making sure that they stop the move in the service. We can extend the tests to do the app checks but that would introduce scheduler specific testing into the client service.
Note that your use of a lambda in createClientRMServiceForMoveApplicationRequest() means this patch can only go into trunk.
oops did not think about that. I'll have rewritten the tests to remove the lambda. I now really appreciate the simplicity of using a lambda

The change to not use the checkAccess() of the scheduler for the capacity scheduler was made as part of YARN-4571.

I'm looking at YARN-4571, and I can't see any reason why checkAccess() was modified to special case the capacity scheduler. The only difference I see from the code is that the new checkAccess() plows ahead even if the queue is null. Maybe there were differences at the time that have since disappeared? (Jian He, can you shed any light?)

Looking at your new checkAccess() method, I also don't see where it does anything different from the schedulers' checkAccess() methods. What problem is it that you're solving with the new method?

Daniel Templeton
added a comment - 06/Dec/16 22:36 The change to not use the checkAccess() of the scheduler for the capacity scheduler was made as part of YARN-4571 .
I'm looking at YARN-4571 , and I can't see any reason why checkAccess() was modified to special case the capacity scheduler. The only difference I see from the code is that the new checkAccess() plows ahead even if the queue is null. Maybe there were differences at the time that have since disappeared? ( Jian He , can you shed any light?)
Looking at your new checkAccess() method, I also don't see where it does anything different from the schedulers' checkAccess() methods. What problem is it that you're solving with the new method?

Jian He
added a comment - 06/Dec/16 22:57 In YARN-4571 , do you mean the QueueACLsManager#checkAccess method ? it has difference between CS and FS: in CS it calls YarnAuthorizationProvider#checkPermission, whereas FS doesn't.

My question is why the QueueACLsManager.checkAccess() method replicates what's in CapacityScheduler.checkAccess() instead of just calling that method (which would mean that there's no need to special case for the scheduler type).

Daniel Templeton
added a comment - 06/Dec/16 23:25 My question is why the QueueACLsManager.checkAccess() method replicates what's in CapacityScheduler.checkAccess() instead of just calling that method (which would mean that there's no need to special case for the scheduler type).

oh, I think it was because the CapacityScheduler#checkAccess is actually not needed for CS now, it cannot be removed because it's a public interface. And CS requires additional parameters like remoteAddress, forwardedAddresses which is not that appropriate to add that in the public interface, because FS inherits the same interface and it does not require these parameters.

Jian He
added a comment - 07/Dec/16 00:17 oh, I think it was because the CapacityScheduler#checkAccess is actually not needed for CS now, it cannot be removed because it's a public interface. And CS requires additional parameters like remoteAddress, forwardedAddresses which is not that appropriate to add that in the public interface, because FS inherits the same interface and it does not require these parameters.

The main point is that the ClientRMService does not have direct access to the Scheduler. All access checks run through the QueueACLsManager or the ApplicationACLsManager. Any change must thus go through that. In this case the new method was introduced because the current method does not have the destination queue available. We need to check the destination queue the originating queue is already checked earlier by calling the existing method. The passed in application has not been moved yet and thus still has the original queue. Updating the application is not possible because that would pre-empt the fact that the application can and will be moved.

The target queue checks are performed because it comes out of the move request and has not been checked at the time the access check is performed. To be able to distinguish between an access denied and a queue that does not exist the log message was added if the queue returned is empty. Without that check, and the log entries, at that point we would not be able to trace back that difference.

I looked at folding the two methods into one to remove some code duplication but stopped with that. The small but important differences between the two methods required a number of if ... else ... constructs which made the code really difficult to read and understand.

Wilfred Spiegelenburg
added a comment - 07/Dec/16 04:35 The main point is that the ClientRMService does not have direct access to the Scheduler. All access checks run through the QueueACLsManager or the ApplicationACLsManager . Any change must thus go through that. In this case the new method was introduced because the current method does not have the destination queue available. We need to check the destination queue the originating queue is already checked earlier by calling the existing method. The passed in application has not been moved yet and thus still has the original queue. Updating the application is not possible because that would pre-empt the fact that the application can and will be moved.
The target queue checks are performed because it comes out of the move request and has not been checked at the time the access check is performed. To be able to distinguish between an access denied and a queue that does not exist the log message was added if the queue returned is empty. Without that check, and the log entries, at that point we would not be able to trace back that difference.
I looked at folding the two methods into one to remove some code duplication but stopped with that. The small but important differences between the two methods required a number of if ... else ... constructs which made the code really difficult to read and understand.

What you're saying then, is that the FairScheduler.checkAccess() method does the same thing, including the null check, but if the queue is null the message is logged as debug, which is not loud enough. Since it doesn't make sense to have the checkAccess() method always log louder, you're replicating the null check so that you can log it as a warn. Am I correct?

Daniel Templeton
added a comment - 07/Dec/16 18:07 What you're saying then, is that the FairScheduler.checkAccess() method does the same thing, including the null check, but if the queue is null the message is logged as debug, which is not loud enough. Since it doesn't make sense to have the checkAccess() method always log louder, you're replicating the null check so that you can log it as a warn. Am I correct?

Daniel Templeton
added a comment - 07/Dec/16 18:08 Since this patch is having to replicate that code again, any objection to moving it into an overloaded CapacityScheduler.checkAccess() method (that isn't inherited from the interface)?

Correct the checkAccess() methods does not have a way to communicate back that the queue does not exist and says that access is denied. There is no way to distinguish the two and we really want to leave some clue behind in the logs which case we have seen.
In the normal checkAccess() case a queue that does not exist is not likely, maybe not even possible, since the queue is set on the application.

Wilfred Spiegelenburg
added a comment - 07/Dec/16 23:25 Correct the checkAccess() methods does not have a way to communicate back that the queue does not exist and says that access is denied. There is no way to distinguish the two and we really want to leave some clue behind in the logs which case we have seen.
In the normal checkAccess() case a queue that does not exist is not likely, maybe not even possible, since the queue is set on the application.

I am all for it but I think we should do that from a follow up jira and not as part of this one.

The reason I think that we should do it in a separate jira is that within the FairScheduler when you dig deeper the access check performed in the queue is exactly what is now done for the CapacityScheduler. The FSQueue.hasAccess() is using the same call to an YarnAuthorizationProvider as we have now in the in the QueueACLsManager for CS.

Wilfred Spiegelenburg
added a comment - 08/Dec/16 02:12 I am all for it but I think we should do that from a follow up jira and not as part of this one.
The reason I think that we should do it in a separate jira is that within the FairScheduler when you dig deeper the access check performed in the queue is exactly what is now done for the CapacityScheduler. The FSQueue.hasAccess() is using the same call to an YarnAuthorizationProvider as we have now in the in the QueueACLsManager for CS.

Yep, I noticed that as well. The remoteAddress and forwardedAddress parameters that are the reason for the special casing are actually ignored under the covers, resulting in all the schedulers doing the same thing anyway.

Daniel Templeton
added a comment - 28/Dec/16 22:25 Yep, I noticed that as well. The remoteAddress and forwardedAddress parameters that are the reason for the special casing are actually ignored under the covers, resulting in all the schedulers doing the same thing anyway.

In the new QueueACLsManager.checkAccess(), I'd really appreciate a comment that sums up the previous discussion on this JIRA so that the next person is less confused than I was

In TestClientRMService.getQueueAclManager(), the answer() method in the anonymous inner class should have an @Override annotation. Also, I think you'll run into problems with Java 7 and the non-final parameters being used inside the anonymous inner class

Same comments for createClientRMServiceForMoveApplicationRequest(), plus you shouldn't need the suppress warnings annotation now that YARN-4457 is in. You will need it in branch-2, though.

Daniel Templeton
added a comment - 28/Dec/16 22:54 Let's get this thing closed out. A few more comments:
In ClientRMService ,
Server.getRemoteAddress(), null , targetQueue)||
should have a space before the pipes
In the new QueueACLsManager.checkAccess() , I'd really appreciate a comment that sums up the previous discussion on this JIRA so that the next person is less confused than I was
In TestClientRMService.getQueueAclManager() , the answer() method in the anonymous inner class should have an @Override annotation. Also, I think you'll run into problems with Java 7 and the non-final parameters being used inside the anonymous inner class
Same comments for createClientRMServiceForMoveApplicationRequest() , plus you shouldn't need the suppress warnings annotation now that YARN-4457 is in. You will need it in branch-2, though.

Wilfred Spiegelenburg
added a comment - 03/Jan/17 04:51 Updated patch:
rebased to make sure it still applies to trunk after YARN-5932
In ClientRMService added the missing space before the pipes
Added a comment to the QueueACLsManager.checkAccess() to explain the versions and the scheduler dependency
Added @Override in TestClientRMService.getQueueAclManager() and TestClientRMService.createClientRMServiceForMoveApplicationRequest()
removed the suppress warnings annotations (not needed after rebase)
The other two remarks I have left for the backport to branch-2:
problems with Java 7 and the non-final parameters being used inside the anonymous inner class (yes it exists)
suppress warnings annotation should not be needed after YARN-5932
I will log a new jira to work on the change for the QueueACLsManager.

Thanks, Wilfred Spiegelenburg. Can you move that checkAccess() comment into the checkAccess() method, right before the if it's talking about? While you're at it, could you fix the language/grammar of the comment text you reformatted in the pre-existing checkAccess() method?

Daniel Templeton
added a comment - 04/Jan/17 22:02 Thanks, Wilfred Spiegelenburg . Can you move that checkAccess() comment into the checkAccess() method, right before the if it's talking about? While you're at it, could you fix the language/grammar of the comment text you reformatted in the pre-existing checkAccess() method?

Daniel Templeton
added a comment - 05/Jan/17 00:01 One more detail: in the new comment on the old checkAccess() , "allow users access and view the apps" should be "allow users to access and view the apps."
I feel like we're asymptotically approaching the finish line. Thanks for hanging in there.

Daniel Templeton
added a comment - 05/Jan/17 15:25 In doing a last pass, I have two questions on the test code:
In testMoveApplicationSubmitTargetQueue() and testMoveApplicationAdminTargetQueue() , would it make sense to test that the moves that are supposed to work do actually work?
Why a ConcurrentHashMap in createClientRMServiceForMoveApplicationRequest() instead of Collections.singletonMap() ?

In testMoveApplicationSubmitTargetQueue() and testMoveApplicationAdminTargetQueue(), would it make sense to test that the moves that are supposed to work do actually work?

The application object itself is hidden and mocked up as an object. The testing from this side is purely for the ACL checks and enforcement. The application object that would be changed is not reachable from the ClientRMService at all. It would require a lot of changes that test underlying the underlying code base more than the client service.
Now that I think about it: we might even be able to make this far simpler if we move things around. Now that we have the move in the RMAppManager we could even think about moving all these ACL checks etc into the pre-validate check, or a security check, that is performed in the app manager. It does make more sense to have it there.

Why a ConcurrentHashMap in createClientRMServiceForMoveApplicationRequest() instead of Collections.singletonMap()?

I used that because the thenReturn expects a ConcurrentHashMap. The apps variable must be declared like it is. To use th singletonMap I then have to cast in the code which does not make it any more readable or maintainable. The code that works would look like this:

Wilfred Spiegelenburg
added a comment - 06/Jan/17 05:56 In testMoveApplicationSubmitTargetQueue() and testMoveApplicationAdminTargetQueue(), would it make sense to test that the moves that are supposed to work do actually work?
The application object itself is hidden and mocked up as an object. The testing from this side is purely for the ACL checks and enforcement. The application object that would be changed is not reachable from the ClientRMService at all. It would require a lot of changes that test underlying the underlying code base more than the client service.
Now that I think about it: we might even be able to make this far simpler if we move things around. Now that we have the move in the RMAppManager we could even think about moving all these ACL checks etc into the pre-validate check, or a security check, that is performed in the app manager. It does make more sense to have it there.
Why a ConcurrentHashMap in createClientRMServiceForMoveApplicationRequest() instead of Collections.singletonMap()?
I used that because the thenReturn expects a ConcurrentHashMap . The apps variable must be declared like it is. To use th singletonMap I then have to cast in the code which does not make it any more readable or maintainable. The code that works would look like this:
ConcurrentHashMap<ApplicationId, RMApp> apps = (ConcurrentHashMap) Collections.singletonMap(applicationId, app);
when(rmContext.getRMApps()).thenReturn(apps);
That does not look any nicer than what we now have does it?