Add long polling to asynchronous execution in HiveServer2

Details

Change to use Long polling as described in description.
Adds hive.server2.long.polling.timeout configuration parameter, which can be used to configure how long the long poll waits. Most users would not need to bother about changing this configuration parameter.

Change to use Long polling as described in description.
Adds hive.server2.long.polling.timeout configuration parameter, which can be used to configure how long the long poll waits. Most users would not need to bother about changing this configuration parameter.

Description

HIVE-4617 provides support for async execution in HS2. The client gets an operation handle which it can poll to check on the operation status. However, the polling frequency is entirely left to the client which can be resource inefficient. Long polling will solve this, by blocking the client request to check the operation status for a configurable amount of time (a new HS2 config) if the data is not available, but responding immediately if the data is available.

[~vaibhavgumashta] Version 5 of the patch won't apply cleanly on trunk. Can you please rebase the patch and post a review request on RB?

One concern I have with the patch is that the delay period affects everyone who calls SQLOperation.getStatus(). Ideally we would be able to limit this behavior to remote clients only. I'm worried that at some point in the future we're going to want to have a housekeeping or monitoring thread that periodically calls getStatus() on all active operations, and we don't want getStatus() to block in these types of situations. Would it be practical to relocate this logic to the CLIService layer so that it only impacts CLIService clients?

Carl Steinbach
added a comment - 06/Nov/13 16:36 [~vaibhavgumashta] Version 5 of the patch won't apply cleanly on trunk. Can you please rebase the patch and post a review request on RB?
One concern I have with the patch is that the delay period affects everyone who calls SQLOperation.getStatus(). Ideally we would be able to limit this behavior to remote clients only. I'm worried that at some point in the future we're going to want to have a housekeeping or monitoring thread that periodically calls getStatus() on all active operations, and we don't want getStatus() to block in these types of situations. Would it be practical to relocate this logic to the CLIService layer so that it only impacts CLIService clients?

Actually the delay period only affects the async SQLOperation (and returns immediately if the async call has the results) and doesn't change the blocking calls. Also, as far as I understand the use case you are pointing to is for a cleanup type of operation, where we might want to clean up the entire session based on whether the client is active or not. In that case, the cleanup thread might end up checking the last active time of the session rather than the operation state and based on whether it exceeds some threshold, delete the session from the server.

Moving the long polling logic to CLIService might involve changing the Operation API, which I was trying to avoid until we get to a specific use case. But if you feel otherwise, I can definitely make the change.

Vaibhav Gumashta
added a comment - 06/Nov/13 19:48 Carl Steinbach Sure, let me rebase on the trunk. You can look at the patch here: https://reviews.facebook.net/D12801 , but I can create a request on RB if you prefer that.
Actually the delay period only affects the async SQLOperation (and returns immediately if the async call has the results) and doesn't change the blocking calls. Also, as far as I understand the use case you are pointing to is for a cleanup type of operation, where we might want to clean up the entire session based on whether the client is active or not. In that case, the cleanup thread might end up checking the last active time of the session rather than the operation state and based on whether it exceeds some threshold, delete the session from the server.
Moving the long polling logic to CLIService might involve changing the Operation API, which I was trying to avoid until we get to a specific use case. But if you feel otherwise, I can definitely make the change.
Thanks for looking at the patch!

Carl Steinbach What Vaibhav is saying sounds reasonable to me. It wouldn't be too hard to add support for a non long polling getOperationStatus call in future, specially a non rpc call. We can make the change when we have such a use case.

Thejas M Nair
added a comment - 07/Nov/13 21:15 Carl Steinbach What Vaibhav is saying sounds reasonable to me. It wouldn't be too hard to add support for a non long polling getOperationStatus call in future, specially a non rpc call. We can make the change when we have such a use case.

It wouldn't be too hard to add support for a non long polling getOperationStatus call in future, specially a non rpc call.

Adding a getOperationStatusInternal() or getOperationStatusNonLongPoll() method to work around this problem is exactly the sort of thing I want to avoid. This patch implements a service layer feature, so I think it makes sense that the implementation belongs in the service layer as well. In CLIService.getOperationStatus() you have access to the SessionManager, and from that you can easily get the Session object, the Operation object, the operation type, and the Session's configuration. As an added bonus you also get the OperationHandle which makes it easier to emit useful log messages that reference the session and operation IDs.

Carl Steinbach
added a comment - 08/Nov/13 06:38 It wouldn't be too hard to add support for a non long polling getOperationStatus call in future, specially a non rpc call.
Adding a getOperationStatusInternal() or getOperationStatusNonLongPoll() method to work around this problem is exactly the sort of thing I want to avoid. This patch implements a service layer feature, so I think it makes sense that the implementation belongs in the service layer as well. In CLIService.getOperationStatus() you have access to the SessionManager, and from that you can easily get the Session object, the Operation object, the operation type, and the Session's configuration. As an added bonus you also get the OperationHandle which makes it easier to emit useful log messages that reference the session and operation IDs.

Based on the test case included with the patch it looks like hive.server2.long.polling.timeout sets a lower bound on the total round trip time required to execute a getOperationStatus RPC, e.g. if polling.timeout = 5000 my query can finish after a second but getOperationStatus will still block for another four seconds before returning. Is this accurate?

Carl Steinbach
added a comment - 08/Nov/13 06:50 I posted the patch to reviewboard and left some comments: https://reviews.apache.org/r/15337/
Based on the test case included with the patch it looks like hive.server2.long.polling.timeout sets a lower bound on the total round trip time required to execute a getOperationStatus RPC, e.g. if polling.timeout = 5000 my query can finish after a second but getOperationStatus will still block for another four seconds before returning. Is this accurate?

Meanwhile on your last comment: getOperationStatus returns after hive.server2.long.polling.timeout expires or the query completes - whichever happens first. So if polling.timeout = 5000 and the query finishes in 1000, the call to getOperationStatus will return in 1000. However, in the test case, the test query we are running is very short - therefore to test the behavior of the long polling timeout, we're explicitly setting the config to a smaller time than default, which is 5000; otherwise if the query takes 3000, it will return state FINISHED and we won't be able to test the timeout.

Vaibhav Gumashta
added a comment - 08/Nov/13 07:09 Thanks Carl Steinbach . I'll look the the feedback here and on rb.
Meanwhile on your last comment: getOperationStatus returns after hive.server2.long.polling.timeout expires or the query completes - whichever happens first. So if polling.timeout = 5000 and the query finishes in 1000, the call to getOperationStatus will return in 1000. However, in the test case, the test query we are running is very short - therefore to test the behavior of the long polling timeout, we're explicitly setting the config to a smaller time than default, which is 5000; otherwise if the query takes 3000, it will return state FINISHED and we won't be able to test the timeout.

Regarding moving the long polling timeout logic, I have some comments where I'll appreciate your feedback:
1. Moving that logic to CLIService would mean accessing some of the operation logic within CLIService, and from the general design of CLIService it looks like we wanted to avoid that.
2. This will also mean that the async nature moves to Operation (runAsync, backgroundHandle) from SQLOperation.

What are your thoughts? Thanks for the feedback and discussion so far!

Vaibhav Gumashta
added a comment - 14/Nov/13 00:12 Carl Steinbach Uploaded a patch based on the feedback.
Regarding moving the long polling timeout logic, I have some comments where I'll appreciate your feedback:
1. Moving that logic to CLIService would mean accessing some of the operation logic within CLIService, and from the general design of CLIService it looks like we wanted to avoid that.
2. This will also mean that the async nature moves to Operation (runAsync, backgroundHandle) from SQLOperation.
What are your thoughts? Thanks for the feedback and discussion so far!

Vaibhav Gumashta
added a comment - 18/Feb/14 13:18 Carl Steinbach Thejas M Nair I just uploaded a patch which moves the polling timeout logic to CLIService. It would be awesome if you can take a look. Thanks in advance!