Details

Description

If a network issue causes an AM web app to resolve the RM proxy's address to something other than what's listed in the allowed proxies list, the AmIpFilter will 302 redirect the RM proxy's request back to the RM proxy. The RM proxy will then consume all available handler threads connecting to itself over and over, resulting in an outage of the web UI.

Activity

One add-on note that we can likely also address with this one: The AmIpFilter resolves the proxy addresses to host addresses (getAllByName, getHostAddress) every single time a request is made to it, vs. caching it upfront. I think we should not try to resolve it on-request unless we have errors, cause the proxy address list does not usually change over time on an already running AM?

Harsh J
added a comment - 04/Mar/16 23:05 One add-on note that we can likely also address with this one: The AmIpFilter resolves the proxy addresses to host addresses (getAllByName, getHostAddress) every single time a request is made to it, vs. caching it upfront. I think we should not try to resolve it on-request unless we have errors, cause the proxy address list does not usually change over time on an already running AM?

Harsh J, I think that's a worthy improvement, but I don't think it's exactly related. I found a couple of items that should be cleaned up in the current web UI code. I'll file another JIRA for those cleanup items.

Daniel Templeton
added a comment - 08/Mar/16 20:10 Harsh J , I think that's a worthy improvement, but I don't think it's exactly related. I found a couple of items that should be cleaned up in the current web UI code. I'll file another JIRA for those cleanup items.

Just noticed that this patch will break if the AM is on the same host as the RM. I'll post an updated patch later. For now, though, I'd still love some feedback on how I'm (ab)using the web UI framework.

Daniel Templeton
added a comment - 08/Mar/16 20:41 Just noticed that this patch will break if the AM is on the same host as the RM. I'll post an updated patch later. For now, though, I'd still love some feedback on how I'm (ab)using the web UI framework.

In the current proxy code we have an override set that allows circular redirects.
By default the jetty service has: http.protocol.allow-circular-redirects set to false and we override it to true to allow some failure cases to redirect. This was done in MAPREDUCE-3706. The real cause of the circular redirect was fixed in MAPREDUCE-3930.

If we remove the change for the circular redirect again would that not fix this issue and show a error?

Wilfred Spiegelenburg
added a comment - 09/Mar/16 00:10 In the current proxy code we have an override set that allows circular redirects.
By default the jetty service has: http.protocol.allow-circular-redirects set to false and we override it to true to allow some failure cases to redirect. This was done in MAPREDUCE-3706 . The real cause of the circular redirect was fixed in MAPREDUCE-3930 .
If we remove the change for the circular redirect again would that not fix this issue and show a error?

Wilfred Spiegelenburg, no that won't solve the issue. The problem is that it's not a circular redirect. A circular redirect is A->A->A... This is a loop of: A contacts B and is redirected to A. From the HttpClient's perspective, there's nothing wonky going on here.

Daniel Templeton
added a comment - 09/Mar/16 14:42 Wilfred Spiegelenburg , no that won't solve the issue. The problem is that it's not a circular redirect. A circular redirect is A->A->A... This is a loop of: A contacts B and is redirected to A. From the HttpClient 's perspective, there's nothing wonky going on here.

This patch also works for single-node clusters. The basic idea is that the AmIpFilter appends /redirect to the URL when it redirects to the web proxy. (It only does that when it's contacted by a host other than the web proxy.) The web proxy looks for the /redirect when it receives a request. If it's there, and if the request originated from the web proxy itself, it instead redirects to an error page, ending the endless loop.

The error page is implemented within the app page and is triggered by appending an "R" to the end of the app ID. If the app page sees the "R", it prints the error instead of the app page. This part is the part I'm least confident about being a reasonable thing to do. Feedback welcome.

Daniel Templeton
added a comment - 09/Mar/16 18:54 This patch also works for single-node clusters. The basic idea is that the AmIpFilter appends /redirect to the URL when it redirects to the web proxy. (It only does that when it's contacted by a host other than the web proxy.) The web proxy looks for the /redirect when it receives a request. If it's there, and if the request originated from the web proxy itself, it instead redirects to an error page, ending the endless loop.
The error page is implemented within the app page and is triggered by appending an "R" to the end of the app ID. If the app page sees the "R", it prints the error instead of the app page. This part is the part I'm least confident about being a reasonable thing to do. Feedback welcome.

Here's one more patch that kills one more of the checkstyle issues. The last two are line length complaints on lines that cannot be shortened without compromising readability.

Ashwin Shankar, I'd love to hear more about what you're encountering. Have you tried applying this patch in your environment? The scenario where the redirect loop happens is pretty unusual. How come you're seeing it so often?

Daniel Templeton
added a comment - 25/May/16 13:43 Here's one more patch that kills one more of the checkstyle issues. The last two are line length complaints on lines that cannot be shortened without compromising readability.
Ashwin Shankar , I'd love to hear more about what you're encountering. Have you tried applying this patch in your environment? The scenario where the redirect loop happens is pretty unusual. How come you're seeing it so often?

More an observation. We seem to be appending "R" at the end to capture a redirection. It seems a little hacky. The alternative of defining a special class with a boolean to capture redirection seems rather heavy and unnecessary. Can we add a more simple test that verifies this addition of "R" so it is easy for the developer changing surrounding code in the future.

AppBlock#generateOverviewTable: The ternary operators aren't following the coding conventions and are also hard to read. Mind updating them to make sure "?" and the associated value are on the same line, and ":" and the associated value are on the same line? And, for nested ternary operator, additional indentation is good. For instance,

Karthik Kambatla
added a comment - 25/May/16 17:01 - edited The patch looks good to me, but for the following nits:
More an observation. We seem to be appending "R" at the end to capture a redirection. It seems a little hacky. The alternative of defining a special class with a boolean to capture redirection seems rather heavy and unnecessary. Can we add a more simple test that verifies this addition of "R" so it is easy for the developer changing surrounding code in the future.
AppBlock#generateOverviewTable: The ternary operators aren't following the coding conventions and are also hard to read. Mind updating them to make sure "?" and the associated value are on the same line, and ":" and the associated value are on the same line? And, for nested ternary operator, additional indentation is good. For instance,
app.getTrackingUrl() == null || app.getTrackingUrl().equals(UNAVAILABLE)
? null
: root_url(app.getTrackingUrl()),
app.getTrackingUrl() == null || app.getTrackingUrl().equals(UNAVAILABLE)
? "Unassigned"
: app.getAppState() == YarnApplicationState.FINISHED ||
app.getAppState() == YarnApplicationState.FAILED ||
app.getAppState() == YarnApplicationState.KILLED
? "History"
: "ApplicationMaster" );
Xuan Gong , Vinod Kumar Vavilapalli - you are more familiar with the redirection and proxy code. Mind taking a quick look? I think we should try and get this into 2.8.0.

Daniel Templeton
added a comment - 25/May/16 19:10 Not that simple. It's not a single connection that's looping. Each
connection is only redirected once. The pattern is: new connection ->
redirect -> new connection -> redirect...

I'm not aware of a TTL at the HTTP level. It's technically possible to use an HTTP header to flag that the incoming connection is a redirect rather than encoding it in the URL, but it would be a pretty invasive change. The code that handles the redirects is buried pretty deep. I think it may even be in a bundled library. Getting to it to make it respect the header probably isn't worth it.

Daniel Templeton
added a comment - 25/May/16 22:20 The full trace looks like this:
Thread 1) Proxy receives incoming request
Thread 1) Proxy opens connection to the NM
Thread 1) NM sends redirect back to the proxy
Thread 1) Proxy opens a new connection to the proxy
Thread 2) Proxy receives incoming request
Thread 2) Proxy opens connection to the NM
Thread 2) NM sends redirect back to the proxy
Thread 2) Proxy opens a new connection to the proxy
And so on.
I'm not aware of a TTL at the HTTP level. It's technically possible to use an HTTP header to flag that the incoming connection is a redirect rather than encoding it in the URL, but it would be a pretty invasive change. The code that handles the redirects is buried pretty deep. I think it may even be in a bundled library. Getting to it to make it respect the header probably isn't worth it.

Here's a patch that cleans up those silly nested ternary operators. As soon as we get some consensus that this approach is the right approach, I'll work on adding tests around the redirect tagging. I agree with Karthik Kambatla that the current approach is hacky, but I don't see a less hacky approach at the moment that is not overkill.

Daniel Templeton
added a comment - 01/Jun/16 22:28 Here's a patch that cleans up those silly nested ternary operators. As soon as we get some consensus that this approach is the right approach, I'll work on adding tests around the redirect tagging. I agree with Karthik Kambatla that the current approach is hacky, but I don't see a less hacky approach at the moment that is not overkill.

hey Daniel Templeton,
Thanks much for rebasing the patch! Just to give you some context on what we see at my company - we first got complaints from users that they cannot access the AM UI.Since these http requests go through the Web proxy, we looked at that process and found that it was unresponsive since all its threads were busy.When we listed open file descriptors, we saw that the webproxy had many connections from itself to itself, which seemed weird then, but makes sense now since its due to AM redirecting requests back to proxy. Web proxy logs showed that most of the requests were made to one or two specific apps. We then looked at that app's AM logs and found "UnknownHostException" when AM was trying to resolve proxy host(which is basically the master node where RM runs) in AmIpFilter code. We believe it wasn't able to resolve due to an intermittent network event to the DNS but its not conclusive. Overall, this issue has been occurring pretty much once every week and we have to bounce the webproxy to fix it.

Ashwin Shankar
added a comment - 09/Jun/16 17:47 - edited hey Daniel Templeton ,
Thanks much for rebasing the patch! Just to give you some context on what we see at my company - we first got complaints from users that they cannot access the AM UI.Since these http requests go through the Web proxy, we looked at that process and found that it was unresponsive since all its threads were busy.When we listed open file descriptors, we saw that the webproxy had many connections from itself to itself, which seemed weird then, but makes sense now since its due to AM redirecting requests back to proxy. Web proxy logs showed that most of the requests were made to one or two specific apps. We then looked at that app's AM logs and found "UnknownHostException" when AM was trying to resolve proxy host(which is basically the master node where RM runs) in AmIpFilter code. We believe it wasn't able to resolve due to an intermittent network event to the DNS but its not conclusive. Overall, this issue has been occurring pretty much once every week and we have to bounce the webproxy to fix it.
Thanks Karthik Kambatla for the review! Xuan Gong , Vinod Kumar Vavilapalli please take a look at the patch when you get a chance. we would like to backport it as soon as its committed.

Daniel Templeton - I am comfortable with the approach here. I have some nits to point out, but will post the comments along with review of tests. Can we add tests and make progress towards getting this in?

Karthik Kambatla
added a comment - 09/Aug/16 17:01 Daniel Templeton - I am comfortable with the approach here. I have some nits to point out, but will post the comments along with review of tests. Can we add tests and make progress towards getting this in?
Xuan Gong and Vinod Kumar Vavilapalli - chime in if you can. I will interpret your silence as a go-ahead

Yishan Yang
added a comment - 09/Aug/16 18:47 Is that possible to provide a patch for hadoop 2.7.2, it's kind of hard to back port this patch to 2.7.2 version. Since 2.8.0 won't be released shortly. Thanks

I also ran into the same issue a couple of months back, but didn't see this JIRA.

Just looked at the patch. It took me a little while, but I understood the problem and your patch.

The approach overall looks good to me.

I am sure you had to do a bit of testing and bringing that back may be some effort, but I got few comments

The "append-/redirect" approach will fail if the AM URIs have query parameters.

This is because, on the WebAppProxyServlet side -> methodAction(), it looks at HttpServletRequest.getPathInfo() which "returns any extra path information associated with the URL the client sent when it made this request. The extra path information follows the servlet path but precedes the query string and will start with a / character.".

How about we instead go with a String target = redirectUrl + "/redirect" + httpReq.getRequestURI(); approach? Or something like that as a prefix..

Also if we do that, we won't need (a) redirecting to the app-page (b) the internal "R" hack to let the app-pages know (c) changing AppBlock etc? We can simply look for paths of the form "http://RM-proxyAddress/proxy/cluster/application_id/redirect" and treat them separately by sending them to an error-page?

Also inject the 'wrong' proxy-address and add a test-case which proves the patch?

Let me know how I can help in push this forward.

Also looking at a 2.8.0 RC0 this weekend / beginning next week - let's see if we can get this moving before that?

Vinod Kumar Vavilapalli
added a comment - 19/Aug/16 22:37 Apologies Daniel Templeton , I haven't been looking at this.
I also ran into the same issue a couple of months back, but didn't see this JIRA.
Just looked at the patch. It took me a little while, but I understood the problem and your patch.
The approach overall looks good to me.
I am sure you had to do a bit of testing and bringing that back may be some effort, but I got few comments
The "append-/redirect" approach will fail if the AM URIs have query parameters.
This is because, on the WebAppProxyServlet side -> methodAction() , it looks at HttpServletRequest.getPathInfo() which "returns any extra path information associated with the URL the client sent when it made this request. The extra path information follows the servlet path but precedes the query string and will start with a / character." .
How about we instead go with a String target = redirectUrl + "/redirect" + httpReq.getRequestURI(); approach? Or something like that as a prefix..
Also if we do that, we won't need (a) redirecting to the app-page (b) the internal "R" hack to let the app-pages know (c) changing AppBlock etc? We can simply look for paths of the form "http://RM-proxyAddress/proxy/cluster/application_id/redirect" and treat them separately by sending them to an error-page?
Also inject the 'wrong' proxy-address and add a test-case which proves the patch?
Let me know how I can help in push this forward.
Also looking at a 2.8.0 RC0 this weekend / beginning next week - let's see if we can get this moving before that?

Vinod Kumar Vavilapalli
added a comment - 29/Aug/16 19:34 Daniel Templeton , the latest patch looks much better to me.
Also like that the unit tests verify the added code in a pin-pointy way.
There's one bug in the code
this .failurePageUrlBase =
StringHelper.pjoin(WebAppUtils.getResolvedRMWebAppURLWithScheme(conf),
"cluster" , "failure" );
You put cluster prefix here, but don't expect that prefix in other places.
Can you please address Jenkins warnings too?

This patch addresses the checkstyle issues. The javadoc complaints are sadly unavoidable.

The "cluster" in the failure page path is correct. See the rmAppPageUrlBase above it. Without the "cluster" there, the patch doesn't work. Do you mean that I should be testing with the "cluster" prefix in the test files? Since the redirection bit doesn't actually care what the path is, I wasn't, but if it will brighten your day, I can do that.

Daniel Templeton
added a comment - 29/Aug/16 21:00 - edited This patch addresses the checkstyle issues. The javadoc complaints are sadly unavoidable.
The "cluster" in the failure page path is correct. See the rmAppPageUrlBase above it. Without the "cluster" there, the patch doesn't work. Do you mean that I should be testing with the "cluster" prefix in the test files? Since the redirection bit doesn't actually care what the path is, I wasn't, but if it will brighten your day, I can do that.

Oh, whoops. Missed the unit test failure. Huh. It passes for me locally. I'll take a look. (Correction: I was passing for me locally. I must have broken it without noticing. Looks like you're right...)

Daniel Templeton
added a comment - 29/Aug/16 21:10 - edited Oh, whoops. Missed the unit test failure. Huh. It passes for me locally. I'll take a look. (Correction: I was passing for me locally. I must have broken it without noticing. Looks like you're right...)

The latest patch looks good to me. Daniel Templeton - is the javadoc related, and does it need fixing?

Vinod Kumar Vavilapalli - from your comments here, I take it you are comfortable with the approach and wanted the tests to be fixed. Would be nice if you could take another look, but otherwise let me assume you are a +0 and commit it once the javadocs are cleared.

Karthik Kambatla
added a comment - 22/Sep/16 01:40 The latest patch looks good to me. Daniel Templeton - is the javadoc related, and does it need fixing?
Vinod Kumar Vavilapalli - from your comments here, I take it you are comfortable with the approach and wanted the tests to be fixed. Would be nice if you could take another look, but otherwise let me assume you are a +0 and commit it once the javadocs are cleared.

Daniel Templeton
added a comment - 23/Sep/16 03:03 The javadoc complaints are all about the fact that _ is used as a variable name. I'd really love to fix that, but it's way out of scope for this patch.

Apologies for not committing it last week before I went away. Now, the patch does not apply cleanly. Daniel Templeton - could you please upload another patch. Will try my best to commit it soon this time around.

Karthik Kambatla
added a comment - 03/Oct/16 01:21 Apologies for not committing it last week before I went away. Now, the patch does not apply cleanly. Daniel Templeton - could you please upload another patch. Will try my best to commit it soon this time around.