Some initial context to this question. Currently I have a application cluster deployed behind a ALB which maintains persistent keep alive connections with the application. This application is under continuous heavy load and must have very high uptime. The ALB has been sending back 502 Bad Gateway status codes from this service. Digging deeper and after taking a pcap capture and sysdig capture on the affected instances we see the following (ordered by sequence of events):

As stated above, it appears that our nodejs application reaches 5 seconds of inactivity on a keep-alive connection (default keep alive timeout period), then receives a network request, then closes the socket and finally responds to the queued network request with a RST.

So it appears that the 502's are due to a race condition where a new request is received from the load balancer between or during the TCP teardown.

The most apparent solution to this problem would to ensure that the load balancer is the source of truth when tearing down these connections, ensuring that the load balancers idle timeout is less than the timeout on the application server. This solution works with AWS classic load balancers but not ALB as according to their docs:

You can set an idle timeout value for both Application Load Balancers
and Classic Load Balancers. The default value is 60 seconds. With an
Application Load Balancer, the idle timeout value applies only to
front-end connections.

Could someone speculate as to why AWS may have removed the backend idle timeout (I'm assuming it is infinity)? I could set the keep alive timeout on the node server to Infinity as well but should I worry about leaking sockets? Are their any other server technologies that handle this problem more gracefully that I could apply to fix this issue (without using classic load balancers)?

Also AWS support states they will not respect a Keep-Alive header sent back from the service.

2 Answers
2

We cannot know the reason why they don't have it. We cannot know AWS design / implementation decisions which cause the behaviour. Only people who work with Amazon on this feature know, and they are most likely under NDA.

It doesn't seem reasonable to believe that ALB keeps back-end connections open indefinitely but rather that the configurable timeout only applies to front-end connections.

I'm a little concerned by how much time elapsed between the arrival and ACK of the request at :32.042 and the fd close at :32.066. You're timing out a connection that is arguably not actually idle -- it accepted a request 24ms earlier. (!?) To me, that's a surprisingly "long" time.

You should not need to worry about leaking descriptors since ALB won't open connections it doesn't actually need for serving requests... but you should not need an infinite timeout, either.

The question seems to be how long ALB holds idle back-end connections open -- which appears to be undocumented but I will review my logs and see if I can find evidence to suggest what the timer may be set to, assuming it's static. (Holding back-end connections open is of course intended as a performance optimization.)

Intuition suggests you might try a 75 second timer on your side. Those are the defaults I established based on classic balancer behavior and I have observed no issues with ALBs dropped into their place.

As this is a node server I'm guessing that 24ms is spent processing a cpu bound task so the event queue is blocked. That probably does play a role in this problem or at least exacerbates it. According to AWS they expect the backend to close the connection and also reiterated that the ALB has no backend idle (without specifying timeout). They recommend implementing a retry. Somewhat surprising seems like this edge case would always exist.
– andrsnnDec 20 '17 at 16:44