TIME_WAIT and “port reuse”

Lately during some support work, a customer raised an interesting case regarding what was referred to as “port reuse”. This lead to quite a nice investigation on the effect of the MSL and TIME_WAIT characteristics of TCP. So first we should define these terms and what exactly they mean. Getting an exact definition can be more difficult than expected because the RFC states one thing, but allows vendors the flexibility to change their defaults for better performance. The best I could come up with is…

Client and server terminate their connection “A” via the use of FIN packets

Client opens another connection “B” to Server

Normal TCP operation

For whatever reason, for example network congestion, latency, high CPU on intermediate nodes, a TCP packet from connection “A” arrives to the server. Should the server accept this packet? Mark it as a duplicate? Deny it?

In order to solve the above problem, we have the TIME_WAIT state. TCP requires that the endpoint that initiates an active close of the connection eventually enters TIME_WAIT. Usually TIME_WAIT = 2MSL. In other words:

If a client or server initiates an active close (using FIN packets), then wait for 2MSL before allowing the same socket to be used (i.e. the same IP addresses / TCP port numbers).

You can inherently see why this makes sense and why it should avoid the problem described in the steps above. If all TCP packets must “die” after time period “MSL”, then surely waiting for twice that amount of time would mean that no TCP packets from the old connection could possibly still exist on the network. I found a pretty good, if somewhat unclear, diagram over at the University of California webpage:

A couple of points you should notice. As I highlighted in bold before, only the side that actively closes the TCP connection enters into the TIME_WAIT state. You see this illustrated above in the arrow coming from the “ESTABLISHED” state saying “appl:close” which presumably stands for “application is closed”, which generates a FIN packet illustrated with the “send: FIN”. Which side this will be is highly dependant on the application that is being used. It can either be the server or the client that goes into TIME_WAIT.

All well and good. This solves the problem I illustrated before. But it introduces a new problem: starvation of resources. For example, on a terminal server with very low traffic we already see several TIME_WAIT connections, and keep in mind these connections cannot be used again for 2MSL

So you see where this is going… in very high activity application network nodes such as proxies (Bluecoat) we may end up in a situation where we don’t have any more sockets to spare since they are all in the TIME_WAIT state.

The RFC, anticipating (or reacting to) this problem, states:

“When a connection is closed actively, it MUST linger in TIME-WAIT state for a time 2xMSL (Maximum Segment Lifetime). However, it MAY accept a new SYN from the remote TCP to reopen the connection directly from TIME-WAIT state, if it:

(1) assigns its initial sequence number for the new connection to be larger than the largest sequence number it used on the previous connection incarnation, and

(2) returns to TIME-WAIT state if the SYN turns out to be an old duplicate”

Ok, so focusing on point (1). This says, we can reuse the same socket but only if the SYN packet contains a sequence number which is larger than was previously used. Simple to follow… but not when you introduce factors like NAT and non compliant client OS. In most of these cases, the same socket is used (especially in NAT) but the sequence number is not changed… so the upstream server or proxy must reset the connection and you get dropped connections and lots of moaning from users.

The obvious answer would be to ensure everyone uses this sequence number which is higher than previously used. Easily said than done. The sheer variety of connections / clients / applications / proxies that necessarily need to communicate means you cannot easily keep track of what sequence numbers were used and so on. Especially considering high-traffic networks can have connections numbering in the millions, and the 2MSL time period means all sequence numbers must be recorded for 4 minutes. That would be an architectural headache for any node that tries to do this.

Instead, the RFC makes a liberal statement and says that vendors are allowed to reduce the 2MSL period so that TIME_WAIT states will last a shorter amount of time so the ports can be reused. This makes it easier to work with. Let’s say we now have two network nodes (these could be a client / server pair, or two forwarding proxies, or what have you… any two nodes that terminate connections). Let’s also say that you have a well-designed and well-behaved network in which you rarely see any fragmented or late packets… so the problem I described in those previous 6 steps are of little concern to you. Let’s finally say that you have a high volume network which presents you with the TIME_WAIT resource starvation I just described above. We’re getting close to an ideal situation where shortening the 2MSL timeout on one side of the connection is a valid solution.

Where we hit a snag… it’s not always that easy to determine on which node to apply the 2MSL reduction. Ideally, one should observe which sockets are being reset due to port reuse, track which applications generate these sockets, and see under normal operation which side of the connection should initiate a TCP connection FIN. In practice, there rarely is the time or will do all this. It’s should be easy to spot which side is resetting the connections.

For example. Let’s say that like my situation you have two BlueCoat proxies. One is forwarding connections to the other. So we have a downstream proxy sending connections to the upstream proxy. Using a PCAP we observe that the upstream proxy is sending all the RST packets because the downstream proxy is re-using the sockets. While it’s not exact, we can assume to a certain degree that if we:

Leave the 2MSL timeout on the downstream proxy at the normal 4 minute time period, this means that if the client starts a connection closure, there will be a relatively long period until the downstream tries to re-use the ports

Reduce the 2MSL timeout on the upstream proxy. This means that if the server starts the connection closure, there will be a relatively short period of time until the upstream will release the ports for re-use

Combined, the above two steps should ensure that even if we dont know which side starts the FIN closure, we can say that probably the downstream will not use the same ports in a longish time, and the upstream will free up the ports for use after a shortish time.

Let me know if you ever ran into this situation and if the above makes sense :)

10 responses to “TIME_WAIT and “port reuse””

I support a website (Apologies, can’t reveal the name) where I have encountered this problem. The apache and the back-end Java application server (Weblogic) both run on Redhat EL 5. For now, assume that I start the Apache process that has two threads (thread1 and 2) and they both connect to the Weblogic for server Java pages.

Occasionally, I have seen a scenario where the thread1 has just closed a connection to the Weblogic back-end and this socket hangs around for 1 minute. But the threas2 before the 1 min period re-uses the same socket and then the next thing we see is a TCP reset from the Weblogic server.

On Solaris, I can change the value of tcp_time_wait_interval but on Linux, I don’t see an equivalent TCP parameter. There is a tcp_fin_timeout parameter but that is theoretically for setting FIN_WAIT_2 period. In your experience, do you think reducing tcp_fin_timout to 30 seconds will help? I can change the MSL timeout period but that would involve re-compiling the kernel. A no-go area. To resolve this problem, one of the options I can think of is increase the tcp port range tp use a much wider range. At present I use from 32768 to 61000. Perhaps I can change it to 1024 to 61000.

Can you think of any other setting or solution to reduce the TIME_WAIT on Linux?

Well the documentation is a bit confusing for linux system. They seem to equate FIN_WAIT with the TIME_WAIT. I usually reduce the FIN_WAIT anyway, but I cant tell for sure if that makes a difference because I have never put the servers under rigorous testing

However, by far the best solution I found is to, as you said, increase the port range. This usually works really well for me. Another thing that works, but maybe be difficult to implement, is to make the client send a RST instead of a FIN.. it’s not elegant, but it usually clears the connection

I had a problem with TIME_WAIT in my environment and the only solution that I’ve found was enabling reuse of tcp connections, using the linux command:
echo “1” >/proc/sys/net/ipv4/tcp_tw_reuse

But I didn’t understand why we have to explicitly enable the reuse of connections and why the TIME_WAIT takes too long. Your article explained perfectly these issues and gave me more options for better setting.