Why Are Linux Kernel Protocol Stacks Dropping SYN Packets

This blog focuses on network problems related to the TCP protocol stack, specifically the issue where no SYNACK was returned to the client.

When troubleshooting network problems, it is common to encounter TCP connection failures. If you can obtain the packets captured at both ends, the packets will look like the following:

The client has been retransmitting TCP SYN in exponential backoff.

This is because the first packet has not obtained the RTT and RTO, and it will be retransmitted in 1, 2, 4, 8 or so seconds until net.ipv4.tcp_syn_retries retransmissions are completed.

It can be seen at the server that the TCP SYN packet has reached the NIC, but there are no return TCP packets.

This problem occurs relatively frequently. This blog will focus in on network problems that are related to the TCP protocol stack. By the TCP protocol stack, what we mean is that for related network problems it may be the case that: the TCP SYN packet may have reached the TCP processing module of the kernel, but no SYNACK was returned to the client based on the server-side kernel code. The client's persistent retransmission of TCP SYN may also be caused by other causes, such as inconsistent access paths caused by multiple NICs on the server side, or SYN packets blocked by iptables rules, but these won't be discussed here. Rather, we will focus on the most common causes.

Processing TCP SYN while in the Listen State

In this tutorial, we will be using the widely used kernel version of CentOS 7. First in this tutorial, let's take a look at the main logic of TCP processing SYN, and analyze the points that may cause problems based on case handling experiences. The logic for a socket in the Listen status to process the first TCP SYN packet is like the following:

tcp_v4_do_rcv() @net/ipv4/tcp_ipv4.c
|--> tcp_rcv_state_process() @net/ipv4/tcp_input.c // This functiuon implements the processing of accept messgaes in the vast TCP state (except for ESTABLISHED and TIME-WAIT), including of course the LISTEN state we are interested in.
|--> tcp_v4_conn_request() @@net/ipv4/tcp_ipv4.c // When the TCP socket is in the LISTEN state and the TCP SYN flag is in the received message is set, it comes to this function.

The kernel code in CentOS may be slightly adjusted. If you need to track the exact number of lines in the source code, systemtap is a suitable method for this purpose, as shown below:

In the code of tcp_v4_conn_request(), the first few rows of the function logic are like the following:

The precondition for entering this function is that the TCP socket is in the Listen state, and the TCP SYN flag is set in the received packet. After entering the function logic, we can find that the function should consider various possible exceptions, but many of them are not actually that common. For example, the two situations in the first few lines:

Line 1482: Reject packets sent to broadcast and multicast.

Line 1490: If the request queue (the queue for storing SYN packets) is full, the isn is 0, and the want_cookie is false, the SYN packet is dropped.

The first case is relatively easy to understand, and has not been seen in practice. But, the second case is a little more complicated, and may be encountered with some probability in practice. Let's take a look at it below:

The first condition where the request queue is full is actually easily met. A syn flood attack can easily cause this to happen. The isn is assigned as TCP_SKB_CB(skb)-> when at the beginning of the function. This is the field used to compute the RTT in the TCP control block structure. The want_cookie indicates whether the syn syncookies method is used or not. Its definition in tcp_syn_flood_action() is as follows. If the ifdef is added before CONFIG_SYN_COOKIES, and the kernel parameter net.ipv4.tcp_syncookies is also set to 1, the summary returns true, so the want_cookie returns as true.

Therefore, in the above case of dropping SYN packets, the true precondition is that the kernel parameter net.ipv4.tcp_syncookies is not enabled. However, in actual production systems, the parameter, net.ipv4.tcp_syncookies, is enabled by default. Syn syncookies is a method to defend against syn flood attacks by exchanging time (CPU computing) for space (request queue). In actual production, you do not need to turn this switch off explicitly in any scenarios. So in general, the request in line 1490 are not very common.

Main Scenarios of Kernel Dropping SYN packets

Below are the two main scenarios of in which SYN packets may be dropped and how to quickly judge why the server does not return SYNACK.

1. The Per-host PAWS Check Results in the Drop of SYN Packets

Symptom

This is the most common problem in the actual production environment: For servers with both the net.ipv4.tcp_tw_recycle and the net.ipv4.tcp_timestamps enabled, the probability of this problem is very high when the server has NAT client access. From the client side, the symptom of this problem is that the new connection is unstable. Sometimes it can be connected and sometimes it cannot.

Per-Host PAWS Principle

For background information, PAWS is short for Protect Against Wrapped Sequences, which is a means for preventing sequence numbers from being wrapped. Next, Per-host checks the IP address of the peer host rather than the quad-tuples of the IP port.

The way per-host PAWS checks is as follows: For the quintuple peer host IP of the TIME_WAIT socket that is quickly recycled, which helps to prevent the interference of old data from the same host. The TCP Timestamps option of the new SYN packet needs to be increased within 60 seconds as a result. When the client is in a NAT environment, this condition is often not easily met.

Theoretically, it is only necessary to remember the above sentence, to solve the problem that the three-way handshakes for many clients are sometimes connected and sometimes disconnected. For more information, see the detailed explanation below.

Why Does Per-Host PAWS Exist?

Allow old duplicate segments to expire. To replace this function of TIME-WAIT state, a mechanism would have to operate across connections. PAWS is defined strictly within a single connection; the last timestamp is TS.Recent is kept in the connection control block, and discarded when a connection is closed.

An additional mechanism could be added to the TCP, a per-host cache of the last timestamp received from any connection. This value could then be used in the PAWS mechanism to reject old duplicate segments from earlier incarnations of the connection, if the timestamp clock can be guaranteed to have ticked at least once since the old connection was open. This would require that the TIME-WAIT delay plus the RTT together must be at least one tick of the sender's timestamp clock. Such an extension is not part of the proposal of this RFC.

Note that this is a variant on the mechanism proposed by Garlick, Rom, and Postel [Garlick77], which required each host to maintain connection records containing the highest sequence numbers on every connection. Using timestamps instead, it is only necessary to keep one quantity per remote host, regardless of the number of simultaneous connections to that host.

The reason why TIME_WAIT is needed is also explained in the code annotations of tcp_minisocks.c. And the PAWS mechanism, the theoretical basis for fast recycling of TIME_WAIT, is as follows:

Main purpose of TIME-WAIT state is to close connection gracefully, when one of ends sits in LAST-ACK or CLOSING retransmitting FIN (and, probably, tail of data) and one or more our ACKs are lost.

What is TIME-WAIT timeout? It is associated with maximal packet lifetime in the internet, which results in wrong conclusion, that it is set to catch "old duplicate segments" wandering out of their path. It is not quite correct. This timeout is calculated so that it exceeds maximal retransmission timeout enough to allow to lose one (or more) segments sent by peer and our ACKs. This time may be calculated from RTO.

When TIME-WAIT socket receives RST, it means that another end finally closed and we are allowed to kill TIME-WAIT too.

Second purpose of TIME-WAIT is catching old duplicate segments. Well, certainly it is pure paranoia, but if we load TIME-WAIT with this semantics, we MUST NOT kill TIME-WAIT state with RSTs.

If we invented some more clever way to catch duplicates (f.e. based on PAWS), we could truncate TIME-WAIT to several RTOs.

According to the description of RFC and the kernel code annotations above, we can see that the Linux kernel implements a fast recycling mechanism for the TIME-WAIT status. Linux can drop the 60-second TIME-WAIT time, and directly shorten it to 3.5 times of the RTO time, because Linux uses some "smart" methods to capture old duplicate packets (for example, based on the PAWS mechanism). In contrast, Linux does use per-host PAWS to prevent packets in the previous connections from being wrapped into new connections.

Linux Kernel Implementation

In tcp_ipv4.c, before the SYN is received, if the following two conditions are met, check whether the peer is proven, that is, to perform the per-host PAWS check:

The received packet has the TCP Time Stamp option.

The kernel parameter net.ipv4.tcp_tw_recycle is enabled on the computer.

In tcp_metrics.c, the implementation logic of Linux per-host PAWS is as follows. Simply put, as mentioned at the beginning of this section: The TCP Timestamps option of the new SYN packet needs to be increased within 60 seconds.

The Impact on Clients in the NAT Environment

When this per-host PAWS mechanism made to allow the TIME-WAIT status to be quickly recycled was implemented in Linux, it was designed to be a solution based on the network environment with a sufficient number of IPv4 address pools. However, with the rapid development of the Internet, the application of NAT is becoming more and more common, and it is very common for clients to access the same server within the SNAT device.

The Per-host PAWS mechanism determines the wrapped data with the increase of the TCP Timestamps option field, and the timestamp is the value obtained based on the CPU ticks of each client, which can be said to be completely random within the NAT device. When client host 1 establishes a TCP connection with the server through NAT, and then the server closes and quickly recycles the TIME-WAIT sockets, the new connection source IP of other client hosts are the same as those recorded in the server peer table, but the TCP Timestamps option is completely random or has a 50% probability of being random compared with the timestamp of host 1 recorded at that time. If timestamp is smaller than that of host 1, the new connection will be rejected within 60 seconds, and the new connection will succeed after 60 seconds. If the timestamp is larger than that of host 1, the new connection is directly successful. So, from the client side, the symptom of this problem is that the new connection is unstable. Sometimes it can be connected and sometimes it cannot.

This is the side effect of using the TIME-WAIT fast recycling mechanism on clients in the NAT environment. This side effect cannot be expected at the beginning of designing the per-host PAWS mechanism, because the network environment at that time was quite different from the current one. In the current network environment, the only recommendation is to disable the TIME-WAIT fast recycling, that is, to make net.ipv4.tcp_tw_recycle=0. Disabling net. ipv4.tcp _ timestamps to remove the TCP Time Stamp option can also solve this problem. However, because the timestamp is the basis for computing RTT and RTO, it is generally not recommended to disable it.

Troubleshooting

In actual production, troubleshooting is not easy. However, for servers with both the net.ipv4.tcp_tw_recycle and the net.ipv4.tcp_timestamps enabled, the probability of this problem is very high when the server has NAT client access, so if you obtain the settings of these two kernel parameters and the NAT environment of the client network, you can make a basic judgment.

In addition, you can refer to the statistics in netstat-s, which collect data from /proc/net/snmp, /proc/net/netstat and /proc/net/sctp/snmp. As shown below, this statistic value indicates how many new connections have been rejected due to the timestamp. This is a historical statistical total, so the difference between the two time points is more meaningful for troubleshooting.

xx passive connections rejected because of time stamp

2. A Full Accept Queue Results in Dropped SYN Packets

Symptom

A unified and regular phenomenon does not exist. This happens when TCP accept queue is full. This often happens when a problem occurs with the user-space application. In general, the probability of occurrence is not very high.

Principle

The accept queue is translated into a fully connected queue or a receiving queue. The new connection enters the accept queue after three handshakes. The user-space application calls the accept system call to obtain the connection and creates a new socket, returns the file descriptor (fd) associated with the socket. In the user space, poll and other mechanisms can be used to learn through the readable event that a new connection that has completed 3 handshakes has entered the accept queue, and the accept system call can be called to obtain the new connection immediately after receiving the notification.

The length of the accept queue is limited. The length depends on the min [backlog, net.core.somaxconn], which is the smaller one of the two parameters.

Backlog is the second parameter when the application calls the Listen system call. Please refer to the int listen(int sockfd, int backlog) in #include.

Net.core.somaxconn is a system kernel parameter. The default value is 128. When the application is listening, if the set backlog is large (for example, the default is 512 for NGINX), and the global kernel parameter is not adjusted, the length of the accept queue will still be determined by the smaller net.core.somaxconn.

Even in the case of a large number of concurrent connections, the normal use of the accept system call by the application to obtain connections in the accept queue will not be delayed due to efficiency problems. However, if connections are not obtained in time due to application blocking, it may cause the accept queue to be full, resulting in the new SYN packets being dropped.

Linux Kernel Implementation

In tcp_ipv4, the implementation of "rejecting SYN packets when the accept queue is full" is very simple, as follows:

/* Accept backlog is full. If we have already queued enough
* of warm entries in syn queue, drop request. It is better than
* clogging syn queue with openreqs with exponentially increasing
* timeout.
*/
//If accept queueis full, and SYN queuehas a semi-join that has not been retransmitted by SYNACK, the SYN request is discarded.
if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
goto drop;
}

The above is the implementation in 3.10. In fact, we need to judge two conditions: "whether the accept queue is full" and "whether the SYN queue has semi-connections that no SYNACK has been retransmitted". A large number of new connections usually exist when the accept queue is full, so the second condition is usually met at the same time. If the SYN queue does not have semi-connections that no SYNACK has been retransmitted when the accept queue is full, the Linux kernel still accepts the SYN and returns SYNACK. This situation is rare in actual production, unless the application process is completely stalled, such as stopping the process using the SIGSTOP signal. So when the accept queue is full, the TCP kernel protocol stack still will not directly drop the SYN packet.

The logic of dropping SYN because the accept queue is full has changed slightly in newer kernel versions. For example, in 4.10, the judgment condition is changed from two to one, that is, the kernel only determines whether the accept queue is full. Therefore, in these versions, when the accept queue is full, the kernel will directly drop the SYN packet.

Troubleshooting

This kind of problem often happens when a problem occurs with the user-space application. In general, the probability of occurrence is not very high. The following two methods are available to confirm this problem:

Use ss Command to Check Real-Time Problems

Use option -l of the ss command to check the listening socket. Recv-Q and Send-Q are displayed. Recv-Q indicates the total connections in the current accept queue, and Send-Q indicates the maximum length of the accept queue. As shown in the following: The default accept queue of several processes is 128, because it is restricted by the net.core.somaxconn=128 of the system.

Netstat -s Statistics

Refer to the statistics in netstat -s. The following statistical value indicates the number of new connections rejected due to socket overflow. Similarly, this is also a historical statistical total, and the difference between the two time points is more meaningful for troubleshooting.

xx times the listen queue of a socket overflowed

Suggested solution

If it is confirmed that the SYN packet is dropped due to the accept queue, it is natural to think that a solution is to increase the length of the accept queue. Increasing the backlog and net.core.somaxconn parameters simultaneously can increase the length of accept queue. However, in general, this method can only alleviate the problem, and the most likely situation is that the lengthened accept queue will be filled up again quickly. Therefore, the best way to solve this problem is to check the application and see why it accepts new connections so slowly, solving the root cause.

Summary

The main body of this blog article summarizes the two main scenarios, mainly issues related to the cloud infrastructure and service software layers, where SYN is dropped due to the per-host PAWS check and full accept queue. These two scenarios cover the vast majority of the TCP stack dropping SYN situations. If SYN is dropped in other protocol stacks, further case-by-case troubleshooting is required in combination with the parameter configuration and code logic.