Wednesday, 18 May 2016

Navigating the Linux kernel network stack: into user land

This is a continuation of my previous post, in which we follow the path of an inbound multicast packet to the user application.
At the end of the last post, I mentioned that application back-pressure was most likely to be the cause of receiver packet-loss. As we continue through the code path taken by inbound packets up into user-land, we will see the various points at which data will be discarded due to slow application processing, and what metrics can shed light on these events.

Protocol mapping

First of all, let's take a quick look at how a received data packet is matched up to its handler function.
We can see from the stack-trace below that the top-level function calls when dealing with the packets are packet_rcv and ip_rcv:

This trace captures the part of the receive path where the inbound packet is passed to each registered handler function. In this way, the kernel handles things like VLAN tagging, interface bonding, and packet-capture. Note the __bpf_prog_run function call, which indicates that this is the entry point for pcap packet capture and filtering.

The protocol-to-handler mapping can be viewed in the file /proc/net/ptype:

Comparing against this reference, it is clear that the kernel reads the value of the ethernet frame's EtherType field and dispatches to the corresponding function. There are also some functions that will be executed for all packet types (signified by type ALL).

So our inbound packet will be processed by the capture hooks, even if no packet capture is running (this then must presumably be very cheap), before being passed to the correct protocol handler, in this case ip_rcv.

The socket buffer

As discussed previously, each socket on the system has a receive-side FIFO queue that is written to by the kernel, and read from by the user application.

ip_rcv starts by getting its own copy of the incoming packet, copying if the packet is already shared. If the copy fails due to lack of memory, then the packet is discarded, and the Discard count of the global ip statistics table is incremented. Other checks made at this point include the IP checksum, header-length check and truncation check, each of which update the relevant metrics.

Before calling the ip_rcv_finish function, the packet is diverted through the netfilter module where software-based network filtering can be applied.

Assuming that the packet is not dropped by a filter, ip_rcv_finish passes the packet on to the next protocol-handler in the chain, in this instance, udp_rcv.

In the internals of the udp_rcv function, we finally get to the point of accessing the socket FIFO queue.
Simple validation performed at this point includes packet-length check and checksum. Failure of these checks will cause the relevant statistics to be updated in the global udp statistics table.

Because the traffic we're tracing is multicast, the next handler function is __udp4_lib_mcast_deliver.
With some exotic locking in place, the kernel determines how many different sockets this multicast packet needs to be delivered to. It is worth noting here that there is a smallish number (256 / sizeof( struct sock)) of sockets that can be listening to a given multicast address before the possibility of greater lock-contention creeps in.

An effort is made to enumerate the registered sockets with a lock held, then perform the packet dispatching without a lock. However, if the number of registered sockets exceeds the specified threshold, then packet dispatch (the flush_stack function) will be handled while a lock is held.

Once in flush_stack, the packet is copied for each registered socket, and pushed onto the FIFO queue.
If the kernel is unable to allocate memory in order to copy the buffer, the socket's drop-count will be incremented, along with the RcvBufErrors and InErrors metrics in the global udp statistics table.

After another checksum test, we are finally at the point where socket-buffer overflow is tested:

If the size of the socket's backlog queue, plus the memory used in the receive queue is greater than the socket receive buffer size, then the RcvBufferErrors and InErrors metrics are updated in the global udp statitics table, along with the socket's drop count.

Kernel statistics

Now that we have seen where the various capacity checks are made, we can make some observations in the proc file system.

There are protocol-specific files in /proc/net that are updated by the kernel whenever data is received.

/proc/net/snmp

This file contains statistics on multiple protocols, and is the target for updates to the UDP receive buffer errors. So when we want to know about system-wide receive errors, this is the place to look. When this file is read, a function call iterates through the stored values and writes them to the supplied destination. Such is the magic of procfs.

In the output below, we can see that there have been several receive errors (RcvbufErrors), and the number is matched by the InErrors metric:

There is a subtle difference between InErrors and RcvbufErrors, but in our case when looking for socket buffer overflow, we only care about RcvbufErrors.

/proc/net/udp

This file contains socket-specific statistics, which only live for the lifetime of a socket. As we have seen from the code so far, each time the RcvbufErrors metric has been incremented, so too has the corresponding sk_drops value for the target socket.

So if we need to differentiate between drop events on a per-socket basis, this file is what we need to observe. The contents below can be explained by looking at the handler function for socket data:

For observing back-pressure, we are interested in the drops column, and the rx_queue column, which is a snapshot of the amount of queued data at the time of reading. Local and remote addresses are hex-encoded ip:port addresses.

Monitoring this file can tell us if our application is falling behind the packet arrival rate (increase in rx_queue), or whether the kernel was unable to copy a packet into the socket buffer due to it already being full (increase in drops).

Tracepoints

If we wish to capture more data about packets that are being dropped, there a three tracepoints that are of interest.

sock:sock_rcvqueue_full: called when the amount of memory already allocated to the socket buffer is greater than or equal to the configured socket buffer size.