Inside the Linux Packet Filter, Part II

Gianluca concludes the packet's journey through the kernel, picking up with TCP processing.

Sleeping Processes

As a side note, you may be wondering, how does a user process
come to sleep on a given socket when it invokes a recv(),
recvfrom() or recvmesg() system call? The mechanism is actually
pretty easy: all the recv functions are implemented inside the
kernel by calling, more or less directly, sock_recvmsg() (in
net/socket.c). This function, in turn, calls the recvmsg() function
that is registered inside the protocol-specific operations within
the sock structure. For example, this function is packet_recvmsg()
in the case of PF_PACKET protocol.

The protocol-specific recvmsg function, among other things,
sooner or later will call skb_recv_datagram(), which is a generic
function handling datagram reception for all the protocols. The
latter function obtains process blocking by calling
wait_for_packet() (in net/core/datagram.c), which sets process
status to TASK_INTERRUPTIBLE (i.e., sleeping task) and queues it
into the socket's sleep queue. The process rests there until a call
to wake_up_interruptible() is triggered by the arrival of a new
packet, as we saw in the previous paragraphs.

What about the Filter Itself?

The main filter implementation resides in core/filter.c,
whereas the SO_ATTACH/DETACH_FILTER ioctls are dealt with in
net/core/sock.c. The filter initially is attached to a socket via
the sk_attach_filter() function, that copies it from user space to
kernel space and runs an integrity check on it (sk_chk_filter()).
The check is aimed at ensuring that no incongruent code is executed
by the filter interpreter. Finally, the filter base address is
copied into the filter field of the sock structure, where it will
be used for filter invocation as we saw before.

The packet filter proper is implemented in the
sk_run_filter() function, which is given an skb (the current
packet) and a filter program. The latter is simply an array of BPF
instructions (see Resources) that is a sequence of numeric opcodes
and operands. The sk_run_filter() function does nothing more than
implement a BPF code interpreter (or a virtual CPU, if you prefer)
in a pretty straightforward way; a long switch/case statement
discriminates the opcode and takes actions on emulated registers
and memory accordingly. The emulated memory space, where the filter
code is run, is of course the packet's payload (sk->data). The
filter execution flow terminates, leading toward exiting the
function, when a BPF RET instruction is encountered.

Note that the sk_run_filter() function is called directly
only from PF_PACKET processing routines. Socket-level receive
routines (i.e., TCP, UDP and raw IP ones) go through the wrapper
function sk_filter() (in sock.h), which in addition to calling
sk_run_filter() internally, trims the packet to the length returned
by the filter.

Hooks to Packet Filter

Our tour of the kernel packet handling functions is now
completed. It is interesting to draw some conclusions regarding the
packet filter invocation points. As we have seen, there are three
distinct call points inside the kernel where the filter may get
invoked: the TCP and UDP (layer 4) receive functions, and the
PF_PACKET (layer 2.5) receive function. Raw IP packets are filtered
also because they pass through the same path followed by UDP
packets (namely, sock_queue_rcv_skb()), which is used for
datagram-oriented reception).

It is important to notice that, at each layer, the filter is
applied to that layer's payload. That is, as the packet travels
upward the filter can see less and less information. For PF_PACKET
sockets, the filter is applied to layer 2 information, which
includes either the whole link layer data frame for SOCK_RAW
sockets or the whole IP packet; for TCP/UDP sockets, the filter is
applied to layer 4 information (basically, port numbers and little
other useful data). For this reason, layer 4 socket filtering is
likely to be useless. Of course, in any case the application level
payload (user data) is always available for the filter, even if it
is often of little or no use at all.

A bright example of layer 4 uselessness is given in Listing 1
[available at
ftp.linuxjournal.com/pub/lj/listings/issue95/5617.tgz and
Listing 2, which presents a simple UDP server with an attached
socket filter and an associated simple UDP data sender. The filter
will accept only packets whose payload starts with “lj” (hex
0x6c6a). To test the program, compile and run Listing 1, called
udprcv. Then compile Listing 2 (udpsnd), and launch it like
this:

./udpsnd 127.0.0.1 "hello world"

Nothing will be printed by udprcv. Now, try writing a string
starting with “lj”, as in

./udpsnd 127.0.0.1 "lj rules"

This time the string is printed correctly by udprcv since the
packet payload matches the filter.

Another important issue that filter writers should be aware
of is that the filter must be written depending on the type of
socket (PF_PACKET, raw IP or TCP/UDP) that the filter will be
attached to. In fact, filter memory accesses use offsets that are
relative to the first byte in the packet payload as seen at a
specific level. Filter memory base addresses corresponding to the
most common families are reported in Table 1.

Moreover, the method described in the June 2001 article to
obtain the filter code (i.e., using tcpdump -dd)
does not apply anymore if non-PF_PACKET sockets are used, as it
produces a filter working only for layer 2 (since it assumes that
address 0 is the start of the link layer frame).