Inside the Linux Packet Filter

In Part I of this two-part series on the Linux Packet Filter, Gianluca describes a packet's journey through the kernel.

Network geeks among you may remember my
article, “Linux Socket Filter: Sniffing Bytes over the Network”,
in the June 2001 issue of LJ, regarding the
use of the packet filter built inside the Linux kernel. In that
article I provided an overview of the functionality of the packet
filter itself; this time, I delve into the depths of the kernel
mechanisms that allow the filter to work and share some insights on
Linux packet processing internals.

Last Article's Points

In the previous article, some arguments regarding kernel
packet processing were raised. It is worthwhile to recall briefly
the most important of them:

Packet reception is first dealt with at the network
card's driver level, more precisely in the interrupt service
routine. The service routine looks up the protocol type inside the
received frame and queues it appropriately for later
processing.

At the socket level, just before reaching user
land, the kernel checks whether an open socket for the given packet
exists. If it does not, the packet is discarded.

Then the Linux kernel implements a generic-purpose
protocol, called PF_PACKET, which allows you to create a socket
that receives packets directly from the network card driver. Hence,
any other protocols' handling is skipped, and any packets can be
received.

An Ethernet card usually passes only the packets
destined to itself to the kernel, discarding all the others.
Nevertheless, it is possible to configure the card in such a way
that all the packets flowing through the network are captured,
independent of their MAC address (promiscuous mode).

Finally, you can attach a filter to a socket, so
that only packets matching your filter's rules are accepted and
passed to the socket. Combined with PF_PACKET sockets, this
mechanism allows you to sniff selected packets efficiently from
your LAN.

Even though we built our sniffer using PF_PACKET sockets, the
Linux socket filter (LSF) is not limited to those. In fact, the
filter also can be used on plain TCP and UDP sockets to filter out
unwanted packets—of course, this use of the filter is much less
common.

In the following, I sometimes refer either to a socket or to
a sock structure. As far as this article is concerned, both forms
indicate the same object, and the latter corresponds to the
kernel's internal representation of the former. Actually, the
kernel holds both a socket structure and a sock structure, but the
difference between the two is not relevant here.

Another data structure that will recur quite often is the
sk_buff (short for socket buffer), which represents a packet inside
the kernel. The structure is arranged in such a way that addition
and removal of header and trailer information to the packet data
can be done in a relatively inexpensive way: no data actually needs
to be copied since everything is done by just shifting
pointers.

Before going on, it may be useful to clear up possible
ambiguities. Despite having a similar name, the Linux socket filter
has a completely different purpose with respect to the Netfilter
framework introduced into the kernel in early 2.3 versions. Even if
Netfilter allows you to bring packets up to user space and feed
them to your programs, the focus there is to handle network address
translation (NAT), packet mangling, connection tracking, packet
filtering for security purposes and so on. If you just need to
sniff packets and filter them according to certain rules, the most
straightforward tool is LSF.

Now we are going to follow the trip of a packet from its very
ingress into the computer to its delivery to user land at the
socket level. We first consider the general case of a plain (i.e.,
not PF_PACKET) socket. Our analysis at link layer level is based on
Ethernet, since this is the most widespread and representative LAN
technology. Cases of other link layer technologies do not present
significant differences.

Ethernet Card and Lower-Kernel Reception

As we mentioned in the previous article, the Ethernet card is
hard-wired with a particular link layer (or MAC) address and is
always listening for packets on its interface. When it sees a
packet whose MAC address matches either its own address or the link
layer broadcast address (i.e., FF:FF:FF:FF:FF:FF for Ethernet) it
starts reading it into memory.

Upon completion of packet reception, the network card
generates an interrupt request. The interrupt service routine that
handles the request is the card driver itself, which runs with
interrupts disabled and typically performs the following
operations:

Allocates a new sk_buff structure, defined in
include/linux/skbuff.h, which represents the kernel's view of a
packet.

Fetches packet data from the card buffer into the
freshly allocated sk_buff, possibly using DMA.

Invokes netif_rx(), the generic network reception
handler.

When netif_rx() returns, re-enables interrupts and
terminates the service routine.

The netif_rx() function prepares the kernel for the next
reception step; it puts the sk_buff into the incoming packets queue
for the current CPU and marks the NET_RX softirq (softirq is
explained below) for execution via the __cpu_raise_softirq() call.
Two points are worth noticing at this stage. First, if the queue is
full the packet is discarded and lost forever. Second, we have one
queue for each CPU; together with the new deferred kernel
processing model (softirqs instead of bottom halves), this allows
for concurrent packet reception in SMP machines.

If you want to see a real-world Ethernet driver in action,
you can refer to the simple NE 2000 card PCI driver, located in
drivers/net/8390.c; the interrupt service routine called
ei_interrupt(), calls ei_receive(), which in turn, performs the
following procedure:

Allocates a new sk_buff structure via the
dev_alloc_skb() call.

Reads the packet from the card buffer
(ei_block_input() call) and sets skb->protocol
accordingly.

Calls netif_rx().

Repeats the procedure for a maximum of ten
consecutive packets.

A slightly more complex example is provided by the 3COM
driver, located in 3c59x.c, which uses DMA to transfer the packet
from the card memory to the sk_buff.

sorry for write here, but for one reason i can't find an e-mail to you. So, my "question">> if you allow me i can write some musical themes or sound effects for PHAVON game, i look into the official page http://phavon.sourceforge.net/ but i can't find a mailing list or something, i want to participate.

This article is the most informative about the packet journey in Linux right from Hardware drivers untill now i came across. I request you to reveal the packet journey details when we register to the TCP/IP stack with NF_HOOK.

I agree with the other posters that this is an excellent article and a welcome journal series. The author's terse but information dense style is a refreshing change from all the fluff out there that aims to pass for technical journal writing. I found particularly useful the references to the relevant kernel source modules. Armed with a sharpened conceptual overview and specific references to source code, both seasoned developer and newbie alike can put this knowledge to real use; whether through real world network application development, or targeted educational research.

More a wish list item than a criticism, I'd like to see a few diagrams modelling and summarizing the excellent overview Insolvibile has sketched for us.