BPF: A Tour of Program Types

Notes on BPF (1) - A Tour of Progam Types

Oracle Linux kernel developer Alan Maguire presents this six-part series on BPF, wherein he presents an in depth look at the kernel's "Berkeley Packet Filter" -- a useful and extensible kernel function for much more than packet filtering.

If you follow Linux kernel development discussions and blog posts, you've probably heard BPF mentioned a lot lately. It's being used for high-performance load-balancing, DDoS mitigation and firewalling, safe instrumentation of kernel and user-space code and much more! BPF does this by supporting a safe flexible programming environment in many different contexts; networking datapaths, kernel probes, perf events and more. Safety is key - in most environments, adding a kernel module introduces significant risk. BPF programs however are verified at program load time to ensure no out-of-bounds accesses occur etc. In addition BPF supports just-in-time compilation of its bytecode to the native instructions set, so BPF programs are also fast. If you're interested in topics like fast packet processing and observability, learning BPF should definitely be on your to-do list.

Here we try to give a guide to BPF, covering a range of topics which will hopefully help developers trying to get to grips with writing BPF programs. This guide is based on Linux 4.14 (which is the kernel for Oracle Linux UEK5), so do bear that in mind as there have been a bunch of changes in BPF since, and some package names etc may differ for other distributions.

Because BPF in Linux is such a fast-moving target, I'm going to try and point you at relevant places in the kernel codebase that may help you get a sense for what the technology can do. The samples/bpf directory is a great place to look to see what others have done, but here we'll also dig into the implementation as reference, as it may give you some ideas how to create new BPF programs. The aim here isn't to give a deep dive into BPF internals, but rather to give a few pointers to areas in the code which reveal BPF functionality. The source tree I'm using for reference is our UEK5 release, based on Linux 4.14.35. See https://github.com/oracle/linux-uek/tree/uek5/master . Most of the functionality described can be found in any recent kernel. The bpf-next tree (where BPF kernel development happens) can be found at

https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

An important caveat; again, what is below describes the state as per the 4.14 kernel. A lot has changed since; but hopefully with the pointers into the code, you'll be better equipped to figure out what some of these changes are!

The aim here is to be able to get to the point of working on interesting problems with BPF. However before we get there, let's look at the various pieces and how they fit together.

The first question to ask is what can we do with BPF? What kinds of programs can we write?

To get a sense for this, let's examine the enumerated type definition from include/uapi/linux/bpf.h https://github.com/oracle/linux-uek/blob/uek5/master/include/uapi/linux/bpf.h#L117

What are all of these program types? To understand this, we will ask the same set of questions for each program type:

what do I do with this program type?

how do I attach my BPF program for this program type?

what context is provided to my program? By this we mean what argument(s) and data are provided for us to work with.

when does the attached program get run? It's important to understand this, as it gives a sense of where for example in the network stack a filter is applied.

We won't worry about how you create the programs for now; that side of things is relatively uniform across the various program types.

1. socket-related program types - SOCKET_FILTER, SK_SKB, SOCK_OPS

First, let's consider the socket-related program types which allow us to filter, redirect socket data and monitor socket events. The filtering use case relates to the origins of BPF. When observing the network we want to see only a portion of network traffic, for example all traffic from a troublesome system. Filters are used to describe the traffic we want to see, and ideally we want it to be fast, and we want to give users an open-ended set of filtering options. But we have a problem; we want to throw away unneeded data as early as possible, and to do that we need to filter in kernel context. Consider the alternative to an in-kernel solution - incurring the cost of copying packets to user-space and filtering there. That would be very expensive, especially if we only want to see a portion of the network traffic and throw away the rest.To achieve this, a safe mini-language was invented to translate high-level filters into a bytecode program that the kernel can use (termed classic BPF, cBPF). The aim of the language was to support a flexible set of filtering options while being fast and safe. Filters written in this assembly-like language could be pushed by userspace programs such as tcpdump to accomplish filtering in-kernel. See

https://www.tcpdump.org/papers/bpf-usenix93.pdf

...for the classic paper describing this work. Modern eBPF took these concepts, expanded the register and instruction set, added data structures called maps, hugely expanded the kinds of events we can attach to, and much more!

For socket filtering, the common case is to attach to a raw socket (SOCK_RAW), and in fact you'll notice most programs that do socket filtering have a line like this:

s = socket(AF_PACKET,SOCK_RAW,htons(ETH_P_ALL));

Creating such a socket, we specify the domain (AF_PACKET), socket type (SOCK_RAW) and protocol (all packet types). In the Linux kernel, receive of raw packets is implemented by the raw_local_deliver() function. It is called called by ip_local_deliver_finish(), just prior to calling the relevant IP protocol's handler, which is where the packet is passed to TCP, UDP, ICMP etc. So at this point the traffic has not been associated with a specific socket; that happens later, when the IP stack figures out the mapping from packet to layer 4 protocol, and then to the relevant socket (if any). You can see the cBPF bytecodes generated by tcpdump by using the -d option. Here I want to run tcpdump on the wlp4s0 interface, filtering TCP traffic only:

Without much deep knowledge we can get a feel for what's happening here. On line 000 we load the offset of the ether header + 12 ; the ether header protocol type. On line 001, we jump to 002 if it matches ETH_P_IPv6 (0x86dd) (jt 2), otherwise jump to 007 if false (jf 7) (handle the IPv4 case).

Let's look at the IPv6 case first. On line 003 we jump to 010 - success - if the IPv6 protocol (offset + 20) is 6 (IPPROTO_TCP) - line 010 returns 65535 which is the max length so we're accepting the packet. Otherwise we jump to 004. Here we compare to 0x2c, which indicates there's an IPv6 fragment header. If that's true we check if the fragment header (offset 54) specifies a next protocol value as IPPROTO_TCP, and if so we jump to 10 (success) or 11 (failure). Returning 0 means dropping the packet for filtering purposes.

Handling IPv4 is simpler; on 007 (arrived at via "jf" on 001), we check for ETH_P_IPV4 and, if found, we verify that the IPPROTO is TCP. And we're done! Remember though this is cBPF; eBPF has an extended instruction/op set similar to x86_64 and additional registers.

One other thing to note - socket filtering is distinct from netfilter-based filtering. Netfilter defines its own set of hooks with NF_HOOK() definitions, which netfilter-based technologies such as ipfilter can use to filter traffic also. You might think - couldn't we use eBPF there too? And you'd be right! bpfilter is replacing ipfilter in more recent Linux kernels.

So with all that in mind, let's return to examining the socket-related program types.

1.1 BPF_PROG_TYPE_SOCKET_FILTER

What do I do with it? The filtering actions include dropping packets (if the program returns 0) or trimming packets (if the program returns a length less than the original). See sk_filter_trim_cap() and its call to bpf_prog_run_save_cb(). Note that we're not trimming or dropping the original packet which would still reach the intended socket intact; we're working with a copy of the packet metadata which raw sockets can access for observability. In addition to filtering packet flow to our socket, we can also do things that have side-effects; for example collecting statistics in BPF maps.

How do I attach my program? BPF programs can be attached to sockets via the SO_ATTACH_BPF setsockopt(), which passes in a file descriptor to the program.

What context is provided? A pointer to the struct __sk_buff containing packet metadata/data. This structure is defined in include/linux/bpf.h, and includes key fields from the real sk_buff. The bpf verifier converts access to valid __sk_buff fields into offsets into the "real" sk_buff, see https://lwn.net/Articles/636647/ for more details.

When does it run? Socket filters run for receive in sock_queue_rcv_skb() which is called by various protocols (TCP, UDP, ICMP, raw sockets etc) and can be used to filter inbound traffic.

To give a sense for what programs look like, here we will create a filter that trims packet data we filter on the basis of protocol type; for IPv4 TCP, let's grab the IPv4 + TCP header only, while for UDP, we'll take the IPv4 and UDP header only. We won't deal with IPv4 options as it's a simple example, so in all other cases we return 0 (drop packet).

This program can be compiled into BPF bytecodes using LLVM/clang by specifying arch of "bpf" , and once that is done it will contain an object with an ELF section called "socket". That is our program. The next step is to use the BPF system call to assign a file descriptor to the program, then attach it to the socket. In samples/bpf , you can see that bpf_load.c scans the ELF sections, and sections with name prefixed by "socket" are recognized as BPF_PROG_TYPE_SOCKET_FILTER programs. If you're adding a sample I'd recommend including bpf_load.h so you can just call load_bpf_file() on your BPF program. For example, in samples/bpf/sockex1_user.c we take the filename of our program (sockex1) and load sockex1_kern.o ; the associated BPF program. Then we open a raw socket to loopback (lo) and attach the program there:

1.2 BPF_PROG_TYPE_SOCK_OPS

What do I do with it? Attach a BPF program to catch socket operations such as connection establishment, retransmit timeout etc. Once caught options can also be set via bpf_setsockopt(), so for example on passive establishment of a connection from a system not on the same subnet, we could lower the MTU so we won't have to worry about intermediate routers fragmenting packets. Programs can return success (0) or failure (a negative value) and a reply value can be set to indicate the desired value for a socket option (e.g. TCP rwnd). See https://lwn.net/Articles/727189/ for full details, and look for tcp_call_bpf()s inline definition in include/net/tcp.h to see how TCP handles execution of such programs. Another use case is for sockmap updates in combination with BPF_PROG_TYPE_SK_SKB programs; the bpf_sock_ops struct pointer passed into the BPF_PROG_TYPE_SOCK_OPS program is used to update the sockmap, associating a value for that socket. Later sk_skb programs can reference those values to specify which socket to redirect to via bpf_sk_redirect_map() calls. If this sounds confusing, I'd recommend taking a look at the code in samples/sockmap.

How do I attach my program? It is attached to a cgroup file descriptor using BPF_CGROUP_SOCK_OPS attach type.

What context is provided? Argument provided is the context, struct bpf_sock_ops *.. Op field specifies the operatiion, BPF_SOCK_OPS_RWND_INIT, BPF_SOCK_OPS_TCP_CONNECT_CB etc. The reply field can be used to indicate to the caller a new value for a parameter set.

/* User bpf_sock_ops struct to access socket values and specify request ops
* and their replies.
* Some of this fields are in network (bigendian) byte order and may need
* to be converted before use (bpf_ntohl() defined in samples/bpf/bpf_endian.h).
* New fields can only be added at the end of this structure
*/
struct bpf_sock_ops {
__u32 op;
union {
__u32 reply;
__u32 replylong[4];
};
__u32 family;
__u32 remote_ip4; /* Stored in network byte order */
__u32 local_ip4; /* Stored in network byte order */
__u32 remote_ip6[4]; /* Stored in network byte order */
__u32 local_ip6[4]; /* Stored in network byte order */
__u32 remote_port; /* Stored in network byte order */
__u32 local_port; /* stored in host byte order */
};

When does it run? As per the above article, unlike other BPF program types that expect to be called at a particular place in the codebase, SOCK_OPS program can be called at different places and use an "op" field to indicate that context. See include/uapi/linux/bpf.h for the enumerated BPF_SOCK_OPS_* definitions, but they include events like retransmit timeout, connection establishment etc.

1.3 BPF_PROG_TYPE_SK_SKB

What do I do with it? Allows users to access skb and socket details such as port and IP address with a view to supporting redirect of skbs between sockets. See https://lwn.net/Articles/731133/ . This functionality is used in conjunction with a sockmap - a special-purpose BPF map that contains references to socket structures and associated values. sockmaps are used to support redirection. The program is attached and the bpf_sk_redirect_map() helper can be used to carry out the redirection. The general approach we catch socket creation events with sock_ops BPF programs, associate values with the sockmap for these, and then use data at the sk_skb instrumentation points to inform socket redirection - this is termed the verdict, and the program for this is attached to the sockmap via BPF_SK_SKB_STREAM_VERDICT. The verdict can be __SK_DROP, __SK_PASS, or __SK_REDIRECT. Another use case for this program type is in the strparser framework (https://www.kernel.org/doc/Documentation/networking/strparser.txt). BPF programs can be used to parse message streams via callbacks for read operations, verdict and read completion. TLS and KCM use stream parsing.

How do I attach my program? A redirection progaram is attached to a sockmap as BPF_SK_SKB_STREAM_VERDICT; it should return the result of bpf_sk_redirect_map(). A strparser program is attached via BPF_SK_SKB_STREAM_PARSER and should return the length of data parsed.

What context is provided? A pointer to the struct __sk_buff containing packet metadata/data. However more fields are accessible to the sk_skb program type. The extra set of fields available are documented in include/linux/bpf.h like so:

So from the above alone we can see we can gather information about the socket, since the above represents the key information that identifies the socket uniquely (protocol is already available in the globally-accessible portion of the struct __sk_buff).

When does it run? A stream parser can be attached to a socket via BPF_SK_SKB_STREAM_PARSER attachment to a sockmap, and the parser runs on socket receive via smap_parse_func_strparser() in kernel/bpf/sockmap.c . BPF_SK_SKB_STREAM_VERDICT also attaches to the sockmap, and is run via smap_verdict_func().

2. tc (traffic control) subsystem programs

Next let's examine the program type related to the TC kernel packet scheduling subsystem. See the tc(8) manpage for a general introduction, and tc-bpf(8) for BPF specifics.

2.1 tc_cls_act : qdisc classifier

What do I do with it? tc_cls_act allows us to use BPF programs as classifiers and actions in tc, the Linux QoS subsystem. What's even better is the tc(8) command has eBPF support also, so we can directly load BPF programs as classifiers and actions for inbound (ingress) and outbound (egress) traffic. See http://man7.org/linux/man-pages/man8/tc-bpf.8.html for a description of how to use tc's BPF functionality. tc programs can classify, modify, redirect or drop packets.

How do I attach my program? tc(8) can be used; see tc-bpf(8) for details. The basics are we create a "clsact" qdisc for a network device, and add ingress and egress classifiers/actoins by specifying the BPF object and relevant ELF section. Example, to add an ingress classifier to eth0 in ELF section my_elf_sec from myprog_kernel.o (a bpf-bytecode-compiled object file):

What context is provided? A pointer to struct __sk_buff packet metadata/data.

When does it get run? As mentioned above, classifier qdisc must be added, and once it is we can attach BPF programs to classify inbound and outbound traffic. Implementation-wise, act_bpf.c and cls_bpf.c implement action/classifier modules. On ingress/gress sch_handle_ingress()/egress() call tcf_classify(). In the case of ingress, we do classification via the core network interface receive function, so we are getting the packet after the driver has processed it but before IP etc see it. On egress, the filtering is done prior to submitting to the device queue for transmit.

3. xdp : the Xpress Data Path.

The key design goal for XDP is to introduce programmability in the network datapath. The aim is to provide the XDP hooks as close to the device as possible (before the OS has created sk_buff metadata) to maximize performace while supporting a common infrastructure across devices. To support XDP like this requires driver changes. For an example see drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c. A bpf net device op (ndo_bpf) is added. For bnxt it supports XDP_SETUP_PROG and XDP_QUERY_PROG actions; the former configures the device for XDP, reserving rings and setting the program as active. The latter returns the BPF program id. BPF-specific transmit and receive functions are provided and called by the real send/receive functions if needed.

3.1 BPF_PROG_TYPE_XDP

What do I do with it? XDP allows access to packet data as early as possible, before packet metadata (struct sk_buff) has been assigned. Thus it is a useful place to do DDoS mitigation or load balancing since such activities can often avoid the expensive overhead of sk_buff allocation. XDP is all about supporting run-time programming of the kernel in via BPF hooks, but by working in concert with the kernel itself; i.e. not a kernel bypass mechanism. Actions supported include XDP_PASS (pass into network processing as usual), XDP_DROP (drop), XDP_TX (transmit) and XDP_REDIRECT. See include/uapi/linux/bpf.h for the "enum xdp_action".

How do I attach my program? Via netlink socket message. A netlink socket - socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE) - is created and bound, and then we send a netlink message of type NLA_F_NESTED | 43 ; this specifies XDP message. The message contains the BPF fd, the interface index (ifindex). See samples/bpf/bpf_load.c for an example.

/* user accessible metadata for XDP packet hook
* new fields must be added to the end of this structure
*/
struct xdp_md {
__u32 data;
__u32 data_end;
};

When does it get run? "Real" XDP is implemented at the driver level, and transmit/receive ring resources are set aside for XDP usage. For cases where drivers do not support XDP, there is the option of using "generic" XDP, which is implemented in net/core/dev.c. The downside of this is we do not bypass skb allocation, it just allows us to use XDP for such devices also.

4. kprobes, tracepoints and perf events

kprobes, tracepoints and perf events all provide kernel instrumentation. kprobes - https://www.kernel.org/doc/Documentation/kprobes.txt - allow instrumentation of specific functions - entry of a function can be monitored via a kprobe, along with most instructions within a function, or entry/return can be instrumented via a kretprobe. When one of these probes is enabled, the code at the enable point is saved, and replaced with a breakpoint instruction. When this breakpoint is hit a trap instruction is generated, registers are saved and we branch to the relevant instrumentation handler. For example, kprobes are handled by kprobe_dispatcher() which gets the address of the kprobe and register context as arguments. kretprobes are implemented via kprobes; a kprobe fires on entry and modifies the return address, saving the original and replacing it with the location of the instrumentation handler. Tracepoints - https://www.kernel.org/doc/Documentation/trace/tracepoints.rst - are similar, but ratther than being enabled at particular instructions, they are explicitly marked at sites in code, and if enabled can be used to collect debugging information at those sites of interest. The same tracepoint can be declared in multiple places; for example trace_drv_return_int() is called in multiple places in net/mac80211/driver-ops.c .

Perf events - https://perf.wiki.kernel.org/index.php/Main_Page - are the basis for eBPF support for these program types. BPF essentially piggy-backs on the existing infrastructure for event sampling, allowing us to attach programs to perf events of interest, which include kprobes, uprobes, tracepoints etc as well has other software events, and indeed hardware events can be monitored too.

These instrumentation points are what gives BPF the capability to be a general-purpose tracing tool as well as a means for supporting the original networking-centric use cases like socket filtering.

4.1 BPF_PROG_TYPE_KPROBE

What do I do with it? instrument code in any kernel function (bar a few exceptions) via kprobe, or instrument entry/return via kretprobe. k[ret]probe_perf_func() executes a BPF program attached to the probe point. Note that this program type can also be used to attach to u[ret]probes - see https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt for details

How do I attach my program? When the kprobe is created via sysfs, it has an id associated with it, stored in /sys/kernel/debug/tracing/events/[uk]probe//id , /sys/kernel/debug/tracing/events/[uk]retprobe/probename/id . https://www.kernel.org/doc/Documentation/trace/kprobetrace.txt contains details on how to create a kprobe using sysfs.For example, to create a probe called "myprobe" on entry to tcp_retransmit_skb() and retrieve its id:

We can use that probe id to open a perf event, enable it, and set the BPF program for that perf event to be our program. See samples/bpf/bpf_load.c in the load_and_attach() function for how this can be done for k[ret]probes. The code might look something like this:

What context is provided? A struct pt_regs *ctx , from which the registers can be accessed. Much of this is platform-specific, but some general-purpose functions exist such as regs_return_value(regs), which returns the value of the register than holds the function return value (regs→ax on x86).

When does it get run? When the probe is enabled and breakpoint is hit, k[ret]probe_perf_func() executes a BPF program attached to the probe point via trace_call_bpf(). Similar story for u[ret]probe_perf_func().

4.2 BPF_PROG_TYPE_TRACEPOINT

What do I do with it? Instrument tracepoints in kernel code. Tracepoints can be enabled via sysfs as is the case with kprobes, and in a similar way. The list of trace events can be seen under /sys/kernel/debug/tracing/events.

How do I attach my program? As we saw above, when the tracepoint is created via sysfs, it has an id associated with it. We can use that probe id to open a perf event, enable it, and set the BPF program for that perf event to be our program. See samples/bpf/bpf_load.c in the load_and_attach() function for how this can be done for tracepoints; the above code snippet for kprobes works for tracepoints also. As an example showing how tracepoints are enabled, here we enable the net/net_dev_xmit tracepoint as "myprobe2" and retrieve its id:

What context is provided? The context provided by the specific tracepoint; arguments and data types are associated with the tracepoint definition.

When does it get run? When the tracepoint is enabled and hit, perf_trace_() (see definition in include/trace/perf.h) calls perf_trace_run_bpf_submit() which will invoke the bpf program via trace_call_bpf().

4.3 BPF_PROG_TYPE_PERF_EVENT

What do I do with it? Instrument software and hardware perf events. These include events like syscalls, timer expiry, sampling of hardware events, etc. Hardware events include PMU events (processor monitoring unit) which tell us things like how many instructions completed etc. Perf event monitoring can be targeted at a specific process or group, processor, and a sample period can be specified for profiling.

How do I attach my program? A similar model as per the above; we call perf_event_open() with an attribute set, enable the perf event via PERF_EVENT_IOC_ENABLE ioctl(), and set the bpf program via PERF_EVENT_IOC_SET_BPF ioctl(). For PMU (processor monitoring unit) perf event example, see these snippets from samples/bpf/sampleip_user.c:

When does it get run? Depends on the perf event firing and the sample rate chosen, specified by freq and sample_period fields in the perf event attribute structure.

5. cgroups-related program types

CGroups are used to handle resource allocation, allowing or denying access to system resources such as CPU, network bandwidth etc for groups of processes. One key use case for cgroups is containers; a container's resource access is limited via cgroups while its activities are isolated by the various classes of namespace (network namespace, process ID namespace etc). In the BPF context, we can create eBPF programs that allow or deny access. In include/linux/bpf-cgroup.h we can see definitions for execution of socket/skb programs, where __cgroup_bpf_run_filter_skb is called wrapped in a check that cgroup BPF is enabled:

If cgroups are enabled, we attach our program to the cgroup and it will be executed at the relevant hook points. To get an idea of the full list of hooks, consult include/uapi/linux/bpf.h and examine the enumerated type "bpf_attach_type" for BPF_CGROUP_* definitions.

5.1 BPF_PROG_TYPE_CGROUP_SKB

What do I do with it? Allow or deny network access on IP egress/ingress (BPF_CGROUP_INET_INGRESS/BPF_CGROUP_INET_EGRESS). BPF programs should return 1 to allow access. Any other value results in the function __cgroup_bpf_run_filter_skb() returning -EPERM, which will be propagated to the caller such that the packet is dropped.

How do I attach my program? The program is attached to a specific cgroup's file descriptor.

What context is provided? The relevant skb.

When does it get run? For inet ingress, sk_filter_trim_cap() (see above) contains a call to BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb); if a non-zero value is returned, the error is propogated to the caller (e.g. __sk_receive_skb()) and the packet is discarded and freed. A similar approach is taken on egress, but in ip[6]_finish_output().

5.2 BPF_PROG_TYPE_CGROUP_SOCK

What do I do with it? Allow or deny network access at various socket-related events (BPF_CGROUP_INET_SOCK_CREATE, BPF_CGROUP_SOCK_OPS). As above, BPF programs should return 1 to allow access. Any other value results in the function __cgroup_bpf_run_filter_sk() returning -EPERM, which will be propagated to the caller such that the packet is dropped.

How do I attach my program? The program is attached to a specific cgroup's file descriptor.

What context is provided? The relevant socket (sk).

When does it get run? At socket creation time, in inet_create() we call BPF_CGROUP_RUN_PROG_INET_SOCK() with the socket as argument, and if that function fails, the socket is released.

6. Lightweight tunnel program types.

Lightweight tunnels - https://lwn.net/Articles/650778/ - are a simple way to do tunneling by attaching encapsulation instructions to routes. The examples in the patch description make things clearer: iproute examples (without BPF):

So we're telling Linux for example that for traffic to 40.1.1.1/32 addresses, we want to encapsulate with a a VXLAN ID of 10 and destination IPv4 address of 50.1.1.2. BPF programs can do the encapsulation on outbound/transmit (inbound packets are readonly). See https://lwn.net/Articles/705609/ for more details. Similarly to tc, iproute eBPF support allows us to attach the eBPF program ELF section directly:

Summary

So hopefully this roundup of program types was useful. We can see that BPF's safe in-kernel programmable environment can be used in all sorts of interesting ways! The next thing we will talk about is what BPF helper functions are available within the varoius program types.

Learning more about BPF

Thanks for reading this installment of our series on BPF. We hope you found it educational and useful. Questions or comments? Use the comments field below!