Debugging netlink requests

This week I was working on a Kubernetes networking problem. Basically
our container network backend was reporting that it couldn’t delete
routes, and we didn’t know why.

I started reading the code that was failing, and it was using a library
called “netlink”. I’d never heard of that before this week.

what’s netlink?

Wikipedia says:

Netlink socket family is a Linux kernel interface used for
inter-process communication (IPC) between both the kernel and
userspace processes, and between different userspace processes, in a
way similar to the Unix domain sockets.

The program I was debugging was creating/deleting routes from the route table.
It seems like netlink is capable of doing lots of things (communicate kernel
<-> userspace and userspace <-> userspace), but in this case what was happening
was pretty simple

userspace program creates a netlink socket

userspace program sends a message with that socket asking the kernel
to delete a route

why the program I was debugging wasn’t working

You see this RTA_OIF field? This field is a network interface id. For
example, on my laptop right now I have 5 network interfaces, numbered 1 through
5. The (correct) message above has RTA_OIF set to 1, for the lo loopback interface.

But in our errant program, the RTA_OIF field was set to 0! 0 is not even a
valid value for this field, I don’t think! (0 is not a valid network interface ID)

pyroute2 is great

pyroute2 is really cool, if I wanted to write a quick script to understand
what’s going on with my network interfaces & routes I would 100% definitely try
pyroute2. There are a lot of great examples here.

For example! If I want to run the equivalent of ip route add 172.16.5.0/24 via 127.0.0.1 dev lo, that’s:

nltrace

There’s also nltrace (for instance nltrace ip route list) but in this case it didn’t give me the information I wanted. It’s not a maintained project but looks maybe useful!

that’s all!

It always makes me happy when I learn about a NEW LINUX THING during the course
of my job. When I was in the middle of this I tweeted

kubernetes is cool but definitely not easy, my experience is definitely like
“learn how all the networking works in excruciating detail”

which definitely feels true, it’s less like “set up networking and it works”
and more like “pick a networking backend, wait a month, discover weird
problems, strace it, learn things about netlink and what a RTA_OIF is, fix
the bugs, eventually it works”. Maybe that isn’t everyone’s experience but that
is my experience so far!