Martin Josefssons netfilter blog

I've finally found the refcount problem after adding tons of printk's everywhere.
It turns out that the event cache increases the refcount for each packet in the
L3 protocol handler, and then it decreases it in nf_conntrack_confirm(), this
refcount isn't coupled to the skb in question as the "regular" refcount is which
means that "this" refcount isn't decreased automagically when the skb is freed.

When we register the hooks we pass in an array of 'struct nf_hook_ops'
and then we loop over the array calling nf_register_hook() for each hook registration.
nf_register_hook() calls spin_lock_bh()/spin_unlock_bh(), spin_unlock_bh()
executes any pending softirqs after it unlocks the lock.

This means that we can register nf_conntrack_in() at PREROUTING and during this
hook registration we get an interrupt from the NIC that schedules an softirq to
be executed as soon as possible. When the hook registration of nf_conntrack_in()
at PREROUTING is complete and the spin_unlock_bh() is called the pending softirq
is executed. And now we just passed a packet through nf_conntrack_in() which
calls the event cache which increases the refcount of the conntrack entry. But
nf_conntrack_confirm() wasn't registered yet so the event cache never decreased
the refcount since the packet never passed through nf_conntrack_confirm().

One workaround could be to extend the spin_lock()/spin_unlock() to cover the
entire nf_conntrack_hooks(), that would prevent the above scenario from occuring...
but only on UP kernels. On SMP kernels one cpu could be going through the hooks
at the same time as we are registering/unregistering them as we are using RCU
for the linked lists of hooks. So that's not a solution.

A better solution is to make sure that we order the hook registrations in the
'struct nf_hook_ops' array in the order the hooks will be called in the net stack.
And then register them in reverse order, we still unregister them in the forward order.
This means that nf_conntrack_confirm() is registered before nf_conntrack_in(), and
unregistered after, and after each unregistration we call synchronize_net() to make
sure that all packets currently in the net stack are finished before we continue to
unregister the next hook. We must make sure that a packet that passed through
nf_conntrack_in() completes its journey through the net stack before we unregister
the nf_conntrack_confirm() hook, thus the synchronization.

Registering in reverse order and unregistering in forward order means that it's
possible that packets for example are passed through nf_conntrack_confirm()
without having been passed through nf_conntrack_in(), this is the reverse
of the original problem, but it's much easier to handle :) We need to handle cases
like this where skbs are supposed to have been passed through an earlier hook
but that hasn't happened because of the racy registration and unregistration.

I've implemented the above solution and so far it has passed my testcase,
consisting of unloading/loading nf_conntrack_ipv4 in a loop, without problems.
Without the fix it's very easy to trigger the refcount problem, even without my
RCU patches. I've been able to trigger it with as few as 3 icmp echo requests
while reloading the module in a loop, sometimes up to 20 packets are needed.
I've only tested the fix in QEMU but I hope it also works for real SMP machines.
And I need to implement the same fix for the ipv6 part as well.

Sometime in the future we'll hopefully be able to get rid of the extra refcounts
the event cache brings, time will tell. Now it's time for some sleep.

I've finally got qemu working as I want so I can start to test my RCU patches
without having to reboot my laptop all the time. I'm having a weird refcount
problem with the L3/L4 RCU patch, I sometimes get a conntrack entry that has an
elevated refcount which never drops down to zero. Very annoying.

Steps to reproduce: ping the testmachine and run while true; do rmmod nf_conntrack_ipv4 ; modprobe nf_conntrack_ipv4; done on the testmachine

This results in the rmmod not returning and an entry with use=1 in /proc/net/nf_conntrack_deleted.
All entries to be deleted as a result of the forced kill are added to this
linked-list after beeing removed from the hashtable, and we wait until they are
all dead since they contain pointers to the L3/L4 protocol handlers that we are about to unload. They die
and get removed from this deleted linked-list when the refcount, use, drops to
zero. But this never happens. And the counters of the entry always say that
there's only been traffic in the ORIG direction, and since it was an icmp
echo-request and that 'ping' reports that it has received all reponses we
know that the packet must have passed through the stack properly which should
have decreased the refcount when the packet was kfree_skb()'d. I've seen it
with tcp packets as well, but it's easier to verify that all packets got through
with icmp.

We have three diffrent "users" of refcounts. The timer for the conntrack entry
holds one reference. Each packet passing through the conntrack infrastructure
holds one reference. And we sometimed manually increase and decrease the
refcount when we want to force an entry to stick around for an extended period
of time (some of these forced refcount increases/decreases might go away with the
use of RCU for the actual entries later, but that's another story).

Testing is performed on an UP kernel without preemption and without the -rt
patches which means that there's no preemption of the softirq going on.
Everything should be serialized and pretty but something somewhere increases
the refcount without decreasing it, the L3/L4 patch decreases the refcount when
it forcibly kills the timer for the entry so that should be ok.

It's been a long time since my last entry, about 9.5 months.
So what has happened since then? Not much. I moved to a new apartment 6 months ago,
still looks a little like I recently moved in but that is starting to change to the better.

I havn't been hacking very much on netfilter, or on anything lately, I implemented a small
hackish mysql-backed dhcp-server in perl but that's about it. I've been on vacation for 3 weeks and
I've rediscovered the joy of hacking. I've been working on cleaning up nf_conntrack a bit in
preparation for using regular spinlocks instead of rwlocks, and then moving on to RCU.
So far I've split nf_conntrack_core.c into a few smaller files since it was fairly large at
1700 lines, replaced the rwlock with spinlock, use RCU for l3/l4 protocol handlers and helpers.
And various other cleanups and minor optimizations. The goal is to use RCU for the hashtable as well
and only use a few atomic operations in the fast path, currently we have a truckload of atomic operations in there.

I now have hashtrie in nf_conntrack compiling, it is still untested and testing will have to wait until tomorrow.
There are still some unresolved issues, like the unconfirmed list, that list used the normal list_head
that was used for the hashtable when the entry wasn't added to the hashtable. This has to change now since list_head
isn't used any more. The old way of implementing the unconfirmed list was bad anyway since it was a global list which was
modified twice for each new assured connection. The other main issues are some refcounting and locking problems
and I have to implement a way to get conntrack dumping working.

Today was a fairly uneventful day travelling back home.
Hotel-Taxi-Flight-Flight-Train-Taxi-Home

During the second flight from Madrid to Copenhagen I came up with a
possible way to modify the hashtrie to do longest prefix matching.
This needs a lot more thought to make sure it could actually work.
Then comes the implementation which I fear is going to become a bit tricky.
It may end up a very bad idea... time will tell.

I unfortunately cought a cold so when I got back home I had a sore throat and a bit of a fever.

The second day of hacking started out a bit sluggish but it picked up speed later in the day.
We spent the day at the hotel all crammed into one hotel room. Our biggest problem was the
fact that the WiFi at the hotel was extremely unstable, to the point where it was often unusable.
In the evening we went out for yet another wonderful meal. Then it was time to say goodbye to everyone
as I'm going to leave early in the morning.

The formal workshop is over but we have two more days of hacking planned.
This first day we tried to divide ourselves up into small groups where each group
works/discusses one area, like {nf,ct}netlink, nfhipac, tcpwindowtracking etc...
I mostly experimented with the hashtrie but I also attended the nfhipac discussions
and I have to say that nfhipac is going to kick butt when the proposed changes are made.
These changes adds a generic userspace to kernelspace format based on nfnetlink that is
going to be documented so you actually can have diffrent userspace applications to manage
rules, and you can even have diffrent filtering backends in the kernel. We just don't want
to paint ourselves into a corner, we like to think we've learned from previous mistakes.
Another really nice result of this discussion is a new way to pass the data needed
for matches between userspace and kernelspace. Currently that is performed by passing structs
around, which has a lot of problems, one problem is the 64bit kernel and 32bit userspace issue
which becomes a problem when you have things like pointers in the structs. Another problem is
when you want to add more members to the struct, then the size of the struct definition in
an old kernel and in the new userspace library doesn't match anymore, thus we've broken
backwards compatibility which just isn't allowed. The new idea is to pass this data around
with netlink TLVs and then you build your internal representation from these TLVs and
then the other way around when you list the rules.

We had nothing planned for the evening so we went back to the hotel to leave all the
hardware and then just go out somewhere to eat and drink some beer. 15 of us went out barhopping.
Later in the evening we ended up at a fairly small street, I have to say I've never ever
seen so many people in one place before. The reason for this was that there's 4 bars located
within a total distance of 10 meters. We had a really good time there but when we finally decided
to head back to the hotel we noticed that Pablo, who knew the way back to the hotel, was missing.
We ended up walking for a while but we eventually found the hotel.

We had lots of nice talks and discussions today as well.
The workshop has come up with possible solutions for many current problems and issues.
Hopefully many of them will result in patches :)

The pub/restaurant we ended up at in the evening was a bit interesting. Instead of tables and chairs
it had kind of a bleacher were everyone sat, drank beer and ate their food. Me and Harald yet again
failed to go to sleep when we got back to the hotel, once again we ended up discussing a lot of diffrent
topics, including why it is that noone seems to have implemented an open software stack for the AC97 compatible
winmodems present in almost all newer laptops. That discussion ended up discussing signal processing and
other weird things people have done that makes writing a stack for these modems sound not all that difficult
(that is, if you know what you are doing, which I don't :)
I've been hacking on the hashtrie during the day, deletes are now a lot faster and forced eviction
of certain entries (with a special status) is implemented and it appears to be really fast as well. But this new feature will
have to undergo a lot of testing to make sure we're not ending up evicting the wrong entries, agewise that is.
Somehow I keep thinking that if I implement something new and it turns out it's fast, it must be broken in some way.

This was the first day of the Netfilter Workshop. Really nice to meet all the
netfilter hackers again. Lots of nice talks, I gave a small half-improvised talk
about a datastructure I'm working on, a hashtrie to be used for connectiontracking.
I showed some performance numbers comparing the regular hashtable of ip_conntrack
to itself with diffrent configurations. And I showed some results comparing the
regular hashtable with the hashtrie.

Later in the evening we went out for dinner, that was really nice. I think there were
around 20 people attending the dinner resulting in lots of discussions about many netfilter related issues.
Some of us went back to the hotel early in order to get some well needed sleep, that failed as expected,
but it was worth a try. Me and Harald ended up talking about a lot of diffrent things, including swedish and german food for
a while instead so we didn't get more sleep anyway.

All travelarrangements was made several months ago, I just forgot one small detail...
To include the strike of the french air traffic controllers in my calculations.
Because of them my first flight from Copenhagen to Madrid was delayed which lead to me
missing the second connecting flight. Getting a new boardingcard for the next flight
wasn't a problem, but the next flight wasn't supposed to depart for another 1.5 hours.
The monitors said it was delayed 45 minutes so I walked to the gate in order to
rest and maybe finish my presentation I was supposed to give at the NFWS2005.
As I'm walking to the gate I discover a weird thing, the Madrid airport has designated
smoking areas, the problem is that those areas aren't separated from the other areas by
anything other than some blue tape on the floor. I couldn't see any extra ventilation
in the roof above these smoking areas either.

Anyway, this new flight ended up beeing delayed over an hour. So I arrived in Seville
after midnight. I met Harald at the hotel room and we had a nice chat before it was
time to go to bed.