I've finally had some time to work on this. Here is the result so far:
In the ipfilter 5.1.2 code in -current, there are two locking-related bugs (in
one case a lock is released too soon, and in one case, a lock can be leaked),
and the custom-built red-black tree seems to have some bugs. I haven't looked
into what bugs the ipf rb-tree implementation has; at Christos's suggestion,
I've just switched it to use our <sys/rbtree.h> implementation, and that solves
the problem.
The ipf rb-tree implementation is implemented as cpp macros, so I swapped out
the rb-tree implementation by adding a different set of macros that call
<sys/rbtree.h> functions. I believe this may be a minimally-invasive change to
the ipf code base, and it should maintain compatibility with all the other OSes
ipf is built on. That said, the bug will remain on other OSes, so Darren may
want to check that out. (The bug manifests as a kernel panic or hard hang
during a call to RBI_SEARCH or RBI_INSERT.)
Attached is a patch that keeps my router from panicking or hanging on heavy NAT
load. Would anyone like to take a look at it? I think these changes should be
incorporated into -current.
After that, there are still a couple other ipf problems that cause serious
issues, although they don't kill the machine. For example, the ns_bucketlen
measure of elements in each bucket in the hash table that keeps NAT state can
be decremented below 0. Since it's an unsigned int, that makes it look as if
the bucket is way over-full, and no new state can be tracked between the two
hosts in question. I'll try to look into this later today.
- Geoff