The future of NetBSD network infrastructure has to efficiently embrace two major design criteria, Symmetric Multi-Processing (SMP) and modularity. Other design considerations include not only supporting but taking advantage of the capability of newer network devices to do packet classification, payload splitting, and even full connection offload.

Symmetric Multi-Processing

NetBSD networking has evolved to work in a uniprocessor envionment. Switching it to use fine-grained locked is a hard and complex problem.

You can divide the network infrastructure into 5 major components:

Interfaces (both real devices and pseudo-devices)

Socket code

Protocols

Routing code

mbuf code.

Part of the complexity is that due to the monolithic nature of the
kernel each layer currently feels free to call any other layer.
This makes designing a lock hierarchy difficult and likely to fail.

Part of the problem are asynchonous upcalls (among which include):

ifa->ifa_rtrequest for route changes

pr_ctlinput for interface events

Another source of complexity is the large number of global variables
scattered throughout the source files. This makes putting locks
around them difficult.

The proposed solution presented here include (in no particular order):

It allows multiple writers (producers) but only a single reader (consumer).
Compare-And-Store operations are used to allow lockless updates.
The consumer is expected to be protected by a mutex that covers the structure
that the PCQ is embedded into (e.g. socket lock, ifnet hwlock).
These queues operate in a First-In, First-Out (FIFO) manner.
The act of inserting or removing an item from a PCQ does not modify the item in any way.
A PCQ does not prevent an item being inserted multiple times into a single PCQ.

Since this structure isn't specific to networking
it will accessed via <sys/pcq.h> and the code will
live in kern/subr_pcq.c.

bool pcq_put(pcq_t *pcq,
void *item);

Places item at the end of the queue. If there is no room in the queue for the item, false is returned; otherwise true is returned. The item must not have the value of NULL.

void *pcq_peek(pcq_t *pcq);

Returns the next item to be consumed from the queue but does not remove it from the queue. If the queue was empty, NULL is returned.

void *pcq_get(pcq_t *pcq);

Removes the next item to be consumed from the queue and returns it. If the queue was empty, NULL is returned.

size_t pcq_maxitems(pcq_t *pcq);

Returns the maximum number of items that the queue can store at any one time.

These routines allow for commonly typed items to be locklessly inserted at either the head or tail of a queue for either last-in, first-out (LIFO) or first-in, first-out (FIFO) behavior, respectively.
However, a queue is not instrinsicly LIFO or FIFO.
Its behavior is determined solely by which method each item was pushed onto the queue.

It is only possible for an item to removed from the head of queue. This removal is also performed in a lockless manner.

All items in the queue must share a atomic_queue_link_t member at the same offset from the beginning of item. This offset is passed to atomic_qinit.

void atomic_qinit(atomic_queue_t *q,
size_t offset);

Initialize the atomic_queue_t queue at q.

offset is the offset to the atomic_queue_link_t inside the data structure where the pointer to the next item in this queue will be placed. It should be obtained using offsetof.

void *atomic_qpeek(atomic_queue_t *q);

Returns a pointer to the item at the head of the supplied queue q. If there was no item because the queue was empty, NULL is returned. No item is removed from the queue. Given this is an unlocked operation, it should only be used as a hint as whether the queue is empty or not.

void *atomic_qpop(atomic_queue_t *q);

Removes the item (if present) at the head of the supplied queue q and returns a pointer to it. If there was no item to remove because the queue was empty, NULL is returned.
Because this routine uses atomic Compare-And-Store operations, the returned item should stay accessible for some indeterminable time so that other interrupted or concurrent callers to this function with this q can continue to deference it without trapping.

BSD systems have always used a radix tree for their routing tables.
However, the radix tree implementation is showing its age.
It's lack of flexibility (it's only suitable for use in a routing table) and overhead of use (requires memory allocation/deallocation for insertions and removals) make replacing it with something better tuned to today's processors a necessity.

Since a radix tree branches on bit differences, finding these bit differences efficiently is crucial to the speed of tree operations.
This is most quickly done by XORing the key and the tree node's value together and then counting the number of leading zeroes in the result of the XOR.
Many processors today (ARM, PowerPC) have instructions that can count the number of leading zeroes in a 32 bit word (and even a 64 bit word).
Even those that don't can use a simple constant time routine to count them:

The existing BSD radix tree implementation does not use this method but instead uses a far more expensive method of comparision.
Adapting the existing implementation to do the above is actually more expensive than writing a new implementation.

The primary requirements for the new radix tree are:

Be self-contained.
It can't require additional memory other than what is used in its data structures.

Be generic.
A radix tree has uses outside networking.

To make the radix tree flexible, all knowledge of how keys are represented
have been encapsulated into a pt_tree_ops_t
structure which contains 4 functions:

Returns true if both foo and bar objects have the identical string of bits starting at *bitoffp and ending before max_bitoff.
In addition to returning true, *bitoffp should be set to the smaller of max_bitoff or the length, in bits, of the compared bit strings.
Any bits before *bitoffp are to be ignored.
If the string of bits are not identical, *bitoffp is set to the where the bit difference occured, *slotp is the value of that bit in foo, and false is returned.
The foo and bar (if not NULL) arguments are pointers to a key member inside a tree object.
If bar is NULL, then assume it points to a key consisting of entirely of zero bits.

A wq must be supplied. It may be one returned
by kcont_workqueue_acquire or a predefined
workqueue such as (sorted from highest priority to lowest):

wq_softserial,
wq_softnet,
wq_softbio,
wq_softclock

wq_prihigh,
wq_primedhigh,
wq_primedlow,
wq_prilow

lock, if non-NULL, will be locked before calling func(arg) and released afterwards. However, if the lock will be released and/or destroyed before the called function returns, then before returning kcont_set_mutex must be called with either a new mutex to be released or NULL.
If acquiring lock would block, other pending kernel continuations which depend on other locks may be dispatched in the meantime. However, all continuations sharing the same set of { wq, lock, [ci] } will be processed in the order they are scheduled.

Currently, flags must be 0.

int kcont_schedule(kcont_t *kc, struct cpu_info *ci, int nticks);

If the continuation is marked as INVOKING, an error of
EBUSY will be returned.
If nticks is 0, the continuation is marked as INVOKING while EXPIRED and PENDING are cleared, and the continuation is scheduled to be invoked without delay.
Otherwise, the continuation is marked as PENDING while
EXPIRED status is cleared, and the timer reset to
nticks.
Once the timer expires, the continuation is marked as
EXPIRED and INVOKING, and the
PENDING status is cleared.

If ci is non-NULL, the continuation will be invoked on the specified CPU if the continuations's workqueue has per-cpu queues.
If that workqueue does not provide per-cpu queues, an error of
ENOENT will be returned.
Otherwise when ci is NULL, the continuation will be
invoked on either the current CPU or the next available CPU depending
on whether the continuation's workqueue has per-cpu queues or not,
respectively.

void kcont_destroy(kcont_t *kc);

kmutex_t *kcont_getmutex(kcont_t *kc);

Return the lock currently associated with the continuation kc.

void kcont_setarg(kcont_t *kc, void *arg);

Update arg in the continuation kc.
If no lock is associated with the continuation, then arg may be changed at any time though if the continuation is being invoked it may not pick up the change. Otherwise, kcont_setarg must only be called when the associated lock is locked.

kmutex_t *kcont_setmutex(kcont_t *kc, kmutex_t *lock);

Updates the lock associated with the continuation kc and returns the previous lock.
If no lock is currently associated with the continuation, then calling this function with a lock other than NULL will trigger an assertion failure.
Otherwise, kcont_setmutex must be called only when the existing lock (which will be replaced) is locked.
If kcont_setmutex is called as a result of the invokation of func, then after kcont_setmutex has been called but before func returns, the replaced lock must have been released, and the replacement lock, if non-NULL, must be locked upon return.

void kcont_setfunc(kcont_t *kc, void (*func)(void *), void *arg);

Update func and arg in the continuation kc.
If no lock is associated with the continuation, then only arg may be changed. Otherwise, kcont_setfunc must be called only when the associated lock is locked.

bool kcont_stop(kcont_t *kc);

The kcont_stop function stops the timer associated the continuation handle kc.
The PENDING and EXPIRED status for the continuation handle is cleared.
It is safe to call kcont_stop on a continuation handle that is not pending, so long as it is initialized.
kcont_stop will return a non-zero value if the continuation was EXPIRED.

bool kcont_pending(kcont_t *kc);

The kcont_pending function tests the PENDING status of the continuation handle kc.
A PENDING continuation is one who's timer has been started and has not expired.
Note that it is possible for a continuation's timer to have expired without being invoked if the continuation's lock could not be acquired or there are higher priority threads preventing its invokation.
Note that it is only safe to test PENDING status when holding the continuation's lock.

bool kcont_expired(kcont_t *kc);

The kcont_expired function tests to see if the continuation's function has been invoked since the last kcont_schedule.

bool kcont_active(kcont_t *kc);

bool kcont_invoking(kcont_t *kc);

The kcont_invoking function tests the INVOKING status of the kcont handle kc.
This flag is set just before a continuation's function is being called.
Since the scheduling of the worker threads may induce delays, other pending higher-priority code may run before the continuation function is allowed to run.
This may create a race condition if this higher-priority code deallocates storage containing one or more continuation structures whose continuation functions are about to be run.
In such cases, one technique to prevent references to deallocated storage would be to test whether any continuation functions are in the INVOKING state using kcont_invoking, and if so, to mark the data structure and defer storage deallocation until the continuation function is allowed to run.
For this handshake protocol to work, the continuation function will have to use the kcont_ack function to clear this flag.

bool kcont_ack(kcont_t *kc);

The kcont_ack function clears the INVOKING state in the continuation handle kc.
This is used in situations where it is necessary to protect against the race condition described under kcont_invoking.

kcont_wq_t *kcont_workqueue_acquire(pri_t pri, int flags);

kcont_workqueue_acquire returns a workqueue that matches the specified criteria. Thus if multiple requesters ask for the same criteria, they will all be returned the same workqueue.

pri specifies the priority at which the kernel thread which empties the workqueue should run.

If flags is 0 then the standard operation is required. Hoever, the following flag(s) may be bitwise ORed together:

WQ_PERCPU specific that the workqueue should have a separate queue for each CPU, this allowing continuations to invoked on specific CPUs.

int kcont_workqueue_release(kcont_wq_t *wq);

kcont_workqueue_release releases an acquired workqueue.
On the last release, the workqueue's resources are freed and the
workqueue is destroyed.

The API is primarily derived from the callout(9) API and is a
superset of the softint(9) API. The most significant change is
that workqueue items are not tied to a specific kernel thread.

Remove ARP, AARP, ISO SNPA, and IPv6 Neighbors from the routing table.
Instead, the ifnet structure will have a set of nexthop caches (usually implemented using patricia trees), one per address family.
Each nexthop entry will contain the datalink header needed to reach the neighbor.

This will remove cloneable routes from the routing table and remove the need to protocol-specific code in the common Ethernet, FDDI, PPP, etc. code and put it back where it belongs, in the protocol itself.

When a network device gets an interrupt, it ill call <iftype>_defer(ifp) to schedule a kernel continuation for that interface which invokes <iftype>_poll.
Whether the the interrupt source should be masked depends if the device is a DMA device or a PIO device.
This routine will call (*ifp->if_poll)(ifp) to deal with the interrupt's servicing.

During servicing any received packets will passed up via (*ifp->if_input)(ifp, m) which will be responsible for ALTQ or any other optional processing as well as protocol dispatch.
Protocol dispatch in <iftype>_input will decode the datalink headers, if needed, via a table lookup and call the matching protocol's pr_input to process the packet. As such, interrupt queues (e.g. ipintrq) will no longer be needed.
Any transmitted packets can be processed as can MII events.
Either true or false will be returned by if_poll depending on whether another invokation of <iftype>_poll for this interface should be immediately scheduled or not, respectively.

Memory allocation will be prohibited in the interrupt routines. The device's
if_poll routine should pre-allocate enough mbufs to do any required buffering. For devices doing DMA, the buffers are placed into receive descripors to be filled via DMA.

For devices doing PIO, pre-allocated mbufs are enqueued onto the softc of the device so when the interrupt routine needs one it simply dequeues one, fills in it in, and then enqueues it onto a completed queue, finally calls <iftype>_defer.
If the number of pre-allocated mbufs drops below a threshold, the driver may decide to increase the number of mbufs that if_poll pre-allocates.
If there are no mbufs left to receive the packet, the packets is dropped and the number of mbufs for if_poll to pre-allocate should be increased.

When interrupts are unmasked depends a few things. If the device is interrupting "too" often, it might make sense for the device's interrupts to remain masked and just schedule the device's continuation for the next clock tick. This assumes the system has a high enough value set for HZ.

Instead of having a set of active workqueue lwps waiting to service
sockets, use the kernel lwp that's blocked on the socket
to service the workitem. It's not being productive being blocked
and it has an interest in getting that workitme done, and maybe
we can directly copy that data to user's address and avoid queuing
in the socket at all.

Collect all the global data for an instance of a network stack
(excluding AF_LOCAL). This includes routing table, data for
multiple domains and their protocols, and the mutexes needed for
regulating access to it all. Indead, a brane is an instance of
a networking stack.

An interface belongs to a brane as do processes. This can be
considered as chroot(2) for networking. A chbrane(2)

Radical Thought #1

LWPs in user space don't need a kernel stack

Those pages are only being used in case the an exception happens. Interrupts are probably going to their own dedicated stack. One could just keep a set of kernel stacks around. Each CPU has one, when a user exception happens, that stack is assigned to the current LWP and removed as the active CPU one. When that CPU next returns to user space, the kernel stack it was using is saved to be used for the next user exception. The idle lwp would just use the current kernel stack.

LWPs waiting for kernel condition shouldn't need a kernel stack

If a LWP is waiting on a kernel condition variable, it is expecting to be inactive for some time, possibly a long time.
During this inactivity, it doesn't really need a kernel stack.

When the exception handler get an usermode exeception, it sets LWP restartable flag that indicates that the exception is restartable, and then services the exception as normal.
As routines are called, they can clear the LWP restartable flag as needed.
When an LWP needs to block for a long time, instead of calling cv_wait, it calls cv_restart.
If cv_restart returned false, the LWPs restartable flag was clear so cv_restart acted just like cv_wait.
Otherwise, the LWP and CV have been tied together (big hand wave), the lock has been released and the routine should return ERESTART.
cv_restart could also wait for a small amount of time like .5 second, and only if the timeout expires.

As the stack unwinds, eventually, it will return to last the exception handler.
The exception will see the LWP has a bound CV, save the LWP's user state into the PCB, set the LWP to sleeping, mark the lwp's stack as idle, and call the scheduler to find more work. When called, cpu_switchto will notice the stack is marked idle, and detach it from the LWP,

When the condition times out or is signalled, the first lwp attached to the condition variable is mark runnable and detached from the CV.
When the cpu_switchto routine is called, the it will notice the lack of a stack so it will grab one, restore the trapframe, and reinvoke the exception handler.