This RFC proposes to add busy-poll support to AF_XDP sockets. With
busy-poll, the driver is executed in process context by calling the
poll() syscall. The main advantage with this is that all processing
occurs on a single core. This eliminates the core-to-core cache
transfers that occur between the application and the softirqd
processing on another core, that occurs without busy-poll. From a
systems point of view, it also provides an advatage that we do not
have to provision extra cores in the system to handle
ksoftirqd/softirq processing, as all processing is done on the single
core that executes the application. The drawback of busy-poll is that
max throughput seen from a single application will be lower (due to
the syscall), but on a per core basis it will often be higher as
the normal mode runs on two cores and busy-poll on a single one.
The semantics of busy-poll from the application point of view are the
following:
* The application is required to call poll() to drive rx and tx
processing. There is no guarantee that softirq and interrupts will
do this for you. This is in contrast with the current
implementations of busy-poll that are opportunistic in the sense
that packets might be received/transmitted by busy-poll or
softirqd. (In this patch set, softirq/ksoftirqd will kick in at high
loads just as the current opportunistic implementations, but I would
like to get to a point where this is not the case for busy-poll
enabled XDP sockets, as this slows down performance considerably and
starts to use one more core for the softirq processing. The end goal
is for only poll() to drive the napi loop when busy-poll is enabled
on an AF_XDP socket. More about this later.)
* It should be enabled on a per socket basis. No global enablement, i.e.
the XDP socket busy-poll will not care about the current
/proc/sys/net/core/busy_poll and busy_read global enablement
mechanisms.
* The batch size (how many packets that are processed every time the
napi function in the driver is called, i.e. the weight parameter)
should be configurable. Currently, the busy-poll size of AF_INET
sockets is set to 8, but for AF_XDP sockets this is too small as the
amount of processing per packet is much smaller with AF_XDP. This
should be configurable on a per socket basis.
* If you put multiple AF_XDP busy-poll enabled sockets into a poll()
call the napi contexts of all of them should be executed. This is in
contrast to the AF_INET busy-poll that quits after the fist one that
finds any packets. We need all napi contexts to be executed due to
the first requirement in this list. The behaviour we want is much more
like regular sockets in that all of them are checked in the poll
call.
* Should be possible to mix AF_XDP busy-poll sockets with any other
sockets including busy-poll AF_INET ones in a single poll() call
without any change to semantics or the behaviour of any of those
socket types.
* As suggested by Maxim Mikityanskiy, poll() will in the busy-poll
mode return POLLERR if the fill ring is empty or the completion
queue is full.
Busy-poll support is enabled by calling a new setsockopt called
XDP_BUSY_POLL_BATCH_SIZE that takes batch size as an argument. A value
between 1 and NAPI_WEIGHT (64) will turn it on, 0 will turn it off and
any other value will return an error.
A typical packet processing rxdrop loop with busy-poll will look something
like this:
for (i = 0; i < num_socks; i++) {
fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
fds[i].events = POLLIN;
}
for (;;) {
ret = poll(fds, num_socks, 0);
if (ret <= 0)
continue;
for (i = 0; i < num_socks; i++)
rx_drop(xsks[i], fds); /* The actual application */
}
Need some advice around this issue please:
In this patch set, softirq/ksoftirqd will kick in at high loads and
render the busy poll support useless as the execution is now happening
in the same way as without busy-poll support. Everything works from an
application perspective but this defeats the purpose of the support
and also consumes an extra core. What I would like to accomplish when
XDP socket busy-poll is enabled is that softirq/ksoftirq is never
invoked for the traffic that goes to this socket. This way, we would
get better performance on a per core basis and also get the same
behaviour independent of load.
To accomplish this, the driver would need to create a napi context
containing the busy-poll enabled XDP socket queues (not shared with
any other traffic) that do not have an interrupt associated with
it.
Does this sound like an avenue worth pursuing and if so, should it be
part of this RFC/PATCH or separate?
Epoll() is not supported at this point in time. Since it was designed
for handling a very large number of sockets, I thought it was better
to leave this for later if the need will arise.
To do:
* Move over all drivers to the new xdp_[rt]xq_info functions. If you
think this is the right way to go forward, I will move over
Mellanox, Netronome, etc for the proper patch.
* Performance measurements
* Test SW drivers like virtio, veth and tap. Actually, test anything
that is not i40e.
* Test multiple sockets of different types in the same poll() call
* Check bisectability of each patch
* Versioning of the libbpf interface since we add an entry to the
socket configuration struct
This RFC has been applied against commit 2b5bc3c8ebce ("r8169: remove manual autoneg restart workaround")
Structure of the patch set:
Patch 1: Makes the busy poll batch size configurable inside the kernel
Patch 2: Adds napi id to struct xdp_rxq_info, adds this to the
xdp_rxq_reg_info function and changes the interface and
implementation so that we only need a single copy of the
xdp_rxq_info struct that resides in _rx. Previously there was
another one in the driver, but now the driver uses the one in
_rx. All XDP enabled drivers are converted to these new
functions.
Patch 3: Adds a struct xdp_txq_info with corresponding functions to
xdp_rxq_info that is used to store information about the tx
queue. The use of these are added to all AF_XDP enabled drivers.
Patch 4: Introduce a new setsockopt/getsockopt in the uapi for
enabling busy_poll.
Patch 5: Implements busy poll in the xsk code.
Patch 6: Add busy-poll support to libbpf.
Patch 7: Add busy-poll support to the xdpsock sample application.
Thanks: Magnus
Magnus Karlsson (7):
net: fs: make busy poll budget configurable in napi_busy_loop
net: i40e: ixgbe: tun: veth: virtio-net: centralize xdp_rxq_info and
add napi id
net: i40e: ixgbe: add xdp_txq_info structure
netdevice: introduce busy-poll setsockopt for AF_XDP
net: add busy-poll support for XDP sockets
libbpf: add busy-poll support to XDP sockets
samples/bpf: add busy-poll support to xdpsock sample
drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 2 -
drivers/net/ethernet/intel/i40e/i40e_main.c | 8 +-
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 37 ++++-
drivers/net/ethernet/intel/i40e/i40e_txrx.h | 2 +-
drivers/net/ethernet/intel/i40e/i40e_xsk.c | 2 +-
drivers/net/ethernet/intel/ixgbe/ixgbe.h | 2 +-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 48 ++++--
drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 2 +-
drivers/net/tun.c | 14 +-
drivers/net/veth.c | 10 +-
drivers/net/virtio_net.c | 8 +-
fs/eventpoll.c | 5 +-
include/linux/netdevice.h | 1 +
include/net/busy_poll.h | 7 +-
include/net/xdp.h | 23 ++-
include/net/xdp_sock.h | 3 +
include/uapi/linux/if_xdp.h | 1 +
net/core/dev.c | 43 ++----
net/core/xdp.c | 103 ++++++++++---
net/xdp/Kconfig | 1 +
net/xdp/xsk.c | 122 ++++++++++++++-
net/xdp/xsk_queue.h | 18 ++-
samples/bpf/xdpsock_user.c | 203 +++++++++++++++----------
tools/include/uapi/linux/if_xdp.h | 1 +
tools/lib/bpf/xsk.c | 23 +--
tools/lib/bpf/xsk.h | 1 +
26 files changed, 495 insertions(+), 195 deletions(-)
--
2.7.4

On Fri, May 3, 2019 at 2:26 AM Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
>
>
> On 5/2/2019 1:39 AM, Magnus Karlsson wrote:
> > This patch introduces a new setsockopt that enables busy-poll for XDP
> > sockets. It is called XDP_BUSY_POLL_BATCH_SIZE and takes batch size as
> > an argument. A value between 1 and NAPI_WEIGHT (64) will turn it on, 0
> > will turn it off and any other value will return an error. There is
> > also a corresponding getsockopt implementation.
>
> I think this socket option should also allow specifying a timeout value
> when using blocking poll() calls.
> OR can we use SO_BUSY_POLL to specify this timeout value?
I think you are correct in that we need to be able to specify the
timeout. The current approach of always having a timeout of zero was
optimized for the high throughput case. But Ilias and others often
talk about using AF_XDP for time sensitive networking, and in that
case spinning in the kernel (for a max period of the timeout) waiting
for a packet would provide better latency. And with a configurable
value, we could support both cases, so why not.
I will add the timeout value to the new setsockopt I introduced, so it
will take both a batch size and a timeout value. I will also call it
something else since it should not have batch_size in its name
anymore.
Thanks: Magnus
> >
> > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> > ---
> > include/uapi/linux/if_xdp.h | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> > index caed8b1..be28a78 100644
> > --- a/include/uapi/linux/if_xdp.h
> > +++ b/include/uapi/linux/if_xdp.h
> > @@ -46,6 +46,7 @@ struct xdp_mmap_offsets {
> > #define XDP_UMEM_FILL_RING 5
> > #define XDP_UMEM_COMPLETION_RING 6
> > #define XDP_STATISTICS 7
> > +#define XDP_BUSY_POLL_BATCH_SIZE 8
> >
> > struct xdp_umem_reg {
> > __u64 addr; /* Start of packet data area */
> >

On Thu, May 02, 2019 at 10:39:16AM +0200, Magnus Karlsson wrote:
> This RFC proposes to add busy-poll support to AF_XDP sockets. With
> busy-poll, the driver is executed in process context by calling the
> poll() syscall. The main advantage with this is that all processing
> occurs on a single core. This eliminates the core-to-core cache
> transfers that occur between the application and the softirqd
> processing on another core, that occurs without busy-poll. From a
> systems point of view, it also provides an advatage that we do not
> have to provision extra cores in the system to handle
> ksoftirqd/softirq processing, as all processing is done on the single
> core that executes the application. The drawback of busy-poll is that
> max throughput seen from a single application will be lower (due to
> the syscall), but on a per core basis it will often be higher as
> the normal mode runs on two cores and busy-poll on a single one.
>
> The semantics of busy-poll from the application point of view are the
> following:
>
> * The application is required to call poll() to drive rx and tx
> processing. There is no guarantee that softirq and interrupts will
> do this for you. This is in contrast with the current
> implementations of busy-poll that are opportunistic in the sense
> that packets might be received/transmitted by busy-poll or
> softirqd. (In this patch set, softirq/ksoftirqd will kick in at high
> loads just as the current opportunistic implementations, but I would
> like to get to a point where this is not the case for busy-poll
> enabled XDP sockets, as this slows down performance considerably and
> starts to use one more core for the softirq processing. The end goal
> is for only poll() to drive the napi loop when busy-poll is enabled
> on an AF_XDP socket. More about this later.)
>
> * It should be enabled on a per socket basis. No global enablement, i.e.
> the XDP socket busy-poll will not care about the current
> /proc/sys/net/core/busy_poll and busy_read global enablement
> mechanisms.
>
> * The batch size (how many packets that are processed every time the
> napi function in the driver is called, i.e. the weight parameter)
> should be configurable. Currently, the busy-poll size of AF_INET
> sockets is set to 8, but for AF_XDP sockets this is too small as the
> amount of processing per packet is much smaller with AF_XDP. This
> should be configurable on a per socket basis.
>
> * If you put multiple AF_XDP busy-poll enabled sockets into a poll()
> call the napi contexts of all of them should be executed. This is in
> contrast to the AF_INET busy-poll that quits after the fist one that
> finds any packets. We need all napi contexts to be executed due to
> the first requirement in this list. The behaviour we want is much more
> like regular sockets in that all of them are checked in the poll
> call.
>
> * Should be possible to mix AF_XDP busy-poll sockets with any other
> sockets including busy-poll AF_INET ones in a single poll() call
> without any change to semantics or the behaviour of any of those
> socket types.
>
> * As suggested by Maxim Mikityanskiy, poll() will in the busy-poll
> mode return POLLERR if the fill ring is empty or the completion
> queue is full.
>
> Busy-poll support is enabled by calling a new setsockopt called
> XDP_BUSY_POLL_BATCH_SIZE that takes batch size as an argument. A value
> between 1 and NAPI_WEIGHT (64) will turn it on, 0 will turn it off and
> any other value will return an error.
>
> A typical packet processing rxdrop loop with busy-poll will look something
> like this:
>
> for (i = 0; i < num_socks; i++) {
> fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
> fds[i].events = POLLIN;
> }
>
> for (;;) {
> ret = poll(fds, num_socks, 0);
> if (ret <= 0)
> continue;
>
> for (i = 0; i < num_socks; i++)
> rx_drop(xsks[i], fds); /* The actual application */
> }
>
> Need some advice around this issue please:
>
> In this patch set, softirq/ksoftirqd will kick in at high loads and
> render the busy poll support useless as the execution is now happening
> in the same way as without busy-poll support. Everything works from an
> application perspective but this defeats the purpose of the support
> and also consumes an extra core. What I would like to accomplish when
Not sure what you mean by 'extra core' .
The above poll+rx_drop is executed for every af_xdp socket
and there are N cpus processing exactly N af_xdp sockets.
Where is 'extra core'?
Are you suggesting a model where single core will be busy-polling
all af_xdp sockets? and then waking processing threads?
or single core will process all sockets?
I think the af_xdp model should be flexible and allow easy out-of-the-box
experience, but it should be optimized for 'ideal' user that
does the-right-thing from max packet-per-second point of view.
I thought we've already converged on the model where af_xdp hw rx queues
bind one-to-one to af_xdp sockets and user space pins processing
threads one-to-one to af_xdp sockets on corresponding cpus...
If so that's the model to optimize for on the kernel side
while keeping all other user configurations functional.
> XDP socket busy-poll is enabled is that softirq/ksoftirq is never
> invoked for the traffic that goes to this socket. This way, we would
> get better performance on a per core basis and also get the same
> behaviour independent of load.
I suspect separate rx kthreads of af_xdp socket processing is necessary
with and without busy-poll exactly because of 'high load' case
you've described.
If we do this additional rx-kthread model why differentiate
between busy-polling and polling?
af_xdp rx queue is completely different form stack rx queue because
of target dma address setup.
Using stack's napi ksoftirqd threads for processing af_xdp queues creates
the fairness issues. Isn't it better to have separate kthreads for them
and let scheduler deal with fairness among af_xdp processing and stack?
>
> To accomplish this, the driver would need to create a napi context
> containing the busy-poll enabled XDP socket queues (not shared with
> any other traffic) that do not have an interrupt associated with
> it.
why no interrupt?
>
> Does this sound like an avenue worth pursuing and if so, should it be
> part of this RFC/PATCH or separate?
>
> Epoll() is not supported at this point in time. Since it was designed
> for handling a very large number of sockets, I thought it was better
> to leave this for later if the need will arise.
>
> To do:
>
> * Move over all drivers to the new xdp_[rt]xq_info functions. If you
> think this is the right way to go forward, I will move over
> Mellanox, Netronome, etc for the proper patch.
>
> * Performance measurements
>
> * Test SW drivers like virtio, veth and tap. Actually, test anything
> that is not i40e.
>
> * Test multiple sockets of different types in the same poll() call
>
> * Check bisectability of each patch
>
> * Versioning of the libbpf interface since we add an entry to the
> socket configuration struct
>
> This RFC has been applied against commit 2b5bc3c8ebce ("r8169: remove manual autoneg restart workaround")
>
> Structure of the patch set:
> Patch 1: Makes the busy poll batch size configurable inside the kernel
> Patch 2: Adds napi id to struct xdp_rxq_info, adds this to the
> xdp_rxq_reg_info function and changes the interface and
> implementation so that we only need a single copy of the
> xdp_rxq_info struct that resides in _rx. Previously there was
> another one in the driver, but now the driver uses the one in
> _rx. All XDP enabled drivers are converted to these new
> functions.
> Patch 3: Adds a struct xdp_txq_info with corresponding functions to
> xdp_rxq_info that is used to store information about the tx
> queue. The use of these are added to all AF_XDP enabled drivers.
> Patch 4: Introduce a new setsockopt/getsockopt in the uapi for
> enabling busy_poll.
> Patch 5: Implements busy poll in the xsk code.
> Patch 6: Add busy-poll support to libbpf.
> Patch 7: Add busy-poll support to the xdpsock sample application.
>
> Thanks: Magnus
>
> Magnus Karlsson (7):
> net: fs: make busy poll budget configurable in napi_busy_loop
> net: i40e: ixgbe: tun: veth: virtio-net: centralize xdp_rxq_info and
> add napi id
> net: i40e: ixgbe: add xdp_txq_info structure
> netdevice: introduce busy-poll setsockopt for AF_XDP
> net: add busy-poll support for XDP sockets
> libbpf: add busy-poll support to XDP sockets
> samples/bpf: add busy-poll support to xdpsock sample
>
> drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 2 -
> drivers/net/ethernet/intel/i40e/i40e_main.c | 8 +-
> drivers/net/ethernet/intel/i40e/i40e_txrx.c | 37 ++++-
> drivers/net/ethernet/intel/i40e/i40e_txrx.h | 2 +-
> drivers/net/ethernet/intel/i40e/i40e_xsk.c | 2 +-
> drivers/net/ethernet/intel/ixgbe/ixgbe.h | 2 +-
> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 48 ++++--
> drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 2 +-
> drivers/net/tun.c | 14 +-
> drivers/net/veth.c | 10 +-
> drivers/net/virtio_net.c | 8 +-
> fs/eventpoll.c | 5 +-
> include/linux/netdevice.h | 1 +
> include/net/busy_poll.h | 7 +-
> include/net/xdp.h | 23 ++-
> include/net/xdp_sock.h | 3 +
> include/uapi/linux/if_xdp.h | 1 +
> net/core/dev.c | 43 ++----
> net/core/xdp.c | 103 ++++++++++---
> net/xdp/Kconfig | 1 +
> net/xdp/xsk.c | 122 ++++++++++++++-
> net/xdp/xsk_queue.h | 18 ++-
> samples/bpf/xdpsock_user.c | 203 +++++++++++++++----------
> tools/include/uapi/linux/if_xdp.h | 1 +
> tools/lib/bpf/xsk.c | 23 +--
> tools/lib/bpf/xsk.h | 1 +
> 26 files changed, 495 insertions(+), 195 deletions(-)
>
> --
> 2.7.4

On Mon, May 6, 2019 at 6:33 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, May 02, 2019 at 10:39:16AM +0200, Magnus Karlsson wrote:
> > This RFC proposes to add busy-poll support to AF_XDP sockets. With
> > busy-poll, the driver is executed in process context by calling the
> > poll() syscall. The main advantage with this is that all processing
> > occurs on a single core. This eliminates the core-to-core cache
> > transfers that occur between the application and the softirqd
> > processing on another core, that occurs without busy-poll. From a
> > systems point of view, it also provides an advatage that we do not
> > have to provision extra cores in the system to handle
> > ksoftirqd/softirq processing, as all processing is done on the single
> > core that executes the application. The drawback of busy-poll is that
> > max throughput seen from a single application will be lower (due to
> > the syscall), but on a per core basis it will often be higher as
> > the normal mode runs on two cores and busy-poll on a single one.
> >
> > The semantics of busy-poll from the application point of view are the
> > following:
> >
> > * The application is required to call poll() to drive rx and tx
> > processing. There is no guarantee that softirq and interrupts will
> > do this for you. This is in contrast with the current
> > implementations of busy-poll that are opportunistic in the sense
> > that packets might be received/transmitted by busy-poll or
> > softirqd. (In this patch set, softirq/ksoftirqd will kick in at high
> > loads just as the current opportunistic implementations, but I would
> > like to get to a point where this is not the case for busy-poll
> > enabled XDP sockets, as this slows down performance considerably and
> > starts to use one more core for the softirq processing. The end goal
> > is for only poll() to drive the napi loop when busy-poll is enabled
> > on an AF_XDP socket. More about this later.)
> >
> > * It should be enabled on a per socket basis. No global enablement, i.e.
> > the XDP socket busy-poll will not care about the current
> > /proc/sys/net/core/busy_poll and busy_read global enablement
> > mechanisms.
> >
> > * The batch size (how many packets that are processed every time the
> > napi function in the driver is called, i.e. the weight parameter)
> > should be configurable. Currently, the busy-poll size of AF_INET
> > sockets is set to 8, but for AF_XDP sockets this is too small as the
> > amount of processing per packet is much smaller with AF_XDP. This
> > should be configurable on a per socket basis.
> >
> > * If you put multiple AF_XDP busy-poll enabled sockets into a poll()
> > call the napi contexts of all of them should be executed. This is in
> > contrast to the AF_INET busy-poll that quits after the fist one that
> > finds any packets. We need all napi contexts to be executed due to
> > the first requirement in this list. The behaviour we want is much more
> > like regular sockets in that all of them are checked in the poll
> > call.
> >
> > * Should be possible to mix AF_XDP busy-poll sockets with any other
> > sockets including busy-poll AF_INET ones in a single poll() call
> > without any change to semantics or the behaviour of any of those
> > socket types.
> >
> > * As suggested by Maxim Mikityanskiy, poll() will in the busy-poll
> > mode return POLLERR if the fill ring is empty or the completion
> > queue is full.
> >
> > Busy-poll support is enabled by calling a new setsockopt called
> > XDP_BUSY_POLL_BATCH_SIZE that takes batch size as an argument. A value
> > between 1 and NAPI_WEIGHT (64) will turn it on, 0 will turn it off and
> > any other value will return an error.
> >
> > A typical packet processing rxdrop loop with busy-poll will look something
> > like this:
> >
> > for (i = 0; i < num_socks; i++) {
> > fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
> > fds[i].events = POLLIN;
> > }
> >
> > for (;;) {
> > ret = poll(fds, num_socks, 0);
> > if (ret <= 0)
> > continue;
> >
> > for (i = 0; i < num_socks; i++)
> > rx_drop(xsks[i], fds); /* The actual application */
> > }
> >
> > Need some advice around this issue please:
> >
> > In this patch set, softirq/ksoftirqd will kick in at high loads and
> > render the busy poll support useless as the execution is now happening
> > in the same way as without busy-poll support. Everything works from an
> > application perspective but this defeats the purpose of the support
> > and also consumes an extra core. What I would like to accomplish when
>
> Not sure what you mean by 'extra core' .
> The above poll+rx_drop is executed for every af_xdp socket
> and there are N cpus processing exactly N af_xdp sockets.
> Where is 'extra core'?
> Are you suggesting a model where single core will be busy-polling
> all af_xdp sockets? and then waking processing threads?
> or single core will process all sockets?
> I think the af_xdp model should be flexible and allow easy out-of-the-box
> experience, but it should be optimized for 'ideal' user that
> does the-right-thing from max packet-per-second point of view.
> I thought we've already converged on the model where af_xdp hw rx queues
> bind one-to-one to af_xdp sockets and user space pins processing
> threads one-to-one to af_xdp sockets on corresponding cpus...
> If so that's the model to optimize for on the kernel side
> while keeping all other user configurations functional.
>
> > XDP socket busy-poll is enabled is that softirq/ksoftirq is never
> > invoked for the traffic that goes to this socket. This way, we would
> > get better performance on a per core basis and also get the same
> > behaviour independent of load.
>
> I suspect separate rx kthreads of af_xdp socket processing is necessary
> with and without busy-poll exactly because of 'high load' case
> you've described.
> If we do this additional rx-kthread model why differentiate
> between busy-polling and polling?
>
> af_xdp rx queue is completely different form stack rx queue because
> of target dma address setup.
> Using stack's napi ksoftirqd threads for processing af_xdp queues creates
> the fairness issues. Isn't it better to have separate kthreads for them
> and let scheduler deal with fairness among af_xdp processing and stack?
When using ordinary poll() on an AF_XDP socket, the application will
run on one core and the driver processing will run on another in
softirq/ksoftirqd context. (Either due to explicit core and irq
pinning or due to the scheduler or irqbalance moving the two threads
apart.) In AF_XDP busy-poll mode of this RFC, I would like the
application and the driver processing to occur on a single core, thus
there is no "extra" driver core involved that need to be taken into
account when sizing and/or provisioning the system. The napi context
is in this mode invoked from syscall context when executing the poll
syscall from the application.
Executing the app and the driver on the same core could of course be
accomplished already today by pinning the application and the driver
interrupt to the same core, but that would not be that efficient due
to context switching between the two. A more efficient way would be to
call the napi loop from within the poll() syscall when you are inside
the kernel anyway. This is what the classical busy-poll mechanism
operating on AF_INET sockets does. Inside the poll() call, it executes
the napi context of the driver until it finds a packet (if it is rx)
and then returns to the application that then processes the packets. I
would like to adopt something quite similar for AF_XDP sockets. (Some
of the differences can be found at the top of the original post.)
From an API point of view with busy-poll of AF_XDP sockets, the user
would bind to a queue number and taskset its application to a specific
core and both the app and the driver execution would only occur on
that core. This is in my mind simpler than with regular poll or AF_XDP
using no syscalls on rx (i.e. current state), in which you bind to a
queue, taskset your application to a core and then you also have to
take care to route the interrupt of the queue you bound to to another
core that will execute the driver part in the kernel. So the model is
in both cases still one core - one socket - one napi. (Users can of
course create multiple sockets in an app if they desire.)
The main reasons I would like to introduce busy-poll for AF_XDP
sockets are:
* It is simpler to provision, see arguments above. Both application
and driver runs efficiently on the same core.
* It is faster (on a per core basis) since we do not have any core to
core communication. All header and descriptor transfers between
kernel and application are core local which is much
faster. Scalability will also be better. E.g., 64 bytes desc + 64
bytes packet header = 128 bytes per packet less on the interconnect
between cores. At 20 Mpps/core, this is ~20Gbit/s and with 20 cores
this will be ~400Gbit/s of interconnect traffic less with busy-poll.
* It provides a way to seamlessly replace user-space drivers in DPDK
with Linux drivers in kernel space. (Do not think I have to argue
why this is a good idea on this list ;-).) The DPDK model is that
application and driver run on the same core since they are both in
user space. If we can provide the same model (both running
efficiently on the same core, NOT drivers in user-space) with
AF_XDP, it is easy for DPDK users to make the switch. Compare this
to the current way where there are both application cores and
driver/ksoftirqd cores. If a systems builder had 12 cores in his
appliance box and they had 12 instances of a DPDK app, one on each
core, how would he/she reason when repartitioning between
application and driver cores? 8 application cores and 4 driver
cores, or 6 of each? Maybe it is also packet dependent? Etc. Much
simpler to migrate if we had an efficient way to run both of them on
the same core.
Why no interrupt? That should have been: no interrupts enabled to
start with. We would like to avoid interrupts as much as possible
since when they trigger, we will revert to the non busy-poll model,
i.e. processing on two separate cores, and the advantages from above
will disappear. How to accomplish this?
* One way would be to create a napi context with the queue we have
bound to but with no interrupt associated with it, or it being
disabled. The socket would in that case only be able to receive and
send packets when calling the poll() syscall. If you do not call
poll(), you do not get any packets, nor are any packets sent. It
would only be possible to support this with a poll() timeout value
of zero. This would have the best performance
* Maybe we could support timeout values >0 by re-enabling the interrupt
at some point. When calling poll(), the core would invoke the napi
context repeatedly with the interrupt of that napi disabled until it
found a packet, but max for a period of time up until the busy poll
timeout (like regular busy poll today does). If that times out, we
go up to the regular timeout of the poll() call and enable
interrupts of the queue associated with the napi and put the process
to sleep. Once woken up by an interrupt, the interrupt of the napi
would be disabled again and control returned to the application. We
would with this scheme process the vast majority of packets locally
on a core with interrupts disabled and with good performance and
only when we have low load and are sleeping/waiting in poll would we
process some packets using interrupts on the core that the
interrupt has been bound to.
I will produce some performance numbers for the various options and
post them in a follow up mail. We need some numbers to talk
around.
/Magnus
> >
> > To accomplish this, the driver would need to create a napi context
> > containing the busy-poll enabled XDP socket queues (not shared with
> > any other traffic) that do not have an interrupt associated with
> > it.
>
> why no interrupt?
>
> >
> > Does this sound like an avenue worth pursuing and if so, should it be
> > part of this RFC/PATCH or separate?
> >
> > Epoll() is not supported at this point in time. Since it was designed
> > for handling a very large number of sockets, I thought it was better
> > to leave this for later if the need will arise.
> >
> > To do:
> >
> > * Move over all drivers to the new xdp_[rt]xq_info functions. If you
> > think this is the right way to go forward, I will move over
> > Mellanox, Netronome, etc for the proper patch.
> >
> > * Performance measurements
> >
> > * Test SW drivers like virtio, veth and tap. Actually, test anything
> > that is not i40e.
> >
> > * Test multiple sockets of different types in the same poll() call
> >
> > * Check bisectability of each patch
> >
> > * Versioning of the libbpf interface since we add an entry to the
> > socket configuration struct
> >
> > This RFC has been applied against commit 2b5bc3c8ebce ("r8169: remove manual autoneg restart workaround")
> >
> > Structure of the patch set:
> > Patch 1: Makes the busy poll batch size configurable inside the kernel
> > Patch 2: Adds napi id to struct xdp_rxq_info, adds this to the
> > xdp_rxq_reg_info function and changes the interface and
> > implementation so that we only need a single copy of the
> > xdp_rxq_info struct that resides in _rx. Previously there was
> > another one in the driver, but now the driver uses the one in
> > _rx. All XDP enabled drivers are converted to these new
> > functions.
> > Patch 3: Adds a struct xdp_txq_info with corresponding functions to
> > xdp_rxq_info that is used to store information about the tx
> > queue. The use of these are added to all AF_XDP enabled drivers.
> > Patch 4: Introduce a new setsockopt/getsockopt in the uapi for
> > enabling busy_poll.
> > Patch 5: Implements busy poll in the xsk code.
> > Patch 6: Add busy-poll support to libbpf.
> > Patch 7: Add busy-poll support to the xdpsock sample application.
> >
> > Thanks: Magnus
> >
> > Magnus Karlsson (7):
> > net: fs: make busy poll budget configurable in napi_busy_loop
> > net: i40e: ixgbe: tun: veth: virtio-net: centralize xdp_rxq_info and
> > add napi id
> > net: i40e: ixgbe: add xdp_txq_info structure
> > netdevice: introduce busy-poll setsockopt for AF_XDP
> > net: add busy-poll support for XDP sockets
> > libbpf: add busy-poll support to XDP sockets
> > samples/bpf: add busy-poll support to xdpsock sample
> >
> > drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 2 -
> > drivers/net/ethernet/intel/i40e/i40e_main.c | 8 +-
> > drivers/net/ethernet/intel/i40e/i40e_txrx.c | 37 ++++-
> > drivers/net/ethernet/intel/i40e/i40e_txrx.h | 2 +-
> > drivers/net/ethernet/intel/i40e/i40e_xsk.c | 2 +-
> > drivers/net/ethernet/intel/ixgbe/ixgbe.h | 2 +-
> > drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 48 ++++--
> > drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 2 +-
> > drivers/net/tun.c | 14 +-
> > drivers/net/veth.c | 10 +-
> > drivers/net/virtio_net.c | 8 +-
> > fs/eventpoll.c | 5 +-
> > include/linux/netdevice.h | 1 +
> > include/net/busy_poll.h | 7 +-
> > include/net/xdp.h | 23 ++-
> > include/net/xdp_sock.h | 3 +
> > include/uapi/linux/if_xdp.h | 1 +
> > net/core/dev.c | 43 ++----
> > net/core/xdp.c | 103 ++++++++++---
> > net/xdp/Kconfig | 1 +
> > net/xdp/xsk.c | 122 ++++++++++++++-
> > net/xdp/xsk_queue.h | 18 ++-
> > samples/bpf/xdpsock_user.c | 203 +++++++++++++++----------
> > tools/include/uapi/linux/if_xdp.h | 1 +
> > tools/lib/bpf/xsk.c | 23 +--
> > tools/lib/bpf/xsk.h | 1 +
> > 26 files changed, 495 insertions(+), 195 deletions(-)
> >
> > --
> > 2.7.4

On Tue, May 07, 2019 at 01:51:45PM +0200, Magnus Karlsson wrote:
> On Mon, May 6, 2019 at 6:33 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, May 02, 2019 at 10:39:16AM +0200, Magnus Karlsson wrote:
> > > This RFC proposes to add busy-poll support to AF_XDP sockets. With
> > > busy-poll, the driver is executed in process context by calling the
> > > poll() syscall. The main advantage with this is that all processing
> > > occurs on a single core. This eliminates the core-to-core cache
> > > transfers that occur between the application and the softirqd
> > > processing on another core, that occurs without busy-poll. From a
> > > systems point of view, it also provides an advatage that we do not
> > > have to provision extra cores in the system to handle
> > > ksoftirqd/softirq processing, as all processing is done on the single
> > > core that executes the application. The drawback of busy-poll is that
> > > max throughput seen from a single application will be lower (due to
> > > the syscall), but on a per core basis it will often be higher as
> > > the normal mode runs on two cores and busy-poll on a single one.
> > >
> > > The semantics of busy-poll from the application point of view are the
> > > following:
> > >
> > > * The application is required to call poll() to drive rx and tx
> > > processing. There is no guarantee that softirq and interrupts will
> > > do this for you. This is in contrast with the current
> > > implementations of busy-poll that are opportunistic in the sense
> > > that packets might be received/transmitted by busy-poll or
> > > softirqd. (In this patch set, softirq/ksoftirqd will kick in at high
> > > loads just as the current opportunistic implementations, but I would
> > > like to get to a point where this is not the case for busy-poll
> > > enabled XDP sockets, as this slows down performance considerably and
> > > starts to use one more core for the softirq processing. The end goal
> > > is for only poll() to drive the napi loop when busy-poll is enabled
> > > on an AF_XDP socket. More about this later.)
> > >
> > > * It should be enabled on a per socket basis. No global enablement, i.e.
> > > the XDP socket busy-poll will not care about the current
> > > /proc/sys/net/core/busy_poll and busy_read global enablement
> > > mechanisms.
> > >
> > > * The batch size (how many packets that are processed every time the
> > > napi function in the driver is called, i.e. the weight parameter)
> > > should be configurable. Currently, the busy-poll size of AF_INET
> > > sockets is set to 8, but for AF_XDP sockets this is too small as the
> > > amount of processing per packet is much smaller with AF_XDP. This
> > > should be configurable on a per socket basis.
> > >
> > > * If you put multiple AF_XDP busy-poll enabled sockets into a poll()
> > > call the napi contexts of all of them should be executed. This is in
> > > contrast to the AF_INET busy-poll that quits after the fist one that
> > > finds any packets. We need all napi contexts to be executed due to
> > > the first requirement in this list. The behaviour we want is much more
> > > like regular sockets in that all of them are checked in the poll
> > > call.
> > >
> > > * Should be possible to mix AF_XDP busy-poll sockets with any other
> > > sockets including busy-poll AF_INET ones in a single poll() call
> > > without any change to semantics or the behaviour of any of those
> > > socket types.
> > >
> > > * As suggested by Maxim Mikityanskiy, poll() will in the busy-poll
> > > mode return POLLERR if the fill ring is empty or the completion
> > > queue is full.
> > >
> > > Busy-poll support is enabled by calling a new setsockopt called
> > > XDP_BUSY_POLL_BATCH_SIZE that takes batch size as an argument. A value
> > > between 1 and NAPI_WEIGHT (64) will turn it on, 0 will turn it off and
> > > any other value will return an error.
> > >
> > > A typical packet processing rxdrop loop with busy-poll will look something
> > > like this:
> > >
> > > for (i = 0; i < num_socks; i++) {
> > > fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
> > > fds[i].events = POLLIN;
> > > }
> > >
> > > for (;;) {
> > > ret = poll(fds, num_socks, 0);
> > > if (ret <= 0)
> > > continue;
> > >
> > > for (i = 0; i < num_socks; i++)
> > > rx_drop(xsks[i], fds); /* The actual application */
> > > }
> > >
> > > Need some advice around this issue please:
> > >
> > > In this patch set, softirq/ksoftirqd will kick in at high loads and
> > > render the busy poll support useless as the execution is now happening
> > > in the same way as without busy-poll support. Everything works from an
> > > application perspective but this defeats the purpose of the support
> > > and also consumes an extra core. What I would like to accomplish when
> >
> > Not sure what you mean by 'extra core' .
> > The above poll+rx_drop is executed for every af_xdp socket
> > and there are N cpus processing exactly N af_xdp sockets.
> > Where is 'extra core'?
> > Are you suggesting a model where single core will be busy-polling
> > all af_xdp sockets? and then waking processing threads?
> > or single core will process all sockets?
> > I think the af_xdp model should be flexible and allow easy out-of-the-box
> > experience, but it should be optimized for 'ideal' user that
> > does the-right-thing from max packet-per-second point of view.
> > I thought we've already converged on the model where af_xdp hw rx queues
> > bind one-to-one to af_xdp sockets and user space pins processing
> > threads one-to-one to af_xdp sockets on corresponding cpus...
> > If so that's the model to optimize for on the kernel side
> > while keeping all other user configurations functional.
> >
> > > XDP socket busy-poll is enabled is that softirq/ksoftirq is never
> > > invoked for the traffic that goes to this socket. This way, we would
> > > get better performance on a per core basis and also get the same
> > > behaviour independent of load.
> >
> > I suspect separate rx kthreads of af_xdp socket processing is necessary
> > with and without busy-poll exactly because of 'high load' case
> > you've described.
> > If we do this additional rx-kthread model why differentiate
> > between busy-polling and polling?
> >
> > af_xdp rx queue is completely different form stack rx queue because
> > of target dma address setup.
> > Using stack's napi ksoftirqd threads for processing af_xdp queues creates
> > the fairness issues. Isn't it better to have separate kthreads for them
> > and let scheduler deal with fairness among af_xdp processing and stack?
>
> When using ordinary poll() on an AF_XDP socket, the application will
> run on one core and the driver processing will run on another in
> softirq/ksoftirqd context. (Either due to explicit core and irq
> pinning or due to the scheduler or irqbalance moving the two threads
> apart.) In AF_XDP busy-poll mode of this RFC, I would like the
> application and the driver processing to occur on a single core, thus
> there is no "extra" driver core involved that need to be taken into
> account when sizing and/or provisioning the system. The napi context
> is in this mode invoked from syscall context when executing the poll
> syscall from the application.
>
> Executing the app and the driver on the same core could of course be
> accomplished already today by pinning the application and the driver
> interrupt to the same core, but that would not be that efficient due
> to context switching between the two.
Have you benchmarked it?
I don't think context switch will be that noticable when kpti is off.
napi processes 64 packets descriptors and switches back to user to
do payload processing of these packets.
I would think that the same job is on two different cores would be
a bit more performant with user code consuming close to 100%
and softirq is single digit %. Say it's 10%.
I believe combining the two on single core is not 100 + 10 since
there is no cache bouncing. So Mpps from two cores setup will
reduce by 2-3% instead of 10%.
There is a cost of going to sleep and being woken up from poll(),
but 64 packets is probably large enough number to amortize.
If not, have you tried to bump napi budget to say 256 for af_xdp rx queues?
Busy-poll avoids sleep/wakeup overhead and probably can make
this scheme work with lower batching (like 64), but fundamentally
they're the same thing.
I'm not saying that we shouldn't do busy-poll. I'm saying it's
complimentary, but in all cases single core per af_xdp rq queue
with user thread pinning is preferred.
> A more efficient way would be to
> call the napi loop from within the poll() syscall when you are inside
> the kernel anyway. This is what the classical busy-poll mechanism
> operating on AF_INET sockets does. Inside the poll() call, it executes
> the napi context of the driver until it finds a packet (if it is rx)
> and then returns to the application that then processes the packets. I
> would like to adopt something quite similar for AF_XDP sockets. (Some
> of the differences can be found at the top of the original post.)
>
> From an API point of view with busy-poll of AF_XDP sockets, the user
> would bind to a queue number and taskset its application to a specific
> core and both the app and the driver execution would only occur on
> that core. This is in my mind simpler than with regular poll or AF_XDP
> using no syscalls on rx (i.e. current state), in which you bind to a
> queue, taskset your application to a core and then you also have to
> take care to route the interrupt of the queue you bound to to another
> core that will execute the driver part in the kernel. So the model is
> in both cases still one core - one socket - one napi. (Users can of
> course create multiple sockets in an app if they desire.)
>
> The main reasons I would like to introduce busy-poll for AF_XDP
> sockets are:
>
> * It is simpler to provision, see arguments above. Both application
> and driver runs efficiently on the same core.
>
> * It is faster (on a per core basis) since we do not have any core to
> core communication. All header and descriptor transfers between
> kernel and application are core local which is much
> faster. Scalability will also be better. E.g., 64 bytes desc + 64
> bytes packet header = 128 bytes per packet less on the interconnect
> between cores. At 20 Mpps/core, this is ~20Gbit/s and with 20 cores
> this will be ~400Gbit/s of interconnect traffic less with busy-poll.
exactly. don't make cpu do this core-to-core stuff.
pin one rx to one core.
> * It provides a way to seamlessly replace user-space drivers in DPDK
> with Linux drivers in kernel space. (Do not think I have to argue
> why this is a good idea on this list ;-).) The DPDK model is that
> application and driver run on the same core since they are both in
> user space. If we can provide the same model (both running
> efficiently on the same core, NOT drivers in user-space) with
> AF_XDP, it is easy for DPDK users to make the switch. Compare this
> to the current way where there are both application cores and
> driver/ksoftirqd cores. If a systems builder had 12 cores in his
> appliance box and they had 12 instances of a DPDK app, one on each
> core, how would he/she reason when repartitioning between
> application and driver cores? 8 application cores and 4 driver
> cores, or 6 of each? Maybe it is also packet dependent? Etc. Much
> simpler to migrate if we had an efficient way to run both of them on
> the same core.
>
> Why no interrupt? That should have been: no interrupts enabled to
> start with. We would like to avoid interrupts as much as possible
> since when they trigger, we will revert to the non busy-poll model,
> i.e. processing on two separate cores, and the advantages from above
> will disappear. How to accomplish this?
>
> * One way would be to create a napi context with the queue we have
> bound to but with no interrupt associated with it, or it being
> disabled. The socket would in that case only be able to receive and
> send packets when calling the poll() syscall. If you do not call
> poll(), you do not get any packets, nor are any packets sent. It
> would only be possible to support this with a poll() timeout value
> of zero. This would have the best performance
>
> * Maybe we could support timeout values >0 by re-enabling the interrupt
> at some point. When calling poll(), the core would invoke the napi
> context repeatedly with the interrupt of that napi disabled until it
> found a packet, but max for a period of time up until the busy poll
> timeout (like regular busy poll today does). If that times out, we
> go up to the regular timeout of the poll() call and enable
> interrupts of the queue associated with the napi and put the process
> to sleep. Once woken up by an interrupt, the interrupt of the napi
> would be disabled again and control returned to the application. We
> would with this scheme process the vast majority of packets locally
> on a core with interrupts disabled and with good performance and
> only when we have low load and are sleeping/waiting in poll would we
> process some packets using interrupts on the core that the
> interrupt has been bound to.
I think both 'no interrupt' solutions are challenging for users.
Stack rx queues and af_xdp rx queues should look almost the same from
napi point of view. Stack -> normal napi in softirq. af_xdp -> new kthread
to work with both poll and busy-poll. The only difference between
poll and busy-poll will be the running context: new kthread vs user task.
If busy-poll drained the queue then new kthread napi has no work to do.
No irq approach could be marginally faster, but more error prone.
With new kthread the user space will still work in all configuration.
Even when single user task is processing many af_xdp sockets.
I'm proposing new kthread only partially for performance reasons, but
mainly to avoid sharing stack rx and af_xdp queues within the same softirq.
Currently we share softirqd for stack napis for all NICs in the system,
but af_xdp depends on isolated processing.
Ideally we have rss into N queues for stack and rss into M af_xdp sockets.
The same host will be receive traffic on both.
Even if we rss stack queues to one set of cpus and af_xdp on another cpus
softirqds are doing work on all cpus.
A burst of 64 packets on stack queues or some other work in softirqd
will spike the latency for af_xdp queues if softirq is shared.
Hence the proposal for new napi_kthreads:
- user creates af_xdp socket and binds to _CPU_ X then
- driver allocates single af_xdp rq queue (queue ID doesn't need to be exposed)
- spawns kthread pinned to cpu X
- configures irq for that af_xdp queue to fire on cpu X
- user space with the help of libbpf pins its processing thread to that cpu X
- repeat above for as many af_xdp sockets as there as cpus
(its also ok to pick the same cpu X for different af_xdp socket
then new kthread is shared)
- user space configures hw to RSS to these set of af_xdp sockets.
since ethtool api is a mess I propose to use af_xdp api to do this rss config
imo that would be the simplest and performant way of using af_xdp.
All configuration apis are under libbpf (or libxdp if we choose to fork it)
End result is one af_xdp rx queue - one napi - one kthread - one user thread.
All pinned to the same cpu with irq on that cpu.
Both poll and busy-poll approaches will not bounce data between cpus.
No 'shadow' queues to speak of and should solve the issues that
folks were bringing up in different threads.
How crazy does it sound?

On Tue, May 7, 2019 at 8:24 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, May 07, 2019 at 01:51:45PM +0200, Magnus Karlsson wrote:
> > On Mon, May 6, 2019 at 6:33 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, May 02, 2019 at 10:39:16AM +0200, Magnus Karlsson wrote:
> > > > This RFC proposes to add busy-poll support to AF_XDP sockets. With
> > > > busy-poll, the driver is executed in process context by calling the
> > > > poll() syscall. The main advantage with this is that all processing
> > > > occurs on a single core. This eliminates the core-to-core cache
> > > > transfers that occur between the application and the softirqd
> > > > processing on another core, that occurs without busy-poll. From a
> > > > systems point of view, it also provides an advatage that we do not
> > > > have to provision extra cores in the system to handle
> > > > ksoftirqd/softirq processing, as all processing is done on the single
> > > > core that executes the application. The drawback of busy-poll is that
> > > > max throughput seen from a single application will be lower (due to
> > > > the syscall), but on a per core basis it will often be higher as
> > > > the normal mode runs on two cores and busy-poll on a single one.
> > > >
> > > > The semantics of busy-poll from the application point of view are the
> > > > following:
> > > >
> > > > * The application is required to call poll() to drive rx and tx
> > > > processing. There is no guarantee that softirq and interrupts will
> > > > do this for you. This is in contrast with the current
> > > > implementations of busy-poll that are opportunistic in the sense
> > > > that packets might be received/transmitted by busy-poll or
> > > > softirqd. (In this patch set, softirq/ksoftirqd will kick in at high
> > > > loads just as the current opportunistic implementations, but I would
> > > > like to get to a point where this is not the case for busy-poll
> > > > enabled XDP sockets, as this slows down performance considerably and
> > > > starts to use one more core for the softirq processing. The end goal
> > > > is for only poll() to drive the napi loop when busy-poll is enabled
> > > > on an AF_XDP socket. More about this later.)
> > > >
> > > > * It should be enabled on a per socket basis. No global enablement, i.e.
> > > > the XDP socket busy-poll will not care about the current
> > > > /proc/sys/net/core/busy_poll and busy_read global enablement
> > > > mechanisms.
> > > >
> > > > * The batch size (how many packets that are processed every time the
> > > > napi function in the driver is called, i.e. the weight parameter)
> > > > should be configurable. Currently, the busy-poll size of AF_INET
> > > > sockets is set to 8, but for AF_XDP sockets this is too small as the
> > > > amount of processing per packet is much smaller with AF_XDP. This
> > > > should be configurable on a per socket basis.
> > > >
> > > > * If you put multiple AF_XDP busy-poll enabled sockets into a poll()
> > > > call the napi contexts of all of them should be executed. This is in
> > > > contrast to the AF_INET busy-poll that quits after the fist one that
> > > > finds any packets. We need all napi contexts to be executed due to
> > > > the first requirement in this list. The behaviour we want is much more
> > > > like regular sockets in that all of them are checked in the poll
> > > > call.
> > > >
> > > > * Should be possible to mix AF_XDP busy-poll sockets with any other
> > > > sockets including busy-poll AF_INET ones in a single poll() call
> > > > without any change to semantics or the behaviour of any of those
> > > > socket types.
> > > >
> > > > * As suggested by Maxim Mikityanskiy, poll() will in the busy-poll
> > > > mode return POLLERR if the fill ring is empty or the completion
> > > > queue is full.
> > > >
> > > > Busy-poll support is enabled by calling a new setsockopt called
> > > > XDP_BUSY_POLL_BATCH_SIZE that takes batch size as an argument. A value
> > > > between 1 and NAPI_WEIGHT (64) will turn it on, 0 will turn it off and
> > > > any other value will return an error.
> > > >
> > > > A typical packet processing rxdrop loop with busy-poll will look something
> > > > like this:
> > > >
> > > > for (i = 0; i < num_socks; i++) {
> > > > fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
> > > > fds[i].events = POLLIN;
> > > > }
> > > >
> > > > for (;;) {
> > > > ret = poll(fds, num_socks, 0);
> > > > if (ret <= 0)
> > > > continue;
> > > >
> > > > for (i = 0; i < num_socks; i++)
> > > > rx_drop(xsks[i], fds); /* The actual application */
> > > > }
> > > >
> > > > Need some advice around this issue please:
> > > >
> > > > In this patch set, softirq/ksoftirqd will kick in at high loads and
> > > > render the busy poll support useless as the execution is now happening
> > > > in the same way as without busy-poll support. Everything works from an
> > > > application perspective but this defeats the purpose of the support
> > > > and also consumes an extra core. What I would like to accomplish when
> > >
> > > Not sure what you mean by 'extra core' .
> > > The above poll+rx_drop is executed for every af_xdp socket
> > > and there are N cpus processing exactly N af_xdp sockets.
> > > Where is 'extra core'?
> > > Are you suggesting a model where single core will be busy-polling
> > > all af_xdp sockets? and then waking processing threads?
> > > or single core will process all sockets?
> > > I think the af_xdp model should be flexible and allow easy out-of-the-box
> > > experience, but it should be optimized for 'ideal' user that
> > > does the-right-thing from max packet-per-second point of view.
> > > I thought we've already converged on the model where af_xdp hw rx queues
> > > bind one-to-one to af_xdp sockets and user space pins processing
> > > threads one-to-one to af_xdp sockets on corresponding cpus...
> > > If so that's the model to optimize for on the kernel side
> > > while keeping all other user configurations functional.
> > >
> > > > XDP socket busy-poll is enabled is that softirq/ksoftirq is never
> > > > invoked for the traffic that goes to this socket. This way, we would
> > > > get better performance on a per core basis and also get the same
> > > > behaviour independent of load.
> > >
> > > I suspect separate rx kthreads of af_xdp socket processing is necessary
> > > with and without busy-poll exactly because of 'high load' case
> > > you've described.
> > > If we do this additional rx-kthread model why differentiate
> > > between busy-polling and polling?
> > >
> > > af_xdp rx queue is completely different form stack rx queue because
> > > of target dma address setup.
> > > Using stack's napi ksoftirqd threads for processing af_xdp queues creates
> > > the fairness issues. Isn't it better to have separate kthreads for them
> > > and let scheduler deal with fairness among af_xdp processing and stack?
> >
> > When using ordinary poll() on an AF_XDP socket, the application will
> > run on one core and the driver processing will run on another in
> > softirq/ksoftirqd context. (Either due to explicit core and irq
> > pinning or due to the scheduler or irqbalance moving the two threads
> > apart.) In AF_XDP busy-poll mode of this RFC, I would like the
> > application and the driver processing to occur on a single core, thus
> > there is no "extra" driver core involved that need to be taken into
> > account when sizing and/or provisioning the system. The napi context
> > is in this mode invoked from syscall context when executing the poll
> > syscall from the application.
> >
> > Executing the app and the driver on the same core could of course be
> > accomplished already today by pinning the application and the driver
> > interrupt to the same core, but that would not be that efficient due
> > to context switching between the two.
>
> Have you benchmarked it?
> I don't think context switch will be that noticable when kpti is off.
> napi processes 64 packets descriptors and switches back to user to
> do payload processing of these packets.
> I would think that the same job is on two different cores would be
> a bit more performant with user code consuming close to 100%
> and softirq is single digit %. Say it's 10%.
> I believe combining the two on single core is not 100 + 10 since
> there is no cache bouncing. So Mpps from two cores setup will
> reduce by 2-3% instead of 10%.
> There is a cost of going to sleep and being woken up from poll(),
> but 64 packets is probably large enough number to amortize.
> If not, have you tried to bump napi budget to say 256 for af_xdp rx queues?
> Busy-poll avoids sleep/wakeup overhead and probably can make
> this scheme work with lower batching (like 64), but fundamentally
> they're the same thing.
> I'm not saying that we shouldn't do busy-poll. I'm saying it's
> complimentary, but in all cases single core per af_xdp rq queue
> with user thread pinning is preferred.
>
> > A more efficient way would be to
> > call the napi loop from within the poll() syscall when you are inside
> > the kernel anyway. This is what the classical busy-poll mechanism
> > operating on AF_INET sockets does. Inside the poll() call, it executes
> > the napi context of the driver until it finds a packet (if it is rx)
> > and then returns to the application that then processes the packets. I
> > would like to adopt something quite similar for AF_XDP sockets. (Some
> > of the differences can be found at the top of the original post.)
> >
> > From an API point of view with busy-poll of AF_XDP sockets, the user
> > would bind to a queue number and taskset its application to a specific
> > core and both the app and the driver execution would only occur on
> > that core. This is in my mind simpler than with regular poll or AF_XDP
> > using no syscalls on rx (i.e. current state), in which you bind to a
> > queue, taskset your application to a core and then you also have to
> > take care to route the interrupt of the queue you bound to to another
> > core that will execute the driver part in the kernel. So the model is
> > in both cases still one core - one socket - one napi. (Users can of
> > course create multiple sockets in an app if they desire.)
> >
> > The main reasons I would like to introduce busy-poll for AF_XDP
> > sockets are:
> >
> > * It is simpler to provision, see arguments above. Both application
> > and driver runs efficiently on the same core.
> >
> > * It is faster (on a per core basis) since we do not have any core to
> > core communication. All header and descriptor transfers between
> > kernel and application are core local which is much
> > faster. Scalability will also be better. E.g., 64 bytes desc + 64
> > bytes packet header = 128 bytes per packet less on the interconnect
> > between cores. At 20 Mpps/core, this is ~20Gbit/s and with 20 cores
> > this will be ~400Gbit/s of interconnect traffic less with busy-poll.
>
> exactly. don't make cpu do this core-to-core stuff.
> pin one rx to one core.
>
> > * It provides a way to seamlessly replace user-space drivers in DPDK
> > with Linux drivers in kernel space. (Do not think I have to argue
> > why this is a good idea on this list ;-).) The DPDK model is that
> > application and driver run on the same core since they are both in
> > user space. If we can provide the same model (both running
> > efficiently on the same core, NOT drivers in user-space) with
> > AF_XDP, it is easy for DPDK users to make the switch. Compare this
> > to the current way where there are both application cores and
> > driver/ksoftirqd cores. If a systems builder had 12 cores in his
> > appliance box and they had 12 instances of a DPDK app, one on each
> > core, how would he/she reason when repartitioning between
> > application and driver cores? 8 application cores and 4 driver
> > cores, or 6 of each? Maybe it is also packet dependent? Etc. Much
> > simpler to migrate if we had an efficient way to run both of them on
> > the same core.
> >
> > Why no interrupt? That should have been: no interrupts enabled to
> > start with. We would like to avoid interrupts as much as possible
> > since when they trigger, we will revert to the non busy-poll model,
> > i.e. processing on two separate cores, and the advantages from above
> > will disappear. How to accomplish this?
> >
> > * One way would be to create a napi context with the queue we have
> > bound to but with no interrupt associated with it, or it being
> > disabled. The socket would in that case only be able to receive and
> > send packets when calling the poll() syscall. If you do not call
> > poll(), you do not get any packets, nor are any packets sent. It
> > would only be possible to support this with a poll() timeout value
> > of zero. This would have the best performance
> >
> > * Maybe we could support timeout values >0 by re-enabling the interrupt
> > at some point. When calling poll(), the core would invoke the napi
> > context repeatedly with the interrupt of that napi disabled until it
> > found a packet, but max for a period of time up until the busy poll
> > timeout (like regular busy poll today does). If that times out, we
> > go up to the regular timeout of the poll() call and enable
> > interrupts of the queue associated with the napi and put the process
> > to sleep. Once woken up by an interrupt, the interrupt of the napi
> > would be disabled again and control returned to the application. We
> > would with this scheme process the vast majority of packets locally
> > on a core with interrupts disabled and with good performance and
> > only when we have low load and are sleeping/waiting in poll would we
> > process some packets using interrupts on the core that the
> > interrupt has been bound to.
>
> I think both 'no interrupt' solutions are challenging for users.
> Stack rx queues and af_xdp rx queues should look almost the same from
> napi point of view. Stack -> normal napi in softirq. af_xdp -> new kthread
> to work with both poll and busy-poll. The only difference between
> poll and busy-poll will be the running context: new kthread vs user task.
> If busy-poll drained the queue then new kthread napi has no work to do.
> No irq approach could be marginally faster, but more error prone.
> With new kthread the user space will still work in all configuration.
> Even when single user task is processing many af_xdp sockets.
>
> I'm proposing new kthread only partially for performance reasons, but
> mainly to avoid sharing stack rx and af_xdp queues within the same softirq.
> Currently we share softirqd for stack napis for all NICs in the system,
> but af_xdp depends on isolated processing.
> Ideally we have rss into N queues for stack and rss into M af_xdp sockets.
> The same host will be receive traffic on both.
> Even if we rss stack queues to one set of cpus and af_xdp on another cpus
> softirqds are doing work on all cpus.
> A burst of 64 packets on stack queues or some other work in softirqd
> will spike the latency for af_xdp queues if softirq is shared.
> Hence the proposal for new napi_kthreads:
> - user creates af_xdp socket and binds to _CPU_ X then
> - driver allocates single af_xdp rq queue (queue ID doesn't need to be exposed)
> - spawns kthread pinned to cpu X
> - configures irq for that af_xdp queue to fire on cpu X
> - user space with the help of libbpf pins its processing thread to that cpu X
> - repeat above for as many af_xdp sockets as there as cpus
> (its also ok to pick the same cpu X for different af_xdp socket
> then new kthread is shared)
> - user space configures hw to RSS to these set of af_xdp sockets.
> since ethtool api is a mess I propose to use af_xdp api to do this rss config
>
> imo that would be the simplest and performant way of using af_xdp.
> All configuration apis are under libbpf (or libxdp if we choose to fork it)
> End result is one af_xdp rx queue - one napi - one kthread - one user thread.
> All pinned to the same cpu with irq on that cpu.
> Both poll and busy-poll approaches will not bounce data between cpus.
> No 'shadow' queues to speak of and should solve the issues that
> folks were bringing up in different threads.
> How crazy does it sound?
Actually, it sounds remarkably sane :-). It will create something
quite similar to what I have been wanting, but you take it at least
two steps further. Did not think about introducing a separate kthread
as a potential solution, and the user space configuration of RSS (and
maybe other flow steering mechanisms) from AF_XDP Björn and I have
only been loosely talking about. Anyway, I am producing performance
numbers for the options that we have talked about. I will get back to
you with them as soon as I have them and we can continue the
discussions based on those.
Thanks: Magnus

Tossing in my .02 cents:
I anticipate that most users of AF_XDP will want packet processing
for a given RX queue occurring on a single core - otherwise we end
up with cache delays. The usual model is one thread, one socket,
one core, but this isn't enforced anywhere in the AF_XDP code and is
up to the user to set this up.
On 7 May 2019, at 11:24, Alexei Starovoitov wrote:
> I'm not saying that we shouldn't do busy-poll. I'm saying it's
> complimentary, but in all cases single core per af_xdp rq queue
> with user thread pinning is preferred.
So I think we're on the same page here.
> Stack rx queues and af_xdp rx queues should look almost the same from
> napi point of view. Stack -> normal napi in softirq. af_xdp -> new
> kthread
> to work with both poll and busy-poll. The only difference between
> poll and busy-poll will be the running context: new kthread vs user
> task.
...
> A burst of 64 packets on stack queues or some other work in softirqd
> will spike the latency for af_xdp queues if softirq is shared.
True, but would it be shared? This goes back to the current model,
which
as used by Intel is:
(channel == RX, TX, softirq)
MLX, on the other hand, wants:
(channel == RX.stack, RX.AF_XDP, TX.stack, TX.AF_XDP, softirq)
Which would indeed lead to sharing. The more I look at the above, the
stronger I start to dislike it. Perhaps this should be disallowed?
I believe there was some mention at LSF/MM that the 'channel' concept
was something specific to HW and really shouldn't be part of the SW API.
> Hence the proposal for new napi_kthreads:
> - user creates af_xdp socket and binds to _CPU_ X then
> - driver allocates single af_xdp rq queue (queue ID doesn't need to be
> exposed)
> - spawns kthread pinned to cpu X
> - configures irq for that af_xdp queue to fire on cpu X
> - user space with the help of libbpf pins its processing thread to
> that cpu X
> - repeat above for as many af_xdp sockets as there as cpus
> (its also ok to pick the same cpu X for different af_xdp socket
> then new kthread is shared)
> - user space configures hw to RSS to these set of af_xdp sockets.
> since ethtool api is a mess I propose to use af_xdp api to do this
> rss config
From a high level point of view, this sounds quite sensible, but does
need
some details ironed out. The model above essentially enforces a model
of:
(af_xdp = RX.af_xdp + bound_cpu)
(bound_cpu = hw.cpu + af_xdp.kthread + hw.irq)
(temporarily ignoring TX for right now)
I forsee two issues with the above approach:
1. hardware limitations in the number of queues/rings
2. RSS/steering rules
> - user creates af_xdp socket and binds to _CPU_ X then
> - driver allocates single af_xdp rq queue (queue ID doesn't need to be
> exposed)
Here, the driver may not be able to create an arbitrary RQ, but may need
to
tear down/reuse an existing one used by the stack. This may not be an
issue
for modern hardware.
> - user space configures hw to RSS to these set of af_xdp sockets.
> since ethtool api is a mess I propose to use af_xdp api to do this
> rss config
Currently, RSS only steers default traffic. On a system with shared
stack/af_xdp queues, there should be a way to split the traffic types,
unless we're talking about a model where all traffic goes to AF_XDP.
This classification has to be done by the NIC, since it comes before RSS
steering - which currently means sending flow match rules to the NIC,
which
is less than ideal. I agree that the ethtool interface is non optimal,
but
it does make things clear to the user what's going on.
Perhaps an af_xdp library that does some bookkeeping:
- open af_xdp socket
- define af_xdp_set as (classification, steering rules, other?)
- bind socket to (cpu, af_xdp_set)
- kernel:
- pins calling thread to cpu
- creates kthread if one doesn't exist, binds to irq and cpu
- has driver create RQ.af_xdp, possibly replacing RQ.stack
- applies (af_xdp_set) to NIC.
Seems workable, but a little complicated? The complexity could be moved
into a separate library.
> imo that would be the simplest and performant way of using af_xdp.
> All configuration apis are under libbpf (or libxdp if we choose to
> fork it)
> End result is one af_xdp rx queue - one napi - one kthread - one user
> thread.
> All pinned to the same cpu with irq on that cpu.
> Both poll and busy-poll approaches will not bounce data between cpus.
> No 'shadow' queues to speak of and should solve the issues that
> folks were bringing up in different threads.
Sounds like a sensible model from my POV.
--
Jonathan

On 5/13/2019 1:42 PM, Jonathan Lemon wrote:
> Tossing in my .02 cents:
>
>
> I anticipate that most users of AF_XDP will want packet processing
> for a given RX queue occurring on a single core - otherwise we end
> up with cache delays. The usual model is one thread, one socket,
> one core, but this isn't enforced anywhere in the AF_XDP code and is
> up to the user to set this up.
AF_XDP with busypoll should allow a single thread to poll a given RX
queue and use a single core.
>
> On 7 May 2019, at 11:24, Alexei Starovoitov wrote:
>> I'm not saying that we shouldn't do busy-poll. I'm saying it's
>> complimentary, but in all cases single core per af_xdp rq queue
>> with user thread pinning is preferred.
>
> So I think we're on the same page here.
>
>> Stack rx queues and af_xdp rx queues should look almost the same from
>> napi point of view. Stack -> normal napi in softirq. af_xdp -> new
>> kthread
>> to work with both poll and busy-poll. The only difference between
>> poll and busy-poll will be the running context: new kthread vs user
>> task.
> ...
>> A burst of 64 packets on stack queues or some other work in softirqd
>> will spike the latency for af_xdp queues if softirq is shared.
>
> True, but would it be shared? This goes back to the current model,
> which
> as used by Intel is:
>
> (channel == RX, TX, softirq)
>
> MLX, on the other hand, wants:
>
> (channel == RX.stack, RX.AF_XDP, TX.stack, TX.AF_XDP, softirq)
>
> Which would indeed lead to sharing. The more I look at the above, the
> stronger I start to dislike it. Perhaps this should be disallowed?
>
> I believe there was some mention at LSF/MM that the 'channel' concept
> was something specific to HW and really shouldn't be part of the SW API.
>
>> Hence the proposal for new napi_kthreads:
>> - user creates af_xdp socket and binds to _CPU_ X then
>> - driver allocates single af_xdp rq queue (queue ID doesn't need to be
>> exposed)
>> - spawns kthread pinned to cpu X
>> - configures irq for that af_xdp queue to fire on cpu X
>> - user space with the help of libbpf pins its processing thread to
>> that cpu X
>> - repeat above for as many af_xdp sockets as there as cpus
>> (its also ok to pick the same cpu X for different af_xdp socket
>> then new kthread is shared)
>> - user space configures hw to RSS to these set of af_xdp sockets.
>> since ethtool api is a mess I propose to use af_xdp api to do this
>> rss config
>
>
> From a high level point of view, this sounds quite sensible, but does
> need
> some details ironed out. The model above essentially enforces a model
> of:
>
> (af_xdp = RX.af_xdp + bound_cpu)
> (bound_cpu = hw.cpu + af_xdp.kthread + hw.irq)
>
> (temporarily ignoring TX for right now)
>
>
> I forsee two issues with the above approach:
> 1. hardware limitations in the number of queues/rings
> 2. RSS/steering rules
>
>> - user creates af_xdp socket and binds to _CPU_ X then
>> - driver allocates single af_xdp rq queue (queue ID doesn't need to be
>> exposed)
>
> Here, the driver may not be able to create an arbitrary RQ, but may need
> to
> tear down/reuse an existing one used by the stack. This may not be an
> issue
> for modern hardware.
>
>> - user space configures hw to RSS to these set of af_xdp sockets.
>> since ethtool api is a mess I propose to use af_xdp api to do this
>> rss config
>
> Currently, RSS only steers default traffic. On a system with shared
> stack/af_xdp queues, there should be a way to split the traffic types,
> unless we're talking about a model where all traffic goes to AF_XDP.
>
> This classification has to be done by the NIC, since it comes before RSS
> steering - which currently means sending flow match rules to the NIC,
> which
> is less than ideal. I agree that the ethtool interface is non optimal,
> but
> it does make things clear to the user what's going on.
'tc' provides another interface to split NIC queues into groups of
queues each with its own RSS. For ex:
tc qdisc add dev <i/f> root mqprio num_tc 3 map 0 1 2 queues 2@0 32@2
8@34 hw 1 mode channel
will split NIC queues into 3 groups of 2, 32 and 8 queues.
By default all the packets goto only the first queue group with 2
queues. Filters can be added to redirect packets to the other queues groups.
tc filter add dev <i/f> protocol ip ingress prio 1 flower dst_ip
192.168.0.2 ip_proto tcp dst_port 1234 skip_sw hw_tc 1
tc filter add dev <i/f> protocol ip ingress prio 1 flower dst_ip
192.168.0.3 ip_proto tcp dst_port 1234 skip_sw hw_tc 2
Here hw_tc indicates the queue group.
It should be possible to run AF_XDP on queue group 3 by creating 8
af-xdp sockets and binding them to queues 34-42.
Does this look like a reasonable model to use a subset of nic queues for
af-xdp applications?
>
> Perhaps an af_xdp library that does some bookkeeping:
> - open af_xdp socket
> - define af_xdp_set as (classification, steering rules, other?)
> - bind socket to (cpu, af_xdp_set)
> - kernel:
> - pins calling thread to cpu
> - creates kthread if one doesn't exist, binds to irq and cpu
> - has driver create RQ.af_xdp, possibly replacing RQ.stack
> - applies (af_xdp_set) to NIC.
>
> Seems workable, but a little complicated? The complexity could be moved
> into a separate library.
>
>
>> imo that would be the simplest and performant way of using af_xdp.
>> All configuration apis are under libbpf (or libxdp if we choose to
>> fork it)
>> End result is one af_xdp rx queue - one napi - one kthread - one user
>> thread.
>> All pinned to the same cpu with irq on that cpu.
>> Both poll and busy-poll approaches will not bounce data between cpus.
>> No 'shadow' queues to speak of and should solve the issues that
>> folks were bringing up in different threads.
>
> Sounds like a sensible model from my POV.
>

On Mon, 13 May 2019 at 22:44, Jonathan Lemon <bsd@fb.com> wrote:
>
> Tossing in my .02 cents:
>
>
> I anticipate that most users of AF_XDP will want packet processing
> for a given RX queue occurring on a single core - otherwise we end
> up with cache delays. The usual model is one thread, one socket,
> one core, but this isn't enforced anywhere in the AF_XDP code and is
> up to the user to set this up.
>
Hmm, I definitely see use-cases where one would like multiple Rx
sockets per core, and say, multiple Tx socket per core. Enforcing it
at the uapi is IMO not correct. (Maybe in libbpf, but that's another
thing.)
> On 7 May 2019, at 11:24, Alexei Starovoitov wrote:
> > I'm not saying that we shouldn't do busy-poll. I'm saying it's
> > complimentary, but in all cases single core per af_xdp rq queue
> > with user thread pinning is preferred.
>
> So I think we're on the same page here.
>
> > Stack rx queues and af_xdp rx queues should look almost the same from
> > napi point of view. Stack -> normal napi in softirq. af_xdp -> new
> > kthread
> > to work with both poll and busy-poll. The only difference between
> > poll and busy-poll will be the running context: new kthread vs user
> > task.
> ...
> > A burst of 64 packets on stack queues or some other work in softirqd
> > will spike the latency for af_xdp queues if softirq is shared.
>
> True, but would it be shared? This goes back to the current model,
> which
> as used by Intel is:
>
> (channel == RX, TX, softirq)
>
> MLX, on the other hand, wants:
>
> (channel == RX.stack, RX.AF_XDP, TX.stack, TX.AF_XDP, softirq)
>
> Which would indeed lead to sharing. The more I look at the above, the
> stronger I start to dislike it. Perhaps this should be disallowed?
>
> I believe there was some mention at LSF/MM that the 'channel' concept
> was something specific to HW and really shouldn't be part of the SW API.
>
I'm probably stating things people already know, but let I'll take a
detour here anyway, hijacking this thread for a queue rant.
AF_XDP sockets has two modes; zero-copy mode and copy-mode. A socket
has different flavors: Rx, Tx or both. Sockets with Rx flavors (Rx or
'both') can be attached to an Rx queue. Today, the only Rx queues are
the ones attached to the stack.
Zero-copy sockets with Rx flavors require hardware steering, and to be
useful, a mechanism to create a set of queues is needed. When stack
queues and AF_XDP sockets reside on a shared netdev; Create queues
separated from the stack (how do we represent that to a user?).
Another way is creating a new netdev (say, macvlan with a zero-copy
support), and have all the Rx queues be represented by AF_XDP sockets.
Copy-mode Rx sockets, OTOH, does not require steering. In the
copy-mode case, the XDP program is a switchboard where some packets
can go to the stack, some to user-space and some elsewhere.
So, what does AF_XDP need, that's not in place yet (from my perspective)?
* For zero-copy: a mechanism to create new sets of Rx queues, and a
mechanism direct flows (via, say, a new bpf hook)
* For zero-copy/copy: a mechanism to create new Tx queues, and from
AF_XDP select that queue to be used by a socket. This would be good
for the generic XDP redirect case as well.
Zero-copy AF_XDP Rx sockets is typically used when the hardware
support that kind of steering, and typically a minimal XDP program
would then be used (if any, going forward). Copy-mode is for the
software fallback, where a more capable XDP program is needed. One
problem is that zero-copy and copy-mode behaves differently, so
copy-mode can't really be seen as a fallback to zero-copy today. In
copy-mode you cannot receive from Rx queue X, and redirect to socket
bound to queue Y (X != Y). In zero-copy mode, you just bind to a
queue, and do the redirection from configuration.
So, it would be nice with "unbound/anonymous queue sockets" or
"virtual/no queues ids", which I think is what Alexei is proposing?
Create a bunch of sockets. For copy-mode, the XDP program will do the
steering (receive from HW queue X, redirect to any socket). For
zero-copy, the configuration will solve that. The userspace
application doesn't have to change, which is a good abstraction. :-)
For the copy-mode it would be a performance hit (relaxing the SPSC
relationship), but maybe we can care about that later.
From my perspective, a mechanism to create Tx *and* Rx queues separate
from the stack is useful even outside the scope of AF_XDP. Create a
set of Rx queues and Tx queues, configure flows to those Rx queues
(via BPF?), and let an XDP program do, say, load-balancing using the
setup Tx queues. This makes sense without AF_XDP as well. The
anonymous queue path is OTOH simpler, but is an AF_XDP only
mechanism...
> > Hence the proposal for new napi_kthreads:
> > - user creates af_xdp socket and binds to _CPU_ X then
> > - driver allocates single af_xdp rq queue (queue ID doesn't need to be
> > exposed)
> > - spawns kthread pinned to cpu X
> > - configures irq for that af_xdp queue to fire on cpu X
> > - user space with the help of libbpf pins its processing thread to
> > that cpu X
> > - repeat above for as many af_xdp sockets as there as cpus
> > (its also ok to pick the same cpu X for different af_xdp socket
> > then new kthread is shared)
> > - user space configures hw to RSS to these set of af_xdp sockets.
> > since ethtool api is a mess I propose to use af_xdp api to do this
> > rss config
>
>
> From a high level point of view, this sounds quite sensible, but does
> need
> some details ironed out. The model above essentially enforces a model
> of:
>
> (af_xdp = RX.af_xdp + bound_cpu)
> (bound_cpu = hw.cpu + af_xdp.kthread + hw.irq)
>
> (temporarily ignoring TX for right now)
>
...and multiple Rx queues per core in there as well.
>
> I forsee two issues with the above approach:
> 1. hardware limitations in the number of queues/rings
> 2. RSS/steering rules
>
> > - user creates af_xdp socket and binds to _CPU_ X then
> > - driver allocates single af_xdp rq queue (queue ID doesn't need to be
> > exposed)
>
> Here, the driver may not be able to create an arbitrary RQ, but may need
> to
> tear down/reuse an existing one used by the stack. This may not be an
> issue
> for modern hardware.
>
Again, let's focus on usability first. If the hardware cannot support
it efficiently, a software fallback. We don't want an API that's a
trash bin of different "stuff" hardware vendors want to put in because
they can. (Hi ethtool! ;-))
> > - user space configures hw to RSS to these set of af_xdp sockets.
> > since ethtool api is a mess I propose to use af_xdp api to do this
> > rss config
>
> Currently, RSS only steers default traffic. On a system with shared
> stack/af_xdp queues, there should be a way to split the traffic types,
> unless we're talking about a model where all traffic goes to AF_XDP.
>
> This classification has to be done by the NIC, since it comes before RSS
> steering - which currently means sending flow match rules to the NIC,
> which
> is less than ideal. I agree that the ethtool interface is non optimal,
> but
> it does make things clear to the user what's going on.
>
I would also like to see a something else than ethtool, but not
limited to AF_XDP. Maybe a BPF configuration hook: Probe the HW for
capabilities; Missing support? Fallback and load an XDP program for
software emulation. Hardware support for BPF? Pass the "fallback" XDP
program to the hardware.
> Perhaps an af_xdp library that does some bookkeeping:
> - open af_xdp socket
> - define af_xdp_set as (classification, steering rules, other?)
> - bind socket to (cpu, af_xdp_set)
> - kernel:
> - pins calling thread to cpu
> - creates kthread if one doesn't exist, binds to irq and cpu
> - has driver create RQ.af_xdp, possibly replacing RQ.stack
> - applies (af_xdp_set) to NIC.
>
> Seems workable, but a little complicated? The complexity could be moved
> into a separate library.
>
Yes. :-)
>
> > imo that would be the simplest and performant way of using af_xdp.
> > All configuration apis are under libbpf (or libxdp if we choose to
> > fork it)
> > End result is one af_xdp rx queue - one napi - one kthread - one user
> > thread.
> > All pinned to the same cpu with irq on that cpu.
> > Both poll and busy-poll approaches will not bounce data between cpus.
> > No 'shadow' queues to speak of and should solve the issues that
> > folks were bringing up in different threads.
>
> Sounds like a sensible model from my POV.
No, "shadow queues", but AF_XDP only queues. Maybe that's ok. OTOH the
XDP Tx queues are still there, and they cannot (today at least) be
configured.
Björn
> --
> Jonathan

On Wed, May 8, 2019 at 2:10 PM Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
>
> On Tue, May 7, 2019 at 8:24 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, May 07, 2019 at 01:51:45PM +0200, Magnus Karlsson wrote:
> > > On Mon, May 6, 2019 at 6:33 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Thu, May 02, 2019 at 10:39:16AM +0200, Magnus Karlsson wrote:
> > > > > This RFC proposes to add busy-poll support to AF_XDP sockets. With
> > > > > busy-poll, the driver is executed in process context by calling the
> > > > > poll() syscall. The main advantage with this is that all processing
> > > > > occurs on a single core. This eliminates the core-to-core cache
> > > > > transfers that occur between the application and the softirqd
> > > > > processing on another core, that occurs without busy-poll. From a
> > > > > systems point of view, it also provides an advatage that we do not
> > > > > have to provision extra cores in the system to handle
> > > > > ksoftirqd/softirq processing, as all processing is done on the single
> > > > > core that executes the application. The drawback of busy-poll is that
> > > > > max throughput seen from a single application will be lower (due to
> > > > > the syscall), but on a per core basis it will often be higher as
> > > > > the normal mode runs on two cores and busy-poll on a single one.
> > > > >
> > > > > The semantics of busy-poll from the application point of view are the
> > > > > following:
> > > > >
> > > > > * The application is required to call poll() to drive rx and tx
> > > > > processing. There is no guarantee that softirq and interrupts will
> > > > > do this for you. This is in contrast with the current
> > > > > implementations of busy-poll that are opportunistic in the sense
> > > > > that packets might be received/transmitted by busy-poll or
> > > > > softirqd. (In this patch set, softirq/ksoftirqd will kick in at high
> > > > > loads just as the current opportunistic implementations, but I would
> > > > > like to get to a point where this is not the case for busy-poll
> > > > > enabled XDP sockets, as this slows down performance considerably and
> > > > > starts to use one more core for the softirq processing. The end goal
> > > > > is for only poll() to drive the napi loop when busy-poll is enabled
> > > > > on an AF_XDP socket. More about this later.)
> > > > >
> > > > > * It should be enabled on a per socket basis. No global enablement, i.e.
> > > > > the XDP socket busy-poll will not care about the current
> > > > > /proc/sys/net/core/busy_poll and busy_read global enablement
> > > > > mechanisms.
> > > > >
> > > > > * The batch size (how many packets that are processed every time the
> > > > > napi function in the driver is called, i.e. the weight parameter)
> > > > > should be configurable. Currently, the busy-poll size of AF_INET
> > > > > sockets is set to 8, but for AF_XDP sockets this is too small as the
> > > > > amount of processing per packet is much smaller with AF_XDP. This
> > > > > should be configurable on a per socket basis.
> > > > >
> > > > > * If you put multiple AF_XDP busy-poll enabled sockets into a poll()
> > > > > call the napi contexts of all of them should be executed. This is in
> > > > > contrast to the AF_INET busy-poll that quits after the fist one that
> > > > > finds any packets. We need all napi contexts to be executed due to
> > > > > the first requirement in this list. The behaviour we want is much more
> > > > > like regular sockets in that all of them are checked in the poll
> > > > > call.
> > > > >
> > > > > * Should be possible to mix AF_XDP busy-poll sockets with any other
> > > > > sockets including busy-poll AF_INET ones in a single poll() call
> > > > > without any change to semantics or the behaviour of any of those
> > > > > socket types.
> > > > >
> > > > > * As suggested by Maxim Mikityanskiy, poll() will in the busy-poll
> > > > > mode return POLLERR if the fill ring is empty or the completion
> > > > > queue is full.
> > > > >
> > > > > Busy-poll support is enabled by calling a new setsockopt called
> > > > > XDP_BUSY_POLL_BATCH_SIZE that takes batch size as an argument. A value
> > > > > between 1 and NAPI_WEIGHT (64) will turn it on, 0 will turn it off and
> > > > > any other value will return an error.
> > > > >
> > > > > A typical packet processing rxdrop loop with busy-poll will look something
> > > > > like this:
> > > > >
> > > > > for (i = 0; i < num_socks; i++) {
> > > > > fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
> > > > > fds[i].events = POLLIN;
> > > > > }
> > > > >
> > > > > for (;;) {
> > > > > ret = poll(fds, num_socks, 0);
> > > > > if (ret <= 0)
> > > > > continue;
> > > > >
> > > > > for (i = 0; i < num_socks; i++)
> > > > > rx_drop(xsks[i], fds); /* The actual application */
> > > > > }
> > > > >
> > > > > Need some advice around this issue please:
> > > > >
> > > > > In this patch set, softirq/ksoftirqd will kick in at high loads and
> > > > > render the busy poll support useless as the execution is now happening
> > > > > in the same way as without busy-poll support. Everything works from an
> > > > > application perspective but this defeats the purpose of the support
> > > > > and also consumes an extra core. What I would like to accomplish when
> > > >
> > > > Not sure what you mean by 'extra core' .
> > > > The above poll+rx_drop is executed for every af_xdp socket
> > > > and there are N cpus processing exactly N af_xdp sockets.
> > > > Where is 'extra core'?
> > > > Are you suggesting a model where single core will be busy-polling
> > > > all af_xdp sockets? and then waking processing threads?
> > > > or single core will process all sockets?
> > > > I think the af_xdp model should be flexible and allow easy out-of-the-box
> > > > experience, but it should be optimized for 'ideal' user that
> > > > does the-right-thing from max packet-per-second point of view.
> > > > I thought we've already converged on the model where af_xdp hw rx queues
> > > > bind one-to-one to af_xdp sockets and user space pins processing
> > > > threads one-to-one to af_xdp sockets on corresponding cpus...
> > > > If so that's the model to optimize for on the kernel side
> > > > while keeping all other user configurations functional.
> > > >
> > > > > XDP socket busy-poll is enabled is that softirq/ksoftirq is never
> > > > > invoked for the traffic that goes to this socket. This way, we would
> > > > > get better performance on a per core basis and also get the same
> > > > > behaviour independent of load.
> > > >
> > > > I suspect separate rx kthreads of af_xdp socket processing is necessary
> > > > with and without busy-poll exactly because of 'high load' case
> > > > you've described.
> > > > If we do this additional rx-kthread model why differentiate
> > > > between busy-polling and polling?
> > > >
> > > > af_xdp rx queue is completely different form stack rx queue because
> > > > of target dma address setup.
> > > > Using stack's napi ksoftirqd threads for processing af_xdp queues creates
> > > > the fairness issues. Isn't it better to have separate kthreads for them
> > > > and let scheduler deal with fairness among af_xdp processing and stack?
> > >
> > > When using ordinary poll() on an AF_XDP socket, the application will
> > > run on one core and the driver processing will run on another in
> > > softirq/ksoftirqd context. (Either due to explicit core and irq
> > > pinning or due to the scheduler or irqbalance moving the two threads
> > > apart.) In AF_XDP busy-poll mode of this RFC, I would like the
> > > application and the driver processing to occur on a single core, thus
> > > there is no "extra" driver core involved that need to be taken into
> > > account when sizing and/or provisioning the system. The napi context
> > > is in this mode invoked from syscall context when executing the poll
> > > syscall from the application.
> > >
> > > Executing the app and the driver on the same core could of course be
> > > accomplished already today by pinning the application and the driver
> > > interrupt to the same core, but that would not be that efficient due
> > > to context switching between the two.
> >
> > Have you benchmarked it?
> > I don't think context switch will be that noticable when kpti is off.
> > napi processes 64 packets descriptors and switches back to user to
> > do payload processing of these packets.
> > I would think that the same job is on two different cores would be
> > a bit more performant with user code consuming close to 100%
> > and softirq is single digit %. Say it's 10%.
> > I believe combining the two on single core is not 100 + 10 since
> > there is no cache bouncing. So Mpps from two cores setup will
> > reduce by 2-3% instead of 10%.
> > There is a cost of going to sleep and being woken up from poll(),
> > but 64 packets is probably large enough number to amortize.
> > If not, have you tried to bump napi budget to say 256 for af_xdp rx queues?
> > Busy-poll avoids sleep/wakeup overhead and probably can make
> > this scheme work with lower batching (like 64), but fundamentally
> > they're the same thing.
> > I'm not saying that we shouldn't do busy-poll. I'm saying it's
> > complimentary, but in all cases single core per af_xdp rq queue
> > with user thread pinning is preferred.
> >
> > > A more efficient way would be to
> > > call the napi loop from within the poll() syscall when you are inside
> > > the kernel anyway. This is what the classical busy-poll mechanism
> > > operating on AF_INET sockets does. Inside the poll() call, it executes
> > > the napi context of the driver until it finds a packet (if it is rx)
> > > and then returns to the application that then processes the packets. I
> > > would like to adopt something quite similar for AF_XDP sockets. (Some
> > > of the differences can be found at the top of the original post.)
> > >
> > > From an API point of view with busy-poll of AF_XDP sockets, the user
> > > would bind to a queue number and taskset its application to a specific
> > > core and both the app and the driver execution would only occur on
> > > that core. This is in my mind simpler than with regular poll or AF_XDP
> > > using no syscalls on rx (i.e. current state), in which you bind to a
> > > queue, taskset your application to a core and then you also have to
> > > take care to route the interrupt of the queue you bound to to another
> > > core that will execute the driver part in the kernel. So the model is
> > > in both cases still one core - one socket - one napi. (Users can of
> > > course create multiple sockets in an app if they desire.)
> > >
> > > The main reasons I would like to introduce busy-poll for AF_XDP
> > > sockets are:
> > >
> > > * It is simpler to provision, see arguments above. Both application
> > > and driver runs efficiently on the same core.
> > >
> > > * It is faster (on a per core basis) since we do not have any core to
> > > core communication. All header and descriptor transfers between
> > > kernel and application are core local which is much
> > > faster. Scalability will also be better. E.g., 64 bytes desc + 64
> > > bytes packet header = 128 bytes per packet less on the interconnect
> > > between cores. At 20 Mpps/core, this is ~20Gbit/s and with 20 cores
> > > this will be ~400Gbit/s of interconnect traffic less with busy-poll.
> >
> > exactly. don't make cpu do this core-to-core stuff.
> > pin one rx to one core.
> >
> > > * It provides a way to seamlessly replace user-space drivers in DPDK
> > > with Linux drivers in kernel space. (Do not think I have to argue
> > > why this is a good idea on this list ;-).) The DPDK model is that
> > > application and driver run on the same core since they are both in
> > > user space. If we can provide the same model (both running
> > > efficiently on the same core, NOT drivers in user-space) with
> > > AF_XDP, it is easy for DPDK users to make the switch. Compare this
> > > to the current way where there are both application cores and
> > > driver/ksoftirqd cores. If a systems builder had 12 cores in his
> > > appliance box and they had 12 instances of a DPDK app, one on each
> > > core, how would he/she reason when repartitioning between
> > > application and driver cores? 8 application cores and 4 driver
> > > cores, or 6 of each? Maybe it is also packet dependent? Etc. Much
> > > simpler to migrate if we had an efficient way to run both of them on
> > > the same core.
> > >
> > > Why no interrupt? That should have been: no interrupts enabled to
> > > start with. We would like to avoid interrupts as much as possible
> > > since when they trigger, we will revert to the non busy-poll model,
> > > i.e. processing on two separate cores, and the advantages from above
> > > will disappear. How to accomplish this?
> > >
> > > * One way would be to create a napi context with the queue we have
> > > bound to but with no interrupt associated with it, or it being
> > > disabled. The socket would in that case only be able to receive and
> > > send packets when calling the poll() syscall. If you do not call
> > > poll(), you do not get any packets, nor are any packets sent. It
> > > would only be possible to support this with a poll() timeout value
> > > of zero. This would have the best performance
> > >
> > > * Maybe we could support timeout values >0 by re-enabling the interrupt
> > > at some point. When calling poll(), the core would invoke the napi
> > > context repeatedly with the interrupt of that napi disabled until it
> > > found a packet, but max for a period of time up until the busy poll
> > > timeout (like regular busy poll today does). If that times out, we
> > > go up to the regular timeout of the poll() call and enable
> > > interrupts of the queue associated with the napi and put the process
> > > to sleep. Once woken up by an interrupt, the interrupt of the napi
> > > would be disabled again and control returned to the application. We
> > > would with this scheme process the vast majority of packets locally
> > > on a core with interrupts disabled and with good performance and
> > > only when we have low load and are sleeping/waiting in poll would we
> > > process some packets using interrupts on the core that the
> > > interrupt has been bound to.
> >
> > I think both 'no interrupt' solutions are challenging for users.
> > Stack rx queues and af_xdp rx queues should look almost the same from
> > napi point of view. Stack -> normal napi in softirq. af_xdp -> new kthread
> > to work with both poll and busy-poll. The only difference between
> > poll and busy-poll will be the running context: new kthread vs user task.
> > If busy-poll drained the queue then new kthread napi has no work to do.
> > No irq approach could be marginally faster, but more error prone.
> > With new kthread the user space will still work in all configuration.
> > Even when single user task is processing many af_xdp sockets.
> >
> > I'm proposing new kthread only partially for performance reasons, but
> > mainly to avoid sharing stack rx and af_xdp queues within the same softirq.
> > Currently we share softirqd for stack napis for all NICs in the system,
> > but af_xdp depends on isolated processing.
> > Ideally we have rss into N queues for stack and rss into M af_xdp sockets.
> > The same host will be receive traffic on both.
> > Even if we rss stack queues to one set of cpus and af_xdp on another cpus
> > softirqds are doing work on all cpus.
> > A burst of 64 packets on stack queues or some other work in softirqd
> > will spike the latency for af_xdp queues if softirq is shared.
> > Hence the proposal for new napi_kthreads:
> > - user creates af_xdp socket and binds to _CPU_ X then
> > - driver allocates single af_xdp rq queue (queue ID doesn't need to be exposed)
> > - spawns kthread pinned to cpu X
> > - configures irq for that af_xdp queue to fire on cpu X
> > - user space with the help of libbpf pins its processing thread to that cpu X
> > - repeat above for as many af_xdp sockets as there as cpus
> > (its also ok to pick the same cpu X for different af_xdp socket
> > then new kthread is shared)
> > - user space configures hw to RSS to these set of af_xdp sockets.
> > since ethtool api is a mess I propose to use af_xdp api to do this rss config
> >
> > imo that would be the simplest and performant way of using af_xdp.
> > All configuration apis are under libbpf (or libxdp if we choose to fork it)
> > End result is one af_xdp rx queue - one napi - one kthread - one user thread.
> > All pinned to the same cpu with irq on that cpu.
> > Both poll and busy-poll approaches will not bounce data between cpus.
> > No 'shadow' queues to speak of and should solve the issues that
> > folks were bringing up in different threads.
> > How crazy does it sound?
>
> Actually, it sounds remarkably sane :-). It will create something
> quite similar to what I have been wanting, but you take it at least
> two steps further. Did not think about introducing a separate kthread
> as a potential solution, and the user space configuration of RSS (and
> maybe other flow steering mechanisms) from AF_XDP Björn and I have
> only been loosely talking about. Anyway, I am producing performance
> numbers for the options that we have talked about. I will get back to
> you with them as soon as I have them and we can continue the
> discussions based on those.
>
> Thanks: Magnus
After a number of surprises and issues in the driver here are now the
first set of results. 64 byte packets at 40Gbit/s line rate. All
results in Mpps. Note that I just used my local system and kernel build
for these numbers so they are not performance tuned. Jesper would
likely get better results on his setup :-). Explanation follows after
the table.
Applications
method cores irqs txpush rxdrop l2fwd
---------------------------------------------------------------
r-t-c 2 y 35.9 11.2 8.6
poll 2 y 34.2 9.4 8.3
r-t-c 1 y 18.1 N/A 6.2
poll 1 y 14.6 8.4 5.9
busypoll 2 y 31.9 10.5 7.9
busypoll 1 y 21.5 8.7 6.2
busypoll 1 n 22.0 10.3 7.3
r-t-c = Run-to-completion, the mode where we in Rx uses no syscalls
and only spin on the pointers in the ring.
poll = Use the regular syscall poll()
busypoll = Use the regular syscall poll() in busy-poll mode. The RFC I
sent out.
cores == 2 means that softirq/ksoftirqd is one a different core from
the application. 2 cores are consumed in total.
cores == 1 means that both softirq/ksoftirqd and the application runs
on the same core. Only 1 core is used in total.
irqs == 'y' is the normal case. irqs == 'n' means that I have created a
new napi context with the AF_XDP queues inside that does not
have any interrupts associated with it. No other traffic goes
to this napi context.
N/A = This combination does not make sense since the application will
not yield due to run-to-completion without any syscalls
whatsoever. It works, but it crawls in the 30 Kpps
range. Creating huge rings would help, but did not do that.
The applications are the ones from the xdpsock sample application in
samples/bpf/.
Some things I had to do to get these results:
* The current buffer allocation scheme in i40e where we continuously
try to access the fill queue until we find some entries, is not
effective if we are on a single core. Instead, we try once and call
a function that sets a flag. This flag is then checked in the xsk
poll code, and if it is set we schedule napi so that it can try to
allocate some buffers from the fill ring again. Note that this flag
has to propagate all the way to user space so that the application
knows that it has to call poll(). I currently set a flag in the Rx
ring to indicate that the application should call poll() to resume
the driver. This is similar to what the io_uring in the storage
subsystem does. It is not enough to return POLLERR from poll() as
that will only work for the case when we are using poll(). But I do
that as well.
* Implemented Sridhar's suggestion on adding busy_loop_end callbacks
that terminate the busy poll loop if the Rx queue is empty or the Tx
queue is full.
* There is a race in the setup code in i40e when it is used with
busy-poll. The fact that busy-poll calls the napi_busy_loop code
before interrupts have been registered and enabled seems to trigger
some bug where nothing gets transmitted. This only happens for
busy-poll. Poll and run-to-completion only enters the napi loop of
i40e by interrupts and only then after interrupts have been enabled,
which is the last thing that is done after setup. I have just worked
around it by introducing a sleep(1) in the application for these
experiments. Ugly, but should not impact the numbers, I believe.
* The 1 core case is sensitive to the amount of work done reported
from the driver. This was not correct in the XDP code of i40e and
let to bad performance. Now it reports the correct values for
Rx. Note that i40e does not honor the napi budget on Tx and sets
that to 256, and these are not reported back to the napi
library.
Some observations:
* Cannot really explain the drop in performance for txpush when going
from 2 cores to 1. As stated before, the reporting of Tx work is not
really propagated to the napi infrastructure. Tried reporting this
in a correct manner (completely ignoring Rx for this experiment) but
the results were the same. Will dig deeper into this to screen out
any stupid mistakes.
* With the fixes above, all my driver processing is in softirq for 1
core. It never goes over to ksoftirqd. Previously when work was
reported incorrectly, this was the case. I would have liked
ksoftirqd to take over as that would have been more like a separate
thread. How to accomplish this? There might still be some reporting
problem in the driver that hinders this, but actually think it is
more correct now.
* Looking at the current results for a single core, busy poll provides
a 40% boost for Tx but only 5% for Rx. But if I instead create a
napi context without any interrupt associated with it and drive that
from busy-poll, I get a 15% - 20% performance improvement for Rx. Tx
increases only marginally from the 40% improvement as there are few
interrupts on Tx due to the completion interrupt bit being set quite
infrequently. One question I have is: what am I breaking by creating
a napi context not used by anyone else, only AF_XDP, that does not
have an interrupt associated with it?
Todo:
* Explain the drop in Tx push when going from 2 cores to 1.
* Really run a separate thread for kernel processing instead of softirq.
* What other experiments would you like to see?
/Magnus

On 5/16/2019 5:37 AM, Magnus Karlsson wrote:
>
> After a number of surprises and issues in the driver here are now the
> first set of results. 64 byte packets at 40Gbit/s line rate. All
> results in Mpps. Note that I just used my local system and kernel build
> for these numbers so they are not performance tuned. Jesper would
> likely get better results on his setup :-). Explanation follows after
> the table.
>
> Applications
> method cores irqs txpush rxdrop l2fwd
> ---------------------------------------------------------------
> r-t-c 2 y 35.9 11.2 8.6
> poll 2 y 34.2 9.4 8.3
> r-t-c 1 y 18.1 N/A 6.2
> poll 1 y 14.6 8.4 5.9
> busypoll 2 y 31.9 10.5 7.9
> busypoll 1 y 21.5 8.7 6.2
> busypoll 1 n 22.0 10.3 7.3
>
> r-t-c = Run-to-completion, the mode where we in Rx uses no syscalls
> and only spin on the pointers in the ring.
> poll = Use the regular syscall poll()
> busypoll = Use the regular syscall poll() in busy-poll mode. The RFC I
> sent out.
>
> cores == 2 means that softirq/ksoftirqd is one a different core from
> the application. 2 cores are consumed in total.
> cores == 1 means that both softirq/ksoftirqd and the application runs
> on the same core. Only 1 core is used in total.
>
> irqs == 'y' is the normal case. irqs == 'n' means that I have created a
> new napi context with the AF_XDP queues inside that does not
> have any interrupts associated with it. No other traffic goes
> to this napi context.
>
> N/A = This combination does not make sense since the application will
> not yield due to run-to-completion without any syscalls
> whatsoever. It works, but it crawls in the 30 Kpps
> range. Creating huge rings would help, but did not do that.
>
> The applications are the ones from the xdpsock sample application in
> samples/bpf/.
>
> Some things I had to do to get these results:
>
> * The current buffer allocation scheme in i40e where we continuously
> try to access the fill queue until we find some entries, is not
> effective if we are on a single core. Instead, we try once and call
> a function that sets a flag. This flag is then checked in the xsk
> poll code, and if it is set we schedule napi so that it can try to
> allocate some buffers from the fill ring again. Note that this flag
> has to propagate all the way to user space so that the application
> knows that it has to call poll(). I currently set a flag in the Rx
> ring to indicate that the application should call poll() to resume
> the driver. This is similar to what the io_uring in the storage
> subsystem does. It is not enough to return POLLERR from poll() as
> that will only work for the case when we are using poll(). But I do
> that as well.
>
> * Implemented Sridhar's suggestion on adding busy_loop_end callbacks
> that terminate the busy poll loop if the Rx queue is empty or the Tx
> queue is full.
>
> * There is a race in the setup code in i40e when it is used with
> busy-poll. The fact that busy-poll calls the napi_busy_loop code
> before interrupts have been registered and enabled seems to trigger
> some bug where nothing gets transmitted. This only happens for
> busy-poll. Poll and run-to-completion only enters the napi loop of
> i40e by interrupts and only then after interrupts have been enabled,
> which is the last thing that is done after setup. I have just worked
> around it by introducing a sleep(1) in the application for these
> experiments. Ugly, but should not impact the numbers, I believe.
>
> * The 1 core case is sensitive to the amount of work done reported
> from the driver. This was not correct in the XDP code of i40e and
> let to bad performance. Now it reports the correct values for
> Rx. Note that i40e does not honor the napi budget on Tx and sets
> that to 256, and these are not reported back to the napi
> library.
>
> Some observations:
>
> * Cannot really explain the drop in performance for txpush when going
> from 2 cores to 1. As stated before, the reporting of Tx work is not
> really propagated to the napi infrastructure. Tried reporting this
> in a correct manner (completely ignoring Rx for this experiment) but
> the results were the same. Will dig deeper into this to screen out
> any stupid mistakes.
>
> * With the fixes above, all my driver processing is in softirq for 1
> core. It never goes over to ksoftirqd. Previously when work was
> reported incorrectly, this was the case. I would have liked
> ksoftirqd to take over as that would have been more like a separate
> thread. How to accomplish this? There might still be some reporting
> problem in the driver that hinders this, but actually think it is
> more correct now.
>
> * Looking at the current results for a single core, busy poll provides
> a 40% boost for Tx but only 5% for Rx. But if I instead create a
> napi context without any interrupt associated with it and drive that
> from busy-poll, I get a 15% - 20% performance improvement for Rx. Tx
> increases only marginally from the 40% improvement as there are few
> interrupts on Tx due to the completion interrupt bit being set quite
> infrequently. One question I have is: what am I breaking by creating
> a napi context not used by anyone else, only AF_XDP, that does not
> have an interrupt associated with it?
>
> Todo:
>
> * Explain the drop in Tx push when going from 2 cores to 1.
>
> * Really run a separate thread for kernel processing instead of softirq.
>
> * What other experiments would you like to see?
Thanks for sharing the results.
For busypoll tests, i guess you may have increased the busypoll budget
to 64.
What is the busypoll timeout you are using?
Can you try a test that skips calling bpf program for queues that are
associated with af-xdp socket? I remember seeing a significant bump in
rxdrop performance with this change.
The other overhead i saw was with the dma_sync_single calls in the driver.
Thanks
Sridhar

On Fri, May 17, 2019 at 1:50 AM Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
> On 5/16/2019 5:37 AM, Magnus Karlsson wrote:
> >
> > After a number of surprises and issues in the driver here are now the
> > first set of results. 64 byte packets at 40Gbit/s line rate. All
> > results in Mpps. Note that I just used my local system and kernel build
> > for these numbers so they are not performance tuned. Jesper would
> > likely get better results on his setup :-). Explanation follows after
> > the table.
> >
> > Applications
> > method cores irqs txpush rxdrop l2fwd
> > ---------------------------------------------------------------
> > r-t-c 2 y 35.9 11.2 8.6
> > poll 2 y 34.2 9.4 8.3
> > r-t-c 1 y 18.1 N/A 6.2
> > poll 1 y 14.6 8.4 5.9
> > busypoll 2 y 31.9 10.5 7.9
> > busypoll 1 y 21.5 8.7 6.2
> > busypoll 1 n 22.0 10.3 7.3
> >
> > r-t-c = Run-to-completion, the mode where we in Rx uses no syscalls
> > and only spin on the pointers in the ring.
> > poll = Use the regular syscall poll()
> > busypoll = Use the regular syscall poll() in busy-poll mode. The RFC I
> > sent out.
> >
> > cores == 2 means that softirq/ksoftirqd is one a different core from
> > the application. 2 cores are consumed in total.
> > cores == 1 means that both softirq/ksoftirqd and the application runs
> > on the same core. Only 1 core is used in total.
> >
> > irqs == 'y' is the normal case. irqs == 'n' means that I have created a
> > new napi context with the AF_XDP queues inside that does not
> > have any interrupts associated with it. No other traffic goes
> > to this napi context.
> >
> > N/A = This combination does not make sense since the application will
> > not yield due to run-to-completion without any syscalls
> > whatsoever. It works, but it crawls in the 30 Kpps
> > range. Creating huge rings would help, but did not do that.
> >
> > The applications are the ones from the xdpsock sample application in
> > samples/bpf/.
> >
> > Some things I had to do to get these results:
> >
> > * The current buffer allocation scheme in i40e where we continuously
> > try to access the fill queue until we find some entries, is not
> > effective if we are on a single core. Instead, we try once and call
> > a function that sets a flag. This flag is then checked in the xsk
> > poll code, and if it is set we schedule napi so that it can try to
> > allocate some buffers from the fill ring again. Note that this flag
> > has to propagate all the way to user space so that the application
> > knows that it has to call poll(). I currently set a flag in the Rx
> > ring to indicate that the application should call poll() to resume
> > the driver. This is similar to what the io_uring in the storage
> > subsystem does. It is not enough to return POLLERR from poll() as
> > that will only work for the case when we are using poll(). But I do
> > that as well.
> >
> > * Implemented Sridhar's suggestion on adding busy_loop_end callbacks
> > that terminate the busy poll loop if the Rx queue is empty or the Tx
> > queue is full.
> >
> > * There is a race in the setup code in i40e when it is used with
> > busy-poll. The fact that busy-poll calls the napi_busy_loop code
> > before interrupts have been registered and enabled seems to trigger
> > some bug where nothing gets transmitted. This only happens for
> > busy-poll. Poll and run-to-completion only enters the napi loop of
> > i40e by interrupts and only then after interrupts have been enabled,
> > which is the last thing that is done after setup. I have just worked
> > around it by introducing a sleep(1) in the application for these
> > experiments. Ugly, but should not impact the numbers, I believe.
> >
> > * The 1 core case is sensitive to the amount of work done reported
> > from the driver. This was not correct in the XDP code of i40e and
> > let to bad performance. Now it reports the correct values for
> > Rx. Note that i40e does not honor the napi budget on Tx and sets
> > that to 256, and these are not reported back to the napi
> > library.
> >
> > Some observations:
> >
> > * Cannot really explain the drop in performance for txpush when going
> > from 2 cores to 1. As stated before, the reporting of Tx work is not
> > really propagated to the napi infrastructure. Tried reporting this
> > in a correct manner (completely ignoring Rx for this experiment) but
> > the results were the same. Will dig deeper into this to screen out
> > any stupid mistakes.
> >
> > * With the fixes above, all my driver processing is in softirq for 1
> > core. It never goes over to ksoftirqd. Previously when work was
> > reported incorrectly, this was the case. I would have liked
> > ksoftirqd to take over as that would have been more like a separate
> > thread. How to accomplish this? There might still be some reporting
> > problem in the driver that hinders this, but actually think it is
> > more correct now.
> >
> > * Looking at the current results for a single core, busy poll provides
> > a 40% boost for Tx but only 5% for Rx. But if I instead create a
> > napi context without any interrupt associated with it and drive that
> > from busy-poll, I get a 15% - 20% performance improvement for Rx. Tx
> > increases only marginally from the 40% improvement as there are few
> > interrupts on Tx due to the completion interrupt bit being set quite
> > infrequently. One question I have is: what am I breaking by creating
> > a napi context not used by anyone else, only AF_XDP, that does not
> > have an interrupt associated with it?
> >
> > Todo:
> >
> > * Explain the drop in Tx push when going from 2 cores to 1.
> >
> > * Really run a separate thread for kernel processing instead of softirq.
> >
> > * What other experiments would you like to see?
>
> Thanks for sharing the results.
> For busypoll tests, i guess you may have increased the busypoll budget
> to 64.
Yes, I am using a batch size of 64 for all experiments as the
NAPI_POLL_WEIGHT is also 64. Note that the i40e driver batches 256
packets on Tx as it does not care what the budget parameter is in the
NAPI function. Rx is according to budget though.
> What is the busypoll timeout you are using?
0, as I am slamming the system as hard as I can with packets. The CPU
is always at close to 100% due to this and there is always something
to do. With a busy-poll timeout value of 100, I see a performance
degradation between 2% (slowest rx) - 7% (fastest tx). But any other
value than 0 for the busy-poll timeout does not make much sense when I
am running the driver and the application on the same core. I am
better off trying to get into softirq/ksoftirqd quicker to get some
new packets and/or send my Tx ones.
Regular poll() has a timeout value in the poll() syscall of 1000, as
it needs to yield to the driver processing. With 0 there are, to my
surprise, some performance improvements of a couple of percent.
Looking at the code, the code path for a 0 timeout is shorter which
might explain this.
> Can you try a test that skips calling bpf program for queues that are
> associated with af-xdp socket? I remember seeing a significant bump in
> rxdrop performance with this change.
Björn is working on this. This should improve performance much more
than busy-poll in my mind.
> The other overhead i saw was with the dma_sync_single calls in the driver.
I will do a "perf top" and check out the bottlenecks in more detail.
Thanks: Magnus
> Thanks
> Sridhar