Keep up to date with Oracle's mainline Linux kernel team

Tuesday Dec 02, 2014

The following is a write-up by Oracle mainline Linux kernel
engineer, Sowmini Varadhan, detailing her recent work on improving the
performance of the Sunvnet driver on Linux for SPARC.

Background

In the typical device-driver, the Producer (I/O device) notifies
the Consumer (device-driver) that data is available for consumption
by triggering a hardware interrupt at a fixed Interrupt Priority
Level (IPL). In the purely interrupt-driven model,
the Consumer then masks off any additional Rx interrupts from the driver,
and drains the read-buffers in hardware-interrupt context. A network
device-driver would then enqueue packets for the TCP/IP stack where
they would typically be processed in software interrupt (softirq)
context.

Dispatching an interrupt is an expensive operation, thus network
device drivers should attempt to batch interrupts, i.e., process as many
packets as possible within the context of one interrupt. Also, hardware
interrupts preempt all tasks running at a lower IPL. Thus the amount
of time spent in hardware interrupt context should be kept to a
a minimum. As pointed out in Mogul1, "If the event rate is high enough
to cause the sytem to spend all of its time responding to interrupts,
then nothing else will happen, and the system throughput will drop to
zero". This condition is called receive-livelock, and all purely
interrupt-driven systems are susceptible to it.

We will now talk about the various improvements made to the sunvnet
driver on Linux to convert it from being a purely interrupt-driven
network device driver to one that implements all of the above
prescriptions using Linux's most current device-driver infrastructure.

What is Sunvnet?

In a virtualized environment such as LDoms, the guest Operating
Systems (DomU) communicate with each other using a virtual
link-layer abstraction called Logical Domain Channel (LDC) on SPARC. The LDC provides point-to-point communication channels between the guests,
or between the domU and an external entity such as a service processor
or the Hypervisor itself. The LDC provides an encapsulation
protocol for other upper-layer protocols such as TCP/IP and Ethernet.

Sunvnet is the device driver that implement this virtual link-layer
on Linux.

Batching Interrupts

In its simplest mode of operation, when the LDC Producer wishes to send
an IP packet to the consumer, it needs to do two things:

Copy the data packet to a descriptor buffer. In the "TxDring" mode,
this buffer is a shared-memory region that is "owned" by the Producer.

After the packet has been successfully copied, the Producer needs to
signal to the Consumer that data is available. This is achieved by
sending a "start" message over the LDC. A "start" message is a
64-byte message sent over the LDC in the format specified by the
VIO protocol. The start message has a subtype of VIO_SUBTYPE_DATA,
and specifies the index of the descriptor buffer at which data is
available.

The transmission of the LDC "start" message is processed at the
Hypervisor, and will result in hard-interrupt at the consumer,
which will invoke the ldc_rx() interrupt handler.
The Consumer would then process the interrupt in hardirq context,
and when it is done, if the Producer had requested a "stopped"
ack for the packet, the Consumer will send back a "stopped" message over
LDC. Just like the "start" message, the "stopped" message is
specified by the VIO protocol. It has a subtype of VIO_SUBTYPE_ACK (0x2)
and allows the Consumer to specify the index at which data was last read.

Note that the VIO protocol does not mandate a "stopped" LDC message
for every descriptor read/write: the Consumer is required
to send back an LDC "stopped" message if, and only if:

the Producer has requested it for the descriptor; or,

the Consumer has read a full burst of ready data in descriptors,
and there are no more ready descriptors.

LDC messaging is expensive to performance: it requires a slot in the
LDC ring, in addition to triggering a hardware interrupt at the receiver.
Thus the first step to improving sunvnet performance was to optimize
the number of LDC messages sent and received, and batching packets
as much as possible.

These, along with some other bug fixes, brought sunvnet to a
more stable performance level: we observed fewer dev_watchdog hangs
(previously seen due to flow-control assertions caused by a full LDC
channel) and soft-lockups were seen. It also gave a 25% bump to
performance. In iperf tests on a T5-2 using 16 VCPUs and 16 iperf threads,
we were now able to handle approximately 100k pps, whereas
we were only able to handle a maximum of 80k pps prior to the fixes. (See diagram below).

But all packets were being received in hard-interrupt context.
And as Mogul1 has established in the 90's: that is toxic to
performance.

NAPI

Linux implements the concepts described in Mogul1 through
a common device-driver infrastructure called NAPI.
The NAPI framework allows a driver to defer reception of packet-bursts
from hardware-interrupt context to a polling-mechanism that
is invoked in softirq context. In addition to the benefits of
interrupt mitigation and avoidance of receive live-lock, this also
has other ramifications:

Since packet transmission via NET_TX_ACTION is already done in softirq
context, moving Rx processing to softirq context now allows Tx reclaim,
and recovery from link-congestion, to be done more efficiently.

The locking model is simplified, eliminating a number of
spin_[un]lock_irqsave[restore] invocations, and improving system
performance in general.

Moving the Rx processing to softirq context allows the driver to
use the vastly more efficient netif_receive_skb() to pass the packet
up to the network-stack, instead of being constrained to defer to
netif_rx(), which is invoked in the less-desirable process context.

We also get the benefits of ksoftirqd to schedule
softirq under scheduler control. Otherwise, everything would get processed
on the CPU that receives the hardware interrupt, and you would
have to configure RPS to distribute those hardirqs (can be
done, but requires extra administration).

We'll now walk through the changes made to NAPIfy sunvnet, to examine
each of these items.

The details...

The sunvnet driver has a `struct vnet_port' data-structure for
each connected peer. At the minimum, there is one such structure for
the vswitch peer in Dom0. In addition, if the Dom0 ldm property
`inter-vnet-link' has been set to `on' (the default), DomU's on
the same physical host will have a virtual point-to-point channel
over LDC. Each such channel is represented by a unique
`struct vnet_port' and has its own LDC ring and Rx descriptor buffers.

As part of the device probe callback, sunvnet allocates one `struct
napi_struct' instance for each `struct vnet_port'.

struct vnet_port {
/* ... */
struct napi_struct napi;
/* ... */
}

The next NAPI requirement is to modify the driver's Rx interrupt
handler. When a new packet becomes available, the driver must
disable any additional Rx interrupts (LDC Rx interrupts in this
case), and arrange for polling by invoking napi_schedule. This is
achieved as follows:

Both sunvnet and the VDC (virtual disk driver) infrastructure share
a common set of routines for processing the VIO messages and LDC
interrupts. Thus the Rx interrupt handler (`ldc_rx()') is common
to both modules, which hands off packets destined to sunvnet by
invoking the `vnet_event()' callback that is registered by sunvnet.
In `vnet_event()', we defer packet processing to the NAPI poll callback
by recording the events (which may include both LDC control events
such as UP/DOWN notifications, as well as notification about incoming
data), disabling hardware interrupts, and scheduling a NAPI callback
for the poll handler.

We now need to set up the poll handler itself. We do this in
`vnet_poll()' which has the signature:

static int vnet_poll(struct napi_struct *napi, int budget);

Thus vnet_poll will be called with a pointer to the NAPI instance,
so that the `struct vnet_port' can be obtained as

struct vnet_port *port = container_of(napi, struct vnet_port, napi);

The `budget' parameter is an upper-bound on the number of
packets that can be processed in any single ->poll invocation. The
intention of the `budget' parameter is to ensure fair-scheduling
across drivers, and avoid starvation when a single driver gets
flooded with packet burst. The ->poll() callback, i.e., vnet_poll(),
must return the number of packets processed. A return value that
is less than the budget can be taken to indicate that we
are at the end of a packet burst, i.e., hard-interrupts can be
re-enabled. We do this in `vnet_poll()' as

where vnet_event_napi examines and processes the `rx_event' bits
available on the `vnet_port'. If data is available on the port,
`vnet_event_napi()' will read the LDC channel for information about
the starting descriptor index, and process a batch of descriptors
in softirq mode, passing up the received packets to the network stack
using `napi_gro_receive()'. The batch processing of descriptors
is constained to at must `budget' descriptors per vnet_event_napi()
invocation.

The final step is to inform the NAPI infractructure that
`vnet_poll()' is the poll callback. We do this in the vnet_port_probe()
routine

Some caveats specific to sunvnet/LDC

The `budget' parameter passed by the NAPI infra to `vnet_poll()' places
an upper-bound on the number of packets that may be processed in a
single ->poll callback. While this ensures fair-scheduling across
drivers, we should be careful not to unnecessarily send LDC stop/start
messages at each `budget' boundary when the packet burst size is
larger than the `budget'.

This entails tracking additional state in the `vnet_port' to remember
(a) when packet processing is truncated prematurely due to `budget'
constraints,
(b) the last index processed, when (a) occurs.

Both of these items are tracked in the `vnet_port' as

bool napi_resume;
u32 napi_stop_idx;

Benefits of NAPI

The most obvious benefit of NAPIfication is interrupt mitigation. The
ability to process packets in softirq context and pass up packets
using napi_gro_receive() by itself results in a significant increase
in packet processing rate. On a T5-2 with 16 VCPUS, iperf tests using 16 threads results in230k pps (compared to the newer baseline of 100k pps!). This is a further 130% increase in performance.

In addition, conforming to the NAPI infrastructure automatically
provides access to all the newest features and enhancements
in the Linux driver infra, such as enhanced RPS.

But there are other benefits as well. With both Tx and Rx packets
now being processed in softirq context, the irq save/restore
locking done in sunvnet at the port level is eliminated, resulting
in lock-less processing. The netif_tx_lock() can instead be used
to synchronize access in the critical sections such as Tx reclaim
which can now be inlined from the ->poll routine without any
pre-emption concerns with dev_watchdog().

Multiple Tx queues

We've mostly talked about Rx side handling here, but on Tx side,
when inter-vnet-link is on, we have a virtual point-to-point link
between guests on the same physical host. As mentioned earlier, each
such point-to-point link is represented as its own data-structure (`struct
vnet_port') and has its own LDC ring and Rx descriptor buffers. Thus a
flow-controlled path due to bursty traffic between peers A
and B should not impact traffic between peers A and C. The Linux
driver infrastructure makes this possible through the support
for multiple Tx queues.

Briefly, these were the steps to set up multiple Tx queues:

Queue allocation: invoke alloc_etherdev_mqs(), to set up VNET_MAX_TXQS
queues when creating the `struct net_device'.

As each port is added, assign a queue index to the port in a
round-robin fasion. The assigned index is tracked in the `vnet_port'
structure.

Supply a ->ndo_select_queue callback that returns the selected
queue to dev_queue_xmit() when it calls netdev_pick_tx(). In the
case of sunvnet, the vnet_select_queue() should simply return the
index assigned to the vnet_port that would be selected for the outgoing
packet.

After the integration of multiple Tx queues, we can do even
better at recovering from flow-control.

Flow control is asserted on the Tx side when
we exhaust either the descriptor rings for data, and/or run out
of resources to send LDC messages. After the batched LDC processing
optimizations, it is uncommon to run out-of-resources for
LDC messages. Thus flow-control is typically asserted when the
Producer generates data much faster than the Consumer, at which
point the netif_tx_stop_queue() is asserted, blocking a Tx queue
for a specific peer.

The flow-control can thus be released when we get back an LDC
stopped ACK from the blocked peer (neatly identified by the LDC
message, and by the specific vnet_port and Tx queue!).

Conclusions and Future Work

In addition to NAPI, Linux offers other alternatives to drivers
for deferring work away from hard-interrupt context, such as
bottom-half (BH) handlers and tasklets.

A BH Rx handler will eliminate the problems of the interrupt context,
and packets can now be received in process context, which speeds up
things somewhat. But it still cannot call netif_receive_skb(), since
that can deadlock on socket locks with the softirq-based tasklets that do
TCP timers, packet rexmit, etc. So the BH handler is constrained to use
netif_rx_ni(), which is still less efficient than the straight-through
call to pass up the packet via netif_receive_skb()

Both NAPI and tasklet based implementations offer softirq context,
which allows the driver to safely invoke netif_receive_skb() to deliver
the packet to the IP stack. NAPI, which seamlessly
allows softirq context for both Tx and Rx processing, and already has
the infrastructure to handle bursts of packets with fair-scheduling,
proved to be the best option for sunvnet.

In the near future, we will be adding support for Jumbo Frames and
TCP Segmentation Offload, to further leverage from hardware support
by offloading features where possible. Another feature that offers
potential for improving performance is the "RxDring" model, where
the Consumer owns the shared-memory buffer for receiving data that
the Producer then populates. In the RxDring model, the buffer can
then be part of the sk_buff itself, thereby saving one memcpy
for the Consumer.

Monday Aug 11, 2014

The following is a write-up by Oracle mainline Linux kernel engineer, Khalid Aziz, detailing his and others' work on improving the performance of Transparent Huge Pages in the Linux kernel.

Introduction

The Linux kernel uses small page size
(4K on x86) to allow for efficient sharing of physical memory among
processes. Even though this can maximize utilization of physical
memory, it results in large numbers of pages associated with each
process and each page requires an entry in the Translation Look-aside
Buffer (TLB) to be able to associate a virtual address with the
physical memory page it represents. The TLB is a finite resource and
large number of entries required for each process forces kernel to
constantly swap out entries in TLB. There is a performance impact any
time the TLB entry for a virtual address is missing. This impact is
especially large for data intensive applications like large
databases.

To alleviate this, Linux kernel added support for
Huge Pages, which can support significantly larger page sizes for
specific uses. This larger page size is variable and depends upon
architecture (a few megabytes to gigabytes) . Huge Pages can be used for
shared memory or for memory mapping. Huge Pages reduce the number of
TLB entries required for a process's data by factor of 100s and thus
reduce the number of TLB misses for the process significantly.

Huge Pages are statically allocated and need to be used through a
hugetlbfs API, which requires changing applications at source level to
take advantage of this feature. The Linux kernel added a Transparent
Huge Pages (THP) feature that coalesces multiple contiguous pages in
use by a process to create a Huge Page transparently without the
process needing to even know about it. This makes the benefits of
Huge Pages available to every application without having to rewrite
it.

Unfortunately, adding THP caused side-effects for performance. We will explore these performance impacts in more
detail in this article and how those issues have been addressed.

The Problem

When Huge Pages were introduced in
the kernel, they were meant to be statically allocated in physical
memory and never swapped out. This made for simple accounting through
use of refcounts for these hugepages. Transparent hugepages on the
other hand need to be swappable so a process could take advantage of
performance improvements through hugepages and yet not tie up the
physical memory for these transparent hugepages. Since the swap subsystem
only deals with base page size, it can not swap out larger hugepages.
The kernel breaks the hugepages up into base page sizes before swapping
transparent huge pages out.

A page is identified as hugepage via page flags and each hugepage is composed of one head page and a
number of tail pages. Each tail page has a pointer, first_page,
that points back to the head page. The Kernel can break the transparent
hugepages up any time there is memory pressure and pages need to be
swapped out. This creates a race between the code that breaks
hugepages up and the code managing free and busy hugepages. When
marking a hugepage busy or free, the code needs to ensure a hugepage
is not broken up underneath it. This requires taking reference to the
page multiple times, locking the page to ensure page is not broken up
and executing memory barriers a few times to ensure any updates to
the page flags get flushed out to memory so we retain consistency.

Before THP was introduced into the
kernel in 2.6.38, the code to release a page was fairly
straightforward. A call to put_page()was
made and first thing put_page()
checked was to determine if it was dealing with hugepage (also known
as compound page) or base page:

If the page being released is a
hugepage, put_compound_page()
verifies reference count is 0 and then calls the free routine for
compound page which walks the head page and tail pages and frees them
all up:

This is fairly straightforward code
and has virtually no impact on performance of page release code.
After THP was introduced, additional checks, locks, page references
and memory barriers were added to ensure correctness. The new
put_compound_page()
in 2.6.38 looks like:

The level of complexity of code
went up significantly. This complexity guaranteed correctness but
sacrificed performance.

Large database applications read
large chunks of database into memory using AIO. When databases
started using hugepages for these reads into memory, performance went
up significantly due to the benefit of much lower number of TLB
misses and significantly smaller amount of memory being used up by
page table resulting in lower swapping activity. When a database
application reads data from disk into memory using AIO, pages from the
hugepages pool are allocated for the read and the block I/O subsystem
grabs reference to these pages for read and later releases reference
to these pages when read is done. This causes traversal of the code
referenced above starting with call to put_page().
With the newly introduced THP code, the additional overhead added up
to significant performance penalty.

Over the next several kernel
releases, the THP code was refined and optimized which helped slightly in
some cases while performance got worse in other cases. Subsequent
refinements to THP code to do accurate accounting of tail pages
introduced the routine __get_page_tail() which is called by get_page()
to grab tail pages for the hugepage. This added further performance
impact to AIO into hugetlbfs pages. All of this code stays in the
code path as long as kernel was compiled with CONFIG_TRANSPARENT_HUGEPAGE. Running echo never > /sys/kernel/mm/transparent_hugepage/enabled does not bypass this new additional code to support THP. Results from
a database performance benchmark run using two common read sizes used
by databases show this performance degradation clearly:

2.6.32 (pre-THP)

2.6.39 (with THP)

3.11-rc5 (with THP)

1M read

8384 MB/s

5629 MB/s

6501 MB/s

64K read

7867 MB/s

4576 MB/s

4251 MB/s

This amounts to 22% degradation for
1M read and 45% degradation for 64K read! perf
top during benchmark runs showed CPU spending more than 40% of
cycles in __get_page_tail()
and put_compound_page().

The Solution

An
Immediate solution to the performance degradation comes from the fact
that hugetlbfs pages can never be split and hence all the overhead
added for THP can be bypassed. I added code to
__get_page_tail() and
put_compound_page()to check
for hugetlbfs page up front and bypass all the additional checks for
those pages:

This
resulted in immediate performance gain. Running the same benchmark as
before with THP enabled, the new performance numbers for aio reads
are below:

2.6.32

3.11-rc5

3.11-rc5 + patch

1M read

8384 MB/s

6501 MB/s

8371 MB/s

64K read

7867 MB/s

4251 MB/s

6510 MB/s

This
patch was sent to linux-mm and linux kernel mailing lists in August
2013[link]
and was subsequently integrated into kernel version 3.12. This is a
significant performance boost for database applications.

Further
review of the original patch by Andrea Arcangeli during integration
of this patch into stable kernels exposed issues with refcounting of
pages and revealed this patch had introduced a subtle bug where a
page pointer could become a dangling link under certain
circumstances. Andrea Arcangeli and author worked to address these
issues and revised the code in __get_page_tail() and put_compound_page() to
eliminate extraneous locks and memory barriers, fixed incorrect
refcounting of tail pages and eliminate some of the inefficiencies in
the code.

Andrea sent out an initial series of patches to address all
of these issues[link].

Further discussions and refinements led to the final version of these
patches which were integrated into kernel version 3.13 [link],[link].

AIO performance has gone up
significantly with these patches but it is still not at the same
level as it used to be for smaller block sizes before THP was
introduced to the kernel. THP and hugetlbfs code in the kernel is
better at guaranteeing correctness but it still comes at the cost of
performance, so there is room for improvement.