Dave Miller Red Hat. "I tell you what to do, in return I merge
all your crap." State of tree, what Eric Dumazet is up to.

Soyoung Park, pointing at DaveM. "I tell him what to do."

Herbert Xu, Red Hat. Tx, multicast with bridging.

Andy Grover, Oracle. RDS (Reliable Datagram Sockets) over
Infiniband.

Saturday, September 19, 2009.

Arnaldo Melo: Batch Datagram Receiving

Summary:

Reduce per-packet overhead on receive by batching packets,
so that protocol-stack overhead is amortized over all
packets making up a given batch. This is accomplished
via a change to the syscall layer allowing iovec to
be passed to the message-receive syscall, which returns
either the number of messages received or an error.
Lower UDP layers changed to reduced locking overhead.

Although financial institutions are said to be intensely
interested in this optimization, they are unwilling to
share performance results. Fortunately, Nir Tzachar tested
on 1Gb/s hardware and noted a latency reduction for 100-byte
packets from 750us to 470us. For larger packets, Nir noted
that throughput doubled. Future work might include batching
on the send side, but this requires lots of consecutive
sends to the same destination, which appears to be a bit
more unusual.

API returns number of datagrams or error. If partial
fill followed by error, get good return for correctly
received datagrams, followed by error on next call.
Timeout for all receives available -- "give me 20 or
whatever shows up in the next millisecond.

Lower layer:
UDP locking changed, release at end of full batch, not
on each packet free.

Nir Tzachar testing. 1Gb/s. 100-byte packets in batches
of 30 reduces latency from 750us to 470us. For larger
packets, get double the throughput.

Networking bandwidths have increased by three orders of magnitude
over the past 25 years, but the MTU remains firmly fixed at
the ancient value of 1536 bytes. This means that per-packet
overhead per unit time has also increased by three orders of
magnitude during that time.

The obvious solution would be to increase the MTU size, but the
tendency to drop ICMP packets defeats MTU discovery, so that
connections spanning Internet are still required to use small
MTU values. In addition, many important applications use small
packets, and thus cannot benefit from increased MTU. Finally,
jumbograms can increase queueing latencies within Internet,
which degrades response times.

So we need to live with small packets, and one way to do so
is to decrease interrupt overhead. NAPI (the not-so-new API)
has done this for some time, but only for the receive side.
At 10Gb/s speeds, we must also deal with tranmit side. Herbert
has implemented a work-witholding approach in which completion
interrupts are requested only 3-4 times per transmit-ring
traversal, as opposed to on each and every packet. But this
approach for virtual NICs, since there is not sufficiency
per-packet transmission delay. But Herbert noted that very
few transmitters want or need timely completion notification,
so is proposing a boolean flag in the skb structure indicating
whether this particular packet requires a completion interrupt.

TSO is one way to reduce per-packet cost by essentially
increasing MTU within host. GRO/GSO does same on receive side.

But still need to decrease interrupt overhead. Want to batch
up the interrupts. Been there for receive for quite some
time -- "NAPI". 10Gb Enet imposes same problem on transmit
side.

100-byte packets as 10Gb comes to an interrupt every 100ns or
so at wire speed...

On receive, just poll at end of interrupt processing, thus
eliminating the need for other interrupts. But need to
make sure to re-enable interrupts eventually...

UDP needs to be careful, as there is no congestion control.
(Though current wire speeds are making this a non-problem.)

Need local flow control in the general case -- especially when
you have multiple transmitting sockets, and fair access is
required.

But cannot predict the future. So keep a ringbuffer, and
exponentially increase the batching. This works fine for
hardware, where packet-transmission delays allow work to
accumulate. But in conjunction with flow-control layers,
you don't get any time to accumulate work. Same thing happens
with virtualization (thank you, Rusty!).

KVM hack, "tx mitigation", postpones the work until some
time later. Initial postponement was for 2ms, but
high-resolution timers has helped somewhat. Not as nice
as dedicated hardware, but much better than 2ms timers.

And hardware will likely continue to get faster, and so the
"virtualization problem" will likely hit real hardware
soon.

One trick for things like UDP is to dispense with completion
interrupts for most traffic -- routers don't care about
completion, for example. So have a per-packet flag that
says whether sender cares about completion feedback. Note
that hardware will still be within its rights to delay this
feedback, permitting the feedback for multiple senders
and streams to be batched up into a single interrupt.

When utilization is low, a greater number of interrupts can
be tolerated -- and is desired, in order to reduce latency.
When utilization is high, aggressively mitigate interrupts
in order to increase throughput.

SH: Swapping over NFS requires special care, as you can livelock
system due to OOM issues.

In virtualized environments, can use hypervisor as the private
pool used to get guests out of trouble. If apps running on
host, then host must avoid swapping.

Stephen Hemminger: Bridging

Summary:

Bridging is now receiving much more attention due to its
new-found uses in virtual environments. The setups for these
envronments is mostly automated, and works quite well.
Except for the spanning-tree implementation, which can take
up to 30 seconds to sync up. RSTP (Rapid Scanning Tree
Protocol) would be a great improvement, and it also better
handles leaf nodes. There is an rstplib, but it is only
occasionally used, mostly by embedded people: problems include
lack of distro uptake and problems with user-kernel version
synchronization.

EMC has an RSTP implementation, which is now in the repository,
and which EMC wants to replace with GPLed code from VMWare,
which is now also in the repository.

Much discussion of VEPA (Virtual Ethernet Port aggregator),
especially regarding the need for solutions that work across
a wide range of hardware.

Userspace difficult due to need to keep kernel and user-space
library in sync. In theory better security, but...

EMC coded up a version, which is now in the repository. But
EMC wants to replace with GPLed code from VMWare.

VEPA (Virtual Ethernet Port Aggregator). But want solutions
to work across a wide range of hardware. Lots of competing
patches and approaches.

Link detection pretty much there, but MTU issues remain.
Bonding issues with min-MTU. Need to make sure that optimizations
like GRO all work correctly.

Jesper Dangaard Brouer: 10Gb bi-directional routing

Summary:

Jesper described ComX'es (Danish ISP) use of a Linux box as a
10Gbit/s Internet router as an alternative to a proprietary
solution that is an order of magnitude more costly.
Jesper's testing achieved full wire speed on dual-interface
unidirectional workloads for packet sizes of 420 bytes or
larger, and on dual-interface bidirectional workloads for
packet sizes of 1280 bytes or larger. Also showed good
results on Robert Olsson's internet-traffic benchmark.

Perhaps needless to say, careful choice of hardware and careful
tuning are required to achieve these results.

Bottom line findings are: 10Gbit/s bi-directional routing is
possible, but we are limited by the packet per second
processing power, there is still memory bandwidth available.

[discussion of solving serialization issue with bw shapers,
applying per-CPU value-caching tricks
to allow scalable traffic shaping. 1Mb/s at 1500
byte packets give 83 interactions per Mb. So hand
out (say) 10 as requested. Vary based on number
of CPUs and size of share.]

Affinity irqs to spread load over CPUs.

Use "mpstat -A -P ALL" to validate irq spreading.

Make sure that corresponding RX and TX queues are
on same CPU.

Three usage cases, for staying on same CPU

Forwarding (RXq to TXq other NIC): record RX
queue number and use it at TX queue.

Server (RXq to TXq): cache socket info (thanks to
Eric Dumazet).

Client (TXq to RXq): Hard!!! Need to use the
flow director in the 10GbE Intel 82599 NIC.

Be skeptic about generators:
First runs unidirectional at wire speed for packet sizes of 768 bytes
or larger. But limited by generator! Note that
pktgen has some known limitations -- want to run with delay zero,
otherwise pktgen does per-frame gettimeofday(). Also need faster
NICs on the packet-generating systems. Stephen Hemminger is looking
into optimizing pktgen.

With faster packet generators, generators get close to wire
speed -- at wire speed for 420-byte packets and larger.

Some limitations due to memory bandwidth -- may need more
NUMA-awareness in drivers and possibly also the network
stack... Will need API from driver to give preferred
buffer/queue layout in memory.

Future: use more queues than CPUs to implement QoS. Even better,
use per-socket queues -- but need thousands, or even millions
of queues.

Thomas Graf: Control Groups (cgroups) Networking Subsystem

Summary:

Thomas described his extension of cgroups to cover networking.
The administrator can create networking classIDs and assign
them to cgroups. These classIDs can then be used by the
traffic classifier.

This approach does not cover incoming traffic, nore does it
cover delayed traffic (where packets are sent from within
softirq context rather than from the context of the originating
task). However, according to DaveM, Thomas's approach covers
the cases that most people care about.

Does not cover incoming traffic, does not necessary cover
delayed traffic -- need to still be in the sending process's
context to be able to use the cgroup classID. But does cover
the cases most people care about, says DaveM.

Thomas Graf: libnl (Netlink library)

Summary:

Thomas reported progress on libnl, including the new
extended-match support that allows the rules to use
protocol field names rather than byte offsets. This
change is likely to be quite welcome.

Gerrit reported on progress with DCCP, a protocol that is unusual
in that it has been "synthesized in the lab" rather than being
refined through experience, bakeoffs, and consumption of large
quantities of alcohol, as was the process used for TCP and SCTP.
DCCP has not seen great uptake, with the result that the only
surviving in-kernel implementation is in Linux. Nevertheless,
a number of applications, including GStreamer, have been ported to
DCCP, and a number of people are actively working to improve it.
Most notably, a group in Italy has applied formal control-theory
results to DCCP's CCID-3 protocol to obtain a simple and effective
congestion-avoidance algorithm that is expected to allow CCID-3
to dispense with high-resolution timers, thereby increasing its
efficiency.

It is expected that increased application usage of DCCP may
eventually require expanding the kernel/user interface to
pass timing information from DCCP to the user application.
Such a change could well permit DCCP to come into its own as a
first-class production-quality protocol suite for time-sensitive
multi-media applications.

Details:

DCCP originally by Arnaldo. Only suriving in-kernel
implementation.

TCP and SCTP refined through experience, with many bake-offs
and incremental improvements. DCCP synthesized in lab.
For example, difficulties routing it through Internet.

No really compelling reasons to use these vs. TCP or UDP.
But there are 251 remaining CCIDs left to implement!!! :-)

IETF let this through. [IETF has certainly changed a lot
in 20 years!!!]

Test tree at Aberdeen.

ECN/ECT(0) patches for DCCPv4/6.
CCID-4 (RFC 5622) in development in Brazil.
New CCID-3 algorithm from Italy, applying control theory.
Hopefully dispense with need for high-res timers.

A number of applications ported to DCCP, including GStreamer.
These might require require additional information piped to
user space, for example, timing information.

PJ: DCB (Data Center Bridging). IEEE standard.

Summary:

PJ gave rundown of DCB, which can be very roughly thought
of as a member of the DCE family. PJ discussed tagging
traffic via VLAN egress, which might simplify filtering
setup by avoiding the need for filtering rules that are
aware of both TPC/IP and Ethernet header fields. Another
improvement proposed was to bypass qdisc when empty, to
avoid the qdisc chokepoint (but see DaveM's talk).

Details:

RDMA over Infiniband. Can configure via netlink layer.
Tools are quite rough, considering putting this function in
etool for FCOE (FibreChannel over Ethernet).

Tag traffic via VLAN egress, simplifying filtering setup.
Can filter based on ethertype, for example. Avoids the
need to make filtering rules that are aware of both Ethernet
and TCP/IP headers.

Problem is that completion interrupts might happen
on other CPUs. Possible mitigation: move clean-up to
transmit side as Chelsio folks do.

But real fix is to more carefully distribute the traffic
so that CPUs don't go after each others' locks.

Dave Miller: Linux Multiqueue Networking

Summary:

Dave gave a compressed version of his NYLUG talk on multiqueue
networking. This work parallelizes the networking stack in almost
all cases -- remaining cases include the complex-qdisc scenario
discussed in Jesper's talk, though backwards compatibility
is important given that qdisc API changes touch something like
450 drivers.

Changes include interrupt mitigation, rework of NAPI, keeping
queue-selection state in the SKB (thus avoiding the need to
acquire locks to access or change this state), and many others
besides. Future challenges include wakeup mapping (perhaps using
some of the tricks that Jens Axboe is applying in his block-I/O
work), Tom Herbert's per-device packet-steering table, a software
version of Intel's flow director, and any number of changes to
accommodate the increasinly common virtualized environments.

Details:

End of Moore's Law frequency scaling. More networking
flows per system. Single-queue/stream approach no longer
works. Need multiple queues.

Some difficulties in wakeup mapping. Possibly use some tricks
that Jens Axboe is applying to the block-I/O subsystem, some
of which got 10% improvement at system level. DaveM has
prototyped it, but dropped it in favor of multiqueue hardware.

Tom Herbert of Google uses per-device packet-steering table
that is set via sysctl.

Another approach: software version of Intel's flow director.
Space, time, locality issues.

Small hash table, no chaining. Track transmits, correlate
TX and RX.

Google would prefer using hardware-generated RSS hash value.

Virtualization: virtual non-multiqueue NICs.

TCAM (ternary content-addressable memory) might also be
applied to this scenario.

Bridging through hypervisor... Might not be so high overhead,
but need GRO/GSO/&c. KVM issues would remain.

PJ: numerous VM-to-VM benchmarking efforts.

DaveM re-implemented multi-queue -three- times... ;-)

Herbert Xu: Bridging and Multicasting

Summary:

Athough bridging currently satisfies most needs, even in
virtualized environments, and even in conjunction with
multicasting, the combination of the three can be inefficient.
The problem is that the Linux kernel's bridging drivers are
unaware of multicast state, and therefore simply flood all
multicast packets out all interfaces (except of course for
the interface on which the packet was received). One solution
would be to leverage IGMP (Internet Group Management Protocol)
to allow the bridging drivers to send multicast packets only
where needed.

Details:

"Bridging does what most people want, even with multicasting."

Some interesting applications: IPTV, but mostly banks.

Main use of bridging is vitualization -- bridging used to
connect multiple guests to single networking devices.

Can use multicasting over bridges, it works -- but it simply
floods to all other possible destinations. Strongly
suboptimal, as it is often the case that the multicast
isn't going to all the possible destination.

So, thinking of implementing IGMP routing protocol for the
bridging driver. SCH: how about -associated- -with- the
bridging driver rather than -in- the bridging driver???

HX: Yes, to be cleanly implemented.

Herbert Xu: GRO (Generic Receive Offload)

Summary:

GRO has been quite successful, but Herbert sees several ways
to usefully expand on it. One example is receive steering,
directing packets to the CPU on which the destination thread
is running. Another example is to reduce processing cost by
short-circuiting TCP ACK processing, as opposed to the current
practice of running all TCP ACK packets through the full protocol
stack. A final example is the creation of monstergram, allowing
a large group of packets from the same connection to be run
through the protocol stack as a unit.

Given fixed MTU and ever-increasing bandwidths, the opportunity
(and need) for such tricks can be expected to increase over time.

Reduce processing cost -- helping general processing.
For example, short-circuit TCP ACK processing: send and
receive the ACKs at the GRO level rather than running
ACKs through the full protocol stack.

Make monstergrams -- wait for full NAPI interval,
accumulate full set of data, run it up the protocol
stack as a unit. Expect things like video to increase
their bandwidth requirements.

The fact that increasing numbers of packets arrive within
a NAPI interval means that the opportunity for such improvements
can be expected to increase over time.

Peter P. Waskiewicz Jr. (PJ): I/O MMU

Summary:

Much discussion of possible hardware and software optimizations
for I/O MMUs, which incur high overheads. It is safe to say that
more work will be required here, both in hardware and in software.

Details:

Optimizations to keep buffers in cache, and repeatedly
run through same set of buffers.

SCH: But user might not read the buffer for a long time...

DM: Some NICs have had a problem where certain conditions
can cause the associated buffer to be forevermore useless.
IO MMU very expensive to change.

Order of magnitude decrease in throughput when enabling
IO MMU, even if not remapping it.

DM: If you have direct HW access to IO MMU, you can run
through the registers, and only touch hardware when you
have overflowed. Can also have hardware update the status
block intermittently.