I've ported myri10ge to use the new LRO interface. I have attached apreliminary patch to myri10ge. I'm very pleased to note that theperformance is on-par with my own LRO used by our out-of-tree driver.(except when using mixed MTUS, see performance data below).

As I expected, actually porting our driver to use the LRO interfacegave me a far better understanding of the interface, and allowed forbetter feedback. I have attached a patch to the LRO code whichaddresses some of the issues I mention below.

Please find below a performance summary, as well as my commentson the LRO code, and patches to myri10ge and inet_lro. Both patchesare Signed-off-by: Andrew J. Gallatin <gallatin@myri.com>

Cheers,

Drew

===================Performance:===================Here is a performance summary taken on my very low-end 2.0GHz AMDAthlon(tm) 64 X2 Dual Core Processor 3800+ running 2.6.23-rc1 andreceiving a netperf TCP_SENDFILE test from an identical sender (whichwas running 2.6.22 and our 1.3.1 "out of tree" driver). The netserverprocess was bound to a different core than the interrupt handler. Thedata reported is from the median of 5 60 second netperf tests. Thefollowing settings were in /etc/sysctl.conf on both machines:

Our NIC passes partial checksums to our driver. In the current code,it seems impossible for page based CHECKSUM_COMPLETE drivers to behavecorrectly in the case of "rejected" frames. Eg, there is no wayto pass the partial checksum to the LRO module so that it getsset in the skb header and passed up the stack.

Further, there seems to be no (easy) way to use CHECKSUM_COMPLETEon an aggregated packet at LRO flush time. By the time a packetis aggregated, the partial checksum from the first segment isout of date.

I think it would make sense to require that a CHECKSUM_COMPLETE styledriver verify the checksum in its get_frag_header / get_skb_headercallback. This allows the LRO code to unconditionally setCHECKSUM_UNNECESSARY.

The attached a patch modifies the code to do this.

2) Non-accelerated VLAN tags

Our firmware currently does not do vlan tag insertionand removal. This causes a problem in __lro_proc_segment()where the tcp and ip headers are setup to point into thenewly created skb. A frame containing an unstripped vlantag will cause the headers to be garbage.

The attached patch modifies __lro_proc_segment() to offsetthose pointers by VLAN_HLEN when required.

3) Padded frames.

I may be missing something, but I don't see where youeither strip padding from frames or reject padded frames.(see the pskb_trim_rcsum() in net/ipv4/ip_input.c:ip_rcv()

I did not add such a feature as I was confused about the intendeduse of len/true_size.

Also, trimming is a pain when dealing with pure frags (without acontaining skb). We have code in our out-of-kernel driver to dealwith it which you are welcome to use.

4) LRO_MIN_PG_HLEN (== 80)

This confuses me. Can you please explain what you're trying to do?Because of this, I kept getting crashes in the skb_pull() done byeth_type_trans() because I was passing segments which were 60 bytesand the skb->data_len of the skb constructed by lro_gen_skb() was -20.I added my own code to bump the length to a magic 80 bytes, and thepanics disappeared. This may cause data corruption because of#3 above!

5) NAPI/non-NAPI

The LRO code assumes the underlying driver uses NAPI, and callsnetif_receive_skb() rather than netif_rx(). Perhaps there should be afield in the lro_mgr struct to specify napi / non-napi.

6) The checks for when to stop aggregating and flush in __lro_proc_{segment|skb} need some improvement.The skb variant currently uses a pure count (max_aggr). In order tokeep the resulting aggregated frame below 64KB, the underlying drivercomputes max_aggr as 0xffff / MTU. This reduces the effectiveness ofLRO on mixed MTU networks. Eg, this causes packets coming from asource with a 1500b MTU to be aggregated after 7 frames when using a9000b MTU on the receiver, rather than after 43 frames. As you cansee from the differences in inet_lro's performance in the tableabove, this is a real problem.I believe that a hybrid byte / max_aggr model should be used. The__lro_proc_segment takes this approach. Note that there is a subtlebug in that the maximum aggregated bytes should not be be 0xffff.Rather, one must leave room for the next frame to arrive by settingthe max aggregated bytes to 0xffff - dev->mtu. This is maskedby the driver computing max_aggr as above.