We provide an zero-copy method which driver side may get externalbuffers to DMA. Here external means driver don't use kernel spaceto allocate skb buffers. Currently the external buffer can be fromguest virtio-net driver.

The idea is simple, just to pin the guest VM user space and thenlet host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a devicewhich provides proto_ops as sendmsg/recvmsg to vhost-net tosend/recv directly to/from the NIC driver. KVM guest who use thevhost-net backend may bind any ethX interface in the host side toget copyless data transfer thru guest virtio-net frontend.

patch 01-11: net core and kernel changes.patch 12-14: new device as interface to mantpulate external buffers.patch 15: for vhost-net.patch 16: An example on modifying NIC driver to using napi_gro_frags().patch 17: An example how to get guest buffers based on driver who using napi_gro_frags().

The guest virtio-net driver submits multiple requests thru vhost-netbackend driver to the kernel. And the requests are queued and thencompleted after corresponding actions in h/w are done.

For read, user space buffers are dispensed to NIC driver for rx whena page constructor API is invoked. Means NICs can allocate user buffersfrom a page constructor. We add a hook in netif_receive_skb() functionto intercept the incoming packets, and notify the zero-copy device.

For write, the zero-copy deivce may allocates a new host skb and putspayload on the skb_shinfo(skb)->frags, and copied the header to skb->data.The request remains pending until the skb is transmitted by h/w.

We provide multiple submits and asynchronous notifiicaton to vhost-net too.

Our goal is to improve the bandwidth and reduce the CPU usage.Exact performance data will be provided later.

What we have not done yet: Performance tuning

what we have done in v1: polish the RCU usage deal with write logging in asynchroush mode in vhost add notifier block for mp device rename page_ctor to mp_port in netdevice.h to make it looks generic add mp_dev_change_flags() for mp device to change NIC state add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load a small fix for missing dev_put when fail using dynamic minor instead of static minor number a __KERNEL__ protect to mp_get_sock()

what we have done in v2:

remove most of the RCU usage, since the ctor pointer is only changed by BIND/UNBIND ioctl, and during that time, NIC will be stopped to get good cleanup(all outstanding requests are finished), so the ctor pointer cannot be raced into wrong situation.

Remove the struct vhost_notifier with struct kiocb. Let vhost-net backend to alloc/free the kiocb and transfer them via sendmsg/recvmsg.

use get_user_pages_fast() and set_page_dirty_lock() when read.

Add some comments for netdev_mp_port_prep() and handle_mpassthru().

what we have done in v3: the async write logging is rewritten a drafted synchronous write function for qemu live migration a limit for locked pages from get_user_pages_fast() to prevent Dos by using RLIMIT_MEMLOCK

what we have done in v4: add iocb completion callback from vhost-net to queue iocb in mp device replace vq->receiver by mp_sock_data_ready() remove stuff in mp device which access structures from vhost-net modify skb_reserve() to ignore host NIC driver reserved space rebase to the latest vhost tree split large patches into small pieces, especially for net core part.

what we have done in v6: move create_iocb() out of page_dtor which may happen in interrupt context -This remove the potential issues which lock called in interrupt context make the cache used by mp, vhost as static, and created/destoryed during modules init/exit functions. -This makes multiple mp guest created at the same time.

what we have done in v7: some cleanup prepared to suppprt PS mode

what we have done in v8: discarding the modifications to point skb->data to guest buffer directly. Add code to modify driver to support napi_gro_frags() with Herbert's comments. To support PS mode. Add mergeable buffer support in mp device. Add GSO/GRO support in mp deice. Address comments from Eric Dumazet about cache line and rcu usage.

what we have done in v9: v8 patch is based on a fix in dev_gro_receive(). But Herbert did not agree with the fix we have sent out. And he suggest another fix. v9 is modified to base on that fix.

what we have done in v10: Fix a partial csum error. Cleanup some unused fields with struct page_info{} in mp device. Modify kmem_cache_zalloc() to kmem_cache_alloc() based on Michael S. Thirkin.

what we have done in v11: Address comments from Michael S. Thirkin to add two new ioctls in mp device. But still need to revise.

what we have done in v12: Address most comments from Ben Hutchings, except the compat ioctls. As the comments are sparse, so do not make a split patch. Change struct mpassthru_port to struct mp_port, and struct page_ctor to struct page_pool.

what we have done in v13: Export functions to other drivers like macvtap, in case it want to reuse it to get zero-copy. Rebase on 2.6.36-rc7.

what we have done in v14: Address the comments from David Miller for bonding device issue. Currently, we treat it in two cases. One case is that bonding is created before zero-copy mode is enabled for a device. The code will check if all the slaves are capable of zero-copy. If yes, it will force all the slaves in zero-copy mode. If not, fails zero-copy. The other case is that zero-copy is enabled before bonding is created, just fail bonding.

what we have done in v15: Address comments from Eric Dumazet about how to clear destructor_arg field of shinfo.

Performance: We have seen the performance data request from mailling-list. And we are now looking into this.