Comments

This adds support to Linux for using virtio between two computers linked by
a PCI interface. This allows the use of virtio_net to create a familiar,
fast interface for communication. It should be possible to use other virtio
devices in the future, but this has not been tested.
I have implemented guest support for the Freescale MPC8349EMDS board, which
is capable of running in PCI agent mode (It acts like a PCI card, but is a
complete computer system, running Linux). The driver is trivial to port to
any MPC83xx system.
It was developed to work in a CompactPCI crate of computers, one of which
is a standard x86 system (acting as the host) and many PowerPC systems
(acting as guests).
I have only tested this driver with a single board in my system. The host
is a 1066MHz Pentium3-M, and the guest is a 533MHz PowerPC. I am able
achieve transfer rates of about 150 mbit host->guest and 350 mbit
guest->host. A few tests showed that using an mtu of 4000 provided much
better results than an mtu of 1500. Using an mtu of 64000 significantly
dropped performance. The performance is equivalent to my PCINet driver for
host->guest, and about 20% faster for guest->host transfers.
I have included a short document explaining what I think is the most
complicated part of the driver: using the DMA engine to transfer data. I
hope everything else is readily obvious from the code. Questions are
welcome.
I will not be able to work on this full time for at least a few weeks, so I
would appreciate actual review of this driver. Nitpicks are fine, I just
won't be able to respond to them quickly.
RFCv1 -> RFCv2:
* fix major brokenness of host detach_buf()
* support VIRTIO_NET_F_CSUM
* support VIRTIO_NET_F_GSO
* support VIRTIO_NET_F_MRG_RXBUF
* rewrote DMA transfers to support merged rxbufs
* added a hack to fix the endianness of virtio_net's metadata
* lots more performance for guest->host transfers (~40MB/sec)
* updated documentation
* allocate 128 feature bits instead of 32
Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu>
---
Yes, the commit message has too much information. This is an RFC after
all. I fully expect to have to make changes. In fact, I posting this
more to "get it out there" than anything else, since I have other tasks
that need doing.
I'd appreciate a serious review of the design by the people who have
been pressuring me to use virtio. I'm very happy to answer any questions
you have.
Thanks to everyone who gave feedback for RFCv1!
Ira
Documentation/virtio-over-PCI.txt | 60 +
arch/powerpc/boot/dts/mpc834x_mds.dts | 7 +
drivers/virtio/Kconfig | 22 +
drivers/virtio/Makefile | 2 +
drivers/virtio/vop.h | 119 ++
drivers/virtio/vop_fsl.c | 2020 +++++++++++++++++++++++++++++++++
drivers/virtio/vop_host.c | 1071 +++++++++++++++++
drivers/virtio/vop_hw.h | 80 ++
8 files changed, 3381 insertions(+), 0 deletions(-)
create mode 100644 Documentation/virtio-over-PCI.txt
create mode 100644 drivers/virtio/vop.h
create mode 100644 drivers/virtio/vop_fsl.c
create mode 100644 drivers/virtio/vop_host.c
create mode 100644 drivers/virtio/vop_hw.h

On Thursday 26 February 2009, Ira Snyder wrote:
> On Thu, Feb 26, 2009 at 05:15:27PM +0100, Arnd Bergmann wrote:>> I think so too. I was just getting something working, and thought it> would be better to have it "out there" rather than be working on it> forever. I'll try to break things up as I have time.
Ok, perfect!
> For the "libraries", would you suggest breaking things into seperate> code files, and using EXPORT_SYMBOL_GPL()? I'm not very familiar with> doing that, I've mostly been writing code within the existing device> driver frameworks. Or do I need export symbol at all? I'm not sure...
You have both options. When you list each file as a separate module
in the Makefile, you use EXPORT_SYMBOL_GPL to mark functions that
get called by dependent modules, but this will work only in one way.
You can also link multiple files together into one module, although
it is less common to link a single source file into multiple modules.
> I always thought you were supposed to use packed for data structures> that are external to the system. I purposely designed the structures so> they wouldn't need padding.
That would only make sense for structures that are explicitly unaligned,
like a register layout using
struct my_registers {
__le16 first;
__le32 second __attribute__((packed));
__le16 third;
};
Even here, I'd recommend listing the individual members as packed
rather than the entire struct. Obviously if you layout the members
in a sane way, you don't need either.
> I mostly don't need it. In fact, the only place I'm using registers not> specific to the messaging unit is in the probe routine, where I setup> the 1GB window into host memory and setting up access to the guest> memory on the PCI bus.
You could add the registers you need for this to the "reg" property
of your device, to be mapped with of_iomap.
If the registers for setting up this window don't logically fit
into the same device as the one you already use, the cleanest
solution would be to have another device just for this and then
make a function call into that driver to set up the window.
> Now, I wouldn't need to access these registers at all if the bootloader> could handle it. I just don't know if it is possible to have Linux not> use some memory that the bootloader allocated, other than with the> mem=XXX trick, which I'm sure wouldn't be acceptable. I've just used> regular RAM so this is portable to my custom board (mpc8349emds based)> and a regular mpc8349emds. I didn't want to change anything board> specific.> > I would love to have the bootloader allocate (or reserve somewhere in> the memory map) 16K of RAM, and not be required to allocate it with> dma_alloc_coherent(). It would save me plenty of headaches.
I believe you can do that through the "memory" devices in the
device tree, by leaving out a small part of the description of
main memory, at putting it into the "reg" property of your own
device.
> Code complexity only. Also, it was easier to write 80-char lines with> something like:> > vop_get_desc(vq, idx, &desc);> if (desc.flags & VOP_DESC_F_NEXT) {> /* do something */> }> > Instead of:> if (le16_to_cpu(vq->desc[idx].flags) & VOP_DESC_F_NEXT) {> /* do something */> }> > Plus, I didn't have to remember how many bits were in each field. I just> thought it made everything simpler to understand. Suggestions?
hmm, in this particular case, you could change the definition
of VOP_DESC_F_NEXT to
#define VOP_DESC_F_NEXT cpu_to_le16(1)
and then do the code as the even simpler (source and object code wise)
if (vq->desc[idx].flags) & VOP_DESC_F_NEXT)
I'm not sure if you can do something along these lines for the other
cases as well though.
> I used 3 so they would would align to 1024 byte boundaries within a 4K> page. Then the layout was 16K on the bus, each 4K page is a single> virtio-device, and each 1K block is a single virtqueue. The first 1K is> for virtio-device status and feature bits, etc.> > Packing them differently isn't a problem. It was just easier to code> because setting up a window with the correct size is so platform> specific.
Ok. I guess the important question is what part of the code makes
this decision. Ideally, the virtio-net glue would instantiate
the device with the right number of queues.
> > > +/*> > > + * This function abuses some of the scatterlist code and implements> > > + * dma_map_sg() in such a way that we don't need to keep the scatterlist> > > + * around in order to unmap it.> > > + *> > > + * It is also designed to never merge scatterlist entries, which is> > > + * never what we want for virtio.> > > + *> > > + * When it is time to unmap the buffer, you can use dma_unmap_single() to> > > + * unmap each entry in the chain. Get the address, length, and direction> > > + * from the descriptors! (keep a local copy for speed)> > > + */> > > > Why is that an advantage over dma_unmap_sg?> > > > When running dma_map_sg(), the scatterlist code is allowed to alter the> scatterlist to store data it needs for dma_unmap_sg(), along with> merging adjacent buffers, etc.> > I don't want any of that behavior. The generic virtio code does not> handle merging of buffers.> > Also, all of the generic virtio code allocates its scatterlists on the> stack. This means I cannot save the pointers between add_buf() and> get_buf(). If I used dma_map_sg(), I'd have to allocate memory to copy> the scatterlist, map it, and save the pointer. Later, retrieve the> pointer, unmap it, and free the memory.> > This is simpler than all of that.
Not sure if I'm following, but you seem to have put enough thought
into it ;-)
Arnd <><

On Thu, Feb 26, 2009 at 09:37:14PM +0100, Arnd Bergmann wrote:
> On Thursday 26 February 2009, Ira Snyder wrote:> > On Thu, Feb 26, 2009 at 05:15:27PM +0100, Arnd Bergmann wrote:> >> > I think so too. I was just getting something working, and thought it> > would be better to have it "out there" rather than be working on it> > forever. I'll try to break things up as I have time.> > Ok, perfect!> > > For the "libraries", would you suggest breaking things into seperate> > code files, and using EXPORT_SYMBOL_GPL()? I'm not very familiar with> > doing that, I've mostly been writing code within the existing device> > driver frameworks. Or do I need export symbol at all? I'm not sure...> > You have both options. When you list each file as a separate module> in the Makefile, you use EXPORT_SYMBOL_GPL to mark functions that> get called by dependent modules, but this will work only in one way.> > You can also link multiple files together into one module, although> it is less common to link a single source file into multiple modules.>
Ok. I'm more familiar with the EXPORT_SYMBOL_GPL interface, so I'll do
that. If we decide it sucks later, we'll change it.
> > I always thought you were supposed to use packed for data structures> > that are external to the system. I purposely designed the structures so> > they wouldn't need padding.> > That would only make sense for structures that are explicitly unaligned,> like a register layout using> > struct my_registers {> __le16 first;> __le32 second __attribute__((packed));> __le16 third;> };> > Even here, I'd recommend listing the individual members as packed> rather than the entire struct. Obviously if you layout the members> in a sane way, you don't need either.>
Ok. I'll drop the __attribute__((packed)) and make sure there aren't
problems. I don't suspect any, though.
> > I mostly don't need it. In fact, the only place I'm using registers not> > specific to the messaging unit is in the probe routine, where I setup> > the 1GB window into host memory and setting up access to the guest> > memory on the PCI bus.> > You could add the registers you need for this to the "reg" property> of your device, to be mapped with of_iomap.> > If the registers for setting up this window don't logically fit> into the same device as the one you already use, the cleanest> solution would be to have another device just for this and then> make a function call into that driver to set up the window.>
The registers are part of the board control registers. They don't fit at
all in the message unit. Doing this in the bootloader seems like a
logical place, but that would require any testers to flash a new U-Boot
image into their mpc8349emds boards.
The first set of access is used to set up a 1GB region in the memory map
that accesses the host's memory. Any reads/writes to addresses
0x80000000-0xc0000000 actually hit the host's memory.
The last access sets up PCI BAR1 to hit the memory from
dma_alloc_coherent(). The bootloader already sets up the window as 16K,
it just doesn't point it anywhere. Maybe this /should/ go into the
bootloader. Like above, it would require testers to flash a new U-Boot
image into their mpc8349emds boards.
> > Now, I wouldn't need to access these registers at all if the bootloader> > could handle it. I just don't know if it is possible to have Linux not> > use some memory that the bootloader allocated, other than with the> > mem=XXX trick, which I'm sure wouldn't be acceptable. I've just used> > regular RAM so this is portable to my custom board (mpc8349emds based)> > and a regular mpc8349emds. I didn't want to change anything board> > specific.> > > > I would love to have the bootloader allocate (or reserve somewhere in> > the memory map) 16K of RAM, and not be required to allocate it with> > dma_alloc_coherent(). It would save me plenty of headaches.> > I believe you can do that through the "memory" devices in the> device tree, by leaving out a small part of the description of> main memory, at putting it into the "reg" property of your own> device.>
I'll explore this option. I didn't even know you could do this. Is a
driver that requires the trick acceptable for mainline inclusion? Just
like setting up the 16K PCI window, this is very platform specific.
This limits the guest driver to systems which are able to change Linux's
view of their memory somehow. Maybe this isn't a problem.
> > Code complexity only. Also, it was easier to write 80-char lines with> > something like:> > > > vop_get_desc(vq, idx, &desc);> > if (desc.flags & VOP_DESC_F_NEXT) {> > /* do something */> > }> > > > Instead of:> > if (le16_to_cpu(vq->desc[idx].flags) & VOP_DESC_F_NEXT) {> > /* do something */> > }> > > > Plus, I didn't have to remember how many bits were in each field. I just> > thought it made everything simpler to understand. Suggestions?> > hmm, in this particular case, you could change the definition> of VOP_DESC_F_NEXT to> > #define VOP_DESC_F_NEXT cpu_to_le16(1)> > and then do the code as the even simpler (source and object code wise)> > if (vq->desc[idx].flags) & VOP_DESC_F_NEXT)> > I'm not sure if you can do something along these lines for the other> cases as well though.>
That's a good idea. It wouldn't fix the addresses, lengths, and next
fields, though. I'll make the change and see how bad it is, then report
back. It may not be so bad after all.
> > I used 3 so they would would align to 1024 byte boundaries within a 4K> > page. Then the layout was 16K on the bus, each 4K page is a single> > virtio-device, and each 1K block is a single virtqueue. The first 1K is> > for virtio-device status and feature bits, etc.> > > > Packing them differently isn't a problem. It was just easier to code> > because setting up a window with the correct size is so platform> > specific.> > Ok. I guess the important question is what part of the code makes> this decision. Ideally, the virtio-net glue would instantiate> the device with the right number of queues.>
Yeah, virtio doesn't work that way.
The virtio drivers just call find_vq() with a different index for each
queue they want to use. You have no way of knowing how many queues each
virtio driver will want, unless you go read their source code.
virtio-net currently uses 3 queues, but we only support the first two.
The third is optional (for now...), and non-symmetric.
Thanks again,
Ira

On Thursday 26 February 2009, Ira Snyder wrote:
> On Thu, Feb 26, 2009 at 09:37:14PM +0100, Arnd Bergmann wrote:> > The registers are part of the board control registers. They don't fit at> all in the message unit. Doing this in the bootloader seems like a> logical place, but that would require any testers to flash a new U-Boot> image into their mpc8349emds boards.> > The first set of access is used to set up a 1GB region in the memory map> that accesses the host's memory. Any reads/writes to addresses> 0x80000000-0xc0000000 actually hit the host's memory.> > The last access sets up PCI BAR1 to hit the memory from> dma_alloc_coherent(). The bootloader already sets up the window as 16K,> it just doesn't point it anywhere. Maybe this /should/ go into the> bootloader. Like above, it would require testers to flash a new U-Boot> image into their mpc8349emds boards.
Ok, I see.
I guess the best option for doing it in Linux then would be to have
a board control driver (not sure if this already exists) that exports
high-level functions to set up the inbound and outbound windows.
> Yeah, virtio doesn't work that way.> > The virtio drivers just call find_vq() with a different index for each> queue they want to use. You have no way of knowing how many queues each> virtio driver will want, unless you go read their source code.> > virtio-net currently uses 3 queues, but we only support the first two.> The third is optional (for now...), and non-symmetric.
I mean the part of your driver that calls register_virtio_device()
could make the decision, this is the one I was referring to
as virtio_net glue because it is the only part that actually needs
to know about the features etc.
Right now, you just call register_virtio_net from vop_probe(), which
is absolutely appropriate for the specific use case. In the most
general case though, you would have a user interface on one or
both sides that allows a (root) user to trigger the creation of
a virtio_net (or other virtio) device with specific characteristics
such as MAC address or number of virtqueues.
One idea I had earlier was that there could be a special device
with just one virtqueue that is always present and that allows you
do communicate configuration changes regarding the available devices
to the remote VOP driver.
Arnd <><

On Thu, Feb 26, 2009 at 11:34:33PM +0100, Arnd Bergmann wrote:
> On Thursday 26 February 2009, Ira Snyder wrote:> > On Thu, Feb 26, 2009 at 09:37:14PM +0100, Arnd Bergmann wrote:> > > > The registers are part of the board control registers. They don't fit at> > all in the message unit. Doing this in the bootloader seems like a> > logical place, but that would require any testers to flash a new U-Boot> > image into their mpc8349emds boards.> > > > The first set of access is used to set up a 1GB region in the memory map> > that accesses the host's memory. Any reads/writes to addresses> > 0x80000000-0xc0000000 actually hit the host's memory.> > > > The last access sets up PCI BAR1 to hit the memory from> > dma_alloc_coherent(). The bootloader already sets up the window as 16K,> > it just doesn't point it anywhere. Maybe this /should/ go into the> > bootloader. Like above, it would require testers to flash a new U-Boot> > image into their mpc8349emds boards.> > Ok, I see.> > I guess the best option for doing it in Linux then would be to have> a board control driver (not sure if this already exists) that exports> high-level functions to set up the inbound and outbound windows.>
Nothing like it exists. The OF device tree doesn't even describe these
registers. The code in arch/powerpc/sysdev/fsl_pci.c uses some registers
near these, but it gets their address by masking the low bits off the
addresses from the device tree and adding the offsets of the new
registers. Nasty.
I'll do this for now:
1) Get the message unit registers from my device tree
2) Encapsulate all use of get_immrbase() to a single function
That way it could be easily replaced in the future when something more
suitable comes along.
> > Yeah, virtio doesn't work that way.> > > > The virtio drivers just call find_vq() with a different index for each> > queue they want to use. You have no way of knowing how many queues each> > virtio driver will want, unless you go read their source code.> > > > virtio-net currently uses 3 queues, but we only support the first two.> > The third is optional (for now...), and non-symmetric.> > I mean the part of your driver that calls register_virtio_device()> could make the decision, this is the one I was referring to> as virtio_net glue because it is the only part that actually needs> to know about the features etc.> > Right now, you just call register_virtio_net from vop_probe(), which> is absolutely appropriate for the specific use case. In the most> general case though, you would have a user interface on one or> both sides that allows a (root) user to trigger the creation of> a virtio_net (or other virtio) device with specific characteristics> such as MAC address or number of virtqueues.>
I didn't think about this at all. This driver could be used to boot a
(guest) system over NFS, so in that case there isn't a userspace running
yet, to allow configuration. This is essentially my use case, though I
haven't implemented it yet.
Also, I hate designing user interfaces :) Any concrete suggestions on
design would be most welcome.
> One idea I had earlier was that there could be a special device> with just one virtqueue that is always present and that allows you> do communicate configuration changes regarding the available devices> to the remote VOP driver.>
That's an interesting idea that I didn't consider, either. It wouldn't
have to be fast, just reliable. When you're doing small transfers, the
CPU is just fine.
Ira

On Friday 27 February 2009, Ira Snyder wrote:
> On Thu, Feb 26, 2009 at 11:34:33PM +0100, Arnd Bergmann wrote:> > I guess the best option for doing it in Linux then would be to have> > a board control driver (not sure if this already exists) that exports> > high-level functions to set up the inbound and outbound windows.> > > > Nothing like it exists. The OF device tree doesn't even describe these> registers. The code in arch/powerpc/sysdev/fsl_pci.c uses some registers> near these, but it gets their address by masking the low bits off the> addresses from the device tree and adding the offsets of the new> registers. Nasty.> > I'll do this for now:> 1) Get the message unit registers from my device tree> 2) Encapsulate all use of get_immrbase() to a single function> > That way it could be easily replaced in the future when something more> suitable comes along.
Ok. However, I don't expect this to get fixed magically. Ideally,
you would start a new file for the board control in arch/powerpc/sysdev
and export the function from there, otherwise you do it the way you
suggested.
Then we can tell the fsl_pci and other people to use the same
method and source file to access the board control.
> > I didn't think about this at all. This driver could be used to boot a> (guest) system over NFS, so in that case there isn't a userspace running> yet, to allow configuration. This is essentially my use case, though I> haven't implemented it yet.> > Also, I hate designing user interfaces :) Any concrete suggestions on> design would be most welcome.
Don't worry about it for now, just put all the hardcoded virtio_net
specific stuff into a file separate from the hardware specific
files so that we have a nice kernel level abstraction to build a
user abstraction on top of.
Arnd <><

Hi Grant,
> I like this a lot. I need to do much the same thing on one of my> platforms, so I'm going to use your patch as my starting point. Have> you made many changes since you posted this version of your patch?> I'd like to collaborate on the development and help to get it> mainlined.> > In my case I've got an MPC5200 as the 'host' and a Xilinx Virtex> (ppc440) as the 'client'. I intend set aside a region of the Xilinx> Virtex's memory space for the shared queues. I'm starting work on it> now, and I'll provide you with feedback and/or patches as I make> progress.
I'll let Ira update you on the patch status.
If you want someone to chat about the hardware-level interaction,
feel free to chat off-list - assuming of course that no one wants
to hear us talk hardware :)
I selected the MPC8349EA in-part due to its PCI mailboxes, so
that we could implement this type of driver (an interlocked
handshake for flow control of data transfers). The previous
chip I'd used was a PLX PCI9054 Master/Target, and it has
similar registers.
I'm not sure if the Xilinx PCI core, or whatever PCI core you
are using, already has something like the mailboxes implemented,
but if not, it won't take much to code up some logic.
I prefer VHDL myself, but can speak Verilog if forced to :)
Cheers,
Dave

On Tue, Apr 14, 2009 at 3:23 PM, David Hawkins <dwh@ovro.caltech.edu> wrote:
> I'll let Ira update you on the patch status.>> If you want someone to chat about the hardware-level interaction,> feel free to chat off-list - assuming of course that no one wants> to hear us talk hardware :)>> I selected the MPC8349EA in-part due to its PCI mailboxes, so> that we could implement this type of driver (an interlocked> handshake for flow control of data transfers). The previous> chip I'd used was a PLX PCI9054 Master/Target, and it has> similar registers.>> I'm not sure if the Xilinx PCI core, or whatever PCI core you> are using, already has something like the mailboxes implemented,> but if not, it won't take much to code up some logic.> I prefer VHDL myself, but can speak Verilog if forced to :)
Thanks David. I haven't looked closely at the xilinx pci data sheet
yet, but I don't expect too many issues in this area. As you say, it
won't take much to code it up. I'll be poking my VHDL engineer to
make it do what I want it to. :-)
I'll keep you up to date on my progress.
g.

Hi Grant,
> Thanks David. I haven't looked closely at the xilinx pci data sheet> yet, but I don't expect too many issues in this area. As you say, it> won't take much to code it up. I'll be poking my VHDL engineer to> make it do what I want it to. :-)
The key aspects of the core will be that it is Master/Target
so that it can take over the PCI bus, and that it has a
DMA engine that can take care of most of the work. In
your case, since you have a DMA controller on the host
(MPC5200) and the target (Xilinx), your driver might end
up having nicer symmetry than our application. The
most efficient implementation will be the one that
uses PCI writes, i.e., MPC5200 DMAs to the Xilinx core,
and the Xilinx core DMAs to the MPC5200. If you use
a PCI Target only core, then the MPC5200 DMA controller
will have to do all the work, and read transfers might
be slightly less efficient.
Our target boards (PowerPC) live in compactPCI backplanes
and talk to x86 boards that do not have DMA controllers.
So the PCI target board DMA controllers are used to
transfer data efficiently to the x86 host (writes)
and less efficiently from the host to the boards
(reads). Our bandwidth requirements are 'to the host',
so we can live with the asymmetry in performance.
> I'll keep you up to date on my progress.
Sounds good.
Cheers,
Dave

On Tue, Apr 14, 2009 at 02:28:26PM -0600, Grant Likely wrote:
> On Mon, Feb 23, 2009 at 6:00 PM, Ira Snyder <iws@ovro.caltech.edu> wrote:> > This adds support to Linux for using virtio between two computers linked by> > a PCI interface. This allows the use of virtio_net to create a familiar,> > fast interface for communication. It should be possible to use other virtio> > devices in the future, but this has not been tested.> > Hey Ira,> > I like this a lot. I need to do much the same thing on one of my> platforms, so I'm going to use your patch as my starting point. Have> you made many changes since you posted this version of your patch?> I'd like to collaborate on the development and help to get it> mainlined.>
This would be great. I'd really appreciate the help. I haven't had time
to make any changes since I last posted the patch. I started work on
converting all of the usage of struct vop_loc_* to just use the on-wire
structures, but I didn't get very far before other work got in the way.
> In my case I've got an MPC5200 as the 'host' and a Xilinx Virtex> (ppc440) as the 'client'. I intend set aside a region of the Xilinx> Virtex's memory space for the shared queues. I'm starting work on it> now, and I'll provide you with feedback and/or patches as I make> progress.>
I'm looking forward to seeing your implementation. If you have any
questions, I'd be happy to attempt to answer them :)
Ira

On Tue, Apr 14, 2009 at 3:52 PM, David Hawkins <dwh@ovro.caltech.edu> wrote:
> Hi Grant,>>> Thanks David. I haven't looked closely at the xilinx pci data sheet>> yet, but I don't expect too many issues in this area. As you say, it>> won't take much to code it up. I'll be poking my VHDL engineer to>> make it do what I want it to. :-)>> The key aspects of the core will be that it is Master/Target> so that it can take over the PCI bus, and that it has a> DMA engine that can take care of most of the work. In> your case, since you have a DMA controller on the host> (MPC5200) and the target (Xilinx), your driver might end> up having nicer symmetry than our application. The> most efficient implementation will be the one that> uses PCI writes, i.e., MPC5200 DMAs to the Xilinx core,> and the Xilinx core DMAs to the MPC5200.
Hmmm, I hadn't thought about this. I was intending to use the
Virtex's memory region for all virtio, but if I can allocate memory
regions on both sides of the PCI bus, then that may be best.
> If you use> a PCI Target only core, then the MPC5200 DMA controller> will have to do all the work, and read transfers might> be slightly less efficient.
I'll definitely intend to enable master mode on the Xilinx PCI controller.
> Our target boards (PowerPC) live in compactPCI backplanes> and talk to x86 boards that do not have DMA controllers.> So the PCI target board DMA controllers are used to> transfer data efficiently to the x86 host (writes)> and less efficiently from the host to the boards> (reads). Our bandwidth requirements are 'to the host',> so we can live with the asymmetry in performance.
Fortunately I don't have very high bandwidth requirements for the
first spin, so I have some room to experiment. :-)
g.

Hi Grant,
> Hmmm, I hadn't thought about this. I was intending to use the> Virtex's memory region for all virtio, but if I can allocate memory> regions on both sides of the PCI bus, then that may be best.
Sounds like you can experiment and see what works best :)
>> If you use>> a PCI Target only core, then the MPC5200 DMA controller>> will have to do all the work, and read transfers might>> be slightly less efficient.> > I'll definitely intend to enable master mode on the Xilinx PCI controller.
Since you understand the lingo, you clearly understand
there are core differences :)
>> Our target boards (PowerPC) live in compactPCI backplanes>> and talk to x86 boards that do not have DMA controllers.>> So the PCI target board DMA controllers are used to>> transfer data efficiently to the x86 host (writes)>> and less efficiently from the host to the boards>> (reads). Our bandwidth requirements are 'to the host',>> so we can live with the asymmetry in performance.> > Fortunately I don't have very high bandwidth requirements for the> first spin, so I have some room to experiment. :-)
Yes, in theory you have enough bandwidth ... then a
few features are added, the PCI core is not quite as
fast as advertised, etc etc :)
Cheers,
Dave

On Thu, Feb 26, 2009 at 3:49 PM, Ira Snyder <iws@ovro.caltech.edu> wrote:
> On Thu, Feb 26, 2009 at 09:37:14PM +0100, Arnd Bergmann wrote:>> If the registers for setting up this window don't logically fit>> into the same device as the one you already use, the cleanest>> solution would be to have another device just for this and then>> make a function call into that driver to set up the window.>> The registers are part of the board control registers. They don't fit at> all in the message unit. Doing this in the bootloader seems like a> logical place, but that would require any testers to flash a new U-Boot> image into their mpc8349emds boards.
Alternately, the board platform code (arch/powerpc/platforms/83xx) is
an ideal place for 'fixups'. ie. to setup things that the firmware
really should be do, but doesn't.
>> > Now, I wouldn't need to access these registers at all if the bootloader>> > could handle it. I just don't know if it is possible to have Linux not>> > use some memory that the bootloader allocated, other than with the>> > mem=XXX trick, which I'm sure wouldn't be acceptable. I've just used>> > regular RAM so this is portable to my custom board (mpc8349emds based)>> > and a regular mpc8349emds. I didn't want to change anything board>> > specific.>> >>> > I would love to have the bootloader allocate (or reserve somewhere in>> > the memory map) 16K of RAM, and not be required to allocate it with>> > dma_alloc_coherent(). It would save me plenty of headaches.>>>> I believe you can do that through the "memory" devices in the>> device tree, by leaving out a small part of the description of>> main memory, at putting it into the "reg" property of your own>> device.>>>> I'll explore this option. I didn't even know you could do this. Is a> driver that requires the trick acceptable for mainline inclusion? Just> like setting up the 16K PCI window, this is very platform specific.
Yup. You wouldn't even need to write any code to do this. Just
reduce the memory node's RAM size listed in the .dts file by 16k and
add a 16K region to the reg property for the messaging region.
Speaking of which, the device tree changes should be adding 2 nodes; 1
node to describe the messaging unit, and 1 node to describe the virtio
instance. The messaging unit is a general purpose piece of hardware,
so it is not appropriate to write a usage-specific device driver that
binds against it. I'm kind of working on this right now, so I'll show
you what I mean in patch form when I actually get things running.
g.

On Tue, Apr 14, 2009 at 3:53 PM, Ira Snyder<iws@ovro.caltech.edu> wrote:
> On Tue, Apr 14, 2009 at 02:28:26PM -0600, Grant Likely wrote:>> On Mon, Feb 23, 2009 at 6:00 PM, Ira Snyder <iws@ovro.caltech.edu> wrote:>> > This adds support to Linux for using virtio between two computers linked by>> > a PCI interface. This allows the use of virtio_net to create a familiar,>> > fast interface for communication. It should be possible to use other virtio>> > devices in the future, but this has not been tested.>>>> Hey Ira,>>>> I like this a lot. I need to do much the same thing on one of my>> platforms, so I'm going to use your patch as my starting point. Have>> you made many changes since you posted this version of your patch?>> I'd like to collaborate on the development and help to get it>> mainlined.>>>> This would be great. I'd really appreciate the help. I haven't had time> to make any changes since I last posted the patch. I started work on> converting all of the usage of struct vop_loc_* to just use the on-wire> structures, but I didn't get very far before other work got in the way.>>> In my case I've got an MPC5200 as the 'host' and a Xilinx Virtex>> (ppc440) as the 'client'. I intend set aside a region of the Xilinx>> Virtex's memory space for the shared queues. I'm starting work on it>> now, and I'll provide you with feedback and/or patches as I make>> progress.>>>> I'm looking forward to seeing your implementation. If you have any> questions, I'd be happy to attempt to answer them :)
Hey Ira,
I've been slowly hacking on your virtio-over-pci stuff. I've got an
initial series of cleanup patches which address some of the comments
from this thread. Before I send them to you, have you made any
changes on your end? They likely won't apply without changes to the
core code, so I'd like to sync up with you first.
Cheers,
g.

On Thu, Jun 11, 2009 at 08:22:54AM -0600, Grant Likely wrote:
> On Tue, Apr 14, 2009 at 3:53 PM, Ira Snyder<iws@ovro.caltech.edu> wrote:> > On Tue, Apr 14, 2009 at 02:28:26PM -0600, Grant Likely wrote:> >> On Mon, Feb 23, 2009 at 6:00 PM, Ira Snyder <iws@ovro.caltech.edu> wrote:> >> > This adds support to Linux for using virtio between two computers linked by> >> > a PCI interface. This allows the use of virtio_net to create a familiar,> >> > fast interface for communication. It should be possible to use other virtio> >> > devices in the future, but this has not been tested.> >>> >> Hey Ira,> >>> >> I like this a lot. I need to do much the same thing on one of my> >> platforms, so I'm going to use your patch as my starting point. Have> >> you made many changes since you posted this version of your patch?> >> I'd like to collaborate on the development and help to get it> >> mainlined.> >>> >> > This would be great. I'd really appreciate the help. I haven't had time> > to make any changes since I last posted the patch. I started work on> > converting all of the usage of struct vop_loc_* to just use the on-wire> > structures, but I didn't get very far before other work got in the way.> >> >> In my case I've got an MPC5200 as the 'host' and a Xilinx Virtex> >> (ppc440) as the 'client'. I intend set aside a region of the Xilinx> >> Virtex's memory space for the shared queues. I'm starting work on it> >> now, and I'll provide you with feedback and/or patches as I make> >> progress.> >>> >> > I'm looking forward to seeing your implementation. If you have any> > questions, I'd be happy to attempt to answer them :)> > Hey Ira,> > I've been slowly hacking on your virtio-over-pci stuff. I've got an> initial series of cleanup patches which address some of the comments> from this thread. Before I send them to you, have you made any> changes on your end? They likely won't apply without changes to the> core code, so I'd like to sync up with you first.>
I haven't made any changes to the code since you've last seen it. I've
been busy with other stuff for quite a while now. I can't wait to see
what you've done :)
At least for the use of the 83xx DMA controller, the DMA_SLAVE mode
patch I posted up about a week ago (to the ppcdev list) could make the
DMA setup much simpler. It hasn't been accepted to mainline yet.
Ira