> What about block devices that could usefully use multi-path to achieve network > redundancy, like nbd? If it's in the block layer or above, they can be made to > work with minimal effort.> > My basic point is that the utility of the feature transcends SCSI, so SCSI is > too low a layer for it.

I agree it has potential uses outside of SCSI, this does not directlyimply that we need to create a generic implementation. I have found nocode to reference in other block drivers or in the block layer. I'velooked some at the dasd code but can't figure out if or where there is anymulti-path code.

Putting multi-path into the block layer means it would have to acquire andmaintain a handle (i.e. path) for each device it knows about, and theneventually pass this handle down to the lower level. I don't see thishappening in 2.5/2.6, unless someone is coding it right now.

It makes sense to at least expose the topology of the IO storage, whetheror not the block or other layers can figure out what to do with theinformation. That is, ideally for SCSI we should have a representation ofthe target - like struct scsi_target - and then the target is multi-pathed,not the devices (LUNs, block or character devices) attached to the target.We should also have a bus or fabric representation, showing multi-path fromthe adapters view (multiple initiators on the fabric or bus).

Whether or not the fabric or target information is used to route IO, theyare useful for hardware removal/replacement. Imagine replacing a fibreswitch, or replacing a failed controller on a raid array.

If all this information was in the device model (driver?), with some sortof function or data pointers, perhaps (in 2.7.x timeframe) we could routeIO and call appropriate drivers based on that information.

> > A major problem with multi-path in md or other volume manager is that> > we use multiple (block layer) queues for a single device, when we> > should be using a single queue. If we want to use all paths to a> > device (i.e. round robin across paths or such, not a failover model)> > this means the elevator code becomes inefficient, mabye even> > counterproductive. For disk arrays, this might not be bad, but for> > actual drives or even plugging single ported drives into a switch or> > bus with multiple initiators, this could lead to slower disk> > performance. > > That's true today, but may not be true in 2.6. Suparna's bio splitting code > is aimed precisely at this and other software RAID cases.

Yes, but then we need some sort of md/RAID/volume manager aware eleavatorcode + bio splitting, and perhaps avoid calling elevator code normally calledfor a Scsi_Device. Though I can imagine splitting the bio in md and thenstill merging and sorting requests for SCSI.

> > In the current code, each path is allocated a Scsi_Device, including a> > request_queue_t, and a set of Scsi_Cmnd structures. Not only do we end> > up with a Scsi_Device for each path, we also have an upper level (sd,> > sg, st, or sr) driver attached to each Scsi_Device. > > You can't really get away from this. Transfer parameters are negotiated at > the Scsi_Device level (i.e. per device path from HBA to controller), and LLDs > accept I/O's for Scsi_Devices. Whatever you do, you still need an entity that > performs most of the same functions as the Scsi_Device, so you might as well > keep Scsi_Device itself, since it works.

Yes negotiation is at the adapter level, but that does not have to be tiedto a Scsi_Device. I need to search for Scsi_Device::hostdata usage tofigure out details, and to figure out if anything is broken in the currentscsi multi-path code - right now it requires the same adapter drivers beused and that certain Scsi_Host parameters are equal if multiple pathsto a Scsi_Device are found.

> > For sd, this means if you have n paths to each SCSI device, you are> > limited to whatever limit sd has divided by n, right now 128 / n.> > Having four paths to a device is very reasonable, limiting us to 32> > devices, but with the overhead of 128 devices. > > I really don't expect this to be true in 2.6.

If we use a Scsi_Device for each path, we always have the overhead of thenumber of paths times the number of devices - upping the limits of sdcertainly helps, but we are then increasing the possibly large amountof memory that we can waste. And, other devices besides disks can bemulti-pathed.

> > Using a volume manager to implement multiple paths (again non-failover> > model) means that the queue_depth might be too large if the> > queue_depth (i.e. number of outstanding commands sent to the drive)> > is set as a per-device value - we can end sending n * queue_depth> > commands to a device.> > The queues tend to be in the controllers, not in the RAID devices, thus for a > dual path RAID device you usually have two caching controllers and thus twice > the queue depth (I know this isn't always the case, but it certainly is enough > of the time for me to argue that you should have the flexibility to queue per > path).

You can have multiple initiators on FCP or SPI, without dual controllersinvolved at all. Most of my multi-path testing has been with dualported FCP disk drives, with multiple FCP adapters connected to aswitch, not with disk arrays (I don't have any non-failover multi-porteddisk arrays available, I'm using a fastt 200 disk array); I don't know thedetails of the drive controllers for my disks, but putting multiplecontrollers in a disk drive certainly would increase the cost.

Yes, per path queues and per device queues are reasonable; per path queuesrequires knowledge of actual device ports not in the current scsi multi-pathpatch. The code I have now uses the Scsi_Host::can_queue to limit the numberof commands sent to a host. I really need slave_attach() support in the hostadapter (like Doug L's patch a while back), plus maybe a slave_attach_path(),and/or queue limit per path.

Per path queues are not required, as long as any queue limits do nothinder the performance.

> SCSI got into a lot of trouble by going down the "kernel doesn't have X > feature I need, so I'll just code it into the SCSI mid-layer instead", I'm > loth to accept something into SCSI that I don't think belongs there in the > long term.> > Answer me this question:> > - In the forseeable future does multi-path have uses other than SCSI?> > I've got to say, I can't see a "no" to that one, so it fails the high level > bar to getting into the scsi subsystem. However, the kernel, as has been said > before, isn't a theoretical excercise in design, so is there a good expediency > argument (like "it will take one year to get all the features of the block > layer to arrive and I have a customer now"). Also, to go in under expediency, > the code must be readily removable against the day it can be redone correctly.

Yes, there could be future multi-path users, or maybe with DASD. If we takeSCSI and DASD as existing usage, they could be a basis for a block layer(or generic) set of multi-path interfaces.

There is code available for scsi multi-path, this is not a design in theory.Anyone can take the code and fold it into a block layer implementation orother approach. I would be willing to work on scsi usage or such for anynew block level or other such code for generic multi-path use. At thistime I wouldn't feel comfortable adding to or modifying block layerinterfaces and code, nor do I think it is possible to come up with thebest interface given only one block driver implementation, nor do I thinkthere is enough time to get this into 2.5.x.

IMO, there is demand for scsi multi-path support now, as users move to large databases requiring higher availabitity. md or volume managerfor failover is adequate in some of these cases.

I see other issues as being more important to scsi - like cleaning it upor rewriting portions of the code, but we still need to add new featuresas we move forward.

> > Generic device naming consistency is a problem if multiple devices> > show up with the same id.> > Patrick Mochel has an open task to come up with a solution to this.

I don't think this can be solved if multiple devices show up with the sameid. If I have five disks that all say I'm disk X, how can there be onename or handle for it from user level?

> > With the scsi layer multi-path, ide-scsi or usb-scsi could also do> > multi-path IO. > > The "scsi is everything" approach got its wings shot off at the kernel summit, > and subsequently confirmed its death in a protracted wrangle on lkml (I can't > remember the reference off the top of my head, but I'm sure others can).

Agreed, but having the block layer be everything is also wrong.

My view is that md/volume manager multi-pathing is useful with 2.4.x, scsilayer multi-path for 2.5.x, and this (perhaps with DASD) could then evolveinto generic block level (or perhaps integrated with the device model)multi-pathing support for use in 2.7.x. Do you agree or disagree with thisapproach?