Please review this new (although, perhaps, the oldest) SCSI target framework for Linux SCST and 4 target drivers for it: for Qlogic 22xx/23xx cards (qla2x00t), for iSCSI (iscsi-scst), for Infiniband SRP (srpt) and for local access to SCST provided backend for purpose of testing and creating target drivers in user space (scst_local).

The activity on the mailing lists of the existing Linux SCSI storage target implementations shows that many people are using Linux to build a networked storage solutions. We believe that a SCSI storage target implementation should run inside the Linux kernel and not in user space. It would be a great service to the users of SCSI target software to include a mature and high-performance SCSI target framework in the mainline kernel. We are convinced that SCST is the implementation that is best suited for inclusion in the mainline Linux kernel. The current SCST implementation and selected target drivers posted here as a series of patches to gather feedback about how inclusion in the mainline kernel should proceed. Any comments and suggestions would be greatly appreciated.

The posted modules are almost ready for inclusion into the kernel, the main thing is left to be done is change of the interface with user space from procfs to sysfs. If procfs-based interface was allowed for new kernel modules, SCST and above target drivers could be called fully mainline ready. See SCST proc interface patch below for description of the proposed sysfs interface replacement.

The strengths of SCST are explained below by comparing SCST with STGT. Although we have a lot of respect for the STGT project and the people who have worked on it, the STGT SCSI framework doesn't satisfy its users in the following areas:

Modern SCSI transports, e.g. Infiniband, have link latency in range 1-3 microseconds. This is the same order of magnitude as the time needed for a system call on a modern system (between 1 and 4 microseconds) and not much more than the time needed for an empty system call (about 0.1 microseconds). So, only ten empty syscalls or one context switch add the same latency as the link. Even 1Gbps Ethernet has less than 100 microseconds of round-trip latency.

For instance, recently there was comparison using SCST SRP target driver and NULLIO backend between processing done in a thread and processing done in SIRQ context (tasklet). Load was multithreaded 4K random writes from a single initiator. Advantage of SIRQ processing was quite impressive:

Another source of additional unavoidable latency with the user space approach is copying data to and from the page cache. Modern memory has about 2.5GB/s copy throughput. So, with a high speed interface, like 20Gbps Infiniband, memory copy almost doubles the overall latency. With the fully kernel space approach, the page cache can be used directly in zero-copy manner, including by user space backend handlers (see below). The only way how zero-copy cache can be implemented in STGT is to completely duplicate functionality of page-cache, including read-ahead, in user space.

In the past there have been objections to the above argumentation arguing that the latency of backing storage (e.g. rotating magnetic disks) is a lot more than one microsecond, especially for seek-intensive workloads. This argument is void however: system operators know very well that in order to reach an acceptable performance, a network storage device has to run in write-back mode and has to be equipped with sufficient RAM memory. Nothing prevents a target from having 8 or even 64GB of cache, so most even random accesses could be served by it.

II. Microkernel-like architecture and overall simplicity.

The general architecture of STGT doesn't suit too well to the widely acknowledged Linux paradigm to avoid microkernel-like distributed processing. This is because STGT has two parts: an in-kernel "microkernel" and a user space part, which does most of the job, but not all. For a SCSI target, especially with hardware target card, data arrives in the kernel and eventually served by kernel, which does the actual I/O or get/put data from/to cache. Dividing the requests processing job between user and kernel spaces creates unnecessary interface boundary and effectively makes the requests processing job distributed with all its complexity and reliability problems. As an example, what will currently happen in STGT if the user space part suddenly dies? Will the kernel part gracefully recover from it? How much effort will be needed to implement that?

Yes, such architecture allows to create less kernel code, but at the expense of more complex user space code, more complex interface between kernel and user spaces and, hence, more complicated maintenance of the combined code. See http://lkml.org/lkml/2007/4/24/364 for the perfect description of why.

III. Complete pass-through mode.

When a SCSI target provides direct access to its backend SCSI devices in pass-through mode to comply with SAM it must intercept coming commands and emulate some functionality of a SCSI host. This is necessary, because backend devices see only a single nexus (SCSI target host), not each initiator as separate nexuses. See http://thread.gmane.org/gmane.linux.scsi/31288 for more details.

Particularly, access to the exported local backend SCSI devices locally from the target should be from a separate nexus as well. To achieve that all commands to those devices should go through the SCSI target engine to allow it to make all the necessary emulation. Definitely, passing all local commands through a user space process is absolutely unacceptable, because it would ruin performance of the system. But with in-kernel SCSI target it can be done simply and painlessly.

SCST addresses all the above issues. It has the best performance, solid processing architecture and allows straightforward, high performance implementation of the local nexus handling. But it also:

- Faster backend handlers in user space. The architecture where the SCSI target state machine and memory management are in kernel, allows user space backend handlers to handle SCSI commands with 1 syscall/command, no additional context switches (no kernel threads involved in processing) and no need to map/unmap user space pages. In future, it is possible to add an splice-like interface, which allows complete zero-copy with page cache. (At the moment zero-copy is only on the target side, i.e. between SCST and user space handlers; backend IO side with cache has to be done by the handlers using regular read()/write() calls, which copy data.)

- Near complete 1:N pass-through mode.

- Advanced per-initiator device visibility management (LUN masking), which allows different initiators to see different set of devices with different access permissions. For instance, initiator A could see exported from target T devices X and Y read-writable, and initiator B from the same target T could see devices Y read-only and Z read-writable.3. Stable, mature and complete for all necessary basic functionality[*]. At least 3 companies already have designed and are selling productswith SCST-based engines (for instance, http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=35&menu_section=34;unfortunately, I don't have rights to name others) and, at least, 5 companies are developing their SCST-based products in areas of HA solutions, VTLs, FC/iSCSI bridges, thin provisioning appliances, regular SANs, etc. How many are there commercially sold STGT-based products? Also you can compare yourself how widely and production-wise used SCST and STGT by simply looking at their mail lists archives: http://sourceforge.net/mailarchive/forum.php?forum_name=scst-devel for SCST and http://lists.wpkg.org/pipermail/stgt. In the SCST's archive you can find many questions from people actively using it in production or developing own SCST-based products. In the STGT's archive you can find that people implementing basic things, like management interface, and fixing basic problems, like crash when there are more than 40 LUNs per target (SCST successfully tested with >4000 LUNs). (STGT developers, no personal offense here, please.)Neither STGT nor LIO can claim that they are complete in the basic functionality area. For instance, both lack support for transports, which don't supply expected transfer values, e.g. parallel SCSI/SAS, and this feature requires a major code surgery to be added. But SCST can do it *now*. Sure, there are still several areas for improvement (see, e.g., http://scst.sourceforge.net/contributing.html), but those are pretty advanced features, like zero-copy cache IO, persistent reservations, dynamic flow control, etc.

Thus, we believe, that SCST should replace STGT in the kernel. From the kernel point of view STGT doesn't have any advantage over SCST, because SCST also allows creating backend devices handlers and target mode drivers in user space. The only exception is in-kernel target driver for IBM virtual SCSI (ibmvscsi), which is done for STGT and doesn't support SCST. So, until it is changed to work with SCST, I guess, both SCSI target frameworks should coexist.

Target driver for iSER is completely user space, so it will not be affected. And such drivers is the place where SCST and user space part of STGT can (and should) supplement each other. User space part of STGT is a good framework/library for creation of user space target drivers and together with scst_local driver SCST can be a good backend for them. It would allow to exploit all SCST's features, including pass-through mode and user space backend handlers. Overhead of such architecture would be (per command):

2. For vdisk mode (FILEIO/BLOCKIO) - overhead of SG/BSG driver, SCSI subsystem/block layer, 2 context switches, which can be avoided (see below) and minus own STGT overhead, which it currently has to submit requests in this mode using pthreads (quite high, in fact).

In all three cases data would be passed in zero-copy manner, for user space backend handlers - using shared memory.

In all the cases I don't count SCST overhead, because, basically, STGT at the moment does about the same processing. If the duplicated code removed, this overhead would be almost completely eliminated.

The referred in (1) and (2) context switches are necessary, because at the moment queuecommand() is called under host_lock and IRQs disabled. If we can find a way to drop that lock and reenable IRQs, those context switches wouldn't be needed.

Thus, since, when people decide to do something in user space, they are willing to accept some performance loss, overhead of SG/BSG driver and SCSI subsystem/block layer should be very acceptable for them.

This set of patches places SCST in drivers/scst. It isn't drivers/scsi/scst, as one might expect, because, in fact, SCST shares almost nothing with the Linux SCSI subsystem. Relation between them is the same as between client and server. For instance, between Apache (HTTP server) and Firefox (HTTP client). Or between NFS client and server. How much is it shared code between them? Near zero? For instance, in Linux 2.6.27:

$ find fs/nfs/ -name *.[ch] | xargs wc -l...30004 total$ find fs/nfsd/ -name *.[ch] | xargs wc -l...19905 total$ find fs/nfs_common/ -name *.[ch] | xargs wc -l...277 totalI.e., between NFS client and server shared only 277 lines of code among almost 50000 (<0.6%).The same is true for SCSI target and initiator. The only code I see can be made shared between them is scsi_alloc_sgtable(), if it is made exported, not static as at the moment. Usage of this function would allow to make scst_alloc() more robust, because of usage of mempool. Scst_alloc() used for buffers allocation in corner cases processing. Everything else, including memory management, serves different needs, so making them shared would only make them more complicated for no gain. Particularly, nothing of SCSI target code can be used by SCSI initiator code. For example, SCSI initiator subsystem deals with memory already mapped from user space or allocated by VM layer. Then, after the corresponding command completed, that memory can't be reused by it. In contrast, SCSI target subsystem always manages memory itself and then, after the corresponding command completed, reuse of that memory is a good chance to gain some performance. If you object me and think that there is much more code, which can be shared, please be specific and point out to *exact* functions to share.

STGT gives another example, why coupling SCSI target and initiators subsystems together is a bad idea. Consider, if I make a general purpose kernel, for which 1% of users would run target mode. I would have to enable as module "SCSI target support" as well as "SCSI target support for transport attributes". Now 99% of users of my kernel, who don't need SCSI target, but need SCSI initiator drivers, would have to have scsi_tgt loaded, because transport attribute drivers would depend on it:

No target functionality is needed, but target mode subsystem is needed. Is it a good design? SCST doesn't have such issue.

At last, why SCST and not LIO. Most important reasons:

1. LIO is terribly overengineered, hence the code isn't clear and very hard (near to impossible, actually) to review and audit. Its interfaces a lot more complicated, than SCST's ones, with, basically, the same functionality.

2. LIO only supports software iSCSI target and is iSCSI centric by design. There is a lot of work to do to allow non-iSCSI target driver be ran with none iSCSI-specific code loaded.

Also, few years ago one of the main reasons to reject the core-iscsi iSCSI initiator from being included in the kernel was its support for MC/S. Then, to be accepted, support for MC/S was removed from open-iscsi. LIO iSCSI target supports MC/S and I'm not sure that times have changed so much so now MC/S become an advantage from the inclusion in the kernel POV.

SCST, in contrast, has clear, well commented and documented (see http://scst.sourceforge.net/scst_pg.html) code, supports many target drivers, transport-neutral by design, supports user space backend/target drivers and has not overcomplicated by MC/S iSCSI target driver.

So, those patches add completely new subsystem to Linux. The code layout was made by the referred above example of NFS: client is in drivers/scsi, server is in drivers/scst. Possible future shared code can be places in drivers/scsi_common.

Currently SCST is maintained separately from the kernel, so the patches sometimes have code, which isn't needed, if SCST included in the kernel. This code will be removed in the next iteration.

The patches in this iteration are made against 2.6.27.x. In the next iteration they will be prepared against the necessary kernel version as we will be told (linux-next, I guess?)

The patch set is quite big, but self-containing and does only minimal, straightforward changes in other parts of the kernel. The only exception is put_page_callback patch, which implements in TCP notifications about completion of data transmit. This patch is an optimization necessary to implement zero-copy data transfer by iSCSI-SCST from user space backend handlers. But this patch isn't required. Without it data from the user space backend handlers will be transferred on the regular way with data copy to TCP send buffers, i.e. on the same way as STGT iSCSI target driver does. Data from in-kernel backend will still be transmitted zero-copy.

Here is the list of patches. They depend on each other, only put_page_callback.diff is independent and optional.

2. It looks like if a SCSI card is modern, i.e. it is possible tobuy it from its manufacturer (hence it is worth writing a driver for it), and it supports target mode, then it's sufficiently advanced to have sane limitations, like support for full 64-bit addressing space, no length alignment, etc. At least, so far I've not seen exceptions.

In Linux it was always accepted that only features for real life needs should be implemented, but "just in case" features with no real life reflection should be rejected. Hence, I will start thinking about handling more hardware restrictions when there is such target hardware, not earlier. Until that I'd prefer memory management simplicity. See section 2 subsection 4 in SubmittingPatches: "Don't over-design".

Also I don't believe that the lack of SG chaining support in SCST is something to be fixed, because this is one of the areas, where requirements for target and initiator are completely opposite. To minimize latency, initiator should build commands with as much data to transfer as possible, but target should transfer those data with as small chunks as possible. This technique called pipelining and widely used in many areas, especially in modern CPUs. All SCSI transports I know, including iSCSI, Fibre Channel, SRP, parallel SCSI and SAS, support pipelining. So, per page limit of 512K is more than satisfying for an SCSI target. I can elaborate more, if someone's interested.

P.S. SCST can also be used with non-SCSI transports, like AoE. It is possible, since those transports' commands can be AFAIK seen as a subset of SCSI. So, the only thing necessary to use them with SCST is to convert their internal commands to the corresponding SCSI commands, then send to SCST, and on the way back from SCST convert SCSI sense codes to their internal status codes.