SCSI Mid-layer
Eric Youngdale
2nd Annual Linux Storage Management Workshop
October 2000
Introduction
Main point of this talk:
– Historical evolution of Linux SCSI.
– Explain state of the art in Linux 2.2.
– Discuss changes for 2.4.
– Discuss pending changes in the 2.5 kernel.
Block devices and Linux
• Linux has a generic block device layer with
which all filesystems will interact.
• SCSI is no different in this regard – it
registers itself with the block device layer
so it can receive requests.
• SCSI also handles character device requests
and ioctls that do not originate in the block
device layer.
What is the “Mid-Layer”?
• Linux SCSI support can be viewed as 3 levels.
• Upper level is device management, such as
tape, cdrom, disk, etc.
• Lower level talks to host adapters.
• Middle layer is essentially a traffic cop,
handing requests from rest of kernel, and
dispatching them to the rest of SCSI.
State of the art in Linux-2.2
• Error handling handled better for drivers
that make use of new error handling code.
New error handling code introduced in 2.2.
• Queue management fundamentally
unchanged since the Linux 1.x days. “The
Code that Time Forgot”. Lots of dinosaurs
running around in the code.
• Rest of mid-level largely stagnant.
What was wrong in 2.2?
• The elevator algorithms in 2.2 allowed requests to
grow irregardless of the capabilities of the
underlying device.
• All SCSI disks were handled in a single queue.
• Disk driver had to split requests that had become
too large.
• One set of common logic for verifying requests had
not become too large.
What was wrong in 2.2 (cont)
• Character device requests not in queue.
• SMP safety was clumsily handled, leading
to race conditions and poor performance.
• Poor scalability.
• Many drivers continue to use old error
handling code.
Queue handling in 2.2
Disk Queue Head Disk1
Disk2
Disk1
Disk3
Disk1
Changes for Linux-2.4
• Block device layer was generalized to
support a “request_queue_t” abstract
datatype that represents a queue.
• Contains function pointers that drivers can
use for managing the size of requests
inserted into queues.
• Requests no longer can grow to be too large
to be handled at one time.
Changes for 2.4 (cont)
• No longer any need for splitting requests.
• No need for ugly logic to scan a queue for a
queueable request.
• SMP locking in mid-layer cleaned up to
provide finer granularity.
Changes for 2.4 (cont)
• A SCSI queuing library was created – a set
of functions for queue management that are
tailored to different sets of requirements.
• SCSI was modified to use a single queue for
each physical device.
• Character device requests and ioctls are
inserted into the same queue at the tail, and
handled the same as other requests.
Queuing library
Maintainability is a problem if multiple instances of
code can perform similar function.
__inline static int
__scsi_merge_requests_fn(request_queue_t * q,
struct request * req, struct request * next,
int use_clustering,
int dma_host)
{
/* * Appropriate contents */
}
Queueing Library (Cont).
#define MERGEREQFCT(_FUNCTION, _CLUSTER, _DMA) \
static int _FUNCTION(request_queue_t * q, \
struct request * req, \
struct request * next) \
{\
return __scsi_merge_requests_fn(q, req, next, _CLUSTER, _DMA);
\
}
MERGEREQFCT(scsi_merge_requests_fn_, 0, 0)
MERGEREQFCT(scsi_merge_requests_fn_d, 0, 1)
MERGEREQFCT(scsi_merge_requests_fn_c, 1, 0)
MERGEREQFCT(scsi_merge_requests_fn_dc, 1, 1)
Changes for 2.4 (cont)
• In 2.2, there were separate functions and
code paths for initializing SCSI for the case
of compiled into kernel and loaded via
modules.
• In 2.4, this was cleaned up – redundant code
was removed, and the same code is used to
initialize for both modules and compiled
into kernel.
Upcoming changes for 2.5
• All drivers will be forced to use new error
handling code.
• Disk driver will be updated to handle larger
number of disks.
• SMP locking will be cleaned up some more
to improve scalability.
Old error handling code
• Essentially a bad state machine.
• Has tons of SMP problems that are not
easily fixed.
• Tries to resolve errors while allowing new
requests to be queued.
• Many kernel reliability problems are
because of old error handling problems.
• Needs to be discarded in the worst way.
New error handling code
• The new error handling code has been
available since the 2.1.75 kernel.
• To force driver authors to update their
drivers, the old error handling code will
simply be removed. Drivers that have not
been updated will fail to compile.
• Orphaned drivers will be handled on a case-
by-case basis.
Further SMP cleanups
• All low-level drivers currently use
io_request_lock for SMP safety.
• This lock is also used by all other block
devices on the system to protect their
queues.
• Plans are in the works to switch the block
device layer to use a per-queue lock,
thereby isolating SCSI from other devices.
SMP Cleanups (cont).
• Low-level drivers don’t need to protect
queue – they don’t have access to it.
• Each low-level driver should have a
separate lock – ideally one per instance of
host, but could be a driver-wide lock
initially. This should be up to the low-level
driver.
SMP Cleanups (cont)
• Block device layer has a number of arrays,
indexed by major/minor:
blksize_size[MAJOR(dev)][MINOR(dev)]
• Access is not protected by any locks.
• Impossible for block drivers to resize
without introducing race condition.
Large numbers of disks
• Current disk driver allocates 8 majors,
allowing for only 128 disks.
• Plans are in the works to allow disk driver
to dynamically allocate major numbers.
• Would support up to about 4000 disks,
when major numbers are exhausted.
• Possible to go beyond this by using fewer
bits for partitions.
Wish list.
• Implement some SCSI-3 features (larger
commands, sense buffers).
• Improve support for shared busses.
• Support target-mode.
• Check module add/remove code for SMP
safety, implement locks.
• Improvements related to high-availability.
Conclusions
The major goal of a rewrite of SCSI queuing has
been accomplished. A number of architectural
problems were resolved at the same time.
There are still some interesting tasks still to be
addressed for 2.5.
See http://www.andante.org/scsi.html for more
info, and http://www.andante.org/scsi_todo.html for
“todo” list.
Contacts
Email: eric@andante.org
Web: http://www.andante.org
The notes for this talk are on the website.