Kernel Korner - Storage Improvements for 2.6 and 2.7

The Linux 2.6 kernel has improved Linux's storage capabilities with advances such as the anticipatory I/O scheduler and support for storage arrays and distributed filesystems.

Storage has changed rapidly during the last decade. Prior to that,
server-class disks were proprietary in all senses of the word.
They used proprietary protocols, they generally were sold by the server
vendor and a given server generally owned its disks, with shared-disk
systems being few and far between.

When SCSI moved up from PCs to mid-range servers in the mid 1990s,
things opened up a bit. The SCSI standard permitted multiple initiators
(servers) to share targets (disks). If you carefully chose
compatible SCSI components and did a lot of stress testing, you could build
a shared SCSI disk cluster. Many such clusters were used in datacenter
production in the 1990s, and some persist today.

One also had to be careful not to exceed the 25-meter SCSI-bus length
limit, particularly when building three- and four-node clusters. Of course,
the penalty for exceeding the length is not a deterministic oops
but flaky disk I/O. This limitation required that disks be
interspersed among the servers.

The advent of FibreChannel (FC) in the mid-to-late 1990s improved this
situation considerably. Although compatibility was and to some extent
still is a problem, the multi-kilometer FC lengths greatly simplified
datacenter layout. In addition, most of the FC-connected RAID arrays
export logical units (LUNs) that can, for example, be striped or mirrored
across the underlying physical disks, simplifying storage administration.
Furthermore, FC RAID arrays provide LUN masking and FC switches provide
zoning, both of which allow controlled disk sharing. Figure 1 illustrates an
example in which server A is permitted to access disks 1 and 2 and server
B is permitted to access disks 2 and 3. Disks 1 and 3 are private,
while disk 2 is shared, with the zones indicated by the grey rectangles.

Figure 1. FibreChannel allows for LUN masking
and zoning. Server A can access disks 1 and 2,
and server B can access 2 and 3.

This controlled sharing makes block-structured centralized storage
much more attractive. This in turn permits distributed filesystems to
provide the same semantics as do local filesystems, while still providing
reasonable performance.

Block-Structured Centralized Storage

Modern inexpensive disks and servers have reduced greatly the cost of
large server farms. Properly backing up each server can be
time consuming, however, and keeping up with disk failures can be a challenge.
The need for backup motivates centralizing data, so that disks physically
located on each server need not be backed up. Backups then can be
performed at the central location.

The centralized data might be stored on an NFS server. This is a
reasonable approach, one that is useful in many cases, especially as
NFS v4 goes mainstream. However, servers sometimes need direct block-level
access to their data:

A given server may need a specific filesystem's features, such
as ACLs, extended attributes or logging.

A particular application may need better performance or robustness
than protocols such as NFS can provide.

Some applications may require local filesystem semantics.

In some cases, it may be easier to migrate from local disks to
RAID arrays.

However, the 2.4 Linux kernel presents some challenges in working with large
RAID arrays, including
storage reconfiguration,
multipath I/O,
support for large LUNs and
support for large numbers of LUNs.
The 2.6 kernel promises to help in many of these areas, although there
are some areas of improvement left for the 2.7 development effort.

Storage Reconfiguration

Because most RAID arrays allow LUNs to be created, removed
and resized dynamically, it is important that the Linux kernel to react
to these actions,
preferably without a reboot. The Linux 2.6 kernel permits this by way of the
/sys filesystem, which replaced the earlier /proc interfaces. For example,
the following command causes the kernel to forget about the LUN on busid
3, channel 0, target 7 and LUN 1:

echo "1" > \
/sys/class/scsi_host/host3/device/3:0:7:1/delete

The busid of 3 is redundant with the 3 in host3.
This format also is used, however, in contexts where the busid is required, such
as in /sys/bus/scsi/devices.