Kernel Korner - Using DMA

DMA makes I/O go faster by letting devices read and write memory without bothering the CPU. Here's how the kernel keeps track of changes that happen behind the CPU's back.

DMA stands for direct memory access and refers to the ability of
devices or other entities in a computing system to modify
main memory contents without going through the CPU. The desirability of DMA
lies in not troubling the CPU; the system simply can request that the data be
fetched into a particular memory region and continue with other
tasks until the data is ready. Most of the problems in DMA, however, are
due
to the lack of CPU involvement.

The problems with DMA are threefold. First,
the CPU probably is operating a memory management unit.
Therefore, the address the CPU uses to describe the memory region is not the
same as the physical address of main memory.
Second, because the transfer is to main memory, the caches
between that memory and the CPU probably are not coherent (see
“Understanding Caching”, LJ, January
2004.
Third, there also may be a memory
management unit on the I/O bus (called an IOMMU). This means the bus
address the device uses to transfer the data may not be the same as
the physical memory address or the CPU's virtual memory address.
This concept is alien to most x86 people. Even here, though, the use
of GARTs (graphical aperture remapping tables) for the AGP bus is
making the x86 refusal of IOMMUs less strong than it once
was.

The API that manages DMA in the Linux kernel must take into
account and solve all three of these problems. In addition, because
most DMA is done from devices on an external bus, three additional
problems may occur. First,
the I/O device addressing width may be different from
the address width of physical memory. For instance, an ISA device is
limited to addressing 24 bits, and some PCI devices in 64-bit systems
are limited to addressing 32 bits.
Second, the I/O bus controller circuitry itself may cache
requests. This occurs mainly on the PCI bus, where write requests
may be held in the PCI controller in the hope that it may accumulate
them for rapid transfer to the device. This phenomenon is called PCI
posting. Third, the operating system may request a transfer to a region
that is contiguous in its virtual memory space but
fragmented in the memory's physical space, usually because the
requested transfer crosses multiple pages. Such a transfer must be
accomplished using scatter/gather (SG) lists.

This article deals strictly with the DMA API for devices.
The new generic device model in Linux 2.6 provides a nice way of
describing device characteristics and finding their bus properties
using a hierarchical tree. The interfaces described
have undergone considerable revision in the transition from 2.4 to
2.6. Although the general principles of this article apply to
2.4, the API described and the kernel capabilities apply only to the
2.6 kernel.

SG Lists

For any DMA transfer, the first problem to consider is the
user may request a large transfer (kilobytes to megabytes) to a given
buffer. Because of the way virtual memory is managed, however, this
area, which is contiguous in virtual space, may be composed of a
sequence of pages fragmented all over physical memory. Linux expects
that any transfer above a page size (4KB on an x86 system) needs
to be described by an SG list.
Ordinarily, these lists are constructed by the block I/O (BIO) layer.
A key job of the device driver is to parameterize the BIO layer in
the way it may divide up the I/O into SG list elements.

Almost every device that transfers large amounts of data is
designed to accept these transfers as some form of SG
list. Although the exact form of this list is likely to differ from
the one supplied by the kernel, conversion usually is trivial.

I/O Memory Management Units (IOMMUs)

Figure 1. Address Domains in DMA

An IOMMU is a memory management unit that goes between the I/O
bus (or hierarchy of buses) and the main memory. This MMU is
separate from the IOMMU built in to the CPU. In order to effect a
transfer from the device to main memory, the IOMMU must be programmed
with the address translations for the transfer in almost exactly the
same way as the CPU's MMU would be programmed. One of the advantages
of doing this is an SG list generated by the BIO layer can
be programmed into the IOMMU such that the memory region
appears to be contiguous again to the device on
the bus.

GARTs and IOMMU Bypass

A GART basically is like a
simple IOMMU. It consists of a window in physical memory and a list
of pages. Its job is to remap physical addresses in the window to
physical pages in the list. The window typically is narrow, only about
128MB or so, and any accesses to physical memory outside
this window are not remapped.
This insufficiency exposes a weakness in the way the Linux kernel
currently handles DMA: none of the DMA APIs have a failure return for
failing to map the memory. A GART has a limited amount
of remapping space, however, and once that is exhausted nothing may be mapped until
some I/O completes and frees up mapping space.

Sometimes, like a GART, an IOMMU may be programmed
not to do address remapping between the I/O bus
and the memory in certain windows. This is called bypass mode and
may not be possible for all types of IOMMU. Bypass mode is
desirable sometimes, because the act of remapping adds a performance hit to the
transfer, so lifting the IOMMU out of the way can achieve an increase in
throughput.

The BIO layer, however, assumes that if an IOMMU is present,
it is being used, and it calculates the space needed for the device
SG list accordingly. Currently, no way exists to inform the BIO
layer that the device wishes to bypass the IOMMU. A problem occurs
if the BIO layer assumes the presence of an IOMMU; it also
assumes SG entries are being coalesced by the IOMMU.
Thus, if the device driver decides to bypass the IOMMU, it may find itself with more SG entries than the
device allows.

Both of these issues are being worked on in the 2.6
kernel. A fix for the IOMMU bypass already is under consideration
and will be invisible to driver writers, because the platform code will choose
when to do the bypass. The fix for the inability to map probably will
consist of making the
mapping APIs return failure. Because
this fix affects every DMA driver in the system, implementing it is going
to be slow.

Trending Topics

Webinar: 8 Signs You’re Beyond Cron

Scheduling Crontabs With an Enterprise Scheduler
11am CDT, April 29th

Join Linux Journal and Pat Cameron, Director of Automation Technology at HelpSystems, as they discuss the eight primary advantages of moving beyond cron job scheduling. In this webinar, you’ll learn about integrating cron with an enterprise scheduler.