Section 18.8. Solaris Device Driver Framework

18.8. Solaris Device Driver Framework

Let's quickly look at how network device drivers were implemented before Solaris 10 and why they needed to change with the new Solaris 10 stack.

18.8.1. GLDv2 and DLPI Drivers (Solaris 9 and Prior)

Before the Solaris 10 release, the network stack depends on Data-Link Provider Interface (DLPI1) providers, which are normally implemented in one of two ways. Figure 18.5 illustrates two stacks: one based on a monolithic DLPI driver and one based on a driver utilizing the generic LAN driver (GLDv2) module.

Figure 18.5. GLDv2 and DLPI Stacks

The GLDv2 module essentially behaves like a library. The client still talks to the driver instance bound to the device, but the DLPI protocol processing is handled by a call into the GLDv2 module, which then calls back into the driver to access the hardware. Using the GLD module has a clear advantage in that the driver writer need not reimplement large amounts of mostly generic DLPI protocol processing. Layer 2 (Data-Link) features such as 802.1q virtual LANs (VLANs) can also be implemented centrally in the GLD module, where they can be leveraged by all drivers. The architecture still poses a problem, though, with respect to implementing features such as 802.3ad link aggregation (a.k.a. trunking) where the one-to-one correspondence between network interface and device is broken.

Both GLDv2 and monolithic drivers depend on DLPI messages and communicate with upper layers through the STREAMS framework. This mechanism was relatively ineffective for link aggregation or 10-Gbit NICs. With the new stack, we needed a better mechanism that could ensure data locality and allow the stack to control the device drivers at much finer granularity to deal with interrupts.

18.8.2. A New Architecture: GLDv3

The Solaris 10 release introduced a new device driver framework called GLDv3 (internal name project Nemo) along with the new stack. Most of the major device drivers were ported to this framework, and all future and 10-Gbit device drivers will be based on this framework. This framework also provided a STREAMS-based DLPI layer for backward compatibility (to allow external, non-IP modules to continue to work).

The GLDv3 architecture virtualizes layer 2 of the network stack. There is no longer a one-to-one correspondence between network interfaces and devices. Figure 18.6 shows multiple devices registered with a MAC Services (MAC) module. It also shows two clients: one traditional client that communicates through DLPI to a data-link driver (DLD) and a kernel-based client that simply makes direct function calls into the Data-Link Services (DLS) module.

Figure 18.6. GLDv3 Architecture

18.8.2.1. GLDv3 Drivers

GLDv3 drivers are similar to GLD drivers. The driver must be linked with a dependency on the misc/mac and misc/dld kernel modules. It must call mac_register() with a pointer to an instance of the following structure to register with the MAC module.

This structure must persist for the lifetime of the registration, that is, it cannot be deallocated until after mac_unregister() is called. A GLDv3 driver _init(9E) enTRy point is also required to call mac_init_ops() before calling mod_install(9F), and they are required to call mac_fini_ops() after calling mod_remove(9F) from _fini(9E).

The following are important members of the mac_t structure:

m_impl. This field is used by the MAC module to point to its private data. It must not be read or modified by a driver.

m_driver. This field should be set by the driver to point to its private data. This value is supplied as the first argument to the driver entry points.

m_dip. This field must be set to the dev_info_t pointer of the driver instance calling mac_register().

Key MAC layer functions include the following:

m_stat(). Entry point that retrieves a value for one of the statistics defined in the mac_stat_t enumeration (below). All values are stored and returned in 64-bit unsigned integers. Values are not requested for statistics that the driver has not explicitly declared to be supported.

m_start(). Entry point that brings the device out of the reset/quiesced state it was in when the interface was registered. No packets are submitted by the MAC module for transmission, and no packets are submitted by the driver for reception before this call is made. If this function succeeds, then zero is returned. If it fails, then an appropriate errno value is returned.

m_stop(). Entry point that stops the device and puts it in a reset/quiesced state such that the interface can be unregistered. No packets are submitted by the MAC for transmission once this call has been made, and no packets are submitted by the driver for reception once it has completed.

m_promisc(). Entry point that sets the promiscuity of the device. If the second argument is B_TRUE, then the device receives all packets on the media. If it is set to B_FALSE, then only packets destined for the device's unicast address and the media broadcast address are received.

m_multicst(). Entry point that adds and removes addresses to and from the set of multicast addresses for which the device will receive packets. If the second argument is B_TRUE, then the address pointed to by the third argument is added to the set. If the second argument is B_FALSE, then the address pointed to by the third argument is removed.

m_unicst(). Entry point that sets a new device unicast address. Once this call is made, then only packets with the new address and the media broadcast address are received unless the device is in promiscuous mode.

m_resources(). Entry point that requests that the driver register its individual receive resources or RX rings.

m_tx(). Entry point that submits packets for transmission by the device. The second argument points to one or more packets contained in mblk_t structures. Fragments of the same packet are linked by the b_cont field. Separate packets are linked by the b_next field in the leading fragment. Packets are scheduled for transmission in the order in which they appear in the chain. Any remaining chain of packets that cannot be scheduled is returned. If m_tx() returns packets that cannot be scheduled, the driver must call mac_tx_update() when resources become available. If all packets are scheduled for transmission, then NULL is returned.

The mac_resource_add() function should be called from the m_resources() enTRy point to register individual receive resources (commonly, ring buffers of DMA descriptors) with the MAC module. The returned mac_resource_handle_t value should then be supplied in calls to mac_rx(). The second argument to mac_resource_add() specifies the resource being added. Resources are specified by the mac_resource_t structure. Currently, only resources of type MAC_RX_FIFO are supported. MAC_RX_FIFO resources are described by the mac_rx_fifo_t structure.

The upper layers use the mac_blank() function to control the interrupt rate of the device. The first argument is the device context that is to be used as the first argument to the poll_blank() function.

The fields mrf_normal_blank_time and mrf_normal_pkt_cnt specify the default interrupt interval and packet count threshold, respectively. These parameters can be the second and third arguments to mac_blank() when the upper layer wants the driver to revert to the default interrupt rate.

The interrupt rate is controlled by the upper layer by a call to poll_blank() with different arguments. The interrupt rate can be increased or decreased: the upper layer passes a multiple of these values to the last two arguments of mac_blank(). Setting these values to zero disables the interrupts, and the NIC is deemed to be in polling mode.

mac_poll() is the driver-supplied function used by upper layers to retrieve a chain of packets (up to max count, specified by the second argument) from the RX ring corresponding to the earlier supplied mrf_arg during mac_resource_add() (supplied as first argument to mac_poll()).

The function mac_resource_update() is invoked by the driver when available resources have changed.

The function mac_rx() function delivers a chain of packets, contained in mblk_t structures, for reception. The b_cont field links fragments of the same packet. The b_next field of the leading fragment links separate packets. If the packet chain was received by a registered resource, then the appropriate mac_resource_handle_t value should be supplied as the second argument to the function. The protocol stack uses this value as a hint when trying to load-spread across multiple CPUs. It is assumed that packets belonging to the same flow are always received by the same resource. If the resource is unknown or is unregistered, then NULL should be passed as the second argument.

18.8.2.3. Data-Link Services Module

The Data-Link Services (DLS) module provides the Data-Link Services interface analogous to DLPI. The DLS interface is a kernel-level functional interface, as opposed to the STREAMS message-based interface specified by DLPI. This module provides the interfaces necessary for the upper layer to create and destroy a data link service. It also provides the interfaces necessary to plumb and unplumb the NIC. The plumbing and unplumbing of an NIC for GLDv3-based device drivers is unchanged from the older GLDv2 or monolithic DLPI device drivers. The major changes are in data paths that allow direct calls, packet chains, and much finer-grained control over an NIC.

18.8.2.4. Data-Link Driver

The Data-Link Driver (DLD) provides a DLPI by using interfaces from the DLS and MAC modules. The driver is configured by ioctls passed to a control node. These ioctls create and destroy separate DLPI provider nodes. This module deals with DLPI messages necessary to plumb and unplumb the NIC and affords backward compatibility for the data path through STREAMS for non-GLDv3-aware clients.

18.8.3. GLDv3 Link Aggregation Architecture

The GLDv3 framework supports link aggregation as defined by IEEE 802.3ad. The key principles governing the design of this facility are these:

Allow GLDv3 MAC drivers to be aggregated without code change.

Preserve the performance of nonaggregated devices.

Keep overhead due to aggregation to a minimum. That is, the performance of aggregated devices should be the cumulative line rate for each member.

Support both manual configuration and the Link Aggregation Control Protocol (LACP).

GLDv3 link aggregation is implemented by means of a pseudo-driver called aggr. It registers virtual ports corresponding to link aggregation groups with the GLDv3 MAC layer. It uses the client interface provided by the MAC layer to control and communicate with aggregated MAC ports as illustrated in Figure 18.7. It also exports a pseudo aggr device driver that the dladm(1M) command uses to configure and control the link-aggregated interface. Once a MAC port is configured to be part of a link aggregation group, it cannot be simultaneously accessed by other MAC clients such as the DLS layer. The exclusive access is enforced by the MAC layer. The implementation of LACP is implemented by the aggr driver, which has access to individual MAC ports or links.

Figure 18.7. GLDv3 Link Aggregation Architecture

The GLDv3 aggr driver acts as a normal MAC module to the upper layer and appears as a standard NIC interface which, once created with dladm(1M), can be configured and managed by the ifconfig(1M) command. The aggr module registers each MAC port that is part of the aggregation with the upper layer by using the mac_resource_add() function, such that the data paths and interrupts from each MAC port can be independently managed by the upper layers (see Section 18.9.2).

In short, the aggregated interface is managed as a single interface with possibly one IP address, and the data paths are managed as individual NICs by unique CPUs and squeues. This management scheme gives aggregation capability to Solaris 10 with near zero overhead and linear scalability with respect to the number of MAC ports that are part of the aggregation group.

18.8.4. Checksum Offload

Solaris 10 improved the hardware checksum offload capability further to improve overall performance for most applications. A 16-bit, one's complement, checksum offload framework has existed in Solaris for some time. It was originally added as a requirement for Zero Copy TCP/IP in the Solaris 2.6 release but was only recently extended to handle other protocols. Solaris 10 defines two classes of checksum offload:

Full. Complete checksum calculation in the hardware, including pseudo-header checksum computation for TCP and UDP packets. The hardware is assumed to be able to parse protocol headers.

Partial. Dumb one's complement checksum based on start, end, and stuff offsets describing the span of the checksummed data and the location of the transport checksum field, with no pseudo-header calculation ability in the hardware.

Adding support for nonfragmented IPV4 cases (unicast or multicast) is trivial for both transmit and receive since most modern network adapters support either class of checksum offload with minor differences in the interface. The IPV6 cases are not as straightforward, because very few full-checksum network adapters can handle checksum calculation for TCP/UDP packets over IPV64.

The fragmented IP cases have similar constraints. On transmit, checksumming applies to the unfragmented datagram. An adapter that is to support checksum offload must be able to buffer all the IP fragments (or perform the fragmentation in hardware) before finally calculating the checksum and sending the fragments over the wire; until then, checksum offloading for outbound IP fragments cannot be done. On the other hand, the receive fragment reassembly case is more flexible since most full-checksum (and all partial-checksum) network adapters can compute and provide the checksum value to the network stack. During the fragment reassembly stage, the network stack can derive the checksum status of the unfragmented datagram by combining all the values.

Things are simplified by not offloading the checksum when the IP option is present. For partial-checksum offload, certain adapters limit the start offset to a width sufficient for simple IP packets. When the length of protocol headers exceeds such a limit (because certain options are present), the start offset wraps around, causing an incorrect calculation. For full-checksum offload, none of the capable adapters correctly handle the IPV4 source routing option.

When transmit checksum offload takes place, the network stack associates eligible packets with ancillary information needed by the driver to offload the checksum computation to hardware.

In the inbound case, the driver has full control over the packets that become associated with hardware-calculated checksum values. Once a driver advertises its capability through DL CAPAB HCKSUM, the network stack accepts full- or partial-checksum information for IPV4 and IPV6 packets. This process happens for both nonfragmented and fragmented payloads.

Fragmented packets first need to be reassembled because checksum validation happens for fully reassembled datagrams. During reassembly, the network stack combines the hardware-calculated checksum value of each fragment.