Oracle Blog

Reflections on OS integration

Wednesday Dec 08, 2010

In my seven years at Sun and Oracle, I've had the opportunity to work with some incredible people on some truly amazing technology. But the time has come for me to move on, and today will be my last day at Oracle.

When I started in the Solaris kernel group in 2003, I had no idea that I was entering a truly unique environment - a culture of innovation and creativity that is difficult to find anywhere, let alone working on a system as old and complex as Solaris in a company as large as Sun. While there, I worked with others to reshape the operating system landscape through technologies like Zones, SMF, DTrace, and FMA, and fortunate to be part of the team that created ZFS, one of the most innovative filesystems ever. From there I became a member of the Fishworks team that created the Sun Storage 7000 series; working with such a close-knit talented team to create a groundbreaking integrated product like that was an experience that I will never forget.

I learned so much and grew in so many ways thanks to the people I had the chance to meet and work with over the past seven years. I would not be the person I am today without your help and guidance. While I am leaving Oracle, I will always be part of the community, and I look forward to our paths crossing in the future.

Wednesday Mar 10, 2010

When the Sun Storage 7000 was first introduced, a key design decision was to allow only a single ZFS storage pool per host. This forces users to fully take advantage of the ZFS pool storage model, and prevents them from adopting ludicrous schemes such as "one pool per filesystem." While RAID-Z has non-trivial performance implications for IOPs-bound workloads, the hope was that by allowing logzilla and readzilla devices to be configured per-filesystem, users could adjust relative performance and implement different qualities of service on a single pool.

While this works for the majority of workloads, there are still some that benefit from mirrored performance even in the presence of cache and log devices. As the maximum size of Sun Storage 7000 systems increases, it became apparent that we needed a way to allow pools with different RAS and performance characteristics in the same system. With this in mind, we relaxed the "one pool per system" rule1 with the 2010.Q1 release.

The storage configuration user experience is relatively unchanged. Instead of having a single pool (or two pools in a cluster), and being able to configure one or the other, you can simply click the '+' button and add pools as needed. When creating a pool, you can now specify a name for the pool. When importing a pool, you can either accept the existing name or give it a new one at the time you select the pool. Ownership of pools in a cluster is now managed exclusively through the Configuration -> Cluster screen, as with other shared resources.

When managing shares, there is a new dropdown menu at the top left of the navigation bar. This controls which shares are shown in the UI. In the CLI, the equivalent setting is the 'pool' property at the 'shares' node.

While this gives some flexibility in storage configuration, it also allows users to create poorly constructed storage topologies. The intent is to allow the user to create pools with different RAS and performance characteristics, not to create dozens of different pools with the same properties. If you attempt to do this, the UI will present a warning summarizing the drawbacks if you were to continue:

Wastes system resources that could be shared in a single pool.

Decreases overall performance

Increases administrative complexity.

Log and cache devices can be enabled on a per-share basis.

You can still commit the operation, but such configurations are discouraged. The exception is when configuring a second pool on one head in a cluster.

We hope this feature will allow users to continue to consolidate storage and expand use of the Sun Storage 7000 series in more complicated environments.

Clever users figured out that this mechanism could be circumvented in a cluster to have two pools active on the same host in an active/passive configuration.

Sunday Oct 18, 2009

In my previous entry, I described the overall architecture of shadow migration. This post will dive into the details of how it's actually implemented, and the motivation behind some of the original design decisions.

VFS interposition

A very early desire was that we wanted something that could migrate data from many different sources. And while ZFS is the primary filesystem for Solaris, we also wanted to allow for arbitrary local targets. For this reason, the shadow migration infrastructure is implemented entirely at the VFS (Virtual FileSystem) layer. At the kernel level, there is a new 'shadow' mountpoint option, which is the path to another filesystem on the system. The kernel has no notion of whether a source filesystem is local or remote, and doesn't differentiate between synchronous access and background migration. Any filesystem access, whether it is local or over some other protocol (CIFS, NFS, etc) will use the VFS interfaces and therefore be fed through our interposition layer.

The only other work the kernel does when mounting a shadow filesystem is check to see if the root directory is empty. If it is empty, we create a blank SUNWshadow extended attribute on the root directory. Once set, this will trigger all subsequent migration as long as the filesystem is always mounted with the 'shadow' attribute. Each VFS operation first checks to see if the filesystem is shadowing another (a quick check), and then whether the file or directory has the SUNWshadow attribute set (slightly more expensive, but cached with the vnode). If the attribute isn't present, then we fall through to the local filesystem. Otherwise, we migrate the file or directory and then fall through to the local filesystem.

Migration of directories

In order to migrate a directory, we have to migrate all the entriest. When migrating an entry for a file, we don't want to migrate the complete contents until the file is accessed, but we do need to migrate enough metadata such that access control can be enforced. We start by opening the directory on the remote side whose relative path is indicated by the SUNWshadow attribute. For each directory entry, we create a sparse file with the appropriate ownership, ACLs, system attributes and extended attributes.

Once the entry has been migrated, we then set a SUNWshadow attribute that is the same as the parent but with "/" appended where "name" is the directory entry name. This attribute always represents the relative path of the unmigrated entity on the source. This allows files and directories to be arbitrarily renamed without losing track of where they are located on the source. It also allows the source to change (i.e. restored to a different host) if needed. Note that there are also types of files (symlinks, devices, etc) that do not have contents, in which case we simply migrate the entire object at once.

Once the diretory has been completely migrated, we remove the SUNWshadow attribute so that future accesses all use the native filesystem. If the process is interrupted (system reset, etc), then the attribute will still be on the parent directory so we will migrate it again when the user (or background process) tries to access it.

Migration of files

Migrating a plain file is much simpler. We use the SUNWshadow attribute to locate the file on the source, and then read the source file and write the corresponding data to the local filesystem. In the current software version, this happens all at once, meaning the first access of a large file will have to wait for the entire file to be migrated. Future software versions will remove this limitation and migrate only enough data to satisfy the request, as well as allowing concurrent accesses to the file. Once all the data is migrated, we remove the SUNWshadow attribute and future accesses will go straight to the local filesystem.

Dealing with hard links

One issue we knew we'd have to deal with is hard links. Migrating a hard link requires that the same file reference appear in multiple locations within the filesystem. Obviously, we do not know every reference to a file in the source filesystem, so we need to migrate these links as we discover them. To do this, we have a special directory in the root of the filesystem where we can create files named by their source FID. A FID (file ID) is a unique identifier for the file within the filesystem. We create the file in this hard link directory with a name derived from its FID. Each time we encounter a file with a link count greater than 1, we lookup the source FID in our special directory. If it exists, we create a link to the directory entry instead of migrating a new instance of the file. This way, it doesn't matter if files are moved around, removed from the local filesystem, or additional links created. We can always recreate a link to the original file. The one wrinkle is that we can migrate from nested source filesystems, so we also need to track the FSID (filesystem ID) which, while not persistent, can be stored in a table and reconstructed using source path information.

Completing migration

A key feature of the shadow migration design is that it treats all accesses the same, and allows background migration of data to be driven from userland, where it's easier to control policy. The downside is that we need the ability to know when we have finished migrating every single file and directory on the source. Because the local filesystem is actively being modified while we are traversing, it's impossible to know whether you've visited every object based only on walking the directory hierarchy. To address this, we keep track of a "pending" list of files and directories with shadow attributes. Every object with a shadow attribute must be present in this list, though this list can contain objects without shadow attributes, or non-existant objects. This allows us to be synchronous when it counts (appending entries) and lazy when it doesn't (rewriting file with entries removed). Most of the time, we'll find all the objects during traversal, and the pending list will contain no entries at all. In the case we missed an object, we can issue an ioctl() to the kernel to do the work for us. When that list is empty we know that we are finished and can remove the shadow setting.

ZFS integration

The ability to specify the shadow mount option for arbitrary filesystems is useful, but is also difficult to manage. It must be specified as an absolute path, meaning that the remote source of the mount must be tracked elsewhere, and has to be mounted before the filesystem itself. To make this easier, a new 'shadow' property was added for ZFS filesystems. This can be set using an abstract URI syntax ("nfs://host/path"), and libzfs will take care of automatically mounting the shadow filesystem and passing the correct absolute path to the kernel. This way, the user can manage a semantically meaningful relationship without worrying about how the internal mechanisms are connected. It also allows us to expand the set of possible sources in the future in a compatible fashion.

Hopefully this provides a reasonable view into how exactly shadow migration works, and the design decisions behind it. The goal is to eventually have this available in Solaris, at which point all the gritty details should be available to the curious.

Thursday Sep 17, 2009

When ZFS was first developed, the engineering team had the notion that pooled storage would make filesystems cheap and plentiful, and we'd move away from the days of /export1, /export2, ad infinitum. From the ZFS perspective, they are cheap. It's very easy to create dozens or hundreds of filesystems, each which functions as an administrative control point for various properties. However, we quickly found that other parts of the system start to break down once you get past 1,000 or 10,000 filesystems. Mounting and sharing filesystems takes longer, browsing datasets takes longer, and managing automount maps (for those without NFSv4 mirror mounts) quickly spirals out of control.

For most users this isn't a problem - a few hundred filesystems is more than enough to manage disparate projects and groups on a single machine. There was one class of users, however, where a few hundred filesystems wasn't enough. These users were university or other home directory environments with 20,000 or more users, each which needed to have a quota to guarantee that they couldn't run amok on the system. The traditional ZFS solution, creating a filesystem for each user and assigning a quota, didn't scale. After thinking about it for a while, Matt developed a fairly simple architecture to provide this functionality without introducing pathological complexity into the bowels of ZFS. In build 114 of Solaris Nevada, he pushed the following:

This provides full support for user and group quotas on ZFS, as well as the ability to track usage on a per-user or per-group basis within a dataset.

This was later integrated into the 2009.Q3 software release, with an additional UI layer. From the 'general' tab of a share, you can query usage and set quotas for individual users or groups quickly. The CLI allows for automated batch operations. Requesting a single user or group is significantly faster than requesting all the current usage, but you an also get a list of the current usage for a project or share. With integrated identity management, users and groups can be specified either by UNIX username or Windows name.

There are some significant differences between user and group quotas and traditional ZFS quotas. The following is an excerpt from the on-line documentation on the subject:

User and group quotas can only be applied to filesystems.

User and group quotas are implemented using delayed enforcement. This means that users will be able to exceed their quota for a short period of time before data is written to disk. Once the data has been pushed to disk, the user will receive an error on new writes, just as with the filesystem-level quota case.

User and group quotas are always enforced against referenced data. This means that snapshots do not affect any quotas, and a clone of a snapshot will consume the same amount of effective quota, even though the underlying blocks are shared.

User and group reservations are not supported.

User and group quotas, unlike data quotas, are stored with the regular filesystem data. This means that if the filesystem is out of space, you will not be able to make changes to user and group quotas. You must first make additional space available before modifying user and group quotas.

User and group quotas are sent as part of any remote replication. It is up to the administrator to ensure that the name service environments are identical on the source and destination.

NDMP backup and restore of an entire share will include any user or group quotas. Restores into an existing share will not affect any current quotas. (There is currently a bug preventing this from working in the initial release, which will be fixed in a subsequent minor release.)

This feature will hopefully allow the Sun Storage 7000 series to function in environments where it was previously impractical to do so. Of course, the real person to thank is Matt and the ZFS team - it was a very small amount of work to provide an interface on top of the underlying ZFS infrastructure.

Wednesday Sep 16, 2009

In the Sun Storage 7000 2009.Q3 software release, one of the major new features I worked on was the addition of what we termed "shadow migration." When we launched the product, there was no integrated way to migrate data from existing systems to the new systems. This resulted in customers rolling it by hand (rsync, tar, etc), or paying for professional services to do the work for them. We felt that we could present a superior model that would provide for a more integrated experience as well as let the customer leverage the investment in the system even before the migration was complete.

The idea in and of itself is not new, and various prototypes of this have been kicking around inside of Sun under various monikers ("brain slug", "hoover", etc) without ever becoming a complete product. When Adam and myself sat down shortly before the initial launch of the product, we decide we could do this without too much work by integrating the functionality directly in the kernel. The basic design requirements we had were:

We must be able to migrate over standard data protocols (NFS) from arbitrary data sources without the need to have special software running on the source system.

Migrated data must be available before the entire migration is complete, and must be accessible with native performance.

All the data to migrate the filesystem must be stored within the filesystem itself, and must not rely on an external database to ensure consistency.

With these requirements in hand, our key insight was that we could create a "shadow" filesystem that could pull data from the original source if necessary, but then fall through to the native filesystem for reads and writes one the file has been migrated. What's more, we could leverage the NFS client on Solaris and do this entirely at the VFS (virtual filesystem) layer, allowing us to migrate data between shares locally or (eventually) over other protocols as well without changing the interpositioning layer. The other nice thing about this architecture is that the kernel remains ignorant of the larger migration process. Both synchronous requests (from clients) and background requests (from the management daemon) appear the same. This allows us to control policy within the userland software stack, without pushing that complexity into the kernel. It also allows us to write a very comprehensive automated test suite that runs entirely on local filesystems without need a complex multi-system environment.

So what's better (and worse) about shadow migration compared to other migration strategies? For that, I'll defer to the documentation, which I've reproduced here for those who don't have a (virtual) system available to run the 2009.Q3 release:

Migration via synchronization

This method works by taking an active host X and migrating data to the new host Y while X remains active. Clients still read and write to the original host while this migration is underway. Once the data is initially migrated, incremental changes are repeatedly sent until the delta is small enough to be sent within a single downtime window. At this point the original share is made read-only, the final delta is sent to the new host, and all clients are updated to point to the new location. The most common way of accomplishing this is through the rsync tool, though other integrated tools exist. This mechanism has several drawbacks:

The anticipated downtime, while small, is not easily quantified. If a user commits a large amount of change immediately before the scheduled downtime, this can increase the downtime window.

During migration, the new server is idle. Since new servers typically come with new features or performance improvements, this represents a waste of resources during a potentially long migration period.

Coordinating across multiple filesystems is burdensome. When migrating dozens or hundreds of filesystems, each migration will take a different amount of time, and downtime will have to be scheduled across the union of all filesystems.

Migration via external interposition

This method works by taking an active host X and inserting a new appliance M that migrates data to a new host Y. All clients are updated at once to point to M, and data is automatically migrated in the background. This provides more flexibility in migration options (for example, being able to migrate to a new server in the future without downtime), and leverages the new server for already migrated data, but also has significant drawbacks:

The migration appliance represents a new physical machine, with associated costs (initial investment, support costs, power and cooling) and additional management overhead.

The migration appliance represents a new point of failure within the system.

The migration appliance interposes on already migrated data, incurring extra latency, often permanently. These appliances are typically left in place, though it would be possible to schedule another downtime window and decommission the migration appliance.

Shadow migration

Shadow migration uses interposition, but is integrated into the appliance and doesn't require a separate physical machine. When shares are created, they can optionally "shadow" an existing directory, either locally (see below) or over NFS. In this scenario, downtime is scheduled once where the source appliance X is placed into read-only mode, a share is created with the shadow property set, and clients are updated to point to the new share on the Sun Storage 7000 appliance. Clients can then access the appliance in read-write mode.

Once the shadow property is set, data is transparently migrated in the background from the source appliance locally. If a request comes from a client for a file that has not yet been migrated, the appliance will automatically migrate this file to the local server before responding to the request. This may incur some initial latency for some client requests, but once a file has been migrated all accesses are local to the appliance and have native performance. It is often the case that the current working set for a filesystem is much smaller than the total size, so once this working set has been migrated, regardless of the total native size on the source, there will be no perceived impact on performance.

The downside to shadow migration is that it requires a commitment before the data has finished migrating, though this is the case with any interposition method. During the migration, portions of the data exists in two locations, which means that backups are more complicated, and snapshots may be incomplete and/or exist only on one host. Because of this, it is extremely important that any migration between two hosts first be tested thoroughly to make sure that identity management and access controls are setup correctly. This need not test the entire data migration, but it should be verified that files or directories that are not world readable are migrated correctly, ACLs (if any) are preserved, and identities are properly represented on the new system.

Shadow migration implemented using on-disk data within the filesystem, so there is no external database and no data stored locally outside the storage pool. If a pool is failed over in a cluster, or both system disks fail and a new head node is required, all data necessary to continue shadow migration without interruption will be kept with the storage pool.

In a subsequent post, I'll discuss some of the thorny implementation detail we had to solve, as well as provide some screenshots of migration in progress. In the meantime, I suggest folks download the simulator and upgrade to the latest software to give it a try.

Tuesday Jun 02, 2009

Last week, we announced the Sun Storage 7310 system. At the same time, a less significant but still notable change was made to the Sun Storage product line. The Sun Storage 7210 system is now expandable via J4500 JBODs. These JBODs have the same form factor as the 7210 - 48 drives in a top loading 4u form factor. Up to two J4500s can be added to a 7210 system via a single HBA, resulting in up to 142 TB of storage in 12 RU of space. JBODs with 500G or 1T drives are supported.

For customers looking for maximum density without high availability, the combination of 7210 with J4500 provides a perfect solution.

Thursday Nov 20, 2008

In the past, I've discussed the evolution of disk FMA. Much has been accomplished in the past year, but there are still several gaps when it comes to ZFS and disk faults. In Solaris today, a fault diagnosed by ZFS (a device failing to open, too many I/O errors, etc) is reported as a pool name and 64-bit vdev GUID. This description leaves something to be desired, referring the user to run zpool status to determine exactly what went wrong. But the user is still has to know how to go from a cXtYdZ Solaris device name to a physical device, and when they do locate the physical device they need to manually issue a zpool replace command to initiate the replacement.

While this is annoying in the Solaris world, it's completely untenable in an appliance environment, where everything needs to "just work". With that in mind, I set about to plug the last few holes in the unified plan:

ZFS faults must be associated with a physical disk, including the human-readable label

A disk fault (ZFS or SMART failure) must turn on the associated fault LED

Removing a disk (faulted or otherwise) and replacing it with a new disk must automatically trigger a replacement

While these seem like straightforward tasks, as usual they are quite difficult to get right in a truly generic fashion. And for an appliance, there can be no Solaris commands or additional steps for the user. To start with, I needed to push the FRU information (expressed as a FMRI in the hc libtopo scheme) into the kernel (and onto the ZFS disk label) where it would be available with each vdev. While it is possible to do this correlation after the fact, it simplifies the diagnosis engine and is required for automatic device replacement. There are some edge conditions around moving and swapping disks, but was relatively straightforward. Once the FMRI was there, I could include the FRU in the fault suspect list, and using Mike's enhancements to libfmd_msg, dynamically insert the FRU label into the fault message. Traditional FMA libtopo labels do not include the chassis label, so in the Fishworks stack we go one step further and re-write the label on receipt of a fault event with the user-defined chassis name as well as the physical slot. This message is then used when posting alerts and on the problems page. We can also link to the physical device from the problems page, and highlight the faulty disk in the hardware view.

With the FMA plumbing now straightened out, I needed a way to light the fault LED for a disk, regardless of whether it was in the system chassis or an external enclosure. Thanks to Rob's sensor work, libtopo already presents a FMRI-centric view of indicators in a platform agnostic manner. So I rewrote the disk-monitor module (or really, deleted everything and created a new fru-monitor module) that would both poll for FRU hotplug events, as well as manage the fault LEDs for components. When a fault is generated, the FRU monitor looks through the suspect list, and turns on the fault LED for any component that has a supported indicator. This is then turned off when the corresponding repair event is generated. This also had the side benefit of generating hotplug events phrased in terms of physical devices, which the appliance kit can use to easily present informative messages to the user.

Finally, I needed to get disk replacement to work like everyone expects it to: remove a faulted disk, put in a new one, and walk away. The genesis of this functionality was putback to ON long ago as the autoreplace pool property. In Solaris, this functionality only works with disks that have static device paths (namely SATA). In the world of multipathed SAS devices, the device path is really a scshi_vhci node identified by the device WWN. If we remove a disk and insert a new one, it will appear as a new device with no way to correlate it to the previous instance, preventing us from replacing the correct vdev. What we need is physical slot information, which happens to be provided by the FMRI we are already storing with the vdev for FMA purposes. When we receive a sysevent for a new device addition, we look at the latest libtopo snapshot and take the FMRI of the newly inserted device. By looking at the current vdev FRU information, we can then associate this with the vdev that was previously in the slot, and automatically trigger the replacement.

This process took a lot longer than I would have hoped, and has many more subtleties too boring even for a technical blog entry, but it is nice to sit back and see a user experience that is intuitive, informative, and straightforward - the hallmarks of an integrated appliance solution.

Wednesday Nov 12, 2008

Since our initial product was going to be a NAS appliance, we knew early on that
storage configuration would be a critical part of the initial Fishworks experience. Thanks to the power
of ZFS storage pools, we have the ability to present a radically simplified interface,
where the storage "just works" and the administrator doesn't need to worry about
choosing RAID stripe widths or statically provisioning volumes. The first decision
was to create a single storage pool (or really one per head in a cluster)1,
which means that the administrator only needs to make this decision once, and
doesn't have to worry about it every time they create a filesystem or LUN.

Within a storage pool, we didn't want the user to be in charge of making
decisions about RAID stripe widths, hot spares, or allocation of devices. This
was primarily to avoid this complexity, but also represents the fact that we
(as designers of the system) know more about its characteristics than you.
RAID stripe width affects performance in ways that are not immediately
obvious. Allowing for JBOD failure requires careful selection of stripe widths.
Allocation of devices can take into account environmental factors (balancing
HBAs, fan groups, backplance distribution) that are unknown to the user.
To make this easy for the user, we pick several different profiles that
define parameters that are then applied to the current configuration to figure
out how the ZFS pool should be laid out.

Before selecting a profile, we ask the user to verify the storage that
they want to configure. On a standalone system, this is just a check
to make sure nothing is broken. If there is a broken or missing disk, we
don't let you proceed without explicit confirmation. The reason we do
this is that once the storage pool is configured, there is no way to add those
disks to the pool without changing the RAS and performance characteristics
you specified during configuration. On a 7410 with multiple JBODs, this verification step is slightly
more complicated, as we allow adding of whole or half JBODs. This step is
where you can choose to allocate half or all of
the JBOD to a pool, allowing you to split storage in a cluster or reserve
unused storage for future clustering options.

Fundamentally, the choice of redundancy is a business decision. There is
a set of tradeoffs that express your tolerance of risk and relative cost. As
Jarod told us very early on in the project: "fast, cheap, or reliable - pick two."
We took this to heart, and our profiles are displayed in a table with
qualitative ratings on performance, capacity, and availability. To further
help make a decision, we provide a human-readable description of the
layout, as well as a pie chart showing the way raw storage will be used
(data, parity, spares, or reserved). The last profile parameter is called
"NSPF," for "no single point of failure." If you are on a 7410 with multiple
JBODs, some profiles can be applied across JBODs such that the loss
of any one JBOD cannot cause data loss2. This often forces arbitrary stripe
widths (with 6 JBODs your only choice is 10+2) and can result in
less capacity, but with superior RAS characteristics.

This configuration takes just two quick steps, and for the common case
(where all the hardware is working and the user wants double parity RAID),
it just requires clicking on the "DONE" button twice. We also support adding
additional storage (on the 7410), as well as unconfiguring and importing
storage. I'll leave a complete description of the storage configuration
screen for a future entry.

[1] A common question we get is "why allow only one storage pool?" The actual
implementation clearly allows it (as in the failed over active-active cluster), so it's
purely an issue of complexity. There is never a reason to create multiple
pools that share the same redundancy profile - this provides no additional value
at the cost of significant complexity. We do acknowledge that mirroring
and RAID-Z provide different performance characteristics, but we hope that with the
ability to turn on and off readzilla and (eventually) logzilla usage on a per-share basis,
this will be less of an issue. In the future, you may see support for multiple pools, but
only in a limited fashion (i.e. enforcing different redundancy profiles).

[2] It's worth noting that all supported configurations of the 7410 have
multiple paths to all JBODs across multiple HBAs. So even without NSPF, we
have the ability to survive HBA, cable, and JBOD controller failure.

Tuesday Nov 11, 2008

With any product, there is always some talk from the enthusiasts about how they
could do it faster, cheaper, or simpler. Inevitably, there's a little bit of truth to
both sides. Enthusiasts have been doing homebrew NAS for as long as free software
has been around, but it takes far more
work to put together a complete, polished solution that stands up under the stress
of an enterprise environment.

One of the amusing things I like to do is to look
back at the total amount of source code we wrote. Lines of source code by itself
is obviously not a measure of complexity - it's possible to write complex software
with very few lines of source, or simple software that's over engineered - but it's
an interesting measure nonetheless. Below is the current output of a little script
I wrote to count lines of code1 in our fish-gate. This does not
include the approximately 40,000 lines of change made to
the ON (core Solaris) gate, most of which we'll be putting back gradually over
the next few months.

Sunday Nov 09, 2008

It's hard to believe that this day has finally come. After more
than two and a half years, our first Fishworks-based product has been
released. You can keep up to date with the latest info at the
Fishworks blog.

For my first technical post, I'd thought I'd give an introduction to
the chassis subsystem at the heart of our hardware integration strategy. This
subsystem is responsible for gathering, cataloging, and presenting a unified
view of the hardware topology. It
underwent two major rewrites (one by myself and one by Keith) but the
fundamental design has remained the same. While it
may not be the most glamorous feature (no one's going to purchase a box
because they can get model information on their DIMMs), I found it an
interesting cross-section of disparate technologies and awash in subtle
complexity. You can find a video of myself talking about and
demonstrating this feature
here.

libtopo discovery

At the heart of the chassis subsystem is the FMA topology as
exported by
libtopo.
This library is already
capable of enumerating hardware in a physically meaningful manner, and
FMRIs (fault managed resource identifiers) form the basis of
FMA fault diagnosis. This alone provides us the following basic
capabilities:

Much of this requires platform-specific XML files, or leverages IPMI
behind the scenes, but this minimal integration work is common to Solaris. Any
platform supported by Solaris is supported by the FishWorks software
stack.

Additional metadata

Unfortunately, this falls short of a complete picture:

No way to identify absent CPUs, DIMMs, or empty PCI slots

DIMM enumeration not supported on all platforms

Human-readable labels often wrong or missing

No way to identify complete PCI cards

No integration with visual images of the chassis

To address these limitations (most of which lie outside the purview of
libtopo), we leverage additional metadata for each supported
chassis. This metadata identifies all physical slots (even those that
may not be occupied), cleans up various labels, and includes visual
information about the chassis and its components. And
we can identify physical cards based on devinfo properties extracted
from firmware and/or the pattern of PCI functions and their attributes
(a process worthy of its own blog entry). Combined with libtopo, we have
images that we can assemble into a
complete view based on the current physical layout, highlight
components within the image, and respond to user mouse clicks.

Supplemental information

However, we are still missing many of
the component details. Our goal is to be able to provide complete
information for every FRU on the system. With just libtopo, we can get
this for disks but not much else. We need to look to alternate
sources of information.

kstat

For CPUs, there is a rather rich set of information available via
traditional kstat interfaces. While we use libtopo to identify CPUs
(it lets us correlate physical CPUs), the
bulk of the information comes from kstats. This is used to get model,
speed, and the number of cores.

libdevinfo

The device tree snapshot provides additional information for PCI
devices that can only be retrieved by private driver interfaces.
Despite the existence of a VPD (Vital Product Data)
standard, effectively no vendors implement it. Instead, it is read by some firmware-specific
mechanism private to the driver. By exporting these as properties in
the devinfo snapshot, we can transparently pull in dynamic FRU
information for PCI cards. This is used to get model, part, and
revision information for HBAs and 10G NICs.

IPMI

IPMI (Intelligent Platform Management Interface) is used to
communicate with the service processor on most enterprise class
systems. It is used within libtopo for power supply and fan
enumeration in libtopo as well as LED management. But IPMI
also supports FRU data, which includes a lot of juicy tidbits
that only the SP knows. We reference this FRU information directly to
get model and part information for power supplies and DIMMs.

SMBIOS

Even with IPMI, there are bits of information that exist only in SMBIOS,
a standard is supposed to provide information about the physical
resources on the system. Sadly, it does not provide enough information
to correlate OS-visible abstractions with their underlying physical
counterparts. With metadata, however, we can use SMBIOS to make this
correlation. This is used to enumerate DIMMs on platforms not
supported by libtopo, and to supplement DIMM information with data
available only via SMBIOS.

Metadata

Last but not least, there is chassis-specific metadata. Some
components simply don't have FRUID information, either because they are
too simple (fans) or there exists no mechanism to get the information
(most PCI cards). In this situation, we use metadata to provide
vendor, model, and part information as that is generally static for a
particular component within the system. We cannot get information
specific to the component (such as a serial number), but at least the
user will be able to know what it is and know how to order another
one.

Putting it all together

With all of this information tied together under one subsystem, we
can finally present the user complete information about their hardware,
including images showing the physical layout of the system. In addition,
this also forms the basis for reporting problems and analytics (using
labels from metadata), manipulating chassis state (toggling LEDs, setting
chassis identifiers), and making programmatic distinctions about the
hardware (such as whether external HBAs are present). Over the
next few weeks I hope to expound on some of these details in further
blog posts.

Thursday Aug 07, 2008

Last week, Rob Johnston and I coordinated two putbacks to Solaris to further the cause of Solaris platform integration, this time focusing on sensors and indicators. Rob has a great blog post with an overview of the new sensor abstraction layer in libtopo. Rob did most of the hard work- my contribution consisted only of extending the SES enumerator to support the new facility infrastructure.

You can find a detailed description of the changes in the original FMA portfolio here, but it's much easier to understand via demonstration. This is the fmtopo output for a fan node in a J4400 JBOD:

Here you can see the available indicators (locate and service), the fan speed (3490 RPM) and if the fan is faulted. Right now this is just interesting data for savvy administrators to play with, as it's not used by any software. But that will change shortly, as we work on the next phases:

Monitoring of sensors to detect failure in external components which have no visibility in Solaris outside libtopo, such as power supplies and fans. This will allow us to generate an FMA fault when a power supply or fan fails, regardless of whether it's in the system chassis or an external enclosure.

Generalization of the disk-monitor fmd plugin to support arbitrary disks. This will control the failure indicator in response to FMA-diagnosed faults.

Correlation of ZFS faults with the associated physical disk. Currently, ZFS faults are against a "vdev" - a ZFS-specific construct. The user is forced to translate from this vdev to a device name, and then use the normal (i.e. painful) methods to figure out which physical disk was affected. With a little work it's possible to include the physical disk in the FMA fault to avoid this step, and also allow the fault LED to be controlled in response to ZFS-detected faults.

Expansion of the SCSI framework to support native diagnosis of faults, instead of a stream of syslog messages. This involves generating telemetry in a way that can be consumed by FMA, as well as a diagnosis engine to correlate these ereports with an associated fault.

Even after we finish all of these tasks and reach the nirvana of a unified storage management framework, there will still be lots of open questions about how to leverage the sensor framework in interesting ways, such as a prtdiag-like tool for assembling sensor information, or threshold alerts for non-critical warning states. But with these latest putbacks, it feels like our goals from two years ago are actually within reach, and that I will finally be able to turn on that elusive LED.

Sunday Jul 13, 2008

Over the past few years, I've been working on various parts of Solaris platform integration, with an emphasis on disk monitoring. While the majority of my time has been focused on fishworks, I have managed to implement a few more pieces of the original design.

About two months ago, I integrated the libscsi and libses libraries into Solaris Nevada. These libraries, originally written by Keith Wesolowski, form an abstraction layer upon which higher level software can be built. The modular nature of libses makes it easy to extend with vendor-specific support libraries in order to provide additional information and functionality not present in the SES standard, something difficult to do with the kernel-based ses(7d) driver. And since it is written in userland, it is easy to port to other operating systems. This library is used as part of the fwflash firmware upgrade tool, and will be used in future Sun storage management products.

While libses itself is an interesting platform, it's true raison d'etre is to serve as the basis for enumeration of external enclosures as part of libtopo. Enumeration of components in a physically meaningful manner is a key component of the FMA strategy. These components form FMRIs (fault managed resource identifiers) that are the target of diagnoses. These FMRIs provide a way of not just identifying that "disk c1t0d0 is broken", but that this device is actually in bay 17 of the storage enclosure whose chassis serial number is "2029QTF0809QCK012". In order to do that effectively, we need a way to discover the physical topology of the enclosures connected to the system (chassis and bays) and correlate it with the in-band I/O view of the devices (SAS addresses). This is where SES (SCSI enclosure services) comes into play. SES processes show up as targets in the SAS fabric, and by using the additional element status descriptors, we can correlate physical bays with the attached devices under Solaris. In addition, we can also enumerate components not directly visible to Solaris, such as fans and power supplies.

The SES enumerator was integrated in build 93 of nevada, and all of these components now show up in the libtopo hardware topology (commonly referred to as the "hc scheme"). To do this, we walk over al the SES targets visible to the system, grouping targets into logical chassis (something that is not as straightforward as it should be). We use this list of targets and a snapshot of the Solaris device tree to fill in which devices are present on the system. You can see the result by running fmtopo on a build 93 or later Solaris machine:

So what does this mean, other than providing a way for you to finally figure out where disk 'c3t0d6' is really located? Currently, it allows the disks to be monitored by the disk-transport fmd module to generate faults based on predictive failure, over temperature, and self-test failure. The really interesting part is where we go from here. In the near future, thanks to work by Rob Johnston on the sensor framework, we'll have the ability to manage LEDs for disks that are part of external enclosures, diagnose failures of power supplies and fans, as well as the ability to read sensor data (such as fan speeds and temperature) as part of a unified framework.

I often like to joke about the amount of time that I have spent just getting a single LED to light. At first glance, it seems like a pretty simple task. But to do it in a generic fashion that can be generalized across a wide variety of platforms, correlated with physically meaningful labels, and incorporate a diverse set of diagnoses (ZFS, SCSI, HBA, etc) requires an awful lot of work. Once it's all said and done, however, future platforms will require little to no integration work, and you'll be able to see a bad drive generate checksum errors in ZFS, resulting in a FMA diagnosis indicating the faulty drive, activate a hot spare, and light the fault LED on the drive bay (wherever it may be). Only then will we have accomplished our goal of an end-to-end storage strategy for Solaris - and hopefully someone besides me will know what it has taken to get that little LED to light.

Saturday Jun 09, 2007

For those of you who have been following my recent work with Solaris platform integration, be sure to check out the work Cindi and the FMA team are doing as part of the Sensor Abstraction Layer project. Cindi recently posted an initial version of the Phase 1 design document. Take a look if you're interested in the details, and join the discussion if you're interested in defining the Solaris platform experience.

The implications of this project for unified platform integration are obvious. With respect to what I've been working on, you'll likely see the current disk monitoring infrastructure converted into generic sensors, as well as the sfx4500-disk LED support converted into indicators. I plan to leverage this work as well as the SCSI FMA work to enable correlated ZFS diagnosis across internal and external storage.

Saturday May 26, 2007

Two weeks ago I putback PSARC 2007/202, the second step in generalizing the x4500 disk monitor. As explained in my previous blog post, one of the tasks of the original sfx4500-disk module was reading SMART data from disks and generating associated FMA faults. This platform-specific functionality needed to be generalized to effectively support future Sun platforms.

This putback did not add any new user-visible features to Solaris, but it did refactor the code in the following ways:

A new private library, libdiskstatus, was added. This generic library uses uSCSI to read data from SCSI (or SATA via emulation) devices. It is not a generic SMART monitoring library, focusing only on the three generally available disk faults: over temperature, predictive failure, and self-test failure. There is a single function, disk_status_get() that reurns an nvlist describing the current parameters reported by the drive and whether any faults are present.

This library is used by the SATA libtopo module to export a generic TOPO_METH_DISK_STATUS method. This method keeps all the implementation details within libtopo and exports a generic inerface for consumers.

A new fmd module, disk-transport, periodically iterates over libtopo nodes and invokes the TOPO_METH_DISK_STATUS method on any supported nodes. The module generates FMA ereports for any detected errors.

These ereports are translated to faults by a simple eversholt DE. These are the same faults that were originally generated by the sfx4500-disk module, so the code that consumes them remains unchanged.

These changes form the foundation that will allow future Sun platforms to detect and react to disk failures, eliminating 5200 lines of platform-specific code in the process. The next major steps are currently in progress:

The FMA team, as part of the sensor framework, is expanding libtopo to include the ability to represent indicators (LEDs) in a generic fashion. This will replace the x4500 specific properties and associated machinery with generic code.

The SCSI FMA team is finalizing the libtopo enumeration work that will allow arbitrary SCSI devices (not just SATA) to be enumerated under libtopo and therefore be monitored by the disk-transport module. The first phase will simply replicate the existing sfx4500-disk functionality, but will enable us to model future non-SATA platforms as well as external storage devices.

Saturday Mar 17, 2007

As I continue down the path of improving various aspects of ZFS and Solaris platform integration, I found myself in the thumper (x4500) fmd platform module. This module represents the latest attempt at Solaris platform integration, and an indication of where we are headed in the future.

When I say "platform integration", this is more involved than the platform support most people typically think of. The platform teams make sure that the system boots and that all the hardware is supported properly by Solaris (drivers, etc). Thanks to the FMA effort, platform teams must also deliver a FMA portfolio which covers FMA support for all the hardware and a unified serviceability plan. Unfortunately, there is still more work to be done beyond this, of which the most important is interacting with hardware in response to OS-visible events. This includes ability to light LEDs in response to faults and device hotplug, as well as monitoring the service processor and keeping external FRU information up to date.

The sfx4500-disk module is the latest attempt at providing this functionality. It does the job, but is afflicted by the same problems that often plague platform integration attempts. It's overcomplicated, monolithic, and much of what it does should be generic Solaris functionality. Among the things this module does:

Reads SMART data from disks and creates ereports

Diagnoses ereports into corresponding disk faults

Implements an IPMI interface directly on top of /dev/bmc

Responds to disk faults by turning on the appropriate 'fault' disk LED

Listens for hotplug and DR events, updating the 'ok2rm' and 'present' LEDs

Updates SP-controlled FRU information

Monitors the service process for resets and resyncs necessary information

Needless to say, every single item on the above list is applicable to a wide variety of Sun platforms, not just the x4500, and it certainly doesn't need to be in a single monolithic module. This is not meant to be a slight against the authors of the module. As with most platform integration activities, this effort wasn't communicated by the hardware team until far too late, resulting in an unrealistic schedule with millions of dollars of revenue behind it. It doesn't help that all these features need to be supported on Solaris 10, making the schedule pressure all the more acute, since the code must soak in Nevada and then be backported in time for the product release. In these environments even the most fervent pleas for architectural purity tend to fall on deaf ears, and the engineers doing the work quickly find themselves between a rock and a hard place.

As I was wandering through this code and thinking about how this would interact with ZFS and future Sun products, it became clear that it needed a massive overhaul. More specifically, it needed to be burned to the ground and rebuilt as a set of distinct, general purpose, components. Since refactoring 12,000 lines of code with such a variety of different functions is non-trivial and difficult to test, I began by factoring out different pieces individually, redesigning the interfaces and re-integrating them into Solaris on a piece-by-piece basis.

Of all the functionality provided by the module, the easiest thing to separate was the IPMI logic. The Intelligent Platform Management Interface is a specification for communicating with service Pprocessors to discover and control available hardware. Sadly, it's anything but "intelligent". If you had asked me a year ago what I'd be doing at the beginning of this year, I'm pretty sure that reading the IPMI specification would have been at the bottom of my list (right below driving stakes through my eyeballs). Thankfully, the IPMI functionality needed was very small, and the best choice was a minimally functional private library, designed solely for the purpose of communicating with the Service Processor on supported Sun platforms. Existing libraries such as OpenIPMI were too complicated, and in their efforts to present a generic abstracted interface, didn't provide what we really needed. The design goals are different, and the ON-private IPMI library and OpenIPMI will continue to develop and serve different purposes in the future.

Last week I finally integrated libipmi. In the process, I eliminated 2,000 lines of platform-specific code and created a common interface that can be leveraged by other FMA efforts and future projects. It is provided for both x86 and SPARC, even though there are currently no supported SPARC machines with an IPMI-capable service processor (this is being worked on). This library is private and evolving quite rapidly, so don't use it in any non-ON software unless you're prepared to keep up with a changing API.

As part of this work, I also created a common fmd module, sp-monitor, that monitors the service processor, if present, and generates a new ESC_PLATFORM_RESET sysevent to notify consumers when the service processor is reset. The existing sfx4500-disk module then consumes this sysevent instead of monitoring the service processor directly.

This is the first of many steps towards eliminating this module in its current form, as well as laying groundwork for future platform integration work. I'll post updates to this blog with information about generic disk monitoring, libtopo indicators, and generic hotplug management as I add this functionality. The eventual goal is to reduce the platform-specific portion of this module to a single .xml file delivered via libtopo that all these generic consumers will use to provide the same functionality that's present on the x4500 today. Only at this point can we start looking towards future applications, some of which I will describe in upcoming posts.