Monday, November 5, 2012

OpenAFS Windows IFS Thirteen Months Later

On 18 September 2011, I discussed the release of the first OpenAFS release that included a native installable file system redirector. It is often said that it takes ten developer years to shake out all of the bugs and performance glitches in a new file system. The last year has certainly seen its fill of BSODs, deadlocks, hiccups, and application interoperability issues. Today, I am releasing version 1.7.18. Over the last thirteen months more than 750 changes have been implemented improving performance, stability, and application compatibility. This post will highlight some of the challenges and lessons learned in the process.

Antimalware Filter Driver compatibility
The vast majority of problems that end users have experienced with the AFS redirector have been related to interactions with Anti-Virus and other forms of content scanners which install filter drivers on the system. Life would be much easier if there was a standard set of hooks that these products could use to scan files and deny access, quarantine, or otherwise alter the normal application data access patterns. Unfortunately that is not the case and learning what works and what doesn't has often been left to trial and error.

Since AFS is a network file system that relies upon credentials that are independent of the local operating system there are added complexities. For example, when Excel opens a spreadsheet file it uses the AFS tokens which are available to the active logon session. The anti-virus service on the other hand is running as an NT service as the SYSTEM or other account in a different logon session. As such, it does not have access to the user's AFS tokens unless the requests to scan the file content is performed by borrowing the File Object from Excel or impersonating the Excel process' security context. Most anti-virus products do impersonate the calling thread or borrow the File Object but not all do. Versions of Microsoft Security Essentials prior to 2.0 did not and it was a significant problem for OpenAFS.

Anti-virus scanners can choose to scan during the CreateFile operation and during the CloseHandle operation (aka File Cleanup.) The challenge here for the AFS redirector is that it must hold various locks in order to protect the integrity of the data and provide cache coherency with the file server managed data versions. Anti-virus scanners can hijak the thread performing the CreateFile or Cleanup and inherit the locks that are already held or they can spawn a worker thread to re-open the file perform a scan and close it again while the application initiated CreateFile or Cleanup is blocked. Any locks that are held across CreateFile or Cleanup which are required by the anti-virus worker thread will result in a deadlock. Failure to hold the locks can result in data corruption. Sophos and Kaspersky were two of the most challenging products to learn to interact with safely.

Microsoft periodically organizes File System Filter Driver PlugFests which provide file system developers, anti-virus vendors, encryption products, content scanners, and others to test their forthcoming products against Microsoft's upcoming operating system releases. The PlugFest is also an opportunity for third-party vendors to perform interoperability testing with each other. It was unfortunate that due to increased secrecy regarding the development of Windows 8 and Server 2012 that Microsoft was unable to hold a PlugFest for more than a year. But in 2012 there were two events in February and August.

The February PlugFest was the first opportunity to interop with a broad range of vendors since the release of 1.7.1. At that event every Interop session was a painful experience. During that week 1.7.7 was scheduled to be released but it had to be pulled because of the many problems (deadlocks, BSODs, and data corruption) that were identified during the interop testing sessions.

This past August's experience was the complete opposite. The code that would become the 1.7.17 release including Windows 8 and Server 2012 specific functionality was tested. Other than a minor error that was uncovered during the first interop session with Microsoft's own anti-virus engine used in Security Essentials and Windows Defender there was not a single hiccup the rest of the week. As it turns out, the AFS redirector was the only non-Microsoft file system to implement all of the required new interfaces for Windows 8.

Application Compatibility
Of course, compatibility with deployed applications is the goal. Whenever possible applications should be unaware that its data is being stored in AFS as opposed to Windows built-in file systems such as NTFS and CIFS. This challenge is made more complicated by the fact that most applications do not implement feature tests for optional file system APIs. Instead they just assume that every feature implemented by NTFS or CIFS will be available everywhere. The deciding factor between whether the file system is local or remote is often decided by whether or not UNC path notation is used. Things should become easier for non-Microsoft file systems now that Microsoft has introduced ReFS, a new file system that does not implement many features of NTFS including transactions, short names, extended attributes or alternate data streams; none of which are implemented by the AFS redirector.

Still, it is worth noting that the AFS redirector is a very complete implementation of the NTFS and CIFS feature set including support for CIFS Pipe Services such as WKSSVC and SRVSVC and a full implementation of the Network Provider API. Both the Pipe Services and the Network Provider API are used by applications to browse the capabilities of the network file system and the available resources such as server and share names. The Network Provider API is also responsible for managing drive letter to UNC path mappings and a path name normalization. One example of a Network Provider incompatibility was the failure to implement network performance statistics which resulted in periodic 20 second delays from within the Explorer Shell.

Reparse Points
One of the most significant visible changes between the SMB gateway interface and the native AFS redirector is the use of file system Reparse Points to represent AFS Mount Points and Symlinks. Unlike POSIX symlink which are unstructured data, a Windows File System Reparse Point is a tagged structured data type. Microsoft maintains a registry of all of the tag values and which organization they are assigned to. More than 50 reparse point tags have been registered and OpenAFS is the proud assignee of IO_REPARSE_TAG_OPENAFS_DFS (0x00000037L). The OpenAFS Reparse Tag Data has three sub-types (Mount Point, Symlink, UNC Referral) which are used to export the target information for each.

When the SMB gateway was used, the entire AFS name space appeared to applications as a single volume exported as as single Windows File Share. It was not possible for Windows to report volume information (quota, readonly status, etc) or detect out of space conditions prior to the application filling the Windows page cache. Now that reparse points are in use, Windows applications can recognize that a path might have crossed from one volume to another. Tools such as robocopy that are Junction (aka Reparse Point) aware can perform operations without crossing volume boundaries.

While this is a major improvement in capability, it is also a dramatic change in behavior for applications. Some applications rely upon the assumption that a Windows File Share can only refer to a single volume and further assume that any file path using UNC notation is a path to a Windows File Share. Such applications can become confused when they query the volume information of \\afs\example.org\ and told that the volume is READ_ONLY when the full target path \\afs\example.org\user\j\johndoe\ is not. This is a deficiency in the application and not a fault of the file system.

One downside of the reparse point model is that applications need to understand the format of the structured data to make use of it. Tools such as JPSoftware's Take Command are reparse parse point aware but can not at present properly display the target information. The same is true for Cygwin and related tools.

Authentication Groups
The SMB gateway client associated credentials with Windows account usernames (or SIDs). The AFS redirector tracks process creation and associates credentials with Authentication Groups (AG). Each process inherits an AG from the creating thread and can create additional AGs to store alternate sets of credentials. When background services such as csrss.exe and svchost.exe execute tasks on behalf of foreground processes they impersonate the credentials of the requesting thread. By impersonating the caller, the background thread informs the AFS redirector which credentials should be used.

Sometimes a mistake is made and the background service fails to impersonate the caller and instead attempts to rely upon the service's own credentials to perform its job. This is the case with conhost.exe when it attempts to access or manipulate the contents of the "Command Prompt.lnk" shortcut. As a result the contents of cmd.exe shortcuts are ignored when initiating command prompt console sessions.

When Will 1.8 Ship?
Users frequently ask "when will 1.8 ship? I don't want to deploy the new OpenAFS client until it is production quality." The reason that the OpenAFS client is 1.7.x and not 1.8.x has less to do with stability than it has to do with the rate of change and unfinished work. The Windows platform has new releases issued every one to two months whereas the rate of issue for the servers and UNIX clients is one every six to twelve months. The rate of change to support new features or improve compatibility and performance on Windows is significantly higher. Nearly 1/3 of all patches contributed to OpenAFS.org are new functionality for Windows. Please do not focus so much in the version label.

1.8 will be issued when the rate of change in the Windows client drops to the point where a new release each month is no longer desirable. The two most significant areas of work that need to be addressed before a 1.8 release are in the Kerberos bindings and the Installer. At present, the 1.7.x binaries are built directly against the MIT KFW 3.2 libraries. This permits OpenAFS to work with KFW 3.2 and the KFW translation layer provided by Heimdal 1.5. However, the KFW 3.2 API does not permit fined grained control over the use of DES encryption types nor is it guaranteed to work with future KFW releases from MIT. The installer requires ease of use improvements. The user should not be prompted when files are in-use but should always be prompted to provide a cell name unless the installation is an upgrade.

What Comes After 1.8?
With large scale deployment comes operational experience. The AFS Redirector design has been shown to have weaknesses that result in a larger than desired in-kernel memory footprint. There are three areas in which a redesign would be desirable:

1. The File Control Blocks (FCB) and the Object Information Control Blocks (OICB) are bound to one another even though they could very well have different life spans. An FCB must exist as long as there is an open HANDLE. Multiple open handles for the same file system object refer to the same FCB. The FCB contains metadata about the file object that is specific to the file system in-kernel. It tracks the allocated file size, the list of data extents that are present in-kernel, etc. For each FCB there must exist an OICB which contains the AFS specific meta data associated with the file object including AFS data version, AFS FileID, etc. While an OICB must exist for an FCB, it does not have to be the other way around.

The mutual binding of the OICB and the FCB makes garbage collection more difficult than it needs to be. Some of the race conditions that were fixed in the 1.7.18 release were the result of this complexity. One of the important goals of a redesign is to break this mutual dependency and instead only maintain a reference from the FCB to the OICB and not the other way around. Doing so will permit FCBs to be garbage collected when the last handle is closed and OICB objects to be garbage collected with their active reference counts reach zero. The garbage collection worker thread will hold fewer locks and have a smaller impact on file system performance.

2. The Directory Entry Control Blocks (DECB) also maintain a reference to the OICB. In fact, each time a directory is enumerated to satisfy FindFirst/FindNext API requests, not only is a DECB allocated but an OICB is as well. Permitting the OICB to be allocated only when a FCB is allocated instead of as part of directory enumeration will reduce the in-kernel memory footprint.

3. Directory enumeration is currently performed for the entire directory not only when the directory object is opened by an application but also when a FindFirst API is issued for a non-wildcard search. The vast majority of FindFirst searches are non-wildcard searches for explicit names. Instead of populating the full contents of the directory in-kernel, the memory footprint can be further reduced by pushing those queries to the afsd_service process.

4. File data is exchanged between the afsd_service and the Windows page cache by sharing a memory-mapped backing store between the AFS Redirector and the afsd_service. The control over specific file extents is managed by a reverse ioctl interface between the redirector and the user-land service. This protocol is racy and can result inefficient exchanges of control. Replacing the existing protocol with one that tracks extent request counts and active reference counts will reduce wasteful exchanges and improve data throughput.

These proposed changes are a significant undertaking and they will not appear in the 1.7.x/1.8.x release series.