A standard technique in developing operating systems has been to cache data in order to improve access times. This approach works well because data that has been recently accessed will be accessed again soon. For a system where there is only ever one program accessing data, caching works fine. However, once a second program attempts to access the same data the issue of cache consistency becomes critical.

Because there are now potentially many copies of the data ? each one in a different cache ? there must be some mechanism to keep these copies in sync with one another. If not different users of the data will have different views and thus there are now many copies of the data, with each one different than the others.

When accessing files across a network, the simplest scheme is to always store all data back on the file server. Thus, whenever an application program reads data, that read request is satisfied from the file server ? across the network. This ensures that there is a consistent view of data since there is only a single copy of the data in existence ? on the file server.

Unfortunately, the performance characteristics of such systems are not ideal. While file servers can be built to work very fast there are numerous bottlenecks between the client accessing the data and the file server storing the data, not to mention the added latency necessary to fetch the data from the file server each time.

Studying this problem at length reveals that the majority of data being retrieved from the file server is never modified ? it is only read. Data that is being modified is almost always being modified by a single program ? like a word processing document. Data that is being modified by multiple programs represent an incredibly tiny amount of the total data traffic. In spite of this, users do expect their file systems to ensure that any data they access is correct ? not most of the time, but all of the time.

The typical access characteristics for such data make network file system clients ideal candidates to cache data ? and many of them do so, using a variety of different techniques to ensure cache consistency. For example, the venerable NFS file system protocol utilizes a scheme of checking file system time stamps on the remote file server to detect when data in its cache may become stale. This solution is not a perfect one since there is a window in which old data may be cached by the NFS client, it has worked well for many years.

For Windows NT, network file system caching is implemented by the LanManager redirector (the "client") and file server (the "server"). In order to ensure correctness of the cached data, LanManager implements a basic cache consistency scheme which covers the entire file contents. Where files are being simultaneously accessed across the network by multiple users for both read and write access, caching is disabled ? clients must fetch data from the file server each time it is read, and must store it back immediately each time it is written. However, in the vast majority of cases, the client will cache data locally. This minimizes network traffic and vastly improves performance for most file access on Windows NT.

This is implemented by Windows NT using a cache consistency scheme known as opportunistic locking. An opportunistic lock is known as an "oplock" in the parlance of Windows NT file systems. Further, the implementation of oplocks by Microsoft impacts both their network and local file systems. Because the details of the local implementation are tightly coupled to how oplocks are used by network file systems, we describe the network implementation initially and then return to discussing issues associated with their local implementation for NT file systems.

In Figure 1 below, we provide our basic reference diagram for this discussion of oplocks. Oplocks are granted by SRV to instances of RDR running on systems across the network ? possibly even including the same system on which SRV is running.

When a client opens a file across the network, it is typically the only user accessing that file. In this very common case, the network client need not store data back to the server immediately, nor need it fetch data repeatedly from the server. Allowing this optimization minimizes unnecessary network traffic which in turn provides overall better perceived performance for both the network client and all other clients using the network.

Figure 1

"Cache consistency" requires that any two clients on the network must see the same information in the file at the same point in time. Thus, if one client is not writing data back to the file server on a regular basis, a second client reading data from the server would receive stale data. This would violate the requirement that two clients on the network see the same information in the file at a given point in time.

To allow client-side caching without suffering from cache consistency problems requires a cache consistency protocol ? a mechanism whereby the client keeping data locally rather than writing it back to disk or refetching it from the server each time it needs it can be informed when it must write the data back to disk or reread it from the file server.

On Windows NT this is done via the "opportunistic locking" protocol, or oplock. In the balance of this section we describe the various types of oplocks, their uses, and how an FSD should deal with them.

There are three types of oplocks: level 1, batch, and level 2. Both the level 1 and batch oplocks are "exclusive access" opens. They are used slightly differently, however, and hence have somewhat different semantics. A level 2 oplock is a "shared access" grant on the file.

Level 1 is used by a remote client that wishes to modify the data. Once granted a Level 1 oplock, the remote client may cache the data, modify the data in its cache and need not write it back to the server immediately.

Batch oplocks are used by remote clients for accessing script files where the file is opened, read or written, and then closed ? repeatedly. Thus, a batch oplock corresponds not to a particular application opening the file, but rather to a remote clients network file system caching the file because it knows something about the semantics of the given file access. The name "batch" comes from the fact that this behavior was observed by Microsoft with "batch files" being processed by command line utilities. Log files especially exhibit this behavior ? when a script it being processed each command is executed in turn. If the output of the script is redirected to a log file the file fits the pattern described earlier, namely open/write/close. With many lines in a file this pattern can be repeated hundreds of times.

Level 2 is used by a remote client that merely wishes to read the data. Once granted a Level 2 oplock, the remote client may cache the data and need not worry that the data on the remote file server will change without it being advised of that change.

An oplock must be broken whenever the cache consistency guarantee provided by the oplock can no longer be provided. Thus, whenever a second network client attempts to access data in the same file across the network, the file server is responsible for "breaking" the oplocks and only then allowing the remote client to access the file. This ensures that the data is guaranteed to be consistent and hence we have preserved the consistency guarantees essential to proper operation.

An oplock break occurs whenever SRV detects that some condition necessary to maintaining the oplock has ceased to be correct. In that case, SRV begins breaking the oplock. Depending upon the type of oplock being broken, SRV may have to engage in a multi-message protocol to complete the oplock break.

The simplest oplock break is for a level 2 oplock. In this case, SRV merely advises the remote client that it must invalidate any cached data it has and reread it from the file server.

Figure 2

Breaking a level 1 oplock, however, is a bit more complicated. In that case the client may have in memory data that must be written back to the file server before the oplock break should be considered complete. A graphical description of the control flow between SRV and RDR is shown in Figure 2. It demonstrates the call from SRV indicating that an oplock break is in progress. In that case, the remote client initiates a series of write operations back to the server. The write back process can consist of many operations between the server and client. Once all data has been written back to the server, the client then acknowledges the oplock break. Microsoft?s protocol allows the server to grant a Level 2 oplock to the client if the client so desires. This would allow the client to retain the data in its cache (as it is valid) minimizing unnecessary network traffic.

Breaking a batch oplock is initiated by the file server (SRV) which indicates to the client that an oplock break is in progress. The client (RDR) then writes any dirty cached data back to the file server. When that is completed, the client then closes the file. This causes the file to be reread from the file server on a subsequent access.

In fact, closing a file always releases an oplock on the given file. A client is no longer interested in cache consistency once the file has been closed ? no data may be cached by the client if the file is not open.

The oplock protocol itself is sufficient to ensure cache consistency between clients anywhere on the network. There is one case, however, that is not covered by this mechanism ? the case of local file system access, perhaps from a local application program. In this case, the application will call directly into the FSD without using either the server (SRV) or client (RDR) components.

This detail of course is essential to our fundamental requirement for cache consistency. It is the requirement that NT support local client access for cache consistency that requires oplocks be implemented in the FSD. Thus, an inherently network activity (remote caching of data) has an important impact on file systems.

Now we turn our attention to describing the mundane details of how to take advantage of oplocks in the local file system.

FSCTL_REQUEST_OPLOCK_LEVEL_1

A level 1 oplock is an exclusive oplock on the file. That is, it gives the holder of the lock the right to cache the data and to modify the data in its cache. Essentially, no other process (on any system in the network) may be accessing the file.

An FSD will grant such an oplock when the file is only opened by a single process. Thus, if the file is already opened by two or more clients when a request for a level 1 lock is made, the request will be denied.

Similarly, if a level 1 lock is already held by the remote client and a second client opens the file, the level 1 lock previously granted must be revoked. This will trigger a write-back of any dirty data stored by the first client before the oplock break is completed.

An interesting requirement of the oplock protocol is that it requires the interface be implemented synchronously. The oplock is granted when STATUS_PENDING is returned to the IRP containing the oplock request. Thus, an FSD must complete the processing of the original IRP synchronously because returning STATUS_PENDING would indicate the oplock grant was successful to the caller.

Once an oplock has been granted, the IRP representing that oplock is queued and held. The oplock break processing is implemented by completing the original IRP that requested the oplock. The IRP must be completed by setting the Information field of the IoStatus field to either FILE_OPLOCK_BROKEN_TO_LEVEL_2 or FILE_OPLOCK_BROKEN_TO_NONE.

However, the oplock break at this stage has not completed. Instead, the owner of the oplock must do any internal processing required. Once that processing has completed, the oplock owner must acknowledge the oplock break. If FILE_OPLOCK_BROKEN_TO_LEVEL_2 was returned, the owner of the oplock may either indicate FSCTL_OPLOCK_BREAK_ACKNOWLEDGE, in which case the acknowledgment IRP is treated as a request for a level 2 oplock (c.f., Section 0.) Alternatively, the oplock owner may acknowledge the IRP but decline the offer of a level 2 oplock (c.f., Section 0) by indicating FSCTL_OPLOCK_BREAK_ACK_NO_2.

The principal reason a level 1 oplock is broken is because another caller opens the file. Normally a caller who wishes to open the file will block until the oplock break is completed. However, SRV (the LanManager file server) requires, for internal deadlock prevention reasons, that a create be completed before the oplock break is completed. This is done by setting (in the create request) the FILE_COMPLETE_IF_OPLOCKED bit in the option flags.

However, before SRV can use the file thus created, it must later verify that the oplock break has really completed. It does this by making a subsequent call to the FSD to wait until the oplock break on the given file is completed (c.f., Section 0).

FSCTL_REQUEST_OPLOCK_LEVEL_2

A level 2 oplock is a shared oplock on the contents of the file. It allows a network client (RDR) to cache the data in memory without fear that the data will change.

As with a level 1 oplock, the oplock is requested via an IRP and the oplock is granted when the FSD returns STATUS_PENDING. Unlike a level 1 oplock, however, a level 2 oplock may be granted when a file has previously been opened. Further, a level 2 oplock may be granted even when other opens of the file allow write access. This point is really very important. As it turns out, many applications will open a file for write access, even if they never intend on modifying the contents of the file.

Thus, an FSD must check when a write is done to a file to ensure that no level 2 oplocks have been granted against the file ? and hence need to be invalidated. This ensures that if the remote client did cache data that it will be properly invalidated.

The oplock is broken by completing the pending IRP. In the case of a level 2 oplock nothing is set in the Information field - the IRP is simply completed with STATUS_SUCCESS. This ensures that the oplock holder has received notification that their cached data is now stale and must be refreshed prior to subsequent use.

A batch oplock is an exclusive oplock against a files contents and against changes in the attributes of the file (notably, but not exclusively, its name.) It allows a network client to keep a file "oplocked" even though the application on the remote client is opening and closing the file repeatedly (as is the case for a batch file and hence the name of the oplock.)

A batch oplock can only be granted under the same circumstances as a level 1 oplock (c.f., Section 0.) The oplock itself is requested via an IRP. Returning STATUS_PENDING for that IRP indicates the oplock itself has been granted.

Breaking a batch oplock is different than breaking a level 1 oplock.? The emphasis with a batch oplock is protecting the file attributes while with a levle 1 oplock the emphasis is protecting the data within the file. Batch oplocks can be held across open instances of the file (that is, you can open the file, acquire the batch oplock, close the file, and re-open the file and still hold the batch oplock) which you cannot do with a level 1 oplock.? Thus, in addition to breaking a batch oplock whenever the data itself has changed a batch oplock must also be broken whenever the name of the file changes. This is because a batch oplock covers the file even thought the client may be opening and closing the file repeatedly. Were that the case and a rename occurred, the client needs to be advised that the file handle it is using no longer represents the file it used to represent.

One interesting side-effect to using batch oplocks is that certain CREATE operations may fail with the Information field set to FILE_OPBATCH_BREAK_UNDERWAY. This occurs when the caller indicated they were unwilling to wait for the oplock break to complete by setting the FILE_COMPLETE_IF_OPLOCKED options flag, as is typically the case for SRV, the LanManager file server. In this case the create operation will fail (typically with STATUS_SHARING_VIOLATION) to indicate to the caller that the problem is with a batch oplock presently held on the file and that a blocking call to CREATE would not necessarily fail.

FSCTL_OPLOCK_BREAK_ACKNOWLEDGE

Once an exclusive (level 1 or batch) oplock has been broken, other file system requests cannot continue until the oplock break is acknowledged. This can be done one of two ways ? either by a subsequent call to the FSD indicating a control code of FSCTL_OPLOCK_BREAK_ACKNOWLEDGE or by closing the file handle.

A batch oplock break is normally acknowledged by the file object being closed. A level 1 oplock is normally acknowledged by way of this call.

FSCTL_OPLOCK_BREAK_NOTIFY

When SRV opens a new file, indicating that it does not wish to wait for the oplock break to complete (c.f. Section 0) it must subsequently make a call to the underlying FSD to ensure that the oplock break has successfully completed.

This is accomplished by indicating FSCTL_OPLOCK_BREAK_NOTIFY as the control code in the IRP. This IRP will then block waiting for any oplock break activity to complete on the file. Once this call returns (STATUS_SUCCESS) the FSD may use the file object safely.

For SRV, proper implementation of these semantics by the FSD is essential to correct behavior. If a normally asynchronous CREATE operation by SRV is forced to be synchronous (perhaps by a filter driver) SRV will experience internal deadlock conditions.

Earlier we mentioned that an oplock break could be acknowledged by closing the file. This control code is used by the oplock owner to indicate the oplock break has been acknowledged and a close of the file is imminent.

In this case, a level 2 oplock is not necessary. No further use should be made of this file object except to close the file.

FSCTL_OPLOCK_BREAK_ACK_NO_2

This control code is a variation on the general acknowledgment operation. In this instance, the owner of the oplock is declining the offer (by the FSD) of a level 2 oplock. This is typically because the owner of the oplock does not use or support level 2 oplocks.

User Comments
Rate this article and give us feedback. Do you find anything missing? Share your opinion with the community!
Post Your Comment

"RE: How is that asynchronous?"
Looks like you're right, that line should indeed read:

"An interesting requirement of the oplock protocol is that it requires the interface be implemented synchronously."

-scott

12-Aug-10, Scott Noone"How is that asynchronous?"
I'm a bit confused by this paragraph:

An interesting requirement of the oplock protocol is that it requires the interface be implemented asynchronously. The oplock is granted when STATUS_PENDING is returned to the IRP containing the oplock request. Thus, an FSD must complete the processing of the original IRP synchronously because returning STATUS_PENDING would indicate the oplock grant was successful to the caller.

I think maybe the word "asynchronously" in the first sentence is a typo for "synchronously", is this true?