Thursday, July 29, 2010

Over at Real World Technologies, Tarek Chammah has a wonderful, long, detailed review of the recent USENIX Workshop on Hot Topics in Parallelism (or HotPar 2010) conference on parallelism research. I particularly liked the section describing recent work on parallelizing Firefox, and the section describing the techniques for parallelizing rendering in gaming engines.

But meanwhile, it's also interesting to read Andrew Tridgell's recent post about his efforts at upgrading the operating system (Ubuntu) on one of his machines. It's amazing that he succeeded at all; I can't think of many engineers that I know of who would have been able to muster up the tools and techniques necessary to break through and resolve this problem. But the reason I thought it was fascinating to read both these articles on the same day was Tridge's description of how the recent work on speeding up Ubuntu boot times by parallelizing the startup work had made it so hard to diagnose the boot failures:

To debug startup problems you need to be able to watch the startup process in action, to see what is waiting. This is much harder these days with the new upstart init system now used in Ubuntu, as startup is much more parallel than it used to be. Adding some echo lines to init scripts used to be a useful technique, but it is much harder to get anything sensible out of that when using upstart.

Really, though, there's no getting around it: parallelism is indeed simultaneously both crucial for future development work, and the source of tremendous complexity and intricacy when things go wrong. Two sides of the same coin.

Wednesday, July 28, 2010

On July 16th, I severely strained my hamstring playing soccer, so I've been more-or-less laid up since then (it's recovering nicely, thank you!). I was Mr Limp-a-lot for the Chicago trip, which turned out to be OK, because between the 98 degree heat and the unbelievable thuderstorms, nobody was really up for a lot of walking anyway.

Then, when I get back, the dry, scratchy feeling in my throat flares up into a full-blown case of Strep. Ugh. Now it's 5 days with me and the Z-pak -- did you know it was discovered in Croatia?

My goal at this point is just to get healthy in time to go backpacking in 10 days. Please! Getting back onto the regular exercise regimen will just have to wait a few weeks...

I remember the Stingley incident, although I was living in New England at the time.

I don't remember much about the Immaculate Reception, but I've seen a million replays.

Jack Tatum was as controversial as they get, but he stood up for what he believed in and didn't try to pretend he was somebody he wasn't. That was very much a different generation of the NFL; it's quite a different form of sport now. I don't want to go back to those days, but it doesn't mean we shouldn't remember them.

Wednesday, July 21, 2010

I've been whiling away an hour or two thinking about Unix file locks, and fair scheduling, and from what I can tell, Unix file locking does not implement fair scheduling. Below are some more words to explain what that means, in more detail.

Unix file locking is a fairly common set of APIs on most Unix and Unix-like systems to support what are called advisory locks:

UNIX does not automatically lock open files and programs. There are different kinds of file locking mechanisms available in different flavours of UNIX and many operating systems support more than one kind for compatibility. The two most common mechanisms are fcntl(2) and flock(2). Although some types of locks can be configured to be mandatory, file locks under UNIX are by default advisory. This means that cooperating processes may use locks to coordinate access to a file among themselves, but programs are also free to ignore locks and access the file in any way they choose to.

Unix file locking is available on a very broad set of Unix-like operating systems, including Solaris, Linux, FreeBSD, HPUX, etc.

Fair Scheduling of lock requests is a term introduced 35 years ago (at least) by that titan of system programming, Jim Gray. In section 4.1 of the ultra-classic paper Granularity of Locks in a Shared Data Base (co-authored by Franco Putzolu and Ray Lorie), Gray writes:

The set of all requests for a particular resource are kept in a queue sorted by some fair scheduler. By "fair" we mean that no particular transaction will be delayed indefinitely. First-in first-out is the simplest fair scheduler and we adopt such a scheduler for this discussion modulo deadlock preemption decisions.

Let's try to illustrate fair scheduling with a short example:

Suppose that I have a set of processes (A, B, C, etc.) who are trying to co-operate by locking and unlocking a common file named file. The processes may lock the file in either shared (S) or exclusive (X) mode.

First, A locks file in shared mode.

Second, B locks file in shared mode. These first two locks are immediately granted, as they conflict with nothing.

Fourth, A unlocks file. At this point, B still holds file (S), and so C still waits.

Fifth, D locks file in shared mode. Here, we have a scheduling decision.

On the one hand, D's lock request can be immediately granted, because it is compatible with all currently-granted locks. This scheduling regimen is sometimes called greedy scheduling, because it attempts to grant locks whenever possible. On the other hand, if D's lock request is immediately granted, then it has been given "unfair" access to the resource, because it made its request afterC, and yet is being granted access beforeC. If, at this point, the locking implementation forces D to wait for its S lock until C has first been granted, then released, its X lock, then the lock manager is practicing fair scheduling, because it is ensuring that C is not delayed indefinitely. (The delay might be indefinite because S lock requests might keep coming along from various processes and if the arrival rate of the new lock requests is such that there is never a period that all the S lock requests are released, then the X lock request may never be granted.)

So, now it should be clear what I mean when I frame the question:

Does Unix file locking implement fair scheduling?

To try to answer the question, I did a simple experiment: I wrote a short C program (actually, one of my colleagues at my day job wrote it for me) which has a simple command-line interface and allows me to interactively lock and unlock files in shared and exclusive mode, and I fired up about half-a-dozen terminal windows, and I started experimenting.

Sure enough, on the systems that I tried:

Linux 2.6 (Ubuntu)

MacOS X 10.6 (Darwin)

FreeBSD 8.0

None of the systems appeared to offer fair scheduling. Late-arriving S lock requests were granted in preference to waiting X lock requests. In fact, on some of the systems, when multiple X lock requests were waiting, and all the S lock requests were released, the X request that was then granted was not always the first X that had made its request! So not only did I observe lock starvation by late-arriving S lock requests, I also observed lock starvation by late-arriving X lock requests.

Interestingly, there was one system that did implement fair scheduling:

Solaris 10

So it appears to be a bit of a mixed bag: some Unix-like systems implement fair scheduling, some do not, and, furthermore, for those that do not, some appear to implement FIFO scheduling of blocked requests, where others will grant an (apparently) arbitrary request from the pool of blocked requests when the last blocking lock is released.

You may be wondering why this matters?

Well, lock scheduling and, specifically, the fairness or unfairness of lock scheduling, is known to be quite closely related to a number of thorny and very annoying performance problems, including starvation and convoying.

Starvation is fairly easy to conceptualize: it is a situation where one request is not making any process because it is being starved of its resources. I tried to illustrate starvation in my example above, where the exclusive lock requester can wait an unexpectedly long time for a lock request to be granted, because the system is unfairly allowing shared lock requests to be granted in preference to it.

Lock convoys are a more complicated problem, and have to do with situations in which the system is fairly busy, and so there is a fair amount of contention for the locks, and additionally the processes are contending for CPU resources and are being time-sliced. In such a case, it is unfortunately quite common for the lock scheduling algorithms to interact very poorly with the time-slicing and context-switching algorithms, leading to a poorly-performing system which is called the "convoy" or "boxcar" effect. I'm not going to have time (nor do I have the brainpower) to describe it very well myself, so let me just point you to a bunch of well-written descriptions:

Joe Duffy's Weblog: "This wait, unfortunately, causes a domino effect from which the system will never recover."

Raymond Chen: "It's not how long you hold the lock, it's how often you ask for it."

Sue Loh: " If all threads at that priority level are contending over the object, what you see is very frequent context switching between them all."

Larry Osterman: "You spend all your valuable CPU time executing context switches between the various threads and none of the CPU time is spent actually processing work items."

Rick Vicik: "when there are many threads, each one runs only briefly before waiting for a lock and by the time it runs again, its cache-state has been wiped out by the others.

Fred on Programming: "You can get this easily if you wake up a large number of threads with the same priority and have them all immediately try to get another lock."

Well, anyway, there you go: a short essay about Unix file locking, fair scheduling, and lots of pointers to fascinating discussions about lock convoys and alternate scheduling algorithms. What more could you want for a summer afternoon?!

Monday, July 19, 2010

The overall presentation is a bit silly and over-blown, but I really like the basic ideas presented in Alberto Savoia's The Way of Testivus. It's hard-earned wisdom, and worth keeping in mind for all programming activities.

Tuesday, July 13, 2010

As the commentors on my earlier post surmised, the notion of a non-elevated administrator is tightly connected with the Windows concept of User Account Control, which was introduced with Windows Vista.

UAC is the Microsoft answer to reducing the privileges users run with by default in Windows Vista. Strategically, Microsoft is moving to an environment where users do not have or need privileges that can affect the operating system and machine-wide configuration in order to perform day-to-day tasks.

So, the important part involves something called the "split token":

UAC starts working when a user logs onto a machine. During an interactive login, the Local Security Authority (LSA) takes the user's credentials and performs the initial logon, evaluating the user's token to see if it has what are defined as elevated privileges. If the LSA determines that the user has elevated privileges, it will filter this token and then perform a second logon with the filtered token.

The result is that:

The desktop session and explorer.exe will always be created with a token that approximates the token of a member of the Users group. Any process that is initiated from the Start Menu or by a user double-clicking in an Explorer window that doesn't require elevation will simply inherit this filtered token. Therefore, by default, every application will be running with the standard user token.

Thus, a "non-elevated administrator" is simply a user who is a member of the Administrators group, who has performed a normal logon to a Vista+ Windows machine, with User Account Control in effect, and thus is running with a special minimal set of privileges designed to approximate a normal non-Administrator user.

And an "elevated" administrator is such a user, running an application using the "Run As Administrator" feature to run with elevated Administrator privileges. When starting such an application, the user is made aware of the elevation, as Russinovich describes:

Granting a process administrative rights is called elevation. When it’s performed by a standard user account, it’s referred to as an Over the Shoulder (OTS) elevation because it requires the entry of credentials for an account that’s a member of the administrator’s group, something that’s usually completed by another user typing over the shoulder of the standard user. An elevation performed by an AAM user is called a Consent elevation because the user simply has to approve the assignment of his administrative rights.

This process is what gives rise to those dialog boxes that you've become so familiar with as part of using Vista or Windows 7:

Another interesting part of User Account Control, which I think is perhaps not very well known, is that there are actually three types of application privilege configurations:

asInvoker

highestAvailable

requireAdministrator

asInvoker and requireAdministrator are the ones most people are familiar with, and they are fairly simple to understand. highestAvailable is more complex. Corio describes the run levels for us:

When a new process is created, the AIS will inspect the binary to determine whether it requires elevation. The first thing that gets checked is the application manifest that is embedded into the application's resources. This takes precedence over any other type of application marking including an application compatibility marking or UAC's Installer Detection, which is described later. The manifest defines a run level that tells Windows the privileges needed to run the binary. The three choices for run level are: asInvoker, highestAvailable, and requireAdministrator.

When AIS finds a binary that is marked with the "asInvoker" run level, it takes no action and the process inherits the process token of the parent process that created it. The "requireAdministrator" run level is pretty straightforward as well and defines that the process must be created by a user token that is a member of the administrator group. If the user who attempted to create this process is not an administrator, he will be presented with the Credential dialog to input his credentials.

The highestAvailable run level is a little more complicated. It denotes that if a user has a linked token, then the application should run with the higher privileged token. This is generally used for applications that have a UI designed for the Users and Administrators groups and it ensures that the application gets the user's full privileges.

I've now got a pile of new documentation to read, but at least the basic notion of being a non-elevated Administator is starting to make more sense now. After nearly a decade of using Windows XP, I've got a bit of un-learning to do, and a lot of new concepts and ideas to shove into those empty spaces in my brain...

We've reached the midway point in the 2010 Google Summer of Code, and I've just completed the mid-term evaluation process. This year, Derby are mentoring three students, and all 3 seem to be doing well. In the project I'm involved with, Nirmal Fernando is making great progress on building Query Plan visualization tools, which are crucial for comprehending the behavior of Derby on large and complex queries.

The core PlanExporter functionality that Nirmal is working on is now a solid piece of software, backed up by automated regression tests that demonstrate its functionality. We're starting to work through more advanced issues, such as packaging, security, and looking for more advanced test cases.

I'm particularly pleased to see Nirmal's comfort level growing as he gains experience working with the Derby community. For example, Nirmal recently approached the community with questions about Java security policy and privileged action implementation (not simple subjects). He raised the topic, worked through several iterations of follow-up and clarification, and solved the problem, all without any assistance from me. You can see the discussions here and here.

We've still got a lot of work left to do on the visualization tools, but if the second half of GSoC 2010 goes as well as the first half, I'll be quite pleased with the results.

Sunday, July 4, 2010

What sort of an API would we like to have in a library for working with Windows symbolic links? I don't think it has to be terribly complicated. We could probably get by with 3 basic functions:

NTFS_makelink(name, target) -- create a new symbolic link by the given name, pointing at the indicated target

NTFS_islink(name) -- return 1 if there is a symbolic link by the given name, return 0 otherwise

NTFS_readlink(name) -- return the target that is pointed to by the symbolic link by the given name

NTFS_makelink looks pretty straightforward: we just have to call CreateSymbolicLink. One bit of complexity is that we have to figure out if the link target is a directory, and pass the SYMBOLIC_LINK_FLAG_DIRECTORY flag if it is. Why do you suppose this flags argument exists? Couldn't the Windows API have figured that out for us, instead of making us figure it out ourselves?

NTFS_islink is slightly more complicated, but still not too hard: we call GetFileAttributes, then check the FILE_ATTRIBUTE_REPARSE_POINT flag to see if we have a reparse point. If we do, then we call FindFirstFile to get the WIN32_FIND_DATA for the file, and look in the dwReserved0 member of that structure for the value IO_REPARSE_TAG_MOUNT_POINT.

NTFS_readlink could be implemented in either of two ways:

We could use DeviceIoControl to read the FSCTL_GET_REPARSE_POINT data, then look at the returned value to see what the target is.

Or, we could use Raymond Chen's technique, and open the symbolic link, allowing the file system to automatically follow the reference and locate the target, then call GetFinalPathNameByHandle to figure out what file we got to.

Overall, this leaves us with an API looking something like:

// makes a symbolic link by the given name, pointing at the given target.//// returns 0 if all went well, <0 if there was a problem (call GetLastError()// to find out the details of the error)extern int NTFS_makelink(const char *name, const char *target);

// Checks the given name to see if it is a symbolic link.//// Returns 1 if it is a link, 0 if it is not, <0 if there was a problem// (call GetLastError() to find out the details of the problem)extern int NTFS_islink(const char *name);

// Fills in the provided buffer with the target of the given symbolic link.//// Copies the target into the buffer and null-terminates it, unless// the target is longer than bufSize, in which case it is truncated// and NOT null-terminated. So you probably want to pass an allocated// buffer of size MAX_PATH to this function.//// Returns 0 if all went well, <0 if there was a problem (call GetLastError()// to find out the details of the error)extern int NTFS_readlink(const char *name, const char *targetBuf, int bufSize);

The next step, which will have to wait until next week sometime when I get access to my Windows 7 development machine, is to try writing this code, and seeing what it does :)

the POSIX standard requires the file system to support case-sensitive file and directory names, a "file-change-time" time stamp (which is different from the MS-DOS "time-last-modified" stamp), and hard links. NTFS implements each of these features. NTFS does not implement POSIX symbolic links in its first release, but it can be extended to do so.

Custer briefly describes how hard links work:

When a hard link to a POSIX file is created, NTFS adds another file name attribute to the file's MFT file record. When a user deletes a POSIX file that has multiple names (hard links), the file record and the file remain in place. The file and its record are deleted only when the last file name (hard link) is deleted.

So hard links have been part of Windows/NTFS for over 15 years; there is a CreateHardLink function in the Windows API; you can read more about hard links at the MSDN web site.

The second stage of support for file links was added in Windows 2000, and was called junctions. As the MSDN documentation describes:

A junction (also called a soft link) differs from a hard link in that the storage objects it references are separate directories, and a junction can link directories located on different local volumes on the same computer. Otherwise, junctions operate identically to hard links. Junctions are implemented through reparse points.

Note that hard links are alternate names for files, whereas junction points occur at the directory level of a path, and include the name of another directory. So junction points are always used to have a soft link from one directory to another directory.

Junction points appear to have been originally directed at the problem of stitching together the multi-volume Windows file system into a single logical file system, with links that crossed from one volume to another. As Knowledge Base article 205524 describes:

You can surpass the 26 drive letter limitation by using NTFS junction points. By using junction points, you can graft a target folder onto another NTFS folder or "mount" a volume onto an NTFS junction point. Junction points are transparent to programs.

The KB article describes the commands "linkd", "mountvol", and "delrp", and notes that the "rp" in "delrp" refers to reparse points, which are the underlying feature that supports junction points. The Sysinternals section of TechNet additionally provides a utility named Junction, for working with junction points. (Sadly, unlike many of the Sysinternals tools, the Junction tool does not appear to come with source code; it's an executable only.)

It's not clear how you work with a Junction point programmatically. Is there a call in the Windows API to create or delete a Junction point? Or is it only possible using these special command-line tools? It must be possible, as various people have implemented their own tools and extensions for working with junction points: here are two: (a) Hermann Schinagl's LinkShellExtension, and (b) FlexHex.com's CreateJunction utility, which does include a snippet of source describing their use of the FSCTL_SET_REPARSE_POINT IOControl operation. Yikers!

Junction points are implemented using reparse points, which are also described in the MSDN documentation:

A file or directory can contain a reparse point, which is a collection of user-defined data. The format of this data is understood by the application which stores the data, and a file system filter, which you install to interpret the data and process the file. When an application sets a reparse point, it stores this data, plus a reparse tag, which uniquely identifies the data it is storing. When the file system opens a file with a reparse point, it attempts to find the file system filter associated with the data format identified by the reparse tag. If a file system filter is found, the filter processes the file as directed by the reparse data. If a file system filter is not found, the file open operation fails.

The third stage of support for file links in Windows came rather quietly, as part of Windows Vista. I don't recall a lot of fanfare about this functionality when Vista was announced; I guess I wasn't paying attention! As described in the MSDN documentation:

A symbolic link is a file-system object that points to another file system object. The object being pointed to is called the target.

Symbolic links are transparent to users; the links appear as normal files or directories, and can be acted upon by the user or application in exactly the same manner.

Symbolic links are designed to aid in migration and application compatibility with UNIX operating systems. Microsoft has implemented its symbolic links to function just like UNIX links.

Symbolic links can either be absolute or relative links. Absolute links are links that specify each portion of the path name; relative links are determined relative to where relative–link specifiers are in a specified path.

There is a CreateSymbolicLinkfunction that allows a program to, naturally, create a symbolic link.

And there is a small set of notes entitled Programming Considerations, which really seems like it belonged in the "Remarks" section under the CreateSymbolicLink function.

And there is a short article about how to open a filesystem object to read the reparse point information:

To determine if a specified directory is a mounted folder, first call the GetFileAttributes function and inspect the FILE_ATTRIBUTE_REPARSE_POINT flag in the return value to see if the directory has an associated reparse point. If it does, use the FindFirstFile and FindNextFile functions to obtain the reparse tag in the dwReserved0 member of the WIN32_FIND_DATA structure. To determine if the reparse point is a mounted folder (and not some other form of reparse point), test whether the tag value equals the value IO_REPARSE_TAG_MOUNT_POINT. For more information, see Reparse Points.

To obtain the target volume of a mounted folder, use the GetVolumeNameForVolumeMountPoint function.

In a similar manner, you can determine if a reparse point is a symbolic link by testing whether the tag value is IO_REPARSE_TAG_SYMLINK.

So there's fairly clear documentation about how to create a symbolic link, and it's also clear that you can delete one by simply calling DeleteFile. It's less clear how to do the equivalent of the Unix readlink function; that is, how do you read the file system to find out whether a particular object is a symbolic link or not, and, if it is, what it points to? It's clear I'm not the only person confused about this. There's apparently a fsutil tool in the Windows 7 command line that does this, but what APIs does it call to get its job done? Even the usually authoritative Raymond Chen doesn't cover this?

Overall, it's quite clear that the "accepted wisdom" has become somewhat stale and obsolete: Windows does support file links, and with relatively complete support. However, as is often the case with Windows, they have their own set of strange and unique APIs for working with the functionality, the documentation about how to use the APIs is scattered and terse, and there is the complexity of dealing with the enormous Windows installed base, and the fact that the support for file links was introduced over time, and hence varies from Windows platform to Windows platform. But, if you are running Windows 7 (and if you aren't, why aren't you?), it seems like you should have enough operating system support to build an application with fairly complete support for file links.

Enough of this overview-level discussion, it's time to write some code! I'll see you later, when I have some actual code to discuss...