January 17, 2010

Adam Megacz posted to the openafs-info mailing list a kernel panic that he encountered with the 1.4.12 release candidate. Derrick and I debugged this over Jabber (with Derrick doing the kernel disassembly). I thought it might be useful to share the methods we used, and to record the command invocations for future googling.

The Problem

A Mac OS X kernel panic. Adam first posted the panic itself, then followed up with the results of running OpenAFS’s decode-panic tool. decode-panic basically drives gdb to turn an panic log with addresses into a readable backtrace with symbols – it needs to be run with the same kernel, and architecture, that created the panic in the first place.

The line in bold is the point where the panic actually occurred (all of the stuff before it is related to the trap mechanism). The lines afterwards tell us how we arrived in afs_GetDCache – for this particular problem, we’re not concerned about the callers, so we can ignore these.

So, at the time of the panic, EDX was probably NULL. This implies that the bug is a NULL pointer dereference of some kind, and that we’re looking up something that’s 0x64 bytes into the structure that’s pointed at by EDX.

If we had a kernel with debugging symbols, then we could look directly at the C source which corresponds to afs_GetDCache+7832. Unfortunately, we didn’t have one to hand, so Derrick improvised. By disassembling the current kernel module, we can see exactly what code occurs at that location…

(The call 0x0 is because we’re debugging a module that isn’t loaded. You can get a dump that fills in the blanks for these by doing the following. When kextutil asks for the load address, give it the one that the panic log said the OpenAFS module was loaded at, in this case 0x4606c000:

)
We’re relatively lucky here, as near the panic location is a call to rx_EndCall(), the 2nd of 3 that occur in that GetDCache. Lining this up with the corresponding source code, the last call to rx_EndCall occurs at line 2219 of afs_dcache.c, which looks something like:

if (length > size) {
/* The fileserver told us it is going to send more data
* than we requested. It shouldn't do that, and
* accepting that much data can make us take up more
* cache space than we're supposed to, so error. */
code = rx_Error(tcall);
RX_AFS_GUNLOCK();
code1 = rx_EndCall(tcall, code);
RX_AFS_GLOCK();
tcall = (struct rx_call *)0;
code = EIO;
}

So, assuming that the compiler hasn’t reordered our code, we’ve got rx_Error(), followed by RX_GUNLOCK. It isn’t possible for the compiler to reorder over a lock, so it’s pretty likely that the ordering is as shown. Both of these are macros masquerading as functions. Looking at the simpler, rx_Error(), we have:

#define rx_Error(call) ((call)->error)

Now, if you recall, we panic’d because we were looking up something that has an offset of 0x64 from the base of the structure. We can verify that ‘error’ is stored 0x64 bytes into the rx_call structure by visually examining the structure definition or, if you’ve got a build tree from the same architecture, by running:

All of this points pretty conclusively at a NULL value for ‘tcall’. Examining the code, there is a situation where that can occur, if we get a bogus length from the network, and as a result double free the rx call. It seems highly likely that this is the bug.

Comments Off on Debugging a Mac OS X Kernel Panic

July 26, 2009

During a course of a project here it became apparent that the Linux OpenAFS cache manager is slow when performing reads from the local disk. In this case, all of the data is already on the local disk, and the cache manager knows that the data is up to date. Naively, you would imagine that reading this data would take roughly the same time as if you were reading directly from the cache filesystem. However, that is not the case – in fact, reads appear to be more than twice as slow when fetched through the AFS cache manager, as compared to fetching the equivalent files from the local disk.

I’ve implemented modifications to the cache manager which attempt to reduce this speed deficit. These modifications can be broadly split into 5 sections

Remove crref() calls

Pretty much every call into the OpenAFS VFS does a crref(), to get a reference to the users current credentials, despite the fact that this information isn’t always required. crref is relatively expensive – it acquires a number of locks in order to perform its copies, and can be a cause of unnecessary serialisation. By only calling crref when required we can gain a small, but measurable, performance increase

Reduce the code path leading to a cache hit

In readpages, we perform a lot of setup operations before we discover whether the data we’re interested in is cached or not. By making the cached case the fast path, we can gain a performance increase for cache hits, without causing a noticable degradation for cache misses.

Remove abstraction layers, and use native interfaces

The code currently uses operating system independent abstraction layers to perform the reads from the disk cache. These don’t know anything about the way in which Linux organises its virtual memory, and do a significant amount of extra, unnecessary work. For example, we use the ‘read’ system call to read in the data, rather than the significantly faster readpages(). As we’re being invoked through the AFS module’s readpages() entry point, we can guarantee that we’re going to be fetching a page off disk. Read() also gets called from a user, rather than kernel, memory context, adding to the overhead.

Implement readahead

The Linux Cache Manager currently has no support for readpages(), instead requiring the VFS layer request each page independently with readpage(). This not only means that we can’t take advantage of cache locality, it also means that we have no support for readahead. Doing readahead is important, because it means that we can get data from the disk into the page cache whilst the application is performing other tasks. It can dramatically increase our throughput, particularly where we are serving data out to other clients, or copying it to other locations. Implementing readpages() on its own gives a small speed improvement, although blocking the client until the readpages completes kind of defeats the point, and leads to sluggish interactive performance!

Make readahead copies occur in the background

The next trick, then, is to make the readahead occur in the background. By having a background kernel thread which waits until each page of data is read from the cache, and then handles copying it over into corresponding AFS page, the business of reading and copying data from the cache can be hidden from the user.

Conclusions

This set of changes actually makes a signifciant improvement to cache read speed. In simple tests where the contents of the cache are copied to /dev/null, the new cache manager is around 55% faster than the old one. Tests using Apache to serve data from AFS show significant (but slightly less dramatic, due to other overheads) performance improvements.

Sadly, the Linux Memory Management architecture means that we’re never going to obtain speeds equivalent to using the native filesystem directly. The architecture requires that a page of memory must be associated with a single filesystem. So, we end up reading a page from the disk cache, copying that page into the AFS page, and returning the AFS page to the user. Ideally, we’d be able to dispense with this copy and read directly into the AFS page by switching the page mappings once the read was complete. However, this isn’t currently an option, and the performance benefits obtained through the current approach are still significant.

Comments Off on Making the OpenAFS client faster

May 26, 2009

For a while now, there have been plans afoot to convert OpenAFS’s CVS repository to git. A number of attempts have been made, which have all stalled due to the complexity of the underlying problem, and issues with the existing tools. Previously, it was felt that the main hurdle to a successful conversion was OpenAFS’s use of ‘deltas’ to provide a changeset style interface on top of CVS. A delta is a collection of related changes, grouped using a comment in the CVS revision log. However, unlike a real changeset, there is no requirement that a delta’s changes be contiguous. A file may be modified by delta A, then by delta B, and then modified by delta A again. This makes it impossible to properly represent all deltas as single changesets. In addition, abuse of deltas within OpenAFS has caused some to span branch or tag points, again making it impossible to represent those deltas as a changeset without destroying the repository history. For many months now, people have been trying to produce conversion tools that achieve as close to a 1 to 1 correspondence between deltas and changesets as is possible, just leaving the troublesome cases as multiple commits.

Frustrated with the lack of progress of this approach, I decided to do a quick and dirty conversion, with the view to getting something completed by the start of coding for this year’s Summer of Code (which I’ve missed) and the yearly Best Practices Conference (which I might just make). I decided to not concern myself with merging deltas at all, but instead use cvsps and the existing git-cvsimport tool to produce a tree where the branch heads and all tag points matched, and which retained enough information to reconstruct deltas without forcing them to be single changesets. In order to be able to perform simple manipulations, I decided to create a perl script which would post-process the cvsps output before feeding it to git. I also created a tool which would check out every branch and tag from cvs, and compare them to the corresponding item in git, and report on any errors. Pretty straightforwards, I thought …

Unfortunately, I rapidly discovered that cvsps had significant problems with the OpenAFS repository. Many tags in CVS were simply not in the cvsps output, other tags (both those marked as FUNKY and INVALID, and those not) were in the wrong place and branchpoints were being incorrectly determined. Rather than get into cvsps’s internals, I ended up extending my post processing script to deal with these errors. It now performs a number of tasks:

Reordering inverted patchsets Some of cvsps’s output gets the patchset ordering wrong, such that a patchset that does fileA:1.2->1.3 comes before fileA:1.1->1.2. The script scans through all of the patchsets for this problem and swaps any that it finds.

Tag point determination Using the output from CVS’s rls command, it is possible to get the revision numbers of every file in a given tag. With this information, the set of patchsets from cvsps can be walked in order to identify the first patchset to satisify every revision contained within the tag. Unfortunately, cvsps patchsets aren’t correctly ordered, so this process also works out how to reorder the patch sets such that no patchsets with file revisions higher than those in the tag occur before the tag point. This reordering is carefully performed in order to not break any tag or branch points which we have already placed! In addition, cvsps sometimes merges commits which occur over a tag point, so we also need to split patchsets which contain both files with revisions before the tag point, and files with revisions after it.

Branch point determination The cvsps output incorrectly places many of OpenAFS’s branchpoints. Fortunately, many of these were marked by tags at the time they were created, and a hardcoded list of these is used to place some branch points in the correct position. For branches that don’t have a corresponding tag, a brute force approach is used. By examining all of the patchsets on the branch, it’s possible to determine the starting revision number of every file that’s modified on that branch – combining this with the contents of the branch head tag from cvs rls gives the equivalent of a tag marking the branchpoint. This can then be processed by the tag point algorithm to work out the correct position in which to place the branch point. This gives the patchset that the branch occurs after, rather than cvsps’s “Ancestor branch” field, which gives the first patchset on the branch. Ancestor branch is fundamentally flawed, as it doesn’t allow changes to occur on HEAD after a branchpoint is created, and before the first patch on that branch. As part of this process, git-cvsimport was modified to understand a new ‘post-patchset’ branch syntax

Unknown branch determination cvsps fails to place some patchsets on the correct branch. By comparing the revision numbers of files in these patchsets with those elsewhere in the list, the correct branch for all of these can be determined (this needs to be done in order that we can work out tag points, as well as being necessary for repository consistency)

We also clean up the output to deal with problems of our own making

Delta naming Whilst there is a style of delta name where the branch is given first, a number of deltas don’t conform to this style, and have the same name across multiple branches. All deltas are renamed such that they are something like STABLE14-delta-name-yyyy-mm-dd

Mistagged files In some places, tags have been incorrectly applied such that files on HEAD are tagged as part of a branch tag. The script contains manual overrides to fix these to tag a revision on the correct branch.

Finally, because having done all of the above I had a pretty good toolset for dealing with patchsets, I implemented support for merging deltas. This merges all bar 10o or so, out of 15,000 deltas into single patchsets. The remaining items are comprised of deltas which span tag or branch points (and which can never be merged) and deltas which contain conflicting changes to a single file (which it might be possible to merge, but which would require manual intervention). These deltas are tagged in a set of git references at refs/deltas/branches/<branch>/<delta>. We separate them from tags so that git-tags doesn’t have to deal with over 10,000 tags, and split them into branches to avoid directory size limitations.

The resulting git tree isn’t a perfect replica of the CVS repository. It has a number of issues which it’s going to be really difficult to fix, and which probably aren’t earth shattering

It contains additional files There are a number of places where a user has added additional directories to cvs. When another user has subsequently tagged their working directory for a release, they haven’t done a cvs update -d, and so these additional directories (and in a small number of cases, files) aren’t contained in the tag for that release. It’s impossible to create a patchset ordering which allows a git tag to not include these directories, so we end up with additional files in the git tag. I don’t think that this is a particular problem

It’s missing a tag There is a tag in the CVS repository (BP-openafs-rxkad-krb5-lha) which is so broken that it’s impossible to construct a patch set ordering that matches it. It is just omitted from the resulting git repository

One branch is bad The openafs-rxkad-krb5-lha branch was created by only branching certain files in the tree. This means that it’s impossible to create a git branch which mimics this one without creating a huge number of additional ‘pull-up’ patch sets. Whilst we include all of the changes that were made on this branch, the final branch state is very different from the one in CVS.

Some deltas are disjoint As discussed, some deltas cannot be merged into single patchsets. This is going to require a new wdelta style tool which understands how to merge these deltas.

April 26, 2009

All of the OpenAFS Fedora/RHEL package builds are done from a machine running Scientific Linux 5. This is Informatics’s managed Linux platform, and using it for the build machine means that it is administered and updated along with the rest of our services, leaving me with more time to do other things. We use the ‘mock’ command to perform builds for all of our x86_64 and i386 architectures – mock uses yum and rpm to construct a chroot for each build architecture, and then runs the build within that chroot. Unfortunately, this is where the fly in the ointment occurs.

RPM has been pretty stable for years, allowing this cross platform building to occur. With Fedora 11, however, a new version of rpm has been shipped. This contains support for longer file digests, and packages from Fedora 11 cannot be installed by older versions of rpm. Unfortunately, this means that we can’t mock Fedora 11 from a normal SL5/EL5 build host. Fortunately, solutions are available.

A version of RPM 4.6 with support for the extended hashes, and which builds on EL5, is available from http://people.redhat.com/mitr/sha256-rpm/ This does change the RPM soname, and will require that packages with dependencies on rpm be rebuilt, and (in some cases) modified to support the newer API.

In addition to the change in RPM hashes, yum itself also needs to be modified to support Fedora 11. The pyhashlib package is needed to give yum support for other hash formats, and a newer version of yum is required to use them.

I’m giving two talks, the first on prometheus, our new Identity Management System. The second is about how to contribute to OpenAFS. The abstracts are

Prometheus is an LDAP based provisioning system, which is designed to manage a wide variety of user databases, including AFS’s PTS and a Kerberos KDC. It is highly flexible in the databases it supports, and permits very fine grained delegation of control. It has a role-based access control model, and allows the creation and management of roles by any authorized user. It is instance aware, allowing users to create many instances of a primary account, request keytabs of those instances, and delegate particular permission sets to individual instances. Prometheus is designed to be a distributed as possible, permitting provisioning of system maintained by disparate groups without requiring those groups be trusted by the system itself. This talk will discuss the design goals behind Prometheus, provide an update on implementation progress, and demonstrate a running system.

and …

OpenAFS has a huge, daunting codebase, with a relatively opaque system of patch submission, review and application. It takes mountains of skill, and years of persistence to get your first patch into a state where it can be submitted, let alone accepted into the hallowed halls of the code tree…

Nonsense!

This talk will attempt to blow away some of the misconceptions with regards to contributing to OpenAFS. It will provide a first-timers view of the steps, both technical and political, to crafting a patch for submission into OpenAFS. We’ll take a whistle stop tour of the tools now involved in the process, from the code repository, to the patch review system and the bug tracker. We’ll talk about code review, bug triage and testing, with a view to inspiring participation in these areas.

Finally, we’ll talk about some low hanging fruit that anyone could get started on, and write their first bit of OpenAFS code …

In addition to keynotes from Morgan Stanley and Carnegie Mellon, the conference features a number of talks about research computing storage (including one from the nanoCmos project), and looks like it will have a great mixture of academic and commercial topics.

The hotel block (at the very reasonable Stanford Guest House) expires April 1st, with the early bird deadline being April 21st.

Comments Off on AFS & Kerberos Best Practices Workshop

January 24, 2009

Fstrace is an AFS utility which logs a huge amount of information about the internal operations of the AFS cache manager. It can provide details of the process flow through the AFS cache manager, without requiring any recompilation or specific debug options.

To use fstrace, you need a /usr/vice/etc/C/afszcm.cat file to be installed. Unfortunately, this file is not currently installed as part of RPM builds, and without it, the output from fstrace is pretty much useless. However, you should be able to take the afszcm.cat file from any build of the same AFS version and use it with your kernel module.

To start logging, initialize the logging system with

fstrace clear cm

fstrace setlog cmfx -buffers 100

and then, to start the logging session, and dump the log output to a file

fstrace sets cm -active

fstrace dump -follow cmfx -file /tmp/log -sleep 10 &

Then, perform the operations which you want to obtain a log for. You may find that the dump needs a short period of time (or another command to be invoked) in order to flush all of the data from the command that you’re logging onto disk.

To stop logging, run

fstrace sets cm -inactive

Update 27/03/09: As of OpenAFS 1.4.9, the afszcm.cat file will ship as part of the standard RPM installation

March 14, 2008

I’ve been experimenting with Mercurial, as a means of streamlining the way I work with the OpenAFS CVS repository. In particular, I’m trying to improve the management of my disconnected operation code, as well as better controlling the large number of patches I’m producing as part of the prototyping and error removal exercise.

Because I tend to flit backwards and forwards between different pieces of code, I tend to find that with CVS I have a large number of different checked out sandboxes. For big projects, such as the disconnection work, there’s no history, or ability to revert changes without taking a snapshot of the sandbox, which is both time consuming and inefficient. For smaller projects, there’s either a huge number of different sandboxes (and the related ‘where did I do X?’ problem), or lots of code ends up being intermingled within the same sandbox, and lots of things have to be unpicked before patches can be sent upstream.

These actually end up being two different problems, and it looks like there are two different mercurial workflows that are best suited to handle them. For disconnection, what I really need is a way of tracking, and managing, my code changes, and I’m using mercurial as a normal SCM to achieve this. With the prototyping changes, what I really need are patch queues – I have a large number of changes which I’m trying to arrange into manageable chunks in order to send upstream. Depending on my testing schedule, I may have a large number of patches awaiting submission. In this case, mercurial’s patch queues seem like by far and a way the best fit.

I’m intending on making my mercurial repositories for both of these tasks publicly available. For now, I can offer a mercurial import of the OpenAFS ‘upstream’ CVS at http://lochranza.inf.ed.ac.uk/upstream/ Other repositories are likely to appear there over time.

Local users may be interested to note that they can get mercurial on a DICE machine by including dice/options/mercurial.h in their profile.