Writing about open source software, virtualization & more

For a long time the libvirtd daemon was single threaded, using a poll() loop to handle and dispatch all I/O requests in a single thread. This was reasonable when each operation was guaranteed to be a fast, but as libvirt has grown in scope, so has the complexity and execution time of many requests. Ultimately we had no choice but to make all the internal drivers thread-safe, and the libvirtd daemon itself multi-threaded. Each libvirt virConnectPtr connection can now be safely used from multiple threads at once. The libvirtd daemon will also process RPC calls from multiple clients concurrently, and even parallelize requests from a single client. Achieving this obviously required that we make significant use of pthread mutexes for locking data structures can be accessed concurrently.

To try and keep the code within the realm of human understanding, we came up with a fixed 2-level locking model. At the first level, each libvirt driver has a lock protecting its global shared state. At the second level, individual managed objects each have a lock (eg, each instance of virDomainObj, virNetworkObj, virStoragePoolObj has a lock). The rules for locking, can be stated pretty concisely

The driver lock must be held when accessing any driver shared state

The managed object lock must be held when accessing any managed object state

The driver lock must be held before obtaining a lock on a managed object

The driver and managed object locks may be released in any order

The locks may not be recursively obtained

With these rules there are three normal locking sequences for each public API entry pointer implemented by a driver. The first, simplest, pattern is for a API that only accesses driver state

For the 0.6.0 release I updated several hundred methods in libvirt to follow these 3 locking patterns. Inevitably there would be mistakes along the way, and those reviewing the patches did indeed find some. End users also found some more when we released this. And there is a continual risk of causing accidental regressions. One approach to validating the locking rules is via functional testing, but libvirt is a very complicated and expansive codebase, making it incredibly hard to exercise all possible codepaths in a functional test.

About a year ago, Richard Jones (of OCaml Tutorial fame) did an proof of concept using CIL to analyse libvirt. His examples loaded all libvirt code, generated control flow graphs and reported on exported functions. CIL is an enormously powerful toolset able to parse even very complicated C programs and perform complex static analysis checks. It sounded like just the tool for checking libvirt locking rules, as well as being a good excuse to learn OCaml

Functional languages really force you to think about the concepts you wish to express & check up front, before hacking code. This is possibly what discourages alot of people from using functional languages ;-) So after trying and failing to hack something up quickly, I stopped and considered what I needed to represent. The libvirt internal code has a pretty consistent pattern across different drivers and managed objects. So we can generalize some concepts

Driver data types (eg, struct qemu_driver, xenUnifiedPrivatePtr)

Managed object data types (eg, virDomainObjPtr, virNetworkObjPtr)

A set of methods which lock drivers (eg, qemuDriverLock, xenUnifiedLock)

A set of methods which unlock drivers (eg qemuDriverUnlock, xenUnifiedUnlock)

A set of methods which obtain a locked managed object (eg, virDomainObjFindByName, virDomainObjAssignDef, virNetworkObjectFindByUUID)

A set of methods which unlock managed objects (virDomainObjUnlock)

A set of public API entry points for each driver (the function pointers in each virDriver struct instance)

I won’t go into fine details about CIL, because it is quite a complex beast. Suffice to say, it loads C code, parses it, and builds up an in-memory data structure representing the entire codebase. Todo this it needs the intermediate files from after pre-processing, eg, the ‘.i’ files generated by GCC when the ‘-save-temps’ flag is given. Once it has loaded the code into memory you can start to extract and process the interesting information.

The first step in libvirt analysis is to find all the public API entry points for each driver. Todo this, we search over all the global variables looking for instances of driver structs virDriver, virNetworkDriver, etc, and then processing their initializers to get all the function pointers. At this point we now have a list containing all the functions we wish to check

The next step is to iterate over each function, and run a data flow analysis. CIL provides two useful modules for data flow analysis. One of them runs analysis from the entry point, forwards, to each return point. The other runs analysis from each return point, backwards to the entry point. I can’t really say which module is “best”, but for libvirt locking analysis it felt like a forward data flow analysis would work pretty well. To perform analysis, you provide an implementation of the Cil.DataFlow.ForwardsTransfer interface which has a couple of methods, of which the interesting ones are doStmt and doInstr. The first is invoked once for each C code block, the second is invoked once for each C statement. When invoked they are passed some initial state. The interface implementation looks at the code block or statement in question, and decides what state has changed, and returns the new state. For lock analysis, I decided to store the following state about variables:

The set of driver variables that are currently locked

The set of driver variables that are currently unlocked

The set of object variables that are currently locked

The set of object variables that are currently unlocked

Thus, the first part of the implementation of the doInstr method was simply looking for function calls which lock or unlock a managed object or driver.
From this core state, some additional state can be derived at each C level statement, recording locking mistakes. The mistakes currently identified are

Statements using an unlocked driver variable

Statements using an unlocked object variable

Statement locking a object variable while not holding a locked driver variable

Statements causing deadlock by fetching a lock object, while an object is already locked

The lock checker runs a forwards code flow analysis, constructing this state for each interesting function we identified earlier. When complete, it simply looks at the final state accumulated for each function. For each of the 5 types of mistake, it prints out the function name, the line number and the C code causing the mistake. In addition to those 5 mistakes, it also checks the final list of locked variables, and reports on any functions which forget to unlock a managed object, or driver, in any of their exit points.

In summary, we have been able to automatically detect and report on 7 critical locking mistakes across the entire driver codebase without needing to ever run the code. We also have a data flow analysis framework that can be further extended to detect other interesting locking problems. For example, it would be desirable to check whether one public API function calls into another, since this would cause a deadlock with non-recursive mutexes.

The final reports generated from the lock checking tool are designed to make it easily for the person reading them to figure out what locking rule has been violated. An example report from current libvirt CVS codebase looks like this

In summary, CIL is an incredibly powerful tool, and it is well worth learning OCaml in order to use the CIL framework for doing static analysis of C code. The nummber of bugs it has allowed us to identify and fix more than compensates for effort developing this test harness.