Scientific Bug Hunting in the Cloud: An Unexpected CEO Adventure

April 16, 2015

The Wolfram Cloud Needs to Be Perfect

The Wolfram Cloud is coming out of beta soon (yay!), and right now I’m spending much of my time working to make it as good as possible (and, by the way, it’s getting to be really great!). Mostly I concentrate on defining high-level function and strategy. But I like to understand things at every level, and as a CEO, one’s ultimately responsible for everything. And at the beginning of March I found myself diving deep into something I never expected…

Here’s the story. As a serious production system that lots of people will use to do things like run businesses, the Wolfram Cloud should be as fast as possible. Our metrics were saying that typical speeds were good, but subjectively when I used it something felt wrong. Sometimes it was plenty fast, but sometimes it seemed way too slow.

We’ve got excellent software engineers, but months were going by, and things didn’t seem to be changing. Meanwhile, we’d just released the Wolfram Data Drop. So I thought, why don’t I just run some tests myself, maybe collecting data in our nice new Wolfram Data Drop?

A great thing about the Wolfram Language is how friendly it is for busy people: even if you only have time to dash off a few lines of code, you can get real things done. And in this case, I only had to run three lines of code to find a problem.

Then I called the API 50 times, measuring how long each call took (% here stands for the previous result):

Then I plotted the sequence of times for the calls:

And immediately there seemed to be something crazy going on. Sometimes the time for each call was 220 ms or so, but often it was 900 ms, or even twice that long. And the craziest thing was that the times seemed to be quantized!

I made a histogram:

And sure enough, there were a few fast calls on the left, then a second peak of slow calls, and a third “outcropping” of very slow calls. It was weird!

I wondered whether the times were always like this. So I set up a periodic scheduled task to do a burst of API calls every few minutes, and put their times in the Wolfram Data Drop. I left this running overnight… and when I came back the next morning, this is what I saw:

Even weirder! Why the large-scale structure? I could imagine that, for example, a particular node in the cluster might gradually slow down (not that it should), but why would it then slowly recover?

My first thought was that perhaps I was seeing network issues, given that I was calling the API on a test cloud server more than 1000 miles away. So I looked at ping times. But apart from a couple of weird spikes (hey, it’s the internet!), the times were very stable.

Something’s Wrong inside the Servers

OK, so it must be something on the servers themselves. There’s a lot of new technology in the Wolfram Cloud, but most of it is pure Wolfram Language code, which is easy to test. But there’s also generic modern server infrastructure below the Wolfram Language layer. Much of this is fundamentally the same as what Wolfram|Alpha has successfully used for half a dozen years to serve billions of results, and what webMathematica started using even nearly a decade earlier. But being a more demanding computational system, the Wolfram Cloud is set up slightly differently.

And my first suspicion was that this different setup might be causing something to go wrong inside the webserver layer. Eventually I hope we’ll have pure Wolfram Language infrastructure all the way down, but for now we’re using a webserver system called Tomcat that’s based on Java. And at first I thought that perhaps the slowdowns might be Java garbage collection. Profiling showed that there were indeed some “stop the world” garbage-collection events triggered by Tomcat, but they were rare, and were taking only milliseconds, not hundreds of milliseconds. So they weren’t the explanation.

By now, though, I was hooked on finding out what the problem was. I hadn’t been this deep in the trenches of system debugging for a very long time. It felt a lot like doing experimental science. And as in experimental science, it’s always important to simplify what one’s studying. So I cut out most of the network by operating “cloud to cloud”: calling the API from within the same cluster. Then I cut out the load balancer, that dispatches requests to particular nodes in a cluster, by locking my requests to a single node (which, by the way, external users can’t do unless they have a Private Cloud). But the slowdowns stayed.

So then I started collecting more-detailed data. My first step was to make the API return the absolute times when it started and finished executing Wolfram Language code, and compare those to absolute times in the wrapper code that called the API. Here’s what I saw:

The blue line shows times before the Wolfram Language code is run; the gold line after. I collected this data in a period when the system as a whole was behaving pretty badly. And what I saw was lots of dramatic slowdowns in the “before” times—and just a few quantized slowdowns in the “after” times.

Once again, this was pretty weird. It didn’t seem like the slowdowns were specifically associated with either “before” or “after”. Instead, it looked more as if something was randomly hitting the system from the outside.

One confusing feature was that each node of the cluster contained (in this case) 8 cores, with each core running a different instance of the Wolfram Engine. The Wolfram Engine is nice and stable, so each of these instances was running for hours to days between restarts. But I wondered if perhaps some instances might be developing problems along the way. So I instrumented the API to look at process IDs and process times, and then for example plotted total process time against components of the API call time:

And indeed there seemed to be some tendency for “younger” processes to run API calls faster, but (particularly noting the suppressed zero on the x axis) the effect wasn’t dramatic.

What’s Eating the CPU?

I started to wonder about other Wolfram Cloud services running on the same machine. It didn’t seem to make sense that these would lead to the kind of quantized slowdowns we were seeing, but in the interest of simplifying the system I wanted to get rid of them. At first we isolated a node on the production cluster. And then I got my very own Wolfram Private Cloud set up. Still the slowdowns were there. Though, confusingly, at different times and on different machines, their characteristics seemed to be somewhat different.

On the Private Cloud I could just log in to the raw Linux system and start looking around. The first thing I did was to read the results from the “top” and “ps axl” Unix utilities into the Wolfram Language so I could analyze them. And one thing that was immediately obvious was that lots of “system” time was being used: the Linux kernel was keeping very busy with something. And in fact, it seemed like the slowdowns might not be coming from user code at all; they might be coming from something happening in the kernel of the operating system.

So that made me want to trace system calls. I hadn’t done anything like this for nearly 25 years, and my experience in the past had been that one could get lots of data, but it was hard to interpret. Now, though, I had the Wolfram Language.

Running the Linux “strace” utility while doing a few seconds of API calls gave 28,221,878 lines of output. But it took just a couple of lines of Wolfram Language code to knit together start and end times of particular system calls, and to start generating histograms of system-call durations. Doing this for just a few system calls gave me this:

Interestingly, this showed evidence of discrete peaks. And when I looked at the system calls in these peaks they all seemed to be “futex” calls—part of the Linux thread synchronization system. So then I picked out only futex calls, and, sure enough, saw sharp timing peaks—at 250 ms, 500 ms and 1s:

But were these really a problem? Futex calls are essentially just “sleeps”; they don’t burn processor time. And actually it’s pretty normal to see calls like this that are waiting for I/O to complete and so on. So to me the most interesting observation was actually that there weren’t other system calls that were taking hundreds of milliseconds.

The OS Is Freezing!

So… what was going on? I started looking at what was happening on different cores of each node. Now, Tomcat and other parts of our infrastructure stack are all nicely multithreaded. Yet it seemed that whatever was causing the slowdown was freezing all the cores, even though they were running different threads. And the only thing that could do that is the operating system kernel.

But what would make a Linux kernel freeze like that? I wondered about the scheduler. I couldn’t really see why our situation would lead to craziness in a scheduler. But we looked at the scheduler anyway, and tried changing a bunch of settings. No effect.

Then I had a more bizarre thought. The instances of the Wolfram Cloud I was using were running in virtual machines. What if the slowdown came from “outside The Matrix”? I asked for a version of the Wolfram Cloud running on bare metal, with no VM. But before that was configured, I found a utility to measure the “steal time” taken by the VM itself—and it was negligible.

By this point, I’d been spending an hour or two each day for several days on all of this. And it was time for me to leave for an intense trip to SXSW. Still, people in our cloud-software engineering team were revved up, and I left the problem in their capable hands.

By the time my flight arrived there was already another interesting piece of data. We’d divided each API call into 15 substeps. Then one of our physics-PhD engineers had compared the probability for a slowdown in a particular substep (on the left) to the median time spent in that substep (on the right):

With one exception (which had a known cause), there was a good correlation. It really looked as if the Linux kernel (and everything running under it) was being hit by something at completely random times, causing a “slowdown event” if it happened to coincide with the running of some part of an API call.

So then the hunt was on for what could be doing this. The next suspicious thing noticed was a large amount of I/O activity. In the configuration we were testing, the Wolfram Cloud was using the NFS network file system to access files. We tried tuning NFS, changing parameters, going to asynchronous mode, using UDP instead of TCP, changing the NFS server I/O scheduler, etc. Nothing made a difference. We tried using a completely different distributed file system called Ceph. Same problem. Then we tried using local disk storage. Finally this seemed to have an effect—removing most, but not all, of the slowdown.

We took this as a clue, and started investigating more about I/O. One experiment involved editing a huge notebook on a node, while running lots of API calls to the same node:

The result was interesting. During the period when the notebook was being edited (and continually saved), the API times suddenly jumped from around 100 ms to 500 ms. But why would simple file operations have such an effect on all 8 cores of the node?

The Culprit Is Found

We started investigating more, and soon discovered that what seemed like “simple file operations” weren’t—and we quickly figured out why. You see, perhaps five years before, early in the development of the Wolfram Cloud, we wanted to experiment with file versioning. And as a proof of concept, someone had inserted a simple versioning system named RCS.

Plenty of software systems out there in the world still use RCS, even though it hasn’t been substantially updated in nearly 30 years and by now there are much better approaches (like the ones we use for infinite undo in notebooks). But somehow the RCS “proof of concept” had never been replaced in our Wolfram Cloud codebase—and it was still running on every file!

One feature of RCS is that when a file is modified even a tiny bit, lots of data (even several times the size of the file itself) ends up getting written to disk. We hadn’t been sure how much I/O activity to expect in general. But it was clear that RCS was making it needlessly more intense.

Could I/O activity really hang up the whole Linux kernel? Maybe there’s some mysterious global lock. Maybe the disk subsystem freezes because it doesn’t flush filled buffers quickly enough. Maybe the kernel is busy remapping pages to try to make bigger chunks of memory available. But whatever might be going on, the obvious thing was just to try taking out RCS, and seeing what happened.

And so we did that, and lo and behold, the horrible slowdowns immediately went away!

So, after a week of intense debugging, we had a solution to our problem. And repeating my original experiment, everything now ran cleanly, with API times completely dominated by network transmission to the test cluster:

The Wolfram Language and the Cloud

What did I learn from all this? First, it reinforced my impression that the cloud is the most difficult—even hostile—development and debugging environment that I’ve seen in all my years in software. But second, it made me realize how valuable the Wolfram Language is as a kind of metasystem, for analyzing, visualizing and organizing what’s going on inside complex infrastructure like the cloud.

When it comes to debugging, I myself have been rather spoiled for years—because I do essentially all my programming in the Wolfram Language, where debugging is particularly easy, and it’s rare for a bug to take me more than a few minutes to find. Why is debugging so easy in the Wolfram Language? I think, first and foremost, it’s because the code tends to be short and readable. One also typically writes it in notebooks, where one can test out, and document, each piece of a program as one builds it up. Also critical is that the Wolfram Language is symbolic, so one can always pull out any piece of a program, and it will run on its own.

Debugging at lower levels of the software stack is a very different experience. It’s much more like medical diagnosis, where one’s also dealing with a complex multicomponent system, and trying to figure out what’s going on from a few measurements or experiments. (I guess our versioning problem might be the analog of some horrible defect in DNA replication.)

My whole adventure in the cloud also very much emphasizes the value we’re adding with the Wolfram Cloud. Because part of what the Wolfram Cloud is all about is insulating people from the messy issues of cloud infrastructure, and letting them instead implement and deploy whatever they want directly in the Wolfram Language.

Of course, to make that possible, we ourselves have needed to build all the automated infrastructure. And now, thanks to this little adventure in “scientific debugging”, we’re one step closer to finishing that. And indeed, as of today, the Wolfram Cloud has its APIs consistently running without any mysterious quantized slowdowns—and is rapidly approaching the point when it can move out of beta and into full production.

This is a clear indication on how the von Neumann architecture of computers and the Linux monolithic kernel are screwing all of us. To begin with, file versioning should be built-in feature of the OS. Secondly, as Ted Nelson put it, we have to get rid of the filesystem tyranny. Wolfram should give priority to develop a version running on bare metal, as said. At that point they may consider developing a Mathematica suitable hardware architecture altogether. Last but not least – the whole multithreading scheme sucks. Go for functional programming, people! It’s math on computers, so go for it. It’s harder, but this thing is mainly for smart people. We can take care later of those who want to calculate the volume of their apple pie using mathematica’s image processing power…