Monday, January 31, 2011

The author of the manual page gives an early hint of his attitude toward this function call at the start of the documentation:

DESCRIPTION

The statvfs() and fstatvfs() functions fill the structure pointed to by buf with garbage. This garbage will occasionally bear resemblance to file system statistics, but portable applications must not depend on this.

The manual page then proceeds to describe various useful information that might (or might not) be provided by this function, and observes that the statvfs function in turn is built upon the statfs function.

Then the author further observes, in the section regarding Posix standards:

STANDARDS

The statvfs() and fstatvfs() functions conform to IEEE Std 1003.1-2001 (``POSIX.1''). As standardized, portable applications cannot depend on these functions returning any valid information at all.

I'm not sure what the author of the manual page was trying to achieve here. Was this a warning about some specific inadequacy or weakness in the implementation? Is there some known bug? Is there a common pitfall that applications might fall into? Are there ways to invoke this function profitably, while other ways are fraught with danger?

Or was this the author's way of making some sort of a joke?

All the manual page seems to do, for me at least, is cause me to raise an eyebrow and wonder what's going on here.

The manual pages for the statvfs function on other operating systems do not appear to have such cautions and warnings. Is there something special about FreeBSD?

Sunday, January 30, 2011

It's looking like 2011 will be a watershed year for low-level system software wonks, as we will be seeing some major transitions in some core software behaviors that have been around for decades. Here's a couple that I've been following:

Another transition which isn't getting quite so much press, but which will also affect us systems sorts, is the storage systems transition from 512-byte sectors to 4096-byte sectors. The new drives, called "Advanced Format" drives, bring both capacity and efficiency gains, but will require careful attention by those who implement file systems, database systems, and the like. One particular bit of complexity is the transition mode on some of these drives, called "512e" mode, which involves the use of emulation firmware that makes a 4K Advanced Format drive appear to be operating as a old school 512-byte sector device. As some people have observed, this is very tricky stuff, and if not done exactly right could introduce some extremely mysterious failure modes. The good news is that there are lots of tools and information about Advanced Format drives available.

Although I had already read a number of the articles he references, several of them were worthy of a re-read in reflection, and there were a number of articles that he pointed out that were new to me, in sources I hadn't seen before.

Here's two that Houston referenced that I hadn't seen before, and that I particularly enjoyed:

Go through all of Houston's list, if you have some time. Almost certainly you'll find something new and interesting that you hadn't been following, or hadn't reflected on in months. It's a lot to chew on, but it was a busy year in the tech world.

Megastore blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS in a novel way, and provides both strong consistency guarantees and high availability. We provide fully serializable ACID semantics within fi ne-grained partitions of data. This partitioning allows us to synchronously replicate each write across a wide area network with reasonable latency and support seamless failover between datacenters.

The paper is excellent: clearly-written, thorough, and relevant. It covers the topic from the high-level requirements, through architecture and design, down to the important aspects of the implementation. You should read the entire paper (I'm just starting on my second pass through it). Of course, if you're not already comfortable with BigTable, Chubby, Paxos, etc., you're going to spend a lot of time chasing references, and probably need to come back to this paper later. But hopefully you've been Keeping Up With The Times, and so this caution isn't necessary...

Anyway, although I don't have a lot of insight regarding the basic content of the paper (except to say: "Thanks, Google, for once again sharing the details of your fascinating work!"), I wanted to share one part that I found particularly interesting, from a section near the end of the paper:

Development of the system was aided by a strong emphasis on testability. The code is instrumented with numerous (but cheap) assertions and logging, and has thorough unit test coverage. But the most effective bug-finding tool was our network simulator: the pseudo-random test framework. It is capable of exploring the space of all possible orderings and delays of communications between simulated nodes or threads, and deterministically reproducing the same behavior given the same seed. Bugs were exposed by finding a problematic sequence of events triggering an assertion failure (or incorrect result), often with enough log and trace information to diagnose the problem, which was then added to the suite of unit tests. While an exhaustive search of the scheduling state space is impossible, the pseudo-random simulation explores more than is practical by other means. Through running thousands of simulated hours of operation each night, the tests have found many surprising problems.

What I particular enjoy about this passage is the way it delivers that hard-won, hard-earned, worth-reflecting-on knowledge that building reliable systems of significant complexity requires not just a single approach, but a collection of techniques. Similar to the way that security experts will often argue for "defense in depth", observe the overall plan of attack used by the Google team:

testability:

assertions

logging

unit tests

coverage

simulators

pseudo-random test generators

event-sequence tracking

for each found problem, adding the case back to the suite of unit tests

It wasn't enough to have one technique, or a simple approach; all of these tools must be taken out of the toolbox and used, routinely, throughout the lifetime of the software.

This is how real systems are built; this is how enduring software gets made. As they say, it's all in the details, and certainly this part of the paper says nothing ground-breaking or startling.

Sunday, January 23, 2011

On the one hand, my initial reaction mirrored that of a number of the commenters on the article: just make a change and commit it directly, from your browser? You didn't test the change? You didn't even write any new tests? Heck, you didn't even compile it? Impulsive change is a dangerous behavior, and software is astonishingly hard to build; without the protection and warm cocoon of my multi-platform build system and my extensive suite of regression tests, I shudder at making any change to my software, even the smallest, most trivial, most elementary. I've seen too many times how "this couldn't possibly be the wrong change" comes back to bite.

On the other hand, this is the entire point of having a Version Control System, such as Subversion, Mercurial, or Perforce. Go ahead, and make a change! If it doesn't work, we can change it again. If we want to know what you changed, we can look at the differences between the two versions. If we decide we don't like that change, we can go back to the previous version. If we decide we aren't sure, and want to run more experiments, we can make a branch, and evaluate the two possibilities independently. Modern version control systems work extremely hard to ensure that digital objects can be changed safely, conveniently, and securely.

Furthermore, making the barrier to submit as low as possible encourages incremental evolution of software. In this I find much to appreciate in the writings of the Extreme Programming school:

So always do the simplest thing that could possibly work next.

...

Break your system into small testable units.

...

The best approach is to create code only for the features you are implementing while you search for enough knowledge to reveal the simplest design. Then refactor incrementally to implement your new understanding and design.Keep things as simple as possible as long as possible by never adding functionality before it is scheduled.

If you make the barrier to submit high, if you make it too hard, too scary, too imposing to submit, then I've seen all too well what happens: people hold on to their changes. They keep them on their machine, accumulate more changes, wait forever to submit. When they finally submit, their changes are too late, and too large. It's hard to comprehend a large change, and it's hard to figure out exactly what caused a small and subtle change in behavior in a multi-thousand-line submit. Worse, when a problem is found late in the process, there is terrible resistance to going back and fixing it: "it's too late, we already built this; we can't change it now" Aargh!

Can I see myself making a change directly in the browser, then submitting it? No, absolutely not.

But I often find myself in the situation of wanting to propose a change. For example, while evaluating a bug report, or discussing a potential project on a mailing list, I often want to say:

Hey, you know this code over here? What if we made a change like this? Would that be taking us in the right direction?

And for purposes like this, I'll routinely whack out a change in my editor, then directly paste the diff output into my message, or into the comments field of the bug report, or into my design spec, or into my project tracking system. Kind of like "a picture is worth a thousand words", a little bit of concrete discussion about the actual code is far better than paragraphs and paragraphs of abstract text. Show me the code!

So in the end I find myself a clear proponent of Google Project Hosting's new "Edit File" feature. Is it the way I work? Not exactly. But is it a useful tool with the right intentions? Absolutely.

And that, finally, led to Prof. Zeller's book on systematic debugging. Amazon promises that my copy is in the mail.

It's dramatic how much difference there is between a good testcase and a poor testcase. A clearly described problem that has been reduced to the simplest set of reproduction steps gets fixed immediately; a vague and complex bug report languishes.

During the fix process, that clear test case can often suggest the location of the problem, and it aids in code review, as you endeavor to explain to your colleagues the problem, and the resolution. Then, after the bug is fixed, that test case turns into a regression test, and a code example in the documentation, and has a life of its own, above and beyond the result of enabling the bug fix.

If you are in the business, you are both a bug finder and a bug fixer; even if you think that learning to produce the best possible test cases is not your problem, it will be, eventually, so you should take the time to learn how to do it well.

Wednesday, January 19, 2011

... it was 62 degrees last night at 9:15 PM in Alameda, CA. That's about 15 degrees above normal for this time of year. And after that wonderful wet December, January has been as dry as dry could be. Hmmm...

Tuesday, January 18, 2011

At my day job, several of our customers independently reported a bug to us over the weekend. After a bit of analysis, our support staff identified the commonality among the cases, and sure enough, there was a particular configuration, a "perfect storm" if you will, involving several different aspects of the server configuration which, if they were all just right, caused a small memory leak.

Well, I say a small memory leak, and it was in fact less than 15 bytes per each TCP/IP connection that the server accepted. Unfortunately, since our server routinely accepts hundreds or even thousands of connections a minute, that can really add up.

It was my bad for introducing the bug in the first place, and for not catching it during the 8 months (!) of internal testing, but things like this happen. I'm pleased that our support team was able to isolate the configuration conditions so rapidly; it saved me an immense amount of time to be able to demonstrate the problem in just a few simple commands.

Unfortunately, you can't fix a bug until you find it. But now that we've found it, and fixed it, I've done what I think is the best I can do:

I searched the code for any similar mistakes that I might have made, and didn't find any.

I ran my fix past 3 separate code reviewers, who each found small areas where I could improve the fix.

I added a test case based on the reproduction script to our nightly regression suites, and I verified that the test case fails without the fix, and passes with the fix in place. The presence of this test case greatly increases my confidence that this bug won't slip back into the product in some future release. It's a substantial bit of effort to add a test for every single bug fix, but the alternative is worse, so I always try to add that test case if at all possible.

And I fixed the problem in a way that I hope will lay the infrastructure for future improvements in subsequent releases.

Since I know people will suggest this: yes, we do make use of a number of resource leak detection tools; getting a clean valgrind run is important to us, and we have a large collection of stress and load test suites. Unfortunately, there are many more operational configurations then you might think, and this particular configuration, although not that unusual, still happened to be a configuration that we don't have leak detection tools for.

No more crying over spilt milk. This bug is fixed, I believe, and fixed well, and there are more tasks lying ahead. That is the way of software, after all.

Saturday, January 15, 2011

I came across a new term recently, one which I hadn't heard before, but immediately recognized: stack ripping.

I was reading the intriguing paper by Krohn, Kohler, and Kaashoek:Events Can Make Sense, where they say that they are working on:

A high-level, type-safe API for event-based programming that frees it from the stack-ripping problem but is still backwards compatible with legacy event code.

The Tame team explain the stack-ripping problem as follows:

But a key advantage of events -- a single stack -- is also a liability. Sharing one stack for multiple tasks requires stack ripping, which plagues the development, maintenance, debugging and profiling of event code. The programmer must manually split (or "rip") each function that might block (due to network communication or disk I/O), as well as all of its ancestors in the call stack.

Stack ripping is introduced in section 3.1 of the Microsoft Research paper, under the sub-title Automatic versus manual:

Programmers can express a task employing either automatic stack management or manual stack management. With automatic stack management, the programmer expresses each complete task as a single procedure in the source language. Such a procedure may call functions that block on I/O operations such as disk or remote requests. While the task is waiting on a blocking operation, its current state is kept in data stored on the procedure's program stack.

In contrast, manual stack management requires a programmer to rip the code for any given task into event handlers that run to completion without blocking. Event handlers are procedures that can be invoked by an event-handling scheduler in response to events, such as the initiation of a task or the response from a previously-requested I/O.

The data structure that is used to track the in-progress task, including the state of the task, and a reference to the callback procedure that will be invoked when the blocked operation completes, is generally called a "continuation". The term "continuation" is nothing new; credit for this term is generally given to the Scheme programming language, for example see Daniel Friedman's work nearly 35 years ago, probably starting with this paper (which I haven't read).

Although I've never programmed in Scheme, I'm very familiar with the basic events-versus-threads tradeoffs, and with the complexities of trying to write resource-efficient task processing code in a multi-threaded server. And when the Microsoft Research team describe the core problem, I absolutely feel their pain:

Software evolution substantially magnifies the problem of function ripping: when a function evolves from being compute-only to potentially yielding, all functions, along every path from the function whose concurrency semantics have changed to the root of the call graph may potentially have to be ripped in two.

The observation that this problem has been around for pretty much as long as computer programming has existed means that we're likely to continue living with it for some time to come. It's great to see the programming language research community continue to discuss and debate it; it's a hard problem and worth working on. One day, perhaps, some elegant programming language of the future will make life easier for systems programming grunts like me. For now, I much enjoyed reading the Tame paper and the Farsite paper; each time I follow through the details of a clearly-written description such as these, it helps me understand and reason about the programming patterns in my own code, and gives me better terminology and concepts to use when I'm trying to describe my work to others.

local incumbent Safaricom had started a minute-sharing service for its prepaid cell phone plans a few years back. The idea was to enable users to send minutes to family members in rural areas, who weren’t otherwise able to buy prepaid phone cards. However, Kenyans quickly came up with other uses. “Lots and lots of people were using it as a surrogate for currency,” Eagle said. “[You] could literally pay for taxi cab rides using cell phone credit."

I was sort of dimly aware that mobile phones had transformed African society in ways that Americans only barely understood, but (I am an American after all) I hadn't paid much attention. This story in Wired discusses other uses of the new technology; it's strange and new and takes some thinking about.

Getting back to Koch's article, he makes the fascinating point that America and other societies are behind in adopting some of these new technologies and techniques, precisely because we are so far ahead in other respects:

Part of the reason mobile banking is so succesful in Kenya is that the majority of the population is not eligible for a bank account, let alone a credit card. Thus they are actively looking for an easy alternative to carrying around lots of bank notes.

In contrast, the West has solved this problem a while back with its existing banking and credit card system. That system doesn’t really work very well on the Internet, but we’ve all grown used to that situation and haven’t really looked for solutions.

Thus it’s the West that is behind here. The mobile economy needs something that’s more user-friendly and more accessible than the current credit card system. Operator billing fits the ... well ... bill.

All told, a lot of very interesting things to read and think about, though I don't have much to add. Are you up on this stuff? Send me good links to make me smarter!

Monday, January 10, 2011

(You should probably cut-and-paste the command to get it exactly, as that turns out to be important for this example.)

Now bring up the file c:\temp\bryan.txt in your favorite editor.

Meanwhile, try this command, too:

echo bryan> c:\temp\bryan2.txt

And bring up the file c:\temp\bryan2.txt in your favorite editor.

Do you see the difference?

Yes, the two commands look very similar, and the two files they produce look very similar, but they are slightly different.

Do you see it yet? It's pretty subtle ...

OK, here's the answer: in the first case, there is a trailing space character in the resulting file, but in the second case there is not! That is, the first command results in a single-line file with a line that is six characters long: 'b', 'r', 'y', 'a', 'n', ' ', while the second command results in a single-line file with a line that is five characters long: 'b', 'r', 'y', 'a', 'n'.

You'll get similar behavior when you use the "pipe" symbol ('|') to construct a command pipeline; that is,

echo bryan | some-program-that-reads-stdin

is slightly different than

echo bryan| some-program-that-reads-stdin

because in the first case the program will read a 6-character line from stdin, while in the second case the program will read a 5-character line from stdin.

I'm sure this is probably documented somewhere, but it was a surprise to me so I thought it would be worth a short note.

Of course, nobody sits down and writes something like this from scratch. A bit of code like this evolves, over time, incrementally, as developers add to it.

And, as many of the commenters note, this statement actually packs an enormous amount of functionality into a very compact form. It is dense.

There's something like 20 tables mentioned in the FROM clause; I bet this monster is fun to run through a modern query optimizer!

In a previous job, I built and maintained a Continuous Integration system which operated a fleet of automated build robots that performed build and test tasks and provided tools to analyze and interpret the results. The core logic of the system was as follows:

Design the underlying database schema as carefully as you can

Express the primary operations of the system as database queries; pack as much intelligence as possible into the query itself

Provide a thin layer of execution (in the build bots) and visualization (in the web UI) logic around the underlying database system; let the basic database structure show through (see point #1)

Many of our queries were rather complex; the most complex query was the one which scheduled the waiting jobs to the available bots, matching priorities and capabilities according to the system's rules. The query was hell to write, but once we got it built, the rest of the system just flowed.

PdaNet does NOT require root access or hacking your phone's firmware in order to work. It is just a regular Android application that works on all Android phones as-is. Your phone can connect to the data service, WiFi, or even through VPN and PdaNet will share the connection with your computer.

The PDANet installer whirled and hummed, and it, too, was having trouble getting the right drivers installed.

Sometimes if USB debugging is checked but there is still no "Android ADB Inteface" listed when phone is attached or if there is a driver error, your system may need the official USB driver for some reason. If that is the case, install one of the following drivers on your computer first:

Driver for LG Android Phones

Once we got that special LG driver installed, the rest of the PDANet installation proceeded without complaint, and a few minutes later we were online, tethered via the phone!

Quite possibly, the installation of the LG driver would have enabled Proxoid to work, but since the PDANet software seemed to be working and diagnosed the problem well and clearly, we've decided to stick with that for now.

Smart phones are a fast-moving technology; I suspect that, in a year, we'll be setting up a new computer with a new smartphone, and we'll be amazed how much things have changed, and how easy they've become. Still, the PDANet installation process was overall pretty reasonable, so I'm suitably impressed.

I find this technique baffling. Why would you want to go to all this trouble and complexity when you could just propagate true virtual machines? It all seems so last century, one of those incredible feats of virtuosity that makes your jaw drop in admiration, then shake your head and think: Why? Why? Why?

If anything, I think that I am still setting up too few virtual machines, not too many. With disk space running at about $100/TB, and memory at somewhere around $15/GB, any desktop machine that you set up this year for serious development work should have at least 5 TB of disk and 16 GB of memory, plenty of horsepower to support 50-100 installed virtual machine images, with at least 2-3 live and active at any given time.

And it's about 30 seconds to suspend one VM and revive another.

Well, anyway, it's always interesting to read about alternate techniques, and I hadn't known about debootstrap before, nor about rinse, and my experience with chroot mostly involved security considerations, not alternate operating system configurations.

So I learned, and I'm smarter, and I'm grateful for that, but thank you very much I think I'll stick with my virtual machines for now.

OCZ let me run some of my own Iometer tests on the drives to verify the claims. Surprisingly enough, the Vertex 3 Pro looks like it’s really as fast as OCZ and SandForce are claiming. When running highly compressible data (pseudo random in Iometer) at low queue depths, I get 518MB/s sequential write speed and nearly 500MB/s for sequential read speed.

SandForce’s controller gets around the inherent problems with writing to NAND by simply writing less. Using real time compression and data deduplication algorithms, the SF controllers store a representation of your data and not the actual data itself. The reduced data stored on the drive is also encrypted and stored redundantly across the NAND to guarantee against dataloss from page level or block level failures. Both of these features are made possible by the fact that there’s simply less data to manage.

These are really, really, really fast devices:

Now the shocker. Thanks to 6Gbps and ONFI 2/Toggle support, the SF-2000 will support up to 500MB/s sequential read and write speeds. On an 8 channel device that’s actually only 62.5MB/s per channel but the combined bandwidth is just ridiculous for a single drive. At full speed you could copy 1GB of data from a SF-2000 drive to another SF-2000 drive in 2 seconds. If SandForce can actually deliver this sort of performance I will be blown away.

On my own moderately-high-end hardware nowadays, I tend to see about 50 Mb/s from my (traditional technology) storage devices, and on our fast internal lab machines we see double that. So these new devices will be approximately 5-10 times faster than the storage devices we're using now!

Tuesday, January 4, 2011

One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.

I love the idea of the Chaos Monkey!

Testing error recovery is hard. It's often hard to provoke errors. It's fairly straightforward to provoke errors that are caused by bad input, so you should always have a thorough suite of tests which tries lots of invalid invocations of your code: syntax errors, parameters out of range, missing values for required fields, invalid combinations of requests, etc.

It's harder to provoke errors that are caused by other conditions: resource shortages, disk or network I/O errors, etc. In a number of my tests I provoke these errors using surrogate mechanisms:

To simulate I/O-related problems I tamper with file or directory permissions, or I remove or rename files and directories that the program wants to access

To simulate network problems I shutdown one or the other end of a conversation, or I use invalid network addresses

There are more sophisticated tools for doing this, for example check out Holodeck.

And you can also modify your application so that it is easier to test; quite commonly this involves modifications which allow testers to force the software through error conditions. This is often called "testability"; here's a pointer to a recent testing conference -- note how many of the talks are focused on various aspects of testability. Currently, much of the focus in the testing world is on the notion of "mock" objects, and indeed they can be very powerful and worth building into your test harness. Here's an interesting recent example: Mocking the File System to Improve Testability.

You can also use randomization, and stress: here at my day job we have something we call the Submitatron, which is a tiny little script that simply loops around, generating arbitrary data and sending it to the server. Similar techniques, which focus more on randomization, are often referred to as "fuzz testing".

But the most important thing is to think about testing, think about errors, think about failures, and try it out!

Sunday, January 2, 2011

First, from a short article in the ACM Queue regarding the performance of the NFS networking protocols across trans-oceanic links, Bound by the Speed of Light:

Unfortunately, the speed of light gets involved when you start creating networks over global distances. It's typical for a transpacific network link to have a 120-ms round-trip time.

...

For every mile between the client and the server, a message cannot get to the server and back to the client in less than 10 microseconds, because light travels one mile in 5.4 microseconds in a vacuum. In a fiber-optic network, or in a copper cable, the signal travels considerably slower. If your server is 1,000 miles from your client, then the best round-trip time you could possibly achieve is 10 milliseconds.

In August, Spread Networks of Ridgeland, Miss., completed an 825-mile fiber optic network connecting the South Loop of Chicago to Cartaret, N.J., cutting a swath across central Pennsylvania and reducing the round-trip trading time between Chicago and New York by three milliseconds, to 13.33 milliseconds.

Then there are the international projects. Fractions of a second are regularly being shaved off of the busy Frankfurt-to-London route. And in October, a company called Hibernia Atlantic announced plans for a new fiber-optic link beneath the Atlantic from Halifax, Nova Scotia, to Somerset, England that will be able to send shares from London to New York and back in 60 milliseconds.

If the best possible time for a 1000 mile round trip is indeed 10 milliseconds, then achieving 13.3 milliseconds is superb progress.

But even 60 milliseconds to make a round-trip between London and New York can feel like an eternity, as the New York Times article points out:

Almost each week, it seems, one exchange or another claims a new record: Nasdaq, for example, says its time for an average order “round trip” is 98 microseconds — a mind-numbing speed equal to 98 millionths of a second.

These are different uses of the term "round-trip", of course, but the basic conclusion holds.

Unfortunately, both articles concentrate mostly on the mechanical details of order processing speed, pointing out why financial engineering companies tend to operate within a few miles of the major financial centers: New York, Chicago, London, Frankfort, etc., while I was hoping that the New York Times article would focus on the more interesting question (in my opinion), which has to do with the fairness and open-ness of modern electronic exchanges, and concerns questions that I think are still open from last spring's "Flash Crash":

But some analysts fear that some aspects of the flash crash may portend dangers greater than mere mechanical failure. They say some wild swings in prices may suggest that a small group of high-frequency traders could manipulate the market. Since May, there have been regular mini-flash crashes in individual stocks for which, some say, there are still no satisfactory explanations. Some experts say these drops in individual stocks could herald a future cataclysm.

For example, the NYT mentions the on-going debate about when information becomes visible on these exchanges:

Most of the exchanges have already eliminated a controversial electronic trading technique known as flash orders, which allow traders’ computers to peek at other investors’ orders a tiny fraction of a second before they are sent to the wider marketplace. Direct Edge, however, still offers a version of this service.

Nothing definite to report here, I guess: computers get faster, the world gets smaller, we continue to try to understand how to build fair and equitable markets. Just more postcards from the bleeding edge.