Most of us, as we leave the theatre, can no more remember which company produced the film we just saw than we could tell you who manufactured the hand dryer in the men's room. The exception is Pixar, the only studio whose products people actively seek out. Everyone knows Pixar.

So Lane goes on a road trip, to visit Pixar, to see if he can understand what makes it special.

I'm not sure if it's because I used to work (though not for Pixar) in that same campus in "the orderly town of Emeryville,", some 15 years ago, or because, like John Lasseter, I too hail from Whittier, California, or just because I'm fascinated by what makes the difference between a good company, a great company, and a legendary company, but I too am fascinated by Pixar, and I devoured Lane's story.

Lane tours the campus, visiting both new buildings and existing ones, noting work areas, the soccer pitch, volleyball court, and pool, the open-air deck with the view of the Golden Gate Bridge, and Pixar University, which he finds intriguing:

they can finish their tasks for the day, in any department of the company, then head over to P.U. for a course in live-action moviemaking, sculpting, fencing, or whatever. "Why are we teaching filmmaking to accountants? Well, if you treat accountants like accountants, they're going to act like accountants." So said Randy Nelson, the first person to head the program, which started in 1995.

I love the way that Lane de-constructs Pixar's history by way of its art: finding in The Incredibles a critique of modern corporate America; seeing Elastigirl as embodying the changing role of women in society; noting that Up, a story about "the bonding of young and old" is produced by a company whose entire body of work is about the bonding of young and old. Lane is right to be fascinated by how all this art can emerge in a world where "none of this exists":

the storyboard? It's not there. What Larsen drew on was a digital sketch pad, and he held an electronic pen. There was no board. ... It was born and cradled in the mind of a computer, and there it lived and grew. ... there are no lenses, or none that you can hold in your palm. They are purely options on a toolbar, and you scroll between them.

Of course, the greatest art is all about creating something in your mind that isn't there.

As Lane points out, there are several aspects to the Pixar office culture that are notable. Firstly, there is the idea of serendipity, and of providing the opportunity for casual, almost accidental, encounters that lead to cross-fertilization and inspiration:

The hope is that, as you head toward storyboarding, you will bump into the woman from the art department whose craving for a Danish has lured her to the cafe, and the film on which you are both toiling will be advanced a notch as your paths intersect.

Secondly, there is the celebration of creativity, inspiration, and innovation:

The funkiest parish in the building is where the animators, a hundred and twenty in number, dwell, and where their fancies are encouraged to sprout.

And thirdly, there is the idea that Pixar is a place where you want to be, where you feel involved, engaged, and inspired:

it is a democratic initiative, whereby those who make the movie at ground level -- the animators, and other interested parties -- should get to watch the movie, and then buff it up, as they go along. No mandatory notes, no development executives, no clipboard. In short, it is the filmmakers who run the show.

All of this, together, creates that magical environment that "drives each employee at Pixar to take more pains than the next one."

Several times, Lane observes that one of Pixar's particular successes is to understand that the technology is just a tool, and that what is really important is the desire to make the perfect movie:

The same goes for Sulley's turquoise fur -- a famous test for the Pixar software engineers, who had to write new simulation programs to engender 2.3 million separate strands. But Sully isn't furry because Pixar wanted to march ahead with the technology; that was merely a happy offshoot of the task. Sully is furry because the tiny human who befriends him, Boo, needs to have something to grip as she clings, unnoticed, to his back.

Lane re-tells the famous story about John Lasseter's first showing of Luxo, Jr. at SIGGRAPH 1986:

After the screening, Lasseter watched in trepidation as Jim Blinn, a computer scientist he knew, approached. What would Blinn want to talk about: rendering algorithms? Z-buffers? Jaggies? But Blinn had only one question: "John, was the parent lamp a mother or a father?

Lane wonders whether Lasseter can truly live up to this reputation, and discovers that he does:

In the mouths of most bosses, such sentiments would be mush, or self-delusion, but Lasseter, like his movies, is there to be believed. When he talks like these, he doesn't sound like a movie supremo. He sounds like Buzz Lightyear. The key to Pixar, I came to realize, is that what it seeks to enact, as corporate policy, and what it strives to dramatize, in its art, spring from a common purpose, and a single clarion call: You've got a friend in me.

Pixar is indeed a fascinating place. It's not the only fascinating place in the world; it's arguably not even the most interesting company in the Bay Area, but they have clearly built something very very special there, and it was quite interesting to read about the company in detail.

Thursday, May 26, 2011

If you are interested in staying up to date regarding the Oracle/Google Android lawsuit, one good resource is Florian Mueller's blog, in particular he has recently been publishing a number of articles about the lawsuit:

Monday, May 23, 2011

In my opinion, the relational data model is powerful and, for what it does, it does it very well. And other data models exist, and they do what they do, very well. I haven't used MySQL a tremendous amount, but it always seems like that community finds ways to do what other database implementations do, but differently.

Telecommunications companies such as Western Union and American Telephone and Telegraph

Broadcasting companies such as Radio Corporation of America and National Broadcasting Corporation

Motion picture companies such as Paramount, Fox, and Warner Brothers

Cable Television companies such as Ted Turner's Cable News Network

Internet companies such as AOL, Apple, and Google

It's a long list, and a broad scope.

The Master Switch covers a lot of ground. There are sections on patent law, on monopolistic behavior, and on trust-busting; there are sections on politics and policy and on the ability of corporate behemoths to use lobbying and political contributions to influence government behavior; there are sections on the differences between "open" and "closed" communications networks; there is a wonderful section that I hadn't known about before regarding the involvement of Leo Beranek and J.C.R. Licklider with Harry Tuttle's Hash-A-Phone lawsuit in 1950.

In my reading, Wu makes three major points:

These corporate behemoths may have good intentions, but they continually succumb to the Kronos Effect

Information industries deserve special attention, because information is not just a commodity

A firm and clear set of Separation Principles are necessary to achieve the right balance of profitability and freedom

Let's take these each in turn.

Wu explains the Kronos Effect thusly:

Kronos, the second ruler of the universe according to Greek mythology, had a problem. The Delphic oracle having warned him that one of his children would dethrone him, he was more than troubled to hear his wife was pregnant. He waited for her to give birth, then took the child and ate it. His wife got pregnant again and again, so he had to eat his own more than once.

And so derives the Kronos Effect: the efforts undertaken by a dominant company to consume its potential successors in their infancy.

Wu illustrates the Kronos Effect with a number of stories, such as this one, about how David Sarnoff and Radio Corporation of America reacted when Professor Edwin Armstrong of Columbia University showed them his new invention for Frequency Modulation (FM) radio:

You might think that the possibility of more radio stations with less interference would be generally recognized as an unalloyed good. More radio stations means more choices for the consumer and more opportunities for speakers, musicians, and other performers. But by this point the radio industry, supported by the federal government, had invested heavily in the status quo of fewer stations. Radio's business model, as we've seen, was essentially "entertainment that sells" -- shows produced by advertisers, with revenues dependent on maximizing one's share of the listenership. Hence, the fewer options the better.

The story of companies defending their dying market has been told many a time, notably by Clayton Christensen's theory of disruptive innovation, but Wu tells the story well and illustrates it with a number of compelling anecdotes.

Wu's claim that "information" industries are special is key to his book, and takes some time to establish and defend. Wu asserts that information industries are the keepers of the "master switch", a heavy responsibility. Here are several passages that I think do a particularly good job of stating his core premise:

But what if the figurative "marketplace of ideas" is lodged in the actual and less lofty markets for products of communication and culture, and these markets are closed, or so costly to enter as to admit only a few? If making yourself heard cannot be practically accomplished in an actual public square but rather depends upon some medium, and upon that medium is built an industry restricting access to it, there is no free market for speech. Seen this way, the Hays Code was a barrier to trade in the marketplace of ideas. And even without them, the higher the costs of entry, the fewer will be the ideas that can vie for attention.

And later, speaking about Fred Friendly, the father of broadcast T.V. news:

Friendly had identified a new reality of the age of mass information: the power of concentrated media to narrow the national conversation. It may seem paradoxical to suggest that new means of facilitating communication could result in less, not more, freedom of expression. But a medium, after all, is literally something that comes between the speaker and the potential listeners. It can facilitate speech only if it is freely accessible. And if it becomes the means by which most people inform themselves, it can decisively reduce free speech by becoming, whether by malign intent or merely benign effect, the arbiter of who gets heard. It was by such means, Friendly believed, that the shortage of TV stations had given exclusive custody of a "master switch" over speech, creating "an autocracy where a very few citizens are more equal than all the others."

Wu's argument reaches its height in this chilling description of the increasingly polarized and bitter public debate that seems to be commonplace today, discussing an essay by Professor Cass Sunstein:

"In a democracy," writes Sunstein, "people do not live in echo chambers or information cocoons. They see and hear a wide range of topics and ideas." There is a bit of a paradox to this complaint that must be sorted out. The concern is not that there are too many outlets of information -- surely that serves the purpose of free expression that sustains democratic society. Rather, the concern is that in our society, we have been freed to retreat into bubbles of selective information and parochial concern (Sunstein's "information cocoons"), in flight from the common concerns we must address as Americans. ... There is little to talk about around the proverbial water cooler in a nation segmented by divides of gender, generation, political inclination, and so on.

"Bubbles of selective information and parochial concern." That description of the modern socially-networked world is extremely apt and accurate; Wu totally nails it with this observation.

To protect us from these maladies and misdeeds, he proposes a surprisingly delicate approach to managing the risks, rounded in the rich heritage of American culture:

The better part of compliance with rules of all sorts actually depends on the power of self-regulation, not the threat of force, though of course that threat can help. Both church and state (or at least individual politicians) may occasionally feel motivated to push the boundaries of their coexistence, but overall both institutions tend to accept the wisdom of the divide between them, which is why it works.

There are uncodified norms governing the behavior of infotel firms in the twenty-first century, ones that did not exist decades ago, such as the norms that stigmatize site blocking, content discrimination, and censorship, broadly defined. Consider that when phone or cable companies have been accused of blocking an Internet site, their tendency has been to deny it, or to blame a low-ranking official, rather than to baldly defend a right to block or censor, as for instance the Edison Trust once did.

As Wu's book shows, this in fact actually is progress, and so we should be grateful for it, reward it, and build upon it.

Overall, The Master Switch is well-written, well-documented, and relevant; it explores important current issues from the perspective of how we got here. That is the mark of a historian who has done his job well. I enjoyed the book, and if you get a chance to read it, I think you'll enjoy it too.

I really like the format that this workshop used of having 5-page papers. The 5 page limit is long enough to provide significant information about the research team's ideas, but short enough to force the writers to concentrate on the major ideas and not get bogged down in detail (though some of the papers are pretty dense). Overall, I've found the papers to be very clear and well-written, and I'd like to particularly recommend a few of them:

a typical OS paper usually uses the phrase "on commodity hardware." As a community, we assume we are stuck with whatever flaws the hardware has.

The authors offer several examples of ideas in which they feel OS research could contribute to hardware enhancements, including: distributed/multicore messaging, efficient context switching, cache management, and instrumentation. But their goal is larger, to try to "close the gap" and encourage computer system researchers and systems software researchers to collaborate more closely.

Multicore OSes: Looking Forward from 1991, er, 2011. This paper makes some broadly similar points to the Mind The Gap paper, but focuses specifically on the increasing prevalence of multicore processors. The authors urge systems software researchers to review their history:

the parallel supercomputers of the 1980's and 1990's exhibited all the same scaling limitations and ensuing software issues that we expect to see in multicore systems. Rather than waste time repeating that history, we should look at where that work led. ... experience in that sector says that conventional thread programming using locks and shared memory does not scale ... we should be looking to programming models, concurrency paradigms, and languages that natively support, or are based entirely on, messages rather than shared memory.

Network latency has been an increasing source of frustration and disappointment over the last thirty years. While nearly every other metric of computer performance has improved drastically, the latency of network communication has not. System designers have consistently chosen to sacrifice latency in favor of other goals such as bandwidth, and software developers have focused their efforts more on tolerating latency than improving it.

Using some simple measurements, the authors suggest that it is not unreasonable to aim for 1-2 orders of magnitude improvements in latency (for those unfamiliar with this terminology, this means ten to one hundred times faster) in the short term. They argue, and I agree, that this would allow dramatic breakthroughs in many systems and techniques. In what may be a controversial proposal, they contend that current computer system design is pointed in the wrong direction:

In recent years some NIC vendors have attempted to offload as much functionality as possible from the CPU to the NIC, including significant portions of network protocols, but we argue that this is the wrong approach. For optimal performance, operations should be carried out on the fastest processor, which is the main CPU; cycles there are now plentiful, thanks to increases in the number of cores.

From experience, I know that it's quite common, when performing distributed system benchmarks, to encounter situations where the individual systems have plentiful CPU, memory, and disk resources, but the overall system is bottlenecked because the networking libraries can't run fast enough. Networking performance improvements have tremendous potential.

Disks Are Like Snowflakes: No Two Are Alike. This paper came totally out of left field for me; I wonder how many other people were as shocked and fascinated by it as I was? The authors break the news, which they claim has been known for years, but somewhat concealed, that disk drives have tremendous complexity and variability in their behavior:

every disk has, by design, unique performance characteristics individually determined according to the capabilities of its physical components; for a given system setup and workload, and for the same corresponding physical regions across disks, some disks are slower, some disks are faster, and no two disks are alike.

[...]

the root cause is manufacturing variations, especially of the disk head electronics, that were previously masked and are now being exploited. Like CPUs that are binned by clock frequency, different disk heads can store and read data at different maximum linear densities. Instead of only using each head at pre-specified densities, wasting the extra capabilities of most, manufacturers now configure per-head zone arrangements, running each head as densely as possible. We refer to this approach as adaptive zoning.

The authors describe their work to explore these behaviors, by measuring different devices in minute detail, using several benchmarks of clever design, and then conclude by pointing out half-a-dozen ways in which this basic incorrect assumption about disk drive behaviors calls for a re-thinking of many basic algorithms and data-handling techniques.

I've only touched on the richness of this conference: there are some 35 different papers, covering a broad range of systems software topics, and most of them are very high quality and well worth your time. It's exciting to see that the systems software research community remains alive and active; there was widespread concern a few years ago that this field was dying out, but conferences such as these make it clear that there is still a long way to go and lots of fascinating work to be done.

Friday, May 20, 2011

I mean, come on, isn't there something more useful and productive that the world could be spending its resources on than this?

These statements would seem to admit the obvious: that an app store is a store that sells apps. Apple, however, argues the opposite. "Apple denies that, based on their common meaning, the words 'app store' together denote a store for apps," the filing reads.

“Engineers are worth half a million to one million,” said Vaughan Smith, Facebook’s director of corporate development, who has helped negotiate many of the 20 or so talent acquisitions made by Facebook in the last four years. The money — in the form of stock — is often distributed among the start-up’s founders, employees and investors. The acquired employees also get a rich salary and often more stock options, which makes this a good time for entrepreneurial engineers.

The article primarily focuses on the tension between the two basic ways of acquiring talent for companies such as Google, Microsoft, Facebook, etc:

The company can individually recruit employees directly

The company can acquire other companies, including in those acquisition calculations the fact that it is acquiring the target as much for its people as for its assets (products, customer base, physical assets, etc.)

Once a company reaches a certain size, individually recruiting talent can seem slow and ineffective, and it can be tempting to try to grow by acquisition. It's an old story (at least, old by Silicon Valley standards).

Meanwhile, reacting in part to a blog post from Pablo Villalba of Teambox, there's an interesting thread on Google's hiring process over at Hacker News, including this observation by Chuck McManis of Blekko. As Nick Carlson points out on Business Insider, one of the basic truths about Google hiring process is that it emphasizes a quantifiable approach to recruiting, with lots of Google "algorithms" to rate, rank, and measure candidates during the recruiting process, and this can open gaps in your hiring, causing you to miss candidates who are hard to quantify, as well as to miss skills that your measurements aren't measuring.

As they say, when you measure, you get what you measure. But if you don't measure, how do you know? It's an eternal struggle.

I went through the Google process myself, about 7 years ago. At the time, I was very involved with my family and kids, and decided Google was not right for me (I live rather far from Google HQ). I'm told that Google has changed its policies since then, and it's easier for Googlers to balance their work life with their family life. How to hire people is a very interesting problem, and I think Google has clearly done a much better job of it than most high tech companies that I know of; the Google talent base is astounding. Still, it's always interesting to examine that process and understand its implications.

I guess I'm a lucky picker! It's Gelfand v Grischuk for the final. Grischuk has sure been playing well of late, and I think he'd be a very exciting player to contest the World Championship. But I'll go with the old guy, and pick Gelfand to take the final match. Here's a fun picture of the two of them onstage after the second round completed.

I had a ThinkPad for 3 years; it was a fabulous computer. Unfortunately it was a company-provided machine, and I had to give it back when I switched jobs (sadly, they probably threw it in a pile and recycled it; that's the way these things usually go).

Cory Doctorow has written a nice essay commemorating the 5th year of running ThinkPads with various Ubuntu releases on them.

My ThinkPad switch was inspired by a desire to try out the Ubuntu flavour of GNU/Linux, which I'd heard great things about. So I downloaded the latest version of Ubuntu – Canonical, the company that oversees Ubuntu, does two releases per year – burned it to a CD and stuck it in the computer, and, a few minutes later, I was up and running. At the time, I promised to document my joys and frustrations with GNU/Linux, but a few months later, once I'd been soaking in the OS for a while, I went back over my notes and discovered that there was practically nothing to report on that score.

My own Ubuntu experience has been very similar. I'm on my third year of running Ubuntu on various machines, and it's been vastly easier and more straightforward than I had expected. I'm not such a big fan of the new "Unity" window manager, so I still use "Ubuntu Classic" but that's just a little whine about what has been, overall, a very successful experience.

Saturday, May 14, 2011

Donna and I decided to take the puppy on a walk, so we climbed into the car and drove out to Morgan Territory. Morgan Territory is about 45 minutes east of Oakland. You can reach it either from Clayton or from Livermore, depending on what is more convenient for you. The trip in from Livermore is dramatically faster; the road from Clayton is longer, but you can make a round trip of it if you like. The road in is a narrow, one-lane road which can make the driving a bit more exciting than usual; get an early start, don't rush, and enjoy the beautiful road!

Morgan Territory itself is one of the largest and least crowded of the East Bay Regional Parks. Usually we have it almost to ourselves, but today was not to be one of those days. We arrived at the parking lot to find it, surprisingly, almost full, with a sign out front: Roger Epperson Dedication. Apparently the park district was having a special event to honor long-time superintendent Roger Epperson by naming "Roger's Ridge" in his honor. Happily, a friendly ranger waved us to an available spot in the parking lot, right at the front by the entrance gate!

The weather today was most unusual: a major spring storm is headed into the Bay Area and the temperature was barely 50 degrees and the wind was gale force. We bundled up as best we could and set out on our hike. Thankfully, once we cleared the first ridge, we spent most of our hike wandering in and among the canyons, and the wind was not so bad. It was rather amusing, actually: the last time we were there it was in late June a few years back, and the temperature was in the mid-90's, and I said to Donna that we must remember to come in early May, when it would probably be only 85 degrees. How wrong I was today!

Morgan Territory is a dog's heaven. Although the skies were gray, the park indeed looks just like this picture. Penny spent a glorious 90 minutes trotting around from bush to rock to tree, sniffing all the good sniffs, chasing squirrels, and generally behaving, well, like a dog. Although we could have done without the howling winds and the chilly air, we both agreed that it was pretty close to a perfect walk in the park.

(Well, OK, I'm not sure those last folks were actually at the conference...)

From a big-picture point of view, it seems like the biggest questions remain:

How will the Chrome-vs-Android strategy play out? Will one or the other turn out to dominate? Or will both platforms find broad user bases?

What happens with Android and Java as the Oracle lawsuit plays out?

Will Google try to compete directly with Amazon, Microsoft, EMC, etc. in the cloud? Or will they continue on their own very Google-like path?

At a more detailed, technical level, the announcement that I found most interesting was that the Go programming language is being offered on Google App Engine. Go is obviously developing very rapidly. I've been paying attention to it when I can, but I haven't really gotten serious about it; perhaps it's time to start digging in and doing more than just watching the language from the sidelines.

Thursday, May 12, 2011

I'm with Vinay Bhat (well, I have far less of a chance to ever be a second...): Gelfand will defeat Kamsky and Grischuk will defeat Kramnik. Here's the schedule of games (I think the next games start in about 20 hours...?)

Wednesday, May 11, 2011

As quoted in El Reg, Google's Andy Rubin gave a press briefing today at Google I/O about the upcoming Android operating system releases, and said:

the Android project is "light on community and heavy on open source". He said it was not an option to develop Android completely in the open because it becomes difficult to tell when the platform is ready for release.

"Open source is different than a community-driven project," Rubin said. "We're building a platform. We're not building an app. When you're building a platform, you evolve APIs, you add APIs, you deprecate APIs. We're always adding new functionality ... so when you add new APIs, typically, in my opinion, community processes don't work. It's really hard to tell when you're done. It's hard to tell what's a release and what's a beta. When you're dealing with a platform, that just doesn't work, because all these [app] developers have an expectation that all those APIs are done and completed on a certain date.

Lance Ulanoff of PC Magazine was the one who asked the actual question, apparently; here's his reporting of Rubin's answer. I'm glad he asked the question, and I'm glad Rubin was frank with his answer. I don't know what the right answer is, but I think it's very interesting to compare what Rubin said with this viewpoint from the ASF website.

Meanwhile, as I said to a friend a few days ago, have you noticed that the word of the day in the computer industry seems to be "ecosystem"?

I think Ford is a great writer, if somewhat unusual. It's well worth digging through his archive and paying attention to what he writes. Sometimes one of his pieces just entirely passes me by, but usually I find lots to think about in his work.

Monday, May 9, 2011

Over at Real World Technologies, David Kanter has a nice short writeup of some background information about Intel's new Tri-Gate Transistors, including some analysis of the possible implications for the near-term.

Kanter observes that it is still early days with this new technology, and that right now it holds a lot of promise, but not much actual results:

The tri-gate transistors are a tremendous breakthrough in performance and the 22nm process also improves density by the traditional 2X. But it is important to realize that the performance gains Intel is citing are not simultaneous. Transistors will not get 37% faster, 50% more efficient and reduce leakage by 10X all at the same time; nor will entire chips see the same gains as the individual transistor level. Intel’s circuit designers will have to pick and choose how to use the newfound advantages throughout each chip to achieve the best overall results, given the product.

Kanter notes that Intel is making tremendous investments in this area, and suggests a possible strategy behind that work:

Intel is planning to build or upgrade five 22nm fabs in Oregon, Arizon and Israel throughout 2011 and 2012. Presently, they have four 32nm fabs in production. The increase in capacity is clearly intended for expansion into the embedded and mobile markets. While smartphones may be an uncertain proposition, Intel is doing well in other embedded areas and looking to continue that growth. This creates risk, should their plans for the handset market go awry, but in the case of owning a fab – there is no opportunity without risk.

Go go hardware guys! Keep making the hardware better, becase we software types still have lots more code that we are writing!

Saturday, May 7, 2011

For some time now, I've been needing to set up a new machine to do Derby development on. My current machine (a 6 year old laptop) is still running, but it's no longer powerful enough to use as a Derby development machine.

Meanwhile, I've been thinking that I really need to learn more about the cloud. I understand a lot of the principles and techniques, but I need to get more hands-on.

So, I'm going to try to set up an Amazon EC2 machine to use for Derby development. Has anybody done this before? Anything special I need to know? My plan is to set up the smallest possible Linux VM on EC2, then learn how to sign on and access it, then set up the necessary software to be able to build Derby and run the tests (Mostly that just means: Subversion, Java, JUnit, and Ant -- pretty simple).

More on this as it develops; let me know if you see any roadblocks standing in my way...

Update: So far, so good. Herewith, a few notes about stuff that was useful along the way:

The built-in Amazon VM has the 1.6 JRE, not the JDK. Since I wanted 1.5 anyway, I was off to the Sun website where they still make a JDK 1.5 download that installs on Linux. Thank goodness.

Ant was easy -- just grabbed the 1.7.1 binary from ant.apache.org

I needed subversion. This was a bit trickier, but I eventually figured out that sudo yum install subversion did the trick.

Derby still uses the 3.8.2 version of junit.jar, which happily I found without too much trouble at SourceForge.

In the end, it took me only about 2 hours to get far enough with EC2 to create a virtual machine, connect to it, load it up with Java software and with Subversion, download the Derby source code, and build it.

Tomorrow, when my head is fresh, I might even try running the Derby test suite!

Over on BoingBoing, Maggie Koerth-Baker has a great interview with Mike Purcell of the Woods Hole team. Even after all the years of effort trying to narrow down where to look, the team still had a massive area to search with their robotic submarine, and searching the ocean floor with a robotic submarine is very time-consuming and detail-oriented work:

There's no data transmission back to the ship while the vehicles are in the water. We get status messages—acoustic messages that come in periodically and tell us how deep it is, latitude and longitude, just a status check to tell us whether there's problems. When the AUV gets back, it takes 45 minutes to download the data and then another half hour to process and get a good look at it. During that time the other team is switching out the battery and getting the vehicle ready to go back in. The ideal is that a vehicle is only out of the water for three hours, while somebody is looking at the data to decide where we go from here and are there things we want to look at again. When we're running three vehicles we get a data dump three times a day.

I love the picture that shows what "the find" actually looked like to the research team: just funny different-colored squiggles on a computer screen. As Purcell delicately notes, "It's good to have an experienced person looking at this stuff."

Big time congratulations to the team(s) for their breakthrough; I'm eagerly anticipating what we'll learn from the material they've been able to bring up to the surface.

Thursday, May 5, 2011

This month, FIDE are holding the candidates matches, the next step in the process of choosing the next challenger to contend for the World Chess Championship against Vishy Anand. Knowing nothing, I'll go out on a limb and predict: Kamsky, Aronian, Kramnik, and Gelfand as the four winners. What's your pick?

Tuesday, May 3, 2011

Generally, in computer science as in other scientific fields, we study success. We gather the best known algorithms for solving certain problems, study them, consider ways to improve them, and publish and share the best ones we can find.

However, it is just as essential, perhaps even more so, to study failure. Cryptographers analyze security breaks, to understand where the reasoning contained flaws. Wise engineers know that when you find one bug, it's worth looking around for other similar problems in the code. We do the best that we can in our designs and implementations, but, as Richard Feynman said in his report on the explosion of the space shuttle Challenger:

For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.

These past weeks have provided a wonderful new opportunity for the studying of failure, in the form of the recent Amazon Web Services Elastic Block Storage re-mirroring outage. Even more interestingly, a number of organizations have published their own analyses of their own mistakes, errors, and mis-steps in using AWS EBS to build their applications.

Here's a brief selection of some of the more interesting post-mortems that I've seen so far:

The basic cause of the outage was that a configuration change was made incorrectly, routing a large amount of traffic from a primary network path to a secondary one, which couldn't handle the load.

This caused nodes to lose connection to their replicas, and to initiate re-mirroring.

In order for the re-mirroring to succeed, a significant amount of extra disk storage had to be provisioned into the data center.

Over-provisioning resources in the data center could have prevented some of the failures: "We now understand the amount of capacity needed for large recovery events."

Keeping enough of the system running in order to work on the failing parts is very challenging: "Our initial attempts to bring API access online to the impacted Availability Zone centered on throttling the state propagation to avoid overwhelming the EBS control plane. ... We rapidly developed throttles that turned out to be too coarse-grained to permit the right requests to pass through and stabilize the system. Through the evening of April 22nd into the morning of April 23rd, we worked on developing finer-grain throttles"

The Netflix team also talked about the benefits of over-provisioning capacity. They also described the complexity of trying to perform massive online reconfiguration: "While we have tools to change individual aspects of our AWS deployment and configuration they are not currently designed to enact wholesale changes, such as moving sets of services out of a zone completely." To simulate these sorts of situations for testing, Netflix are considering replacing their "Chaos Monkey" with a "Chaos Gorilla"!

The Conversations Network team (a large podcasting site) described an interesting rule of thumb for determining when to initiate manual disaster recovery schemes: "Once the length of the outage exceeds the age of the backups, it makes more sense to switch to the backups. If the backups are six hours old, then after six hours of downtime, it makes sense to restart from backups." However, they also commented that they had overlooked the need to test your backups, as it turned out that once of their crucial data sets was not being backed up on the same schedule as others.

Bryan Cantrill of Joyent talked about the danger of adopting a cloud strategy that leads to "the concentration of load and risk in a single unit (even one that is putatively highly available)."

The Heroku team pointed out that not all Amazon Web Services are the same, particularly when it comes to availability: "EC2, S3, and other AWS services have grown much more stable, reliable, and performant over the four years we've been using them. EBS, unfortunately, has not improved much, and in fact has possibly gotten worse. Amazon employes some of the best infrastructure engineers in the world: if they can't make it work, then probably no one can."

The SimpleGeo team pointed out one of the reasons that over-provisioning is crucial: "When EBS service degradation occurred in one AZ, I suspect that customers started provisioning replacement volumes elsewhere via the AWS API. This is largely unavoidable. The only way to address this issue is through careful capacity planning -- over-provisioning isolated infrastructure so that it can absorb capacity from sub-components that might fail together. This is precisely what we do, and it's one of the reasons we love Amazon. AWS has reduced the lead time for provisioning infrastructre from weeks to minutes while simultaneously reducing costs to the point where maintaining slack capacity is feasible."

At my day job I've been spending a lot of time recently thinking about how to build reliable, dependable, scalable distributed systems. It's a big problem, and one that takes years to address.

Building distributed systems is extremely hard; building reliable high-performing highly-available systems is harder still. There is still much to learn, so Amazon are to be commended, praised, and thanked for their openness and their timely release of detailed information about the failure, which is greatly appreciated by us all.

Monday, May 2, 2011

I'm cautiously optimistic that the problems I was having with graphics crashes under Ubuntu 11.04 on my Dell Latitude 610 were caused by this bug. Ubuntu pushed the fix via Update Manager and I installed it today, so we should soon be able to tell if the system is more stable.