I've only had this new Mac for 1 week, and Software Update just told me I have 1 Gb of updates to apply!

Apple's Software Update download servers seem kind of slow... will it really take 5 hours to download these updates?

Meanwhile, How do I configure Software Update so that it grabs the "important" updates automatically? Windows and Ubuntu seem to be pretty good about getting the most important updates in a timely fashion, can I do that for Mac OS X software updates too?

Tuesday, March 30, 2010

Russ Cox has posted the third of his detailed articles on the RE2 regular expression engine behind Code Search. The RE2 project is open source, and this third paper goes into considerable detail about the implementation of the engine.

Like most people who have thought about how to use Regular Expressions effectively, I am of two minds about the regex technology. It can be extremely powerful, but it can also be overused.

Still, as an abstract programming exercise, regular expressions are a fascinating sub-field, and I've been interested in them ever since learning about automata theory in school thirty years ago.

As I read through Cox's paper, I found these observations to be the most compelling:

Simplify the problem to simplify the implementation: Cox's engine puts a lot of effort into simplifying complex regular expressions:

One thing that surprised me is the variety of ways that real users write the same regular expression. For example, it is common to see singleton character classes used instead of escaping -- [.] instead of \. -- or alternations instead of character classes -- a|b|c|d instead of [a-d]. The parser takes special care to use the most efficient form for these, so that [.] is still a single literal character and a|b|c|d is still a character class. It applies these simplifications during parsing, rather than in a second pass, to avoid a larger-than-necessary intermediate memory footprint.

Recognize when one sub-problem can be solved using techniques you've already developed for other sub-problems. Regular Expression processing is full of situations where the larger problems can be decomposed into smaller problems, and Cox identifies dozens of these. I particularly enjoyed this one:

Run the DFA backward to find the start. When studying regular expressions in a theory of computation class, a standard exercise is to prove that if you take a regular expression and reverse all the concatenations (e.g., [Gg]oo+gle becomes elgo+o[Gg]) then you end up with a regular expression that matches the reversal of any string that the original matched. In such classes, not many of the exercises involving regular expressions and automata have practical value, but this one does! DFAs only report where a match ends, but if we run the DFA backward over the text, what the DFA sees as the end of the match will actually be the beginning. Because we're reversing the input, we have to reverse the regular expression too, by reversing all the concatenations during compilation.

When testing is a challenge, be smart about how you test. Cox tested his engine by piggy-backing on the testing that has already occurred on several other engines:

How do we know that the RE2 code is correct? Testing a regular expression implementation is a complex undertaking, especially when the implementation has as many different code paths as RE2. Other libraries, like Boost, PCRE, and Perl, have built up large, manually maintained test suites over time....Given a list of small regular expressions and operators, the RegexpGenerator class generates all possible expressions using those operators up to a given size. Then the StringGenerator generates all possible strings over a given alphabet up to a given size. Then, for ever regular expression and every input string, the RE2 tests check that the output of the four different regular expression engines agree with each other.

Learn from those that have gone before you, and study alternate implementations. Cox goes out of his way to credit the work that has occurred over the years in this field. For example:

Philip Hazel's PCRE is an astonishing piece of code. Trying to match most of PCRE's regular expression features has pushed RE2 much farther than it would otherwise have gone, and PCRE has served as an excellent implementation against which to test.

I strongly recommend this series of papers. Cox's articles are clear and well-written, have good examples and great explanations, and are filled with all the references and citations you might want if you should to pursue this topic further.

Now, given the particular software that I work on, my use of Mac OS X so far has mostly involved learning enough about Spaces and Expose so that I can manage the 30-or-so terminal windows that I tend to have open; I'm a bash-and-vim guy at the end of the day, after all.

And most of my minute-by-minute work tends to be edit-compile-test cycles, so I'm mostly intimate with GCC and JAM for those operations.

Thursday, March 25, 2010

It's a wonderful team, and a great product, and I'm really glad to have this opportunity.

Change is constant, as they say.

I don't expect to be blogging about work directly; I never do that, and, besides, Perforce already have a wonderful blog you can read.

But my blog interests are certainly driven by my work interests, so you'll probably notice that the sorts of things I'm interested in start to drift a bit closer to the world of software configuration management.

Tuesday, March 23, 2010

Magnus Carlsen is the number one ranked chess player, currently. Here's a fun interview with him, reprinted from the German magazine Der Spiegel.

He talks about the first moment when he remembered he wanted to become good at chess:

SPIEGEL: There was no crucial experience?

Carlsen: I saw Ellen, my sister, playing. I think I wanted to beat her at it.

SPIEGEL: And?

Carlsen: After the game she didn’t touch a board again for four years.

He talks about the impact of computers as training and education tools:

Carlsen: I was eleven or twelve. I used the computer to prepare for tournaments, and I played on the Internet. Nowadays, children start using a computer at an even earlier age; they are already learning the rules on screen. In that sense I am already old-fashioned. Technological progress leads to younger and younger top players, everywhere in the world.

And, of course, he's probably the only person on the planet who can say this about Garry Kasparov:

Carlsen: No. In terms of our playing skills we are not that far apart. There are many things I am better at than he is. And vice versa. Kasparov can calculate more alternatives, whereas my intuition is better. I immediately know how to rate a situation and what plan is necessary. I am clearly superior to him in that respect.

Monday, March 22, 2010

The Fields Medal is the top honor in mathematics. It is as prestigious as the Nobel Prize; there is no Nobel Prize for mathematics. Here is a list of the winners of the Fields Medal.

I was intrigued by this short story in the weekend NYTimes about Dr Grigori Perelman.

Dr. Perelman was awarded the Fields Medal in 2006, but declined to accept the award.

Now he has been named the winner of the Clay Prize. He is the first person to be awarded a Clay Prize; they were established a decade ago but nobody has won one before. Dr. Perelman's Clay Prize, like the Fields Medal before, is for his work in solving the Poincare Conjecture.

Although I've been a committed Ubuntu user for several years now, I haven't been paying as much attention to Canonical the company, so I totally missed Mark Shuttleworth's announcement last December that he is substantially changing his role at Canonical. It certainly sounds like this is one of those cases where Canonical the company had grown enough that Shuttleworth wasn't the right person to lead it anymore, and hence the re-arrangement.

Now, this week, comes a variety of news stories talking about the impact of that change, and of the follow-on change in which Matt Asay joins Canonical as the new COO, taking over the role that Jane Silber held before she moved up to take Shuttleworth's CEO spot.

As I said, I haven't paid a lot of attention to Canonical-the-company, but have been a thoroughly happy Ubuntu user, so hopefully this is all good news for Canonical as they attempt to expand their success in providing Linux-for-the-masses.

Sounds like it's a bit of a change for Asay, though:

I will remain living in Utah but will commute regularly to London, where most of the Canonical executives are based. (I will be starting my days at 4 a.m. Mountain Time to try to overlap as much as possible with U.K. working hours. Ouch.)

In the modern world, all software companies are global, and we are all learning to work together across the planet. Good luck, Matt!

Or you can try to harvest the wisdom of crowds, and use the law of big numbers at a site like ESPN's National Bracket. Wow! The country thinks Cal is just going to get stuffed! If you do this, note these good bits of advice about how you might get a good result, just not good enough.

Or you can try to work your way through the bracket carefully, using lots of heuristics like who's the better coach, or who has a better back court, or who excels at the 3-point game

The National Security Cutter, for example, will enable the Coast Guard to screen and target vessels faster, more safely and reliably before they arrive in U.S. waters – to include conducting onboard verification through boardings and, if necessary, taking enforcement-control actions.

In addition to its well-known search-and-rescue and anti-terrorist missions, the Coast Guard is involved in issues of customs, immigration, and drug enforcement, and the new cutters will be part of all these efforts.

Tuesday, March 16, 2010

The February 27th, 2010 issue of the Economist has an interesting special section:

Data, data everywhere: A special report on managing information

The special section contains about ten articles, looking at the subject from various angles. From the first article:

The world contains an unimaginably vast amount of digital information which is getting ever vaster ever more rapidly. This makes it possible to do many things that previously could not be done: spot business trends, prevent diseases, combat crime and so on. Managed well, the data can be used to unlock new sources of economic value, provide fresh insights into science and hold governments to account.

But they are also creating a host of new problems.

The companies and organizations which are working with this "big data" are the obvious ones: Google, IBM, Microsoft, Oracle, etc. Here's an example:

Wal-Mart's inventory-management system, called RetailLink, enables suppliers to see the exact number of their products on ever shelf of every store at that precise moment. The system shows the rate of sales by the hour, by the day, over the past year, and more. Begun in the 1990's, RetailLink gives suppliers a complete overview of when and how their products are selling, and with what other products in the shopping cart. This lets suppliers manage their stocks better.

The article gives a clever use of such data mining:

In 2004 Wal-Mart peered into its mammoth databases and noticed that before a hurricane struck, there was a run on flashlights and batteries, as might be expected; but also on Pop-Tarts, a sugary American breakfast snack. On reflection it is clear that the snack would be a handy thing to eat in a black-out, but the retailer would not have thought to stock up on it before a storm.

The articles touch on many of the problems in dealing with big data:

Size/cost/time to process such enormous amounts of data

False conclusions, confusing interpretation issues

Security and privacy concerns

Ownership issues

Fair access to data

Environmental issues

You might be surprised to see "Environmental issues" in the list.

Another concern is energy consumption. Processing huge amounts of data takes a lot of power. "In two to three years we will saturate the electric cables running into the building," says Alex Szalay at Johns Hopkins University. "The next challenge is how to do the same things as today, but with ten to 100 times less power."

Both Google and Microsoft have had to put some of their huge data centers next to hydroelectric plants to ensure access to enough energy at a reasonable price.

The articles also talk about many of the fascinating areas of technology being driven/created by these new big-data efforst:

Cloud computing

Query processing

Statistical algorithms, such as

collaborative filtering

statistical spelling correction

statistical translation

Bayesian spam filtering

Predictive analytics

Storage and networking advancements

Data visualization

Flash trading

It's a long list, and a lot of exciting areas.

As is often true with Economist special reports, the writing is fairly dry, and the presentation tends to provide a fairly high-level overview of a lot of related areas, without providing much in the way of resources for digging into those areas more deeply.

I’ve been working with a bunch of astronomers lately and we need to send around huge databases. I started writing my databases to disk and mailing the disks. At first, I was extremely cautious because everybody said I couldn’t do that—that the disks are too fragile. I started out by putting the disks in foam. After mailing about 20 of them, I tried just putting them in bubble wrap in a FedEx envelope. Well, so far so good.

The last time I looked into web server architecture was six or seven years ago, when someone pointed me at Dan Kegel's C10K paper. I'm pleased to see that he tried for a while to keep that paper up to date, although it looks like it's been a few years since it was updated.

Recently, I followed some links and ended up at Matt Welsh's SEDA work. Welsh appears to have moved on to other projects, but the SEDA work remains live at SourceForge.

I read through the SOSP 2001 paper and enjoyed it; it's quite readable and well-written, and explains the basic principles clearly.

The primary focus of SEDA is on helping software developers build what Welsh calls "well-conditioned services":

The key property of a well-conditioned service is graceful degradation: as offered load exceeds capacity, the service maintains high throughput with a linear response-time penaly that impacts all clients equally, or at least predictably according to some service-specific policy. Note that this is not the typical Web experience; rather, as load increases, throughput decreases and response time increases dramatically, creating the impression that the service has crashed.

Modern network servers tend to use one of two basic programming techniques: thread-based concurrency, or event-driven programming. Welsh compares and contrasts the two models:

The most commonly used design for server applications is the thread-per-request model. ... In this model, each accepted request consumes a thread to process it, with synchronization operations protecting shared resources. The operaitng system overlaps computation and I/O by transparently switching among threads. ...When each request is handled by a single thread, it is difficult to identify internal performance bottlenecks in order to perform tuning and load conditioning.

...

The scalability limits of threads have led many developers to eschew them almost entirely and employ an event-driven approach to managing concurrency. In this approach, a server consists of a small number of threads (typically one per CPU) that loop continuously, processing events of different types from a queue. ... Event-driven systems tend to be robust to load, with little degradation in throughput as offered load increases beyond saturation. ... The choice of an event scheduling algorithm is often tailored to the specific application, and introduction of new functionality may require the algorithm to be redesigned. Also, modularity is difficult to achieve, as the code implementing each state must be trusted not to block or consume a large number of resources that can stall the event-handling thread.

To overcome the problems of the two basic approaches, the SEDA work proposes a synthesis and unification:

The staged event-driven architecture (SEDA) ... decomposes an application into a network of stages separated by event queues and introduces the notion of dynamic resource controllers to allow applications to adjust dynamically to changing load.

...

A stage is a self-contained application component consisting of an event handler, an incoming event queue, and a thread pool. ... Each stage is managed by a controller that affects scheduling and thread allocation. ... Stage threads operate by pulling a batch of events off of the incoming event queue and invoking the application-supplied event handler. The event handler processes each batch of events, and dispatches zero or more events by enqueuing them on the event queues of other stages.

The paper goes on to work through many of the details, and to demonstrate a variety of different resource controller implementations, showing how the alternate controllers can provide elegant implementations of adaptive load shedding without forcing complexity into the system developer's core logic.

It's quite a nice presentation, and I liked the way it was all pulled together. I was intimately familiar with the basic techniques and approaches, but it was very helpful to see them pulled together and structured in this fashion.

If you're at all interested in how to build scalable and robust network services that perform reliably under extreme conditions, it's worth spending some time with this work.

Thursday, March 11, 2010

A good friend of mine is embarking on a new career at Google. Bravo! I think this will be a really exciting move for him; he's a special individual and I think he will do quite well. He's spent too much time working in organizations that wasted his talents, and it's past time for him to have an opportunity like this.

We happened to be talking the other night, and he was describing the process known as Allocation. As I understand it, it goes something like this:

You interview at Google. You do well, you like Google, they like you, you have the right stuff, and they make you an offer.

You accept.

This process is being repeated, simultaneously, with dozens of other individuals (Google are now quite a large organization). So you become part of a pool of talent who are all entering Google at the same time.

Every so often (monthly? bi-weekly?), Google convene the Allocation Review Board, comprised of some set of HR personnel, together with the hiring managers of various projects who have budget ("open reqs").

The Allocation Review Board sit and study the available talent in this pool. Each project has a certain amount of "points" to spend; they discuss things somewhat, then they each "bid" on the incoming fresh meat.

This information is then entered into a computer, possibly along with the results of a questionnaire you've completed about yourself. An algorithm then crunches the data, and performs the allocations.

Your precise position, including your title, job description, chosen team/manager, and initial project(s), are now known, and you are informed.

You show up for your new job!

When my friend described this process to me, I was aghast, horrified. It sounded paternalistic and demeaning; I am not a number! Is it just my age? Do all companies do this nowadays? Am I misunderstanding the process?

Has anyone been through this, and willing to share their experience? I'd love to learn more...

Thursday, March 4, 2010

Hackers who breached Google and other companies in January targeted source code management systems, security firm McAfee asserted Wednesday, manipulating a little-known trove of security flaws that would allow easy unauthorized access to the intellectual property it is meant to protect. The software management systems, widely used at businesses unaware that the holes exist, were exploited by the so-called Aurora hackers in a way that would have enabled them to siphon source code as well as modify it to make customers of the software vulnerable to attack.

As the McAfee researchers point out, your SCM system is the most important system in your entire software development process, and you must pay the utmost attention to administering it properly:

these were the crown jewels of most of these companies in many ways — much more valuable than any financial or personally identifiable data that they may have and spend so much time and effort protecting.

It is certainly true that some of the default security settings on SCM systems are weak, and allow too much access if left unchanged. However, administrators can definitely provide alternate settings.

An important principle of security protection is "defense in depth"; that is, providing multiple layers of defense so that if one is breached, the entire system does not fail. It sounds like in at least some cases, there was insufficient attention paid to administering security at all the various layers where it can be done.

Wednesday, March 3, 2010

Some of the most exciting work in programming language implementations is occurring in the JavaScript world. Mozilla's TraceMonkey project brought massive performance improvements to JavaScript starting with Firefox 3.5, using a technique called a tracing JIT.

So, a type-specializing trace JIT generates really fast code, but can’t generate native code for the situations described above. Conversely, an optionally specializing method JIT generates moderately fast code, but can generate native code for any JavaScript program. So the two techniques are complementary–we want the method JIT to provide good baseline performance, and then for code that traces well, tracing will push performance even higher.

It's interesting that much of this progress is able to occur because the various competing teams (Google, WebKit, Mozilla, etc.) are all doing their work out in the open:

We decided to import the assembler from Apple’s open-source Nitro JavaScript JIT. (Thanks, WebKit devs!) We know it’s simple and fast from looking at it before (I did measurements that showed it was very fast at compiling regular expressions), it’s open-source, and it’s well-designed C++, so it was a great fit.

This work is delightful for multiple reasons:

As a user, I can continue to anticipate better browsers, faster web pages, and a happier overall experience

As a programmer, I can continue to benefit from learning about all these different techniques

Tuesday, March 2, 2010

The world of extreme ultra-high-end scalable systems begins with Google, and their well-known technologies: BigTable, GFS, Map/Reduce, Protocol Buffers, and so forth. But that world certainly doesn't end with Google; it's a big world, and there's lots of fascinating work being done at places like Amazon, Flickr, Yahoo, and more.

I'm still wrapping my head around these eventually-consistent, non-relational data stores; after all, I'm a relational DBMS guy from a long time back. The Cassandra papers are quite approachable, and give a lot of fascinating insight into the behavior of these systems:

we will focus on the core distributed systems techniques used in Cassandra: partitioning, replication, membership, failure handling, and scaling. All these modules work in synchrony to handle read/write requests. Typically a read/write request for a key gets routed to any node in the Cassandra cluster. The node then determines the replicas for this particular key. For writes, the system routes the requests to the replicas and waits for a quorum of replicas to acknowledge the completion of the writes. For reads, based on the consistency guarantees required by the client, the system either routes the requests to the closest replica or routes the requests to all replicas and waits for a quorum of responses.