Peter Norvig digs deeply into the game of Set, which I've never played, but enjoy reading about. In my group of gamers, there is one other besides myself who is this sort of gamer, and 4 who are not, so by majority rule, this is not our sort of game. :)

In my youth (1979), I was a Transocean employee. I went through rig training and worked for 3 months as a roustabout on various rigs. This rig didn't exist then, and I no longer recall what rigs I worked on. But it was a very exciting job to have, at the time. It took me only 3 months to realize that I was much better cut out for mathematics than for rig work :)

The first paper is almost 30 years old, and dates from Tichy's work at Purdue during the development of RCS. From the introduction:

The string-to-string correction problem is to find a minimal sequence of edit operations for changing a given string into another given string. The length of the edit sequence is a measure of the differences between the two strings.

At the time, the best-known diff algorithm was Doug McIlroy's Unix diff algorithm (more on that in a future post), which is based on the detection of the Longest Common Subsequence. As Tichy shows, the LCS-based algorithms, while computationally related to the edit sequence programs, are not necessarily the best for use in difference construction.

Tichy's basic algorithm is surprisingly simple to state:

Start at the left end of the target string T, and try to find prefixes of T in S. If no prefix of T occurs in S, remove the first symbol from T and start over. If there are prefixes, choose the longest one and record it as a block move. Then remove the matched prefix from T and try to match a longest prefix of the remaining tail of T, again starting at the beginning of S. This process continues until T is exhausted. The recorded block moves constitute a minimal covering set of block moves.

After working through a proof of the basic algorithm, Tichy briefly touches on two variations:

Program text and prose have the property of few repeated lines. ... To speed up comparisons, the program should use hashcodes for lines of text rather than performing character-by-character comparisons.

An important element in the Knuth-Morris-Pratt algorithm is an auxiliary array N which indicates how far to shift a partially matched pattern or block move after a mismatch. ... Fortunately, N can also be computed incrementally.

Delta Algorithms: An Empirical Analysis describes Tichy's work in benchmarking diff algorithms. The paper contains dozens of scatter-plot diagrams of the various benchmark tests, as well as a fine high-level discussion of the complexity of building a suitable benchmark for diff:

The first problem encountered when defining a benchmark is finding an appropriate data set that is both large enough for the results to be statistically significant and representative of real world applications. For delta algorithms, the most important quality of any benchmark is that it contain a wide spectrum of change examples. This means that both the size of the changes represented and the size of the files involved should vary considerably. Large changes on small files and small changes on large files should be included as well as small changes on small files and large changes on large files.

Furthermore, the benchmark should contain a variety of formats, in particular pure text, pure object code, and pseudo text.

The paper also describes a diff algorithm variation which they call vdelta:

Vdelta is a new technique that combines both data compression and data differencing. It is a refinement of W.F. Tichy's block-move algorithm, in that, instead of a suffix tree, vdelta uses a hash table approach inspired by the data parsing scheme in the 1978 Ziv-Lempel compression technique. Like block-move, the Ziv-Lempel technique is also based on a greedy approach in which the input string is parsed by longest matches to previously seen data. ... Vdelta generalizes Ziv-Lempel and block-move by allowing for string matching to be done both within the target data and between a source data and a target data. For efficiency, vdelta relaxes the greedy parsing rule so that matching prefixes are not always maximally long.

Over the years, there have been a number of fundamental attempts to construct differencing algorithms. It would be an ideal world if the "best" algorithm always became the best known and most widely-used. However, the discussion and analysis of algorithms is a complex intellectual activity and many factors other than the qualities of the actual algorithm come in to play. Perhaps most importantly, if an algorithm is not well-presented and well-described, it can be over-looked and under-used, even if it is valuable and effective.

With diff algorithms, it is becoming clear that two things are true:

There have been a variety of diff algorithms discovered and re-discovered over the years, but many of them are not well-described nor easy to find: the papers are scattered, hard to locate, and behind ACM or IEEE paywalls; and when the papers are tracked down, they are confusing and hard to read.

The two Myers papers ("A file comparison program" and "An O(ND) difference algorithm and its variations") are so well-written and so well-known that they have pretty much dominated the discussion.

I've still got a number of other papers to study, but for now I think I've learned what I can from Tichy's work.

She gets rather frustrated when she loses, so I tried to explain to her that it's my game, and I've had more practice, and so I'm better at the game (right now) and hence I win.

The other day, she and my wife were playing at bowling. Somewhat accidentally, the game was configured so that the two "Miis" (virtual characters) were mis-set, and my grand-daughter was actually playing as me.

Well, she proceeded to lose to her grand-mother.

At which point she stomped her feet in frustration and exclaimed:

I don't get it. Grandpa said that he always wins, so why didn't he win this time?

We had a number of very good proposals this year. Overall, Google reported that they received nearly 6,000 proposals, and awarded 1,025 of them. As you can see, it is a very competitive program and it is not easy to qualify among so many excellent students.

Congratulations to the students who were accepted!

This is my third year being involved with the Google Summer of Code program, and my second year as a mentor with an accepted project. Last year's project was a very rewarding experience; I hope that I have another good experience this summer.

Saturday, April 24, 2010

The article discusses the role that open-ness played in the CDO/CDS ratings disaster:

The rating agencies made public computer models that were used to devise ratings to make the process less secretive. That way, banks and others issuing bonds — companies and states, for instance — wouldn’t be surprised by a weak rating that could make it harder to sell the bonds or that would require them to offer a higher interest rate.

But by routinely sharing their models, the agencies in effect gave bankers the tools to tinker with their complicated mortgage deals until the models produced the desired ratings.

That paragraph beginning "But..." is a common knee-jerk reaction to the concept of open source, particularly for sensitive topics such as financial ratings, and it's wrong, so wrong, 100% wrong.

At first, it's hard to understand how sharing and opening-up a complex and sensitive process can make it more secure. So it's time for a bit of history.

A similar process has occurred with computer security software over the years. Initially, the prevailing notion was that, the less known about the security software and its implementation, the better. At one company I worked for in the early 80's, the subroutine package which handled security went by the internal acronym CTTC; this string of letters was used to name the library, the APIs and variables, etc. When I asked what it meant, somebody finally told me: "Confusing To The Curious".

Complex multi-symptom errors also tend to have multiple trace paths from surface symptoms back to the actual bug. Which of the trace paths a given developer or tester can chase may depend on subtleties of that person's environment, and may well change in a not obviously deterministic way over time. In effect, each developer and tester samples a semi-random set of the program's state space when looking for the etiology of a symptom. The more subtle and complex the bug, the less likely that skill will be able to guarantee the relevance of that sample.

It's an argument that takes time to comprehend and grow comfortable with. If you've never done so, I encourage you to read Raymond's entire paper.

So, back in the world of finance, how does this argument apply? Well, I think that, if handled properly, the release of rating model details should actually make the ratings models better, not worse. When computer software security teams first started releasing the details of security bugs, people were horrified, as they thought that this was tantamount to giving the bad guys the keys to the storehouse.

But in fact, it was exactly the opposite: the bad guys already have the keys to the storehouse (and we see exactly this in the NYT article that started this discussion); what the open-ness process does is to give the good guys a chance to improve and refine their models.

Back to the New York Times article:

Sometimes agency employees caught and corrected such entries. Checking them all was difficult, however.

“If you dug into it, if you had the time, you would see errors that magically favored the banker,” said one former ratings executive, who like other former employees, asked not to be identified, given the controversy surrounding the industry. “If they had the time, they would fix it, but we were so overwhelmed.”

Do you see the comparison now? The agencies were trying to make the models better, but they needed more eyeballs.

Some years ago, the Secure Hash Algorithm was broken. If you have no idea what I'm talking about, here's a good place to start. That discovery, as it percolated through the computer security community, could have had a variety of results: it could have been covered up, it could have been "classified", it could have been played down.

Instead, the computer security world made it big news, as big as they possibly could. They told everybody. They blogged about it, wrote articles for the trade press, called up journalists and tried to explain to them what it meant.

The end result was this: The National Institute of Standards and Technology is holding a world-wide competition to design and implement a new hash function, the best one that the entire planet can possibly construct.

This is the process that we need for the ratings agencies and their models: instead of having a closed, secret process for developing ratings, we need an open one. We need eager graduate students pursuing their research projects in ratings models. We need conferences where people from different organizations and communities meet and discuss how to comprehend the financial models and their complexity. We need sunshine. We need a mindset in which anyone who reads this paragraph recoils in horror at how wrong the second sentence is, especially after the first sentence seems to show that they were starting to 'get it':

David Weinfurter, a spokesman for Fitch, said via e-mail that rating agencies had once been criticized as opaque, and that Fitch responded by making its models public. He stressed that ratings were ultimately assigned by a committee, not the models.

A commitee??!?

Make no mistake about it: modern structured financial instruments are extraordinarily complex. As this paper shows, they are equivalent in complexity to the hardest problems that computer scientists tackle. So hopefully the task forces and politicians investigating these problems will realize that they need to be handled using techniques that we have evolved for tackling these problems.

Tuesday, April 20, 2010

If you're familiar with MapReduce, it's probably in the context of giant data centers like Google's, where enormous tasks are processed by decomposing them into smaller jobs which are then distributed over thousands of individual servers, with the results re-assembled at the end.

Well, here's a nice body of work by a team at Stanford, showing how they are using the basic concepts of MapReduce (functional programming, problem decomposition, parallelism, resource management) to accomplish parallel programming tasks on single-system configurations.

Google's MapReduce implementation facilitates processing of terabytes on clusters with thousands of nodes. The Phoenix implementation is based on the same principles but targets shared-memory systems such as multi-core chips and symmetric multiprocessors.

Most parallel and concurrent programming APIs are too hard: to complex to understand, and to easy to use incorrectly. MapReduce has been successful over the last 15 years because it nicely balances the power of parallelism with a clear and simple programming abstraction.

Code is expected to have high reliability as it is written and we adhere to a socially reinforced code review and check-in practices. Development teams write good tests because they care about the products, but also because they want more time to spend writing features and less on debugging. Teams with good testing hygiene upstream have more time to innovate, and are thus more adaptable and competitive. In addition, there is one source tree and poorly written code is quickly identified because it breaks other people's tests and projects. Aggressive rolling back is employed to keep the tree building "green".

Oh, what the heck, how about two money quotes:

Within milliseconds of a code check-in, our build process will automatically select the appropriate tests to run based on dependency analysis, run those tests and report the results. By reducing the window of opportunity for bad code to go unnoticed, overall debugging and bug isolation time is radically reduced. The net result is that the engineering teams no longer sink hours into debugging build problems and test failures.

Everybody "knows" that investment in development tools, infrastructure, and productivity support is important. But Google not only talk the talk, they back it up with real systems and processes. Very impressive!

Wednesday, April 14, 2010

Here's an interesting thing I noticed yesterday. I only tried it on Windows 7, so perhaps the behavior is different on other versions of Windows. Also, this behavior seems to be specific to situations where Daylight Saving Time applies, so if your computer isn't observing Daylight Saving Time, you probably won't notice this.

Anyway, without further ado:

Open Windows Explorer and find a file somewhere on your machine which you worked on back in February. It can be any file: a document, a picture, etc.

In Windows Explorer, right-click on that file and choose Properties.

Now, run the Command Prompt. (Start -> Run, and type "cmd"). Change directory to the folder where you found the file from February, and type "dir" to display the directory information.

Now, tell me:

Does the last-modified time displayed by the "dir" command match the last-modified time displayed in the "Properties" window?

On the systems that I tried it on, the DIR command displays a last-modified time that is 1 hour later than the time showed in Windows Explorer (which, I think, is the correct time).

If you happen to have a system running a version of Windows other than Windows 7, I'd appreciate it if you tried this experiment and let me know what you found.

Sunday, April 11, 2010

Why, look! It's a sports story, and it's not about the Masters and what's-his-name!

This weekend's NYT has a story about about last weekend's Philadelphia Open, and about Gata Kamsky, one of the top chess players in the United States. When I was paying a lot of attention to chess back in the early 1990's, Gata Kamsky was all the rage: a 17 year old phenom who looked like he was unstoppable.

Well, fast-forward 18 years:

Most of the top players are European, and most of the top tournaments are in Europe. So it is not surprising that there are few spots in those competitions for non-Europeans.

For many years, if an American player was included in an elite event, the invitation went to Gata Kamsky. But Kamsky's world ranking has slipped to No. 34, while Hikaru Nakamura, the reigning United States champion, has risen to No. 17. Nakamura, at age 22, is 13 years younger than Kamsky, and he plays an exciting chess that is popular with fans. So Nakamura now seems to be claiming most of the choice tournament spots.

But, as the article goes on to discuss, Kamsky ain't done for yet, and his strong showing in the Philadelphia Open, which is one of the strongest tournaments held in this country, is a promising one:

In the Philadelphia Open last weekend, he tied for first with three other grandmasters, and he took the title on a tie-breaker.

The column goes on to dissect an exciting 7th round match between Kamsky and Vladimir Romanenko of Belarus. Plenty of errors on both sides, but somehow Kamsky hung on and found a way to win.

OK, I've got a few years on Kamsky, but still, let's celebrate: way to go, old guys!

In literary studies, "close reading" is a technique in which the reader conducts an extremely thorough and detailed study of a relatively short passage. It's not an easy skill to learn, but, once learned, it can be very successful at revealing insights about the work you're studying.

Here are a few nice introductions to the notion of close reading. Lynch describes the purpose of close reading quite nicely:

That means reading every word: it's not enough to have a vague sense of the plot. Maybe that sounds obvious, but few people pay serious attention to the words that make up every work of literature. Remember, English papers aren't about the real world; they're about representations of the world in language. Words are all we have to work with, and you have to pay attention to them.

Although most software has centuries of development to go before it reaches the level of accomplishment that great works of literature have achieved, the serious study of software has many aspects in common with close reading.

Recently, I've been engaged in a close reading of "diff".

You know diff, of course: it's the program which computes the differences between a pair of files, and displays them to you.

Diff has been around a long while, and has been very closely studied. It's also a great example of a program where the proper choice of algorithm makes an enormous difference. So it's important to start, before reading the code, by understanding the algorithm.

In the case of diff, it turns out that there are some great papers that, over the years, have done a wonderful job of explaining how diff works. Here's three papers that tend to be most commonly credited as the most important:

The String-to-String Correction problem, Robert Wagner and Michael Fischer. This paper is 35 years old, and is certainly one of the great early efforts in algorithm description and analysis.

The String-to-String Correction Problem with Block Moves, Walter Tichy. Tichy is well-known not only for this work, but also for the development of RCS, one of the first widely-successful version control programs.

An O(ND) Difference Algorithm and its Variations, Eugene Myers. This is the paper which introduced the famous notion of "snakes". Here's a greatpair of articles working through this algorithm in detail.

I'm planning to get to all of these papers, but for the last week, I've been painstakingly and minutely making my way, word by word, through

A File Comparison Program, Webb Miller and Eugene Myers

This last paper is in the journal Software, Practice and Experience, and was published in 1985. It's somewhat more recent than the other papers, and is not therefore "beginning at the beginning", but it's where I chose to start, and so here I am.

There are (at least) three aspects to a diff algorithm:

How good is the result?

How fast does the algorithm run?

How hard is the algorithm to understand?

To the first point, there are many different ways to describe the differences between two files; some are short and precise, others are long and verbose. We speak about the differences between two files by talking about an "edit script", which is a sequence of commands (insert this line here, remove that line there) that can be applied to transform the one file into the other. There is such a thing as the shortest possible edit script, and any diff algorithm worth respecting should produce such a script. There are usually more than one shortest possible edit script, however, so diff programs may vary in their output while all still producing the "best" result.

To the second point, it is very important that we be able to diff two files efficiently. Modern systems spend stupendous amounts of time computing differences between pairs of files, so improvements in execution time can have breakthough results on larger systems.

To the last point, which is my current focus, some algorithms are simply harder to understand than others. So, I'm glad I started with the SP&E presentation, as it is very clear and readable. The algorithm described by Miller and Myers is a dynamic programming technique, inductively proved correct, that computes edit (sub-)scripts of length N + 1 by applying 3 basic rules to edit scripts of length N (greatly summarized here):

Move right: Given a script that converts A[1:i] to B[1:j-1], adding the command 'Insert B[j]' produces the next step in the edit script

Move down: Given a script that converts A[1:i-1] to B[1:j], adding the command 'Delete A[i]' produces the next step in the edit script

Slide down the diagonal: Given a script that converts A[1:i-1] to B[1:j-1], and given that A[i] equals B[j], the edit script that we already have already produces the next step in the edit script, converting A[1:i] to B[1:j].

I haven't really done a very good job of describing the algorithm. Hopefully, though, I've motivated you to go track down the references and read the algorithm(s) for yourself.

Monday, April 5, 2010

I think I want to do something fairly simple with my little iPod shuffle and my Ubuntu desktop:

Subscribe to a fairly small (5-6) set of podcasts

About twice a week, connect my iPod shuffle to my laptop

Get the latest podcasts from my feeds, copy them onto my shuffle, deleting the podcasts that were previously there, and disconnect the iPod

Repeat the process again, a few days later

I take the iPod over to the fitness center, so I have something to listen to on the treadmill.

I used to use iTunes on Windows, and I found the interface pretty easy:

I could tell what podcasts I was subscribed to,

I could fetch the latest episodes, after scanning their descriptions to see if they sounded interesting.

I could copy those episodes to the iPod, and could select and delete old episodes off the iPod

But with RhythmBox, I keep tying myself in knots:

I think I've deleted episodes, but they are still present on the iPod even though they no longer show up in the RhythmBox display! By still present I mean: when I press the "play" button on the iPod, the podcast episode plays, even though I deleted it via RhythmBox (and responded to a confirmation prompt!)

I can't easily see a list of which new podcast episodes are available for download

RhythmBox seems to have two displays: "Music" and "Podcasts" which seem to mostly show the same thing, but not quite exactly, and neither seems quite accurate

Is there an easier-to-use Ubuntu application for fetching recent podcast episodes, keeping my podcast episode collection trimmed to a manageable size, and copying recent episodes to my iPod, overwriting the previous ones as I do so?

Saturday, April 3, 2010

My development machine at my day job is quite amply provisioned with the two resources that no programmer ever gets enough of: memory (8Gb) and disk (2Tb). I've been periodically checking the display that shows my overall usage of disk space, waiting for the gauge to think that I've used enough disk space to be worthy of mention. With 2Tb, that may take a while...

But one wonderful thing that can be done with a computer like this is to run VMWare. Specifically, I have VMWare Fusion 3 on my Mac. It's hard to overstate what an amazing piece of software VMWare is.

So I've been building additional virtual machines, since programmers (especially programmers like me) love to have as many operating systems available to them as possible.

So far, I've got a fairly short list:

Mac OS X (Snow Leopard)

Windows 7 (64 bit)

Ubuntu 9.1 (32 bit)

Open Solaris 2009.6

I'm confused by OpenSolaris version numbering: am I running OpenSolaris 9? Or 10? VMWare Fusion considers OpenSolaris 9 to be "experimental" while it thinks version 10 is "stable". However, when I actually boot the OpenSolaris VM, it seems to think I'm running version 5. Hmmm... Much more to learn about OpenSolaris.

Meanwhile, the question is, what to install next?

I'm intending to install FreeBSD version 8 next week, partly at the suggestion of my co-worker, who feels that there is a lot to learn from the FreeBSD implementations of various OS-level behaviors (networking, memory management, etc.).