Computational Complexity and other fun stuff in math and computer science from Lance Fortnow and Bill Gasarch

Thursday, February 22, 2007

Henzinger on Algorithms

A reader writes about coming
across a 2003 CIO article about
Google Research Director Monika Henzinger and being quite surprised to
read the following:

But it was while teaching courses on her beloved algorithms at
Cornell University when she had a flash. "I realized that
efficient algorithms were fun but not very useful to the world
anymore," she says.

The reader asks if there any reasonable sense in which that could be true?

While most algorithms developed by theorists have little practical
value and will never see code, efficient algorithms play a critical
role in any large system, especially at Google. As Henzinger states
herself later in the same article:

20 comments:

I think the implication is that there are hardly any use for algorithms that run in time n log log log n instead of n log log n. That is to say, most of the pioneering work has been done already, and what remains are only incremental improvements. I would wager that the "good algorithms" that Google uses are largely based on research done several decades ago...

I disagree with all the previous posters. The issue is that the models of computation required have some different properties than standard uses of the RAM model.

This has meant that new algorithm development are needed to deal with these new models. Thus the need for algorithms is even greater than it would have been if they simply took algorithms from your favorite textbook.

Lots of algorithms research since the mid 1990's has focused on these new models. This focus has helped to keep the field exciting.

Oh, and by the way, with the sizes of these data sets, getting that O(nlog n) algorithm down to O(n) is even more important.

Whether improving O(lg n) to O(lglg n) is great work or theoretical nonsense depends very much on context. Tell somebody programming a router to use binary search and you are certain to be laughed at.

We should also understand that reducing 2^n to n^3 is equally useless if the instances people care about have n=1M. Or designing O(lg^2 n) approximation algorithms for a lot of problems.

As far as modern developments and practice are concerned, I think the main problem is that theoretical ideas are not promoted enough to the practical community. The old algorithms made it into intro classes, and everybody knows they exist. For new algorithms, many practitioners don't even realize they can hope for something better.

For certain problems, where practical heuristics are not great at all, it is obvious to every practitioner that he should hope for something better. When there's a theory talk about such a problem, the room is packed with people hoping to learn something nonobvious.

It's not that algorithms don't matter. They do. It's just that applications matter more to the average human in the world, and systems that implement good algorithms matter more than algorithms that cannot be reasonably implemented.

There is always some tension between theoretical branches of science and the more applied branches. To the extent that we can bridge those gaps, we'll improve science.

I think the RAM is actually a reasonable model in which to analyze serial algorithms. On the other hand, most of what we do at Google involves parallelism, and there are great opportunities for analysis of parallel algorithms. Unfortunately most of the academic literature on parallel algorithms involves impractical models of computation that have little to do with real systems. Think of it as an opportunity for theorists to do some work that has high impact outside of theory...

In my opinion, it is a major theoretical challenge to devise a good theoretical notion implying practical (as opposed to asymptotic) hardness or efficiency. It is partially approached by self-reducibility, but this is certainly not enough.

In particular, the social and economic unity of the 787's (enormous) development enterprise are intimately entangled with the mathematical and algorithmic integrity of its engineering design. They are directly related, one to the other.

From a mathematical point of view, the natural "complexification" of Boeing's classical MOR algorithms are the (many) emerging algorithms for efficient quantum system simulation.

Just as the notion of analytic continuation greatly increases the power of functional analysis, the natural mathematical complexification of classical MOR yields algorithms that greatly increase the efficiency of quantum system simulations.

We explain this point-of-view to nonspecialists (e.g., our biomedical colleagues) as follows.

----------

Richard Feynman said in 1966: "We are very lucky to live in an age in which we are still making discoveries. It is like the discovery of America—you only discover it once. The age in which we live is the one in which we are discovering the fundamental laws of nature, and that day will never come~again."

Our view is similar to Feynman’s, but updated for a new century: "We are very lucky to be living at the dawn of the age of quantum system engineering. Our age is the one in which engineers will learn to create near-perfect technologies, for example quantum-limited microscopes that observe for the first time all biological molecules in situ, and thereby help humanity discover our own inner workings."

----------

From this engineering point of view (and from many other points of view too), we are truly entering "the century of algorithms." Fun!!!

There was an interesting talk at PODC 2006 by T. Chandra regarding the difference between theory and practice in implementing the Paxos protocol at Google. One take-home message was that the "page complexity" of an algorithm is more important in practice than in theory. (Particularly in distributed computing where debugging is harder)

Research in TCS algorithms is mostlyabout ideas. Many of those don'tlead to practical things. Even whenthey can much extra work is neededeither in terms of engineering orfurther development. An average TCSperson does not gain anything bydoing this follow up work. Whetherthis is good or bad is hard to tell.I think there is some responsibilityon the applied people to seek outthe best algorithms that exist fora problem and also to interesttheoreticians to work on theirproblems. It is not possible fora theory person to solve everyvariant that might come up inpractice in advance.

What applications do you care about and where do you think current models of concurrency and parallelism fall short for analyzing these applications? What factors could added to the current models to make them more useful to you?

Having talked to researchers at Google, Micorosoft, and Intel, I am confident that in some of the most important emerging areas (high-dimensional data analysis, object recognition/intelligent image and video processing, massive databases, machine learning) there is ample opportunity for theory to export _groundbreaking_ technology to industry. One problem is that the people working at those companies don't have the technical background to understand our approaches, and this is crucial because good implementations require high quality engineering to go along with high quality ideas.

Just as some of the bright young TCS researchers are importing ideas from mathematics on a regular basis, I suspect that in the next five years, a mathematically minded systems person is going to blow away the world by importing techniques from TCS into actual systems.

I'm not implying that this doesn't already happen, but the potential for more of it is overwhelming.

> Even when> they can much extra work isneeded> either in terms of engineering or> further development. An average TCS> person does not gain anything by> doing this follow up work

There is *plenty* to be gained by a TCS person from putting the "extra work" to make the algorithms practical. There is (a) the satisfaction of your algorithms being used in practice, (b) better understanding of the structure underlying the problems, which can lead to further theoretical work. If this sounds too nebulous, there are also concrete carrots, such as (c) your citations will shoot up (if people find your algorithms useful, they will cite them) and (d) you will have much easier time convincing funding agencies to give you money for research.

I won't attempt to map out the big problems in parallel computing, and different applications matter to different communities.

If you think abstractly about what matters to Google, you can at least identify index builds and index serving. These fit into a clear theme that has emerged over the history of high performance computing, namely that for technological reasons, our ability to perform computation seems to always outstrip our ability to move data around. That's why we have caches in microprocessors, and why the most expensive supercomputers like Blue Gene have a significant amount of their budget spent on the communication network.

In building an index, we have to take the (document, term) relations and organize them by term. This ends up performing a personalized all-to-all communication of what is essentially the entire data input set.

The PRAM model essentially ignores the cost of data movement, focusing instead on the cost of computation once the data has been accessed. There have been other models proposed, but there is still a lot of room for better models of computation and programming that are more representative models of the technology that we can build.

Kevin McCurley said...The PRAM model essentially ignores the cost of data movement, focusing instead on the cost of computation once the data has been accessed. There have been other models proposed, but there is still a lot of room for better models of computation and programming that are more representative models of the technology that we can build.

What is the problem with other models,e.g. external-memory, cache-oblivious, etc?

To build a more represntative model, one would first have to know, why the current models are not "representative enough".

Our QSE Group's impression is that most of the literature on algorithms is, well, algorithmic.

As with any branch of mathematics, other points of view are feasible -- algebraic, differential, geometric -- and one mark of maturity in a mathematical discipline is that these points of view begin to fuse.

Engineering algorithms are often focused upon simulation. The simulation of linear systems has reached a reasonable degree of maturity, based largely on the analytic continuation of linear response functions -- there is an immense body of literature on this topic.

The simulation of nonlinear systems (in our opinion) is poised for a similar leap forward, based upon the "complexification" of state space that is naturally associated with the simulation of quantum states. The mathematical invariances that are naturally associated with this complex state space provide (we think) a mathematical "handle" upon nonlinearity that has previously been lacking.

The scope of application for nonlinear quantum dynamical simulation is Google-scale (we think). After all, every cell of the human body has within it 100X as many atoms as there are stars in the galaxy. In principle, it seems that we are allowed by quantum mechanics, and challenged as engineers and mathematicians, to observe these individual atoms by quantum spin microscopy, and then to "fly" them at the system level by quantum dynamical simulation.

Conducted as a whole-biome survey, this structure-and-function survey would be a planetary-scale effort -- many orders of magnitude larger than the genome project.

This is just to point out, that fundamental research in algorithms contributes not only to existing large-scale enterprises like Google, but to the genesis of wholly new large-scale projects.