Pages

Tuesday, July 26, 2005

Heard in passing: a student from a country that is, to put it delicately, not part of the US Axis of Friendship, decides to hold off on returning home for a visit because of fears that he will not be allowed back in the US. He finally returns, only to find that the US Consulate views his long absence from home as a sign that he is planning to emigrate to the US, and thus denies him a visa to return.

[Lowerbounds, Upperbounds] reproduces a famous speech titled 'You and your research' by Richard Hamming. The entire speech makes for excellent reading, but I thought this excerpt, from discussions at the end, was very insightful. Replace 'book' by 'blog', and there you go...

In the short-haul, papers are very important if you want to stimulate someone tomorrow. If you want to get recognition long-haul, it seems to me writing books is more contribution because most of us need orientation. In this day of practically infinite knowledge, we need orientation to find our way. [....]

The present growth of knowledge will choke itself off until we get different tools. I believe that books which try to digest, coordinate, get rid of the duplication, get rid of the less fruitful methods and present the underlying ideas clearly of what we know now, will be the things the future generations will value. Public talks are necessary; private talks are necessary; written papers are necessary. But I am inclined to believe that, in the long-haul, books which leave out what’s not essential are more important than books which tell you everything because you don’t want to know everything. I don’t want to know that much about penguins is the usual reply. You just want to know the essence.

The theory of computation appears in the most unexpected places. Consider the two following excerpts:

Cross-stitching is an entertaining pastime where one “paints” pictures with needle and thread, usually on specially prepared cross-stitch canvases. See cross-stitching.com for many links for cross-stitching. Mathematically speaking, a cross-stitch pattern is simply a rectangular grid where some of the squares in the grid are filled with colors.

and this:

Every scale can be reinterpreted as a rhythm. In particular, the diatonic major scale, which translates into box-like notation as [x . x . x x . x . x . x], is internationally the most well known of all African rhythms. It is traditionally played on an iron bell, and is known on the world scene mainly by its Cuban name Bembe´.

Tuesday, July 19, 2005

In a mathematical competition 6 problems were posed to the contestants. Each pair of problems was solved by more than 2/5 of the contestants. Nobody solved all 6 problems. Show that there were at least 2 contestants who each solved exactly 5 problems.

I had decided not to post anything from here, because after all, from a technical perspective, most of the people who'd be interested in the work are already here ! But then again, ...

The workshop is on sublinear algorithms, algorithms that use a sublinear amount of some resource (space or time). Sublinear time algorithms might seem odd, but they come under the general rubric of property testing, where the idea is that by probing an input in a small number of places, you can learn something about it. Property testing shows up in many places; indeed the PCP theorem itself can be viewed as a kind of property testing algorithm, where the object is the "proof" of membership, and a few (randomized) queries are made to check its correctness.

What is interesting (to me) is the amount of work in testing properties of distributions. For example, can I tell if a distribution is uniform with only a few samples ? Can I tell if it is the joint distribution of two independent variables ? Can I compute its entropy ?

In other matters, ...

The one think I like best is the randomized seat assignments. Lunch and dinner seating is fixed, and changes each meal. So not only is the tension of "where should I sit today" reduced, one gets to meet and chat with different folks each time.

Another is the size of the tables; at most of our conferences, the tables are big enough that disconnected conversation cliques form, which is really no fun. Here, the tables are small and seat only 5 people: you really have to have everyone involved in the conversation.

Talks are chugging along: lots more than one might think for an informal workshop. Thankfully mine is done, though I can't say I was very happy with it.

Sunday, July 17, 2005

Suppose you're sampling from a distribution, say of English words. You see a word, update its count, and repeat for a number of trials. After some trials, you will have counts that give you some idea of the frequency of words, and the simple process of normalizing the word counts by dividing by the total number of samples yields a distribution, the maximum likelihood estimator, that represents your best guess of the original distribution.

But is it ? The problem with this process is that it assigns all probabilities to things you have already seen. Any element you haven't seen is given a probability of zero, which seems wrong, as well as implying that you will never predict seeing a new element if you rely on the estimated distribution to predic tthe next item.

This seemingly dry problem has a rather colorful history. Back during World War II, the Allied forces had a crack team of cryptographers dedicated to decoding Enigma, the German code used to transmit messages to German commanders in the field. Alan Turing was heavily involved with this effort, which was fictionalized quite dramatically in Neal Stephenson's book, "Cryptonomicon".

At the time, British forces had captures the codebook used by the Germans to generate keys for their messages, and the problem that I. J. Good and Turing faced was how to determine the distribution governing which pages of the books were chosen. Messages that they decrypted told them which page a particular message had come from ("the sample").

I'll get to their solution in a second. But firstly, it's worth noting that this problem was examined long before Good and Turing. Laplace suggested adding one to each count before normalizing, thus guaranteeing that elements not seen had a non-zero probability. More precisely, the probability of seeing a new element was then given by

Pr(new) = (n - n>=1)/(n + s),

where n is the number of elements in the universe, s is the number of samples, and n>=1 is the number of elements seen thus far.

This is called the "add-one" solution for obvious reasons, and other constants were suggested as well (Krichevsky and Trofimov suggested an "add-1/2" solution that has some optimality properties). But in general, these methods were really only suitable for small universe sizes, where the number of samples was likely to be larger than the number of elements.

Good and Turing proposed an estimator that, in a bizarre twist, not only include terms relating to the frequency of elements, but also include terms that relate to the frequency of frequency of elements. A general form of the estimator they use is given as follows.

For a given word X, let r be the number of times you've seen X, and let Nr be the number of different words seen exactly r times. Then the probability of seeing X in the next sampling step (after seeing N samples) is

Pr(X) = [(r+1)/N] * [E(Nr)/E(Nr+1)].

Notice how the expected values of the frequencies Nr appear in the expression. Also, glossed over is a rather nontrivial issue of how to compute E(Nr), in a process called smoothing that accounts for many of the Good-Turing variants.

Good later wrote that Turing had some "intuitive demonstration" for the validity of this formula. It has been very successful in natural language processing since then, but a clear understanding of how it works has eluded researchers. The amount of work on this estimator is too voluminous to detail here; it's probably easiest to start backwards, with a recent paper in FOCS by Orlitsky, Santhanam and Zhang, and a slightly older paper in COLT by McAllester and Schapire.

It may come as a surprise to some (at least it did to me) that you can define interesting algebraic structures on words (sequences of letters from a (finite) alphabet). Combinatorics on Words is a great book that builds up the theory of free algebras over words. It was in this book that I first came across a beautiful proof of the precursor of the Robertson-Seymour program:

Consider any infinite sequence of finite-lengthwords w1, w2, ..., over a finite alphabet. There always exists an i and a j > i such that wi is a subsequence of wj.Viewed another way, what this says is that if we define a partial ordering on strings such that one string is greater than another if it contains the second as a substring, then

there is no infinite antichain

Why is this a precursor of the Robertson-Seymour program ? Well, the above fact shows that the set of strings with the given order is a well-quasi order, which in turn means that any filter (i.e any set that is "upward" closed, or for each element x contains all elements y greater than x) has a finite basis, which means that there is a finite set of strings whose upward closure defines the filter). Conversely, this means that any ideal (a set of strings closed under the substring operation) has as complement a filter, and thus membership in the ideal can be described in terms of "forbidden" strings (the elements of the finite basis)

The Robertson-Seymour program can be stated as: graphs with the order defined by minors (G is greater than G' if G' is a minor of G) form a well-quasi order, which thus means that any upward closed family has a finite basis, and thus any idea (any minor-closed family) has a finite set of excluded graphs that defines it.

The result on strings is not hard to prove: I used it in this paper to show that membership for a certain grammar was decidable. Unfortunately, we were never able to prove a stronger result, and the grammar seemed so simple that we felt there must be a more efficient membership result.

Here's the grammar, for your entertainment.

All production rules are of the form

a -> bacd -> d

where a, b, c, d are elements of the alphabet. There is a special symbol x, and the language of such a grammar is defined as all strings w such that wx ->* x. Notice that in the productions, a symbol either inserts or deletes a new symbol to its left. We called these grammers (somewhat clunkily) "leftist grammars".

Update (12/23/05): Leftist grammars are not context-free. In FCT 2005, Tomasz Jurdzinski and Krzysztof Lorys present a context-sensitive language that can be expressed as a leftist grammer.

As the titles suggest, the two workshops focus on approximation algorithms and randomization. They are not large (accepting 20/21 papers), and are a good place for more focused discussions on these topics.

If you're a researcher in either of these areas, you know about these anyway, but if not, and you happen to be in the Bay Area, you should definitely stop by there.

Wednesday, July 06, 2005

Update: And here they are. 437 papers, 22+1 committee members, average load of nearly 60 per person. Having done this last year, all I can say is, 'bwa ha ha'.

p.s. There were 392 longs last year, and a total of 487 papers. so depending on how you look at it, submissions either rose or dropped. If you look at just long papers, then the pattern is 224,283,331,360,392 for the last 5 years (averaging 42 per year), and this year the increment was 45.

What remains to be seen is whether any erstwhile short papers were submitted as longs, given that

I salute you in the name of the most high God.I was the former first lady Federal Republic of Nigeria, married tolate General Sani Abacha the late Nigerian military Head of State.I am presently in distress and under house arrest whilemy son Mohammed is undergoing trial in Oputa Panel Lagosand Abuja, this Panel was set up by the present civilian regime.

I now salute you in the name of Ghod,I who a piteous widow must complain.My son, my joy, arrested by a squad —And in far Lagos he shall soon be slain.The cash for his defense my husband hid(I mean the late Abacha, even he),I cannot use; for unjust laws forbidThat my funds can now be released to me.

And why did she do this ? Well, for a goal that us SCI-afflicted folks can relate to: to write a poem so bad that the International Library of Poetry would not declare it a semifinalist (and thus would not include it in their next anthology, which they would then offer to the author at a "reasonable price").

Is the International Library of Poetry a scam ? Well here's this, from Wocky Jivy, the primary participants in this poetic pursuit:

"Scam" is in the eye of the beholder and we leave that to you to decide for yourself. The National Library of Poetry (International Library of Poetry), doesn't say they'll use discretion in the poems they select as winners or semi-finalists -- so some will say it is the NLP's prerogative to select even really really bad "poetry" as "winners."

Exactly, and SCI can accept whatever papers it wants. It seems to me that the goal of getting a junk paper accepted to SCI was too shortsighted. The goal should be to get a paper rejected from SCI (of course Stribling, Kohn and Aguayo had their paper rejected, but they cheated by publicizing their actions :)).

It would be remiss of me if I were to end without a word from the Python brethren:

From the anthology: "The Poems of Ewen McTeagle"

Lend us a quid till the end of the week.

Lend us a quid till the end of the week.If you could see your wayTo lending me sixpenceI could at least buy a newspaper.That's not much to ask anyone.

Friday, July 01, 2005

In honor of its 125th anniversary, Science magazine is running a series on the top 125 questions on science that can be asked in the next 25 years. The top 25 have a detailed blurb each: all the usual suspects are there, with one pleasant surprise:

"What are the limits of conventional computing"

... Mathematicians have shown that if you could come up with a quick and easy shortcut to solving any one of the hardest type of NP problems, you'd be able to crack them all. In effect, the NP problems would turn into P problems. But it's uncertain whether such a shortcut exists--whether P = NP. Scientists think not, but proving this is one of the great unanswered questions in mathematics....

There is also a blurb on mining biological data, with the key point being made that this is an interdisciplinary area that requires biologists, mathematicians and computer scientists to work together.

Progress is limited by the difficulty of translating biological patterns into computer models. Network computer programs themselves are relatively simple, and the methods of portraying the results in ways that researchers can understand and interpret need improving. New institutions around the world are gathering interdisciplinary teams of biologists, mathematicians, and computer specialists to help promote systems biology approaches. But it is still in its early days.