When I was an undergrad, probably my favorite CS class I took was algorithms. I liked it (a) because my background was math so it was the closest match to what I knew and (b) because even though it was "theory," a lot of the stuff we learned was really relevant. Over time, it seemed like the area had distilled worthwhile algorithms from interesting-in-theory-but-you'll-never-actually use algorithms.

In fact, I think this is a large part of why most undergraduate CS degrees today require a course in algorithms. You have these very nice, clearly defined statements, and very elegant solutions to those statements that in most cases (at the UG level) are known to be optimal.

Fast forward N years.

My claim today---and I'm speaking really as an NLP person, which is how I self-identify---is that machine learning is the new core. Everything that algorithms was to computer science 15 years ago, machine learning is today. That's not to say it won't move in another 10 years, but that's how I see it.

Why?

For the most part, algorithms (especially as taught at th UG level) is the study of one thing: Given a perfect input, how do I most efficiently compute the optimal output.

The problem is the "perfect input" part.

All of my experience in the past N years has told me that you never have a perfect input, and that it's far far far more important to be able to synthesize information from a large number of sources and reason about it than it is to find the exact-right-solution to some problem that exists only to Plato.

Even within machine learning you see this effect. Lots of numerical analysis people have worked on good algorithms for getting that last little bit of precision out of optimization algorithms. Does it matter? Nope! Model specification, parameter tuning, features, and data matter infinitely more than that last little bit of precision. (In some fields, for instance, scientific computing, that last little bit of precision may matter. I don't know enough to know one way or the other.)

Let's play a thought game. Say you're an UG CS major. You graduate and get a job in CS (not grad school). Which are you more likely to use: (1) a weighted cost flow algorithm or (2) a perceptron/decision tree?

Clearly I think the answer is (2). And I loved flow algorithms when I was an undergrad and have actually spent since 2006 trying to figure out how I can use them for a problem I want to solve. No dice.

I would actually go further. Suppose you have a problem whose inputs are ill-specified (as they always are when dealing with data), and whose structure actually does look like a flow problem. There are two CS students trying to solve this problem. Akiko knows about machine learning but not flows; Bob knows about flows but not machine learning. Bob tries to massage his data by hand into the input to an optimal flow algorithm, and then solves it exactly. Akiko uses machine learning to get good edge weights and hacks together some greedy algorithm for flows, not even knowing it's called a flow. Who's solution works better? I would put almost any amount of money on Akiko.

Full disclosure: those who know about my research in structured prediction will recognize this as a recurring theme in my own research agenda: fancy algorithms always lose to better models.

There's another big difference between N years ago and today: almost every algorithm you could possibly care about (or learn about as an UG) is implemented in a library for any reasonable programming language. That's not to say that it's unimportant to know how things work in order to use them, but I would argue it's much less important in a field like algorithms whose knowledge is comparatively stable, versus a field like machine learning where things are still changing and there is no "one right answer" to the "machine learning problem." In a field that's still a bit of an art rather than a science, understanding how things work under the hood feels a lot more important. Quicksort, heaps, minimum spanning trees, ... these are all here to stay. Okay, so now I've convinced myself that we should yank algorithms out as an UG requirement and replace it with machine learning.

But wait, I can hear my colleagues yelling, taking algorithms isn't about learning algorithms: it's about learning how to think! But that's also what I think is great about machine learning: the distance between theory and algorithms is actually usually quite small (I try to get this across at various points in CiML, to varying degrees of success). If the only point of an algorithms class (I've heard exactly this argument made about automata theory, for instance) is to teach students how to think, I think we could do much better.

Okay, so I've thrown down the gauntlet. Someone should come smack me with theirs :P!

Edit after some comments:

I think I probably wrote badly and as a result my main point got lost. I'll try to restate it here briefly and then I'll edit the main post.

Main point: I feel like for 15 years, algorithms has been at the heart of most of what computer science does. I feel like that coveted position has now changed to machine learning or, more generically, statistical reasoning. I feel this way because figuring out how to map a real world problem into something an "algorithm" can consume, especially when that needs statistical modeling of various inputs, is (IMO) a more important and harder problem than details about flow algorithms.

let me give a concrete example that may actually be a real world example, but i don't know (though see this paper). that of path finding for taxis or cars. the world is a graph and given directed edge costs we can run dijkstra or whatever to find LEAST-TIME (shortest) paths. this is basically google maps/etc.

of course, we never know the true time to travel some segment. we might know itnow, but by the time the driver gets to some road (5 or 10 minutes from now) the conditions may have changed. and of course we have historical data on traffic from which we can predict what the condition of the road will be like in 10 minutes.

so here, "foo" is a function that takes the time of data, historical traffic data, weather and whathaveyou, and maps it to edge costs.

my claim is that if you really want to solve this problem, it's much more important to understand how to create foo than how to create bar. in particular, if i gave you a greedy or near greedy approach to bar, combined with a really good foo, i bet this would be significantly better than an optimal bar and a crappy foo.

Okay, necessary warning. I'm not a semanticist. I'm not even a linguist. Last time I took semantics was twelve years ago (sigh.)

Like a lot of people, I've been excited about AMR (the "Abstract Meaning Representation") recently. It's hard not to get excited. Semantics is all the rage. And there are those crazy people out there who think you can cram meaning of a sentence into a !#$* vector [1], so the part of me that likes Language likes anything that has interesting structure and calls itself "Meaning." I effluviated about AMR in the context of the (awesome) SemEval panel.

What I guess I'm interested in discussing is not the question of whether AMR is a good interlingua but whether it's a semantic representation. Note that it doesn't claim this: it's not called ASR. But as semantics is the study of the relationship between signifiers and denotation, [Edit: it's a reasonable place to look; see Emily Bender's comment.] it's probably the closest we have.

We've spent some time looking at the data (dummy), to try to understand what is actually there. What surprised me was how un-semantics-y AMR tends to be. The conclusion I reached is that it might be much closer to a sort of D-structure (admittedly with some word sense disambiguation) than a semantics (sorry, I grew up during GB days and haven't caught on to the whole minimalism thing). And actually it's kind of dubious even as a D-structure....

Why does this surprise me? I would strongly have expected "p" (the Little Prince) to be the ARG0 of help. But in this representation, help doesn't have any ARG0.

You could make some argument that you could write a tool to transform such matrix clauses and re-insert the ARG0 of the matrix verb as the ARG0 of the subordinate verb, but this doesn't work in general. For instance, "I always want to rest" in the dataset also doesn't have "I" as an argument of "rest." Unfortunately, the rest-er in the propbank/verbnet definition of rest is (correctly) its ARG1/theme. So this deterministic mapping doesn't work -- if you did this the interpretation would be that "I always want to rest" means "I always want to cause-something-unknown to rest" which is clearly different.

Perhaps this is an annotation error? I found the same "error" in "And I knew that I could not bear the thought of never hearing that laughter any more" (the ARG0 of "hear" is missing) but there are a few cases where the subordinate clause does get an agent; namely:

She did not wish to go out into the world all rumpled...("go" correctly gets "she" as the ARG0)

...that she wished to appear...("appear" gets "she" as the ARG0)

So I'm not sure what's going on here. At least it's inconsistent...

Noun-noun compounds

In a semantic representation, I would expect the interpretation of noun noun compounds to be disambiguated.

For instance, the string "only one ring of petals" is annotated as:

(r / ring :quant 1 :mod (o / only) :consist-of (p / petal)))

But the string "glass globe" is simply:

(g / globe :mod (g2 / glass)))

In other words, the "ring of petals" is a ring that consists of petals. But the "glass globe" is just a "globe" modified by "glass." We have no idea what sort of modification this is, though one could argue that it's the consist-of relation that was used for ring of petals.

As made poinant by Ewan Dunbar, disambiguating Noun-Noun compounds is important when translating into many other languages:

Possession

There are many types of possession, many of which can be turned into predicates:

Erin's student => the student Erin advises

Julio's paper => the paper Julio wrote

Smita's football team => the team Smita plays on; the team Smita owns; the team Smita roots for

Kenji's speech => the manner in which Kenji talks; the speech Kenji gave (these last are actually hard to predicatize given my inventory of English verbs)

(Thanks to Ani Nenkova for most of these examples.)

In AMR, it appears the rule is that apostrophe-s ('s) turns into :poss. You (appear to) get (student :poss Erin) and (paper :poss Julio) and (team :poss Smita) and (speech :poss Kenji).

This is a bit strange because the Norman genitive alternation ("of") of possession in English (as opposed to the Saxon genitive "'s") does not turn into :poss. For instance, "other side of the planet" becomes:

(s2 / side :mod (o / other) :part-of (p2 / planet))

Here, the "of" has been disambiguated into :part-of; in contrast, with "air of authority", we get:

(a / air :domain (a2 / authority))

where the "of" has turned into ":domain". In fact, I cannot find any cases where there is a :poss and no "'s" (or his/her/etc...).

Now, you could argue that "planet's side" and "authority's air" sound at best poetic and at worst wrong. (Though I find "planet's other side" pretty acceptable.) But this is basically a property of English. But these are totally fine in Japanese with の/no as the possessive marker (according first to my prior and then confirmed by my Japanese informant Alvin -- thanks Alvin :P). I'm guessing they're okay in Chinese too (with 的/de as the possessive marker), but I'm not positive.

Moreover, possession is more complicated that than in lots of languages that distinguish between alienable and inalienable possession. A classic example being "my aunt" versus "my car." My mother is inalienable because try as I might, I cannot sell, buy, exchange, etc., my aunt. My car is alienable because I can do these things. In lots of languages, (about half, according to WALS), there is more than one class of possession, and the two class (in)alienable distinction is the most common.

As an example (taken from that WALS link, originally due to Langdon (1970)): In Mesa Grande Diegueño (Yuman; California), inalienable nouns like mother (ətalʸ) take a simple prefix ʔ- ('my'), while house (ewa) takes the compound prefix ʔə-nʸ- as in:

a. ʔ-ətalʸ 1sg-mother ‘my mother’

b. ʔə-nʸ-ewaː 1sg-alienable-house ‘my house’

So you could argue that in this case it's purely a property of the possessed noun, and so even in Mesa Grande Digueño, you could say :poss and then disambiguate by the semantics of the possessee.

Wrap Up

I could continue to go on, but perhaps I've made my point. Which is not that AMR sucks or is uninteresting or anything like that. It's just that even if we can parse English into AMR, there's a long way to go before we can start doing semantic reasoning from it. And maybe along the way you learned something. I know I did.

I think what really drove it home for me that AMR is not so much of a semantic representation is the ease with which I could imagine writing a rule-based generator for AMR to English. Yes, the sentences would come out stodgy and kind of wrong, but given an AMR and doing an in-order traversal of the tree, I'm pretty sure I could generate some fairly reasonable sentences (famous last words, right?). I believe this is true even if you took the AMR, re-ified it into a Hobbs-esque flat form first. The first step would be un-reification, which would basically amount to choosing a root, and then going from there. Like all NLP papers written in the 80s, perhaps the devil is in the details, in which case I'll be happy to be proved wrong.

One interesting final point for me is that as established in the paper I stole this title from, AMR is actually a pretty reasonable interlingua. But it's a pretty crappy semantic representation (IMO). This sort of breaks the typical MT Vauquois triangle triangle. That alone is kind of interesting :).

[1] Due to Ray Mooney, saw on Twitter, but now can't find the picture of the slide any more.

I love handouts, so I made a handout for this one too. I paste below the handout. All good ideas are those of the respective authors; all errors and bad ideas are probably due to bad transcription on my part.

"Count models": * For each word type, collect context vectors * Context vectors look at n words on the left and right (with position info? together or separately?) varied in 2..5 * Each type is represented by the bag of contexts in which it appears * Contexts are scored by PMI or LLR (gets rid of useful but frequent contexts) * We might reduce dimensionality to k in { 200, ..., 500 } - using either SVD or NNMF

"Predict models" (aka deep learning): * Assume a mapping from word type -> k-dim vector * Learn a model to predict any word token given the vectors of the n words to its left and right varied in {2,5} * Words are thrown out proportional to their frequency - makes things faster - reduces importance of frequent words like IDF * Vary k in { 200, ..., 500 }

CW (Collobert & Weston) models: * freely available online * 100 dimensional vectors trained for 2 months on wikipedia * predict a word with 5 words to the left and right * used extensively in other literature

Okay, usually Hello World is the first program you learn to write in a new programming language. For fun, I've been collecting how to say hello world in different human languages, something remarkably difficult to search for (because of the overloading of the word "language"). I have 28. I'd like to make it to 280 :). If you have one (or more) to contribute, email me, post a comment, or tweet to me @haldaume3. And of course if you think any of these is wrong, please let me know that too.

Usual caveats: didn't see all papers, blah blah blah. Also look for #acl14nlp on twitter -- lots of papers were mentioned there too!

A Tabular Method for Dynamic Oracles in Transition-Based Parsing; Yoav Goldberg, Francesco Sartorio, Giorgio Satta.Jaokim Nivre, Ryan McDonald and I tried searnifying MaltParser back in 2007 and never got it to work. Perhaps this is because we didn't have dynamic oracles and we thought that a silly approximate oracle would be good enough. Guess not. Yoav, Francesco and Giorgio have a nice technique for efficiently computing the best possible-to-achieve dependency parse given some prefix, possibly incorrect, parse.

Joint Incremental Disfluency Detection and Dependency Parsing; Matthew Honnibal, Mark JohnsonThe basic idea is to do shift-reduce dependency parsing, but allow for "rewinds" in the case of (predicted) disfluencies. I like that they didn't just go with the most obvious model and actually thought about how might be a good way to solve this problem. Basic idea is if you get "Please book a flight to Boston uh to Denver..." is that you parse "to Boston" like usual but then when you get to the "uh", you remove old arcs. You do it this way because detecting the disfluent segment ("to Boston") is much easier when you hit "uh" than when you hit "to Boston."

Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors; Marco Baroni; Georgiana Dinu; Germán KruszewskiThis paper is summarized best by its own statement, which should win it the award for most honest paper ever: "...we set out to conduct this study because we were annoyed by the triumphalist overtones often surrounding [neural network embeddings], despite the almost complete lack of a proper comparison.... Our secret wish was to discover that it is all hype... Instead, we found that the [embeddings] are so good that, while the triumphalist overtones still sound excessive, there are very good reasons to switch to the new architecture."

Learning to Automatically Solve Algebra Word Problems ; Nate Kushman; Luke Zettlemoyer; Regina Barzilay; Yoav ArtziAn algebra word problem is something like "I have twice as many dimes as nickles and have $2.53. How many nickles do I have." Of course usually they actually have an answer. They have a nice, fairly linguistically unstructured (i.e., no CCG) for mapping word problems to algebraic formulae and then solving those formulae. Code/data available.

Grounded Compositional Semantics for Finding and Describing Images with Sentences; Richard Socher, Quoc V. Le, Christopher D. Manning, and Andrew Y. NgThis is the follow-on work from Richard's NIPS workshop paper on text images from this past NIPS. They fixed the main bug in that paper (the use of l2 error, which gives a trivial and uninteresting global optimal solution) and get nice results. If you're in the langvis space, worth a read, even if you don't like neural networks :).->

Kneser-Ney Smoothing on Expected Counts; Hui Zhang; David ChiangI didn't actually see this talk or read the paper, but lots of people told me in hallways that this is a very nice result. Basically we like KN smoothing, but it only works for integral counts, which means it's hard to incorporate into something like EM, which produces fractional counts. This paper solves this problem.

Linguistic Structured Sparsity in Text Categorization; Dani Yogatama; Noah A. SmithAlso didn't see this one, but skimmed the paper. The reason I really like this paper is because they took a well known technique in ML land (structured sparsity) and applied it to NLP, but in an interesting way. I.e., it wasn't just "apply X to Y" but rather find a very linguistically clever/interesting way of mapping X to a problem that we care about. Very cool work.

Overall I really liked the conference, thanks to everyone who helped put it together. I can't help but notice that about half of my picks above were actually TACL papers. I suspect this will be more and more true over time.

In a previous post's comments, we talked about Bayes rule and things like that. This got me wondering about the following question:

If we know p(A) and p(B|A), we can reconstruct p(A|B) perfectly by Bayes' rule. What if we only have estimates of p(A) and p(B|A)? How does the quality of the reconstruction of p(A|B) vary as a function of the quality of the estimates of the marginal and conditional?

I feel like there have to be results along these lines, but I was unable to find them. My next attempt was to prove something, which failed miserably after a few hours. So, as a good empiricist and lazy(/bad) theorist, I designed a simple experiment.

Let A and B be binary variables. Let's generate a random joint distribution p(A,B), which has four cells for the four possible combinations of values of A and B. From this, we can directly compute the true marginal p(A) and the true conditionals p(B|A) and p(A|B).

Now, let's pick some "estimate" q(A) and q(B|A). You can think of these as a "noisy" version of p(A) and p(B|A). Given q(A) and q(B|A), we can compute an estimate a reconstructed joint distribution q(A,B) = q(A)q(B|A), as well as a reconstructed conditional distribution q(A|B) = q(A)q(B|A) / Z(q), where Z(q) is computed according to q. We can then compare q(A,B) to the true p(A,B) and q(A|B) to the true p(A|B) and measure how far they are.

At this point we have to decide what our measurement (divergence) function is. I tried three: variational distance (max absolute difference), l1 distance (sum absolute difference) and KL divergence. To be absolutely pedantic, I will define the versions of these that I used. First, the KL variants:

Note that the direction is q from p (chosen because p is the "true" distribution) and that this also has the advantage that the conditional KL is based on p(B), which (in this case) is known exactly and is "correct."

Okay, so now for the experiment. First I generate a random (uniform) true joint distribution p(A,B). I then run through 1,000,000 possible q(A,B), where each of the three sufficient statistics are chosen from [0.01, 0.02, ... 0.99]. I then conditionalize and marginalize these in all the relevant ways and compute KL. Finally, I generate plots like the following very representative example for KL:

On the left column, we're inspecting the recovered joint distribution and in the right column the recovered conditional distribution. The top row shows: for different divergences of q(A) from p(A), and for different divergences of q(B|A) from p(B|A), how far is (left) the recovered joint q(A,B) from the true joint q(A,B), or how far is the (right) recovered conditional q(A|B) from the true conditional p(A|B). The middle row is the projection of this into two dimensions, focusing on the divergence in the marginal, and the bottom row is the projection onto the divergence in the conditional. The title shows what the true distribution is in the form [p(a,b) p(a,~b) ; p(~a,b) p(~a,~b)]. I chose this example because the joint has a correlation between a and b.

This example is fairly benign: as the approximations become worse, so do both of the recovered distributions, in a fairly linear way until a plateau. From the bottom row, you can see that it's more important to get the conditional right than the marginal (you can have a marginal that's quite far--eg., a KL of 1.5--and still get an almost perfect recovery of the conditional or joint, but this is not true for large differences in the conditional B|A.

One strange thing is that you often (for different true joints) see results that look like:

There's a very strange effect here, in which a larger kl on B|A can actually be better at the recovery of the conditional, while worse at the recovery of the joint.

One can ask if this is an artifact of KL. So let's switch to L1 and variational for the first set of plots:

and variational:

So, in both L1 land and variational land, you can do better on the conditional by being worse on the (other) conditional.

For the example that gave rise to the weird KL results, we have the following for L1:

which shows almost an identical effect. For variational:

the effect is still the same.

Okay, so it's totally entirely possible (perhaps probable?) that there's a bug in my code. If you'd like to look, check out mykl.m and myklrun.m (yes, it's matlab). Let me know in the comments if there are bugs. If you'd like to look at more examples, check out all ten examples.

During grad school, my advisor suggested I identify a recent grad who has been, to me, successful. I could then use him or her as a guide. I picked someone (he now knows who he is), and the exercise was useful: there are lots of ways to be successful in research land, and this helped me focus.

RST-relation=Topic-Shift.

I'm fairly serious about yoga. I've had a lot of instructors over the years and noticed a high correlation between InstructorILike and InstructorWhoIsMale. Initially I believed this was because male instructors pushed more, and that worked for me. Over time I realized that was not the full story.

I spent two weeks going to classes by instructors I hadn't had before to try to understand what variable(s) made the difference. I've believe now that a large part of the reason I like male instructors is precisely because they're male. A female instructor would do some crazy pose and my brain would immediately say "I could never do that." A male instructor would do the same pose and my brain would say "If he can do it, so can I." (I'd then try and fail several times, but never with a defeatist attitude.)

Topic-UnShift.

I've heard for a long time that having role models you can identify with is important. As someone who has in almost all of my life fit into the overwhelming majority (white male in tech/academia), it's been rare that I've had the opportunity to really feel this effect for myself. I try to believe things even if they haven't happened to me, but it's always better when you can empathize rather than sympathize and it's easier to empathize when you've actually been there.

The first time I remember feeling the effect of a role model "who looks like me" was at the 1996 Olympics and Poul-Erik Høyer Larsen (Denmark) was the first European to ever win the badminton semi-finals; he then won the gold medal against Dong Jiong (China). (This sport is dominated by Indonesia, China and Malaysia.) Growing up in a particular part of Los Angeles and playing badminton as a kid, I was very much an outlier. Even though I'd never heard for Poul-Erik before (everyone knew who Jiong was), his win gave me something I could aspire to.

A few years ago I began broadcasting my support of the LGBT community, e.g., an HRC link on my web page and painting my laptop. Since then I've gotten emails from several people (mostly students) effectively asking why there aren't more/any LGBT role models in our community. You can interpret "community" meaning anything in the NLP/ML to CS to Science/Tech range. My answer: I don't know. It's hard to even know how large this community is because, unlike things like race and (binary) gender, it's not always outwardly inferrable (with noise). These issues effect tech is nuanced ways; see for instance an interview with the founder of Lesbians Who Tech or Queer in STEM for more.

This is all to say that having role models is important, and yes, it does matter who they are, where they came from, and what they look like. It mattered to the high school aged version of me, the grad school version of me, and the associate prof version of me. I'm not saying anything new here, but for our field to be healthy, we need a large number of successful people who can be role models for all sorts of students (and beyond). Token visibility is not enough because a single example of some particular label won't match with everyone who self-identifies with that label. The person I chose was, yes, a while male. There were plenty to choose from. But I chose him, and others would not have sufficed.

I took part in a wonderful Dagstuhl workshop this past February on translating morphologically rich languages. (Yeah, I also don't really know why I was invited :P.) But many thanks to Alex, Kevin, Philipp, Helmut and Hans for inviting me. I had a realization during this workshop that I thought I'd share. It's obvious in retrospect, and perhaps in front-spect for many of you. Much of this came up in the discussion with Bonnie Webber, Marion Weller, Martin Volk, Marine Carpuat, Jörg Tiedemann and Maja Popovic, and Maja deserves much credit for her awesome error analysis tool that helped shed some light on German.

One thing you commonly think of when translating into a morphologically rich language is that there's stuff you're going to have to hallucinate. Really this isn't an issue of morphology per se, but just that this is one place where it's obvious. For instance, even going from English to French you'll have to hallucinate gender on your determiners (un versus une and le versus la) that's unmarked in English. Or when going from Japanese (which roughly combines present and future tenses into a single tense) to English, you'll have to hallucinate "will" at appropriate places.

An abstraction that I think was pretty widespread among the initial discussions in the workshop was that if you're going from language X to Y, there are basically two options:

Phenomenon foo is explicit in Y but implicit in X, and therefore you'll have to hallucinate it (i.e., tense is explicit in English but not in Mandarin)

Phenomenon bar is explicit in Y and also explicit in X, and so you can just copy it.

The problem is that (2) is just false, even for things that you think it might not be false for. That is to say: Just because two languages explicitly code for something that we give a consistent (linguistic) name to, doesn't mean that they code for it consistently.

Okay, so you want examples.

An easy example is gender. I've been well assured that, for instance, French and Russian both have explicit gender. But just because some noun (eg moon/lune) is feminine in French doesn't mean it's also feminine in Russian. (In fact I think it's neuter.)

You might argue gender is a stupid thing to pick because it's essentially an artificial encoding of who-knows-what.

How about tense. That clearly has a semantic interpretation (did something happen in the past, the present or the future) and so if languages X and Y both express some particular tense, they must be consistent in how they do it.

Wrong. Now my memory is getting a bit shaky, but my recollection is that, for instance, in newswire text, it's very common in German to refer to things that have happened in the past in present tense. To English speakers this is a strange convention (we tend to refer to such things in past tense), but it doesn't have to be so. And of course English has it's own idiosyncrasies: see the plight of the native German speaker who cannot understand English tense usage in (New Zealand) news articles.

Part of this is probably because tense, even in English, is a pretty slippery concept. We (native English speakers) have no problem using present (or progressive) tense to refer to things that happened in the present (John runs) the past (so yesterday I'm running to the store and a hamburger falls on my head!) or the future (my flight leaves at 8:00 tonight).

Another easy example is definiteness (thanks to Kevin Knight for this inspiration). Again, our high school English teachers tell us that "the" (+Definite) has to refer to something that's already been introduced into context. I just went to cnn.com, clicked on the very first link, and the first sentence is "The boss resigned under pressure and other Veterans Affairs managers are likely on the way out." Ok you could argue that "The boss" is already in the context of the US news media (this is an article about Shinseki) but it's nonetheless very common to see (English) entities introduced using "the" and the precise rules that govern this may or may not be consistent across other languages.

The long and short of this is: I like the fact that translation into morphologically rich languages makes us pay attention to linguistic divergence. But that doesn't mean that divergences aren't there even when languages express the same set of linguistically-named phenomena. Usage can vary dramatically, be it for conventions, socio-linguistic reasons, or other things that are hard to pin down. It's just that by focusing all our energy on a very particular convention (newswire, parliament), we can pretty easily learn these mappings because there's no variability. Add some variability and we're hosed, even for languages with the same set of (overt) markings.

It's fair to say that perplexity is the de facto standard for evaluating language models. Perplexity comes under the usual attacks (what does it mean? does it correlate with something we care about? etc.), but here I want to attack it for a more pernicious reason: it locks us in to probabilistic models.

Background

Language modeling---or more specifically, history-based language modeling (as opposed to full sentence models)---is the task of predicting the next word in a text given the previous words. For instance, given the history "Mary likes her coffee with milk and", a good language model might predict "sugar" and a bad language model might predict "socks." (This is related to the notion of cloze probability.)

It's quite clear that there is no "right answer" to any of these prediction problem. As an extreme example, given the history " The", there are any number of possible words that could go next. There's just no way to know what the "right" answer is, whether you're a machine or a person.

This is probably the strongest justification for a perplexity-like measure. Since there's no "right" answer, we'll let our learned model propose a probability distribution over all possible next words. We say that this model is good if it assigns high probability to "sugar" and low probability to "socks."

Perplexity just measures the cross entropy between the empirical distribution (the distribution of things that actually appear) and the predicted distribution (what your model likes) and then divides by the number of words and exponentiates after throwing out unseen words.

The Issue

The issue here is that in order to compute perplexity, your model must produce a probability distribution. Historically we've liked probability distributions because they can be combined with other probability distributions according to the rules of probability (eg., Bayes rule or chain rule). Of course we threw that out a long time ago when we realized that combining things, for instance, in log-linear models, worked a lot better in practice, if you had a bit of data to tune the weights of the log-linear models.

So the issue in my mind is that there's plenty of good technology out there for making predictions that does not produce probability distributions. I think it's really unfortunate that non-probabilistic approaches don't get to play the language modeling game because they produce the "wrong" sort of output (according to the evaluation, but not according to the real world). I'm not saying there aren't good reasons to like probabilistic models, but just that alternatives are good. And right now those alternatives cannot compete. (For instance, Roark, Saraclar and Collins [2007] don't use perplexity at all and just go for word error rate of a speech recognizer around their perceptron-based language model.)

When I Ran Into This

I was curious about building a language model using vw, in the context of another project, and also to stress-test multiclass classification algorithms that scale well with respect to the number of classes.

As soon as I ran it, I discovered the issue: it produced results in the form of error rates. As I recall (it was a while ago), the error rate was somewhere in the 60s or 70s. I had absolutely no idea whether this was good or not. It seemed reasonable.

To get a sense of how standard language models fare, I decided to train a language model using srilm, and evaluate it according to error rate. To make my life easier, I just ran it on the WSJ portion of the Penn treebank. I used the first 48k sentences as train and the last 1208 sentences as test. I trained a 5gram Kneser-Ney smoothed language model and evaluated both perplexity and error rate (the latter required a bit of effort -- if anyone wants the scripts, let me know and I'll post them---but basically, I just take the LM's prediction to be the highest probability word given the context).

The language model I built had a perplexity (ppl1 in srilm) of 236.4, which seemed semi-reasonable, though of course pretty crappy. There was an OOV rate of 2.5% (ignored in the perplexity calculation).

The overall error rate for this model was 75.2%. This means that it was only guessing a quarter of words correct. (Note that this includes the 2.5% errors mandated by OOVs.)

I also tried another version, where all the model had to do was put the words in the right order. In other words, it knows ahead of time the set of words in the sentence and just has to pick between those 20, rather than between the full vocabulary (43k types). [This is maybe semi-reasonable for MT.] The error rate under this setting was 66.8%. Honestly I expected it would be a lot better.

Note that if you always guess ","---the most frequent type in this data---your error rate is 95.3%.

So why was it only moderately helpful (< 10% improvement) to tell the language model what the set of possible words was? Basically because the model was always guessing really high probability unigrams. Below are the top ten predicted words when the model made an error with their frequencies. They're basically all stop words. (This is in the unrestricted setting.) 1 14722 , 2 1393 . 3 1298 the 4 512 and 5 485 in 6 439 of 7 270 to 8 163 '' 9 157 a 10 108 is 11 54 's 12 52 have 13 49 said 14 41 `` 15 38

16 38 for 17 38 are 18 34 % 19 33 $ 20 31 be

The same list for the restricted setting is virtually identical, basically because most of these words are available in the average sentence:

Ok, so I don't have a "good" counter proposal to perplexity. Error rate certainly has many issues of its own. You could use IR-like metrics, like recall @ 10, mean average precision, etc., which are all questionable in their own ways.

I would just, in general, like if we could evaluate language models without having to be handcuffed by probabilities.

I'll be honest: I've had my feelings hurt by scathing reviews more than a few times. In grad school I remember even crying over a review that I thought was particular pernicious. My skin has thickened a bit over time, though often in the not-so-helpful manner of dismissing reviews that I don't like as "they didn't get it," which defeats one of the two primary purposes of reviews in the first place (providing feedback; the other: making accept/reject decisions).

The thing that's hard to reconcile is that I really like most of the people in our community, and everyone I meet at least seems really friendly.

When doing mock reviews with grad students, I'll often tell them to keep in mind that there's a good chance that the author is, or later will be, a friend of theirs. It's possible to provide feedback to a friend in such a way that you don't hurt their feelings.

I've recently started doing something else (in addition to the above suggestion). I don't use the words "you" or "the authors" or even "I." The review of a scientific contribution is not about me and it's not about the authors. It's about the method, the experiments and the contribution. I see little reason why you need to mention anything related to the people involved. (One exception: "I" is often useful in hedging, like the previous sentence, which would be more forceful if I just said "There is little reason...") Perhaps we could even integrate this into START...

This is, of course, similar to the pop-psych advice of talking to loved ones about "actions" rather than "the person." For instance: "I hate you for spilling coffee and not cleaning it up" versus "I hate having coffee spilt on the floor." Or something. I'm sure others can come up with better examples.

My current approach is to write my review with this in mind, and then go back and search for all occurrences of my outlawed nouns, and rewrite these sentences. Often in the process of doing this, I become aware that in many of the cases what I've said really does sound like an attack, and with the very small edit this effect is removed or at least greatly reduced.

I realize I've now just given a pretty good signal for people reading reviews to see if they were written by me or not. Here's a solution: everyone should adopt this policy and then my reviews will no longer be so obvious.But overall, I really think we should be nice to each other. Perhaps fewer people will depart from the field if they're not constantly battered down by harsh reviews, and then we'll all be better off.

Okay, so I've had this file called emnlp.txt sitting in my home directory since Oct 24 (last modification), and since I want to delete it, I figured I'd post it here first. I know this is super belated, but oh well, if anyone actually reads this blog any more, you're the first to know how I felt 6 months ago. I wonder if I would make the same calls today... :)

I have a colleague who wants to look through large amounts of (text) data for examples of a pretty rare phenomenon (maybe 1% positive class, at most). We have about 20 labeled positive examples and 20 labeled negative examples. The natural thing to do at this point is some sort of active learning.

But here's the thing. We have no need for a classifier. And we don't even care about being good at finding negative examples. All we care about is finding as many positive examples from a fixed corpus as possible.

That is to say: this is really a find-a-needle-in-a-haystack problem. The best discussion I've found of this is Section 4 of Inactive Learning, by Attenberg and Provost. I came across a couple other papers that basically do some sort of balancing to deal with the imbalanced data problem in active learning, but the Attenberg and Provost results suggest that even this is tricky to get right.

But I think this only partially addresses the problem. Here's why.

Let's say that my colleague (call her Alice) is willing to spend one hour of her life looking at examples. Alice estimates that she can look at about 500 examples in that time period (small caveat: some examples are larger than other and therefore slower, but let's ignore this for now). Alice's goal is to find as many positive examples in that one hour as possible. Here are some possible strategies Alice could deploy

Look at 500 randomly selected examples, getting 5 (more) positive examples in expectation (likely somewhere between 2 and 12, with 95% confidence). This gives her a total of 25 positive examples

Train a classifier on her 40 labeled examples, have it rank the entire set (minus those 40). Then look at the top 500 examples. If the learned model has a recall of r% in the top 500, then she should expect to get 5*r more examples, giving her a total of 20+5r examples. (Caveat: this is a bit dangerous, because with very few training examples, results can often be anti-correlated with the desired output and you get better performance by flipping the predicted label. There was a paper on this maybe 5-10 years ago but I can't find it... Maybe somewhat knows what I'm referring to.)

Train a classifier, spend 30 minutes looking at 250 random examples, getting 2.5r more positive examples, then train another classifier, then spend 30 minutes looking at it's top ranked 250 examples. If the second classifier has a recall of r'% in the top 250, then she should expect to get another 2.5*r', and she'll have a total of 20+2.5r+2.5r' = 20 + 2.5(r+r') so long as r>r' this should be better.

Taking this to the extreme, Alice could annotate one example at a time (subject to constraints on either the learning being very fast or Alice not actually being a human). As long as the recall-at-one of the classifiers learned is non-decreasing, this should (I think) be the optimal strategy.

So it seems that in a setting like this, what the classifier should optimize for is simply recall-at-one. In other words, it doesn't care about anything except that it's top ranked output is correct.

Okay, so why is this difficult? Basically because we don't have that much data. If the corpus that represents the haystack is huge, then we want to ensure that the top-one example from that haystack is a positive example. So in principle even if all we're trying to do is select between two different hypotheses (say our hypothesis class at each time step has cardinality two) then in principle we should use as much data as possible to evaluate these two hypotheses. In particular, looking at the recall-at-one on the subset of 40 labeled examples that we have is probably not representative, essentially because this loss function doesn't decompose nicely.

So what could we do instead? Well, we could pool the labeled and unlabeled examples and predict on those instead. Certainly if one of the positive labeled examples outranked everything else, that would be a good thing. But if truly 1% of the unlabeled dataset is positive, then this is not a necessary condition. In particular, the top-ranked positive labeled example could be as far as 1% down the list and we could still (in principle) be doing a good job.

Ok I'll admit: I really don't know how to do this. Maybe it's enough in most cases to optimize for zero-one loss (or a weighted version thereof) and just cross your fingers that the recall-at-one is good. But it does feel like there should be a more direct way to go about this problem.

One of the highlights for me at NAACL was the *SEM panel on "Toward Deep NLU", which had the following speakers: Kevin Knight (USC/ISI), Chris Manning (Stanford), Martha Palmer (UC Boulder), Owen Rambow (Columbia) and Dan Roth (UIUC). I want to give a bit of an overview the panel, interspersed with some opinion. I gratefully acknowledge my wonderful colleague Bonnie Dorr for taking great notes (basically a transcript) and sharing them with me to help my failing memory. For what it's worth, this basically seemed like the "here's what I'm doing for DEFT panel" :).

Here's the basic gist that I got from each of the panel members, who gave roughly 10 minute talks:

Dan Roth: doing role labeling restricted to verbs is not enough. As an easy example, "John, a fast-rising politician, slept on the train to Chicago"... by normal SRL we get that John is sleeping, but not the possibly more important fact that John is a politician. Another example is prepositions: "University of Illinois" versus "State of Illinois" -- "of" is ambiguous. They came up with a taxonomy of 32 relations and labeled data and then did some learning -- see the TACL paper that was presented at NAACL, Srikumar & Roth.

Commentary: the ambiguity of prepositions issue is cool and I really liked the TACL paper. It reminds me of learning Latin in high school and being confused that ablative case markers were abiguous across from/by/with. It astounded me that that was an acceptable ambiguity, but of course English has equally crazy ones that I've just gotten used to. But it does make me think that some cross-linguistic study/model might be cool here. Even more broadly, it made me think about noun-noun compound semantics: "farmers market" (market put on by farmers) versus "fruit market" (market where you buy fruit) versus "fruit pie" (pie made out of fruit). I went back and read Lucy Vanderwende's dissertation, which dealt exactly with these issues. She had far fewer relations than Srikumar and Roth, though perhaps once you allow explicit prepositions the range of things you can express grows (though somehow my gut feeling is that it doesn't, at least in English).

Kevin Knight: Basically talked about their deep semantic approach to MT: see the abstract meaning representation web page for more. The idea is that people who work on syntax don't Balkanize into those who do PPs, those who do VPs, etc., so why should semantics break apart like it does. AMR is very GOFAI-style representation for language, and they've annotated a Chinese-English bilingual copy of Le Petite Prince with this representation. Now they need analyzers (hard), generators (hard) and transformation formalisms (hard). The nice thing is that this one representation captures almost all relevant semantic issues: scoping, argument structure, coreference, etc. For instance, co-ref is not explicitly annotated: it's just that a single agent can participate in multiple predicates. (Note: not yet across sentences.)

Commentary: It's hard not to get excited about this stuff, especially when Kevin talks about it. His enthusiasm is infectious. I left the talk thinking "wow I want to work on that!" There's of course the worry that we've tried this before and failed and that's why things in semantics Balkanized, but maybe the time is right to revisit it. For instance, Bonnie herself (note: she didn't tell me this; it had come up in recent discussions with Philip Resnik and our postdoc Junhui Li) had a meaning representation very similar to AMR called Lexical Conceptual Structures (LCS), and Nizar Habash had a hybrid rule-based/statistical approach to translating there. The idea was that if you want to handle divergent translations (classic example: "the bottle floated across the river" (English) versus "the bottle crossed the river floatingly" (Spanish, I think)), you need a representation that abstracts means from predicate. But it's still very cool. (Actually in digging up refs, I just found this paper on mapping from LCS to AMR... from AMTA 1998!)

Martha Palmer: focused mostly on event relations that go across sentences, which includes things like even coreference, bridging relations (enablement, result) and so on. They're also looking seriously at type (evidential, aspectual, etc.), modality (actual, hypothetical, generic, hedged, etc.), polarity and aspect. They are currently doing a lot of work in the clinical domain, in which these distinctions are really important if you want to understand, say, patient medical histories.

Commentary: this is a bit outside things I usually think about, so I have less to say. I really like the hyper-sentence view, of course.

Owen Rambow: talked about some of my favorite work that I've seen recently: basically work on propositional attitudes. The view Owen put forth is that most of NLP is focused on a world of facts, and the goal of NLU is to figure out what these facts are. They are taking a much more social model of text meaning, in which you really care about inferring partipants' cognitive states (standard triumvirate: belief, desire and intention). This actually shows up in at least one English-German translation example, essentially in which Google translate misses a very important subjunctive.

Commentary: I really liked the original work Owen did on BDI inference and I'm thrilled it's going further. I think one of the historical reasons why I find this so interesting is that propositional attitudes is basically what I started doing when I started grad school, when looking at discourse analysis through RST. I think many people forget this, but the discourse relationships in RST (and other discourse theories) are really based on attitude. For instance, X is in a background relation to Y if (roughly) the listener already believes X and the listener also believes that X increases the chance of Y. (Or something like that: I just made that up :P.) But it's all about belief of listeners and utterers.

Chris Manning: focused on deep learning, basically asserting (in a manner designed to be a bit controversial) that Stanford dependencies are their meaning representation and that the big problems aren't in representations. Sure, Stanford dependencies miss out on a lot (quanitification, tense, semantic roles, modality, etc.) but he felt that there are more important problems to address. And then what we need instead is "soft" meaning representations, like vector space models and distributed representations give us. Giving rise to something akin to Natural Logic.

Commentary: to a large degree I agree with the notion that the "big problems" in language are probably not those that (eg) semanticists like to look at, at least from the typical view of NLE in which we want systems that do well on average across a distribution of examples that we've cultivated. But I also worry that there's a bit of magical thinking here, in the sense that it kind of feels like a cop-out: it's too hard to define categories by hand so let's let the machine figure it out. Now, don't get me wrong, I'm all for machines figuring out stuff (I gave a not-very-well-received talk to that degree at a workshop a couple years ago on linguistics in NLP), but I'm also a bit reticent to believe that this is really going to bring us any closer to really solving the NLU problem (whatever that is), though of course it will get us another 5-10% in standard benchmarks. (Ok this sounds way too negative: I actually really liked Chris' talk, and one of the things I liked about it was that it challenged my thinking. And I agree that there is a lot that we shouldn't be designing by hand -- some people, like Yoshua Bengio, would probably argue that we shouldn't be designing anything by hand, or at least that we shouldn't have to -- but I guess I still belong to the camp of "linguists give the structure, statistics gives the parameters.")

There was also a lot of really interesting discussion after the presentations, some of which I'll highlight below:

Lucy Vanderwende, I think mostly directed at Kevin, fell into the "we tried this X years ago" camp, basically said that whenever they tried to abstract more and more from the input representation, you ended up getting very boring sentences generated because you'd thrown out all the "nuance" (my word, not hers). The discussion afterward basically revolved about whether you annotate input sentences with meaning (which is currently the standard) or throw them out with the bathwater. Owen points out that the meaning of a passive sentence is not +passive but something much more nuanced, and if you could capture that correctly, then (in principle) generators could reflect it properly in the target language. (Me: For instance, maybe in some wacky language a passive sentence actually means that you're trying to emphasize the subject.)

There was also a lot of discussion around Chris, I think partially because he went last and partially because he was trying to be controversial. Mausam made an argument (akin to what I wrote above) that logicians have made a billion logics of language and nothing really has worked (in a sense it's been a series of negative results). What about inference rules or consistency?

Okay, that's all I want to write for now. Congrats if you made it this far. And thanks to the *SEM organizers for putting together this great panel!

I feel a bit odd doing my "what I liked at NAACL 2013" as one of the program chairs, but not odd enough to skip what seems to be the most popular type of post :). First, though, since Katrin Kirchhoff (my co-chair) and I never got a chance to formally thank Lucy Vanderwende (the general chair) and give her flowers (or wine or...) let me take this opportunity to say that Lucy was an amazing general chair and that working with her made even the least pleasant parts of PCing fun. So: thanks Lucy -- I can't imagine having someone better to have worked with! And all of the rest of you: if you see Lucy and if you enjoyed NAACL, please thank her!

I also wanted to really thank Matt Post for doing the NAACL app -- lots of people really liked it and I hope we do it in the future. I'm at ICML now and constantly wishing there were an ICML app :).

Okay, with that preface, let's get down to what you came for. Below is the list of my (complete) list of favorite papers from NAACL 2013 (also indexed on Braque) in no particular order:

Relation Extraction with Matrix Factorization and Universal Schemas (N13-1008 by Sebastian Riedel; Limin Yao; Andrew McCallum; Benjamin M. Marlin)Very cool paper. The idea is to try to jointly infer relations (think OpenIE-style) across text and databases, by writing everything down in a matrix and doing matrix completion. In particular, make the rows of this matrix equal to pairs of entities (Hal,UMD and UMD,DC-area) and the columns relations like "is-professor-at" and "is-located-in." These entity pairs and relations come both from free text and databases like FreeBase. Fill in the known entities and then think of it as a recommender system. They get great results with a remarkably straightforward approach. Reminds me a lot of my colleague Lise Getoor's work on multi-relational learning using tensor decompositions.

Combining multiple information types in Bayesian word segmentation (N13-1012 by Gabriel Doyle; Roger Levy)I guess this qualifies as an "obvious in retrospect" idea -- and please recognize that I see that as a very positive quality! The basic idea is that stress patterns (eg trochees versus iambs) are very useful for kids (who apparently can recognize such things at 4 days old!) and are also very useful for word segmentation algorithms.

Learning a Part-of-Speech Tagger from Two Hours of Annotation (N13-1014 by Dan Garrette; Jason Baldridge)Probably my overall favorite paper of the conference, and the title says everything. Also probably one of the best presentations I saw at the conference -- I can't even begin to guess how long Dan spent on his slides! I loved the question from Salim in the Q/A session, too: "Why did you stop at two hours?" (They have an ACL paper coming up, apparently, that answers this.) You should just read this paper.

Automatic Generation of English Respellings (N13-1072 by Bradley Hauer; Grzegorz Kondrak)This paper was the recipient of the best student paper award and, I thought, really great. It's basically about how English (in particular) has funny orthography and some times it's useful to map spellings to their pro-nun-see-ey-shuns, which most people find more useful than . It's a bit more of a bunch of stuff glued together than I usually go for in papers, but the ideas are solid and it seems to work pretty well -- and I'd never even thought this would be something interesting to look at, but it makes complete sense. Best part of presentation was when Greg tripped up pronouncing some example words :).

Linguistic Regularities in Continuous Space Word Representations (N13-1090 by Tomas Mikolov; Wen-tau Yih; Geoffrey Zweig) This is a paper that makes my list because it made me think. The basic idea is that if you do some representation learning thingamajig and then do vector space algebra like repr("King") - repr("man") + repr("woman") you end up with something that's similar to repr("Queen"). It's a really interesting observation, but I'm at a loss for why we would actually expect something like this to happen!

PPDB: The Paraphrase Database (N13-1092 by Juri Ganitkevitch; Benjamin Van Durme; Chris Callison-Burch)This is a paper about a dataset release that I think I'll find useful and I bet other people will too. Go download it and play with it. I'd encourage the authors (are you listening, Juri!) to make a web demo (or web service) so that I don't need to go through the pain of getting it all set up to see if it might be useful for me.

Supervised Learning of Complete Morphological Paradigms (N13-1138 by Greg Durrett; John DeNero) Basic idea: college morphological paradigms from Wiktionary and then train a supervised system to generalize from those to novel words. Works remarkably well and the model is well thought out. Plus I like papers that take morphology seriously: I wish we saw more stuff like this in NAACL.

And though I don't often do this, I have to mention the following paper because although I didn't see the talk or read the paper, enough independent people pointed it out to be as great that I figured I'd mention it:

Improved Reordering for Phrase-Based Translation using Sparse Features (N13-1003 by Colin Cherry)

Sparsity has been all the rage for a couple of years now. The standard notion of "sparse" vector u is that the number of non-zeros in u is small. This is simply the l_0 norm of u, ||u||_0. This norm is well studied, known to be non-convex, and often relaxed to the l_1 norm of u, ||u||_1: the sum of absolute values. (Which has the nice property of being the "tightest" convex approximation to l_0.)

In some circumstances, it might not be that most of u is zero, but simply that most of u is some fixed scalar constant a. The "non-constant" norm of u would be something like "the number of components that are not equal to some a" or, in more mathy terms, "min_a ||u-a||_0". I first came across it this in the Differentiable sparse coding paper by Bradley and Bagnell (NIPS 2008), where they claim physicists call it "Maximum entropy on the mean" if you're using a KL distance, but I don't know if there's a name for it in the l_0 sense, so I'l call it "l_0 from the mode" which seems to capture the right notion.

In many cases, we want to use a sparse norm to define a notion of a sparse distance function between two vectors u and v. For instance d_0(u,v) = ||u-v||_0 would be a reasonable notion of a sparse difference between two vectors.

Now, let's suppose that u and v are probability distributions over a finite event space, so that u_i is the probability of event i. In this case, both u and v belong to the simplex: u_i >= 0 for all i, and sum_i u_i = 1 (which is equivalent to saying ||u||_1 = 1, given the first condition).

It seems natural to ask ourselves whether there is a reasonable notion of sparse difference between two probability distributions. This is something I've thought about, but not come up with a satistfactory answer. The "natural" thing would be to define d_0(p,q) = ||p-q||_0, where I've switched from u/v to p/q to emphasize that these are distributions. This is certainly reasonable in some sense, but does not correspond to how I usually think of probability distributions, essentially because it ignores the "normalization" effect.

As a simple example, let's suppose that I generate a data set by rolling a dice with sides A-F. Suppose it comes up A five times, B twice, C once and F twice. Then my maximum likelihood estimate for p would be [.5, .2, .1, 0, 0, .2]. If I had a different dataset, in which exactly the same thing happened except I got one roll of D in addition to all the other rolls, I would estimate the distribution here as q = [.45, .18, .09, .09, 0, .18], which differs everywhere except the "E" position. The d_0 distance between these two distributions would be 4, which intuitively seems "wrong.

One possible way around this would be to allow us to rescale the distributions before computing the l_0 distance, something like d(p,q) = min_(a,b > 1) || ap - bq ||_0. When talking about l_0 as the underlying norm, this can be simplified to min_(0 < a < 1) || ap - (1-a)q ||_0. Note that the inequalities on "a" are necessary. With equalities, you could "erase" all of p (or q) by setting a to zero (or 1), which is definitely not desirable. Fortunately, the optimal "a" can be computed in linear time in the dimensionality: it's just the value that maximizes the number of i for which ap_i = (1-a)q_i. This definitlon captures the example above and gives a distance of "1" (corresponding to a = mode(q / (p+q)) = 10/21 = 0.4797).

We can attempt to make this definition more formal by invoking the notion of exponential families. Recall that an expfam distribution has the form log p(x;t) = t*f(x) - A(t) + B(x), where t are the natural parameters, f is the vector of sufficient statistics, * denotes dot product, and A() is the log partition function. A quick observation is that depending on how f looks, there might be lots of t that give rise to the same probability distribution. If we guarantee that the fs are always linearly independent, then this representation is called "minimal"; otherwise it is "overcomplete." (Formally, it is overcomplete if there exists a non-zero vector a for which a*f(x) is constant for all x.)

For a categorical distribution, an overcomplete representation would be a six dimensional indicator of which side of the die came up, and the natural parameters would be the (log of the) ps and qs shown above (it's overcomplete because one component is completely determined by the other 5). For example, the example p above would have exponential form with t=log[.5/.8, .2/.8, .1/.8, 0, 0] and log partition function A(t) = log 1.8. (See wikipedia's variant 3 for categorical distributions.) and q would have t=log[.45/.82, .18/.82, .09/.82, .09/.82, 0] with A(t) = log 1.82.

Supposing that we're working with a minimal representation, what I think this amounts to is that we want the difference in natural parameters to be sparse in the "l_0 from the mode" sense -- the difference is equal to a constant. In the above example, the difference in sufficient statistics is equal to 0.1301 in the first three components, -Infinity in the fourth, and something like zero in the last (who knows -- NaN?) which gives a distance of one (if you discard the last one).

The problem here is that there is not a unique minimal representation; I could also have chosen t=log[.2/.5, .1/.5, 0, 0, .2/.5] and A=log 1.5 for p and t=log[.18/.55 .09/.55 .09/.55 0/.55 .18/.55] and A=log 1.55 for q. In this case, the mode of the difference is 0.2007 (in the first, second and last components) there's one -Infinity and one NaN. So the answer works out the same, at least if you know how to deal with the NaNs.

There's an obviously needed lemma here: that any minimal representation will give you the same distance. I haven't thought nearly long enough about why this should be true (i.e., I've thought about if for about 30 seconds :P). There's also a lot of other questions: what if you relax this notion of sparsity to something more l_1-like? What other properties does it have? And of course is there a better notion of sparse difference of probability distributions for this or other circumstances?

I had a post a while back on teaching ML that I still basically stand by. I've taught intro grad NLP here at UMD twice now, and a sort-of-similar-course back at Utah once. I find these courses really hard to teach. And not for the usually bemoaned reason of the CS/linguistics mix -- I think it's possible to deal with that, and certainly it's an issue that's been talked about a lot.

What I find difficult is that NLP (and CL) is a collection of problems, techniques, ideas, frameworks, etc. that really are not tied together in any reasonable way other than the fact that they have to do with NLP. Even if you manage to answer questions about "what sort of topics are most interesting?" you're still faced with this problem that every time you switch topics, the entire context in which you're discussing them changes. This is exacerbated by the problem that things like tagging and parsing are hopelessly boring (in comparison to all the cool interesting stuff in NLP these days), but yet so many modern ideas are based on understanding basic dynamic programming for tree structures and things like that.

To make things a bit more concrete, a standard intro NLP class might start with morphology. Okay, so you have to explain what morphemes are and why they're important. Now, you probably will take a finite state approach, so you have to explain transducers. If you want these things to work, you have to explain weighted transducers. Do you do probabilities, in which case there's the whole local vs global normalization stuff that takes more time? So now you want to do POS tagging or something. Fine, you can do that with finite state models too. But no one actually does this any more (except lingpipe :P). So you have to explain POS stuff, perhaps how this works in non-English, and then you can leave them with HMMs (maybe talking about Viterbi algorithm) or do lots of ML so you can get to CRFs or structured perceptron or something. And we're still at POS tagging. Now you switch to parsing. Back to square one. And then you want to do compositional semantics, now there's lots more structure, lots more features and so on. But even now things are at least somewhat connected. But then you talk about lexical semantics: be it distributed representations or WSD or whatever, but the problem is new, the techniques are new (do you teach Yarowsky?), the evaluation is now and so on.

I think it's worth contrasting this with ML. I find ML remarkably easy to teach (so I'm flipping the classroom this coming Spring for the UG version to make it more exciting) despite the fact that the material is (in many ways) much harder for CS types. The thing that is nice about ML is that the problem is basically always the same (or at least changes only once, when you switch from supervised to unsupervised). In that sense, ML tends to be a course about techniques for a relatively fixed problem (or at least fixed problem type). This makes for significantly less context switching, which makes learning easier (and thereby makes teaching easier).

So the question I wanted to ask is: can we do something similar in NLP. The crazy idea that I'm sure everyone will say is insane is the following: teach NLP as a course about what you can do with log-linear models. Here's how I envision it. You spend the first day talking about NLP and why data is important, ambiguity, etc, just like normal. You spend the next two days explaining enough about log linear models that you can treat them as given for the rest of the semester. Maybe you tell how to optimize them by gradient descent or something, but basically enough that anyone who is simultaneously taking ML will get more out of it, but those that are not are fine with LL models as a black box.

Now, when you teach different topics, the framework in which you discuss them is the same. You have a structured problem (which forces you to talk about algorithms like Viterbi or CKY) with interesting ambiguities (which forces you to talk about features). Then, the class essentially becomes a sequence of problems, associated algorithms and relevant features. The rest is left as a black box, which can be provided off the shelf for programming projects, and they can focus on the interesting and more NLP-ish problems of algorithms and features. You could even start with something like sentiment classification (at a document level) to make the beginning gentle.

I realize there are some things you couldn't do this way, or would be very awkward to do this way. Anything generative or unsupervised, which often go together. For instance, word alignment via the IBM models won't fit. Topic models won't fit (though I don't usually do them anyway -- maybe I should). Probably there are some other things too.

Anyway, I'd be curious to hear what people think of this idea. I know it's biased by my own view of the world, but hey -- that's why I'm a professor (or at least why I assist professors...). Or if anyone has tried it.

NIPS is over as of last night. Overall I thought the program was strong (though I think someone, somewhere, is trying to convince me I need to do deep learning -- or at least that was the topic d'jour... or I guess d'an? this time). I wasn't as thrilled with the venue (details at the end) but that's life. Here were some of the highlights for me, of course excluding our own papers :P, (see the full paper list here)... note that there will eventually be videos for everything!

User-Friendly Tools for Studying Random MatricesJoel A TroppThis tutorial was awesome. Joel has apparently given it several times and so it's really well fine-tuned. The basic result is that if you love your Chernoff bounds and Bernstein inequalities for (sums of) scalars, you can get almost exactly the same results for (sums of) matrices. Really great talk. If I ever end up summing random matrices, I'm sure I'll use this stuff!

Emergence of Object-Selective Features in Unsupervised Feature LearningAdam Coates, Andrej Karpathy, Andrew Y. NgThey show that using only unlabeled data that is very heterogenous, some simple approaches can pull out faces. I imagine that some of what is going on is that faces are fairly consistent in appearance whereas "other stuff" often is not. (Though I'm sure my face-recognition colleagues would argue with my "fairly consistent" claim.)

Scalable nonconvex inexact proximal splittingSuvrit SraI just have to give props to anyone who studies nonconvex optimization. I need to read this -- I only had a glance at the poster -- but I definitely think it's worth a look.

FastEx: Hash Clustering with Exponential FamiliesAmr Ahmed, Sujith Ravi, Shravan M. Narayanamurthy, Alexander J. SmolaThe idea here is to replace the dot product between the parameters and sufficient statistics of an exp fam model with an approximate dot product achieved using locality sensitive hashing. Take a bit to figure out exactly how to do this. Cool idea and nice speedups.

Identifiability and Unmixing of Latent Parse TreesDaniel Hsu, Sham M. Kakade, Percy LiangShort version: spectral learning for unsupervised parsing; the challenge is to get around the fact that different sentences have different structures, and "unmixing" is the method they propose to do this. Also some identifiability results.

Multiple Choice Learning: Learning to Produce Multiple Structured OutputsAbner Guzman-Rivera Dhruv Batra Pushmeet KohliOften we want our models to produce k-best outputs, but for some reason we only train them to produce one-best outputs and then just cross our fingers. This paper shows that you can train directly to produce a good set of outputs (not necessarily diverse: just that it should contain the truth) and do better. It's not convex, but the standard training is a good initializer.

[EDIT Dec 9, 11:12p PST -- FORGOT ONE!]Query Complexity of Derivative-Free OptimizationKevin G. Jamieson, Robert D. Nowak, Benjamin RechtThis paper considers derivative free optimization with two types of oracles. In one you can compute f(x) for any x with some noise (you're optimizing over x). In the other, you can only ask whether f(x)>f(y) for two points x and y (again with noise). It seems that the first is more powerful, but the result of this paper is that you get the same rates with the second!

I didn't see it, but Satoru Fujishige's talk Submodularity and Discrete Convexity in the Discrete Machine Learning workshop was supposedly wonderful. I can't wait for the video.

I'm sure there was lots of other stuff that was great and that I missed because I was skiing working hard on NAACL.

Really my only gripe about NIPS this year was the venue. I normally wouldn't take the time to say this, but since we'll be enjoying this place for the next few years, I figured I'd state what I saw as the major problems, some of which are fixable. For those who didn't come, we're in Stateline, NV (on the border between CA and NV) in two casinos. Since we're in NV, there is a subtle note of old cigarette on the nose fairly constantly. There is also basically nowhere good to eat (especially if you have dietary restrictions) -- I think there are a half dozen places on yelp with 3.5 stars or greater. My favorite tweet during NIPS was Jacob Eisenstein who said: "stateline, nevada / there is nothing but starbucks / the saddest haiku". Those are the "unfixables" that make me think that I'll think twice about going to NIPS next year, but of course I'll go.

The things that I think are fixable... there was no where to sit. Presumably this is because the casino wants you to sit only where they can take your money, but I had most of my long discussions either standing or sitting on the ground. More chairs in hallways would be much appreciated. There was almost no power in rooms, which could be solved by some power strips. The way the rooms divided for tutorials was really awkward, as the speaker was clear on one side of the room and the screen was on the other (and too high to point to) so it was basically like watching a video of slides online without ever seeing the presenter. Not sure if that's fixable, but seems plausible. And the walls between the workshop rooms were so thin that often I could hear another workshop's speaker better than I could hear the speaker in the workshop I was attending. And the internet in my hotel room was virtually unusably slow (though the NIPS specific internet was great).

I'll warn in advance that this is probably one of the more controversial posts I've written, but realize that my goal is really to raise questions, not necessarily give answers. It's just more fun to write strong rhetoric :).

The thing I especially like about this paper is that it's not a complaint (like this blog post!) but rather a thoughtful critique of how one could do this sort of research right. This includes actually looking at what has been done before (political scientists have been studying this issue for a long time and perhaps we should see what they have to say; what can we do to make our results more reproducible; etc).

For me, personally, this goes back to my main reviewing criteria: "what did I learn from this paper?" The problem is that in the extreme, cartoon version of a data porn paper (my 1-4 list above), the answer is that I learned that machine learning works pretty well, even when asked to do arbitrary tasks. Well, actually I already knew that. So I didn't really learn anything.

Now, of course, many data porn-esque papers aren't actually that bad. There are many things one can do (and people often do do) that make these results interesting:

Picking a real problem -- i.e., one that someone else might actually care about. There's a temptation (that I suffer from, too) of saying "well, people are interested in X, and X' is kind of like X, so let's try to predict X' and call it a day." For example, in the context of looking at scientific articles, it's a joy in many communities to predict future citation counts because we think that might be indicative of something else. I've certainly been guilty of this. But where this work can get interesting is if you're able to say "yes, I can collect data for X' and train a model there, but I'll actually evaluate it in terms of X, which is the thing that is actually interesting."

Once you pick a real problem, there's an additional challenge: other people (perhaps social scientists, political scientists, humanities researchers, etc.) have probably looked at this in lots of different lights before. That's great! Teach me what they've learned! How, qualitatively, do your results compare to their hypotheses? If they agree, then great. If they disagree, then explain to me why this would happen: is there something your model can see that they cannot? What's going on?

On the other hand, once you pick a real problem, there's a huge advantage: other people have looked at this and can help you design your model! Whether you're doing something straightforward like linear classification/regression (with feature engineering) or something more in vogue, like building some complex Bayesian model, you need information sources (preferably beyond bag of words!) and all this past work can give you insights here. Teach me how to think about the relationship between the input and the output, not just the fact that one exists.

In some sense, these things are obvious. And of course I'm not saying that it's not okay to define new problems: that's part of what makes the world fun. But I think it's prudent to be careful.

One attitude is "eh, such papers will die a natural death after people realize what's going on, they won't garner citations, no harm done." I don't think this is all together wrong. Yes, maybe they push out better papers, but there's always going to be that effect, and it's very hard to evaluate "better."

The thing I'm more worried about is the impression that such work gives from our community to others. For instance, I'm sure we've all seen papers published in other venues that do NLP-ish things poorly (Joshua Goodman has his famous example in physics, but there's tons more). The thing I worry is that we're doing ourselves a disservice as a community to try to claim that we're doing something interesting in other people's spaces, without trying to understand and acknowledge what they're doing.

NLP obviously has a lot of potential impact on the world, especially in the social and humanities space, but really anywhere that we want to deal with text. I'd like to see ourselves set up to succeed there, by working on real problems and making actual scientific contributions, in terms of new knowledge gathered and related to what was previously known.

Since my job as a blogger is to express my opinion about things you don't want to hear my opinion about, I wish they'd select fewer workshops. I've always felt that NIPS workshops are significantly better than *ACL workshops because they tend to be workshops and not mini-conferences (where "mini" is a euphemism for non-selective :P). At NIPS workshops people go, really talk about problems and it's really the best people and the best work in the area. And while, yes, it's nice to be supportive of lots of areas, but what ends up happening is that people jump between workshops because there are too many that interest them, and then you lose this community feeling. This is especially troubling when workshops are already competing with skiing :).

Anyway, with that behind me, there are a number that NLP folks might find interesting:

Personalizing Education With Machine Learning: I don't think there's much in the way of NLP in personalizing education, but I see this as a potential big avenue for large breakthroughs in NLP in the coming 5 years, especially as we start talking about automated grading in MOOCs and whatnot. ETS folks, I hear you!

With the deadlines so close I don't imagine anyone's going to be submitting stuff that they just started, but if you have things that already exist, NIPS is fun and it would be fun to see more NLPers there!

Like many people, I spent last week in lovely Montreal (at least lovely for the second half) at NAACL. Despite the somewhat unfortunate submission date, I thought the program this year was quite good. Of course I didn't see every talk and haven't read many of the papers yet, but I figured I'd point out what I saw that I liked and other people can comment likewise.

Identifying High-Level Organizational Elements in Argumentative Discourse (Madnani, Heilman, Tetreault, Chodorow). This is maybe one of the first discourse papers I've seen where I actually believe that they have a chance of solving the problem that they've set out. Here, the problem is separating the meat (content of an essay) from the shell (the bits of discourse that hold the meat together). It's a cool problem and their solution seems to work well. Very nice paper. (And Nitin's talk was great.)

Risk Training of Approximate CRF-Based NLP Systems (Stoyanov, Eisner). This paper is basically about training approximate models based on some given loss function. Reminds me a lot of the Ross et al. CVPR 2011 paper on Message-Passing. It's a cool idea, and there's software available. Being me, the thing I wonder the most about is whether you can achieve something similar being completely greedy, and then whether you need to do all this work to get a good decision function. But that's me -- maybe other people like CRFs :).

MSR SPLAT, a language analysis toolkit (Quirk, Choudhury, Gao, Suzuki, Toutanova, Gamon, Yih, Cherry, Vanderwende). This is a demo of a system where you send them a sentence and they tell you everything you want to know about it. Never run your own parser/NER/etc. again. And, having see it in action at MSR, it's fast and high quality.

Parsing Time: Learning to Interpret Time Expressions (Angeli, Manning, Jurafsky). This was a great paper about semantic interpretation via compositional semantics (something sort of like lambda calculus) for time expressions. I cannot find myself getting super jazzed up about time, but it's a nice constrained problem and their solution is clean. I'm actually thinking of using something like this (or a subset thereof) as a course project for the intro CL course in the Fall, since I'm always starved for something to do with compositional semantics.

Unsupervised Translation Sense Clustering (Bansal, DeNero, Lin). If you want to build a bilingual dictionary from parallel text, you need to cluster translations into senses. Here's a way to do it. Nice results improving using bilingual contexts, which was nice to see.

I also feel like I should raise my glass to the organizers of NLP Idol and congrats to Ray for winning with Robert Wilensky's paper "PAM." (If anyone can find an online version, please comment!) Though I would actually encourage everyone to read all three papers if you haven't already. They all changed how I was thinking about problems. Here are the others: Burstiness (Church), Context (Akman), Suppertagging (Bangalore, Joshi).