14 August 2016

A conference just ended, so it's that time of year! Here are some papers I liked with the usual caveats about recall.

Before I go to the list, let me say that I really really enjoyed ACL this year. I was completely on the fence about going, and basically decided to go only because of giving a talk at Repl4NLP, and wanted to attend the business meeting for the discussion of diversity in the ACL community, led by Joakim Nivre with an amazing report that he, Lyn Walker, Yejin Choi and Min-Yen Kan put together. (Likely I'll post more, separately, about both of these; for the latter, I tried to transcribe much of Joakim's presentation.)

All in all, I'm supremely glad I decided to go: it was probably my favorite conference in recent memory. This was not just because there were lots of great papers (there were!) but also because somehow it felt more like a large community conference than others I've attended recently. I'm not sure what made it like this, but I noticed it felt a lot less clique-y than NAACL, a lot more broad and interesting than ICML/NIPS (though that's probably because of my personal taste in research) and in general a lot friendlier. I don't know what the organizers did that managed this great combination, but it was great!

I like this paper because it has a nice solution to a problem I spent a year thinking about on-and-off and never came up with. The problem is: suppose that you're training a discriminative MT system (they're doing neural; that's essentially irrelevant). You usually have far more monolingual data than parallel data, which typically gets thrown away in neural systems because we have no idea how to incorporate it (other than as a feature, but that's blech). What they do here is, assuming you have translation systems in both directions, back translate your monolingual target-side data, and then use that faux-parallel-data to train your MT system on. Obvious question is: how much of the improvement in performance is due to language modeling versus due to some weird kind of reverse-self-training, but regardless the answer, this is a really cool (if somewhat computationally expensive) answer to a question that's been around for at least five years. Oh and it also works really well.

I didn't see this paper presented, but it was suggested to me at Monday's poster session. Suppose we're trying to learn representations of adjective/noun pairs, by modeling nouns as vectors and adjectives as matrices, evaluating on unseen pairs only. (Personally I don't love this style, but that's incidental to the main ideas in this paper.) This paper adjusts the adjective matrices depending on whether they're being used literally ("sweet candy") or metaphorically ("sweet dreams"). But then you can go further and posit that there's another matrix that can transform literal metaphors into metaphorical metaphors automatically, essentially implementing the Lakoff-style notion that there is great consistency in how metaphors are created.

This paper should win some sort of award for thoroughness. The idea is that in many frames ("The walrus pummelled the sea squirt") there is implied connotation/polarity/etc. on not only the agent (walrus) and theme (sea squirt) of the frame but also tells us something about the relationship between the writer/speaker and the agent/theme (the writer might be closer to the sea squirt in this example, versus s/pummelled/fought/). The connotation frame for pummelled collects all this information. This paper also describes an approach to prediction of these complex frames using nice structured models. Totally want to try this stuff on our old plotunits data, where we had a hard time getting even a much simpler type of representation (patient polarity verbs) to work!

This was perhaps my favorite paper of the conference because it's trying to do something new and hard and takes a nice approach. At a high level, suppose you're Facebook and you're trying to improve your translation system so you ask users to give 1 star to 5 star ratings. How can you use this to do better translation? This is basically the (structured) contextual bandit feedback learning problem. This paper approaches this from a dueling bandits perspective where I show you two translations and ask which is better. (Some of the authors had an earlier MT-Summit paper on the non-dueling problem which I imagine many people didn't see, but you should read it anyway.) The technical approach is basically probabilitic latent-variable models, optimized with gradient descent, with promising results. (I also like this because I've been thinking about similar structured bandit problems recently too :P.)

[EDIT 14 Aug 2:40p: I misunderstood from the talk and therefore the following is basically inaccurate. I'm leaving this description and paper here on the list because Yoav's comment will make no sense otherwise, but please understand that it's wrong and, I hate to say this, it does make the paper less exciting to me. The part that's wrong is struck-out below.] There's a theme in the past two years of basically repeating all the structured prediction stuff we did ten years ago on our new neural network technology. This paper is about using Collins & Roark-style incremental perceptron for transition-based dependency parsing on top of neural networks. The idea is that label-bias is perhaps still a problem for neural network dependency parsers, like their linear grandparents. Why do I like this? Because I think a lot of neural nets people would argue that this shouldn't be necessary: the network can do arbitrarily far lookahead into the future and therefore should be able to learn to avoid the label-bias problem. This paper shows that current techniques don't achieve that: there's a consistent win to be had by doing global normalization.

This paper shows pretty definitively that human evaluations against a reference translation are super biased toward the particular reference used (probably because evaluators are lazy and are basically doing ngram matching anyway -- a story I heard from MSR friends a while back). The paper also shows that this gets worse over time, presumably as evaluators get tireder.

This is a nice paper summarizing four issues that come up in ethics that also come up in NLP. I mostly liked this paper because it gave names to things I've thought about off and on, but didn't have a name for. In particular, they consider exclusion (hey my ASR system doesn't work on people with an accent, I guess they don't get a voice), overgeneralization (to what degree are our models effectively stereotyping more than they should), over- and under-exposure (hey lets all work on parsing because that's what everyone else is working on, which then makes parsing seem more important...just to pick a random example :P), and dual-use (I made something for good but XYZ organization used it for evil!). This is a position/discussion-starting paper, and I thought quite engaging.

13 comments:

I think you may be reaching the wrong conclusion from Andor et al: their model does not attempt to look into the future (the NN is used only for automatic feature combinations + easy integration of pre-trained word representations). of course it will benefit from search. The real comparison should be against a model that uses something like an LSTM: and indeed we have very competitive results with biLSTM features and a greedy parser. (Our TACL paper is at 93.9UAS, and we also have yet unpublished results of 94.7. http://aclweb.org/anthology/Q/Q16/Q16-1023.pdf)

Thanks for the intro here, I felt entirely the same way. Maybe so many of us being connected on social media is changing the fabric of ACL. +1 on the Connotation Frames paper and the Social Impact paper.

Thanks for the intro here, I felt entirely the same way. Maybe so many of us being connected on social media is changing the fabric of ACL. +1 on the Connotation Frames paper and the Social Impact paper.

I guess it's partly my fault for not having ever provided a sticky explanation of label bias, but the benefit of global normalized training is neither about search per se nor about "looking into the future." It's about proper credit assignment among the many decisions that go into making a structured result, which cannot be achieved with local normalization because in local normalization, each decision is only penalized for its failure to match its gold label, not for its contribution to pushing the whole structured prediction too far from the gold structure.

I typically have thought of LB in terms of looking into the future because we want the global structure to be able to overrule low-entropy (or low-branching-factor) states whose outgoing edges look artificially good, but that the past (getting to this state) is somehow already accounted for.

But this isn't totally correct, because you could have label bias on the very final decision, and then it's really trading off with the past.

At any rate, it would be great to have a sticky definition so I can point reviewers to it when they misuse it worse than I've misused it here :)

Fernando: that's a good explanation of LB. But I am still curious if this holds for models with "infinite" lookahead such as those we have now. Having the ability to look into the future (of the input, if not of the structure) can calibrate the probabilities of the local results much better. Don't you think?Wondering what will be a good way to properly test this.

@Yoav: Looking into the indefinite future not enough, you still optimizing the wrong thing, local loss. Decisions here need to compete with decisions there on a common scale, but local norm prevents that, as it makes each decision worth the same, independently or how easy or hard it is locally (consider the extreme case of choices forced by the input and previous or following state). Bidirectional local models are often better than unidirectional ones, but they don't address label bias. In fact, back in the dark old days of linear models, the SOTA for sequence tagging/segmentation were bidirectional combinations of SVM-based bidirectional classifiers, until we did CRFs. There's no reason why one could not do globally normalized training for BiLSTMs (I think, just off the top of my head), it's just a matter of changing the loss function and the training procedure and thus the gradients you compute and normalize. It would be expensive training, but from my POV training is cheap given that our trained models are used in production.

Thanks Fernando. Yes, this makes sense.We currently don't see benefits from beam training w/ the BiLSTMs beyond what one could get by changing the random seed, but admittedly didn't properly try to make it work. (training ain't that much more expensive)

Also note that with biLSTMs we are not really dealing with local decisions: forced choice is only bad if it's wrong, and without proper lookahead you cannot know that it's wrong - but with fwd looking LSTM you can, and it will potentially be reflected in the local score.I am talking here about models that are both trained and tested greedly. if you do structured inference, I can see why you'd like global training (though see recent arxiv paper from Mirela Lapata's group providing a data point to the contrary)

Label bias is an interesting topic that has come up now and then in discussions. If there are any strong results showing that globally normalized works better than locally normalized models (WITHOUT independence assumptions), please let me know. I'll provide an empirical (recent) datapoint that seems to suggest powerful sequence models can deal quite well with complicated, global structure:

In answer to Oriol's question: One potential case where global training helps evenwith a full LSTM (no independence assumptions, if I understand correctly) is in the following paper on speech recognition

https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenTerm3201415/ctc.pdf

In table 1 they quote results for a model with a bi-directional LSTM (modelingthe full forward context), with both local and global training. My impressionis that many (most?) state-of-the-art results in speech recognition usesome form of global training.

This is doing named entity recognition using B-I-O tagging, based on biLSTM features. Adding tag-tag transition scores + viterbi search improves over greedy prediction. I assume that the reason is that, like in speech, your metric is not per-time-step accuracy, but a more global structure.

Why doesn't it work (yet...) for transition-based parsing then? I don't know, but perhaps because the control structures of the parser are strong enough to disallow these "illegal" or "bad" cases from happening.