Wow. That piece about the bad adversarial NLG paper really struck a nerve. Its been getting tons of attention, and some very positive comments. Thanks!

There are also a few points that people (especially, I think, younger researchers) raise, either on the web or in private, along the lines of this comment on reddit:

You could take Goodfellows original GAN paper, and critique it in a similar way: It’s not as good as the state of the art, it’s only on toy datasets, etc. Yet that method has been called “the coolest idea in machine learning the last 20 years”. And it probably is.

Now, you shouldn’t overstate your results. But are they really doing that? The blog author has a gripe with the paper title, because it claims to generate “natural language”, when the language doesn’t seem natural. I took that to just mean that it tries to generate human language as opposed to, say, a programming language.

The author seems to find some kind of arrogance in the paper that I really don’t see from the examples given.

This triggered me to write a few clarification.

First and foremost, I would like to reiterate that this particular paper was spectacularly bad in my view on many levels, but my broader criticism was on a trend, not on a single paper. Now, for some specific points:

My criticism is not about the paper not getting state-of-the-art results.

The focus on SOTA results is very overrated in my view, especially in deep learning, where so many things are going on beyond the innovation described in each work. I don’t need to see SOTA results, I want to see a convincing series of experiments, showing that the proposed method does something relevant, new and interesting.

My criticism is not about the paper using a toy task, or a toy grammar.

It is OK to use toy tasks. It is often desirable to use a toy task. For example, I could imagine some very interesting research that uses even smaller grammars than the one used by the authors in their simplest task. The idea would be to construct a grammar to demonstrate some phenomena, and then correlate it with learnability, for example. But the toy task must be meaningful and relevant, and you have to explain why it is meaningful and relevant. And, I think it goes without saying, you should understand the toy task you are using. Here, the authors clearly had no idea what the grammar they were using is doing. Not only that they don’t distinguish lexical rules from non-terminal productions, they didn’t even realize the vast majority of the production rules in the grammar file were not being used.

My criticism is not about the paper “not solving natural language generation”.

Of course the paper did not solve natural language generation. That’s exactly the point, no single paper can “solve” NLG (what does that even mean?), like no single biology paper will solve cancer. But the paper should be clear about the scope of the work it is actually doing. In the title, in the abstract, in the text.

(another point on “natural language”: the reddit comment above says “I took that to mean it tries to generate human language as opposed to, say, a programming language”. That’s the problem. The paper claims to generate human language, but is not evaluated on human language but instead only on very narrow fragments of human language, which are waaay much more similar to a very simple, stripped down programming language, without semantics, than it is to human languages. The paper also does not evaluate on any property that relates to the language being “natural” or “human”. This makes the paper very misleading in its description of what it is doing. It mislead that reader on reddit. It likely mislead many others as well. I tend to believe the authors did that by ignorance rather than maliciously. This is precisely where the arrogance comes in: working in a field that you do not understand, while not realizing that you do not understand it, or even that it is a complex field that needs understanding, and making broad, unsubstantiated and misleading claims as a result.)

My criticism is not about the paper being incremental.

This is very much related to the point above. I don’t have a problem with incremental papers. Most papers are incremental. That’s how progress is made, in small, incremental step. (It is true that there’s also a trend in deep-learning, fueled by arxiv-mania, of slicing things a bit too thin, pushing out papers for miniscule increments. Let’s put that aside for the current discussion.) Incrementality is perfectly fine, but you have to clearly define your contribution, position it w.r.t existing work, and precisely state (and evaluate) your increment.

Combining the points above, if the paper had only simple CFG experiments, but was titled (and written to support) something like “An Adversarial Training Method for Discrete Sequences that can Recover Short Context-Free Fragments”, and had a discussion of rules in the CFG and the kinds of structures they capture, followed by a proper, convincing evaluation, and a statement that the sentence set could very easily be learned by an RNN but not by any previous GAN-based method, yet the current GAN captures them, and that this is a first step in something that could at some point lead to NLG — this would actually be a solid paper that I’d happily accept to a conference. (not necessarily an NLP conference, this depends on other factors as well, for example the form of the CFG they were using and the classes of structures it captures.)