Fake language: two perspectives

Is there a role for fake language datasets in the AI ecosystem? (By “fake
language” I mean things like
bAbI,
CLEVR,
Karthik’s “Home” world
[pdf link], and
this DeepMind paper.)
Claims that various learning architectures can “do language processing” based on
results from these datasets have caused a lot of hand-wringing in the NLP
community. While a lot of this is due to good old-fashioned overclaiming, I’ve
become persuaded some of it is miscommunication between two groups that mean
totally different things by “language data”.

I’m going to focus here on instruction following, but I think there’s a similar
story to tell about lots of other grounded tasks like question answering,
generation, etc.1 For a long time, an AI researcher’s view of an instruction
following problem was something like this:

Language -> Abstraction -> Behavior

That is, we start with whatever utterances people are generating, map them to
some kind of clean, structured representation, and then make decisions about how
to act on the basis of that structure. Because this whole pipeline was too hard
to tackle all at once, the community mostly started working on it from different
ends.2

“Language people” worked on this:

Language -> Abstraction

In this picture, language comes to you from the outside world—you don’t
control the distribution. You get to design the language of abstractions, but it
had better be able to handle (or at least fail gracefully on) whatever
utterances the world throws at you. Linguists give us a nice abstraction
formalism in the form of logic, and that the way to get from abstraction to
behavior is just logical interpretation. So it’s very easy to say “abstraction =
formal semantics” and treat the Abstraction -> Behavior edge as someone else’s
problem.

Data is collected from human speakers who don’t necessarily know anything
about logical forms. Indeed, decisions about details of the logical language are
typically worked out after collecting initial annotations. What distinguishes
“language data” from other kinds of data is precisely the fact that it was
generated by human users. (If we’re generating data from a fake grammar and
mapping it onto logical forms, we generally haven’t learned anything about
language that we didn’t write down in the first place.)

“Policy people” (broadly understood to include everything from RL to planning
to classical control) worked on this:

Abstraction -> Behavior

In this picture, the scope of acceptable abstractions is up to the system
designer—it’s in behavior where details of the real world (physics, etc.)
intervene. Abstraction languages range from “do one of these 10 specific things”
to “satisfy this STRIPS goal”. In particular, an abstraction language
that doesn’t support all possible goals is no more problematic than a remote
control that doesn’t operate all appliances at once.3 The thing that I
originally found difficult to appreciate is just how hard some of these problems
are even when we have complete control over the input distribution.
Reinforcement learning is hard. Planning is hard. There’s still lots of
room for interesting compositionality in these abstraction languages—if I have
some kind of structured representation of the goal, and I train on a subset of
structures, do I generalize to the rest? There’s lots we still can’t do.

To come up with problems that are within reach of current methods, data is
generated rather than collected. The distribution over abstractions and their
induced behaviors is hand-designed. There’s no language data here; what
distinguishes “language data” from what does get used is that language has no
precise execution semantics, but e.g. STRIPS does.

In the last few years these two communities have run together, because the world
looks like this:

Language -> Behavior

Everything is end-to-end all the time. Abstraction isn’t gone, but now it lives
in some uninterpretable representation space rather than a formalism we’ve
designed by hand. This is a great thing! Language people no longer have to limit
themselves to worlds where they’re clever enough to construct a good enough
logical language.

And policy people (here’s where the trouble starts) no longer have to describe
their task inventory in terms of any particular formalism: they just need some
way of generating reward functions / goal tests synchronously with some kind of
(compositional?) identifier that describes them. So they generate interpretable
strings made of sequences of words. No execution semantics, uses English words:
natural language. Whence the confusion.

I think there’s a lesson in this for people in both communities:

For “policy people” as researchers: please please please signpost
explicitly when your input strings were generated synthetically. The word
language is hopelessly overloaded at this point, but the bigram natural
language is not: avoid using the word natural unless people were involved.
(A couple of the papers linked in the first paragraph are guilty of this.)

For “language people” as reviewers: respond to appropriately qualified fake
language datasets by asking “Is an interesting Abstraction -> Behavior
problem is being solved? Do strings index the target class of behaviors in an
interesting way?” There are lots of problems out there for which this is an
appropriate standard.

I think we’re still at a stage where there’s something to learn from fake
language, even those of us who ultimately care about the distribution of
sentences produced by humans.

Explicitly excluded from this discussion is work on doing linguistic
analysis of fake language data. Such work is not about language at all,
and is either about analyzing the formal expressive capacity of certain
model classes, or is garbage. ↩

Not everyone! Both Stefanie Tellex and
Branavan had work that was trying to
tackle the complete pipeline well before the current end-to-end craze. I
think this research is a model for what the field should be trying to
accomplish, but for lots of problems our techniques just aren’t there
yet. ↩

There’s a sense in which Jonathan Berant and Percy Liang’s work on
paraphrasing for semantic parsing
[pdf link] actually belongs in this category, rather than Language ->
Abstraction. But they still hold themselves to a “real language”
evaluation standard. ↩