Comments

Thursday, May 21, 2015

Manufacturing facts; the case of Subject Advantage Effects

Real science data is not natural. It is artificial. It is
rarely encountered in the wild and (as Nancy Cartwright has emphasized (see
here for discussion)) it standardly takes a lot of careful work to create the
conditions in which the facts are observable. The idea that science proceeds by
looking carefully at the natural world is deeply misleading, unless, of course,
the world you inhabit happens to be CERN. I mention this because one of the
hallmarks of a progressive research program is that it supports the manufacture
of such novel artificial data and their bundling into large scale “effects,” artifacts
which then become the targets of theoretical speculation.[1]
Indeed, one measure of how far a science has gotten is the degree to which the
data it concerns itself with is factitious and the number of well-established
effects it has managed to manufacture. Actually, I am tempted to go further: as
a general rule only very immature scientific endeavors are based on naturally
available/occurring facts.[2]

Why do I mention this. Well, first, by this measure,
Generative Grammar (GG) has been a raging success. I have repeatedly pointed to
the large number of impressive effects that GG has collected over the last 60
years and the interesting theories that GGers have developed trying to explain
them (e.g. here).
Island and ECP effects, binding effects and WCO effects do not arise naturally
in language use. They need to be constructed, and in this they are like most facts
of scientific interest.

Second, one nice way to get a sense of what is happening in
a nearby domain is to zero in on the effects its practitioners are addressing.
Actually, more pointedly, one quick and dirty way of seeing whether some area
is worth spending time on is to canvass the variety and number of different
effects it has manufactured.In what
follows I would like to discuss one of these that has recently come to my
attention that has some interests for a GGer like me.

A recent paper (here)
by Jiwon Yun, Zhong Chen, Tim Hunter, John Whitman and John Hale (YCHWH)
discusses an interesting processing fact concerning relative clauses (RC) that
seems to hold robustly cross linguistically. The effect is called the “Subject
Advantage” (SA). What’s interesting about this effect is that it holds in
languages where the head both precedes and follows the relative clause (i.e.
for languages like English and those like Japanese). Why is this
interesting?

Well, first, this argues against the idea that the SA simply
reflects increasing memory load as a function of linear distance between gap
and filler (i.e. head). This cannot be the relevant variable for though it
could account for SA effects in languages like English where the head precedes
the RC (thus making the subject gap closer to the head than the object gap is)
in Japanese style RCs where heads follow the clause the object gap is linearly
closer to the head than the subject gap is, hence predicting an object
advantage, contrary to experimental fact.

Second, and
here let me quote John Hale (p.c.):

SA effects defy explanation in terms of
"surprisal". The surprisal idea is that low probability words are
harder, in context. But in relative clauses surprisal values from simple
phrase structure grammars either predict effort on the wrong word (Hale 2001) or get it completely backwards --- an object
advantage, rather than a subject advantage (Levy 2008, page 1164).

Thus, SA effects are interesting in that they appear to be
stable over languages as diverse as English on the one hand and Japanese on the
other and seem to refractory to many of the usual processing explanations.

Furthermore, SA effects suggest that grammatical structure
is important, or to put this in more provocative terms, that SA effects are
structure dependent in some way. Note that this does not imply that SA effects are grammatical effects, only that G
structure is implicated in their explanation.In this, SA effects are a little like Island Effects as understood (here).[3]
Purely functional stories that ignore G structure (e.g. like linearly dependent
memory load or surprisal based on word-by-word processing difficulty) seem to
be insufficient to explain these effects (see YCHWH 117-118).[4]

So how to explain the SA? YCHWH proposes an interesting
idea: that what makes object relatives harder than subject relatives is have
different amounts of “sentence medial ambiguity” (the former more than the
latter) and that resolving this ambiguity takes work that is reflected in
processing difficulty. Or put more flatfootedly, finding an object gap requires
getting rid of more grammatical
ambiguity than finding a subject gap and getting rid of this ambiguity requires
work, which is reflected in processing difficulty. That’s the basic idea. He
work is in the details that YCHWH provides. And there are a lot of them.Here are some.

YCHWH defines a notion of “Entropy Reduction” based on the
weighted possible continuations available at a given point in a parse. One
feature of this is that the model provides a way of specifying how much work
parsing is engaged in at a particular
point. This contrasts with, for example, a structural measure of memory
load. As note 4 observes, such a measure could explain a subject advantage but
as John Hale (p.c.) has pointed out to me concerning this kind of story:

This general account is thus adequate but not very
precise. It leaves open, for instance, the question of where exactly greater
difficulty should start to accrue during incremental processing.

That said, whether to go for the YCHWH account or the less
precise structural memory load account is ultimately an empirical matter.[5]
One thing that YCHWH suggests is that it should be possible to obviate the SA
effect given the right kind of corpus data. Here’s what I mean.

YCHWH defines entropy reduction by (i) specifying a G for a
language that defines the possible G continuations in that language and (ii)
assigning probabilistic weights to these continuations. Thusm YCHWH shows how
to combine Gs and probabilities of use of these. Parsing, not surprisingly,
relies on the details of a particular
G and the details of the corpus of usages of those G possibilities. Thus, what
options a particular G allows affects how much entropy reduction a given word
licenses, as does the details of the corpus that are probabilize the G.This thus means that it is possible that SA
might disappear given the right corpus details. Or it allows us to ask what if
any corpus details could wipe out SA effects. This, as Tim Hunter noted (p.c.)
raises two possibilities. In his words:

An interesting (I think) question that arises is:
what, if any, different patterns of corpus data would wipe out the subject
advantage? If the answer were 'none', then that would mean that the grammar
itself (i.e. the choice of rules) was the driving force. This is almost
certainly not the case. But, at the other extreme, if the answer were 'any
corpus data where SRCs are less frequent than ORCs', then one would be forgiven
for wondering whether the grammar was doing anything at all, i.e. wondering
whether this whole grammar-plus-entropy-reduction song and dance were just a
very roundabout way of saying "SRCs are easier because you hear them more
often".

One of the nice features of the YCHWH discussion is that it
makes it possible to analytically
approach this problem. It would be nice to know what the answer is both
analytically as well as empirically.

Another one of he nice features of YCHWH is that it
demonstrates how to probabilize MGs of the Stabler variety so that one can view
parsing as a general kind of information
processing problem. In such a context difficulties in language parsing are
the natural result of general information processing demands. Thus, this
conception of parsing locates it in a more general framework of information
processing, parsing being one specific application where the problem is to
determine the possible G compatible continuations of a sentence. Note that this
provides a general model of how G knowledge can get used to perform some task.

Interestingly, on this view, parsing does not require a
parser. Why? Because parsing just is information processing when the relevant
information is fixed. It’s not like we do language parsing differently than we
do, say, visual scene interpretation once
we fix the relevant structures being manipulated. In other words, parsing
on the YCHWH view is just information
processing in the domain of language (i.e. there is nothing special about language processing except the fact that
it is Gish structures that are being manipulated). Or, to say this another way,
though we have lots of parsing, there is no parser that does it.

YCHWHis a nice
example of a happy marriage of grammar and probabilities to explain an
interesting parsing effect, the SA. The latter is a discovery about the ease of
parsing RCs that suggests that G structure matters and that language
independent functional considerations just won’t cut it. It also shows how easy
it is to combine MGs with corpora to deliver probabilistic Gs that are
plausibly useful in language use. All in all, fun stuff, and very instructive.

[2]
This is one reason why I find admonitions to focus on natural speech as a
source of linguistic data to be bad advice in general. There may be exceptions,
but as a general rule such data should be treated very gingerly.

[3]
See, for example, the discussion in the paper by Sprouse, Wagers and Phillips.

[4]
A measure of distance based on structure could explain the SA. For example, there
are more nodes separating the object trace and the head than separating the
subject trace and the head. If memory load were a function of depth of
separation, that could account for the SA, at least at the whole sentence level.
However, until someone
defines an incremental version of the Whole-Sentence structural memory load
theory, it seems that only Entropy Reduction can account for the word-by-word
SA effect across both English-type and Japanese-type languages.

[5]
The following is based on some correspondence with Tim Hunter. Thus he is
entirely responsible for whatever falsehoods creep into the discussion here.

A quick follow-up on the point about what patterns of corpus data would wipe out the subject advantage. It turns out (at least in Japanese, which is the only language I tested) that it is not the case that the subject advantage only appears when SRCs are more frequent than ORCs in the corpus. So we are not at the "other extreme" mentioned in the post: the theory put forward by YCHWH is not just a roundabout way of appealing to the corpus frequencies.

To be more precise: if you leave all the corpus weights as they are in the paper except for the SRC-vs-ORC frequency, and replace this with an artificial 50-50 split, then you still get the subject advantage. So the subject advantage appears even though SRCs and ORCs were equally frequent.

A minor point about surprisal: there's nothing inherently grammar-sensitive about entropy reduction or grammar-insensitive about surprisal. Surprisal in Hale (2001), Levy (2008) and elsewhere is calculated based on probabilistic grammars. In the other direction you can get entropy reduction estimates from language models that don't have any hierarchical structure (e.g. Stefan Frank's work). You could make the case that entropy reduction is in many cases more sensitive to representational assumptions than surprisal, though.

Empirically, surprisal and entropy reduction make very different predictions, which are right in some cases and wrong in others for both metrics (though we have more evidence for surprisal effects). But the debate over which metric is correct (or whether both are) is orthogonal to whether you use probabilistic grammars or more "linear" models.

Tal's own work, appearing soon in _Cognitive Science_, confirms Entropy Reduction using phrase structure grammars based on the Penn Treebank: "We find that uncertainty about the full structure...was a significant predictor of processing difficulty"

@Tal: You are absolutely right. And I don't think this is a minor point: any probabilistic or information theoretic notion of language requires a clearly specified underlying model. But I think this point is missed by many practitioners in the cognitive science of language, who seem to interpret previous findings as a demonstration that language functions--in use, change, learning, etc.--to facilitate communication in some very general sense. Of course, one needs to ask the question, If you have a well motivated specific model of language, are such general considerations still necessary? (And they may be wrong.) This is especially important because calculating surprisal or similar information theories measures is computationally very difficult. It seems that YCHWH took an important step in this direction and I hope their paper is widely read and discussed.

Me and the students in my MG parsing research project have a sort-of follow-up paper on this that will be presented at MOL in July (which is colocated with the LSA summer institute this year, btw). We're still working on the revisions, but I'll put a link here once it's done.

We approach the SA from a very different perspective. We completely ignore distributional facts and ask instead what assumptions about memory usage in an MG parser can derive the SA. It turns out that one needs a fairly specific (albeit simple and plausible) story, and it is a story that hooks directly into the movement dependencies one sees with subject relative clauses and object relative clauses. More specifically, there's three simple ways of measuring memory usage:

1) the number of items that must be held in memory2) the duration that an item must be held in memory3) the size of the items that must be held in memory

1 and 2 can at best get you a tie between subjects and objects, you need 3 to derive the SA. Intuitively, the structurally higher position of subjects in comparison to objects leads to shorter movement paths, which reduces the size of parse items and thus derives the SA.

I'm not sure how the two approaches differ in the predictions they make for other constructions; one of the things that really is missing from the parsing literature right now is a formal method for comparing models, similar to how formal language theory provides a scaffolding for comparing macroclasses of syntactic proposals.

This sounds like a version of the whole sentence theory mentioned in noted 4. If so, it seems like it will still be missing the word-by-word effect that Entropy Reduction can accommodate, or am I missing something?

You're right that it only gives you an offline difficulty measure that does not map neatly to online performance. That's the main gripe some of my students have with the project, but I'm actually quite happy to abstract away from this for now. That's because the overall goal of the projects is slightly different:

We're building on Kobele, Gerth and Hale (2012), who use a memory-based mapping from tree structures to processing difficulty to argue against specific syntactic analyses on the basis that they cannot derive the observed processing effects with this simple machinery. So we are less concerned with modeling specific processing effects, the real issue is how far a purely syntactically informed mapping from structures to levels of processing difficulty can take us before we have to switch to (presumably) more powerful methods. The SA is interesting because it looks like something that intuitively could have a memory-based explanation, but at the same time it is rather tricky to accommodate. Our result confirms that picture: you have to add metric 3 to get it right, 1 and/or 2 by itself is not enough even if you're willing to play around with your analysis of relative clauses (wh VS promotion VS no movement).

I actually feel like this result is still way too specific. What I want is a general theory that tells us something like "if your metric is in complexity class C, and your structures satisfy property P, then you can only derive processing effects of type T". To the best of my knowledge nobody has ever tried to do something like that, but it seems to me that this is the only way to combat the problem of combinatorial explosion you run into when you want to do this kind of work on a bigger scale (dozens of synactic analyses, parsing models, difficulty metrics, etc).

Actually, let me rephrase the first sentence: it might map neatly to online processing, but nobody has really worked that out yet. The MG parser is incremental and moves through the derivations in a specific way, so you can map its position in the structure to a specific position in the string. For all I know, this might make it straight-forward to turn the global difficulty metric into an incremental one. I haven't really thought about it all that much, so I can't even make an educated guess.

The project sounds fine as an analytic exercise, but it's not clear to me how hard it is. There are two obvious measures of load based on distance; linear proximity and hierarchical proximity. It sounds like you opted for door number 2. That's fine if we have any reason to think that this is a general metric in general for memory load. Is it? I dunno. But I guess I don't see why this was tricky. What's tricky is to see if this is more or less on the right track and the SA suggests that it is. However, it also suggests that the SA is categorical (it should hold in every language regardless of frequency of usage). Tim has done some analytic work thinking about what kind of frequency data would overturn SA given an Entropy Reduction approach. If the SA did not hold categorically it would be very interesting. But thx for explaining what you are up to an why.

The challenge came mostly from spelling out that idea, grounding it in memory usage, and ensuring that it does not interfere with any of the previous findings --- the memory-load story works for at least three other phenomena: crossing VS nested dependencies, right embedding VS center embedding, and relative clause within a sentential clause VS sentential clause within a relative clause. I agree that it's not a spectacularly surprising finding, but it made for a nice seminar topic and gave the students quite a bit of material to chew on.

Regarding the universality of the SA, the next language to look at should probably be Basque, for which a fairly recent study has reported an object advantage. One conceivable story is that this is related to the fact that Basque is an ergative language and thus might use a very different structural configuration for subjects and objects.

@Thomas: It depends what you mean by a very different structural configuration. Basque, like most (if not all) ergative languages, exhibits all the regular subject-object asymmetries that you're familiar with from nominative-accusative languages (except for those having to do with case & agreement, of course). So, for example, the (ergative) subject binds the (absolutive) object but not vice versa, the (absolutive) object is structurally closer to the verb than the (ergative) subject is, etc. etc.

Now, none of that guarantees that the subject and object in Basque are literally in the same structural positions as, say, their counterparts in Japanese. But what I can tell you with confidence is that Basque is not what used to be called a "deep ergative" language, where subjecthood and objecthood (and Agenthood and Patienthood) are underlyingly inverted. As I intimated above, there are probably somewhere between zero and three (inclusive range) languages that are actually "deep ergative."

@Omer: If you're right then there's at least three possible outcomes: i) Basque is a clear-cut counterexample to structural explanations of subject/object preferences with relative clauses (which would be a neat and very strong result), ii) the metric can be refined even further to accommodate Basque (not particularly appealing imho), or iii) the experimental findings about Basque are wrong (though I can't imagine what kind of confound would produce a clear object advantage).

It will be interesting to see whether YCHWH's account has a good chance of extending to Basque. The Basque study (behind a paywall *sigh*) talks a little bit about structural ambiguity in relative clauses, but they look very similar to East Asian RCs in that respect.

As I noted earlier, Tim has been working on some analytical results trying to explain what sort of frequencies are required to reverse SA in Korean style RCs. The YCHWH account can accommodate non SA, but it requires a pretty specific set of facts. In other words, it may come close to making a prediction about Basque data (though I suspect I am being over optimistic here). Tim?

Yes, the YWHWH account (i.e. the Entropy Reduction Hypothesis) can definitely accommodate non-SA. The SA prediction in the three languages we looked at is a function of the grammars of those languages *and* the probabilities that describe the comprehender's expectations (which we drew from corpora, but you can get them from anywhere you want). I played around with the Japanese grammar, and found that there are definitely ways the corpus freqencies could have been which would have led to an object-advantage prediction, holding the grammar fixed. So knowing exactly what "the correct" grammar of Basque is would not itself be sufficient to get a prediction about whether SRCs or ORCs are predicted to be easier.

Put differently, it's in effect an empirical discovery that corpora tend to have the properties which lead the ERH to make the subject-advantage prediction. There's no necessary analytic connection between the ERH and the subject advantage.

I don't think we can confidently say that getting a non-SA prediction in the Chinese/Japanese/Korean cases that we looked at "requires a pretty specific set of facts". If I'm understanding right this would mean that in some sense "most of the ways the corpora could have been" end up producing the SA prediction, but I just have no idea whether this is right or not. (And of course the question of whether that's right or not is going to be different for each grammar you define.)

The ERH leads you to make the subject-advantage prediction based on a combination of your corpus and the syntactic representations you use to encode it, right? It would be interesting to see how robust the results are to various representational decisions in your Minimalist Grammar.

I also wonder about the consequences of analyzing a fragment of the grammar of the language rather than the full grammar - couldn't you be underestimating the entropy of a nonterminal that you happened not to care about that much in your fragment?

To the first point: yes -- but if I'm understanding what you mean correctly, that's just to say that the predictions are a function of the grammar and the corpus. The procedure that you follow to produce a probabilistic grammar from a grammar G and a corpus C will (typically?) involve working out what syntactic representations G assigns to the sentences in C. Even holding the grammar and corpus fixed, there are many such procedures: for one thing, you can parametrize the probability distribution in different ways (as I talked about here). That's even before considering different representational decisions in the grammar.

On the second point: yes, I think that's definitely possible (John may have thought about this more than I have), but I think the assumption/hope is that while the actual entropy values we compute are no doubt much smaller than the "real" values, the relationships among them (and therefore the points where higher and lower ER takes place) might be unaffected. The reasons for concentrating only on a small fragment are really just practical, at this point.

I should clarify one more thing: when I said above that "it's in effect an empirical discovery that corpora tend to have the properties which lead the ERH to make the subject-advantage prediction", this is not simply the discovery that corpora tend to have the property that SRCs are more frequent than ORCs. The properties that lead the ERH to make the subject-advantage prediction are much more sublte and complex, and dependent on all sorts of other things like the choice of grammar.

Yes, I don't think we disagree, it's just that the shorthand you used earlier ("the grammar of the language") could imply that there's only one possible grammar that derives the language, whereas in practice there are a wide range of grammars. And my hunch is that two grammars that have the same weak generative capacity can lead to different ERH predictions - it really depends on whether and where in the grammar you have "spurious" ambiguity.

As for the fragments, I think your assumption might hold if your productions are a representative sample of the grammar in some sense, though I'm not sure. But imagine a situation where your fragment doesn't have AdvP, but in reality AdvPs have a lot of internal entropy, and SRCs are more likely to have AdvPs than ORCs; wouldn't you underestimate the entropy of the SRCs in that case, and potentially derive the opposite predictions than if you estimated an empirical grammar?

Possibly... but we believe we considered a realistic subset of relevant alternative constructions. The burden of proof that we left something out some really important alternative is really on the proposer of an alternative theory :)

Tal wrote: Yes, I don't think we disagree, it's just that the shorthand you used earlier ("the grammar of the language") could imply that there's only one possible grammar that derives the language, whereas in practice there are a wide range of grammars. And my hunch is that two grammars that have the same weak generative capacity can lead to different ERH predictions - it really depends on whether and where in the grammar you have "spurious" ambiguity.

Right, I don't think we disagree either. When I wrote "the grammar of the language" I just meant "the grammar in the Basque speaker's head". There's only one of those (under the usual idealizing assumptions). There are no doubt many possible grammars that are "weakly equivalent to Basque", or "grammars which weakly generate Basque", and those would definitely make different ERH predictions, even holding the corpus and everything else fixed. The point I wanted to make was that even when you work out what that one true grammar is -- even when Thomas and Omer's questions about the Basque case system (note these are not just questions about the string language) were answered in full detail -- no predictions follow until you assign probabilities.

Oh, I understand what you meant now. I think the point still stands that we don't know exactly which of those weakly equivalent grammars are in the Basque speaker's head, how different are the grammars across Basque speakers, etc. This is clearly science fiction, but in principle you could even use the ERH in an individual differences study to figure out which grammar predicts each person's reading times best...