Comments

Monday, April 28, 2014

Here's a recent post on Andrew Gelman's blog. Gelman is a real big deal in stats, probability. This piece on the promise (or lack thereof) of Big Data seems to me to represent a turning of the zeitgeist. A while ago Big Data was going to solve every problem. Now, things are moving more cautiously. There's nothing wrong with data and nothing wrong with lots of it. But, it's no substitute for thought and it brings its own problems with it. Here's hoping the hype has peaked.

Sunday, April 27, 2014

This past friday (April 25) I received word that Jim Higginbotham passed away. Jim made myriad contributions to both linguistic semantics and the philosophy of linguistics. He was among the first to argue for the relevance of Davidson's view of events for linguistic semantics. He also made major contributions to our understanding of pronominal binding and tense. Jim was a powerful thinker and his work influenced many younger researchers including Barry Schein, Paul Pietroski and Rich Larson. Sad news.

Friday, April 25, 2014

Ewan informs me that he and his buddies are launching a new site dedicated to:"issues big, small, and in between in the investigation of language and cognitive development using big data, reverse engineering, or by any manner of computational tools."

It looks like the kind of thing that people who have been reading material here might be interested in. Check it out at: bootphon.blogpsot.fr (here).

Thursday, April 17, 2014

I have recently urged that we adopt a particular
understanding of the Strong Minimalist Thesis (SMT) (here).The version that I favor treats the SMT as a
thesis about systems that use
grammars and suggests that central features of the grammatical representations
that they use will be crucial to explaining why they are efficient. If this
proves to be doable, then it is reasonable to describe FL and the grammars it
makes available as “well designed” and “computationally efficient.” Stealing
from Bob Berwick (here),
I will take parsing efficiency to
mean real time parsing and (real time) acquisition to mean easy acquisition
given the PLD.Put this all together and
the SMT is the conjecture that the grammatical format of Gs and UG is critical
to allowing parsers, acquirers, producers, etc. to be very good at what they do
(i.e. to be well-designed). On this view, grammars are “well designed” or
“computationally efficient” in virtue of having properties that allow their
users to be good at what they do when such grammars are embedded transparently
within these systems.

One particularly attractive virtue of this interpretation
(for me) is that I understand how I could go about empirically investigating
it.I confess that this is not true for
other versions of the SMT that talk about neat fits between grammar and the CI
interface, for example. So far as I can tell, we know rather little about the
CI interface and so the question of fit is, at best, premature. On the other
hand we do know a bit about how parsing works and how acquisition proceeds so
we have something to fit the grammar to.[1]

So how to proceed? In two steps I believe. The first is to
see if use systems (e.g. parsers) actually deploy grammars in real time, i.e.
as they parse. Thus, if it is true that the basic features of grammatical
representations are responsible for how (e.g.) parsers manage to efficiently do
what they do then we should find real time evidence implicating these representations
in real time parsing. Second, we should look for how exactly the implicated
features manage to make things so efficient. Thus, we should look for
theoretical reasons for why parsers that transparently embody, say, Subjacency
like principles, would be efficient.Let
me discuss each of these points in turn.

There is increasing evidence from psycho-ling research indicating
that real time parsing respects grammatical distinctions, even very subtle
ones.Colin Phillips is a leader in this
kind of work and he and his (ex) students (e.g. Brian Dillon, Matt Wagers,
Ellen Lau, Dave Kush, Masaya Yoshida) have produced a body of work that
demonstrates how very closely parsers respect grammatical conditions like
islands, c-command, and local binding domains. And by closely I mean very closely.So, for example, Colin shows (here)
that online parsing respects the grammatical conditions that license parasitic
gaps. So, not only do parsers respect islands, but they even treat
configurations where island effects are amnestied as if they were not islands.
Thus, parsers respect both the general conditions that grammars lay down
regarding islands and the exceptions
to these general conditions that grammars allow. This is what I mean by
‘close.’

There is a recent excellent demonstration of this from
Masaya Yoshida, Lauren Ackerman, Morgan Purier and Rebekah Ward (YLPW) (here
are slides from a recent CUNY talk).[2]
YLPW analyzes the processing of backward sluicing constructions like (1):

(1)I
don’t recall which writer, but the editor notified a writer about a new project

There is an ellipsis “gap” right after which writer that is redeemed by anchoring it to a writer in the following sentence. What
YLPW is looking to determine is whether the elided
gap site is sensitive to online
parsing effects. YLPW uses a plausibility effect as probe as follows.

First, it is well known that a wh in CP triggers an active search for a verb/gap that will give it
an interpretation. ‘Active’ here means that the parser uses a top down
predictive process and is eagerly looking to link the wh to a predicate without first consulting bottom information that
would indicate the link to be ill-advised. YLPW show that the eagerness to
“fill a gap” is as true for implicit gaps within ellipsis sites as it is for
“real” gaps in regular wh sentences.YLPW shows this by demonstrating a
plausibility effect slowdown in
sentences like (2a) parallel to the ones found in (2b):

(2)a. I
don’t remember which writer/which book,
but the editor notified a writer about a new book

b.
I don’t remember which writer/which book
the editor notified GAP about a new book

When the wh is which book then there is a significant
pause at notified in both sentences
in (2), as contrasted with the same sentences where which writer is the antecedent of the gap.This is because parsers, we know, greedily
try and relate the wh to the first
syntactically available position encountered and in the case of which book the wh is not a plausible filler of the gap and the attempted filling
results in a little lingering about the verb (*notify this book about…). If the antecedent is which writer no such pause occurs, for obvious reasons.The plausibility effect, then, is just a
version of the well-known filled gap effect, with a little semantic kicker to
add some frisson. At any rate, the first important discovery is that we find
the same plausibility effect in both (2a) with the gap inside a sluiced
ellipsis site, and (2b) where the gap is “overt.”

The next step is to see if this plausibility/filled gap
effect slowdown occurs when the relevant antecedent for the sluiced ellipsis
site is inside an island. It is well known that ellipsis is not subject to island restrictions.
Thus, if the parser cleaves tightly
to the distinctions the grammar makes (as the SMT would lead us to expect) then
we should find plausibility slowdowns except
when the gap is inside an ellipsis site for the latter are not subject to
island restrictions [and so should induce filled gap/plausibility effects (added: thx Ewan)].And that’s exactly
what YLPW find. Though plausibility effects are not found at notifiedin cases like (3) they are found in cases like (4) where the
“gap” is inside a sluice sight.

(3)I
don’t remember which book [the editor
who notified the publisher about
some science book] had recommended to me

(4)I
don’t remember which book, but [the
editor who notified the publisher
about some science book] recommended a new book to me

This is just what we expect from a parser that transparently
embeds a UG like grammar that treats movement but not ellipsis as a product of
(long) movement.

The conclusion: it seems that parsers make just the
distinctions that grammars make when they parse in real time, just as the SMT
would lead us to expect.

So, there is growing evidence that parsers transparently
embed UG like grammars.This readies us
for the second step. Why should they do so?Here, there is less current research
that bears on the issue. However, there is work from the 80s by Mitch Marcus,
Bob Berwick and Amy Weinberg that showed that a Marcus style parser that
incorporated grammatical features like Subjacency (and, interestingly, Extension)
could parse sentences efficiently (effectively, in real time).This is just what the doctor ordered. It goes
without saying (though I will say it) that this work needs updating to bear
more directly on the SMT and minimalist accounts of FL. However, it provides a
useful paradigm of how one might go
about connecting the discoveries concerning online parsing with computational questions
of parsing efficiency and their relationship to central architectural features
of FL/UG.

The SMT is a bold conjecture. Indeed, it is likely false, at
least in fine detail. This does not, however, detract from its programmatic
utility.The fact is that there is
currently lots of research that can be understood as bearing on its accuracy
and that fruitfully brings together work in syntax, psycholinguistics and
computational linguistics.The SMT, in
other words, is a terrific hypothesis that will generate fascinating work
regardless of its ultimate empirical fate.That’s what we want from a research program and that’s something that
the Strong Minimalist Thesis is ready to deliver. Were this all that the
Minimalist Program provided, it would have been enough (dayenu!). There is more, but for the nonce, this is more than
enough. Yay, for the Minimalist Program!!!

[1]
Let me modulate this: we know something about some other parts, see here
for discussion of magnitude estimation in the visual domain. Note that this
discussion fits well with the version of the SMT deployed here precisely
because we know something about how this part of the visual system works. We
cannot say as much about most of the other parts of CI. Indeed, we don’t really
know how many “parts” CI has.

[2]
They are running some more experiments, so this work is not yet finished.
Nonetheless, it illustrates the relevant point well, and it is really fun
stuff.

Friday, April 11, 2014

There are never enough good papers illustrating the poverty
of the stimulus. Here’s
a recent one that I read by Jennifer Culbertson and David Adger (yes, that
David Adger!) (C&A) that uses artificial language learning tasks as probes
into the kinds of generalizations that learners naturally (i.e. uninstructed) make.
Remember that generalization is the name of the game. Everyone agrees that no
generalizing beyond the input, no learning. The debate is is not about whether
this exists, but what the relevant dimensions are that guide the generalization
process. One standard view is that it’s just frequency of some kind, often
bigram and trigram frequencies. Another is that the dimension along which a
learner generalizes is more abstract, e.g. along some dimension of linguistic
structure.C&A provide an
interesting example of the latter in the context of artificial language
learning, a technique, I believe, that is still new to most linguists.[1]

Let me say a word about this technique. Typological
investigation provides a standard method for finding UG universals. The method
is to survey diverse grammars (or more often, and more superficially,
languages) and see what properties they all share. Greenberg was a past master
of this methodology, though from the current perspective, his methods look
rather “shallow,” (though the same cannot be said of modern cartographers like
Cinque). And, looking for common features of diverse grammars seems like a plausible
way to search for invariances. The current typological literature is well
developed in this regard and C&A note that Greenberg’s U20, which their
experiment explores, is based on an analysis of 341 languages (p.2/6).So, these kinds of typological investigations
are clearly suggestive. Nonetheless, I think that C&A are correct in
thinking that supplementing this kind of typological evidence with experimental
evidence is a very good idea for it allows one to investigate directly what typological surveys can
only do indirectly: to what degree the gaps in the record are principled.We know for a fact that the extant
languages/grammars are not the only possible
ones. Moreover, we know (or at least I believe) that the sample of grammars we
have at our disposal are a small subset of the possible ones. As the artificial
language learning experiments promise to allow us to directly probe what typological comparison only allows us to indirectly infer, better to use the
direct method if it is workable.C&A’s paper offers a nice paradigm for how to do this that those
interested in exploring UG should look at this method with interest.

So what do C&A do? They expose learners to an artificial
version of English wherein pre-nominal order of determiner, numeral and
adjective are flipped from the English case. So, in “real” English (RE), the
order and structure is [Dem [ num [ adj [ N ] ] ] (as in: these three yellow mice). C&A expose learners to nominal bits
of artificial English (AE) where the dem, num, and adj are postnominal. In
particular, they present learners with data like mice these, mice three, mice
yellow etc. and see how they generalize to examples with more than one
postnominal element, e.g. do learners prefer phrases in AE like mice yellow these or mice these yellow? If learners treat AE
as just like RE but for the postnominal order then they might be expected to
preserve the word order they invariably see pre-nominally in RE postnominally
in AE (thus to prefer mice these yellow).
However, if they prefer to retain the scope
structure of the expressionsin RE and port that over to AE, then they will
prefer to preserve the bracketing noted above but flip the word order, i.e. [ [
[ N ] adj ] num ] dem]. On the first hypothesis, learners prefer to orders
they’ve encountered repeatedly in RE before, while on the second they prefer to
preserve RE’s more abstract scope relations when projecting to the new structures
in AE.

So what happens? Well you already know, right? They go for
door number 2 and preserve the scope order of RE thus reliably generalizing to
an order ‘N-adj-num-det.’ C&A conclude, reasonably enough, that “learner’s
overwhelmingly favor structural similarity over preservation of superficial order”
(abstract, p.1/6) and that this means that “when they are pitted against one
another, structural rather than distributional knowledge is brought to bear
most strongly in learning a new language” (p.5/6). The relevant cognitive
constraint, C&A conclude, is that learners adopt a constraint “enforcing an
isomorphism in the mapping between semantics and surface word order via
hierarchical syntax.”[2]

This actually coincides with similar biases young kids
exhibit in acquiring their first language. Lidz and Musolino (2006) (L&M)
show a similar kind of preference in relating quantificational scope and
surface word order. Together, C&A and L&M show a strong preference for
preserving a direct mapping between overt linear order and hierarchical
structure, at least in “early” learning, and, as C&A’s results show, that
this preference is not a simple L-R preference but a real structural one.

One further point struck me. We must couple the observed
preference for scope preserving order with a dispreference for treating surface
forms as a derived structure, i.e. a product of movement. C&A note that
‘N-dem-num-adj’ order is typologically rare. However, this order is easy enough
to derive from a structure like (1) via head movement given some plausible
functional structure. Given (1), N to F0 movement suffices.

(1)F0[Dem [ num [ adj [ N ] ] ] à
[N+F0[Dem [ num [ adj [ N
] ] ] ]

We know that there are languages where N moves to above
determiners (so one gets the order N-det rather than Det-N) and though the
N-dem-num-adj is “rare” it is, apparently, not unattested. So, there must be more going on. This, it goes without
saying I hope, does not detract from C&A’s conclusions, but it raises other
interesting questions that we might be able to use this technique to explore.

So, C&A have written a fun paper with an interesting
conclusion that deploys a useful method that those interested in FL might find
productive to incorporate into their bag of investigative tricks. Enjoy!

[1]
Though not to psychologists and some psycholinguists. Lidz and his student Eri
Takahashi (see here) have used
this technique to also argue against standard statistical approaches to
language acquisition.

Monday, April 7, 2014

I've been pretty busy recently and I doubt that I'll be able to post anything "meaty" this week (here's a good place to cheer btw). However, here are some things that might entertain you that people have sent me or that I have tripped over myself:

Thx to Avery for getting the right link to 4 below. The one I linked to earlier went nowhere. This one works.

An editorial on Big Data by Gary Marcus (here). It seems that the "hype-cycle" is cresting and that people are beginning to consider the problems with Big Data Science (BDS). BDS is the idea that big data sets can substitute for standard scientific practice whose aim is to uncover the causal structure of things. BDS seems happy substituting correlation for causation, the idea being that enough of the former and we can dispense with the latter. The recent Google flu failure has brought home to even the enthusiasts that there is no such thing as thought free science. At any rate, Gary here goes over in bullet form some of the drawbacks.

Pedro Martins sends me this link to an interesting interview with Marc Hauser. Those who want their bio-ling fix can get it here.

Talking about the "hype-cycle," here's a reaction to the MOOCification of education by someone that would have to implement it. Janet Napolitano (formerly head of the Department of Homeland Security, so not one of usual go-to people) is the head of the UC system. Jerry Brown is a big enthusiast of MOOCs, seeing these as a way of providing a quality education to all at a reduced cost. Napolitano talks about the costs of MOOCs and what kinds of service they could provide. It is a reasonable reaction, IMO. Note her observations that these will not really save much money, if any. This, I believe, is a big deal. The fight is about transferring money from universities to education entrepreneurs. The total cost will not change much, if at all.

Last, here's a video of a recent talk by Chomsky at Keio (thx, Hisa). This one should occupy you for at least as much time as it takes you to read one of my long post.

Thursday, April 3, 2014

I have been thinking again about the relationship between
Plato’s Problem and Darwin’s. The crux of the issue, as I’ve noted before (see
e.g. here)
is the tension between the two. Having a rich linguo-centric FL makes
explaining the acquisition certain features of particular Gs easy (why? Because
they don’t have to be learned, they are given/innate). Examples include the
stubborn propensity for movement rules to obey island conditions, for
reflexives to resist non-local binding etc. However, having an FL with rich
language specific architecture makes it more difficult to explain how FL came
to be biologically fixed in humans. The problem gets harder still if one buys
the claim that human linguistic facility arose in the species in (roughly) only
the last 50-100,000 years. If this is true, then the architecture of FL must be
more or less continuous with that we find in other domains of cognition, with
the addition of a possible tweak or two (language is more or less an app in Jan
Koster’sense). In other words, FL can’t be that
linguo-centric! This is the essential tension. The principle project of
contemporary linguistics (in particular that of the Minimalist Program (MP)), I
believe, should be to resolve this tension.In other words, to show how you can eat your Platonic cake and have Darwin’s
too.

How to do this? Well, here’s an unkosher way of resolving
the tension. It is not an admissible move in this game to deny Plato’s Problem
is a real problem. That does not “resolve” the tension. It denies that there is/was
one to begin with. Denying Plato’s Problem in our current setting includes
ignoring all the POS arguments that have been deployed to argue in favor of
linguo-centric structure for FL. Examples abound and I have been talking about
these again in recent posts (here,
here).
Indeed, most of the structure GB postulates, if an accurate description of FL,
is innate or stems from innate mental architecture.GB’s cousins (H/GPSG, LFG, RG) have their
corresponding versions of the GB modules and hence their corresponding linguo-centric
innate structures. The interesting MP question is how to combine the fact that
FL has the properties GB describes with a plausible story of how these GBish
features of FL could have arisen. To repeat: denying that Plato’s Problem is
real or denying that FL arose in the species at some time in the relatively
recent past does not solve the MP problem, it denies that there is any problem
to solve.[1]

There is one (and so far as I can tell, only one) way of
squaring this apparent circle: to derive the properties of GB from simpler
assumptions.In other words, to treat GB
in roughly the way the theory of Subjacency treats islands: to show that the
independent principles and modules are all special cases of a simpler more
plausible unified theory.

This project involves two separate steps.

First, we need to show how to unify the disparate modules. A
good chunk of my research over the last 15 years has aimed for this (with
varying degrees of success). I have argued (though I have persuaded few) that
we should try and reduce all non-local dependencies to “movement” relations.
Combine this with Chomsky’s proposal that movement and phrase building devolve
to the same operation ((E/I)-Merge) and one gets the result that all
grammatical dependencies are products of a single operation, viz. Merge.[2]
Or to put this now in Chomsky’s terms, once Merge becomes cognitively available
(Merge being the evolutionary miracle, aka, random mutation), the rest of GB does
as well for GB is nothing other than a catalogue of the various kinds of Merge
dependencies available in a computationally well-behaved system.

Second, we need to show that once Merge arises, the
limitations on the Merge dependencies that GB catalogues (island effects,
binding effects, control effects, etc.) follow from general (maybe ‘generic’ is
a better term) principles of cognitive computation. If we can assimilate
locality principles like the PIC and Minimality and Binding Domain to (plausibly)
more cognitively generic principles like Extension (conservativity) or
Inclusiveness then it is possible to understand that GB dependencies are what
one gets if (i) all operations “live on” Merge and (ii) these operations are
subject to non-linguocentric principles of cognitive computation.

Note that if this
can be accomplished, then the tension noted at the outset is resolved.
Chomsky’s hunch, the basic minimalist conjecture, is that this is doable; that
it is possible to reduce grammatical dependencies to a (at most) one (or two)
specifically linguistic operations which when combined with other cognitive operations
plus generic constraints on cognitive computation (one’s not particularly
linguo-centric) we get the effects of GB.

There is a second additional
conjecture that Chomsky advances that bears on the program. This second
independent hunch is the Strong Minimalist Thesis (SMT). IMO, it has not been
very clear how we are to understand the SMT. The slogan is that FL is the
“optimal solution to minimal design specifications.”However, I have never found the intent of
this slogan to be particularly clear. Lately, I have proposed (e.g.
here) that we understand the SMT in the context of the question one of the
four classic questions in Generative Grammar: How are Gs put to use? In
particular, the SMT tells us that grammars are well designed for use by the
interface.

I want to stress that SMT is an extra hunch about the structure of FL. Moreover, I believe that
this reconstruction of the problematic (thanks Hegel) might not (most likely, does
not) coincide with how Chomsky understands MP. The paragraphs above argue that
reconciling Darwin and Plato requires showing that most of the principles
operative in FL are cognitively generic
(viz. that they are operative in other non-linguistic cognitive domains). This
licenses the assumption that they pre-exist the emergence of FL and so we need
not explain why FL recruits them. All that is required is that they “be there”
for the taking. The conjecture that FL is optimal
computationally (i.e. that it is well-designed wrt to use by the interfaces)
goes beyond the evolutionary assumption required to solve the Plato/Darwin
tension. The SMT postulates that these evolutionarily available principles are
also well designed. This second conjecture, if true, is very interesting
precisely because the first Darwinian one can be true without the second
optimal design assumption being true. Moreover, if the SMT is true, this might
require explanation. In particular, why
should evolutionary available mechanisms that FL embodies be well designed for
use (especially given that FL is of recent vintage)?[3]

That said, what’s “well designed” mean? Well, here’s a
proposal: that the competence constraints that linguists find suffice for efficient
parsing and easy learnability. There is actually a lost literature on this
conjecture that precedes MP. For example, the work by Marcus and Berwick &
Weinberg on parsing, and Culicover & Wexler and Berwick on learnability
investigate how the constraints on linguistic representations, when
transparently embedded in use systems, can allow for efficient parsing and easy
learnability.[4]It is natural to say that grammatical
principles that allow for efficient parsing and easy learning are themselves computationally optimal in a
biologically/psychologically relevant sense. The SMT can be (and IMO, should
be) understood as conjecturing that FL produces grammars that are
computationally optimal in this sense.

Two thoughts to end:

First, this way of conceiving of MP treats it as a very
conservative extension of the general generative program. One of the
misconceptions “out there” (CSers and Psychologists are particularly prone to
this meme)is that Generativists change
their minds and theories every 2 months and that this theoretical Brownian
motion is an indication that linguists know squat about FL or UG. This is
false. The outlines of MP as necessarily incorporating GB results (with the aim
of making them “theorems” in a more general theoretical framework) emphasizes
that MP does not abandon GB results but tries to explain them. This what
typically takes place in advancing sciences and it is no different in
linguistics. Indeed, a good Whig history of Generative Grammar would
demonstrate that this conservatism has been characteristic of most of the
results from LSLT to MP. This is not the place to show this, but I am planning
to demonstrate it anon.

Second, MP rests on two different but related Chomskyan
hunches (‘conjectures’ would sound more serious, so I suggest you sue this term
when talking to the sciency types on the prestigious parts of campus): first
that it is possible to resolve the tension between Plato and Darwin without
doing damage to the former and that
the results will be embeddable in use systems that are computationally
efficient.We currently have schematic
outlines for how this might be done (though there are many holes to be filled).
Chomsky’s hunch is that this project can be completed.

IMO, we have made some progress towards showing that this is
not a vain hope, in fact that things are better than one might have initially
thought (especially if one is a pessimist like me).[5]
However, realizing this ambitious program requires a conservative attitude
towards past results. In particular, MP does not imply that GB is passe. Going beyond explanatory adequacy does not imply forgetting about
explanatory adequacy. Only cheap minimalism forgets what we have found, and as
my mother repeatedly wisely warned me “cheap is expensive in the long run.” So,
a bit of advice: think babies and bathwaters next time you are tempted to dump
earlier GB results for purportedly minimalist ends.

[1]
It is important to note that this is logically possible. Maybe the MP project
rests on a misdescription of the conceptual lay of the land. As you might
imagine, I doubt that this is so. However, it is a logical possibility. This is
why POS phenomena are so critical to the MP enterprise. One cannot go beyond explanatory adequacy without some
candidate theories that (purport to) have
it.

[2]
For the record, I am not yet convinced of Chomsky’s way of unifying things via
Merge. However, for current purposes, the disagreement is not worth pursuing.

[3]
Let me reiterate that I am not
interpreting Chomsky here. I am pretty sure that he would not endorse this
reconstruction of the Minimalist Problematic. Minimalists be warned!

[4]
In his book on learning, Berwick notes that it is a truism in AI that “having
the right restrictions on a given representation can make learning simple.”
Ditto for parsing. Note that this does not
imply that features of use cause
features of representations, i.e. this does not imply that demands for
efficient parsability cause grammars to have subjacency like locality
constraints. Rather, for example, grammars that have subjacency like
constraints will allow for simple transparent embeddings into parsers that will
compute efficiently and support learning algorithms that have properties that
support “easy” learning (See Berwick’s
book for lots of details).

[5]
Actually, if pressed, I would say that we have made remarkable progress in
cashing in Chomsky’s two bets. We have managed to outline plausible theories of
FL that unify large chunks of the GB modules and we have begun to find concrete
evidence that both parsing, production and language acquisition transparently
use the kinds of representations that competence theories have discovered. The
project is hardly complete. But, given the ambitious scope of Chomsky’s
hunches, IMO we have every reason to be sanguine that something like MP is
realizable. This, however, is also fodder for another post at another time.