Comments

Wednesday, November 28, 2012

As readers may have noticed (even my mother has noticed!), I
am very fond of Poverty of Stimulus arguments (POS). Executed well, POSs
generate slews of plausible candidate structures for FL/UG. Given my delight in
these, I have always wondered why it is that many other otherwise intelligent
looking/sounding people don’t find them nearly as suggestive/convincing as I
do. It could be that they are not nearly as acute as they appear (unlikely), or
it could be that I am wrong (inconceivable!), or it could be that discussants
are failing to notice where the differences lie. I would like to explore this
last possibility by describing two different senses of pattern, one congenial
to an empiricist mind set, and one not so much. This is not, I suspect, a conscious
conviction and so highlighting it may allow for a clearer understanding of
where disagreement lies, even if it does not lead to a Kumbaya resolution of
differences. Here goes.

The point I want to make rests on a cute thought experiment
suggested by an observation by David Berlinski in his very funny, highly
readable and strongly recommended (especially with those who got off on
Feyerabend’s jazz style writing in AgainstMethod) book Black Mischief. Berlinski discusses two kinds of patterns.
The first is illustrated in the following non-terminating decimal expansions:

1.(a)
.222222…

(b) .333333…

(c) .454545…

(d)
.123412341234…

If asked to continue into the … range, a normal person (i.e.
a college undergrad, the canonical psych subject and the only person buyable
with a few “extra” credits, i.e. cheap) would continue (1a) with more 2s, (1c)
with more 3s (1c) with 45s and (1d) with 1234s.
Why, because the average person would detect the indicated pattern and
generalize as indicated. People are good
at detecting patterns of this sort. Hume discussed this kind of pattern
recognition behavior, as have empiricists ever since. What the examples in (1)
illustrate is constant conjunction, and this leads to a simple pattern that
humans have little trouble extracting, (at least in the simple cases[1]).

Now as we all know, this will not get us great results for
examples like (2).

2.(a) .141592653589793…

(b)
.718281828459045…

The cognoscenti will
have recognized (2a) as the decimal part of the decimal expansion of π (15
first digits) and (2b) as the decimal part of the decimal expansion of e (15 first digits). If our all purpose
undergrad were asked to continue the series he would have a lot of trouble
doing so (Don’t take my word for it. Try the next three digits[2]).
Why? Because these decimal expansions don’t display a regular pattern as they
have none. That’s what makes these numbers irrational in contrast with the
rational numbers in (1). However, and this is important, the fact that
they don’t display a pattern does not mean that it is impossible to generate the decimal expansions in (2). It
is possible and there are well known algorithms for doing so (as we display
anon). However, though there are generative procedures for calculating the decimal
expansions of π and e, these
procedures differ from the ones underlying (1) in that the products of the procedures don’t exhibit a perceptible pattern.
The patterns, we might say, contrast in that the patterns in (1) carry the
procedures for generating them in their
patterning (Add 2,3, 45, 1234, to the end), while this is not so for the
examples in (2). Put crudely, constant conjunction and association exercised on
the patterning of 2s in (1a) lead to the rule ‘keep adding 2’ as the rule for
generating (1a), while inspecting the patterning of digits in (2a) suggests
nothing whatsoever about the rule that generates it (e.g. (3a)). And this, I believe, is an important conceptual
fault line separating empiricists from rationalists. For empiricists, the
paradigm case of a generative procedure is intimately related to the observable
patternings generated while Rationalists have generally eschewed any
“resemblance” between the generative procedure and the objects generated. Let
me explain.

As Chomsky has
repeatedly correctly insisted, everybody
assumes that learners come to the task of language acquisition with
biases. This just means that everyone
agrees that what is acquired is not a list, but a procedure that allows for
unbounded extension of the given (finite) examples in determinate ways. Thus, everyone (viz. both empiricists and
rationalists (thus, both Chomsky and his critics)) agrees that the aim is to
specify what biases a learner brings to the acquisition task. The difference
lies in the nature of the biases each is willing to consider. Empiricists are
happy with biases that allow for the filtering of patterns from data.[3]
Their leading idea is that data reveals patterns and that learning amounts to
finding these in the data. In other
words, they picture the problem of learning as roughly illustrated by the
example in (1). Rationalists agree that
this kind of learning exists,[4]but that there are learning problems
akin to that illustrated (2). And that this kind of learning demands departure
from algorithms that look for “simple” patternings of data. In fact, it requires
something like a pre-specification of the possible
generative procedures. Here’s what I
mean.

Consider learning the
digital expansion of π. It’s possible to “learn” that some digital sequence is
that of π by sampling the data (i.e. the digits) if, for example, one is biased
to consider only a finite number of pre-specified
procedures. Concretely, say I am given the generative procedures in (3a)
and (3b) and am shown the digits in (2a). Could I discover how to continue the
sequence so armed? Of course. I could quickly come to “know” that (2a) is the
right generative procedure and so I could continue adding to the … as desired. (Excuse 'infinity' below. Blogspot doesn't like the infinity sideways 8)

How would I come to
know this? By plugging several values for k,
n into (3a,b) and seeing what pops
out. (3a) will spit out the sequence in (2a) and (3b) that of (2b). These
generative procedures will diverge very quickly. Indeed the first computed digit
renders us confident that asked to choose (3a) or (3b) given the data in (2a),
(3a) is an easy choice. The moral: even
if there are no patterns in the data
learning is possible if the range of relevant choices is sufficiently
articulated and bounded.

This is just a thought
experiment, but I think that it highlights several features of importance.
First, that everyone is knee deep in given biases, aka: innate, given modes of
generalizations. The question is not
whether these exist but what they are. Empiricists, from the Rationalist point
of view, unduly restrict the admissible biases to those constructed to find
patterns in the data. Second, that even in the absence of patterned
data, learning is possible if we consider it as a choice among given hypotheses. Structured hypothesis
spaces allow one to find generative procedures whose products display no
obvious patterns. Bayesians, by the way, should be happy with this last point
as nothing in their methods restricts what’s in the hypothesis space. Bayes
instructs us how to navigate the space given input data. IT has nothing to say
about what’s in the space of options to begin with. Consequently there is no a priori reason for restricting it to
some functions rather than others. The matter, in other words is entirely
empirical. Last, it pays to ask whether for any problem of interest it is more
like that illustrated in (1) or in (2). One way of understanding Chomsky’s
point is that when we understand what we want to explain, i.e. that linguistic
competence amounts to a mastery of “constrained homophony” over an unbounded
domain of linguistic objects (see here), then the problem looks much more like that in (2)
than in (1), viz. there are very few (1) type patterns in the data when you
look closely and there are even fewer when the nature of the PLD is
considered. In other words, Chomsky’s
bet (and on this I think he is exactly right) is that the logical problem of
language acquisition looks much more like (2) than like (1).

A historical aside:
Here, Cartwright provides the ingredients for a nice reconstructed history.
Putting more than a few words in her mouth, it would go something like this:

In the beginning there was Aristotle. For him minds could form concepts/identify
substances from observation of the elements that instanced them (you learn
‘tiger’ by inspecting tigers, tiger-patterns lead to ‘tiger’ concepts/extracted
tiger-substances). The 17th century dumped Aristotle’s epistemology
and metaphysics. One strain rejected the substances and substituted the
patterns visible to the naked eye (there is no concept/substance ‘tiger’ just
some perceptible tiger patternings). This grew up to become Empiricism. The
second, retained the idea of concepts/substances but gave up the idea that
these were necessarily manifest in visible surface properties of experience (so
‘tiger’ may be triggered by tigers but the concept contains a whole lot more
than what was provided in experience, even what was provided in the patternings). This view grew up to be Rationalism.
Empiricists rejected the idea that conceptual contents contain more than meets
the eye. Rationalists gave up the idea the content of concepts are exhausted by
what meets the eye.

Interestingly, this
discussion persists. See for example Marr’s critique of Gibsonian theories of
visual perception here. In sum, the idea that learning is restricted to
patterns extractable from experience, though wrong, has a long and venerable
pedigree. So too the Rationalist alternative. A rule of thumb: for every
Aristotle there is a corresponding Plato (and, of course, vice versa).

[1]
There is surely a bound to this. Consider a decimal expansion whose period are
sequences of 2,500 digits. This would likely be hard to spot and the wonders of
“constant” conjunction would likely be much less apparent.

[3]
Hence the ton of work done on categorization, categorization of prior
categorizations, categorization of prior categorizations of prior
categorizations…

[4]
Or may exist. Whether it does is
likely more complicated than usually assumed as Randy Gallistel’s work has
shown. If Randy is right, then even the parade cases for associationism are
considerably less empiricist than often assumed.

Monday, November 26, 2012

In the last several years I have become a really big fan of
singing mice. It seems that unbeknownst
to us, these white little fur balls have been plunging from aria to aria while
gorging on food pellets and simultaneously training their ever-vigilant grad
student minders to react appropriately whenever they pressed a bar. Their songs sound birdish though at a higher
pitch. Now it seems that many kinds of mice sing, not only those complaining of incarceration. I was delighted and amazed
(though as my daughter pointed out, we’ve known since the first Feival film
that mice are great singers).

I don’t know how extensively rodent operettas have been
studied, but recently there has been a lot of research on the structure of bird
song and interesting speculation about what it may tell us about the species
specificity of the kind of hierarchical recursion we find in natural language
(NL). Berwick, Beckers, Okanoya and Bolhuis (BBOB; hmm, kind of a stuttering
version of Berwick’s first name) provide an extensive linguist friendly review
of the relevant literature which I recommend to the ornithophile with interests
in UG.

BBOB’s review is especially relevant to anyone interested in
the evolution of the faculty of language (FL) (ahem, I’m talking to all you
minimalists out there!). They note “many striking parallels between speech and vocal
production and learning in birds and humans” but also note qualitative
differences “when one compares language syntax and birdsong more generally
(5/1).” The value of the review, however, is not in these broad conclusions but
in the detailed comparisons between phonological vs syntactic vs birdsong
structure that it outlines. In particular, both birdsong and the human sound
system display precedence based dependencies (1st order markov), adjacency-based
dependencies, some (limited) non-adjacent dependencies, and the grouping of
elements into “chunks” (“phrases,” “syllables”). In effect, birdsongs seem restricted to
linear precedence relations alone, just what Heinz and Idsardi propose suffices
to represent the essentials of the human sound system. Importantly, there is no
evidence that birdsong allows for the kind of hierarchical recursion that is
typical of syntactic structures:

Birdsong does not admit such extended self-nested structures, even in
the nightingale song chunks are not contained within other song chunks, or song
packets within other song packets or contexts within contexts (5/6) (my
emphasis).

Nor do they provide any evidence for unbounded dependencies,
unboundedly hierarchical asymmetric “phrases,” or displacement relations (aka
movement), all characteristic features of NLs.

The BBOB paper also contains an interesting comparison of
songbird and human brains remarking on various possible shared vocalization
homologies in human and bird brain architecture. Even FoxP2, (that ubiquitous
rascal) makes a cameo appearance, with BBOB briefly reviewing the current
speculations concerning how “this system may be part of a “molecular toolkit
that is essential for sensory-guided motor learning” in the relevant regions of
songbirds and humans (5/9).”

All in all then I found this a very useful guide to the current
state of the art, especially for those
with minimalist interests.

Why minimalists in particular? Because it has possible
bearing on a currently active speculation regarding the species specificity and
domain specificity of Merge. Merge,
recall, is the minimalist replacement for phrase structure rules (and movement). It’s the operation
responsible both for unbounded hierarchical embedding and displacement. So if birdsong displays context free patterns
one source for this could be the presence of Merge as a basic operation in the
songbird brain. BBOB carefully review the evidence that birdsong patterns
exceed the descriptive power of finite transition networks and demand the
resources of context free grammars. They conclude that there is currently “no compelling
evidence” that they do (5/14). Furthermore, BBOB note that there is no evidence
for displacement-like operations in birdsong, the second product of a merge-like
operation. Thus, at this time, NLs alone provide clear evidence of context
free and displacement structures. So, if Merge is the operation that generates
such structures, there is currently no evidence that Merge has arisen in any
species other than humans or in any domain other than syntax.

Why is this important for minimalists? The minimalist
Genesis story goes as follows: Some “miracle” occurred in the last 100,000
years that allowed for NLs to arise in humans. Following Chomsky, let’s call
this miracle “Merge.” By hypothesis, Merge is a very “simple” addition to the
cognitive repertoire. Conceptually, there are (at least) two ways it might have
been added: (i) Merge is a linguistically specific miracle or (ii) it is a more
general cognitive one. If (ii), then we might expect Merge to have arisen
before in other species and to be expressed in other cognitive domains, e.g.
birdsong. This is where BBOB’s
conclusions are important for they indicate that there is currently no evidence
in birdsong for the kind of structures (i.e. ones displaying unbounded nested
dependencies and displacement) Merge would generate. Thus, at present, the only
cognitive products of Merge we have found occur in species that have NLs, i.e.
us.

Moreover, as BBOB emphasize the impact of Merge is only
visible in a subpart of our linguistic products. It is a property of syntactic
structures not phonological ones. Indeed, as BBOB show, human sound systems and
birdsong systems look very similar. This
suggests that Miracle Merge is quite a picky operation, exercising its powers
in just a restricted part of FL (widely construed). So not only is Merge not cognitively general, it’s not even linguistically general. Its signature properties are restricted to
syntactic structures.

If this is correct, then it suggests (to me at least) that
Merge is a linguistically local miracle and so proprietary to FL and so part of
UG. This, I believe, comports more with Chomsky’s earlier conception of Merge,
than his current one. The former sees
the capacity to build bigger and bigger hierarchically embedded structures (and
movement) as resting on being able to spread “edge features” (EF) from lexical
items to the complexes of lexical items that Merge forms. So given two lexical items (LI) (each with an
inherent EF), a complex inherits an EF (presumably from one of its participants)
and this inherited EF is what licenses the further merging of the created
complex with other EF bearing elements (LIs and earlier products of Merge).
Inherited EFs then are essentially the products of labeling (full disclosure: I
confess to liking this idea as I outlined/adopted a version of it here (Btw, it makes a wonderful stocking stuffer so buy early buy often!) and
labeling is the miracle primarily responsible for the e(/I)mergence (like that?) of
both phrase structure and displacement.

Chomsky’s more current view seems to be that labeling (and
so EFs) are dispensable and that Merge alone is the source of phrase structure
and movement. There is no need for EFs as Merge is defined as being able to apply to any cognitive objects at all,
primitive or constructed. In particular,
both lexical items and complexes of lexical items formed by prior applications
of Merge are in the domain of Merge. EFs are unnecessary and so, with a hat tip
to Ockham, should be dispensed with.

And this brings us back to birds, their songs and their
brains. It would have been a powerful
piece of evidence in favor of this latter conception were a signature of merge attested
in the cognitive products of some other species for it would have been evidence
that the operation isn’t FL/UG peculiar.
Birdsong was a plausible place to look and it appears that it isn’t
there. BBOB’s review locates the effects
of Merge exclusively to the syntax of NL.
Were Merge more domain general and less species specific we might have
expected other dogs to bark (or sing more complex songs). And though absence of evidence should not be
mistaken for evidence of absence, at least right now, it looks like Merge is
very domain specific, something more compatible with Chomsky’s first version of
Merge than his second.

Sunday, November 25, 2012

In this morning's NY Times, James Atlas has an interesting opinion piece about rising tides and the human tendency to be willfully ignorant. In his essay, there is also a passage that will leap out for anyone familiar with questions about what city names denote. (Does 'London' denote a geographic region that might become uninhabited, a polis that might be relocated along with important buildings, or something else?) Mr. Atlas says that while there is a "good chance that New York City will sink beneath the sea,"...the city could move to another island, the way Torcello was moved to Venice, stone by stone, after the lagoon turned into a swamp and its citizens succumbed to a plague of malaria. The city managed to survive, if not where it had begun. Perhaps the day will come when skyscrapers rise out of downtown Scarsdale. Not cheery, even given the most optimistic assumptions about Scarsdale. But it seems that a competent speaker--indeed, a very competent user of language--can talk about cities in this way, expect to be understood, and expect the editors at The Newspaper of Record to permit such talk on their pages. But when discussing this kind of point about how city/country names can be used, often in the context of Chomsky's Austinian/Strawsonian remarks about reference, I'm sometimes told that "real people" don't talk this way. (You know who you are out there.) And if global warming can make such usage standard, then theorists can't bracket the usage as marginal, at least not in the long run. It may be that Venice, nee Torcello but not identical with current Torcello, will need to be moved again.
Someday, everyone will admit that natural language names are not parade cases for a denotational conception of meaning. The next day, The Messiah will appear. (Apologies to Jerry Fodor for theft of joke.) Once we get beyond the alleged analogy of numbers being denoted by logical constants in an invented language, things get pretty complicated: many Smiths, one Paderewski; Hesperus and Venus; Neptune and Vulcan; the role of Macbeth in Macbeth, and all the names he could have given to the dagger that wasn't there; Julius and the zip(per); the Tyler Burge we all know about, a.k.a. Professor Burge; The Holy Roman Empire, The Sun, The Moon; all those languages in which "names" very often look/sound like phrases that have proper nouns as predicative components; etc. It's also veryeasy to use 'name' is ways that confuse claims about certain nouns, which might appear as constituents of phrases headed by (perhaps covert) demonstratives or determiners, with hypothesized singular concepts that may well be atomic. This doesn't show that names don't denote. But it should make one wonder.
Yet in various ways, various people cling to the idea that a name like 'London' is an atomic expression of type <e> that denotes its bearer. Now I have nothing against idealizations. But there is a difference between a refinable idealization that gets at an important truth (e.g., PV = k, PV = nRT, van der Waal's equation) and a simplification that is just false though perhaps convenient for certain purposes (e.g., taking the earth to be the center of the universe when navigating on a moonless night). One wants an idealization to do some explanatory work, and ideally, to provide tolerably good descriptions of a few model cases. So if we agree to bracket worries about Vulcan and Macbeth, along with worries about Smiths and Tyler Burge and so on--in order to see how fruitful it is to suppose that names denote things--then it's a bit of a let down to be told that 'London' denotes a funny sort of thing, and that to figure out what 'London' denotes (and sorry Ontario, there's only one London), we'll have to look very carefully at how competent speakers of a human language can use city names.
Perhaps 'New York City', as opposed to 'Gotham', is a grammatically special case. And perhaps names introduced for purposes of story telling are semantically special in some way that doesn't bother kids. Believe it if you must. But if a semanticist tells you that 'London' denotes London, while declining to say what the alleged denotatum is (except by offering coy descriptions like 'the largest city in England'), then the semanticist doesn't also get to tell you that a denotational conception of meaning is confirmed by the "truth" that 'London' denotes London.
One doesn't just say that '4' denotes Four, and then declare victory. In this case, it's obvious that theorists need to say a little more about what (the number) Four is--perhaps by saying what Zero is, appealing to some notion of succession, and then showing that our best candidate for (being) Four has the properties that Four seems to have. But once characterized, the fifth natural number stays put, ontologically speaking. While it may be hard to know what abstracta are, there is little temptation to talk about them as if they were spatiotemporally located. More generally, we can say that '4' denotes a certain number without implying that some thing in the domain over which we quantify has a cluster of apparently incompatible properties. To that extent, saying that '4' denotes doesn't get us into trouble.
In principle, one can likewise cash out the idealization regarding city names. But to do so, one needs an independent characterization of the cities allegedly denoted, such that the domain entities thereby characterized can satisfy predicates like 'is on an island', 'was moved onto an island', 'could be moved inland', 'is crowded', 'will be uninhabited', etc. Perhaps this can be done. I won't be holding my breath. But even if you think it can be done, that's not an argument that it has been done modulo a few details that can be set aside. Prima facie, natural language names provide grief for denotational conceptions of meaning. Given this, some denotationalists have developed a brilliant rhetorical strategy: take it to be a truism that names denote, and ask whether this points the way to a more general conception of meaning. But this may be taking advantage of the human tendency to be willfully ignorant.

Wednesday, November 21, 2012

It was apparently Max Planck who discovered the unit time of
scientific change to be the funeral (the new displacing the old one funeral at
a time).In the early 1990s, I
discovered a second driving force, boredom.As some of you may know, since about the mid-1990s I have been a
minimalist enthusiast. For the record, I became one despite my initial
inclinations. On first reading A
minimalist program for linguistic theory (a Korean bootlegged version purportedly
whisked of Noam’s desk and quickly disseminated), I was absolutely convinced
that it had to be on the wrong track, if not the aspirations, then the
tentative conclusions. I was absolutely certain that one of the biggest
discoveries of generative grammar had been the centrality of government as a
core relation and S-structure as the indispensible level (I can still see
myself making just these points in graduate intro syntax). Thus the idea that
we dispense with government as a fundamental relation (it’s called Government-Binding theory after all!),
or that we eliminate S-structure as a fundamental level (D-structure, I
confess, I was willing to throw under the bus) struck me as nuts, just
another manoeuver by Chomsky to annoy former graduate students.

Three things worked together to open (more accurately, pry
open) my mind.

First, my default strategy is to agree with Chomsky, even if
I have no idea what he’s talking about. In fact, I often try to figure out
where he’s heading so that I can pre-agree
with him. Sadly, he tends not to run in a straight line so I can often be seen
going left when he zags right or right when he zigs left. This has proven to be
both healthful (I am very fit!) and fruitful. More often than not, Chomsky
identifies fecund research directions, or at least ones that in retrospect I
have found interesting.No doubt this is
just dumb luck on Chomsky’s part, but if someone is lucky often enough, it is
worth paying very careful attention (as my mother says: “better lucky than
smart”).So, though I have often found
my work at a slant (even perpendicular) to his detailed proposals (e.g. just
look at how delighted Noam is with Movement Theory of Control, a theory near
and dear to my heart), I have always found it worthwhile to try to figure out
what he is proposing and why.

Second, fear: when the first minimalist paper began to
circulate in the early 1990s I was invited to teach a graduate syntax seminar
at Nijmegen (populated by eager, smart, hungry (and so ill-tempered) grad
students from Holland and the rest of Europe) and I needed something new to talk about. If you just get up
and repeat what you’ve already done, they could be ready for you. Better to
move in some erratic direction and keep them guessing. Chomsky’s recent
minimalist musings seemed like perfect cover.

Third, and truth be told I believe that this is the main
reason, the GB stuff I/we had been exploring had become really boring. Why? For
the best of possible reasons: viz. we really understood what made GB style
theories tick and we/I needed something new to play with, something that would
allow me/us to approach old questions in a different way (or at least not put
us/me to sleep). That new thing was the Minimalist Program. I mention this,
because at the time there was a lot of toing and froing about why so many had lemming-like
(this is apparently a rural legend; they don’t fling themselves off cliffs) jumped
off of the GB bandstand and onto the minimalist bandwagon. As I faintly recall,
there was an issue of the Linguistic
Review dedicated to this timely question with many authoritative voices giving
very reasonable explanations for why they were taking the minimalist turn.And most of these reasons were in fact good
ones. However, if my conversion was not completely atypical, the main thrust
came from simple thasaphobia and the discovery of the well-established fact
that intensive study of the Barriers
framework could be deleterious to one’s health (good reason to avoid going
there again all you phase-lovers out there!).

These three motivations joined to prompt me, as an exercise,
to stow the skepticism, at least for the duration of the Dutch lectures, assume
that this minimalist stuff was on the right track and see how far I could get
with it.Much to my surprise, it did not
fall apart on immediate inspection (a surprisingly good reason to persist in my
experience), it was really fun to play with, and, if you got with the program,
there was a lot to do given that few GB details survived minimalism’s dumping of
government as a core grammatical relation (not so surprising given that it is government-binding theory).So I was hooked, and busy. (p.s. I also
enjoyed the fact that, at the time, playing minimalist partisan could get one
into a lot of arguments and nothing is more fun than heated polemics).

These were the basic causes
for my theoretical conversion. Were there any good reasons? Yes, one.Minimalism was the next natural scientific
step to take given the success of the GB enterprise.

This actually became more apparent to me several years
later, than it was on my road to Damascus Nijmegen.The GB era produced a rich description of the
structure of UG; internally modular with distinctive conditions, primitives and
operations characterizing each sub-part. In effect, GB delivered a dozen or so “laws”
of grammar (e.g. subjacency, ECP, principles A-C of binding theory, X’-theory
etc.), of pretty good (no, not perfect, but pretty good) empirical standing
(lots of cross linguistic support). This put generative grammar in a position
to address a new kind of question: why these laws and not others? Note: you
can’t ask this question if there are
no “laws.” Attacking it requires that we rethink the structure of UG in a new
way; not only to ask “what’s in UG ?” but also “what that is in UG is
distinctively linguistic and what traceable to more general powers, cognitive,
computational, or physical?”. This put a version of what we might call Darwin’s
Problem (the logical problem of language evolution) on the agenda along side
Plato’s Problem (the logical problem of language acquisition).The latter has not been solved, not by a long
shot, but fortunately adding a question to the research agenda does not require
that previous problems have been put to bed and snuggly tucked in. So though in
one sense, minimalism was nothing new, just the next reasonable scientific step
to take, it was also entirely new in that it raised to prominence a question
whose time, we hoped, had come.[1]

Chomsky has repeatedly emphasized the programmatic aspects
of minimalism.And, as he has correctly
noted, programs are not true or false but fecund or barren. However, after 20
years, it’s perhaps (oh what a weasel word!) time to sit back and ask how
fertile the minimalist turn has been? In my view, very, precisely because it
has spawned minimalist theories that
advance the programmatic agenda, theories that can be judged not merely in
terms of their fertility but also in terms of their verisimilitude. I have my
own views about where the successes lie, and I suspect that they may not
coincide with either Noam’s or yours.However, I believe that it is time that we identified what we take to be
our successes and ask ourselves how (or whether?) they reflect the principle
ambitions and intuitions of the minimalist program.

Let me put this another way: in one sense minimalism and GB
are not competitors for the aims of the former presuppose the success of the
latter.However, minimalist theories and GB theories often are (or can be) in direct competition and it is
worth evaluating them against each other.So for example, to take an example at random (haha!), GB has a theory of
control and current minimalism has several. We can ask, for example: In what
ways do the GB and minimalist accounts differ? How do they stack up
empirically? What minimalist precepts do the minimalist theories reflect?What GB principles are the minimalist
accounts (in)compatible with? What larger minimalist goals do the minimalist
theories advance?What does the
minimalist story tells us that the earlier GB story didn’t? And vice versa?
Etc. etc. etc.

IMHO, these are not questions that we have asked often
enough. I believe that we have failed to effectively use GB as the foil (and
measuring rod) it can be. Why? I’m not sure. Perhaps because we have concluded
that because the minimalist program
is worth pursuing that specific minimalist theories
that brandish distinctive minimalist technology (feature checking, merge,
Agree, probe-goal architecture, phases etc.) are “better” or “truer” than those
exploiting the quaint out of date GB apparatus.If so, we were wrong.We always
need to measure our advances. One good way to do this is to compare your
spanking new minimalist proposal with the model T GB version. I hereby propose
that going forward we adopt the mantra “What would GB say?” (WWGBS; might even
make for a good license plate) and compare our novel proposals with this
standard to make clear to ourselves and others where and how we’ve progressed.

I will likely blog more on this topic soon and identify what
I take to be some of the more interesting lines of investigation to date.However, I am very interested in what others
take the main minimalist successes to be.What are the parade case achievements? Let me know. After 20 years, it
seems reasonable to try to make a rough estimate of how far we’ve come.

The actual laws of nature are
interesting, but it’s also interesting that there are laws at all…We want to
know what those laws are. More ambitiously, we’d like to know if those laws
could possibly have been different…We may or may not be able to answer such a
grandiose question, but it’s the kind of thing that lights the imagination of
the working scientist (p.23)

This is what I mean by the next obvious scientific
step to take.First find laws, then ask
why these laws and not others. That’s the way the game is played, at least by
the real pros.

I am still playing impresario to Bob Berwick's arias. Enjoy! What follows is all Robert C. Berwick:

Excellent questions all; each deserves a
reply in itself. But for now, we’ll have to content ourselves with this
Thanksgiving appetizer, with more to come. The original post made just 2 simple
points: (1) wrt Gold, virtually everyone who’s done serious work in the field
immediately moved to a stochastic setting, > 40 years ago, in the best case
coupling it to a full-fledged linguistic theory (e.g., Wexler and Degree-2 learnability);
and (2) that simply adding probabilities to get PCFGs doesn’t get us out of the
human language learning hot water. So far as I can make out, none of the
replies to date have really blunted the force of these two points. If anything,
the much more heavy-duty (and much more excellent and subtle) statistical
armamentarium that Shay Cohen & Noah Smith bring to bear on the PCFG
problem actually reinforces the second message. Evidently, one has to use
sophisticated estimation techniques to get sample complexity bounds, and even
then, one winds up with an unsupervised learning method that is computationally
intractable, solvable only by approximation (a point I’ll have to take up later).

Now, while I personally find such results
enormously valuable, as CS themselves say in their introduction, they’re
considering probabilistic grammars that are “used in diverse NLP applications…”
“ranging from syntactic and morphological processing to applications like
information extraction, question answering, and machine translation.” Here, I
suspect, is where my point of view probably diverges from what Alex and Noah
subscribe to (though they ought to speak for themselves of course): what counts
a result? I can dine on Cohen and Smith’s fabulous statistical meal, but then I
walk away hungry: What does it tell us about human language and human language acquisition
that we did not already know? Does it tell us why, e.g., in a sentence like He said Ted criticized Morris, that he must be used deictically, and can’t
be the same person as Morris? And, further,
why this constraint just happens to
mirror the one in sentences such as Who
did he say Ted criticized, where again he
must be deictic, and can’t be who;
and, further, why this just happens to
mirror the one in sentences such as, He
said Ted criticized everyone, where again he must be deictic, and can’t mean everybody; and, then finally, and most crucially of all, why these
same constraints appear to hold, not only for English but for every language we’ve
looked at, and, further, how it is that children come to know all this before age 3? (Examples from Stephen Crain by way of
Chomsky.)Now that, as they say, is a
consummation devoutly to be wished. And yet that’s what linguistic science
currently can deliver.Now, if one's
math-based approach account could do the same – why this pattern must hold, ineluctably, well, then that
would be a Happy Meal deserving of the name. But this, for me, is what linguistic
science is all about. In short, I
hunger for explanations in the usual scientific sense, rather than theorems
about formal systems. Explanations like the ones we have about other biological
systems, or the natural world generally. Again, remember, this is what Ken
Wexler's Degree-2 theory bought us – to ensure feasible learnability, one had
to impose locality constraints on grammars that are empirically attested. In
this regard it seems my taste runs along the lines of Avery Andrews’ comment
regarding the disability placard to be awarded to PCFGs. (Though as I’ll write
in a later post, his purchase of Bayes as a “major upgrade over Chomsky” in
fact turns out, perhaps surprisingly, that he’s bought the older, original
model.) Show me a better explanation and
I’ll follow you anywhere.

As for offering constructive options,
well, but of course! The original post mentioned two: the first, Ken Wexler's
degree-2 learnability demonstration with an EST-style TG; and the second, Mark Steedman's
more recent combinatory categorial grammar approach (which, as I've conjectured
just the other day with Mark, probably has a degree-1 learnability proof. (And
in fact both of these probably have nice Vapnik-Chervonenkis learnability
results hiding in there somewhere – something I’m confident CS could make quick
work of, given their obvious talents.)) The EST version badly wants updating,
so, there’s plenty of work to do. Time to be fruitful and multiply.