Comments

Saturday, June 28, 2014

This week I got to present at CMCL 2014, a workshop on computational models of language-related cognition, i.e. processing, acquisition, discourse representation, and so on. My talk was about the connection between Stabler's top-down parser for Minimalist grammars and the processing of relative clauses, something I've been working on for a while now with Bradley Marcinek, a student of mine. Thanks to Greg Kobele, John Hale and Sabrina Gerth, we already know that the predictions of this parser depend on one's syntactic analysis in interesting ways, so we wanted to extend their line of work to some other well-known phenomena. Long story short, our results are rather messy and it will be a while until we can get this idea truly off the ground.

That is why I won't blog about this research quite yet (except for the shameless self-promotion above) and instead focus on the talks I heard, rather than the one I gave. Don't get me wrong, many of them were very interesting to me on a technical level; some of them even pierced my 90s habitus of acerbic cynicism and got me a bit excited. Quite generally, a fun time was had by all. But the talks made me aware of a gapping hole in my understanding of the field, a hole that one of you (I believe we have some readers with serious modelling chops) may be able to plug for me: Just what is the point of cognitive modelling?

Sunday, June 22, 2014

In the first post (here),
I discussed Chomsky’s version of Merge and the logic behind it.The main idea is that Merge, the conceptually simplest conception of
recursion, has just the properties to explain why NL Gs generate structures
with unbounded hierarchical structure, why NLs allow displacement, show
reconstruction effects, and why rules of G are structure dependent. Not bad for
any story. Really good (especially for DP concerns) if we get all of this from
a very simple (nay, simplest) conception. In what follows I turn to a discussion
of the last three properties Chomsky identified and see how he aims to account
for them. I repeat them here for convenience.

(v)
its operations apply cyclically

(vi)
it can have lots of morphology

(vii) in
externalization only a single “copy” is pronounced

In contrast to the first four properties, the last three do
not follow simply from the properties of the conceptually “simplest” combination
operation. Rather Chomsky argues that they reflect principles of computational
efficiency. Let’s see how.

With respect to (vii), Chomsky assumes that externalization
(i.e. “vocalizing” the structures) is computationally costly. In other words,
actually saying the structures out loud is hard. How costly? Well, it must be more
costly than copy deletion at Transfer is. Here’s why. Given the copy theory as
a consequence of Merge, FL must contain a procedure to choose which
copy/occurrence is pronounced (note: this is not a conceptual observation but
an inference based on the fact that typically only one copy is pronounced).
This decision/choice, I assume, requires some computation. I further assume that
choosing which copies/occurrences to externalize requires some computation that
would not be required were all
copies/occurrences pronounced. Chomsky’s assumption is that the cost of
choosing is less than the cost of externalizing.Thus, FL’s choice lowers overall
computational cost.

Furthermore, we must also assume that the cost of
pronunciation also exceeds the computational cost of being misunderstood for
otherwise it would make sense for FL to facilitate parsing by pronouncing all
the copies, or at least those that would facilitate a hearer’s parsing of our
sentences. None of these assumptions are self-evidently true or false. Plus,
the supposition that copy deletion is more computationally efficient than pronouncing
them would be does not follow simply from considerations of conceptual
simplicity, at least as far as I can tell. It involves substantive assumptions
about actual computational costs, for which, so far as I can tell, we have
little independent evidence.

One more point: If copy deletion exists in Transfer to the
CI interface (as Chomsky argued in his original 1993 paper and that underlies
standard accounts of reconstruction effects and that so far as I know is still
part of current theory) then in the normal case only a single copy/occurrence
makes it to either interface, though
which copy is interpreted at CI can be different form the copy spoken at AP
(and this is typically how displacement is theoretically described). But if
this is correct, then it suggests that Chomsky’s argument here might need some
rethinking. Why? If deletion is part of Transfer to CI then copy deletion cannot
be simply a fact about the computational cost of externalization, as it applies to the mapping of linguistic objects
to the internal thought system as
well. It seems that copies per se are
the problem, not just copies that must be pronounced.

Before moving on to (v) and (vi) it is worth pausing to note
that Chomsky’s discussion here reverberates with pretty standard conceptions of
computational efficiency (viz. he is making claims about how hard it is to do something). This moves away from the
purely conceptual matters that motivated the discussion of the first four
features of FL. There is a very interesting hypothesis that might link the two:
that the simplest computational operation will necessarily be embedded in a
computationally efficient system. This is along the lines of how I interpreted
the SMT in earlier posts (linked to in the first part of this post).However, whether you think this is feasible,
it appears, at least to me, that there are two different kinds of arguments
being deployed to SMT ends, a purely conceptual one and a more conventional
“resource” argument.

Ok, let’s return to (v) and (vi). Chomsky suggests that
considerations of computational efficiency also account for these properties
of. In particular, they follow from something like the strict cycle as embodied
in phase theory.So the question is
what’s the relation between the strict cycle and efficient computation?

Chomsky supposes that the strict cycle, or something like
it, is what we would expect from a computationally well-designed system. There
are times that (to me) Chomsky sounds like he seems to be assuming that the
conceptually simplest system will necessarily
be computationally efficient.[1]
I don’t see why. In particular, if I understand the lecture correctly, Chomsky
is suggesting that the link between conceptual simplicity and computational
efficiency should follow as a matter of natural law. Even if correct, it is
clear that this line of reasoning goes considerably beyond considerations of conceptual
simplicity. What I mean is that even if one grants that the simplest
computational operation will be something like Merge, it does not follow that
the simplest system that includes Merge will also incorporate the strict cycle.Phases then, (Chomsky’s mechanism for
realizing the strict cycle) are motivated not on grounds of conceptual simplicity
alone but on grounds of efficiency (i.e. a well/optimally designed system will
incorporate something like the strict cycle). So far as I can tell Chomsky does
not explain the relation (if any) between conceptual simplicity and
computationally efficiency, though to be fair, I may be over-interpreting his
intent here.

This said how does the strict cycle bear on computational
efficiency? It allows computational decisions to be made locally and
incrementally. This is a generically nice feature for computational systems to
have for it simplifies computations.[2]
Chomsky notes that it also simplifies the process of distinguishing two
selections of the same expression from the lexicon vs two occurrences of the
same expression. How does it simplify it? By making the decision a bounded one.
Distinguishing them, he claims, requires recalling whether a given
occurrence/copy is a product of E- or I-Merge. If such decisions are made
strict cyclically (at every phase) then phases reduce memory demand: because
phases are bounded, you need not retain information in memory regarding the
provenance of a valued occurrence beyond the phase where an expression’s
features are valued.[3]
So phases ease the memory burdens that computations impose. Let me note again without
further comment, that if this is
indeed a motivation for phases, then it presupposes some conception of
performance for only in this kind of context do resource issues (viz. memory
concerns) arise. God has no need for bounding computation.

Now I have a confession to make.I could not come up with a concrete example where
this logic is realized involving DP copies, given standard views. It’s easy enough to come up with a relevant
case if e.g. reflexivization is a product of movement.[4]
If reflexives involve A-chains with two thematically marked “links” then we
need to distinguish copies from originals (e.g. Everyone loves himself differs from everyone loves everyone in that the first involves one selection of
everyone from the lexicon (and so one
chain with two occurrences of everyone)
while the second involves two selections of everyone
from the lexicon and so two different chains). However, if you don’t assume
this, I personally had a hard time finding an example of what’s worrying
Chomsky, at least with copies. This might mean that Chomsky is finally coming
to his senses and appreciating the beauty of movement theories of Control and
Binding OR it might mean that I am a bear of little brain and just couldn’t
come up with a relevant case. I know which option I would bet on, even given my
little brain, and it’s not the first. So, anyone with a nice illustration is
invited to put it in the comments section or send it to me and I will post it.
Thanks.

It is not hard to come up with cases that do not involve
DPs, but the problem then is not distinguishing copies from originals. Take the
standard case of Subject-Predicate agreement for example. Here the unvalued
features of T are valued by those of the inherently valued features of the subject
DP.Once valued, the features on T and D
are indistinguishable qua features.
However, there is assumed to be an important difference between the two, one
relevant to the interpretation at the CI interface. Those on D are meaning
relevant but those on T are uninterpretable.
What, after all, could it mean to say that the past tense is first person and
plural?[5]
If one assumes that all features at the interfaces must be interpretable at
those interfaces if they make it there, then the valued features on T must
disappear at Transfer to CI. But if (by assumption) they are indistinguishable
from the interpretable ones on D, the computational system must remember how the features got onto T (i.e.
by valuation rather or inherently). The ones that get there by valuation in the
grammar must be removed or the derivation will not converge. Thus, Gs need to
know how features get onto the expressions they sit on and it would be very
nice memory-wise if this was a bounded decision.

Before moving on, it’s worth noting that even this version
of the argument is hardly straightforward. It assumes that phi-features on T
are not-interpretable and that these cause derivations to crash (rather, then, for example, converge as gibberish) (also see note 5). It also requires that
deletion not be optional, otherwise there would be derivations where all the
good features remained on all of the right objects and all of the uninterpretable
ones freely deleted. Nor does it allow Transfer (which, after all, straddles
the syntax and CI) to peak at the meaning of T during Transfer, thereby determining which features are interpretable
on which items and so which should be deleted and which retained. Note that
such a peak-a-boo decision to delete during Transfer would be very local,
relying just on the meaning of T and the meaning of phi-features. Were this
possible, we could delay Transfer indefinitely. So, to make Chomsky’s argument we
must assume that Transfer is completely “blind” to the interpretation of the
syntactic objects at every point in the syntactic computation including the one
that interfaces with CI. This amounts to a very strong version of the autonomy
of syntax thesis; one in which no part of the syntax, even the rules that directly
interface with the interpretive interfaces, can see any information that the
interfaces contain.[6]

Let’s return to the main point. Must the simplest system
imaginable be computationally efficient? It’s not clear. One might imagine that
the conceptually “simplest” system would not worry about computational
efficiency at all (damn memory considerations!). The simplest system might just
do whatever it can and produce whatever structured products it can without
complicating FL with considerations of resource demands like memory burdens.
True, this might render some products of FL unusable or hard to use (and so we
would probably perceive their use as perceive them as unacceptable) but then we
just wouldn’t use them (sort of like what we say about self-embedded
clauses).So, for example, we would tend
not to use sentences with multiple occurrences of the same expressions where
this made life computationally difficult (e.g. you would not talk about two
Norberts in the same sentence). Or without phases we might leave to context the
determination of whether an expression is a copy or a lexical primitive or we
might allow Transfer to see if features on an expression were kosher or not. At
any rate, it seems to me that all of these options are as conceptually “simple”
as adding phases to FL unless, or course, phases come for free as a matter of
“natural law.”I confess to being skeptical
about this supposition. Phases come with a lot of conceptual baggage, which I
personally find quite cumbersome (reminds me of Barriers actually, not one of the aesthetic high points in GG
(ugh!)). That said, let’s accept that the “simplest” theory comes with
phases.

As Chomsky notes, phases themselves come have complex
properties.For example, phases bring
with them a novel operation, feature lowering, which now must be added to the
inventory of FL operations. However, feature lowering does not seem to be either
a conceptually simple or cognitively/computationally generic kind of operation.
Indeed, it seems (at least to me) quite linguistically parochial. This, of
course, is not a good thing if one’s sights are set on answering Darwin’s
problem.If so, phases don’t fit snugly
with the SMT. This does not mean there are none. It just means that they
complicate matters conceptually and pull against Chomsky’s first conceptual
argument wrt Merge.

Again, let’s put this all aside and assume that strict
cyclicity is a desirable property to have and that phases are an optimal way of
realizing this. Chomsky then asks how we identify phases? He argues that we can
identify phases by their heads as phase heads are where unvalued features live.
Thus a phase is the minimal domain of a phase head with unvalued features.[7]
A possible virtue of this way of looking at things is that it might provide a
way of explaining why languages contain so much morphology. They are the
adventitious by-products for identifying the units/domain of the optimal
computational system.Chomsky notes that
what he means by morphology is abstract (a la Vergnaud), so a little more has
to be said, especially given that externalization is costly, but it’s an idea
in an area where we don’t have many (see here).[8]

One remark: on this reconstruction of Chomsky’s arguments,
unvalued features play a very big role. They identify phases, which implement
strict cyclicity and are the source of overt morphology.I confess to being wary here. Chomsky
originally introduced unvalued features to replace uninterpretable ones. Now he
assumes that features are both +/- valued and +/- interpretable. As unvalued
features are always uninterpretatble, this seems like an unwanted redundancy in
the feature system.At any rate, as
Chomsky notes, uninterpretable features really do look sort of strange in a
perfect system. Why have them only to get rid of them?Chomsky’s big idea is that they exist to make
FL computationally efficient. Color me very unconvinced.

So this is the main lay of the land. I should mention that,
as others have pointed out (especially Dennis O), part of Chomsky’s SMT argument
here (i.e. the one linked to conceptual simplicity concerns) is different from
the interpretation of the SMT that I advanced in other posts (here,
here,
here).Thus, my version is definitely NOT the one
that Chomsky elaborates when considering these. However, there is a clear
second strand dealing with pretty standard efficiency concerns, and here my
speculations and his might find some common ground. That said, Chomsky’s proposals
rest heavily on certain assumptions about conceptual simplicity, and of a very
strong kind. In particular, Chomsky’s argument rests on a very aggressive use
of Occam’s razor.Here’s what I mean.
The argument he offers is not that we should adopt Merge because all other notions
are too complex to be biologically plausible units of genetic novelty. Rather,
he argues that in the absence of information to the contrary, Occamite
considerations should rule: choose the simplest
(not just a simple) starting point
and see where you get. Given that we don’t know much about how operations that
describe the phenotype (the computational properties of FL) relate to the
underlying biological substrate that is the thing that actually evolved, it is
not clear (at least to me) how to weight such strong Occamite considerations.
They are not without power, but, to me at least, we don’t really know how to
assess whether all things are indeed equal and how seriously to weight this
very strong demand for simplicity

Let me end by fleshing this out a bit.I confess to not being moved by Chomsky’s
conceptual simplicity arguments. There are lots of simple starting points (even if some may be simpler than others).
Ordered pairs are not that much more conceptually
complex than sets. Symmetric operations are not obviously simpler than
asymmetric ones, especially given that it appears that syntax abhors symmetry
(see Moro and Chomsky). So, the general starting point that we need to start
with the conceptually simplest
conception of “combination” and that this means an operation that creates sets
of expressions seems based on weak considerations. IMO, we should be looking
for basic concepts that are simple enough
to address DP (and there may be many) and evaluate them in terms of how well
they succeed in unifying the various apparently disparate properties of FL. Chomsky
does some of this here, and it’s great. But we should not stop here. Let me
given an example.

One of the properties that modern minimalist theory has had
trouble accounting for is the fact that the unit of syntactic
movement/interpretation/deletion is the phrase.
We may move heads, but we typically move/delete phrases. Why? Right now
standard minimalist accounts have no explanation on hand. We occasionally hear
about “pied piping” but more as an exercise in hand waving than in explanation.
Now, this feature of FL is not exactly difficult to find in NL Gs. That
constituency matters is one of the obvious facts about how
displacement/deletion/binding operates. There is a simple story about this that
labels and headedness can be used to deliver.[9]
If this means that we need a slightly less conceptually simple starting point
than sets, then so be it.

More generally: the problem that motivates the minimalist
program is DP. To address DP we need to factor out most of the linguistic
specific structure of FL and attribute it to more cognitively generic
operations (or/and, if Chomsky is right, natural laws).What’s simple in a DP context is not what is conceptually most basic, but what is
simple given what our ancestors had
available cognitively about 100k years ago. We need a simple addition to this, not something that is conceptually
simple tout court.[10]In this context it’s not clear to me that
adding a set construction operation
(which is what Merge amounts to) is the simplest evolutionary alternative. Imagine,
for example, that our forbearers already had an itterative concatenation
operation.[11]Might not some addition to this be just as
simple as adding Merge in its entirety? Or imagine that our ancestors could
combine lexical atoms together into arbitrarily big unstructured sets, might not an addition that allowed that
operation to yield structured sets be just as simple in the DP context as
adding Merge? Indeed, it might be simpler depending in what was cognitively
available in the mental life of our ancestors.And once we are at it, how “simple” is an operation that forms arbitrary
sets from atoms and other sets?Sets may
be simple objects with just the properties we need, but I am not sure that
operations that construct them are particularly simple.[12]

Ok, let me end this much too long second post. And moreover,
let me end on a very positive note. In the second lecture Chomsky does what we
all should be doing when we are doing minimalist syntax. He is interested in
finding simple computational systems that derive the basic properties of FL. He
concentrates on some very interesting key features: unbounded hierarchy,
displacement, reconstruction, etc. and makes concrete proposals (i.e. he offers
a minimalist theory) that seem
plausible. Whether he is right in detail is less important IMO than that his
ambitions and methods are worth copying. He identifies non-trivial properties
of FL that GG has discovered over the last 60 years and he tries to explain why
they should exist.This is exactly the
right kind of thing MPers should be doing. Is he right? Well, let’s just say
that I don’t entirely agree with him (yet!). Does lecture 2 provide a nice
example of what MP research should look like. You bet. It identifies real deep
properties of FL and sees how to derive them from more general principles and
operations. If we are ever to solve Darwin’s problem, we will need simple
systems that do just what Chomsky is proposing.

[1]
Note, we want the necessarily here.
That it is both simple and efficient
does not explain why it need be efficient if
simple.

[2]
It is also a necessary condition for incrementality in the use systems (e.g.
parsing), as Bill Idsardi pointed out to me.I know that the SMT does not care about use systems according to some
(Dennis and William this is a shout-out to you), but this is a curious and
interesting fact nonetheless.Moreover,
if I am right that the last three properties do not follow (at least not
obviously) from conceptual considerations, it seems that Chomsky might be
pursuing a dual route strategy for explaining the properties of FL.

[3]
Note that this assumes that there is no syntactic difference between inherent
features and features valued in the course of the derivation.

[4]
And even this requires a special version of the theory, one like Idsardi and
Lidz’s rather than Zwart’s.

[5]
However, if v raised to T before Transfer then one might try and link these
features to the thematic argument that v licenses. And then it might make lots
of sense to say that phi-features are interpretable on T. They would say that
the variable of the predicate bound by the subject must have such and such an
interpretation. This information might be redundant, but it is not obviously
uninterpretable.

[6]
The ‘autonomy of syntax’ thesis refers to more than one claim. The simplest one
is that syntactic primitives/operations are not reducible to phonetic or
semantic ones. This is notthe version adverted to above. This is a more
specific version of the thesis; one that requires a complete separation between
syntactic and semantic information in the course of a derivation. Note, that
the idea that one can add EPP/edge features only if it affects interpretation
(the Reinhart-Fox view that Chomsky has at times endorsed) violates this strong
version of the autonomy thesis.

[8]
Note, incidentally, that Chomsky assumes both that features are +/- valued and
that they are +/- interpretable. At one time, the former was considered a
substitute for the latter. Now, they are both theoretically required, it seems.
As -valued features seem to always be –interpretatble, this seems like an
unwanted redundancy.

[10]
A question: we can define ordered pairs set theoretically. I assume the
argument against labels is that ordered sets are conceptually more complex than
unordered sets. So {a,b} is conceptually simpler than {a,{a,b}}.If this is the argument, it is very very
subtle. I find it hard to believe that whereas the former is simple enough to
be biologically added, the latter is not. Or even that the relative simplicity
of the two could possibly matter. Ditto for other operations like concatenation
in place of Merge as the simplest operation.Given how long this post is already, I will refrain from elaborating
these points here.

[11]
Birds (and mice and other animals) can string “syllables” together (put them
together in a left/right order) to make songs. From what I can tell, there is
no hard upper bound on how many syllables can be so combined.These do not display hierarchy, but they may
be recursive in the sense that the combination operation can iterate. Might it
not be possible that what we find in FL builds on this iteration operation?
That the recursion we find in FL is iteration plus something novel (I have
suggested labeling is the novelty)? My point here is not that this is correct,
but that the question of simplicity in a DP context need not just be a matter
of conceptual simplicity.

[12]
How are sets formed? How computationally simple is the comprehension axiom in
set theory, for example? It is actually logically quite involved (see here). I
ask because Merge is a set forming
operation, so the relevant question is how cognitively complex is it to form arbitrary sets. We have been assuming
that this is conceptually simple and hence cognitively easy. However, it is
worth considering just how easy. The Wikepedia entry suggests that it is not a
particularly simple operation. Sets are funny things and what mental powers go
into being able to construct them is not all that clear.

Saturday, June 21, 2014

How does the learner acquire the following patterns in dative constructions:

(1)a. John told a story to Bill.

John told Bill a story.

b. John promised a car to Bill.

John promised Bill a car.

c. John donated a painting to the museum/them.

*John donated the museum/them a painting.

Lexical conservation is not the way to go. Children productively (over)generalize both constructions (“I said him no”) about 5% of time (Gropen et al. 1989 Lg.) at a rate comparable to that of past tense overreguarlization. As young as age 3, they can extend novel verbs from one construction to another (“I pilked the cup to Petey”=>”I pilked Petey the cup”; Conwell & Demuth 2007 Cognition) though the DOC to PC extension is more robust than the other way around.

There is pretty good agreement on the semantic conditions for the dative constructions: DOC generally involves caused possession of the theme by the goal and PC requires caused motion of the theme along the path to the goal. These are what Pinker (1989) calls “broad range rules” but they are clearly necessary conditions on the dative constructions as the examples in (1) illustrate. Moreover, there is considerably crosslinguistic variation: in some languages, (the equivalent of) dative constructions are limited to a handful of verbs.

Pinker then propose a set of “narrow range rules”, each defining a subclass of verbs on the basis of semantics, e.g., verbs of instantaneous causation of ballistic motion (“throw”), verbs of future having (“leave”), verbs of instrument of communication (“telegraph”), etc., which allow DOC and verbs of fulling (“present”), verbs of manner of speaking (“shout”) etc., which allow PC only. Beth Levin refined these lists in her 1993 EVCA book. But as noted by Melissa Bowerman and others, these subclasses do not solve the learning problem. First, it’s not clear how the child learner can conjure up these subclasses: we probably don’t want to build the telecommunication class into an innate UG. Second, these subclasses do not behave consistently across languages (Levin 2008 Stanford ms.); even if the they are available for the learner’s consideration, their productivity still needs to be determined.

You know where we are going with this. I looked at a 3 million word corpus of child directed English and found a total of 49 verbs attested in either dative constructions:

(2) a. 48 appear in PC, of which 37 also appear in DOC.

b. 38 appear in DOC, of which 37 also appear in PC.

Applying the N/ln N formula, we see that both PC=>DOC and DOC=>PC are productive generalizations. That is, if the child sees a verb used in one of the constructions, it will automatically generalize to the other. This appears to be what children do; see above. The DOC=>PC rule is a far more reliable generalization, virtually exceptionless, than the PC=>DOC rule, which may account for the asymmetry in the extension of novel verbs in Conwell and Demuth’s study.

So there is no Baker’s paradox for a 3 year old, as both construction can be productively learned. The paradox arises for certain verbs such as the Latinate class but there is hardly any Latinate dative verbs in the child directed data (and no a single instance of the telecommunication verbs; these are data collected before everyone was online). As the child grows older, especially after the onset of literacy which will begin to feature more Latinate words, his vocabulary will expand and he will encounter more examples of dative constructions: some verbs will appear in both DOC and PC while others will only appear in PC. But even the ungrammaticality of latinate verbs in DOC's is matter of tendency not to mention individual variation. Those such as “assign”, “advance”, “award” “guarantee” etc. do allow DOC and Germanic verbs such as “shout”, “trust”, “lift”, “pick” do not. Collectively, Gropen et al’s list contains 54 Latinate verbs that can participate in PC but only 14 can be used in DOC: Latinate verbs, then, do not productively participate in DOC and the learner will have to lexicalize the 14. Levin’s longer list shows the same pattern.

So the child grows into a paradox: in other words, the productivity of rules/constructions must change over the course of language acquisition. Gropen et al. (1989) lists of 73 DOC/PC verbs and 34 PC only verbs for a total of 107, which yields a threshold of 23. If the child learns all of the 107 verbs, the PC=>DOC extension will no longer be justified. A productive rule when he was three will cease to be productive when he’s 30.

I think this is when the child will be prompted to look for subclasses or narrow range rules. Not having a productive linguistic system is a crime against nature. Sometimes we are genuinely stuck when there isn’t any to be found (such as the paradigmatic gap examples I mentioned in the previous post) but the child will not give up trying. In a paper published in the same volume as Berwick, Chomsky and Piatelli-Parlmarini, Julie Legate and I studied how the metrical stress parameters of English can be acquired. It’s well known that the overwhelming majority of English words are stress initial (up to 80-90%; Cutler & Carter 1987, Comp. Speech & Lg.), but no metrical theory of English, or any English speaker, treats English as a quantity insensitive (QI) system like Afrikaans while lexically listing 10% of exceptions. Using child directed English words, we found that indeed, the QI system fails to reach productivity despite being the overwhelming majority, and a productive system (as described in Halle 1998 LI) can only be established if the child subdivides the vocabulary into nouns and verbs and consider different stress marking options for these subclasses. Conceivably, this is how they learn the narrow range rules. OK, PC=>DOC may be bust, but if I cut up the verbs into semantic classes, I can still find some productive ones.

****

This work has tormented me for quite some time. I have argued for a variational conception of language learning, where the learner acquires a probabilistic distribution over grammatical hypotheses—which is contrasted that with what can be called “transformational” model of learning, where the learner goes from one grammar to another. Yet what we have on hand is exactly a transformational model of language a la hypothesis testing (see Aspects), where the hypotheses are confirmed or rejected by an evaluation metric for productivity.

There really does seem to be two kinds of learning in child language. On the one hand, there is probabilistic adjustment, where non-target grammars show up. The case for parameters remains strong; I hope to provide a report on some recent collaborative work soon. On the other, we have the tipping point phenomena such as U-shape curve learning and other forms of linguistic induction, where a hypothesis suddenly emerges.

I’m happy to concede that I’m treating unattested examples as negative evidence. As noted earlier, the child must be able to generalize over unseen data so that much seems unavoidable. But I still think this work is different from at least the conventional use of indirect negative evidence. Under the standard view, the learner has two (or many) hypotheses and performs some kind of comparison, discrete or probabilistic, to select the best. (For a recent take on the dative acquisition, see Perfors et al. JCL 2010 and Villavicencio et al. ACL 2013.) The model developed here considers one hypothesis at a time by working over two numbers: it keeps a hypothesis that is good enough and moves on to find another if not. This is the classic error driven learning in much of the inductive learning business (Aspects, Wexler & Culicover, Berwick).

In any case, I think the empirical aspects of productivity are far more important than theoretical formulations and deserve much more attention:

A productive system requires super duper majority: see English metrical stress.

Productivity can change over the course of language acquisition.

The failure of productivity results in ineffability such as paradigmatic gaps: sometimes the best isn't good enough.

Thursday, June 19, 2014

This was once a 10 page post. I’ve decided to break it into
two to make it more manageable. I welcome discussion as there is little doubt
that I got many things wrong. However, it’s been my experience that talking
about Chomsky’s stuff with others, even if it begins in the wrong place, ends
up being very fruitful. So engage away.

In lecture 2, Chomsky starts getting down to details.Before reviewing these, however, let me draw
attention to one of Chomsky’s standard themes concerning semantics, with which he
opens.He does not really believe that
semantics exists (at least as part of FL). Or more accurately, he doubts that
there is any part of FL that recursively specifies truth (or satisfaction)
conditions on the bases of reference relations that lexical atoms have to
objects “in the world.”

Chomsky argues that lexical atoms within natural language
(viz. words, more or less) do not refer.Speakers can use words to
refer, but words in natural languages (NL) have no intrinsic reference relation
to objects or properties or relations or qualities or whatever favorite in the
world “things” one cares to name.Chomsky
interestingly contrasts word with animal symbols, which he observes really do
look like they fit the classical referential conception as they are tightly
linked to external states or current appetites on every occasion of use. As
Chomsky has repeatedly stressed, this contrast between our “words” and animal
“words” needs explaining, as it appears to be a distinctive (dare I say species
specific) feature of NL atoms.

Interestingly (at least to me), the point Chomsky makes here
echoes ideas in the Wittgenstein’s (W) later writings. Take a look at W’s slab
language in the Investigations. This
is a “game” in which terms are explicitly referentially anchored. This language
has a very primitive tone (a point that W wants to make IMO) and has none of
the suppleness characteristic of even the simplest words in NL.This resonates very clearly with Chomsky’s Aristotelian
observations about how words function.

Chomsky’s pushes these observations further. If he is right
about the absence of an intrinsic reference relation between words and the
world and that words function in a quasi Aristotelian way, then semantics is just
a species of syntax, in that it specifies internal
relations between different types of symbols.Chomsky once again (he does this here
for example) urges an analogy with phonological primitives, which also have no
relations to real world objects but can be used to create physical effects that
others built like us can interpret. So, no semantics, just various kinds of
syntax and some pragmatics describing how these different sets of symbols are
used by speakers.

Two remarks and we move on to discuss the meat of the
lecture: (i) Given Chomsky’s skepticism concerning theories of use, this
suggests that there is unlikely to be a “theory” of how linguistic structures
are used to “refer,” make assertions, ask questions etc.We can get informal descriptions that are
highly context sensitive, but Chomsky is likely skeptical about getting much
more, e.g. a general theory of how sentences are used to assert truths.
Interestingly, here too Chomsky echoes W. W noted that there are myriad
language games, but he doubted that there could be theories of such games. Why?
Because games, W observes, are very loosely related to one another and a game’s
rules are often constructed on the fly.

With very few exceptions semanticists, both linguists and
philosophers, have not reacted well to these observations. Most of the
technology recursively specifies truth conditions based on satisfaction
conditions of predicates. There is a whole referentialist metaphysics based on
this. If Chomsky is right, then this will all have to be re-interpreted (and
parts scrapped). So far as I know, Paul Pietroski (see here)
is unique among semanticists in developing interpretive accounts of sentence
meaning not based on these primitive referential conceptions.

Ok, let’s now move onto the main event. Chomsky, despite his
standard comments noting that Minimalism is a program and not a theory,
outlines a theory that, he argues, addresses
minimalist concerns.[1]
The theory he outlines aims to address Darwin’s Problem (DP). In reviewing the
intellectual lay of the land (as described more fully in lecture 1) he observes
that FL arose quickly, all at once, in the recent past, and has remained stable
ever since.He concludes from this that
the change, whatever it was, was necessarily “simple.” Further, Chomsky
specifies the kinds of things that this “simple” change should account for,
viz. a system with (at least) the following characteristic:

(i)it generates an infinite number of hierarchically
structured objects

(ii)it allows for displacement

(iii)it displays reconstruction effects

(iv)its operations are structure dependent

(v)its operations apply cyclically

(vi)it can have lots of morphology

(vii)in externalization only a single “copy” is
pronounced

Chomsky argues that these seven properties are consequences of
the simplest conceivable conception
of a recursive mechanism. Let’s follow the logic.

Chomsky assumes that whatever emerged had to be “simple.”
Why? One reason is that complexity requires time, and if the timeline that
experts like Tattersall have provided is more or less correct, then the
timeline is very short in evo terms (roughly 50-100k years). So whatever
changed occurred must have been a simple modification of the previous cognitive
system. Another reason for thinking it was simple is that it has been stable since
it was first introduced. In particular, human FLs have not changed since humans
left Africa and dispersed across the globe about 50-100k years ago. How do we
know? Because any human kid acquires any human language in effectively the
same way. So, whatever the change was, it was simple.

Next question: what’s “simple” mean? Here Chomsky makes an
interesting (dare I say, bold?) move. He equates evolutionary simplicity with conceptual
simplicity. So he assumes that what we recognize as conceptually simple
corresponds to what our biochemistry takes to be simple. I say that this is
“interesting/bold” for I see no obvious reason why it need be true. The change
was “simple” at the genetic/chemical level. It was anything but at the
cognitive one.Indeed, that’s the point;
a small genetic/biochemical change can have vast phenotypic effects, language
being the parade case. However, what Chomsky is assuming, I think, is that the
addition of a simple operation to our cognitive
inventory will correspond to a simple change at the genetic/developmental
level.[2]
We return to this assumption towards the end.

As is well known, Chomsky’s candidate for the “simplest” change
is the addition of an operation that “takes two things already (my emphasis NH) constructed and forms a new thing from
them” (at about 28;20). Note the ‘already.’ The simplest operation, let’s call
it by its common name- “Merge,” does not put any two things together. It puts two constructed things together. We return to this too.

How does it put them together? Again, the simplest operation
will leave the combinees unchanged in putting them together (it will obey the No
Tampering Condition (NTC)) and the simplest operation will be symmetric (i.e.
impose no order on the elements combined).[3]So the operation will be something like
“combine A and B,” not “combine A with B.” The latter is asymmetric and so
imposes a kind of order on the combiners.The Merge so conceived can be represented as an operation that creates
sets. Sets have both the required properties. Their elements are unordered and putting
things into sets (i.e. taking things elements of a set) does not thereby change
the elements so besetted.[4]

We have heard this song before. However, Chomsky puts a new
spin on things here. He notes that the “simplest” application of Merge is one
where you pick an expression X that is within
another expression Y and combine X and Y.Thus I(nternal)-Merge is the simplest application/instance of Merge. The
cognoscenti will recognize that this is not how Chomsky elaborated things
before. In earlier versions, taking two things neither of which was contained
in the other and Merging them (viz. E-merge) was taken to be simpler.Not now, however. Chomsky does not go into
why he changes his mind, but he hints that the issue is related to “search.” It
is easier to “find” a term within a term than to find two terms in a workspace
(especially one that contains a lexicon).[5]So, the simplest operation is I-merge,
E-merge being only slightly more complex, and so also available.

Comments: I found this discussion a bit hard to follow. Here’s
why. A logical precondition for the application of I-merge is the existence of
structured objects and many (most) of these will be products of E-merge. That
would seem to suggest that the “simplest” version of the operation is not the conceptually
most basic as it logically presupposes that another operation exist.It is coherent to assume that even if E-merge
is more conceptually basic, I-merge is easier to apply (think search). But if
one is trucking in conceptual simplicity, it sure looks like E-merge is the
more basic notion. After all, one can imagine derivations with E-merges and no
I-merges but not the reverse.[6]
Clearly we will be hearing more about this in later lectures (or so I assume).
Note that this eliminates the possibility of Economy notions like “merge over
move” (MoM). This is unlikely to worry Chomsky given the dearth of effects
regulated by this MoM economy condition (Existential constructions?
Fougetaboutit!).[7]
Nonetheless, it is worth noting. Indeed, it looks like Chomsky is heading
towards a conception more like “move over merge” or “I over E merge” (aka:
Pesetsky’s Earliness principle), but stay tuned.

Chomsky claims that these are the simplest conceivable pair
of operations and so we should eschew all else.[8]
Some may not like this (e.g. moi) as it purports to eliminate operations like
inter-arboreal/sidewards Merge (where one picks a term within one expression
and merges it with a term from the lexicon). I am not sure, however, why this
should not be allowed. If we grant that finding mergeables in the lexicon is
more complex than finding a mergeable within a complex term, then why shouldn’t
finding a term within a term (bounded search here) and merging it with a term
from the lexicon not be harder than I-merge but simpler than E-merge?After all, for interarboreal merge we need
scour the big vast nasty lexicon but once rather than twice, as is the case
with many case of E-merge (e.g. forming {the,man}). At any rate, Chomsky wants
none of this, as it goes beyond the conceptually simplest possibilities.

Chomsky also does not yet mention pair merge, though in
other places he notes that this operation, though more complex than set merge
(note: it does imply an ordering,
hence the ‘pair’ in (ordered?) pair merge) is also required.If this is correct, it would be useful to
know how pair merge relates to I and E merge: is it a different operation
altogether (that would not be good for DP purposes as we need keep miracles to
a small minumum) and where does it sit in the conceptual complexity hierarchy
of merge operations? Stay tuned.

So, to return to the main theme, the candidate for the small
simple change that occurred is the arrival of Merge, an operation that forms
new sets of expressions both from already constructed sets of expressions
(I-merge) and from lexical items (which are themselves atomic, at least as far
as merge is concerned) (E-merge).The
strong minimalist thesis (SMT) is the proposal that these conceptual bare bones
suffice to get us many (in the best case, all) of the distinctive properties of
NL Gs. In other words, that the
conceptually ”simplest” operation (i.e. the one that would have popped into our
genomes/developmental repertoires if anything did) suffices to explain the
basic properties of FL. Let’s see how merge manages this.

Recall that Merge forms sets in accord with the NTC. Thus,
it can form bigger and bigger (with
no bound to how big) hierarchically
structured objects. The hierarchy is a product of the NTC. The recursion is endemic
to Merge. Thus, Merge, the “simplest” recursive operation, suffices to derive
(i) above (i.e. the fact that NLs contain an infinite number of hierarchically
structured objects).

In addition, I-merge models displacement (an occurrence of
the same expression in two different places) and as I-merge is the simplest
application of Merge, we expect any system built on Merge to have displacement
as an inherent property (modulo AP deletion, see next post).[9]

We also expect to find (iii) reconstruction effects for
Merge plus NTC implies the copy theory of movement. Note, that we are forming
sets, so when we combine A (contained in B) with B via merge we don’t change A
(due to NTC) and so we get another instance of A in its newly merged
position.In effect, movement results in
two occurrences of the same expression in the two places.These copies suffice to support
reconstruction effects so the simplest operation explains (iii), at least in
part (see note 10).[10]

(iv) follows as well. The objects created have no left/right
order, as the objects created are sets and sets have no order at all, and so no
left/right order.[11]
This means that operations on such set theoretic structures cannot exploit
left/right order as such relations are not defined for the set theoretic objects
that are the objects of syntactic manipulation. Thus, syntactic operations must
be structure dependent as they cannot be structure independent.[12]

This seems like a good place to stop. The discussion continues
in the next post where I discuss the last three properties outlined above.

[1]
Chomsky would argue, correctly, that should his particular theory fail then
this would not impugn the interest of the program.However, he is also right in thinking that
the only way to advance a program is by developing specific theories that
embody its main concerns.

[2]
I use ‘genetic/developmental’ as shorthand for whatever physical change was
responsible for this new cognitive operation. I have no idea what the relation
between cognitive primitives and biological primitives is. But, from what I can
tell, neither does anyone else. Ah dualism! What a pain!!

[3]
We need to distinguish order from a left/right ordering. For example, in
earlier proposals, labels were part of Merge. Labels served to order the
arguments: {a,{a,b}} is equivalent to the ordered pair <a,b>. However,
Merge plus label does not impose a left-right ordering on ‘a’ and ‘b’. Chomsky
in this lecture explicitly rejects a label based conception of Merge so he is
arguing that the combiners are formed into simple sets, not ordered sets. The
issue about ordering, then, is more general than whether Merge, like earlier
Phrase Structure rules in GG, imposes a left-right order on the atoms in
addition to organizing them into “constituents.”

[4]
If it did, we could not identify a set in terms of the elements it contains.

[5]
I heard Chomsky analogize this to finding something in your pocket vs finding
it on your desk, the former being clearly simpler. This clearly says something
about Chomsky’s pockets versus his desks.But substitute purses or school bags for pockets and the analogy, at
least in my case, strains. This said, I like this analogy better than Chomsky’s
old refinery analogy in his motivation of numerations.

[6]
Indeed, one can imagine an FL that only has an operation like E-merge (no
I-merge) but not the converse.Restricting Merge to E-merge might be conceptually ad hoc, As Chomsky
has argued before, but it is doable. A system with I-merge alone (no E-merge at
all) is, at least for me, inconceivable.

[7]
I know of only three cases where MoM played a role: Existential constructions,
adjunct control and the order of shifted objects and base generated subjects. I
assume that Chomsky is happy to discount all three, though a word or two why
they fail to impress would be worth hearing given the largish role MoM played
in earlier proposals. In particular, what is Chomsky’s current account of *there seems a man to be here?

[8]
Chomsky talks as if these are two different
operations with one being simpler than the other. But I doubt that this is what
he means. He very much wants to see structure building and movement as products
of the same single operation and that on the simplest story, if you get one you
necessarily get the other. This is not
what you get if the two merge operations are different, even slightly so.
Rather I think we should interpret Chomsky as saying that E/I-Merge are two applications of the same operation with
the application of I-merge being simpler than E-merge.

[9]
What I mean is that I-merge implies the presence of non-local dependencies. It
does not yet imply displacement phenomena if these are understood to mean that
an expression appears at AP in a postion different from where it is interpreted
at CI. For this we need copy deletion
as well as I-merge.

[10]
Actually, this is not quite right. Reconstruction requires allowing information
from lower copies to be retained for
binding. This need not have been
true. For example, if CI objects were like the objects that AP interprets, the
lower copies would be minimized (more or less deleted) and so we would not
expect to find reconstruction effects.So what Merge delivers is a necessary (not a sufficient) condition for
reconstruction effects. Further technology is required to actually deliver the
goods. I mention this, for any additional technology must find its way into the
FL genome and so complicates DP.It
seems that Chomsky here may be claiming a little more for the simplest
operation than Merge actually delivers. In Chomsky’s original 1993 paper,
Chomsky recognized this. See his discussion of the Preference Principle,
wherein minimizing the higher copy is preferred to minimizing the lower one.

[11]
As headedness imposes an ordering on the arguments (it effectively delivers
ordered pairs), headedness is also excluded as a basic part of the
computational system as it does not follow from the conceptually “simplest”
possible combination operation. I discuss this a bit more in the next post.

[12]
Note, that we need one more assumption to really seal the deal, viz. that there
are no syntax like operations that apply after Transfer. Thus, there can be no
“PF” operations that move things around. Why not? Because Transfer results in
left-right ordered objects. Such kinds of operations were occasionally proposed
and it would be worth going back to look at these cases to see what they imply
for current assumptions.