LINGUIST List 12.2607

Fri Oct 19 2001

Review: Bertolo, Language Acquisition and Learnability

Editor for this issue: Terence Langendoen <terrylinguistlist.org>

What follows is another discussion note contributed to our Book Discussion
Forum. We expect these discussions to be informal and interactive; and
the author of the book discussed is cordially invited to join in.
If you are interested in leading a book discussion, look for books
announced on LINGUIST as "available for discussion." (This means that
the publisher has sent us a review copy.) Then contact Simin Karimi at
siminlinguistlist.org or Terry Langendoen at terrylinguistlist.org.

Bertolo, Stefan, ed. (2001) Language Acquisition and Learnability.
Cambridge University Press, viii+247pp, hardback ISBN 0-521-64149-7,
$64.95; paperback ISBN 0-521-64620-0, $22.95.
Lee Fullerton, University of Minnesota.
This book is not an anthology in the normal sense, but rather
a five-chapter introduction to the topics of its title by six
authors. The editor has made an effort to uniformize
terminology and formal notation and sets out his norms in the
fourteen-page Chapter 1. He claims the book is accessible,
but the reader definitely needs to be versed in the
mathematics of sets, probabilities, other statistics, as well
as Chomsky's Principles and Parameters (1981; hereafter PPH,
where H stands for 'hypothesis') and his Minimalist Program
(1995). The book contains almost no discussion of results in
first-language acquisition research. Rather, it is a
theoretical, mostly abstract treatment of the nature of
parameters, universal grammar (UG) and the learning device.
Each chapter has exercises interrupting the text. (Those for
Chapter 5 are all at the end.)
Chapter 1, A brief overview of learnability, by Stefano Bertolo
Chapter 2, Learnability and the acquisition of syntax, by
Martin Atkinson
Chapter 3, Language change and learnability, by Ian Roberts
Chapter 4, Information theory, complexity and linguistic
descriptions, by Robin Clark
Chapter 5, The Structural Triggers Learner, by William G.
Sakas and Janet D. Fodor
In the first chapter Bartolo describes the book as a
collaboration of linguistics, psychology and learning theory
in which linguists elaborate all the theoretical
possibilities, learning researchers eliminate some of these
based on theory supported by empirical studies, and
psychologists eliminate others based on their studies of
learning schedules and types of data learned. B then
elaborates briefly on the first four of the following five
questions central to learnability researchers: (i) What is
being learned, exactly? (ii) What kind of hypotheses is the
learner capable of entertaining? (iii) How are the data of
the target language presented to the learner? (iv) What are
the restrictions that govern how the learner updates her
conjectures in response to the data? (v) Under what
conditions, exactly, do we say that a learner has been
successful in the language learning task? B's elaboration of
the last question begins by noting that the current consensus
rejects the notion that humans learn language by
Identification by Enumeration (of conjectures about the
finite values of a finite number of parameters), but he
proceeds anyway to outline three types of such learning, all
of which are theoretically compatible with the PPH:
Identification in the Limit, the Wexler and Culicover
Criterion, and the Probably Approximately Correct Criterion.
B describes each formally and proves PPH grammars learnable
mathematically under each. More recent work, B says, has
turned to investigating another possible organization of the
space which parameters occupy: the Subset Principle. In the
last three pages of the chapter B gives formal definitions of
parameter spaces and a notation for exploring interesting
regions of them.
Atkinson's Chapter 2 has three parts, which incidentally
retrace the successive foci of recent learning theory: the
child's linguistic environment, the subset problem, and
algorithms accounting for the effects of interacting
parameters. Regarding the child's environment, A concludes
first that negative feedback (i.e. correction or other
expression of disapproval of ill-formed sentences) is nearly
nonexistent and would not make a difference even if it were
frequent. Second, the environment of Motherese does indeed
present the child with simple sentences: embeddings are very
rare. Parameters are therefore settable without reference to
embedding, and such phenomena as WH-movement over one or more
clause boundaries are acquired in the form of movement over
one or more nodes within the simple sentence. Thirdly, A
considers whether Motherese is "graded" , i.e. whether the
data presented are ordered in a way that matches the brain's
predetermined sequence of structure-types acquired. The
reasonable answer is no, but empirical evidence is lacking.
Brain maturation might cause the child to focus on different
aspects of the consistently simple data in a predetermined
sequence.
A introduces the subset problem with a set-theoretical
analysis and illustrates it later with the phenomenon of
across-clause binding of anaphors and pronouns. Any language
that binds an anaphor across x number of clauses will also
allow that binding across x-1 clauses, which results in
language X properly including language X-1. This discussion
is quite lengthy, including at the end treatment of the null
subject parameter as a possible subset problem, one in which
the values of the parameter are only two: pronoun subject
expressed or not. In fact, if one ignores the across-
clause(s) binding problem, it turns out that maybe all
parameters have binary values, neither of which has the
qualifier 'optional'. A suggests that current research on
the proper articulation of the binding theory will lead to
the conclusion that it too is binary. I see the binding of
anaphors as part of the same problem as the distance of WH-
movement mentioned above. Both are acquired as binding
across nodes within the simple sentence. A closes this
section with the conclusion that subset relations are not a
part of human grammar.
A's last section deals with problems that arise with the
interaction of two or more parameters. Given XP=spec,X' and
X'=Comp(lement), X, where Spec is S(ubject), Comp is O(ject)
and unbarred X is V(erb), four arrangements of S, V, O are
possible depending on the (binary) ordering values assigned
to the two parameters. The first yields either S-first or S-
last while the second yields either VO or OV. Given initial
settings and a wealth of data sentences, usually including
more than S, V, O (e.g. a second complement of the verb, free
adverbials, auxiliary verbs), what triggering word orders are
needed to move the learner from his/her initial settings to
the target settings of his/her language? Are there setting
pairs (initial, target) for which no sequence of triggers
will move the learner to the target settings? Once A
introduces a third parameter, plus or minus verb second (V2),
the second question is answered with yes. These "local
maxima" occur when the initial setting for V2 is minus and
the target setting is plus. A gets around this dilemma by
assuming that learners set parameter values according to a
particular ordering of parameters. If the two X-bar
parameters are set first, then the later setting of V2 can
avoid the problem. The ordering of parameters set is of
course consistent with the above-mentioned ordering of focal
points on the input data according to brain maturation.
Ian Roberts begins Chapter 3 by noting that language change
requires us to believe that a generation of learners can
sometimes set a parameter's value differently from the
members of the generation providing input. Latin pretty
clearly has underlying OV order, while its daughters, the
modern Romance languages, all have VO. Such things happen, R
says, because between the parents' setting and the input
there is the learning algorithm (device), which in cases of
change finds the parent setting unlearnable. Citing Cinque
(1997), who finds in a large sample of unrelated languages 32
ordered parameters expressed in IP alone, R calculates 2-to-
the-64th-power grammars. Setting one parameter per second,
the learner's acquisition would in the worst case take more
than 34 years. What allows acquisition within three-five
years is the learning algorithm.
Drawing heavily on Kauffman's (1995) study of the clumping of
matter in galaxies using Boolean networks and their states, R
credits the learning algorithm with the notions of markedness
between the two values of a parameter and an implicational
relation of one parameter to another. The latter is only two
deep, that is, parameter X can have either value 0 or value 1
and, if 1, then parameter Y is in play with its two values,
but there are no further parameters involved. Not only are
the values binary but also the size of the network.
Following a long discussion of feature checking within
Chomsky's (1995) Minimalist Program, R boils it down to the
same kind of binary network: any feature F is either
expressed at the level of Phonetic Form or it is not: if
expressed, then by either Merge (lexical-phonological
insertion) or by Move.
Markedness in acquisition has the (always conservative)
learner setting parameters only in response to trigger input
which expresses the marked value. Otherwise the unmarked
value is the default setting. Move is marked with regard to
no phonological expression of the feature, and Merge is
marked with respect to Move. This notion of markedness is
part of the learning algorithm and is distinct from Cinque's
parameters within IP, which are part of UG. These latter R
calls Jakobsonian parameters and illustrates them with four
mood parameters, e.g. Mood-sub-Speech-Act is unmarkedly
declarative and markedly non-declarative.
Both types of markedness are subject to the observation that
the default (unmarked) setting unmarkedly lacks overt
expression while the marked setting is expressed. Thus
declarative sentences in English have English's underlying
order SVO while yes/no-interrogative and imperative sentences
show subject-verb inversion (VSO), a result of Move.
By all these notions R claims to put syntactic variation and
syntactic change into the realm of cognitive theory, which in
turn allows variation and change to be evidence for
describing UG and the learning algorithm. R's first case is
the loss of two sites (C and AgrS) for the main verb in
Modern English (ModE). In the absence of an auxiliary, ModE
requires DO-Support with negation and interrogation. It also
places the verb after certain adverbials and after a floated
quantifier. None of this was true in Middle English (ME):
no DO-Support; verbs preceded a floated quantifier and an
adverbial. By R's analysis the ME order shows movement of the
main verb into AgrS while ModE lacks such movement and is
therefore simpler and more "elegant" to the learning
algorithm. The ModE situation shows no overt expression of
the unmarked setting zero of the V-to-AgrS parameter, as
opposed to the ME situation, where Move correlates with the
marked setting 1. A further change is the loss of
distinctive verb inflection for person and number (four
different endings in 1400, only one (3sg -s) today). It too
illustrates elegant as unmarked: ME has the marked setting
of phonological expression, i.e. Merge, while ModE has the
unmarked setting of no overt expression.
R treats two more cases of syntactic change in similar
fashion: (1) the shift of 'habere' from an independent verb of
possession in Latin to a suffix for future and conditional in
the modern Romance languages and (2) the shift from SOV in
Latin and Proto-Germanic to SVO in Romance and English,
respectively.
At its beginning Robin Clark identifies Chapter Four as an
invitation to linguists to explore the mathematics of
information theory and statistics. At the end of his second
section (page 136) C summarizes his intuitive account of the
relationship between "texts," i.e. adult input, and the
setting of parameters:
"...[A] system has the learnability property just in case
there is some learner that learns the languages [including
variation within a single language] determined by that
system from any arbitrarily selected 'fair' text, one where
each parameter value is expressed above the learnability
threshold [frequency]. The complexity bound U...should
serve to limit the complexity of the input text; in
particular, given U we can establish an upper bound of both
the sample size and the time required by the learner.
...[A]s the complexity bound U grows, the sentences which
express structures near the bound become less likely. It
will take increasingly large samples to learn more complex
parameter values. Assuming, as seems reasonable, that the
time to converge is a function of the size of the text [the
learner] learns on, then the time-complexity of learning is
also a function of U. But U is a bound on parameter
expression: no parameter can contain more information than
can be expressed by a phrase marker of complexity at most
U. In other words, the information content of a parameter
value is directly related to probabilities. Finally, since
cross-linguistic variation is determined by the different
parameter values, U also limits the amount of variation
that is possible across languages. We now turn to the
formalization of these intuitions."
As implied above, C assumes that parameters have a finite
number of values greater than two. He seems also to assume a
critical learning period extending beyond age three. He
assumes further that the learner comes equipped with some
innate mathematics.
C's sketch of probability and information theory begins with
three axioms about sets of events and four axioms about the
probabilities of events within the set. After introducing the
notion of entropy, a measure of the uncertainty in a system,
C defines conditional entropy and relative entropy.
Examples: Heads reduce entropy by selecting for semantic,
syntactic and morphological properties of the constituents
they govern (conditional entropy). Which sense of a two-
meaning word like 'grade' is intended (school, slope) can be
approached by calculating the probability of each in a given
context (relative entropy).
C's discussion of parameters begins with the idea of
describing an object, any object, including linguistic
objects. The complexity of the description depends on the
degree to which the object has structure: Objects with a
great deal of internal predictability, like languages, have
short descriptions, which may be formulated as instructions;
objects with no structure (random objects, like a sequence of
coin tosses) cannot have a compressed description. To tie
these notions to symbolic descriptions, C discusses Turing
machines, including two-tape ones (for finding e.g.
palindromes) and universal ones, which can simulate any
particular Turing machine. This leads to a discussion of
data compression and codes, in particular, instantaneous
codes, in which no codeword is identical with any initial
sequence ("prefix") of any other codeword. Optimal
instantaneous codes give the shortest codewords to those
entities with the highest probability of occurrence. To end
this section C provides a binary, optimal, instantaneous code
description of a universal Turing machine.
The Kolmogorov complexity of an object is the length of the
briefest program (formal description) of the object by a
Turing machine. C demonstrates a relationship between K-
complexity and entropy, namely, as sample size grows the two
approach each other. For linguists this means that as input
to the learner grows, its uncertainty about parameter
settings shrinks. There must be an upper limit on the
complexity of any given parameter setting, for the input data
that allow the learner to make the setting become ever less
frequent the more complex the setting. At some level of
complexity the frequency of corresponding data becomes so low
that it fails to meet the threshold required for
learnability. Corpus-based studies can reveal the upper
bound of complexity and thereby inform both the typologist
and the learnability researcher as to how much information
can be packed into a parameter. This in turn should have
consequences for theoretical syntax and developmental
psycholinguistics.
In Chapter 5 Sakas and Fodor (S&F) introduce the Structural
Triggers Learner (STL), in three versions: the strong STL,
which uses parallel processing of many candidate grammars,
the weak STL, which uses serial processing and throws out all
ambiguous input sentences, and the dynamic weak STL, which
gleans what it can from ambiguous sentences. Before
examining any of these, however, S&F discuss four general
problems (summarized below) and show the unsupportability of
Gibson and Wexler's (1994) Triggering Learning Algorithm
(ignored here).
S&F posit three phases for the learner's work in establishing
a parameter setting: I. recognizing the trigger structure
when present in the input; II. adopting the corresponding
value for the parameter; III. finding any other parameter
settings that are now in error and resetting them by I and
II. That underlying phrase structures get changed by
derivational processes is a problem for the learner. For
example, how is a learner of German, underlyingly verb last,
going to arrive at that setting when nearly all of the input
consists of independent clauses, which have the verb first or
second? Of course, the input reflects movement of the German
verb, but that brings up another problem. In an SVO string
the German verb has been moved into the C(omplementizer)
position, the English verb remains in its underlying position
in VP, and the French verb has been moved up to an
inflectional head. However, in none of these languages is
there anything in the input's surface string to indicate what
node dominates the verb. How is a learner to set a parameter
for the verb's landing site? S&F assume a parser, which sets
parameters, including "deep" parameters, i.e. those for which
there is never any trigging information in the surface string
of the input. The parser can work only with the learner's
current grammar, but triggers from which the learner can
learn are contained in only that input which the parser
cannot yet parse. Sidestepping this "parsing paradox", what
the parser delivers is a phrase structure tree for the
surface string together with all the information telling how
this tree is derived from the underlying tree. The string
SVO is necessarily ambiguous to the learner because its verb
position could be underlying, as in English, or derived from
SOV, as in German. The parser's output ought to reveal a
trigger for setting the verb-landing-site parameter, but that
can't happen here because the parser yields two mutually
incompatible outputs. What does the learning device do with
ambiguous input data?
Although S&F consider both a strong and a weak version of the
STL, they decide in the end on a third version. Basic to all
three is the Parametric Principle, by which the value of
every parameter is set for all time and independently of all
other parameters. This yields rapid acquisition, because each
successive setting act eliminates a (progressively smaller)
host of candidates for target grammar. Independence of
parameters also eliminates the need for III above.
The S&F process of setting parameters has the parser working
with the learner's current grammar until it encounters a
(sub)string that it cannot parse; it then turns to a
"supergrammar" which contains "treelets" supplied by UG.
These treelets are minimal structures--S&F's examples show
only two branches--in which the terminals and the node(s) are
labeled. Each represents one value of one parameter; a
binary parameter thus has two treelets, an n-ary parameter
has n treelets. Assuming that treelet selection (parameter
setting) leads the learner to build the underlying syntactic
structure of the target language, what happens when the
(sub)string the parser is looking at is derived, i.e.
distorted by deletion, movement, etc.? If I understand
correctly, S&F offer two answers: the (sub)string contains
traces reflecting underlying structure; the supergrammar
contains, in addition to underlying treelets, also treelets
for all possible derived structures, i.e. derivational steps
also have parameters.
This model eliminates I above, since the triggers are
discovered by the parser's getting hung up. The parsing
paradox is eliminated by the parser's ability to turn to the
supergrammar. The derivation problem is solved by the
presence of derived treelets in the supergrammar. Problems
like that of the verb's landing site are solved by labels on
the UG treelets. The problem of ambiguous input remains, but
S&F suggest that in practice it may be a small problem, since
adult input during early learning may be very simple,
expressing no more that six parameters per sentence
Overall, this book focuses on the mathematics of formalizing
and testing theories. Actual adult sentences addressed to
children are rarely cited, and children's own speech is never
discussed. Of course, it's probably true that acquisition is
nearly complete when children begin uttering three- and four-
word strings, so speculation about what's going on before
that point should not be unwelcome. Yet the formal,
mathematical approach forces the practitioner to make
simplifying assumptions, the more of which take him/her the
farther from common sense and the real world. All of these
authors know this. Sakas and Fodor even illustrate it in
their endnote 12 with a joke that circulated once among
mathematicians: "A Mafia boss kidnaps a mathematician, locks
him into a dank cellar, says 'I'll be back in six months and
you must then give me a formula to predict whether my horse
will win at the races. If you don't, I'll shoot you.' He
leaves. He returns in six months, asks the mathematician for
the formula, the mathematician doesn't have it, the Mafia man
pulls out his gun. But the mathematician says 'No, don't
shoot me. I don't have the formula yet but I have made
significant progress. I have it worked out for the case of
the perfectly spherical horse."
Among models discussed, that of Sakas and Fordor is the most
detailed, but it too falls short. Traces are as discernible
in the input as node labels, i.e. not at all. The
supergrammar is so loaded with treelets that it threatens to
lose the internal predictability of structured objects in the
sense of Clark, Chapter 4. Assuming binarity, a parameter
ought to have only plus and minus values, e.g. plus for the
VP parameter could mean VO, and minus would then mean OV.
The same for what we used to call rules, e.g. plus or minus
V2. If research supports them, the values M(arked) and
U(nmarked) would be even better. The claim that every
parameter is set independently of every other loses the
insight of implicational universals, e.g. underlying verb-
last structure implies postpositions. If early input were to
contain no adpositions but clearly trigger OV, that should in
turn trigger an initial and unmarked setting for
postpositions (contingent unmarkedness as per Roberts, Chapter
3). If later input revealed only prepositions, the learning
device ought to be able to change the adposition parameter
setting to prep. Finally, Sakas and Fodor worry too much
about ambiguous input. If they were to take intonation and
prosody into account, ambiguity would drop sharply. For
example, they label the following sentence ambiguous: He fed
her dog biscuits (biscuits to her dog or dog biscuits to
her). Speak the sentence aloud once for each reading; you
will find no ambiguity. Note also that, from the page, the
second reading is hard to fetch--and a literate child would
never fetch it--because it it expresses anomalous behavior.
This fact is not unimportant.
REFERENCES
Chomsky, N. (1981) Principles and Parameters in Syntactic
Theory, in N. Hornstein and D. Lightfoot, eds., Explanation
in Linguistics: the Logical Problem of Language Acquisition,
Longman.
Chomsky, N. (1995) The Minimalist Program, MIT Press.
Cinque, G. (1997) Adverbs and the Universal Hierarchy of
Functional Projections, unpublished manuscript, University of
Venice.
Gibson, E. and K. Wexler (1994) Triggers, Linguistic Inquiry
25: 407-54.
Kauffman, S. (1995) At Home in the Universe, Viking Press.
Wexler, K and P. Culicover (1980) Formal Principles of
Language Acquisition, MIT Press.
ABOUT THE REVIEWER
Lee Fullerton is an Associate Professor on leave from the
University of Minnesota. His main scholarly interests are
historical Germanic morphology and phonology and the syntax
of Modern German.