Comments

Thursday, April 30, 2015

Those interested in this topic might find the following interesting (here). It is now the received wisdom that MOOCs were hyped when started and their promise, if there is any, was largely in the minds of those that stood to benefit most. The Gates foundation is now funding research on this topic (which, given the monetary source may or may not be "objective," pardon my cynicism) so that we can find out what MOOCs are good for, if anything.

It appears that the one group that they do service are "non-traditional" students and the problem appears to be keeping their attention.

The report names student engagement as a prominent theme. Many students enrolled in MOOCs are nontraditional, so making sure that they are engaged and able to succeed in such a course is even more important. Figuring out how to maintain students’ interest during an online course when “a distraction is literally just a click away” is another important element, Mr. Siemens said.

So putting things on line has some potential drawbacks that researchers are now addressing. Note too the audience, "non-traditional" students. It seems that for the regular college crowd MOOCs may not be on the agenda. Effectively, MOOCs are now filling the role that correspondence courses filled in the pre-digital era. And it seems that they are finding problems analogous to those that such courses traditionally encounter; keeping the student's attention focused on the material. This does not strike me as very surprising. It was never clear to me why presenting the material on line on a screen should make it more engaging than doing so in a book on your lap. At any rate, the discussion goes on, this time with much less hype.

Wednesday, April 29, 2015

Thanks to Shigeru and Vitor for taking the time to elaborate on the points they make in their paper.

***

Dear Norbert,

Thanks for
taking up our paper in your blog (Nóbrega and
Miyagawa, 2015, Frontiers in Psychology). We are glad that
you appreciate our arguments against the gradualist approach to language
evolution. There are two things that don't come out in your blog that we want
to note.

First, our arguments against the gradualist view are predicted by the
Integration Hypothesis, which Miyagawa proposed with colleagues in earlier Frontiers articles (Miyagawa et al.
2013, 2014). The gradualists such as Progovac and Jackendoff claim that
compounds such as doghouse and daredevil are living fossils of an
earlier stage in language, which they call protolanguage. The reason is that
the two "words" are combined without structure, due to the fact that
these compounds (i) have varied semantic interpretations (NN compounds), and
(ii) are unproductive and not recursive (VN compounds). We argued that if one
looks beyond these few examples, we find plenty of similar compounds that are
fully productive and recursive, such as those in Romance and Bantu. These
productive forms show that the members that make up the compound are not bare
roots, but are "words" in the sense that they are associated with
grammatical features of category and sometimes even case.

This is precisely what the Integration Hypothesis (IH) predicts. IH
proposes that the structure found in modern language arose from the integration
of two pre-adapted systems. One is the Lexical system, found in monkeys, for
example. The defining characteristic of the L-system is that it is composed of
isolated symbols, verbal or gestural, that have some reference in the real
world. The symbols do not combine. The other is the Expressive system found in
birdsong. The E-system is a series of well-defined, finite state song patterns,
each song without specific meaning. For instance, the nightingale may sing up
to 200 different songs to express a limited range of intentions such as the
desire to mate. The E-system is akin to human language grammatical features.
These are the two major systems found in nature that underlie communication. IH
proposes that these two systems integrated uniquely in humans to give rise to
human language.

Based on the nature of these two systems, IH predicts that the members of
the L-system do not combine directly, since that is a defining characteristic
of the L-system. E must mediate any such combination. This is why the IH
predicts that there can't be compounds of the form L-L, but instead, IH
predicts L-E-L. Such an assumption bears a close relation to how human language
roots are ontologically defined, as feature-less syntactic objects. Once roots
are feature-less they are invisible to the generative system, thus there is no
motivation a priori to assume that syntax merges two bare roots, that is, two
syntactically invisible objects.

The second point is that the L-system is related to such verbal behavior
as the alarm calls of Vervet monkeys. We focus on the fact that these calls are
isolated symbols, each with reference to something in the real world (thus,
they are closer to concepts rather than to full-blown propositions). You
question the correlation by noting that while the elements in a monkey's alarm
calls appear purely to be referential, words in human language are more
complex, a point also Chomsky makes. We also accept this difference, but
separate from this, roots and alarm calls share the property, if we are right,
that they are isolated elements that do not directly combine. This is the
property we key in on in drawing a correlation between roots and alarm calls as
belonging to the L-system. In addition to the referential aspect of alarm
calls, there is another important question to solve: what paved the way to the
emergence of the open-vocabulary stored in our long-term memory, since alarm
calls are very restricted? Perhaps what you’ve mentioned as “something
‘special’ about lexicalization”, that is, the effect that Merge had on the
pre-existing L-system, may have played a role in the characterization of human
language roots, allowing the proliferation of a great number of roots in modern
language. Nevertheless, we will only get a satisfactory answer to this question
when we have a better understanding of the nature of human language roots.

Finally, you might be interested to know that Nature just put up a
program on primate communication and human language on Nature Podcast in which
Chomsky and Miyagawa are the linguists interviewed.

Tuesday, April 28, 2015

MIT News (here)
mentions a paper that recently appeared in Frontiers
in Psychology (here)
by Vitor Norbrega and Shigeru Miyagawa (N&M). The paper is an Evolang
effort that argues for a rapid (rather than a gradual) emergence of FL. The
blessed event was “triggered” by the emergence of Merge which allowed for the
“integration” of two “pre-adapted systems,” one relating to outward expression
(think AP) and one related to referential meaning (think CI). N&M calls the
first the E-system and the second the L-system. The main point of the paper is
that the L-system does not correspond to anything like a word. Why? Because
words found in Gs are themselves hierarchically structured objects, with
structures very like the kind we find in phrases (a DMish perspective). The
paper is interesting and worth looking at, though I have more than a few
quibbles with some of the central claims. Here are some comments.

N&M has two aims: first to rebut gradualist claims
concerning the evolution of FL. The second is to provide a story for the rapid
emergence of the faculty. I personally found the criticisms more compelling
than the positive proposal. Here’s why.

The idea that FL emerged gradually generally rests on the
idea that FL builds on more primitive systems that went from 1-word to 2-word
to arbitrarily large n-word sequences.My problem with these kinds of stories has always been how we get from 2
to arbitrarily large n. As Chomsky has noted, “go on indefinitely” does not
obviously arise from “go to some fixed n.” The recursive trick that Merge embodies
does not conceptually require priming by finite instances to get it going. Why?
Because there is no valid inference from “I can do X once, twice” to “I can do
X indefinitely many times.” True, to get to ‘indefinitely many X’ might casually (if not conceptually)require seguing via finite instances of
X, but if it does, nobody has explained how it does.[1]
Brute facts causing other brute facts does not an explanation make.

Let me put this another way: Perhaps as a matter of historical
fact our ancestors did go through a protolanguage to get to FL. However, it has
never been explained how going
through such a stage was/is required
to get to the recursive FL of the kind we have. The gradualist idea seems to be
that first we tried 1-word sequences then 2-word and that this prompted the
idea to go to 3, 4, n-word sequences for arbitrary n. How exactly this is
supposed to have happened absent already having the idea that “going on
indefinitely” was ok has never been explained (at least to me). As this is
taken to be a defining characteristic of FL, failing to show the link between the
finite stages and the unbounded one (a link that I believe is conceptually
impossible to show, btw) leaves the causal relevance of the earlier finite
stages (should they even exist) entirely opaque (if not worse).So, the argument that recursion “gradually”
emerged is not merely wrong, IMO, it is barely coherent, at least if one’s
interest is in explaining how unbounded hierarchical recursion arose in the
species.[2]

N&M hints at a second account that, IMO, is not as
conceptually handicapped as the one above. Here it is: One might imagine a
system in place in our ancestors capable of generating arbitrarily big “flat”
structures. Such structures would be different from our FL in not being hierarchical,
and the same in being unbounded. These procedures, then, could generate
arbitrarily “long” structures (i.e. the flat structures could be indefinitely
long (think beads on a string) but have 0-depth).Now we can ask a question: how can one get
from the generative procedures that deliver arbitrarily long strings to our
generative procedures which deliver structures that are both long and deep? I
confess to having been very attracted to this conception of Darwin’s Problem
(DP). DP so understood asks for the secret sauce required to go from “flat”
n-membered sets (or sequences for arbitrary n) to the kind of arbitrarily
deeply hierarchically structured sets (or graphs or whatever) we find in Gs
produced by FL. I have a dog in this fight (see here),
though I am not that wedded to the answer I gave (in terms of labeling being
the novelty that precipitated change). This version of the problem finesses the
question of where recursion came from (after all, it assumes that we have a procedure to generate arbitrarily long flat
structures) and substitutes the question where did hierarchical recursion come from. At any rate, the two strike me as
different, the second not suffering from the conceptual hurdle besetting the
first.

N&M provides more detailed arguments against several current
proposals for a gradualist conception for the evolution of FL. Many of these
seem to take words as fossils of the earlier evolutionary stages. N&M
argues that words cannot be the missing link that gradualists have hoped for.
The discussion is squarely based on Distributed Morphology reasoning and
observations. I found the points N&M makes very much to the point. However,
given the technical requirements needed to follow the details, I fear that
tyros (i.e. the natural readership of Frontiers)
will remain unconvinced. This said, the points seem dead on target.

This brings us to the second aim of the paper, and here I
confess to having a hard time following the logic. The idea seems to be that Merge
when added to the E systems we find in bird song and the L system we find in
vervets gets us the kinds of generative systems we find in G products of FLThis is a version of the classical Minimalist
answer to DP favored by Chomsky. I say “sort of” as Chomsky, at least lately,
has been making a big deal of the claim that the mapping to E systems is a late
accretion and the real action is in the mapping to thought. I am not sure that
N&M disagrees with this (the paper doesn’t really discuss this point) as I
am not sure how the L-system and Chomsky’s CI interface relate to one another.
The L-system seems closer to concepts than full-blown propositional
representations, but I could be wrong here.At any rate, this seems to be the N&M view.

Here’s my problem; in fact a few. First, this seems to
ignore the various observations that whatever our L-atoms are they seem
different in kind from what we find in animal communication systems. The fact
seems to be that vervet calls are far more “referential” than human “words”
are. Ours are pretty loosely tied to whatever humans may use words to refer to.
Chomsky has discussed these differences at length (see here for a recent critique of “referentialism”) and if he
is in any way correct it suggests that vervet calls are not a very good proxy for what our linguistic atoms do as the two
have very different properties. N&M might agree with this, distinguishing
roots from words and saying that our words have the Chomsky properties but our
concepts are vervetish. But how turning roots into words manages this remains,
so far as I can see, a mystery. Chomsky notes that the question of where the
properties of our lexical items comes from is at present completely mysterious.
But the bottom line, as Chomsky sees it (and I agree with him here), is that “[t]he minimal meaning-bearing
elements of human languages – word-like, but not words -- are radically
different from anything known in animal communication systems.” And if
this is right, then it is not clear to me that Merge alone is sufficient to
explain what our language manages to do, at least on the lexical side. There is
something “special” about lexicalization that we really don’t yet understand
and it does not seem to be reducible to Merge and it does not seem to really
resemble the kinds of animal calls that N&M invokes. In sum, if Merge is
the secret sauce, then it did more than link to a pre-existing L-system of the
kind we find in vervet calls. It radically changed their basic character. How
Merge might have done this is a mystery (at least to me (and, I believe,
Chomsky)).

Again, N&M might agree, for the story it tells does not
rely exclusively on Merge to bridge the gap. The other ingredient involves
checking grammatical features. By
“grammatical” I mean that these features are not reducible to the features of
the E or L systems. Merge’s main grammatical contribution is to allow these
grammatical features to talk to one another (to allow valuation to apply). As
roots don’t have such features, merging roots would not deliver the kinds of
structures that our Gs do as roots do not have the wherewithal to deliver “combinatorial
systems.” So it seems that in addition
to Merge, we need grammatical features to deliver what we have.

The obvious question is where these syntactic features come
from?More pointedly, Merge for N&M
seems to be combinatorically idle absent these features. So Merge as such is
not sufficient to explain Gish generative procedures. Thus, the real secret
sauce is not Merge but these features and the valuation procedures that they
underwrite. If this is correct, the deep Evolang question concerns the genesis
of these features, not the operation instructing how to put grammatical objects
together given their feature structures. Or, put another way: once you have the
features how to put them together seems pretty straightforward: put them together
as the features instruct (think combinatorial grammar here or type theory).
Darwin’s Problem on this conception reduces to explaining how these syntactic features got a mental toehold.
Merge plays a secondary role, or so it seems to me.

To be honest, the above problem is a problem for every
Minimalist story addressing DP. The Gs we are playing with in most contemporary
work have two separate interacting components: (i) Merge serves to build
hierarchy, (ii) AGREE in Probe-Goal configurations check/value features. AGREE
operations, to my knowledge, are not generally reducible to Merge (in
particular I-merge). Indeed trying to unify them, as in Chomsky’s early
minimalist musings, has (IMO, sadly) fallen out of fashion.[3]
But if they are not unified and most/many non-local dependencies are the
province of AGREE rather than I-merge, then Merge alone is not sufficient to explain the emergence of Gs with the characteristic
dependencies ours embody. We also need a story about the etiology of the long
distance AGREE operation and a story about the genesis of the syntactic
features they truck in.[4]
To date, I know of no story addressing this, not even very speculative ones. We
could really use some good ideas here (or, as in note 3, begin to rethink the
centrality of Probe/Goal Agree).

I don’t want to come off sounding overly negative. N&M,
unlike many evolangers know a lot about FL. Their critique of gradualist
stories seems to be very well aimed. However, precisely because the authors
know so much about FL while trying to give a responsible positive outline of an
answer to DP the problem, the paper makes clear the outstanding problems that
providing an adequate explanation sketch
faces. For this alone, N&M is worth reading.

So what’s the takeaway message here? I think we know what a
solution to DP in the domain of language should involve. It should provide an
account of how the generative procedures responsible for the G properties we
have discovered over the last 60 years arose in the species. The standard
Minimalist answer has been to focus on Merge and argue that adding it the
capacities of our non-linguistic ancestors suffices to give them our kinds of
grammatical powers. Now, there is no doubting that Merge does work wonders.
However, if current theoretical thinking is on the right track, then Merge
alone is insufficient to account for the various non-local dependencies that we
find in Gs. Thus, Merge alone does not deliver what we need to fully explain
the origins of our FL (i.e. it leaves out a large variety of agreement
phenomena).[5]
In this sense, either we need someideas
about where AGREE comes from, or we need some work showing how to accomodate
the phenomena that AGREE does via I-merge. Either way, the story that ties the
evolutionary origins of our FL to the
emergence of a single novel Merge operation is, at best, incomplete.

[1]
Here from Edward St Aubyn in At Last: the
final Patrick Melrose Novel:

“Ok, so who created
infinite regress.” That’s the right question.

[2]
No less a figure than Wittgenstein had a field day with this observation.“And so on” is not a concept that finite
sequences of anything embody.

[3]
I may be one of the last thinking that moving to AGREE systems was a bad idea
if one’s interest is in DP. I argue this here.
I don't think I’ve convinced many of the virtues either of disagreement in
general or dis-AGREE-ment in particular. So it goes.

[4]
It is tempting to see Chomsky’s latest discussions of labeling as an attempt to
resolve this problem. Agreement on this view is what is required to get
interpretable objects at the CI interface. It is not the product of AGREE but
of the labeling algorithm. Chomsky does not say this. But this is where he
might be heading. It is an attempt to reduce “morphology” to Bare Output
Conditions. I personally am not convinced by his detailed arguments, but if
this is the intent, I am very sympathetic to the project.

[5]
I am currently co-teaching intro to contemporary minimalism with Omer
Preminger. He has inundated me with arguments (good ones) that something like
AGREE does excellent work in accounting for huge swaths of intricate data.
Thus, at the very least, it seems that the current consensus among minimalist
syntacticians is that Merge is not
the only basic syntactic operation and so an account that ties all of our
grammatical prowess to Merge is either insufficient or the current consensus is
wrong. If I were a betting person, I would put my money on the first disjunct.
But…

Thursday, April 23, 2015

For various reasons, Mark J could not post this as a comment on this post. As he knows much more than I do about these matters, I thought it a public service to lift these remarks from the comments section to make them more visible. I think that this is worth reading in conjunction with Charles' recent post (here). At any rate, this is interesting stuff and I don't disagree much (or have not found reasons to disagree much) with what Mark J says below. I will of, course, allow myself some comments later. Thx Mark.

***

This was originally written as
a comment for the "Bayes and Gigerenzer" post, but a combination of a
length restriction on comments and my university's not enabling blog posts from
our accounts meant I had to email Norbert directly.

As Norbert has remarked,
Bayesian approaches are often conflated with strong empiricist approaches, and
I think this post does that too. But even within a Bayesian approach,
there are powerful reasons not to be a "tabula rasa"
empiricist. The "bias-variance dilemma" is a mathematical
statement of something I've seen Norbert say in this blog: learning only works
when the hypothesis space is constrained. In mathematical terms, you can
characterise a learning problem in terms of its bias -- the range of hypotheses
being considered -- and the variance or uncertainty with which you can identify
the correct hypothesis. There's a mathematical theorem that says that in
general as the bias goes down (i.e., the class of hypotheses increases) the
variance increases.

Given this, I think a very
reasonable approach is to formulate a model that includes as much relevant
information from universal grammar as we can put into it, and perform inference
that is as close to optimal as we can achieve from data that is as close as
possible to what the child receives. I think this ought to be every generative
linguist's baseline model of acquisition! Even with an incomplete model
and incomplete data, we can obtain results of the kind "innate knowledge X
plus data Y can yield knowledge Z".

But for some strange (I
suspect largely historical) reason, this is not how Chomskyian linguists think
of computational models of language acquisition. Instead, they prefer ad
hoc procedural models. Everyone agrees there has to be some kind of
algorithm which children use to learn language. I know there are lots of
pyschologists who are sure they have a good idea of the kinds of things kids
can and can't do, but I suspect nobody really has the faintest idea of what
algorithms are "cognitively plausible". We have little idea of
how neural circuitry computes, especially over the kinds of hierarchical
representations we know are involved in language. Algorithms which can be
coded up as short computer programs (which is what most people have in mind
when they say simple) might turn out to be neurally complex, while we know that
the massive number of computational elements in the brain enable it to solve
computationally very complex problems. In vision -- a domain we can
sort-of study because we can stick electrodes into animals' brains -- it turns
out that the image processing algorithms implemented in brains are actually
very sophisticated and close to Bayes-optimal, backed up with an incredible
amount of processing power. Why not start with the default assumption
that the same is true of language?

It's true that in word
segmentation, a simple ad hoc procedure -- a simple greedy learning algorithm
that ignores all interword dependencies -- actually does a "reasonable
job" of segmentation, and that improving either the algorithm's search
procedure or making it track more complex dependencies actually decreases the
overall word segmentation accuracy. Here I think Ben Borschinger's
comment has it right - sometimes a suboptimal algorithm can correct for the
errors of a deficient model if the errors of each go in the opposite way.
We've known since at least Goldwater's work that an inaccurate
"unigram" model that ignores inter-word interactions will prefer to
find multi-word collocations and hence undersegment. On the other hand, a
naive greedy search procedure tends to over-segment, i.e., find word boundaries
where there are none. Because the unigram model under-segments, while a
naive greedy algorithm over-segments, the combination actually does better than
approaches where you just improve only the search procedure or only the model
(by incorporating inter-word dependencies) since now you have "uncancelled
errors".

Of course it's logically
possible that children use ad hoc learning procedures that rely on this kind of
"happy coincidence", but I think it's unlikely for several reasons.

First, these procedures are ad
hoc -- there is no theory, no principled reason why they should work.
Their main claim to fame is that they are simple, but there are lots of other
"simple" procedures that don't actually solve the problem at hand
(here, learn words). We know that they work (sort of) because we've tried
them and checked that they do. But a child has no way of knowing that
this simple procedure works while this other one doesn't, so the procedure
would need to be innately associated with the word learning task. This
raises Darwin's problem for the ad hoc algorithm (as well as other related
problems: if the learning procedure is really innately specified, then we ought
to see dissociation disorders in acquisition, where the child's knowledge of
language is fine, but their word learning algorithm is damaged somehow).

Second, ad hoc procedures like
these only partially solve the problem, and there's usually no clear way to
extend them to solve the problem fully, so some other learning mechanism will
be required anyway. For example, the unigram+greedy approach can find
around 3/4 of tokens and 1/2 of the types, and there's no obvious way to extend
it so it learns all the tokens and all the types. But children do eventually
learn all the tokens and all the types, and we'll need another procedure for
doing this. Note that the Bayesian approach that relies on more complex
models does have an account here, even though it currently involves
"wishful thinking": as the models become more accurate by including
more linguistic phenomena and the search procedures become more accurate, the
word segmentation accuracy continues to improve. We don't know how to
build models that include even a fraction of the linguistic knowledge of a 3 year
old, but the hope is that eventually these models would achieve perfect word
segmentation, and indeed, be capable of learning all of a language. In
other words, there isn't a plausible path by which the ad hoc approach would
generalise to learning all of a language, while there is plausible path for the
Bayesian approach that relies on more and more accurate linguistic models.

Finally -- and I find it
strange to be saying this to a linguist who is otherwise providing very cogent
arguments for linguistic structure -- there really are linguistic structures
and linguistic dependencies, and it seems weird to assume that children use a
learning procedure that just plain ignores them. Maybe there is a stage
where children think language consists of isolated words (this is basically
what a unigram model assumes), and the child only hypothesises larger
linguistic structures after some "maturation" period. But our
work shows that you don't need to assume this; instead, a single model that
does incorporate these dependencies combined with a more effective search
procedure actually learns words from scratch more accurately than the ad hoc
procedures.

Norbert sometimes seems very
sure he knows what aspects of language have to be innate. I'm much less
sure myself of just what has to be innate and what can be learned, but I
suspect a lot has to be innate (I think modern linguistic theory is as good a
bet as any). I think an exciting thing about Bayesian models is that they
give us a tool for investigating the relationship between innate knowledge and
learnability. For example, if we can show that a model with innate
knowledge X+X' can learn Z from data Y, but a model with only innate knowledge
X fails to learn Z, then probably innate knowledge X' plays a role in learning
Z. I said probably because someone could claim that the child's data
isn't just Y but also includes Y' and from model X and data Y+Y' it's possible
to infer Z. Or someone might show that a completely different set
of innate knowledge X'' suffices to learn Z from Y. So of course a
Bayesian approach won't definitely answer all the questions about language
acquisition, but it should provide another set of useful constraints on the
process.