Comments

Tuesday, April 19, 2016

One of the features of Charles’ paper (CY) that I did
not comment on before and that I would like to bring to your attention here is
the relevance (or, more accurately, lack thereof) of indirect negative evidence
(INE) for real time acquisition. CY’s claim is that it is largely toothless and
unlikely to play much of a role in explaining how kids acquire their Gs.A few comments.

CY is not the first time I have been privy to this
observation. I recall that my “good and great friend” Elan Dresher said as much
when he was working on learnability of stress with Jonathan Kaye. He noted (p.c.)
that very few Gs lined up in the relevant sub/super set configuration relevant
for an application of the principle. Thus, though it is logically possible that
INE could provide info to the LAD for zeroing in on the right G, in fact it was all but useless given the
nature of the parameter space and the Gs that such a space supports. So, nice
try INE, but no cigar.[1]

CY makes this point elaborately. It notes several problems
with INE as, for example, embodied in Bayes models (see p 14-15).

First, generating the sets
necessary to make the INE comparison is computationally expensive. CY cites
work by Osherson et al (1986) noting that generating such sets may not even be
computable and by Fodor and Sakas that crunches the numbers in cases with a
finite set of G alternatives and finds that here too com putting the extensions
of the relevant Gs in order to apply the INE is computationally costly.

Nor should this be surprising. If even updating several Gs
wrt data quickly gets out of computational control, then it is hardly
surprising that using Gs to generate sets
of outputs and then comparing them wrt containment is computationally
demanding. In sum, surprise, surprise, INE runs into the same kind to
tractability issues that Bayes is already rife with.[2]

Second, and maybe more interesting still, CY diagnoses why
it is that INE is not useful in real world contexts. Here is CY (note:
‘super-hypothesis’ is what some call the supersets):

The fundamental problem can be
stated simply: the super-hypothesis cannot be effectively ruled out due to the
statistical properties of child directed English. (16)

What exactly is the source of the problem? Zipf’s law.

The failure of indirect negative
evidence can be attributed to the inherent statistical distribution of
language. Under Zipf’s law, which applies to linguistic units (e.g. words) as
well as their combinations (e.g. N-grams, phrases, rules; see Yang (2013)), it
is very difficult to differentiate low probability events and impossible
events.

And this makes it inadvisable to use the absence of a particular
form as evidence of its non-generability. In other words, Zipf’s law cuts the ground
from under INE.

Here CY (as it notes) is making a point quite similar to
that made over 25 years ago by Steve Pinker (here)
(14):

…it turns out to be far from clear
what indirect negative evidence could be. It can’t be true that the child
literally rules out any sentence he or she hasn’t heard, because there is
always an infinity of sentences that he or she hasn’t heard that are
grammatical …And it is trivially true that the child picks hypothesis grammars
that rule out some of the sentences
that he or she hasn’t heard, and that if a child hears a sentence she or she
will often entertain a different hypothesis grammar that if she or she hasn’t
heard it. So the question is, under exactly what circumstances does a child
conclude that a non witnessed sentence is ungrammatical?

What CY notes is that this is not only a conceptual
possibility given the infinite number of grammatical linguistic objects, but it
is statistically likely that because of the Zipfian distribution of linguistic
forms in the PLD that the evidence relevant to concluding G absence from statistical
absence (or rarity) will be very spotty, and that building on such absence will
lead in very unfortunate directions. CY discusses a nice case of this wrt
adjectives, but the point is quite general. It seems like Zipf’s law makes
relying on gaps in the data to make
conclusions about (il)licit grammatical structures a bad strategy.

This a very nice point, which is why I have belabored it.
So, not only are the computations intractable but the evidence relevant for
using INE is inadequate for principled reasons. Conclusion, forget about the
INE.

Why mention this? It is yet another problem with Bayes. Or,
more directly, it suggests that the premier theoretical virtue of Bayes (the
one that gets cited whenever I talk to a Bayesian) is empirically nugatory.
Bayes incorporates the subset principle (i.e. Bayesian reasoning can explain
why the subset principle makes sense). This might seem like a nice feature. And
it would be were INE actually an important feature of the LAD’s learning
strategy (i.e. a principle that guided learning). But, it seems that it is not.
It cannot be used both for computational and statistical reasons. Thus, it is a strike against any theory of the
ideal learner that it incorporates the subset principle in a principled manner.
Why? Because, the idealization points in the wrong direction. It suggests that
negative evidence is important to the LAD in getting to its G. But if this is
false, then a theory that incorporates it in a principled fashion is, at best,
misleading. And being misleading is a major strike against an idealization. So,
bad idealization! Again!

And it’s worse still because there is an alterative?Here’s CY (18):

The alternative strategy is a
positive one, as it exploits the distributional similarities … Under this approach, the over-hypothesis is never
available to the learner, and there is no need to rule it out.

So, frame the problem well (i.e. adopt the right
idealization) and you point yourself in the right direction (i.e. by avoiding
dealing with problems that the wrong idealization generates).

As CY notes, none of these arguments are “decisive.”
Arguments against idealizations never are (though the ones CY presents and that
I have rehearsed wrt Bayes in the last several posts seems to me pretty close
to dispositive). But, they are important. Like all matters scientific,
idealizations need to be defended. One way to defend them is to note that they
point to the right kinds of problems and suggest how the kinds of solutions we
ought to explore. If an idealization consistently points in the wrong
direction, then it’s time to chuck it. It’s worse than false, it is
counter-productive. In the domain of language, whatever the uses of the
technology Bayes makes available, it looks like it is misleading in every possible
way. The best that we seem to be able to say for it is that if we don’t take any
of its claims seriously then it won’t cause too much trouble. Wow, what an
endorsement. Time to let the thing go and declare the “revolution” over. Let’s
say this loudly and all together: Bye bye Bayes!

[1]
It is worth noting that the Dresher-Kaye system was pretty small, about 10
parameters. Even in this small system, the subset principle proved to be idle.

[2]
In fact, it might be worse in this case. The Bayes maneuver generally
circumvents the tractability issue by looking for algorithms that can serve to
“update” the hypotheses without actually directly updating them. For INE we
will need cheap algorithms to generate the required sets and then compare them.
Do such quick and dirty algorithms exist for generation and comparison of the
extensions of hypotheses?

Monday, April 18, 2016

Here is a recent interview with Chomsky (thx to Elika for the link) where he talks about things like Big Data in linguistics, Experimental Syntax, islands, superiority and other things. The interview (Jeff Runner doing the ling questioning) is short but interesting.

He makes at least there important points.

First, that there is a difference between data collection and scientific experimentation. The idea, implicit in most of the big data PR, is that one can collect data quite a-theoretically and expect to gain scientific insight. As Chomsky notes that this runs against the accumulated wisdom of the last 200 years of scientific research. As Chomsky compactly put it:

...theory-driven experimental investigation has been the nature of the sciences for the last 500 years.

Quite right. Experiments are not just looking. They are looking with an attitude and the tude is a function of theory.

Second, much of what linguistic study has NO relevant data in any conceivable corpus. He cites ECP, but this is just the tip of a very large iceberg. No relevant data, then big data collection is besides the point:

In linguistics we all know that the kind of phenomena that we inquire about are often exotic. They are phenomena that almost never occur. In fact, those are the most interesting phenomena, because they lead you directly to fundamental principles. You could look at data forever, and you’d never figure out the laws, the rules, that are structure dependent. Let alone figure out why. And somehow that’s missed by the Silicon Valley approach of just studying masses of data and hoping something will come out. It doesn’t work in the sciences, and it doesn’t work here.

Let me underline one point Chomsky makes: it's the manufactured experimental data that is important to gaining insight. As in the other sciences, linguists create data not found in the wild and use this factitious data to understand what is happening. Real life data is often (IMO, generally) useless because it is too complex. The aim of good data is to reduce irrelevant interference effects that arise from the interaction of many component causes. Real life data is just that; too complex. In linguistics, of particular importance is negative data; data that some structure is unacceptable or cannot have a specific meaning. This is not the kind of data that Big Data can get because it is data that is missing from everyday usage of language. And yes, PoS arguments are built from this kind of data and that is why they are so useful.

Third, I am still not sure what Chomsky's take on island effects is. One of the interesting debates in the Sprouse and Hornstein volume revolved around whether these were reducible to simple complexity effects. My read on this is that Sprouse and Wagers and Phillips got the better of the discussion and that reducing islands to complexity just wasn't going to fly. I'd be interested to know what others think.

At any rate, take a quick look, as it is short and interesting.

CHomsky's recent Sophia Lectures is another excellent recent source of Chomsky syntax speculation. The lectures (plus an excellent interview by Naomi Fukui and Mihoko Zushi) are contained in volume 64 of Sophia Linguistica. I have no online link, unfortunately. But I recommend getting hold of the volume and reading it. Interesting stuff.

Friday, April 15, 2016

Here are three pieces (and one youtube clip) that you might
find interesting and provocative. In the last one, Chomsky discusses Marr.

The first is a piece on teaching. It
responds to a piece by Brain Leiter on teaching philosophy in mixed gender
environments and whether or not males create environments which make it harder
for females to participate and learn. Leiter and the blogger Harry Brighouse
(HB) are philosophers so their concern is with philo pedagogy. But I believe
that ling classes and philo classes have very similar dynamics (less lecture
and more discussion, “discovery,” give and take, argument) and so the
observations HB makes on Leiter’s original post (link included in above piece)
seems relevant to what we do.Take a
look and let me know what you think.

FWIW, I personally found some of the suggestions useful, and
not only as applied to women. In my experience some very smart people can be
quite reluctant to participate in class discussion. This is unfortunate for I
know for a fact that the class (and me too) would benefit from their
participation (as, I suspect, would they). IMO, learning takes place in classes
less because information is imparted and more because a certain style of
exploration of ideas is promoted. If lucky, the process is fun and develops a
dynamic of its own, which leads to new ideas which leads to more discussion
which promotes more amusement which… A really good class shows how to ride this
kind of enthusiasm and think more clearly and originally. The problem that
Leiter and HB identify would impede this. So, is it a problem for linguistics?
My guess is absolutely. If so, what to do? Comments welcome.

The second paper (here)
is on a new IARPA funded project to get machines to “think” more like brains. I
don’t really care about the technology concerns (though I don’t think that they
are uninteresting or trivial either though the ends to which they will be put
are no doubt sinister), but it is interesting to hear how leaders in cog-neuro
see the problem. The aim is to get machines to think like brains and so what do
they fund? Projects aimed at complete wiring diagrams of the brain. So, for
example, Christof Koch and his team at the very well endowed Allen Institute
are going to do a “complete wiring diagram of a small cube of brain – a million
cubic microns, totaling one five-hundredth of cortext.” The idea is that once
we have complete wiring diagrams we will know how brains do what they do.
Here’s Andreas Tolias being quoted: “without knowing all of the component
parts, he said, “maybe we’re missing the beauty of the structure.” Maybe. Then
again, maybe not. Who knows? Well, I think I do and that’s because of observations
that Koch has made in the past.

It is simply false that we do not have complete wiring
diagrams. We do. We have the complete wiring diagram and genome of the nematode
c-elegans. Despite this we know very little about what the little bugger does
(actually we do know a lot about how it defecates David Poeppel informed me
recently). So, having the complete diagram and genome has not helped crack the
critter’s cognitive issues. Once you see this, you understand that the whole
project discussed here is based on the assumption that the relation of human
cognition/behavior to brain diagrams is simpler than that of the
behavior/cognition of a very simple worm to its wiring diagram and genome. A
bold conjecture, you might say. Yup, very bold. Foolhardy anyone? But see
below.

It is hard to avoid the suspicion that this is another case
of research following the money. Koch knows that there is little reason to think
that this will work. But big deal, there’s money there so work it will. And if
it fails, then it means we have not gotten to the right level of wiring detail.
We need yet more fine grained maps, or maps of other things, or maps between
maps of other things and the connectome or.... There really is no end to this
and so it is the perfect project.

The little piece is also worth reading for it reports many
of the assumptions that our leaders in neuroscience make about brains. Here’s
one I liked: Some brain types really believe the neural networks of the 1980s
vintage “mimic the basic structure of brains.” So now we know why neural nets they
were so popular: they looked “brainy”! I used to secretly think that this kind
of belief was too silly to attribute to anyone. But, nope, it seems that some
really take arguments from pictorial resemblance to be dispositive.

We also know that they have no idea what “feedback loops”
are doing, especially from higher order to lower order layers. Despite the
mystery surrounding what top down loops do, the assumption still seems to be
that, largely, “information flows from input to output through a series of
layers, each layer is trained to recognize certain features…with each
successive layer performing complex computations on the data.” In other words,
the standard learning model is a “discovery procedure,” and the standard view
of the learning involved is standard Empiricism/Associationsim, the only tweak
being that maybe we can do inductions over inductions over inductions as well
as inductions over just the initial input. This is the old discredited idea central to American Structuralist Linguistics. Early
GG showed that this could not be true and that the relations between levels is
much more complex than this picture envisaged. However, the idea that levels
might be autonomous is not even on the neuroscience agenda, or so it appears.

In truth, none of this should be surprising. If the report
in Quanta accurately relays the
standard wisdom, neuroscience is completely unencumbered by any serious
theories of cognition. The idea seems to be that we will reverse engineer
cognition from wiring diagrams. This is nuts. Imagine reverse engineering the
details of a word processing program from a PC’s wiring diagram. It would be a
monumental task, though a piece of cake compared to the envisioned project of
reverse engineering brains from connectomes.

At any rate, read the piece (and weep).

As a relevant addendum to the above piece take a look at the
following. Ellen Lau send me a link to a
debate about the utility of studying the connectome moderated by David
Poeppel at the last CNS meeting in NYC. It is quite amusing. The protagonists
are Moritz Helmstaeder (MH) and Tony Movshon (TM). The former holds the pro
connectome position (don’t let his first remarks fool you, they are intended to
be funny), while the latter embraces a more skeptical Marr like view.

Here’s one remarkable bit: MH presents an original argument
regarding the recognized failure of c-elegans connectomics to get much function
out of structure. He claims that simple systems are more complex than more
complex ones. As TM notes, this is more guess than argument (and there is no
argument given). I am pretty sure that were the c-elegans case “successful”
this would be generally advertised. David P questions him on this with, IMO,
little satisfactory reply. Let’s just say that the position he holds is, ahem,
possible but a stretch.

The one things about the debate that I found interesting is
that MH seems to be defending a position that nobody could object to while TM
is addressing a question that is very hard to be dispositive about. MH is
arguing that connectomics has been and can be useful. TM is arguing that there
are other better ways to proceed right now. Or, more accurately, that the Marr
three prong attack is the way to go and that we will not get cognition from
wiring diagrams, no matter how carefully drawn they are.

IMO, TM has the better of this discussion because he notes
that the cases that MH points to as success stories for connectomics are areas
where we have had excellent functional stories (Barlow results are the basis of
MH’s results) for a while. And in this context, looking at the physiology is
likely to be very useful and likely successful. To put this crudely, TM (who
cited Marr) seems to appreciate that questions of CN interest can be pursued at
different levels, which are somewhat independent. And of course, we want them
to be related to each other. MH seems to think in a more reductive manner, that
level 3 is basic and that we will be able to deduce/infer level 2 and level 1
stories once we understand the connectomic details. Thus, we can get cognition
from wiring diagrams (hence the relevance of the failure of c-elegans).

You know where I stand on this. But the discussion is
interesting and worth the 90 minutes. There is a lot of mother and apple pie
here (as questioners point out). Nobody argues (reasonably enough) against
doing any connectomics work. The argument should be (but often isn’t) about
research strategy; about whether connectomics can bypass the C part of CNS? As
David P puts it: can one reverse engineer the other two levels given level 3 (see
discussion from about 1:15 ff)? Connectomics (MH) leans towards a ‘yes,’ the
critics TM think ‘no.’ Given the money at stake, this is no small debate. Those
who want to see the relevance of Marrian methodological reasoning, need look no
further than here.

The last piece is something that I probably already posted
once before but might be of interest to those following the Marr discussion in
recent posts. It’s Chomsky talking about AI and its prospects (here).
It’s a fun interview and a good antidote to the second piece I linked to. It
also has the longest extended discussion of Marr as it relates to linguistics
that I know of.

Chomsky makes two points. First, the point that David Adger
made that there is “no real algorithmic level” when it comes GG because “it’s
just a system of knowledge” and “there is no process” a system of knowledge not
reducible to how it gets used. (24)

He also makes a second point. Chomsky allows that “[m]aybe information
about how it’s used [can] tell you something about the mechanisms.” So
ontologically speaking, Gs are not things that do anything, but it might be
possible for us (Chomsky notes that
some higher (Martian?) intelligence might not require this) to learn something
about the knowledge by inspecting how it is used: “Maybe looking at process by
which it’s used gives you helpful information” about the structure of the
knowledge. (26)

The upshot: there is an ontological/conceptual difference
between the knowledge structures that GG describes and how this knowledge is
put to use algorithmically but looking at how the system of knowledge is used
may be helpful in figuring out what the structure of that knowledge is.

I agree with the ontological point, but I think that Marr
might too. Level 2 theories, as I read him, are not less abstruse descriptions
of level 1 theories. Rather, level 1 theories specify computational problems that
level 2 theories must solve if they are to explain how humans see or speak or
hear or…. In other words, level 2 theories must solve level 1 problems to do
what they do. So, for example, in the domain of language, to (at least in part)
explain linguistic creativity (humans can produce and understand sentences
never before encountered) we must show how information Gs describe (i.e. rules
relating sound with meaning) is extracted by parsers in real time. So, the Marr
schema does not deny the knowledge/use distinction that Chomsky emphasizes
here, and that is a good thing as the two are not the same thing.

However, putting things in this way, misidentifies the value
of the Marr schema. It is less a metaphysical doctrine than a methodological
manual. It notes that it is very useful in vision to parse a problem into three
parts and ask how they talk to one another. Why is it helpful? Because it seems
that the parts do often enough talk to one another. In other words, asking how
the knowledge is put to use can be very helpful in figuring out what the
structure of that knowledge is. I think that this is especially true in
linguistics where there is really nothing like physical optics or arithmetic to
ground level 1 speculations. Rather we discover the content of level 1 theories
by inspecting a particular kind of use (i.e. judgments in reflective
equilibrium). It seems very reasonable (at least to me) to think that insight
we get into the structures using this kind of data will carry over to our study
of processing and real time acquisition. Thus, the structures that the
processor or LAD is looking for very close to those that our best theories of
linguistic knowledge say that they are. Another way of saying this is that we
assume that there is a high level of transparency between what we know and
those things we parse. There may even be a pretty close relation between
derivations that represent knowledge and variables that measure occurrent
psychological processes (think the DTC). This need not be the case, for Chomsky and Adger are right that there is
an ontological distinction between knowledge and how knowledge is put to use,
but it might be the case. Moreover, if
it is, that would offer a terrifically useful probe into the structure of
linguistic knowledge. And this is precisely what a methodological reading of
Marr’s schema suggests, which is shy I would like to emphasize that reading.

Let me add one more point once I am beating a hobbyhorse
that I have lately ridden silly: not only is this a possibility, but we have
seen recent efforts that suggest its fecundity. Transparency plays an important
conceptual role in Pietroski et al’s argument for its proposed semantic
structure of most and it also plays
an important role in Yang’s understanding of the Elsewhere Principle. I found
these arguments very compelling. They use a strong version of transparency to
motivate the conclusions. This provides a reason for valuing transparency as a
regulative ideal. And this is what, IMO, a Marr schema encourages.

Ok, I’ll stable the pony now with the following closing
comments: Chomsky and Adger are right about the ontology. However, there is an
interesting reading of Marr where the methodology urged is very relevant to
linguistic practice. And Marr is very worthwhile under that reading for it urges
a practice where competence and performance issues are more tightly aligned, to
the benefit of each.

Oh yes: there is lot’s more interesting stuff in the Chomsky
interview. He takes shots at big data, the distinction between engineering and
science, and the difference between reduction and unification. You’ve no doubt
seen/heard/read him make these points before, but the interview is compact and
easy to read.

Thursday, April 14, 2016

So, what makes an inductive theory Bayesian? I have no idea. Nor, it appears does anyone else. This is too bad. Why? Because though ti is always the case that particular models must be evaluated on their own merits (as Charles rightly notes in the previous post), the interest in particular models, IMO, stems from the light they shine on the class of models of which they are a particular instance. In other words, specific models are interesting both for their empirical coverage AND (IMO, more specifically) for the insight they provide for the theoretical commitments a model embodies (hence one model from the class of models).

My discussion of Bayes rested on the assumption that Bayes commits one to some interesting theoretical claims and that the specific models offered are in service of advancing more general claims that Bayes embodies. From where I sit, it seems to me that for many there are no theoretical claims that Bayes embodies so that the supposition that a Bayes model intends to tell us something beyond what the specific model is a model of is off base. Ok. I can live with that. It just means that the whole Bayes thing is not that interesting, except technologically. What's potential interest are the individual proposals, but they don't have theoretical legs as they are not in service of larger claims.

I should add, however, that many "experts" are not quite so catholic. Here is a quote from Gelman and Shalizi's paper on Bayes.

The common core of various conceptions of
induction is some form of inference from particulars to the general – in the
statistical context, presumably, inference from the observations y to
parameters describing the data-generating process. But if that were all
that was meant, then not only is ‘frequentist statistics a theory of inductive
inference’ (Mayo & Cox, 2006), but the whole range of guess-and-test
behaviors engaged in by animals (Holland, Holyoak, Nisbett, & Thagard,
1986), including those formalized in the hypothetico-deductive method, are also
inductive. Even the unpromising-sounding procedure, ‘pick a model at random and
keep it until its accumulated error gets too big, then pick another model
completely at random’, would qualify (and could work surprisingly well under
some circumstances – cf. Ashby, 1960; Foster & Young, 2003). So would
utterly irrational procedures (‘pick a new random when the sum of the least
significant digits in y is 13’). Clearly something more is required, or
at least implied, by those claiming that Bayesian updating is inductive. (25-26)

Note the theories that they count as "inductive" under the general heading but find to be unlikely candidates for the Bayes moniker. See what they consider not Bayes inductive rules? Here are two, in case you missed it: "the whole range of guess-and-test behaviors" and even the "pick a model at random and keep it until its accumulated error gets too big, then pick another model completely at random." G&S take it that if even there methods are instances of Bayesian updating, then there is nothing interesting to discuss for it denudes Bayes of any interesting content.

Of course, you will have noticed that these two procedures are in fact the ones that people (e.g. Charles, Trueswell and Gleitman and Co) have been arguing in fact characterize acquisition in various linguistic domains of interest. Thus, they reasonably enough (at least if they understand things the way Gelman and Shalizi do) conclude that these methods are not interestingly Bayesian (or for that matter "inductive," except in a degenerate sense).

So, there is a choice: treat "Bayes" as an honorific in which case there is no interesting content to being Bayesian beyond "hooray!!" or treat it as having content, in which case it seems opposed to systems like "guess-and-test" or "pick at random." Which one picks is irrelevant to me. It would be nice to know, however, which is intended when someone offers up a Bayesian model. In the first case it 'Bayesian' just means "one that I think is correct." In the second, it has slightly more content. But what that is? Beats me.

One last thing. It is possible to understand the Aspects model of G acquisition as Bayesian (I have this from an excellent (let's say, impeccable) source). Chomsky took the computational intractability of that model (its infeasibility) to imply that we need to abandon the Aspect model in favor of a P&P view of acquisition (though whether this is tractable is an open question as well). In other words, Chomsky took seriously the mechanics of the Aspects model and thought that its intractability indicated that it was fatally flawed. Good for him. He opted for being wrong over being vacuous. May this be a lesson for us all.

Wednesday, April 13, 2016

Norbert has brought out the main themes of my paper much more clearly than I could have (many thanks for that). This entry is something of a postscript triggered by the comments over the past few days.

The comments remind me of the early days in the Past Tense debate. What does it mean to be a connectionist model? Can't it pass the Wug test if we just get rid of those awful Wickelfeatures? If not backprop, maybe a recurrent net? Most commentators tread a similar terrain: What’s the distinction between a normative Bayesian model and a cognitive one? How essential is the claim of optimality? Is a model that uses non-Bayesian approximations Bayesian in name only? If not MAP, then how about a full posterior interpretation … [1]

These questions can never be fully resolved because they are questions about frameworks. As Norbert notes, frameworks can only be evaluated by the questions they raise and the answers they provide, not by whether it can or cannot do X because one can always patch things up. (Of course this holds for the Minimalist Framework as well.) A virtue of the Past Tense debate was that it grounded a largely conceptual/philosophical discussion in a well-defined empirical domain, and we have it to thank for a refined understanding of morphology and language acquisition. That represents progress, even if no minds were changed. So let’s focus on some concrete empirical cases, be it probability matching by rodents or Polish genitives by kids. Framework-level questions go nowhere, especially when the highest priests of Bayesianism disagree.

As I said in the paper, none of my criticisms is necessarily decisive but taken together, I hope they make it worthwhile to pursue alternatives [2]: alternatives that linguists have always been good at (e.g., restricting hypothesis space), alternatives that take the psychological findings of language acquisition seriously, and alternatives that do not take forever to run. It’s disappointing to see all the hard lessons are forgotten. For instance, indirect negative evidence, which was always viewed with suspicion, is now freely invoked without actually working through its complications. The problem doesn't go away when the modeler peeks at the target grammar and rigs the machinery accordingly, even though the modeler is some kind of idealized observer.

Somewhere during the Second Act of the Past Tense debate, connectionist models that implicitly implemented the regular/irregular distinction started to appear. I remember it annoyed the heck out of a young Gary Marcus, but I suspect that an older and wiser Gary would take that as a compliment.

[1] A "true" Bayesian model does not necessarily do better. As I noted in the paper, one such model for morphological learning took a week to train on supervised data but only offers very marginal improvement over an online incremental and psychologically motivated unsupervised model, which processed almost a million words in under half an hour.

[2] The paper does offer an alternative, one embedded in a framework that insists on a transparent mapping between the Marrian levels. Like in the Past Tense debate, a critique is never enough, and one needs a positive counterproposal. So let's hear some counter-counter-proposals.

Monday, April 11, 2016

I want to pour some oil on the flames. Which flames? The
ones that I had hoped that my two recent posts on Yang’s critique of Bayes (here
and here)
would engender. There has been some mild pushback (from Ewan, Tal, Alex and
Avery). But the comments section has been pretty quiet. I want to restate what
I take to be the heart of the critique because, if correct, it is very
important. If correct, it suggests that there is nothing worth salvaging from
the Bayes “revolution” for there is no there there. Let me repeat this. If Yang
is right, then Bayes is a dead end with no redeeming scientific (as opposed to
orthographic) value. This does not
mean that specific Bayes proposals are worthless. They may not be. What it
means is that Bayes per se not only
adds nothing to the discussion, but that taking its tenets to heart will mislead inquiry. How so? It endorses the
wrong
idealization of how stats are relevant to cognition. And
misidealizations are as big a mistake as one can make, scientifically speaking.
Here’s the bare bones of the argument.

a.The
hypothesis space is cast very wide. In the limit all possible hypothesis are in the space of options

b.All
potentially relevant data is considered, i.e. any data that could decide
between competing hypotheses is used to adjudicate among the hypotheses in the
space

c.All
hypotheses are evaluated wrt to all of the data. So, as data is considered every hypothesis’ chance of being true
is evaluated wrt to every data point considered

d.When
all data has been considered the rule is to choose that hypothesis in the space
with the highest score

Two things are worth noting about the above.

First, that (3) provides serious content to a Bayesian
theory, unlike (1) and (2). The latter are trivial in that nobody has ever
thought otherwise. Nobody. Ever. So if this is the point of Bayes, then this
ain’t no revolution!

Second, (3) has serious normative motivation. It is a good
analysis of what kind of inference an inference to the best explanation might
be. Normatively, an explanation is best if it is better than all other possible
explanations and accounts for all of the possibly relevant data. Ideally, this
implies evaluating all alternatives wrt to all the data and choosing the best.
This gives us (3a-d). Cognitive Bayes (CB) is the hypothesis that normative
Bayes (NB) is a reasonable idealization
for people actually do when the learn/acquire something. And we should
appreciate that this could be the case. Let’s consider how for a moment.

The idealization would make sense for the following kind of
case (let’s restrict ourselves to language). Say that the hypothesis space of a
potential Gs was quite big. For concreteness, say that we were always
considering about 50 different candidate Gs. This is not all possible Gs, but 50 is a pretty big number computationally
speaking. So say 50 or more alternatives is the norm. Then Bayes (3a) would
function a lot like the standard linguistic assumption that the set of well-formed
syntactic objects in a given language is effectively infinite. Let me unpack
the analogy.

This infinity assumption need not be accurate to be a good
idealization. Say it turns out that the number of well-formed sentences a
native speaker of English is competent wrt is “only” 101000.
Wouldn’t this invalidate the infinity assumption? No, it would show that it is
false, but not that it is a bad idealization. Why? Because the idealization is
a good one because it focuses attention onto the right problem. Which one? The
Projection Problem: how do native speakers go from a part of the language all
of it? How given exposure to only a subset of the language does a LAD get
mastery over a whole language? The answer: you acquire recursive rules, a G,
that’s how. And this is true whether or not the “language” is infinite or just
very big. The problem, going from a subset to its containing superset, will
transit via a specification of rules whether or not the set is actually
infinite. All the infinite idealization does is concentrate the mind on the
projection problem by making the alternative tempting idea (learning by
listing) silly. This is what Chomsky means when he says in Current Issues” “once we have mastered a language, the class of
sentences with which we can operate fluently or hesitation is so vast that for all practical purposes (and, obviously,
for all theoretical purposes), we may regard it as infinite” (7, my
emphasis NH). See: the idealization is reasonable because it does not
materially change the problem to be solved (i.e. how to go from part of the
language you are exposed to, to the whole language that you have mastery over).

A similar claim could
be true of Bayes. Yes, the domain of Gs a LAD considers is in fact big. Maybe
not thousands or millions of alternatives, but big enough to be worth
idealizing to a big hypothesis space in the same way that it is worth assuming
that the class sentences a native speaker is competent wrt is infinite. Is this
so? Probably not. Why not? Because even moderately large hypothesis spaces (say
with over 5 competing alternatives) turns out to be very hard to manage. So the
standard practice is to use really truncated spaces, really small SWSs. But
when you so radically truncate the space, there is no reason to think that the
inductive problem remains the same. Just think if the number of sentences we
actually knew was about 5 (roughly what happens in animal communication
systems). Would the necessity of rules really be obvious? Might we not reject
the idealization Chomky argues for
(and note that I emphasize ‘argue’)? So, rejecting (3a) means rejecting part of the Bayes idealization.

What of the other parts, (3b-d)? Well, as I noted in my
posts, Charles argues that each and every one is wrong in such a way as to be
not worth making. It gets the shape of the problem wrong. He may be right. He
may be wrong (not really, IMO), but he makes an argument. And if he is right,
then what’s at stake is the utility of RB as a useful idealization for
cognitive purposes. And, if you accept this, we are left with (1-2), which is
methodological pablum.

I noted one other thing the normative idealization above was
once considered as a cognitive option within linguistics. It was knows as the
child-as-little-linguist theory. And it had exactly the same problems that Bayes has.It suggests that what kids do is what linguists do. But it is not the same thing at all. And realizing
this helped focus on what the problem the LAD faces is. Bayes is not unique in
misidealizing a problem.

Three more points and I end today’s diatribe.

First, one can pick and choose among the four features
above. In other words, there is no law saying that one must choose the various
assumptions as a package. One can adopt a SWS assumption (rejecting 3a) while
adopting a panoramic view of the updating function (assuming that every
hypothesis in the space is updated wrt every new data point) and rejecting
choice optimization (3d). In other words, mixing and matching is fine and worth
exploring. But what gives Bayes content, and makes it more than one of many
bookkeeping notations, is the idealization implicit in CB as NB.

Second, what makes Bayes scientifically interesting is the
idealization implicit in it. I mention
this because as Tal notes in a comment (here),
it seems that current Bayesians are promoting their views as just “set of
modeling practices.” The ‘just’ is mine, but this seems to me what Tal is
indicating about the paper he links to. But the “just” matters. Modeling
practices are scientifically interesting to the degree that they embody ideas
about the problem being modeled. The good ones are ones that embody a good
idealization. So, either these practices are based on substantive assumptions
or they are “mere” practices. If the latter, then the Bayes modeling is in itself of zero scientific interest.
Does anyone really want to defend Bayes in this way? I confess that if this is
the intent then there is nothing much to argue about given how modest (how really modest, how really really modest) the
Bayes claim is.

Last, there is a tendency to insulate one’s work from
criticism. One way of doing this is to refuse to defend the idealizations
implicit in one’s technology. But technology is never innocent. It always
embodies assumptions about the way the world is so that the technology used is
a good technology in that it allows one to see/do things that other
technologies do not permit or, at least, does not distort how the basic
problems of interest are to be investigated. But researchers hate having to
defend their technology, more often favoring the view that how it runs is its
own defense. I have been arguing that this is incorrect. It does matter. So, if
it turns out that Bayesians now are urging us to use the technology but are
backing away from the idealizations implicit in it, that is good to know. This
was not how it was initially sold. It was sold as a good way of developing
level 1 cognitive theories. But if Bayes has no content then this is false. It
cannot be the source of level 1 theories for on the revised version of Bayes as
a “set of modeling practices” Bayes per se has no content so Bayes is not and
cannot be a level 1 theory of anything. It is vacuous. Good to know. I would be
happy if this is now widely conceded by our most eminent Bayesians. If this is
now the current view of things, then there is nothing to argue about. If only
Bayes had told us this sooner.