Comments

Monday, February 24, 2014

Syntacticians have effectively used just one kind of probe
to investigate the structure of FL, viz. acceptability judgments. These come in
two varieties: (i) simple “sounds good/sounds bad” ratings, with possible
gradations of each (effectively a 6ish point scale ok, ?, ??, ?*, *, **), and
(ii) “sounds good/sounds bad under this interpretation” ratings (again with
possible gradations). This rather crude empirical instrument has proven to be
very effective as the non-trivial nature of our theoretical accounts indicates.[1]
Nowadays, this method has been partially systematized under the name
“experimental syntax.” But, IMO, with a few important conspicuous exceptions,
these more refined rating methods have effectively endorsed what we knew
before. In short, the precision has been useful, but not revolutionary.[2]

In the early heady days of Generative Grammar (GG), there
was an attempt to find other ways of probing grammatical structure.
Psychologists (following the lead that Chomsky and Miller (1963) (C&M)
suggested) took grammatical models and tried to correlate them with measures
involving things like parsing complexity or rate of acquisition. The idea was a
simple and appealing one: more complex grammatical structures should be more
difficult to use than less complex
ones and so measures involving language use (e.g. how long it takes to
parse/learn something) might tell us something about grammatical structure.
C&M contains the simplest version of this suggestion, the now infamous
Derivational Theory of Complexity (DTC). The idea was that there was a
transparent (i.e. at least a homomorphic) relation between the rules required
to generate a sentence and the rules used to parse it and so parsing complexity
could be used to probe grammatical structure.

Though appealing, this simple picture can (and many believed
did) go wrong in very many ways (see Berwick and Weinberg 1983 (BW) here
for a discussion of several).[3]
Most simply, even if it is correct that there is a tight relation between the
competence grammar and the one used for parsing (which there need not be, though
in practice there often is, e.g. the Marcus Parser) the effects of this algorithmic complexity need not show up
in the usual temporal measures of complexity, e.g. how long it takes to parse a
sentence. One important reason for this is that parsers need not apply their
operations serially and so the supposition that every algorithmic step takes
one time step is just one reasonable assumption among many. So, even if there is a strong transparency
between competence Gs and the Gs parsers actually deploy, no straightforward measureable
time prediction follows.

This said, there remains something very appealing about DTC
reasoning (after all, it’s always nice
to have different kinds of data converging on the same conclusion, i.e. Whewell’s
consilience) and though it’s true that the DTC need not be true, it might be worth looking for places where the
reasoning succeeds. In other words, though the failure of DTC style reasoning
need not in and of itself imply
defects in the competence theory used, a successful DTC style argument can tell
us a lot about FL. And because there are many ways for a DTC style explanation
to fail and only a few ways that it can succeed, successful stories if they exist can shed interesting light
on the basic structure of FL.

I mention this for two reasons. First, I have been reading
some reviews of the early DTC literature and have come to believe that its
demonstrated empirical “failures” were likely oversold. And second, it seems
that the simplicity of MP grammars has made it attractive to go back and look
for more cases of DTC phenomena. Let me elaborate on each point a bit.

First, the apparent demise of the DTC. Chapter 5 of Colin
Phillips’ thesis (here)
reviews the classical arguments against the DTC.Fodor, Bever and Garrett (in their 1974 text)
served as the three horsemen of the DTC apocalypse. They interned the DTC by
arguing that the evidence for it was inconclusive. There was also some
experimental evidence against it (BW note the particular importance of Slobin
(1966)). Colin’s review goes a very long way in challenging this pessimistic
conclusion. He sums up his in depth review as follows (p.266):

…the received view that the
initially corroborating experimental evidence for the DTC was subsequently discredited
is far from an accurate summary of what happened. It is true that some of the
experiments required reinterpretation, but this never amounted to a serious
challenge to the DTC, and sometimes even lent stronger support to the DTC than
the original authors claimed.

In sum, Colin’s review strongly implies that linguists
should not have abandoned the DTC so quickly.[4]
Why, after all, give up on an interesting hypothesis, just because of a few
counter-examples, especially ones that when considered carefully seem on the
weak side? In retrospect, it looks like the abandonment of the strong
hypothesis was less a matter of reasonable retreat in the face of overwhelming
evidence than a decision that disciplines occasionally make to leave one
another alone for self-interested reasons. With the demise of the DTC,
linguists could assure themselves that they could stick to their investigative
methods and didn’t have to learn much psychology and psychologists could
concentrate on their experimental methods and stay happily ignorant of any
linguistics. The DTC directly threatened this comfortable “live and let live”
world and perhaps this is why its demise was so quickly embraced

by all sides.

This state of comfortable isolation is now under threat,
happily.This is so for several reasons.
First, some kind of DTC reasoning is really the only game in town in cog-neuro.
Here’s
Alec Marantz’s take:

...the “derivational theory of
complexity” … is just the name for a standard methodology (perhaps the dominant
methodology) in cognitive neuroscience (431).

Alec rightly concludes that given the standard view within GG
that what linguists describe are real mental structures, there is no choice but
to accept some version of the DTC as the null hypothesis. Why? Because, ceteris paribus:

…the more complex a representation-
the longer and more complex the linguistic computations necessary to generate
the representation- the longer it should take for a subject to perform any task
involving the representation and the more activity should be observed in the
subject’s brain in areas associated with creating or accessing the
representation or performing the task (439).

This conclusion strikes me as both obviously true and
salutary, with one caveat. As BW has shown us, the ceteris paribus clause can in practice be quite important.Thus, the common indicators of complexity
(e.g. time measures) may be only indirectly related to algorithmic complexity.
This said, GG is (or should be) committed to the view that algorithmic
complexity reflects generative complexity and that we should be able to find
behavioral or neural correlates of this (e.g. Dehaene’s work (discussed here)
in which BOLD responses were seen to track phrasal complexity in pretty much a
linear fashion or Forster’s work finding temporal correlates mentioned in note
4).

Alec (439) makes an additional, IMO correct and important,
observation. Minimalism in particular, “in denying multiple routes to
linguistic representations,” is committed to some kind of DTC thinking.[5]
Furthermore, by emphasizing the centrality of interface conditions to the
investigation of FL, Minimalism has embraced the idea that how linguistic
knowledge is used should reveal a great deal about what it is. In fact, as I’ve
argued elsewhere, this is how I would like to understand the “strong minimalist
thesis,” (SMT) at least in part. I have suggested that we interpret the SMT as
committed to a strong “transparency hypothesis” (TH) (in the sense of Berwick
& Weinberg), a proposal that can only be systematically elaborated by how
linguistic knowledge is used.

Happily, IMO, paradigm examples of how to exploit “use” and
TH to probe the representational format of FL are now emerging. I’ve already
discussed how Pietroski, Hunter, Lidz and Halberda’s work relates to the SMT
(e.g. here
and here).
But there is other stuff too of obvious relevance: e.g. BW’s early work on
parsing and Subjacency (aka Phase Theory) and Colin’s work on how islands are
evident in incremental sentence processing. This work is the tip of an
increasingly impressive iceberg. For example, there is analogous work showing
that that parsing exploits binding restrictions incrementally during processing
(e.g. by Dillon, Sturt, Kush).

This latter work is interesting for two reasons. It validates
results that syntacticians have independently arrived at using other methods
(which, to re-emphasize, is always worth doing on methodological grounds). And,
perhaps even more importantly, it has started raising serious questions for
syntactic and semantic theory proper. This is not the place to discuss this in
detail (I’m planning another post dedicated to this point), but it is worth
noting that given certain reasonable assumptions about what memory is like in
humans and how it functions in, among other areas, incremental parsing, the
results on the online processing of binding noted above suggest that binding is
not stated in terms of c-command but some other notion that mimics its effects.

Let me say a touch more about the argument form, as it is
both subtle and interesting. It has the following structure: (i) we have
evidence of c-command effects in the domain of incremental binding, (ii) we
have evidence that the kind of memory we use in parsing cannot easily code a
c-command restriction, thus (iii) what the parsing Grammar (G) employs is not
c-command per se but another notion
compatible with this sort of memory architecture (e.g. clausemate or
phasemate). But, (iv) if we adopt a strong SMT/TH (as we should), (iii) implies
that c-command is absent from the competence G as well as the parsing G. In
short, the TH interpretation of SMT in this context argues in favor of a
revamped version of Binding Theory in which FL eschews c-command as a basic
relation. The interest of this kind of argument should be evident, biut let me
spell it out. We S-types are starting to face the very interesting prospect
that figuring out how grammatical information is used at the interfaces will help us choose among alternative competence theories by placing interface
constraints on the admissible primitives. In other words, here we see a
non-trivial consequence of Bare Output Conditions on the shape of the grammar.
Yessss!!!

We live in exciting times. The SMT (in the guise of TH)
conceptually moves DTC-like considerations to the center of theory evaluation.
Additionally, we now have some useful parade cases in which this kind of reasoning
has been insightfully deployed (and which, thereby, provide templates for
further mimicking). If so, we should expect that these kinds of considerations
and methods will soon become part of every good syntactician’s armamentarium.

[1]
The fact that such crude data can be used so effectively is itself quite remarkable.
This speaks to the robustness of the system being studied for such weak signals
should not be expected to be so useful otherwise.

[2]
Which is not to say that such more careful methods don’t have their place.
There are some cases where being more careful has proven useful. I think that
Jon Sprouse has given the most careful thought to these questions. Here
is an example of some work where I think that the extra care has proven to be
useful.

[4]
BW note that Forster provided evidence in favor of the DTC even as Fodor et.
al. were in the process of burying it. Forster effectively found temporal
measures of psychological complexity that tracked the grammatical complexity
the DTC identified by switching the experimental task a little (viz. he used an
RSVP presentation of the relevant data).

[5]
I believe that what Alec intends here is that in a theory where the only real
operation is merge then complexity is easy to measure and there are pretty
clear predictions of how this should impact algorithms that use this
information. It is worth noting that the heyday of the DTC was in a world where
complexity was largely a matter of how many transformations applied to derive a
surface form. We have returned to that world again, though with a vastly
simpler transformational component.

Sunday, February 23, 2014

Here's another essay investigating the MOOC issue and its relation to education. There are two points. First, that lots of what is intended when people speak of "education" is really how to effectively provide credentials that will enhance job prospects. Second, that what you learn is less important that who you meet. MOOCs may help with the first, but won't address the second. And as the second dominates in determining one's life opportunities, MOOCs will simply serve to further disadvantage the less advantaged. This, in part, reflects my own jaundiced views about MOOCs, with one more perverse twist. Should MOOCs win the day, then we might discover that we destroy research as well as education. Look at what happened to Bell Labs when we made telecommunication more efficient. A by-product of MOOCing the university might be the elimination of any venue for basic research. At least in the US and Canada, the only place basic research happens is the university and the university in the US and Canada effectively tethers undergrad instruction to a research engine. Break the bond and there is no reason to suppose that there will be no place for basic research. So, MOOCs may not only kill education (that's my hunch) but destroy basic inquiry as well.

Friday, February 14, 2014

Another snow day, another blog post. Last week we took a gander at derivation trees and noticed that they satisfy a number of properties that should appeal to Minimalists. Loudmouth that I am I took things a step further and proclaimed that there is no good reason to keep using phrase structure trees now that we have this shiny new toy that does everything phrase structure trees do, just better. Considering the central role of phrase structure trees, that's a pretty bold claim... or is it?

Tuesday, February 11, 2014

Alex C (in the comment section here
(Feb. 1)) makes a point that I’ve encountered before that I would like to
comment on. He notes that Chomsky has stopped worrying about Plato’s Problem
(PP) (as has much of “theoretical” linguistics as I noted in the previous post)
and suggests (maybe this is too much to attribute to him, if so, sorry Alex)
that this is due to Darwin’s Problems (DP) occupying center stage at present.
I don’t want to argue with this factual claim, for I believe that there’s lots
of truth to it (though IMO, as readers of the last several posts have no doubt
gathered, theory of any kind is largely absent from current research). What I want
to observe is that (1) there is a tension between PP and DP and (2) that
resolving it opens an important place for theoretical speculation. IMO, one of
the more interesting facets of current theoretical work is that it proposes a
way of resolving this tension in an empirically interesting way. This is what I
want to talk about.

First the tension: PP is the observation that the PLD the
child uses in developing its G is impoverished in various ways when one
compares it to the properties of Gs that children attain. PP, then, is another
name for the Poverty of Stimulus Problem (POS).Generative Grammarians have proposed to “solve” this problem by packing
FL with principles of UG, many of which are very language specific (LS), at
least if GB is taken as a guide to the content of FL.By LS, I mean that the principles advert to
very linguisticky objects (e.g. Subjects, tensed clauses, governors, case
assigners, barriers, islands, c-command, etc) and very linguisticky operations
(agreement, movement, binding, case assignment, etc.).The idea has been that making UG rich enough
and endowing it with LS innate structure will allow our theories of FL to
attain explanatory adequacy, i.e. to explain how, say, Gs obey islands despite
the absence of good and bad data relevant to fixing them present in the
PLD.

By now, all of this is pretty standard stuff (which is not
to say that everyone buys into the scheme (Alex?)), and, for the most part, I
am a big fan of POS arguments of this kind and their attendant conclusions. However,
even given this, the theoretical problem that PP poses has hardly been solved. What
we do have (again assuming that the POS arguments are well founded (which I do
believe)) is a list of (plausibly) invariant(ish) properties of Gs and an explanation
for why these can emerge in Gs in the absence of the relevant data in the PLD
required to fix them. Thus, why do movement rules in a given G resist
extraction from islands? Because something like the Subjacency/Barriers theory
is part of every Language Acquisition Device’s (LAD) FL, that’s why.

However, even given this, what we still don’t have is an
adequate account of how the variant
properties of Gs emerge when planted in a particular PLD environment. Why is
there V to T in French but not in English? Why do we have inverse control in
Tsez but not Polish? Why wh-in-situ in Chinese but multiple wh to C in
Bulgarian. The answer GB provided (and so far as I can tell, the answer still)
is that FL contains parameters that can be set in different ways on the basis
of PLD and the various Gs we have are the result of differential parameter
setting. This is the story, but we have known for quite a while that this is
less a solution to the question of how Gs emerge in all their variety than it
is an explanation schema for a
solution. P&P models, in other words, are not so much well worked out
theories than they are part of a general recipe for a theory that were we able to cookit, would produce just the kind of FL
that could provide a satisfying answer to the question of how Gs can vary so
much. Moreover, as many have observed (Dresher and Janet Fodor are two notable
examples, see below) there are serious problems with successfully fleshing out
a P&P model.

Here are two: (i) the hope that many variant properties of
Gs would hinge on fixing a small number of parameters seems increasingly
empirically uncertain. Cederic Boeckx and Fritz Newmeyer have been arguing this
for a while, and while their claims are debated (and by very intelligent people
so, at least for a non-expert like me, the dust is still too unsettled to reach
firm conclusions), it seems pretty clear that the empirical merits of earlier proposed parameterizations are less
obvious than we took them to be. Indeed, there appears to some skepticism about
whether there are any macro-parameters (in Baker’s sense[1])
and many of the micro-parametric proposals seem to end up restating what we
observe in the data: that languages can differ. What made early macro-parameter
theories interesting is the idea that differences among Gs come in largish
clumps. The relation between a given parameter setting and the attested surface
differences was understood as one to many. If, however, it turns out that every
parameter correlates with just a single difference then the value of a
parametric approach becomes quite unclear, at least so far as acquisition
considerations are concerned. Why? Because it implies that surface differences
are just due to differing PLD, not to the different options inherent in the
structure of FL. In other words, if we end up with one parameter per surface
difference then variation among Gs will not be as much of a window into the
structure of FL as we thought it could be.

Here’s another problem: (ii) the likely parameters are not independent.
Dresher (and friends) has demonstrated this for stress systems and Fodor (and
friends) has provided analogous results for syntax.The problem with a theory where parameters
are not independent is that they make it very hard to see how acquisition could
be incremental. If it turns out that the value of any parameter is conditional
on the value of every other parameter (or very many others) then it would seem
that we are stuck with a model in which all parameters must be set at once
(i.e. instantaneous learning). This is not good! To evade this problem, we need
some way of imposing independence on the parameters so that they can be set
piecemeal without fear of having to re-set them later on. Both Dresher and
Fodor have proposed ways of solving this independence problem (both elaborate a
richer learning theory for parameter values to accommodate this problem). But,
I think that it is fair to say that we are still a long way from a working
solution. Moreover, the solutions provided all involve greatly enriching FL in
a very LS way. This is where PP runs into DP. So let’s return to the
aforementioned tension between PP and
DP.

One way to solve PP is to enrich FL. The problem is that the
richer and more linguistically parochial FL is, the harder it becomes to
understand how it might have evolved. In other words, our standard GB tack in
solving PP (LS enrichment of FL) appears to make answering DP harder. Note I
say ‘appears.’ There are really two problems, and they are not equally acute.
Let me explain.

As noted above, we have two things that a rich FL has been
used to explain; (a) invariances characteristic of all Gs and (b) the attested
variation among Gs. In a P&P model, the first ‘P’ handles (a) and the
second (b). I believe that we have seen glimmers of how to resolve the tension
between PP’s demands on FL versus DP’s as regards the principles part of P&P. Where things have become far more
obscure (and even this might be too kind) involves the second parametric P. Here’s what I mean.

As I’ve argued in the past, one important minimalist project
has been to do for the principles of GB what Chomsky did for islands and
movement via the theory of subjacency in On
Wh Movement (OWM). What Chomsky did in this paper is theoretically unify
the disparate island effects by unifying all non-local (A’) dependency
constructions by proposing that they have a common movement core (viz. move WH) subject to locality
restrictions characterized by Bounding Theory (BT). This was terrifically
inventive theory and aside from
rationalizing/unifying Ross’s very disparate Island Effects, the combination of
Move WH + BT predicted that all long movement would have to be
successive cyclic (and even predicted a few more islands, e.g. subject islands
and Wh-islands).[2]

But to get back to PP and DP, one way of regarding MP work
over the last 20 years is as an attempt to do for GB modules what Chomsky did
for Ross’s Islands. I’ve suggested this many times before but what I want to
emphasize here is that this MP project is perfectly in harmony with the PP
observation that we want to explain many of the invariances witnessed across Gs
in terms of an innately structured FL. Here there is no real tension if this
kind of unification can be realized. Why not? Because if successful we retain
the GB generalizations. Just as Move WH
+ BT retain Ross’s generalizations, a successful unification within MP will
retain GB’s (more or less) and so we can continue to tell the very same story
about why Gs display the invariances attested as we did before. Thus, wrt this POS problem, there is a way to
harmonize DP concerns with PP concerns. Of course, this does not mean that we
will successfully manage to unify the GB modules in a Move WH + BT way, but we understand what a successful solution would
look like and, IMO, we have every reason to be hopeful, though this is not the
place to defend this view.

So, the principles
part of P&P is, we might say, DP compatible (little joke here for the
cognoscenti). The problem lies with the second P. FL on GB was understood to
provide not only the principles of invariance but also to specify all the
possible ways that Gs could differ. The
parameters in GB were part of FL! And it is hard to see how to square this
with DP given the terrific linguistic
specificity of these parameters. The MP conceit has been to try and
understand what Gs do in terms of one (perhaps)[3]
linguistically specific operation (Merge) interacting with many general
cognitive/computational operations/principles.In other words, the aim has been to reduce the parochialism of the GB
version of FL. The problem with the GB conception of parameters is that it is
hard to see how to recast them in similarly general terms. All the parameters
exploit notions that seem very very linguo-centric. This is especially true of micro parameters, but it is even true of
macro ones. So, theoretically, parameters present a real problem for DP, and
this is why the problems alluded to earlier have been taken by some (e.g. me)
to suggest that maybe FL has little to say about G-variation. Moreover, it
might explain why it is that, with DP becoming prominent, some of the interest
in PP has seemed to wane. It is due to a dawning realization that maybe the
structure of FL (our theory of UG) has little to say directly about grammatical variation and typology. Taken together
PP and DP can usefully constrain our theories of FL, but mainly in licensing
certain inferences about what kinds of invariances we will likely discover
(indeed have discovered). However, when it comes to understanding variation, if parameters cannot be bleached of
their LSity (and right now, this looks to me like a very rough road), it looks
to me like they will never be made to fit with the leading ideas of MP, which
are in turn driven by DP.

So, Alex C was onto something important IMO. Linguists tend
to believe that understanding variation is key to understanding FL. This is
taken as virtually an article of faith. However, I am no longer so sure that
this is a well founded presumption. DP provides us with some reasons to doubt
that the range of variation reflects intrinsic
properties of FL. If that is correct, then variation per se may me of little interest for those interested in liming the
basic architecture of FL. Studying various Gs will, of course, remain a useful
tool for in getting the details of the invariant principles and operations
right. But, unlike earlier GB P&P models, there is at least an argument to
be made (and one that I personally find compelling) that the range of
G-variation has nothing whatsoever to do with the structure of FL and so will
shed no light on two of the fundamental questions in Generative Grammar: what’s
the structure of FL and why?[4]

[1]
Though Baker, a really smart guy, thinks that there are so please don’t take me
as endorsing the view that there aren’t any. I just don’t know. This is just my
impression from linguist in the street interviews.

[2]
The confirmation of this prediction was one of the great successes of
generative grammar and the papers by, e.g. Kayne and Pollock, McCloskey, Chung,
Torrego, and many others are still worth reading and re-reading. It is worth
noting that the Move WH + BT story
was largely driven by theoretical considerations,
as Chomsky makes clear in OWM. The
gratifying part is that the theory proved to be so empirically fecund.

[3]
Note the ‘perhaps.’ If even merge is in the current parlance “third factor”
then there is nothing taken to be linguistically special about FL.

[4]
Note that this quite a bit of room for “learning” theory. For if the range of
variation is not built into FL then why we see the variation we do must be due
to how we acquire Gs given FL/UG.The
latter will still be important (indeed critical) in that any larning theory
will have to incorporate the isolated invariances. However, a large part of the
range of variation will fall outside the purview of FL. I discuss this somewhat
in the last chapter of A theory if syntax
for any of you with a prurient interest in such matters. See, in particular,
the suggestion that we drop the switch analogy in favor of a more geometrical
one.

Monday, February 10, 2014

In my haste to get Chris's opinions out there, I jumped the gun and posted an early draft of what he intended to see the light of day. So all of you who read the earlier post, fogetabouit!!! You are to ignore all of its contents and concentrate on the revised edition below. I will try (but no doubt fail) never to screw up again. So, sorry Chris. And to readers, enjoy Chris's "real" post.

*****

Why Formalize?

I read with interest Norbert’s recent post on formalization:
“Formalization and Falsification in Generative Grammar”. Here I write some
preliminary comments on his post.I have
not read other relevant posts in this sprawling blog, which I am only now
learning how to navigate. So some of what I say may be redundant. Lastly, the
issues that I discuss below have come up in my joint work with Edward Stabler
on formalizing minimalism, to which I refer the reader for more details.

I take it that the goal of linguistic theory is to
understand human language faculty by formulating UG, a theory of the human
language faculty. Formalization is a tool toward that goal. Formalization is stating a theory
clearly and formally enough that one can establish conclusively (i.e., with a
proof) the relations between various aspects of the theory and between claims
of the theory and claims of alternative theories.

Frege in the Begriffsschrift (pg. 6 of the Begriffschrift in
the book Frege and Godel) analogizes the “ideography” (basically first and
second order predicate calculus) to a microscope: “But as soon as scientific
goals demand great sharpness of resolution, the eye proves to be insufficient.
The microscope, on the other hand, is perfectly suited to precisely such goals,
but that is just why it is useless for all others.” Similarly, formalization in
syntax is a tool that should be employed when needed. It not an absolute
necessity and there are many ways of going about things (as I discuss below).
By citing Frege, I am in no way claiming that we should aim for the same level of
formalization that Frege aimed for.

There is an important connection with the ideas of Rob
Chametzky (posted by Norbert in another place on this blog). As we have seen,
Rob divides up theorizing into meta-theoretical, theoretical and analytical. Analytical work, according to Chametzky is:
“concerned with investigating the (phenomena of the) domain in question. It
deploys and tests concepts and architecture developed in theoretical work,
allowing for both understanding of the domain and sharpening of the theoretical
concepts.” It is clear that more than 90% of all linguistics work (maybe 99%)
is analytical, and that there is a paucity of true theoretical work.

A good example of analytical work would be Noam Chomsky’s
“On Wh-Movement”, which is one of the most beautiful and important papers in
the field. Chomsky proposes the wh-diagnostics and relentlessly subjects a
series of constructions to those diagnostics uncovering many interesting patterns
and facts. The consequence that all these various constructions can be reduced
to the single rule of wh-movement is a huge advance, allowing one insight into
UG. Ultimately, this paper led to the Move-Alpha framework, which then led to
Merge (the simplest and most general operation yet).

“On Wh-Movement” is
what I would call “semi-formal”. It has semi-formal statements of various
conditions and principles, and also lots of assumptions are left implicit. As a
consequence it has the hallmark property of semi-formal work: there are no
theorems and no proofs.

Certainly, it would have been a waste of time to fully
formalize “On Wh-Movement”. It would have expanded the text 10-20 fold at
least, and added nothing. This is something that I think Pullum completely
missed in his 1989 NLLT contribution on formalization. The semi-formal nature
of syntactic theory, also found in such classics as “Infinite Syntax” by Haj Ross
and “On Raising” by Paul Postal, has led to a huge explosion of knowledge that
people outside of linguistics/syntax do not really appreciate (hence all the
uninformed and uninteresting discussion out there on the internet and Facebook about
what the accomplishments of generative grammar have been), in part because
syntacticians are generally not very good popularizers.

Theoretical work, according to Chametzky is:“is concerned
with developing and investigating primitives, derived concepts and architecture
within a particular domain of inquiry.” There are many good examples of this
kind of work in the minimalist literature. I would say Juan Uriagereka’s
original work on multi-spell-out qualifies and so does Sam Epstein’s work on
c-command, amongst others.

My feeling is that theoretical work (in Chametzky’s sense)
is the natural place for formalization in linguistic theory. One reason is that
it is possible, using formal assumptions to show clearly the relationship
between various concepts, assumptions, operations and principles. For example, it
should be possible to show, from formal work, that things like the NTC, the
Extension Condition and Inclusiveness should really be thought of as theorems proved
on the basis of assumptions about UG. If
they were theorems, they could be eliminated from UG. One could ask if this
program could be extended to the full range of what syntacticians normally
think of as constraints.

In this, I agree with Norbert who states: “It can lay
bare what the conceptual dependencies between our basic concepts are.”
Furthermore, as my previous paragraph makes clear, this mode of reasoning is
particularly important for pushing the SMT (Strong Minimalist Thesis) forward.
How can we know, with certainty, how some concept/principle/mechanism fits into
the SMT? We can formalize and see if we can prove relations between our
assumptions about the SMT (assumptions about the interfaces and computational
efficiency) and the various concepts/principles/mechanisms. Using the ruthless
tools of definition, proof and theorem, we can gradually whittle away at UG,
until we have the bare essence. I am sure that there are many surprises in
store for us. Given the fundamental, abstract and subtle nature of the elements
involved, such formalization is probably a necessity, if we want to avoid
falling into a muddle of unclear conclusions.

A related reason for formalization (in addition to clearly
stating/proving relationships between concepts and assumptions) is that it
allows one to compare competing proposals. One of the biggest such areas
nowadays is whether syntactic dependencies make use of chains, multi-dominance
structures or something else entirely. Chomsky’s papers, including his recent
ones, make references to chains at many points. But other recent work invokes
multi-dominance. What are the differences between these theories? Are either of them really necessary? The SMT
makes it clear that one should not go beyond Merge, the lexicon, and the
structures produced by Merge. So any additional assumptions needed to implement
multi-dominance or chains are suspect. But what are those additional
assumptions? I am afraid that without formalization it will be impossible to
answer these questions.

Questions about syntactic dependencies interact closely with
TransferPF (Spell-Out) and TransferLF, which to my knowledge, have not only not
been formalized but not even stated in an explicit manner (other than the
initial attempt in Collins and Stabler 2013). Investigating the question of
whether multi-dominance, chains or some something else entirely (perhaps
nothing else) is needed to model human language syntax will require a
concomitant formalization of TransferPF and TransferLF, since these are the
functions that make use of the structures formed by Merge. Giving explicit and
perhaps formalized statements of TransferPF and TransferLF should in turn lead
to new empirical work exploring the predictions of the algorithms used to
define these functions.

A last reason for formalization is that it may bring out
complications in what appear to be innocuous concepts (e.g., “workspaces”,
“occurrences”, “chains”).It will also
help one to understand what alternative theories without these concepts would
have to accomplish. In accordance with the SMT, we would like to formulate UG
without reference to such concepts, unless they are really needed.

Minimalist syntax calls for formalization in a way that
previous syntactic theories did not. First, the nature of the basic operations
is simple enough (e.g., Merge) to make formalization a real possibility. The
baroque and varied nature of transformations in the “On Wh-Movement” framework
and preceding work made the prospect for a full formalization more daunting.

Second, the nature of the concepts involved in minimalism, because
of their simplicity and generality (e.g., the notion of copy), are just too
fundamental and subtle and abstract to resolve by talking through them in an
informal or semi-formal way. With formalization we can hope to state things in
such a way to make clear the conceptual and the empirical properties of the
various proposals, and compare and evaluate them.

My expectation is that selective formalization in syntax
will lead to an explosion of interesting research issues, both of an empirical
and conceptual natural (in Chametzky’s terms, both analytical and theoretical).
One can only look at a set of empirical problems against the backdrop of a
particular set of theoretical assumptions about UG and I-language. The more
that these assumptions are articulated; the more one will be able to ask
interesting questions about UG.