Statistical models of grammar induction, by which I mean models that estimate a probability distribution over both syntactic analyses and word strings after observing only word strings, are typically a type of model called a generative model. This notion of "generative" is actually pretty similar to the notion of "generative" in modern syntax: there is an inventory of re-usable substructures or pieces (like context free grammar rules, which are really just local subtrees), and structures and word strings are produced by selecting pieces from your inventory. The primary difference from standard syntax is that each choice is associated with a probability score.

Now, there's a little bit of a technical wrinkle here, and I was wondering if the other commenters here had any reactions. The thing that makes these models "generative" in the statistical sense is that each "probability score" is restricted to be a conditional probability of the right hand side, given the parent node. So if you fix the parent node, and add up the probability score of each possible right hand side of that parent node, the sum will be one. This constraint on the probability scores is computationally convenient, because it means we can guarantee that the probability of every tree sums to one, since the probability of every subtree sums to one in this recursive fashion, and we can get the probability of any particular structure by just multiplying the probability scores of the pieces that actually appear in that structure. However, this has the consequence that the model as a whole is parameterized in terms of observations given latent variables.

An alternative is to allow the "probability scores" to be any non-negative number, using what's called an undirected or energy model. There are two reasons that people don't do this in practice. First, if probability scores are any non-negative number, and there are unobserved variables (we only see words, not dependency arcs, for example), the computational complexity gets really bad. If we multiply the probability scores of each piece in a structure, the resulting number is only proportional to the probability of the structure (the constraint in generative models just guarantees that the proportionality constant is always 1). To actually get the probability of the structure, we have to add up the products of the probability scores of all the pieces that we could have used, including the strings that we could have seen but did not. Obviously, doing this kind of sum exactly is impossible (although random approximations are possible). Second, it turns out that undirected models do not have any more expressive power than directed models: it's been proven that given any set of weights from an undirected model, it is possible to find a set of conditional probabilities for the corresponding directed model that produces exactly the same probability distribution over structures.

So people pretty much never use undirected models for grammar induction: even though they are perfectly well-defined, they are a nightmare computationally, and don't buy you anything in principle. On the other hand, they have different biases, so we might end up learning a different distribution over structures in practice. Also, it leads to this weird situation where we are learning models that are parameterized in terms of the probability of the observations given the unobserved stuff:

even though we usually don't care about the probability of the stuff we saw: if you "run" the models in this direction to generate new "observed stuff" like word strings, you typically get garbage. What we usually care about is the probability of the unobserved stuff given the observed stuff. You can do this by using Baye's rule to "run the models in reverse:"

i.e., we see a sentence and we want to know how to parse it; we don't care what the probability of that sentence is.

On a more theoretical level, though, I wonder how important this issue is. I mentioned in another thread that a notion of grammaticality based on typical sets might be interesting, but the feasibility of such a model relies on getting good probabilities when running the models "forward." Another thing to consider is the potential for theoretical tidiness, with one generative model responsible for both production (running "forwards") and perception (running "in reverse"). On the other hand, maybe it just isn't possible to partial out specifically linguistic influences (subject-verb agreement) from general knowledge (movie-popcorn agreement) in string probabilities.

I don't necessarily have a specific question, but these are some things I've been thinking about lately and wondered if anybody had any reactions.

(PS feel free to move this to the computational or out-of-the-box section... I wasn't sure whether to classify based on topic or methodology...)

Interesting post. I don't have too much of a direct response, but I have several related ideas. Sorry this might drag the discussion to a tangent...

1. Computational complexityThis is a huge issue in engineering (speech recognition, machine translation, etc.). But it apparently isn't much of an issue in cognition-- we do this quite easily. It's hard to know exactly what the complexity of the brain is compared to a computer, but whatever the correct algorithm is, we should be able to implement it in linear time for example.If the best algorithm found is one that is computationally impractical, it is probably the wrong algorithm. That's an odd puzzle because the problem does really seem to be very complicated.

2. What is the object of study?

Quote

On the other hand, maybe it just isn't possible to partial out specifically linguistic influences (subject-verb agreement) from general knowledge (movie-popcorn agreement) in string probabilities.

This is where my current thoughts are about syntax, but beyond that point. You also said this:

Quote

there is an inventory of re-usable substructures or pieces (like context free grammar rules, which are really just local subtrees), and structures and word strings are produced by selecting pieces from your inventory.

That's a standard position, and one that I question, actually.

Language has hierarchical structure (as much as it bothers me to admit that-- I've been trying to find some alternative for years and have finally accepted that it must be correct). But WHERE is that structure? I think this can lead us to be a better answer.

I'll copy two paragraphs from a short proposal I wrote for my dissertation project:

Quote from: my in-progress dissertation

It is widely accepted in the field of Theoretical Linguistics that a major research goal is determining the relationship between form and meaning in natural languages. Since Chomsky’s early work (1956, 1957, 1959), the formal study of Syntax has assumed a model of individual nodes arranged in a hierarchical structure. Essentially each word is given a position within a hierarchy such that structural relationships can be determined between the nodes in a sentence. This approach has been extremely popular and productive for the field. It is the purpose of this dissertation to take another look at these basic assumptions in order to introduce a more accurate method for understanding the relationship between linguistic form (“Syntax”) and underlying meaning (“Semantics”). In short, the exact form of a linguistic utterance will be seen as the result of linearization and widely variable language-specific rules, while the underlying semantic structure will resemble what is assumed in the standard hierarchical models but without many of the complications introduced by trying to account for peculiarities of form.While the standard theory often appears successful and leads to insightful results, I argue that this is an illusion: most of the time, linguistic form is isomorphic to semantic structure, but exceptions reveal that the situation is more complex. It is not unexpected that, ceteris paribus, linguistic form would reflect semantics, but this is not a strict rule in human language.

To name some specific examples:1. CoordinationI'm now convinced that "and" is not actually a node in the tree, but rather the linear representation of a coordinated structure (whatever that actually is is irrelevant).

2. Control verbs as in "John tries [John] to swim."There has been a huge amount of debate about how that extra "[John]" gets interpreted semantically. Maybe it's movement. Maybe there's some invisible pronoun. Etc. My answer is fairly simple: a preliminary linear parse adds "John" back in to the structure so that it is available for semantics.

3. Periphrasis:Some language have multiple realizations for a single meaning. In a single paradigm, something that might be realized as an affix in one case might be realized as a separate word in another. There are some very convincing examples of this, but I can't list them off the top of my head. To take a familiar (and probably slightly less convincing example), consider English tense:Present: sometimes -sPast: -edFuture: willThe future is oddly a separate word with very different syntax compared to the others.My suggestion: this is just a surface-level issue, not relevant to the real hierarchical structure.

So what's left at the actual hierarchical structure?Core relationships like modification and argument-selection. Maybe almost nothing else.

(There is a complication here in that obviously we need some kind of hierarchical structure to begin interpreting in the first place such as recognizing "try to" as a control structure rather than some other linear string. But I'll set that aside for now. It's probably some kind of boostrapping.)

In short: I think syntacticians are looking at the wrong thing.

So for your question, this suggests that building up a bunch of subtrees from surface form is the wrong approach and hence why it is computationally difficult and not especially reliable.

Interesting post. I don't have too much of a direct response, but I have several related ideas. Sorry this might drag the discussion to a tangent...

1. Computational complexityThis is a huge issue in engineering (speech recognition, machine translation, etc.). But it apparently isn't much of an issue in cognition-- we do this quite easily. It's hard to know exactly what the complexity of the brain is compared to a computer, but whatever the correct algorithm is, we should be able to implement it in linear time for example.If the best algorithm found is one that is computationally impractical, it is probably the wrong algorithm. That's an odd puzzle because the problem does really seem to be very complicated.

Yes, this is what I was trying to get at. Researchers have been avoiding undirected models because they are hard to engineer, but, since people probably use a resource-limited approximation algorithm anyway, this isn't really a justification for completely ignoring undirected models. There are still practical issues, however, in interpreting an implementation that relies on lots of approximations: is the behavior of our implementation due to the model, or to the kind of approximation our inference makes, or both?

For me, at this point, the object of study is the data available to children. What kinds of assumptions about the shape of linguistic structures turns that data into good evidence for structures? The algorithm the child uses is in principle separate from the assumptions about possible linguistic structures the algorithm makes; or, to put it another way, there are many different algorithms that rely on the same set of assumptions (or biases, or prior expectations) about what linguistic structures look like, and I'd like to tackle the issue of learning biases without having to get the details of the learning procedure right. The clearest way to do this is to use a really accurate inference procedure, so we can be confident that the conclusions about linguistic structure that our implementation reaches after seeing data are due to the model assumptions.

So what's left at the actual hierarchical structure?Core relationships like modification and argument-selection. Maybe almost nothing else.

This sounds reasonable to me. In practice, the grammar induction systems that people have gotten to work have relied on dependency grammar or CCG representations that focus on modification and argument-selection. Grammar induction systems that model the details of linear order have not worked.

(There is a complication here in that obviously we need some kind of hierarchical structure to begin interpreting in the first place such as recognizing "try to" as a control structure rather than some other linear string. But I'll set that aside for now. It's probably some kind of boostrapping.)

In short: I think syntacticians are looking at the wrong thing.

So for your question, this suggests that building up a bunch of subtrees from surface form is the wrong approach and hence why it is computationally difficult and not especially reliable.

I'm a bit confused by this suggestion. Building up a bunch of subtrees from surface forms is not that hard and can be done relatively efficiently. What is hard is estimating probabilities for those subtrees if we don't constrain the probability score of each piece to be conditional probabilities. Or maybe I haven't understood your suggestion?

Yes, this is what I was trying to get at. Researchers have been avoiding undirected models because they are hard to engineer, but, since people probably use a resource-limited approximation algorithm anyway, this isn't really a justification for completely ignoring undirected models. There are still practical issues, however, in interpreting an implementation that relies on lots of approximations: is the behavior of our implementation due to the model, or to the kind of approximation our inference makes, or both?

The word "approximation" bothers me. Given the fact that children do learn language, they probably end up adapting it to the most efficient form for processing in some sense (broadly), so it probably isn't something that needs to be approximated-- it probably is relatively optimized, as a system.

Of course in terms of surface form, they are approximating something, such as distribution probabilities.

So I think we agree on this.

Quote

For me, at this point, the object of study is the data available to children. What kinds of assumptions about the shape of linguistic structures turns that data into good evidence for structures?

I agree, but the details are crucial: how is that input interpreted? What is it mapped onto?I would tentatively say something like: they are identifying relationships in the sentences (dependencies) and mapping them probabilistically to forms.I wouldn't say that they are analyzing surface forms directly.

Quote

This sounds reasonable to me. In practice, the grammar induction systems that people have gotten to work have relied on dependency grammar or CCG representations that focus on modification and argument-selection. Grammar induction systems that model the details of linear order have not worked.

That sounds right to me. I'm not suggesting a purely linear model. Instead, I'm suggesting one that is designed to intelligently/complexly map a linear form onto a hierarchical structure. The indirectness is what will help.

Quote

I'm a bit confused by this suggestion. Building up a bunch of subtrees from surface forms is not that hard and can be done relatively efficiently. What is hard is estimating probabilities for those subtrees if we don't constrain the probability score of each piece to be conditional probabilities. Or maybe I haven't understood your suggestion?

I'm saying that the subtrees are the wrong objects to be working with, that they don't exist in an algorithm of language. Instead, the assembly should be done at the level of something like conceptual structure (dependencies, whatever) then that should be mapping onto a linear form.

Let me explain with an example of coordination. I think the actual structure is something like this: "a and b"I'm not certain about the tree, but I am fairly confident that the word "and" is not a node in it, not at the relevant level. Then as a coordinated structure, "and" gets inserted during linearization.

(Don't take my ideas too seriously yet-- I'm working out a lot of details/complications.)

So in short, what I'm suggesting is that attempting to map 1-to-1 from words to nodes in a subtree is a bad plan. The must better approach is to work indirectly, sort of like in machine translation going from source language to semantic metalanguage to output language.

Yes, this is what I was trying to get at. Researchers have been avoiding undirected models because they are hard to engineer, but, since people probably use a resource-limited approximation algorithm anyway, this isn't really a justification for completely ignoring undirected models. There are still practical issues, however, in interpreting an implementation that relies on lots of approximations: is the behavior of our implementation due to the model, or to the kind of approximation our inference makes, or both?

The word "approximation" bothers me. Given the fact that children do learn language, they probably end up adapting it to the most efficient form for processing in some sense (broadly), so it probably isn't something that needs to be approximated-- it probably is relatively optimized, as a system.

Of course in terms of surface form, they are approximating something, such as distribution probabilities.

So I think we agree on this.

Quote

For me, at this point, the object of study is the data available to children. What kinds of assumptions about the shape of linguistic structures turns that data into good evidence for structures?

I agree, but the details are crucial: how is that input interpreted? What is it mapped onto?I would tentatively say something like: they are identifying relationships in the sentences (dependencies) and mapping them probabilistically to forms.I wouldn't say that they are analyzing surface forms directly.

Hmm, maybe my confusion has to do with the term "surface form" and "distribution probabilities" that are in "terms of surface form." What do you consider to surface form? Sometimes people use "surface form" to mean the string that is generated, and sometimes they use "surface form" to mean the whole structure after a series of operations, like transformations, have been applied. These models define probabilities over structures, they're not just relying on n-gram statistics from the string.

Many models take the input string and separate it into chunks then attempt to assign those chunks to a structure. I'm suggesting that those chunks will never match the correct structure because the structure operates at a different level. Of course it's related and so you'll get partial accuracy, but it won't ever be 100% working with only chunks of the input string.

On the other hand, maybe it just isn't possible to partial out specifically linguistic influences (subject-verb agreement) from general knowledge (movie-popcorn agreement) in string probabilities.

I love it. I'm going to use that phrase next time I teach competency.

I've never been a syntactician, but certainly this is the direction that phonology has been moving for the last ten years or so. Optimality was an extremely ambitious formalism in its day, but it has more or less crumbled under its own weight. Most contemporary work attributes a great deal more importance to lateral forces: acoustic and articulatory mechanics, repetition, metalinguistic awareness, even "knowledge" located not in individuals but rather groups collectively. I'm biased here of course, and I'd love to see phonology reclaim it's seat as the heart of language theory, but if the field offers anything to linguistics generally its the notion that we really need to start considering a wider variety of *things* to constitute language.

I'm also struck by the (very non-scientific) experience of raising a couple of language-acquiring kids. Your average child is "fluent" in pragmatic function long, long before anything like compositional syntax shows up. That seems significant.

Many models take the input string and separate it into chunks then attempt to assign those chunks to a structure. I'm suggesting that those chunks will never match the correct structure because the structure operates at a different level. Of course it's related and so you'll get partial accuracy, but it won't ever be 100% working with only chunks of the input string.

Haha, what do you call an ambiguous typo? I thought "fro" was supposed to be for "from", and djr33 was asking how you got to talking about phonology when I started by talking about syntax (although in principle the same issue applies in computational modeling of phonology acquisition, there just hasn't been as much work on unsupervised, large scale speech recognition as there has been on unsupervised grammar induction).

(although in principle the same issue applies in computational modeling of phonology acquisition, there just hasn't been as much work on unsupervised, large scale speech recognition as there has been on unsupervised grammar induction).

Actually, there is a parallel here: syntax still separates the input into chunks/sections to be modelled hierarchically. But a lot of work in speech recognition for phonology uses something like "landmark theory" to instead look for important regions rather than just raw "phones"/segments in the signal. That's a little more like what I'm suggesting for syntax, in a sense.

Quote

Maybe you are talking about Steven Abney's 1991 "parsing by chunks" paper? Or Elias Ponvert's chunking-based parser? Or, if I may be so vain, my own paper on learning syntactic chunks using words and acoustic measures? Or was there another example you had in mind? (a serious question, since I can't think of any other chunk-based parsers with a cognitive spin).

Not quite. I don't mean "chunk" as a technical term like that. I'll define it here:Input: a linear string (or acoustic signal), "utterance"Chunk: any linear subsection of that input, such that the set of all chunks is equal (when recombined linearly) to the input.

If the input is:abcdefghThen the chunks might be:ab+c+d+e+fg+h

Obviously an initial parse would start with something like that, but I am suggesting that (at least wondering whether) the best approach for working out hierarchical structure (as children learn it) is not to use those chunks as nodes. Instead, there is a mapping process from the linear sequence (via bootstrapping of some sort?) onto a cleaner hierarchical structure.

Some chunks become nodes in the hierarchical structure, while others are operators or indicators of other properties of that structure.

To take one example, consider respectively:John and Mary ate pie and cake respectively.

Clearly respectively is some sort of instruction to the parser to build up a meaning along the lines of:John ate pie and Mary ate cake.

It is, in my opinion, bizarre to come up with a hierarchical semantic structure that directly involves the word "respectively" as if it has direct semantic value-- instead it has parsing value.

But... I realize this is all speculative and I haven't worked out the details!!

For or from, I'm not really following the question I'm afraid. He said "maybe it just isn't possible to partial out specifically linguistic influences", and I said (to paraphrase) "this is the rising opinion in phonology". That didn't seem like a particularly large jump to me when I wrote it.

I'd love to see phonology reclaim it's seat as the heart of language theory

Was that just a casual aside or did you mean that there was some implication it might happen?

If the past is any guide to the future, it will almost certainly happen. Linguistics has a very long history, and (like everything else) phonology has risen and fallen in prominence several times. In fact, before Chomsky, syntax wasn't really seen as requiring much explanation.

Though the generative turn shifted attention strongly away from phonology and morphology, generative models have actually been extremely productive in phonological research. But, they've never been quite sufficient. The lines have always been blurrier, and phonologists have generally needed to appeal to "something else" in addition to computational formalism in order to make the data work. Jkpate speculated here that syntax too might need a something else, and I'm curious (professionally and personally) to see if that ends up being true.