17 January 2011

I remember when I took my first "real" Syntax class, where by "real" I mean "Chomskyan." It was at USC in Fall 2001, taught by Roumyana Pancheva. It was hard as hell but I loved it. However, as a computationally minded guy, I remember snickering to myself the whole time we were talking about movements that get you from deep structure to surface structure. This stuff was all computationally ridiculous.

But why was it computationally ridiculous? It was ridiculous because my mindset, and I think the mindset of most computational folks at the time, was that of n^3 CKY or Earley style parsing. Namely exact parsing in a context free manner. This whole idea of transformations would kill anything like that in a very bad way.

However, there's been a recent shift in attitudes. Sure, people still do their n^3 parsing, but of course none of it is exact anyway (due to pruning). But more than that, things like linear time parsing algorithms as popularized by people like Joakim Nivre and Kenji Sagae and Brian Roark and Joseph Turian, have proved very useful. They work well, are incredibly efficient, and are easy to implement. They're also a bit more psychologically plausible (as Eugene Charniak said recently "we don't know what people are doing, but they're definitely not doing CKY.").

So I'm led to wonder: could we actually do parsing in a transformational grammar using all the new stuff we know about (for instance) left-to-right parsing?

One thing that stands in our way, of course, is the stupid Penn Treebank, which was annotated only with very simple transformations (mostly noun phrase movements) and not really "deep" transformations as most Chomskyan linguists would recognize them.

But I think you could still do it. It would end up as being partially unsupervised, but at least from a minimum description length perspective, I can either spend weights learning more special cases, or I can learn general transformational rules. It would take some thought and effort to write it out and figure out how to actually optimize such a thing, but I bet it could be done in a semester.

So then the question is: aside from smaller models (potentially), is there any other reason to do it?

I can think of at least one: parsing non-declarative sentences. Since almost all sentences in the Treebank are declarative, parsers do pretty crappy when tested on other things. Slav Petrov had a paper at EMNLP 2010 on parsing questions. Here is the abstract, which says pretty much everything:

... We show that dependency parsers have more difficulty parsing questions than constituency parsers. In particular, deterministic shift-reduce dependency parsers ... drop to 60% labeled accuracy on a question test set. We propose an uptraining procedure in which a deterministic parser is trained on the output of a more accurate, but slower, latent variable constituency parser (converted to dependencies). Uptraining with 100K unlabeled questions achieves results comparable to having 2K labeled questions for training. With 100K unlabeled and 2K labeled questions, uptraining is able to improve parsing accuracy to 84%, closing the gap between in-domain and out-of-domain performance.

Now, at least in principle, if you can parse declarative sentences, you should be able to parse questions. At least if you know about some basic syntactic transformations in English. (As an aside, the "uptraining" idea is almost exactly the same as the structure compilation idea that Percy, Dan and I had at ICML 2008, though Slav and colleagues apply it to a domain adaptation problem, while we just did simple semi-supervised learning.)

We have observed similar effects in the parsing of commands, such as "Put your head in a noose" where parsers -- even constituency ones -- really really want "Put" to be a noun! Again, if you know simple transformations -- like subject dropping -- you should be able to parse commands if you can already parse declarations.

As with any generalization, the hope is that by realizing the generalization, you don't need to store so many specific cases. So if you can learn that commands and questions are simple transformation on declarative sentences, and you can learn to parse declaratives, you should be able to handle the other case.

(Anticipating comments: yes, I know you could try to pre-transform your data, like they do in MT, but that's quite inelegant. And yes, I know you could probably take the treebank and turn a lot of the sentences into commands or questions to create a new data set. But that's kind of missing the point: I don't want to just handle commands or questions... I want to handle anything, even things that I might not have anticipated.)

@Suresh: I have no idea what a derivative is -- I'll look into it. Others: see here for what Suresh is talking about ;). The key claim: "This post also describes compaction (not in the draft), and makes a formal argument that the cost of parsing with derivatives is O(n|G|) on average."

I remember having the same mixed feelings that you describe when I took *my* syntax class, and I definitely agree that this kind of parsing could be really cool.

I am not certain about your motivation, though -- if all you want to do is parse questions or commands, there are many much easier ways of doing it.

Also, if we had a good model of transformations, we could potentially apply it in reverse on the treebank trees, and get all sorts of trees (including questions and commands but all the "other stuff" also), which we could then ("efficiently") parse. Might be worth a shot.

While I'm happy to see that you think adding linguistic knowledge like this would be a good idea, I'm dismayed at the apparent presupposition that the only approach to parsing is treebank-based machine learning (or unsupervised machine-learning). The logical extreme of adding linguistic knowledge is to create a hand-engineered grammar. This is not impossible!

The actual computational problem with transformational grammar is not the metaphor of transformations, but that that theoretical work is too imprecise for implementation. But there are theoretical approaches to syntax which are precise enough (HPSG, LFG, CCG, ...). It's a major failing of linguistics instruction that these are not at least mentioned at institutions where they are not practiced.

That lack of precision is I think part of what is behind Sproat's challenge.

@Kallerdis: indeed -- I had read that a while ago but hadn't thought about it recently. As far as I know, not much :).

@Emily: I agree and disagree :). I definitely think that it's unfortunate that one side (non-Chomskyan) tends to acknowledge the other, but not vice versa. That said, I at least personally got filled up with LFG stuff at CMU as an undergrad, and Roumy actually talked about TAG quite a bit during our Syntax class. But I think that's definitely the exception and I think it's unfortunate. I was ignoring stuff like LFG and HPSG in the post because when I talked about community acceptance, I really meant "ACL community" acceptance, which only once in a while has a smattering of these things.

I also agree with your assessment that things like transformation grammar are too imprecise to implement. And certainly things like CCG, LFG, etc., have gone a long way in this direction.

What I wonder is whether it's actually okay for things to be imprecise. This was actually sort of the thesis of my position talk at the Linguistics meets NLP workshop in Uppsala. Namely, that linguists are great at getting the big picture structure, but perhaps not so good at getting the low level details. But machine learning is rubbish at getting big picture structures, but really good at getting low level details. So perhaps it's actually enough to know that things like transformations exist (for whatever definition of "exist" you like -- you can read "exist" simply as "are potential description length reduction" if you want), and maybe some of the parameters that control these things (notion like C-command and the like). And then let machine learning figure out the details. Even though I ended up getting trounced a bit at the workshop, I still feel like this is a reasonable direction to go!

I'm sorry I couldn't attend the workshop in Uppsala. I suspect if I could have been there, though, even while enjoying it, I would have been frustrated by the same thing as I was at the RING session at COLING 2010, namely, the tendency for those members of the 'ACL community' who are interested in these things---even those who know better!---to perpetuate a false dichotomy where the only two choices for sources of knowledge in parsing (that is, parsing English) are transformational grammar (and its current mainstream descendants) and the PTB.

To reply to your point, some linguists (often those who brand themselves as theorists) do seem to be doing work that is not concerned with getting the details right. You give them credit for seeing the big picture. I'd be more cynical, and say that they are doing work that is not empirical (in the original sense), and at worst not even falsifiable. But there are linguists who care very much about getting the details right. (And many, though not all, of us use computers to help us do so.)

I do not mean to say that machine learning is not interesting. I think it could be very interesting indeed to use machine learning in the service of linguistic analysis. But to start with the PTB and a vague notion of transformations is to fail to start from the state of the art. There exist broad coverage grammars (and associated parsers) in multiple frameworks ---- some hand engineered, some tree-bank derived --- that include detailed analyses of questions, long-distance dependencies, and all kinds of wonderful constructions. Perhaps you could use data annotated with those grammars/parsers as a gold-standard in an experiment that starts with the PTB and a notion of transformations. But perhaps it would be more interesting to talk to the linguists behind those grammars and find out what they think is both missing from their accounts and difficult to get at through the manual analysis they have done so far.

(a) I totally agree about the dichotomy. This is actually why I referred to the Treebank as the "stupid" Treebank, because one thing it's done is resulted in a very weird definition of syntax that I don't think anyone would really agree with: it's not "deep" enough to appeal to Chomskyans, but not functional/lexical enough to appeal to LFG or HPSG types. I'm actually a big fan of LFG, and really enjoyed the paper at the workshop by Barbara Plank and Gertjan van Noord on grammar- versus data-driven parsing in an adaptation setting. It was a very cool way to see that "fancy" linguistic syntax can help.

(b) I don't really want to have the "is linguistics a science" debate, but you're right. Maybe I'm giving people too much credit. I can't even read Chomskyan linguistics papers any more -- I've forgotten too much to understand them at all. But I think you actually could make it falsifiable by doing what linguists on the more empirical side are doing. That said, I also think that there's a false dichotomy between Chomskyan and empirical. As Sprout's challenge hinted, there's no reason you couldn't do both.

(c) Actually that's a great question: how to combine the knowledge in human-written grammars and data!

@hal,Emily: one superficial but substantial obstacle standing in way of integrating grammar-based and data-driven (=treebank) approaches is that of evaluation, or in other words 'proving to the "acl-community" that it works'.

Currently, the preferred (only?) evaluation metric is F-measure on the treebank, specifically on section 23. I suspect that every grammar-based parser would perform worse on this metric than a treebank-only parser.

This is not because the grammar-based parsers would provide bad analyses -- I believe that they'd in fact be much better in many respects, but they will probably fail to capture the many idiosyncrasies and ad-hoc annotations in the treebank. I further suspect that our current, ML based parsers get a lot of their performance advantage due to fitting these idiosyncracies very well.

@Yoav: Totally agree. But see the paper I pointed to in a previous for a potential way around this! Of course whether "ACL" cares about adaptation performance is another question, but I think they at least sort of maybe care.

(Incidentally, I think that it's not just machines that are "overfit" the the idiosyncrasies, but the annotators themselves, as well... have you ever read the treebank annotation guidelines???? Gives me nightmares!)

@yoav: In addition to papers like the one Hal points to (and the work of Laura Rimell and colleagues), I think what is called for is more extrinsic evaluation of parsing technology. An example of that is Miyao et al 2008 ACL paper (and follow up journal paper).

Hi everyone,I'm not trying to open already opened doors but the question of parsing evaluation is really becoming more crucial than ever. Whenever I read report on experiments stating that this parser performs better than one other using only one domain (wsj23) and one metric (parseval or LAS), I always wonder if I'm actually learning something about the capacity of a model to provide an analysis of an expressed linguistic fact (eventually learned or not) or about its capacity to optimize score for this or that metric. Two years ago, at IWPT'09, there was a public discussion about this and then Mark Johnson said, using a tired cow boy voice, like "Guys, we're only trying to optimize [our parsers] to get higher parseval F-score. Not more, not less". The implicit being that higher evaluation score means: 1) Scientific knowledge improved => getting published 2) the intellectual approach has been validated both empirically and academically => providing the assurance if someone uses this for real world task he won't get fired for doing so (this point comes actually from Owen Rambow)

Of course, everyone is aware of that but that doesn't prevent us to chase the Score as much as we can..

I think that what is really needed at that stage is a multilingual, framework neutral, multidomain dataset that would provide a clear insight about what to use if one needs to parse unrestricted text for a given language.I'm not certain about extrinsic evaluation (said parser performance in a syntaxe based mt system) as to fully compare the parsing component, all others modules have to be the same so at the end, we're just replacing one metric which one other which is likely to be optimized for after a couple of years (that was also said by Nizar Hasbah at sprml2010) but that should be tryied of course, first evaluation results will be really meaningful.

hal@ how about launching another survey on what parsers are used for (and which one) these days ?

Heilman & Smith's approach to learning tree edit models for sentence pair tasks seems relevant (at a high level, anyway). They are not seeking to learn a transformational grammar, but they do discover ways of transforming one tree into another that correlate with the nature of the relation between them. Perhaps knowing something about the edits "licensed" by a type of sentence relation could help in parsing?

There is a vast literature from the 1980s and early 1990s in ACL, COLING, IWPT, CUNY Sentence Processing Workshop, etc., that addresses (a) resource bounded parsing, (b)left-to-right parsing, and (c) transformations.

Left-to-right parsing was the whole motivation for Steedman's CCG, for instance (though it's actually broken on the left-to-right front unlike the fully associative Lambek calculus (circa 1957), which can be proven left associative).

The first paper I ever wrote in stat NLP was an IWPT paper with Chris Manning on using left-corner parsers (trained on the treebank, natch) to do bounded-memory left-to-right parsing that was more natural than shift-reduce or pure top-down from a psychological perspective (e.g., it disfavors center embeddings, which grow the stack, but allows right branching on a bounded stack).

There's a whole tradition of using rich features that goes under the heading "history based parsing".

There's also a vast literature connecting transformational grammar parsing to formalisms like CG or GPSG. I'd first look at Ed Stabler's work along those lines.

As to our not doing CKY, how about our doing sub-linear parallel CKY? If you have n**6 processors, you can process in log(n) time. If you have n**3 processors you can process in linear time. The real argument is not that our brains don't have the processing power to do CKY, but that we're not actually very good parsers without semantic coherence in a real world context.

Bur really, why would you want to use transformations? Have you ever actually tried to write a transformational grammar?

It's a pity you guys don't have more contact with computationally oriented psycholinguists (not me, I stepped out 2 years ago). The human mind doesn't do CKY or Earley, it also doesn't do transformations. There's no evidence for that. For the studies that found some effects of traces, other explanations exist.

Instead, there is a whole set of different ideas in psycholinguistics, but (I have to warn you) it's not ready for large scale parsing. If you want a good starting point, check out CogSci proceedings of the last few years.

If you can't be bothered with that, then also forget about transformations. They're just going to generate more alternatives than you care for.