29 March 2006

Deepak brings up a great discussion topic that really needs its own post. Hopefully I'll catch this here before anyone continues to reply there (don't). The basic question is this:

Why cannot people simply use heuristicy and hackery approaches that have been proven over years to work well?

Deepak's point, which I think is well taken and should not be forgotten, is that a simple "hacky" thing (I don't intend "hacky" to be condescending...Deepak used it first!) often only does at most epsilon worse than a mathematically compelling technique, and sometimes even better. I think you can make a stronger argument. Hacky things allow us to essentailly encode any information we want into a system. We don't have to find a mathematically convenient way of doing so (though things are better now that we don't use generative models for everything).

I think there are several responses to these arguments, but I don't think anything topples the basic point. But before I go into those, I want to say that I think there's a big difference between simple techniques and heuristicy techniques. I am of course in favor of the former. The question is: why shouldn't I just use heuristics. Here are some responses.

That's just how I am. I studied math for too long. It's purely a personal choice. This is not compelling on a grand scale, but at a personal level, it is unlikely one will do good research if one is not interested in the topic and approach.

We don't want to just solve one problem. Heuristic techniques are fantastic for solving a very particular problem. But they don't generalize to other similar problems. The extent to which they don't generalize (or the difficulty to force generalization) of course varies.

Similar to #2, I want a technique that can be described more briefly than simply the source code. Heuristic techniques by definition cannot. Why? I'd like to know what's really going on. Maybe there's some underlying principle that can be applied to other problems.

Lasting impact. Heuristics rarely have lasting (scientific) impact because they're virtually impossible to reproduce. Of course, a lasting impact for something mathematically clever but worthless is worse than worthless.

...a more complicated model gives very little improvements and generally never scales easily.

I think it's worthwhile separating the problem from the solution. The issue with a lot of techniques not scaling is very true. This is why I don't want to use them :). I want to make my own techniques that do scale and that make as few assumptions as possible (regarding features, loss, data, etc.). I think we're on our way to this. Together with John and Daniel, we have some very promising results.

...working on such (SP) problems means getting married more to principles of Machine Learning than trying to make any progress towards real world problems.

...a lot of lot of smart (young?) people in the NLP/ML community simply cannot admit the fact that simple techniques go a long way...

This is very unfortunate if true. I, for one, believe that simple techniques do go a long way. And I'm not at all in favor of using a steamroller to kill a fly. But I just don't feel like they can or will go all the way in any scalable manner (scalable in O(person time) not O(computation time)). I would never (anymore!) build a system for answering factoid questions like "how tall is the empire state building?" that is not simple. But what about the next level "how much taller is the empire state building than the washington monument?" Okay, now things are interesting, but this is still a trivial question to answer. Google identifies this as the most relevant page. And it does contain the necessary information. And I can imagine building heuristics that could answer "how much Xer is X that Y" based on asking two separate questions. But where does this process end? I will work forever on this problem. (Obviously this is just a simple stupid example which means you can find holes in it, but I still believe the general point.)

24 comments:

"... a lot of lot of smart (young?) people in the NLP/ML community simply cannot admit the fact that simple techniques go a long way ..."

I suspect this sentiment is much wider than smart/young and NLP/ML. David Hand has a paper ("Classification technology and the illusion of progress") dealing with this phenomenon (see reference R122 on his home page and an abstract here). I believe that the academic reward system promotes complexity over practicality, so that fields of study tend to become more esoteric over time until eventually they asphyxiate due to the intellectually rarefied atmosphere having become a vacuum.

Despite that belief I am quite certain there are some problem domains where a scientific understanding of the phenomena requires complex techniques/models. Take text indexing as an example. Latent Semantic Indexing is unreasonably successful on average. Can the meaning of a text really be captured by treating it as a bag of words? I am absolutely convinced that capturing the semantics of text requires more complexity (possibly many orders of magnitude more complexity) than LSI. That level of complexity will ultimately have to be reflected in our scientific understanding. Will that level of complexity ever be economically justifiable in a practical problem solving context? That is much less certain.

Both approaches have their place, but I would prefer it if authors would indicate whether their objectives are practical or scientific, so that I know which evaluation criteria to apply.

My advisor Eduard Hovy thinks of the following:There's a complicated relationship between simple Wal-Mart vs. complex Saks approaches and the amount of storage and compute power you have available.

If it is the case that you can get essentially the same results both ways, but the former takes longer and is more wasteful, then the decision about which to use depends simply on the economics of storage and compute power. Deepak believes (and I do too) that the economics and Moore's Law mean a lot more problems are amenable to Wal-Mart than most people think, including MT (for all its complexity, Wal-Mart is in effect Franz Och's approach to MT, slightly modified by storing only ngram pairs that are 'useful' and not longer than (say) 5 words). (The MT case is true only when there is a lot of training material easily available though.)

Wal-Mart is ok even for precomputing all summaries, at all lengths, of every doc, etc., though that does seem a little excessive, even if you did have an oracle that could tell you which summary variant is likely to be needed how many times over the next 100 years. But the innate complexity of some problems prohibits Wal-Mart from ever being used. If no-one has ever done the transformation computation before you, then there is no training data available and no examples, and then your system has to do the (probably expensive) work. So you have to go Saks. (The first guy to compute the log tables had to do it the hard way; all subsequent guys could store his results in a table and use that.)

We have new papers coming up in almost every NLP conference about a different way to perform syntactic parsing and chunking. (I am not going to argue here if syntactic parsing and chunking are useful) Agreed some of the older techniques are not mathematically very strong -- but my point is that even the new techniques do not perform that well.

It reminds me of the NLP days of the late 80s and early 90s were many people did not evaluate their stuff. I am also going to argue that we have trained so many models on the Penn-tree bank that I strongly suspect of overfitting and hence the results make no sense.

Hal's comments:...That's just how I am. I studied math for too long. It's purely a personal choice. This is not compelling on a grand scale, but at a personal level, it is unlikely one will do good research if one is not interested in the topic and approach...

The above argument does not seem very logical to me. I have relatively good Math skills but I am not going to write esoteric papers with complicated Math if it does not produce superior results. One clear example is working on generative models (Baysean learning) for supervized learning. Why even work on these (generative) problems when we we have better & fast discriminative solutions?

Prof Hands was kind enough to pass me an unpublished draft of his paper:Classifier technology and the illusion of progress It aptly quoted the sentence I truely believe in:...increasing model complexity leads to decreasing rate of improvement.

Hal says: We don't want to just solve one problem. Heuristic techniques are fantastic for solving a very particular problem.This is the argument given all the time. But we know for sure that we are all working on such hard problems that "one model many problem" approach is never going to be realized for a long long time. Every model (even mathematically sound ones) makes assumptions any way.

For O(person time) I believe simple models are still the best. I do not think a complicated Mathematical model solves can be developed faster. Coding time is worse!

I saw the postings. Pretty good stuff you wrote there. I think one point that troubled me was that people think that complicated models are actually modeling "understanding", which I don't think it's the case. Rather, the simple Wal-mart approaches are more about "understanding", because the most striking features of languages are modeled and modeled well. The inherent properties of "understanding" probably could not be modeled correctly with some mathematical functions, or conjunctions of them. Because these properties are not that evident to "an human eye", how does one make the leap b/w them and some mathematical/structural representations, and make sure the mappings are correct?

First, I think there's a difference between "simple" approaches and "memorization" approaches. If I have a summarization system and google wants to use it, I don't care if they precompute and store everything or compute things on the fly. (In practice, you'd probably want to trade-off storage for runtime and do some smart indexing.) The question is: where do these original summaries come from. Similarly, I don't really think phrase-based MT is the Wal-Mart approach. Perhaps more-so than, say, Interlingua-based MT. But, to me, Wal-Mart MT would be "store all parallel sentences; when a new F sentences comes in, find the closest match in your corpus and output the translation." Phrase-based MT is (IMO) actually *too* complicated :). Or, at least the current models we have for it. (More specifically, its complexity is in the wrong places.)

I think regarding the math/empirical results/etc., everyone has his or her own threshold for believability: the minimum evidence they require to beleive a technique is useful (i.e., decide to try to apply it to one's own problem). For some people this is "you can prove a nice theorem about it and I agree with the assumptions"; to some people this is "it improves state of the art performance on some benchmark"; to some it is "easy to implement"; to others it is "psychologically plausible."

For me, I fall somewhere between the "theorem" statement and the "empirical" statement. I will forgive a small lack in one for strong evidence in the other. Why? None of any of these pieces of evidence is ever going to be fully sufficient. I see both of these as sanity checks. The "empirical" part tells me that it works on one problem (essentially, the bias is correct for some problem), while the "theorem" part tells me that it's probably not completely insane to believe it will also work (reasonably well) on another problem. I'm curious, Deepak: what is your threshold? :)

I agree that one model for everything is not going to be realized for a long time (I hesitate to even say that it is possible). I disagree that one model for many things approach is not going to be realized for a long time. I currently have an approach for structured prediction that works really well on sequence labeling tasks, coreference resolution and automatic summarization. I can see how (though don't have the time) to work it on MT, parsing and other NLP tasks, as well as a few things in vision and biology. It's also very easy to implement, assuming you have access to some pre-implemented binary classifier.

Wrapping up for now, I think that many of us probably agree on most things (reusing the same data sets, etc.). I think this is a really useful discussion, but for the purposes of continuing it, it might be worth trying to semi-formally define what we mean by a Wal-Mart approach and a Saks approach. Based on comments, I get the impression everyone has a slightly different opinion.

Incidentally, the Hand paper (which I largely agree with: in fact, several posts so far have talked about many of the issues pointed out in that paper) seems to focus on impoved machine learning techniques. Which I think many people here associate with "overly complex" techniques. But I don't really think that anything said there does not also apply to the increasing complexity of more "practical" solutions, such as rule based translation (for specific domains), or any other "practicality" focused solution. This is the typical 90/10 rule, where one year will get you 90% of the way and you need 10 more years to get the last 10%. This last 10% will involve tons of complexity, regardless of how you solve it and regardless of whether these are "theoretical" improvements or practical ones.

It is clear from the posts that people have different definitions about Wal-mart and Saks approaches.

Answering Hal's question I do care amore about getting good results on benchmark tests. I don't care about theorem proving anymore. (My take on doing things the sound way is that there are assumptions in every case. For me having 1 assumption and 10 assumptions make no difference empirically.)

However, (I might sort of contradict myself here) I believe, there is practically *no* difference between two systems whose output numbers are 88.3% and 89.2% (these numbers are made up BTW for illustration purpose) on algortihms that have been tested on the same test sets over several years. One clear example of these problems are those of parsing, chunking, sequence labelling.

Now coming back to the 90/10 rule:....typical 90/10 rule, where one year will get you 90% of the way and you need 10 more years to get the last 10%.

Now, we have a big problem in the NLP community here. A lot of us work on the same test set for several years and hence 10% progress made over 10 years is sometimes questionable. I would call it more like the 98-2 rule. One makes 98% of the actual progress in the first year and then people make 2% progress over the next 10 years.

My definition for Wal-mart and Saks approaches:

Wal-mart approach: For me it involves quick, dirty, hacky and heuristicy approaches to get the first 98% (of the 98-2% rule) and then I try to work on scaling it up. [As a research direction I have chosen to work on unsuperivised/semi-supervised approaches to work with large data.]

Saks approach: Doing it the *correct* way. It may involve proving theorems, making only minimum assumptions and possibly getting more than the first 98% performance (assuming 100% is the upper ceiling achieved by mature algorithms on the same problem over several years of research.)

So the thesis of my original post and (perhaps this one too) is:

Why work on 2% gain problem over 10 years when we have zillion different problems to work on. I agree that working on hacky techniques is not intellectually stimulating. However, there are still many creative ways of coming with new features for many new problems that will keep us engaged for years.

I completely agree on improving on stationary test sets and will probably post about this topic at some point. But this is really a concern with the treebank based evals. Everyone else (MT, summarization, IE, QA, etc.) has yearly evals where the data changes. There's also the question of whether these improvements are real.

I agree about making assumptions, but the hacky approach is also making assumptions. They're just never written down. A big assumption made, for instance, is: the data I'm hacking on is the only data I ever need to do well on. This assumption scares me. A lot.

Your definitions about Wal-mart and Saks agree with what I have in my head. And I think I (largely, at least) agree about working on fun new problems. Where I disagree is on the claim that for many interesting problems we can get 98% (or even just 90%) of the way there in a year of work (hackily or correctly). For many things it seems more like we can get 50% of the way there easily. And then the question is whether we're satisfied with 50% or not. (How you define the %ages is not so important. In fact, for many problems for which current systems report "95% accuracy," this is an artifact of the cost function, not of "real system performance.") For instance, though QA systems report high accuracies on TREC-style data, we're clearly very far from actually solving the QA problem.

One answer to your thesis at the end is: yes, we have a zillion different problems. This is precisely why I don't want to hack up a solution to each one individually!

Last summer I was working on all sorts of ways to train a dependency parser. Being a person who likes math, I studied all sorts of complicated models. In the end I discovered that many training methods gave similar results and the real impact came from defining good features. One may call such feature engineering hacky, but now I have I have deeper appreciation for those who have the linguistic insight to come up with simple features for a particular problem.

I liked Deepak's statement: "Why work on 2% gain problem over 10 years when we have zillion different problems to work on". Can you tell me what are the zillion different problems, though? (I'm in thesis-searching mode now... :) It would be cool if we could identify those new problems--it seems that it's often easier to work on the same problem (and use the same test set) than identifying a new important area. Finding a problem requires true innovation; solving a problem can be done with either hacky or complex solutions.

"In the end I discovered that many training methods gave similar results and the real impact came from defining good features."

That is the essence of David Hand's paper, that if you partition the variance in the accuracy over studies into sources, you find that the importance of algorithms is pretty low on the list and having better data is pretty near the top.

My point was that academic reward systems tend to reward pretty equations and that feature engineering tends to be invisible or seen as ad hocery to be ashamed of.

Several writers made statements along the lines of "The first 10% of effort yields 90% of the benefit." I think it is actually worse than this. Say you have a system that is at the 10%-effort/90%-results point. Now you add on some more complexity to the model that is intended to improve performance (it may be an addition to the algorithm or maybe some more feature engineering). At this point you don't know what the "true" end point is, so it is unlikely that your addition is on the shortest path between the current form and the ultimate form of the system. This off-path component is equivalent to injecting some noise in the system, so you may actually be worse off adding to the model even if it is qualitatively of the right form.

For example, you may have a system based on a bag-of-words representation so you decide to augment the input by somehow encoding the meaning of the text. Unfortunately, the way you have chosen to encode the semantics is not the right way for this problem and the noise introduced by the mismatch of the semantic representation overwhelms the increase in performance due to incorporating semantics, so your system accuracy drops.

Kevin: I doubt anyone would be really surprised at your situation with the parser. I think, given this experience, that the request to the machine learning people is for a SP framework that enables us incorporate whatever kinds of features we want and not worry about messy things like efficiency and learning and the like. You just want it to work.

Ross: Perhaps a second issue (and this relates to Kevin's request for topics) is that the academic system rewards staying inside the box and not branching out to "odd" problems. This means that the only possible impact is by incremental improvements on the state of the art.

It does seem a bit strong to say that any step off the shortest path is equivalent to noise. I fear that making such arguments, while often reasonable, are on a very slipperly slope. My concern is that it's very easy to conclude that anything done to improve an existing system should not be done. But clearly progress has been made in lots of problems this way. As long as you're not doing human-in-the-loop learning (i.e., running your system against test data, adding more features, running against the same test data, and so on), I feel like you're fairly safe. It is common practice to either change the test set every year or at least to do the feature engineering against a held-out development set. Sure, you're at a slight risk of overfitting, but this seems better to me than to never actually solve any problem.

the academic system rewards staying inside the box and not branching out to "odd" problems. This means that the only possible impact is by incremental improvements on the state of the art.

Readers of this blog might be interested in Don Braben. He believes that peer-review funding systems are strongly biased in favour of incremental research over revolutionary research. He has some very interesting case studies from a period when he developed an alternative methodology for assessing research funding requests and actually distributed significant amounts of money on that basis.

Hal also said

My concern is that it's very easy to conclude that anything done to improve an existing system should not be done. But clearly progress has been made in lots of problems this way. ... This seems better to me than to never actually solve any problem.

I would rather state that differently: Under what circumstances is it appropriate to cease attempting to incrementally improve an imperfect system?

If you are approaching the problem as a practical exercise then you are entitled to use goodness of fit as your primary metric. When you first approach the problem you don't know where you are on the performance vs complexity curve (and I don't think that's the same thing as overfitting), so you can try out improvements with a clear conscience. When you get to the point where adding more complexity mostly doesn't improve the model then you should probably stop trying in that direction because you are likely to be wasting someone's time and money. In a practical setting I would either move on to a different problem or find some way to reconceptualise the current problem so that I am at the begiining of the curve again but the line of development points in a different direction to before, so I can expect some growth in performance before I hit the wall.

If you are approaching the problem as a scientfic exercise then I would not take goodness of fit as the only or primary assessment criterion. If you have to add on some theoretical complexity to make the model internally coherent or coherent with some external theory it may decrease the goodness of fit of your model. That would be inconvenient, but not an adequate reason to drop the theoretical extensions. A lot of theory development is effectively a long term bet that a particular conceptual framework will turn out to be more productive (and accurate) in the long run. The gap between proposal of a theory and reasonable empirical confirmation of that theory (suitably elaborated and refined) mayu be decades. So if you're in this game for the science of it you need to evaluate progress differently from someone who is in the game purely for practical outcomes.

You could draw a link between this point and the first point by noting that the institutional bias to incrementalism can be viewed as the (inappropriate) application of a practical evaluation metric to scientific problems.

Ross: I really like your post. The first point is essentially a "trying to get to the moon by climbing a tree argument." I think it's often really hard to stop climbing the tree, even if you seem to be hitting the top, because its a long way down. I think regardless of whether you are practically oriented or scientifically oriented, this can be a challenge, essentially because it can take a while to climb back up. What is the best mechanism to avoid this in the academic world?

Hey Ross and Hal--thanks for the reference to the Braben book. I like his quote: "An expert opinion is one thing; the consensus of experts is another." (ps: This reminds me of a common phenonemon in training mixture of experts in machine learning--you want lots of diversity among your experts to achieve good accuracy!) I think this discussion has gone from the debate on heuristics specifically to a more general forum on what does it take to do good research, to be innovative, etc. I'll summarize some of my own take-home messages so far:

- There are Wal-mart style and Saks approaches to solving problems. Some think that simple Wal-mart style approaches get you most of the way. However, it seems that we all have different opinions on what approach is Wal-mart and what is not.

- Quoting Hal: "Everyone has his or her own threshold for ... the minimum evidence they require to beleive a technique is useful... For some people this is "you can prove a nice theorem about it and I agree with the assumptions"; to some people this is "it improves state of the art performance on some benchmark"; to some it is "easy to implement"; to others it is "psychologically plausible."

- It is difficult to find new problems; further, peer-review systems tend to disfavor thinking out-of-the-box, so both new problems and new solutions have less chance to be pursued. Innovation comes from dissent; it comes from the feeling "Why do we have to do things like this?". I think this is true whether we have a Wal-mart or Saks approach to solutions.

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..sesli sohbetsesli chatsesli sohbet siteleri