I co-wrote this paper during the first summer I started doing NLP research, but it didn’t see the light of day until a year after I’d finished my Master’s degree. Yeesh!

It all started when I decided to spend the Summer of 2001 (between my junior and senior years at Stanford) doing research at UC San Diego with Charles Elkan. I’d met Charles through my dad on an earlier visit to UCSD, and his research exhibited exactly the mix of drive for deep understanding and desire to solve real-world problems that I was looking for. I was also working at the time for a startup called MedExpert that was using AI to help provide personalized medical advice. Since one of the major challenges was digesting the staggering volume of medical literature, MedExpert agreed to fund my summer research in information extraction. So I joined the UCSD AI lab for the summer and started working on tools for extracting information from text, a field that I would end up devoting most of my subsequent research to in one form or another.

As it happened, one of Charles’s PhD students, Dave Kauchak, was also working on information extraction, and he had recently gotten interested in a technique called Boosted Wrapper Induction. So Dave, Charles, and I ended up writing a lengthy paper that analyzed how BWI worked and how to improve it, including some specific work on medical literature using data from Mark Craven. By the end of the summer we had some interesting results, a well-written paper (or so I thought), and I was looking forward to being a published author.

Then the fun began. We submitted the paper for publication in an AI journal (it was too long to be a conference paper) and it got rejected, but with a request to re-submit it once we had made a list of changes. Many of the changes seemed to be more about style than substance, but we decided to make them anyway, and in the process we ran some additional experiments to shore up any perceived weaknesses (by this time I was back at Stanford and Dave was TAing classes, so re-running old research was not at the top of our wish list). Finally we submitted our revised paper to a new set of reviewers, who came back with a different set of issues they felt we had to fix first.

To make a long story short, we kept fiddling with it until finally, long after I had stopped personally working on this paper (and NLP altogether, for that matter), I got an e-mail from Dave saying the paper had finally been accepted, and would be published in the highly respected Journal of Machine Learning Research. It was hard to believe, but sure enough at the end of 2004–more than three years since we first wrote the paper–it finally saw the light of day. It was the paper that would not die.

Charles had long since published an earlier version of the paper as a technical report, so at least our work was able to have a bit more timely of an impact while it was “in the machine”. I’m glad it finally did get published, and I know that academic journals are notoriously slow, but given how fast the fronteir of computer science and NLP are moving, waiting 3+ years to release a paper is almost comical. I can’t wait until this fall to get the new issue and find out what Dave did the following summer. :p

Download PPT (500KB; presentation to NLP group, including work discussed in this paper)

This is another paper I wrote that didn’t get accepted for publication. Like my character-level paper, it was interesting and useful but not well targeted to the mindset and appetite of the academic NLP community. Also like my other paper, the work here ended up helping us build our CoNLL named-entity recognition model, which performed quite well and became a well-cited paper. If for no other reason, this paper is worth looking at because it contains a number of neat diagrams and graphs (as well as some fancy math that I can barely comprehend any more, heh).

One reason why I think this paper failed to find acceptance is that it wasn’t trying to get a high-score in extraction accuracy. Rather it was trying to use smaller models and simpler data to gain a deeper understanding of what’s working well and what’s not. When you build a giant HMM and run it on 1000 pages of text, it does so-so and there’s not a lot you can learn about what went wrong. It’s way too complex and detailed to look at and grok what it did and didn’t learn. Our approach was to start with a highly restricted toy domain and minimal model so we could see exactly what was going on and test various hypotheses. We then scaled the models up slightly to show that the results held in the real world, but we never tried to beat the state-of-the-art numbers. Sadly, it’s a lot harder to get a paper published when your final numbers aren’t competitive, even if the paper contributes some useful knowledge in the process.

It seems both odd and unfortunate to me that academic NLP, which is supposedly doing basic scientific research for the long-term interest, is culturally focused more on engineering and tweaking systems that can boost the numbers by a few percent than by really trying to understand what’s going on under the covers. After all, most of these systems aren’t close to human-level performance, and the current generation of technology is unlikely to get us there, so just doing a little better is a bit like climbing a tree to get to the moon (to quote Hubert Dreyfus, who famously said as much about the field of AI in general).

If companies are trying to use AI in the real-world, their interest is performance first, understanding second (make it work). But in academia, it should be just the opposite–careful study of techniqus and investigation of hypotheses with the aim of making breakthroughs in understanding today that will lead to high-performance systems in the future. But I guess the reality is that it’s much easier (in any discipline) to pick a metric and compete for the high score. (The race for a 3.6GHz processor to out-do the 3.5GHz competition in consumer desktop computers comes to mind, when both computers are severely bottlenecked on disk-IO and memory size and rarely stress the CPU in either case. Ok, that was either a lucid metaphor or complete jibberish, depending on you are. :))

In any event, I enjoyed doing this research, and I’m proud of the paper we wrote.

The Semantic Web is a great idea: expose all of the information on the web in a machine-readable format, and intelligent agents will the be able to read it and act on your behalf (“Computer: When can I fly to San Diego? Where can I stay that has a hot tub? Ok, book it and add it to my calendar”). There’s just one problem: the humans writing web pages are writing them for other humans, and no one is labeling them for computers. (A notable exception are blogs, like this one, whose authoring tools also generate machine-readable versions in RSS or Atom that can be consumed by sites like Bloglines. In a way, Bloglines is one of the few sites making good on the vision of the Semantic Web.)

What do people do when they’re looking for a piece of information, say a list of syllabi for NLP classes? There’s no database that lists that type of information in a structured and curated form. Rather, there are a set of web pages that describe these classes, and they’re all a bit different. But most of them contain similar information–the title of the class, the course number, the professor, and so on. So, in a way, these pages do constitute a database of information, it just takes more work to access it.

That’s where NLP comes in. One of the ways we were using information extraction in the Stanford NLP group was to automatically extract structured information from web pages and represent it in a semantic web format like DAML+OIL and RDF. The idea is that you send your intelligent agent out to the web (“find me a bunch of NLP classes”) and when it comes across a page that looks promising, it first looks for semantic web markup. If it can’t find any (which will usually be the case for the forseeable future), it tries running the information extraction engine on the site to pull out the relevant data anyway. If the site allows it, it could then write that data back in a machine-readable format so the web becomes semantically richer the more agents go looking for information.

Specifically, we built a plug-in to the protege tool developed by the Stanford Medical Informatics group. Protege is a Java-based tool for creating and managing ontologies (a form of knowledge representation used by the semantic web). Our plug-in let you load a web page, run our information extraction tools on it, and add the extracted semantic information to your ontology. You could build up a collection of general-purpose information extracton tools (either hand-built or trained from data) and then use them as you found web pages you wanted to annotate.

Cynthia Thompson, a visiting professor for the year, used this system to find and extract information about educational materials on the web as part of the Edutella project. It ended up working well, and this paper was accepted to the Workshop on Adaptive Text Extraction and Mining as part of the annual European Conference on Machine Learning (ECML). I declined the offer to go to Croatia for the conference (though I’m sure it would have been a memorable experience), but I’m glad that my work contributed to this project.

Every year that Conference on Computational Natural Language Learning (CoNLL) has a “shared task” where they define a specific problem to solve, provide a standard data set to train your models on, and then host a competition for researchers to see who can get the best score. In 2003 the shared task was named-entity recognition (labeling person, place, and organization names in free text) with the twist that they were going to run the final models on a foreign language that wouldn’t be disclosed until the day of the competition. This meant that your model had to be flexible enough to learn from training data in a language it had never seen before (and thus you couldn’t hard-code English rules like “CEO of X” –> “X is an organization”).

Even though my first paper on character-level models got rejected, we kept working on it in the Stanford NLP group because we knew we were on to something. Since one of the major strengths of the model was its ability to distinguish different types of proper names based on their composition (i.e. it recognized that people’s names and company names usually look different), this seemed like an ideal task in which it could shine (see my master’s thesis for more on this work). By this time, I’d started working with Dan Klein, and he was able to take my model to the next level by combining it with a discriminatively trained maximum-entropy sequence model that allowed us to try lots of different character-level features without worrying about violating independence assumptions (a common problem with generative models like my original version). Dan’s also just brilliant and relentless when it comes to analyzing the errors a model is making and then iteratively refining it to perform better and better. The final piece of the puzzle came from my HMM work with Huy Nguyen, which let us combine segmentation (finding the boundaries of proper names in text) and classification (figuring out which type of proper name it is) into a single model.

Our paper was accepted (yay!) and Dan and I flew to Canada to present our work. This was my first NLP conference and it was awesome to meet all these famous researchers whom I’d previously read and learned from. Luckily for me, Dan was just about to finish his PhD, and he was actively being courted by the top NLP programs, so by sticking with him I quickly met most of the important people in the field. Statistical NLP attracts a fascinating mix of people with strong math backgrounds, interest in language, and a passion for empirical (data-driven) research, so this was an amazing group of people to interact with.

On the last day of the conference (CoNLL was held inside HLT-NAACL, which were two larger NLP conferences that had also merged), the big day had come at last. My first presentation as an NLP researcher (Dan let me give the talk on behalf of our team), and the announcement of the competition results. There were 16 entries in the competition. In English (the language we had been given ahead of time), our model got the 3rd highest score; in German (the secret language), our model came in 2nd, though the difference between our model and the one in 1st place was not statistically significant. In other words, had the test data been slightly different, we might easily have had the highest score.

Doing so well was certainly gratifying, but what made us even happier was the fact that our model was far simpler and purer than most in the competition. For instance, the model that got first place in both languages was itself a combination of four separate classifiers, and in addition to the training data provided by the conference, it also used a large external list of known person, place, and organizaton names (called a gazetteer). While piling so much on certainly helped eek out a slightly higher score, it also makes it harder to learn any general results about what pieces contributed and how that might be applied in the future.

In contrast, our model was almost exclusively a demonstration of the valuable information contained in character-level features. Despite leaving out many of the bells-and-whistles used by other systems, our model performed well because we gave it good features and let it combine them well. As a wise man once said, “let the data do the talking”. Perhaps because of the simplicity of our model and its novel use of character features, our paper has been widely cited, and is certainly the most recognized piece of research I did while at Stanford. It makes me smile because the core of the work never got accepted for publication, but it managed to live on and make an impact regardless.

As I describe in my post about my master’s thesis, I started doing research in Natural Language Processing after Chris Manning, the professor that taught my NLP class at Stanford, asked me to further develop the work I did for my class project. He helped me clean up my model, suggested some improvements, and taught me the official way to write and style a professional academic paper (I narrowly avoided having to write it in LaTeX!). I was proud of the final paper, but it wasn’t accepted (I believe we submitted it to EMNLP 02).

This was the start of a series of lessons I learned at Stanford about the difference between what I personally found interesting (and how I wanted to explain it) and what the academic establishment (that decides what papers are published by peer review) thought the rules and conventions had to be for “serious academic work”. While I got better at “playing the game” during my time at Stanford–and to be fair, some of it was actually good and helpful in terms of how to be precise, avoid overstating results, and so on–I still feel that the academic community has lost sight of their original aspirations in some important ways.

At its best, academic research embarks on grand challenges that will take many years to accomplish but whose results will change society in profound ways. It’s a long-term investment for a long-term gain. NLP has no shortage of these lofty goals, including the ability to carry on a natural conversation with your computer, high quality machine-translation of text in foreign languages, the ability to automatically summarize large quantities of text, and so on. But in practice I have found that in most of these areas, the sub-community that is ostensibly working one of these problems has actually constructed its own version of the problem, along with its own notions of what’s important and what isn’t, that doesn’t always ground out in the real world at the end of the day. This limits progress when work that could contribute to the original goal is not seen as important in the current academic formulation. And since, in most cases, the final challenge is not yet solvable, it’s often difficult to offer empirical counter-evidence to the opinions of the establishment as to whether a piece of work will or will not end up making an important difference.

I found this particularly vexing because my intuition is driven strongly by playing with a system, noting its current shortcomings, and then devising clever ways to overcome them. Some of the shortcomings I perceived were not considered shortcomings in the academic version of these challenges, and thus my interest in improving those aspects fell largely on deaf ears.

For instance, I did a fair amount of work in information extraction, which is about extracting structured information from free text (e.g. finding the title, author, and price of a book on an amazon web page or determining which company bought which other one and for how much in a Reuters news article). The academic formulation of this problem is to run your system fully autonomously over a collection of pages, and your score is based on how many mistakes you make. There are two kinds of mistakes–extracting the wrong piece of information, or not extracting anything when you should have–and both are usually counted as equally bad (the main score used in papers is F1, which is the harmonic average of precision and recall, which measure those two types of errors respectively). If your paper doesn’t show a competitive F1, it’s difficult to convince the community that you’re advancing the state-of-the-art, and thus it’s difficult to get it published.

However, in many real-world applications, the computer is not being run completely autonomously, and mistakes and omissions are not equally costly. In fact, if you’re trying to construct a high-quality database of information starting from free text, I’d say the general rule is that people are ultimately responsible for creating the output (the computer program is a means to that end), and that the real challenge is to see how much text you can automatically extract given that what you do extract has to be extremely high quality. In most cases, returning garbage data is much worse than not being able to cover every piece of information possible, and if humans can clean up the computer’s output, they will definitely want to do so. Thus the real-world challenges are maximizing recall at a fixed high-level of precision (not maximizing F1) and accurately estimating confidence scores for each piece of information extracted (so the human can focus on just cleaning up the tricky parts), neither of which fit cleanly into the academic conception of the problem. And this is to say nothing about how quickly or robustly the systems can process the information they’re extracting, which would clearly also be of utmost importance in a functioning system.

I witnessed firsthand this difference between the problem academics are trying to solved and the solution that real applications need when I started working for Plaxo. A core component of the original system was the ability to let people e-mail you their current contact info (either in free text, like “hey, i got a new cell phone…” or in the signature blocks at the bottom of messages) and automatically extract that information and stick it in your address book. This would clearly be very useful if it worked well (the status quo is you have to copy-and-paste it all manually, and as a result, most people just leave that information sitting in e-mail), and it clearly fits the real-world description above (sticking garbage in your address book is unaccepatble, whereas failing to extract 100% of the info is still strictly better than not doing anything). None of the academic systems being worked on had a chance of doing a good job at this problem, and so I had to write a custom solution involving a lot of complicated regular expressions and other pattern-matching code. My system ended up working very well–and very quickly (it could process a typical message in under 50 msec, whereas most academic systems are a “start it running and then go for coffee” kind of affair)–and developing it required a lot of clever ideas, but it was certainly nothing I could get an academic paper published about.

The irony cuts both ways–when I tried to solve the real problem, I couldn’t get published, but the work that was published didn’t help. And yet the academic community could surely do a much better job of solving the real problem if only they hadn’t decided it wasn’t the problem they were interested in. I only bring this up because I am a big believer in the power and potential of academic research, and I still optimistically hope that its impact could be that much greater if its goals were more closely aligned with the ultimate problems they’re trying to solve. By bridging the gap between academia and companies, both should be able to benefit tremendously.

If you’ve read this far in the hope of knowing more about the contents of my first NLP paper, I’m sorry to say it has nothing to do with information extraction, and certainly nothing to do with the academic/real-world divide. But it’s a neat paper (and probably shorter than this blog post!) and despite its not being published, the work it describes ended up influencing other work that I and people at the Stanford NLP group did, some of which did end up gaining a fair bit of notoriety in academic circles.

After four years as an undergraduate at Stanford, I wasn’t ready to leave yet. There were more classes I wanted to take, and I wanted to do more research. Since I was in the Symbolic Systems program, I was taking a mix of Computer Science, Linguistics, Psychology, and Philosophy classes for my major. I was particularly interested in CS and Linguistics, and I wanted to take many of the graudate-level classes in each department, so I really needed a fifth year at school.

During my senior year, I had started doing some NLP research with Chris Manning, which I was really enjoying. When I took his CS224N class, I did a final paper with Steve “Sensei” Patel in which we built a model to recognize unknown words as drug, company, person, place, and movie names based on their composition (e.g. “cotramoxizole” looks like a drug word, and “InterTrust” looks like company name, and we trained our model to learn these patterns). Our model performed very well–in fact, it did better than our friends on the same tests!–and Chris asked me if I’d like to develop this research further with him. After the work we did during my senior year, he offered to fund me as a research assistant during my fifth year.

Stanford has this amazing co-terminal master’s program where you can start taking master’s classes before you finish your undergraduate degree, and so you end up getting both degrees in five years (some people, like my wife, even manage to squeeze both degrees into four years, but like I said, I wasn’t ready to leave yet). The symbolic systems program had just started offering a co-term, but it was research-based (in some departments you just have to take more classes) and so one requirement was you had to have a professor sponsoring your research and vouching that you were serious. The timing was perfect, and I was selected as one of a few students to do a research MS in SSP that year.

(That summer, I also met the founders of Plaxo and started working “part time” building some NLP tools for them. That’s another story, but let me say it’s really not possible to do research and a startup at the same time and do both of them well.)

While working in the Stanford NLP group, I spent a lot of time with Dan Klein, one of Chris’s star PhD students, who’s now a professor at Berkeley. He had a major influence on my work, as well as on me personally. During my co-term year, I also started working with a CS master’s student named Huy Nguyen. We became good friends and he’s now an engineer at Plaxo (hmm, I wonder how that happened ;)).

I wrote quite a few academic NLP papers during my time at Stanford, some of which got published and some of which didn’t. The original paper I did with Chris based on my CS224N project got rejected, but it ended up forming the core of the model Dan, Huy, and I used at the CoNLL-03 competition, which was very successful and has since been widely cited.

My thesis represents the culmination of the work I did at Stanford. It’s central thrust is that you can tell a surprisingly great deal about a proper name by looking at its composition at the character-level. Most NLP systems just treat words as opaque symbols (“dog” = x1, “cat” = x2, etc.) and treat all unknown (previously unseen) words as a generic UNK word (that’s really all you can do if you’re only gathering statistics at the word-level). As a result, these systems often perform poorly when dealing with unknown words, which is increasingly common as they are applied to the untamed world-wide-web or to domains like medicine and biology that are full of specific technical words.

My research looked at a variety of ways you could exploit regularities in the character sequences of unknown words to segment and classify them semantically, even though you’d never seen them before. In addition to presenting experimental results in a number of domains and in multiple lanugages, I also investigated why there appears to be this sound-symbolic regularity in naming, looking at language evolution and professional brand-name creation in particular.

When my thesis was complete, I had to decide whether to apply to a PhD program to continue my research or to instead join Plaxo full-time as an engineer. As you probably know, I ended up choosing Plaxo, mainly because I really believed in the founders and the company’s vision, but also because I wanted to do something tangible that would have immediate impact in the real-world. But I still think that someday I might like to go back to school and continue doing NLP research. The way I look at it, I can’t lose: by the time I’m ready to go back, either all the interesting problems in NLP will have already been solved–in which case the world will be a truly amazing place to live–or there will still be plenty for me left to work on.