05 July 2009

If you compare vision research with NLP research, there are a lot of interesting parallels. Like we both like linear models. And conditional random fields. And our problems are a lot harder than binary classification. And there are standard data sets that we've been evaluating on for decades and continue to evaluate on (I'm channeling Bob here :P).

But there's one thing that happens, the difference of which is so striking, that I'd like to call it to center stage. It has to do with "messing with our inputs."

I'll spend a bit more time describing the vision approach, since it's probably less familiar to the average reader. Suppose I'm trying to handwriting recognition to identify digits from zero to nine (aka MNIST). I get, say, 100 labeled zeros, 100 labeled ones, 100 labeled twos and so on. So a total of 1000 data points. I can train any off the shelf classifier based on pixel level features and get some reasonable performance (maybe 80s-90s, depending).

Now, I want to insert knowledge. The knowledge that I want to insert is some notion of invariance. I.e., if I take an image of a zero and translate it left a little bit, it's still a zero. Or up a little bit. Of if I scale it up 10%, it's still a zero. Or down 10%. Or if I rotate it five degrees. Or negative five. All zeros. Same hold for all the other digits.

One way to insert this knowledge is to muck with the learning algorithm. That's too complicated for me: I want something simpler. So what I'll do is take my 100 zeros and 100 ones and so on and just manipulate them a bit. That is, I'll sample a random zero, and apply some small random transformations to it, and call it another labeled example, also a zero. Now I have 100,000 training points. I train my off the shelf classifier based on pixel level features and get 99% accuracy or more. The same trick works for other vision problem (eg., recognizing animals). (This process is so common that it's actually described in Chris Bishop's new-ish PRML book!)

This is what I mean by small changes (to the input) begetting good example. A slightly transformed zero is still a zero.

Of course, you have to be careful. If you rotate a six by 180 degrees, you get a nine. If you rotate a cat by 180 degrees, you get an unhappy cat. More seriously, if you're brave, you might start looking at a class of transformations called diffeomorphisms, which are fairly popular around here. These are nice because of their nice mathematical properties, but un-nice because they can be slightly too flexible for certain problems.

Now, let's go over to NLP land. Do we ever futz with our inputs?

Sure!

In language modeling, we'll sometimes permute words or replace one word with another to get a negative example. Noah Smith futzed with his inputs in contrastive estimation to produce negative examples by swapping adjacent words, or deleting words.

In fact, try as I might, I cannot think of a single case in NLP where we make small changes to an input to get another good input: we always do it to get a bad input!

In a sense, this means that one thing that vision people have that we don't have is a notion of semantics preserving transformations. Sure, linguists (especially those from that C-guy) study transformations. And there's a vague sense that work in paraphrasing leads to transformations that maintain semantic equivalence. But the thing is that we really don't know any transformations that preserve semantics. Moreover, some transformations that seem benign (eg., passivization) actually are not: one of my favorite papers at NAACL this year by Greene and Resnik showed that syntactic structure affects sentiment (well, them, drawing on a lot of psycholinguistics work)!

I don't have a significant point to this story other than it's kind of weird. I mentioned this to some people at ICML and got a reaction that replacing words with synonyms should be fine. I remember doing this in high school, when word processors first started coming with thesauri packed in. The result seemed to be that if I actually knew the word I was plugging in, life was fine... but if not, it was usually a bad replacement. So this seems like something of a mixed bag: depending on how liberal you are with defining "synonym" you might be okay do this, but you might also not be.

In MT, there are a couple of marginal examples of futzing with the input to produce positive training examples. When you're translating, for example, from Chinese to English, where the source segmentation is not given, it is not unheard of to use two different segmentations (say, word-based and character-based). There was also a workshop paper (WMT08 maybe?) from CMU maybe? on using alignments computed over alternative segmentations to improve word alignment of a particular segmentation.

We paraphrased MT training data using a deep (HPSG-based) grammar to produce more positive training examples and it worked fairly well. Other people have done it before us (cited in the paper).

The problem is that paraphrasing itself is as hard a problem as MT, so there is no guarantee that this should help. We are taking advantage of the fact that there are often better monolingual resources available than bilingual.

Images and sound are pretty low-level; language at the text level (characters, words) is comparatively high-level. As such, it's very easy to come up with random perturbations to image or sound that preserve their meaning. But random futzing with text isn't very safe. Add or delete random characters? Words? Sentences? It's research problem just to know what futzings are meaning-preserving...

The difference between NLP and for example vision is that NLP is treating discrete tokens, whereas in vision images commonly consist of `continuous' pixel values, neighborhoods are clearly defined, and thus you can wiggle them around.

The same treatment (i.e. creating new samples by random perturbations) is also done in speech recognition, where the input also consists of continuous data.

In terms of input, a fundamental difference between the two is that word representations already encode significant information, whereas pixels do not. In particular, humans can "think" or reason in terms of words; we do not do the same with pixels. Instead we reason in terms of higher order concepts such as shapes, which can be represented by many different, but similar, pixel representations.

One somewhat related idea in IR is pseudo-relevance feedback. It's a very hands-on approach. Suppose a user issues a very short query to a search engine. The user has some information need that's related to what's available in the indexed corpus. One way to get at this information is to first run the original query through our search engine, collect the top K results, and finally take the important words of the top K results as a new query to issue to the search engine. This is basically a smoothing heuristic. The idea is that the information need of the user can be "equivalently" represented using different bags of words (at least given the current level of sophistication of modern search engines).

Maybe someone could define a distance measure for synonyms (something like a KL-divergence of neighbor distributions?) and then perturb the examples using "near" synonyms. Or has anybody done this already?

As other people have already mentioned, paraphrasing has been used to futz with the inputs for the various components in MT pipeline: paraphrasing to create new training data and paraphrasing to increase the coverage for a test set. My work is part of the same group and uses sentence-level paraphrasing to create additional inputs for the MERT level of the MT pipeline, i.e., toimprove the tuning the decoder feature weights. And the neat thing is that the sentence-level English paraphraser is built by using nothing more than what you already have for the MT system (by extending CCB's idea)

While not "futzing with the data" per-se, I do have some positive experience with using data to automatically generate positive training samples for an NLP task.

Specifically, for the task of identifying transliterated foreign words (in Hebrew), I got pretty decent results by starting with the CMU-pronunciation dictionary and an English corpus, and generating many possible transliterations for each of the English words based on its pronunciation and some simple and ambiguous phoneme transliteration rules. The vast majority of the generated transliterations would not have been accepted as valid by a human reader, but were quite adequate for training a statistical model to recognize other transliterated words.

Interesting observation. I wonder if part of the difference, though, arises from the granularity at which you're working. If you have a classification problem in which the universe of images consists of only ten distinct item-types (digits 0 through 9), then there's a lot of room for perturbation and invariance. In the NLP world, this would be like classifying sentences into simple declarations, compound statements, interrogatives, and exclamations; at that level, even in language there is a lot of room for (valid) permutation that wouldn't affect the class of the object. (Sentiment classification might be another such example.) But for parsing or machine translation or automatic summary generation or question answering, the distinctions being made seem to be at a much finer level -- where there would be less room for any valid (or meaningful?) perturbations. These problems are perhaps more akin to digit classification in which the font, weight, and point size all matter. The more details you care about, the fewer invariant transformations exist.

I found your blog on google and read a few Thanks for the information you mentioned here, I'm looking forward to see your future posts. Cheers !! Please come visit my site Lobby Hobby Directory when you got time.

Good tips that are worth checking and these tips are also worth suggesting to friends. Thanks for sharing. Great stuff! . I am new to seo, trying to visit more seo blogs for guides and tips. You can be friends with me. Please come visit my site Peoria Arizona business directory when you got time. Thanks.

Awesome article, definitely liked the info provided. Just subscribed to your blog. Great stuff! . I am new to seo, trying to visit more seo blogs for guides and tips. You can be friends with me. Please come visit my site Pembroke Pines Florida yellow pages when you got time. Thanks.

Awesome! I have read a lot on this topic, but you definitely give it a good vibe. This is a great post. Will be back to read more! Please come visit my site Contractor Painter Business Directory when you got time.

I can see that you are an expert at your field! I am launching a website soon, and your information will be very useful for me. Thanks for all your help and wishing you all the success in your business. Please come visit my site Sacramento Business Directory when you got time.

You got a really useful blog I have been here reading for about an hour. I am a newbee and your success is very much an inspiration for me. Please come visit my site Santa Ana Business Directory when you got time.

Me and my friend were arguing about an issue similar to this! Now I know that I was right. lol! Thanks for the information you post. Please come visit my site Discount Stores and give me any valuable feedbacks

Couldn't be written any better. Reading this post reminds me of my old room mate! He always kept talking about this. I will forward this article to him. Pretty sure he will have a good read. Thanks for sharing! Please come visit my site Equipment Rental when you got time.

Hey congrats on the new posting come outbtw i love your blog although i have just stumbled upon it =)Love the new pictures you got there! Please come visit my site Boston Yellow Page Business Directory when you got time.

I can see that you are an expert at your field! I am launching a website soon, and your information will be very useful for me. Thanks for all your help and wishing you all the success in your business. Please come visit my site good business management give me any valuable feedbacks.

You got a really useful blog I have been here reading for about an hour. I am a newbee and your success is very much an inspiration for me. Please come visit my site cross cultural psychology when you got time.

James Dean wore in the blockbuster film Levi jeans and Lacoste polo shirts,Levis Jeans from a "personality and rebellion" symbol. While their counterparts in another company to do jeans Diesel, it has more than the Diesel Jeans another culture, fashion spokesperson. In a series of autumn and winter are filled with a thick knighthood, while below it a cheap jacke is the best proof of this. puma basket t with a nike max shoes make you full of self-confidence. ed hardy shirts make you feel the freedom of street culture with nature.

What a blog filled with vital and important information this is .. It must have taken a lot of hours for you to write these yourself. Hats off from me for your hard work. Please come visit my site auto repair give me any valuable feedbacks.

That is some inspirational stuff... Never know that opinions could be this varied. Thanks for all the enthusiasm to offer such helpful information here. Please come visit my site logistics when you got time.

sneakers shoes She continued,"Why...? Don't you need someone to pose as your girlfriend this year?" Then he answered, "No, there is no need for that anymore......"Before he can continue, he was interrupted, discount nike shoes"Oh yes! Must have found a girlfriend! nike shox r4 You haven't been searching for one for the past years, right?" The man looked up, as if he has struck gold, his face beamed and looked directly at the drunken girl. tn dollarHe replied, "Yes......you are right! I haven't been looking for anyone for the past years."With that, the man darted across the floor and out the door, cheap nike shoesleaving the lady in much bewilderment. He finally realized that he has already found his dream girl, and she was.....the Vancouver girl all along! The drunken lady has said something that awoken him.All along he has found his girl.nike tennis shoes That was why he did not bother to look further when he realized she was not coming back. It was not any specific girl he was seeking! cheap nike shoxIt was perfection that he wanted, and yes.....perfection!!Relationship is something both parties should work on. Realizing that he had let away someone so important in his life, he decided to call her immediately. His whole mind was flooded with fear.free shipping shoes He was afraid that she might have found someone new or no longer had the same feelings anymore..... For once, he felt the fear of losing someone.As it was Christmas eve, the line was quite hard to get through, especially an overseas call. He tried again and again, never giving up. Finally, he got through......precisely at 1200 midnight. He confessed his love for her and the girl was moved to tears. nike shoes It seemed that she never got over him! Even after so long, she was still waiting for him, never giving up.He was so excited to meet her and to begin his new chapter of their lives. He decided to fly to Vancouver to join her. It was the happiest time of their lives! nike discount shoes But their happy time was short-lived. Two days before he was supposed to fly to Vancouver,cheap puma shoes he received a call from her father. She had a head-on car collision with a drunken driver. nike shox shoes She passed away after 6 hours in a coma.The guy was devastated, as it was a complete loss. Why did fate played such cruel games with him? He cursed the heaven for taking her away from him, denying even one last look at her! How cruel he cursed! How he damned the Gods...!!nike free shoes How he hated himself....for taking so long to realize his mistake!! That was in 1996.The moral of this story is :Treasure what you have...Time is too slow for those who wait;Too swift for those who fear;Too long for those who grief;Too short for those who rejoice;But for those who love...Time is Eternity.For all you out there with someone special in your heart, cherish that person, cherish every moment that you spend together that special someone, for in life, anything can happen anytime. buy shoes onlineYou may painfully regret, only to realise that it is too late.

I was thinking of looking up some of them newspaper websites, but am glad I came here instead. Although glad is not quite the right word… let me just say I needed this after the incessant chatter in the media, and am grateful to you for articulating something many of us are feeling - even from distant shores. Please come visit my site Business Reviews Of Oakland City when you got time.

I was thinking of looking up some of them newspaper websites, but am glad I came here instead. Although glad is not quite the right word… let me just say I needed this after the incessant chatter in the media, and am grateful to you for articulating something many of us are feeling - even from distant shores. Please come visit my site Business Directory Listings Of Santa Ana California CA when you got time.

I can see that you are putting a lot of time and effort into your blog and detailed articles! I am deeply in love with every single piece of information you post here. Will be back often to read more updates! Please come visit my site business directory when you got time.

I usually don’t leave comments!!! Trust me! But I liked your blog…especially this post! Would you mind terribly if I put up a backlink from my site to your site? Please come visit my site Indianapolis Community Video Library when you got time.

I usually don’t leave comments!!! Trust me! But I liked your blog…especially this post! Would you mind terribly if I put up a backlink from my site to your site? Please come visit my site Indianapolis Business Phone Numbers when you got time.

I just love it ..... well i don't have any doubt about your articles... your articles are awesome... Honestly you are simply the best.Thanks for sharing this with us. Please come visit my site Home remedies when you got time.

Good tips that are worth checking and these tips are also worth suggesting to friends. Thanks for sharing. Great stuff! . I am new to seo, trying to visit more seo blogs for guides and tips. You can be friends with me. Please come visit my site Children's health when you got time. Thanks.

Your article is very good.I like it very much.Once upon a time, there was a mouse father.He wanted to marry his daughter to the greatest person in the world.But, who was the greatest person in the world?Oh! puma ferrari shoescheap nike shoesThe sun! He must be the greatest person in the world.The mouse father went to talk to the sun."Hello! Mr. Sun. puma shoesferrari shoesI know you are the greatest person in the world.Would you marry my daughter?""What? I'm not the greatest person in the world. The greatest person is the cloud.If he comes out, I’ll be covered."nike shox nzUgg BootsThe mouse father went to talk to the cloud. “Hello! Mr. Cloud. I know you are the greatest person in the world. Would you marry my daughter?” nike 360 air maxnike shox shoes“What? I’m not the greatest person in the world. The greatest person is the wind.If he comes out, I’ll be blown away.”cheap puma shoespuma drift catThe mouse father went to talk to the wind. “Hello! Mr. Wind. I know you are the greatest person in the world.Would you marry my daughter?” “What? I’m not the greatest person in the world. The greatest person is the wall. If he comes out, I’ll be stopped.”cheap nike shoxnike air max 360The mouse father went to talk to the wall. “Hello! Mr. Wall. I know you are the greatest person in the world. Would you marry my daughter?” “What? I’m not the greatest person in the world. The greatest person is YOU, the mouse.” “The greatest person in the world is … mouse?” “Yes, the greatest person in the world is mouse. See? If mouse comes out, I’ll be bit!” nike air maxpumas shoesThe mouse father was very happy. He finally knew mouse was the greatest person in the world. He would marry his daughter to the handsome mouse next door.

Awesome! I have read a lot on this topic, but you definitely give it a good vibe. This is a great post. Will be back to read more! Please come visit my site Free Business Listing of Las Vegas when you got time.

Awesome! I have read a lot on this topic, but you definitely give it a good vibe. This is a great post. Will be back to read more! Please come visit my site Services Business Directory Las Vegas when you got time.

Often we forget the little guy, the SMB, in our discussions of the comings and goings of the Internet marketing industry. Sure there are times like this when a report surfaces talking about their issues and concerns but, for the most part, we like to talk about big brands and how they do the Internet marketing thing well or not so well.

This is very interesting information. I am doing some research for a class in school. and i liked the post. do you know where I can find other information regarding this? I am finding other information on this but nothing that I can use really in my paper for my final. do you have any suggestions?

This is very interesting information. I am doing some research for a class in school. and i liked the post. do you know where I can find other information regarding this? I am finding other information on this but nothing that I can use really in my paper for my final. do you have any suggestions?

This is very interesting information. I am doing some research for a class in school. and i liked the post. do you know where I can find other information regarding this? I am finding other information on this but nothing that I can use really in my paper for my final. do you have any suggestions?

This is such a great resource that you are providing and you give it away for free. I love seeing websites that understand the value of providing a quality resource for free. It’s the old what goes around comes around routine.

This is very interesting information. I am doing some research for a class in school. and i liked the post. do you know where I can find other information regarding this? I am finding other information on this but nothing that I can use really in my paper for my final. do you have any suggestions

This is very interesting information. I am doing some research for a class in school. and i liked the post. do you know where I can find other information regarding this? I am finding other information on this but nothing that I can use really in my paper for my final. do you have any suggestions?

This is very interesting information. I am doing some research for a class in school. and i liked the post. do you know where I can find other information regarding this? I am finding other information on this but nothing that I can use really in my paper for my final. do you have any suggestions?

This is very interesting information. I am doing some research for a class in school. and i liked the post. do you know where I can find other information regarding this? I am finding other information on this but nothing that I can use really in my paper for my final. do you have any suggestions?

This is such a great resource that you are providing and you give it away for free. I love seeing websites that understand the value of providing a quality resource for free. It’s the old what goes around comes around routine.

This is very interesting information. I am doing some research for a class in school. and i liked the post. do you know where I can find other information regarding this? I am finding other information on this but nothing that I can use really in my paper for my final. do you have any suggestions?

This is very interesting information. I am doing some research for a class in school. and i liked the post. do you know where I can find other information regarding this? I am finding other information on this but nothing that I can use really in my paper for my final. do you have any suggestions?

This is very interesting information. I am doing some research for a class in school. and i liked the post. do you know where I can find other information regarding this? I am finding other information on this but nothing that I can use really in my paper for my final. do you have any suggestions?