2
Intro to NLP - J. Eisner2 Why Text Categorization?  Is it spam?  Is it Spanish?  Is it interesting to this user?  News filtering  Helpdesk routing  Is it interesting to this NLP program?  e.g., should my calendar system try to interpret this as an appointment (using info. extraction)?  Where should it go in the directory?  Yahoo! / Open Directory / digital libraries  Which mail folder? (work, friends, junk, urgent...)

3
Intro to NLP - J. Eisner3 Measuring Performance  Classification accuracy: What % of messages were classified correctly?  Is this what we care about? Overall accuracy Accuracy on spam Accuracy on gen System 195%99.99%90% System 295%90%99.99%  Which system do you prefer?

5
Intro to NLP - J. Eisner5 Measuring Performance low threshold: keep all the good stuff, but a lot of the bad too high threshold: all we keep is good, but we don’t keep much OK for spam filtering and legal search OK for search engines (maybe) would prefer to be here! point where precision=recall (sometimes reported) F-measure = 1 / (average(1/precision, 1/recall))

6
Intro to NLP - J. Eisner6 More Complicated Cases of Measuring Performance  For multi-way classifiers:  Average accuracy (or precision or recall) of 2-way distinctions: Sports or not, News or not, etc.  Better, estimate the cost of different kinds of errors  e.g., how bad is each of the following?  putting Sports articles in the News section  putting Fashion articles in the News section  putting News articles in the Fashion section  Now tune system to minimize total cost  For ranking systems:  Correlate with human rankings?  Get active feedback from user?  Measure user’s wasted time by tracking clicks? Which articles are most Sports-like? Which articles / webpages most relevant?

7
Intro to NLP - J. Eisner7 How to Categorize? Subject: would you like to drive a new vehicle for free ? ? ? this is not hype or a hoax, there are hundreds of people driving brand new cars, suvs, minivans, trucks, or rvs. it does not matter to us what type of vehicle you choose. if you qualify for our program, it is your choice of vehicle, color, and options. we don ' t care. just by driving the vehicle, you are promoting our program. if you would like to find out more about this exciting opportunity to drive a brand new vehicle for free, please go to this site : http : / / / ntr to watch a short 4 minute audio / video presentation which gives you more information about our exciting new car program. if you do n't want to see the short video, but want us to send you our information package that explains our exciting opportunity for you to drive a new vehicle for free, please go here : http : / / / ntr / form. htm we would like to add you the group of happy people driving a new vehicle for free. happy motoring.

8
Intro to NLP - J. Eisner8 How to Categorize? (supervised) 1.Build n-gram model of each category  Question: How to classify test message?  Answer: Bayes’ Theorem We’ve seen lots of options in this course!

9
Intro to NLP - J. Eisner9 How to Categorize? (supervised) 2.Represent each document as a vector (must choose representation and distance measure; use SVD?)  Question: How to classify test message?  Answer 1: Category whose centroid is most similar (may not work well if category is diverse)  Answer 2: Cluster each category into subcategories (then use answer 1 to pick a subcategory) (return the category that the subcategory is in) (this can also be useful for n-gram models)  Answer 3: Just look at labels of nearby training docs (e.g., let the k nearest neighbors vote – flexible!) (maybe the closer ones get a bigger vote) We’ve seen lots of options in this course!

10
Intro to NLP - J. Eisner10 How to Categorize? (supervised) 3.Treat it like word-sense disambiguation a)Vector model – use all the features (we just saw this) b)Decision list – use single most indicative feature c)Naive Bayes – use all the features, weighted by how well they discriminate among the categories d)Decision tree – use some of the features in sequence e)Other options from machine learning, like perceptron, Support Vector Machine (SVM), logistic regression, … Features matter more than which machine learning method We’ve seen lots of options in this course!

12
Intro to NLP - J. Eisner12 Review: Decision Lists slide courtesy of D. Yarowsky (modified) To disambiguate a token of lead :  Scan down the sorted list  The first cue that is found gets to make the decision all by itself  Not as subtle as combining cues, but works well for WSD Cue’s score is its log-likelihood ratio: log [ p(cue | sense A) [smoothed] / p(cue | sense B) ]

16
Intro to NLP - J. Eisner16 Features Besides Unigrams  All these approaches (except n-gram model) can use “interesting” features, not just unigrams.  There’s generally a heuristic feature selection problem  Use some very large set of features defined by a template  Maybe restrict to features that look useful in isolation?  Add features greedily, one at a time  Measure or guess expected improvement of each feature  Make sure to smooth when doing this – why?  At the end, remove features that hurt performance on held-out data  What does SpamAssassin use?

17
Intro to NLP - J. Eisner17 SpamAssassin Features 100 From: address is in the user's black-list 4.0 Sender is on Habeas Infringer List Invalid Date: header (timezone does not exist) Written in an undesired language Listed in Razor2, see Subject is full of 8-bit characters Claims compliance with Senate Bill exists:X-Precedence-Ref Reverses Aging Claims you can be removed from the list 'Hidden' assets Claims to honor removal requests Contains "Stop Snoring" Received: contains a name with a faked IP-address Received via a relay in list.dsbl.org Character set indicates a foreign language

31
Intro to NLP - J. Eisner31 SpamAssassin Features Spam phrases score is 08 to 13 (medium) URL uses words and phrases which indicate porn (4) As seen on national TV! Message text disguised using base-64 encoding Date: is 3 to 6 hours after Received: date Score with babes! From and To are same (6) 'From' yahoo.com does not match 'Received' headers Spam phrases score is 13 to 21 (high) Not intended for residents of XYZ Faked To "Undisclosed-Recipients" From and To are same (5) Only thing addresses on CD are useful for is spam Contains "Vjestika Aphrodisia" Lower Monthly Payment HTML comment has 3 consecutive 8-bit characters

32
Intro to NLP - J. Eisner32 SpamAssassin Features From: does not include a real name Uses a dotted-decimal IP address in URL Contains link without prefix 'Subject' contains G.a.p.p.y-T.e.x.t Marketing Solutions Spam tool pattern in MIME boundary 'Prestigious Non-Accredited Universities' Spam tool pattern in MIME boundary Incorporates a tracking ID number From and To are same (2) Contains 'free sample' with capitals Claims compliance with spam regulations Online Pharmacy Received via SMTPD32 server (SMTPD32-n.n) Includes a form which will send an While you Sleep

46
Intro to NLP - J. Eisner46 SpamAssassin Features Refinance Home Received via a relay in relays.ordb.org Contains 'free access' with capitals Uses a long numeric IP address in URL Have you been turned down? Includes a URL link to send an with the subject'remove' No Credit Check No Inventory To: has a malformed address Be your own boss Information on how to work at home (2) Contains mail-in order form One hundred percent guaranteed Guaranteed Stuff Information on mortgage rates Frequent SPAM content

47
Intro to NLP - J. Eisner47 SpamAssassin Features From and To the same (1) Bulk software fingerprint (screwup 2) found inheaders Gives an excuse for why message was sent Avoid Bankruptcy Includes a link for AOL users to click Form for changing address Apply online (with capital O) List removal information Date: is 12 to 24 hours after Received: date Asks you for your signature on a form Subject talks about losing pounds Lower Interest Rates Do it Today Unsecured Credit/Debt The best Rates From: starts with nums

49
Intro to NLP - J. Eisner49 SpamAssassin Features Possible porn - Porn Fest Sent with 'X-Priority' set to high Local part containing a "4u" variant HTML font color is magenta Join Millions of Americans Asks for a billing address Nigerian scam key phrase ((dollar) NNN.Nm/USDNNN.N m/US(dollar) NN.N m) Claims "This is not spam" Sent with 'X-Msmail-Priority' set to high Subject contains "FREE" in CAPS exists:X-MailingID MIME section missing boundary Asks you to fill out a form HTML font color is unknown to us Domain name containing a "4u" variant HTML font color is yellow

50
Intro to NLP - J. Eisner50 SpamAssassin Features Includes a link to send a mail with a subject Standard investment opportunity spam Javascript to hide URLs in browser Offers Extra Cash Eliminate Bad Credit Lose Weight Spam Subject talks about savings Subject ends with lots of white space Offers a full refund Gives instructions for removal from list Free Cell Phone Frontpage used to create the message Offers a limited time offer Claims you can be removed from the list Attempt at obfuscating the word "mortgage" Opportunity - What a deal!

51
Intro to NLP - J. Eisner51 SpamAssassin Features Nobody's perfect Tells you about a strong buy HTML table has thick border Buy Direct Instant Access button HTML font color is green HTML font color is cyan Discusses money making Asks you to click below (in caps) Uses open redirection service exists:X-ServerHost Claims you can be removed from the list List removal information Message with extraneous Content-type:...type=header There is no obligation Talks about lots of money

52
Intro to NLP - J. Eisner52 SpamAssassin Features Contains 'Get it now' with capitals Supplies are Limited No such thing as a free lunch (2) You won't be dissapointed Possible porn - Offers Instant Access Nigerian scam key phrase ((dollar)NN,NNN,NNN.NN) How dear can you be if you don't know my name? No Strings Attached HTML with embedded plugin object Received via a relay in relays.osirusoft.com Off Shore Scams Information on how to work at home (1) Possible porn - Hot, Nasty, Wild, Young Contains word 'amazing' in all-caps exists:X-SMTPExp-Version There is no catch.

53
Intro to NLP - J. Eisner53 SpamAssassin Features sent to or similar Received from first hop dialup listed inrelays.osirusoft.com HTML font color is same as background Subject: is empty or missing FONT Size +2 and up or 3 and up Lowest Price HTML font color has unusual name Contains word 'profits' in all-caps HTML font color is gray What are you waiting for One Time Rip Off Talks about prizes Free Website To: and Cc: contain similar usernames at least 5 times HTML font face is not a commonly used face Quoted-printable line longer than 76 characters

54
Intro to NLP - J. Eisner54 SpamAssassin Features From: has a malformed address exists:X-SMTPExp-Registration Message-Id has sign No such thing as a free lunch (1) URL of CGI script called "unsubscribe" or "remove" Satisfaction Guaranteed "if you do not wish to receive any more" Message contains a lot of ^M characters exists:x-esmtp Claims you are a winner From: contains numbers mixed in with letters Can't live without? HTML mail with non-white background Talks about marketing Save big money HTML font color is red

58
Intro to NLP - J. Eisner58 SpamAssassin Features Free DVD Date: is 12 to 24 hours before Received: date JavaScript code Header with all capitals found HTML font color is blue Winner in Caps HTML font face is not a word Fantastic Deal Includes a 'remove' address Includes a URL link to send an Possible porn - Large Number of movies, pics Free Offer Contains a tollfree number illegal Nigerian transactions (1) Image tag with an ID code to identify you Frame wanted to load outside URL

59
Intro to NLP - J. Eisner59 SpamAssassin Features Contains 'for only' some amount of cash X-Mailer header indicates a non-spam MUA(Outlook Express) Spam tool pattern in MIME boundary Cancel at any time! Talks about social security numbers Click to perform an action on an account Gives an excuse about why you were sent this spam Nigerian scam key phrase ((dollar) NNN.Nm/USDNNN.N m/US(dollar) NN.N m) Contains a comment with nothing but unique ID No Claim Forms 'Message-Id' was added by a relay (2) Free Trial They're just giving it away! Message-Id has characters indicating spam Dear Free Hosting

60
Intro to NLP - J. Eisner60 SpamAssassin Features Contains an ASCII-formatted form I wonder how many s they sent in error URL of page called "unsubscribe" Subject has exclamation mark and question mark Offer Expires Contains 'Dear Somebody' Javascript protocol in a URI Message includes Microsoft executable program MIME filename does not match content Spam tool pattern in MIME boundary 'Received:' has 'may be forged' warning Message-Id is not valid, according to RFC Offers Coupon Please read this! Please oh please oh please! Shopping Spree Contains a line >=199 characters long =199 characters long">

65
Intro to NLP - J. Eisner65 SpamAssassin Features Common footer for MSN Contains a password retrieval system Something about registration User-Agent header indicates a non-spam MUA(Mutt) Came from MSN Communities exists:X-Cron-Env Subject looks like order info From the Mailer-Daemon Subject contains a date Contains what looks like an attribution Common footer for Hotmail X-Mailer header indicates a non-spam MUA (Appl ) Common footer for Hotmail Sent through Microsoft's ListBuilder service Short signature present (empty lines) Common footer for Hotmail

67
Intro to NLP - J. Eisner67 How to Categorize? (unsupervised) What if we don’t have supervised training data? Might try an iterative approach as usual: 1.Cluster the messages 2.Train n-gram, Naive Bayes, or decision list model to discriminate among the clusters 3.Use the model to reassign messages to clusters (most will stay put but some will move) 4.Return to step 2 until convergence

68
Intro to NLP - J. Eisner68 How to Categorize? (semisupervised) What if we have only a little supervised data? Could try bootstrapping like Yarowsky’s WSD: 1.Start with very small, rather accurate classes 2.Train n-gram, Naive Bayes, or decision list model to discriminate among the classes 3.Augment each class with new messages that the model confidently classifies there (maybe also move or remove some existing messages) 4.Return to step 2 until convergence

69
Intro to NLP - J. Eisner69 How to Categorize? (adaptive) What if we gradually get more new data over time?  User feedback (active or passive) on our classifications  News / systems that categorize, or judge relevance  Add new articles / messages to training data  If they’re unlabeled (no supervision), label them automatically  Add them only if we’re confident? Add them fractionally, like EM? So model adjusts over time:  E.g., change the cluster centroids or n-gram parameters  May want to weight the more recent data more heavily, since the future is more like the present than the past  E.g., message from k days ago has weight 0.9 k (k=0,1,2,...)  So today’s model = today’s data * yesterday’s model

70
Intro to NLP - J. Eisner70 How to Categorize? (hierarchical) What if we are putting document in a Yahoo! category?  There are thousands of categories (at least) – too hard!  Choose one of the 14 top-level categories, e.g., Science  Then use a Science-specific classifier to choose one of the 54 second-level categories within Science (14 are symlinks)  Continue working your way down the tree...  When you can’t classify with high confidence, ask a human (then use the human’s answer as more training data)