Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

6.
Pragmatix (1 of 3) • The hands-­‐on programs use Python 2.x and the nltk stack o Installing epd free is a good idea o http://www.enthought.com/products/epd_free.php o But just pip install nltk should work • Did you attend Prof.Downey’s tutorial last week ?

7.
Pragmatix (2 of 3) • Data o We will use the Enron spam dataset o Background : http://csmining.org/ index.php/enron-­‐spam-­‐ datasets.html o The ﬁles are also available at http:// labs-­‐repos.iit.demokritos.gr/skel/i-­‐ conﬁg/ • We will work with the data in the enron-­‐pre-­‐ processed folder • FYI, full Enron dataset is available at http://www.cs.cmu.edu/~enron/ • With attachments is available at http://www.edrm.net/resources/ data-­‐sets/edrm-­‐enron-­‐email-­‐data-­‐ set-­‐v2 o It is ~210 GB, aws public dataset

8.
Pragmatix (3 of 3) • Programs o The python code is in Github o git@github.com:xsankar/pydata.git o https://github.com/xsankar/pydata o Three ﬁles • cat_in_the_hat-­‐xx.py • spam-­‐01.py • Spam-­‐02.py • This slide set is in slideshare o http://www.slideshare.net/ksankar/pydata-19 o as well as in the github repository

20.
“..Whilst the chance of a future event happening may be expressed in percentage terms the same is not true of past events. It cannot be said that there is a particular percentage chance that a particular event happened. The event either happened or it did not…” “ .. the “balance of probabilities” standard when expressed mathematically as “50 + % probability” carried with it a “danger of pseudo-­‐mathematics ..” Ref : http://understandinguncertainty.org/court-appeal-bans-bayesian-probability-and-sherlock-holmeshttp://www.lexology.com/library/detail.aspx?g=471d7904-20d2-4fdb-a061-8d49b10de60d&l=7HVPC65 http://www.12kbw.co.uk/cases/commentary/id/161/,

23.
Problem #1 “Of Colossal Conglomerates 16,000 clients, 3200 own their own business, 1600 are "gold class" customers, and 800 own their own business and are also "gold class" customers. What is the probability that a randomly chosen client who owns his or her own business is a "gold class" customer?” Hint : Create a 4 X 4 Matrix

24.
Problem #1 “Of Colossal Conglomerates 16,000 clients, 3200 own their own business, 1600 are "gold class" customers, and 800 own their own business and are also "gold class" customers. What is the probability that a randomly chosen client who owns his or her own business is a "gold class" customer?” Owner !Owner Gold 800 1600 ! Gold 3200 16,000

27.
Problem #2 A test for a disorder has the following properties. If you have the disorder, the probability that the test returns positive is 0.999. If you don’t have the disorder the probability that the test returns positive is 0.004. Assume that 3% of the population has the disorder. If a person is randomly chosen from the population and tested, and the test returns positive, what is the probability that the person has the disorder? Hint : Use Bayes 2nd form P(θ|χ) = P(χ|θ)P(θ) 4 P(χ|θ)P(θ)+P(χ|~θ)P(~θ) θ = Has disorder, χ = Positive result http://www.cecs.csulb.edu/~ebert/teaching/lectures/551/bayes/bayes.pdf

29.
Problem #3 • When observing a blue vehicle driving by on a dimly-­‐lit night, there is a 25% chance that its true color is green. The same can be said when observing a green vehicle under the same conditions: there is a 25% chance that its true color is blue. • Suppose that 90% of taxis on the streets have blue color, while the other 10% have green color. • On a dimly-­‐lit night you see a taxi randomly drive by and observe its color as blue. • What is the probability that its true color is blue? • Hint: Condition the event that the true color is blue on the event that you’ve observed it to have blue color.

35.
Bayes Classiﬁer Bayes Theorem says that P(c|d) ∞ P(c) P(d|c) 6 Our problem reduces to ﬁnding P(c) & P(d|c) If we have a training set of N’ documents, without any prior knowledge our estimate would be P(c) = Nc/N’ 7 where Nc = Number of document in class c http://nlp.stanford.edu/IR-book/pdf/13bayes.pdf

36.
Bayes Classiﬁer We can deﬁne a document in terms of a set of features, here the tokens it has; and then P(d|c) = ∏ P(tk|c) 8 1→ k o k = Number of tokens & o P(tk|c) = The probability of tk occurring in class c, likelihood as inferred from the training set, “conditional attribute probability” Note : Since multiplication of small numbers can result in ﬂoating point underﬂow, we take log & add them

37.
Bayes Classiﬁer We still are left with calculating the likelihood probability P(tk|c) : Ntc P(tk|c) = 8 ∑ Ntc Which is the relative frequency of t occurring in class c in the training set, a.k.a. conditional attribute probability In order to make this work for unseen tokens, ie if N = 0; we add the Laplace Smoothing as follows: P(tk|c) = Ntc + 0.5 9 ∑ Ntc + (Nt/2)

40.
The Federalist Papers • 85 Essays published anonymously 1787-­‐88 to persuade New Yorkers to ratify the US constitution • Authorship of 12 papers was not clear – between Hamilton and Madison • In 1962, Mosteller & Wallace undertook a study to ﬁngerprint the authorship based on word frequencies in previous writings • Their work revived Bayesian Classiﬁcation Methods http://www.freerepublic.com/focus/f-news/1386274/posts

44.
Problem Statement • We have two small paragraphs o One from “Cat In The Hat” & o Another from “Green Eggs and Ham” • We train the classiﬁers with them • Then we test our classiﬁer with other lines from the books • Gives us a starting point for Bayesian Classiﬁer • Hands on : Load “cat_in_the_hat-­‐xx.py”

45.
Code Walk thru (1 of 2) #!/usr/bin/env python label_2 = "Green Eggs and Ham" # train_text_2 = "I am Sam. # pydata Tutorial March 18, 2013 Sam I am. # Thanks to StreamHacker a.k.a. Jacob Perkins # Thanks to Prof.Todd Ebert I do not like green eggs and ham. # I would not like them anywhere. import nltk.classify.util Would you like them in a house? from nltk.classify import NaiveBayesClassiﬁer I will not eat them on a train. from nltk import tokenize Thank you, thank you, Sam I am." # # Data # test_text_2 = "I would not eat them label_1 = "Cat In The Hat" here or there. I would not eat them anywhere. train_text_1 = "So we sat in the house all that cold, wet day. I would not eat green eggs and ham.” And we saw him! The Cat in the Hat! # Your mother will not mind. He should not be here when your mother is out. # For testing classiﬁcation With a cake on the top of my hat! # I can hold the ship." classify_cith = "We saw the cat in the house." classify_geah = "And I will eat them in a house." test_text_1 = "That cat is a bad one, classify_other = "A man a plan a canal Panama!" That Cat in the Hat. He plays lots of bad tricks. Dont you let him come near."

50.
Background • Spam & Ham data are hard to come by • In 2005, the Enron investigations made public e-­‐ mails of 150 employees from Enron • That gave impetus to spam research work • The data we are using are the processed e-­‐mails of 6 employees, mixing with 3 spam collections • 16,545 Ham, 17,171 spam for a total of 33,716 items Ham Spam enron1 3672 1500 enron2 4361 1496 enron3 4012 1500 enron4 1500 4500 enron5 1500 3675 enron6 1500 4500

51.
Data Prep • Unzip the data • Each user has ham & spam directory • The program spam-­‐01.py runs over these ﬁles. • You should change the directory name [root_path variable] to where you downloaded the data • Caution : o The program takes 10 minutes to run over the entire dataset. o I created a directory enron-­‐small with 20 ham & spam e-­‐mails to test during development

57.
Summary • Naïve Bayes o Assumes independence between the features o Assumes order is irrelevant o Assumes all features have equal importance o Skewed Data Bias (Bias for class with more data) o http://www.stanford.edu/class/cs276/handouts/rennie.icml03.pdf has some good ideas o http://courses.ischool.berkeley.edu/i290-­‐dm/s11/SECURE/Optimality_of_Naive_Bayes.pdf is another good paper http://www.youtube.com/watch?v=pc36aYTP44o