Contents

Feb. 27, 2014

Folks met and hacked on the noisebridge discuss mailing list. We created a 102MB text dump, and a python script to parse it, File:Py-piper-parser.txt. We wrote pseudo code to implement a Naive Bayesian filter to protect the world from trolls. Will implement soon.

March 6, 2014

Group compared notes on list-scraping code and then delved into details of algorithm via tracing simple example.
user presented w message
user labels spam/not spam or scale
user presented with rating: 50% naive
after user labels, algorithm readjusts
very beginning
first step: assign prior (50% of all messages are spam)
also have a likelihood that each word is spam or not spam
- combines to total 50% (also consider log-likelihood)
next step: algorithm decides per message
this is end of phase zero (PERFORMANCE PHASE)
algorithm performing without human intervention
next: TRAINING PHASE
first step of training phase
examine message, assign a rating (spamicity ie 7/7 or drama-tag or spam/not spam)
then save rating associated with message
ie write to vector (each index corresponds to a message)
[[File:Javascript.jpg|left|link:{{-}}]]
spamicity vector:
msg0 msg1 msg2 msg3
1 0 0 3
then update dictionary
what dictionary?
temporary dictionary in the first step
every dictionary item has a frequency count
now ...
after (say) 1000 messages, algorithm guesses mostly correctly
'spam or not-spam' (is this TESTING PHASE ?)
training continues via occasional instances of (human) correction
update dictionary with each word in (human?) rated message
one possibly viable dictionary structure:
{'word':[counted_in_spam, counted_in_not_spam]}
so, algorithm might operate as per this trace:
msg[0]: 'foo bar' SPAM
msg[1]: 'foo foo bar foo bar' HAM
msg[2]: 'bar bar bar foo' ... WHAT IS THIS ????
Can consult algorithm because now we have SPAM and HAM
so can get bayes-informed result
dictionary
after msg[0]
{'foo': [1, 0], 'bar': [1, 0]}
after msg[1]
{'foo': [1, 3], 'bar': [1, 2]}
NOW WE ARE AT msg[2] WHAT IS THIS ???
THIS LOOKS EASIER TO SOLVE THIS TIME !!!!!!!
WE HAVE A VECTOR OF SPAM/HAM
IT LOOKS LIKE THIS: ['s', 'h']
OR IF YOU LIKE BINARY [True, False]
OR [0, 1]
OK I GET IT !!!!
WE HAVE A VECTOR AND A DICTIONARY and a message NOW WHAT ???
{'foo': [1, 3], 'bar': [1, 2]}
[0, 1]
msg[2]: 'bar bar bar foo' ... WHAT IS THIS ????
A: probability of 'foo' | spam = 1
B: probability of spam = 0.5
C: probability of 'foo' | ham = 3
... wtf ???
this gets normalized later ? maybe
sam hopes this cancels out without being painful
D: probability of ham = 0.5
= 0.25 likelihood given 'foo'
(1 * 0.5) / ((1 * 0.5) + (3 * 0.5))
A * B / ((A * B) + (C * D))
= 0.25 likelihood given 'foo'
1(.5)
1 = prob of foo in message | spam
.5 = prob of any word | spam
.5 = prob of 'foo' in (first) word | spam
.5 = A (normed)
3 = prob of foo in message(?) | not spam
1/5 = prob of word | ham
.6 = prob of foo in (first) word | ham
.6 = C (normed)
likelihood given 'bar'
(1 * 0.5) / ((1 * 0.5) + (2 * 0.5))
= 0.3333... (1/3)
(1/3.0)**3 * (1/4.0) / (((1/3.0)**3 * (1/4.0)) + ((2/3.0)**3 * (3/4.0)))
= 0.04
The way this is not fully bayesian... p(foo) & p(bar) are interacting...
Also, are we normalizing correctly?
If we normalized,
we take into account the following:
avg freq of words in spam
avg freq of words in ham
but this is not fully bayesian
because ... so far ...
we have been assuming independence
between words (at the full-message level)