[Neil Schemenauer]
> I've written a driver script the does "all but one testing". The basic
> algorithm is:
>> gb = GrahamBayes()
> for msg in spam:
> gb.learn(msg, is_spam=True)
> for msg in ham:
> gb.learn(msg, is_spam=False)
> for msg in spam:
> gb.unlearn(msg, is_spam=True)
> gb.spamprob(msg)
> gb.lear(msg, is_spam=True)
> for msg in ham:
> gb.unlearn(msg, is_spam=False)
> gb.spamprob(msg)
> gb.lear(msg, is_spam=False)
> print summary
>> Is this type of testing useful?
It's sure better than nothing <wink>. Also better than nothing, but not as
good, is doing the same thing but skipping the learn/unlearn calls after
initial training.
> As understand it, it's most useful when you have a small amount of testing
> and training data.
I've run no experiments on training set size yet, and won't hazard a guess
as to how much is enough. I'm nearly certain that the 4000h+2750s I've been
using is way more than enough, though. It's a question of practical
importance open for fresh triumphs <wink>.
> That doesn't seem> to be a problem for us. Also, it's really slow.
Each call to learn() and to unlearn() computes a new probability for every
word in the database. There's an official way to avoid that in the first
two loops, e.g.
for msg in spam:
gb.learn(msg, True, False)
gb.update_probabilities()
In each of the last two loops, the total # of ham and total # of spam in the
"learned" set is invariant across loop trips, and you *could* break into the
abstraction to exploit that: the only probabilities that actually change
across those loop trips are those associated with the words in msg. Then
the runtime for each trip would be proportional to the # of words in the msg
rather than the number of words in the database.
Another area for potentially fruitful study: it's clear that the
highest-value indicators usually appear "early" in msgs, and for spam
there's an actual reason for that: advertising has to strive to get your
attention early. So, for example, if we only bothered to tokenize the first
90% of a msg, would results get worse? I doubt it. And if not, what about
the first 50%? The first 10%? The first 1000 bytes? max(1000 bytes, first
10%)? That could also yield a major speed boost, and *may* even improve
results -- e.g., sometimes an on-topic message starts well but then rambles.