[Rob Hooft]
> ...
> Tim: It does look like your messages are a bit easier to classify than
> mine....
I don't know. The results I reported were:
> Here's a use_central_limit2 run with max_discriminators=50, trained
> on 5000 ham and 5000 spam, then predicting against 7500 of each
and all runs were on the same set of msgs.
The last time you mentioned how "big" your tests are was:
> I focussed for our night on optimizing the max_discriminators for
> clt2 using 10x(200+200) messages out of my corpses,
I'm not sure exactly what 10x(200+200) means, but at the plausible extremes
it means your classifiers were trained on 200 on each, or on 1800 of each.
So at worst, my classifier was trained on 3x as much data, and at best on
25x as much data. Error rates certainly improve with more training data,
albeit slowly.
OTOH, later you showed output saying
> Reading climbig12.pk ...
> Nham= 12800
> RmsZham= 2.76178782393
> Nspam= 5600
so at *some* point you stopped predicting against equal amounts of ham and
spam, but there's no way to guess how much was trained on for that result.
Interpreting results here gets very difficult because it's often not clear
what a tester is reporting on (how much training data, how much prediction
data, which test driver produced the results, what the relevant options
were).
That said, I expect my ham is easier than most, because newsgroup traffic
almost never contains personal msgs -- no screaming red HTML birthday wishes
from 9-year-old nieces, no confirmations of payment received, no opt-in
marketing newsletters, no chain letters forwarded from naive brothers, etc.