T. Alexander Popiel wrote:
>> Argh. Most of the confusion arises from a complete lack of
> documentation on the interface to the regimes: what their
> parameters mean, what the return code means, etc. I'll try
> to get to that soon... unless someone beats me to it. Reading
> incremental.py is pretty much required until such docs get
> written.
Somewhat tangential, but...
Last night I set up the default Data/{Ham,Spam}/SetN testing structure
and was able to run incremental.py (with the balance_corrected regime
added) on the lot of it. I have 164 * 10 ham and 54 * 10 spam. The
spam rate has increased steadily since I started collecting - the first
10 spam took 100 days to come in (ahh the joys of a private domain name
and practicing safe computing! Alas, those days are no more). I used a
modified version of the dotest.sh script to run each set against each
regime, which produced 70 graphs that, while nice, don't allow for easy
comparative analysis*.
The docs in the timtest.py and timcv.py don't imply any easy/automatic
way to change .ini settings or regimes (I haven't gone through the code
yet, however), but seem to be the standard for assessing the impact of a
change to the tokenizer, etc.
I'm wanting to cook up something that will take a list of .ini files (or
Option objects, if I understand correctly - they are equivalent?) and a
list of regimes and run all the combinations, outputting a few pretty
graphs. The end goal is to produce a suite that easily tells a) what
effect a regime change has on a range of .ini settings (or the reverse,
an .ini change has on the various regimes) and more pragmatically b)
what the "best" .ini options and regime are for my mail stream. We'll
see how much happens this weekend. :)
Any suggestions, ideas for features, pointers, etc.?
Eli
[*] - Though a few spikes in the FP line did lead me to find a few spam
in my ham corpus that I had missed previously. ;)