[Skip Montanaro]
> Interesting scheme. When I tried that I got swamped by '(qmail NNN
> ...' stuff, where it appears that NNN is a process id. To retain
> this in its current form I suspect we'd have to either specifically
> eliminate such features or implement hapax expiration.
Changing the regexp to use [a-z] instead of \w would weed out all that
stuff. I didn't see any containing numbers that looked promising. The
all-text ones looked interesting, though.
> Perhaps we should add
>> header = re.sub(r'\s+', ' ', header)
>> to the "for header ..." loop in any case?
There are many "for header" loops, and I'm not sure which one(s) you're
talking about here. If you want to do this somewhere,
header = ' '.join(header.split())
is faster.
> It seems that many other headers get split that way. If we're
> looking for features which include whitespace we should probably
> normalize it.
I doubt this is often a concern. It's dangerous to make basic changes "in
general", so don't do it except where there's a specific need. It should be
fine in Received lines. As a counter-example, Subject line parsing *wants*
to know whether tab characters appear, and runs of multiple spaces are also
significant there. It's irrelevant to parsing of multi-line address headers
(like To and Cc) because email.Utils.getaddresses() is already used for
those, and already hides the line structure.
> I'm willing to tuck the more general received sifting into the
> tokenizer controlled by a new experimental option. Let me know if
> you want me to take that step.
No, I don't want another experimental option just for this. It seems clear
enough already that "may be forged" is potentially interesting, and also
that "may be forged" isn't the only potentially interesting string. We
should suck up a bunch of them, or none of them. The classifier will learn
which are and aren't useful, and it sure looks like that will vary depending
on user (that one of my ISPs is Comcast and one of yours isn't is not a good
reason to poo-poo the clues Comcast leaves behind <wink>).