On 3.3.2 version command sa-learn from maibox I have next on any input mailboxes:
> sa-learn --spam --mbox spam-2011-10-16.mbox
> Learned tokens from 0 message(s) (0 message(s) examined)
After downgrade to v 3.2.5 on same files I have next output:
> sa-learn --spam --mbox spam-2011-10-16.mbox
> Learned tokens from 0 message(s) (3303 message(s) examined)
learning from splitted messages does work correctly.

When saving messages with kmail 1 the From Line has following format which is not iaw RFC 822:
From info@ende-18-06.com Fri Jun 17 16:03:07 2011
which is not iaw RFC 822 but will be parsed by sa-learn reporting:
Learned tokens from 1 message(s) (1 message(s) examined
With kmail 2 the format is changed to the format which is iaw RFC 822
From thomas@arend-rhb.de Tue, 15 May 2012 22:01:41 +0200
which is not parsed correctly by sa-learn. sa-learn --spam reports:
Learned tokens from 0 message(s) (0 message(s) examined)
Proposed to change the behavior in a way that the old malformed From lines and the new correct ones are parsed.

> When saving messages with kmail 1 the From Line has following format
> which is not iaw RFC 822:
> From info@ende-18-06.com Fri Jun 17 16:03:07 2011
> With kmail 2 the format is changed to the format which is iaw RFC 822
> From thomas@arend-rhb.de Tue, 15 May 2012 22:01:41 +0200
> which is not parsed correctly by sa-learn. sa-learn --spam reports: [...]
Oh, no, not yet another incompatible mbox format!!!
> Proposed to change the behavior in a way that the old malformed From lines
> and the new correct ones are parsed.
It is the other way around, the new one differs from everybody else.
The format of the mbox file (along with its separator From_ lines)
is *not* governed by RFC 822 or its successors. There is no formal
standard for an mbox format, the RFC 4155 comes closest:
http://tools.ietf.org/html/rfc4155
See also a Wikipedia article:
http://en.wikipedia.org/wiki/Mbox
RFC 4155 says:
| a timestamp indicating the UTC date and time when the message
| was originally received, conformant with the syntax of the
| traditional UNIX 'ctime' output sans timezone (note that the
| use of UTC precludes the need for a timezone indicator);
This matches qmail docs:
http://qmail.org/qmail-manual-html/man5/mbox.html
and matches Postfix and sendmail's local delivery agent.
To accommodate the new incompatible format it seems that the
two instances of a regexps in ArchiveIterator.pm need to be
extended, or just relaxed. Not sure if the date would still
be correctly parsed.
Best would be to persuade kmail folks to back off the change!

> To accommodate the new incompatible format it seems that the
> two instances of a regexps in ArchiveIterator.pm need to be
> extended, or just relaxed. Not sure if the date would still
> be correctly parsed.
>
> Best would be to persuade kmail folks to back off the change!
Agreeing with Mark. There is no standard mbox format.
However, I can't find the ticket but I made some small tweaks recently-ish to trunk to help another mbox format issue. I wonder if trunk can parse this new format?
1 - Can you attach a file with 3 mbox format files that aren't read
2 - Perhaps we need a relaxed option in ArchiveIterator that uses a less stringent From line test or even a way to specify a regex to use in a cf?
Regards,
KAM

> 1 - Can you attach a file with 3 mbox format files that aren't read
Two are already attached, one old and one new.
I tried it, and the trunk can't parse the new one.
> 2 - Perhaps we need a relaxed option in ArchiveIterator that uses a less
> stringent From line test or even a way to specify a regex to use in a cf?
I thought of that. The From line test is really quite stringent.
The concern is that one could encounter an mbox file in an mboxcl or mboxcl2
format (see the Wikipedia article on mbox), where a From may not even
be protected by a '>', but parsing relies on a Content-Length header
field. I believe this is also being adopted by Dovecot.

(In reply to comment #8)
> > 1 - Can you attach a file with 3 mbox format files that aren't read
>
> Two are already attached, one old and one new.
> I tried it, and the trunk can't parse the new one.
I meant a single mbox file with the new format with 3 emails in it.
> > 2 - Perhaps we need a relaxed option in ArchiveIterator that uses a less
> > stringent From line test or even a way to specify a regex to use in a cf?
>
> I thought of that. The From line test is really quite stringent.
Agreed.
> The concern is that one could encounter an mbox file in an mboxcl or mboxcl2
> format (see the Wikipedia article on mbox), where a From may not even
> be protected by a '>', but parsing relies on a Content-Length header
> field. I believe this is also being adopted by Dovecot.
Well at least mboxcl is a different format and we'll have to cross that bridge if it looks like it will be adopted.

A first quick and dirty solution to work around this problem was encouraging.
I converted the kmail2 mbox with the following command to the older version:
sed '/^From / s#\(...\), \([0-3][0-9]\) \(...\) 2012 \([0-2][0-9]:[0-5][0-9]:[0-5][0-9]\) [+-][0-9]\{4\}#\1 \3 \2 \4 2012#' <spam-old.mbox >spam-new.mbox
sa-learn reported:
Learned tokens from 1489 message(s) (1733 message(s) examined)
Which may be ok some messages may be learned earlier.
A trailing "0" in the day - which may not comply to ctime format - does not effect the parsing of the mbox file.
This work-araound will help me over the next days and weeks. But learning in kmail2 may not work or will need a modification.
I understand that there are different definitions for the date time in the "From_" line. What a mess!
I agree that the best would be to persuade the kmail folks to back off the change. You may refer to the following bug report: 297198
https://bugs.kde.org/show_bug.cgi?id=297198

(In reply to comment #13)
> A first quick and dirty solution to work around this problem was encouraging.
>
> I converted the kmail2 mbox with the following command to the older version:
>
> sed '/^From / s#\(...\), \([0-3][0-9]\) \(...\) 2012
> \([0-2][0-9]:[0-5][0-9]:[0-5][0-9]\) [+-][0-9]\{4\}#\1 \3 \2 \4 2012#'
> <spam-old.mbox >spam-new.mbox
>
> sa-learn reported:
> Learned tokens from 1489 message(s) (1733 message(s) examined)
>
> Which may be ok some messages may be learned earlier.
>
> A trailing "0" in the day - which may not comply to ctime format - does not
> effect the parsing of the mbox file.
>
> This work-araound will help me over the next days and weeks. But learning in
> kmail2 may not work or will need a modification.
>
> I understand that there are different definitions for the date time in the
> "From_" line. What a mess!
>
> I agree that the best would be to persuade the kmail folks to back off the
> change. You may refer to the following bug report: 297198
>
> https://bugs.kde.org/show_bug.cgi?id=297198
Very interesting idea to use sed. You should be able to pipe that directly to sa-learn as well instead of even using a temporary file. That's a good temporary fix.

>Very interesting idea to use sed. You should be able to pipe that directly to >sa-learn as well instead of even using a temporary file. That's a good >temporary fix.
Thanks,
I did not time zone adjustment. Would be a bit complicate for sed. I hope this will not effect the spam tokens to much.

(In reply to comment #7)
> However, I can't find the ticket but I made some small tweaks recently-ish
> to trunk to help another mbox format issue. I wonder if trunk can parse
> this new format?Bug 6413 and no, that was for communigate not kmail. Not relevant except it touches on the same non-standardization for mbox formats.

It's getting worse with the kmail folks. The sometimes put to From_ lines in the mbox file.
From thomas@arend-rhb.de Sat, 19 May 2012 00:10:44 +0200
From thomas@arend-rhb.de Sat May 19 00: 16:57 2012
[..]
In teh second line you can see that they spend an extra space in case the hour would extend to more than 99 min in the future.
I will report this to kmail.
Have a nice Sunday!

(In reply to comment #27)
> Created attachment 5071[details]
> Test mbox showing two From_ lines
>
> This is an export of four messages which show two from lines.
Just found that the corrupted From_ line was added when filtering through formail. Don`t ask me why. Formail works on command line as expected.

(In reply to comment #29)
> To allow for variations in whitespace, perhaps:
>
> /^From\s{1,5}\S+\s{1,5}[[:upper:]][[:lower:]]{2}(?:, \d\d
> [[:upper:]][[:lower:]]{2} \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}|
> [[:upper:]][[:lower:]]{2} [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/
I never saw more than one whitespace after the "From" but one or two white spaces after the "e-mail" address / before the weekday.

> Patch committed.
This doesn't work as intended, sa-learn now rejects most messages
as oversized.
The problem is that mbox_format_from_regex in Conf.pm defaults to
an empty string, but _set_default_message_selection_opts() tests
for undef, so the regexp in $self->{opt_from_regex} remains an
empty string, which according to some bizarre Perl brain damage
defaults to a last regexp that matched, so the ArchiveIterator
keeps finding huge messages.
The fix is trivial, just removing the default => ''
from the Conf.pm entry on a config option 'mbox_format_from_regex'.
Also I think the /$self->{opt_from_regex}/ in sub _run_mailbox
could deserve an /o flag, as I don't think its value can change
once it has been set and was used for the first time.
Btw, while investigating why sa-learn broke I ended up in
letting ArchiveIterator and sa-learn be able to take a message
size limit option. Will open a separate ticket for this.
Sending ArchiveIterator.pm
Sending Conf.pmT
Committed revision 1354072.

(In reply to comment #34)
> Thanks, Mark. That would have driven me up a wall!
Does mbox_format_from_regex break anything with mass-check?
From John Hardin:
When I run masschecks locally against an up-to-date repo, it is not setting the message boundary RE properly end gets scads of uninitialized variable errors trying to parse the corpus mailbox files. Last I looked, I added some warn() output and it was setting the default RE properly but then appeared to be resetting it later somewhere.

(In reply to comment #36)
> mass_checks child process:
> mass_checks::generate_queue
> -> ArchiveIterator::_scan_targets
> -> ArchiveIterator::_set_default_message_selection_opts
> (sets opt_from_regex for a child process)
>
> mass_checks parent process:
> waits for child;
> mass_checks::run_post_scan
> -> run_through_messages
> --> $iter->_run_message (runs with an undefined opt_from_regex)
This is also happening to Marc Andre Selig.
I think we need to revert the patches related to this, see if mbox numbers improve and then figure out a masscheck safe way to implement the fixes.
That's commits 1350839 & 1354072 however fundamentally, I thought I set a sane default for the option and this is confusing me a lot.

(In reply to comment #37)
> (In reply to comment #36)
> > mass_checks child process:
> > mass_checks::generate_queue
> > -> ArchiveIterator::_scan_targets
> > -> ArchiveIterator::_set_default_message_selection_opts
> > (sets opt_from_regex for a child process)
> >
> > mass_checks parent process:
> > waits for child;
> > mass_checks::run_post_scan
> > -> run_through_messages
> > --> $iter->_run_message (runs with an undefined opt_from_regex)
>
>
> This is also happening to Marc Andre Selig.
>
> I think we need to revert the patches related to this, see if mbox numbers
> improve and then figure out a masscheck safe way to implement the fixes.
>
> That's commits 1350839 & 1354072 however fundamentally, I thought I set a
> sane default for the option and this is confusing me a lot.
The issues with masscheck and archiveiterator are related and at the same time unrelated.
IMO, we are improperly using the ArchiteIterator class without calling _set_default_message_selection_opts which is where I set the sanity check for the opt_from_regex.
I think this should be called from new when the class is constructed rather than rely on someone using the API to call it. However, bowing to existing structure, I have added a call to _set_default_message_selection_opts in mass-check.sh.
I've checked and this change appears to be the fix.
Index: masses/mass-check
===================================================================
--- masses/mass-check (revision 1373195)
+++ masses/mass-check (working copy)
@@ -419,6 +419,9 @@
$iter->set_functions(\&wanted, \&result);
}
+#AFTER SETTING ALL OPTS, RUN _set_default_message_selection_opts() TO GET SANE DEFAULTS
+$iter->_set_default_message_selection_opts();
+
my $messages;
# normal mode as well as a server do scan mode and get a temp file
svn commit -m 'change to mass-check to get ArchiteIterator sane defaults like the from separator'
Sending lib/Mail/SpamAssassin/ArchiveIterator.pm
Sending masses/mass-check
Transmitting file data ..
Committed revision 1373202.

This is SA Bugzilla: the Apache SpamAssassin bug system. In case
of problems with the functioning of SA Bugzilla, please contact
bugzilla-admin@apache.org.
Please Note: this e-mail address is only for reporting problems
with SA Bugzilla. Mail about any other subject will be silently
ignored.