>> It's not clear much can be done, though it might be interesting to
>> try an option to map Latin-1 accented characters to their unadorned
>> ASCII counterparts, at least in subjects (strip_subject_accents?).
Alex> I suspect that would have serious detrimental effects for foreign
Alex> language users.
My thought was that whether or not to enable the option would be under user
control. If it's a good spaminator for me why should I suffer because the
effbot's native language includes accented characters? (No offsense
intended toward any Swedes who might be reading this, BTW. ;-)
>> The problem with trying such an experiment isn't that it might not be
>> worthwhile, but that if it's a new spammer technique, there won't be
>> many messages in our existing spam/ham databases which would exercise
>> the technique.
Alex> I don't see this as any different from any of the other neologisms
Alex> that spammers come up with; if they persist in using such words
Alex> (and you're still training), then the odd words with accents will
Alex> quickly become strong spam indicators. No need for us to do
Alex> anything... it's already going to be handled properly.
Except note that they weren't accenting every vowel and there were many
other accents to choose from. The message I received had "makë" and "teën".
There are several other accented characters with "a" or "e" as their base
character. I would have to receive many messages using this technique to
build up enough such odd words to make a difference. I think that's the
spammer's basic idea with this - keep it readable but fly below the word
count radar. Like I said, "subject:love" is very spammy for me, but I'd
never seem "subject:löve" before, so it wasn't used to score the message.
The fundamental problem when dealing with new spam techniques is (and will
always be, I think) when to mount a counterattack. That's certainly the
case here.
Skip