Sat, 08 Dec 2012

Having not had much luck with spam filtering solutions like SpamAssassin,
I'm forever having to add new spam filters by hand. For instance, after
about the sixth time I get "President Waives Refi Requirement"
or "Melt your fat! MUST WATCH this video now!" within a couple of
hours, I'm pretty tired of it and don't want to see any more of them.

With mail filtering programs like procmail or maildrop, it's easy
enough to match a pattern like "Subject:.*Refi Requirement" or
"Subject:.*Melt your fat" and filter that message to a spam folder
(or /dev/null).

But increasingly, I add patterns I'm seeing in spam messages, and yet
the messages with those patterns keep coming in. Why? Because the
spammers are using RFC 2047
to encode the subject into some other character set.

Here's how it works. A spammer sends a subject line that looks
something like this:

Subject: =?utf-8?B?U3RvcCBPdmVycGF5aW5nIGZvciBQcmludGVyIEluaw==?=

Mail programs are smart enough to decode this into:

Subject: Stop Overpaying for Printer Ink

but spam filtering programs often aren't, so your "printer ink" filter
won't catch it. And if you look through your spam folder with tools like
grep to see why it didn't get caught, or to find particularly spammy
subjects that might call for a filter
(grep Subject spamfolder | sort is pretty handy),
these encoded subjects will be incognito.

I briefly tried setting up a filter that spam-filed anything with =? in the
Subject line. But that's way too broad a brush -- not all people
there are legitimate reasons for using other charsets even in English
language email. It's relatively rare, but it happens. And some bots,
notably the Adafruit forum notification bot
and the bot that sends out announcements from my alma mater,
unaccountably encode the charset even when they're sending mail
entirely in US ASCII.

So what's really needed is not to filter out all messages that specify
a charset, but to decode the Subject so the spam filter can see it and
filter it accordingly.

How? I couldn't find any ready-made tool
available for Linux that could decode RFC 2047 headers; but the Python
email package makes decoding a one-line task.
In the Python interpreter:

So it's easy to write a script that can pull headers out of email
messages (files) and decode them. Just look for the line starting with
the header you want to match -- e.g. "Subject:" -- and pass that line
to email.Header.decode_header().

Only one snag. If the subject is longer than about 20 characters,
spammers will often opt to split it up into multiple groups, sometimes
even in different character sets. So for example, you might see
something like this, spread over multiple lines: