Posted
by
timothy
on Tuesday January 13, 2004 @09:16PM
from the re:-claire-yum-donut-manhattan-regrets-cute dept.

hcg50a writes "Wired has a story about the random words which have recently been appearing in spam. Antispam experts agreed that this isn't a brand-new technique, but said the addition of potentially filter-foiling gibberish is rapidly becoming a common component of spam."

"Most of the illegal-exploit spammers use hash busters and any other trick they can to get past filters, refusing to accept that people use spam filters because they really don't want spam," Linford added.

I really understand this part: going after people who are taking active measures against your enterprise due to their disinterest. Why bother to market to them at all? Is the rate of return worth all the ill will, DOS attacks and legislation?

I can see them doing this to overcome Bayesian filters, but why? AFAIK, Bayesian filters are not used much (if at all) on mail servers. These filters are run at home by geeks.

Granted, this may get them past the filters, but if somebody's gone through the effort of setting up a Bayesian filter, they're not going to buy your product even if you get into their inbox. It seems like a waste of everybody's effort, and I mean including the spammers.

A Bayesian spam filter teamed with a standard grammar checker adapted from an open-source word processor.

It'll take more processing power, and lead to spammers following proper grammar in their pseudo-nonsense, but it's the way to raise the bar against this attack (making those spammers that can't clear the bar out of luck).

The solution to randomness is to spell check and grammar check incoming e-mail, and consider violations as cause to ad points to the score indicating that it's spam-like.

Sure, a few strange words might be a name that's not in the filter yet, but pure gibberish should be a red flag that either somebody's cat walked on the keyboard, or there's spam going on here. Heavy use of "non-spam" words can override to indicate it's good mail... but a poorly composed mail that doesn't use language seen in friendly mail is highly likely to be spam....

Spam is a perfect carrier for steganographic data since it's broadcast to millions of people and nobody can fall under suspicion merely by receiving it. When the government wants to monitor people's communications to search for steganography, when they don't do anything about spam, the purpose of the monitoring is probably not the stated one.

Try this: turn on the "size" column in you favourite email client. I use Eudora (Tools-options-Mailbox). Note that a normal plaintext email is 3k. Now look at the size of a spam. You're paying for that, or someone is. Soon the spam arms race is going to require everyone to have broadband just to check their email.

It is not very often that people send random giberish in e-mail. Why not look for the gibberish. Hell even MS word can detect gibberish, I think a spam filter could score a message on non linguistic gibberish.

As the article points out, the technique isn't as effetive as one might initially think. However, there's a clear "next generation" method that I'm sure we'll soon be seeing:

Insert four or five lines of valid extra text -- lines from books, selections from recent USENET postings, etc, etc -- into the spam.
Make the selection semi-random.
Now do it 100 times and send 100 copies to each person on the mailing list.

At first glance it doesn't seem to make sense, but think about it. They take a little time and effort to thwart your filter and they may increase distribution slightly. When your sending like a billions emails a day even a 1% increase is significant. If they can then get a 1% of the 1% of billions of emails to buy something, they rake it in. Sending the email doesn't cost them a dime and they have everything to gain.

The technique also makes obvious the lie of their "we're just innocent entrepeneurs trying to make a buck" defense. Innocent entrepeneurs don't go out of their way to try to hack their data into other people's computers, past programs that are every bit as clear a sign of intent as a "No Soliciting" sign on your door.

On every spam thread on Slashdot, there's someone complaining that technical measures won't solve the problem, and another saying legal measures won't solve the problem. The answer is that you need both: technical measures to assure the identity of the sender -- both spammer and sponsor -- as well as legal measures to provide for punishment.

Unfortunately, spammers are not in the business of selling things to consumers. They are in the business of selling advertising space to other companies. As long as they can convince unscrupulous business owners that advertising via spam is worthwhile, the spam will continue.

It just goes to show, they're not just motivated by greed. They, or at least the people making the programs that do this, actually *want* to annoy the shit out of people. They think it's their right to annoy us like this and they're on a mission to assert that right by subverting all attempts to tune them out. It's not just greed; it's a weird kind of sociopathy.

Most of them are using random word sequences; the random strings like xdwexe are not usually an important percentage of the overall text, no more than names might be. Besides, how large a corpus of "valid" words do you want to use? The OED weighs in at almost 0.5M; and then with another 0.5M uncatalogued scientific terms and neologisms, plus common mis-spellings and typos and jargon and dialect orthography (like our color, meter, checker, jail etc. for the Brits colour, metre, chequer, gaol)...

If you don't want to keep the entire corpus of "valid" words in your code, you're going to have to make some compromises. Maybe you'll want to exclude words like "thou," "hauberk," and "coney." Not so good if you're subscribing to an Early Modern Literature listserv.

So you're going to need some logic to determine whether or not a "valid" word that occurs in a message is meaningful. Here's how one rather well known discussion [paulgraham.com] of Bayesian filtering deals with this issue (of unknown words); this is precisely the logic that spammers with random meaningful words are exploiting:

One question that arises in practice is what probability to assign to a word you've never seen, i.e. one that doesn't occur in the hash table of word probabilities. I've found, again by trial and error, that.4 is a good number to use. If you've never seen a word before, it is probably fairly innocent; spam words tend to be all too familiar.

So, what if all the words are valid, but the sentences aren't? Grammar checkers involve a lot more logic than spellcheckers do, and are consequently a lot less accurate. Fact is, you can also fool a grammar checker filter: just pad with random quotations from novels, etc. instead of padding with random words or random misspelled strings.

So the Bayesian approach of identifying spam and ham words is a pretty effective one, given the limitations.

I've wondered why Bayesian filtering didn't also include word pairs as input. Doing so would mean that it would be more likely gibberish and actual language would be easier to distinguish, since using pairs (or even triads/trios if absolutely necessary) maintains some of the word order statistics for the Bayesian filters to key off of. Also, lots of spam now separates letters with spaces or punctuation to fool filters that would key off words. Using word-pairs would identify these types of spam easily, since the bulk of legitimate mail won't have word pairs like "v-i" "i-a" "a-g" "g-r" and "r-a".

Another input I wish Mozilla (or other bayesian filtering systems) would include is a dictionary look-up on words, then input the statistics of the message. For instance, a message where > 60% of the words don't match my english dictionary and 40% do match is most likely spam in my mailbox. This additional stat would give those filters more power.

SO I wonder... Would adding these things to existing bayesian filtering systems solve this issue to some degree? My gut instinct is that it would.

In the past many ISPs would add filters and NOT tell the users they were doing it.Now a days however ISPs (most notably Earthlink and MSN) advertise spam blocking as a feature.If people wanted this stuff you'd think non-filtering ISPs would advertise "You get ALL your e-mail".

But back to the original point. Spammers have used misleading topics in e-mail if only to make sure you don't delete the message. That and creating spam lists based on people who DO NOT like spam or of people who have manually opted out of spam lists.The people who actually make money with spam don't care about selling products via spam as they sell spam services. The people who sell stuff via spam aren't making money becouse they are reaching markets who are wholely disintrested in buying stuff from them.

It's really simple. The ONLY way spammers can defeat Bayesian filters is if they imitate what you call ham. ham = What you want; spam = what you don't want. Unless they custom tailor each message or random words to each user and guess (through some form of magical powers) what kind of email you call ham, then they fail.

Besides, if they could guess what your ham looked like, then they wouldn't be spammers... they'd be advertising folks pulling in 7 figures.

I'm pretty sure that the big worry is about third party filtering. If I install a spam filter, that means that I don't want to see spam and am unlikely to buy something advertized therein. If my ISP installs a spam filter, it removes spam to everyone, including the idiots who might actually buy something from a spammer. Since my ISP theoretically might be using the same technology in their filter that I'm using in mine, it would still make sense for the spammer to work on defeating my filter.

It's possible, if not likely, that some of the spamware authors are doing it for the challenge. Some of those guys are allegedly pretty good programmers, and I suspect that many of them are essentially hackers with no sense of morals. I could easily imagine somebody like that trying to figure out how to bypass spam filters just because it was a challenge, not because he actually expected any particular rewards for it. It's like trying to break into the computers in the Pentagon; it's stupid and illegal but a big enough challenge that some people with more brains than common sense will try it anyway.

Nigerian scam spam is very different from most spam. It is a story that can be carefully written to use only words that are commonly used, assuming that the people who author them are able to go beyond their broken English all the way to use of statistically hammy correctly spelled text.

But how would you sell more inches on your male member enhanced with V*@gra to make money fast watching celeb teenie nymphos doing it on the farm while only using ordinary non-spammy words?

There are only so many ways to get someone to click here to get all the hot action and a long boring story full of erudite euphemisms is not one of them.

It would be interesting to see if your method of disguising spam can work on a wider range of topics.

So just modify the bayesian filters to act on a set number of mispilled/garbled words say 10 or so. Of course this might make us have to learn how to spell correctly if we aver want anyone to get the emails we send:0)

What it will take is the enforcement of existing computer-cracking laws. Spammers will then have a choice between 5-10 year sentences or sending spam with no munged words, forged headers, misleading subject lines, etc.

Twice in this thread, I see you talking about training the bayesian filter. You seem to think this is something of a burden, like training a big dog...

I think you misunderstand how easily one trains the current Mozilla email client's bayesian filter.

Day 1:
1: the mail comes in, spam included.
2: one of the inbox columns is a blue 'recycle' lookin' symbol. It is a toggle that acts like the 'new' indicator column, and a click on it turns state on or off.
3: glancing through the list, one clicks on the obvious spam, on this column. If there are chunks or patterns that help, you sort them via whatever useful column, then highlight a group, and hit a 'junk' button up in the toolbar. The messages marked as junk disappear (into a 'junk' folder), where they are automatically parsed by the bayes filter. This is what you'd I guess mean by training the filter. For me, it took about 4 minutes the first day, for over 100 messages at a 90% spam ratio. No disrespect, but I doubt you could write your whole stack of filters in 4 minutes.

Day 2:
Most of the junk mail gets caught. I'd say well over 3/4ths of the spam goes away on day 2. You see it come into your inbox, and then a second later all the junk items get the little blue icon turned on, then flash away to the junk folder. A few missed items or new junky things surface.

Days 3 and on: same thing, only better. By the 4th day, my 100 messages a day had fallen back to the dozen nonspams, plus one or two bogus items. It's an automatic 'In, ZZAP! Junk!' Every few days, I glance at the junk folder as you mention, and so far in the last 4 months I've had 5 misfiled messages declared as junk. 3 of them were atypically 'spammy' messages on usually-clean lists.

Now, compared to your way, I have:

No rules to maintain,

no problems with exceptions that are hard to write filters for. In my case, I'm on a couple mailing lists that broadcast all messages with the true sender (not the list) as the 'from' field, and nothing obvious in the subject line to filter on.

Oh, and I'm lazy, too. What you describes sounds like it would take a few dozen built/tested filters, plus maintenance each time I get a new customer or the likes.

no problems if a prospective customer sends me a request for a bid 'out of the blue',

My way's sorta fun: Each morning, I see a message like 'getting 1 of 103 messages'... it counts up to 103, then I watch as the stack gets filtered back to just the real ones. Instead of admiring my own cleverness (advantage here to your way), I get to admire this nifty gadget that 'Just Works.' In fact, the one thing I'd like to see in this mail client is a 'Why' button, just so I could see diagnostics on a message's bayesian results. That, and a ranking to keep track of the spammiest message scores my filter ever sees!

no lost messages from people I neglected to include in my filters.

Granted, you'll find those lost in your method in the spam folder. I say the Mozilla 's built in bayes approach is better because these messages don't get misfiled in the first place.

Oh, and people I could never expect to set/maintain filters can intuitively 'click' the spam away. That's my favorite advantage to my way.

The spammers can't go too far with this stuff because they'd eventually start to stifle their sales.

What makes you think they have any sales (of the advertised product). I would guess that almost all spam (maybe excluding for pr0n sites) is either being sent by a MAKEMONEYFAST sucker or by a professional spammer who charges such suckers to send their spam out. The first set never make any sales, dissapear and are replaced by the next moron, the latter have their money sales or not.

But then again, Joe Sixpack and Jane Astrology aren't all that smart.

And you think Sam Slashdot is? How many pieces of dead end technology do you think you could find in the average/.ers home? `Early Adoption' is geek herbal viagra.

To start with the punchline: well, so filter them away anyway. The way I view "l33t" or "netspeak" is: if it's not important enough for you to bother writing correct, easily readable text, it's not important for me to read either.

So yes, as far as I'm concerned, a good filter should throw away that kind of message away anyway. I don't care if the l33t spelled part was "|-|3rb@1 \/1@gr@" or "Ph34r my 1337 D34thm4tch ski11z", I just don't want to receive it anyway. They're both garbage.

That said... I can somewhat see your point.

Having once written a walkthrough for a game, I have had the dubious honour of receiving tons of mail from people who were both 1 and 2. I.e., 14 year old _and_ gamers.

Ooer. Stuff like "u sux & ur walkthru sux becuz u never sed which of teh terminal 2 klik on & y duzent ne1 make maps" were more common than I would have thought. (The above sequence was about a small level with 3 blinking terminals. You'd think someone could just try all 3 of them if it isn't clear enough.)

But... I don't think it's fair to blame it on the "gamer" part. Some people are simply retards. Plain and simple. Completely coincidental, some of them also play games. But even without the "gamer" part, they'd still be retards. And they'd still write like total analphabets.