Once you've applied the mechanism described in the SO answer, surely anything else is most easily managed via Bayesian filering?
–
symcbeanMay 21 '13 at 12:44

1

If you are specifically looking for English language filtering, why do you need a dictionary at all. Isn't the presence of non-latin characters evidence of SPAM by itself?
–
AJ HendersonMay 21 '13 at 13:06

@AJHenderson Good observation, and yes, even mixed locales would also be an indicator as well. I removed the English requirement to make this question useful to other people.
–
makerofthings7May 21 '13 at 13:40

@makerofthings7 thats cool, is there a generator for english letters to that example text?
–
NikosApr 15 '14 at 14:26

2 Answers
2

I would add to my spam classification algorithm something that detects multiple encodings in the same word/sentence. E.g., lαѕт having a Latin l, greek/coptic alpha, cyrillic dze, cyrillic te seems very suspicious. A very quick and dirty thing in python could be done using unicodedata. That is

The fact that there are several words with 3 different letter encodings is highly suspicious (even two is suspicious). As well as other suspicious things; e.g., the soft-hyphens alternated by one character (codepoint 0xad) in the .com.

Pulling some brief foreign text samples off the web, you find that most of the time words should only have one encoding:

Obviously, you would have to thoroughly test with machine learning as there may be benign circumstances where encodings are mixed, but this could be useful input to your spam classifier that's trained on a large dataset.

On second thought, you'd probably have to add a list of "ignore-me" characters that can be added to a string to make it different while looking similar, for example U+0082. And now that I think about it, this could be used to defeat the "at most two encodings in each sentence". A word such as "déja vu" can be used legitimately (I remember seeing it out of a Mac editor), but the combining U+0300 accent can be used to make "Víágŕa" look like something else altogether.

So first all "combinings" should be removed, then some legitimate characters must be ignored (e.g. the ellipsis - Word processors adore it... and the various styles of quotes). Finally encodings can be counted, or you can replace characters with their OCR lookalikes as above.