In the second article, I'd mentioned that the hardest part of the exercise
was figuring out where we needed backslashes.
Devdas (f3ew) asked on Twitter
whether I would still need all the backslash escapes even
if I put the pattern in a file -- in other worse, are the backslashes
merely to get the shell to pass special characters unchanged?

A good question, and I suspected the need for some of the backslashes
would disappear. So I tried this:

The problem, it turns out, is that my shell, zsh, changed both instances
of \b to an ASCII backspace, ^H. Editing the file fixes that, and so does

$ echo -E ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta' >/tmp/commas

But that only applies to echo: zsh doesn't do the \b -> ^H substitution
in the original command, where you pass the string directly as a sed argument.

Okay, with that straightened out, what about Devdas' question?

Surprisingly, it turns out that all the backslashes are still needed.
None of them go away when you echo > file, so they
weren't there just to get special characters past the shell; and if
you edit the file and try removing some of the backslashes, you'll
see that the pattern no longer works. I had thought at least some of them,
like the ones before the \{ \}, were extraneous, but even those are
still needed.

Filtering unprintable characters

As long as I'm writing about regular expressions, I learned a nice
little tidbit last week. I'm getting an increasing
flood of Asian-language spams which my mail ISP doesn't filter out (they
use spamassassin, which is pretty useless for this sort of filtering).
I wanted a simple pattern I could pass to egrep (via procmail) that
would filter out anything with a run of more than 4 unprintable characters
in a row. [^[:print:]]{4,} should do it, but it wasn't working.

The problem, it turns out, is the definition of what's printable.
Apparently when the default system character set is UTF-8, just about
everything is considered printable! So the trick is that you need to
set LC_ALL to something more restrictive, like C (which basically means
ASCII) to before :print: becomes useful for language-based filtering.
(Thanks to Mikachu for spotting the problem).

So in a terminal, you can do something like

LC_ALL=C egrep -v '[^[:print:]]' filename

In procmail it was a little harder; I couldn't figure out any way to
change LC_ALL from a procmail recipe; the only solution I came up
with was to add this to ~/.procmailrc: