Perl One-Liner of the Month: The Case of the Evil SpambotsBy Ben Okopnik

A REPORTER'S NOTE

To forestall some sure-to-happen complaints, I'd like to underscore
the necessity of having the current version of Perl (at least 5.8.0, as
of this writing) in order to play with the scripts presented in these articles.
One-liners, to a far greater degree than proper scripts, rely on new and
unusual language features, and languages tend to "grow" new features and
drop old, outdated ones as version numbers rise. Perl, heading for its
17th year of growth and development, is no exception.

One of a number of possible problems with one-liners is fragility,
especially in those (many of them) which are dependent on cryptocontext,
side effects, and undocumented features, which are likely - in fact, are
certain
- to change without notice. One-liners are hacks which often demonstrate
some clever twist or feature, which encourages the use of all of the above.
Remember - these are fun toys which (hopefully) lead to a better understanding
of Perl; trying to use them as you would robust, solid code would be a
serious error. If you don't understand the basics
of Perl, this is not the place to start.

Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it.
-- Brian W. Kernighan

Caveat Lector (Let the reader beware).

Ben OkopnikOn board S/V "Ulysses", Saint Augustine, Florida

Frink Ooblick had fallen asleep at the keyboard. He had been alternately
playing and trying to puzzle out the number-guessing game that Woomert
had written (the first had proven easy, but the second still eluded him);
in fact, his last unfinished game was still visible on the screen:

perl -wlne'BEGIN{$b=rand$=}$a=qw/Up exit Down/[($_<=>int$b)+1];print eval$a'
50
Down
25
Up
37
Up
44
Up

What was the secret? How did it work? [1] Frink's dreams
were full of floating bits of code which spiraled off into the distance
or mutated into monstrous shapes, threatening to consume the world. The
hand shaking his shoulder, waking him, was therefore a welcome relief.
Woomert stood at his side, looking impatient.

- "In the living room. Come on, come on, there's not a moment
to lose!"

Frink's first sight of their visitor brought him to a stop. Used to
dealing with the working crowd - sysadmins, techs, etc. - he had expected
the usual scruffy-and-competent look, perhaps complete with hiking boots;
what greeted his eyes was a fellow in a pinstripe suit, crisp white shirt,
a red "power" tie, and lacquered black shoes. He had been impatiently pacing
the floor, and brightened up considerably at the sight of Frink.

- "Ah, this must be the second team member in your organizational
hierarchy! Excellent; now, we can get into actualizing the power strategies
that will reorganize this, erm, unpredicted opportunity into the profit
slot on the balance sheet. All right, here's how we wind-tunnel this: the
securitization of the computing resources is predicated on leveraging..."

Keeping a cautious eye on their visitor, Frink prison-whispered to Woomert:
"What's he saying? And what language is it in?"

- "It's Marketroid. You need to learn at least the basics of it;
not that it's spoken by the people who sign the checks - they don't have
much time for that sort of thing - but you're going to run into it in the
business world, and it's best to be prepared. Usually, though, most of
these people can still speak English; let's see if this fellow remembers
how. Oh, Mr. Wibbley!"

Their visitor had just finished what he obviously considered an explanation
of the problem, had switched off the overhead LCD projector, put away his
laser pointer, and was looking at them in an expectant manner. Clearly,
he had heard of Woomert's reputation and was relying on the famous Hard-Nosed
Computer Detective to deal with... well, whatever it was.

- "Mr. Wibbley - that was an excellent presentation, but I wonder
if you could restate the problem in more basic terms for my assistant here.
I'm afraid he's not up on proper business terminology, and has missed the
more subtle points."

Their visitor heaved a sigh, and dropped into the nearby easy chair.

- "Oh, sure. You know, they were going to send one of the system
administrators to talk to you, but of course I insisted on doing the presentation
myself as soon as I heard about it. After all, one of them wouldn't
have even thought of using that textured salmon-and-peach background on
the slides, and that's all the rage these days! Anyway, I did get
a note from him that explains it in his own words; it's crude and unsophisticated,
not at all proper marketing technique, but I suppose you fellows will understand
it..."

The crumpled and coffee-stained napkin, most of which was covered with
calculations, reminders, and something that looked like firewall rules,
contained a short note framed with a red marker pen:

Woomert, spambots are harvesting the e-mail addresses on our website
(we've tagged them with the "plus hack", [2] so we know
where it's coming from); the amount of spam we're getting is growing by
leaps and bounds. We need to have the addresses out there - it's our contact
info, site problem reports, etc. - but we've got to stop the 'bots somehow!
I've already written the CGI to handle the hot links, but we need to have
the actual addresses displayed on the pages, and the 'bots are getting
those. Any ideas? The page is at http://xxxxxxxxxxxx.xxx.
I've created an account for you; just go to ssh://xxxxxxxxx.xxx/xxx,
password 'xxxxxxxxxx'. Thanks!
- Int Main

After Woomert had ushered out their visitor (and reassured him that,
indeed, the salmon-and-peach background was delightful), he returned to
the living room where Frink awaited him.

- "What are you going to do, Woomert? Any plans?"

- "Yes; let's take a peek at their website, then get out there and
look around. It's a mistake to make decisions ahead of your facts, and we
have few facts at hand."

...

Once again, Woomert and Frink found themselves surrounded by the familiar
sights and sounds of a working web site. They could see the Web server
easily spawning off threads without significantly affecting CPU load; clearly,
the local sysadmin had installed mod_perl [3]. Here and
there, data streams whisked by, and everything moved like a smoothly-oiled
machine.

A sudden shadow made Frink look up. "What the..." Before he could go
any further, a horrifying creature, all tentacles, lenses, and evil intent
[4]
leaped upon the scene, sucked up a copy of every HTML file at once, and
was gone in a blink.

- "What was that, Woomert - a spambot?"

- "Yep. These things traverse the Net, collecting e-mail addresses
and reporting them to their scummy spammer masters. Given the nature of
the Net, you can't stop them - but you can make them much less effective.
Spammers are stupid, their bots even more so, and that's what we're going
to rely on. Mind you, whatever we do is only going to be a temporary solution;
eventually, spammers (or at least their hired techie help) will catch on
to this particular method - but by then, we'll implement other solutions."

Walking up to a convenient terminal, Woomert slipped on his favorite
typing gloves and fired off a rapid volley.

This time, there was no output; however, Woomert looked satisfied. He quickly
shot off an email to the local sysadmin that contained some instructions
and included a shorter version of the last one-liner -

perl -we'map{printf"&#%s;",ord}split//,pop' user@host.com

- "All right-o, Frink; our work here is done. Home, here we come!"

...

The old-fashioned coal-fired samovar [6] was gently
perking; the zavarka (tea concentrate), made with excellent Georgian
tea, gave off a marvelous smell. A plate of canapés, ranging from the
best Russian butter and wild blackberry jam on freshly-baked fluffy white
bread to beluga caviar on a heavy, dark rye rubbed with just a touch of
garlic, was set close at hand, and both Woomert and Frink were merrily
foraging in the gourmet field thus presented. Eventually they settled back,
replete with good food, and Frink's curiosity could be contained no longer.

- "Woomert, when I try to puzzle out your one-liners, I can only
get so far; then I run out of steam. Can you tell me about what you did?"

Lying back in his favorite armchair, Woomert smiled.

- "Instead, why don't you start by telling me what part you understood?
I like to see how far you've advanced, Frink; it's been a pleasure to me
to see you picking up some of the finer points. I'll take it from there."

-Mmodule Use the specified module-w Enable warnings-n Non-printing loop-e Execute the following commands

However, I couldn't quite puzzle out the '-MRFC::RFC822::Address=valid'syntax
- what was that?"

- "Ah. As 'perldoc perlvar' tells us, in the entry for '-M', it's
a bit of syntactic sugar; '-MBar=foo' is a shortcut for 'use
Bar qw/foo/', which imports the specified function 'foo' from module
'Bar'. Go on, you're doing well."

Frink cleared his throat.

- "In that case, I think I have it figured out... almost. Let
me take a quick look at 'perldoc perlvar' and 'perldoc RFC::RFC822::Address'...
Yes, that's what I thought - I've got it! The regex at the beginning -

/[\w-]+@[\w.-]+/

tries to match e-mail addresses - it's not perfect, but should do reasonably
well. What it says is "match any character in [a-zA-Z0-9-] repeated
one or more times, followed by '@', followed by any character
in [a-zA-Z0-9.-] repeated one or more times". If the match does
not succeed - the '||' logical-or operator handles that - go to
the next line."

- "Brilliant, Frink! What happens then?"

- "If it does succeed, 'next' is skipped over, and 'print
valid$&' is invoked. The module documentation tells me that the
'valid' function tests an e-mail address for RFC822 (e-mail specification)
conformance, and returns true or false based on validity. '$&',
according to 'perldoc perlvar', is the last successful pattern match -
in other words, whatever was matched by the regex. Since you saw all '1's
and no errors - any matches that weren't RFC822-valid would have returned
something like "Use of uninitialized value in print at -e line 1"
- what you matched was all valid. What you were doing here is checking
to see that your regex only matched actual addresses. How did I do?"

- "Excellent, my dear Frink; you're coming along well! As a side
note, it's generally best to avoid the use of $&, $`,
and $' as well as 'use English' in scripts; there's a
rather large performance penalty associated with them (see 'perldoc perlvar').
However, here we had a very small list of matches, and so I went ahead
with it. Go on, see what you can make of the next one."

Mmmm... I got sorta lost here, Woomert. I see that regex that you'd
used before, but what's that 's=' bit?"

- "It's one of those convenient tweaks that Perl provides - although,
admittedly, the basic idea was stolen from 'sed'. It's simply an alternate
delimiter used with the 's' (substitute) operator; there are times when
using the default delimiter ("/") is highly inconvenient and leads to "toothpick
Hell" - as, for example, in matching a directory name:

s/\/path\/to\/my\/directory/my home directory/

Far better to use an alternate delimiter, one that is not contained
in the text of either the pattern or the replacement:

s#/path/to/my/directory#my home directory#

As long as it's non-alphanumeric and non-whitespace, it'll work fine.
There are some special cases, but they're all sensible ones; using a single
quote disables interpolation in both the pattern and the replacement (see
the rules in 'perldoc perlop'), and using braces or brackets as delimiters
requires rather obvious syntax:

s{a}{b}s(a)(b)s[a][b]

Many people like '#' as a delimiter; I prefer '=', since '#' tends to
come up in HTML and comments. Can you make sense of any of the rest?"

- "I'm afraid not. You're matching the email addresses as previously,
and replacing them with something, but I can't figure out what."

- "All right; it is rather involved. The replacement part of
the substitution is actual Perl code; we can do that thanks to the 'e'
(evaluate) modifier on the end of the 's' operator. Let's parse the relevant
code from right to left:

join"",map{sprintf"&#%s;",ord}split//,$&

We know that '$&' contains an email address; the next thing
we do is use the 'split' function which converts a scalar to a
list, splitting it on whatever is specified between the delimiters. In
this case, however, the delimiter is empty, a null - so the returned list
has each character of the address as a separate element in the list. We
now pass this list to the 'map' function, which will evaluate
the code specified in the {block} for each element of the supplied
list and return the result - as another list.

Within the block itself, each character is used as an argument to the
'ord' function, which returns the ASCII value of that character;
this, in turn, is used as the argument for the 'sprintf' function
which returns the following formatted string:

&#<ASCII_value>;

for each value so specified. After all the characters in the list have
been processed, we use the 'join' function to convert the list
back to a scalar - which the substitute operator will now use as a replacement
string for the original email address. What used to be "foo@bar.com"
now looks like

&#102;&#111;&#111;&#64;&#98;&#97;&#114;&#46;&#99;&#111;&#109;

This, you must admit, looks nothing like an e-mail address - so spambots
will not be able to read it!"

Frink looked troubled.

- "Woomert, I hate to tell you... but human beings won't be able
to read it either!"

Woomert took another sip of his tea and smiled.

- "You're forgetting one thing, Frink. Humans aren't going
to be reading this; since it's part of the HTML files, it's going to be
read by
browsers. As it happens, the HTML specification for showing
ASCII characters by their value is

&#<ASCII_value>;

which is exactly what we've produced. Try this yourself: save the text
between the following lines as "text.html" and view it in a browser.

- "Woomert, what a great solution! Your client will be able to
display the addresses without them being harvested, and the Web page will
still look the same as it did before. I can tell by comparison that the
last bit of code:

perl -we'map{printf"&#%s;",ord}split//,pop' user@host.com

simply enables the sysadmin to convert any new addresses before popping
them into the HTML. Wonderful!"

- "A large part of the complete solution, of course, was the CGI
that the local admin had written - that takes a bit more than a one-liner,
although not very much more, given the power of the CGI module. Remember,
Frink: as your powers grow, make certain to align yourself with the side
of Good rather than Evil. Not only is it the right thing to do; the people
around you are far more likely to have brains!"

[1] Oddly enough, my mysterious correspondent did not
include the solution to this, perhaps deeming it simple enough (!) for
the public to figure out - or (and I suspect this to be the more likely
scenario) he has not yet figured it out himself. Readers are welcome to
write in with their ideas... but for now, the workings of Woomert's game
remain a puzzle.

[2] A number of commonly-used Mail Transfer Agents will
ignore anything that follows a plus sign in the username part of the address,
e.g. <smith+yahoo@joe.com> will be routed exactly the same as <smith@joe.com>.
This can be a very useful mechanism for tracing and reducing spam: a "plus-hacked"
address that becomes too spam-loaded can be directed to "/dev/null"
and replaced by a newly generated one (say, <smith+yahoo1@joe.com> -
which would also go to <smith@joe.com>.)

[3] A.K.A. "Apache On Steroids". From the mod_perl documentation:

The Apache/Perl integration project brings together the full power
ofthe Perl programming language and the Apache HTTP server. This
isachieved by linking the Perl runtime library into the server andproviding an object oriented Perl interface to the server's C languageAPI.

These pieces are seamlessly glued together by the `mod_perl' serverplugin, making it is possible to write Apache modules entirely
inPerl. In addition, the persistent interpreter embedded in the serveravoids the overhead of starting an external interpreter program
andthe additional Perl start-up (compile) time.

There are many major benefits to using mod_perl; if you use Apache in
any serious fashion without it, you're almost certainly throwing away some
of your time and effort.

[4] If you've seen "The Matrix", just picture the Sentinels.
If you haven't seen it, hey, you've got only yourself to blame. :)

[5] Gibberish is the written form of the Marketroid
language. It was formerly spoken by the Gibbers, who all died out as a
result of their complete inability to do anything (as opposed to talking
about it.) It is exactly as comprehensible as its spoken counterpart, although
many people confuse the two: "it's all marketroid gibberish!" is a highly
redundant statement.

[6] See the "Russian Tea HOWTO", by Dániel Nagy,
for the proper way to make and serve Russian tea. The man knows
what he's talking about.

Ben is a Contributing Editor for Linux Gazette and a member of
The Answer Gang.

Ben was born in Moscow, Russia in 1962. He became interested in
electricity at age six--promptly demonstrating it by sticking a fork into
a socket and starting a fire--and has been falling down technological mineshafts
ever since. He has been working with computers since the Elder Days, when
they had to be built by soldering parts onto printed circuit boards and
programs had to fit into 4k of memory. He would gladly pay good money to any
psychologist who can cure him of the resulting nightmares.

Ben's subsequent experiences include creating software in nearly a dozen
languages, network and database maintenance during the approach of a hurricane,
and writing articles for publications ranging from sailing magazines to
technological journals. Having recently completed a seven-year
Atlantic/Caribbean cruise under sail, he is currently docked in Baltimore, MD,
where he works as a technical instructor for Sun Microsystems.

Ben has been working with Linux since 1997, and credits it with his complete
loss of interest in waging nuclear warfare on parts of the Pacific Northwest.