Defending Your Site Against Spam

Like so many other people out on the Internet, I get unsolicited
commercial email or "spam". Until recently, I could handle spam by
just deleting it or using email aliases. Unfortunately, my server was
rendered useless by a spam attack launched by an unknown spammer. The
experience forced me to improve my spam defenses. In two articles, I
will share the research and results of my effort to implement an
anti-spam system. In this first installment, I will briefly cover
various anti-spam systems and the system I chose, a network level
defense. In the next installment, I'll dig deeper into the details of
an implementation with qmail. (The information is general enough that
it could be applied to other email systems such as Postfix or
Sendmail.)

Spam Defenses

Let's begin by covering the current state of the anti-spam
world. Since spam is such a widespread problem, there have been an
increasing number of anti-spam measures devised in the last four
years. Some measures involve legislation, some techniques require
large groups of people, and some are simple techniques that
individuals can use. I will cover some of the more popular defenses
that individuals or administrators can implement on their own as well
as mention a few of the up-and-coming systems. None of these systems
is perfect. I do not claim that any of these will work without a
hitch.

Here is a brief rundown of the most popular techniques in use, in
order of increasing sophistication:

The first technique involving spam is to choose a hard to
guess email address and hide it from all publicly viewable
content. This makes it hard for a spammer to guess or use a dictionary
attack on your email address. If you do get spam, you will just have
to ignore it or live with it. If you want to get email from people,
you must somehow give them your email address in advance through a
non-Internet channel or use a web form as an email proxy to your
address. For some people this works, but it usually means their email
address is hard to remember. This technique defeats the purpose of
the Internet by making it hard to communicate with other people.

The next technique, aliases, escalates previous one.
Essentially you use several email aliases for the same account. For
example, it is easy to configure some mail systems to deliver all
email to user-something@company.com to
user@company.com, where something is any valid
email character string. You could have an alias for public display on
the Internet, such as on a web site or in a Usenet post. You could
also have an alias for each web site where you register. This is a
great way to find out if that web site you registered with is selling
your email address or being lax about protecting your email
address.

If a spammer (or web site) abuses one of your addresses, you can
just block that particular alias. I used this technique formerly, and
it worked pretty well. Still, it doesn't prevent all the spam.
Another issue to consider is that most popular email clients don't
support different aliases well. Finally, it is worth noting that some
anti-spam companies have sprung up around just this idea.

The next technique is content filtering, which is actually an
entire family of techniques. It is also the most widely implemented
and discussed technique. All the systems work in roughly the same
manner. Incoming email is processed by a filter. This filter scans
for patterns that may characterize the email as spam or not. It's
that simple. The techniques all vary on where and what they filter
on. The filter could be at the server or it could be at the client.
It might look at the email headers, the email body, or both.

Another interesting variation is where the patterns come from. Some
systems use your address book for a list of valid senders. Others let
you enter your own list of words. Some compile patterns from email
sent in by people on the Internet. Others build patterns based on your
past email. The most popular variations include collaborative
filtering, Bayesian filtering, and fingerprinting. No matter what
anyone claims, no system works 100% for everyone. As filtering
techniques have improved, spammers have continued to work around
them.

One content-filtering technique, called challenge-response, is
on the rise and deserves a separate description. Essentially, the
email content is filtered by the sender's email address and compared
against a whitelist of approved addresses. If the sender's email
address isn't in the list, an email is sent to the recipient and they
are challenged to respond. The challenge is usually simple for a human
to perform but hard for a computer.

Some challenges require the user to type a word in from a noisy
image. I've even seen one that asks the user to "count the kittens" in
a picture. I think these techniques are very successful, but I worry
that they may be alienating certain groups like non-English users or
the visually impaired. Also, some people refuse to use this technique
since they don't want to annoy or offend the senders. Still, some
people praise the technique and consider it so special that there has
even been patent activity around it.

Another category of spam defense is the network level
defense. This technique simply involves looking at the IP address of
the machine sending an email and deciding if it is allowed or
not. This lookup is done against a blocklist, which is just a list of
IP addresses considered to be bad. If the IP address is allowed, then
the mail connection proceeds and the email is processed. If the IP
address is not allowed, then the TCP connection is dropped or the SMTP
connection is aborted with a descriptive error like "Your machine is
on the XYZ blocklist, bye bye".

This system works because IP addresses are hard to forge and people
can't get new IP addresses easily. If an IP becomes useless to a
large portion of the Internet, the spammer must spend energy to get a
new IP. The benefits of this defense are unique. If it works, it
prevents the wasted network and CPU utilization that spam causes on
mail servers. It also unique in that it is geared more toward
administrators than end users. Unfortunately, it too has its
weaknesses. Currently, spammers routinely take over other non-secured
hosts on the Internet in order to relay spam. Also, some blocklists
end up being ineffective as they are incomplete, inconsistent, or too
extreme in their practices.

One last technique is cryptographic authentication. This is
more of a proposed or future technique than one that is currently used
in practice. The idea is similar to using a whitelist of approved
emails or hosts. The difference is that you only allow senders that
have the proper credentials based on modern cryptography. These
credentials would be impossible to forge and expensive to re-purchase
continually.

This technique is worth mentioning since there are groups working
hard on a secure email infrastructure. Such a system would require an
authentication piece as well. If this were built, not only would we be
able to send email securely, but we could have the ability to filter
spam. Unfortunately, since the existing email infrastructure is so
huge and entrenched, it will take a long time for such a system to get
built.

All of these techniques have their pros and cons. The cons are
especially annoying with naive or poor implementations. They may
filter an email and never let the original sender know that their
message was marked as spam. They may block a host that should
actually be allowed. Some require a lot of user intervention. Most
will accidentally block subscribed mailing lists. Some systems that
share patterns over the Internet mark forwarded "joke" emails as spam,
though this may be a feature. Some require lots of email or time to
learn your valid email patterns. All of this will make it harder for
new people (customers or anonymous contributors) to communicate with
you.

To cut to the chase, I looked at the above techniques and chose a
network level defense. The choice was easy for me since that system
was the only one that could have protected my machine's resources from
the recent spam attack I endured.

The Attack of 2002

On November 19, 2002, I was getting 10-20 TCP connections per
second from around 300 different IP networks to my machine at a
colocation facility. I checked the source IPs and they were coming
from all over the globe. The destination email addresses all conformed
to a simple pattern; this indicated that something was performing a
simple algorithmic attack. My computer was really sluggish from
queuing all of the bounce messages. My qmail queue was
over 13,500 messages at that point. In fact, I couldn't even send out
email through the machine's localhost interface. The load
caused all sorts of timeouts for other systems on the machine.

After about 24 hours, the attack ended. I didn't receive or relay
any spam, but I was really upset. I did some research on the net and a
few people in the spam community believed that this had the signature
of a Klez Cluster Attack. This is an attack where the spammer uses a
cluster of machines infected with a Klez virus to relay spam to
various hosts on the Internet. Think of it as unsuspecting users
donating their machine time to the spam@home project. These types of
attacks appear to be increasing steadily, and I'm not the only one
upset about it.

So, with that experience, I came up with a simple list of features
that my spam defense would have to provide:.

Obviously, it must block or significantly reduce incoming spam.

I wanted any email that was blocked to bounce immediately. That
way the sender would immediately know if there was a problem and then
take action.

I wanted the email to be blocked at the network level to prevent
my machine from being overwhelmed with email processing.

I wanted to be able to use any email client.

I wanted to avoid maintaining filters.

I wanted it to be a small load on my server.

With that in hand, I did some research on the existing network
level spam defenses and talked to a few friends. Let me go over that
research right now.

History of Network Level Spam Defenses

Network Level spam systems owe their design to the original group
MAPS. The MAPS project started in 1997 as a small private mailing list
called the Realtime Blackhole List (RBL). It was composed of
like-minded anti-spammers. Paul Vixie, a widely known netizen, was
one of the main persons involved with the group and he helped
publicize their efforts. They created a list of IP addresses that
spammers were using and allowed other members to query their database
in real-time, over the Internet. If an IP address was in that list,
and it attempted to send mail through any of the MAPS subscribers'
networks, the packets were "black holed" or dropped. This worked well
against some of the main spammers who were coming from known
networks.

At first, the RBL group used the Border Gateway Protocol (BGP) for
distributing this blackhole list or database to other
systems. Although BGP was normally used for exchanging global routes
between core Internet routers, it could also be used for distributing
the RBL database. Since almost all of the systems that could talk BGP
were routers, the RBL system was mostly useful to people in control of
their routers. It also required good knowledge of the protocol and a
decent Internet connection. These features kept the RBL from being
useful to a larger set of administrators.

A simpler system was devised in order to make the system much more
approachable to normal administrators with fewer resources. In the
same way that the MAPS group used an existing protocol for a new
purpose, they found another system that would fit this new set of
requirements. They chose to use the most successful distributed
database system that was already in use, the Domain Name Service.

Paul Vixie and a few others already had expert knowledge of the DNS
system, having worked on the BIND DNS server, the most widely used DNS
server of the time. Choosing DNS allowed them to reuse a lot of
existing software and avoid conflicts with existing firewall
rules. Also, because it was DNS, it was already lightweight and well
tested. This adaptation on top of DNS is the system used by probably
99% of the network level spam providers today. In the anti-spam
community, the protocol is called IP4R, which is probably derived from
the phrase "IPv4 Reverse Lookup".

The IP4R Protocol

When you query a server via IP4R for IP addresses, your query is
similar to the query that a host uses when looking up the name
associated with an IP address. Suppose you use the 1.2.3.4 IP address
for your system, which is called a.b.com. When you
connect to a server on the Internet, the destination server will query
its DNS servers for the name associated with your IP address. It does
that by querying for a DNS record in the namespace
4.3.2.1.in-addr.arpa from the root name servers. If there
is a PTR record set up by your ISP for 1.2.3.4 (or the destination
server's DNS was setup correctly), the query will eventually return
a.b.com as the name associated with that IP.

When email servers want to use IP4R to see if a host is from a
hostile IP address, they do a similar lookup with a few
differences. The first difference is that IP4R does not use the same
DNS namespace. You would use the blocklist provider's namespace rather
than in-addr.arpa. For example, if we used a service from
example.com, they may tell us to use the namespace
rbl.example.com. The second difference is in the DNS
reply from the lookup. A normal IP reverse lookup would expect a reply
to return a hostname. An IP4R reply, on the other hand, returns a
special IP address to indicate its answer. Let's step through a simple
IP4R lookup to illustrate.

First, we would setup our software to query for IP addresses in
our example providers namespace: rbl.example.com.
Suppose we are checking on a host with the IP of 1.2.3.4. The DNS
query would go out on the internet for the name
4.3.2.1.rbl.example.com. If the IP address is in the
blocklist, the server will reply with the address
127.0.0.2 (in DNS parlance, an "A" or address record of
127.0.0.2). If the IP address isn't in the database, the query will
return an empty reply. That is all there is to it. The idea of
reusing DNS and keeping it as simple as possible was a great idea.

There is also another optional record that an IP4R provider can
send as the result of a query. In DNS terminology the record is called
a TXT or "text" record. These records just hold character strings. If
an IP is in the database, the provider can also return a TXT record in
the reply to the query. Within the reply, the IP4R provider can put
an explanation of why the record exists in the database and who to
contact about it. This is important, and I'll show you how this comes
into play in the actual implementation.

The last part of the protocol to mention is required for testing
purposes. All IP4R providers should have the 127.0.0.2
address in their database. This is in the 127/8 localhost
network and should never be an address on the Internet. Since it can't
be on the Internet, we can safely use it to test queries to the IP4R
provider. A neat trick for doing this is to use the ping
command on the command line. For example, if we use the
rbl.example.com again, you should be able to do:

If your ping starts pinging (like the example above), then you have
a proper setup. If you get some other error (usually an "unknown
host"), then you know to review your configuration.

Also, since we get the address of '127.0.0.2' if an IP is in the
blocklist, we can use the same simple technique to see if arbitrary
addresses are in the IP4R provider's database. Using the
1.2.3.4 address example again:

# ping 4.3.2.1.rbl.example.com.

If it's in there, then you should get ping replies. If the address
isn't in the blocklist, then you should get an 'unknown host' or
similar error.

Before we move on, let me briefly cover something I glossed over in
that last section. In order to use an IP4R protocol, your mail
software must support it. The good news is that most email servers
(or, more pedantically, Mail Transfer Agents or MTAs), support
this. If a protocol is simple in design, it is usually simple to
implement.

Conclusion for Today

In the beginning of this article we took a brief tour of the
various anti-spam techniques in order to determine the right
solution. Then we went deeper into the best one for my situation, a
network level spam defense. In the next article, I'll go over which
blocklist provider I chose, giving a detailed description of an
install with my mail server and discussing the positive results of
this effort.

Links

Dru Nelson
has been on the Internet since 1988. After starting an ISP in
Florida, he moved to the San Francisco Bay area and has been involved
with large Internet infrastructure at companies like Four11 (Yahoo
Mail), eGroups (Yahoo Groups), and Plaxo. He is now the
CTO and co-founder of BrightRoll.com.