Sniffing out Spammers

To identify the geographic regions from which link spam originated, a database locates IP addresses and the Google Charts service puts them onto a world map.

Sometimes I imagine how satisfying it would be to track down a spammer or telemarketer's office, like the man in the Snickers commercial [1] who arrives at the office and gets his revenge. Unfortunately, legal and logistical reasons often prevent this. Additionally, it is often the case that the perpetrators are botnets rather than the spammers themselves. Still, it would be interesting to create a graph that pinpoints the geographic regions in which most spam activities originate.

The Internet is the ideal platform for anonymous trickery, but the perpetrator's deeds actually leave a trail – each incoming request on a website includes the sender's IP address (see Figure 1).

Of course, the address could be spoofed, but this is not so simple and just too much trouble for most link spammers.

The DNS system, which resolves hostnames into IP addresses, can often do the same thing in reverse gear. A DNS reverse lookup expects an IP address, and if the spammer's service provider has set up everything as it should be, the script in Listing 1, revlookup, will return a hostname from which you usually can identify the provider. Figure 2 shows that the address I caught spamming, IP 69.162.110.146, belongs to an ISP called lstn.net; a friendly email to the ISP's webmaster, stating the IP and the time (which is important because these IPs are often assigned dynamically) might just be the ticket to stop the spammer's illegal activity for good.

Figure 2: A reverse DNS lookup often reveals the domain associated with the IP address.

Listing 1

revlookup

The inet_aton() function in Perl's Socket module accepts an IP address in string notation ("x.x.x.x") and returns a data structure for a subsequent call to Perl's gethostbyaddr() function. When called with the AF_INET parameter, as shown in line 17 in revlookup, the function performs the DNS reverse lookup in the IPv4 address space and, if successful, returns a string with the hostname or returns undef if an error occurs. Depending on how busy the DNS server you call on is at the time and how many of its peers it needs to consult to answer your request, this process can take a couple of seconds.

As another option, the whois command-line utility doesn't just work with domains, it also accepts IP addresses as arguments. Figure 3 shows that the provider, Limestone Networks, has registered everything correctly and even provides an email address that spammed webmasters can contact with their complaints. The lookup can be automated in Perl with the CPAN Net::Whois::Raw module, for example; however, it puts significant load on the servers hosted by Network Solutions, who will block access if you perform 100 lookups in quick succession. In other words, searching a complete access log with this module is impossible, even if you cache queries you have already made.

Many spammers use IP addresses without a reverse lookup entry on the DNS system. But even then you can still locate the culprit; IP addresses are assigned to service providers in blocks, and you can download databases with the information necessary to discover the approximate geographic position of any given IP address. MaxMind offers a database file [2] that is free for non-commercial use. The licensing conditions are available in the same directory as the database itself. The CPAN IP::Country::MaxMind module provides an API to match, thus avoiding the need to mess around with data blobs. The IP mappings stored in the database change very slowly; updating once every couple of months should be fine.

After installing the module, you will need another CPAN module, Geo::IP::PurePerl. The MaxMind module's open() constructor loads the local database that you specify, and the inet_atocc() function returns a country code for any IP address (for example, DE for Germany).

The Google Charts API [3] gives you a useful option for plotting these codes on a world map. If you pass in pairs of values to the Google server, it will respond with a PNG-formatted image file. The data format for the pairs of values is slightly unusual in that you need to squash largish volumes of data into the very restricted space offered by a URL and its query parameters.

Simplicity Itself

The API's aptly named Simple Encoding data format will only allow values between 0 and 61, encoded as A-Z (0-25), a-z (26-51), and 0-9 (52-61).

If you assign a value of 23 to Germany, 3 to the USA, and 60 to Japan, you can encode the country codes in the chld URL parameter as "DEUSJP" (DE, US, and JP, concatenated without blanks), and the values as "s:XD8" (s = simple encoding, X = 23, D = 3, and 8 = 60) in chd.

The script in Listing 2, spam2geo, implements the steps I have identified thus far; it analyzes the access.log file from an Apache server under heavy fire from link spammers. The CPAN ApacheLog::Parser module provides a parse_line_to_hash function, which understands the access.log format and returns the individual fields of each log entry as a hash. The client entry includes the spammer's IP address in each case, and a call to the inet_atocc method in line 32 returns the two-letter country code, assuming the database knows it.

Listing 2

spam2geo

If successful, line 36 increments the hash entry for the country, and the program moves on to the next line in the logfile. Because you are not interested in all the URLs – just the ones generated by spammers – line 28 filters out all entries whose path (file hash key) does not match the regular expression posting. The regex should only match URLs used by spammers to post on the forums you are monitoring, so you must modify it to match your local conditions.

Normalization and conversion of the data to the Google format starts in line 42. Because the numeric values for each country in the %by_country hash are not necessarily in the range 0--61 but can assume arbitrary values, spam2geo must determine the limits of the range by use of min and max from List::Util. After doing so, it subtracts $min and divides by $max to squash the numeric values to be represented into the range between 0 and 1 and multiplies the latter value by the number of encoding characters minus 1. Thus, $norm contains a floating point number, which can be converted to an integer and used as an index in the @SYMBOLS array, thus mapping the whole range of values to an element in the array.

Lines 68 and 70 then concatenate the calculated symbols to give strings without separating blanks, for passing in with the chld (country codes) and chd (values) URL parameters. From the programmer's point of view, the order in which the keys and values functions return results is arbitrary, but consistent within the Perl script, and irrelevant to the Google service.

Communications with the Google server are handled by LWP::UserAgent via the http protocol. The URL parameters are set by the query_form() method, which also performs any URL encoding required. The cht parameter specifies the charts type used by the Google Charts service and is set to "t" (topological) for a world map. You can optionally restrict the view to individual continents; however, you need to set the chtm parameter to "world" for a world map.

The chs parameter sets the dimensions of the resulting image to 440x220 pixels. Google Charts uses the colors white, yellow, and red specified as hex RGB values in chco to shade the countries, thus reflecting minimum, medium, and maximum values. So, the settings in Listing 2 leave countries with normalized spam counts around 0 white, values of around 20 yellow, and values of 60 or more red. The "bg,s,EAF7FE" string for the chf parameter stands for background, solid, and the hex value for light blue to color the world's oceans.

All told, the URL will look something like this: http://chart.apis.google.com/chart?cht=t&chs=440x220&chtm=world&chd=
s%3ABFAABAHGQAAA8BAAAAAAAaBAA&chco=ffffff%2Cf4ed28%2Cf11414&chld=GBNLHKEELVKRRUSAPAMDCASECNDEPKITPLINMEBRCZUSUAESFR&chf=bg%2Cs%2CEAF7FE

Google takes just a couple of seconds to render and deliver this as the graph shown in Figure 4. If you comment out lines 27-29 in spam2geo, the graph will give you a geographic distribution of all incoming URLs instead (Figure 5).

Figure 4: Spammers mainly come from China and North America.

Figure 5: Users of the website mainly come from Germany.

Although most spam requests originate in China and the US, most of the website's bona fide customers come from Germany. The eog file.png command displays the file produced by Google and retrieved via a web request in the Eye of Gnome utility.

Installation

After downloading the MaxMind GeoIP.dat.gz database [2], unpack the GeoIP.dat file and place the spam2geo script into your current working directory. The CPAN IP::Country::MaxMind, Geo::IP::PurePerl, List::Util, and ApacheLog::Parser modules and all their dependencies are best installed from a CPAN shell. To use the Google API, you do not need to register. You just need to modify line 28 in spam2geo to match your local conditions by changing the /posting/ pattern to match URLs used only by spammers to clutter your discussion groups with parasitic entries.

For more detailed analysis including, for example, the number of forum requests compared with other activities or the preferred browser type used by the spammers (at least what they say they're using), check out the enormous choice provided by the Google Charts API [3], which gives you an easy approach to render any statistical information elegantly in polished chart form.

Related content

In its semi-annual spam report, the Russian security experts Kaspersky Lab have concluded the economic crisis has had no bearing on the amount of spam distributed worldwide. However, spammers have had to turn to creating ads for their very own services.

Spammers charge real money for their dubious services, and hundreds of advertisers are willing to pay. We’ll show you some innovative techniques for controlling and containing spam, including strategies for slowing down spam bots, keeping spammers from getting your address, and separating spam from legitimate email.