2
2 Spam: More than Just a Nuisance 95% of all email traffic –Image and PDF Spam (PDF spam ~12%) As of August 2007, one in every 87 emails constituted a phishing attack Targeted attacks on the rise –20k-30k unique phishing attacks per month Source: CNET (January 2008), APWG

3
3 Filtering Prevent unwanted traffic from reaching a users inbox by distinguishing spam from ham Question: What features best differentiate spam from legitimate mail? –Content-based filtering: What is in the mail? –IP address of sender: Who is the sender? –Behavioral features: How the mail is sent?

5
5 Problems with Content Filtering Low cost to evasion: Spammers can easily alter features of an emails content can be easily adjusted and changed Customized emails are easy to generate: Content- based filters need fuzzy hashes over content, etc. High cost to filter maintainers: Filters must be continually updated as content-changing techniques become more sophisticated

7
7 Problem: Addresses Keep Changing Fraction of IP Addresses About 10% of IP addresses never seen before in trace

8
8 Key Idea: Network-Based Filtering Filter email based on how it is sent, in addition to simply what is sent. Network-level properties are less malleable –Set of target recipients –Hosting or upstream ISP (AS number) –Membership in a botnet (spammer, hosting infrastructure) –Network location of sender and receiver

9
9 Challenges (Talk Outline) Understanding the network-level behavior –What behaviors do spammers have? –How well do existing techniques work? Building classifiers using network-level features –Key challenge: Which features to use? –Two Algorithms: SpamTracker and SNARE Building the system –Dynamism: Behavior itself can change –Scale: Lots of email messages (and spam!) out there

14
14 Why Such Big Prefixes? Visibility: Route typically wont be filtered (nice and short) Flexibility: Client IPs can be scattered throughout dark space within a large /8 –Same sender usually returns with different IP addresses

15
15 Other Findings Top senders: Korea, China, Japan –Still about 40% of spam coming from U.S. More than half of sender IP addresses appear less than twice ~90% of spam sent to traps from Windows

17
17 Two Metrics Completeness: The fraction of spamming IP addresses that are listed in the blacklist Responsiveness: The time for the blacklist to list the IP address after the first occurrence of spam

18
18 Completeness and Responsiveness 10-35% of spam is unlisted at the time of receipt 8.5-20% of these IP addresses remain unlisted even after one month Data: Trap data from March 2007, Spamhaus from March and April 2007

19
19 Whats Wrong with IP Blacklists? Based on ephemeral identifier (IP address) –More than 10% of all spam comes from IP addresses not seen within the past two months Dynamic renumbering of IP addresses Stealing of IP addresses and IP address space Compromised machines IP addresses of senders have considerable churn Often require a human to notice/validate the behavior –Spamming is compartmentalized by domain and not analyzed across domains

20
20 How to Fix This Problem? Option 1: Stronger sender identity –Stronger sender identity/authentication may make reputation systems more effective –May require changes to hosts, routers, etc. Option 2: Filtering based on sender behavior –Can be done on todays network –Identifying features may be tricky, and some may require network-wide monitoring capabilities

21
21 Outline Understanding the network-level behavior –What behaviors do spammers have? –How well do existing techniques work? Building classifiers using network-level features –Key challenge: Which features to use? –Algorithms: SpamTracker and SNARE Building the system (SpamSpotter) –Dynamism: Behavior itself can change –Scale: Lots of email messages (and spam!) out there

22
22 SpamTracker Idea: Blacklist sending behavior (Behavioral Blacklisting) –Identify sending patterns commonly used by spammers Intuition: Much more difficult for a spammer to change the technique by which mail is sent than it is to change the content

23
23 SpamTracker Approach Construct a behavioral fingerprint for each sender Cluster senders with similar fingerprints Filter new senders that map to existing clusters

27
27 Evaluation Emulate the performance of a system that could observe sending patterns across many domains –Build clusters/train on given time interval Evaluate classification –Relative to labeled logs –Relative to IP addresses that were eventually listed

28
28 Data 30 days of Postfix logs from email hosting service –Time, remote IP, receiving domain, accept/reject –Allows us to observe sending behavior over a large number of domains –Problem: About 15% of accepted mail is also spam Creates problems with validating SpamTracker 30 days of SpamHaus database in the month following the Postfix logs –Allows us to determine whether SpamTracker detects some sending IPs earlier than SpamHaus

31
31 Outline Understanding the network-level behavior –What behaviors do spammers have? –How well do existing techniques work? Building classifiers using network-level features –Key challenge: Which features to use? –Two Algorithms: SpamTracker and SNARE Building the system –Dynamism: Behavior itself can change –Scale: Lots of email messages (and spam!) out there

34
34 Density of Senders in IP Space For spammers, k nearest senders are much closer in IP space

35
35 Local Time of Day at Sender Spammers peak at different local times of day

36
36 Combining Features: RuleFit Put features into the RuleFit classifier 10-fold cross validation on one day of query logs from a large spam filtering appliance provider Using only network-level features Completely automated

37
37 Outline Understanding the network-level behavior –What behaviors do spammers have? –How well do existing techniques work? Building classifiers using network-level features –Key challenge: Which features to use? –Algorithms: SpamTracker and SNARE Building the system (SpamSpotter) –Dynamism: Behavior itself can change –Scale: Lots of email messages (and spam!) out there

39
39 Challenges Scalability: How to collect and aggregate data, and form the signatures without imposing too much overhead? Dynamism: When to retrain the classifier, given that sender behavior changes? Reliability: How should the system be replicated to better defend against attack or failure? Evasion resistance: Can the system still detect spammers when they are actively trying to evade?

44
44 Next Steps: Applications to Scams Scammers host Web sites on dynamic scam hosting infrastructure Use the DNS to redirect users to different sites when the location of the sites move State of the art: Blacklist URL Our approach: Blacklist based on network-level fingerprints

45
45 Example: Time Between Record Changes Fast-flux Domains tend to change much more frequently than legitimately hosted sites

46
46 Summary: Network-Based Behavioral Filtering Spam increasing, spammers becoming agile –Content filters are falling behind –IP-Based blacklists are evadable Up to 30% of spam not listed in common blacklists at receipt. ~20% remains unlisted after a month Complementary approach: behavioral blacklisting based on network-level features –Blacklist based on how messages are sent –SpamTracker: Spectral clustering catches significant amounts faster than existing blacklists –SNARE: Automated sender reputation ~90% accuracy of existing with lightweight features –SpamSpotter: Putting it together in an RBL system

49
49 Classifying IP Addresses Given new IP address, build a feature vector based on its sending pattern across domains Compute the similarity of this sending pattern to that of each known spam cluster –Normalized dot product of the two feature vectors –Spam score is maximum similarity to any cluster

51
51 Additional History: Message Size Variance Senders of legitimate mail have a much higher variance in sizes of messages they send Message Size Range Certain Spam Likely Spam Likely Ham Certain Ham Surprising: Including this feature (and others with more history) can actually decrease the accuracy of the classifier

52
52 Completeness of IP Blacklists ~80% listed on average ~95% of bots listed in one or more blacklists Number of DNSBLs listing this spammer Only about half of the IPs spamming from short-lived BGP are listed in any blacklist Fraction of all spam received Spam from IP-agile senders tend to be listed in fewer blacklists

53
53 Low Volume to Each Domain Lifetime (seconds) Amount of Spam Most spammers send very little spam, regardless of how long they have been spamming.

55
55 Characteristics of Agile Senders IP addresses are widely distributed across the /8 space IP addresses typically appear only once at our sinkhole Depending on which /8, 60-80% of these IP addresses were not reachable by traceroute when we spot- checked Some IP addresses were in allocated, albeit unannounced space Some AS paths associated with the routes contained reserved AS numbers

56
56 Early Detection Results Compare SpamTracker scores on accepted mail to the SpamHaus database –About 15% of accepted mail was later determined to be spam –Can SpamTracker catch this? Of 620 emails that were accepted, but sent from IPs that were blacklisted within one month –65 emails had a score larger than 5 (85 th percentile)