PageRank-like algorithm creates predictive malware blacklist

Researchers find that malware attacks have a clustering that mimics the …

It's easy to create a blacklist of sites that have initiated malware attacks on a server, and use that to configure a firewall to prevent further problems. But these blacklists are purely retrospective, since sources only appear in the blacklist after attacks have occurred. The DShield project is an attempt to improve upon this. System administrators can upload their firewall logs, which are then processed to identify sources of malware, allowing them to be blacklisted on servers they haven't attacked yet. Some computer scientists have now used the information present in DShield to make predictions of future attacks for specific servers based on the fact that malware displays some network effects.

The motivation for the work is that some malware sources will be more relevant than others; the trick is identifying them based on firewall logs. The authors attempt to do this via a two-pronged approach. The first is simply a evaluation of the source's maliciousness that creates a score based on the potential for havoc that a given attack might create. The second prong is where network effects are evaluated in order to improve the predictive value of the blacklist.

The authors noted that patterns of malware attacks often show network effects. Individual pairs of DShield log contributors often show similar patterns of attacks, meaning that if a malware source attacks one, it's likely to go after the other half of the pair. On a larger scale, these pairs form clusters where, once an attacker goes after several members of a cluster, it's likely to eventually attack the rest. Individual contributors may belong to many clusters, but those clusters appear to be stable over time.

This sort of behavior is identical to the pattern used by search algorithms, such as Google's PageRank system, meaning that we have substantial experience with identifying them. In the abstract, the analysis involves determining what fraction of a given network a malware source has already attacked, and using that to predict the probability that the remaining members of the network will see an attack. Because this analysis is based on a given site's membership in various network nodes, the resulting predictions are specific to each individual DShield contributor.

To implement this sort of screening, the authors evaluated the DShield contributors logs and selected 700 that consistently submitted logs that were large enough to contribute to their analysis. As a first step, their software scanned the logs and eliminated the traffic that came from things like search bots and false alarms caused by timed-out connections. The filtered logs were evaluated separately for maliciousness and to predict future targets based on network effects. The scores were then combined to produce a blacklist ranked by overall threat level.

The authors took several weeks' worth of DShield logs and subjected them to analysis to produce a predictive blacklist for each contributor. In every variation the authors tried, their predictive algorithm beat the unprocessed blacklists generated by DShield. There were some exceptions—for any given data set, about seven percent of contributors actually fared worse under the predictive blacklist. The authors suggest, however, that these can be identified based on the fact that they poorly cluster into network nodes.

The authors also explored how changing a variety of parameters affected the quality of the blacklist. Initial tests were performed using a 1,000-member blacklist. Dropping that to 500 severely decreased the value of the blacklist, while expanding it to 5,000 provided a significant improvement, although gains tailed off beyond that point. Using two days of logs to train the system was less effective than a five-day window, but going much beyond a week of logs provided little benefit. The resulting profile tailed off in effectiveness gradually. It started off correctly predicting about 45 percent of future attack sources and, by 10 days out, was still accurately predicting roughly 35 percent of the attacks.

The resulting system isn't psychic—it still relies on a number of servers getting attacked in order to predict future victims—but it consistently outperforms standard blacklist techniques. Best yet, it's already working in the real world. For the last year, administrators have been able to obtain customized blacklists at DShield's Highly Predictive Blacklist site.

Further reading:

A copy of the paper (PDF) that the authors will present at the Usenix Security meeting is available.