Security Inference from Noisy Data

Li Zhuang

My thesis is that contemporary information systems allow automatic extraction of security-related information from large amounts of noisy data. Extracting this information is the security inference problem: attackers or defenders extract information from noisy data that helps to compromise an adversary's security goals. I believe security inference is an important problem. Security inference often reveals a large amount of sensitive information that may be useful either to attackers or to system administrators. Attackers can use security inference to extract private information; system administrators can use security inference to determine the nature of attackers. Security inference is often a challenging problem because of the size and noisy nature of many real-world datasets. Our solution is to apply statistical analysis to this problem. We present two case studies that extract meaningful security knowledge from noisy data using statistical analysis. One goal is to explore selection of proper statistical analysis tools for security inference. The two case studies use a diverse set of statistical methods, which we believe to be applicable to other settings. We also propose a general framework for modeling security inference problems, which identifies key steps in the security inference process. In the first case study, we examine the problem of keyboard acoustic emanations. Attackers use security inference to analyze sound signals from typing on computer keyboards. We present a novel attack that takes as input a 10-minute sound recording of a user typing English text on a keyboard and recovers up to 96% of the characters typed. There is no need for a labeled training recording. Moreover, the recognizer bootstrapped this way can even recognize random text such as passwords: in our experiments, with 20 or fewer attempts to guess a random letter-only password, an attacker can guess 90% of 5-character passwords and 70% of 10-character password. This case study demonstrates that applying statistical analysis to security problems provides new tools for drawing powerful conclusions. In the second case study, system administrators (or defenders) use security inference to determine the nature of attackers. We develop new techniques to map botnet membership and other characteristics of botnets using spam traces. The data consist of side channel traces from attackers: spam email messages received by Hotmail, one of the largest Web mail services. The basic assumption is that spam email messages with similar content often originate from the same controlling entity. These email messages share a common economic interest, so it is likely that a single entity also controls the machines sending these spam email messages. By grouping spam email messages with similar content and determining the senders of these email messages, one can infer the composition of the botnet. This approach can analyze botnets regardless of their internal organization and means of communication. This work also reports new statistics about botnets. In this thesis, we leverage recent developments in the areas of applied data mining, statistical learning, and distributed data analysis. The approaches we discuss are easily deployable to real systems.

Advisor: Doug Tygar

BibTeX citation:

@phdthesis{Zhuang:EECS-2008-32,
Author = {Zhuang, Li},
Title = {Security Inference from Noisy Data},
School = {EECS Department, University of California, Berkeley},
Year = {2008},
Month = {Apr},
URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-32.html},
Number = {UCB/EECS-2008-32},
Abstract = {My thesis is that contemporary information systems allow automatic extraction of security-related information from large amounts of noisy data. Extracting this information is the security inference problem: attackers or defenders extract information from noisy data that helps to compromise an adversary's security goals. I believe security inference is an important problem. Security inference often reveals a large amount of sensitive information that may be useful either to attackers or to system administrators. Attackers can use security inference to extract private information; system administrators can use security inference to determine the nature of attackers. Security inference is often a challenging problem because of the size and noisy nature of many real-world datasets. Our solution is to apply statistical analysis to this problem. We present two case studies that extract meaningful security knowledge from noisy data using statistical analysis. One goal is to explore selection of proper statistical analysis tools for security inference. The two case studies use a diverse set of statistical methods, which we believe to be applicable to other settings. We also propose a general framework for modeling security inference problems, which identifies key steps in the security inference process.
In the first case study, we examine the problem of keyboard acoustic emanations. Attackers use security inference to analyze sound signals from typing on computer keyboards. We present a novel attack that takes as input a 10-minute sound recording of a user typing English text on a keyboard and recovers up to 96% of the characters typed. There is no need for a labeled training recording. Moreover, the recognizer bootstrapped this way can even recognize random text such as passwords: in our experiments, with 20 or fewer attempts to guess a random letter-only password, an attacker can guess 90% of 5-character passwords and 70% of 10-character password. This case study demonstrates that applying statistical analysis to security problems provides new tools for drawing powerful conclusions.
In the second case study, system administrators (or defenders) use security inference to determine the nature of attackers. We develop new techniques to map botnet membership and other characteristics of botnets using spam traces. The data consist of side channel traces from attackers: spam email messages received by Hotmail, one of the largest Web mail services. The basic assumption is that spam email messages with similar content often originate from the same controlling entity. These email messages share a common economic interest, so it is likely that a single entity also controls the machines sending these spam email messages. By grouping spam email messages with similar content and determining the senders of these email messages, one can infer the composition of the botnet. This approach can analyze botnets regardless of their internal organization and means of communication. This work also reports new statistics about botnets.
In this thesis, we leverage recent developments in the areas of applied data mining, statistical learning, and distributed data analysis. The approaches we discuss are easily deployable to real systems.}
}