(c) ACM, 2014. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in 2014 , 2014-02-26.

{Cellular network operators are expected to maintain a good quality-of-experience (QoE) for many services beyond circuit-switched voice and messaging. However, new smart-phone ``app'' services, such as over-the-top video delivery and real-time video calling, are not under an operator's control. Furthermore, operators cannot practically deploy active probes to monitor app QoE in all the places its subscribers may be and complex interactions between network protocol layers make it challenging for operators to understand how network-level parameters and performance will influence an app's QoE. This paper takes a first step to address these challenges by presenting a novel approach to estimate app QoE using passive network measurements. Our approach uses machine learning to obtain a function that relates passive measurements to an app's QoE. In contrast to previous approaches, our approach does not require any control over app services or domain knowledge about how an app's network traffic relates to QoE. We implemented our approach in Prometheus, a prototype system in a large U.S. cellular operator. We show that Prometheus can track the QoE of real video-on-demand and VoIP apps with over 80% accuracy, which is at least as accurate as approaches suggested by domain experts.}

(c) ACM, 2014. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in 2014 , 2014-09-07.

{Recent studies have shown that web browsing is one of the most prominent cellular applications. It is therefore important for cellular network operators to understand how radio network characteristics (such as signal strength, handovers, load, etc.) influence users' web browsing Quality-of-Experience (web QoE). Understanding the relationship between web QoE and network characteristics is a pre-requisite for cellular network operators to detect when and where degraded network conditions actually impact web QoE. Unfortunately, cellular network operators do not have access to detailed server-side or client-side logs to directly measure web QoE metrics, such as abandonment rate and session length. In this paper, we first devise a machine-learning-based mechanism to infer web QoE metrics from network traces accurately. We then present a large-scale study characterizing the impact of network characteristics on web QoE using a month-long anonymized dataset collected from a major cellular network provider. Our results show that improving signal-to-noise ratio, decreasing load and reducing handovers can improve user experience. We find that web QoE is very sensitive to inter-radio-access-technology (IRAT) handovers. We further find that radio data link rate is not correlated with web QoE. Since many network characteristics are interrelated, we also use machine learning to accurately model the influence of radio network characteristics on user experience metrics. This model can be used by cellular network operators to prioritize the improvement of network factors that most influence web QoE.}

The definitive version was published in 2013, Volume $vol, 2013-02-24, http://www.cs.unc.edu/~amw/ndss2013/ISOC_copyright_forms.txt

{Internet-based services routinely contend with a range of
malicious activity (e.g., spam, scans, botnets) that can potentially
arise from virtually any part of the global Internet
infrastructure and that can shift longitudinally over time. In
this paper, we develop the first algorithmic techniques to automatically
infer regions of the internet with shifting security
characteristics in an online fashion. Conceptually, our
key idea is to model the malicious activity on the Internet as
a decision tree over the IP address space, and identify the
dynamics of the malicious activity by inferring the dynamics
of the decision tree. Our evaluations on large corpuses of
mail data and botnet data indicate that our algorithms are
fast, can keep up with Internet-scale traffic data, and can
extract changes in sources of malicious activity substantially
better (a factor of 2.5) than approaches based on using
predetermined levels of aggregation such as BGP-based
network-aware clusters. Our case studies demonstrate our
algorithm’s ability to summarize large shifts in malicious
activity to a small number of IP regions (by as much as two
orders of magnitude), and thus help focus limited operator
resources. Using our algorithms, we find that some regions
of the Internet are prone to much faster changes than others,
such as a set of small and medium-sized hosting providers
that are of particular interest to mail operators.}

{Abstract—Effective management of large-scale cellular data
networks is critical to meet customer demands and expectations.
Customer calls for technical support provides direct indication as
to the issues and problems customers encounter. In this paper we
study the customer tickets – free-text recordings and classifications
by customer support agents – collected at a large cellular network
provider, with two inter-related goals: i) to characterize and
understand the major factors which lead to customers to call
and seek support; and ii) to utilize such customer tickets to
help identify potential network problems. For this purpose, we
develop a novel statistical approach to model customer call rates
which account for customer-side factors (e.g., user tenure and
handset types) as well as geo-locations. We show that most calls
are due to customer-side factors and can be well captured by the
model. Furthermore, we also demonstrate that location-specific
deviations from the model provide a good indicator of potential
network-side issues. The latter is corroborated with the detailed
analysis of customer tickets and other independent data sources
(non-ticket customer feedback and network performance data).}

(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGCOMM Workshop on Measurements Up the STack (Wâ€MUST) , 2011-08-19.

{Multidimensional distributions are often used in data min- ing to describe and summarize different features of large datasets. It is natural to look for distinct classes in such datasets by clustering the data. A common approach entails the use of methods like k-means clustering. However, the k-means method inherently relies on the Euclidean metric in the embedded space and does not account for additional topology underlying the distribution.
In this paper, we propose using Earth Mover Distance (EMD) to compare multidimensional distributions. For a n-bin histogram, the EMD is based on a solution to the transportation problem with time complexity O(n3 log n). To mitigate the high computational cost of EMD, we pro- pose an approximation that reduces the cost to linear time.
Other notions of distances such as the information theo- retic Kullback-Leibler divergence and statistical χ2 distance, account only for the correspondence between bins with the same index, and do not use information across bins, and are sensitive to bin size. A cross-bin distance measure like EMD is not affected by binning differences and meaningfully matches the perceptual notion of “nearness”.
Our technique is simple, efficient and practical for clus- tering distributions. We demonstrate the use of EMD on a practical application of clustering over 400,000 anonymous mobility usage patterns which are defined as distributions over a manifold. EMD allows us to represent inherent re- lationships in this space. We show that EMD allows us to successfully cluster even sparse signatures and we compare the results with other clustering methods. Given the large size of our dataset a fast approximation is crucial for this application.}

The definitive version was published in Proceedings of the Annual Technical Conference, Usenix. , 2011-06-15

{Operators typically monitor the performance of network server farms
using rule-based scripts to automatically flag "events of interest" on
an array of active and passive performance measurement feeds.
However, such automatic detection is typically limited to events with
known properties. A different challenge involves detecting the
"unknown unknowns" -- the events of interest whose properties are
unknown, and therefore, cannot be defined beforehand. Visualization
can significantly aid the rapid discovery of such unknown patterns, as
network operators, with domain expertise, may quickly notice
unexpected shifts in traffic patterns when represented visually.
However, the volume of Internet-wide raw performance data can easily
overwhelm human comprehension, and therefore, an effective
visualization needs to be sparse in representation, yet discriminating
of good and poor performance.
This paper presents a tool that can be used to visualize performance metrics at Internet-scale. At its core, the tool builds decision trees over the IP address space using performance measurements, so that IP addresses with similar performance characteristics are clustered together, and those with significant performance differences are separated. These decision trees need to be dynamic -- i.e., learnt online, and adapt to changes in the underlying network. We build these adaptive decision trees by extending online decision-tree learning algorithms to the unique challenges of classifying performance measurements across the Internet, and
our tool then visualizes these adaptive decision trees, distinguishing parts of the
network with good performance from those with poor performance. We
show that the differences in the visualized decision trees helps us
quickly discover new patterns of usage and novel anomalies in latency
measurements at a large server farm.
}

(c) ACM, 20XX. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proc. of ACM Internet Measurement Conference (IMC). , 2011-11-01.

{as gateways to Internet services, rather than traditional web
browsers. Application marketplaces for iOS, Android, and
Windows Mobile platforms have made it attractive for developers
to deploy apps and easy for users to discover and
start using many network-enabled apps quickly. For example,
it was recently reported that the iOS AppStore has more
than 350K apps and more than 10 billion downloads. Furthermore,
the appearance of tablets and mobile devices with
other form factors, which also use these marketplaces, has
increased the diversity in apps and their user population.
Despite the increasing importance of apps as gateways to
network services, we have a much sparser understanding
of how, where, and when they are used compared to traditional
web services, particularly at scale. This paper takes
a first step in addressing this knowledge gap by presenting
results on smartphone app usage at a national level using
anonymized network measurements from a tier-1 cellular
carrier in the U.S. We identify traffic from distinct marketplace
apps based on HTTP signatures and present aggregate
results on their spatial and temporal prevalence, locality, and
correlation.}

(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Conference on Computer and Communications Security , 2011-10-04.

The sheer volume of new malware found each day is enormous.
Worse, current trends show the amount of malware is doubling
each year. This exponential growth is quickly outpacing existing
malware analysis techniques. As a result, defenders in practice are
constantly faced with hard choices about which malware to analyze
and how much analysis should be done given the limited computing
resources available.
In this paper we propose efficient techniques for identifying malware
families and variants in order to provide scalable malware
triage capabilities. By providing the capability to quickly compare
and cluster malware into families, we enable defenders can make
more intelligent choices on which subsequent (potentially expensive)
analysis to perform, and on which samples. In particular, we
do not claim 100% accuracy, but instead strive for a balance that
maintains high accuracy for common variants but provides significantly
better scalability than previous approaches.
At the core of our work is an algorithm called BitShred for fast
similarity detection within binary code. We have implemented Bit-
Shred, and show that it is several factors to several orders of magnitude
faster than previous approaches, can take advantage of distributed
resources such as Hadoop, all while offering similar accuracy
at identifying malware families to previous approaches. We
also note that our techniques are applicable in other settings, e.g.,
measuring similarity based upon dynamic traces, and automatic
code reuse detection (such as plagiarism) in binary code.

(c) ACM, 2010. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution.

The definitive version was published in ACM Internet Measurement Conference , 2010-11-01.

{How fast is the network? The speed at which real users can download content at different locations and at different times is an important metric for service providers. Knowledge of this speed helps determine where to provision more capacity and helps detect network problems. However, most network-level estimates of these speeds today are obtained using active “speed tests” that place substantial load on the network and are not necessarily representative of actual user experiences due to limited vantage points. These problems are exacerbated in wireless networks where the physical locations of users play an important role in performance. To redress these problems, this paper presents a new technique to estimate achievable download speed using only flow records collected passively. Estimating achievable speed passively is non-trivial because the measured throughput of real flows is often not comparable to the achievable steady-state TCP rate. This can be because, for example, flows are small and never exit TCP slow start or are rate-limited by the content-provider. Our technique addresses these issues by constructing a Throughput Index, a list of flow types that accurately estimate achievable speed. We show that our technique estimates achievable throughput more accurately than other techniques in a large 3G wireless network.}

{We formulate and address the problem of discovering dynamic malicious regions on the Internet. We model this problem as one of adaptively pruning a known decision tree, but with additional challenges: (1) severe space requirements, since the underlying decision tree has over 4 billion leaves, and (2) a changing target function, since malicious activity on the Internet is dynamic. We present a novel algorithm that addresses this problem, by putting together a number of different "experts" algorithms and online paging algorithms. We prove guarantees on our algorithm?s performance as a function of the best possible pruning of a similar size, and our experiments show that our algorithmachieves high accuracy on large real-world data sets, with significant improvements over existing approaches. }

System For Internet Scale Visualization And Detection Of Performance Events,
March 11, 2014

A system for visualization of performance measurements is disclosed. The system may include an electronic data processor configured to receive a stream of the performance measurements and select a maximum number of leaf nodes of a plurality of nodes for use in an adaptive decision tree. Additionally, the electronic processor may be configured to determine a depth of each branch in the adaptive decision tree needed to differentiate performance among internet protocol addresses in an internet protocol prefix of each node. Each of the plurality of nodes may be annotated with a predicted latency category and the processor may be configured to generate the adaptive decision tree based on the maximum number of leaf nodes selected, the depth of each branch determined, the predicted latency category, and on the stream of performance measurements associated with the network. Moreover, the processor may display the adaptive decision tree.

System For Estimating Download Speed From Passive Measurements,
June 11, 2013

A system for passive estimation of throughput in an electronic network is disclosed. The system may include an plurality of mobile devices configured to operate in the network and may further include an electronic data processor. The processor may be configured to access flow records for data flows associated with the mobile devices during a predetermined time interval. Additionally, the processor may be configured to annotate the flow records with an application field and a content provider field. The processor may also be configured to determine a flow type of each data flow based on the application field and the content provider field of the flow records. Furthermore, the processor may be configured to generate a throughput index that only includes non-rate-limited flow types. Moreover, the processor may be configured to estimate maximum throughput for each data flow having non-rate-limited flow types in the throughput index.