Classifiers Unclassified

Many network provides apply different policies for different network traffic, for example, T-mobile's Binge On
program zero-rates (i.e., does not charge against monthly data quota) network traffic identified as video streaming,
and also throttles this traffic to a maximum of 1.5Mbps. However, in general a network provider does not know what
app you are using; rather, they only see the app's network traffic. As a result, they have to make educated guesses based on the network
traffic that the app generates.
To address this challenge, network providers usually deploy one or more devices (typically called middleboxes) that perform this mapping
between network traffic and applications. Specifically, such middleboxes include a classification rule that maps network traffic into
specific category, and an action that specifies what should be done to this category of traffic. Little is known about these
classification rules, since middleboxes use proprietary, closed-source hardware and software.

In this work, we develop a general approach for identifying classification rules (i.e., the network provider's "educated guesses")
that map network traffic to applications. Specifically, we use an efficient binary search and catedully-generated network flows to minimize
the number of testes needed to reverse engineer the rules. We also characterize the classification rules for HTTP(S) traffic implemented
in today's carrier-grade middleboxes and identify examples of misclassification (traffic from application A being labeled mistakenly as application B).
In summary, our analysis shows that different vendors use different matching rules, but all generally focus on a small number of fields inside HTTP/S traffic.
used binary search and carefully-generated flows to eliminate the number of tests to run for reverse-engineering the rules.

Key Contributions

We develop a general methodology for identifying the matching rules used by a classifier.

We conduct a detailed study of the classification rules used by devices in a controlled setting and in the wild.

We find that the devices use simple text-based matching in HTTP and TLS handshakes.

We find that the devices exhibit simple matching rules with deterministic matching-rule priorities.

We publish the code below that can be used to analysis on any network that does DPI-based zero-rating or shaping

This material is based upon work supported by the National Science Foundation under Grant No. (CNS-1617728) and by a Google Faculty Research Award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or Google.