Problem

Application-level protocol specifications are useful for many security applications, including intrusion prevention and detection that performs deep packet inspection and traffic normalization, and penetration testing that generates network inputs to an application to uncover potential vulnerabilities. However, current practice in deriving protocol specifications is mostly manual.

Approach

Discoverer is a tool for automatically reverse engineering the protocol message formats of an application from its network trace. To automatically reverse engineer message formats for a wide range of protocols, we face three main challenges: (1) We have very few hints from the network trace. The only evident information from the trace is the directionality of byte streams. (2) Protocols are significantly different from each other. (3) Protocol message formats are often context-sensitive where earlier fields dictate the parsing of the subsequent part of the message. To make our tool general, we base our design on inferring protocol idioms commonly seen in message formats of many protocols. To cope with the few hints, we dissect the formless byte streams into text and binary segments or tokens as a starting point for clustering messages with similar patterns, where each cluster approximates a message format. By comparing messages in a cluster and observing the characteristics of known cross-field dependencies (such as a length field followed by a string of the length), we infer additional properties for the tokens, which in turn can be leveraged to refine and divide the clusters of messages, where each subcluster approximates a more precise format. This process continues recursively until we can no longer divide up any message clusters based on the newly finished inference. After this recursive clustering phase, we look at all message clusters globally through a type-based sequence alignment algorithm, and merge similar clusters into one. This way, we can produce more concise message formats.