Problem

Automatic reverse engineering of protocol or file format specifications is important for many security applications. For instance, such specifications give security applications like firewalls or intrusion detection systems the context information of a network communication or file parsing session, which is crucial for accurately detecting or preventing intrusions. Being able to automatically reverse engineer such protocol or file format specifications alleviates the timeconsuming and error-prone manual effort. A key issue of previous tools is that the formats reverse engineered by them have missed important information that is critical for security applications. First, many input formats include arbitrary sequences of data elements (records). For example, most media files consist of sequences of chunks of compressed media data. Second, input fields may have arbitrary dependencies which cannot be captured by predefined semantics. For example, there are many different ways to compute checksums. The ShieldGen system has shown that it is important to understand record sequences and various data dependencies in an attack instance for constructing a high-quality vulnerability signature for a zero-day vulnerability.

Approach

Tupni is a tool that can reverse engineer an input format with a rich set of information, given one or more inputs of the unknown format and a program that can process these inputs. The main novelty in Tupni is the identification and analysis of arbitrary record sequences. Unlike previous tools that either ignore record sequences or only work for some special cases, Tupni can identify arbitrary record sequences by analyzing loops in a program, using the fact that a program usually processes an unbounded record sequence in a loop. Tupni can also cluster records into a small set of types based on the set of instructions that process a record. In addition, Tupni can infer constraints of various, not pre-defined dependencies across fields or messages (e.g., checksums or sequence numbers) by tracking symbolic predicates from dynamic data flow analysis. Furthermore, to mitigate a fundamental problem of dynamic analysis that our view is limited by the execution path associated with a particular input, Tupni can derive a more complete format specification by aggregating the format information inferred from multiple inputs.

Tupni is currently used inside Microsoft for inferring file formats for fuzz testing.