Description

Various tools for data-flow analysis.

Comparison with source code data-flow

Binary data-flow analysis differs from source code data-flow analysis in that source code has well-known, named variables and the analysis is often simplified by considering only those variables (e.g., whole arrays, not array element granularity), and by ignoring aliasing issues. But because binaries have no named variables but rather one large array of memory with computed addresses, a binary data-flow analysis must be at address granularity and account for aliasing. In fact, because of the rich set of addressing modes, it is very common that within a single instruction some of the inputs and/or outputs depend on other inputs.

Another difference is in the number of basic operations that need to be supported. Source-level languages typically have at most a few dozen basic operations for which data-flow must be explicitly encoded in the analysis, while binary analysis needs basic data-flow information about hundreds of CPU instructions across the supported architectures. In fact, there are so many unique instructions that the implementation here takes a different approach: it leverages the fact that ROSE has instruction semantics and is able to discern data-flow within an instruction (or even across an entire basic block) by executing the instruction (or block) in a special data-flow semantic domain. The domain is an aggregate domain consisting of a main user-specified domain (often symbolic), a domain to track data-flow, and any number of additional user-specified domains. This approach works to discover multiple data-flow within the executed item (instruction or basic block); most CPU instructions have multiple data flows (e.g., "PUSH" has at least three independent data flows: incrementing of the instruction pointer, moving data between stack and register, incrementing the stack pointer).

Two additional minor things to be aware of in binary data-flow are (1) the difference in size between registers and memory locations, and (2) the use of idioms to perform common tasks. For instance, a read from memory to a 32-bit register is actually four separate 8-bit reads at the data-flow level since memory holds 8-bit values. An example of an idiom is the "XOR EAX, EAX" instruction to set the EAX register to zero (this is important for taint analysis, since it effectively removes any taint from EAX).

The major components

The core component of the binary data-flow is an Engine class template. Its purpose is to collect all the types and objects necessary for a data-flow analysis and to iteratively solve a data-flow equation.

Prerequisites for the core data flow-engine

See Engine for how data-flow is solved. In short, data-flow consists of values (taken from a lattice) that are stored in variables (abstract locations, memory, registers). A set of variables and their values (a.k.a., state) is stored at each vertex in a control flow graph. The solver iterates over the control flow graph. It invokes a user-defined transfer function at each vertex. The transfer function takes an incoming state and produces an outgoing state. The incoming state for a vertex is calculated by merging the outgoing states of its predecessor vertices. When the iteration reaches a fixed point the analysis is complete.

Member Typedef Documentation

Vertices are abstract locations (register descriptors and/or memory addresses) and edges indicate the flow of data from abstract location to another. The edge IDs are in the order they were added; each edge also has a sequence number that is initially equal to the edge ID, although edge IDs can change if edges are erased from the graph).

This data structure holds a data-flow graph for each control flow graph vertex. The data-flow graph represents data-flow from one abstract location to another. The union of all abstract locations across all control flow vertices is the set of variables present in the specimen being analyzed. The map is keyed by CFG vertex IDs.

Constructor & Destructor Documentation

Constructs a new data-flow analysis framework using the specified user-level instruction semantics. The semantics should consist of an instruction dispatcher appropriate for the architecture being analyzed, and an associated semantic domain that will be used to calculate memory addresses. The SymbolicSemantics domain is typically used.

Computes and returns a graph describing how data-flow occurs for the specified instructions (which must have linear control flow, e.g., from a single basic block). The vertices of the returned graph contain the abstract locations that are referenced (read or written) and the edges indicate the data-flow. The edge values are integers imparting an ordering to the data-flow.

This method traverses the control flow graph to discover the abstract locations that are referenced by each vertex and the flow edges between those abstract locations.

The algorithm currently implemented here is to use an aggregate semantic domain consisting of a user-supplied semantic domain in conjunction with a data-flow discovery domain, and to process each CFG vertex one time in some arbitrary depth-first order. This isn't a rigorous analysis, but it is usually able to accurately identify local and global variables, although typically not through pointers or array indexing.

This method traverses the control flow graph to discover the abstract locations that are referenced by each vertex and the flow edges between those abstract locations.

The algorithm currently implemented here is to use an aggregate semantic domain consisting of a user-supplied semantic domain in conjunction with a data-flow discovery domain, and to process each CFG vertex one time in some arbitrary depth-first order. This isn't a rigorous analysis, but it is usually able to accurately identify local and global variables, although typically not through pointers or array indexing.

This returns a list of unique variables by computing the union of all variables across the index. Uniqueness is determined by calling AbstractLocation::mustAlias. The variables are returned in no particular order.