Take a look at the two procedures (written in the C-like Boogie language) down below. Although they look syntactically quite different, they compute virtually the same values. Click the "Compute Similarity" button and see what happens.

What's going on?

Say that you have a procedure in assembly code (a query), and you want to find similar procedures, say the same source code compiled by a different compiler or slightly patched, in a corpus of assembly procedures (targets).

Why would I want to do this? Imagine you have a vulnerable procedure (e.g. Heartbleed), and you want to check if it was embedded in some binary in your organization, but don't know what compiler was used, or whether the code was patched. Other use-cases can be clone detection for reducing code-base maintenance (by enabling code-reuse) or for identifying IP theft. That's where Esh comes in!

What's Esh?Esh is a tool that allows searching a corpus of procedures in assembly code, for a query procedure, even if the source code was compiled with a different compiler of was slightly patched. Esh does not rely on syntactic information and thus is able to operate even when procedures were stripped of debug information. Read on to learn more.

So what were those pieces of code? We included a demo of a key component of Esh - the ability to find semantic similarity of small linear pieces of code, where no matching of variable names exist. Why? because this is the common scenario when dealing with assembly code - no variable names exists, so you need to get creative to compute similarity. Other key aspects of Esh are procedure decomposition and statistical significance.

Decomposing the procedure into strands: We decompose procedures into smaller segments (we refer to as strands), which are feasible to compare, and try to use semantically similar strands from one procedure to compose the other.

Pairwise semantic comparison: We use a program verifier to check whether two strands are semantically equivalent by assuming input equivalence and checking if intermediate and output values are the same (this is essentially the demo you played with ;). When they are not equivalent, we define a quantitative notion of strand similarity based on the proportion of matching values to the total number of values in the strand. To allow for this semantic comparison, we translate the assembly code segments to higher level BoogieIR. (The translation is performed using BAP and SMACK.)

Statistical reasoning over strands: We compute whole-procedure similarity using local similarity between strands and applying statistical reasoning. A matching pair of strands is not enough to suggest similarity, we also require statistical evidence. A key aspect of our statistical framework is that we amplify matches of unique strands, and dampen the significance of “common” strands (with respect to the target database examined) as they are less indicative of similarity.

Inspiration

Look at the 3 pictures above. Which of the query pictures: 1 or 2, is more similar to the target picture? Can you say why?

If you thought (like we did) that image 2 is more similar to the target, then the above pictures explains why. The pictures mark and number the common regions of the queries and target image. Image 2 is more similar to the target, simply because they share more mutual regions, and these regions are big and non-trivial (for instance the black background in the pictures should not be used as evidence for similarity). This observation is inspired by work done by Michal Irani and Oren Boiman in the field of image processing (dancer images courtesy of Prof. Michal Irani). Esh shows that the same principals of similarity apply to code!

Consider the assembly code in the diagram above, extracted from binaries stripped of debug information. The blocks show partial assembly code of three procedures, two of them (t and q2) containing the “Heartbleed” vulnerability and another (q1) is unrelated. The vulnerable procedures were compiled using different compilers. Finding similarity between the procedures using syntactic techniques is challenging, as different compilers can produce significantly different assembly code.

Instead, we decompose each procedure to strands, semantically compare similarity of strands, and lift the results to procedures. In the diagram, the target code t and the query code q1, q2 share three matching strands, numbered in the figure as 1, 2, and 3 (within circles). Each strand is a sequence of instructions, and strands are considered as matches when they perform an equivalent computation. In the figure, we mark matching strands using the same circled number. Two syntactically different strands can be equivalent.

For example, the strands numbered 2 in q and t2 are different syntactically but are equivalent (they perform the same operation on inputs). Further, strands need not be syntactically contiguous, this is because they are based on data-flow dependencies rather than on syntactic properties. For example, strand 3 in the query procedure q2 ( mov r12, rbx; lea rdi, [r12+3] ). This strand matches the strand mov r13, rbx;lea rcx, [r13+3] in the target procedure t. In contrast, the code in q1 only matches the single strand 1 in the target, and that strand is of low statistical significance as it is common and thus suggests less of similarity.