Binary Hashing: Motivations and Algorithms

By Rylan O'Connell |
November 26, 2019

As security researchers, we often spend a lot of time looking into the internals of libraries in products we are
assessing. With this come some common time sinks, such as identifying library versions.
While library version identification is relatively straightforward on the surface, other tasks are
clearly more challenging – such as applying signatures to stripped binaries, porting defined types across libraries,
and similar codebases.

For example, instead of upgrading an old, vulnerable library version, development teams sometimes choose to back-port
and attempt to properly insert a patch into a vulnerable version of a code library.
While there are a number of reasons this is done, it is difficult to accurately and automatically detect what has
changed and what remains vulnerable.

We have recently been working under DARPA’s SafeDocs program to develop new methods for understanding and simplifying
complex document formats (e.g., PDF) to safer, clearly understandable, and “verification-friendly” subsets.1
One way that we understand the format’s de facto standard is by conducting analysis of the common parsers which consume
the data.
As part of solving this problem, we often look at how exactly a parsing library is used. This includes how an input
file traverses through this library when that library is nested deep inside various other programs.
Binary hashing is one technique which can help us apply knowledge that a human analyst generates against one instantiation of the parser (e.g., annotation of a potentially exploitable function) to other instances of it being used.

Background

Cryptographic hashing is common and familiar in the security industry.
This involves taking in some input and producing a unique output that aims to minimize collisions (e.g., two
separate inputs should have a very low probability of producing the same hash digest).
This is critical for password comparisons, signature checking on software, or proof-of-work in a blockchain.
However, in other instances we may want to comparatively maximize collisions.
This includes applications such as
audio/video fingerprinting, spam detection, matching text such as articles, and near-duplicate detection.
In these cases, we leverage locality sensitive hashing which aims to put “similar” items into the same “bucket” together.

In applying this to binaries, we encounter a number of issues.
Some, such as the order of instructions, are straightforward to address.
For example, instructions within a basic block can be treated as unordered.
There are also issues stemming from platform and architecture differences, which are harder to address.
Compiler optimizations present even more difficult challenges.

As a result of these and other challenges, there is no easy answer in the “real world” to binary hashing – each
approach which we will discuss has its own strengths and weaknesses.
Frequently, many methods are combined to help compensate and offset weaknesses.
For example, Diaphora uses almost 50 heuristics2.
Simply put, binary hashing is often messy in practice.

As a result, multiple mechanisms are often combined. Some prior art uses a number of simple algorithms in
combination – specifically items such as cryptographic hashes which will not match fuzzily (by their nature) are
combined in different ways.
For example one heuristic is an MD5 of all non-relative bytes in the function; another is comparing the assembly code
generated by IDA after ignoring auto-generated names.
Others are more fuzzy – using a DeepToad fuzzy-match or a small-primes-product of the mnemonics in the function.
Finally, some methods look at the connectedness of the graph – such as looking at callgraph matches.

From our review of the current practice and literature, we break these approaches down into two components:

Hash of the Basic Block (BB)

Graph-Aware Measure

Hash of instructions contained within each individual basic block

A secondary hash which accounts for the structure of a function’s CFG

Next, we’ll dive into these two categories in more detail.

Basic Block Hashing

We first provide an overview of a number of methods which are sometimes applied to basic block hashing,
in implementation or academic literature.
The below table summarizes what we see as the features and downsides of different algorithms.

For those interested, we include some high-level pseudocode for each algorithm discussed in an Appendix at the bottom of this article.

Locality Sensitive Hashing

Unlike the other algorithms described above, locality sensitive hashing algorithms are designed explicitly to
collide similar inputs so that they result in the same hash.
One popular implementation relies on dividing an N dimensional space with n hyperplanes such that an input’s “hash”
can be derived from its position relative to each hyperplane.

We think the best way to learn this algorithm (7, 8) is through some visual depictions.
We’ve attempted to simplify it into 2D space to share with you. You can click through the steps in the below diagrams:

In comparing this description with the other pseudocode in the Appendix, you’ll understand why this method is more
computationally intensive - but it does yield a number of other benefits identified above.

Graph Hashing

For algorithms to support the graph hashing – e.g., which basic blocks are connected - we will
compare Graph Edit Distance and the Weisfeiler Lehman Isomorphism Test.

Our takeaways from studies in this area is that Weisfeiler Lehman is likely “good enough” for signature recognition,
while Graph Edit Distance (GED) may be better suited for use in “bindiff” type applications. Because the GED gives an analog comparison of the similarity between two functions, in order to capture its full potential, we would need to compute the edit distance between every single pair of functions in a binary. While this may be feasible on small binaries with relatively few functions, it would quickly become unrealistic on more complex binaries. Weisfeiler Lehman, on the other hand, allows us to build up a unique “ID” for each CFG. As such, we can compute Weisfeiler Lehman once for each function, and use the resulting ID as a basic hash for comparison.

Weisfeiler Lehman Isomorphism Test

Using the Weisfeiler Lehman Isomorphism Test (WLI) allows us to perform a significant amount of pre-computation on data sets,
and later perform comparisons on the results. This is crucial to being able to scale our analysis.

WLI operates through a series of rounds of neighbor discovery and relabeling.
During neighbor discovery, the WLI algorithm simply gathers every neighbor of a node, and collects them into an
(ordered) list.
After every node has undergone neighbor discovery, the nodes are then relabeled to their neighborhood list.
Finally, after several rounds of neighbor discovery and relabeling, a “hash” can be computed by visiting every node
in order and building up a list of labels (similar to the neighbor discovery process).

This means that for N rounds, v vertices, and e edges, this algorithm’s runtime complexity is effectively O(Nve).
The keen eyed among you may have realized that in a general graph (without multi-edges), |e| ≤ |v|, so this
complexity could be further reduced to O(Nv2), which is no longer linear.
However, because we are specifically concerned about running this algorithm on CFGs, we know that a node can have at
most 2 branches, meaning that WLI will still run efficiently even on large functions with complex CFGs.

Again, we shall turn to a visual demonstration of how this algorithm runs:

Conclusion

We hope that this has given you a brief introduction to the subject area.
Keep an eye out for our next post, where we released:

Details of our approach, specifically targeting the problem of porting information across similar binaries

Measurable metrics for algorithm performance

An open-source tool for you to use and extend!

River Loop Security works heavily in applying program analysis techniques to solve real-world problems in cybersecurity,
from research projects such as DARPA’s SafeDocs program to commercial penetration testing engagements where we audit
source code or binaries to find vulnerabilities.

We encourage you to contact us if you have any questions or comments based on this post
or have questions about applying binary analysis to solve your cybersecurity challenges.

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0074.Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA).