International Workshop on Obfuscation: Science, Technology, and TheoryApril 7-8, 2017 • New York University

OBFUSCATION WORKSHOP REPORT

Software Diversification as an Obfuscation Technique

Nicolas Harrand and Benoit Baudry, Inria

Context: worms and botnets

Along the ever increasing number of devices connected to the Internet comes an everlasting plague: computer worms. From the notorious Conficker that may have infected from 7 up to 25 millions computer running Microsoft Windows [1], to the more recent Mirai and Bashlite targeting IoT devices such as DVRs or IP cameras running Linux, they share a common operating pattern: they are able to penetrate the victim system and use it to spread to new targets. Resulting botnets can cause all sorts of troubles from world-wide spamming campaigns to massive DDoS attacks (a Mirai botnet is thought to be responsible for the latest DDoS on Dyn causing an array of sites, including Twitter, Amazon, Tumblr, Reddit, Spotify and Netflix, to be partly unreachable on October 16th 2016 [2]).

Ultimately, this malware can thrive on the Internet because it presents a combination of two dangerous properties:

Despite the large a number of nodes, there is a strong software mono-culture: for a given purpose, only a few different program are used by most nodes (e.g., WordPress or Drupal for content management). This property is equally true for the environment of these programs. Indeed only a few OSs (Linux for servers, Windows for PCs, Android for mobile and embedded devices) cover most devices. The risk of this monoculture is as follows: a hacker who finds an exploit for one these programs can potentially infect all machines on which it is installed.

It is a huge network made up of billions of interconnected devices. Consequently, all machines on which the same program is installed are connected to each other and an attacker who succeeds in accessing one of them can access all the others.

In the end, if a program offers an exploitable flaw (i.e. buffer overflow), it will be shared by many nodes providing entry points on each of them. Furthermore, as these nodes run in a similar environment, a unique malware may exploit their resources (i.e. shells or download tools such as wget) in order to replicate and spread over the network (see figure 1).

On the danger of software mono-culture

In a network of interconnected nodes, the presence of one vulnerability on one node is not the most serious issue. The threat comes the fact that huge clusters of connected nodes are hosting the same vulnerabilities.

The similarity among the nodes in the network is the way to obtain information on the node. This lack of diversity allows an opponent to test and predict the behavior of its attack offline, because having the information on one node means having the information on each and every similar node. This decreases the cost of an attack.

In addition, being able to hack one node means in fact being able to compromise all the nodes that share the flaw. The benefits of such an attack increase with the popularity of the program infected by the vulnerability.

The lack of software diversity is both how and why a node is attacked.

Diversity and Obfuscation

Software diversity refers to the fact that multiple variants of a program can offer the same functionality through different implementations. It can appear naturally, e.g., when different persons choose to implement software for similar purposes (e.g., most languages have their maths library, some even have several), or it can be synthesized through automatic transformations [3]. Automatic diversification can occur at multiple stages of the chain: at the source code level [4], at compile time [5, 6], at load time, or even at run time [7].

Diversification represents a form of obfuscation not because it makes it hard to predict the behavior of one variant, but rather because observing one variant does not leak information about other variants. Therefore it mitigates the capacity to gain information on the node making it a moving target [7] (see figure 2). Software diversification could greatly increase the cost of an attack, as it would have to be either tailored to each variant or adaptable enough to ignore their differences. Furthermore it may cripple the propagation of worms unable to penetrate the system that present different flaw, and prevent from using at will available resources.

Yet, automatic software diversification raises many challenges. In particular, there is a lack of appropriate metrics to evaluate diversity and its efficiency in achieving the desired properties. In this perspective, our work addresses the following aspects: (i) measure actual diversity through number of diverging code blocks, instructions, or execution traces, (ii) and directly test known attacks on a vulnerable program and compare the results with its variants. Furthermore depending on the goals of diversification, all software regions are not equally interesting to diversify, and independently of the goals, all regions do not offer the same opportunities for diversification. The characterization of such software regions is essential for the success of massive diversification.

How to evaluate the efficiency of diversity?

Determining the effectiveness of diversity at avoiding widespread vulnerabilities is a major challenge in the current state of the art. Given a set of N variants of a program, a set of attacks, we currently investigate the following metric. If we cluster variants based on their vulnerability or non vulnerability against each attack: 𝕍 = ⋃Vi , where Vi is the set of variants vulnerable to the attack i. Its cardinality is noted |𝕍|.

The distance between two clusters is defined as follows:

This distance can be seen as the ratio of variants that differ from one group to the other.

The quadratic entropy H(𝕍) is defined as follow:

where is the probability that a variant belongs to Vi. This quadratic entropy based on Rao’s quadratic entropy [8], can be seen as the measure of the average spread of the set of attacks on our set of variants.

Worst case scenario: each variant is vulnerable to each attack. (That is the case if we have only one variant). The distance between every two given groups is null because all of our groups overlap exactly, therefore H(𝕍) = 0.

Best case scenario: V partitions the set of variants, meaning every group is disjoined, and each group has the same size, and then H(𝕍) = 1. Note that some variants may not have any vulnerability and do not belong to any group.

From there, we can see that our goal is to maximize entropy on a set of variants grouped by vulnerabilities or, said in a different way, to minimize the information on one variant leaked by another. The more entropy a diversification scheme can generate the more efficient it is to achieve our fixed goal of preventing widespread vulnerabilities.

Feedback from the workshop

We presented this work at the International Workshop on Obfuscation in April 2017. The audience acknowledged the relevance of diversity for obfuscation and also provided extremely useful feedback about the challenges to actually deploy diverse versions of a program.

One challenge is about the transparency of the code and the build pipeline. Several open source communities aim at building secure programs by having a completely transparent build process that can be understood and analyzed by everyone. This transparency is necessary to ensure true open source. The integration of automatic diversification in such processes can be challenging to keep the transparency while preserving the benefits of diversity.

Software maintenance in the presence of diversity was mentioned as another challenge of this approach for obfuscation.