Monthly Archives: August 2016

Wait, what?

This concept was put forth in a paper titled “Symbolic Execution and Program testing” by James C. King. Rather than using the data objects provided by the programming language for storing values, we use arbitrary symbolic values for computation. You can see that the normal computation is a special case of symbolic execution wherein we “concretize” the values. Let us take a look at some assumptions made by the author.

Assumptions

Computer programs deal only with integers.
This assumption is fairly valid and is almost harmless in many analyses. Although a lot of programs work with float and double values, to reduce the search space, we will assume integral values only. We will also ignore integer overflows(in the case of 32/64 bit integers in C, for example).

The execution tree resulting from many programs is inifinite.

Symbolic execution of IF statements requires theorem proving which is mechanically possible.
This is a very critical assumption and forms the basis of all analysis performed on the execution tree.

Why symbolic execution?

It is the “workhorse” of modern software testing. As with my previous blog post, Symbolic execution saw a rise in interest and research due to increase in CPU speed and the ability to execute things parallely. We will look at how exactly this is useful. Also, it is one of the very few ideas which has made it to the industry and practice, breaking out of the academic bubble`.

How does it compare to the current techniques of static and dynamic analyses?

Symbolic Execution provides a some advantages over both these approaches. Let us look at them:

In comparison to (dynamic)testing it allows us to explore a more broader search space. We can make sure that the program is tested on almost all possible inputs before it is shipped.

Static Analysis has seen a lot of growth in the past few years. We have have better approaches for applying techniques on binaries and the source as well. But one of the fundamental issues with applying such algorithms is that they can point it out to you that there is a possibility of an error, but can never predict/output the input which can trigger the exploit. Symbolic Execution scores a +1 in that regard.

I think the above two reasons are enough to peek under the hood?

Control Flow graph and understanding the semantics of the flow

As wikipedia defines it,

A CFG is a representation using graph notation, of all the paths that might be traversed through a program during it’s execution.

A symbolic execution engine works on this exact representation of the program. Instead of variables having numeric(concrete) values, we choose one of the infinite abstract values and give it to the variable. Let us check out an example:

From the above graph we understand that at the IF conditional statement the graph branches in two. This means that depending on the conditions imposed on the various variables in the program in the IF statement, the program is going to behave differently. In this particular case, if the value of the variable i is 5, the control flow is going to be leading to the true branch and to the false branch otherwise. The rules for evaluation of these statements must be extended to include symbolic values as well.

Understanding the generation of the path equation

A program state is usually defined by the value of its variables(global and local, ofcourse) during normal executionl. But King, in his paper describes the generation of a path equation which has to be included in the program state. A path expression or a path condition(pc) is always initialized as true. As the program follows a particular execution path, more and more conditions are conjoined to the pc. Each statement thus contributes to the path expression(unless it’s a trivial assignment).

An interesting situation arises for the IF condition. A trivial case for this would be if the condition evaluates to one of the two boolean variables. If it does, we continue with that branch and do not consider the other one. But in case we cannot determine at symbolic time what the condition will evaluate to, then we need take a look at both the branches. This happens due to the involvement of the symbolic variable in the equation.

We now fork the execution into two branches. One, where the true branch was taken, and another where it was not. This will basically lead us to generate two expressions. One will be of the form: pc AND (i==5) and another pc AND (i NOT= 5). Now symbolic execution for the rest of the program analysis continues with both these states in memory.

Let us see the generation of the path expression of the above C program:

Statements 1, 2: A fresh new symbolic variable is assigned to i. Let us call it alpha. Current pc: true

Statement 3 : An if condition is encountered. Since it cannot be evaluated, we generate two conditions. pc: true AND (alpha == 5) | true AND (alpha NOT= 5).

Statement 4 : Since this is the last statement in the program, we have already evaluated all the branches, and hence the expression here would be the logical OR of both the expressions in memory.

Thus, after evaluating the entire (execution) tree in this manner, we will end up with certain conditions on the variables for each leaf node. This is a completely exhaustive search mechanism since we have assumed the values to the totally abstract and not provided any information on what they could be(apart from being integers).

Below is the tree representation:

This is a very simple example but it gets the idea through. Although things get a lot more messier with assignments and goto statements, the concept remains the same. The material above is enough to understand basic literature on the topic.

Theorem provers and SMT solvers

Well, now what? We have the path expression of the entire program. Assuming we want to generate test cases for the ERROR branch, we need to solve the boolean expression. Enter, SMT Solvers! SMT(Satisfiability Modulo Theory) Solvers try and solve decision problems. Since, throughout the program we are applying constraints on each variable, these are also known as constraint solvers. Microsoft’s z3 is one of the most used(in production) theorem prover freely available online.

They take in an expression containing symbolic variables and constraints which are imposed upon them. They then try and give us a result depending on the satisfiability of a problem. Sometimes it may give us our result in seconds whereas on other times it may tell us that the problem is unsolvable. But also, solvers fail to deduce either of the facts since it is a NP Complete problem. The algorithms used for these solvers internally, is a matter of a completely different blog post altogether but they basically try and combine multiple expressions from the given path condition and see if they can find a valid solution. Since the search space is very high(unless otherwise bound in a constraint), it takes a lot of time. This also, to certain extent depends on the number of conditions but it’s performance decreases as the number of expressions grow.

Are we going to have a demo or what?

In Clojure? Sure! We are going to try and solve a Project Euler problem 9. The problem statement is as follows:

Pretty easy to just iterate over all the possible values over 1000. But we will try and do something new! \o/

Okay, now let’s say that a, b and c are symbolic variables. We will use the clojure.core.logic library for this.

(ns clj.p9
;; conflicting operator
(:refer-clojure :exclude [==])
(:require [clojure.core.logic :refer :all]
[clojure.core.fd :as fd]))
(defn solution
;; Let us say we take as input the sum of the three pythagorean variables and input the result.
[sum]
;; We will have the result in a vec as [a b c].
;; Since we want the product a*b*c, let's just multiply them together.
;; We use the function first since the two results provided are:
;; [a b c] and [b a c] with c being the hypotenuse. Hence we choose one of them.
(reduce * (first
;; Here we go! `run*` implies that we want the computer to compute
;; all the possible values for the expression `q`.
(run* [q]
;; Declare 3 fresh symbolic variables for a, b and c.
(fresh [a b c]
;; The solution should be a vec of values of the variables a, b and c.
(== q [a b c])
;; Now we know that all that all the number have to be distinct.
;; We hence use the fd/distinct to add that constraint.
(fd/distinct [a b c])
;; Another contraint we can derive is that since the sum of two sides is always &gt; than the third,
;; each side has to be lesser than half of the `sum` value.
;; This can reduce our search space by quite a lot. fd/in provides us an option to specify the range.
;; (fd/interval [start] [end]) is the actual numeric range.
(fd/in a b c (fd/interval 1 (/ sum 2)))
;; We now add the given constraints on the variables.
(fd/eq
;; sum = a + b + c
(= sum (+ a b c))
;; c^2 = a^2 + b^2 (since they are pythagorean triplets)
(= (* c c) (+ (* a a) (* b b)))))))))

EZ PZ. Let’s run this!

(time (soln 1000))
"Elapsed time: 2345.989466 msecs"
31875000

There we have it! Solved in 2.3 seconds. Although this was fairly simple, but we learnt a lot on the way. core.logic is not as highly optimized as z3 but is fairly fast. It is made for a relational algebra which is a superset of symbolic analysis. z3 has a nice Python API to interact with it!

Issues involved

Current research trends include the inclusion of analysis of source and raw binaries as well. Program testing can be approached in an entirely new way if the inputs to methods(maybe?) are considered to be symbolic and we then fuzz test the over the entire search space. It is although very important to realize that there are a lot of problems which can come up during practical implementations:

Path explosion: If there are a lot of branching statements in the program, there will be generation of a new path expression at each branch hence creating a new instance of the symbolic runtime increasing the number of paths. This is a very serious problem and requires smart inputs from the user to prune certain branches off.

Environment interactions and system calls: A symbolic engine will find it very tough to interact with the underlying OS or the runtime environment. This can hinder a lot of core systems testing. The program may take decision based on the return value of a particular read statement or write to a particular file and then taking decisions. There is still ongoing research to find implement a virtual model of the underlying stack for the engine to interact with.

A thought experiment: Symbolic Execution in Healthcare?

Medical systems need to completely fool proof and verified before they are sent to production. Symbolic execution and analysis can be absolutely critical in such an analysis. Like normal programs, even though certain inputs are completely pointless(for example providing a negative number to a square root calculator) it is important that we implement correct error handling.

Although the above is practical, it would be more interesting to apply this logic to the human body directly which is… let’s just say it’s a though experiment. Some of my thoughts related to the ideas below stem from the discussion here. Below are some assumptions I make:

Since the human body is made up of a finite number of atoms, at any point in time there is a set of values which describe the entire body state(Assuming we manage to deconstruct the brain).

Although huge, we have a model of the human body which can be provided to a computer to generate an approximate system which will emulate the body internals and reactions to stimuli. Although far-fetched, we are slowly inching toward such a goal.

The computer we imagine is fast enough to perform calculations on such a scale.

Imagine medicine in such a scenario. Assume a medicinal drug to be a symbolic variable and the program to be the human body itself. The medicine can be thought to be an input to the body and hence the system should be able to track the breakdown of such a medicine in the body. The IF conditions in this case would represent exsiting patient conditions and his body state which would lead to different reactions in the body. Once we have all the leaf-nodes of a graph of such a system, the resulting information can be highly infomative in providing crucial healthcare and desigining of drugs, since we will have data regarding each breakdown of the chemicals. Drug design can be highly benefit from such a scenario!