symbolic execution

Two weeks ago, we talked about finding errors in a program with inputs using fuzzing. While the method is useful, its success is nonetheless strongly dependent on heuristics which guide the trajectory of parameter exploration. We can, however, approach the problem differently. Instead of generating random, but concrete, inputs and executing the program normally from there, we will represent program data symbolically.

During an evaluation of an ordinary program, each declared variable has a known concrete value, such as 2, true or "hello". In contrast, symbolic values are program expressions involving concrete values, their combinations (e.g. using arithmetic operators) and, crucially, unknowns standing for, well, unknown values coming from program inputs. Such expressions are usually called terms.

data flow

Consider the following simple program:

x ← input
y ← input
z ← x - y
out ← x + y / z

It contains an obvious error, which occurs precisely when z = 0. The value of z is a function of the initial values of x and y, and this is where the unknowns come into play. Let us mark those initial values of x and y by letters α and β, respectively, and simulate the execution of the program using such symbolic data:

statement	symbolic result
`x ← input`	α
`y ← input`	β
`z ← x - y`	α - β
`out ← x + y / z`	α + β / (α - β)

At the end of the symbolic execution, the variable z stores the term α - β while out contains α + β / (α - β). That is, each variable stores a term representing a mapping from the concrete values of the parameters α and β to the concrete value of the given variable. And because division is undefined precisely when the divisor is 0, we can immediately conclude that the program reaches an error if z = α - β = 0, or α = β, and synthesize a set of offending concrete values (such as α = β = 0).

control flow

The previous program was particularly easy. To spice things up, consider the following extension:

x ← input
y ← input
z ← x - y

if z % 2 = 1 then
    out ← x + y / z
else
    out ← x + y / ( z + 1 )

The difference here is that we have added nontrivial control flow in the form of conditional branching. Since we cannot tell in advance which branch is going to be executed for every concrete value of z, we will consider both possibilities:

statement	symbolic result
`x ← input`	α
`y ← input`	β
`z ← x - y`	α - β
`if z % 2 = 1 then`	(α - β) % 2 = 1
`out ← x + y / z`	α + β / (α - β)
`else`	(α - β) % 2 ≠ 1
`out ← x + y / ( z + 1 )`	α + β / (α - β + 1)

Does this program ever divide by zero? Looking at the values of both divisors, we clearly see that no matter which input values we supply, we will always divide by an odd number, which is safe. However, this is not the result we would get by a naïve local inspection of the conditional assignments and by substitution of just arbitrary concrete input values. The key here is that the expression on which we branch (here z % 2 = 1) constrains the possible values which α and β can take.

The constraint on the input values is called a path condition and we can imagine it as a set of terms that evaluate to boolean values. For reasons that will become apparent later, we cannot allow every combining operation available in the source language in such terms, though. In fact, we are usually restricted to a set of primitive operations consisting mainly of arithmetical, logical, bitwise and comparison operators.

We say that a path condition φ is satisfiable if there is a mapping which assigns a concrete value to each unknown appearing in some term of φ, such that each term in φ would evaluate to true under this mapping. Such a mapping is often called a model of φ.

A given program can take many possible execution trajectories, or runs, depending on the values of its arguments, with each run completely determined by the results of encountered branch conditions. By building the path condition as we encounter them, we are not only constraining the possible set of input values at a given point of the run, but also identifying a specific run with its path condition φ. For any realizable run, the corresponding φ must be satisfiable; moreover, every model of φ yields a set of inputs that follows the same run. On the other hand, path conditions that are not satisfiable do not correspond to possible program runs.

To see an example of an impossible run, consider a simple program:

x ← input

if x > 10 then
    if x < 5 then
        error

An error is syntactically present and a run that would reach it has a path condition φ = {x > 10, x < 5}. Strictly speaking, the satisfiability of φ depends on the exact semantics of the language, but under any reasonable definition of <, it is going to be a contradiction (i.e. unsatisfiable).

systematic execution

The infrastructure above is enough to sketch up an algorithm that performs an exhaustive search of all possible runs of a given program. A typical symbolic executor evaluates its input program using some, often user specified, information as to which data should be modelled using symbolic unknowns and interprets the program while keeping track of the path condition and the symbolic valuation of variables. Unlike an ordinary interpreter, a symbolic executor potentially needs to enter all branches of all conditional statements (e.g. both then and else clauses of an if statement, all case labels of a switch, etc.).

We can model the set of possible runs of a program using a (symbolic) execution tree, where nodes represent a specific execution state, i.e. a program location, a path condition and a variable valuation. The successors of a node U are nodes representing a state into which the executor can follow from the state U. The tree is rooted at a single node R given by the entrypoint location, empty path condition and empty valuation. A run is then a (not necessarily finite) path in this tree starting at R. However, we have to make sure that each run in an execution tree of a program corresponds to a realizable run of the same. To no one's surprise, if we include exactly the nodes with satisfiable path conditions, the tree represents the set of realizable runs, as desired.

Now consider the following, even spicier, program:

x ← input
out ← 0

if x % 2 = 1 then
    if x = 4 then
        error
    else
        out ← x - 1
else
    out ← x + 1

out ← 100 / out

And let us now draw a symbolic tree which models the execution of this program in the domain of, say, integers numbers (that is, the possible values of the variables are integers and each operator has integer semantics). The edges are labelled with statements of the program (or the constraint valid for the respective branch if the statement was an if), while the nodes contain the path condition and the variable valuation (in this order).

As you can see, there is one path in the tree ending in unreachable. This happens because the potential resulting path condition of the missing node ({α % 2 = 1, α = 4}) is clearly unsatisfiable over the integers. Another interesting point is that only one of the leaves contains an error, since α + 1 = 0 cannot happen with even α ({α % 2 ≠ 1}).

Now, the symbolic execution algorithm is conceptually pretty simple. All we have to do is perform a traversal of the execution tree and report back any interesting information (such as nodes with errors). If the tree is finite, depth-first traversal will do and is probably easier to imagine, as the executor evaluates a whole run until its end and, in essence, behaves like a backtracking interpreter. However, symbolic execution is still possible for trees with infinite branches. In this case, we can switch to either breadth-first or iteratively-deepening depth-first traversal. These algorithms do not terminate, but they are guaranteed to report, in finite time, any error location reachable in the original program.

The execution tree is generated on-the-fly during traversal. For each node, we have to disregard potential successors with unsatisfiable path conditions. As such, symbolic executors need to be able to algorithmically determine satisfiability of a given path condition. How is this done and whether it is even possible depends on the semantics of our language, and it will be a topic of a future article.

conclusion

Clearly, we are not done with this topic. In fact, we barely scratched the surface, and you can expect follow-up posts to go into more detail. In the meantime, here are some questions to consider:

How does an infinite execution tree look like? When can it happen?
How effective and efficient would you expect symbolic execution to be in cases of finite trees?
Why have we limited ourselves to primitive operations in path conditions?