symbolic execution
Two weeks ago, we talked about finding errors in a program with inputs using
fuzzing. While the method is useful, its success is nonetheless strongly
dependent on heuristics which guide the trajectory of parameter exploration. We
can, however, approach the problem differently. Instead of generating random,
but concrete, inputs and executing the program normally from there, we will
represent program data symbolically.
During an evaluation of an ordinary program, each declared variable has a known
concrete value, such as
2
, true
or "hello"
. In contrast, symbolic values
are program expressions involving concrete values, their combinations (e.g.
using arithmetic operators) and, crucially, unknowns standing for, well,
unknown values coming from program inputs. Such expressions are usually called
terms.
data flow
Consider the following simple program:
x ← input
y ← input
z ← x - y
out ← x + y / z
It contains an obvious error, which occurs precisely when
z = 0
. The value of
z
is a function of the initial values of x
and y
, and this is where the
unknowns come into play. Let us mark those initial values of x
and y
by
letters α
and β
, respectively, and simulate the execution of the program
using such symbolic data:
statement | symbolic result |
x ← input | α |
y ← input | β |
z ← x - y | α - β |
out ← x + y / z | α + β / (α - β) |
At the end of the symbolic execution, the variable
z
stores the term
α - β
while out
contains α + β / (α - β)
. That is, each variable stores a
term representing a mapping from the concrete values of the parameters α
and β
to the concrete value of the given variable. And because division is
undefined precisely when the divisor is 0
, we can immediately conclude that
the program reaches an error if z = α - β = 0
, or α = β
, and synthesize a
set of offending concrete values (such as α = β = 0
).
control flow
The previous program was particularly easy. To spice things up, consider the
following extension:
x ← input
y ← input
z ← x - y
if z % 2 = 1 then
out ← x + y / z
else
out ← x + y / ( z + 1 )
The difference here is that we have added nontrivial control flow in the
form of conditional branching. Since we cannot tell in advance which branch
is going to be executed for every concrete value of
z
, we will consider both
possibilities:
statement | symbolic result |
x ← input | α |
y ← input | β |
z ← x - y | α - β |
if z % 2 = 1 then | (α - β) % 2 = 1 |
out ← x + y / z | α + β / (α - β) |
else | (α - β) % 2 ≠ 1 |
out ← x + y / ( z + 1 ) | α + β / (α - β + 1) |
Does this program ever divide by zero? Looking at the values of both divisors,
we clearly see that no matter which input values we supply, we will always
divide by an odd number, which is safe. However, this is not the result we
would get by a naïve local inspection of the conditional assignments and by
substitution of just arbitrary concrete input values. The key here is that the
expression on which we branch (here
z % 2 = 1
) constrains the possible
values which α
and β
can take.
The constraint on the input values is called a path condition and we can
imagine it as a set of terms that evaluate to boolean values. For reasons that
will become apparent later, we cannot allow every combining operation
available in the source language in such terms, though. In fact, we are
usually restricted to a set of primitive operations consisting mainly of
arithmetical, logical, bitwise and comparison operators.
We say that a path condition
φ
is satisfiable if there is a mapping which
assigns a concrete value to each unknown appearing in some term of φ
, such
that each term in φ
would evaluate to true
under this mapping. Such a
mapping is often called a model of φ
.
A given program can take many possible execution trajectories, or runs,
depending on the values of its arguments, with each run completely determined
by the results of encountered branch conditions. By building the path
condition as we encounter them, we are not only constraining the possible set
of input values at a given point of the run, but also identifying a specific
run with its path condition
φ
. For any realizable run, the corresponding
φ
must be satisfiable; moreover, every model of φ
yields a set of inputs
that follows the same run. On the other hand, path conditions that are not
satisfiable do not correspond to possible program runs.
To see an example of an impossible run, consider a simple program:
x ← input
if x > 10 then
if x < 5 then
error
An error is syntactically present and a run that would reach it has a path
condition φ = {
x > 10
, x < 5
}. Strictly speaking, the satisfiability of
φ
depends on the exact semantics of the language, but under any reasonable
definition of <
, it is going to be a contradiction (i.e. unsatisfiable).
systematic execution
The infrastructure above is enough to sketch up an algorithm that performs an
exhaustive search of all possible runs of a given program. A typical
symbolic executor evaluates its input program using some, often user
specified, information as to which data should be modelled using symbolic
unknowns and interprets the program while keeping track of the path condition and
the symbolic valuation of variables. Unlike an ordinary interpreter, a symbolic
executor potentially needs to enter all branches of all conditional statements
(e.g. both
then
and else
clauses of an if
statement, all case
labels
of a switch
, etc.).
We can model the set of possible runs of a program using a (symbolic)
execution tree, where nodes represent a specific execution state, i.e. a
program location, a path condition and a variable valuation. The successors of
a node
U
are nodes representing a state into which the executor can follow
from the state U
. The tree is rooted at a single node R
given by the
entrypoint location, empty path condition and empty valuation. A run is then a
(not necessarily finite) path in this tree starting at R
. However, we have
to make sure that each run in an execution tree of a program corresponds to a
realizable run of the same. To no one's surprise, if we include exactly the
nodes with satisfiable path conditions, the tree represents the set of
realizable runs, as desired.
Now consider the following, even spicier, program:
x ← input
out ← 0
if x % 2 = 1 then
if x = 4 then
error
else
out ← x - 1
else
out ← x + 1
out ← 100 / out
And let us now draw a symbolic tree which models the execution of this program
in the domain of, say, integers numbers (that is, the possible values of the
variables are integers and each operator has integer semantics). The edges are
labelled with statements of the program (or the constraint valid for the
respective branch if the statement was an if), while the nodes contain the
path condition and the variable valuation (in this order).
As you can see, there is one path in the tree ending in unreachable. This
happens because the potential resulting path condition of the missing node
({
α % 2 = 1
, α = 4
}) is clearly unsatisfiable over the integers.
Another interesting point is that only one of the leaves contains an error,
since α + 1 = 0
cannot happen with even α
({α % 2 ≠ 1
}).
Now, the symbolic execution algorithm is conceptually pretty simple. All we
have to do is perform a traversal of the execution tree and report back any
interesting information (such as nodes with errors). If the tree is finite,
depth-first traversal will do and is probably easier to imagine, as the
executor evaluates a whole run until its end and, in essence, behaves like a
backtracking interpreter. However, symbolic execution is still possible for
trees with infinite branches. In this case, we can switch to either
breadth-first or iteratively-deepening depth-first traversal. These algorithms
do not terminate, but they are guaranteed to report, in finite time, any error
location reachable in the original program.
The execution tree is generated on-the-fly during traversal. For each node, we
have to disregard potential successors with unsatisfiable path conditions. As
such, symbolic executors need to be able to algorithmically determine
satisfiability of a given path condition. How is this done and whether it is
even possible depends on the semantics of our language, and it will be a topic
of a future article.
conclusion
Clearly, we are not done with this topic. In fact, we barely scratched the
surface, and you can expect follow-up posts to go into more detail. In the
meantime, here are some questions to consider:
- How does an infinite execution tree look like? When can it happen?
- How effective and efficient would you expect symbolic execution to be in cases of finite trees?
- Why have we limited ourselves to primitive operations in path conditions?