# symbolic execution

Two weeks ago, we talked about finding errors in a program with inputs using
fuzzing. While the method is useful, its success is nonetheless strongly
dependent on heuristics which guide the trajectory of parameter exploration. We
can, however, approach the problem differently. Instead of generating random,
but concrete, inputs and executing the program normally from there, we will
represent program data

*symbolically*.During an evaluation of an ordinary program, each declared variable has a known
concrete value, such as

`2`

, `true`

or `"hello"`

. In contrast, symbolic values
are program expressions involving concrete values, their combinations (e.g.
using arithmetic operators) and, crucially, *unknowns*standing for, well, unknown values coming from program inputs. Such expressions are usually called*terms*.# data flow

Consider the following simple program:

```
x ← input
y ← input
z ← x - y
out ← x + y / z
```

It contains an obvious error, which occurs precisely when

`z = 0`

. The value of
`z`

is a function of the initial values of `x`

and `y`

, and this is where the
unknowns come into play. Let us mark those initial values of `x`

and `y`

by
letters `α`

and `β`

, respectively, and *simulate*the execution of the program using such symbolic data:statement | symbolic result |

`x ← input` | α |

`y ← input` | β |

`z ← x - y` | α - β |

`out ← x + y / z` | α + β / (α - β) |

At the end of the

*symbolic execution*, the variable`z`

stores the term
`α - β`

while `out`

contains `α + β / (α - β)`

. That is, each variable stores a
term representing a *mapping*from the concrete values of the parameters`α`

and `β`

to the concrete value of the given variable. And because division is
undefined precisely when the divisor is `0`

, we can immediately conclude that
the program reaches an error if `z = α - β = 0`

, or `α = β`

, and synthesize a
set of offending concrete values (such as `α = β = 0`

).
# control flow

The previous program was particularly easy. To spice things up, consider the
following extension:

```
x ← input
y ← input
z ← x - y
if z % 2 = 1 then
out ← x + y / z
else
out ← x + y / ( z + 1 )
```

The difference here is that we have added nontrivial control flow in the
form of

*conditional branching*. Since we cannot tell in advance which branch is going to be executed for every concrete value of`z`

, we will consider both
possibilities:
statement | symbolic result |

`x ← input` | α |

`y ← input` | β |

`z ← x - y` | α - β |

`if z % 2 = 1 then` | (α - β) % 2 = 1 |

`out ← x + y / z` | α + β / (α - β) |

`else` | (α - β) % 2 ≠ 1 |

`out ← x + y / ( z + 1 )` | α + β / (α - β + 1) |

Does this program ever divide by zero? Looking at the values of both divisors,
we clearly see that no matter which input values we supply, we will always
divide by an odd number, which is safe. However, this is not the result we
would get by a naïve local inspection of the conditional assignments and by
substitution of just arbitrary concrete input values. The key here is that the
expression on which we branch (here

`z % 2 = 1`

) *constrains*the possible values which`α`

and `β`

can take.
The constraint on the input values is called a

*path condition*and we can imagine it as a set of terms that evaluate to boolean values. For reasons that will become apparent later, we cannot allow every combining operation available in the source language in such terms, though. In fact, we are usually restricted to a set of*primitive operations*consisting mainly of arithmetical, logical, bitwise and comparison operators.We say that a path condition

`φ`

is *satisfiable*if there is a mapping which assigns a concrete value to each unknown appearing in some term of`φ`

, such
that each term in `φ`

would evaluate to `true`

under this mapping. Such a
mapping is often called a *model*of`φ`

.
A given program can take many possible execution trajectories, or

*runs*, depending on the values of its arguments, with each run completely determined by the results of encountered branch conditions. By building the path condition as we encounter them, we are not only constraining the possible set of input values at a given point of the run, but also identifying a specific run with its path condition`φ`

. For any *realizable*run, the corresponding`φ`

must be satisfiable; moreover, every model of `φ`

yields a set of inputs
that follows the same run. On the other hand, path conditions that are not
satisfiable do not correspond to possible program runs.
To see an example of an impossible run, consider a simple program:

```
x ← input
if x > 10 then
if x < 5 then
error
```

An error is syntactically present and a run that would reach it has a path
condition φ = {

`x > 10`

, `x < 5`

}. Strictly speaking, the satisfiability of
`φ`

depends on the exact semantics of the language, but under any reasonable
definition of `<`

, it is going to be a contradiction (i.e. unsatisfiable).
# systematic execution

The infrastructure above is enough to sketch up an algorithm that performs an

*exhaustive search*of all possible runs of a given program. A typical*symbolic executor*evaluates its input program using some, often user specified, information as to which data should be modelled using symbolic unknowns and interprets the program while keeping track of the path condition and the symbolic valuation of variables. Unlike an ordinary interpreter, a symbolic executor potentially needs to enter all branches of all conditional statements (e.g. both`then`

and `else`

clauses of an `if`

statement, all `case`

labels
of a `switch`

, etc.).
We can model the set of possible runs of a program using a (symbolic)

*execution tree*, where nodes represent a specific execution*state*, i.e. a program location, a path condition and a variable valuation. The successors of a node`U`

are nodes representing a state into which the executor can follow
from the state `U`

. The tree is *rooted*at a single node`R`

given by the
entrypoint location, empty path condition and empty valuation. A run is then a
(not necessarily finite) path in this tree starting at `R`

. However, we have
to make sure that each run in an execution tree of a program corresponds to a
*realizable*run of the same. To no one's surprise, if we include exactly the nodes with satisfiable path conditions, the tree represents the set of realizable runs, as desired.Now consider the following, even spicier, program:

```
x ← input
out ← 0
if x % 2 = 1 then
if x = 4 then
error
else
out ← x - 1
else
out ← x + 1
out ← 100 / out
```

And let us now draw a symbolic tree which models the execution of this program
in the domain of, say, integers numbers (that is, the possible values of the
variables are integers and each operator has integer semantics). The edges are
labelled with statements of the program (or the constraint valid for the
respective branch if the statement was an if), while the nodes contain the
path condition and the variable valuation (in this order).

As you can see, there is one path in the tree ending in unreachable. This
happens because the potential resulting path condition of the missing node
({

`α % 2 = 1`

, `α = 4`

}) is clearly *unsatisfiable*over the integers. Another interesting point is that only one of the leaves contains an error, since`α + 1 = 0`

cannot happen with even `α`

({`α % 2 ≠ 1`

}).
Now, the symbolic execution algorithm is conceptually pretty simple. All we
have to do is perform a

*traversal*of the execution tree and report back any interesting information (such as nodes with errors). If the tree is finite, depth-first traversal will do and is probably easier to imagine, as the executor evaluates a whole run until its end and, in essence, behaves like a backtracking interpreter. However, symbolic execution is still possible for trees with infinite branches. In this case, we can switch to either breadth-first or iteratively-deepening depth-first traversal. These algorithms do not terminate, but they are guaranteed to report, in finite time, any error location reachable in the original program.The execution tree is generated on-the-fly during traversal. For each node, we
have to disregard potential successors with unsatisfiable path conditions. As
such, symbolic executors need to be able to algorithmically determine
satisfiability of a given path condition. How is this done and whether it is
even possible depends on the semantics of our language, and it will be a topic
of a future article.

# conclusion

Clearly, we are not done with this topic. In fact, we barely scratched the
surface, and you can expect follow-up posts to go into more detail. In the
meantime, here are some questions to consider:

- How does an infinite execution tree look like? When can it happen?
- How effective and efficient would you expect symbolic execution to be in cases of finite trees?
- Why have we limited ourselves to primitive operations in path conditions?