# logic and satisfiability solving

In the introduction to symbolic execution last week, we have left several
questions unanswered. Among those was the problem of determining whether a
given path condition (which we defined as a set of boolean

*terms*in the source language) was*satisfiable*, and this will be the topic of today's discussion.In order to talk about satisfiability, our simplistic view of path conditions
has to give way to a more concrete and formal description, namely that of

*logical formulae*. Before we discuss satisfiability in this context though, we need to define some basic terminology. While this might seem like a needless waste of time at the beginning, we will soon see the benefits of this choice of abstraction, whose usefulness greatly transcends symbolic execution.# propositional logic

Our quick tour of logic starts at

*propositional logic*, which is probably the simplest generally useful logical system you can think of.*Propositional formulae*consist of*atomic propositions*, which act as variables over the domain , and means of their combination, i.e. logical connectives, such as negation , conjunction , and disjunction .The set
of propositional formulae
over a set of atomic propositions
(that is intentionally left arbitrary) is then generated by the following
rules (we will not discuss technical matters of precedence and bracketing, as
those are intuitive enough):

- Each atomic proposition is a formula.
- If and are formulae, then , and are also formulae.
- Nothing else is a formula.

*finite*formal string with no intrinsic meaning, and that the set of formulae is defined

*inductively*. This latter point allows us to define operations on using structural recursion, and indeed, we can recursively define meaning (or

*semantics*) of propositional formulae.

The base case of our construction are atomic propositions. Keeping in mind that
those range over
, we can, given a

*valuation*that maps each atomic proposition to or , evaluate a formula recursively. The evaluation function (which we will call*interpretation*from now on), named , is defined like this:formula | interpretation | |

→ | ||

→ | ||

→ | ||

→ |

This construction depends on

*boolean functions*, and , which act not on formulae, but on values and , according to the rules summarized in the following table:a | b | ~a | a ⊓ b | a ⊔ b |

0 | 0 | 1 | 0 | 0 |

0 | 1 | 1 | 0 | 1 |

1 | 0 | 0 | 0 | 1 |

1 | 1 | 0 | 1 | 1 |

And after all this setup, we can finally state that we say
to be

*satisfiable*if there is a valuation such that . Otherwise it is*unsatisfiable*. The satisfying valuation is also called a*model*of .At last, we have definitions whose names coincide with those that we have
defined previous week in the context of pure symbolic execution. We are not
quite there yet, as propositional logic is simply too weak for modeling of real
world (or real mathematics, if you prefer) phenomena. As we see from its syntax
and semantics, it is a logic of pure truthiness/falseness. Indeed, there are no
objects other than boolean values. Notably, we lack

*terms*which usually talk about more complex objects.# propositional satisfiablity solving

Before we extend our logic though, let us look at how we would go about
determining, for a given propositional formula
, whether it is satisfiable
and if so, finding a model of
. Since a propositional formula is finite and
thus contains at most finite number of atomic propositions, there is a simple
algorithm for this problem: simply try all possible valuations
and for each
compute the value of
. If it is
, we have a model. If we exhaust all
possible valuations, the formula is unsatisfiable.

Now of course, this algorithm is not very efficient, since for a formula with
atomic propositions, there are
possible valuations. Unfortunately, an
exponential algorithm is most probably the best we can hope for, since this
problem (often called

*SAT*) is a well known (even prototypical)*NP-complete*problem. Nonetheless, there is a range of programs for this exact problem (so called SAT solvers) and they are practically usable for most ‘ordinary‘ formulae.SAT solvers don't use our naive algorithm, though. They often implement a more
practical (though still exponential in the worst case) algorithm called

*DPLL*, which builds valuation incrementally by ordering atomic propositions and then choosing their values on-the-fly, simplifying the formula for each choice (e.g. by eliminating trivially true parts) and backtracking when they reach a dead end (i.e. the resulting formula becomes unsatisfiable). Additionally, there are heuristics used to find atomic propositions whose values can be determined quickly and many other improvements.A downside of DPLL is that it only natively works with formulae in a specific
shape called the

*conjunctive normal form*(CNF). A formula is in CNF if it is a conjunction of*clauses*, i.e. , where each clause is a disjunction of*literals*and each literal is a possibly negated atomic proposition. Every formula can be translated into an*equivalent*(i.e. having exactly the same models) formula in CNF. However, the size of the resulting formula might be exponential when compared to the original formula. Not all is lost, though. If we do not require the original and the resulting formulae to be equivalent, but merely*equisatisfiable*(which means that the original formula is satisfiable exactly when the resulting one is satisfiable), there is a translation with only linear size blowup. Equivalence is not preserved in this translation, as the resulting formula contains newly added atomic propositions.Alternatively, solvers can be built around a specialised data structure called
(ordered)

*binary decision diagram*, which can be used to represent propositional formulas and supports reasonably efficient operations for formula composition, negation and check for satisfiability. This is, however, beyond the scope of today's discussion.# first-order logic

Let us now return to the problem of representing

*terms*in logical formulae. We have a problem, though: what a term is depends on our problem domain. Even within our motivating example of symbolic execution, each programming language has different semantics, different data types and thus also different operations. We will, therefore, start by building an abstract framework of terms that is independent of concrete semantics (and to a degree, syntax as well).The basic building blocks of our terms are

*variables*(somewhat similar to atomic propositions that we have seen before),*constants*and their means of combination, that is*function symbols*. To each function symbol , we associate an natural number called the*arity*of . For given sets , and of variables, constants and function symbols, respectively, we define the set of terms inductively like so:- Each variable is a term.
- Each constant is a term.
- If is a function symbol with arity and , , ..., are terms, is also a term.
- Nothing else is a term.

We cannot simply combine propositional logic with terms, though, as we want to
make sure that strings like
are not allowed (what would even be a
meaning of such ‘formulae’?). To solve this technical obstacle, we define, for
given set
of

*predicates*(which are, too, associated with their arities) and terms the set of (first-order)*atomic formulae*(you guessed it) inductively:- If is a predicate of arity and , , ..., are terms, is an atomic formula.
- Nothing else is an atomic formula.

There is a small problem with equality. The usual treatment is to hardwire the
symbol
into our language, i.e. it is not a predicate, but acts like one
(notably, we can build atomic formulae using equality). The reason for this
special treatment is mainly technical and has to do with the fact that we
require some special properties from equality. Notably, it has to be a
congruence over terms, which means that from
follows
for any function symbol
.

But wait, you might think, we have forgotten

*quantifiers*, the most important feature of first-order logic! But while it is true that quantifiers are of prime importance in mathematical logic (and indeed usual mathematical practice), they are actually not as important for our applications (after all, there was no mention of any quantification in path conditions). But sure, quantifiers exist and they unfortunately disrupt our simple definition using propositional logic over atomic formulae.We therefore have to extend the set of our formula formation rules by:

- If is a variable and a formula, is also a formula.
- If is a variable and a formula, is also a formula.

*domain of discourse*(or universe) and thus variables can be substituted with values of . Similarly, we have to assign a meaning to our function symbols and constants (each symbol being mapped to a function of type ) and predicates (each mapped to a relation under ). Taken together, and both mappings are called a

*structure*(while the tuple of function symbols, constants and predicates is sometimes called a (first-order)

*language*).

Similarly to propositional logic, we can evaluate a first-order formula
,
given a structure
and a valuation
which maps variables to values of
, to a value
. The definition is again recursive, though
this time a bit more involved, since we have to work both with booleans and
objects of
(and recurse across the inductive definitions of formulae,
atomic formulae and terms). We will not show the construction here, as it is
a simple extension of all we have seen so far. The only notable points are that
a predicate is evaluated as true exactly when its associated relation relates
the evaluated values of its terms, and that the quantifiers are defined using

*substitution*of variable , where is a variable free (i.e. unquantified in ) and is an element of .# quick tour of first-order logics

We will now take a quick look at some concrete first-order languages and logics
(basically languages together with an interpretation)hat we can spot in the
wild. One of the simplest (though what this means is debatable) is the language
of set theory, which includes just one binary predicate
and no function
symbols. Every other (non-propositional) symbol that is used within set theory
can be defined internally (in terms of
). Another well known examples are
the logics of the integers, either with addition only, or with addition and
multiplication (as function symbols), and the corresponding logics of real
numbers.

For our purposes however, the most important example is the logic of fixed size

*bitvectors*, which is a logical representation of*two's complement arithmetic*. To make things technically simpler (i.e. to hand-wave away the issue of many-sorted logic), let us consider just a set of 32-bit binary vectors (i.e. sequences of exactly 32 binary digits). Our language consists of function symbols representing arithmetic (`add`

, `sub`

, `mul`

, `neg`

, ...) and
bitwise (`and`

, `or`

, `xor`

, `cpl`

, `shl`

, ...) operations on such bitvectors,
predicates
and
(remember, equality is present by default) and constants
for each representable natural number, that is
to
. The
interpretation of all these elements is the expected one, i.e.
stands for
, `neg(12)`

evaluates to
(in two's complement) and
each operation has *wrap-around*overflow semantics.# satisfiability modulo theories

We will end today's discussion with the problem of satisfiablity within a
first-order structure, also called satisfiability modulo theories (or SMT for
short). However, we will restrict ourselves to formulae without quantifiers, as
quantified satisfiability is relatively problematic (enough to be an active
research area) and not really needed for our purposes. Recall then, that we can
represent a quantifier-free first-order formula as a propositional formula over
a set of first-order atomic formulae.

In general,

*SMT solvers*build upon SAT solvers and extend them in a manner that is dependent on the structure we are working with. The easiest approach, often called*eager*SMT solving in the literature, is to abandon any first-order abstraction and translate each atomic first-order formula into a propositional formula. This approach is suitable for the bitvector logic, mainly due to the fact that we are working with*finite*objects.The idea is rather simple - we represent each individual bit of each bitvector
by its own atomic proposition (or constant
/
). This translation is
straightforward for equality and bitwise operations. For example, bitwise

```
z =
and(x, y)
```

can be simply translated to a conjunction of formulae
(for every bit
of the arguments). The construction is a bit more
complicated for addition, where the common approach is to implement a chained
binary *adder circuit*in predicate logic, which results in a propositional formula linear in size to the bit-width of the summands. Multiplication is much harder though, and the blowup in size can be*quadratic*(in a typical implementation, a single 64-bit multiplication adds 64² new variables and tens of thousands of clauses).Unfortunately, not every logic can be solved in this manner, and even if it can
be, the blowup of formula sizes is often intractable. This leads us to the
so-called

*lazy*approach to SMT solving, which employs, alongside a general SAT solver, a specialised*T-solver*that is able to solve conjunctions of atomic formulae over a specific structure.The lazy approach itself comes in two variants. The

*offline*variant works by first running a SAT solver on the formula, interpreting each atomic formula as an atomic proposition, and only if the*SAT skeleton*is satisfiable, runs a T-solver in order to find a satisfying valuation for the set of ‘true’ atomic formulae. If this fails, the process repeats with a different model of the SAT skeleton. The*online*variant, itself called*DPLL(T)*, is an extension of the aforementioned DPLL algorithm, which builds a potential model of a formula incrementally. In this case, the interaction between the SAT solver and the T-solver is more direct, and T-solver is consulted on-the-fly as the propositional valuation is being built.You may be thinking now that all this is potentially very expensive. After all,
even propositional SAT was already NP-complete. Indeed, SMT solving is in
general at least

*NP-hard*, but can be, depending on the logic in question, much worse still, up to*undecidable*. Even such simple logics as the integers with addition and multiplication are, in fact, already undecidable (though, funnily enough, the same logic with the reals is decidable). Luckily for us, bitvectors are decidable, and even in cases of undecidable logics, SMT solvers are often available, though with the added caveat of possible*unknown*result (which unfortunately often already occurs for relatively simple formulae).# conclusion

And that is all there is about logic, at least for today. Though the discussion
was long, it will surely come in handy many times in the future.

We will conclude with some problems to think about:

- What kind of simplifying heuristics can DPLL make? Remember that the formula must be in CNF.
- Give formal semantics of the universal quantifier using substitution.
- Give an example of at least two different structures for the language of set theory.
- Give an example of a formula that is valid (i.e. true in every valuation) when interpreted over the real numbers, but not valid over the integers.
- Give an example of a formula that is valid over the integers and an intuitively similar formula that is not valid over 32-bit bitvectors.