logic and satisfiability solving
In the introduction to symbolic execution last week, we have left several
questions unanswered. Among those was the problem of determining whether a
given path condition (which we defined as a set of boolean terms in the
source language) was satisfiable, and this will be the topic of today's
discussion.
In order to talk about satisfiability, our simplistic view of path conditions
has to give way to a more concrete and formal description, namely that of
logical formulae. Before we discuss satisfiability in this context though, we
need to define some basic terminology. While this might seem like a needless
waste of time at the beginning, we will soon see the benefits of this choice of
abstraction, whose usefulness greatly transcends symbolic execution.
propositional logic
Our quick tour of logic starts at propositional logic, which is probably the
simplest generally useful logical system you can think of. Propositional
formulae consist of atomic propositions, which act as variables over the
domain
, and means of their combination, i.e. logical connectives,
such as negation
, conjunction
, and disjunction
.
The set
of propositional formulae
over a set of atomic propositions
(that is intentionally left arbitrary) is then generated by the following
rules (we will not discuss technical matters of precedence and bracketing, as
those are intuitive enough):
- Each atomic proposition is a formula.
- If and are formulae, then , and are also formulae.
- Nothing else is a formula.
The base case of our construction are atomic propositions. Keeping in mind that
those range over
, we can, given a valuation
that maps each
atomic proposition to
or
, evaluate a formula
recursively. The
evaluation function (which we will call interpretation from now on), named
, is defined like this:
formula | interpretation | |
→ | ||
→ | ||
→ | ||
→ |
This construction depends on boolean functions
,
and
, which act
not on formulae, but on values
and
, according to the rules summarized
in the following table:
a | b | ~a | a ⊓ b | a ⊔ b |
0 | 0 | 1 | 0 | 0 |
0 | 1 | 1 | 0 | 1 |
1 | 0 | 0 | 0 | 1 |
1 | 1 | 0 | 1 | 1 |
And after all this setup, we can finally state that we say
to be
satisfiable if there is a valuation
such that
. Otherwise it is
unsatisfiable. The satisfying valuation
is also called a model of
.
At last, we have definitions whose names coincide with those that we have
defined previous week in the context of pure symbolic execution. We are not
quite there yet, as propositional logic is simply too weak for modeling of real
world (or real mathematics, if you prefer) phenomena. As we see from its syntax
and semantics, it is a logic of pure truthiness/falseness. Indeed, there are no
objects other than boolean values. Notably, we lack terms which usually talk
about more complex objects.
propositional satisfiablity solving
Before we extend our logic though, let us look at how we would go about
determining, for a given propositional formula
, whether it is satisfiable
and if so, finding a model of
. Since a propositional formula is finite and
thus contains at most finite number of atomic propositions, there is a simple
algorithm for this problem: simply try all possible valuations
and for each
compute the value of
. If it is
, we have a model. If we exhaust all
possible valuations, the formula is unsatisfiable.
Now of course, this algorithm is not very efficient, since for a formula with
atomic propositions, there are
possible valuations. Unfortunately, an
exponential algorithm is most probably the best we can hope for, since this
problem (often called SAT) is a well known (even prototypical) NP-complete
problem. Nonetheless, there is a range of programs for this exact problem (so
called SAT solvers) and they are practically usable for most ‘ordinary‘
formulae.
SAT solvers don't use our naive algorithm, though. They often implement a more
practical (though still exponential in the worst case) algorithm called DPLL,
which builds valuation incrementally by ordering atomic propositions and then
choosing their values on-the-fly, simplifying the formula for each choice (e.g.
by eliminating trivially true parts) and backtracking when they reach a dead
end (i.e. the resulting formula becomes unsatisfiable). Additionally, there are
heuristics used to find atomic propositions whose values can be determined
quickly and many other improvements.
A downside of DPLL is that it only natively works with formulae in a specific
shape called the conjunctive normal form (CNF). A formula
is in CNF if it
is a conjunction of clauses, i.e.
, where each clause
is a disjunction of literals and each literal is a possibly
negated atomic proposition. Every formula can be translated into an equivalent
(i.e. having exactly the same models) formula in CNF. However, the size of the
resulting formula might be exponential when compared to the original formula.
Not all is lost, though. If we do not require the original and the resulting
formulae to be equivalent, but merely equisatisfiable (which means that the
original formula is satisfiable exactly when the resulting one is satisfiable),
there is a translation with only linear size blowup. Equivalence is not
preserved in this translation, as the resulting formula contains newly added
atomic propositions.
Alternatively, solvers can be built around a specialised data structure called
(ordered) binary decision diagram, which can be used to represent
propositional formulas and supports reasonably efficient operations for formula
composition, negation and check for satisfiability. This is, however, beyond
the scope of today's discussion.
first-order logic
Let us now return to the problem of representing terms in logical formulae.
We have a problem, though: what a term is depends on our problem domain. Even
within our motivating example of symbolic execution, each programming language
has different semantics, different data types and thus also different
operations. We will, therefore, start by building an abstract framework of
terms that is independent of concrete semantics (and to a degree, syntax as
well).
The basic building blocks of our terms are variables (somewhat similar to
atomic propositions that we have seen before), constants and their means of
combination, that is function symbols. To each function symbol
, we
associate an natural number
called the arity of
. For given sets
,
and
of variables, constants and function symbols, respectively,
we define the set of terms
inductively like so:
- Each variable is a term.
- Each constant is a term.
- If is a function symbol with arity and , , ..., are terms, is also a term.
- Nothing else is a term.
We cannot simply combine propositional logic with terms, though, as we want to
make sure that strings like
are not allowed (what would even be a
meaning of such ‘formulae’?). To solve this technical obstacle, we define, for
given set
of predicates (which are, too, associated with their arities)
and terms
the set of (first-order) atomic formulae
(you
guessed it) inductively:
- If is a predicate of arity and , , ..., are terms, is an atomic formula.
- Nothing else is an atomic formula.
There is a small problem with equality. The usual treatment is to hardwire the
symbol
into our language, i.e. it is not a predicate, but acts like one
(notably, we can build atomic formulae using equality). The reason for this
special treatment is mainly technical and has to do with the fact that we
require some special properties from equality. Notably, it has to be a
congruence over terms, which means that from
follows
for any function symbol
.
But wait, you might think, we have forgotten quantifiers, the most important
feature of first-order logic! But while it is true that quantifiers are of
prime importance in mathematical logic (and indeed usual mathematical
practice), they are actually not as important for our applications (after all,
there was no mention of any quantification in path conditions). But sure,
quantifiers exist and they unfortunately disrupt our simple definition using
propositional logic over atomic formulae.
We therefore have to extend the set of our formula formation rules by:
- If is a variable and a formula, is also a formula.
- If is a variable and a formula, is also a formula.
Similarly to propositional logic, we can evaluate a first-order formula
,
given a structure
and a valuation
which maps variables to values of
, to a value
. The definition is again recursive, though
this time a bit more involved, since we have to work both with booleans and
objects of
(and recurse across the inductive definitions of formulae,
atomic formulae and terms). We will not show the construction here, as it is
a simple extension of all we have seen so far. The only notable points are that
a predicate is evaluated as true exactly when its associated relation relates
the evaluated values of its terms, and that the quantifiers are defined using
substitution of variable
, where
is a variable free (i.e.
unquantified in
) and
is an element of
.
quick tour of first-order logics
We will now take a quick look at some concrete first-order languages and logics
(basically languages together with an interpretation)hat we can spot in the
wild. One of the simplest (though what this means is debatable) is the language
of set theory, which includes just one binary predicate
and no function
symbols. Every other (non-propositional) symbol that is used within set theory
can be defined internally (in terms of
). Another well known examples are
the logics of the integers, either with addition only, or with addition and
multiplication (as function symbols), and the corresponding logics of real
numbers.
For our purposes however, the most important example is the logic of fixed size
bitvectors, which is a logical representation of two's complement
arithmetic. To make things technically simpler (i.e. to hand-wave away the
issue of many-sorted logic), let us consider just a set
of 32-bit binary
vectors (i.e. sequences of exactly 32 binary digits). Our language consists of
function symbols representing arithmetic (
add
, sub
, mul
, neg
, ...) and
bitwise (and
, or
, xor
, cpl
, shl
, ...) operations on such bitvectors,
predicates
and
(remember, equality is present by default) and constants
for each representable natural number, that is
to
. The
interpretation of all these elements is the expected one, i.e.
stands for
, neg(12)
evaluates to
(in two's complement) and
each operation has wrap-around overflow semantics.
satisfiability modulo theories
We will end today's discussion with the problem of satisfiablity within a
first-order structure, also called satisfiability modulo theories (or SMT for
short). However, we will restrict ourselves to formulae without quantifiers, as
quantified satisfiability is relatively problematic (enough to be an active
research area) and not really needed for our purposes. Recall then, that we can
represent a quantifier-free first-order formula as a propositional formula over
a set of first-order atomic formulae.
In general, SMT solvers build upon SAT solvers and extend them in a manner
that is dependent on the structure we are working with. The easiest approach,
often called eager SMT solving in the literature, is to abandon any
first-order abstraction and translate each atomic first-order formula into a
propositional formula. This approach is suitable for the bitvector logic,
mainly due to the fact that we are working with finite objects.
The idea is rather simple - we represent each individual bit of each bitvector
by its own atomic proposition (or constant
/
). This translation is
straightforward for equality and bitwise operations. For example, bitwise
z =
and(x, y)
can be simply translated to a conjunction of formulae
(for every bit
of the arguments). The construction is a bit more
complicated for addition, where the common approach is to implement a chained
binary adder circuit in predicate logic, which results in a propositional
formula linear in size to the bit-width of the summands. Multiplication is
much harder though, and the blowup in size can be quadratic (in a typical
implementation, a single 64-bit multiplication adds 64² new variables and
tens of thousands of clauses).
Unfortunately, not every logic can be solved in this manner, and even if it can
be, the blowup of formula sizes is often intractable. This leads us to the
so-called lazy approach to SMT solving, which employs, alongside a general
SAT solver, a specialised T-solver that is able to solve conjunctions of
atomic formulae over a specific structure.
The lazy approach itself comes in two variants. The offline variant works by
first running a SAT solver on the formula, interpreting each atomic formula as
an atomic proposition, and only if the SAT skeleton is satisfiable, runs a
T-solver in order to find a satisfying valuation for the set of ‘true’ atomic
formulae. If this fails, the process repeats with a different model of the SAT
skeleton. The online variant, itself called DPLL(T), is an extension of the
aforementioned DPLL algorithm, which builds a potential model of a formula
incrementally. In this case, the interaction between the SAT solver and the
T-solver is more direct, and T-solver is consulted on-the-fly as the
propositional valuation is being built.
You may be thinking now that all this is potentially very expensive. After all,
even propositional SAT was already NP-complete. Indeed, SMT solving is in
general at least NP-hard, but can be, depending on the logic in question,
much worse still, up to undecidable. Even such simple logics as the integers
with addition and multiplication are, in fact, already undecidable (though,
funnily enough, the same logic with the reals is decidable). Luckily for us,
bitvectors are decidable, and even in cases of undecidable logics, SMT solvers
are often available, though with the added caveat of possible unknown result
(which unfortunately often already occurs for relatively simple formulae).
conclusion
And that is all there is about logic, at least for today. Though the discussion
was long, it will surely come in handy many times in the future.
We will conclude with some problems to think about:
- What kind of simplifying heuristics can DPLL make? Remember that the formula must be in CNF.
- Give formal semantics of the universal quantifier using substitution.
- Give an example of at least two different structures for the language of set theory.
- Give an example of a formula that is valid (i.e. true in every valuation) when interpreted over the real numbers, but not valid over the integers.
- Give an example of a formula that is valid over the integers and an intuitively similar formula that is not valid over 32-bit bitvectors.