logic and satisfiability solving

In the introduction to symbolic execution last week, we have left several questions unanswered. Among those was the problem of determining whether a given path condition (which we defined as a set of boolean terms in the source language) was satisfiable, and this will be the topic of today's discussion.
In order to talk about satisfiability, our simplistic view of path conditions has to give way to a more concrete and formal description, namely that of logical formulae. Before we discuss satisfiability in this context though, we need to define some basic terminology. While this might seem like a needless waste of time at the beginning, we will soon see the benefits of this choice of abstraction, whose usefulness greatly transcends symbolic execution.

propositional logic

Our quick tour of logic starts at propositional logic, which is probably the simplest generally useful logical system you can think of. Propositional formulae consist of atomic propositions, which act as variables over the domain , and means of their combination, i.e. logical connectives, such as negation , conjunction , and disjunction .
The set of propositional formulae over a set of atomic propositions (that is intentionally left arbitrary) is then generated by the following rules (we will not discuss technical matters of precedence and bracketing, as those are intuitive enough):
  1. Each atomic proposition is a formula.
  2. If and are formulae, then , and are also formulae.
  3. Nothing else is a formula.
Our set of allowed connectives is just one out of many possible choices. What is important is that a formula is just a finite formal string with no intrinsic meaning, and that the set of formulae is defined inductively. This latter point allows us to define operations on using structural recursion, and indeed, we can recursively define meaning (or semantics) of propositional formulae.
The base case of our construction are atomic propositions. Keeping in mind that those range over , we can, given a valuation that maps each atomic proposition to or , evaluate a formula recursively. The evaluation function (which we will call interpretation from now on), named , is defined like this:
formula interpretation
This construction depends on boolean functions , and , which act not on formulae, but on values and , according to the rules summarized in the following table:
a b ~a a ⊓ b a ⊔ b
0 0 1 0 0
0 1 1 0 1
1 0 0 0 1
1 1 0 1 1
And after all this setup, we can finally state that we say to be satisfiable if there is a valuation such that . Otherwise it is unsatisfiable. The satisfying valuation is also called a model of .
At last, we have definitions whose names coincide with those that we have defined previous week in the context of pure symbolic execution. We are not quite there yet, as propositional logic is simply too weak for modeling of real world (or real mathematics, if you prefer) phenomena. As we see from its syntax and semantics, it is a logic of pure truthiness/falseness. Indeed, there are no objects other than boolean values. Notably, we lack terms which usually talk about more complex objects.

propositional satisfiablity solving

Before we extend our logic though, let us look at how we would go about determining, for a given propositional formula , whether it is satisfiable and if so, finding a model of . Since a propositional formula is finite and thus contains at most finite number of atomic propositions, there is a simple algorithm for this problem: simply try all possible valuations and for each compute the value of . If it is , we have a model. If we exhaust all possible valuations, the formula is unsatisfiable.
Now of course, this algorithm is not very efficient, since for a formula with atomic propositions, there are possible valuations. Unfortunately, an exponential algorithm is most probably the best we can hope for, since this problem (often called SAT) is a well known (even prototypical) NP-complete problem. Nonetheless, there is a range of programs for this exact problem (so called SAT solvers) and they are practically usable for most ‘ordinary‘ formulae.
SAT solvers don't use our naive algorithm, though. They often implement a more practical (though still exponential in the worst case) algorithm called DPLL, which builds valuation incrementally by ordering atomic propositions and then choosing their values on-the-fly, simplifying the formula for each choice (e.g. by eliminating trivially true parts) and backtracking when they reach a dead end (i.e. the resulting formula becomes unsatisfiable). Additionally, there are heuristics used to find atomic propositions whose values can be determined quickly and many other improvements.
A downside of DPLL is that it only natively works with formulae in a specific shape called the conjunctive normal form (CNF). A formula is in CNF if it is a conjunction of clauses, i.e. , where each clause is a disjunction of literals and each literal is a possibly negated atomic proposition. Every formula can be translated into an equivalent (i.e. having exactly the same models) formula in CNF. However, the size of the resulting formula might be exponential when compared to the original formula. Not all is lost, though. If we do not require the original and the resulting formulae to be equivalent, but merely equisatisfiable (which means that the original formula is satisfiable exactly when the resulting one is satisfiable), there is a translation with only linear size blowup. Equivalence is not preserved in this translation, as the resulting formula contains newly added atomic propositions.
Alternatively, solvers can be built around a specialised data structure called (ordered) binary decision diagram, which can be used to represent propositional formulas and supports reasonably efficient operations for formula composition, negation and check for satisfiability. This is, however, beyond the scope of today's discussion.

first-order logic

Let us now return to the problem of representing terms in logical formulae. We have a problem, though: what a term is depends on our problem domain. Even within our motivating example of symbolic execution, each programming language has different semantics, different data types and thus also different operations. We will, therefore, start by building an abstract framework of terms that is independent of concrete semantics (and to a degree, syntax as well).
The basic building blocks of our terms are variables (somewhat similar to atomic propositions that we have seen before), constants and their means of combination, that is function symbols. To each function symbol , we associate an natural number called the arity of . For given sets , and of variables, constants and function symbols, respectively, we define the set of terms inductively like so:
  1. Each variable is a term.
  2. Each constant is a term.
  3. If is a function symbol with arity and , , ..., are terms, is also a term.
  4. Nothing else is a term.
For example, contains the term (or, more formally , though this is exactly the kind of syntactic details we are not concerned with). Similarly to propositional formulae, terms are purely syntactic objects independent of any meaning, which we are free to define as we see fit, as we will see shortly.
We cannot simply combine propositional logic with terms, though, as we want to make sure that strings like are not allowed (what would even be a meaning of such ‘formulae’?). To solve this technical obstacle, we define, for given set of predicates (which are, too, associated with their arities) and terms the set of (first-order) atomic formulae (you guessed it) inductively:
  1. If is a predicate of arity and , , ..., are terms, is an atomic formula.
  2. Nothing else is an atomic formula.
And that is almost all there is to it. Our first try at defining first-order formulae is then to use the set of propositional formulae over first-order atomic formulae, which gives us nice terms into our logical language. An example formula with terms and predicates is then .
There is a small problem with equality. The usual treatment is to hardwire the symbol into our language, i.e. it is not a predicate, but acts like one (notably, we can build atomic formulae using equality). The reason for this special treatment is mainly technical and has to do with the fact that we require some special properties from equality. Notably, it has to be a congruence over terms, which means that from follows for any function symbol .
But wait, you might think, we have forgotten quantifiers, the most important feature of first-order logic! But while it is true that quantifiers are of prime importance in mathematical logic (and indeed usual mathematical practice), they are actually not as important for our applications (after all, there was no mention of any quantification in path conditions). But sure, quantifiers exist and they unfortunately disrupt our simple definition using propositional logic over atomic formulae.
We therefore have to extend the set of our formula formation rules by:
  1. If is a variable and a formula, is also a formula.
  2. If is a variable and a formula, is also a formula.
And that finally finishes our syntax of first-order logic. The good news is that semantics should not be too hard to understand after all this setup. The most important point is that we view every formula as occurring within a domain of discourse (or universe) and thus variables can be substituted with values of . Similarly, we have to assign a meaning to our function symbols and constants (each symbol being mapped to a function of type ) and predicates (each mapped to a relation under ). Taken together, and both mappings are called a structure (while the tuple of function symbols, constants and predicates is sometimes called a (first-order) language).
Similarly to propositional logic, we can evaluate a first-order formula , given a structure and a valuation which maps variables to values of , to a value . The definition is again recursive, though this time a bit more involved, since we have to work both with booleans and objects of (and recurse across the inductive definitions of formulae, atomic formulae and terms). We will not show the construction here, as it is a simple extension of all we have seen so far. The only notable points are that a predicate is evaluated as true exactly when its associated relation relates the evaluated values of its terms, and that the quantifiers are defined using substitution of variable , where is a variable free (i.e. unquantified in ) and is an element of .

quick tour of first-order logics

We will now take a quick look at some concrete first-order languages and logics (basically languages together with an interpretation)hat we can spot in the wild. One of the simplest (though what this means is debatable) is the language of set theory, which includes just one binary predicate and no function symbols. Every other (non-propositional) symbol that is used within set theory can be defined internally (in terms of ). Another well known examples are the logics of the integers, either with addition only, or with addition and multiplication (as function symbols), and the corresponding logics of real numbers.
For our purposes however, the most important example is the logic of fixed size bitvectors, which is a logical representation of two's complement arithmetic. To make things technically simpler (i.e. to hand-wave away the issue of many-sorted logic), let us consider just a set of 32-bit binary vectors (i.e. sequences of exactly 32 binary digits). Our language consists of function symbols representing arithmetic (add, sub, mul, neg, ...) and bitwise (and, or, xor, cpl, shl, ...) operations on such bitvectors, predicates and (remember, equality is present by default) and constants for each representable natural number, that is to . The interpretation of all these elements is the expected one, i.e. stands for , neg(12) evaluates to (in two's complement) and each operation has wrap-around overflow semantics.

satisfiability modulo theories

We will end today's discussion with the problem of satisfiablity within a first-order structure, also called satisfiability modulo theories (or SMT for short). However, we will restrict ourselves to formulae without quantifiers, as quantified satisfiability is relatively problematic (enough to be an active research area) and not really needed for our purposes. Recall then, that we can represent a quantifier-free first-order formula as a propositional formula over a set of first-order atomic formulae.
In general, SMT solvers build upon SAT solvers and extend them in a manner that is dependent on the structure we are working with. The easiest approach, often called eager SMT solving in the literature, is to abandon any first-order abstraction and translate each atomic first-order formula into a propositional formula. This approach is suitable for the bitvector logic, mainly due to the fact that we are working with finite objects.
The idea is rather simple - we represent each individual bit of each bitvector by its own atomic proposition (or constant / ). This translation is straightforward for equality and bitwise operations. For example, bitwise z = and(x, y) can be simply translated to a conjunction of formulae (for every bit of the arguments). The construction is a bit more complicated for addition, where the common approach is to implement a chained binary adder circuit in predicate logic, which results in a propositional formula linear in size to the bit-width of the summands. Multiplication is much harder though, and the blowup in size can be quadratic (in a typical implementation, a single 64-bit multiplication adds 64² new variables and tens of thousands of clauses).
Unfortunately, not every logic can be solved in this manner, and even if it can be, the blowup of formula sizes is often intractable. This leads us to the so-called lazy approach to SMT solving, which employs, alongside a general SAT solver, a specialised T-solver that is able to solve conjunctions of atomic formulae over a specific structure.
The lazy approach itself comes in two variants. The offline variant works by first running a SAT solver on the formula, interpreting each atomic formula as an atomic proposition, and only if the SAT skeleton is satisfiable, runs a T-solver in order to find a satisfying valuation for the set of ‘true’ atomic formulae. If this fails, the process repeats with a different model of the SAT skeleton. The online variant, itself called DPLL(T), is an extension of the aforementioned DPLL algorithm, which builds a potential model of a formula incrementally. In this case, the interaction between the SAT solver and the T-solver is more direct, and T-solver is consulted on-the-fly as the propositional valuation is being built.
You may be thinking now that all this is potentially very expensive. After all, even propositional SAT was already NP-complete. Indeed, SMT solving is in general at least NP-hard, but can be, depending on the logic in question, much worse still, up to undecidable. Even such simple logics as the integers with addition and multiplication are, in fact, already undecidable (though, funnily enough, the same logic with the reals is decidable). Luckily for us, bitvectors are decidable, and even in cases of undecidable logics, SMT solvers are often available, though with the added caveat of possible unknown result (which unfortunately often already occurs for relatively simple formulae).

conclusion

And that is all there is about logic, at least for today. Though the discussion was long, it will surely come in handy many times in the future.
We will conclude with some problems to think about:
  1. What kind of simplifying heuristics can DPLL make? Remember that the formula must be in CNF.
  2. Give formal semantics of the universal quantifier using substitution.
  3. Give an example of at least two different structures for the language of set theory.
  4. Give an example of a formula that is valid (i.e. true in every valuation) when interpreted over the real numbers, but not valid over the integers.
  5. Give an example of a formula that is valid over the integers and an intuitively similar formula that is not valid over 32-bit bitvectors.