abstraction basics

This post will serve as an introduction to a rather broad topic that holds considerable interest to our group: (automated) software abstraction. Exhaustive analysis of software is quite hard:
  1. if we consider programs to have infinite state spaces (a common simplification that is invoked every time Turing-completeness is mentioned), we immediately run into limits on decidability,
  2. if we take a more realistic (though not necessarily more useful) stance that computers are finite, they have finite memory and hence the state space of all programs must be finite, we instead crash, at about the same speed, into the immense size of those technically finite state spaces (a computer with very modest 1GiB of memory has distinct memory states; yeah, that is not going to fly).
If the latter number doesn't seem big enough, consider what is the worst-case runtime of an exhaustive search of that state space. Clearly, there is no hope of storing that to use a standard graph search. But since we know there is a finite number of states, we can employ a simple trick: after that many states, if the program still runs, it must have looped, so we can safely backtrack. Give or take, that means exploring paths. A finite number, surely, but it seems unlikely that the civilization is going to last long enough to explore even a tiny fraction of that search space. Which really brings into focus why the approach from point 1 above is so useful.

problem statement

We are going to need to bring that number down. Way down. Clearly, it must be possible for at least some programs, because people have been writing correct algorithms and proving their correctness. Surely, they didn't enumerate anything like the above numbers of states or paths. Of course, programs are not random blobs of a state space.1 There is humongous amount of symmetry in those state spaces – if only we could find and exploit it. We have discussed control flow graphs before: clearly, there is a relationship between a state space of a program and its control flow graph. While normally, we think of the state space as an ‘unrolling’ of the control flow graph, we could just as well consider the CFG to be a ‘collapsing’ of the state space graph.
Of course, we can't just look at the CFG to decide whether a program is safe: in actual programs, data exists, too. But sometimes we can look at a CFG and at least say that the program is definitely not safe (think an unconditionally-reachable fail statement; not something a programmer would intentionally write, but can easily happen with an optimizing compiler).
1
Though some people are clearly trying to get there.

examples

It is rarely useful to collapse the state space beyond CFG, in the sense that we would collapse states that correspond to different control locations in the program into a single abstract state. However, it is certainly useful to retain additional distinctions, causing fewer state to collapse, making the abstract state space richer and more useful for gleaning information about the program.
A very common, though largely underappreciated, example is type checking: in addition to control locations, we retain the types of pieces of data, but not their values. A set of states whose members only differ by the specific values of some pieces of data are merged into a single abstract state. This is still a very coarse abstraction, as witnessed by attempts to type-check dynamically typed programs, where types of values can depend on a specific path the program has taken (i.e. types in dynamic languages are just a very specific type of regular data). In general, type-checking a dynamically typed program is undecidable, for much the same reasons as is any other interesting (‘semantic’) property of programs. While a lot of state collapsing happens in this case, it is not enough to even make it finite, much less of a manageable size.
But wait, what about statically typed languages? The type checkers for those normally terminate on arbitrary programs, often provably so. This is because static type systems place very restrictive limits on which pieces of data can have distinct types. In general, each object in such a program must have a type that is determined statically, i.e. by an annotation in the program (or in absence of such annotations, they must be possible at least in principle). This is, after all, what static typing means. But that means that we only have finitely many classes of objects, which are represented by a single piece of syntax, and we require all the objects within a class to share a type. Rather obviously, since there are only finitely many classes, but infinitely many objects, some of those classes must themselves be infinite. But since we only care about type-per-object and we know all objects within a class share a type, we can again collapse any set of states where the members only differ in what objects are within a class. Now the abstract state space is finite and not even very large (for monomorphic type systems, it is roughly a product of the number of type annotations and the size of the control flow graph; in fact, in this one case, it can be desirable to collapse even distinct control locations).
As a bit of a tangent, this also nicely illustrates why dependent typing is undecidable: there is sufficient polymorphism and sufficiently many types (in the extreme case, all values are also types) that it is entirely possible to write programs with state spaces that are still infinite after type-induced collapse (quite informally, there are ‘too many types’ – infinite families that can be instantiated in a single finite program, via recursion or loops).

state spaces

Our principal point of view so far has been in terms of the state space (a conspicuously operational view, to be sure) of the program as it is executed. What is a state space though? Anything you want it to be! You need a collection of states and a rule of evolution (I am specifically not saying ‘a rule for moving from one state to the next’ because nobody said a state space needs to be countable). What is also absent is a requirement for determinism, either into the past or into the future, but what is definitely present is a sense of past and future. Trajectories in this space – computations – can split and merge without limitation. In particular, this means that we cannot predict the future, or read off the past, from a single state.
Of course, that does not mean we cannot construct state spaces like that: a simple way to do that is to take a base space, say , construct a disjoint union of all trajectories from , and give them a structure of a state space: any ‘point’ of this state space evolves along the ‘strand’ of this new space that is the image of a single trajectory. Or a little more technically, we can take points in this state space to be pairs of a trajectory and one of its points, and each evolves to (with being an evolution of in such that lies on ). If our original state space was the prototypical example, where points (states) are machine configurations and the evolution is given by the computation step of this machine, the above construction simply gives us the set of all computations (the setting for linear-time analysis of systems).
We can also only go half the way: consider prefixes, or truncated trajectories instead and again make pairs , this time such that is the point of truncation. The evolution of is then such that is an evolution of in and is an extension of to along the same ‘arc’. Unlike the previous case, the future is not determined: if had multiple future evolutions, so has . On the other hand, the past is: each point in the new space has exactly one past. In the discrete case, the space of such prefixes is then nothing else but a computation tree, of the kind we encountered when talking about symbolic execution. Of course, this particular representation of that space is not very useful for analysis.

abstraction

In general, when analysing software, the state space (in its entirety) is not a terribly practical object. A lot of work goes into not constructing too much of the space, and into finding state spaces that are small and whose points (states) are easy to represent compactly. In particular, we want to avoid constructing a big state space in order to collapse sets of points to construct a smaller space. That would essentially defeat the purpose of the whole exercise.
So how do we work with the smaller, ‘more abstract’ space? The foremost question on our minds should be ‘how do the sets that we want to collapse look like?’. And a lot of the time, the answer will be ‘the values stored by the program are similar’. How similar? Well, that depends on the program. Ideally, we collapse sets where the differences in data are of no consequence to the evolution: more formally, we would like the collapsing relation to be a congruence of evolution. What does that mean? It's just a fancy way of saying that related states evolve into related states. There are some technical problems because of the non-deterministic nature of evolution, but we can brush those aside for now.
Notably, this will often fail, and when it fails, it means the new space will have trajectories whose preimage (in the original space) is not a trajectory. This is bad, because if our analysis decides that one of those trajectories is defective, we have a spurious counterexample on our hands. If we give it to the user, they are going to be rather confused. So, as I was saying, this is bad and we will deal with that some other day.

data

So, back to similar data. Imagine this very silly program:
if x > 5:
    fail
First, we will put forth a very reasonable requirement that all fail states are dissimilar from all non-fail states (i.e. we never collapse a set with both fail and non-fail states in it). That puts a firm lower bound on the number of equivalence classes: there must be at least 2. But that won't be enough, because there are points that are non-fail that evolve into fail (e.g. those that have x = 7) and then there are non-fail points that evolve into non-fail. That refines our lower bound to 3, and I think that is going to be enough this time. The set of all non-fail states in which x > 5, all non-fail states in which x ≤ 5 and all fail states. This program cannot distinguish (in the sense of leading to a fail vs a non-fail state) anything else. Coincidentally, this is exactly the state space that we get from symbolic execution (okay, not actually a coincidence).
Using terms to represent values is perhaps a little too meta. Any other options? Well, yes, of course. We can take and as our non-fail states and everything still works out. Unfortunately, unlike terms, this is not always a congruence (actually, it only very rarely is). But we got lucky this time, perhaps we can get lucky other times? Or perhaps use some so far unspecified magic that would help us deal with spurious counterexamples. And the answer seems to be yes, a combination of luck and magic can sometimes produce useful results.

data abstraction

Before we leave this admittedly rambling think piece behind, we should at least look a bit more formally at data abstraction – the idea that we have been building up this entire time. While it is not universally true, it is very common that data is the main driver of state space size (and hence of intractability of analysis). In a ‘machine configuration’ mode of computation, it is also very natural to construct an equivalence on states using an equivalence on values as a basis (one could also say that an equivalence on values induces an equivalence on states).
But values are only one part of the equation: we also need to think about evolution, because there is no way we are constructing the original state space. Instead, we need a compact representative for each equivalence class and we need to be able to look at a computation step of the machine and perform the effect it has on data using only our chosen representatives.
Therefore, to efficiently compute the collapsed state space, we replace values with their representatives (e.g. types from the type-checking example from way above) and operations on them (e.g. int + int evaluates to int, so if the original machine executed x ← 3 + 7 and the effect was to store 10 in register x, our abstract machine will execute x ← int + int with the effect of storing int in register x… same story for x ← 4 + 7, x ← 5 + 7 and so on).
What are our (formal) options for dealing with values and operations? First-order logic (with its language and structures) comes to mind: signature tells us what operations are there and a structure tells us what they mean. In this sense, the ‘instruction set’ of the machine is the signature, its standard operational semantics is a structure, and the abstract semantics are a different structure (over the same signature). Or we can take a page from abstract (universal) algebra, where a signature is essentially the same thing (declaring which operations exist) and again a structure takes the role of semantics (did you ever notice how similar universal algebra and first-order logic really are?).
In both cases, math usually only deals with a single signature and a single set of values in any given context. A single program, on the other hand, often uses a variety of data types, each with its own operations. Of course, nothing stops us from taking a ‘union’ of all relevant signatures and assigning special values to ‘nonsense’ operations (which will never be actually used by well-typed programs). However, this resolution is somewhat unsatisfactory (though this is essentially an aesthetic quibble).
On the other hand, in programs with poor or nonexistent typing discipline (anything that came from C, pretty much), we will eventually need to take that union anyway. Moreover, universal algebra has a long and rich tradition of building structures from pieces and of taking them apart. So this might not be such a bad fit after all.

signatures and structures

It still makes sense to consider multiple signatures that can be combined to form bigger signatures, or take sub-signatures. The obstacles are purely formal: it is a perfectly intuitive thing to do. So what are some examples of useful signatures?
  1. integer arithmetic: addition +, subtraction -, multiplication *, integer division /, negation (unary -); with relational operations, we immediately run into the issue of types, or sorts (at least if we happen to be computer scientists, or worse, programmers)… but this is actually irrelevant to the classical concept of a signature, since signatures are just syntax – all we care about are arities, and there is no problem with those; so we put our worries aside and include ==, <, etc.
  2. the (fixed-width) bitvector signature extends the above with ‘bitwise’ operations (or, xor, and, shifts and so on) – but there is an entirely different problem with those, namely that they are actually all derived and hence ‘redundant’ (bitwise or can be defined entirely in terms of the integer signature), but programmers take bitwise operations to be primitives and our analysis will very much benefit from treating them as such, so we include them;
  3. the (generalized) bitvector signature: this is where things start getting interesting – like in the previous step, we had to temper our desire for orthogonality and minimalism by a pragmatic consideration of the situation on the ground; and programs certainly mix and match bitvectors of different widths, and use distinct operations for each width; so we take a product of the fixed-width signature with and get a (countably infinite!) signature with operations like add₃₂ and add₁₆ and or₆₄ and so on (of course we can restrict ourselves to a finite list of widths… that works too, in many cases); to make things even more interesting, we will also want to add conversion operations, for extending and truncating bitvectors to different widths: (or perhaps sign and zero extend),
  4. the (monomorphic) pointer signature: this includes a subset of arithmetic (addition and subtraction) and pointer-specific operations – load and store, essentially… but wait, in actual programs there are actually different operations for loading different types of values (oh boy, I can feel types creeping up on us!),
  5. the (generalized) pointer signature: a load and a store operation for each type in some unspecified set of ‘types’ (so a product of the monomorphic one above and a set of types).
Actually, I think that's quite enough for today. We will look at structures and how to build them up some other time. There's still a lot of ground to cover.