abstraction basics
This post will serve as an introduction to a rather broad topic that holds
considerable interest to our group: (automated) software abstraction.
Exhaustive analysis of software is quite hard:
- if we consider programs to have infinite state spaces (a common simplification that is invoked every time Turing-completeness is mentioned), we immediately run into limits on decidability,
- if we take a more realistic (though not necessarily more useful) stance that computers are finite, they have finite memory and hence the state space of all programs must be finite, we instead crash, at about the same speed, into the immense size of those technically finite state spaces (a computer with very modest 1GiB of memory has distinct memory states; yeah, that is not going to fly).
problem statement
We are going to need to bring that number down. Way down. Clearly, it must be
possible for at least some programs, because people have been writing
correct algorithms and proving their correctness. Surely, they didn't
enumerate anything like the above numbers of states or paths. Of course,
programs are not random blobs of a state space.1 There is humongous amount of
symmetry in those state spaces – if only we could find and exploit it. We have
discussed control flow graphs before: clearly, there is a relationship between
a state space of a program and its control flow graph. While normally, we
think of the state space as an ‘unrolling’ of the control flow graph, we could
just as well consider the CFG to be a ‘collapsing’ of the state space graph.
Of course, we can't just look at the CFG to decide whether a program is safe:
in actual programs, data exists, too. But sometimes we can look at a CFG and
at least say that the program is definitely not safe (think an
unconditionally-reachable
fail
statement; not something a programmer would
intentionally write, but can easily happen with an optimizing compiler).
1
Though some people are clearly trying to get there.
examples
It is rarely useful to collapse the state space beyond CFG, in the sense that
we would collapse states that correspond to different control locations in the
program into a single abstract state. However, it is certainly useful to
retain additional distinctions, causing fewer state to collapse, making the
abstract state space richer and more useful for gleaning information about
the program.
A very common, though largely underappreciated, example is type checking: in
addition to control locations, we retain the types of pieces of data, but
not their values. A set of states whose members only differ by the specific
values of some pieces of data are merged into a single abstract state. This is
still a very coarse abstraction, as witnessed by attempts to type-check
dynamically typed programs, where types of values can depend on a specific
path the program has taken (i.e. types in dynamic languages are just a very
specific type of regular data). In general, type-checking a dynamically typed
program is undecidable, for much the same reasons as is any other interesting
(‘semantic’) property of programs. While a lot of state collapsing happens in
this case, it is not enough to even make it finite, much less of a manageable
size.
But wait, what about statically typed languages? The type checkers for those
normally terminate on arbitrary programs, often provably so. This is because
static type systems place very restrictive limits on which pieces of data can
have distinct types. In general, each object in such a program must have a
type that is determined statically, i.e. by an annotation in the program (or
in absence of such annotations, they must be possible at least in principle).
This is, after all, what static typing means. But that means that we only
have finitely many classes of objects, which are represented by a single
piece of syntax, and we require all the objects within a class to share a
type. Rather obviously, since there are only finitely many classes, but
infinitely many objects, some of those classes must themselves be infinite.
But since we only care about type-per-object and we know all objects within a
class share a type, we can again collapse any set of states where the members
only differ in what objects are within a class. Now the abstract state space
is finite and not even very large (for monomorphic type systems, it is roughly
a product of the number of type annotations and the size of the control flow
graph; in fact, in this one case, it can be desirable to collapse even
distinct control locations).
As a bit of a tangent, this also nicely illustrates why dependent typing is
undecidable: there is sufficient polymorphism and sufficiently many types (in
the extreme case, all values are also types) that it is entirely possible to
write programs with state spaces that are still infinite after type-induced
collapse (quite informally, there are ‘too many types’ – infinite families
that can be instantiated in a single finite program, via recursion or
loops).
state spaces
Our principal point of view so far has been in terms of the state space (a
conspicuously operational view, to be sure) of the program as it is executed.
What is a state space though? Anything you want it to be! You need a collection
of states and a rule of evolution (I am specifically not saying ‘a rule for
moving from one state to the next’ because nobody said a state space needs to
be countable). What is also absent is a requirement for determinism, either
into the past or into the future, but what is definitely present is a sense
of past and future. Trajectories in this space – computations – can split
and merge without limitation. In particular, this means that we cannot predict
the future, or read off the past, from a single state.
Of course, that does not mean we cannot construct state spaces like that: a
simple way to do that is to take a base space, say
, construct a disjoint
union of all trajectories from
, and give them a structure of a state
space: any ‘point’ of this state space evolves along the ‘strand’ of this new
space that is the image of a single trajectory. Or a little more technically,
we can take points in this state space to be pairs
of a trajectory
and one of its points, and each
evolves to
(with
being
an evolution of
in
such that
lies on
). If our original state
space was the prototypical example, where points (states) are machine
configurations and the evolution is given by the computation step of this
machine, the above construction simply gives us the set of all computations
(the setting for linear-time analysis of systems).
We can also only go half the way: consider prefixes, or truncated
trajectories instead and again make pairs
, this time such that
is the point of truncation. The evolution of
is then
such
that
is an evolution of
in
and
is an extension of
to
along the same ‘arc’. Unlike the previous case, the future is not
determined: if
had multiple future evolutions, so has
. On the
other hand, the past is: each point in the new space has exactly one past.
In the discrete case, the space of such prefixes is then nothing else but a
computation tree, of the kind we encountered when talking about symbolic
execution. Of course, this particular representation of that space is not
very useful for analysis.
abstraction
In general, when analysing software, the state space (in its entirety) is not
a terribly practical object. A lot of work goes into not constructing too
much of the space, and into finding state spaces that are small and whose
points (states) are easy to represent compactly. In particular, we want to
avoid constructing a big state space in order to collapse sets of points to
construct a smaller space. That would essentially defeat the purpose of the
whole exercise.
So how do we work with the smaller, ‘more abstract’ space? The foremost
question on our minds should be ‘how do the sets that we want to collapse
look like?’. And a lot of the time, the answer will be ‘the values stored by
the program are similar’. How similar? Well, that depends on the program.
Ideally, we collapse sets where the differences in data are of no consequence
to the evolution: more formally, we would like the collapsing relation to be
a congruence of evolution. What does that mean? It's just a fancy way of
saying that related states evolve into related states. There are some
technical problems because of the non-deterministic nature of evolution, but
we can brush those aside for now.
Notably, this will often fail, and when it fails, it means the new space will
have trajectories whose preimage (in the original space) is not a
trajectory. This is bad, because if our analysis decides that one of those
trajectories is defective, we have a spurious counterexample on our hands.
If we give it to the user, they are going to be rather confused. So, as I was
saying, this is bad and we will deal with that some other day.
data
So, back to similar data. Imagine this very silly program:
if x > 5:
fail
First, we will put forth a very reasonable requirement that all fail states
are dissimilar from all non-fail states (i.e. we never collapse a set with
both fail and non-fail states in it). That puts a firm lower bound on the
number of equivalence classes: there must be at least 2. But that won't be
enough, because there are points that are non-fail that evolve into fail (e.g.
those that have
x = 7
) and then there are non-fail points that evolve into
non-fail. That refines our lower bound to 3, and I think that is going to be
enough this time. The set of all non-fail states in which x > 5
, all
non-fail states in which x ≤ 5
and all fail states. This program cannot
distinguish (in the sense of leading to a fail vs a non-fail state)
anything else. Coincidentally, this is exactly the state space that we get
from symbolic execution (okay, not actually a coincidence).
Using terms to represent values is perhaps a little too meta. Any other
options? Well, yes, of course. We can take
and
as
our non-fail states and everything still works out. Unfortunately, unlike
terms, this is not always a congruence (actually, it only very rarely is). But
we got lucky this time, perhaps we can get lucky other times? Or perhaps use
some so far unspecified magic that would help us deal with spurious
counterexamples. And the answer seems to be yes, a combination of luck and
magic can sometimes produce useful results.
data abstraction
Before we leave this admittedly rambling think piece behind, we should at
least look a bit more formally at data abstraction – the idea that we have
been building up this entire time. While it is not universally true, it is
very common that data is the main driver of state space size (and hence of
intractability of analysis). In a ‘machine configuration’ mode of computation,
it is also very natural to construct an equivalence on states using an
equivalence on values as a basis (one could also say that an equivalence on
values induces an equivalence on states).
But values are only one part of the equation: we also need to think about
evolution, because there is no way we are constructing the original state
space. Instead, we need a compact representative for each equivalence class
and we need to be able to look at a computation step of the machine and
perform the effect it has on data using only our chosen representatives.
Therefore, to efficiently compute the collapsed state space, we replace
values with their representatives (e.g. types from the type-checking example
from way above) and operations on them (e.g.
int + int
evaluates to int
,
so if the original machine executed x ← 3 + 7
and the effect was to store
10
in register x
, our abstract machine will execute x ← int + int
with
the effect of storing int
in register x
… same story for x ← 4 + 7
, x ←
5 + 7
and so on).
What are our (formal) options for dealing with values and operations?
First-order logic (with its language and structures) comes to mind:
signature tells us what operations are there and a structure tells us what
they mean. In this sense, the ‘instruction set’ of the machine is the
signature, its standard operational semantics is a structure, and the abstract
semantics are a different structure (over the same signature). Or we can take
a page from abstract (universal) algebra, where a signature is essentially the
same thing (declaring which operations exist) and again a structure takes the
role of semantics (did you ever notice how similar universal algebra and
first-order logic really are?).
In both cases, math usually only deals with a single signature and a single
set of values in any given context. A single program, on the other hand, often
uses a variety of data types, each with its own operations. Of course, nothing
stops us from taking a ‘union’ of all relevant signatures and assigning
special values to ‘nonsense’ operations (which will never be actually used by
well-typed programs). However, this resolution is somewhat unsatisfactory
(though this is essentially an aesthetic quibble).
On the other hand, in programs with poor or nonexistent typing discipline
(anything that came from C, pretty much), we will eventually need to take that
union anyway. Moreover, universal algebra has a long and rich tradition of
building structures from pieces and of taking them apart. So this might not be
such a bad fit after all.
signatures and structures
It still makes sense to consider multiple signatures that can be combined to
form bigger signatures, or take sub-signatures. The obstacles are purely
formal: it is a perfectly intuitive thing to do. So what are some examples of
useful signatures?
- integer arithmetic: addition
+
, subtraction-
, multiplication*
, integer division/
, negation (unary-
); with relational operations, we immediately run into the issue of types, or sorts (at least if we happen to be computer scientists, or worse, programmers)… but this is actually irrelevant to the classical concept of a signature, since signatures are just syntax – all we care about are arities, and there is no problem with those; so we put our worries aside and include==
,<
, etc. - the (fixed-width) bitvector signature extends the above with ‘bitwise’
operations (
or
,xor
,and
, shifts and so on) – but there is an entirely different problem with those, namely that they are actually all derived and hence ‘redundant’ (bitwiseor
can be defined entirely in terms of the integer signature), but programmers take bitwise operations to be primitives and our analysis will very much benefit from treating them as such, so we include them; - the (generalized) bitvector signature: this is where things start
getting interesting – like in the previous step, we had to temper our
desire for orthogonality and minimalism by a pragmatic consideration of
the situation on the ground; and programs certainly mix and match
bitvectors of different widths, and use distinct operations for each
width; so we take a product of the fixed-width signature with
and
get a (countably infinite!) signature with operations like
add₃₂
andadd₁₆
andor₆₄
and so on (of course we can restrict ourselves to a finite list of widths… that works too, in many cases); to make things even more interesting, we will also want to add conversion operations, for extending and truncating bitvectors to different widths: (or perhaps sign and zero extend), - the (monomorphic) pointer signature: this includes a subset of
arithmetic (addition and subtraction) and pointer-specific operations –
load
andstore
, essentially… but wait, in actual programs there are actually different operations for loading different types of values (oh boy, I can feel types creeping up on us!), - the (generalized) pointer signature: a
load
and astore
operation for each type in some unspecified set of ‘types’ (so a product of the monomorphic one above and a set of types).