data flow analysis

In previous blog posts, we have mentioned the term data flow multiple times. This week, we will have a closer look at some interesting properties of programs that can be inferred from its data flow, and how to recover them algorithmically. The process is commonly known as data flow analysis and is a form of static analysis. The basic premise is to summarise the data, coming into and leaving each basic block of a given program.
It is the summarisation that is the key to usefulness of the analysis: clearly, if we simply ‘propagate’ the concrete values from variable to variable, performing the usual operations, we would simply get standard execution. Instead, we will use a suitable domain to keep track of variable properties of interest. This freedom in the choice of the domain is what makes data flow analysis a very flexible and general tool – picking a different domain leads to a different type of analysis.
Often enough, we can track the data flow separately for each variable or other basic data unit of the program, even though this is not always possible. When we track individual values, the domains involved in the analysis become much simpler and easier to understand, at the expense of a little bit of generality.

a motivating example

Let's consider a very simple domain first: the set of possible values a variable can take. In a program with only a single basic block, the results will coincide with straightforward execution:
L₁: x₁ ← 2            # x₁ = { 2 }
    y₁ ← 3            # y₁ = { 3 }
    x₂ ← x₁ + y₁      # x₂ = { 5 }
Let's add some non-trivial control flow to the mix, namely, an if-else statement. Assume that y = { 0, 1 } at this point:
L₁: x₁ ← 2            # x₁ = { 2 }, y = { 0, 1 }
    goto y ? L₂ : L₃  # both outgoing edges are possible
L₂  x₂ ← x₁ + 3       # x₂ = { 5 }
    goto L₄
L₃: x₃ ← x₁ + 2       # x₃ = { 4 }
    goto L₄
L₄: x₄ ← φ x₂ x₃      # x₄ = { 3, 4 } coming from either L₁ or L₂
Of course, the real challenge is loops. Let's see what happens if we try to analyze one (again, y = { 0, 1 } at the outset):
L₁: x₁ ← 2
L₂: x₂ ← φ x₁ x₃
    x₃ ← x₂ + 3
    goto y ? L₂ : L₃
L₃: end
Let's track the computation through a few iterations:
L₁: x₁ ← 2          # x₁ = { 2 }
L₂: x₂ ← φ x₁ x₃    # x₂ = { 2 }       coming from from L₁
    x₃ ← x₂ + 3     # x₃ = { 5 }
L₂: x₂ ← φ x₁ x₂    # x₂ = { 2, 5 }    add 5 coming from L₂
    x₃ ← x₂ + 3     # x₃ = { 2, 5, 8 }
L₂: x₂ ← φ x₁ x₂    # x₂ = { 2, 5, 8 } add 8 coming from L₂
    x₃ ← x₂ + 3     # x₃ = { 2, 5, 8, 11 }
L₂: x₂ ← φ x₁ x₂    # x₂ = { 2 + 3n | n ∈ ℕ }
While a fixed point clearly exists, there are two problems: the value of x₂ (and x₃) is infinite, and it takes infinitely many steps to find it. Before we tackle that problem, however, let's spell out the algorithm that we have intuitively followed in the above examples. That will also clarify the role of the domain and the operations it needs to provide.

the algorithm

So, the algorithm is very simple:
  1. assign initial values, taken from the domain, to each piece of ‘data’ in the program: constants and variables,
  2. propagate these values along the flow of data, substituting the operations in the program (e.g. x ← a + b) with suitable operations in the domain (e.g. x ← a ∨ b, if the domain is a simple semilattice),
  3. repeat (2) until a fixpoint is reached (i.e. further propagation does not change the data in any way).
Perhaps the least obvious sticking point in the algorithm is what ‘each piece of data’ actually means. Since we want the algorithm to terminate (and terminate quickly), we cannot really use the dynamic notion of a variable here: recursion, linked lists, etc. could all cause the analysis to blow up. Instead, the usual compromise is to do a static approximation, where each named variable is taken to represent all its possible instances. A similar trick (often incurring even more severe loss of precision) can be applied to arrays and memory (and other kinds of anonymous variables).
While φ instructions might seem special at first, it turns out that they play exactly the same role as the conflation of multiple dynamic variables into a single static ‘supervariable’. And that's also where those semilattices come into play.

semilattices and semantics

A finite join semilattice is a partially ordered set that has a join (supremum, least upper bound) operation defined on all its non-empty subsets. We will write joins using the binary operator as is the convention.
Immediately, we can see that a function that maps any to some has a fixed point – such that – that can be obtained by starting at an arbitrary and applying sufficiently (but finitely) many times: . Of course, different choices of can yield different fixed points .
So the analysis does something ‘until a fixpoint’, we want the analysis to terminate, and we have a choice of ‘domain’ which represents the state of the algorithm. I'm sure you can hear the pieces clicking into place.
It is perhaps worth mentioning that if we are using the ‘per variable’ or rather ‘per static supervariable’ form of the analysis, the above is simply a direct product of finitely many finite semilattices, where each factor represents a single variable. Therefore, from a theoretical perspective, we can assume a single semilattice that represents the entire state of the algorithm.
An astute reader will also have noticed that the hitherto abstract is really just a form of operational semantics, though it's not the standard operational semantics of the language in question (the one that corresponds to execution and which models the dynamic behaviour of its programs). So the algorithm really boils down to:
Apply operational semantics to the initial state of the program until a fixpoint is reached.
The natural semantics of x₀ ← φ x₁ x₂ is the join operation of the semilattice, i.e. . The same (with some extra steps) holds for merging multiple dynamic instances of the same variable in one of those supervariables. As long as all the remaining mundane operations can get semantics that are monotonic, we have a clear winner. A simple way to ensure that is to define (assignment) itself to include a join: a ← b becomes . Since the join is applied last, this is clearly monotonic.
Alternatively, if we have SSA, we know that all data flow loops must go through φ instructions, which do the joining. This means that, conveniently, φ instructions also represent the only values that need to be stored (everything else within SSA is derived combinatorially from φ instructions and ‘constants’). We immediately see that the entire semantic function is monotonic if the semantics of φ instructions are.

worked examples

Let us consider the following C++ function:
const char* color_name( color_t col ) 
    const char* name;
    switch ( col )
        case color_t::red:
            name = "red"; break;
        case color_t::green:
            name = "green"; break;
        case color_t::blue:
            name = "blue"; break;
    return name;
And the corresponding CFG:
L₁:      name     ← *             # undefined
         is_red   ← col == red
    goto is_red   ? L₅ : L₂
L₂:      is_green ← col == green
    goto is_green ? L₆ : L₃
L₃:      is_blue  ← col == blue
    goto is_blue ? L₇ : L₄

L₄: return name

L₅: name ← "red"
    goto L₄
L₆: name ← "green"
    goto L₄
L₇: name ← "blue"
    goto L₄
Or in picture form:
Especially from the picture, it should be quite obvious that we can reach ret while the value of name is still * (i.e. undefined). That is a problem, more so because name is a pointer that may be dereferenced by the caller of the function, in which case the program is going to crash, or worse, do something random and unexpected. So how do we detect the problem systematically, using data flow analysis? We need to find a suitable domain, which is going to be a very simple 3-element semilattice:
All constants are taken to be def, all variables are initially undef (in SSA, this is actually made explicit) and all operations are joins. Except assignment, which is regular assignment. As you can easily see from the Hasse diagram above, the joins look like this (recall that semilattices are commutative bands, so we leave out the redundant rows with flipped operands):
a b a ∨ b
def def def
def undef maybe_undef
def maybe_undef maybe_undef
undef undef undef
undef maybe_undef maybe_undef
maybe_undef maybe_undef maybe_undef
So how does it go? Looking at the CFG and remembering that the order of propagation does not matter (all our operations are associative; I mean the one operation that we use for everything):
  1. we don't know anything at all that would help us decide which way a given branch goes, so we always, automatically take both branches,
  2. propagate the undef associated with name down along the is_red edge,
  3. apply name ← "red", which translates to name ← def which means that name is now def,
  4. do the same along is_green and is_blue. The picture now looks something like this:
  1. propagate along the 4 incoming edges into the exit block; this gives us 3 def and one undef; thanks to associativity of , it does not matter in which order we get there, the result is going to be maybe_undef,
  2. at this point, no propagation that we can do will change the assignments and we have reached our fixed point,
  3. we conclude that the original C++ function can return an uninitialized pointer and crash the program (the technical term is that it invokes undefined behaviour).


Data flow analysis use cases include various compiler optimization techniques like constant propagation, static checks like definite assignment analysis or detection of possible null pointer dereferences, or even construction of an SSA form of a program.
There are also many variants that we did not cover: for instance, propagation can be performed backwards (against the flow of data) in some analyses, or the requirements for the domain can be relaxed (the semilattice requirement is sufficient, but not necessary). For such relaxed domains, things can become a lot more complicated (e.g. the order of traversal of the control flow diagram may start to matter).