data flow analysis

In previous blog posts, we have mentioned the term data flow multiple times. This week, we will have a closer look at some interesting properties of programs that can be inferred from its data flow, and how to recover them algorithmically. The process is commonly known as data flow analysis and is a form of static analysis. The basic premise is to summarise the data, coming into and leaving each basic block of a given program.

It is the summarisation that is the key to usefulness of the analysis: clearly, if we simply ‘propagate’ the concrete values from variable to variable, performing the usual operations, we would simply get standard execution. Instead, we will use a suitable domain to keep track of variable properties of interest. This freedom in the choice of the domain is what makes data flow analysis a very flexible and general tool – picking a different domain leads to a different type of analysis.

Often enough, we can track the data flow separately for each variable or other basic data unit of the program, even though this is not always possible. When we track individual values, the domains involved in the analysis become much simpler and easier to understand, at the expense of a little bit of generality.

a motivating example

Let's consider a very simple domain first: the set of possible values a variable can take. In a program with only a single basic block, the results will coincide with straightforward execution:

L₁: x₁ ← 2            # x₁ = { 2 }
    y₁ ← 3            # y₁ = { 3 }
    x₂ ← x₁ + y₁      # x₂ = { 5 }
    end

Let's add some non-trivial control flow to the mix, namely, an if-else statement. Assume that y = { 0, 1 } at this point:

L₁: x₁ ← 2            # x₁ = { 2 }, y = { 0, 1 }
    goto y ? L₂ : L₃  # both outgoing edges are possible
L₂  x₂ ← x₁ + 3       # x₂ = { 5 }
    goto L₄
L₃: x₃ ← x₁ + 2       # x₃ = { 4 }
    goto L₄
L₄: x₄ ← φ x₂ x₃      # x₄ = { 3, 4 } coming from either L₁ or L₂
    end

Of course, the real challenge is loops. Let's see what happens if we try to analyze one (again, y = { 0, 1 } at the outset):

L₁: x₁ ← 2
L₂: x₂ ← φ x₁ x₃
    x₃ ← x₂ + 3
    goto y ? L₂ : L₃
L₃: end

Let's track the computation through a few iterations:

L₁: x₁ ← 2          # x₁ = { 2 }
L₂: x₂ ← φ x₁ x₃    # x₂ = { 2 }       coming from from L₁
    x₃ ← x₂ + 3     # x₃ = { 5 }
L₂: x₂ ← φ x₁ x₂    # x₂ = { 2, 5 }    add 5 coming from L₂
    x₃ ← x₂ + 3     # x₃ = { 2, 5, 8 }
L₂: x₂ ← φ x₁ x₂    # x₂ = { 2, 5, 8 } add 8 coming from L₂
    x₃ ← x₂ + 3     # x₃ = { 2, 5, 8, 11 }
…
L₂: x₂ ← φ x₁ x₂    # x₂ = { 2 + 3n | n ∈ ℕ }

While a fixed point clearly exists, there are two problems: the value of x₂ (and x₃) is infinite, and it takes infinitely many steps to find it. Before we tackle that problem, however, let's spell out the algorithm that we have intuitively followed in the above examples. That will also clarify the role of the domain and the operations it needs to provide.

the algorithm

So, the algorithm is very simple:

assign initial values, taken from the domain, to each piece of ‘data’ in the program: constants and variables,
propagate these values along the flow of data, substituting the operations in the program (e.g. x ← a + b) with suitable operations in the domain (e.g. x ← a ∨ b, if the domain is a simple semilattice),
repeat (2) until a fixpoint is reached (i.e. further propagation does not change the data in any way).

Perhaps the least obvious sticking point in the algorithm is what ‘each piece of data’ actually means. Since we want the algorithm to terminate (and terminate quickly), we cannot really use the dynamic notion of a variable here: recursion, linked lists, etc. could all cause the analysis to blow up. Instead, the usual compromise is to do a static approximation, where each named variable is taken to represent all its possible instances. A similar trick (often incurring even more severe loss of precision) can be applied to arrays and memory (and other kinds of anonymous variables).

While φ instructions might seem special at first, it turns out that they play exactly the same role as the conflation of multiple dynamic variables into a single static ‘supervariable’. And that's also where those semilattices come into play.

semilattices and semantics

A finite join semilattice is a partially ordered set that has a join (supremum, least upper bound) operation defined on all its non-empty subsets. We will write joins using the binary operator as is the convention.

Immediately, we can see that a function that maps any to some has a fixed point – such that – that can be obtained by starting at an arbitrary and applying sufficiently (but finitely) many times: . Of course, different choices of can yield different fixed points .

So the analysis does something ‘until a fixpoint’, we want the analysis to terminate, and we have a choice of ‘domain’ which represents the state of the algorithm. I'm sure you can hear the pieces clicking into place.

It is perhaps worth mentioning that if we are using the ‘per variable’ or rather ‘per static supervariable’ form of the analysis, the above is simply a direct product of finitely many finite semilattices, where each factor represents a single variable. Therefore, from a theoretical perspective, we can assume a single semilattice that represents the entire state of the algorithm.

An astute reader will also have noticed that the hitherto abstract is really just a form of operational semantics, though it's not the standard operational semantics of the language in question (the one that corresponds to execution and which models the dynamic behaviour of its programs). So the algorithm really boils down to:

Apply operational semantics to the initial state of the program until a fixpoint is reached.

a	b	a ∨ b
`def`	`def`	`def`
`def`	`undef`	`maybe_undef`
`def`	`maybe_undef`	`maybe_undef`
`undef`	`undef`	`undef`
`undef`	`maybe_undef`	`maybe_undef`
`maybe_undef`	`maybe_undef`	`maybe_undef`

data flow analysis

a motivating example

the algorithm

semilattices and semantics

worked examples

conclusion