# data flow analysis

In previous blog posts, we have mentioned the term

*data flow*multiple times. This week, we will have a closer look at some interesting properties of programs that can be inferred from its data flow, and how to recover them algorithmically. The process is commonly known as*data flow analysis*and is a form of*static analysis*. The basic premise is to summarise the data, coming into and leaving each*basic block*of a given program.It is the

*summarisation*that is the key to usefulness of the analysis: clearly, if we simply ‘propagate’ the concrete values from variable to variable, performing the usual operations, we would simply get standard execution. Instead, we will use a suitable*domain*to keep track of variable properties of interest. This freedom in the choice of the domain is what makes data flow analysis a very flexible and general tool – picking a different domain leads to a different type of analysis.Often enough, we can track the data flow separately for each variable or other
basic data unit of the program, even though this is not always possible. When
we track individual values, the domains involved in the analysis become much
simpler and easier to understand, at the expense of a little bit of
generality.

# a motivating example

Let's consider a very simple domain first: the set of possible values a
variable can take. In a program with only a single basic block, the results
will coincide with straightforward execution:

```
L₁: x₁ ← 2 # x₁ = { 2 }
y₁ ← 3 # y₁ = { 3 }
x₂ ← x₁ + y₁ # x₂ = { 5 }
end
```

Let's add some non-trivial

*control flow*to the mix, namely, an*if-else*statement. Assume that`y = { 0, 1 }`

at this point:
```
L₁: x₁ ← 2 # x₁ = { 2 }, y = { 0, 1 }
goto y ? L₂ : L₃ # both outgoing edges are possible
L₂ x₂ ← x₁ + 3 # x₂ = { 5 }
goto L₄
L₃: x₃ ← x₁ + 2 # x₃ = { 4 }
goto L₄
L₄: x₄ ← φ x₂ x₃ # x₄ = { 3, 4 } coming from either L₁ or L₂
end
```

Of course, the real challenge is loops. Let's see what happens if we try to
analyze one (again,

`y = { 0, 1 }`

at the outset):
```
L₁: x₁ ← 2
L₂: x₂ ← φ x₁ x₃
x₃ ← x₂ + 3
goto y ? L₂ : L₃
L₃: end
```

Let's track the computation through a few iterations:

```
L₁: x₁ ← 2 # x₁ = { 2 }
L₂: x₂ ← φ x₁ x₃ # x₂ = { 2 } coming from from L₁
x₃ ← x₂ + 3 # x₃ = { 5 }
L₂: x₂ ← φ x₁ x₂ # x₂ = { 2, 5 } add 5 coming from L₂
x₃ ← x₂ + 3 # x₃ = { 2, 5, 8 }
L₂: x₂ ← φ x₁ x₂ # x₂ = { 2, 5, 8 } add 8 coming from L₂
x₃ ← x₂ + 3 # x₃ = { 2, 5, 8, 11 }
…
L₂: x₂ ← φ x₁ x₂ # x₂ = { 2 + 3n | n ∈ ℕ }
```

While a fixed point clearly exists, there are two problems: the value of

`x₂`

(and `x₃`

) is infinite, and it takes infinitely many steps to find it. Before
we tackle that problem, however, let's spell out the algorithm that we have
intuitively followed in the above examples. That will also clarify the role of
the *domain*and the operations it needs to provide.# the algorithm

So, the algorithm is very simple:

- assign initial values, taken from the domain, to each piece of ‘data’ in the program: constants and variables,
- propagate these values along the flow of data, substituting the operations
in the program (e.g.
`x ← a + b`

) with suitable operations in the*domain*(e.g.`x ← a ∨ b`

, if the domain is a simple semilattice), - repeat (2) until a fixpoint is reached (i.e. further propagation does not change the data in any way).

*dynamic*notion of a variable here: recursion, linked lists, etc. could all cause the analysis to blow up. Instead, the usual compromise is to do a

*static approximation*, where each named variable is taken to represent all its possible instances. A similar trick (often incurring even more severe loss of precision) can be applied to

*arrays*and

*memory*(and other kinds of anonymous variables).

While

`φ`

instructions might seem special at first, it turns out that they
play exactly the same role as the conflation of multiple dynamic variables
into a single static ‘supervariable’. And that's also where those semilattices
come into play.
# semilattices and semantics

A

*finite join semilattice*is a partially*ordered set*that has a*join*(supremum, least upper bound) operation defined on all its non-empty subsets. We will write joins using the binary operator as is the convention.Immediately, we can see that a function
that maps any
to
some
has a fixed point –
such that
– that can be
obtained by starting at an arbitrary
and applying
sufficiently (but
finitely) many times:
. Of course, different choices of
can yield different fixed points
.

So the analysis does something ‘until a fixpoint’, we want the analysis to
terminate, and we have a choice of ‘domain’ which represents the state of the
algorithm. I'm sure you can hear the pieces clicking into place.

It is perhaps worth mentioning that if we are using the ‘per variable’ or
rather ‘per static supervariable’ form of the analysis, the above
is
simply a direct product of finitely many finite semilattices, where each
factor represents a single variable. Therefore, from a theoretical
perspective, we can assume a single semilattice that represents the entire
state of the algorithm.

An astute reader will also have noticed that the hitherto abstract
is
really just a form of

*operational semantics*, though it's not the*standard*operational semantics of the language in question (the one that corresponds to execution and which models the dynamic behaviour of its programs). So the algorithm really boils down to:Apply operational semantics to the initial state of the program until a fixpoint is reached.

The natural semantics of

`x₀ ← φ x₁ x₂`

is the join operation of the
semilattice, i.e.
. The same (with some extra steps) holds for
merging multiple dynamic instances of the same variable in one of those
supervariables. As long as all the remaining mundane operations can get
semantics that are monotonic, we have a clear winner. A simple way to ensure
that is to define `←`

(assignment) itself to include a join: `a ← b`

becomes
. Since the join is applied last, this is clearly monotonic.
Alternatively, if we have SSA, we know that all data flow loops must go
through φ instructions, which do the joining. This means that, conveniently, φ
instructions also represent the only values that need to be stored (everything
else within SSA is derived combinatorially from φ instructions and
‘constants’). We immediately see that the entire semantic function is
monotonic if the semantics of φ instructions are.

# worked examples

Let us consider the following C++ function:

```
const char* color_name( color_t col )
{
const char* name;
switch ( col )
{
case color_t::red:
name = "red"; break;
case color_t::green:
name = "green"; break;
case color_t::blue:
name = "blue"; break;
}
return name;
}
```

And the corresponding CFG:

```
L₁: name ← * # undefined
is_red ← col == red
goto is_red ? L₅ : L₂
L₂: is_green ← col == green
goto is_green ? L₆ : L₃
L₃: is_blue ← col == blue
goto is_blue ? L₇ : L₄
L₄: return name
L₅: name ← "red"
goto L₄
L₆: name ← "green"
goto L₄
L₇: name ← "blue"
goto L₄
```

Or in picture form:

Especially from the picture, it should be quite obvious that we can reach

`ret`

while the value of `name`

is still `*`

(i.e. undefined). That is a
problem, more so because `name`

is a pointer that may be dereferenced by the
caller of the function, in which case the program is going to crash, or worse,
do something random and unexpected. So how do we detect the problem
systematically, using data flow analysis? We need to find a suitable domain,
which is going to be a very simple 3-element semilattice:
All constants are taken to be

`def`

, all variables are initially `undef`

(in
SSA, this is actually made explicit) and all *operations are joins*. Except assignment, which is regular assignment. As you can easily see from the Hasse diagram above, the joins look like this (recall that semilattices are*commutative*bands, so we leave out the redundant rows with flipped operands):a | b | a ∨ b |

`def` | `def` | `def` |

`def` | `undef` | `maybe_undef` |

`def` | `maybe_undef` | `maybe_undef` |

`undef` | `undef` | `undef` |

`undef` | `maybe_undef` | `maybe_undef` |

`maybe_undef` | `maybe_undef` | `maybe_undef` |

So how does it go? Looking at the CFG and remembering that the order of
propagation does not matter (all our operations are associative; I mean the
one operation that we use for everything):

- we don't know anything at all that would help us decide which way a given branch goes, so we always, automatically take both branches,
- propagate the
`undef`

associated with`name`

down along the`is_red`

edge, - apply
`name ← "red"`

, which translates to`name ← def`

which means that`name`

is now`def`

, - do the same along
`is_green`

and`is_blue`

. The picture now looks something like this:

- propagate along the 4 incoming edges into the exit block; this gives us 3
`def`

and one`undef`

; thanks to associativity of , it does not matter in which order we get there, the result is going to be`maybe_undef`

, - at this point, no propagation that we can do will change the assignments and we have reached our fixed point,
- we conclude that the original C++ function can return an uninitialized pointer and crash the program (the technical term is that it invokes undefined behaviour).

# conclusion

Data flow analysis use cases include various compiler optimization techniques
like constant propagation, static checks like definite assignment analysis or
detection of possible null pointer dereferences, or even construction of an
SSA form of a program.

There are also many variants that we did not cover: for instance, propagation
can be performed backwards (against the flow of data) in some analyses, or the
requirements for the domain can be relaxed (the semilattice requirement is
sufficient, but not necessary). For such relaxed domains, things can become a
lot more complicated (e.g. the order of traversal of the control flow diagram
may start to matter).