data flow analysis
In previous blog posts, we have mentioned the term data flow multiple times.
This week, we will have a closer look at some interesting properties of
programs that can be inferred from its data flow, and how to recover them
algorithmically. The process is commonly known as data flow analysis and is
a form of static analysis. The basic premise is to summarise the data,
coming into and leaving each basic block of a given program.
It is the summarisation that is the key to usefulness of the analysis:
clearly, if we simply ‘propagate’ the concrete values from variable to
variable, performing the usual operations, we would simply get standard
execution. Instead, we will use a suitable domain to keep track of variable
properties of interest. This freedom in the choice of the domain is what makes
data flow analysis a very flexible and general tool – picking a different
domain leads to a different type of analysis.
Often enough, we can track the data flow separately for each variable or other
basic data unit of the program, even though this is not always possible. When
we track individual values, the domains involved in the analysis become much
simpler and easier to understand, at the expense of a little bit of
generality.
a motivating example
Let's consider a very simple domain first: the set of possible values a
variable can take. In a program with only a single basic block, the results
will coincide with straightforward execution:
L₁: x₁ ← 2 # x₁ = { 2 }
y₁ ← 3 # y₁ = { 3 }
x₂ ← x₁ + y₁ # x₂ = { 5 }
end
Let's add some non-trivial control flow to the mix, namely, an if-else
statement. Assume that
y = { 0, 1 }
at this point:
L₁: x₁ ← 2 # x₁ = { 2 }, y = { 0, 1 }
goto y ? L₂ : L₃ # both outgoing edges are possible
L₂ x₂ ← x₁ + 3 # x₂ = { 5 }
goto L₄
L₃: x₃ ← x₁ + 2 # x₃ = { 4 }
goto L₄
L₄: x₄ ← φ x₂ x₃ # x₄ = { 3, 4 } coming from either L₁ or L₂
end
Of course, the real challenge is loops. Let's see what happens if we try to
analyze one (again,
y = { 0, 1 }
at the outset):
L₁: x₁ ← 2
L₂: x₂ ← φ x₁ x₃
x₃ ← x₂ + 3
goto y ? L₂ : L₃
L₃: end
Let's track the computation through a few iterations:
L₁: x₁ ← 2 # x₁ = { 2 }
L₂: x₂ ← φ x₁ x₃ # x₂ = { 2 } coming from from L₁
x₃ ← x₂ + 3 # x₃ = { 5 }
L₂: x₂ ← φ x₁ x₂ # x₂ = { 2, 5 } add 5 coming from L₂
x₃ ← x₂ + 3 # x₃ = { 2, 5, 8 }
L₂: x₂ ← φ x₁ x₂ # x₂ = { 2, 5, 8 } add 8 coming from L₂
x₃ ← x₂ + 3 # x₃ = { 2, 5, 8, 11 }
…
L₂: x₂ ← φ x₁ x₂ # x₂ = { 2 + 3n | n ∈ ℕ }
While a fixed point clearly exists, there are two problems: the value of
x₂
(and x₃
) is infinite, and it takes infinitely many steps to find it. Before
we tackle that problem, however, let's spell out the algorithm that we have
intuitively followed in the above examples. That will also clarify the role of
the domain and the operations it needs to provide.
the algorithm
So, the algorithm is very simple:
- assign initial values, taken from the domain, to each piece of ‘data’ in the program: constants and variables,
- propagate these values along the flow of data, substituting the operations
in the program (e.g.
x ← a + b
) with suitable operations in the domain (e.g.x ← a ∨ b
, if the domain is a simple semilattice), - repeat (2) until a fixpoint is reached (i.e. further propagation does not change the data in any way).
While
φ
instructions might seem special at first, it turns out that they
play exactly the same role as the conflation of multiple dynamic variables
into a single static ‘supervariable’. And that's also where those semilattices
come into play.
semilattices and semantics
A finite join semilattice
is a partially ordered set that has a join
(supremum, least upper bound) operation defined on all its non-empty subsets.
We will write joins using the binary operator
as is the convention.
Immediately, we can see that a function
that maps any
to
some
has a fixed point –
such that
– that can be
obtained by starting at an arbitrary
and applying
sufficiently (but
finitely) many times:
. Of course, different choices of
can yield different fixed points
.
So the analysis does something ‘until a fixpoint’, we want the analysis to
terminate, and we have a choice of ‘domain’ which represents the state of the
algorithm. I'm sure you can hear the pieces clicking into place.
It is perhaps worth mentioning that if we are using the ‘per variable’ or
rather ‘per static supervariable’ form of the analysis, the above
is
simply a direct product of finitely many finite semilattices, where each
factor represents a single variable. Therefore, from a theoretical
perspective, we can assume a single semilattice that represents the entire
state of the algorithm.
An astute reader will also have noticed that the hitherto abstract
is
really just a form of operational semantics, though it's not the standard
operational semantics of the language in question (the one that corresponds to
execution and which models the dynamic behaviour of its programs). So the
algorithm really boils down to:
Apply operational semantics to the initial state of the program until a fixpoint is reached.
The natural semantics of
x₀ ← φ x₁ x₂
is the join operation of the
semilattice, i.e.
. The same (with some extra steps) holds for
merging multiple dynamic instances of the same variable in one of those
supervariables. As long as all the remaining mundane operations can get
semantics that are monotonic, we have a clear winner. A simple way to ensure
that is to define ←
(assignment) itself to include a join: a ← b
becomes
. Since the join is applied last, this is clearly monotonic.
Alternatively, if we have SSA, we know that all data flow loops must go
through φ instructions, which do the joining. This means that, conveniently, φ
instructions also represent the only values that need to be stored (everything
else within SSA is derived combinatorially from φ instructions and
‘constants’). We immediately see that the entire semantic function is
monotonic if the semantics of φ instructions are.
worked examples
Let us consider the following C++ function:
const char* color_name( color_t col )
{
const char* name;
switch ( col )
{
case color_t::red:
name = "red"; break;
case color_t::green:
name = "green"; break;
case color_t::blue:
name = "blue"; break;
}
return name;
}
And the corresponding CFG:
L₁: name ← * # undefined
is_red ← col == red
goto is_red ? L₅ : L₂
L₂: is_green ← col == green
goto is_green ? L₆ : L₃
L₃: is_blue ← col == blue
goto is_blue ? L₇ : L₄
L₄: return name
L₅: name ← "red"
goto L₄
L₆: name ← "green"
goto L₄
L₇: name ← "blue"
goto L₄
Or in picture form:
Especially from the picture, it should be quite obvious that we can reach
ret
while the value of name
is still *
(i.e. undefined). That is a
problem, more so because name
is a pointer that may be dereferenced by the
caller of the function, in which case the program is going to crash, or worse,
do something random and unexpected. So how do we detect the problem
systematically, using data flow analysis? We need to find a suitable domain,
which is going to be a very simple 3-element semilattice:
All constants are taken to be
def
, all variables are initially undef
(in
SSA, this is actually made explicit) and all operations are joins. Except
assignment, which is regular assignment. As you can easily see from the Hasse
diagram above, the joins look like this (recall that semilattices are
commutative bands, so we leave out the redundant rows with flipped
operands):
a | b | a ∨ b |
def | def | def |
def | undef | maybe_undef |
def | maybe_undef | maybe_undef |
undef | undef | undef |
undef | maybe_undef | maybe_undef |
maybe_undef | maybe_undef | maybe_undef |
So how does it go? Looking at the CFG and remembering that the order of
propagation does not matter (all our operations are associative; I mean the
one operation that we use for everything):
- we don't know anything at all that would help us decide which way a given branch goes, so we always, automatically take both branches,
- propagate the
undef
associated withname
down along theis_red
edge, - apply
name ← "red"
, which translates toname ← def
which means thatname
is nowdef
, - do the same along
is_green
andis_blue
. The picture now looks something like this:
- propagate along the 4 incoming edges into the exit block; this gives us 3
def
and oneundef
; thanks to associativity of , it does not matter in which order we get there, the result is going to bemaybe_undef
, - at this point, no propagation that we can do will change the assignments and we have reached our fixed point,
- we conclude that the original C++ function can return an uninitialized pointer and crash the program (the technical term is that it invokes undefined behaviour).
conclusion
Data flow analysis use cases include various compiler optimization techniques
like constant propagation, static checks like definite assignment analysis or
detection of possible null pointer dereferences, or even construction of an
SSA form of a program.
There are also many variants that we did not cover: for instance, propagation
can be performed backwards (against the flow of data) in some analyses, or the
requirements for the domain can be relaxed (the semilattice requirement is
sufficient, but not necessary). For such relaxed domains, things can become a
lot more complicated (e.g. the order of traversal of the control flow diagram
may start to matter).