Alias Analysis in LART
======================

The alias analyses implemented in LART are fairly expensive, and usually can't
provide on-the-fly answers. Instead, they compute a whole-module analysis that
gives points-to information for all "top level" values (mainly LLVM registers)
in the program, and for any "abstract memory locations" these values could be
pointing to.

As a rule, all alias analyses implemented in LART are "may" analyses, and
represent their result using points-to sets, where a particular pointer may
take any of the values in its points-to set. If this set is singleton, the
information is fully precise. Most importantly, the analyses guarantee that if
a particular location is *not* in a points-to set, the pointer must not take
that value.

Abstract Memory
---------------

In a running program, each memory (heap) allocation will result in a concrete
memory location being created in the program's address space and used. For the
purpose of alias analysis, we need to abstract those concrete memory locations
in some fashion, as it is impossible to statically compute the set of concrete
locations. The common abstraction used for this is assigning a single abstract
location to every callsite of a memory allocation routine.

In LART, the creation of abstract memory locations is a choice to be made by a
particular analysis. Initially, we will use the common approach of using one
abstract location for each static call that allocates memory.

Abstract memory locations are represented as unique integers.

Context and Flow Sensitivity
----------------------------

An alias analysis can be global, computing a single conservative "may point to"
solution for the entire program, meaning that for a particular pointer, it will
never "fall out" of its points-to set for the entire lifetime of the program.
In many cases, this analysis will be overly pessimistic. To that end, there are
two common ways in which to refine this global view. One computes distinct
solutions for each call-site (this is called context sensitivity), and another
for each control-flow location (this is called flow sensitivity). These two are
somewhat orthogonal: an analysis can be neither, one of them or both.

Metadata Format
---------------

As outlined above, the analyses produce a significant quantity of data and
could require fairly long time to do so. Moreover, it is desirable that this
data be readily available to external tools, even though LART itself is not
going to be part of LLVM in foreseeable future. As such, providing analysis
results in a well-defined format embedded inside LLVM bitcode as metadata which
standard LLVM libraries and tools can read seems to be a good compromise.

This makes the analysis results persistent, and easily re-usable without a
requirement for run-time linking or exporting stable APIs from LART.

The structure of the persistent metadata needs to be such that it can easily
represent results of multiple different alias analyses, with various degrees of
context and flow sensitivity. In particular, this means that the metadata needs
to be able to attach points-to sets to particular contexts, whether that
context is derived from flow sensitivity or from callsite/callgraph
sensitivity.

## Representing Points-To Sets

While top-level variables in LLVM are in an SSA form (meaning that they never
change their value during their lifetime), this is not true of allocated
memory, whether with alloca instructions or with malloc (and it wouldn't be
necessary to perform further alias analysis if it was). As such, the top-level
points-to sets (attached to toplevel pointers) never change during their
lifetime -- however, the points-to sets available indirectly through them do.
To take advantage of this fact, it is desirable to decouple transitive
points-to relations, so that top-level points-to sets can be attached to
definitions of toplevel values (i.e. to the instruction that defines them). As
such, they are entirely immutable. The points-to sets can only point at
abstract memory locations, toplevel variables in LLVM never have their
addresses taken and as such no pointers can point at them.

Apart from toplevel points-to sets, the metadata needs to also represent sets
that are attached to abstract memory locations. These sets will be (depending
on the type of the analysis) different at different point in the program, and
hence it needs to be possible to keep multiple such sets for each abstract
memory location, distinguished by context.

The representation of the context needs to be such as to be able to efficiently
decide if a particular callstack falls under the given context. For
flow-sensitive queries, only the topmost callframe is relevant, while for
context-sensitive queries, all but the topmost callframe can be potentially
relevant.

The contexts form a semi-lattice, and so do the points-to sets associated to
those contexts. Meets of contexts correspond to joins of points-to sets and
vice-versa. Intuitively, the broader the context the bigger (more
over-approximated) the points-to sets it entails. The idea here is that for a
particular context, we can obtain over-approximate points-to set as a union of
points-to sets in all contexts that are more specific. Conversely, for a more
specific context, if exact information for that context is not available, a
points-to set for any enclosing context is a sound over-approximation of the
desired answer. This gives us a good way to represent the context -> points-to
set mappings for abstract memory locations: each abstract memory location has a
context tree attached to it.

There are two types of context-tree nodes, internal and leaf nodes. An internal
node only has a single instruction pointer in it, always pointing to a callsite
(there is an additional virtual callsite that represents the entire program,
i.e. the root of the static call graph). Leaf nodes additionally contain an
actual points-to set. The points-to set for an internal node can be computed as
an union of all its children's points-to sets, and as such is not stored
explicitly. The depth of a context tree is entirely the discretion of the
analysis in question, and queries are satisfied by giving the most-specific
points-to set available in the context tree. (In the degenerate case of a
context-insensitive analysis, the context tree is singleton.)

To represent flow sensitivity in the data, context-trees are rather cumbersome
and inefficient, as they would need to allow representing instruction spans,
and for context-insensitive, flow-sensitive analysis, addition of many
redundant internal nodes to the context tree. Instead, a different
representation is used to represent flow sensitivity, orthogonal to context
sensitivity. Recall that instructions have a static points-to set attached to
them, representing the *result* of that instruction. Additionally, we can
attach a map from abstract memory locations to context trees to each
instruction, containing all the AMLs that any of the static points-to set in
any of its *arguments* refer to. The context trees in this map then represent
the points-to sets for those memory locations at the particular spot in the
control flow graph. Again, for context-insensitive, flow-sensitive analyses,
these context trees will be singleton. Conversely, for flow-insensitive,
context-sensitive analyses, they will just point to the global context trees
for the relevant memory locations (over-approximating the flow sensitivity
away).

## Mapping Points-To Set Representation to LLVM Metadata

To summarise the discussion above, we should set out the various types that
come into play in representing points-to sets.

abstract memory location
:   a unique representation for (a set of) memory locations
points-to set
:   a set of abstract memory locations
context tree
:   each node points to a particular callsite, representing the static
    callgraph of the program; leaf nodes additionaly contain a points-to set;
    representing context trees is particularly tricky, because it is not
    possible to directly store references to instructions in global metadata;
    instead, context tree nodes are referenced from callsites, and value
    use-def chains can be used to look up the callsite for a particular context
    tree node
AML map
:   a function from abstract memory locations to context trees
instruction
:   each instruction has a single points-to set attached, representing the
    points-to set of the (toplevel) result of this instruction, and a single
    AML map, which has entries for each element in all static points-to sets
    for all operands of the instruction, and transitively for all AMLs that
    arise in points-to sets of any already included AMLs
global points-to data
:   a single top-level AML map with entries for all AMLs that exist in the
    program; can be automatically summarised using results of a flow-sensitive
    analysis by unioning AML maps attached to all instructions, or can be
    obtained directly as a result of a flow-insensitive analysis

LLVM metadata is structured as a graph with labelled edges and nodes being
tuples of primitive values. The basic idea of the format is to represent the
above data types as LLVM metadata nodes. Abstract memory locations are simply
represented by nodes carrying a numeric ID. Points-to sets are represented by a
single node, with outgoing edges for each element of that set. Context tree are
mapped naturally to LLVM metadata trees, AML maps are represented as a list of
tuples (AML pointer, context tree pointer). Instructions get two named metadata
slots, `!aa_def` and `!aa_use`, first representing the result points-to set and
the other an AML map.

## Context Trees

As outlined earlier, context trees are particularly challenging to embed in
LLVM metadata. The trees are formed with no references to the actual callsites
they represent; instead, each callsite gets a unique ID and this ID is stored
in the context tree. To make lookups easy, each callsite is annotated with a
list of context tree nodes that reference its ID. Then, to obtain a pointer to
the particular callsite instruction from a context tree node, we look at the
node's use list -- the only instruction in the use list is the callsite.

## An Annotated IR Example

XXX

Future Extensions
-----------------

This spec is a work in progress. We expect to extend the metadata format to
cover type information, especially for agregate types, and field sensitivity in
the points-to sets.