intermediate languages
The main characteristic of a compiler (as opposed to an interpreter) is that
it transforms the input code. To make this a bit more precise, a compiler
reads a description of a program in some language, computes for a while, and
spits out a new description of a closely related program, usually in some
other language. In the most commonly encountered form, the input language is
tailored to be easy to write and read by human beings, and the output language
is designed to efficiently execute on some machine.
However, neither language is usually very well suited for transforming nor
for analysing programs. For this reason, many compilers actually deal with 3
(or more) languages. The additional languages (neither input nor output) are
called intermediate. As you probably know, the most famous and fashionable
of those is LLVM. However, it is often the case that any given compiler will
have its own intermediate language (LLVM is a bit of an outlier).
abstractions
What do we want from such intermediate language (IL)? It should make the
following things easy:
- Convert the input language into the IL.
- Analyse and transform the program in various ways, most commonly to make it run faster.
- Convert the IL into the output language.
Possibly by pure luck, the set of abstractions that work well for step 2 above
also happen to offer a good decomposition. Now what those abstractions are
somewhat depends on the nature of the input and output languages in questions.
So this is a good point to assume a structured imperative input language and
some sort of machine code as the output. In this case, the middle ground in
terms of abstraction seems to be:
- split code into functions which can call each other,
- linear (flat) function-local control flow in the form of an oriented graph,
- automatic variables (named, function-scoped memory locations).
local concerns
Functions encapsulate and decompose both control and data flow of the program.
Even in high-level languages, this is precisely what makes functions powerful:
a function offers a meaningful unit of behaviour, sufficiently complex as to
be interesting, but sufficiently simple to be easily analysed (for some value
of easily).
What do I mean when I say encapsulate and decompose (lest we descent into
a meaningless buzzword salad)? The local control and data flow is protected
from the effects of other functions: a slightly idealized function call is
indistinguishable from a built-in operation of the language:
- the control flow of the caller is wholly unaffected by the call,
- data flows in (through the arguments, akin to operands of a primitive operation) and out (through the return value).
Clearly, this neat little world will immediately start falling apart when we
consider mutable state shared by multiple functions (global variables) or, to
a lesser degree, control flow that cuts through multiple functions (like
exceptions).These features clearly compromise the black-boxness of functions,
and its very compelling consequence that global behaviour is a straightforward
composition of local behaviours. You could almost understand people who want
to banish global state and go purely functional.
Crucially, it is the abstractions the IL provides that make this encapsulation
possible (however limited by the presence of mutable shared state). It ought
to be instructive to compare this functions + local control flow + local
variables model to a traditional register machine.
What's a register machine, you ask? It's a rather simple computational model
that is quite good at capturing the semantics of actual physical processors
(obviously with considerable simplifications). An abstract machine is a
device that has some internal state, and a program which tells the machine
how to update that state. A register machine, then, is an abstract machine
where:
- the program is a numbered list of instructions,
- the state is made of:
- a program counter which tells the machine which instruction from the above list should be executed in the next step,
- a register file (a fixed set of named global variables),
- some memory (a numbered array of global variables),
- there are rules that describe how to update the above state when a particular instruction gets executed.
f
’ makes use of ‘logical piece g
’ (in a way that would resemble a
function call): we have to keep track of what registers (and memory locations)
are in use by f
, so that g
does not clobber them. This is hardly
compositional – g
must be aware of all its callers and the register and
memory locations that they use and scrupulously avoid using those resources
for itself.
removing abstraction
But surely, you might protest, we can organize (part of) the memory of the
abstract machine into a call stack, have every ‘logical piece’ of code store
its local state on that stack, back up in-use registers (again, using this
call stack) before ‘calling’ other ‘logical pieces’ and so on and so forth.
The machinery that we are all intimately familiar with. That way, perhaps we
can have functions? And the answer to that is both ‘yes’ and ‘no’.
Just like the abstract concept of a stack can be implemented as a dynamic
array or as a linked list (or in any number of other ways), what we have
discovered is how to implement the abstract concept of a function using a
call stack and some discipline (a protocol, or an ABI, if you like).
But consider this: you are given a program that uses a dynamic array to do
some stuff – perhaps it's used in a stack-like manner, but perhaps not; maybe
there's a bit in the program where it sneakily iterates the array from left to
right… In some sense, the information is present in the program, but it is
not manifest. You have to painstakingly analyze the program to discover that
yes, in fact, this array is being used as a stack.
Clearly, same thing happens if you are given an arbitrary register program:
perhaps it maintains a call stack and takes advantage of local state and so
on. But you cannot easily tell. Ask anyone who ever tried to decompile a
program.
In any case, if you know the trick (call stacks), going from a language with
abstract functions to a language without them becomes quite easy. Going back,
however, is fraught with difficulties.
The story with the local control flow graph is the essentially the same: the
register program has a numbered list of instructions, and perhaps some
instructions to change the value of the program counter (what we would
recognize as a jump instruction). Clearly, we can encode our neat abstract
control flow by arranging the nodes in the control flow graph into a linear
list and connect them with suitable jump instructions. Recovering that control
flow from the linear, jump-connected list is a much more difficult problem.
Even worse, there are register programs that don't translate into a neat
control flow graph at all (of course, same thing happens with call stacks).
Hopefully, the lesson here is clear: abstraction, once removed, is very hard
to recover. This is why keeping the right abstractions in the intermediate
language is really important.
compilation
Removing (encoding) abstractions is basically what compilers do. So far, we
have been thinking about the abstractions that we chose to retain in our
intermediate language, and the impact of their loss on our ability to reason
about the program. But when we think about the role of the IL, it is also
important to think about the abstractions that we choose to remove from the
input program.
Perhaps the most obvious casualty is structured control flow: those fancy
if
/elif
/else
statements, for
loops, while
loops, switch
statements, and so on, all got flattened into a control flow graph.
Interestingly, what we lost in abstraction, we have gained in uniformity.
There's a couple of reasons why control flow graphs (CFGs) are a good
compromise. Remember those 3 steps that we wanted the IL to make easy? This is
how CFGs fit in:
- Encoding common structured control flow constructs into a CFG ranges from easy to trivial.
- The uniform representation and the explicitness of control flow (where can the program go next?) vastly simplify analysis – both when compared to the structured control (input), but also when compared to a numbered list of instructions with jumps between them (output).
- Encoding the CFG using jumps is again somewhere between easy and trivial.
Much easier than directly encoding a
for
loop as a list of instructions with a conditional backjump.
- Functions (along with methods, closures, coroutines, …) on the input get translated into functions in the IL – at least the function → function case is pretty easy.
- Local state and local control flow greatly simplify everything. It is hard to overstate how useful this is.
- Use a call stack to encode functions. Perhaps a little finicky, but reasonably easy.
- Local variables (a nearly universal feature of human-readable programming languages) in the input translate to local variables in the IL. Aggregate types might be decomposed into their constituent scalars, or some other type-related simplifications may be done. Mostly not a big deal.
- Local variables greatly simplify data flow analysis. What CFGs are for control flow, local variables are for data flow. We will get back to this in much more detail in a future post.
- Call stack again.
conclusion
I actually wanted to do a post about SSA (static single assignment) this week,
which is really about representing the data flow in intermediate languages. So
what you got instead is a sort of an ‘extended intro’. I don't have a plan yet
for next week, so perhaps we can finish what we started. After that, we will
start looking at processors, with the first post in that series covering the
basics of superscalar architectures.