The main characteristic of a compiler (as opposed to an interpreter) is that it transforms the input code. To make this a bit more precise, a compiler reads a description of a program in some language, computes for a while, and spits out a new description of a closely related program, usually in some other language. In the most commonly encountered form, the input language is tailored to be easy to write and read by human beings, and the output language is designed to efficiently execute on some machine.
However, neither language is usually very well suited for transforming nor for analysing programs. For this reason, many compilers actually deal with 3 (or more) languages. The additional languages (neither input nor output) are called intermediate. As you probably know, the most famous and fashionable of those is LLVM. However, it is often the case that any given compiler will have its own intermediate language (LLVM is a bit of an outlier).
What do we want from such intermediate language (IL)? It should make the following things easy:
- Convert the input language into the IL.
- Analyse and transform the program in various ways, most commonly to make it run faster.
- Convert the IL into the output language.
Possibly by pure luck, the set of abstractions that work well for step 2 above also happen to offer a good decomposition. Now what those abstractions are somewhat depends on the nature of the input and output languages in questions. So this is a good point to assume a structured imperative input language and some sort of machine code as the output. In this case, the middle ground in terms of abstraction seems to be:
- split code into functions which can call each other,
- linear (flat) function-local control flow in the form of an oriented graph,
- automatic variables (named, function-scoped memory locations).
Functions encapsulate and decompose both control and data flow of the program. Even in high-level languages, this is precisely what makes functions powerful: a function offers a meaningful unit of behaviour, sufficiently complex as to be interesting, but sufficiently simple to be easily analysed (for some value of easily).
What do I mean when I say encapsulate and decompose (lest we descent into a meaningless buzzword salad)? The local control and data flow is protected from the effects of other functions: a slightly idealized function call is indistinguishable from a built-in operation of the language:
- the control flow of the caller is wholly unaffected by the call,
- data flows in (through the arguments, akin to operands of a primitive operation) and out (through the return value).
Clearly, this neat little world will immediately start falling apart when we consider mutable state shared by multiple functions (global variables) or, to a lesser degree, control flow that cuts through multiple functions (like exceptions).These features clearly compromise the black-boxness of functions, and its very compelling consequence that global behaviour is a straightforward composition of local behaviours. You could almost understand people who want to banish global state and go purely functional.
Crucially, it is the abstractions the IL provides that make this encapsulation possible (however limited by the presence of mutable shared state). It ought to be instructive to compare this functions + local control flow + local variables model to a traditional register machine.
What's a register machine, you ask? It's a rather simple computational model that is quite good at capturing the semantics of actual physical processors (obviously with considerable simplifications). An abstract machine is a device that has some internal state, and a program which tells the machine how to update that state. A register machine, then, is an abstract machine where:
- the program is a numbered list of instructions,
- the state is made of:
- a program counter which tells the machine which instruction from the above list should be executed in the next step,
- a register file (a fixed set of named global variables),
- some memory (a numbered array of global variables),
- there are rules that describe how to update the above state when a particular instruction gets executed.
f’ makes use of ‘logical piece
g’ (in a way that would resemble a function call): we have to keep track of what registers (and memory locations) are in use by
f, so that
gdoes not clobber them. This is hardly compositional –
gmust be aware of all its callers and the register and memory locations that they use and scrupulously avoid using those resources for itself.
But surely, you might protest, we can organize (part of) the memory of the abstract machine into a call stack, have every ‘logical piece’ of code store its local state on that stack, back up in-use registers (again, using this call stack) before ‘calling’ other ‘logical pieces’ and so on and so forth. The machinery that we are all intimately familiar with. That way, perhaps we can have functions? And the answer to that is both ‘yes’ and ‘no’.
Just like the abstract concept of a stack can be implemented as a dynamic array or as a linked list (or in any number of other ways), what we have discovered is how to implement the abstract concept of a function using a call stack and some discipline (a protocol, or an ABI, if you like).
But consider this: you are given a program that uses a dynamic array to do some stuff – perhaps it's used in a stack-like manner, but perhaps not; maybe there's a bit in the program where it sneakily iterates the array from left to right… In some sense, the information is present in the program, but it is not manifest. You have to painstakingly analyze the program to discover that yes, in fact, this array is being used as a stack.
Clearly, same thing happens if you are given an arbitrary register program: perhaps it maintains a call stack and takes advantage of local state and so on. But you cannot easily tell. Ask anyone who ever tried to decompile a program.
In any case, if you know the trick (call stacks), going from a language with abstract functions to a language without them becomes quite easy. Going back, however, is fraught with difficulties.
The story with the local control flow graph is the essentially the same: the register program has a numbered list of instructions, and perhaps some instructions to change the value of the program counter (what we would recognize as a jump instruction). Clearly, we can encode our neat abstract control flow by arranging the nodes in the control flow graph into a linear list and connect them with suitable jump instructions. Recovering that control flow from the linear, jump-connected list is a much more difficult problem. Even worse, there are register programs that don't translate into a neat control flow graph at all (of course, same thing happens with call stacks).
Hopefully, the lesson here is clear: abstraction, once removed, is very hard to recover. This is why keeping the right abstractions in the intermediate language is really important.
Removing (encoding) abstractions is basically what compilers do. So far, we have been thinking about the abstractions that we chose to retain in our intermediate language, and the impact of their loss on our ability to reason about the program. But when we think about the role of the IL, it is also important to think about the abstractions that we choose to remove from the input program.
Perhaps the most obvious casualty is structured control flow: those fancy
switchstatements, and so on, all got flattened into a control flow graph. Interestingly, what we lost in abstraction, we have gained in uniformity. There's a couple of reasons why control flow graphs (CFGs) are a good compromise. Remember those 3 steps that we wanted the IL to make easy? This is how CFGs fit in:
- Encoding common structured control flow constructs into a CFG ranges from easy to trivial.
- The uniform representation and the explicitness of control flow (where can the program go next?) vastly simplify analysis – both when compared to the structured control (input), but also when compared to a numbered list of instructions with jumps between them (output).
- Encoding the CFG using jumps is again somewhere between easy and trivial.
Much easier than directly encoding a
forloop as a list of instructions with a conditional backjump.
- Functions (along with methods, closures, coroutines, …) on the input get translated into functions in the IL – at least the function → function case is pretty easy.
- Local state and local control flow greatly simplify everything. It is hard to overstate how useful this is.
- Use a call stack to encode functions. Perhaps a little finicky, but reasonably easy.
- Local variables (a nearly universal feature of human-readable programming languages) in the input translate to local variables in the IL. Aggregate types might be decomposed into their constituent scalars, or some other type-related simplifications may be done. Mostly not a big deal.
- Local variables greatly simplify data flow analysis. What CFGs are for control flow, local variables are for data flow. We will get back to this in much more detail in a future post.
- Call stack again.
I actually wanted to do a post about SSA (static single assignment) this week, which is really about representing the data flow in intermediate languages. So what you got instead is a sort of an ‘extended intro’. I don't have a plan yet for next week, so perhaps we can finish what we started. After that, we will start looking at processors, with the first post in that series covering the basics of superscalar architectures.