garbage collection
The data of a running program is logically organised into objects (values),
whether bound to variables or referenced by other objects. During a
computation, such objects are created and used, but there is always a point
of last use for each, after which the program never refers to that object (at
the very latest, this happens when the program terminates). Creation (and
reading and updates) of objects is clearly part of what we would call useful
computation – it is what programs do to compute their results.
Since resources are finite, it is also desirable to destroy objects when
they are no longer useful: clearly, an object that is never going to be used
again (i.e. the program will not read or update its value) – a dead object –
does not need to be remembered. And not remembering an object will free up
precious memory in which to store new (more useful) objects. However,
reclamation of memory used by dead objects does not serve any direct
computational goal: it is only done to conserve resources.1
In complex programs, it can be hard to figure out exact lifetime of objects
(i.e. where exactly in the program is their point of last use, after which
they become dead). There are two ways in which this can go wrong:
- keeping a dead object around longer than necessary: this may be harmless, but if such objects start to accumulate, we call this a memory leak (or more generally a resource leak) and this may significantly reduce our ability to perform useful computation (since we run out of resources unnecessarily),
- recycling a live object: this is worse (much worse), because now the program is computing the wrong thing when it (inevitably) uses the object that no longer ‘exists’ (has been recycled).
A program which makes use of automatic garbage collection can be split into
three basic parts:
- a mutator which performs the useful computation (the program logic) and which therefore creates, reads and updates objects,
- an allocator which keeps track of unused memory (and hence can assign, or allocate memory for new objects as requested by the mutator), and finally
- a collector which collects dead objects (and recycles them, by telling the allocator that the memory they had occupied is now unused, or free).
1
Of course, there is a flip side to saving resources: a more complicated computation can be done with a given fixed amount of resources if they are managed more carefully.
2
There are other approaches to the same problem. They are collectively known as automatic memory management, and can be as mundane as a stack (for keeping local variables) or as sophisticated as a borrow checker.
3
It turns out that automatic memory management (and garbage collection in particular) enables quite a few language features that are otherwise hard to implement: lexical closures are a common example. There are two contributing factors: garbage collectors are dynamic (they perform computation at program runtime to infer lifetimes) and global (they ‘see’ all the objects in a program).
approximating lifetimes
Computing the exact lifetime of an object is a so-called ‘reachability
problem’ – is any use of the object reachable from this point in the program?
It is not surprising that this is undecidable (easily shown by reduction of
the halting problem, for instance). Therefore, computing exact lifetimes of
objects is hopeless – we have to find a suitable approximation.
The first priority is going to be that we never encounter the second type of
error (recycling a live object) – this means we need an approximation with a
one-sided error (also known as over- or under-approximation, depending on
the direction of the error – in this case, we will be over-approximating the
set of live objects). A natural choice for our question then is: can we say
with certainty that a given object is dead? If we do not know, then it might be
dead or alive, but we have to assume that it is alive (making the approximate
live set bigger).
How do we know that an object is certainly dead? Surely, if there are no live
references to that object, it cannot be used, and hence it must be dead. Now
this is a thing that is sufficiently easy to compute for the collector. Of
course, the first type of error – keeping dead objects around longer than
necessary – can and will happen (to various degrees, depending on how exactly
we implement the collector). Before we look at actual algorithms, though, we
need to make our notion of objects and references a little more formal.
First of all, the collector cannot treat objects as opaque blobs of data – it
needs to be able to extract information about references. A reference is, in
this context, any piece of data that can be used to access an object. While
their exact form is not very important for now, they usually take the form of
pointers – numbers that represent a memory address (at which the object is
actually stored).
Given a set of live object, any references stored in any of these objects is
a live reference and hence the object that it refers to is in the
approximate live set (as defined above). This looks like a promising start
of an inductive definition: all we need is a starting point. To this end, we
define a root set – a set of objects we consider alive unconditionally.
Those are, essentially, the objects that are directly bound to variables.4
There is another way to look at this: if we take objects to be vertices and
references to be directed edges, together they form a directed graph. We can
then say that an object is certainly dead as soon as it becomes unreachable
from the root set. A collector which performs a literal reachability analysis
of this graph (starting from the root set) is called a tracing collector.
These collectors face one significant obstacle. As they rely on traversing the
graph, we can not allow for its modification during the operation. If a
mutator would move a reference during the traversal, it could cause objects
that are still alive to never be discovered.5 There are ways around this
(generational, incremental and concurrent collection), but the easiest way out
is to stop the world – that is, stop execution of all mutators until the
collection algorithm completes its work. Obviously this comes at a cost as it
freezes the entire core program.
4
In most implementations, this translates to objects stored on the stack (or multiple stacks in multi-threaded programs) and those stored in global variables.
5
This could for example happen if the mutator would move a reference from yet unexplored part to a part that was already examined. The objects in a tree rooted at the moved reference would never be explored by a standard traversal algorithm and therefore would be collected.
reference counting
Let us first look at an approach to garbage collection that does not trace the
object graph. The key idea is to store the count of incoming references for each
allocated object. This reference count (or refcount for short) is increased
each time a new reference is pointed to the object and similarly decreased when
a reference is lost (e.g. due to repointing or destruction of its source).
When the count reaches 0, we know that the object is surely dead and can be
safely collected.
The crucial point here is that pure refcounting collectors rely solely on
local information about individual references – their code is executed only
when a reference changes. This makes them less intrusive and easier to implement
than tracing collectors. As a result, they are sometimes available even in
‘low-level’ languages such as C++, since
std::shared_ptr
is nothing but a
pointer with refcount support.
Reference counting has several important advantages over a tracing GC. The
aforementioned locality distributes the costs of garbage collection throughout
the run time, which leads to better latency (at the cost of potentially worse
throughput). Additionally, we can use this approach whenever we require
deterministic (i.e. predictable) collection, which is important for example
when a collection of an object releases further non-memory resources such as
file handles. On the other hand, reference counts have to be maintained, which
can cause significant overhead, as making a copy of an object must also update
reference count in all objects that the object referred to, triggering
additional memory writes and possibly compromising locality of reference.
There is a bigger drawback, though. Pure reference counting will never release
objects that form a loop in the object graph. Indeed, let us imagine a simple
case where we have a pair of objects A and B referencing each other and a local
reference, say R, to A. A has refcount 2 (from R and B) and B has refcount 1
(from A). When R is lost, the refcount of A is decreased and the refcounts of
both objects equal 1 despite not being reachable.
This situation is too common (consider cyclic data structures such as
doubly-linked lists or pointer-based graphs) to be ignored and as such, there
are several workarounds. We can employ either weak references, which do not
count as incoming references and are reset when their target is collected, or a
tracing collector in addition to reference counting – an approach chosen by
Python, for example.
mark and sweep
We will now turn our attention to tracing garbage collection algorithms.
Recall that we have a good over-approximation for live objects – those objects
that are reachable from our root set. Therefore, a general strategy presents
itself: first traverse the object graph and mark all reachable objects (e.g.
by keeping a special ‘live’ bit for each object), then sweep (release) every
unmarked object and reset marks6.
While the idea is simple, an implementation of a tracing GC can be quite
challenging. First, a collector is usually run when the available program memory
approaches exhaustion, and this places restrictions on the space complexity of
the collector. Notably, we cannot just implement a naive breadth-first search or
similar graph traversal algorithm and consider the mark phase solved.
Furthermore, the collector needs to go over the whole allocated memory several
times and mutators have to be stopped while the collector runs. It is not
surprising that this simple approach has unacceptable overhead, especially in
memory-constrained environments where the collector needs to run more often.
There are several modifications to our general strategy that make tracing GCs
usable in the real world. An important optimisation for the mark phase is called
tri-color marking. Instead of a single live/dead bit, we mark each object as
either white, gray or black. In short, we remember for each object not just its
reachability, but also whether we have visited its outgoing references.
Confirmed reachable objects are black if we have also visited their outgoing
references and gray otherwise; other objects are white. The algorithm proceeds
as follows:
- start with all objects white,
- mark every object reachable from the root set gray,
- while there are gray objects, pick any, mark it black and mark its white successors gray.
While this looks like a needless complication, it has an important advantage –
it can be performed incrementally. The collector can just pick one or a few
gray objects, perform the necessary markings, and return control back to a
mutator without actually collecting anything. Once a sweep is requested, it only
has to deal with remaining gray objects. Of course this comes at a cost in
implementation complexity: for instance, the GC now needs to ensure that new
objects reachable immediately from the root set are marked gray, and so on.
6
Can you improve this process so that we do not have to reset marks in every collection cycle?
other tracing algorithms of interest
Garbage collection is an old but still very active area of research. There are
many other approaches, sometimes complementary, sometimes incompatible with
the above strategy, and most are beyond the scope of this blog post. A few are
nonetheless worth mentioning.7
A very different strategy is called copying garbage collection. In the
simplest case, it works by splitting the program heap into two halves – a
working half and an empty half. Each time a collection is performed, the
collector traverses the memory (more specifically, the working half) similarly
to the mark and sweep one, but instead of marking the reachable objects, it
copies them to the empty half. Once all reachable objects are copied, the empty
half becomes the new working half and vice versa. This method has several
important advantages. Chiefly, it prevents memory fragmentation (by placing
objects next to each other in the working half) and makes object allocation
quite cheap (the working space is neatly partitioned into a single contiguous
‘free’ and one ‘used’ area).
Of course, this algorithm has an obvious disadvantage of wasting half of the
heap memory. However, modern copying (also called compacting) collectors use a
similar method without splitting the heap – objects can be ‘shifted’ within the
heap towards the start, thus ensuring contiguous space of allocated objects.
Another quite serious complication lies in the fact that once objects start
moving in memory, extra work needs to be done to repoint all live references
to new locations. As a result, this approach makes interaction with code
unaware of the collector problematic.8
Another method is generational garbage collection. This builds on the
(empirically tested) observation that most objects die quite young, while old
objects are less likely to be collected. This leads to an idea of several heap
segments, or generations, managed more or less separately. New objects are
created in the ‘young space’, which is collected often, while those that survive
long enough are promoted to a more ‘mature’ segment, which is scanned less
frequently. This decreases the time spent on a single collector run, because
the GC scans only a part of the heap.
A key point here is that references from older to younger objects are
comparatively rare (this is especially true in functional languages, which
tend to avoid mutating existing objects). This mean that such references can
be tracked separately, avoiding expensive tracing of the entire (possibly
quite large) mature segment.
7
We invite the interested reader to look up further information for example in the Garbage Collection Handbook.
8
This is important for example when calling a function written in a different language without garbage collector. The problem is solved by telling the collector to pin the object in place, which is of course another implementation pain point.
identifying references
We have previously mentioned that a collector needs to recognize which values
are references and which are other, uninteresting data. Until now, however, we
have assumed it ‘just works’. Let us now look at how a collector actually
achieves this.
If the programming language (and its runtime) was designed with garbage
collection in mind, the mutator itself will track which bits of data are
references and which are not. There are multiple ways to achieve this,
probably the most common is tagging (where each value has a short type tag).
Other methods rely on insight into the layout of objects. A garbage collector
that relies on this information is called exact.
However, this information might not be available – this would be the case in
languages that do not natively use a garbage collector, for instance C or C++.
Fortunately, this does not make garbage collection impossible. What we need to
be able to do is follow references and it is actually not a huge problem if we
accidentally ‘follow’ something that is not an actual pointer.9 In the worst
case, we will simply mark an object that is not actually a target of any live
references. This might cause additional dead objects to be retained, but it
cannot cause the opposite – more severe – type of error, where an object is
released prematurely.
9
Collectors that take advantage of this principle are called conservative. Boehm GC is a well-known example. Since Boehm GC additionally cooperates with the system allocator, it can also be used to detect leaks.