garbage collection

The data of a running program is logically organised into objects (values), whether bound to variables or referenced by other objects. During a computation, such objects are created and used, but there is always a point of last use for each, after which the program never refers to that object (at the very latest, this happens when the program terminates). Creation (and reading and updates) of objects is clearly part of what we would call useful computation – it is what programs do to compute their results.

Since resources are finite, it is also desirable to destroy objects when they are no longer useful: clearly, an object that is never going to be used again (i.e. the program will not read or update its value) – a dead object – does not need to be remembered. And not remembering an object will free up precious memory in which to store new (more useful) objects. However, reclamation of memory used by dead objects does not serve any direct computational goal: it is only done to conserve resources.¹

In complex programs, it can be hard to figure out exact lifetime of objects (i.e. where exactly in the program is their point of last use, after which they become dead). There are two ways in which this can go wrong:

keeping a dead object around longer than necessary: this may be harmless, but if such objects start to accumulate, we call this a memory leak (or more generally a resource leak) and this may significantly reduce our ability to perform useful computation (since we run out of resources unnecessarily),
recycling a live object: this is worse (much worse), because now the program is computing the wrong thing when it (inevitably) uses the object that no longer ‘exists’ (has been recycled).

Given the above, it makes a lot of sense to (attempt to) automate reclamation of resources used up by dead objects. If done correctly, both of the pitfalls can be avoided ‘by design’, without the need to keep track of lifetimes manually. This is precisely what garbage collectors do.²

A program which makes use of automatic garbage collection can be split into three basic parts:

a mutator which performs the useful computation (the program logic) and which therefore creates, reads and updates objects,
an allocator which keeps track of unused memory (and hence can assign, or allocate memory for new objects as requested by the mutator), and finally
a collector which collects dead objects (and recycles them, by telling the allocator that the memory they had occupied is now unused, or free).

Usually, the allocator and the collector are provided by a library or by a programming language runtime (in cases where garbage collection is an integral part of language design).³

Of course, there is a flip side to saving resources: a more complicated computation can be done with a given fixed amount of resources if they are managed more carefully.

There are other approaches to the same problem. They are collectively known as automatic memory management, and can be as mundane as a stack (for keeping local variables) or as sophisticated as a borrow checker.

It turns out that automatic memory management (and garbage collection in particular) enables quite a few language features that are otherwise hard to implement: lexical closures are a common example. There are two contributing factors: garbage collectors are dynamic (they perform computation at program runtime to infer lifetimes) and global (they ‘see’ all the objects in a program).

approximating lifetimes

Computing the exact lifetime of an object is a so-called ‘reachability problem’ – is any use of the object reachable from this point in the program? It is not surprising that this is undecidable (easily shown by reduction of the halting problem, for instance). Therefore, computing exact lifetimes of objects is hopeless – we have to find a suitable approximation.

The first priority is going to be that we never encounter the second type of error (recycling a live object) – this means we need an approximation with a one-sided error (also known as over- or under-approximation, depending on the direction of the error – in this case, we will be over-approximating the set of live objects). A natural choice for our question then is: can we say with certainty that a given object is dead? If we do not know, then it might be dead or alive, but we have to assume that it is alive (making the approximate live set bigger).

How do we know that an object is certainly dead? Surely, if there are no live references to that object, it cannot be used, and hence it must be dead. Now this is a thing that is sufficiently easy to compute for the collector. Of course, the first type of error – keeping dead objects around longer than necessary – can and will happen (to various degrees, depending on how exactly we implement the collector). Before we look at actual algorithms, though, we need to make our notion of objects and references a little more formal.

First of all, the collector cannot treat objects as opaque blobs of data – it needs to be able to extract information about references. A reference is, in this context, any piece of data that can be used to access an object. While their exact form is not very important for now, they usually take the form of pointers – numbers that represent a memory address (at which the object is actually stored).

Given a set of live object, any references stored in any of these objects is a live reference and hence the object that it refers to is in the approximate live set (as defined above). This looks like a promising start of an inductive definition: all we need is a starting point. To this end, we define a root set – a set of objects we consider alive unconditionally. Those are, essentially, the objects that are directly bound to variables.⁴

There is another way to look at this: if we take objects to be vertices and references to be directed edges, together they form a directed graph. We can then say that an object is certainly dead as soon as it becomes unreachable from the root set. A collector which performs a literal reachability analysis of this graph (starting from the root set) is called a tracing collector.

These collectors face one significant obstacle. As they rely on traversing the graph, we can not allow for its modification during the operation. If a mutator would move a reference during the traversal, it could cause objects that are still alive to never be discovered.⁵ There are ways around this (generational, incremental and concurrent collection), but the easiest way out is to stop the world – that is, stop execution of all mutators until the collection algorithm completes its work. Obviously this comes at a cost as it freezes the entire core program.

⁴

In most implementations, this translates to objects stored on the stack (or multiple stacks in multi-threaded programs) and those stored in global variables.

⁵

This could for example happen if the mutator would move a reference from yet unexplored part to a part that was already examined. The objects in a tree rooted at the moved reference would never be explored by a standard traversal algorithm and therefore would be collected.

reference counting

Let us first look at an approach to garbage collection that does not trace the object graph. The key idea is to store the count of incoming references for each allocated object. This reference count (or refcount for short) is increased each time a new reference is pointed to the object and similarly decreased when a reference is lost (e.g. due to repointing or destruction of its source). When the count reaches 0, we know that the object is surely dead and can be safely collected.

The crucial point here is that pure refcounting collectors rely solely on local information about individual references – their code is executed only when a reference changes. This makes them less intrusive and easier to implement than tracing collectors. As a result, they are sometimes available even in ‘low-level’ languages such as C++, since std::shared_ptr is nothing but a pointer with refcount support.

Reference counting has several important advantages over a tracing GC. The aforementioned locality distributes the costs of garbage collection throughout the run time, which leads to better latency (at the cost of potentially worse throughput). Additionally, we can use this approach whenever we require deterministic (i.e. predictable) collection, which is important for example when a collection of an object releases further non-memory resources such as file handles. On the other hand, reference counts have to be maintained, which can cause significant overhead, as making a copy of an object must also update reference count in all objects that the object referred to, triggering additional memory writes and possibly compromising locality of reference.

There is a bigger drawback, though. Pure reference counting will never release objects that form a loop in the object graph. Indeed, let us imagine a simple case where we have a pair of objects A and B referencing each other and a local reference, say R, to A. A has refcount 2 (from R and B) and B has refcount 1 (from A). When R is lost, the refcount of A is decreased and the refcounts of both objects equal 1 despite not being reachable.

This situation is too common (consider cyclic data structures such as doubly-linked lists or pointer-based graphs) to be ignored and as such, there are several workarounds. We can employ either weak references, which do not count as incoming references and are reset when their target is collected, or a tracing collector in addition to reference counting – an approach chosen by Python, for example.

mark and sweep

We will now turn our attention to tracing garbage collection algorithms. Recall that we have a good over-approximation for live objects – those objects that are reachable from our root set. Therefore, a general strategy presents itself: first traverse the object graph and mark all reachable objects (e.g. by keeping a special ‘live’ bit for each object), then sweep (release) every unmarked object and reset marks⁶.

While the idea is simple, an implementation of a tracing GC can be quite challenging. First, a collector is usually run when the available program memory approaches exhaustion, and this places restrictions on the space complexity of the collector. Notably, we cannot just implement a naive breadth-first search or similar graph traversal algorithm and consider the mark phase solved. Furthermore, the collector needs to go over the whole allocated memory several times and mutators have to be stopped while the collector runs. It is not surprising that this simple approach has unacceptable overhead, especially in memory-constrained environments where the collector needs to run more often.

There are several modifications to our general strategy that make tracing GCs usable in the real world. An important optimisation for the mark phase is called tri-color marking. Instead of a single live/dead bit, we mark each object as either white, gray or black. In short, we remember for each object not just its reachability, but also whether we have visited its outgoing references. Confirmed reachable objects are black if we have also visited their outgoing references and gray otherwise; other objects are white. The algorithm proceeds as follows:

start with all objects white,
mark every object reachable from the root set gray,
while there are gray objects, pick any, mark it black and mark its white successors gray.

Notice that throughout this algorithm, we never encounter a situation where a black object references a white one. This means that once no gray objects remain (i.e. every reachable object is black), all white objects are provably unreachable and can be safely collected.

While this looks like a needless complication, it has an important advantage – it can be performed incrementally. The collector can just pick one or a few gray objects, perform the necessary markings, and return control back to a mutator without actually collecting anything. Once a sweep is requested, it only has to deal with remaining gray objects. Of course this comes at a cost in implementation complexity: for instance, the GC now needs to ensure that new objects reachable immediately from the root set are marked gray, and so on.

⁶

Can you improve this process so that we do not have to reset marks in every collection cycle?

other tracing algorithms of interest

Garbage collection is an old but still very active area of research. There are many other approaches, sometimes complementary, sometimes incompatible with the above strategy, and most are beyond the scope of this blog post. A few are nonetheless worth mentioning.⁷

A very different strategy is called copying garbage collection. In the simplest case, it works by splitting the program heap into two halves – a working half and an empty half. Each time a collection is performed, the collector traverses the memory (more specifically, the working half) similarly to the mark and sweep one, but instead of marking the reachable objects, it copies them to the empty half. Once all reachable objects are copied, the empty half becomes the new working half and vice versa. This method has several important advantages. Chiefly, it prevents memory fragmentation (by placing objects next to each other in the working half) and makes object allocation quite cheap (the working space is neatly partitioned into a single contiguous ‘free’ and one ‘used’ area).

Of course, this algorithm has an obvious disadvantage of wasting half of the heap memory. However, modern copying (also called compacting) collectors use a similar method without splitting the heap – objects can be ‘shifted’ within the heap towards the start, thus ensuring contiguous space of allocated objects. Another quite serious complication lies in the fact that once objects start moving in memory, extra work needs to be done to repoint all live references to new locations. As a result, this approach makes interaction with code unaware of the collector problematic.⁸

Another method is generational garbage collection. This builds on the (empirically tested) observation that most objects die quite young, while old objects are less likely to be collected. This leads to an idea of several heap segments, or generations, managed more or less separately. New objects are created in the ‘young space’, which is collected often, while those that survive long enough are promoted to a more ‘mature’ segment, which is scanned less frequently. This decreases the time spent on a single collector run, because the GC scans only a part of the heap.

A key point here is that references from older to younger objects are comparatively rare (this is especially true in functional languages, which tend to avoid mutating existing objects). This mean that such references can be tracked separately, avoiding expensive tracing of the entire (possibly quite large) mature segment.

⁷

We invite the interested reader to look up further information for example in the Garbage Collection Handbook.

⁸

This is important for example when calling a function written in a different language without garbage collector. The problem is solved by telling the collector to pin the object in place, which is of course another implementation pain point.

identifying references

We have previously mentioned that a collector needs to recognize which values are references and which are other, uninteresting data. Until now, however, we have assumed it ‘just works’. Let us now look at how a collector actually achieves this.

If the programming language (and its runtime) was designed with garbage collection in mind, the mutator itself will track which bits of data are references and which are not. There are multiple ways to achieve this, probably the most common is tagging (where each value has a short type tag). Other methods rely on insight into the layout of objects. A garbage collector that relies on this information is called exact.

However, this information might not be available – this would be the case in languages that do not natively use a garbage collector, for instance C or C++. Fortunately, this does not make garbage collection impossible. What we need to be able to do is follow references and it is actually not a huge problem if we accidentally ‘follow’ something that is not an actual pointer.⁹ In the worst case, we will simply mark an object that is not actually a target of any live references. This might cause additional dead objects to be retained, but it cannot cause the opposite – more severe – type of error, where an object is released prematurely.

⁹

Collectors that take advantage of this principle are called conservative. Boehm GC is a well-known example. Since Boehm GC additionally cooperates with the system allocator, it can also be used to detect leaks.