fuzzing

Discovering errors in programs with non-trivial inputs is often challenging. It can be hard to find or predict patterns of failure, and hence to pin down examples that can be used as test cases which achieve good code coverage. This is especially true if testing is entirely ad hoc, without any kind of coverage feedback.

There are many possible improvements over just ‘thinking very hard’ about what to test. The one we are going to examine today in more detail is automatic generation of (pseudo-)random test inputs. The technique is known as fuzzing, and the program doing it a fuzzer. While they are often used for software testing, fuzzers also feature prominently in cybersecurity and hacking.

So what goes into a fuzzer?

what do we want?

The traditional starting point for writing software is requirements, so let's try to pin some down for our would-be fuzzer:

It should be easy to use - we want to reduce the burden of testing, not pile additional complexity on the user. Everything that can be reasonably automated, should be.
It should be lightweight and fast. We are not going to achieve anything that resembles a correctness guarantee (this would require a much heavier tool), so we might as well get cracking. Quickly finding bugs is the name of the game. And the improvement in code coverage certainly helps to make us more confident about the relative correctness of the code.
Speaking of coverage, we want to examine as many code paths as possible - keeping track of branches and lines of code that we have managed to hit can provide useful guidance to the fuzzer. We will get to how that works in a bit.

Basically, we want a tool that gives us as much information as possible, as fast as possible, and with as little effort as possible (and who doesn't). Of course, we are trading the robustness of a negative result (no bug found) for speed and simplicity, so perhaps we are onto something here.

crafting a fuzzer

After laying out what we want from our fuzzer, we can try to actually design one. We should first think about what information will be available when running the fuzzer. Do we have some set of interesting inputs available? Do we have a (possibly formal) specification of how the input can look? Do we have some knowledge of the structure of the program we are examining?

Since we are designing a fuzzer for software testing, it is reasonable to expect some knowledge of the program we want to test. So let's say we have the program's source code, and we can easily create some initial set of inputs - often called a corpus or seed in this context. These inputs can be all valid, and hence comparatively easy to come by.

This helps us satisfy the first requirement: providing examples of valid inputs is often the easiest part of testing. Of course, the extent of those examples (and the coverage they provide in their own right) directly affects the quality of results that the fuzzer can provide, as well as how quickly it can find bugs.

XXX Armed with the source code, we can measure some form of coverage achieved by a single input as well as coverage achieved by our tool so far during the examination. We can use the code coverage metric to guide our fuzzer and satisfy the third requirement. I will explain this in more detail later.

Given the above, the main loop of our outlined fuzzer could look something like this:

First, you might have noticed that the fuzzer lacks a termination condition – in other words, it is never ‘done’. Instead, we need to set an some external condition, usually a time limit.

Other than that, it seems straightforward enough: we maintain a queue of inputs to try and based on the outcome of each, we can either:

for boring inputs (usually based labelled as such because nothing new happened), ignore it and move on to the next item on the queue,
for less boring inputs (they triggered a new path), feed them into mutate, which then adds new variants of the successful input to the queue,
for truly interesting outcomes (however we define that), also store the input for later re-use, both within this run (as an additional input into the mutator) and for future runs.

Clearly, most of the magic is hidden in mutate, where existing inputs are fed into in and new inputs come out.

Generally, it works by modifying the selected input (as hinted by the name). In practice, this process heavily relies on heuristics to minimize the amount of ‘chaff’ (boring inputs), often making use of the stored set of (interesting) inputs. How those heuristics work is a fairly big topic in its own right – for now, it will be sufficient to consider entirely random mutations like bit flips, insertions, and deletions.

In any case, the process results in an input slightly different from the original. Intuitively, mutations will quickly produce invalid inputs, even if all of the seed (initial inputs provided by the user) was valid. Searching through the generated mutations in a roughly breadth-first order will result in counterexamples (inputs that cause crashes or other bugs to manifest) with small deviations (not quite minimal, but close enough for practical purposes). This is desirable, since smaller deviations from expected format will make any uncovered bugs easier to understand (and consequently fix).

The main limitation here is that this process will mainly explore the parts of the program which handles the syntactic validity of the input (this is where mutation heuristics can help improve the ‘depth of penetration’ of the fuzzer, further from the input handling logic and into the ‘business’ part of the code).

Our other big problem is that left to its own devices, the mutator will likely generate a large pile of garbage that won't tell us much about the program (almost all of them triggering the same ‘invalid input’ path). This is where coverage comes in: we restrict our attention to inputs that trigger new paths in the program (as measured by whichever coverage metric we are using). Those are the inputs that we retain and mutate further, in the hopes that it will help us reach other interesting paths.

Moreover, to keep our set of stored inputs clean, we can remove inputs which are subsumed (again, as measured by coverage) by the new addition. This can be advantageous even in cases when a newly generated input does not provide any new coverage, as long as it can replace multiple existing inputs, or it is shorter than an equivalent (again, in terms o coverage) already-known input.

does it work well?

You might be wondering how thorough can a fuzzer be or how ‘deep’ it can go when examining the program. In line with expectations, the coverage is wider than it is deep, in the sense that the initial stages of the program (the input validation and parsing logic) are covered very densely, but the coverage into deeper parts rapidly falls off, as the fuzzer fails to combine the right mutations.

To better understand why this happens, we can reduce the input into an abstract set of features, in such a way that choices (conditional branching) in the program depends on presence (or absence) thereof. A feature could be ‘input starts with \x7fELF’ or ‘a line with a leading # is present in the input’. Mutation of the raw input can obviously lead to flips in this abstract feature set. The problem is that the deeper we reach into the program – the longer our paths get, the more branching they encounter, and each branch (that we have any chance of influencing) constrains the feature set in some way.

Deep paths therefore have a large set of bits fixed by those branches, but we do all our work near the start, randomly flipping raw input bits, and each of those has a small chance of flipping a ‘feature’ bit. Understandably, hitting a particular path this way gets harder the more bits we must correctly guess. Meanwhile, the total number of distinct paths as a function of their length is not falling off nearly as quickly, at least not in a typical program (every branch that is not immediately followed by termination of at least one of the resulting paths contributes to the surplus).

Despite these limitations, fuzzing is quite successful in practice, as witnessed by a number of tools based on the technique. A typical (and successful) example would be American Fuzzy Lop¹, perhaps better known by its acronym AFL, which essentially takes the approach discussed here.

https://github.com/google/AFL

input structure

So we have created a fuzzer that successfully examines the input (syntactic) stage of the program. What can we do to go further? It will clearly help our effort if we restrict our attention to the kinds of inputs that are accepted by the parser and hence lead to paths that continue beyond the initial stage. Often, the structure of the input can be described by a grammar. Armed with such a grammar, we can help the fuzzer get further along.

First, let's take a quick detour to input generators (like the one that is part of QuickCheck). These tools start with a grammar-like specification and randomly generate samples that conform to this pseudo-grammar. A useful abstract model of this process is to imagine the generator as a function from bitvectors (a sequence of ‘untyped’ bits without additional meaning) to specification-conforming samples. The finiteness of these bitvectors poses a few technical (but not terribly interesting) challenges that we will sweep under the rug for now.

Internally, a grammar-based generator will use the input bits to decide which of the applicable rewrite rules in the grammar to apply at any given step.

back to fuzzing

Entirely random generation is a viable approach (as witnessed by the existence of many tools of this sort), but we want to do better. What happens if we sandwich an input generator between the fuzzer and the program under test? Suddenly, every possible bitvector (i.e. every possible input, as understood by the fuzzer) translates to a syntactically valid input for the program.

Let's illustrate this on an example. Consider a generator which constructs a list of lists of positive 8-bit integers. One approach that we can take is to take the MSB of each byte to indicate whether this byte is the last item in its list.

0000 0101 | 0100 1000 | 0110 0101 | 1110 1100 | 0110 1100 | 0110 1111

This gives [ 5, 72, 101 ], [ 108, 111 ]. Bit flips in any of the MSBs will then alter the structure, while any other bit flip will alter the content of the result.

Observe:

0000 0101 | 1100 1000 | 0110 0101 | 0110 1100 | 0110 1100 | 0110 1111

Yields [ 5 ], [ 72, 101 ], [ 108, 111 ].

In a grammar (or a rewrite system in general), some bit flips will have an even more dramatic effect. In either case, we expect the coverage feedback loop in the fuzzer to focus the mutation on those structure-altering bits, since those are by far the most likely to cover new paths.

A much more detailed account of this approach can be found in a paper on Zest². The algorithm covered therein is not an exact match for the above, but it builds on the same general idea. The paper also provides further examples of how mutations in the bitvector affect the generated input.

https://dl.acm.org/doi/abs/10.1145/3293882.3330576