fuzzing
Discovering errors in programs with non-trivial inputs is often challenging.
It can be hard to find or predict patterns of failure, and hence to pin down
examples that can be used as test cases which achieve good code coverage. This
is especially true if testing is entirely ad hoc, without any kind of
coverage feedback.
There are many possible improvements over just ‘thinking very hard’ about what
to test. The one we are going to examine today in more detail is automatic
generation of (pseudo-)random test inputs. The technique is known as
fuzzing, and the program doing it a fuzzer. While they are often used for
software testing, fuzzers also feature prominently in cybersecurity and
hacking.
So what goes into a fuzzer?
what do we want?
The traditional starting point for writing software is requirements, so
let's try to pin some down for our would-be fuzzer:
- It should be easy to use - we want to reduce the burden of testing, not pile additional complexity on the user. Everything that can be reasonably automated, should be.
- It should be lightweight and fast. We are not going to achieve anything that resembles a correctness guarantee (this would require a much heavier tool), so we might as well get cracking. Quickly finding bugs is the name of the game. And the improvement in code coverage certainly helps to make us more confident about the relative correctness of the code.
- Speaking of coverage, we want to examine as many code paths as possible - keeping track of branches and lines of code that we have managed to hit can provide useful guidance to the fuzzer. We will get to how that works in a bit.
crafting a fuzzer
After laying out what we want from our fuzzer, we can try to actually design
one. We should first think about what information will be available when
running the fuzzer. Do we have some set of interesting inputs available? Do we
have a (possibly formal) specification of how the input can look? Do we have
some knowledge of the structure of the program we are examining?
Since we are designing a fuzzer for software testing, it is reasonable to
expect some knowledge of the program we want to test. So let's say we have
the program's source code, and we can easily create some initial set of inputs
- often called a corpus or seed in this context. These inputs can be all
valid, and hence comparatively easy to come by.
This helps us satisfy the first requirement: providing examples of valid
inputs is often the easiest part of testing. Of course, the extent of those
examples (and the coverage they provide in their own right) directly affects
the quality of results that the fuzzer can provide, as well as how quickly it
can find bugs.
XXX Armed with the source code, we can measure some form of coverage achieved by a
single input as well as coverage achieved by our tool so far during the
examination. We can use the code coverage metric to guide our fuzzer and
satisfy the third requirement. I will explain this in more detail later.
Given the above, the main loop of our outlined fuzzer could look something
like this:
First, you might have noticed that the fuzzer lacks a termination condition
– in other words, it is never ‘done’. Instead, we need to set an some external
condition, usually a time limit.
Other than that, it seems straightforward enough: we maintain a queue of
inputs to try and based on the outcome of each, we can either:
- for boring inputs (usually based labelled as such because nothing new happened), ignore it and move on to the next item on the queue,
- for less boring inputs (they triggered a new path), feed them into
mutate
, which then adds new variants of the successful input to the queue, - for truly interesting outcomes (however we define that), also store the input for later re-use, both within this run (as an additional input into the mutator) and for future runs.
mutate
, where existing inputs are
fed into in and new inputs come out.
Generally, it works by modifying the selected input (as hinted by the name).
In practice, this process heavily relies on heuristics to minimize the amount
of ‘chaff’ (boring inputs), often making use of the stored set of
(interesting) inputs. How those heuristics work is a fairly big topic in its
own right – for now, it will be sufficient to consider entirely random
mutations like bit flips, insertions, and deletions.
In any case, the process results in an input slightly different from the
original. Intuitively, mutations will quickly produce invalid inputs, even if
all of the seed (initial inputs provided by the user) was valid. Searching
through the generated mutations in a roughly breadth-first order will result
in counterexamples (inputs that cause crashes or other bugs to manifest) with
small deviations (not quite minimal, but close enough for practical purposes).
This is desirable, since smaller deviations from expected format will make any
uncovered bugs easier to understand (and consequently fix).
The main limitation here is that this process will mainly explore the parts of
the program which handles the syntactic validity of the input (this is where
mutation heuristics can help improve the ‘depth of penetration’ of the fuzzer,
further from the input handling logic and into the ‘business’ part of the
code).
Our other big problem is that left to its own devices, the mutator will likely
generate a large pile of garbage that won't tell us much about the program
(almost all of them triggering the same ‘invalid input’ path). This is where
coverage comes in: we restrict our attention to inputs that trigger new paths
in the program (as measured by whichever coverage metric we are using). Those
are the inputs that we retain and mutate further, in the hopes that it will
help us reach other interesting paths.
Moreover, to keep our set of stored inputs clean, we can remove inputs which
are subsumed (again, as measured by coverage) by the new addition. This can be
advantageous even in cases when a newly generated input does not provide any
new coverage, as long as it can replace multiple existing inputs, or it is
shorter than an equivalent (again, in terms o coverage) already-known input.
does it work well?
You might be wondering how thorough can a fuzzer be or how ‘deep’ it can go
when examining the program. In line with expectations, the coverage is wider
than it is deep, in the sense that the initial stages of the program (the
input validation and parsing logic) are covered very densely, but the coverage
into deeper parts rapidly falls off, as the fuzzer fails to combine the right
mutations.
To better understand why this happens, we can reduce the input into an
abstract set of features, in such a way that choices (conditional branching)
in the program depends on presence (or absence) thereof. A feature could be
‘input starts with
\x7fELF
’ or ‘a line with a leading #
is present in the
input’. Mutation of the raw input can obviously lead to flips in this abstract
feature set. The problem is that the deeper we reach into the program – the
longer our paths get, the more branching they encounter, and each branch (that
we have any chance of influencing) constrains the feature set in some way.
Deep paths therefore have a large set of bits fixed by those branches, but we
do all our work near the start, randomly flipping raw input bits, and each of
those has a small chance of flipping a ‘feature’ bit. Understandably, hitting
a particular path this way gets harder the more bits we must correctly guess.
Meanwhile, the total number of distinct paths as a function of their length is
not falling off nearly as quickly, at least not in a typical program (every
branch that is not immediately followed by termination of at least one of the
resulting paths contributes to the surplus).
Despite these limitations, fuzzing is quite successful in practice, as
witnessed by a number of tools based on the technique. A typical (and
successful) example would be American Fuzzy Lop1, perhaps better known by
its acronym AFL, which essentially takes the approach discussed here.
1
https://github.com/google/AFL
input structure
So we have created a fuzzer that successfully examines the input (syntactic)
stage of the program. What can we do to go further? It will clearly help our
effort if we restrict our attention to the kinds of inputs that are accepted
by the parser and hence lead to paths that continue beyond the initial stage.
Often, the structure of the input can be described by a grammar. Armed with
such a grammar, we can help the fuzzer get further along.
First, let's take a quick detour to input generators (like the one that is
part of
QuickCheck
). These tools start with a grammar-like specification and
randomly generate samples that conform to this pseudo-grammar. A useful
abstract model of this process is to imagine the generator as a function from
bitvectors (a sequence of ‘untyped’ bits without additional meaning) to
specification-conforming samples. The finiteness of these bitvectors poses a
few technical (but not terribly interesting) challenges that we will sweep
under the rug for now.
Internally, a grammar-based generator will use the input bits to decide which
of the applicable rewrite rules in the grammar to apply at any given step.
back to fuzzing
Entirely random generation is a viable approach (as witnessed by the existence
of many tools of this sort), but we want to do better. What happens if we
sandwich an input generator between the fuzzer and the program under test?
Suddenly, every possible bitvector (i.e. every possible input, as understood
by the fuzzer) translates to a syntactically valid input for the program.
Let's illustrate this on an example. Consider a generator which constructs a
list of lists of positive 8-bit integers. One approach that we can take is to
take the MSB of each byte to indicate whether this byte is the last item in
its list.
0000 0101 | 0100 1000 | 0110 0101 | 1110 1100 | 0110 1100 | 0110 1111
This gives
[ 5, 72, 101 ], [ 108, 111 ]
. Bit flips in any of the MSBs will
then alter the structure, while any other bit flip will alter the content of
the result.
Observe:
0000 0101 | 1100 1000 | 0110 0101 | 0110 1100 | 0110 1100 | 0110 1111
Yields
[ 5 ], [ 72, 101 ], [ 108, 111 ]
.
In a grammar (or a rewrite system in general), some bit flips will have an
even more dramatic effect. In either case, we expect the coverage feedback
loop in the fuzzer to focus the mutation on those structure-altering bits,
since those are by far the most likely to cover new paths.
A much more detailed account of this approach can be found in a paper on
Zest2. The algorithm covered therein is not an exact match for the above, but
it builds on the same general idea. The paper also provides further examples
of how mutations in the bitvector affect the generated input.
2
https://dl.acm.org/doi/abs/10.1145/3293882.3330576