verification vs file system
Verification tools and algorithms come in various flavours and with different
capabilities. What kinds of programs and what kinds of properties we can
examine largely depends on how an execution environment for the programs is
set up. Clearly, the environment affects the possible behaviours of the
program (otherwise, it would be kind of pointless) and that in turn means that
the capabilities of the environment affect how realistic are the observed
behaviours of the program under analysis.
what is the environment?
The environment is basically everything that is not inherently part of the
program but still affects its execution. This includes obvious things like
system configuration and environment variables, which the program can directly
query, to less obvious items like available resources (e.g. the amount of
memory available, maximum size of the C stack, what physical devices attached
to the system). It is actually quite hard to imagine a program that would not
interact with with its environment in some way (even pure computations need
memory and stack). For most programs, the interaction is quite extensive and
often involves shared resources that may be used by other programs at the same
time.
The environment must be present in some form for any type of execution or
dynamic analysis. How exactly it is implemented is then an important part of
the analysis and determines both the degree of realism (how sensible are the
counterexamples that we get) and soundness (what kinds of counterexamples are
we going to miss).
We might not be super interested in what happens if we run the program with 15
bytes of memory, even if, strictly speaking, a correct program would not
crash. Of course there might be a boundary where the program has enough
memory to start, but an untimely allocation failure causes a critical
vulnerability when talking to an untrusted client. That is something you would
rather not miss, so drawing a line ‘this amount of memory is too small and it
does not make sense to check’ can be quite hard.
Restricting ourselves to comparatively simple cases, like data structures or
pure computations, does not entirely solve the problem either. Sure, a data
structure wasn't going to directly interact with an untrusted client, but if
it becomes corrupt when allocation fails, it could easily be the cause of the
security problem earlier. Imagine we are using a tree of access control rules,
and allocation fails during insertion, which causes the data structure to lose
a pre-existing subtree (i.e. after the failed insert, there are fewer elements
than before it was attempted). Well that subtree could have been holding
something really important and it's no longer there, and perhaps the client
can now bypass those missing permission checks. But at least there is no
direct interaction between abstract trees and the file system.
On the other hand, if we are unlucky, we might be trying to prove that an
implementation of a database never corrupts the stored data. In that case,
almost anything in the environment can come and haunt us. Even then, we might
want to rule out things like the on-disk data randomly changing by itself
Unfortunately, this is of course possible, it's just hard to imagine a
system that would somehow preserve your data when the file system, or the
system administrator, decides to erase it… or all storage devices stop working
at the same time… or any number of things that can happen in principle. But
a completely unrestricted environment would need to admit them happening,
which will produce an unending list of ridiculous counterexamples, all
essentially stating ‘the environment ate your data, but hey the data is gone,
property violated’.
The file system is particularly vulnerable to these problems: it is a point
where many programs ‘meet’ and can have (possibly unwanted) interactions with
each other. This unfortunately includes absurd scenarios like one program
completely rewriting data of another program. Perhaps with random garbage,
perhaps with something malicious that makes the program under analysis do bad
things. Again, drawing the line can be really tricky.
leaky abstractions
The abstraction we are usually promised when it comes to files is that we can
write some bytes (a more or less unrestricted number of them, too) and come
later and read those same bytes. This will even work when the computer is
turned off and on again. Of course, there is a catch (more than one). Or, in
other words, this idealized abstraction is leaky. Writes fail. Reads fail.
Data sometimes disappears, other times it gets corrupt, yet other times it
gets overwritten by another program. Or it's accidentally erased by the
hapless user. When writes fail, sometimes you get back an error. Other times,
you don't. Sometimes retrying helps (you just happened to hit a hardware
timeout somewhere, or a bad sector, or whatever), but sometimes it doesn't
(the drive is busted). Maybe the file system is full. Maybe someone frees up
some space, or maybe they don't. Yeah, files are hard.
Many of these problems are actually reflected in the interface: have a look at
the
ERRORS
section of the open(2)
manpage. To be honest, if you prefer to
pretend that you write bytes and later read the same bytes and that's the end
of the story, I can't say I blame you.
Some of the errors are tied to the system resources - for example, the limit
on open file descriptors or open files. The limits depend on the operating
system and its configuration, but in some cases (e.g. the number of open
files) the limit is shared by all currently running processes, and whether we
ran out depends on what those other processes are doing.
For simplicity, let's consider the case where the table of open files is full:
this is entirely out of the control of the program under analysis, hence
potentially any
open
call can fail with ENFILE
. This means that there are
at least 2 possible outcomes of each open
call that must be considered: the
limit was hit, and the limit wasn't hit. In fact, there are many possible
errors that can arise, each creating a new branch in the state space of the
program. Of course, this can easily get out of hand: not only we have to
contend with absurd counterexamples, we also have to consider the limits of
analysis tools.
tool examples
Obviously, if we want to verify real-world programs, we have to provide access
to the file system (otherwise, the range of programs we are able to verify
becomes quite limited). However, implementing the entire API can be quite
challenging. How do existing tools cope?
One of the issues is that available system calls differ between operating
systems, so if we target multiple platforms, the situation gets even worse.
One option is to use the interface of the underlying OS on on which the
verification tool is running. This is done by, for example, the symbolic
executor KLEE that, except in a few special cases, simply invokes the
corresponding system call on the host system. Needless to say, this approach
has downsides.
Consider instead a model checker (like
divine
), which builds the state space
as a graph, and might execute the same code in the same context multiple
times. This is not going to work unless the execution environment is fully
deterministic (in the sense that the possible outcomes only depend on the
state of the program and its environment as recorded by the analysis tool). In
practice, this means that a model checker must be able to faithfully capture
the entire state of the environment, and even worse, restore it to exactly
match a previously captured snapshot. This is hard to even imagine unless the
environment is realized internally.
What does
divine
do? Basically, it gives up. Instead of trying to capture
and manipulate the external environment, it simply requires the system under
test to provide its own environment, which is then simply simulated along with
the program. Of course, that's not very satisfactory from a user point of
view, which is why divine
also provides DiOS, which does all the heavy
lifting for the user. In particular, it provides a POSIX-compatible file
system (which can be seeded from a snapshot of a directory from the host, but
after that point, it is entirely independent). While the complexity of
implementing the file system is considerable, at least it's a reasonably
self-contained component that can be modified independently of the analysis
algorithm.
soundness, coverage and realism
As was already mentioned, we might want to simulate failures of the operations
related to the file system. Implementation-wise, this is fairly
straightforward: analysis tools usually provide some form of non-deterministic
choice (essentially stating ‘one of these things happened, but we don't know
which’). Pessimistic tools will then check all the possibilities. Optimistic
tools (e.g. when hunting bugs) might only explore the most promising ones.
Errors are relatively simple, because there is a limited set of possible
outcomes, and which make sense is mostly independent of the program we are
testing. But what if everything succeeds, and we have read some bytes? What
should those bytes be?
Of course, the user might provide fixed file content as part of a file system
image. But then we rely on the user to provide all ‘reasonable’ initial images
as inputs. That's just testing with extra steps. We have pulled out the big
guns (symbolic execution, model checking, etc.) exactly because we didn't want
to enumerate test cases. We might actually even want some assurances that the
program works (for some limited definition of ‘work’) regardless of what it
finds in the file system.
What options do we have? You might recall symbolic data from a previous
article about
symbolic execution. Can we use that here? The first important
observation is that files are really just arrays of bytes. Both the length and
the content of that array is, generally speaking, arbitrary. So we create a
‘symbolic input’ for the length of the file, and whenever a new byte is
needed, we create a new input variable for that byte. It's all rather simple,
but it runs into scalability limits very quickly. Especially if the program
keeps reading from the file forever (which it likely will, because if it uses
any sort of ‘end of input’ marker, there's always a feasible execution branch
where that marker was not yet encountered).
Symbolic execution is pretty much doomed (or rather, bound to hit a depth
limit). We can perhaps hope that even if there are infinite branches in the
execution tree, the state space is still finite (i.e. those infinite
branches are in fact loops in the state space). Or rather, the state space is
certainly finite (the length of the file is a bit vector of finite width) but
it can still be enormous. Honestly, the chances for getting a state space of
manageable size this way are rather slim. An example where it might work out
would be a program that simulates a finite state machine on the file content,
discarding the actual bytes after processing each of them.
more challenges
So how well does it work in practice? Surprisingly, if we limit ourselves to
dealing with the standard input (
stdin
), things work out somewhat okay.
While we currently don't generate arbitrary content (we use grammars to
generate something more restricted), we do get manageable state spaces even
for infinite input streams.
However, things get worse when we try to simulate a regular file using the
same approach. Why? Basically, because files offer random access, while
stdin
is a stream: once the program reads a byte, it is removed from stdin
forever. There is also no way to ask how many bytes are left in stdin
.
There are a few problems related to the random access nature of files:
- If the program asks for the file size (e.g. using
stat
, orlseek
andftell
), it becomes hard to avoid excessive branching as the program takes various actions that depend on this size. - The system needs to remember a position indicator: another number that can very easily get entangled with the program state, causing a blowup. To make things worse, the file system itself must compare this indicator to the file size, to correctly signal the end-of-file condition.
Still, even this compromise does not address the position indicator problem:
within a given size, keeping track of the read position could be a serious
bottleneck, e.g. if the program reads one byte at a time. And even though most
programs use buffered reads, the problem in that case simply transfers to the
buffer position pointer (if we are lucky, it is maintained in
libc
where we
can at least hope to do something about it; if we are less lucky, it is
maintained by program logic directly and then the only hope is that the
analysis tool can somehow deal with it).
conclusion
So what can DiOS and its
libc
do to make the position indicator problem less
bad? One option might be to use zippers (or multi-zippers, in case of a file
with multiple independent position indicators). You can imagine a zipper as a
pair of stacks, with bytes behind the indicator on one of them, and the bytes
in front of the indicator on the other. Reading data translates into movement
of those bytes from one stack to the other. If multiple indicators are
required, the bytes between them become a deque
(double-ended queue), which
is a bit trickier, but perhaps doable.
The important part here is that the offset does not need to be materialized as
a number (unless the program explicitly asks for it). As long as the bytes
themselves (on either side of the position indicator) can be represented
succinctly, we should be okay. The situation is clearly more complicated than
it was in the case of
stdin
, it is perhaps still manageable. Perhaps we will
be able to report some success in a future article.