Automatic abstraction is a powerful software verification technique. In this paper, we elaborate an abstract domain for C strings, that is, null-terminated arrays of characters. We describe the abstract semantics of basic string operations and prove their soundness with regards to previously established concrete semantics of those operations. In addition to a selection of string functions from the standard C library, we provide semantics for character access and update, which make it possible to automatically lift arbitrary string-manipulating code into the domain.

The domain we present (called M-String) has two other abstract domains as its parameters: an index (bound) domain and a character domain. Picking different constituent domains allows M-String to be tailored for specific requirements of a given verification task, balancing precision against complexity.

In addition to describing the domain theoretically, we also provide an executable implementation of the abstract operations, along with a tool to automatically lift existing programs into the M-String domain. Using this tool and an explicit-state model checker, we have evaluated the proposed domain experimentally on a few simple but realistic test programs.

- full text (pdf, revision 1, 6.4.2019)
- Source archive of the submission

## Installation

DIVINE should work on any reasonably recent Linux distribution. It might also be possible to make it work on macOS, *BSD or even on Windows using Windows Subsystem for Linux, but this is not tested routinely.

## Building from source

To build from th source, you will need some prerequisites, namely a C++11 compiler such as reasonably recent GCC (>= 4.9) or Clang (>= 3.2), `make`

, CMake (>= 3.2), libedit, perl, and about 12 GB of disk space and 4 GB of RAM. For the symbolic verification to work, you will also need the Z3 solver, including its libraries.

To build DIVINE, you have to xtract the archive and then run `make divine`

in its main directory. This should build DIVINE into `_build.release/tools`

subdirectory by default. For more details, please refet to the manual.

## Usage

Binary distribution contains a main binary of `divine`

. If you have built DIVINE from the source, the built binary resides in `_build.release/tools/divine`

.

To run divine verification with mstring abstraction on `source.c`

use:

`divine check --lart abstraction:mstring --symbolic -o ignore:memory -o ignore:control source.c`

## M-String interface

An abstract string with symbolic bounds can be created using a cunstructor `__mstring_sym_val`

. To get definition of this function, source code needs to include `rst/domains.h`

header.

`__mstring_sym_val`

takes a variadic number of arguments – characters and bounds.

*Example*: construction of an mstraing of form `a^i b^j`

where `i + j < 5, i > 0, j > 0`

`char * str = __mstring__sym_val(3, 'a', 1, 5, 'b', 1, 5, '\0', 5, 6);`

where the first argument is a number of characters including terminating zero. Following arguments consists of interleaved characters and bounds. The first bound is implicitly an interval [0, 1), the second bound is in the interval of [1, 5). The third bound is again in [1, 5), however the implementation guarantees that its greater than previous bound. End of the string buffer is guarded by last bound [5, 6).

Hence the resulting msting contains a sequence of **a**s in range from index 0 up to first bound. Succeeding by **b**s starting at index of the first bound and ending at the second bound. The rest of the mstring contains zeros up to the last bound. Since bounds are symbolic, the resulting mstring describe all strings of the required form.

The implementation of M-String domain is located in attached sources. It is split into multiple files:

`divine/dios/include/rst/segmentation.h`

which implements parametric representation`divine/dios/rst/mstring.c`

which contains C wrapper functions for LART.

## Evaluation

To run benchmarks unzip benchmark files and call `python benchmark.py`

with `divine`

in PATH.

### Abstract Operations (sources)

Benchmarks of abstract operations were evaluated on three types of M-Strings:

*Word*: a string of the form a^i b^j c^k , i + j + k ≤ length.states 8 64 1024 4096 LART strcmp 163 12.8s 14.7s 17.1s 27.2s 0.85s strcpy 36 1.67s 1.63s 2.18s 2.48s 0.51s strcat 477 32.2s 33.4s 31.9s 33.4s 0.92s strchr 24 0.28s 0.35s 0.53s 1.14s 0.88s strlen 26 0.45s 0.46s 0.69s 1.31s 0.86s *Sequence*: a string of the form a^i, i ≤ length.states 8 64 1024 4096 LART strcmp 12 0.55s 0.67s 1.15s 1.79s 0.65s strcpy 9 0.36s 0.27s 0.51s 0.83s 0.88s strcat 25 2.28s 2.5s 2.79s 3.19s 0.93s strchr 6 0.03s 0.08s 0.13s 0.26s 0.56s strlen 6 0.09s 0.11s 0.17s 0.33s 0.86s *Alternation*: is a string of form a^i b^j a^k b^l , i + j + k + l ≤ length.states 8 64 1024 4096 LART strcmp 744 63.4s 110s 129s 141s 0.77s strcpy 74 3.62s 4.9s 5.3s 4.36s 0.46s strcat 2406 208s 218s 220s 205s 0.95s strchr 45 0.54s 0.54s 0.92s 1.89s 0.83s strlen 53 1.01s 1.21s 1.94s 2.33s 0.82s

Columns in tables denote the size of state space, time of verification for various string sizes and time of transformation in LART.

### C Standard Library (sources)

Number in column headers denotes length of input string.

- pdclib

*Word*: a string of the form a^i b^j c^k , i + j + k ≤ length.time 4 states 4 time 8 states 8 time 16 states 16 strcmp 18.5s 90 597s 1228 2410s 11416 strcpy 8.4s 45 99s 438 775s 4410 strcat 75.5s 303 T - T - strchr 12.4s 39 166s 245 934s 1265 strlen 0.5s 27 7.9s 169 811s 1365 *Sequence*: a string of the form a^i, i ≤ length.time 4 states 4 time 8 states 8 time 16 states 16 strcmp 3.82s 14 12s 32 32.2s 68 strcpy 5.7s 14 17.5s 24 71.8s 44 strcat 23.7s 35 117s 149 769s 737 strchr 4.34s 8 16.5s 12 158s 20 strlen 0.27s 8 0.6s 14 1.7s 20 *Alternation*: is a string of form a^i b^j a^k b^l , i + j + k + l ≤ length.time 4 states 4 time 8 states 8 time 16 states 16 strcmp 70.7s 348 T - T - strcpy 11.6s 80 168s 928 3230s 19234 strcat 249s 1085 T - T - strchr 13.4s 57 316s 815 T - strlen 1.8s 48 69s 357 3250s 5307

- μCLibc

*Word*: a string of the form a^i b^j c^k , i + j + k ≤ length.time 4 states 4 time 8 states 8 time 16 states 16 strcmp 19 90 939s 1228 T - strcpy 11.3s 45 163s 438 1460s 4410 strcat 120s 315 T - T - strchr 12.6s 39 156s 245 1230s 1265 strlen 1.05s 27 21.7s 169 842s 1365 *Sequence*: a string of the form a^i, i ≤ length.time 4 states 4 time 8 states 8 time 16 states 16 strcmp 1.89s 14 4.94s 32 17.4s 68 strcpy 7.53s 14 22.8s 24 85.2s 44 strcat 26.5s 37 197s 157 1350s 757 strchr 4.52s 8 26.3s 12 209s 20 strlen 0.36s 8 1.13s 20 2.64s 68 *Alternation*: is a string of form a^i b^j a^k b^l , i + j + k + l ≤ length.time 4 states 4 time 8 states 8 time 16 states 16 strcmp 18.4s 168 T - T - strcpy 14.7s 58 457s 1010 T - strcat 86.6s 512 T - T - strchr 16.4s 57 383s 815 T - strlen 1.34s 37 70.1s 575 T -

- musl-libc

*Word*: a string of the form a^i b^j c^k , i + j + k ≤ length.time 4 states 4 time 8 states 8 time 16 states 16 strcmp 29.6s 94 928s 1316 T - strcpy 8.05s 45 152s 438 1150s 4410 strcat 125s 315 T - T - strchr 13.9s 39 130s 265 1450s 1265 strlen 0.87s 27 19.4s 236 726s 3306 *Sequence*: a string of the form a^i, i ≤ length.time 4 states 4 time 8 states 8 time 16 states 16 strcmp 3.16s 14 9.33s 32 32.2s 68 strcpy 4.9s 14 22.9s 24 69.4s 44 strcat 27.2s 37 187s 157 757s 805 strchr 4.34s 8 26.4s 12 207s 20 strlen 0.37s 8 1s 20 3.57s 68 *Alternation*: is a string of form a^i b^j a^k b^l , i + j + k + l ≤ length.time 4 states 4 time 8 states 8 time 16 states 16 strcmp 51.7s 180 T - T - strcpy 12.4s 58 311s 1010 T - strcat 200s 512 T - T - strchr 18.9s 57 224s 835 T - strlen 1.27s 37 65.3s 575 3085s 5307

### Bison Grammars

- Numerical expressions (sources)

time 10 | states 10 | time 20 | states 20 | time 25 | states 25 | time 35 | states 35 | |
---|---|---|---|---|---|---|---|---|

addition | 40.2s | 416 | 319s | 3548 | 622s | 13000 | T | - |

ones | 5.5s | 62 | 8.1s | 196 | 29.7s | 402 | 189s | 2186 |

alternation | 708s | 105 | 582s | 11000 | T | - | T | - |

*addition* is a string with two numbers with between them (the string is of form 123 + 123, where the whole length is bounded by the value from the table)

*ones* is a simplified version of addition where numbers consist of ones only

*alternation* is expression that contains numbers with alternating digits (1010)

- Simple programming language (sources)

time 10 | states 10 | time 50 | states 50 | time 100 | states 100 | time 1000 | states 1000 | |
---|---|---|---|---|---|---|---|---|

value | 6.58s | 38 | 44.4s | 238 | 90.4s | 488 | 1100s | 4988 |

loop | 1.53s | 23 | 3.28s | 23 | 4.88s | 23 | 33.3s | 23 |

wrong | 7.34s | 82 | 27.9s | 442 | 67.7s | 892 | 311s | 8992 |

*value* is a program with value declaration

*loop* is a program with a single loop

*wrong* is a program a syntactic error