centipede/doc/DESIGN.md - external/github.com/google/fuzztest - Git at Google


 [TOC]

 # Centipede Design

 We are trying to build Centipede based on our experience with libFuzzer (and to
 a lesser extent with AFL and others). We keep what worked well, and change what didn't.

 See [README](../README.md) for user documentation and most of the terminology.

 ## Execution Features

 Centipede reasons about the execution feedback in terms of *features*. A feature
 is some unique behavior of the target exercised by a given input. So, executing
 an input is a way to compute the input's features.

 The currently supported features (see [feature.h](../feature.h) for details)
 are:

 * Control flow edges with
   [8-bit counters](https://clang.llvm.org/docs/SanitizerCoverage.html#inline-8bit-counters)
   .
 * Simplified data flow edges: either {store-PC, load-PC} or {global-address,
   load-PC}.
 * Bounded control flow paths.
 * Instrumented CMP instructions.
 * ... more coming.

 However, the target may generate features of its own type without Centipede
 having to support it explicitly.

 ## Persistent state

 Centipede aims to handle both large and slow targets, so even a minimized corpus
 may consist of 100K-1M inputs and executing every input can take 1ms-1s. We
 also aim to support many more feature types than most other engines, which will
 in turn bloat the corpus further.

 At the same time, we aim to execute on cheap ("preemptible") cloud VMs, so
 we need to minimize the startup computations.

 Therefore, we try hard to eliminate all redundant executions.

 Centipede state consists of the following:

 * Corpus. A set of inputs. The corpus is a property of a group of fuzz targets
   sharing the same input data format.
 * Feature sets. For every corpus element we preserve the set of its features.
   Features are a property of a specific target binary. Different binaries (e.g.
   from a different revision, or different build options, or from different
   code) will have their own persistent feature set. A feature set is associated
   with an input via the input's hash.

 On startup, Centipede loads the corpus, and checks which corpus elements have
 their corresponding feature sets. Only when the feature set is not present for
 an input in the corpus, Centipede will recompute it.

 ## Concurrent execution

 Centipede jobs run concurrently (in separate processes, potentially on different
 machines). They peek at each other's corpus (and feature sets) periodically.
 Every job writes only to its own persistent state, but can read any other job's
 state concurrently with that job writing to it.

 Centipede implements this via appendable storage format.

 ## Storage format

 Very simple and inefficient home-brewed appendable data format is currently
 used, see `PackBytesForAppendFile()` in [util.h](../util.h). We may need to
 replace it with something more efficient when this one stops scaling.

 ## Out-of-process target execution

 Centipede executes the target out of process. In order to minimize the process
 startup costs, it passes inputs to the target in batches, and receives features
 in batches as well.

 The specific mechanism of execution and passing the data between processes can
 be overridden. The default implementation is `CentipedeCallbacks::Execute()` in
 [centipede_interface.h](../centipede_interface.h).

 It is possible to override the execution to do it in-process, but this way
 Centipede will lose the ability to set RAM and time limit, and it will not
 tolerate crashes in the target.

 ## Instrumentation

 Centipede is decoupled from the mechanism that collects the execution feedback.
 Any source of feedback can be used: compiler instrumentation, run-time
 instrumentation, simulation, hardware-based tracing, etc. The default
 implementation in [runner_main.cc](../runner_main.cc) and other `runner_*.cc`
 files
 relies on
 [SanitizerCoverage](https://clang.llvm.org/docs/SanitizerCoverage.html)

 ## Mutations

 Centipede is decoupled from Mutations - they are provided by the user via
 `CentipedeCallbacks::Mutate()`.

 Centipede provides a default implementation of mutations via `ByteArrayMutator`.

 ## Corpus management

 One of the important heuristics during fuzzing is which corpus elements to
 mutate, and which to discard.

 Centipede tries to mutate only those corpus elements that have *rare* features.

 TODO(kcc): explain how this works.

 ## Related reading

 Centipede currently doesn't do all of the following, but aspires to eventually
 do much more.

 * [Entropic] Boosting fuzzer efficiency: an information theoretic perspective.
   https://dl.acm.org/doi/abs/10.1145/3368089.3409748
 * [Nezha]: Efficient Domain-Independent Differential Testing
   https://www.cs.columbia.edu/~suman/docs/nezha.pdf

	[TOC]

	# Centipede Design

	We are trying to build Centipede based on our experience with libFuzzer (and to
	a lesser extent with AFL and others). We keep what worked well, and change what didn't.

	See [README](../README.md) for user documentation and most of the terminology.

	## Execution Features

	Centipede reasons about the execution feedback in terms of features. A feature
	is some unique behavior of the target exercised by a given input. So, executing
	an input is a way to compute the input's features.

	The currently supported features (see [feature.h](../feature.h) for details)
	are:

	* Control flow edges with
	[8-bit counters](https://clang.llvm.org/docs/SanitizerCoverage.html#inline-8bit-counters)
	.
	* Simplified data flow edges: either {store-PC, load-PC} or {global-address,
	load-PC}.
	* Bounded control flow paths.
	* Instrumented CMP instructions.
	* ... more coming.

	However, the target may generate features of its own type without Centipede
	having to support it explicitly.

	## Persistent state

	Centipede aims to handle both large and slow targets, so even a minimized corpus
	may consist of 100K-1M inputs and executing every input can take 1ms-1s. We
	also aim to support many more feature types than most other engines, which will
	in turn bloat the corpus further.

	At the same time, we aim to execute on cheap ("preemptible") cloud VMs, so
	we need to minimize the startup computations.

	Therefore, we try hard to eliminate all redundant executions.

	Centipede state consists of the following:

	* Corpus. A set of inputs. The corpus is a property of a group of fuzz targets
	sharing the same input data format.
	* Feature sets. For every corpus element we preserve the set of its features.
	Features are a property of a specific target binary. Different binaries (e.g.
	from a different revision, or different build options, or from different
	code) will have their own persistent feature set. A feature set is associated
	with an input via the input's hash.

	On startup, Centipede loads the corpus, and checks which corpus elements have
	their corresponding feature sets. Only when the feature set is not present for
	an input in the corpus, Centipede will recompute it.

	## Concurrent execution

	Centipede jobs run concurrently (in separate processes, potentially on different
	machines). They peek at each other's corpus (and feature sets) periodically.
	Every job writes only to its own persistent state, but can read any other job's
	state concurrently with that job writing to it.

	Centipede implements this via appendable storage format.

	## Storage format

	Very simple and inefficient home-brewed appendable data format is currently
	used, see `PackBytesForAppendFile()` in [util.h](../util.h). We may need to
	replace it with something more efficient when this one stops scaling.

	## Out-of-process target execution

	Centipede executes the target out of process. In order to minimize the process
	startup costs, it passes inputs to the target in batches, and receives features
	in batches as well.

	The specific mechanism of execution and passing the data between processes can
	be overridden. The default implementation is `CentipedeCallbacks::Execute()` in
	[centipede_interface.h](../centipede_interface.h).

	It is possible to override the execution to do it in-process, but this way
	Centipede will lose the ability to set RAM and time limit, and it will not
	tolerate crashes in the target.

	## Instrumentation

	Centipede is decoupled from the mechanism that collects the execution feedback.
	Any source of feedback can be used: compiler instrumentation, run-time
	instrumentation, simulation, hardware-based tracing, etc. The default
	implementation in [runner_main.cc](../runner_main.cc) and other `runner_*.cc`
	files
	relies on
	[SanitizerCoverage](https://clang.llvm.org/docs/SanitizerCoverage.html)

	## Mutations

	Centipede is decoupled from Mutations - they are provided by the user via
	`CentipedeCallbacks::Mutate()`.

	Centipede provides a default implementation of mutations via `ByteArrayMutator`.

	## Corpus management

	One of the important heuristics during fuzzing is which corpus elements to
	mutate, and which to discard.

	Centipede tries to mutate only those corpus elements that have rare features.

	TODO(kcc): explain how this works.

	## Related reading

	Centipede currently doesn't do all of the following, but aspires to eventually
	do much more.

	* [Entropic] Boosting fuzzer efficiency: an information theoretic perspective.
	https://dl.acm.org/doi/abs/10.1145/3368089.3409748
	* [Nezha]: Efficient Domain-Independent Differential Testing
	https://www.cs.columbia.edu/~suman/docs/nezha.pdf