Source/bmalloc/libpas/Documentation.md - external/github.com/WebKit/webkit - Git at Google

 # All About Libpas, Phil's Super Fast Malloc

 Filip Pizlo, Apple Inc., February 2022

 # License

 Copyright (c) 2022 Apple Inc. All rights reserved.

 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions
 are met:

 1. Redistributions of source code must retain the above copyright
    notice, this list of conditions and the following disclaimer.
 2. Redistributions in binary form must reproduce the above copyright
    notice, this list of conditions and the following disclaimer in the
    documentation and/or other materials provided with the distribution.

 THIS SOFTWARE IS PROVIDED BY APPLE INC. ``AS IS'' AND ANY
 EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL APPLE INC. OR
 CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
 PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
 OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 # Introduction

 This document describes how libpas works as of [247029@main](https://commits.webkit.org/247029@main), so a bit ahead of
 where WebKit was as of [246842@main](https://commits.webkit.org/246842@main). Libpas is a fast and memory-efficient
 memory allocation toolkit capable of supporting many heaps at once, engineered with the hopes that someday it'll be used
 for comprehensive isoheaping of all malloc/new callsites in C/C++ programs.

 Since WebKit [186504@main](https://commits.webkit.org/186504@main), we've been steadily enabling libpas as a
 replacement for WebKit's bmalloc and MetaAllocator. This has so far added up to a ~2% Speedometer2 speed-up and
 a ~8% memory improvement (on multiple memory benchmarks). Half of the speed-up comes from replacing the MetaAllocator,
 which was JavaScriptCore's old way of managing executable memory. Now, JSC uses libpas's jit_heap to manage executable
 memory. The other half of the speed-up comes from replacing everything that bmalloc provided -- the fastMalloc API, the
 Gigacage API, and the IsoHeap<> API. All of the memory improvement comes from replacing bmalloc (the MetaAllocator was
 already fairly memory-efficient).

 This document is structured as follows. First I describe the goals of libpas; these are the things that a
 malloc-like API created out of libpas should be able to expose as fast and memory-efficient functions. Then I
 describe the coding style. Next I tell all about the design. Finally I talk about how libpas is tested.

 # Goals of Libpas

 Libpas tries to be:

 - Fast. The goal is to beat bmalloc performance on single-threaded code. Bmalloc was previously the fastest
   known malloc for WebKit.
 - Scalable. Libpas is meant to scale well on multi-core devices.
 - Memory-efficient. The goal is to beat bmalloc memory usage across the board. Part of the strategy for memory
   efficiency is consistent use of first-fit allocation.
 - External metadata. Libpas never puts information about a free object inside that object. The metadata is
   always elsewhere. So, there's no way for a use-after-free to corrupt libpas's understanding of memory.
 - Multiple heap configurations. Not all programs want the same time-memory trade-off. Some malloc users have
   very bizarre requirements, like what JavaScriptCore does with its ExecutableAllocator. The goal is to support
   all kinds of special allocator needs simultaneously in one library.
 - Boatloads of heaps. Libpas was written with the dream of obviating the need for ownership type systems or
   other compiler approaches to fixing the type-safety of use-after-frees. This means that we need one heap per
   type, and be 100% strict about it. So, libpas supports tons of heaps.
 - Type-awareness. Sometimes, malloc decisions require knowing what the type's size and alignment are, like when
   deciding how to split and coalesce memory. Libpas is designed to avoid type errors arising from the malloc's
   "skewed" reuse of memory.
 - Common free. Users of libpas isoheaps don't have to know the heap of an object when they free it. All objects
   should funnel into the same free function. One kind of exception to this requirement is stuff like
   ExecutableAllocator, which needs a malloc, but is fine with not calling a common free function.
 - Cages. WebKit uses virtual memory reservations called *cages*, in which case WebKit allocates the virtual
   memory and the malloc has to associate that memory with some heap. Libpas supports multiple kinds of cages.

 # Libpas Style

 Libpas is written in C. Ultimately, I chose C because I felt that the language provides better support for
 extremely low-level code:

 - C++ is usually my favorite, because it makes code easier to write, but for libpas, I wanted something easier
   to read. It's easier to read C when auditing for subtle bugs, because there's nothing hidden. C doesn't have
   stuff like destructor invocations or operator overloading, which result in surprising effectfulness in
   otherwise innocent-looking code. Memory management code like libpas has to be read a lot, so C is better.
 - C makes casting between pointers and integers very simple with its style of cast operator. It feels weird to
   use the C cast operator in C++, so when I have to do a lot of uintptr_t'ing, I prefer C.

 C lets you do most of what C++ can if you rely on `always_inline`. This didn't used to be the case, but modern C
 compilers will meat-grind the code with repeated application of the following things:

 - Inlining any `always_inline` call except if it's recursive or the function uses some very weird features that
   libpas doesn't use (like goto pointer).
 - Copy-propagating the values from the callsite into the function that uses the value.

 Consequently, passing a function pointer (or struct of function pointers), where the pointer points to an
 `always_inline` function and the callee is `always_inline` results in specialization akin to template
 monomorphization. This works to any depth; the compiler won't be satisfied until there are no more
 `always_inline` function calls. This fortuitous development in compilers allowed me to write very nice template
 code in C. Libpas achieves templates in C using config structs that contain function pointers -- sometimes to
 `always_inline` functions (when we want specialization and inlining) and sometimes to out-of-line functions
 (when we want specialization but not inlining). Additionally, the C template style allows us to have true
 polymorphic functions. Lots of libpas slow paths are huge and not at all hot. We don't want that code
 specialized for every config. Luckily, this works just fine in C templates -- those polymorphic functions just
 pass around a pointer to the config they are using, and dynamically load and call things in that config, almost
 exactly the same way that the specialized code would do. This saves a lot of code size versus C++ templates.

 Most of libpas is written in an object-oriented style. Structs are used to create either by-value objects or
 heap-allocated objects. It's useful to think of these as classes, but a loose way since there are many ways to
 do classes in C, and libpas uses whatever techniques are best on a per-class basis. But heap allocated objects
 have a clear convention: for a class named `foo`, we would call the struct `pas_foo`, and for a method `bar` on
 `foo`, we would call the function `pas_foo_bar` and have the first parameter be `pas_foo*`. The function that
 creates instances of `foo` is called `pas_foo_create` (or `pas_foo_create_blah_blah` in case of overloading) and
 returns a `pas_foo*`. The function that destroys `foo` objects is called `pas_foo_destroy` and takes a
 `pas_foo*`.

 Libpas classes are usually implemented in files called `pas_foo.h` (the header that defines the struct and a
 subset of functions), `pas_foo_inlines.h` (the header that defines inline functions of `foo` that require
 calling functions declared in headers that `pas_foo.h` can't include), and `pas_foo.c` (the implementations of
 `foo` functions that can be out-of-line).

 Some libpas "classes" are singletons. The standard way of implementing a singleton in libpas is that there is
 really no struct, only global variables and functions that are declared in the header. See `pas_page_malloc` or
 `pas_system_heap` for examples of singletons.

 Not everything in libpas is a class. In cases where a bunch of not-class-like things can be grouped together in
 a way that makes sense, we usually do something like a singleton. In cases where a function can't easily be
 grouped together with some class, even a singleton, we name the file it's in after the function. There are lots
 of examples of this, like `pas_deallocate` or `pas_get_page_base`. Sometimes this gets fun, like
 `pas_get_page_base_and_kind_for_small_other_in_fast_megapage.h`.

 Finally, libpas avoids abbreviations even more so than WebKit usually does. Functions that have a quirky
 meaning typically have a long name that tells the story. The point is to make it easy to appreciate the subtlety
 of the algorithm when reading the code. This is the kind of code where complex situations should look complex
 at any abstraction level.

 # Design of Libpas

 Libpas is organized into roughly eleven areas:

 1.  Heap configurations. This is the way that we tell libpas how to organize a heap. Heap configurations can
     control a lot. They can change obvious things like the minalign and page size, but also more crazy things,
     like how to find a page header given a page and vice-versa.
 2.  The large heaps. This is a first-fit heap based on arrays, cartesian trees, and hashtables. The large
     heap has excellent type safety support and can be safely (though not efficiently) used for small objects.
 3.  Metacircularity. Libpas uses malloc-like APIs internally for managing its state. These are provided by the
     so-called bootstrap, utility, and immortal heaps.
 4.  The segregated heaps and TLCs (thread-local caches). Libpas has a super fast simple segregated storage slab
     allocator. It supports type safety and is the most commonly used kind of heap.
 5.  The bitfit heaps. This is a fast and memory-efficient type-unsafe heap based on slabs and bitmaps.
 6.  The scavenger. Libpas performs a bunch of periodic bookkeeping tasks in a scavenger thread. This includes,
     but is not limited to, returning memory to the OS.
 7.  Megapages and page header tables. Libpas has multiple clever tricks for rapidly identifying which kind of
     heap an object belongs to. This includes an arithmetic hack called megapages and some lock-free hashtables.
 8.  The enumerator. Libpas supports malloc heap enumeration APIs.
 9.  The basic configuration template, used to create the `bmalloc_heap` API that is used as a replacement for
     all of bmalloc's functionality.
 10. The JIT heap config.
 11. The fast paths. The various heaps, TLCs, megapages and page header tables are glued together by fast paths
     provided for allocation, deallocation, and various utility functions.

 ## Heap Configurations

 The `pas_heap_config` struct defines all of the configurable behaviors of a libpas heap. This includes things
 like how the heap gets its memory, what size classes use segregated, bitfit, or large allocators, and a bunch
 of other things.

 Heap configs are passed by-value to functions that are meant to be specialized and inlined. To support this,
 the convention for defining a heap config is that you create a macro (like `BMALLOC_HEAP_CONFIG`) that gives a
 heap config literal expression. So, a call like `pas_get_allocation_size(ptr, BMALLOC_HEAP_CONFIG)` will give
 you an optimized fast path for getting the allocation size of objects in bmalloc. This works because such fast
 paths are `always_inline`.

 Heap configs are passed by-pointer to functions that are not meant to be specialized. To support this, all
 heap configs also have a global variable like `bmalloc_heap_config`, so we can do things like
 `pas_large_heap_try_deallocate(base, &bmalloc_heap_config)`.

 Heap configs can have up to two segregated page configs (`config.small_segregated_config` and
 `config.medium_segregated_config`) and up to three bitfit page configs (`config.small_bitfit_config`,
 `config.medium_bitfit_config`, and `config.marge_bitfit_config`). Any of the page configs can be disabled,
 though weird things might happen if the smallest ones are disabled (rather than disabling the bigger ones).
 Page configs (`pas_segregated_page_config`, `pas_bitfit_page_config`, and the common supertype,
 `pas_page_base_config`) get used in much the same way as heap configs -- either by-value for specialized and
 inlined functions or by-pointer for unspecialized functions.

 Heap and page configs also support specialized-but-not-inlined functions. These are supported using additional
 function pointers in those configs that are filled in using macros -- so you don't need to fill them in
 explicitly when creating your own config, like `BMALLOC_HEAP_CONFIG` or `JIT_HEAP_CONFIG`. The macros fill them
 in to point at never_inline functions that call some specialized and inlined function with the config passed as
 a constant. This means for example that:

     BMALLOC_HEAP_CONFIG.specialized_local_allocator_try_allocate_small_segregated_slow(...);

 Is an out-of-line direct function call to the specialization of
 `pas_local_allocator_try_allocate_small_segregated_slow`. And this would be a virtual call to the same
 function:

     const pas_heap_config* config = ...;
     config->specialized_local_allocator_try_allocate_small_segregated_slow(...);

 Note that in many cases where you have a `pas_heap_config`, you are in specialized code and the heap config is
 a known constant at compile to, so then:

     config.specialized_local_allocator_try_allocate_small_segregated_slow(...);

 Is an out-of-line direct function call.

 ## The Large Heaps

 Libpas's large heaps serve multiple purposes:

 - Everything is bootstrapped on large heaps. When segregated and bitfit heaps allocate memory, they do so from
   some large heap.
 - Segregated and bitfit heaps have object size ceilings in the tens or hundreds of kilobytes. So, objects that
   are too large for the other heaps get allocated from large heaps.

 Large heaps are broken into two parts:

 1. The large free heap. In libpas jargon, a *free heap* is a heap that requires that deallocation passes the
    object size, requires that the freed object size matches the allocated object size for that object, and makes
    no guarantees about what kind of mess you'll get yourself into if you fail to obey that rule.
 2. The large map. This maps object pointer to size and heap.
 3. The large heap. This is an abstraction over both (1) and (2).

 Large free heaps just maintain a free-list; they know nothing about allocated objects. But combined with the
 large map, the large heaps provide a user-friendly deallocation API: you just need the object pointer, and the
 large map figures out the rest, including identifying which free heap the object should be deallocated into, and
 the size to pass to that free heap.

 Large heaps operate under a single global lock, called the *heap lock*. Most libpas heaps use fine-grained
 locking or avoid locking entirely. But for the large heap, libpas currently just uses one lock.

 ### Large Free Heap

 Large free heaps are built out of a generic algorithm that doesn't know how to represent the free-list and gets
 instantiated with either of two free-list representations, *simple* and *fast*. The simple large free heap uses
 an unordered array of coalesced free object descriptions. The fast large free heap uses a cartesian tree of
 coalesced free object descriptions.

 A free object description is represented by the `pas_large_free` object in libpas; let's just call it a *large
 free* for brevity. Large frees can tell you the beginning and end of a free chunk of memory. They can also tell
 if the memory is already known to be zero and what the *type skew* of the free memory is. The large heap can be
 used to manage arrays of some type that is either larger than the heap's minimum alignment, or that is smaller
 than and not a divisor of the alignment. Especially when this is combined with `memalign`, the free heap will
 have to track free memory that isn't type-aligned. Just consider a type of size 1000 that is allocated with
 alignment 16384. The rules of memalign say that the size must be 16384 in that case. Assuming that the free heap
 had 32768 contiguous bytes of free memory to begin with, it will now have 16384 bytes that starts with a type
 skew of 384 bytes. The type skew, or `offset_in_type` as libpas calls it, is the offset of the beginning of the
 large free inside the heap's type. In extremely complex cases, this means that finding where the first valid
 object address is inside a large free for some type and alignment requires computing the offset least common
 multiple (see `pas_coalign`), which relies on the right bezout coefficient of the extended GCD of the type size
 and alignment (see `pas_extended_gcd`).

 Large frees support an API for coalescing (merging as libpas calls it) and splitting. The generic large free
 heap handles searching through large frees to find the first one that matches an allocation request for some
 size and alignment. It also handles coalescing freed memory back into the heap, by searching for adjacent free
 memory. The searches are through a struct of function pointers that may be implemented either efficiently (like
 the simple large free heap's O(n) search through an unordered array) or efficiently (like the O(1)-ish or
 O(log n)-ish operations on the cartesian tree in the fast large free heap). The generic algorithm uses the C
 generic idiom so there are no actual function pointer calls at runtime.

 Large free heaps allow you to give them callbacks for allocating and deallocating memory. The allocation
 callback will get called if you ask a large free heap for memory and it doesn't have it. That allocation
 callback could get the memory from the OS, or it could get it from some other heap. The deallocation callback
 is for those cases where the large free heap called the allocation callback and then decided it wanted to give
 some fraction of that memory back. Both callbacks are optional (can be NULL), though the case of a NULL
 allocation callback and non-NULL deallocation callback is not useful since the deallocation callback only gets
 called on the path where we had an allocation callback.

 Note that large free heaps do not do anything to decommit their free memory. All decommit of memory in large
 free heaps is accomplished by the *large sharing pool*, which is part of the scavenger.

 ### Large Map

 The large map is a hashtable that maps object addresses to *large entries*, which contain the size of the object
 and the heap it belongs to. The large map has a fairly complex hashtable algorithm because of my past attempts
 at making the large heap at least somewhat efficient even for small objects. But it's conceptually a simple part
 of the overall algorithm. It's also legal to ask the large map about objects it doesn't know about, in which
 case, like a normal hashtable, it will just tell you that it doesn't know about your object. Combined with the
 way that segregated and bitfit heaps use megapage tables and page header tables, this means that libpas can do
 fall-through to another malloc for objects that libpas doesn't manage.

 Note that it might be OK to remove the small object optimizations in the large map. On the other hand, they are
 reliable, and they aren't known to increase the cost of the algorithm. Having that capability means that as part
 of tuning the algorithm, it's more safe than it would otherwise be to try putting some small objects into the
 large heap to avoid allocating the data structures required for operating a segregated or bitfit heap.

 ### The Large Heap

 The large free heap and large map are combined into a high-level API with the `pas_large_heap`. In terms of
 state, this is just a
 `pas_large_free_heap` plus some data to help with small-object optimizations in the large map. The functions
 of the large heap do a lot of additional work:

 - They give the free heap an allocator for getting new memory. The large heap routes memory allocation
   requests to the heap config's allocation callback.
 - They ensure that each free heap allocation ends up in the large map.
 - They implement deallocation by removing something from the large map and then deallocating it into the free
   heap.
 - They provide integration with the scavenger's large sharing pool so that free memory can be decommitted.

 The large heap is always used as a member of the `pas_heap` object. It's useful to think of `pas_large_heap` as
 never being a distinct object; it's more of a way of compartmentalizing `pas_heap`. The heap object also
 contains a segregated heap and some other stuff.

 ## Metacircularity

 I'm used to programming with dynamically allocated objects. This lets me build arrays, trees, hashtables,
 look-up tables, and all kinds of lock-free data structures. So, I wanted libpas's internals to be able to
 allocate objects just like any other kind of algorithm would do. But libpas is engineered so that it can
 be a "bottom of the world" malloc -- where it is the implementation of `malloc` and `free` and cannot rely on
 any memory allocation primitives other than what the kernel provides. So, libpas uses its own allocation
 primitives for its own objects that it uses to implement those primitives. This is bootstrapped as follows:

 - The *bootstrap heap* is a simple large free heap. A simple large free heap needs to be able to allocate
   exactly one variable-length array of large frees. The bootstrap heap has hacks to allow itself to allocate
   that array out of itself. This trick then gives us a complete malloc implementation for internal use by
   libpas, albeit one that is quite slow, can only be used under the heap lock, and requires us to know the
   object's size when freeing it. All other simple large free heaps allocate their free lists from the bootstrap
   heap. The bootstrap heap is the only heap in libpas that asks the OS for memory. All other heaps ask either
   the bootstrap heap for memory, or they ask one of the other heaps.
 - The *compact reservation* is 128MB of memory that libpas uses for objects that can be pointed at with 24-bit
   (3 byte) pointers assuming 8-byte alignment. Libpas needs to manage a lot of heaps, and that requires a lot
   of internal meta-data, and having compact pointers reduces the cost of doing this. The compact reservation is
   allocated from the bootstrap heap.
 - The *immortal heap* is a heap that bump-allocates out of the compact reservation. It's intended for small
   objects that are immortal.
 - The *compact bootstrap heap* is like the bootstrap heap, except that it allocates its memory from the compact
   reservation, and allocates its free list from the bootstrap heap rather than itself.
 - The *compact large utility free heap* is a fast large free heap that supports decommitting free memory (see
   the scavenger section) and allocates its memory from the compact bootstrap heap.
 - The *utility heap* is a segregated heap configured to be as simple as possible (no thread-local caches for
   example) and can only be used while holding the heap lock. It only supports objects up to some size
   (`PAS_UTILITY_LOOKUP_SIZE_UPPER_BOUND`), supports decommitting free memory, and gets its memory from the
   compact bootstrap heap. One example of how the utility heap gets used is the nodes in the cartesian trees
   used to implement fast large free heaps. So, for example, the compact large utility free heap relies on the
   utility heap.
 - The *large utility free heap* is a fast large free heap that supports decommitting free memory and allocates
   its memory from the bootstrap heap.

 Note how the heaps pull memory from one another. Generally, a heap will not return memory to the heap it got
 memory from except to "undo" part of an allocation it had just done. So, this arrangement of
 who-pulls-memory-from-who is designed for type safety, memory efficiency, and elegantly supporting weird
 alignments:

 - Libpas uses decommit rather than unmapping free memory because this ensures that we don't ever change the type
   of memory after that memory gets its type for the first time.
 - The lower-level heaps (like bootstrap and compact bootstrap) do not support decommit. So, if a higher-level
   heap that does support decommit ever returned memory to the lower-level heap, then the memory would never get
   decommitted.
 - Page allocation APIs don't let us easily allocate with alignment greater than page size. Libpas does this by
   over-allocating (allocating size + alignment and then searching for the first aligned start within that larger
   reservation). This is all hidden inside the bootstrap heap; all other heaps that want memory on some weird
   alignment just ask some other heap for memory (often the bootstrap heap) and that heap, or ultimately the
   bootstrap heap, figure out what that means in terms of system calls.

 One missing piece to the metacircularity is having a *fast utility heap* that uses thread-local caches. There is
 currently maybe one utility heap callsite that only grabs the heap lock just because it wants to allocate in the
 utility heap. There's a possibility of a small speed-up if any callsite like that used a fast utility heap
 instead, and then no locking would be required. It's not clear how easy that would be; it's possible that some
 bad hacks would be required to allow code that uses TLCs to call into a heap that then also uses TLCs.

 ## Segregated Heaps and TLCs (Thread Local Caches)

 Libpas's great performance is mostly due to the segregated heaps and how they leverage thread-local caches. TLCs
 provide a cache of global memory which. In the best case, this cache prevents threads from doing any
 synchronization during allocation and deallocation. Even when they do have to do some synchronization, TLCs
 make it unlikely that one thread will ever want to acquire a lock held by another thread. The strategy is
 three-fold:

 - The TLC has a per-size-class allocator that caches some amount of that size class's memory. This means that
   the allocation fast path doesn't have to do any locking or atomic instructions except when its cache runs out.
   Then, it will have to do some synchronization -- in libpas's case, fine-grained locking and some lock-free
   algorithms -- to get more memory. The amount of memory each allocator can cache is bounded (usually 16KB) and
   allocators can only hold onto memory for about a second without using it before it gets returned (see the
   scavenger section).
 - The TLC has a deallocation log. The fast path of deallocating a segregated heap object is just pushing it onto
   the deallocation log without any locking. The slow path is to walk the log and free all of the objects. The
   libpas deallocation log flush algorithm cleverly avoids doing per-object locking; in the best case it will
   acquire a couple of locks before flushing and release them after flushing the whole log.
 - When the deallocation log flush frees memory, it tries to first make that memory available exclusively to the
   thread that freed it by putting the free memory into the *local view cache* for that size class in that
   thread. Memory moves from the local view caches into the global heap only if the view cache is full (has about
   1.6MB of memory in it) or if it hasn't been used in about a second (see the scavenger section).

 This section lays out the details of how this works. *Segregated heaps* are organized into *segregated
 directories*. Each segregated directory is an array of page *views*. Each page view may or may not have a *page*
 associated with it. A view can be *exclusive* (the view has the page to itself), *partial* (it's a view into a
 page shared by others), or *shared* (it represents a page shared by many partial views). Pages have a *page
 boundary* (the address of the beginning of the page) and a *page header* (the object describing the page, which
 may or may not actually be inside the page). Pages maintain *alloc bits* to tell which objects are live.
 Allocation uses the heap's lookup tables to find the right *allocator index* in the TLC, which yields a *local
 allocator*; that allocator usually has a cache of memory to allocate from. When it doesn't, it first tries to
 pop a view from the local view cache, and if that fails, it uses the *find first eligible* algorithm on the
 corresponding directory to find an eligible view. One the allocator has a view, it ensures that the view has a
 page, and then scans the alloc bits to create a cache of free memory in that page. Deallocation fast paths just
 push the object onto the deallocation log. When the log is full, the TLC flushes its log while trying to
 amortize lock acquisition. Freeing an object in a page means clearing the corresponding alloc bit. Once enough
 alloc bits are clear, either the page's view ends up on the view cache, or the directory is notified to mark the
 page either eligible or empty. The sections that follow go into each of these concepts in detail.

 ### The Thread Local Cache

 Each thread has zero or one `pas_thread_local_cache`s. Libpas provides slow paths for allocating and
 deallocating without a TLC in cases where TLC creation is forbidden (like when a thread is shutting down). But
 usually, allocation and deallocation create a TLC if the thread doesn't already have one. TLCs are structured
 as follows:

 - TLCs contain a fixed-size deallocation log along with an index that tells how much of the log is full.
   Deallocation pushes onto that log.
 - TLCs contain a variable-length array of *allocators*, which are really either *local allocators* or *local
   view caches*. Allocators are variable length. Clients access allocators using an allocator index, which they
   usually get from the directory that the allocator corresponds to.
 - TLCs can get reallocated and deallocated, but they always point to a `pas_thread_local_cache_node`, which is
   an immortal and compact descriptor of a TLC. TLC nodes are part of a global linked list. Each TLC node may or
   may not have a live TLC associated with it. TLCs cannot be created or destroyed unless the heap lock is held,
   so if you hold the heap lock, you can iterate the TLC node linked list to find all TLCs.
 - The layout of allocator indices in a TLC is controlled by both the directories and the TLC layout data
   structure (`pas_thread_local_cache_layout`). This is a global data structure that can tell us about all of the
   allocators in a TLC. When holding the heap lock, it's possible to loop over the TLC layout linked list to
   find what all of the valid allocator indices are and to introspect what is at those indices.

 Thread local caches tend to get large because both the local allocators and local view caches have inline
 arrays. The local allocator has an array of bits that tell the allocator where the free objects are. The local
 view cache has an array of view pointers (up to 1.6MB / 16KB = 100 entries, each using a 24-bit pointer). When
 used in single-heap applications, these overheads don't matter -- they end up being accounting for less than
 10<sup>-5</sup> of the overall process footprint (not just in WebKit but when I experimented with libpas in
 daemons). But when used for many heaps, these overheads are substantial. Given thousands or tens of thousands
 of heaps, TLCs account for as much as 1% of memory. So, TLCs support partial decommit. Those pages that only
 have allocators that are inactive get decommitted. Note that TLC decommit has landed in the libpas.git repo
 as of [247029@main](https://commits.webkit.org/247029@main), but hasn't yet been merged into WebKit.

 The TLC deallocation log flush algorithm is designed to achieve two performance optimizations:

 - It achieves temporal locality of accesses to page headers. If freeing each object meant flipping a bit in the
   page header, then many of those operations would miss cache, since the page header is not accessed by normal
   program operation -- it's only accessed during some allocation slow paths and when deallocating. But because
   deallocation accesses page headers only during log flush and log flush touches about 1000 objects, it's likely
   that the flush will touch the same page header cache lines multiple times.
 - It reduces the average number of lock acquisitions needed to free an object. Each page uses its own lock to
   protect its page header, and the page header's `alloc_bits`. But deallocation log flush will do no locking or
   unlocking if the object at index *i* and the object at index *i+1* use the same lock. Pages can dynamically
   select which lock they use (thanks to `page->lock_ptr`), and they select it so that pages allocated out of by
   a thread tend to share the same lock, so deallocation log flush usually only just acquires a lock at the start
   of the 1000 objects and releases it when it finishes the 1000 objects.

 ### The Segregated Heap

 The `pas_segregated_heap` object is the part of `pas_heap` that facilitates segregated heap and bitfit heap
 allocation. The details of the bitfit heap are discussed in a later section. Segregated heaps can be created
 separately from a `pas_heap`, but segregated heaps are almost always part of a `pas_heap` (and it would be easy
 to refactor libpas to make it so that segregated heaps are always part of heaps).

 Segregated heaps use *size directories* to track actual memory. Most of the exciting action of allocation and
 deallocation happens in directories. Each directory corresponds to some size class. Segregated heaps make it
 easy to find the directory for a particular size. They also make it easy to iterate over all the directories.
 Also, segregated heaps make it easy to find the allocator index for a particular size (using a lookup table that
 is essentially a cache of what you would get if you asked for the directory using the size-to-directory lookup
 tables and then asked the directory for the allocator index). The most exciting part of the segregated heap
 algorithm is `pas_segregated_heap_ensure_size_directory_for_size`, which decides what to do about allocating a
 size it hadn't encountered before. This algorithm will either return an existing directory, create a new one, or
 even retire an existing one. It handles all of the issues related to type size, type alignment, and the
 alignment argument to the current malloc call.

 The lookup tables maintained by segregated heaps have some interesting properties:

 - They can be decommitted and rematerialized. This is a useful space saving when having lots of isoheaps. The
   rematerialization happens because a heap also maintains a linked list of directories, and that linked list
   never goes away. Each directory in the linked list knows what its representation would have been in the lookup
   tables.
 - They are optional. Some heaps can be configured to have a preferred size, called the *basic size class*. This
   is very common for isoheaps, which may only ever allocate a single size. For isoheaps based on type, the
   basic size class is just that type's size. Other isoheaps dynamically infer a preferred size based on the
   first allocation. When a heap only has the basic size class, it will have no lookup tables.
 - There are separate lookup tables for smaller sizes (not related to the small_segregated_config -- the
   threshold is set separately) are just arrays whose index is the size divided by the heap's minalign, rounded
   up. These may be populated under heap lock while they are accessed without any locks. So, accesses to them
   have some guards against races.
 - There are separate lookup tables for medium sizes (anything above the threshold for small lookup tables).
   The medium table is a sorted array that the allocator binary-searches. It may be mutated, decommitted,
   rematerialized, or reallocated under heap lock. The algorithm defends itself against this with a bunch of
   compiler fences, a mutation count check. Mutating the table means incrementing-to-mutate before making changes
   and incrementing-to-finish after making changes. So, the algorithm for lock-free lookup checks mutation count
   before and after, and makes sure they are the same and neither indicates that we are mutating. This involves
   clever use of dependency threading (like the ARM64 eor-self trick) to make sure that the mutation count reads
   really happen before and after the binary search.

 ### Segregated Directories

 Much of the action of managing memory in a segregated heap happens in the segregated directories. There are two
 kinds:

 - Segregated size directories, which track the views belonging to some size class in some heap. These may be
   *exclusive views*, which own a page, or *partial views*, which own part of a *shared page*. Partial views
   range in size between just below 512 bytes to possibly a whole page (in rare cases).
 - Segregated shared page directories, which track *shared views*. Each shared view tracks shared page and which
   partial views belong to it. However, when a shared page is decommitted, to save space, the shared view will
   forget which partial views belong to it; they will re-register themselves the first time someone allocates in
   them.

 Both of them rely on the same basic state, though they use it a bit differently:

 - A lock-free-access vector of compact view pointers. These are 4-byte pointers. This is possible because views
   are always allocated out of the compact reservation (they are usually allocated in the immortal
   heap). This vector may be appended to, but existing entries are immutable. So, resizing just avoids deleting
   the smaller-sized vectors so that they may still be accessed in case of a race.
 - A lock-free-access segmented vector of bitvectors. There are two bitvectors, and we interleave their 32-bit
   words of bits. The *eligible* bitvector tells us which views may be allocated out of. This means different
   things for size directories than shared page directories. For size directories, these are the views that have
   some free memory and nobody is currently doing anything with them. For shared page directories, these are the
   shared pages that haven't yet been fully claimed by partial views. The *empty* bitvector tells us which pages
   are fully empty and can be decommitted. It's never set for partial views. It means the same thing for both
   exclusive views in size directories and shared views in shared page directories.

 Both bitvectors are searched in order:

 - Eligible bitvectors are searched first-fit.
 - Empty bitvectors are searched last-fit.

 Searches are made fast because the directory uses the lock-free tricks of `pas_versioned_field` to maintain two
 indices:

 - A first-eligible index. This always points to the first eligible bit, except in cases where some thread has
   set the bit but hasn't gotten around to setting the first-eligible index. In other words, this may have some
   lag, but the lag is bounded.
 - A last-empty-plus-one index. This always points to the index right after the last empty bit. If it's zero, it
   means there are no set empty bits. If it's the number of views, then it means the last view is empty for sure
   and there may be any number of other empty views.

 These versioned indices can be read without any atomic instructions in many cases, though most mutations to them
 require a pair of 128-bit compare-and-swaps.

 The eligible bitvector together with the first-eligible index allow for very fast searches to find the first
 eligible view. Bitvector searches are fast to begin with, even over a segmented vector, since the segmented
 vector has large-enough chunks. Even searching the whole bitvector is quite efficient because of the properties
 of bitvector simd (i.e. using a 32-bit or 64-bit or whatever-bit word to hold that many bits). But the
 first-eligible index means that most searches never go past where that index points, so we get a mostly-O(1)
 behavior when we do have to find the first eligible view.

 The empty bitvector gives a similar property for the scavenger, which searches backwards to find empty views.
 The efficiency here arises from the fact that empty pages know the timestamp of when they became empty, and the
 scavenger will terminate its backwards search when it finds a too-recently emptied page.

 Directories have to make some choices about how to add views. View addition happens under heap lock and must be
 in this order:

 - First we make sure that the bitvectors have enough room for a bit at the new index. The algorithm relies on
   the size of the view vector telling us how many views there are, so it's fine for the bitvectors are too big
   for a moment. The segmented vector algorithm used for the bitvectors requires appending to happen under heap
   lock but it can run concurrently to accesses to the vector. It accomplishes this by never deallocating the
   too-small vector spines.
 - Then we append the the view to the view vector, possibly reallocating the view vector. Reallocation keeps the
   old two-small copy of the vector around, allowing concurrent reads to the vector. The vector append stores the
   value, executes an acqrel fence (probably overkill -- probably could just be a store fence), and then
   increments the size. This ensures nobody sees the view until we are ready.

 Directories just choose what kind of view they will create and then create an empty form of that view. So, right
 at the point where the vector append happens, the view will report itself as not yet being initialized. However,
 any thread can initialize an empty view. The normal flow of allocation means asking a view to "start
 allocating". This actually happens in two steps (`will_start_allocating` and `did_start_allocating`). The
 will-start step checks if the view needs to commit its memory, which will cause empty exclusive views to
 allocate a page. Empty partial views get put into the *partial primordial* state where they grab their first
 chunk of memory from some shared view and prepare to possibly grab more chunks of that shared view, depending on
 demand. But all of this happens after the directory has created the view and appended it. This means that there
 is even the possibility that one thread creates a view, but then some other thread takes it right after it was
 appended. In that case, the first thread will loop around and try again, maybe finding some other view that had
 been made eligible in the meantime, or against appending another new view.

 Size directories maintain additional state to make page management easy and to accelerate allocation.

 Size directories that have enabled exclusive views have a `full_alloc_bits` vector that has bits set for those
 indices in their pages where an object might start. Pages use bitvectors indexed by minalign, and only set those
 bits that correspond to valid object offsets. The `full_alloc_bits` vector is the main way that directories tell
 where objects could possibly be in the page. The other way they tell is with
 `offset_from_page_boundary_to_first_object` and `offset_from_page_boundary_to_end_of_last_object`, but the
 algorithm relies on those a bit less.

 Size directories can tell if they have been assigned an allocator index or a view cache index, and control the
 policies of when they get them. A directory without an allocator index will allocate out of *baseline
 allocators*, which are shared by all threads. Having an allocator index implies that the allocator index has
 also been stored in the right places in the heap's lookup tables. Having a view cache index means that
 deallocation will put eligible pages on the view cache before marking them eligible in the directory.

 ### Page Boundaries, Headers and Alloc Bits

 "Pages" in the segregated heap are a configurable concept. Things like their size and where their header lives
 can be configured by the `pas_segregated_page_config` of their directory. The config can tell which parts of
 the page are usable as object payload, and they can provide callbacks for finding the page header.

 The page header contains:

 - The page's kind. The segregated page header is a subtype of the `pas_page_base`, and we support safely
   downcasting from `pas_page_base*` to `pas_segregated_page*`. Having the kind helps with this.
 - Whether the page is in use for allocation right now. This field is only for exclusive views. Allocation in
   exclusive pages can only happen when some local allocator claims a page. For shared pages, this bit is in
   each of the the partial views.
 - Whether we have freed an object in the page while also allocating in it. This field is only for exclusive
   views. When we finish allocating, and the bit is set, we do the eligibility stuff we would have done if we had
   freed objects without the page being used for allocation. For shared pages, this bit is in each of the partial
   views.
 - The sizes of objects that the page manages. Again, this is only used for exclusive views. For shared pages,
   each partial view may have a different size directory, and the size directory tells the object size. It's also
   possible to get the object size by asking the exclusive view for its size directory, and you will get the same
   answer as if you had asked the page in that case.
 - A pointer to the lock that the page uses. Pages are locked using a kind of locking dance: you load the
   `page->lock_ptr`, lock that lock, and then check if `page->lock_ptr` still points at the lock you tried.
   Anyone who holds both the current lock and some other lock can change `page->lock_ptr` to the other lock.
   For shared pages, the `lock_ptr` always points at the shared view's ownership lock. For exclusive views, the
   libpas allocator will change the lock of a page to be the lock associated with their TLC. If contention
   on `page->lock_ptr` happens, then we change the lock back to the view's ownership lock. This means that in
   the common case, flushing the deallocation log will encounter page after page that wants to hold the same
   lock -- usually the TLC lock. This allows the deallocation log flush to only do a handful of lock acquisitions
   for deallocating thousands of objects.
 - The timestamp of when the page became empty, using a unit of time libpas calls *epoch*.
 - The view that owns the page. This is either an exclusive view or a *shared handle*, which is the part of the
   shared view that gets deallocated for decommitted pages. Note: an obvious improvement is if shared handles
   were actually part of the page header; they aren't only because until recently, the page header size had to be
   the same for exclusive and shared pages.
 - View cache index, if the directory enabled view caching. This allows deallocation to quickly find out which
   view cache to use.
 - The alloc bits.
 - The number of 32-bit alloc bit words that are not empty.
 - Optionally, the granule use counts. It's possible for the page config to say that the page size is larger than
   the system page size, but that the page is divided up into *granules* which are system page size. In that
   case, the page header will have an array of 1-byte use counts per granule, which count the number of objects
   in that granule. They also track a special state when the granule is decommitted. The medium_segregated_config
   uses this to offer fine-grained decommit of 128KB "pages".

 Currently we have two ways of placing the page header: either at the beginning of the page, or what we call
 the page boundary, or in an object allocated in the utility heap. In the latter case, we use the
 mostly-lock-free page header table to map between the page boundary and page header, or vice-versa. The page
 config has callbacks that allow either approach. I've also used page config hacking to attempt other kinds of
 strategies, like saying that every aligned 16MB chunk of pages has an array of page headers at the start of it;
 but those weren't any better than either of the two current approaches.

 The most important part of the page header is the alloc bits array and the `num_non_empty_words` counter. This
 is where most of the action of allocating and deallocating happens. The magic of the algorithm arises from the
 simple bitvector operations we can perform on `page->alloc_bits`, `full_alloc_bits` (from the size
 directory in case of exclusive pages or from the partial view in case of shared pages), and the
 `allocator->bits`. These operations allow us to achieve most of the algorithm:

 - Deallocation clears a bit in `page->alloc_bits` and if this results in the word becoming zero, it decrements
   the `num_non_empty_words`. The bit index is just the object's offset shifted by the page config's
   `min_aligh_shift`, which is a compile-time constant in most of the algorithm. If the algorithm makes any bit
   (for partials) or any word (for exclusives) empty, it makes the page eligible (either by putting it on a view
   cache or marking the view eligible in its owning directory). If `num_non_empty_words` hits zero, the
   deallocator also makes the view empty.
 - Allocation does a find-first-set-bit on the `allocator->bits`, but in a very efficient way, because the
   current 64-bit word of bits that the allocator is on is cached in `allocator->current_word` -- so allocation
   rarely searches an array. So, the allocator just loads the current word, does a `ctz` or `clz` kind of
   operation (which is super cheap on modern CPUs), left-shifts the result by the page config's minalign, and
   adds the `allocator->page_ish` (the address in memory corresponding to the first bit in `current_word`).
   That's the allocator fast path.
 - We prepare `allocator->bits` by basically saying `allocator->bits = full_alloc_bits & ~page->alloc_bits`. This
   is a loop since each of the `bits` is an array of words of bits and each array is the same size. For a 16384
   size page (the default for `small_segregated_config`) and a minalign shift of 4 (so minalign = 16, the default
   for `small_segregated_config`), this means 1024 bits, or 32 32-bit words, or 128 bytes. The loop over the 32
   32-bit words is usually fully unrolled by the compiler. There are no loop-carried dependencies. This loop
   shows up in profiles, and though I've tried to make it faster, I've never succeeded.

 ### Local Allocators

 Each size directory can choose to use either baseline allocators or TLC local allocators for allocation. Each
 size directory can choose to have a local view cache or not. Baseline allocators are just local allocators that
 are global and not part of any TLC and allocation needs to grab a lock to use them. TLC local allocators don't
 require any locking to get accessed.

 Local allocators can be in any of these modes:

 - They are totally uninitialized. All fast paths fail and slow paths will initialize the local allocator by
   asking the TLC layout. This state happens if TLC decommit causes a local allocator to become all zero.
 - They are in bump allocation mode. Bump allocation happens either when a local allocator decides to allocate in
   a totally empty exclusive page, or for primordial partial allocation. In the former case, it's worth about 1%
   performance to sometimes bump-allocate. In the latter case, using bump allocation is just convenient -- the
   slow path will decide that the partial view should get a certain range of memory within a shared page and it
   knows that this memory has never been used before, so it's natural to just set up a bump range over that
   memory.
 - They are in free bits mode. This is slightly more common than the bump mode. In this mode, the
   `allocator->bits` is computed using `full_alloc_bits & ~page->alloc_bits` and contains a bit for the start of
   every free object.
 - They are in bitfit mode. In this mode, the allocator just forwards allocations to the `pas_bitfit_allocator`.

 Local allocators can be *stopped* at any time; this causes them to just return all of their free memory back to
 the heap.

 Local allocators in a TLC can be used without any conventional locking. However, there is still synchronization
 taking place because the scavenger is allowed to stop allocators. To support this, local allocators set an
 `in_use` bit (not atomically, but protected by a `pas_compiler_fence`) before they do any work and clear it when
 done. The scavenger thread will suspend threads that have TLCs and then while the TLC is suspended, they can
 stop any allocators that are not `in_use`.

 ## The Bitfit Heaps

 Libpas is usually used with a combination of segregated and large heaps. However, sometimes we want to have a
 heap that is more space-efficient than segregated but not quite as slow as large. The bitfit heap follows a
 similar style to segregated, but:

 - While bitfit has a bit for each minalign index like segregated, bitfit actually uses all of the bits. To
   allocate an object in bitfit, all of the bits corresponding to all of the minalign granules that the object
   would use have to be free before the allocation and have to be marked as not free after the allocation.
   Freeing has to clear all of the bits.
 - The same page can have objects of any size allocated in it. For example, if a 100 byte object gets freed, then
   it's legal to allocate two 50 byte objects out of the freed space (assuming 50 is a multiple of minalign).
 - A bitfit directory does not represent a size class. A bitfit heap has one directory per bitfit page config
   and each page config supports a large range of sizes (the largest object is ~250 times larger than the
   smallest).

 Bitfit pages have `free_bits` as well as `object_end_bits`. The `free_bits` indicates every minalign granule
 that is free. For non-free granules, the `object_end_bits` has an entry for every granule that is the last
 granule in some live object. These bits get used as follows:

 - To allocate, we find the first set free bit and then find the first clear free bit after that. If this range
   is big enough for the allocation, we clear all of the free bits and set the object end bit. If it's not big
   enough, we keep searching. We do special things (see below) when we cannot allocate in a page.
 - To free, we find the first set object end bit, which then gives us the object size. Then we clear the object
   end bit and set the free bits.

 This basic style of allocation is usually called *bitmap* allocation. Bitfit is a special kind of bitmap
 allocation that makes it cheap to find the first page that has enough space for an allocation of a given size.
 Bitfit makes allocation fast even when it is managing lots of pages by using two tricks:

 - Bitfit directories have an array of bitfit views and a corresponding array of `max_free` bytes. Bitfit views
   are monomorphic, unlike the polymorphic views of segregated heaps. Each bitfit view is either uninitialized or
   has a bitfit page. A `max_free` byte for a page tells the maximum free object size in that page. So, in the
   worst case, we search the `max_free` vector to find the find byte that is large enough for our allocation.
 - Bitfit uses size classes to short-circuit the search. Bitfit leverages segregated heaps to create size
   classes. Segregated size directories choose at creation time if they want to support segregated allocation or
   bitfit allocation. If the latter, the directory is just used as a way of locating the bitfit size class. Like
   with segregated, each local allocator is associated with a segregated size directory, even if it's a local
   allocator configured for bitfit. Each size class maintains the index of the first view/page in the directory
   that has a free object big enough for that size class.

 The updates to `max_free` and the short-circuiting indices in size classes happen when an allocation fails in
 a page. This is an ideal time to set those indices since failure to allocate happens to also tell you the size
 of the largest free object in the page.

 When any object is freed in a page, we mark the page as having `PAS_BITFIT_MAX_FREE_UNPROCESSED` and rather than
 setting any short-circuiting indices in size classes, we just set the `first_unprocessed_free` index in the
 `pas_bitfit_directory`. Allocation will start its search from the minimum of `directory->first_unprocessed_free`
 and `size_class->first_free`. All of these short-circuiting indices use `pas_versioned_field` just like how
 short-circuiting works in segregated directories.

 Bitfit heaps use fine-grained locking in the sense that each view has its own locks. But, there's no attempt
 made to have different threads avoid allocating in the same pages. Adding something like view caches to the
 bitfit heap is likely to make it much faster. You could even imagine that rather than having the
 `directory->first_unprocessed_free`, we instead have freeing in a page put the page's view onto a local view
 cache for that bitfit directory, and then we allocate out of the view cache until it's empty. Failure to
 allocate in a page in the view cache will then tell us the `max_free`, which will allow us to mark the view
 eligible in the directory.

 ## The Scavenger

 Libpas returns memory to the OS by madvising it. This makes sense for a malloc that is trying to give strong
 type guarantees. If we unmapped memory, then the memory could be used for some totally unrelated type in the
 future. But by just decommitting the memory, we get the memory savings from free pages of memory and we also
 get to preserve type safety.

 The madvise system call -- and likely any mechanism for saying "this page is empty" -- is expensive enough that
 it doesn't make sense to do it anytime a page becomes empty. Pages often become empty only to refill again. In
 fact, last time I measured it, just about half of allocations went down the bump allocation path and (except for
 at start-up) that path is for completely empty pages. So, libpas has mechanisms for stashing the information
 that a page has become empty and then having a *scavenger thread* return that memory to the OS with an madvise
 call (or whatever mechanism). The scavenger thread is by default configured to run every 100ms, but will shut
 itself down if we have a period of non-use. At each tick, it returns all empty pages that have been empty for
 some length of time (currently 300ms). Those two thresholds -- the period and the target age for decommit -- are
 independently configurable at it might make sense (for different reasons) to have either number be bigger than
 the other number.

 This section describes the scavenging algorithm is detail. This is a large fraction of what makes libpas fast
 and space-efficient. The algorithm has some crazy things in it that probably didn't work out as well as I wanted
 but nonetheless those things seem to avoid showing up in profiles. That's sort of the outcome of this algorithm
 being tuned and twisted so many times during the development of this allocator. First I'll describe the
 *deferred decommit log*, which is how we coalesce madvise calls. Then I'll describe the *page sharing pool*,
 which is a mechanism for multiple *participants* to report that they have some empty pages. Then I'll describe
 how the large heap implements this with the large sharing pool, which is one of the singleton participants. Then
 I'll describe the segregated directory participants -- which are a bit different for shared page directories
 versus size directories. I'll also describe the bitfit directory participants, which are quite close to their
 segregated directory cousins. Then I'll describe some of the things that the scavenger does that aren't to do
 with the page sharing pool, like stopping baseline allocators, stopping utility heap allocators, stopping
 TLC allocators, flushing TLC deallocation logs, decommitting unused parts of TLCs, and decommitting expendable
 memory.

 ### Deferred Decommit Log

 Libpas's decommit algorithm coalesces madvise calls -- so if two adjacent pages, even from totally unrelated
 heaps, become empty, then their decommit will be part of one syscall. This is achieved by having places in the
 code that want to decommit memory instead add that memory to a deferred decommit log. This log internally uses
 a minheap based on address. The log also stores what lock was needed to decommit the range, since the decommit
 algorithm relies on fine-grained locking of memory rather than having a global commit lock. So, when the
 scavenger asks the page sharing pool to go find some memory to decommit, it gives it a deferred decommit log.
 The page sharing pool will usually return all of the empty pages in one go, so the deferred decommit logs can
 get somewhat big. They are allocated out of the bootstrap free heap (which isn't the best idea if they ever get
 very big, since bootstrap free heap memory is not decommitted -- but right now this is convenient because for
 a heap to support decommit, it needs to talk to deferred decommit logs, so we want to avoid infinite recursion).
 After the log is filled up, we can decommit everything in the log at once. This involves heapifying the array
 and then scanning it backwards while detecting adjacent ranges. This is the loop that actually calls decommit. A
 second loop unlocks the locks.

 Two fun complications arise:

 - As the page sharing pool scans memory for empty pages, it may arrive at pages in a random order, which may
   be different from any valid order in which to acquire commit locks. So, after acquiring the first commit lock,
   all subsequent lock acquisitions are `try_lock`'s. If any `try_lock` fails, the algorithm returns early, and
   the deferred decommit log helps facilitate detecting when a lock failed to be acquired.
 - Libpas supports calling into the algorithm even if some commit locks, or the heap lock, are held. It's not
   legal to try to acquire any commit locks other than by `try_lock` in that case. The deferred decommit log will
   also make sure that it will not relock any commit locks that are already held.

 Libpas can be configured to return memory either by madvising it or by mmap-zero-filling it, which has a similar
 semantic effect but is slower. Libpas supports both symmetric and asymmetric forms of madvise, though the
 asymmetric form is faster and has a slight (though mostly theoretical) memory usage edge. By *asymmetric* I mean
 that you call some form of madvise to decommit memory and then do nothing to commit it. This works on Darwin and
 it's quite efficient -- the kernel will clear the decommit request and give the page real memory if you access
 the page. The memory usage edge of not explicitly committing memory in the memory allocator is that programs may
 allocate large arrays and never use the whole array. It's OK to configure libpas to use symmetric decommit, but
 the asymmetric variant might be faster or more efficient, if the target OS allows it.

 ### Page Sharing Pool

 Different kinds of heaps have different ways of discovering that they have empty pages. Libpas supports three
 different kinds of heaps (large, segregated, and bitfit) and one of those heaps has two different ways of
 discovering free mempry (segregated shared page directories and segregated size directories). The page sharing
 pool is a data structure that can handle an arbitrary number of *page sharing participants*, each of which is
 able to say whether they have empty pages and whether those empty pages are old enough to be worth decommitting.

 Page sharing participants need to be able to answer the following queries:

 - Getting the epoch of the oldest free page. This can be approximate; for example, it's OK for the participant
   to occasionally give an epoch that is newer than the true oldest so long as this doesn't happen all the time.
   We call this the `use_epoch`.
 - Telling if the participant thinks it *is eligible*, i.e. has any free pages right now. It's OK for this to
   return true even if there aren't free pages. It's not OK for this to return false if there are free pages. If
   this returns true and there are no free pages, then it must return false after a bounded number of calls to
   `take_least_recently_used`. Usually if a participant incorrectly says that it is eligible, then this state
   will clear with exactly one call to `take_least_recently_used`.
 - Taking the least recently used empty page (`pas_page_sharing_participant_take_least_recently_used`). This is
   allowed to return that there aren't actually any empty pages. If there are free pages, this is allowed to
   return any number of them (doesn't have to be just one). Pages are "returned" via a deferred decommit log.

 Participants also notify the page sharing pool when they see a *delta*. Seeing a delta means one of:

 - The participant found a new free page and it previously thought that it didn't have any.
 - The participant has discovered that the oldest free page is older than previously thought.

 It's OK to only report a delta if the participant knows that it was previously not advertising itself as being
 eligible, and it's also OK to report a delta every time that a free page is found. Most participants try to
 avoid reporting deltas unless they know that they were not previously eligible. However, some participants (the
 segregated and bitfit directories) are sloppy about reporting the epoch of the oldest free page. Those
 participants will conservatively report a delta anytime they think that their estimate of the oldest page's age
 has changed.

 The page sharing pool itself comprises:

 - A segmented vector of page sharing participant pointers. Each pointer is tagged with the participant's type,
   which helps the pool decide how to get the `use_epoch`, decide whether the participant is eligible, and take
   free pages from it.
 - A bitvector of participants that have reported deltas.
 - A minheap of participants sorted by epoch of the oldest free memory.
 - The `current_participant`, which the page sharing pool will ask for pages before doing anything else, so long
   as there are no deltas and the current participant continues to be eligible and continues to report the same
   use epoch.

 If the current participant is not set or does not meet the criteria, the heap lock is taken and all of the
 participants that have the delta bit set get reinserted into the minheap based on updated use epochs. It's
 possible that the delta bit causes the removal of entries in the minheap (if something stopped being eligible).
 It's possible that the delta bit causes the insertion of entries that weren't previously there (if something has
 just become eligible). And it's possible for an entry to be removed and then readded (if the use epoch changed).
 Then, the minimum of the minheap becomes the current participant, and we can ask it for pages.

 The pool itself is a class but it happens to be a singleton right now. It's probably a good idea to keep it as a
 class, because as I've experimented with various approaches to organizing memory, I have had versions where
 there are many pools. There is always the physical page sharing pool (the singleton), but I once had sharing
 pools for moving pages between threads and sharing pools for trading virtual memory.

 The page sharing pool exposes APIs for taking memory from the pool. There are two commonly used variants:

 - Take a set number of bytes. The page sharing pool will try to take that much memory unless a try_lock fails,
   in which case it records that it should take this additional amount of bytes on the next call.
 - Take all free pages that are same age or older than some `max_epoch`. This API is called
   `pas_physical_page_sharing_pool_scavenge`. This algorithm will only return when it is done. It will reloop in
   case of `try_lock` failure, and it does this in a way that avoids spinning.

 The expected behavior is that when the page sharing pool has to get a bunch of pages it will usually get a run
 of them from some participant -- hence the emphasis on the `current_participant`. However, it's possible that
 various tweaks that I've made to the algorithm have made this no longer be the case. In that case, it might be
 worthwhile to try to come up with a way of recomputing he `current_participant` that doesn't require holding the
 `heap_lock`. Maybe even just a lock for the page sharing pool, rather than using the `heap_lock`, would be
 enough to get a speed-up, and in the best case that speed-up would make it easier to increase scavenger
 frequency.

 The page sharing pool's scavenge API is the main part of what the scavenger does. The next sections describe the
 inner workings of the page sharing participants.

 ### Large Sharing Pool

 The large sharing pool is a singleton participant in the physical page sharing pool. It tracks which ranges of
 pages are empty across all of the large heaps. The idea is to allow large heaps to sometimes split a page among
 each other, so then no single large heap knows whether the page can be decommitted. So, all large heaps as well
 as the large utility free heap and compact large utility free heap report when they allocate and deallocate
 memory to the large sharing pool.

 Internally, the large sharing pool maintains a red-black tree and minheap. The tree tracks coalesced ranges of
 pages and their states. The data structure thinks it knows about all of memory, and it "boots up" with a single
 node representing the whole address space and that node claims to be allocated and committed. When heaps that
 talk to the large sharing pool acquire memory, they tell the sharing pool that the memory is now free. Any
 free-and-committed ranges also reside in the minheap, which is ordered by use epoch (the time when the memory
 became free).

 The large sharing pool can only be used while holding the `heap_lock`, but it uses a separate lock, the
 `pas_virtual_range_common_lock`, for commit and decommit. So, while libpas is blocked in the madvise syscall,
 it doesn't have to hold the `heap_lock` and the other lock only gets acquired when committing or decommitting
 large memory.

 It's natural for the large sharing pool to handle the participant API:

 - The large sharing pool participant says it's eligible when the minheap is non-empty.
 - The large sharing pool participant reports the minheap's minimum use epoch as its use epoch (though it can be
   configured to do something else; that something else may not be interesting anymore).
 - The large sharing pool participant takes least recently used by removing the minimum from the minheap and
   adding that node's memory range to the deferred decommit log.

 The large sharing pool registers itself with the physical page sharing pool when some large heap reports the
 first bit of memory to it.

 ### Segregated Directory Participant

 Each segregated directory is a page sharing participant. They register themselves once they have pages that
 could become empty. Segregated directories use the empty bits and the last-empty-plus-one index to satisfy the
 page sharing pool participant API:

 - A directory participant says it's eligible when last-empty-plus-one is nonzero.
 - A directory participant reports the use epoch of the last empty page before where last-empty-plus-one points.
   Note that in extreme cases this means having to search the bitvector for a set empty bit, since the
   last-empty-plus-one could lag in being set to a lower value in case of races.
 - A directory participant takes least recently used by searching backwards from last-empty-plus-one and taking
   the last empty page. This action also updates last-empty-plus-one using the `pas_versioned_field` lock-free
   tricks.

 Once an empty page is found the basic idea is:

 1. Clear the empty bit.
 2. Try to take eligibility; i.e. make the page not eligible. We use the eligible bit as a kind of lock
    throughout the segregated heap; for example, pages will not be eligible if they are currently used by some
    local allocator. If this fails, we just return. Note that it's up to anyone who makes a page ineligible to
    then check if it should set the empty bit after they make it eligible again. So, it's fine for the scavenger
    to not set the empty bit again after clearing it and failing to take the eligible bit. This also prevents a
    spin where the scavenger keeps trying to look at this allegedly empty page even though it's not eligible. By
    clearing the empty bit and not setting it again in this case, the scavenger will avoid this page until it
    becomes eligible.
 3. Grab the *ownership lock* to say that the page is now decommitted.
 4. Grab the commit lock and then put the page on the deferred decommit log.
 5. Make the page eligible again.

 Sadly, it's a bit more complicated than that:

 - Directories don't track pages; they track views. Shared page directories have views of pages whose actual
   eligibility is covered by the eligible bits in the segregated size directories that hold the shared page's
   partial views. So, insofar as taking eligibility is part of the algorithm, shared page directories have to
   take eligibility for each partial view associated with the shared view.
 - Shared and exclusive views could both be in a state where they don't even have a page. So, before looking at
   anything about the view, it's necessary to take the ownership lock. In fact, to have a guarantee that nobody
   is messing with the page, we need to grab the ownership lock and take eligibility. The actual algorithm does
   these two things together.
 - It's possible for the page to not actually be empty even though the empty bit is set. We don't require that
   the empty bit is cleared when a page becomes nonempty.
 - Some of the logic of decommitting requires holding the `page->lock_ptr`, which may be a different lock than
   the ownership lock. So the algorithm actually takes the ownership lock, then the page lock, then the ownership
   lock again, and then the commit lock.
 - If a `try_lock` fails, then we set both the eligible and empty bits, since in that case, we really do want
   the page sharing pool to come back to us.

 Some page configs support segregated pages that have multiple system pages inside them. In that case, the empty
 bit gets set when any system page becomes empty (using granule use counts), and the taking algorithm just
 decommits the granules rather than decommitting the whole page.

 ### Bitfit Directory Participant

 Bitfit directories also use an empty bitvector and also support granules. Although bitfit directories are an
 independent piece of code, their approach to participating in page sharing pools exactly mirrors what segregated
 directories do.

 This concludes the discussion about the page sharing pool and its participants. Next, I will cover some of the
 other things that the scavenger does.

 ### Stopping Baseline Allocators

 The scavenger also routinely stops baseline allocators. Baseline allocators are easy to stop by the scavenger
 because they can be used by anyone who holds their lock. So the scavenger can stop any baseline allocator it
 wants. It will only stop those allocators that haven't been used in a while (by checking and resetting a dirty
 bit).

 ### Stopping Utility Heap Allocators

 The scavenger also does the same thing for utility allocators. This just requires holding the `heap_lock`.

 ### Stopping TLC Allocators

 Allocators in TLCs also get stopped, but this requires more effort. When the scavenger thread is running, it's
 possible for any thread to be using any of its allocators and no locks are held when this happens. So, the
 scavenger uses the following algorithm:

 - First it tries to ask the thread to stop certain allocators. In each allocation slow path, the allocator
   checks if any of the other allocators in the TLC have been requested to stop by the scavenger, and if so, it
   stops those allocators. That doesn't require special synchronization because the thread that owns the
   allocator is the one stopping it.
 - If that doesn't work out, the scavenger suspends the thread and stops all allocators that don't have the
   `in_use` bit set. The `in_use` bit is set whenever a thread does anything to a local allocator, and cleared
   after.

 One wrinkle about stopping allocators is that stopped allocators might get decommitted. The way that this is
 coordinated is that a stopped allocator is in a special state that means that any allocation attempt will take
 a slow path that acquires the TLC's scavenging lock and possibly recommits some pages and then puts the
 allocator back into a normal state.

 ### Flushing TLC Deallocation Logs

 The TLC deallocation logs can be flushed by any thread that holds the scavenging lock. So, the scavenger thread
 flushes all deallocation logs that haven't been flushed recently.

 When a thread flushes its own log, it holds the scavenger lock. However, appending doesn't grab the lock. To
 make this work, when the scavenger flushes the log, it:

 - Replaces any entry in the log that it deallocated with zero. Actually, the deallocation log flush always does
   this.
 - Does not reset the `thread_local_cache->deallocation_log_index`. In fact, it doesn't do anything except read
   the field.

 Because of the structure of the deallocation log flush, it's cheap for it to null-check everything it loads from
 the deallocation log. So when, a thread goes to flush its log after the scavenger has done it, it will see a
 bunch of null entries, and it will skip them. If a thread tries to append to the deallocation log while the
 scavenger is flushing it, then this just works, because it ends up storing the new value above what the
 scavenger sees. The object is still in the log, and will get deallocated on the next flush.

 ### Decommitting Unused Parts of TLCs

 For any page in the TLC consisting entirely of stopped allocators, the scavenger can decommit those pages. The
 zero value triggers a slow path in the local allocator and local view cache that then commits the page and
 rematerializes the allocator. This is painstakingly ensured; to keep this property you generally have to audit
 the fast paths to see which parts of allocators they access, and make sure that the allocator goes to a slow
 path if they are all zero. That slow path then has to check if the allocator is stopped or decommitted, and if
 it is either, it grabs the TLC's scavenger lock and recommits and rematerializes.

 This feature is particularly valuable because of how big local allocators and local view caches are. When there
 are a lot of heaps, this accounts for tens of MBs in some cases. So, being able to decommit unused parts is a
 big deal.

 ### Decommitting Expendable Memory

 Segregated heaps maintain lookup tables that map "index", i.e. the object size divided by the heap's minalign,
 to allocator indices and directories. There are three tables:

 - The index-to-allocator-index table. This only works for small-enough indices.
 - The index-to-directory table. This also only works for small-enough indices.
 - The medium index-to-directory-tuple binary search array. This works for the not-small-enough indices.

 It makes sense to let those tables get large. It might make sense for each isoheap to have a large lookup table.
 Currently, they use some smaller threshold. But to make that not use a lot of memory, we need to be able to
 decommit memory that is only used by lookup tables of heaps that nobody is using.

 So, libpas has this weird thing called `pas_expendable_memory`, which allows us to allocate objects of immortal
 memory that automatically get decommitted if we don't "touch" them frequently enough. The scavenger checks the
 state of all expendable memory, and decommits those pages that haven't been used in a while. The expendable
 memory algorithm is not wired up as a page sharing participant because the scavenger really needs to poke the
 whole table maintained by the algorithm every time it does a tick; otherwise the algorithm would not work.
 Fortunately, there's never a lot of this kind of memory. So, this isn't a performance problem as far as I know.
 Also, it doesn't actually save that much memory right now -- but also, right now isoheaps use smaller-size
 tables.

 This concludes the discussion of the scavenger. To summarize, the scavenger periodically:

 - Stops baseline allocators.
 - Stops utility heap allocators.
 - Stops TLC allocators.
 - Flushes TLC deallocation logs.
 - Decommits unused parts of TLCs.
 - Decommits expendable memory.
 - Asks the physical page sharing pool to scavenge.

 While it does these things, it notes whether anything is left in a state where it would be worthwhile to run
 again. A steady state might look like that all empty pages have been decommitted, rather than that only old
 enough ones were decommitted. If a steady state looks like it's being reached, the scavenger first sleeps for
 a while, and then shuts down entirely.

 ## Megapages and Page Header Tables

 Libpas provides non-large heaps two fast and scalable ways of figuring out about an object based on its
 address, like during deallocation, or during user queries like `malloc_size`.

 1. Megapages and the megapage table. Libpas describes every 16MB of memory using a two-bit enum called
    `pas_fast_megapage_kind`. The zero state indicates that this address does not have a *fast megapage*. The
    other two describe two different kinds of fast megapages, one where you know exactly the type of page it
    is (small exclusive segregated) and one where you have to find out by asking the page header, but at least
    you know that it's small (so usually 16KB).
 2. Page header table. This is a lock-free-to-read hashtable that tells you where to find the page header for
    some page in memory. For pages that use page headers, we also use the page header table to find out if the
    memory address is in memory owned by some kind of "page" (either a segregated page or a bitfit page).

 Usually, the small segregated and small bitfit configs use megapages. The medium segregated, medium bitfit, and
 marge bitfit configs use page header tables. Large heaps don't use either, since they have the large map.

 ## The Enumerator

 Libpas supports the libmalloc enumerator API. It even includes tests for it. The way that this is implemented
 revolves around the `pas_enumerator` class. Many of the data structures that are needed by the enumerator have
 APIs to support enumeration that take a `pas_enumerator*`. The idea behind how it works is conceptually easy:
 just walk the heap -- which is possible to do accurately in libpas -- and report the metadata areas, the areas
 that could contain objects, and the live objects. But, we have to do this while walking a heap in a foreign
 process, and we get to see that heap through little copies of it that we request a callback to make for us. The
 code for enumeration isn't all that pretty but at least it's easy to find most of it by looking for functions
 and files that refer to `pas_enumerator`.

 But a challenge of enumeration is that it happens when the remote process is stopped at some arbitrary
 point. That point has to be an instruction boundary -- so CPU memory model issues aren't a problem. But this
 means that the whole algorithm has to ensure that at any instruction boundary, the enumerator will see the right
 thing. Logic to help the enumerator still understand the heap at any instruction is spread throughout the
 allocation algorithms. Even the trees and hashtables used by the large heap have special hacks to enable them to
 be enumerable at any instruction boundary.

 The enumerator is maintained so that it's 100% accurate. Any discrepancy -- either in what objects are reported
 live or what their sizes are -- gets flagged as an error by `test_pas`. The goal should be to maintain perfect
 enumerator accuracy. This makes sense for two reasons:

 1. I've yet to see a case where doing so is a performance or memory regression.
 2. It makes the enumerator easy to test. I don't know how to test an enumerator that is not accurate by design.
    If it's accurate by design, then any discrepancy between the test's understanding of what is live and what
    the enumerator reports can be flagged as a test failure. If it wasn't accurate by design, then I don't know
    what it would mean for a test to fail or what the tests could even assert.

 ## The Basic Configuration Template

 Libpas's heap configs and page configs allow for a tremendous amount of flexibility. Things like the utility
 heap and the `jit_heap` leverage this flexibility to do strange things. However, if you're using libpas to
 create a normal malloc, then a lot of the configurability in the heap/page configs is too much. A "normal"
 malloc is one that is exposed as a normal API rather than being internal to libpas, and that manages memory that
 doesn't have special properties like that it's marked executable and not writeable.

 The basic template is provided by `pas_heap_config_utils.h`. To define a new config based on this template, you
 need to:

 - Add the appropriate heap config and page config kinds to `pas_heap_config_kind.def`,
   `pas_segregated_page_config_kind.def`, and `pas_bitfit_page_config_kind.def`. You also have to do this if you
   add any kind of config, even one that doesn't use the template.
 - Create the files `foo_heap_config.h` and `foo_heap_config.c`. These are mostly boilerplate.

 The header file usually looks like this:

     #define ISO_MINALIGN_SHIFT ((size_t)4)
     #define ISO_MINALIGN_SIZE ((size_t)1 << ISO_MINALIGN_SHIFT)

     #define ISO_HEAP_CONFIG PAS_BASIC_HEAP_CONFIG( \
         iso, \
         .activate = pas_heap_config_utils_null_activate, \
         .get_type_size = pas_simple_type_as_heap_type_get_type_size, \
         .get_type_alignment = pas_simple_type_as_heap_type_get_type_alignment, \
         .dump_type = pas_simple_type_as_heap_type_dump, \
         .check_deallocation = true, \
         .small_segregated_min_align_shift = ISO_MINALIGN_SHIFT, \
         .small_segregated_sharing_shift = PAS_SMALL_SHARING_SHIFT, \
         .small_segregated_page_size = PAS_SMALL_PAGE_DEFAULT_SIZE, \
         .small_segregated_wasteage_handicap = PAS_SMALL_PAGE_HANDICAP, \
         .small_exclusive_segregated_logging_mode = pas_segregated_deallocation_size_oblivious_logging_mode, \
         .small_shared_segregated_logging_mode = pas_segregated_deallocation_no_logging_mode, \
         .small_exclusive_segregated_enable_empty_word_eligibility_optimization = false, \
         .small_shared_segregated_enable_empty_word_eligibility_optimization = false, \
         .small_segregated_use_reversed_current_word = PAS_ARM64, \
         .enable_view_cache = false, \
         .use_small_bitfit = true, \
         .small_bitfit_min_align_shift = ISO_MINALIGN_SHIFT, \
         .small_bitfit_page_size = PAS_SMALL_BITFIT_PAGE_DEFAULT_SIZE, \
         .medium_page_size = PAS_MEDIUM_PAGE_DEFAULT_SIZE, \
         .granule_size = PAS_GRANULE_DEFAULT_SIZE, \
         .use_medium_segregated = true, \
         .medium_segregated_min_align_shift = PAS_MIN_MEDIUM_ALIGN_SHIFT, \
         .medium_segregated_sharing_shift = PAS_MEDIUM_SHARING_SHIFT, \
         .medium_segregated_wasteage_handicap = PAS_MEDIUM_PAGE_HANDICAP, \
         .medium_exclusive_segregated_logging_mode = pas_segregated_deallocation_size_aware_logging_mode, \
         .medium_shared_segregated_logging_mode = pas_segregated_deallocation_no_logging_mode, \
         .use_medium_bitfit = true, \
         .medium_bitfit_min_align_shift = PAS_MIN_MEDIUM_ALIGN_SHIFT, \
         .use_marge_bitfit = true, \
         .marge_bitfit_min_align_shift = PAS_MIN_MARGE_ALIGN_SHIFT, \
         .marge_bitfit_page_size = PAS_MARGE_PAGE_DEFAULT_SIZE, \
         .pgm_enabled = false, \
         .delegate_large_user_allocations = true, \

     PAS_API extern const pas_heap_config iso_heap_config;

     PAS_BASIC_HEAP_CONFIG_DECLARATIONS(iso, ISO);

 Note the use of `PAS_BASIC_HEAP_CONFIG`, which creates a config literal that automatically fills in a bunch of
 heap config, segregated page config, and bitfit page config fields based on the arguments you pass to
 `PAS_BASIC_HEAP_CONFIG`. The corresponding `.c` file looks like this:

     const pas_heap_config iso_heap_config = ISO_HEAP_CONFIG;

     PAS_BASIC_HEAP_CONFIG_DEFINITIONS(
         iso, ISO,
         .allocate_page_should_zero = false,
         .intrinsic_view_cache_capacity = pas_heap_runtime_config_zero_view_cache_capacity);

 Note that this just configures whether new pages are zeroed and what the view cache capacity for the intrinsic
 heap are. The *intrinsic heap* is one of the four categories of heaps that the basic heap configuration template
 supports:

 - Intrinsic heaps are global singleton heaps, like the common heap for primitives. WebKit's fastMalloc bottoms
   out in an intrinsic heap.
 - Primitive heaps are heaps for primitive untyped values, but that aren't singletons. You can have many
   primitive heaps.
 - Typed heaps have a type, and the type has a fixed size and alignment. Typed heaps allow allocating single
   instances of objects of that type or arrays of that type.
 - Flex heaps are for objects with flexible array members. They pretend as if their type has size and alignment
   equal to 1, but in practice they are used for objects that have some base size plus a variable-length array.
   Note that libpas doesn't correctly manage flex memory in the large heap; we need a variant of the large heap
   that knows that you cannot reuse flex memory between different sizes.

 The basic heap config template sets up some basic defaults for how heaps work:

 - It makes small segregated and small bitfit page configs put the page header at the beginning of the page and
   it arranges to have those pages allocated out of megapages.
 - It makes medium segregated, medium bitfit, and marge bitfit use page header tables.
 - It sets up a way to find things like the page header tables from the enumerator.
 - It sets up segregated shared page directories for each of the segregated page configs.

 The `bmalloc_heap_config` is an example of a configuration that uses the basic template. If we ever wanted to
 put libpas into some other malloc library, we'd probably create a heap config for that library, and we would
 probably base it on the basic heap config template (though we don't absolutely have to).

 ## JIT Heap Config

 The JIT heap config is for replacing the MetaAllocator as a way of doing executable memory allocation in WebKit.
 It needs to satisfy two requirements of executable memory allocation:

 - The allocator cannot read or write the memory it manages, since that memory may have weird permissions at any
   time.
 - Clients of the executable allocator must be able to in-place shrink allocations.

 The large heap trivially supports both requirements. The bitfit heap trivially supports the second requirement,
 and can be made to support the first requirement if we use page header tables for all kinds of memory, not just
 medium or marge. So, the JIT heap config focuses on just using bitfit and large and it forces bitfit to use
 page header tables even for the small bitfit page config.

 ## Security Considerations

 ### Probabilistic Guard Malloc

  Probabilistic Guard Malloc (PGM) is a new allocator designed to catch use after free attempts and out of bounds accesses.
  It behaves similarly to AddressSanitizer (ASAN), but aims to have minimal runtime overhead.

  The design of PGM is quite simple. Each time an allocation is performed an additional guard page is added above and below the newly
  allocated page(s). An allocation may span multiple pages. When a deallocation is performed, the page(s) allocated will be protected
  using mprotect to ensure that any use after frees will trigger a crash. Virtual memory addresses are never reused, so we will never run
  into a case where object 1 is freed, object 2 is allocated over the same address space, and object 1 then accesses the memory address
  space of now object 2.

  PGM does add notable memory overhead. Each allocation, no matter the size, adds an additional 2 guard pages (8KB for X86_64 and 32KB
  for ARM64). In addition, there may be free memory left over in the page(s) allocated for the user. This memory may not be used by any
  other allocation.

  We added limits on virtual memory and wasted memory to help limit the memory impact on the overall system. Virtual memory for this
  allocator is limited to 1GB. Wasted memory, which is the unused memory in the page(s) allocated by the user, is limited to 1MB.
  These overall limits should ensure that the memory impact on the system is minimal, while helping to tackle the problems of catching
  use after frees and out of bounds accesses.

 ## The Fast Paths

 All of the discussion in the previous sections is about the innards of libpas. But ultimately, clients want to
 just call malloc-like and free-like functions to manage memory. Libpas provides fast path templates that actual
 heap implementations reuse to provide malloc/free functions. The fast paths are:

 - `pas_try_allocate.h`, which is the single object allocation fast path for isoheaps. This function just takes a
   heap and no size; it allocates one object of the size and alignment that the heap's type wants.
 - `pas_try_allocate_array.h`, which is the array and aligned allocation fast path for isoheaps. You want to use
   it with heaps that have a type, and that type has a size and alignment, and you want to allocate arrays of
   that type or instances of that type with special alignment.
 - `pas_try_allocate_primitive.h`, which is the primitive object allocation fast path for heaps that don't have
   a type (i.e. they have the primitive type as their type -- the type says it has size and alignment equal to
   1).
 - `pas_try_allocate_intrinsic.h`, which is the intrinsic heap allocation fast path.
 - `pas_try_reallocate.h`, which provides variants of all of the allocators that reallocate memory.
 - `pas_deallocate.h`, which provides the fast path for `free`.
 - `pas_get_allocation_size.h`, which is the fast path for `malloc_size`.

 One thing to remember when dealing with the fast paths is that they are engineered so that malloc/free functions
 do not have a stack frame, no callee saves, and don't need to save the LR/FP to the stack. To facilitate this,
 we have the fast path call an inline-only fast path, and if that fails, we call a "casual case". The inline-only
 fast path makes no out-of-line function calls, since if it did, we'd need a stack frame. The only slow call (to
 the casual case) is a tail call. For example:

     static PAS_ALWAYS_INLINE void* bmalloc_try_allocate_inline(size_t size)
     {
         pas_allocation_result result;
         result = bmalloc_try_allocate_impl_inline_only(size, 1);
         if (PAS_LIKELY(result.did_succeed))
             return (void*)result.begin;
         return bmalloc_try_allocate_casual(size);
     }

 The way that the `bmalloc_try_allocate_impl_inline_only` and `bmalloc_try_allocate_casual` functions are created
 is with:

     PAS_CREATE_TRY_ALLOCATE_INTRINSIC(
         bmalloc_try_allocate_impl,
         BMALLOC_HEAP_CONFIG,
         &bmalloc_intrinsic_runtime_config.base,
         &bmalloc_allocator_counts,
         pas_allocation_result_identity,
         &bmalloc_common_primitive_heap,
         &bmalloc_common_primitive_heap_support,
         pas_intrinsic_heap_is_designated);

 All allocation fast paths require this kind of macro that creates a bunch of functions -- both the inline paths
 and the out-of-line paths. The deallocation, reallocation, and other fast paths are simpler. For example,
 deallocation is just:

     static PAS_ALWAYS_INLINE void bmalloc_deallocate_inline(void* ptr)
     {
         pas_deallocate(ptr, BMALLOC_HEAP_CONFIG);
     }

 If you look at `pas_deallocate`, you'll see cleverness that ensures that the slow path call is a tail call,
 similarly to how allocators work. However, for deallocation, I haven't had the need to make the slow call
 explicit in the client side (the way that `bmalloc_try_allocate_inline` has to explicitly call the slow path).

 This concludes the discussion of libpas design.

 # Testing Libpas

 I've tried to write test cases for every behavior in libpas, to the point that you should feel comfortable
 dropping a new libpas in WebKit (or wherever) if `test_pas` passes.

 `test_pas` is a white-box component, regression, and unit test suite. It's allowed to call any libpas function,
 even internal functions, and sometimes functions that libpas exposes only for the test suite.

 Libpas testing errs on the side of being comprehensive even if this creates annoying situations. Many tests
 assert detailed things about how many objects fit in a page, what an object's offset is in a page, and things
 like that. This means that some behavior changes in libpas that aren't in any way wrong will set off errors in
 the test suite. So, it's common to have to rebase tests when making libpas changes.

 The libpas test suite is written in C++ and uses its own test harness that forks for each test, so each test
 runs in a totally pristine state. Also, the test suite can use `malloc` and `new` as much as it likes, since in
 the test suite, libpas does not replace `malloc` and `free`.

 The most important libpas tests are the so-called *chaos* tests, which randomly create and destroy objects and
 assert that the heap's state is still sane (like that no live objects overlap, that all live objects are
 enumerable, etc).

 # Conclusion

 Libpas is a beast of a malloc, designed for speed, memory efficiency, and type safety. May whoever maintains it
 find some joy in this insane codebase!