| # All About Libpas, Phil's Super Fast Malloc |
| |
| Filip Pizlo, Apple Inc., February 2022 |
| |
| # License |
| |
| Copyright (c) 2022 Apple Inc. All rights reserved. |
| |
| Redistribution and use in source and binary forms, with or without |
| modification, are permitted provided that the following conditions |
| are met: |
| |
| 1. Redistributions of source code must retain the above copyright |
| notice, this list of conditions and the following disclaimer. |
| 2. Redistributions in binary form must reproduce the above copyright |
| notice, this list of conditions and the following disclaimer in the |
| documentation and/or other materials provided with the distribution. |
| |
| THIS SOFTWARE IS PROVIDED BY APPLE INC. ``AS IS'' AND ANY |
| EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE |
| IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR |
| PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL APPLE INC. OR |
| CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, |
| EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, |
| PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR |
| PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY |
| OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT |
| (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE |
| OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
| |
| # Introduction |
| |
| This document describes how libpas works as of [247029@main](https://commits.webkit.org/247029@main), so a bit ahead of |
| where WebKit was as of [246842@main](https://commits.webkit.org/246842@main). Libpas is a fast and memory-efficient |
| memory allocation toolkit capable of supporting many heaps at once, engineered with the hopes that someday it'll be used |
| for comprehensive isoheaping of all malloc/new callsites in C/C++ programs. |
| |
| Since WebKit [186504@main](https://commits.webkit.org/186504@main), we've been steadily enabling libpas as a |
| replacement for WebKit's bmalloc and MetaAllocator. This has so far added up to a ~2% Speedometer2 speed-up and |
| a ~8% memory improvement (on multiple memory benchmarks). Half of the speed-up comes from replacing the MetaAllocator, |
| which was JavaScriptCore's old way of managing executable memory. Now, JSC uses libpas's jit_heap to manage executable |
| memory. The other half of the speed-up comes from replacing everything that bmalloc provided -- the fastMalloc API, the |
| Gigacage API, and the IsoHeap<> API. All of the memory improvement comes from replacing bmalloc (the MetaAllocator was |
| already fairly memory-efficient). |
| |
| This document is structured as follows. First I describe the goals of libpas; these are the things that a |
| malloc-like API created out of libpas should be able to expose as fast and memory-efficient functions. Then I |
| describe the coding style. Next I tell all about the design. Finally I talk about how libpas is tested. |
| |
| # Goals of Libpas |
| |
| Libpas tries to be: |
| |
| - Fast. The goal is to beat bmalloc performance on single-threaded code. Bmalloc was previously the fastest |
| known malloc for WebKit. |
| - Scalable. Libpas is meant to scale well on multi-core devices. |
| - Memory-efficient. The goal is to beat bmalloc memory usage across the board. Part of the strategy for memory |
| efficiency is consistent use of first-fit allocation. |
| - External metadata. Libpas never puts information about a free object inside that object. The metadata is |
| always elsewhere. So, there's no way for a use-after-free to corrupt libpas's understanding of memory. |
| - Multiple heap configurations. Not all programs want the same time-memory trade-off. Some malloc users have |
| very bizarre requirements, like what JavaScriptCore does with its ExecutableAllocator. The goal is to support |
| all kinds of special allocator needs simultaneously in one library. |
| - Boatloads of heaps. Libpas was written with the dream of obviating the need for ownership type systems or |
| other compiler approaches to fixing the type-safety of use-after-frees. This means that we need one heap per |
| type, and be 100% strict about it. So, libpas supports tons of heaps. |
| - Type-awareness. Sometimes, malloc decisions require knowing what the type's size and alignment are, like when |
| deciding how to split and coalesce memory. Libpas is designed to avoid type errors arising from the malloc's |
| "skewed" reuse of memory. |
| - Common free. Users of libpas isoheaps don't have to know the heap of an object when they free it. All objects |
| should funnel into the same free function. One kind of exception to this requirement is stuff like |
| ExecutableAllocator, which needs a malloc, but is fine with not calling a common free function. |
| - Cages. WebKit uses virtual memory reservations called *cages*, in which case WebKit allocates the virtual |
| memory and the malloc has to associate that memory with some heap. Libpas supports multiple kinds of cages. |
| |
| # Libpas Style |
| |
| Libpas is written in C. Ultimately, I chose C because I felt that the language provides better support for |
| extremely low-level code: |
| |
| - C++ is usually my favorite, because it makes code easier to write, but for libpas, I wanted something easier |
| to read. It's easier to read C when auditing for subtle bugs, because there's nothing hidden. C doesn't have |
| stuff like destructor invocations or operator overloading, which result in surprising effectfulness in |
| otherwise innocent-looking code. Memory management code like libpas has to be read a lot, so C is better. |
| - C makes casting between pointers and integers very simple with its style of cast operator. It feels weird to |
| use the C cast operator in C++, so when I have to do a lot of uintptr_t'ing, I prefer C. |
| |
| C lets you do most of what C++ can if you rely on `always_inline`. This didn't used to be the case, but modern C |
| compilers will meat-grind the code with repeated application of the following things: |
| |
| - Inlining any `always_inline` call except if it's recursive or the function uses some very weird features that |
| libpas doesn't use (like goto pointer). |
| - Copy-propagating the values from the callsite into the function that uses the value. |
| |
| Consequently, passing a function pointer (or struct of function pointers), where the pointer points to an |
| `always_inline` function and the callee is `always_inline` results in specialization akin to template |
| monomorphization. This works to any depth; the compiler won't be satisfied until there are no more |
| `always_inline` function calls. This fortuitous development in compilers allowed me to write very nice template |
| code in C. Libpas achieves templates in C using config structs that contain function pointers -- sometimes to |
| `always_inline` functions (when we want specialization and inlining) and sometimes to out-of-line functions |
| (when we want specialization but not inlining). Additionally, the C template style allows us to have true |
| polymorphic functions. Lots of libpas slow paths are huge and not at all hot. We don't want that code |
| specialized for every config. Luckily, this works just fine in C templates -- those polymorphic functions just |
| pass around a pointer to the config they are using, and dynamically load and call things in that config, almost |
| exactly the same way that the specialized code would do. This saves a lot of code size versus C++ templates. |
| |
| Most of libpas is written in an object-oriented style. Structs are used to create either by-value objects or |
| heap-allocated objects. It's useful to think of these as classes, but a loose way since there are many ways to |
| do classes in C, and libpas uses whatever techniques are best on a per-class basis. But heap allocated objects |
| have a clear convention: for a class named `foo`, we would call the struct `pas_foo`, and for a method `bar` on |
| `foo`, we would call the function `pas_foo_bar` and have the first parameter be `pas_foo*`. The function that |
| creates instances of `foo` is called `pas_foo_create` (or `pas_foo_create_blah_blah` in case of overloading) and |
| returns a `pas_foo*`. The function that destroys `foo` objects is called `pas_foo_destroy` and takes a |
| `pas_foo*`. |
| |
| Libpas classes are usually implemented in files called `pas_foo.h` (the header that defines the struct and a |
| subset of functions), `pas_foo_inlines.h` (the header that defines inline functions of `foo` that require |
| calling functions declared in headers that `pas_foo.h` can't include), and `pas_foo.c` (the implementations of |
| `foo` functions that can be out-of-line). |
| |
| Some libpas "classes" are singletons. The standard way of implementing a singleton in libpas is that there is |
| really no struct, only global variables and functions that are declared in the header. See `pas_page_malloc` or |
| `pas_system_heap` for examples of singletons. |
| |
| Not everything in libpas is a class. In cases where a bunch of not-class-like things can be grouped together in |
| a way that makes sense, we usually do something like a singleton. In cases where a function can't easily be |
| grouped together with some class, even a singleton, we name the file it's in after the function. There are lots |
| of examples of this, like `pas_deallocate` or `pas_get_page_base`. Sometimes this gets fun, like |
| `pas_get_page_base_and_kind_for_small_other_in_fast_megapage.h`. |
| |
| Finally, libpas avoids abbreviations even more so than WebKit usually does. Functions that have a quirky |
| meaning typically have a long name that tells the story. The point is to make it easy to appreciate the subtlety |
| of the algorithm when reading the code. This is the kind of code where complex situations should look complex |
| at any abstraction level. |
| |
| # Design of Libpas |
| |
| Libpas is organized into roughly eleven areas: |
| |
| 1. Heap configurations. This is the way that we tell libpas how to organize a heap. Heap configurations can |
| control a lot. They can change obvious things like the minalign and page size, but also more crazy things, |
| like how to find a page header given a page and vice-versa. |
| 2. The large heaps. This is a first-fit heap based on arrays, cartesian trees, and hashtables. The large |
| heap has excellent type safety support and can be safely (though not efficiently) used for small objects. |
| 3. Metacircularity. Libpas uses malloc-like APIs internally for managing its state. These are provided by the |
| so-called bootstrap, utility, and immortal heaps. |
| 4. The segregated heaps and TLCs (thread-local caches). Libpas has a super fast simple segregated storage slab |
| allocator. It supports type safety and is the most commonly used kind of heap. |
| 5. The bitfit heaps. This is a fast and memory-efficient type-unsafe heap based on slabs and bitmaps. |
| 6. The scavenger. Libpas performs a bunch of periodic bookkeeping tasks in a scavenger thread. This includes, |
| but is not limited to, returning memory to the OS. |
| 7. Megapages and page header tables. Libpas has multiple clever tricks for rapidly identifying which kind of |
| heap an object belongs to. This includes an arithmetic hack called megapages and some lock-free hashtables. |
| 8. The enumerator. Libpas supports malloc heap enumeration APIs. |
| 9. The basic configuration template, used to create the `bmalloc_heap` API that is used as a replacement for |
| all of bmalloc's functionality. |
| 10. The JIT heap config. |
| 11. The fast paths. The various heaps, TLCs, megapages and page header tables are glued together by fast paths |
| provided for allocation, deallocation, and various utility functions. |
| |
| ## Heap Configurations |
| |
| The `pas_heap_config` struct defines all of the configurable behaviors of a libpas heap. This includes things |
| like how the heap gets its memory, what size classes use segregated, bitfit, or large allocators, and a bunch |
| of other things. |
| |
| Heap configs are passed by-value to functions that are meant to be specialized and inlined. To support this, |
| the convention for defining a heap config is that you create a macro (like `BMALLOC_HEAP_CONFIG`) that gives a |
| heap config literal expression. So, a call like `pas_get_allocation_size(ptr, BMALLOC_HEAP_CONFIG)` will give |
| you an optimized fast path for getting the allocation size of objects in bmalloc. This works because such fast |
| paths are `always_inline`. |
| |
| Heap configs are passed by-pointer to functions that are not meant to be specialized. To support this, all |
| heap configs also have a global variable like `bmalloc_heap_config`, so we can do things like |
| `pas_large_heap_try_deallocate(base, &bmalloc_heap_config)`. |
| |
| Heap configs can have up to two segregated page configs (`config.small_segregated_config` and |
| `config.medium_segregated_config`) and up to three bitfit page configs (`config.small_bitfit_config`, |
| `config.medium_bitfit_config`, and `config.marge_bitfit_config`). Any of the page configs can be disabled, |
| though weird things might happen if the smallest ones are disabled (rather than disabling the bigger ones). |
| Page configs (`pas_segregated_page_config`, `pas_bitfit_page_config`, and the common supertype, |
| `pas_page_base_config`) get used in much the same way as heap configs -- either by-value for specialized and |
| inlined functions or by-pointer for unspecialized functions. |
| |
| Heap and page configs also support specialized-but-not-inlined functions. These are supported using additional |
| function pointers in those configs that are filled in using macros -- so you don't need to fill them in |
| explicitly when creating your own config, like `BMALLOC_HEAP_CONFIG` or `JIT_HEAP_CONFIG`. The macros fill them |
| in to point at never_inline functions that call some specialized and inlined function with the config passed as |
| a constant. This means for example that: |
| |
| BMALLOC_HEAP_CONFIG.specialized_local_allocator_try_allocate_small_segregated_slow(...); |
| |
| Is an out-of-line direct function call to the specialization of |
| `pas_local_allocator_try_allocate_small_segregated_slow`. And this would be a virtual call to the same |
| function: |
| |
| const pas_heap_config* config = ...; |
| config->specialized_local_allocator_try_allocate_small_segregated_slow(...); |
| |
| Note that in many cases where you have a `pas_heap_config`, you are in specialized code and the heap config is |
| a known constant at compile to, so then: |
| |
| config.specialized_local_allocator_try_allocate_small_segregated_slow(...); |
| |
| Is an out-of-line direct function call. |
| |
| ## The Large Heaps |
| |
| Libpas's large heaps serve multiple purposes: |
| |
| - Everything is bootstrapped on large heaps. When segregated and bitfit heaps allocate memory, they do so from |
| some large heap. |
| - Segregated and bitfit heaps have object size ceilings in the tens or hundreds of kilobytes. So, objects that |
| are too large for the other heaps get allocated from large heaps. |
| |
| Large heaps are broken into two parts: |
| |
| 1. The large free heap. In libpas jargon, a *free heap* is a heap that requires that deallocation passes the |
| object size, requires that the freed object size matches the allocated object size for that object, and makes |
| no guarantees about what kind of mess you'll get yourself into if you fail to obey that rule. |
| 2. The large map. This maps object pointer to size and heap. |
| 3. The large heap. This is an abstraction over both (1) and (2). |
| |
| Large free heaps just maintain a free-list; they know nothing about allocated objects. But combined with the |
| large map, the large heaps provide a user-friendly deallocation API: you just need the object pointer, and the |
| large map figures out the rest, including identifying which free heap the object should be deallocated into, and |
| the size to pass to that free heap. |
| |
| Large heaps operate under a single global lock, called the *heap lock*. Most libpas heaps use fine-grained |
| locking or avoid locking entirely. But for the large heap, libpas currently just uses one lock. |
| |
| ### Large Free Heap |
| |
| Large free heaps are built out of a generic algorithm that doesn't know how to represent the free-list and gets |
| instantiated with either of two free-list representations, *simple* and *fast*. The simple large free heap uses |
| an unordered array of coalesced free object descriptions. The fast large free heap uses a cartesian tree of |
| coalesced free object descriptions. |
| |
| A free object description is represented by the `pas_large_free` object in libpas; let's just call it a *large |
| free* for brevity. Large frees can tell you the beginning and end of a free chunk of memory. They can also tell |
| if the memory is already known to be zero and what the *type skew* of the free memory is. The large heap can be |
| used to manage arrays of some type that is either larger than the heap's minimum alignment, or that is smaller |
| than and not a divisor of the alignment. Especially when this is combined with `memalign`, the free heap will |
| have to track free memory that isn't type-aligned. Just consider a type of size 1000 that is allocated with |
| alignment 16384. The rules of memalign say that the size must be 16384 in that case. Assuming that the free heap |
| had 32768 contiguous bytes of free memory to begin with, it will now have 16384 bytes that starts with a type |
| skew of 384 bytes. The type skew, or `offset_in_type` as libpas calls it, is the offset of the beginning of the |
| large free inside the heap's type. In extremely complex cases, this means that finding where the first valid |
| object address is inside a large free for some type and alignment requires computing the offset least common |
| multiple (see `pas_coalign`), which relies on the right bezout coefficient of the extended GCD of the type size |
| and alignment (see `pas_extended_gcd`). |
| |
| Large frees support an API for coalescing (merging as libpas calls it) and splitting. The generic large free |
| heap handles searching through large frees to find the first one that matches an allocation request for some |
| size and alignment. It also handles coalescing freed memory back into the heap, by searching for adjacent free |
| memory. The searches are through a struct of function pointers that may be implemented either efficiently (like |
| the simple large free heap's O(n) search through an unordered array) or efficiently (like the O(1)-ish or |
| O(log n)-ish operations on the cartesian tree in the fast large free heap). The generic algorithm uses the C |
| generic idiom so there are no actual function pointer calls at runtime. |
| |
| Large free heaps allow you to give them callbacks for allocating and deallocating memory. The allocation |
| callback will get called if you ask a large free heap for memory and it doesn't have it. That allocation |
| callback could get the memory from the OS, or it could get it from some other heap. The deallocation callback |
| is for those cases where the large free heap called the allocation callback and then decided it wanted to give |
| some fraction of that memory back. Both callbacks are optional (can be NULL), though the case of a NULL |
| allocation callback and non-NULL deallocation callback is not useful since the deallocation callback only gets |
| called on the path where we had an allocation callback. |
| |
| Note that large free heaps do not do anything to decommit their free memory. All decommit of memory in large |
| free heaps is accomplished by the *large sharing pool*, which is part of the scavenger. |
| |
| ### Large Map |
| |
| The large map is a hashtable that maps object addresses to *large entries*, which contain the size of the object |
| and the heap it belongs to. The large map has a fairly complex hashtable algorithm because of my past attempts |
| at making the large heap at least somewhat efficient even for small objects. But it's conceptually a simple part |
| of the overall algorithm. It's also legal to ask the large map about objects it doesn't know about, in which |
| case, like a normal hashtable, it will just tell you that it doesn't know about your object. Combined with the |
| way that segregated and bitfit heaps use megapage tables and page header tables, this means that libpas can do |
| fall-through to another malloc for objects that libpas doesn't manage. |
| |
| Note that it might be OK to remove the small object optimizations in the large map. On the other hand, they are |
| reliable, and they aren't known to increase the cost of the algorithm. Having that capability means that as part |
| of tuning the algorithm, it's more safe than it would otherwise be to try putting some small objects into the |
| large heap to avoid allocating the data structures required for operating a segregated or bitfit heap. |
| |
| ### The Large Heap |
| |
| The large free heap and large map are combined into a high-level API with the `pas_large_heap`. In terms of |
| state, this is just a |
| `pas_large_free_heap` plus some data to help with small-object optimizations in the large map. The functions |
| of the large heap do a lot of additional work: |
| |
| - They give the free heap an allocator for getting new memory. The large heap routes memory allocation |
| requests to the heap config's allocation callback. |
| - They ensure that each free heap allocation ends up in the large map. |
| - They implement deallocation by removing something from the large map and then deallocating it into the free |
| heap. |
| - They provide integration with the scavenger's large sharing pool so that free memory can be decommitted. |
| |
| The large heap is always used as a member of the `pas_heap` object. It's useful to think of `pas_large_heap` as |
| never being a distinct object; it's more of a way of compartmentalizing `pas_heap`. The heap object also |
| contains a segregated heap and some other stuff. |
| |
| ## Metacircularity |
| |
| I'm used to programming with dynamically allocated objects. This lets me build arrays, trees, hashtables, |
| look-up tables, and all kinds of lock-free data structures. So, I wanted libpas's internals to be able to |
| allocate objects just like any other kind of algorithm would do. But libpas is engineered so that it can |
| be a "bottom of the world" malloc -- where it is the implementation of `malloc` and `free` and cannot rely on |
| any memory allocation primitives other than what the kernel provides. So, libpas uses its own allocation |
| primitives for its own objects that it uses to implement those primitives. This is bootstrapped as follows: |
| |
| - The *bootstrap heap* is a simple large free heap. A simple large free heap needs to be able to allocate |
| exactly one variable-length array of large frees. The bootstrap heap has hacks to allow itself to allocate |
| that array out of itself. This trick then gives us a complete malloc implementation for internal use by |
| libpas, albeit one that is quite slow, can only be used under the heap lock, and requires us to know the |
| object's size when freeing it. All other simple large free heaps allocate their free lists from the bootstrap |
| heap. The bootstrap heap is the only heap in libpas that asks the OS for memory. All other heaps ask either |
| the bootstrap heap for memory, or they ask one of the other heaps. |
| - The *compact reservation* is 128MB of memory that libpas uses for objects that can be pointed at with 24-bit |
| (3 byte) pointers assuming 8-byte alignment. Libpas needs to manage a lot of heaps, and that requires a lot |
| of internal meta-data, and having compact pointers reduces the cost of doing this. The compact reservation is |
| allocated from the bootstrap heap. |
| - The *immortal heap* is a heap that bump-allocates out of the compact reservation. It's intended for small |
| objects that are immortal. |
| - The *compact bootstrap heap* is like the bootstrap heap, except that it allocates its memory from the compact |
| reservation, and allocates its free list from the bootstrap heap rather than itself. |
| - The *compact large utility free heap* is a fast large free heap that supports decommitting free memory (see |
| the scavenger section) and allocates its memory from the compact bootstrap heap. |
| - The *utility heap* is a segregated heap configured to be as simple as possible (no thread-local caches for |
| example) and can only be used while holding the heap lock. It only supports objects up to some size |
| (`PAS_UTILITY_LOOKUP_SIZE_UPPER_BOUND`), supports decommitting free memory, and gets its memory from the |
| compact bootstrap heap. One example of how the utility heap gets used is the nodes in the cartesian trees |
| used to implement fast large free heaps. So, for example, the compact large utility free heap relies on the |
| utility heap. |
| - The *large utility free heap* is a fast large free heap that supports decommitting free memory and allocates |
| its memory from the bootstrap heap. |
| |
| Note how the heaps pull memory from one another. Generally, a heap will not return memory to the heap it got |
| memory from except to "undo" part of an allocation it had just done. So, this arrangement of |
| who-pulls-memory-from-who is designed for type safety, memory efficiency, and elegantly supporting weird |
| alignments: |
| |
| - Libpas uses decommit rather than unmapping free memory because this ensures that we don't ever change the type |
| of memory after that memory gets its type for the first time. |
| - The lower-level heaps (like bootstrap and compact bootstrap) do not support decommit. So, if a higher-level |
| heap that does support decommit ever returned memory to the lower-level heap, then the memory would never get |
| decommitted. |
| - Page allocation APIs don't let us easily allocate with alignment greater than page size. Libpas does this by |
| over-allocating (allocating size + alignment and then searching for the first aligned start within that larger |
| reservation). This is all hidden inside the bootstrap heap; all other heaps that want memory on some weird |
| alignment just ask some other heap for memory (often the bootstrap heap) and that heap, or ultimately the |
| bootstrap heap, figure out what that means in terms of system calls. |
| |
| One missing piece to the metacircularity is having a *fast utility heap* that uses thread-local caches. There is |
| currently maybe one utility heap callsite that only grabs the heap lock just because it wants to allocate in the |
| utility heap. There's a possibility of a small speed-up if any callsite like that used a fast utility heap |
| instead, and then no locking would be required. It's not clear how easy that would be; it's possible that some |
| bad hacks would be required to allow code that uses TLCs to call into a heap that then also uses TLCs. |
| |
| ## Segregated Heaps and TLCs (Thread Local Caches) |
| |
| Libpas's great performance is mostly due to the segregated heaps and how they leverage thread-local caches. TLCs |
| provide a cache of global memory which. In the best case, this cache prevents threads from doing any |
| synchronization during allocation and deallocation. Even when they do have to do some synchronization, TLCs |
| make it unlikely that one thread will ever want to acquire a lock held by another thread. The strategy is |
| three-fold: |
| |
| - The TLC has a per-size-class allocator that caches some amount of that size class's memory. This means that |
| the allocation fast path doesn't have to do any locking or atomic instructions except when its cache runs out. |
| Then, it will have to do some synchronization -- in libpas's case, fine-grained locking and some lock-free |
| algorithms -- to get more memory. The amount of memory each allocator can cache is bounded (usually 16KB) and |
| allocators can only hold onto memory for about a second without using it before it gets returned (see the |
| scavenger section). |
| - The TLC has a deallocation log. The fast path of deallocating a segregated heap object is just pushing it onto |
| the deallocation log without any locking. The slow path is to walk the log and free all of the objects. The |
| libpas deallocation log flush algorithm cleverly avoids doing per-object locking; in the best case it will |
| acquire a couple of locks before flushing and release them after flushing the whole log. |
| - When the deallocation log flush frees memory, it tries to first make that memory available exclusively to the |
| thread that freed it by putting the free memory into the *local view cache* for that size class in that |
| thread. Memory moves from the local view caches into the global heap only if the view cache is full (has about |
| 1.6MB of memory in it) or if it hasn't been used in about a second (see the scavenger section). |
| |
| This section lays out the details of how this works. *Segregated heaps* are organized into *segregated |
| directories*. Each segregated directory is an array of page *views*. Each page view may or may not have a *page* |
| associated with it. A view can be *exclusive* (the view has the page to itself), *partial* (it's a view into a |
| page shared by others), or *shared* (it represents a page shared by many partial views). Pages have a *page |
| boundary* (the address of the beginning of the page) and a *page header* (the object describing the page, which |
| may or may not actually be inside the page). Pages maintain *alloc bits* to tell which objects are live. |
| Allocation uses the heap's lookup tables to find the right *allocator index* in the TLC, which yields a *local |
| allocator*; that allocator usually has a cache of memory to allocate from. When it doesn't, it first tries to |
| pop a view from the local view cache, and if that fails, it uses the *find first eligible* algorithm on the |
| corresponding directory to find an eligible view. One the allocator has a view, it ensures that the view has a |
| page, and then scans the alloc bits to create a cache of free memory in that page. Deallocation fast paths just |
| push the object onto the deallocation log. When the log is full, the TLC flushes its log while trying to |
| amortize lock acquisition. Freeing an object in a page means clearing the corresponding alloc bit. Once enough |
| alloc bits are clear, either the page's view ends up on the view cache, or the directory is notified to mark the |
| page either eligible or empty. The sections that follow go into each of these concepts in detail. |
| |
| ### The Thread Local Cache |
| |
| Each thread has zero or one `pas_thread_local_cache`s. Libpas provides slow paths for allocating and |
| deallocating without a TLC in cases where TLC creation is forbidden (like when a thread is shutting down). But |
| usually, allocation and deallocation create a TLC if the thread doesn't already have one. TLCs are structured |
| as follows: |
| |
| - TLCs contain a fixed-size deallocation log along with an index that tells how much of the log is full. |
| Deallocation pushes onto that log. |
| - TLCs contain a variable-length array of *allocators*, which are really either *local allocators* or *local |
| view caches*. Allocators are variable length. Clients access allocators using an allocator index, which they |
| usually get from the directory that the allocator corresponds to. |
| - TLCs can get reallocated and deallocated, but they always point to a `pas_thread_local_cache_node`, which is |
| an immortal and compact descriptor of a TLC. TLC nodes are part of a global linked list. Each TLC node may or |
| may not have a live TLC associated with it. TLCs cannot be created or destroyed unless the heap lock is held, |
| so if you hold the heap lock, you can iterate the TLC node linked list to find all TLCs. |
| - The layout of allocator indices in a TLC is controlled by both the directories and the TLC layout data |
| structure (`pas_thread_local_cache_layout`). This is a global data structure that can tell us about all of the |
| allocators in a TLC. When holding the heap lock, it's possible to loop over the TLC layout linked list to |
| find what all of the valid allocator indices are and to introspect what is at those indices. |
| |
| Thread local caches tend to get large because both the local allocators and local view caches have inline |
| arrays. The local allocator has an array of bits that tell the allocator where the free objects are. The local |
| view cache has an array of view pointers (up to 1.6MB / 16KB = 100 entries, each using a 24-bit pointer). When |
| used in single-heap applications, these overheads don't matter -- they end up being accounting for less than |
| 10<sup>-5</sup> of the overall process footprint (not just in WebKit but when I experimented with libpas in |
| daemons). But when used for many heaps, these overheads are substantial. Given thousands or tens of thousands |
| of heaps, TLCs account for as much as 1% of memory. So, TLCs support partial decommit. Those pages that only |
| have allocators that are inactive get decommitted. Note that TLC decommit has landed in the libpas.git repo |
| as of [247029@main](https://commits.webkit.org/247029@main), but hasn't yet been merged into WebKit. |
| |
| The TLC deallocation log flush algorithm is designed to achieve two performance optimizations: |
| |
| - It achieves temporal locality of accesses to page headers. If freeing each object meant flipping a bit in the |
| page header, then many of those operations would miss cache, since the page header is not accessed by normal |
| program operation -- it's only accessed during some allocation slow paths and when deallocating. But because |
| deallocation accesses page headers only during log flush and log flush touches about 1000 objects, it's likely |
| that the flush will touch the same page header cache lines multiple times. |
| - It reduces the average number of lock acquisitions needed to free an object. Each page uses its own lock to |
| protect its page header, and the page header's `alloc_bits`. But deallocation log flush will do no locking or |
| unlocking if the object at index *i* and the object at index *i+1* use the same lock. Pages can dynamically |
| select which lock they use (thanks to `page->lock_ptr`), and they select it so that pages allocated out of by |
| a thread tend to share the same lock, so deallocation log flush usually only just acquires a lock at the start |
| of the 1000 objects and releases it when it finishes the 1000 objects. |
| |
| ### The Segregated Heap |
| |
| The `pas_segregated_heap` object is the part of `pas_heap` that facilitates segregated heap and bitfit heap |
| allocation. The details of the bitfit heap are discussed in a later section. Segregated heaps can be created |
| separately from a `pas_heap`, but segregated heaps are almost always part of a `pas_heap` (and it would be easy |
| to refactor libpas to make it so that segregated heaps are always part of heaps). |
| |
| Segregated heaps use *size directories* to track actual memory. Most of the exciting action of allocation and |
| deallocation happens in directories. Each directory corresponds to some size class. Segregated heaps make it |
| easy to find the directory for a particular size. They also make it easy to iterate over all the directories. |
| Also, segregated heaps make it easy to find the allocator index for a particular size (using a lookup table that |
| is essentially a cache of what you would get if you asked for the directory using the size-to-directory lookup |
| tables and then asked the directory for the allocator index). The most exciting part of the segregated heap |
| algorithm is `pas_segregated_heap_ensure_size_directory_for_size`, which decides what to do about allocating a |
| size it hadn't encountered before. This algorithm will either return an existing directory, create a new one, or |
| even retire an existing one. It handles all of the issues related to type size, type alignment, and the |
| alignment argument to the current malloc call. |
| |
| The lookup tables maintained by segregated heaps have some interesting properties: |
| |
| - They can be decommitted and rematerialized. This is a useful space saving when having lots of isoheaps. The |
| rematerialization happens because a heap also maintains a linked list of directories, and that linked list |
| never goes away. Each directory in the linked list knows what its representation would have been in the lookup |
| tables. |
| - They are optional. Some heaps can be configured to have a preferred size, called the *basic size class*. This |
| is very common for isoheaps, which may only ever allocate a single size. For isoheaps based on type, the |
| basic size class is just that type's size. Other isoheaps dynamically infer a preferred size based on the |
| first allocation. When a heap only has the basic size class, it will have no lookup tables. |
| - There are separate lookup tables for smaller sizes (not related to the small_segregated_config -- the |
| threshold is set separately) are just arrays whose index is the size divided by the heap's minalign, rounded |
| up. These may be populated under heap lock while they are accessed without any locks. So, accesses to them |
| have some guards against races. |
| - There are separate lookup tables for medium sizes (anything above the threshold for small lookup tables). |
| The medium table is a sorted array that the allocator binary-searches. It may be mutated, decommitted, |
| rematerialized, or reallocated under heap lock. The algorithm defends itself against this with a bunch of |
| compiler fences, a mutation count check. Mutating the table means incrementing-to-mutate before making changes |
| and incrementing-to-finish after making changes. So, the algorithm for lock-free lookup checks mutation count |
| before and after, and makes sure they are the same and neither indicates that we are mutating. This involves |
| clever use of dependency threading (like the ARM64 eor-self trick) to make sure that the mutation count reads |
| really happen before and after the binary search. |
| |
| ### Segregated Directories |
| |
| Much of the action of managing memory in a segregated heap happens in the segregated directories. There are two |
| kinds: |
| |
| - Segregated size directories, which track the views belonging to some size class in some heap. These may be |
| *exclusive views*, which own a page, or *partial views*, which own part of a *shared page*. Partial views |
| range in size between just below 512 bytes to possibly a whole page (in rare cases). |
| - Segregated shared page directories, which track *shared views*. Each shared view tracks shared page and which |
| partial views belong to it. However, when a shared page is decommitted, to save space, the shared view will |
| forget which partial views belong to it; they will re-register themselves the first time someone allocates in |
| them. |
| |
| Both of them rely on the same basic state, though they use it a bit differently: |
| |
| - A lock-free-access vector of compact view pointers. These are 4-byte pointers. This is possible because views |
| are always allocated out of the compact reservation (they are usually allocated in the immortal |
| heap). This vector may be appended to, but existing entries are immutable. So, resizing just avoids deleting |
| the smaller-sized vectors so that they may still be accessed in case of a race. |
| - A lock-free-access segmented vector of bitvectors. There are two bitvectors, and we interleave their 32-bit |
| words of bits. The *eligible* bitvector tells us which views may be allocated out of. This means different |
| things for size directories than shared page directories. For size directories, these are the views that have |
| some free memory and nobody is currently doing anything with them. For shared page directories, these are the |
| shared pages that haven't yet been fully claimed by partial views. The *empty* bitvector tells us which pages |
| are fully empty and can be decommitted. It's never set for partial views. It means the same thing for both |
| exclusive views in size directories and shared views in shared page directories. |
| |
| Both bitvectors are searched in order: |
| |
| - Eligible bitvectors are searched first-fit. |
| - Empty bitvectors are searched last-fit. |
| |
| Searches are made fast because the directory uses the lock-free tricks of `pas_versioned_field` to maintain two |
| indices: |
| |
| - A first-eligible index. This always points to the first eligible bit, except in cases where some thread has |
| set the bit but hasn't gotten around to setting the first-eligible index. In other words, this may have some |
| lag, but the lag is bounded. |
| - A last-empty-plus-one index. This always points to the index right after the last empty bit. If it's zero, it |
| means there are no set empty bits. If it's the number of views, then it means the last view is empty for sure |
| and there may be any number of other empty views. |
| |
| These versioned indices can be read without any atomic instructions in many cases, though most mutations to them |
| require a pair of 128-bit compare-and-swaps. |
| |
| The eligible bitvector together with the first-eligible index allow for very fast searches to find the first |
| eligible view. Bitvector searches are fast to begin with, even over a segmented vector, since the segmented |
| vector has large-enough chunks. Even searching the whole bitvector is quite efficient because of the properties |
| of bitvector simd (i.e. using a 32-bit or 64-bit or whatever-bit word to hold that many bits). But the |
| first-eligible index means that most searches never go past where that index points, so we get a mostly-O(1) |
| behavior when we do have to find the first eligible view. |
| |
| The empty bitvector gives a similar property for the scavenger, which searches backwards to find empty views. |
| The efficiency here arises from the fact that empty pages know the timestamp of when they became empty, and the |
| scavenger will terminate its backwards search when it finds a too-recently emptied page. |
| |
| Directories have to make some choices about how to add views. View addition happens under heap lock and must be |
| in this order: |
| |
| - First we make sure that the bitvectors have enough room for a bit at the new index. The algorithm relies on |
| the size of the view vector telling us how many views there are, so it's fine for the bitvectors are too big |
| for a moment. The segmented vector algorithm used for the bitvectors requires appending to happen under heap |
| lock but it can run concurrently to accesses to the vector. It accomplishes this by never deallocating the |
| too-small vector spines. |
| - Then we append the the view to the view vector, possibly reallocating the view vector. Reallocation keeps the |
| old two-small copy of the vector around, allowing concurrent reads to the vector. The vector append stores the |
| value, executes an acqrel fence (probably overkill -- probably could just be a store fence), and then |
| increments the size. This ensures nobody sees the view until we are ready. |
| |
| Directories just choose what kind of view they will create and then create an empty form of that view. So, right |
| at the point where the vector append happens, the view will report itself as not yet being initialized. However, |
| any thread can initialize an empty view. The normal flow of allocation means asking a view to "start |
| allocating". This actually happens in two steps (`will_start_allocating` and `did_start_allocating`). The |
| will-start step checks if the view needs to commit its memory, which will cause empty exclusive views to |
| allocate a page. Empty partial views get put into the *partial primordial* state where they grab their first |
| chunk of memory from some shared view and prepare to possibly grab more chunks of that shared view, depending on |
| demand. But all of this happens after the directory has created the view and appended it. This means that there |
| is even the possibility that one thread creates a view, but then some other thread takes it right after it was |
| appended. In that case, the first thread will loop around and try again, maybe finding some other view that had |
| been made eligible in the meantime, or against appending another new view. |
| |
| Size directories maintain additional state to make page management easy and to accelerate allocation. |
| |
| Size directories that have enabled exclusive views have a `full_alloc_bits` vector that has bits set for those |
| indices in their pages where an object might start. Pages use bitvectors indexed by minalign, and only set those |
| bits that correspond to valid object offsets. The `full_alloc_bits` vector is the main way that directories tell |
| where objects could possibly be in the page. The other way they tell is with |
| `offset_from_page_boundary_to_first_object` and `offset_from_page_boundary_to_end_of_last_object`, but the |
| algorithm relies on those a bit less. |
| |
| Size directories can tell if they have been assigned an allocator index or a view cache index, and control the |
| policies of when they get them. A directory without an allocator index will allocate out of *baseline |
| allocators*, which are shared by all threads. Having an allocator index implies that the allocator index has |
| also been stored in the right places in the heap's lookup tables. Having a view cache index means that |
| deallocation will put eligible pages on the view cache before marking them eligible in the directory. |
| |
| ### Page Boundaries, Headers and Alloc Bits |
| |
| "Pages" in the segregated heap are a configurable concept. Things like their size and where their header lives |
| can be configured by the `pas_segregated_page_config` of their directory. The config can tell which parts of |
| the page are usable as object payload, and they can provide callbacks for finding the page header. |
| |
| The page header contains: |
| |
| - The page's kind. The segregated page header is a subtype of the `pas_page_base`, and we support safely |
| downcasting from `pas_page_base*` to `pas_segregated_page*`. Having the kind helps with this. |
| - Whether the page is in use for allocation right now. This field is only for exclusive views. Allocation in |
| exclusive pages can only happen when some local allocator claims a page. For shared pages, this bit is in |
| each of the the partial views. |
| - Whether we have freed an object in the page while also allocating in it. This field is only for exclusive |
| views. When we finish allocating, and the bit is set, we do the eligibility stuff we would have done if we had |
| freed objects without the page being used for allocation. For shared pages, this bit is in each of the partial |
| views. |
| - The sizes of objects that the page manages. Again, this is only used for exclusive views. For shared pages, |
| each partial view may have a different size directory, and the size directory tells the object size. It's also |
| possible to get the object size by asking the exclusive view for its size directory, and you will get the same |
| answer as if you had asked the page in that case. |
| - A pointer to the lock that the page uses. Pages are locked using a kind of locking dance: you load the |
| `page->lock_ptr`, lock that lock, and then check if `page->lock_ptr` still points at the lock you tried. |
| Anyone who holds both the current lock and some other lock can change `page->lock_ptr` to the other lock. |
| For shared pages, the `lock_ptr` always points at the shared view's ownership lock. For exclusive views, the |
| libpas allocator will change the lock of a page to be the lock associated with their TLC. If contention |
| on `page->lock_ptr` happens, then we change the lock back to the view's ownership lock. This means that in |
| the common case, flushing the deallocation log will encounter page after page that wants to hold the same |
| lock -- usually the TLC lock. This allows the deallocation log flush to only do a handful of lock acquisitions |
| for deallocating thousands of objects. |
| - The timestamp of when the page became empty, using a unit of time libpas calls *epoch*. |
| - The view that owns the page. This is either an exclusive view or a *shared handle*, which is the part of the |
| shared view that gets deallocated for decommitted pages. Note: an obvious improvement is if shared handles |
| were actually part of the page header; they aren't only because until recently, the page header size had to be |
| the same for exclusive and shared pages. |
| - View cache index, if the directory enabled view caching. This allows deallocation to quickly find out which |
| view cache to use. |
| - The alloc bits. |
| - The number of 32-bit alloc bit words that are not empty. |
| - Optionally, the granule use counts. It's possible for the page config to say that the page size is larger than |
| the system page size, but that the page is divided up into *granules* which are system page size. In that |
| case, the page header will have an array of 1-byte use counts per granule, which count the number of objects |
| in that granule. They also track a special state when the granule is decommitted. The medium_segregated_config |
| uses this to offer fine-grained decommit of 128KB "pages". |
| |
| Currently we have two ways of placing the page header: either at the beginning of the page, or what we call |
| the page boundary, or in an object allocated in the utility heap. In the latter case, we use the |
| mostly-lock-free page header table to map between the page boundary and page header, or vice-versa. The page |
| config has callbacks that allow either approach. I've also used page config hacking to attempt other kinds of |
| strategies, like saying that every aligned 16MB chunk of pages has an array of page headers at the start of it; |
| but those weren't any better than either of the two current approaches. |
| |
| The most important part of the page header is the alloc bits array and the `num_non_empty_words` counter. This |
| is where most of the action of allocating and deallocating happens. The magic of the algorithm arises from the |
| simple bitvector operations we can perform on `page->alloc_bits`, `full_alloc_bits` (from the size |
| directory in case of exclusive pages or from the partial view in case of shared pages), and the |
| `allocator->bits`. These operations allow us to achieve most of the algorithm: |
| |
| - Deallocation clears a bit in `page->alloc_bits` and if this results in the word becoming zero, it decrements |
| the `num_non_empty_words`. The bit index is just the object's offset shifted by the page config's |
| `min_aligh_shift`, which is a compile-time constant in most of the algorithm. If the algorithm makes any bit |
| (for partials) or any word (for exclusives) empty, it makes the page eligible (either by putting it on a view |
| cache or marking the view eligible in its owning directory). If `num_non_empty_words` hits zero, the |
| deallocator also makes the view empty. |
| - Allocation does a find-first-set-bit on the `allocator->bits`, but in a very efficient way, because the |
| current 64-bit word of bits that the allocator is on is cached in `allocator->current_word` -- so allocation |
| rarely searches an array. So, the allocator just loads the current word, does a `ctz` or `clz` kind of |
| operation (which is super cheap on modern CPUs), left-shifts the result by the page config's minalign, and |
| adds the `allocator->page_ish` (the address in memory corresponding to the first bit in `current_word`). |
| That's the allocator fast path. |
| - We prepare `allocator->bits` by basically saying `allocator->bits = full_alloc_bits & ~page->alloc_bits`. This |
| is a loop since each of the `bits` is an array of words of bits and each array is the same size. For a 16384 |
| size page (the default for `small_segregated_config`) and a minalign shift of 4 (so minalign = 16, the default |
| for `small_segregated_config`), this means 1024 bits, or 32 32-bit words, or 128 bytes. The loop over the 32 |
| 32-bit words is usually fully unrolled by the compiler. There are no loop-carried dependencies. This loop |
| shows up in profiles, and though I've tried to make it faster, I've never succeeded. |
| |
| ### Local Allocators |
| |
| Each size directory can choose to use either baseline allocators or TLC local allocators for allocation. Each |
| size directory can choose to have a local view cache or not. Baseline allocators are just local allocators that |
| are global and not part of any TLC and allocation needs to grab a lock to use them. TLC local allocators don't |
| require any locking to get accessed. |
| |
| Local allocators can be in any of these modes: |
| |
| - They are totally uninitialized. All fast paths fail and slow paths will initialize the local allocator by |
| asking the TLC layout. This state happens if TLC decommit causes a local allocator to become all zero. |
| - They are in bump allocation mode. Bump allocation happens either when a local allocator decides to allocate in |
| a totally empty exclusive page, or for primordial partial allocation. In the former case, it's worth about 1% |
| performance to sometimes bump-allocate. In the latter case, using bump allocation is just convenient -- the |
| slow path will decide that the partial view should get a certain range of memory within a shared page and it |
| knows that this memory has never been used before, so it's natural to just set up a bump range over that |
| memory. |
| - They are in free bits mode. This is slightly more common than the bump mode. In this mode, the |
| `allocator->bits` is computed using `full_alloc_bits & ~page->alloc_bits` and contains a bit for the start of |
| every free object. |
| - They are in bitfit mode. In this mode, the allocator just forwards allocations to the `pas_bitfit_allocator`. |
| |
| Local allocators can be *stopped* at any time; this causes them to just return all of their free memory back to |
| the heap. |
| |
| Local allocators in a TLC can be used without any conventional locking. However, there is still synchronization |
| taking place because the scavenger is allowed to stop allocators. To support this, local allocators set an |
| `in_use` bit (not atomically, but protected by a `pas_compiler_fence`) before they do any work and clear it when |
| done. The scavenger thread will suspend threads that have TLCs and then while the TLC is suspended, they can |
| stop any allocators that are not `in_use`. |
| |
| ## The Bitfit Heaps |
| |
| Libpas is usually used with a combination of segregated and large heaps. However, sometimes we want to have a |
| heap that is more space-efficient than segregated but not quite as slow as large. The bitfit heap follows a |
| similar style to segregated, but: |
| |
| - While bitfit has a bit for each minalign index like segregated, bitfit actually uses all of the bits. To |
| allocate an object in bitfit, all of the bits corresponding to all of the minalign granules that the object |
| would use have to be free before the allocation and have to be marked as not free after the allocation. |
| Freeing has to clear all of the bits. |
| - The same page can have objects of any size allocated in it. For example, if a 100 byte object gets freed, then |
| it's legal to allocate two 50 byte objects out of the freed space (assuming 50 is a multiple of minalign). |
| - A bitfit directory does not represent a size class. A bitfit heap has one directory per bitfit page config |
| and each page config supports a large range of sizes (the largest object is ~250 times larger than the |
| smallest). |
| |
| Bitfit pages have `free_bits` as well as `object_end_bits`. The `free_bits` indicates every minalign granule |
| that is free. For non-free granules, the `object_end_bits` has an entry for every granule that is the last |
| granule in some live object. These bits get used as follows: |
| |
| - To allocate, we find the first set free bit and then find the first clear free bit after that. If this range |
| is big enough for the allocation, we clear all of the free bits and set the object end bit. If it's not big |
| enough, we keep searching. We do special things (see below) when we cannot allocate in a page. |
| - To free, we find the first set object end bit, which then gives us the object size. Then we clear the object |
| end bit and set the free bits. |
| |
| This basic style of allocation is usually called *bitmap* allocation. Bitfit is a special kind of bitmap |
| allocation that makes it cheap to find the first page that has enough space for an allocation of a given size. |
| Bitfit makes allocation fast even when it is managing lots of pages by using two tricks: |
| |
| - Bitfit directories have an array of bitfit views and a corresponding array of `max_free` bytes. Bitfit views |
| are monomorphic, unlike the polymorphic views of segregated heaps. Each bitfit view is either uninitialized or |
| has a bitfit page. A `max_free` byte for a page tells the maximum free object size in that page. So, in the |
| worst case, we search the `max_free` vector to find the find byte that is large enough for our allocation. |
| - Bitfit uses size classes to short-circuit the search. Bitfit leverages segregated heaps to create size |
| classes. Segregated size directories choose at creation time if they want to support segregated allocation or |
| bitfit allocation. If the latter, the directory is just used as a way of locating the bitfit size class. Like |
| with segregated, each local allocator is associated with a segregated size directory, even if it's a local |
| allocator configured for bitfit. Each size class maintains the index of the first view/page in the directory |
| that has a free object big enough for that size class. |
| |
| The updates to `max_free` and the short-circuiting indices in size classes happen when an allocation fails in |
| a page. This is an ideal time to set those indices since failure to allocate happens to also tell you the size |
| of the largest free object in the page. |
| |
| When any object is freed in a page, we mark the page as having `PAS_BITFIT_MAX_FREE_UNPROCESSED` and rather than |
| setting any short-circuiting indices in size classes, we just set the `first_unprocessed_free` index in the |
| `pas_bitfit_directory`. Allocation will start its search from the minimum of `directory->first_unprocessed_free` |
| and `size_class->first_free`. All of these short-circuiting indices use `pas_versioned_field` just like how |
| short-circuiting works in segregated directories. |
| |
| Bitfit heaps use fine-grained locking in the sense that each view has its own locks. But, there's no attempt |
| made to have different threads avoid allocating in the same pages. Adding something like view caches to the |
| bitfit heap is likely to make it much faster. You could even imagine that rather than having the |
| `directory->first_unprocessed_free`, we instead have freeing in a page put the page's view onto a local view |
| cache for that bitfit directory, and then we allocate out of the view cache until it's empty. Failure to |
| allocate in a page in the view cache will then tell us the `max_free`, which will allow us to mark the view |
| eligible in the directory. |
| |
| ## The Scavenger |
| |
| Libpas returns memory to the OS by madvising it. This makes sense for a malloc that is trying to give strong |
| type guarantees. If we unmapped memory, then the memory could be used for some totally unrelated type in the |
| future. But by just decommitting the memory, we get the memory savings from free pages of memory and we also |
| get to preserve type safety. |
| |
| The madvise system call -- and likely any mechanism for saying "this page is empty" -- is expensive enough that |
| it doesn't make sense to do it anytime a page becomes empty. Pages often become empty only to refill again. In |
| fact, last time I measured it, just about half of allocations went down the bump allocation path and (except for |
| at start-up) that path is for completely empty pages. So, libpas has mechanisms for stashing the information |
| that a page has become empty and then having a *scavenger thread* return that memory to the OS with an madvise |
| call (or whatever mechanism). The scavenger thread is by default configured to run every 100ms, but will shut |
| itself down if we have a period of non-use. At each tick, it returns all empty pages that have been empty for |
| some length of time (currently 300ms). Those two thresholds -- the period and the target age for decommit -- are |
| independently configurable at it might make sense (for different reasons) to have either number be bigger than |
| the other number. |
| |
| This section describes the scavenging algorithm is detail. This is a large fraction of what makes libpas fast |
| and space-efficient. The algorithm has some crazy things in it that probably didn't work out as well as I wanted |
| but nonetheless those things seem to avoid showing up in profiles. That's sort of the outcome of this algorithm |
| being tuned and twisted so many times during the development of this allocator. First I'll describe the |
| *deferred decommit log*, which is how we coalesce madvise calls. Then I'll describe the *page sharing pool*, |
| which is a mechanism for multiple *participants* to report that they have some empty pages. Then I'll describe |
| how the large heap implements this with the large sharing pool, which is one of the singleton participants. Then |
| I'll describe the segregated directory participants -- which are a bit different for shared page directories |
| versus size directories. I'll also describe the bitfit directory participants, which are quite close to their |
| segregated directory cousins. Then I'll describe some of the things that the scavenger does that aren't to do |
| with the page sharing pool, like stopping baseline allocators, stopping utility heap allocators, stopping |
| TLC allocators, flushing TLC deallocation logs, decommitting unused parts of TLCs, and decommitting expendable |
| memory. |
| |
| ### Deferred Decommit Log |
| |
| Libpas's decommit algorithm coalesces madvise calls -- so if two adjacent pages, even from totally unrelated |
| heaps, become empty, then their decommit will be part of one syscall. This is achieved by having places in the |
| code that want to decommit memory instead add that memory to a deferred decommit log. This log internally uses |
| a minheap based on address. The log also stores what lock was needed to decommit the range, since the decommit |
| algorithm relies on fine-grained locking of memory rather than having a global commit lock. So, when the |
| scavenger asks the page sharing pool to go find some memory to decommit, it gives it a deferred decommit log. |
| The page sharing pool will usually return all of the empty pages in one go, so the deferred decommit logs can |
| get somewhat big. They are allocated out of the bootstrap free heap (which isn't the best idea if they ever get |
| very big, since bootstrap free heap memory is not decommitted -- but right now this is convenient because for |
| a heap to support decommit, it needs to talk to deferred decommit logs, so we want to avoid infinite recursion). |
| After the log is filled up, we can decommit everything in the log at once. This involves heapifying the array |
| and then scanning it backwards while detecting adjacent ranges. This is the loop that actually calls decommit. A |
| second loop unlocks the locks. |
| |
| Two fun complications arise: |
| |
| - As the page sharing pool scans memory for empty pages, it may arrive at pages in a random order, which may |
| be different from any valid order in which to acquire commit locks. So, after acquiring the first commit lock, |
| all subsequent lock acquisitions are `try_lock`'s. If any `try_lock` fails, the algorithm returns early, and |
| the deferred decommit log helps facilitate detecting when a lock failed to be acquired. |
| - Libpas supports calling into the algorithm even if some commit locks, or the heap lock, are held. It's not |
| legal to try to acquire any commit locks other than by `try_lock` in that case. The deferred decommit log will |
| also make sure that it will not relock any commit locks that are already held. |
| |
| Libpas can be configured to return memory either by madvising it or by mmap-zero-filling it, which has a similar |
| semantic effect but is slower. Libpas supports both symmetric and asymmetric forms of madvise, though the |
| asymmetric form is faster and has a slight (though mostly theoretical) memory usage edge. By *asymmetric* I mean |
| that you call some form of madvise to decommit memory and then do nothing to commit it. This works on Darwin and |
| it's quite efficient -- the kernel will clear the decommit request and give the page real memory if you access |
| the page. The memory usage edge of not explicitly committing memory in the memory allocator is that programs may |
| allocate large arrays and never use the whole array. It's OK to configure libpas to use symmetric decommit, but |
| the asymmetric variant might be faster or more efficient, if the target OS allows it. |
| |
| ### Page Sharing Pool |
| |
| Different kinds of heaps have different ways of discovering that they have empty pages. Libpas supports three |
| different kinds of heaps (large, segregated, and bitfit) and one of those heaps has two different ways of |
| discovering free mempry (segregated shared page directories and segregated size directories). The page sharing |
| pool is a data structure that can handle an arbitrary number of *page sharing participants*, each of which is |
| able to say whether they have empty pages and whether those empty pages are old enough to be worth decommitting. |
| |
| Page sharing participants need to be able to answer the following queries: |
| |
| - Getting the epoch of the oldest free page. This can be approximate; for example, it's OK for the participant |
| to occasionally give an epoch that is newer than the true oldest so long as this doesn't happen all the time. |
| We call this the `use_epoch`. |
| - Telling if the participant thinks it *is eligible*, i.e. has any free pages right now. It's OK for this to |
| return true even if there aren't free pages. It's not OK for this to return false if there are free pages. If |
| this returns true and there are no free pages, then it must return false after a bounded number of calls to |
| `take_least_recently_used`. Usually if a participant incorrectly says that it is eligible, then this state |
| will clear with exactly one call to `take_least_recently_used`. |
| - Taking the least recently used empty page (`pas_page_sharing_participant_take_least_recently_used`). This is |
| allowed to return that there aren't actually any empty pages. If there are free pages, this is allowed to |
| return any number of them (doesn't have to be just one). Pages are "returned" via a deferred decommit log. |
| |
| Participants also notify the page sharing pool when they see a *delta*. Seeing a delta means one of: |
| |
| - The participant found a new free page and it previously thought that it didn't have any. |
| - The participant has discovered that the oldest free page is older than previously thought. |
| |
| It's OK to only report a delta if the participant knows that it was previously not advertising itself as being |
| eligible, and it's also OK to report a delta every time that a free page is found. Most participants try to |
| avoid reporting deltas unless they know that they were not previously eligible. However, some participants (the |
| segregated and bitfit directories) are sloppy about reporting the epoch of the oldest free page. Those |
| participants will conservatively report a delta anytime they think that their estimate of the oldest page's age |
| has changed. |
| |
| The page sharing pool itself comprises: |
| |
| - A segmented vector of page sharing participant pointers. Each pointer is tagged with the participant's type, |
| which helps the pool decide how to get the `use_epoch`, decide whether the participant is eligible, and take |
| free pages from it. |
| - A bitvector of participants that have reported deltas. |
| - A minheap of participants sorted by epoch of the oldest free memory. |
| - The `current_participant`, which the page sharing pool will ask for pages before doing anything else, so long |
| as there are no deltas and the current participant continues to be eligible and continues to report the same |
| use epoch. |
| |
| If the current participant is not set or does not meet the criteria, the heap lock is taken and all of the |
| participants that have the delta bit set get reinserted into the minheap based on updated use epochs. It's |
| possible that the delta bit causes the removal of entries in the minheap (if something stopped being eligible). |
| It's possible that the delta bit causes the insertion of entries that weren't previously there (if something has |
| just become eligible). And it's possible for an entry to be removed and then readded (if the use epoch changed). |
| Then, the minimum of the minheap becomes the current participant, and we can ask it for pages. |
| |
| The pool itself is a class but it happens to be a singleton right now. It's probably a good idea to keep it as a |
| class, because as I've experimented with various approaches to organizing memory, I have had versions where |
| there are many pools. There is always the physical page sharing pool (the singleton), but I once had sharing |
| pools for moving pages between threads and sharing pools for trading virtual memory. |
| |
| The page sharing pool exposes APIs for taking memory from the pool. There are two commonly used variants: |
| |
| - Take a set number of bytes. The page sharing pool will try to take that much memory unless a try_lock fails, |
| in which case it records that it should take this additional amount of bytes on the next call. |
| - Take all free pages that are same age or older than some `max_epoch`. This API is called |
| `pas_physical_page_sharing_pool_scavenge`. This algorithm will only return when it is done. It will reloop in |
| case of `try_lock` failure, and it does this in a way that avoids spinning. |
| |
| The expected behavior is that when the page sharing pool has to get a bunch of pages it will usually get a run |
| of them from some participant -- hence the emphasis on the `current_participant`. However, it's possible that |
| various tweaks that I've made to the algorithm have made this no longer be the case. In that case, it might be |
| worthwhile to try to come up with a way of recomputing he `current_participant` that doesn't require holding the |
| `heap_lock`. Maybe even just a lock for the page sharing pool, rather than using the `heap_lock`, would be |
| enough to get a speed-up, and in the best case that speed-up would make it easier to increase scavenger |
| frequency. |
| |
| The page sharing pool's scavenge API is the main part of what the scavenger does. The next sections describe the |
| inner workings of the page sharing participants. |
| |
| ### Large Sharing Pool |
| |
| The large sharing pool is a singleton participant in the physical page sharing pool. It tracks which ranges of |
| pages are empty across all of the large heaps. The idea is to allow large heaps to sometimes split a page among |
| each other, so then no single large heap knows whether the page can be decommitted. So, all large heaps as well |
| as the large utility free heap and compact large utility free heap report when they allocate and deallocate |
| memory to the large sharing pool. |
| |
| Internally, the large sharing pool maintains a red-black tree and minheap. The tree tracks coalesced ranges of |
| pages and their states. The data structure thinks it knows about all of memory, and it "boots up" with a single |
| node representing the whole address space and that node claims to be allocated and committed. When heaps that |
| talk to the large sharing pool acquire memory, they tell the sharing pool that the memory is now free. Any |
| free-and-committed ranges also reside in the minheap, which is ordered by use epoch (the time when the memory |
| became free). |
| |
| The large sharing pool can only be used while holding the `heap_lock`, but it uses a separate lock, the |
| `pas_virtual_range_common_lock`, for commit and decommit. So, while libpas is blocked in the madvise syscall, |
| it doesn't have to hold the `heap_lock` and the other lock only gets acquired when committing or decommitting |
| large memory. |
| |
| It's natural for the large sharing pool to handle the participant API: |
| |
| - The large sharing pool participant says it's eligible when the minheap is non-empty. |
| - The large sharing pool participant reports the minheap's minimum use epoch as its use epoch (though it can be |
| configured to do something else; that something else may not be interesting anymore). |
| - The large sharing pool participant takes least recently used by removing the minimum from the minheap and |
| adding that node's memory range to the deferred decommit log. |
| |
| The large sharing pool registers itself with the physical page sharing pool when some large heap reports the |
| first bit of memory to it. |
| |
| ### Segregated Directory Participant |
| |
| Each segregated directory is a page sharing participant. They register themselves once they have pages that |
| could become empty. Segregated directories use the empty bits and the last-empty-plus-one index to satisfy the |
| page sharing pool participant API: |
| |
| - A directory participant says it's eligible when last-empty-plus-one is nonzero. |
| - A directory participant reports the use epoch of the last empty page before where last-empty-plus-one points. |
| Note that in extreme cases this means having to search the bitvector for a set empty bit, since the |
| last-empty-plus-one could lag in being set to a lower value in case of races. |
| - A directory participant takes least recently used by searching backwards from last-empty-plus-one and taking |
| the last empty page. This action also updates last-empty-plus-one using the `pas_versioned_field` lock-free |
| tricks. |
| |
| Once an empty page is found the basic idea is: |
| |
| 1. Clear the empty bit. |
| 2. Try to take eligibility; i.e. make the page not eligible. We use the eligible bit as a kind of lock |
| throughout the segregated heap; for example, pages will not be eligible if they are currently used by some |
| local allocator. If this fails, we just return. Note that it's up to anyone who makes a page ineligible to |
| then check if it should set the empty bit after they make it eligible again. So, it's fine for the scavenger |
| to not set the empty bit again after clearing it and failing to take the eligible bit. This also prevents a |
| spin where the scavenger keeps trying to look at this allegedly empty page even though it's not eligible. By |
| clearing the empty bit and not setting it again in this case, the scavenger will avoid this page until it |
| becomes eligible. |
| 3. Grab the *ownership lock* to say that the page is now decommitted. |
| 4. Grab the commit lock and then put the page on the deferred decommit log. |
| 5. Make the page eligible again. |
| |
| Sadly, it's a bit more complicated than that: |
| |
| - Directories don't track pages; they track views. Shared page directories have views of pages whose actual |
| eligibility is covered by the eligible bits in the segregated size directories that hold the shared page's |
| partial views. So, insofar as taking eligibility is part of the algorithm, shared page directories have to |
| take eligibility for each partial view associated with the shared view. |
| - Shared and exclusive views could both be in a state where they don't even have a page. So, before looking at |
| anything about the view, it's necessary to take the ownership lock. In fact, to have a guarantee that nobody |
| is messing with the page, we need to grab the ownership lock and take eligibility. The actual algorithm does |
| these two things together. |
| - It's possible for the page to not actually be empty even though the empty bit is set. We don't require that |
| the empty bit is cleared when a page becomes nonempty. |
| - Some of the logic of decommitting requires holding the `page->lock_ptr`, which may be a different lock than |
| the ownership lock. So the algorithm actually takes the ownership lock, then the page lock, then the ownership |
| lock again, and then the commit lock. |
| - If a `try_lock` fails, then we set both the eligible and empty bits, since in that case, we really do want |
| the page sharing pool to come back to us. |
| |
| Some page configs support segregated pages that have multiple system pages inside them. In that case, the empty |
| bit gets set when any system page becomes empty (using granule use counts), and the taking algorithm just |
| decommits the granules rather than decommitting the whole page. |
| |
| ### Bitfit Directory Participant |
| |
| Bitfit directories also use an empty bitvector and also support granules. Although bitfit directories are an |
| independent piece of code, their approach to participating in page sharing pools exactly mirrors what segregated |
| directories do. |
| |
| This concludes the discussion about the page sharing pool and its participants. Next, I will cover some of the |
| other things that the scavenger does. |
| |
| ### Stopping Baseline Allocators |
| |
| The scavenger also routinely stops baseline allocators. Baseline allocators are easy to stop by the scavenger |
| because they can be used by anyone who holds their lock. So the scavenger can stop any baseline allocator it |
| wants. It will only stop those allocators that haven't been used in a while (by checking and resetting a dirty |
| bit). |
| |
| ### Stopping Utility Heap Allocators |
| |
| The scavenger also does the same thing for utility allocators. This just requires holding the `heap_lock`. |
| |
| ### Stopping TLC Allocators |
| |
| Allocators in TLCs also get stopped, but this requires more effort. When the scavenger thread is running, it's |
| possible for any thread to be using any of its allocators and no locks are held when this happens. So, the |
| scavenger uses the following algorithm: |
| |
| - First it tries to ask the thread to stop certain allocators. In each allocation slow path, the allocator |
| checks if any of the other allocators in the TLC have been requested to stop by the scavenger, and if so, it |
| stops those allocators. That doesn't require special synchronization because the thread that owns the |
| allocator is the one stopping it. |
| - If that doesn't work out, the scavenger suspends the thread and stops all allocators that don't have the |
| `in_use` bit set. The `in_use` bit is set whenever a thread does anything to a local allocator, and cleared |
| after. |
| |
| One wrinkle about stopping allocators is that stopped allocators might get decommitted. The way that this is |
| coordinated is that a stopped allocator is in a special state that means that any allocation attempt will take |
| a slow path that acquires the TLC's scavenging lock and possibly recommits some pages and then puts the |
| allocator back into a normal state. |
| |
| ### Flushing TLC Deallocation Logs |
| |
| The TLC deallocation logs can be flushed by any thread that holds the scavenging lock. So, the scavenger thread |
| flushes all deallocation logs that haven't been flushed recently. |
| |
| When a thread flushes its own log, it holds the scavenger lock. However, appending doesn't grab the lock. To |
| make this work, when the scavenger flushes the log, it: |
| |
| - Replaces any entry in the log that it deallocated with zero. Actually, the deallocation log flush always does |
| this. |
| - Does not reset the `thread_local_cache->deallocation_log_index`. In fact, it doesn't do anything except read |
| the field. |
| |
| Because of the structure of the deallocation log flush, it's cheap for it to null-check everything it loads from |
| the deallocation log. So when, a thread goes to flush its log after the scavenger has done it, it will see a |
| bunch of null entries, and it will skip them. If a thread tries to append to the deallocation log while the |
| scavenger is flushing it, then this just works, because it ends up storing the new value above what the |
| scavenger sees. The object is still in the log, and will get deallocated on the next flush. |
| |
| ### Decommitting Unused Parts of TLCs |
| |
| For any page in the TLC consisting entirely of stopped allocators, the scavenger can decommit those pages. The |
| zero value triggers a slow path in the local allocator and local view cache that then commits the page and |
| rematerializes the allocator. This is painstakingly ensured; to keep this property you generally have to audit |
| the fast paths to see which parts of allocators they access, and make sure that the allocator goes to a slow |
| path if they are all zero. That slow path then has to check if the allocator is stopped or decommitted, and if |
| it is either, it grabs the TLC's scavenger lock and recommits and rematerializes. |
| |
| This feature is particularly valuable because of how big local allocators and local view caches are. When there |
| are a lot of heaps, this accounts for tens of MBs in some cases. So, being able to decommit unused parts is a |
| big deal. |
| |
| ### Decommitting Expendable Memory |
| |
| Segregated heaps maintain lookup tables that map "index", i.e. the object size divided by the heap's minalign, |
| to allocator indices and directories. There are three tables: |
| |
| - The index-to-allocator-index table. This only works for small-enough indices. |
| - The index-to-directory table. This also only works for small-enough indices. |
| - The medium index-to-directory-tuple binary search array. This works for the not-small-enough indices. |
| |
| It makes sense to let those tables get large. It might make sense for each isoheap to have a large lookup table. |
| Currently, they use some smaller threshold. But to make that not use a lot of memory, we need to be able to |
| decommit memory that is only used by lookup tables of heaps that nobody is using. |
| |
| So, libpas has this weird thing called `pas_expendable_memory`, which allows us to allocate objects of immortal |
| memory that automatically get decommitted if we don't "touch" them frequently enough. The scavenger checks the |
| state of all expendable memory, and decommits those pages that haven't been used in a while. The expendable |
| memory algorithm is not wired up as a page sharing participant because the scavenger really needs to poke the |
| whole table maintained by the algorithm every time it does a tick; otherwise the algorithm would not work. |
| Fortunately, there's never a lot of this kind of memory. So, this isn't a performance problem as far as I know. |
| Also, it doesn't actually save that much memory right now -- but also, right now isoheaps use smaller-size |
| tables. |
| |
| This concludes the discussion of the scavenger. To summarize, the scavenger periodically: |
| |
| - Stops baseline allocators. |
| - Stops utility heap allocators. |
| - Stops TLC allocators. |
| - Flushes TLC deallocation logs. |
| - Decommits unused parts of TLCs. |
| - Decommits expendable memory. |
| - Asks the physical page sharing pool to scavenge. |
| |
| While it does these things, it notes whether anything is left in a state where it would be worthwhile to run |
| again. A steady state might look like that all empty pages have been decommitted, rather than that only old |
| enough ones were decommitted. If a steady state looks like it's being reached, the scavenger first sleeps for |
| a while, and then shuts down entirely. |
| |
| ## Megapages and Page Header Tables |
| |
| Libpas provides non-large heaps two fast and scalable ways of figuring out about an object based on its |
| address, like during deallocation, or during user queries like `malloc_size`. |
| |
| 1. Megapages and the megapage table. Libpas describes every 16MB of memory using a two-bit enum called |
| `pas_fast_megapage_kind`. The zero state indicates that this address does not have a *fast megapage*. The |
| other two describe two different kinds of fast megapages, one where you know exactly the type of page it |
| is (small exclusive segregated) and one where you have to find out by asking the page header, but at least |
| you know that it's small (so usually 16KB). |
| 2. Page header table. This is a lock-free-to-read hashtable that tells you where to find the page header for |
| some page in memory. For pages that use page headers, we also use the page header table to find out if the |
| memory address is in memory owned by some kind of "page" (either a segregated page or a bitfit page). |
| |
| Usually, the small segregated and small bitfit configs use megapages. The medium segregated, medium bitfit, and |
| marge bitfit configs use page header tables. Large heaps don't use either, since they have the large map. |
| |
| ## The Enumerator |
| |
| Libpas supports the libmalloc enumerator API. It even includes tests for it. The way that this is implemented |
| revolves around the `pas_enumerator` class. Many of the data structures that are needed by the enumerator have |
| APIs to support enumeration that take a `pas_enumerator*`. The idea behind how it works is conceptually easy: |
| just walk the heap -- which is possible to do accurately in libpas -- and report the metadata areas, the areas |
| that could contain objects, and the live objects. But, we have to do this while walking a heap in a foreign |
| process, and we get to see that heap through little copies of it that we request a callback to make for us. The |
| code for enumeration isn't all that pretty but at least it's easy to find most of it by looking for functions |
| and files that refer to `pas_enumerator`. |
| |
| But a challenge of enumeration is that it happens when the remote process is stopped at some arbitrary |
| point. That point has to be an instruction boundary -- so CPU memory model issues aren't a problem. But this |
| means that the whole algorithm has to ensure that at any instruction boundary, the enumerator will see the right |
| thing. Logic to help the enumerator still understand the heap at any instruction is spread throughout the |
| allocation algorithms. Even the trees and hashtables used by the large heap have special hacks to enable them to |
| be enumerable at any instruction boundary. |
| |
| The enumerator is maintained so that it's 100% accurate. Any discrepancy -- either in what objects are reported |
| live or what their sizes are -- gets flagged as an error by `test_pas`. The goal should be to maintain perfect |
| enumerator accuracy. This makes sense for two reasons: |
| |
| 1. I've yet to see a case where doing so is a performance or memory regression. |
| 2. It makes the enumerator easy to test. I don't know how to test an enumerator that is not accurate by design. |
| If it's accurate by design, then any discrepancy between the test's understanding of what is live and what |
| the enumerator reports can be flagged as a test failure. If it wasn't accurate by design, then I don't know |
| what it would mean for a test to fail or what the tests could even assert. |
| |
| ## The Basic Configuration Template |
| |
| Libpas's heap configs and page configs allow for a tremendous amount of flexibility. Things like the utility |
| heap and the `jit_heap` leverage this flexibility to do strange things. However, if you're using libpas to |
| create a normal malloc, then a lot of the configurability in the heap/page configs is too much. A "normal" |
| malloc is one that is exposed as a normal API rather than being internal to libpas, and that manages memory that |
| doesn't have special properties like that it's marked executable and not writeable. |
| |
| The basic template is provided by `pas_heap_config_utils.h`. To define a new config based on this template, you |
| need to: |
| |
| - Add the appropriate heap config and page config kinds to `pas_heap_config_kind.def`, |
| `pas_segregated_page_config_kind.def`, and `pas_bitfit_page_config_kind.def`. You also have to do this if you |
| add any kind of config, even one that doesn't use the template. |
| - Create the files `foo_heap_config.h` and `foo_heap_config.c`. These are mostly boilerplate. |
| |
| The header file usually looks like this: |
| |
| #define ISO_MINALIGN_SHIFT ((size_t)4) |
| #define ISO_MINALIGN_SIZE ((size_t)1 << ISO_MINALIGN_SHIFT) |
| |
| #define ISO_HEAP_CONFIG PAS_BASIC_HEAP_CONFIG( \ |
| iso, \ |
| .activate = pas_heap_config_utils_null_activate, \ |
| .get_type_size = pas_simple_type_as_heap_type_get_type_size, \ |
| .get_type_alignment = pas_simple_type_as_heap_type_get_type_alignment, \ |
| .dump_type = pas_simple_type_as_heap_type_dump, \ |
| .check_deallocation = true, \ |
| .small_segregated_min_align_shift = ISO_MINALIGN_SHIFT, \ |
| .small_segregated_sharing_shift = PAS_SMALL_SHARING_SHIFT, \ |
| .small_segregated_page_size = PAS_SMALL_PAGE_DEFAULT_SIZE, \ |
| .small_segregated_wasteage_handicap = PAS_SMALL_PAGE_HANDICAP, \ |
| .small_exclusive_segregated_logging_mode = pas_segregated_deallocation_size_oblivious_logging_mode, \ |
| .small_shared_segregated_logging_mode = pas_segregated_deallocation_no_logging_mode, \ |
| .small_exclusive_segregated_enable_empty_word_eligibility_optimization = false, \ |
| .small_shared_segregated_enable_empty_word_eligibility_optimization = false, \ |
| .small_segregated_use_reversed_current_word = PAS_ARM64, \ |
| .enable_view_cache = false, \ |
| .use_small_bitfit = true, \ |
| .small_bitfit_min_align_shift = ISO_MINALIGN_SHIFT, \ |
| .small_bitfit_page_size = PAS_SMALL_BITFIT_PAGE_DEFAULT_SIZE, \ |
| .medium_page_size = PAS_MEDIUM_PAGE_DEFAULT_SIZE, \ |
| .granule_size = PAS_GRANULE_DEFAULT_SIZE, \ |
| .use_medium_segregated = true, \ |
| .medium_segregated_min_align_shift = PAS_MIN_MEDIUM_ALIGN_SHIFT, \ |
| .medium_segregated_sharing_shift = PAS_MEDIUM_SHARING_SHIFT, \ |
| .medium_segregated_wasteage_handicap = PAS_MEDIUM_PAGE_HANDICAP, \ |
| .medium_exclusive_segregated_logging_mode = pas_segregated_deallocation_size_aware_logging_mode, \ |
| .medium_shared_segregated_logging_mode = pas_segregated_deallocation_no_logging_mode, \ |
| .use_medium_bitfit = true, \ |
| .medium_bitfit_min_align_shift = PAS_MIN_MEDIUM_ALIGN_SHIFT, \ |
| .use_marge_bitfit = true, \ |
| .marge_bitfit_min_align_shift = PAS_MIN_MARGE_ALIGN_SHIFT, \ |
| .marge_bitfit_page_size = PAS_MARGE_PAGE_DEFAULT_SIZE, \ |
| .pgm_enabled = false, \ |
| .delegate_large_user_allocations = true, \ |
| |
| PAS_API extern const pas_heap_config iso_heap_config; |
| |
| PAS_BASIC_HEAP_CONFIG_DECLARATIONS(iso, ISO); |
| |
| Note the use of `PAS_BASIC_HEAP_CONFIG`, which creates a config literal that automatically fills in a bunch of |
| heap config, segregated page config, and bitfit page config fields based on the arguments you pass to |
| `PAS_BASIC_HEAP_CONFIG`. The corresponding `.c` file looks like this: |
| |
| const pas_heap_config iso_heap_config = ISO_HEAP_CONFIG; |
| |
| PAS_BASIC_HEAP_CONFIG_DEFINITIONS( |
| iso, ISO, |
| .allocate_page_should_zero = false, |
| .intrinsic_view_cache_capacity = pas_heap_runtime_config_zero_view_cache_capacity); |
| |
| Note that this just configures whether new pages are zeroed and what the view cache capacity for the intrinsic |
| heap are. The *intrinsic heap* is one of the four categories of heaps that the basic heap configuration template |
| supports: |
| |
| - Intrinsic heaps are global singleton heaps, like the common heap for primitives. WebKit's fastMalloc bottoms |
| out in an intrinsic heap. |
| - Primitive heaps are heaps for primitive untyped values, but that aren't singletons. You can have many |
| primitive heaps. |
| - Typed heaps have a type, and the type has a fixed size and alignment. Typed heaps allow allocating single |
| instances of objects of that type or arrays of that type. |
| - Flex heaps are for objects with flexible array members. They pretend as if their type has size and alignment |
| equal to 1, but in practice they are used for objects that have some base size plus a variable-length array. |
| Note that libpas doesn't correctly manage flex memory in the large heap; we need a variant of the large heap |
| that knows that you cannot reuse flex memory between different sizes. |
| |
| The basic heap config template sets up some basic defaults for how heaps work: |
| |
| - It makes small segregated and small bitfit page configs put the page header at the beginning of the page and |
| it arranges to have those pages allocated out of megapages. |
| - It makes medium segregated, medium bitfit, and marge bitfit use page header tables. |
| - It sets up a way to find things like the page header tables from the enumerator. |
| - It sets up segregated shared page directories for each of the segregated page configs. |
| |
| The `bmalloc_heap_config` is an example of a configuration that uses the basic template. If we ever wanted to |
| put libpas into some other malloc library, we'd probably create a heap config for that library, and we would |
| probably base it on the basic heap config template (though we don't absolutely have to). |
| |
| ## JIT Heap Config |
| |
| The JIT heap config is for replacing the MetaAllocator as a way of doing executable memory allocation in WebKit. |
| It needs to satisfy two requirements of executable memory allocation: |
| |
| - The allocator cannot read or write the memory it manages, since that memory may have weird permissions at any |
| time. |
| - Clients of the executable allocator must be able to in-place shrink allocations. |
| |
| The large heap trivially supports both requirements. The bitfit heap trivially supports the second requirement, |
| and can be made to support the first requirement if we use page header tables for all kinds of memory, not just |
| medium or marge. So, the JIT heap config focuses on just using bitfit and large and it forces bitfit to use |
| page header tables even for the small bitfit page config. |
| |
| ## Security Considerations |
| |
| ### Probabilistic Guard Malloc |
| |
| Probabilistic Guard Malloc (PGM) is a new allocator designed to catch use after free attempts and out of bounds accesses. |
| It behaves similarly to AddressSanitizer (ASAN), but aims to have minimal runtime overhead. |
| |
| The design of PGM is quite simple. Each time an allocation is performed an additional guard page is added above and below the newly |
| allocated page(s). An allocation may span multiple pages. When a deallocation is performed, the page(s) allocated will be protected |
| using mprotect to ensure that any use after frees will trigger a crash. Virtual memory addresses are never reused, so we will never run |
| into a case where object 1 is freed, object 2 is allocated over the same address space, and object 1 then accesses the memory address |
| space of now object 2. |
| |
| PGM does add notable memory overhead. Each allocation, no matter the size, adds an additional 2 guard pages (8KB for X86_64 and 32KB |
| for ARM64). In addition, there may be free memory left over in the page(s) allocated for the user. This memory may not be used by any |
| other allocation. |
| |
| We added limits on virtual memory and wasted memory to help limit the memory impact on the overall system. Virtual memory for this |
| allocator is limited to 1GB. Wasted memory, which is the unused memory in the page(s) allocated by the user, is limited to 1MB. |
| These overall limits should ensure that the memory impact on the system is minimal, while helping to tackle the problems of catching |
| use after frees and out of bounds accesses. |
| |
| ## The Fast Paths |
| |
| All of the discussion in the previous sections is about the innards of libpas. But ultimately, clients want to |
| just call malloc-like and free-like functions to manage memory. Libpas provides fast path templates that actual |
| heap implementations reuse to provide malloc/free functions. The fast paths are: |
| |
| - `pas_try_allocate.h`, which is the single object allocation fast path for isoheaps. This function just takes a |
| heap and no size; it allocates one object of the size and alignment that the heap's type wants. |
| - `pas_try_allocate_array.h`, which is the array and aligned allocation fast path for isoheaps. You want to use |
| it with heaps that have a type, and that type has a size and alignment, and you want to allocate arrays of |
| that type or instances of that type with special alignment. |
| - `pas_try_allocate_primitive.h`, which is the primitive object allocation fast path for heaps that don't have |
| a type (i.e. they have the primitive type as their type -- the type says it has size and alignment equal to |
| 1). |
| - `pas_try_allocate_intrinsic.h`, which is the intrinsic heap allocation fast path. |
| - `pas_try_reallocate.h`, which provides variants of all of the allocators that reallocate memory. |
| - `pas_deallocate.h`, which provides the fast path for `free`. |
| - `pas_get_allocation_size.h`, which is the fast path for `malloc_size`. |
| |
| One thing to remember when dealing with the fast paths is that they are engineered so that malloc/free functions |
| do not have a stack frame, no callee saves, and don't need to save the LR/FP to the stack. To facilitate this, |
| we have the fast path call an inline-only fast path, and if that fails, we call a "casual case". The inline-only |
| fast path makes no out-of-line function calls, since if it did, we'd need a stack frame. The only slow call (to |
| the casual case) is a tail call. For example: |
| |
| static PAS_ALWAYS_INLINE void* bmalloc_try_allocate_inline(size_t size) |
| { |
| pas_allocation_result result; |
| result = bmalloc_try_allocate_impl_inline_only(size, 1); |
| if (PAS_LIKELY(result.did_succeed)) |
| return (void*)result.begin; |
| return bmalloc_try_allocate_casual(size); |
| } |
| |
| The way that the `bmalloc_try_allocate_impl_inline_only` and `bmalloc_try_allocate_casual` functions are created |
| is with: |
| |
| PAS_CREATE_TRY_ALLOCATE_INTRINSIC( |
| bmalloc_try_allocate_impl, |
| BMALLOC_HEAP_CONFIG, |
| &bmalloc_intrinsic_runtime_config.base, |
| &bmalloc_allocator_counts, |
| pas_allocation_result_identity, |
| &bmalloc_common_primitive_heap, |
| &bmalloc_common_primitive_heap_support, |
| pas_intrinsic_heap_is_designated); |
| |
| All allocation fast paths require this kind of macro that creates a bunch of functions -- both the inline paths |
| and the out-of-line paths. The deallocation, reallocation, and other fast paths are simpler. For example, |
| deallocation is just: |
| |
| static PAS_ALWAYS_INLINE void bmalloc_deallocate_inline(void* ptr) |
| { |
| pas_deallocate(ptr, BMALLOC_HEAP_CONFIG); |
| } |
| |
| If you look at `pas_deallocate`, you'll see cleverness that ensures that the slow path call is a tail call, |
| similarly to how allocators work. However, for deallocation, I haven't had the need to make the slow call |
| explicit in the client side (the way that `bmalloc_try_allocate_inline` has to explicitly call the slow path). |
| |
| This concludes the discussion of libpas design. |
| |
| # Testing Libpas |
| |
| I've tried to write test cases for every behavior in libpas, to the point that you should feel comfortable |
| dropping a new libpas in WebKit (or wherever) if `test_pas` passes. |
| |
| `test_pas` is a white-box component, regression, and unit test suite. It's allowed to call any libpas function, |
| even internal functions, and sometimes functions that libpas exposes only for the test suite. |
| |
| Libpas testing errs on the side of being comprehensive even if this creates annoying situations. Many tests |
| assert detailed things about how many objects fit in a page, what an object's offset is in a page, and things |
| like that. This means that some behavior changes in libpas that aren't in any way wrong will set off errors in |
| the test suite. So, it's common to have to rebase tests when making libpas changes. |
| |
| The libpas test suite is written in C++ and uses its own test harness that forks for each test, so each test |
| runs in a totally pristine state. Also, the test suite can use `malloc` and `new` as much as it likes, since in |
| the test suite, libpas does not replace `malloc` and `free`. |
| |
| The most important libpas tests are the so-called *chaos* tests, which randomly create and destroy objects and |
| assert that the heap's state is still sane (like that no live objects overlap, that all live objects are |
| enumerable, etc). |
| |
| # Conclusion |
| |
| Libpas is a beast of a malloc, designed for speed, memory efficiency, and type safety. May whoever maintains it |
| find some joy in this insane codebase! |
| |