PartitionAlloc Design

This document describes PartitionAlloc at a high level, with some architectural details. For implementation details, see the comments in partition_alloc_constants.h.

Overview

PartitionAlloc is a memory allocator optimized for space efficiency, allocation latency, and security.

This document largely avoids defining terms; consult the glossary for a complete reference.

Performance

PartitionAlloc is designed to be extremely fast in its fast paths. The fast paths of allocation and deallocation require very few (reasonably predictable) branches. The number of operations in the fast paths is minimal, leading to the possibility of inlining.

general architecture

However, even the fast path isn't the fastest, because it requires taking a per-partition lock. Although we optimized the lock, there was still room for improvement; to this end, we introduced the thread cache. The thread cache has been tailored to satisfy a vast majority of requests by allocating from and releasing memory to the main allocator in batches, amortizing lock acquisition and further improving locality while not trapping excess memory.

Security

Security is one of the important goals of PartitionAlloc.

PartitionAlloc guarantees that different partitions exist in different regions of the process's address space. When the caller has freed all objects contained in a page in a partition, PartitionAlloc returns the physical memory to the operating system, but continues to reserve the region of address space. PartitionAlloc will only reuse an address space region for the same partition.

Similarly, one page can contain only objects from the same bucket. When freed, PartitionAlloc returns the physical memory, but continues to reserve the region for this very bucket.

The above techniques help avoid type confusion attacks. Note, however, these apply only to normal buckets and not to direct map, as it'd waste too much address space.

PartitionAlloc also guarantees that:

  • Linear overflows/underflows cannot corrupt into, out of, or between partitions. There are guard pages at the beginning and the end of each memory region owned by a partition.

  • Linear overflows/underflows cannot corrupt the allocation metadata. PartitionAlloc records metadata in a dedicated, out-of-line region (not adjacent to objects), surrounded by guard pages. (Freelist pointers are an exception.)

  • Partial pointer overwrite of freelist pointer should fault.

  • Direct map allocations have guard pages at the beginning and the end.

Alignment

PartitionAlloc guarantees that returned pointers are aligned on partition_alloc::internal::kAlignment boundary (typically 16B on 64-bit systems, and 8B on 32-bit).

PartitionAlloc also supports higher levels of alignment, that can be requested via PartitionAlloc::AlignedAllocWithFlags() or platform-specific APIs (such as posix_memalign()). The requested alignment has to be a power of two. PartitionAlloc reserves the right to round up the requested size to the nearest power of two, greater than or equal to the requested alignment. This may be wasteful, but allows taking advantage of natural PartitionAlloc alignment guarantees. Allocations with an alignment requirement greater than partition_alloc::internal::kAlignment are expected to be very rare.

Architecture

Layout in Memory

PartitionAlloc handles normal buckets by reserving (not committing) 2MiB super pages. Each super page is split into partition pages. The first and the last partition page are permanently inaccessible and serve as guard pages, with the exception of one system page in the middle of the first partition page that holds metadata (32B struct per partition page).

anatomy of a super page

  • The slot span numbers provide a visual hint of their size (in partition pages).
  • Colors provide a visual hint of the bucket to which the slot span belongs.
    • Although only five colors are shown, in reality, a super page holds tens of slot spans, some of which belong to the same bucket.
  • The system page that holds metadata tracks each partition page with one 32B PartitionPage struct, which is either
  • Gray fill denotes guard pages (one partition page each at the head and tail of each super page).
  • In some configurations, PartitionAlloc stores more metadata than can fit in the one system page at the front. These are the bitmaps for StarScan and MTECheckedPtr<T>, and they are relegated to the head of what would otherwise be usable space for slot spans. One, both, or none of these bitmaps may be present, depending on build configuration, runtime configuration, and type of allocation. See SuperPagePayloadBegin() for details.

As allocation requests arrive, there is eventually a need to allocate a new slot span. Address space for such a slot span is carved out from the last super page. If not enough space, a new super page is allocated. Due to varying sizes of slot span, this may lead to leaving space unused (we never go back to fill previous super pages), which is fine because this memory is merely reserved, which is far less precious than committed memory. Note also that address space reserved for a slot span is never released, even if the slot span isn't used for a long time.

All slots in a newly allocated slot span are free, i.e. available for allocation.

Freelist Pointers

All free slots within a slot span are chained into a singly-linked free-list, by writing the next pointer at the beginning of each slot, and the head of the list is written in the metadata struct.

However, writing a pointer in each free slot of a newly allocated span would require committing and faulting in physical pages upfront, which would be unacceptable. Therefore, PartitionAlloc has a concept of provisioning slots. Only provisioned slots are chained into the freelist. Once provisioned slots in a span are depleted, then another page worth of slots is provisioned (note, a slot that crosses a page boundary only gets provisioned with slots of the next page). See PartitionBucket::ProvisionMoreSlotsAndAllocOne() for more details.

Freelist pointers are stored at the beginning of each free slot. As such, they are the only metadata that is inline, i.e. stored among the objects. This makes them prone to overruns. On little-endian systems, the pointers are encoded by reversing byte order, so that partial overruns will very likely result in destroying the pointer, as opposed to forming a valid pointer to a nearby location.

Furthermore, a shadow of a freelist pointer is stored next to it, encoded in a different manner. This helps PartitionAlloc detect corruptions.

Slot Span States

A slot span can be in any of 4 states:

  • Full. A full span has no free slots.
  • Empty. An empty span has no allocated slots, only free slots.
  • Active. An active span is anything in between the above two.
  • Decommitted. A decommitted span is a special case of an empty span, where all pages are decommitted from memory.

PartitionAlloc prioritizes getting an available slot from an active span, over an empty one, in hope that the latter can be soon transitioned into a decommitted state, thus releasing memory. There is no mechanism, however, to prioritize selection of a slot span based on the number of already allocated slots.

An empty span becomes decommitted either when there are too many empty spans (FIFO), or when PartitionRoot::PurgeMemory() gets invoked periodically (or in low memory pressure conditions). An allocation can be satisfied from a decommitted span if there are no active or empty spans available. The slot provisioning mechanism kicks back in, committing the pages gradually as needed, and the span becomes active. (There is currently no other way to unprovision slots than decommitting the entire span).

As mentioned above, a bucket is a collection of slot spans containing slots of the same size. In fact, each bucket has 3 linked-lists, chaining active, empty and decommitted spans (see PartitionBucket::*_slot_spans_head). There is no need for a full span list. The lists are updated lazily. An empty, decommitted or full span may stay on the active list for some time, until PartitionBucket::SetNewActiveSlotSpan() encounters it. A decommitted span may stay on the empty list for some time, until PartitionBucket<thread_safe>::SlowPathAlloc() encounters it. However, the inaccuracy can't happen in the other direction, i.e. an active span can only be on the active list, and an empty span can only be on the active or empty list.