InternalDocs/profiling_binary_format.md - external/github.com/python/cpython - Git at Google

 # Profiling Binary Format

 The profiling module includes a binary file format for storing sampling
 profiler data. This document describes the format's structure and the
 design decisions behind it.

 The implementation is in
 [`Modules/_remote_debugging/binary_io_writer.c`](../Modules/_remote_debugging/binary_io_writer.c)
 and [`Modules/_remote_debugging/binary_io_reader.c`](../Modules/_remote_debugging/binary_io_reader.c),
 with declarations in
 [`Modules/_remote_debugging/binary_io.h`](../Modules/_remote_debugging/binary_io.h).

 ## Overview

 The sampling profiler can generate enormous amounts of data. A typical
 profiling session sampling at 1000 Hz for 60 seconds produces 60,000 samples.
 Each sample contains a full call stack, often 20-50 frames deep, and each
 frame includes a filename, function name, and line number. In a text-based
 format like collapsed stacks, this would mean repeating the same long file
 paths and function names thousands of times.

 The binary format addresses this through two key strategies:

 1. **Deduplication**: Strings and frames are stored once in lookup tables,
    then referenced by small integer indices. A 100-character file path that
    appears in 50,000 samples is stored once, not 50,000 times.

 2. **Compact encoding**: Variable-length integers (varints) encode small
    values in fewer bytes. Since most indices are small (under 128), they
    typically need only one byte instead of four.

 Together with optional zstd compression, these techniques reduce file sizes
 by 10-50x compared to text formats while also enabling faster I/O.

 ## File Layout

 The file consists of five sections:

 ```
 +------------------+  Offset 0
 |     Header       |  64 bytes (fixed)
 +------------------+  Offset 64
 |                  |
 |   Sample Data    |  Variable size (optionally compressed)
 |                  |
 +------------------+  string_table_offset
 |   String Table   |  Variable size
 +------------------+  frame_table_offset
 |   Frame Table    |  Variable size
 +------------------+  file_size - 32
 |     Footer       |  32 bytes (fixed)
 +------------------+  file_size
 ```

 The layout is designed for streaming writes during profiling. The profiler
 cannot know in advance how many unique strings or frames will be encountered,
 so these tables must be built incrementally and written at the end.

 The header comes first so readers can quickly validate the file and locate
 the metadata tables. The sample data follows immediately, allowing the writer
 to stream samples directly to disk (or through a compression stream) without
 buffering the entire dataset in memory.

 The string and frame tables are placed after sample data because they grow
 as new unique entries are discovered during profiling. By deferring their
 output until finalization, the writer avoids the complexity of reserving
 space or rewriting portions of the file.

 The footer at the end contains counts needed to allocate arrays before
 parsing the tables. Placing it at a fixed offset from the end (rather than
 at a variable offset recorded in the header) means readers can locate it
 with a single seek to `file_size - 32`, without first reading the header.

 ## Header

 ```
  Offset   Size   Type      Description
 +--------+------+---------+----------------------------------------+
 |    0   |  4   | uint32  | Magic number (0x54414348 = "TACH")     |
 |    4   |  4   | uint32  | Format version                         |
 |    8   |  4   | bytes   | Python version (major, minor, micro,   |
 |        |      |         | reserved)                              |
 |   12   |  8   | uint64  | Start timestamp (microseconds)         |
 |   20   |  8   | uint64  | Sample interval (microseconds)         |
 |   28   |  4   | uint32  | Total sample count                     |
 |   32   |  4   | uint32  | Thread count                           |
 |   36   |  8   | uint64  | String table offset                    |
 |   44   |  8   | uint64  | Frame table offset                     |
 |   52   |  4   | uint32  | Compression type (0=none, 1=zstd)      |
 |   56   |  8   | bytes   | Reserved (zero-filled)                 |
 +--------+------+---------+----------------------------------------+
 ```

 The magic number `0x54414348` ("TACH" for Tachyon) identifies the file format
 and also serves as an **endianness marker**. When read on a system with
 different byte order than the writer, it appears as `0x48434154`. The reader
 uses this to detect cross-endian files and automatically byte-swap all
 multi-byte integer fields.

 The Python version field records the major, minor, and micro version numbers
 of the Python interpreter that generated the file. This allows analysis tools
 to detect version mismatches when replaying data collected on a different
 Python version, which may have different internal structures or behaviors.

 The header is written as zeros initially, then overwritten with actual values
 during finalization. This requires the output stream to be seekable, which
 is acceptable since the format targets regular files rather than pipes or
 network streams.

 ## Sample Data

 Sample data begins at offset 64 and extends to `string_table_offset`. Samples
 use delta compression to minimize redundancy when consecutive samples from the
 same thread have identical or similar call stacks.

 ### Stack Encoding Types

 Each sample record begins with thread identification, then an encoding byte:

 | Code | Name | Description |
 |------|------|-------------|
 | 0x00 | REPEAT | RLE: identical stack repeated N times |
 | 0x01 | FULL | Complete stack (first sample or no match) |
 | 0x02 | SUFFIX | Shares N frames from bottom of previous stack |
 | 0x03 | POP_PUSH | Remove M frames from top, add N new frames |

 ### Record Formats

 **REPEAT (0x00) - Run-Length Encoded Identical Stacks:**
 ```
 +-----------------+-----------+----------------------------------------+
 | thread_id       | 8 bytes   | Thread identifier (uint64, fixed)      |
 | interpreter_id  | 4 bytes   | Interpreter ID (uint32, fixed)         |
 | encoding        | 1 byte    | 0x00 (REPEAT)                          |
 | count           | varint    | Number of samples in this RLE group    |
 | samples         | varies    | Interleaved: [delta: varint, status: 1]|
 |                 |           | repeated count times                   |
 +-----------------+-----------+----------------------------------------+
 ```
 The stack is inherited from this thread's previous sample. Each sample in the
 group gets its own timestamp delta and status byte, stored as interleaved pairs
 (delta1, status1, delta2, status2, ...) rather than separate arrays.

 **FULL (0x01) - Complete Stack:**
 ```
 +-----------------+-----------+----------------------------------------+
 | thread_id       | 8 bytes   | Thread identifier (uint64, fixed)      |
 | interpreter_id  | 4 bytes   | Interpreter ID (uint32, fixed)         |
 | encoding        | 1 byte    | 0x01 (FULL)                            |
 | timestamp_delta | varint    | Microseconds since thread's last sample|
 | status          | 1 byte    | Thread state flags                     |
 | stack_depth     | varint    | Number of frames in call stack         |
 | frame_indices   | varint[]  | Array of frame table indices           |
 +-----------------+-----------+----------------------------------------+
 ```
 Used for the first sample from a thread, or when delta encoding would not
 provide savings.

 **SUFFIX (0x02) - Shared Suffix Match:**
 ```
 +-----------------+-----------+----------------------------------------+
 | thread_id       | 8 bytes   | Thread identifier (uint64, fixed)      |
 | interpreter_id  | 4 bytes   | Interpreter ID (uint32, fixed)         |
 | encoding        | 1 byte    | 0x02 (SUFFIX)                          |
 | timestamp_delta | varint    | Microseconds since thread's last sample|
 | status          | 1 byte    | Thread state flags                     |
 | shared_count    | varint    | Frames shared from bottom of prev stack|
 | new_count       | varint    | New frames at top of stack             |
 | new_frames      | varint[]  | Array of new_count frame indices       |
 +-----------------+-----------+----------------------------------------+
 ```
 Used when a function call added frames to the top of the stack. The shared
 frames from the previous stack are kept, and new frames are prepended.

 **POP_PUSH (0x03) - Pop and Push:**
 ```
 +-----------------+-----------+----------------------------------------+
 | thread_id       | 8 bytes   | Thread identifier (uint64, fixed)      |
 | interpreter_id  | 4 bytes   | Interpreter ID (uint32, fixed)         |
 | encoding        | 1 byte    | 0x03 (POP_PUSH)                        |
 | timestamp_delta | varint    | Microseconds since thread's last sample|
 | status          | 1 byte    | Thread state flags                     |
 | pop_count       | varint    | Frames to remove from top of prev stack|
 | push_count      | varint    | New frames to add at top               |
 | new_frames      | varint[]  | Array of push_count frame indices      |
 +-----------------+-----------+----------------------------------------+
 ```
 Used when the code path changed: some frames were popped (function returns)
 and new frames were pushed (different function calls).

 ### Thread and Interpreter Identification

 Thread IDs are 64-bit values that can be large (memory addresses on some
 platforms) and vary unpredictably. Using a fixed 8-byte encoding avoids
 the overhead of varint encoding for large values and simplifies parsing
 since the reader knows exactly where each field begins.

 The interpreter ID identifies which Python sub-interpreter the thread
 belongs to, allowing analysis tools to separate activity across interpreters
 in processes using multiple sub-interpreters.

 ### Status Byte

 The status byte is a bitfield encoding thread state at sample time:

 | Bit | Flag                  | Meaning                                    |
 |-----|-----------------------|--------------------------------------------|
 |  0  | THREAD_STATUS_HAS_GIL | Thread holds the GIL (Global Interpreter Lock) |
 |  1  | THREAD_STATUS_ON_CPU  | Thread is actively running on a CPU core   |
 |  2  | THREAD_STATUS_UNKNOWN | Thread state could not be determined       |
 |  3  | THREAD_STATUS_GIL_REQUESTED | Thread is waiting to acquire the GIL  |
 |  4  | THREAD_STATUS_HAS_EXCEPTION | Thread has a pending exception         |

 Multiple flags can be set simultaneously (e.g., a thread can hold the GIL
 while also running on CPU). Analysis tools use these to filter samples or
 visualize thread states over time.

 ### Timestamp Delta Encoding

 Timestamps use delta encoding rather than absolute values. Absolute
 timestamps in microseconds require 8 bytes each, but consecutive samples
 from the same thread are typically separated by the sampling interval
 (e.g., 1000 microseconds), so the delta between them is small and fits
 in 1-2 varint bytes. The writer tracks the previous timestamp for each
 thread separately. The first sample from a thread encodes its delta from
 the profiling start time; subsequent samples encode the delta from that
 thread's previous sample. This per-thread tracking is necessary because
 samples are interleaved across threads in arrival order, not grouped by
 thread.

 For REPEAT (RLE) records, timestamp deltas and status bytes are stored as
 interleaved pairs (delta, status, delta, status, ...) - one pair per
 repeated sample - allowing efficient batching while preserving the exact
 timing and state of each sample.

 ### Frame Indexing

 Each frame in a call stack is represented by an index into the frame table
 rather than inline data. This provides massive space savings because call
 stacks are highly repetitive: the same function appears in many samples
 (hot functions), call stacks often share common prefixes (main -> app ->
 handler -> ...), and recursive functions create repeated frame sequences.
 A frame index is typically 1-2 varint bytes. Inline frame data would be
 20-200+ bytes (two strings plus a line number). For a profile with 100,000
 samples averaging 30 frames each, this reduces frame data from potentially
 gigabytes to tens of megabytes.

 Frame indices are written innermost-first (the currently executing frame
 has index 0 in the array). This ordering works well with delta compression:
 function calls typically add frames at the top (index 0), while shared
 frames remain at the bottom.

 ## String Table

 The string table stores deduplicated UTF-8 strings (filenames and function
 names). It begins at `string_table_offset` and contains entries in order of
 their assignment during writing:

 ```
 +----------------+
 | length: varint |
 | data: bytes    |
 +----------------+  (repeated for each string)
 ```

 Strings are stored in the order they were first encountered during writing.
 The first unique filename gets index 0, the second gets index 1, and so on.
 Length-prefixing (rather than null-termination) allows strings containing
 null bytes and enables readers to allocate exact-sized buffers. The varint
 length encoding means short strings (under 128 bytes) need only one length
 byte.

 ## Frame Table

 The frame table stores deduplicated frame entries with full source position
 information and bytecode opcode:

 ```
 +----------------------------+
 | filename_idx: varint       |
 | funcname_idx: varint       |
 | lineno: svarint            |
 | end_lineno_delta: svarint  |
 | column: svarint            |
 | end_column_delta: svarint  |
 | opcode: u8                 |
 +----------------------------+  (repeated for each frame)
 ```

 ### Field Definitions

 | Field            | Type          | Description                                              |
 |------------------|---------------|----------------------------------------------------------|
 | filename_idx     | varint        | Index into string table for file name                    |
 | funcname_idx     | varint        | Index into string table for function name                |
 | lineno           | zigzag varint | Start line number (-1 for synthetic frames)              |
 | end_lineno_delta | zigzag varint | Delta from lineno (end_lineno = lineno + delta)          |
 | column           | zigzag varint | Start column offset in UTF-8 bytes (-1 if not available) |
 | end_column_delta | zigzag varint | Delta from column (end_column = column + delta)          |
 | opcode           | u8            | Python bytecode opcode (0-254) or 255 for None           |

 ### Delta Encoding

 Position end values use delta encoding for efficiency:

 - `end_lineno = lineno + end_lineno_delta`
 - `end_column = column + end_column_delta`

 Typical values:
 - `end_lineno_delta`: Usually 0 (single-line expressions) → encodes to 1 byte
 - `end_column_delta`: Usually 5-20 (expression width) → encodes to 1 byte

 This saves ~1-2 bytes per frame compared to absolute encoding. When the base
 value (lineno or column) is -1 (not available), the delta is stored as 0 and
 the reconstructed value is -1.

 ### Sentinel Values

 - `opcode = 255`: No opcode captured
 - `lineno = -1`: Synthetic frame (no source location)
 - `column = -1`: Column offset not available

 ### Deduplication

 Each unique (filename, funcname, lineno, end_lineno, column, end_column,
 opcode) combination gets one entry. This enables instruction-level profiling
 where multiple bytecode instructions on the same line can be distinguished.

 Strings and frames are deduplicated separately because they have different
 cardinalities and reference patterns. A codebase might have hundreds of
 unique source files but thousands of unique functions. Many functions share
 the same filename, so storing the filename index in each frame entry (rather
 than the full string) provides an additional layer of deduplication. A frame
 entry is typically 7-9 bytes rather than two full strings plus location data.

 ### Size Analysis

 Typical frame size with delta encoding:
 - file_idx: 1-2 bytes
 - func_idx: 1-2 bytes
 - lineno: 1-2 bytes
 - end_lineno_delta: 1 byte (usually 0)
 - column: 1 byte (usually < 64)
 - end_column_delta: 1 byte (usually < 64)
 - opcode: 1 byte

 **Total: ~7-9 bytes per frame**

 Line numbers and columns use signed varint (zigzag encoding) to handle
 sentinel values efficiently. Synthetic frames—generated frames that don't
 correspond directly to Python source code, such as C extension boundaries or
 internal interpreter frames—use -1 to indicate the absence of a source
 location. Zigzag encoding ensures these small negative values encode
 efficiently (−1 becomes 1, which is one byte) rather than requiring the
 maximum varint length.

 ## Footer

 ```
  Offset   Size   Type      Description
 +--------+------+---------+----------------------------------------+
 |    0   |  4   | uint32  | String count                           |
 |    4   |  4   | uint32  | Frame count                            |
 |    8   |  8   | uint64  | Total file size                        |
 |   16   | 16   | bytes   | Checksum (reserved, currently zeros)   |
 +--------+------+---------+----------------------------------------+
 ```

 The string and frame counts allow readers to pre-allocate arrays of the
 correct size before parsing the tables. Without these counts, readers would
 need to either scan the tables twice (once to count, once to parse) or use
 dynamically-growing arrays.

 The file size field provides a consistency check: if the actual file size
 does not match, the file may be truncated or corrupted.

 The checksum field is reserved for future use. A checksum would allow
 detection of corruption but adds complexity and computation cost. The
 current implementation leaves this as zeros.

 ## Variable-Length Integer Encoding

 The format uses LEB128 (Little Endian Base 128) for unsigned integers and
 zigzag + LEB128 for signed integers. These encodings are widely used
 (Protocol Buffers, DWARF debug info, WebAssembly) and well-understood.

 ### Unsigned Varint (LEB128)

 Each byte stores 7 bits of data. The high bit indicates whether more bytes
 follow:

 ```
 Value        Encoded bytes
 0-127        [0xxxxxxx]                    (1 byte)
 128-16383    [1xxxxxxx] [0xxxxxxx]         (2 bytes)
 16384+       [1xxxxxxx] [1xxxxxxx] ...     (3+ bytes)
 ```

 Most indices in profiling data are small. A profile with 1000 unique frames
 needs at most 2 bytes per frame index. The common case (indices under 128)
 needs only 1 byte.

 ### Signed Varint (Zigzag)

 Standard LEB128 encodes −1 as a very large unsigned value, requiring many
 bytes. Zigzag encoding interleaves positive and negative values:

 ```
  0 -> 0    -1 -> 1     1 -> 2    -2 -> 3     2 -> 4
 ```

 This ensures small-magnitude values (whether positive or negative) encode
 in few bytes.

 ## Compression

 When compression is enabled, the sample data region contains a zstd stream.
 The string table, frame table, and footer remain uncompressed so readers can
 access metadata without decompressing the entire file. A tool that only needs
 to report "this file contains 50,000 samples of 3 threads" can read the header
 and footer without touching the compressed sample data. This also simplifies
 the format: the header's offset fields point directly to the tables rather
 than to positions within a decompressed stream.

 Zstd provides an excellent balance of compression ratio and speed. Profiling
 data compresses very well (often 5-10x) due to repetitive patterns: the same
 small set of frame indices appears repeatedly, and delta-encoded timestamps
 cluster around the sampling interval. Zstd's streaming API allows compression
 without buffering the entire dataset. The writer feeds sample data through
 the compressor incrementally, flushing compressed chunks to disk as they
 become available.

 Level 5 compression is used as a default. Lower levels (1-3) are faster but
 compress less; higher levels (6+) compress more but slow down writing. Level
 5 provides good compression with minimal impact on profiling overhead.

 ## Reading and Writing

 ### Writing

 1. Open the output file and write 64 zero bytes as a placeholder header
 2. Initialize empty string and frame dictionaries for deduplication
 3. For each sample:
    - Intern any new strings, assigning sequential indices
    - Intern any new frames, assigning sequential indices
    - Encode the sample record and write to the buffer
    - Flush the buffer through compression (if enabled) when full
 4. Flush remaining buffered data and finalize compression
 5. Write the string table (length-prefixed strings in index order)
 6. Write the frame table (varint-encoded entries in index order)
 7. Write the footer with final counts
 8. Seek to offset 0 and write the header with actual values

 The writer maintains two dictionaries: one mapping strings to indices, one
 mapping (filename_idx, funcname_idx, lineno) tuples to frame indices. These
 enable O(1) lookup during interning.

 ### Reading

 1. Read the header magic number to detect endianness (set `needs_swap` flag
    if the magic appears byte-swapped)
 2. Validate version and read remaining header fields (byte-swapping if needed)
 3. Seek to end − 32 and read the footer (byte-swapping counts if needed)
 4. Allocate string array of `string_count` elements
 5. Parse the string table, populating the array
 6. Allocate frame array of `frame_count * 3` uint32 elements
 7. Parse the frame table, populating the array
 8. If compressed, decompress the sample data region
 9. Iterate through samples, resolving indices to strings/frames
    (byte-swapping thread_id and interpreter_id if needed)

 The reader builds lookup arrays rather than dictionaries since it only needs
 index-to-value mapping, not value-to-index.

 ## Platform Considerations

 ### Byte Ordering and Cross-Platform Portability

 The binary format uses **native byte order** for all multi-byte integer
 fields when writing. However, the reader supports **cross-endian reading**:
 files written on a little-endian system (x86, ARM) can be read on a
 big-endian system (s390x, PowerPC), and vice versa.

 The magic number doubles as an endianness marker. When read on a system with
 different byte order, it appears byte-swapped (`0x48434154` instead of
 `0x54414348`). The reader detects this and automatically byte-swaps all
 fixed-width integer fields during parsing.

 Writers must use `memcpy()` from properly-sized integer types when writing
 fixed-width integer fields. When the source variable's type differs from the
 field width (e.g., `size_t` written as 4 bytes), explicit casting to the
 correct type (e.g., `uint32_t`) is required before `memcpy()`. On big-endian
 systems, copying from an oversized type would copy the wrong bytes—high-order
 zeros instead of the actual value.

 The reader tracks whether byte-swapping is needed via a `needs_swap` flag set
 during header parsing. All fixed-width fields in the header, footer, and
 sample data are conditionally byte-swapped using Python's internal byte-swap
 functions (`_Py_bswap32`, `_Py_bswap64` from `pycore_bitutils.h`).

 Variable-length integers (varints) are byte-order independent since they
 encode values one byte at a time using the LEB128 scheme, so they require
 no special handling for cross-endian reading.

 ### Memory-Mapped I/O

 On Unix systems (Linux, macOS), the reader uses `mmap()` to map the file
 into the process address space. The kernel handles paging data in and out
 as needed, no explicit read() calls or buffer management are required,
 multiple readers can share the same physical pages, and sequential access
 patterns benefit from kernel read-ahead.

 The implementation uses `madvise()` to hint the access pattern to the kernel:
 `MADV_SEQUENTIAL` indicates the file will be read linearly, enabling
 aggressive read-ahead. `MADV_WILLNEED` requests pre-faulting of pages.
 On Linux, `MAP_POPULATE` pre-faults all pages at mmap time rather than on
 first access, moving page fault overhead from the parsing loop to the
 initial mapping for more predictable performance. For large files (over
 32 MB), `MADV_HUGEPAGE` requests transparent huge pages (2 MB instead of
 4 KB) to reduce TLB pressure when accessing large amounts of data.

 On Windows, the implementation falls back to standard file I/O with full
 file buffering. Profiling data files are typically small enough (tens to
 hundreds of megabytes) that this is acceptable.

 The writer uses a 512 KB buffer to batch small writes. Each sample record
 is typically tens of bytes; writing these individually would incur excessive
 syscall overhead. The buffer accumulates data until full, then flushes in
 one write() call (or feeds through the compression stream).

 ## Future Considerations

 The format reserves space for future extensions. The 12 reserved bytes in
 the header could hold additional metadata. The 16-byte checksum field in
 the footer is currently unused. The version field allows incompatible
 changes with graceful rejection. New compression types could be added
 (compression_type > 1).

 Any changes that alter the meaning of existing fields or the parsing logic
 should increment the version number to prevent older readers from
 misinterpreting new files.