Profiling Binary Format

The profiling module includes a binary file format for storing sampling profiler data. This document describes the format's structure and the design decisions behind it.

The implementation is in Modules/_remote_debugging/binary_io_writer.c and Modules/_remote_debugging/binary_io_reader.c, with declarations in Modules/_remote_debugging/binary_io.h.

Overview

The sampling profiler can generate enormous amounts of data. A typical profiling session sampling at 1000 Hz for 60 seconds produces 60,000 samples. Each sample contains a full call stack, often 20-50 frames deep, and each frame includes a filename, function name, and line number. In a text-based format like collapsed stacks, this would mean repeating the same long file paths and function names thousands of times.

The binary format addresses this through two key strategies:

  1. Deduplication: Strings and frames are stored once in lookup tables, then referenced by small integer indices. A 100-character file path that appears in 50,000 samples is stored once, not 50,000 times.

  2. Compact encoding: Variable-length integers (varints) encode small values in fewer bytes. Since most indices are small (under 128), they typically need only one byte instead of four.

Together with optional zstd compression, these techniques reduce file sizes by 10-50x compared to text formats while also enabling faster I/O.

File Layout

The file consists of five sections:

+------------------+  Offset 0
|     Header       |  64 bytes (fixed)
+------------------+  Offset 64
|                  |
|   Sample Data    |  Variable size (optionally compressed)
|                  |
+------------------+  string_table_offset
|   String Table   |  Variable size
+------------------+  frame_table_offset
|   Frame Table    |  Variable size
+------------------+  file_size - 32
|     Footer       |  32 bytes (fixed)
+------------------+  file_size

The layout is designed for streaming writes during profiling. The profiler cannot know in advance how many unique strings or frames will be encountered, so these tables must be built incrementally and written at the end.

The header comes first so readers can quickly validate the file and locate the metadata tables. The sample data follows immediately, allowing the writer to stream samples directly to disk (or through a compression stream) without buffering the entire dataset in memory.

The string and frame tables are placed after sample data because they grow as new unique entries are discovered during profiling. By deferring their output until finalization, the writer avoids the complexity of reserving space or rewriting portions of the file.

The footer at the end contains counts needed to allocate arrays before parsing the tables. Placing it at a fixed offset from the end (rather than at a variable offset recorded in the header) means readers can locate it with a single seek to file_size - 32, without first reading the header.

Header

 Offset   Size   Type      Description
+--------+------+---------+----------------------------------------+
|    0   |  4   | uint32  | Magic number (0x54414348 = "TACH")     |
|    4   |  4   | uint32  | Format version                         |
|    8   |  4   | bytes   | Python version (major, minor, micro,   |
|        |      |         | reserved)                              |
|   12   |  8   | uint64  | Start timestamp (microseconds)         |
|   20   |  8   | uint64  | Sample interval (microseconds)         |
|   28   |  4   | uint32  | Total sample count                     |
|   32   |  4   | uint32  | Thread count                           |
|   36   |  8   | uint64  | String table offset                    |
|   44   |  8   | uint64  | Frame table offset                     |
|   52   |  4   | uint32  | Compression type (0=none, 1=zstd)      |
|   56   |  8   | bytes   | Reserved (zero-filled)                 |
+--------+------+---------+----------------------------------------+

The magic number 0x54414348 (“TACH” for Tachyon) identifies the file format and also serves as an endianness marker. When read on a system with different byte order than the writer, it appears as 0x48434154. The reader uses this to detect cross-endian files and automatically byte-swap all multi-byte integer fields.

The Python version field records the major, minor, and micro version numbers of the Python interpreter that generated the file. This allows analysis tools to detect version mismatches when replaying data collected on a different Python version, which may have different internal structures or behaviors.

The header is written as zeros initially, then overwritten with actual values during finalization. This requires the output stream to be seekable, which is acceptable since the format targets regular files rather than pipes or network streams.

Sample Data

Sample data begins at offset 64 and extends to string_table_offset. Samples use delta compression to minimize redundancy when consecutive samples from the same thread have identical or similar call stacks.

Stack Encoding Types

Each sample record begins with thread identification, then an encoding byte:

CodeNameDescription
0x00REPEATRLE: identical stack repeated N times
0x01FULLComplete stack (first sample or no match)
0x02SUFFIXShares N frames from bottom of previous stack
0x03POP_PUSHRemove M frames from top, add N new frames

Record Formats

REPEAT (0x00) - Run-Length Encoded Identical Stacks:

+-----------------+-----------+----------------------------------------+
| thread_id       | 8 bytes   | Thread identifier (uint64, fixed)      |
| interpreter_id  | 4 bytes   | Interpreter ID (uint32, fixed)         |
| encoding        | 1 byte    | 0x00 (REPEAT)                          |
| count           | varint    | Number of samples in this RLE group    |
| samples         | varies    | Interleaved: [delta: varint, status: 1]|
|                 |           | repeated count times                   |
+-----------------+-----------+----------------------------------------+

The stack is inherited from this thread's previous sample. Each sample in the group gets its own timestamp delta and status byte, stored as interleaved pairs (delta1, status1, delta2, status2, ...) rather than separate arrays.

FULL (0x01) - Complete Stack:

+-----------------+-----------+----------------------------------------+
| thread_id       | 8 bytes   | Thread identifier (uint64, fixed)      |
| interpreter_id  | 4 bytes   | Interpreter ID (uint32, fixed)         |
| encoding        | 1 byte    | 0x01 (FULL)                            |
| timestamp_delta | varint    | Microseconds since thread's last sample|
| status          | 1 byte    | Thread state flags                     |
| stack_depth     | varint    | Number of frames in call stack         |
| frame_indices   | varint[]  | Array of frame table indices           |
+-----------------+-----------+----------------------------------------+

Used for the first sample from a thread, or when delta encoding would not provide savings.

SUFFIX (0x02) - Shared Suffix Match:

+-----------------+-----------+----------------------------------------+
| thread_id       | 8 bytes   | Thread identifier (uint64, fixed)      |
| interpreter_id  | 4 bytes   | Interpreter ID (uint32, fixed)         |
| encoding        | 1 byte    | 0x02 (SUFFIX)                          |
| timestamp_delta | varint    | Microseconds since thread's last sample|
| status          | 1 byte    | Thread state flags                     |
| shared_count    | varint    | Frames shared from bottom of prev stack|
| new_count       | varint    | New frames at top of stack             |
| new_frames      | varint[]  | Array of new_count frame indices       |
+-----------------+-----------+----------------------------------------+

Used when a function call added frames to the top of the stack. The shared frames from the previous stack are kept, and new frames are prepended.

POP_PUSH (0x03) - Pop and Push:

+-----------------+-----------+----------------------------------------+
| thread_id       | 8 bytes   | Thread identifier (uint64, fixed)      |
| interpreter_id  | 4 bytes   | Interpreter ID (uint32, fixed)         |
| encoding        | 1 byte    | 0x03 (POP_PUSH)                        |
| timestamp_delta | varint    | Microseconds since thread's last sample|
| status          | 1 byte    | Thread state flags                     |
| pop_count       | varint    | Frames to remove from top of prev stack|
| push_count      | varint    | New frames to add at top               |
| new_frames      | varint[]  | Array of push_count frame indices      |
+-----------------+-----------+----------------------------------------+

Used when the code path changed: some frames were popped (function returns) and new frames were pushed (different function calls).

Thread and Interpreter Identification

Thread IDs are 64-bit values that can be large (memory addresses on some platforms) and vary unpredictably. Using a fixed 8-byte encoding avoids the overhead of varint encoding for large values and simplifies parsing since the reader knows exactly where each field begins.

The interpreter ID identifies which Python sub-interpreter the thread belongs to, allowing analysis tools to separate activity across interpreters in processes using multiple sub-interpreters.

Status Byte

The status byte is a bitfield encoding thread state at sample time:

BitFlagMeaning
0THREAD_STATUS_HAS_GILThread holds the GIL (Global Interpreter Lock)
1THREAD_STATUS_ON_CPUThread is actively running on a CPU core
2THREAD_STATUS_UNKNOWNThread state could not be determined
3THREAD_STATUS_GIL_REQUESTEDThread is waiting to acquire the GIL
4THREAD_STATUS_HAS_EXCEPTIONThread has a pending exception

Multiple flags can be set simultaneously (e.g., a thread can hold the GIL while also running on CPU). Analysis tools use these to filter samples or visualize thread states over time.

Timestamp Delta Encoding

Timestamps use delta encoding rather than absolute values. Absolute timestamps in microseconds require 8 bytes each, but consecutive samples from the same thread are typically separated by the sampling interval (e.g., 1000 microseconds), so the delta between them is small and fits in 1-2 varint bytes. The writer tracks the previous timestamp for each thread separately. The first sample from a thread encodes its delta from the profiling start time; subsequent samples encode the delta from that thread's previous sample. This per-thread tracking is necessary because samples are interleaved across threads in arrival order, not grouped by thread.

For REPEAT (RLE) records, timestamp deltas and status bytes are stored as interleaved pairs (delta, status, delta, status, ...) - one pair per repeated sample - allowing efficient batching while preserving the exact timing and state of each sample.

Frame Indexing

Each frame in a call stack is represented by an index into the frame table rather than inline data. This provides massive space savings because call stacks are highly repetitive: the same function appears in many samples (hot functions), call stacks often share common prefixes (main -> app -> handler -> ...), and recursive functions create repeated frame sequences. A frame index is typically 1-2 varint bytes. Inline frame data would be 20-200+ bytes (two strings plus a line number). For a profile with 100,000 samples averaging 30 frames each, this reduces frame data from potentially gigabytes to tens of megabytes.

Frame indices are written innermost-first (the currently executing frame has index 0 in the array). This ordering works well with delta compression: function calls typically add frames at the top (index 0), while shared frames remain at the bottom.

String Table

The string table stores deduplicated UTF-8 strings (filenames and function names). It begins at string_table_offset and contains entries in order of their assignment during writing:

+----------------+
| length: varint |
| data: bytes    |
+----------------+  (repeated for each string)

Strings are stored in the order they were first encountered during writing. The first unique filename gets index 0, the second gets index 1, and so on. Length-prefixing (rather than null-termination) allows strings containing null bytes and enables readers to allocate exact-sized buffers. The varint length encoding means short strings (under 128 bytes) need only one length byte.

Frame Table

The frame table stores deduplicated frame entries with full source position information and bytecode opcode:

+----------------------------+
| filename_idx: varint       |
| funcname_idx: varint       |
| lineno: svarint            |
| end_lineno_delta: svarint  |
| column: svarint            |
| end_column_delta: svarint  |
| opcode: u8                 |
+----------------------------+  (repeated for each frame)

Field Definitions

FieldTypeDescription
filename_idxvarintIndex into string table for file name
funcname_idxvarintIndex into string table for function name
linenozigzag varintStart line number (-1 for synthetic frames)
end_lineno_deltazigzag varintDelta from lineno (end_lineno = lineno + delta)
columnzigzag varintStart column offset in UTF-8 bytes (-1 if not available)
end_column_deltazigzag varintDelta from column (end_column = column + delta)
opcodeu8Python bytecode opcode (0-254) or 255 for None

Delta Encoding

Position end values use delta encoding for efficiency:

  • end_lineno = lineno + end_lineno_delta
  • end_column = column + end_column_delta

Typical values:

  • end_lineno_delta: Usually 0 (single-line expressions) → encodes to 1 byte
  • end_column_delta: Usually 5-20 (expression width) → encodes to 1 byte

This saves ~1-2 bytes per frame compared to absolute encoding. When the base value (lineno or column) is -1 (not available), the delta is stored as 0 and the reconstructed value is -1.

Sentinel Values

  • opcode = 255: No opcode captured
  • lineno = -1: Synthetic frame (no source location)
  • column = -1: Column offset not available

Deduplication

Each unique (filename, funcname, lineno, end_lineno, column, end_column, opcode) combination gets one entry. This enables instruction-level profiling where multiple bytecode instructions on the same line can be distinguished.

Strings and frames are deduplicated separately because they have different cardinalities and reference patterns. A codebase might have hundreds of unique source files but thousands of unique functions. Many functions share the same filename, so storing the filename index in each frame entry (rather than the full string) provides an additional layer of deduplication. A frame entry is typically 7-9 bytes rather than two full strings plus location data.

Size Analysis

Typical frame size with delta encoding:

  • file_idx: 1-2 bytes
  • func_idx: 1-2 bytes
  • lineno: 1-2 bytes
  • end_lineno_delta: 1 byte (usually 0)
  • column: 1 byte (usually < 64)
  • end_column_delta: 1 byte (usually < 64)
  • opcode: 1 byte

Total: ~7-9 bytes per frame

Line numbers and columns use signed varint (zigzag encoding) to handle sentinel values efficiently. Synthetic frames—generated frames that don't correspond directly to Python source code, such as C extension boundaries or internal interpreter frames—use -1 to indicate the absence of a source location. Zigzag encoding ensures these small negative values encode efficiently (−1 becomes 1, which is one byte) rather than requiring the maximum varint length.

Footer

 Offset   Size   Type      Description
+--------+------+---------+----------------------------------------+
|    0   |  4   | uint32  | String count                           |
|    4   |  4   | uint32  | Frame count                            |
|    8   |  8   | uint64  | Total file size                        |
|   16   | 16   | bytes   | Checksum (reserved, currently zeros)   |
+--------+------+---------+----------------------------------------+

The string and frame counts allow readers to pre-allocate arrays of the correct size before parsing the tables. Without these counts, readers would need to either scan the tables twice (once to count, once to parse) or use dynamically-growing arrays.

The file size field provides a consistency check: if the actual file size does not match, the file may be truncated or corrupted.

The checksum field is reserved for future use. A checksum would allow detection of corruption but adds complexity and computation cost. The current implementation leaves this as zeros.

Variable-Length Integer Encoding

The format uses LEB128 (Little Endian Base 128) for unsigned integers and zigzag + LEB128 for signed integers. These encodings are widely used (Protocol Buffers, DWARF debug info, WebAssembly) and well-understood.

Unsigned Varint (LEB128)

Each byte stores 7 bits of data. The high bit indicates whether more bytes follow:

Value        Encoded bytes
0-127        [0xxxxxxx]                    (1 byte)
128-16383    [1xxxxxxx] [0xxxxxxx]         (2 bytes)
16384+       [1xxxxxxx] [1xxxxxxx] ...     (3+ bytes)

Most indices in profiling data are small. A profile with 1000 unique frames needs at most 2 bytes per frame index. The common case (indices under 128) needs only 1 byte.

Signed Varint (Zigzag)

Standard LEB128 encodes −1 as a very large unsigned value, requiring many bytes. Zigzag encoding interleaves positive and negative values:

 0 -> 0    -1 -> 1     1 -> 2    -2 -> 3     2 -> 4

This ensures small-magnitude values (whether positive or negative) encode in few bytes.

Compression

When compression is enabled, the sample data region contains a zstd stream. The string table, frame table, and footer remain uncompressed so readers can access metadata without decompressing the entire file. A tool that only needs to report “this file contains 50,000 samples of 3 threads” can read the header and footer without touching the compressed sample data. This also simplifies the format: the header's offset fields point directly to the tables rather than to positions within a decompressed stream.

Zstd provides an excellent balance of compression ratio and speed. Profiling data compresses very well (often 5-10x) due to repetitive patterns: the same small set of frame indices appears repeatedly, and delta-encoded timestamps cluster around the sampling interval. Zstd's streaming API allows compression without buffering the entire dataset. The writer feeds sample data through the compressor incrementally, flushing compressed chunks to disk as they become available.

Level 5 compression is used as a default. Lower levels (1-3) are faster but compress less; higher levels (6+) compress more but slow down writing. Level 5 provides good compression with minimal impact on profiling overhead.

Reading and Writing

Writing

  1. Open the output file and write 64 zero bytes as a placeholder header
  2. Initialize empty string and frame dictionaries for deduplication
  3. For each sample:
    • Intern any new strings, assigning sequential indices
    • Intern any new frames, assigning sequential indices
    • Encode the sample record and write to the buffer
    • Flush the buffer through compression (if enabled) when full
  4. Flush remaining buffered data and finalize compression
  5. Write the string table (length-prefixed strings in index order)
  6. Write the frame table (varint-encoded entries in index order)
  7. Write the footer with final counts
  8. Seek to offset 0 and write the header with actual values

The writer maintains two dictionaries: one mapping strings to indices, one mapping (filename_idx, funcname_idx, lineno) tuples to frame indices. These enable O(1) lookup during interning.

Reading

  1. Read the header magic number to detect endianness (set needs_swap flag if the magic appears byte-swapped)
  2. Validate version and read remaining header fields (byte-swapping if needed)
  3. Seek to end − 32 and read the footer (byte-swapping counts if needed)
  4. Allocate string array of string_count elements
  5. Parse the string table, populating the array
  6. Allocate frame array of frame_count * 3 uint32 elements
  7. Parse the frame table, populating the array
  8. If compressed, decompress the sample data region
  9. Iterate through samples, resolving indices to strings/frames (byte-swapping thread_id and interpreter_id if needed)

The reader builds lookup arrays rather than dictionaries since it only needs index-to-value mapping, not value-to-index.

Platform Considerations

Byte Ordering and Cross-Platform Portability

The binary format uses native byte order for all multi-byte integer fields when writing. However, the reader supports cross-endian reading: files written on a little-endian system (x86, ARM) can be read on a big-endian system (s390x, PowerPC), and vice versa.

The magic number doubles as an endianness marker. When read on a system with different byte order, it appears byte-swapped (0x48434154 instead of 0x54414348). The reader detects this and automatically byte-swaps all fixed-width integer fields during parsing.

Writers must use memcpy() from properly-sized integer types when writing fixed-width integer fields. When the source variable's type differs from the field width (e.g., size_t written as 4 bytes), explicit casting to the correct type (e.g., uint32_t) is required before memcpy(). On big-endian systems, copying from an oversized type would copy the wrong bytes—high-order zeros instead of the actual value.

The reader tracks whether byte-swapping is needed via a needs_swap flag set during header parsing. All fixed-width fields in the header, footer, and sample data are conditionally byte-swapped using Python's internal byte-swap functions (_Py_bswap32, _Py_bswap64 from pycore_bitutils.h).

Variable-length integers (varints) are byte-order independent since they encode values one byte at a time using the LEB128 scheme, so they require no special handling for cross-endian reading.

Memory-Mapped I/O

On Unix systems (Linux, macOS), the reader uses mmap() to map the file into the process address space. The kernel handles paging data in and out as needed, no explicit read() calls or buffer management are required, multiple readers can share the same physical pages, and sequential access patterns benefit from kernel read-ahead.

The implementation uses madvise() to hint the access pattern to the kernel: MADV_SEQUENTIAL indicates the file will be read linearly, enabling aggressive read-ahead. MADV_WILLNEED requests pre-faulting of pages. On Linux, MAP_POPULATE pre-faults all pages at mmap time rather than on first access, moving page fault overhead from the parsing loop to the initial mapping for more predictable performance. For large files (over 32 MB), MADV_HUGEPAGE requests transparent huge pages (2 MB instead of 4 KB) to reduce TLB pressure when accessing large amounts of data.

On Windows, the implementation falls back to standard file I/O with full file buffering. Profiling data files are typically small enough (tens to hundreds of megabytes) that this is acceptable.

The writer uses a 512 KB buffer to batch small writes. Each sample record is typically tens of bytes; writing these individually would incur excessive syscall overhead. The buffer accumulates data until full, then flushes in one write() call (or feeds through the compression stream).

Future Considerations

The format reserves space for future extensions. The 12 reserved bytes in the header could hold additional metadata. The 16-byte checksum field in the footer is currently unused. The version field allows incompatible changes with graceful rejection. New compression types could be added (compression_type > 1).

Any changes that alter the meaning of existing fields or the parsing logic should increment the version number to prevent older readers from misinterpreting new files.