| # Profiling Binary Format |
| |
| The profiling module includes a binary file format for storing sampling |
| profiler data. This document describes the format's structure and the |
| design decisions behind it. |
| |
| The implementation is in |
| [`Modules/_remote_debugging/binary_io_writer.c`](../Modules/_remote_debugging/binary_io_writer.c) |
| and [`Modules/_remote_debugging/binary_io_reader.c`](../Modules/_remote_debugging/binary_io_reader.c), |
| with declarations in |
| [`Modules/_remote_debugging/binary_io.h`](../Modules/_remote_debugging/binary_io.h). |
| |
| ## Overview |
| |
| The sampling profiler can generate enormous amounts of data. A typical |
| profiling session sampling at 1000 Hz for 60 seconds produces 60,000 samples. |
| Each sample contains a full call stack, often 20-50 frames deep, and each |
| frame includes a filename, function name, and line number. In a text-based |
| format like collapsed stacks, this would mean repeating the same long file |
| paths and function names thousands of times. |
| |
| The binary format addresses this through two key strategies: |
| |
| 1. **Deduplication**: Strings and frames are stored once in lookup tables, |
| then referenced by small integer indices. A 100-character file path that |
| appears in 50,000 samples is stored once, not 50,000 times. |
| |
| 2. **Compact encoding**: Variable-length integers (varints) encode small |
| values in fewer bytes. Since most indices are small (under 128), they |
| typically need only one byte instead of four. |
| |
| Together with optional zstd compression, these techniques reduce file sizes |
| by 10-50x compared to text formats while also enabling faster I/O. |
| |
| ## File Layout |
| |
| The file consists of five sections: |
| |
| ``` |
| +------------------+ Offset 0 |
| | Header | 64 bytes (fixed) |
| +------------------+ Offset 64 |
| | | |
| | Sample Data | Variable size (optionally compressed) |
| | | |
| +------------------+ string_table_offset |
| | String Table | Variable size |
| +------------------+ frame_table_offset |
| | Frame Table | Variable size |
| +------------------+ file_size - 32 |
| | Footer | 32 bytes (fixed) |
| +------------------+ file_size |
| ``` |
| |
| The layout is designed for streaming writes during profiling. The profiler |
| cannot know in advance how many unique strings or frames will be encountered, |
| so these tables must be built incrementally and written at the end. |
| |
| The header comes first so readers can quickly validate the file and locate |
| the metadata tables. The sample data follows immediately, allowing the writer |
| to stream samples directly to disk (or through a compression stream) without |
| buffering the entire dataset in memory. |
| |
| The string and frame tables are placed after sample data because they grow |
| as new unique entries are discovered during profiling. By deferring their |
| output until finalization, the writer avoids the complexity of reserving |
| space or rewriting portions of the file. |
| |
| The footer at the end contains counts needed to allocate arrays before |
| parsing the tables. Placing it at a fixed offset from the end (rather than |
| at a variable offset recorded in the header) means readers can locate it |
| with a single seek to `file_size - 32`, without first reading the header. |
| |
| ## Header |
| |
| ``` |
| Offset Size Type Description |
| +--------+------+---------+----------------------------------------+ |
| | 0 | 4 | uint32 | Magic number (0x54414348 = "TACH") | |
| | 4 | 4 | uint32 | Format version | |
| | 8 | 4 | bytes | Python version (major, minor, micro, | |
| | | | | reserved) | |
| | 12 | 8 | uint64 | Start timestamp (microseconds) | |
| | 20 | 8 | uint64 | Sample interval (microseconds) | |
| | 28 | 4 | uint32 | Total sample count | |
| | 32 | 4 | uint32 | Thread count | |
| | 36 | 8 | uint64 | String table offset | |
| | 44 | 8 | uint64 | Frame table offset | |
| | 52 | 4 | uint32 | Compression type (0=none, 1=zstd) | |
| | 56 | 8 | bytes | Reserved (zero-filled) | |
| +--------+------+---------+----------------------------------------+ |
| ``` |
| |
| The magic number `0x54414348` ("TACH" for Tachyon) identifies the file format |
| and also serves as an **endianness marker**. When read on a system with |
| different byte order than the writer, it appears as `0x48434154`. The reader |
| uses this to detect cross-endian files and automatically byte-swap all |
| multi-byte integer fields. |
| |
| The Python version field records the major, minor, and micro version numbers |
| of the Python interpreter that generated the file. This allows analysis tools |
| to detect version mismatches when replaying data collected on a different |
| Python version, which may have different internal structures or behaviors. |
| |
| The header is written as zeros initially, then overwritten with actual values |
| during finalization. This requires the output stream to be seekable, which |
| is acceptable since the format targets regular files rather than pipes or |
| network streams. |
| |
| ## Sample Data |
| |
| Sample data begins at offset 64 and extends to `string_table_offset`. Samples |
| use delta compression to minimize redundancy when consecutive samples from the |
| same thread have identical or similar call stacks. |
| |
| ### Stack Encoding Types |
| |
| Each sample record begins with thread identification, then an encoding byte: |
| |
| | Code | Name | Description | |
| |------|------|-------------| |
| | 0x00 | REPEAT | RLE: identical stack repeated N times | |
| | 0x01 | FULL | Complete stack (first sample or no match) | |
| | 0x02 | SUFFIX | Shares N frames from bottom of previous stack | |
| | 0x03 | POP_PUSH | Remove M frames from top, add N new frames | |
| |
| ### Record Formats |
| |
| **REPEAT (0x00) - Run-Length Encoded Identical Stacks:** |
| ``` |
| +-----------------+-----------+----------------------------------------+ |
| | thread_id | 8 bytes | Thread identifier (uint64, fixed) | |
| | interpreter_id | 4 bytes | Interpreter ID (uint32, fixed) | |
| | encoding | 1 byte | 0x00 (REPEAT) | |
| | count | varint | Number of samples in this RLE group | |
| | samples | varies | Interleaved: [delta: varint, status: 1]| |
| | | | repeated count times | |
| +-----------------+-----------+----------------------------------------+ |
| ``` |
| The stack is inherited from this thread's previous sample. Each sample in the |
| group gets its own timestamp delta and status byte, stored as interleaved pairs |
| (delta1, status1, delta2, status2, ...) rather than separate arrays. |
| |
| **FULL (0x01) - Complete Stack:** |
| ``` |
| +-----------------+-----------+----------------------------------------+ |
| | thread_id | 8 bytes | Thread identifier (uint64, fixed) | |
| | interpreter_id | 4 bytes | Interpreter ID (uint32, fixed) | |
| | encoding | 1 byte | 0x01 (FULL) | |
| | timestamp_delta | varint | Microseconds since thread's last sample| |
| | status | 1 byte | Thread state flags | |
| | stack_depth | varint | Number of frames in call stack | |
| | frame_indices | varint[] | Array of frame table indices | |
| +-----------------+-----------+----------------------------------------+ |
| ``` |
| Used for the first sample from a thread, or when delta encoding would not |
| provide savings. |
| |
| **SUFFIX (0x02) - Shared Suffix Match:** |
| ``` |
| +-----------------+-----------+----------------------------------------+ |
| | thread_id | 8 bytes | Thread identifier (uint64, fixed) | |
| | interpreter_id | 4 bytes | Interpreter ID (uint32, fixed) | |
| | encoding | 1 byte | 0x02 (SUFFIX) | |
| | timestamp_delta | varint | Microseconds since thread's last sample| |
| | status | 1 byte | Thread state flags | |
| | shared_count | varint | Frames shared from bottom of prev stack| |
| | new_count | varint | New frames at top of stack | |
| | new_frames | varint[] | Array of new_count frame indices | |
| +-----------------+-----------+----------------------------------------+ |
| ``` |
| Used when a function call added frames to the top of the stack. The shared |
| frames from the previous stack are kept, and new frames are prepended. |
| |
| **POP_PUSH (0x03) - Pop and Push:** |
| ``` |
| +-----------------+-----------+----------------------------------------+ |
| | thread_id | 8 bytes | Thread identifier (uint64, fixed) | |
| | interpreter_id | 4 bytes | Interpreter ID (uint32, fixed) | |
| | encoding | 1 byte | 0x03 (POP_PUSH) | |
| | timestamp_delta | varint | Microseconds since thread's last sample| |
| | status | 1 byte | Thread state flags | |
| | pop_count | varint | Frames to remove from top of prev stack| |
| | push_count | varint | New frames to add at top | |
| | new_frames | varint[] | Array of push_count frame indices | |
| +-----------------+-----------+----------------------------------------+ |
| ``` |
| Used when the code path changed: some frames were popped (function returns) |
| and new frames were pushed (different function calls). |
| |
| ### Thread and Interpreter Identification |
| |
| Thread IDs are 64-bit values that can be large (memory addresses on some |
| platforms) and vary unpredictably. Using a fixed 8-byte encoding avoids |
| the overhead of varint encoding for large values and simplifies parsing |
| since the reader knows exactly where each field begins. |
| |
| The interpreter ID identifies which Python sub-interpreter the thread |
| belongs to, allowing analysis tools to separate activity across interpreters |
| in processes using multiple sub-interpreters. |
| |
| ### Status Byte |
| |
| The status byte is a bitfield encoding thread state at sample time: |
| |
| | Bit | Flag | Meaning | |
| |-----|-----------------------|--------------------------------------------| |
| | 0 | THREAD_STATUS_HAS_GIL | Thread holds the GIL (Global Interpreter Lock) | |
| | 1 | THREAD_STATUS_ON_CPU | Thread is actively running on a CPU core | |
| | 2 | THREAD_STATUS_UNKNOWN | Thread state could not be determined | |
| | 3 | THREAD_STATUS_GIL_REQUESTED | Thread is waiting to acquire the GIL | |
| | 4 | THREAD_STATUS_HAS_EXCEPTION | Thread has a pending exception | |
| |
| Multiple flags can be set simultaneously (e.g., a thread can hold the GIL |
| while also running on CPU). Analysis tools use these to filter samples or |
| visualize thread states over time. |
| |
| ### Timestamp Delta Encoding |
| |
| Timestamps use delta encoding rather than absolute values. Absolute |
| timestamps in microseconds require 8 bytes each, but consecutive samples |
| from the same thread are typically separated by the sampling interval |
| (e.g., 1000 microseconds), so the delta between them is small and fits |
| in 1-2 varint bytes. The writer tracks the previous timestamp for each |
| thread separately. The first sample from a thread encodes its delta from |
| the profiling start time; subsequent samples encode the delta from that |
| thread's previous sample. This per-thread tracking is necessary because |
| samples are interleaved across threads in arrival order, not grouped by |
| thread. |
| |
| For REPEAT (RLE) records, timestamp deltas and status bytes are stored as |
| interleaved pairs (delta, status, delta, status, ...) - one pair per |
| repeated sample - allowing efficient batching while preserving the exact |
| timing and state of each sample. |
| |
| ### Frame Indexing |
| |
| Each frame in a call stack is represented by an index into the frame table |
| rather than inline data. This provides massive space savings because call |
| stacks are highly repetitive: the same function appears in many samples |
| (hot functions), call stacks often share common prefixes (main -> app -> |
| handler -> ...), and recursive functions create repeated frame sequences. |
| A frame index is typically 1-2 varint bytes. Inline frame data would be |
| 20-200+ bytes (two strings plus a line number). For a profile with 100,000 |
| samples averaging 30 frames each, this reduces frame data from potentially |
| gigabytes to tens of megabytes. |
| |
| Frame indices are written innermost-first (the currently executing frame |
| has index 0 in the array). This ordering works well with delta compression: |
| function calls typically add frames at the top (index 0), while shared |
| frames remain at the bottom. |
| |
| ## String Table |
| |
| The string table stores deduplicated UTF-8 strings (filenames and function |
| names). It begins at `string_table_offset` and contains entries in order of |
| their assignment during writing: |
| |
| ``` |
| +----------------+ |
| | length: varint | |
| | data: bytes | |
| +----------------+ (repeated for each string) |
| ``` |
| |
| Strings are stored in the order they were first encountered during writing. |
| The first unique filename gets index 0, the second gets index 1, and so on. |
| Length-prefixing (rather than null-termination) allows strings containing |
| null bytes and enables readers to allocate exact-sized buffers. The varint |
| length encoding means short strings (under 128 bytes) need only one length |
| byte. |
| |
| ## Frame Table |
| |
| The frame table stores deduplicated frame entries with full source position |
| information and bytecode opcode: |
| |
| ``` |
| +----------------------------+ |
| | filename_idx: varint | |
| | funcname_idx: varint | |
| | lineno: svarint | |
| | end_lineno_delta: svarint | |
| | column: svarint | |
| | end_column_delta: svarint | |
| | opcode: u8 | |
| +----------------------------+ (repeated for each frame) |
| ``` |
| |
| ### Field Definitions |
| |
| | Field | Type | Description | |
| |------------------|---------------|----------------------------------------------------------| |
| | filename_idx | varint | Index into string table for file name | |
| | funcname_idx | varint | Index into string table for function name | |
| | lineno | zigzag varint | Start line number (-1 for synthetic frames) | |
| | end_lineno_delta | zigzag varint | Delta from lineno (end_lineno = lineno + delta) | |
| | column | zigzag varint | Start column offset in UTF-8 bytes (-1 if not available) | |
| | end_column_delta | zigzag varint | Delta from column (end_column = column + delta) | |
| | opcode | u8 | Python bytecode opcode (0-254) or 255 for None | |
| |
| ### Delta Encoding |
| |
| Position end values use delta encoding for efficiency: |
| |
| - `end_lineno = lineno + end_lineno_delta` |
| - `end_column = column + end_column_delta` |
| |
| Typical values: |
| - `end_lineno_delta`: Usually 0 (single-line expressions) → encodes to 1 byte |
| - `end_column_delta`: Usually 5-20 (expression width) → encodes to 1 byte |
| |
| This saves ~1-2 bytes per frame compared to absolute encoding. When the base |
| value (lineno or column) is -1 (not available), the delta is stored as 0 and |
| the reconstructed value is -1. |
| |
| ### Sentinel Values |
| |
| - `opcode = 255`: No opcode captured |
| - `lineno = -1`: Synthetic frame (no source location) |
| - `column = -1`: Column offset not available |
| |
| ### Deduplication |
| |
| Each unique (filename, funcname, lineno, end_lineno, column, end_column, |
| opcode) combination gets one entry. This enables instruction-level profiling |
| where multiple bytecode instructions on the same line can be distinguished. |
| |
| Strings and frames are deduplicated separately because they have different |
| cardinalities and reference patterns. A codebase might have hundreds of |
| unique source files but thousands of unique functions. Many functions share |
| the same filename, so storing the filename index in each frame entry (rather |
| than the full string) provides an additional layer of deduplication. A frame |
| entry is typically 7-9 bytes rather than two full strings plus location data. |
| |
| ### Size Analysis |
| |
| Typical frame size with delta encoding: |
| - file_idx: 1-2 bytes |
| - func_idx: 1-2 bytes |
| - lineno: 1-2 bytes |
| - end_lineno_delta: 1 byte (usually 0) |
| - column: 1 byte (usually < 64) |
| - end_column_delta: 1 byte (usually < 64) |
| - opcode: 1 byte |
| |
| **Total: ~7-9 bytes per frame** |
| |
| Line numbers and columns use signed varint (zigzag encoding) to handle |
| sentinel values efficiently. Synthetic frames—generated frames that don't |
| correspond directly to Python source code, such as C extension boundaries or |
| internal interpreter frames—use -1 to indicate the absence of a source |
| location. Zigzag encoding ensures these small negative values encode |
| efficiently (−1 becomes 1, which is one byte) rather than requiring the |
| maximum varint length. |
| |
| ## Footer |
| |
| ``` |
| Offset Size Type Description |
| +--------+------+---------+----------------------------------------+ |
| | 0 | 4 | uint32 | String count | |
| | 4 | 4 | uint32 | Frame count | |
| | 8 | 8 | uint64 | Total file size | |
| | 16 | 16 | bytes | Checksum (reserved, currently zeros) | |
| +--------+------+---------+----------------------------------------+ |
| ``` |
| |
| The string and frame counts allow readers to pre-allocate arrays of the |
| correct size before parsing the tables. Without these counts, readers would |
| need to either scan the tables twice (once to count, once to parse) or use |
| dynamically-growing arrays. |
| |
| The file size field provides a consistency check: if the actual file size |
| does not match, the file may be truncated or corrupted. |
| |
| The checksum field is reserved for future use. A checksum would allow |
| detection of corruption but adds complexity and computation cost. The |
| current implementation leaves this as zeros. |
| |
| ## Variable-Length Integer Encoding |
| |
| The format uses LEB128 (Little Endian Base 128) for unsigned integers and |
| zigzag + LEB128 for signed integers. These encodings are widely used |
| (Protocol Buffers, DWARF debug info, WebAssembly) and well-understood. |
| |
| ### Unsigned Varint (LEB128) |
| |
| Each byte stores 7 bits of data. The high bit indicates whether more bytes |
| follow: |
| |
| ``` |
| Value Encoded bytes |
| 0-127 [0xxxxxxx] (1 byte) |
| 128-16383 [1xxxxxxx] [0xxxxxxx] (2 bytes) |
| 16384+ [1xxxxxxx] [1xxxxxxx] ... (3+ bytes) |
| ``` |
| |
| Most indices in profiling data are small. A profile with 1000 unique frames |
| needs at most 2 bytes per frame index. The common case (indices under 128) |
| needs only 1 byte. |
| |
| ### Signed Varint (Zigzag) |
| |
| Standard LEB128 encodes −1 as a very large unsigned value, requiring many |
| bytes. Zigzag encoding interleaves positive and negative values: |
| |
| ``` |
| 0 -> 0 -1 -> 1 1 -> 2 -2 -> 3 2 -> 4 |
| ``` |
| |
| This ensures small-magnitude values (whether positive or negative) encode |
| in few bytes. |
| |
| ## Compression |
| |
| When compression is enabled, the sample data region contains a zstd stream. |
| The string table, frame table, and footer remain uncompressed so readers can |
| access metadata without decompressing the entire file. A tool that only needs |
| to report "this file contains 50,000 samples of 3 threads" can read the header |
| and footer without touching the compressed sample data. This also simplifies |
| the format: the header's offset fields point directly to the tables rather |
| than to positions within a decompressed stream. |
| |
| Zstd provides an excellent balance of compression ratio and speed. Profiling |
| data compresses very well (often 5-10x) due to repetitive patterns: the same |
| small set of frame indices appears repeatedly, and delta-encoded timestamps |
| cluster around the sampling interval. Zstd's streaming API allows compression |
| without buffering the entire dataset. The writer feeds sample data through |
| the compressor incrementally, flushing compressed chunks to disk as they |
| become available. |
| |
| Level 5 compression is used as a default. Lower levels (1-3) are faster but |
| compress less; higher levels (6+) compress more but slow down writing. Level |
| 5 provides good compression with minimal impact on profiling overhead. |
| |
| ## Reading and Writing |
| |
| ### Writing |
| |
| 1. Open the output file and write 64 zero bytes as a placeholder header |
| 2. Initialize empty string and frame dictionaries for deduplication |
| 3. For each sample: |
| - Intern any new strings, assigning sequential indices |
| - Intern any new frames, assigning sequential indices |
| - Encode the sample record and write to the buffer |
| - Flush the buffer through compression (if enabled) when full |
| 4. Flush remaining buffered data and finalize compression |
| 5. Write the string table (length-prefixed strings in index order) |
| 6. Write the frame table (varint-encoded entries in index order) |
| 7. Write the footer with final counts |
| 8. Seek to offset 0 and write the header with actual values |
| |
| The writer maintains two dictionaries: one mapping strings to indices, one |
| mapping (filename_idx, funcname_idx, lineno) tuples to frame indices. These |
| enable O(1) lookup during interning. |
| |
| ### Reading |
| |
| 1. Read the header magic number to detect endianness (set `needs_swap` flag |
| if the magic appears byte-swapped) |
| 2. Validate version and read remaining header fields (byte-swapping if needed) |
| 3. Seek to end − 32 and read the footer (byte-swapping counts if needed) |
| 4. Allocate string array of `string_count` elements |
| 5. Parse the string table, populating the array |
| 6. Allocate frame array of `frame_count * 3` uint32 elements |
| 7. Parse the frame table, populating the array |
| 8. If compressed, decompress the sample data region |
| 9. Iterate through samples, resolving indices to strings/frames |
| (byte-swapping thread_id and interpreter_id if needed) |
| |
| The reader builds lookup arrays rather than dictionaries since it only needs |
| index-to-value mapping, not value-to-index. |
| |
| ## Platform Considerations |
| |
| ### Byte Ordering and Cross-Platform Portability |
| |
| The binary format uses **native byte order** for all multi-byte integer |
| fields when writing. However, the reader supports **cross-endian reading**: |
| files written on a little-endian system (x86, ARM) can be read on a |
| big-endian system (s390x, PowerPC), and vice versa. |
| |
| The magic number doubles as an endianness marker. When read on a system with |
| different byte order, it appears byte-swapped (`0x48434154` instead of |
| `0x54414348`). The reader detects this and automatically byte-swaps all |
| fixed-width integer fields during parsing. |
| |
| Writers must use `memcpy()` from properly-sized integer types when writing |
| fixed-width integer fields. When the source variable's type differs from the |
| field width (e.g., `size_t` written as 4 bytes), explicit casting to the |
| correct type (e.g., `uint32_t`) is required before `memcpy()`. On big-endian |
| systems, copying from an oversized type would copy the wrong bytes—high-order |
| zeros instead of the actual value. |
| |
| The reader tracks whether byte-swapping is needed via a `needs_swap` flag set |
| during header parsing. All fixed-width fields in the header, footer, and |
| sample data are conditionally byte-swapped using Python's internal byte-swap |
| functions (`_Py_bswap32`, `_Py_bswap64` from `pycore_bitutils.h`). |
| |
| Variable-length integers (varints) are byte-order independent since they |
| encode values one byte at a time using the LEB128 scheme, so they require |
| no special handling for cross-endian reading. |
| |
| ### Memory-Mapped I/O |
| |
| On Unix systems (Linux, macOS), the reader uses `mmap()` to map the file |
| into the process address space. The kernel handles paging data in and out |
| as needed, no explicit read() calls or buffer management are required, |
| multiple readers can share the same physical pages, and sequential access |
| patterns benefit from kernel read-ahead. |
| |
| The implementation uses `madvise()` to hint the access pattern to the kernel: |
| `MADV_SEQUENTIAL` indicates the file will be read linearly, enabling |
| aggressive read-ahead. `MADV_WILLNEED` requests pre-faulting of pages. |
| On Linux, `MAP_POPULATE` pre-faults all pages at mmap time rather than on |
| first access, moving page fault overhead from the parsing loop to the |
| initial mapping for more predictable performance. For large files (over |
| 32 MB), `MADV_HUGEPAGE` requests transparent huge pages (2 MB instead of |
| 4 KB) to reduce TLB pressure when accessing large amounts of data. |
| |
| On Windows, the implementation falls back to standard file I/O with full |
| file buffering. Profiling data files are typically small enough (tens to |
| hundreds of megabytes) that this is acceptable. |
| |
| The writer uses a 512 KB buffer to batch small writes. Each sample record |
| is typically tens of bytes; writing these individually would incur excessive |
| syscall overhead. The buffer accumulates data until full, then flushes in |
| one write() call (or feeds through the compression stream). |
| |
| ## Future Considerations |
| |
| The format reserves space for future extensions. The 12 reserved bytes in |
| the header could hold additional metadata. The 16-byte checksum field in |
| the footer is currently unused. The version field allows incompatible |
| changes with graceful rejection. New compression types could be added |
| (compression_type > 1). |
| |
| Any changes that alter the meaning of existing fields or the parsing logic |
| should increment the version number to prevent older readers from |
| misinterpreting new files. |