Doc/library/profiling.sampling.rst - external/github.com/python/cpython - Git at Google

 .. highlight:: shell-session

 .. _profiling-sampling:

 ***************************************************
 :mod:`profiling.sampling` --- Statistical profiler
 ***************************************************

 .. module:: profiling.sampling
    :synopsis: Statistical sampling profiler for Python processes.

 .. versionadded:: 3.15

 **Source code:** :source:`Lib/profiling/sampling/`

 .. program:: profiling.sampling

 --------------

 .. image:: tachyon-logo.png
    :alt: Tachyon logo
    :align: center
    :width: 300px

 The :mod:`profiling.sampling` module, named **Tachyon**, provides statistical
 profiling of Python programs through periodic stack sampling. Tachyon can
 run scripts directly or attach to any running Python process without requiring
 code changes or restarts. Because sampling occurs externally to the target
 process, overhead is virtually zero, making Tachyon suitable for both
 development and production environments.


 What is statistical profiling?
 ==============================

 Statistical profiling builds a picture of program behavior by periodically
 capturing snapshots of the call stack. Rather than instrumenting every function
 call and return as deterministic profilers do, Tachyon reads the call stack at
 regular intervals to record what code is currently running.

 This approach rests on a simple principle: functions that consume significant
 CPU time will appear frequently in the collected samples. By gathering thousands
 of samples over a profiling session, Tachyon constructs an accurate statistical
 estimate of where time is spent. The more samples collected, the
 more precise this estimate becomes.


 How time is estimated
 ---------------------

 The time values shown in Tachyon's output are **estimates derived from sample
 counts**, not direct measurements. Tachyon counts how many times each function
 appears in the collected samples, then multiplies by the sampling interval to
 estimate time.

 For example, with a 10 kHz sampling rate over a 10-second profile,
 Tachyon collects approximately 100,000 samples. If a function appears in 5,000
 samples (5% of total), Tachyon estimates it consumed 5% of the 10-second
 duration, or about 500 milliseconds. This is a statistical estimate, not a
 precise measurement.

 The accuracy of these estimates depends on sample count. With 100,000 samples,
 a function showing 5% has a margin of error of roughly ±0.5%. With only 1,000
 samples, the same 5% measurement could actually represent anywhere from 3% to
 7% of real time.

 This is why longer profiling durations and shorter sampling intervals produce
 more reliable results---they collect more samples. For most performance
 analysis, the default settings provide sufficient accuracy to identify
 bottlenecks and guide optimization efforts.

 Because sampling is statistical, results will vary slightly between runs. A
 function showing 12% in one run might show 11% or 13% in the next. This is
 normal and expected. Focus on the overall pattern rather than exact percentages,
 and don't worry about small variations between runs.


 When to use a different approach
 --------------------------------

 Statistical sampling is not ideal for every situation.

 For very short scripts that complete in under one second, the profiler may not
 collect enough samples for reliable results. Use :mod:`profiling.tracing`
 instead, or run the script in a loop to extend profiling time.

 When you need exact call counts, sampling cannot provide them. Sampling
 estimates frequency from snapshots, so if you need to know precisely how many
 times a function was called, use :mod:`profiling.tracing`.

 When comparing two implementations where the difference might be only 1-2%,
 sampling noise can obscure real differences. Use :mod:`timeit` for
 micro-benchmarks or :mod:`profiling.tracing` for precise measurements.


 The key difference from :mod:`profiling.tracing` is how measurement happens.
 A tracing profiler instruments your code, recording every function call and
 return. This provides exact call counts and precise timing but adds overhead
 to every function call. A sampling profiler, by contrast, observes the program
 from outside at fixed intervals without modifying its execution. Think of the
 difference like this: tracing is like having someone follow you and write down
 every step you take, while sampling is like taking photographs every second
 and inferring your path from those snapshots.

 This external observation model is what makes sampling profiling practical for
 production use. The profiled program runs at full speed because there is no
 instrumentation code running inside it, and the target process is never stopped
 or paused during sampling---Tachyon reads the call stack directly from the
 process's memory while it continues to run. You can attach to a live server,
 collect data, and detach without the application ever knowing it was observed.
 The trade-off is that very short-lived functions may be missed if they happen
 to complete between samples.

 Statistical profiling excels at answering the question, "Where is my program
 spending time?" It reveals hotspots and bottlenecks in production code where
 deterministic profiling overhead would be unacceptable. For exact call counts
 and complete call graphs, use :mod:`profiling.tracing` instead.


 Quick examples
 ==============

 Profile a script and see the results immediately::

    python -m profiling.sampling run script.py

 Profile a module with arguments::

    python -m profiling.sampling run -m mypackage.module arg1 arg2

 Generate an interactive flame graph::

    python -m profiling.sampling run --flamegraph -o profile.html script.py

 Attach to a running process by PID::

    python -m profiling.sampling attach 12345

 Use live mode for real-time monitoring (press ``q`` to quit)::

    python -m profiling.sampling run --live script.py

 Profile for 60 seconds with a faster sampling rate::

    python -m profiling.sampling run -d 60 -r 20khz script.py

 Generate a line-by-line heatmap::

    python -m profiling.sampling run --heatmap script.py

 Enable opcode-level profiling to see which bytecode instructions are executing::

    python -m profiling.sampling run --opcodes --flamegraph script.py


 Commands
 ========

 Tachyon operates through two subcommands that determine how to obtain the
 target process.


 The ``run`` command
 -------------------

 The ``run`` command launches a Python script or module and profiles it from
 startup::

    python -m profiling.sampling run script.py
    python -m profiling.sampling run -m mypackage.module

 When profiling a script, the profiler starts the target in a subprocess, waits
 for it to initialize, then begins collecting samples. The ``-m`` flag
 indicates that the target should be run as a module (equivalent to
 ``python -m``). Arguments after the target are passed through to the
 profiled program::

    python -m profiling.sampling run script.py --config settings.yaml


 The ``attach`` command
 ----------------------

 The ``attach`` command connects to an already-running Python process by its
 process ID::

    python -m profiling.sampling attach 12345

 This command is particularly valuable for investigating performance issues in
 production systems. The target process requires no modification and need not
 be restarted. The profiler attaches, collects samples for the specified
 duration, then detaches and produces output.

 ::

    python -m profiling.sampling attach --live 12345
    python -m profiling.sampling attach --flamegraph -d 30 -o profile.html 12345

 On most systems, attaching to another process requires appropriate permissions.
 See :ref:`profiling-permissions` for platform-specific requirements.


 .. _replay-command:

 The ``replay`` command
 ----------------------

 The ``replay`` command converts binary profile files to other output formats::

    python -m profiling.sampling replay profile.bin
    python -m profiling.sampling replay --flamegraph -o profile.html profile.bin

 This command is useful when you have captured profiling data in binary format
 and want to analyze it later or convert it to a visualization format. Binary
 profiles can be replayed multiple times to different formats without
 re-profiling.

 ::

    # Convert binary to pstats (default, prints to stdout)
    python -m profiling.sampling replay profile.bin

    # Convert binary to flame graph
    python -m profiling.sampling replay --flamegraph -o output.html profile.bin

    # Convert binary to gecko format for Firefox Profiler
    python -m profiling.sampling replay --gecko -o profile.json profile.bin

    # Convert binary to heatmap
    python -m profiling.sampling replay --heatmap -o my_heatmap profile.bin


 Profiling in production
 -----------------------

 The sampling profiler is designed for production use. It imposes no measurable
 overhead on the target process because it reads memory externally rather than
 instrumenting code. The target application continues running at full speed and
 is unaware it is being profiled.

 When profiling production systems, keep these guidelines in mind:

 Start with shorter durations (10-30 seconds) to get quick results, then extend
 if you need more statistical accuracy. The default 10-second duration is usually
 sufficient to identify major hotspots.

 If possible, profile during representative load rather than peak traffic.
 Profiles collected during normal operation are easier to interpret than those
 collected during unusual spikes.

 The profiler itself consumes some CPU on the machine where it runs (not on the
 target process). On the same machine, this is typically negligible. When
 profiling remote processes, network latency does not affect the target.

 Results from production may differ from development due to different data
 sizes, concurrent load, or caching effects. This is expected and is often
 exactly what you want to capture.


 .. _profiling-permissions:

 Platform requirements
 ---------------------

 The profiler reads the target process's memory to capture stack traces. This
 requires elevated permissions on most operating systems.

 **Linux**

 On Linux, the profiler uses ``ptrace`` or ``process_vm_readv`` to read the
 target process's memory. This typically requires one of:

 - Running as root
 - Having the ``CAP_SYS_PTRACE`` capability
 - Adjusting the Yama ptrace scope: ``/proc/sys/kernel/yama/ptrace_scope``

 The default ptrace_scope of 1 restricts ptrace to parent processes only. To
 allow attaching to any process owned by the same user, set it to 0::

    echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope

 **macOS**

 On macOS, the profiler uses ``task_for_pid()`` to access the target process.
 This requires one of:

 - Running as root
 - The profiler binary having the ``com.apple.security.cs.debugger`` entitlement
 - System Integrity Protection (SIP) being disabled (not recommended)

 **Windows**

 On Windows, the profiler requires administrative privileges or the
 ``SeDebugPrivilege`` privilege to read another process's memory.


 Version compatibility
 ---------------------

 The profiler and target process must run the same Python minor version (for
 example, both Python 3.15). Attaching from Python 3.14 to a Python 3.15 process
 is not supported.

 Additional restrictions apply to pre-release Python versions: if either the
 profiler or target is running a pre-release (alpha, beta, or release candidate),
 both must run the exact same version.

 On free-threaded Python builds, the profiler cannot attach from a free-threaded
 build to a standard build, or vice versa.


 Sampling configuration
 ======================

 Before exploring the various output formats and visualization options, it is
 important to understand how to configure the sampling process itself. The
 profiler offers several options that control how frequently samples are
 collected, how long profiling runs, which threads are observed, and what
 additional context is captured in each sample.

 The default configuration works well for most use cases:

 .. list-table::
    :header-rows: 1
    :widths: 25 75

    * - Option
      - Default
    * - Default for ``--sampling-rate`` / ``-r``
      - 1 kHz
    * - Default for ``--duration`` / ``-d``
      - 10 seconds
    * - Default for ``--all-threads`` / ``-a``
      - Main thread only
    * - Default for ``--native``
      - No ``<native>`` frames (C code time attributed to caller)
    * - Default for ``--no-gc``
      - ``<GC>`` frames included when garbage collection is active
    * - Default for ``--mode``
      - Wall-clock mode (all samples recorded)
    * - Default for ``--realtime-stats``
      - Disabled
    * - Default for ``--subprocesses``
      - Disabled
    * - Default for ``--blocking``
      - Disabled (non-blocking sampling)


 Sampling rate and duration
 --------------------------

 The two most fundamental parameters are the sampling rate and duration.
 Together, these determine how many samples will be collected during a profiling
 session.

 The :option:`--sampling-rate` option (:option:`-r`) sets how frequently samples
 are collected. The default is 1 kHz (10,000 samples per second)::

    python -m profiling.sampling run -r 20khz script.py

 Higher rates capture more samples and provide finer-grained data at the
 cost of slightly higher profiler CPU usage. Lower rates reduce profiler
 overhead but may miss short-lived functions. For most applications, the
 default rate provides a good balance between accuracy and overhead.

 The :option:`--duration` option (:option:`-d`) sets how long to profile in seconds. The
 default is 10 seconds::

    python -m profiling.sampling run -d 60 script.py

 Longer durations collect more samples and produce more statistically reliable
 results, especially for code paths that execute infrequently. When profiling
 a program that runs for a fixed time, you may want to set the duration to
 match or exceed the expected runtime.


 Thread selection
 ----------------

 Python programs often use multiple threads, whether explicitly through the
 :mod:`threading` module or implicitly through libraries that manage thread
 pools.

 By default, the profiler samples only the main thread. The :option:`--all-threads`
 option (:option:`-a`) enables sampling of all threads in the process::

    python -m profiling.sampling run -a script.py

 Multi-thread profiling reveals how work is distributed across threads and can
 identify threads that are blocked or starved. Each thread's samples are
 combined in the output, with the ability to filter by thread in some formats.
 This option is particularly useful when investigating concurrency issues or
 when work is distributed across a thread pool.


 .. _blocking-mode:

 Blocking mode
 -------------

 By default, Tachyon reads the target process's memory without stopping it.
 This non-blocking approach is ideal for most profiling scenarios because it
 imposes virtually zero overhead on the target application: the profiled
 program runs at full speed and is unaware it is being observed.

 However, non-blocking sampling can occasionally produce incomplete or
 inconsistent stack traces in applications with many generators or coroutines
 that rapidly switch between yield points, or in programs with very fast-changing
 call stacks where functions enter and exit between the start and end of a single
 stack read, resulting in reconstructed stacks that mix frames from different
 execution states or that never actually existed.

 For these cases, the :option:`--blocking` option stops the target process during
 each sample::

    python -m profiling.sampling run --blocking script.py
    python -m profiling.sampling attach --blocking 12345

 When blocking mode is enabled, the profiler suspends the target process,
 reads its stack, then resumes it. This guarantees that each captured stack
 represents a real, consistent snapshot of what the process was doing at that
 instant. The trade-off is that the target process runs slower because it is
 repeatedly paused.

 .. warning::

    Do not use very high sample rates (low ``--interval`` values) with blocking
    mode. Suspending and resuming a process takes time, and if the sampling
    interval is too short, the target will spend more time stopped than running.
    For blocking mode, intervals of 1000 microseconds (1 millisecond) or higher
    are recommended. The default 100 microsecond interval may cause noticeable
    slowdown in the target application.

 Use blocking mode only when you observe inconsistent stacks in your profiles,
 particularly with generator-heavy or coroutine-heavy code. For most
 applications, the default non-blocking mode provides accurate results with
 zero impact on the target process.


 Special frames
 --------------

 The profiler can inject artificial frames into the captured stacks to provide
 additional context about what the interpreter is doing at the moment each
 sample is taken. These synthetic frames help distinguish different types of
 execution that would otherwise be invisible.

 The :option:`--native` option adds ``<native>`` frames to indicate when Python has
 called into C code (extension modules, built-in functions, or the interpreter
 itself)::

    python -m profiling.sampling run --native script.py

 These frames help distinguish time spent in Python code versus time spent in
 native libraries. Without this option, native code execution appears as time
 in the Python function that made the call. This is useful when optimizing
 code that makes heavy use of C extensions like NumPy or database drivers.

 By default, the profiler includes ``<GC>`` frames when garbage collection is
 active. The :option:`--no-gc` option suppresses these frames::

    python -m profiling.sampling run --no-gc script.py

 GC frames help identify programs where garbage collection consumes significant
 time, which may indicate memory allocation patterns worth optimizing. If you
 see substantial time in ``<GC>`` frames, consider investigating object
 allocation rates or using object pooling.


 Opcode-aware profiling
 ----------------------

 The :option:`--opcodes` option enables instruction-level profiling that captures
 which Python bytecode instructions are executing at each sample::

    python -m profiling.sampling run --opcodes --flamegraph script.py

 This feature provides visibility into Python's bytecode execution, including
 adaptive specialization optimizations. When a generic instruction like
 ``LOAD_ATTR`` is specialized at runtime into a more efficient variant like
 ``LOAD_ATTR_INSTANCE_VALUE``, the profiler shows both the specialized name
 and the base instruction.

 Opcode information appears in several output formats:

 - **Flame graphs**: Hovering over a frame displays a tooltip with a bytecode
   instruction breakdown, showing which opcodes consumed time in that function
 - **Heatmap**: Expandable bytecode panels per source line show instruction
   breakdown with specialization percentages
 - **Live mode**: An opcode panel shows instruction-level statistics for the
   selected function, accessible via keyboard navigation
 - **Gecko format**: Opcode transitions are emitted as interval markers in the
   Firefox Profiler timeline

 This level of detail is particularly useful for:

 - Understanding the performance impact of Python's adaptive specialization
 - Identifying hot bytecode instructions that might benefit from optimization
 - Analyzing the effectiveness of different code patterns at the instruction level
 - Debugging performance issues that occur at the bytecode level

 The :option:`--opcodes` option is compatible with :option:`--live`, :option:`--flamegraph`,
 :option:`--heatmap`, and :option:`--gecko` formats. It requires additional memory to store
 opcode information and may slightly reduce sampling performance, but provides
 unprecedented visibility into Python's execution model.


 Real-time statistics
 --------------------

 The :option:`--realtime-stats` option displays sampling rate statistics during
 profiling::

    python -m profiling.sampling run --realtime-stats script.py

 This shows the actual achieved sampling rate, which may be lower than requested
 if the profiler cannot keep up. The statistics help verify that profiling is
 working correctly and that sufficient samples are being collected. See
 :ref:`sampling-efficiency` for details on interpreting these metrics.


 Subprocess profiling
 --------------------

 The :option:`--subprocesses` option enables automatic profiling of subprocesses
 spawned by the target::

    python -m profiling.sampling run --subprocesses script.py
    python -m profiling.sampling attach --subprocesses 12345

 When enabled, the profiler monitors the target process for child process
 creation. When a new Python child process is detected, a separate profiler
 instance is automatically spawned to profile it. This is useful for
 applications that use :mod:`multiprocessing`, :mod:`subprocess`,
 :mod:`concurrent.futures` with :class:`~concurrent.futures.ProcessPoolExecutor`,
 or other process spawning mechanisms.

 .. code-block:: python
    :caption: worker_pool.py

    from concurrent.futures import ProcessPoolExecutor
    import math

    def compute_factorial(n):
        total = 0
        for i in range(50):
            total += math.factorial(n)
        return total

    if __name__ == "__main__":
        numbers = [5000 + i * 100 for i in range(50)]
        with ProcessPoolExecutor(max_workers=4) as executor:
            results = list(executor.map(compute_factorial, numbers))
        print(f"Computed {len(results)} factorials")

 ::

    python -m profiling.sampling run --subprocesses --flamegraph worker_pool.py

 This produces separate flame graphs for the main process and each worker
 process: ``flamegraph_<main_pid>.html``, ``flamegraph_<worker1_pid>.html``,
 and so on.

 Each subprocess receives its own output file. The filename is derived from
 the specified output path (or the default) with the subprocess's process ID
 appended:

 - If you specify ``-o profile.html``, subprocesses produce ``profile_12345.html``,
   ``profile_12346.html``, and so on
 - With default output, subprocesses produce files like ``flamegraph_12345.html``
   or directories like ``heatmap_12345``
 - For pstats format (which defaults to stdout), subprocesses produce files like
   ``profile_12345.pstats``

 The subprocess profilers inherit most sampling options from the parent (sampling
 rate, duration, thread selection, native frames, GC frames, async-aware mode,
 and output format). All Python descendant processes are profiled recursively,
 including grandchildren and further descendants.

 Subprocess detection works by periodically scanning for new descendants of
 the target process and checking whether each new process is a Python process
 by probing the process memory for Python runtime structures. Non-Python
 subprocesses (such as shell commands or external tools) are ignored.

 There is a limit of 100 concurrent subprocess profilers to prevent resource
 exhaustion in programs that spawn many processes. If this limit is reached,
 additional subprocesses are not profiled and a warning is printed.

 The :option:`--subprocesses` option is incompatible with :option:`--live` mode
 because live mode uses an interactive terminal interface that cannot
 accommodate multiple concurrent profiler displays.


 .. _sampling-efficiency:

 Sampling efficiency
 -------------------

 Sampling efficiency metrics help assess the quality of the collected data.
 These metrics appear in the profiler's terminal output and in the flame graph
 sidebar.

 **Sampling efficiency** is the percentage of sample attempts that succeeded.
 Each sample attempt reads the target process's call stack from memory. An
 attempt can fail if the process is in an inconsistent state at the moment of
 reading, such as during a context switch or while the interpreter is updating
 its internal structures. A low efficiency may indicate that the profiler could
 not keep up with the requested sampling rate, often due to system load or an
 overly aggressive interval setting.

 **Missed samples** is the percentage of expected samples that were not
 collected. Based on the configured interval and duration, the profiler expects
 to collect a certain number of samples. Some samples may be missed if the
 profiler falls behind schedule, for example when the system is under heavy
 load. A small percentage of missed samples is normal and does not significantly
 affect the statistical accuracy of the profile.

 Both metrics are informational. Even with some failed attempts or missed
 samples, the profile remains statistically valid as long as enough samples
 were collected. The profiler reports the actual number of samples captured,
 which you can use to judge whether the data is sufficient for your analysis.


 Profiling modes
 ===============

 The sampling profiler supports four modes that control which samples are
 recorded. The mode determines what the profile measures: total elapsed time,
 CPU execution time, time spent holding the global interpreter lock, or
 exception handling.


 Wall-clock mode
 ---------------

 Wall-clock mode (:option:`--mode`\ ``=wall``) captures all samples regardless of what the
 thread is doing. This is the default mode and provides a complete picture of
 where time passes during program execution::

    python -m profiling.sampling run --mode=wall script.py

 In wall-clock mode, samples are recorded whether the thread is actively
 executing Python code, waiting for I/O, blocked on a lock, or sleeping.
 This makes wall-clock profiling ideal for understanding the overall time
 distribution in your program, including time spent waiting.

 If your program spends significant time in I/O operations, network calls, or
 sleep, wall-clock mode will show these waits as time attributed to the calling
 function. This is often exactly what you want when optimizing end-to-end
 latency.


 CPU mode
 --------

 CPU mode (:option:`--mode`\ ``=cpu``) records samples only when the thread is actually
 executing on a CPU core::

    python -m profiling.sampling run --mode=cpu script.py

 Samples taken while the thread is sleeping, blocked on I/O, or waiting for
 a lock are discarded. The resulting profile shows where CPU cycles are consumed,
 filtering out idle time.

 CPU mode is useful when you want to focus on computational hotspots without
 being distracted by I/O waits. If your program alternates between computation
 and network calls, CPU mode reveals which computational sections are most
 expensive.


 Comparing wall-clock and CPU profiles
 -------------------------------------

 Running both wall-clock and CPU mode profiles can reveal whether a function's
 time is spent computing or waiting.

 If a function appears prominently in both profiles, it is a true computational
 hotspot---actively using the CPU. Optimization should focus on algorithmic
 improvements or more efficient code.

 If a function is high in wall-clock mode but low or absent in CPU mode, it is
 I/O-bound or waiting. The function spends most of its time waiting for network,
 disk, locks, or sleep. CPU optimization won't help here; consider async I/O,
 connection pooling, or reducing wait time instead.

 .. code-block:: python

    import time

    def do_sleep():
        time.sleep(2)

    def do_compute():
        sum(i**2 for i in range(1000000))

    if __name__ == "__main__":
        do_sleep()
        do_compute()

 ::

    python -m profiling.sampling run --mode=wall script.py  # do_sleep ~98%, do_compute ~1%
    python -m profiling.sampling run --mode=cpu script.py   # do_sleep absent, do_compute dominates


 GIL mode
 --------

 GIL mode (:option:`--mode`\ ``=gil``) records samples only when the thread holds Python's
 global interpreter lock::

    python -m profiling.sampling run --mode=gil script.py

 The GIL is held only while executing Python bytecode. When Python calls into
 C extensions, performs I/O operations, or executes native code, the GIL is
 typically released. This means GIL mode effectively measures time spent
 running Python code specifically, filtering out time in native libraries.

 In multi-threaded programs, GIL mode reveals which code is preventing other
 threads from running Python bytecode. Since only one thread can hold the GIL
 at a time, functions that appear frequently in GIL mode profiles are
 monopolizing the interpreter.

 GIL mode helps answer questions like "which functions are monopolizing the
 GIL?" and "why are my other threads starving?" It can also be useful in
 single-threaded programs to distinguish Python execution time from time spent
 in C extensions or I/O.

 .. code-block:: python

    import hashlib

    def hash_work():
        # C extension - releases GIL during computation
        for _ in range(200):
            hashlib.sha256(b"data" * 250000).hexdigest()

    def python_work():
        # Pure Python - holds GIL during computation
        for _ in range(3):
            sum(i**2 for i in range(1000000))

    if __name__ == "__main__":
        hash_work()
        python_work()

 ::

    python -m profiling.sampling run --mode=cpu script.py  # hash_work ~42%, python_work ~38%
    python -m profiling.sampling run --mode=gil script.py  # hash_work ~5%, python_work ~60%


 Exception mode
 --------------

 Exception mode (``--mode=exception``) records samples only when a thread has
 an active exception::

    python -m profiling.sampling run --mode=exception script.py

 Samples are recorded in two situations: when an exception is being propagated
 up the call stack (after ``raise`` but before being caught), or when code is
 executing inside an ``except`` block where exception information is still
 present in the thread state.

 The following example illustrates which code regions are captured:

 .. code-block:: python

    def example():
        try:
            raise ValueError("error")    # Captured: exception being raised
        except ValueError:
            process_error()              # Captured: inside except block
        finally:
            cleanup()                    # NOT captured: exception already handled

    def example_propagating():
        try:
            try:
                raise ValueError("error")
            finally:
                cleanup()                # Captured: exception propagating through
        except ValueError:
            pass

    def example_no_exception():
        try:
            do_work()
        finally:
            cleanup()                    # NOT captured: no exception involved

 Note that ``finally`` blocks are only captured when an exception is actively
 propagating through them. Once an ``except`` block finishes executing, Python
 clears the exception information before running any subsequent ``finally``
 block. Similarly, ``finally`` blocks that run during normal execution (when no
 exception was raised) are not captured because no exception state is present.

 This mode is useful for understanding where your program spends time handling
 errors. Exception handling can be a significant source of overhead in code
 that uses exceptions for flow control (such as ``StopIteration`` in iterators)
 or in applications that process many error conditions (such as network servers
 handling connection failures).

 Exception mode helps answer questions like "how much time is spent handling
 exceptions?" and "which exception handlers are the most expensive?" It can
 reveal hidden performance costs in code that catches and processes many
 exceptions, even when those exceptions are handled gracefully. For example,
 if a parsing library uses exceptions internally to signal format errors, this
 mode will capture time spent in those handlers even if the calling code never
 sees the exceptions.


 Output formats
 ==============

 The profiler produces output in several formats, each suited to different
 analysis workflows. The format is selected with a command-line flag, and
 output goes to stdout, a file, or a directory depending on the format.


 pstats format
 -------------

 The pstats format (:option:`--pstats`) produces a text table similar to what
 deterministic profilers generate. This is the default output format::

    python -m profiling.sampling run script.py
    python -m profiling.sampling run --pstats script.py

 .. figure:: tachyon-pstats.png
    :alt: Tachyon pstats terminal output
    :align: center
    :width: 100%

    The pstats format displays profiling results in a color-coded table showing
    function hotspots, sample counts, and timing estimates.

 Output appears on stdout by default::

    Profile Stats (Mode: wall):
         nsamples  sample%    tottime (ms)  cumul%   cumtime (ms)  filename:lineno(function)
           234/892    11.7%       234.00     44.6%       892.00    server.py:145(handle_request)
           156/156     7.8%       156.00      7.8%       156.00    <built-in>:0(socket.recv)
            98/421     4.9%        98.00     21.1%       421.00    parser.py:67(parse_message)

 The columns show sampling counts and estimated times:

 - **nsamples**: Displayed as ``direct/cumulative`` (for example, ``10/50``).
   Direct samples are when the function was at the top of the stack, actively
   executing. Cumulative samples are when the function appeared anywhere on the
   stack, including when it was waiting for functions it called. If a function
   shows ``10/50``, it was directly executing in 10 samples and was on the call
   stack in 50 samples total.

 - **sample%** and **cumul%**: Percentages of total samples for direct and
   cumulative counts respectively.

 - **tottime** and **cumtime**: Estimated wall-clock time based on sample counts
   and the profiling duration. Time units are selected automatically based on
   the magnitude: seconds for large values, milliseconds for moderate values,
   or microseconds for small values.

 The output includes a legend explaining each column and a summary of
 interesting functions that highlights:

 - **Hot spots**: Functions with high direct/cumulative sample ratio (ratio
   close to 1.0). These functions spend most of their time executing their own
   code rather than waiting for callees. High ratios indicate where CPU time
   is actually consumed.

 - **Indirect calls**: Functions with large differences between cumulative and
   direct samples. These are orchestration functions that delegate work to
   other functions. They appear frequently on the stack but rarely at the top.

 - **Call magnification**: Functions where cumulative samples far exceed direct
   samples (high cumulative/direct multiplier). These are frequently-nested
   functions that appear deep in many call chains.

 Use :option:`--no-summary` to suppress both the legend and summary sections.

 To save pstats output to a file instead of stdout::

    python -m profiling.sampling run -o profile.txt script.py

 The pstats format supports several options for controlling the display.
 The :option:`--sort` option determines the column used for ordering results::

    python -m profiling.sampling run --sort=tottime script.py
    python -m profiling.sampling run --sort=cumtime script.py
    python -m profiling.sampling run --sort=nsamples script.py

 The :option:`--limit` option restricts output to the top N entries::

    python -m profiling.sampling run --limit=30 script.py

 The :option:`--no-summary` option suppresses the header summary that precedes the
 statistics table.


 Collapsed stacks format
 -----------------------

 Collapsed stacks format (:option:`--collapsed`) produces one line per unique call
 stack, with a count of how many times that stack was sampled::

    python -m profiling.sampling run --collapsed script.py

 The output looks like:

 .. code-block:: text

    main;process_data;parse_json;decode_utf8 42
    main;process_data;parse_json 156
    main;handle_request;send_response 89

 Each line contains semicolon-separated function names representing the call
 stack from bottom to top, followed by a space and the sample count. This
 format is designed for compatibility with external flame graph tools,
 particularly Brendan Gregg's ``flamegraph.pl`` script.

 To generate a flame graph from collapsed stacks::

    python -m profiling.sampling run --collapsed script.py > stacks.txt
    flamegraph.pl stacks.txt > profile.svg

 The resulting SVG can be viewed in any web browser and provides an interactive
 visualization where you can click to zoom into specific call paths.


 Flame graph format
 ------------------

 Flame graph format (:option:`--flamegraph`) produces a self-contained HTML file with
 an interactive flame graph visualization::

    python -m profiling.sampling run --flamegraph script.py
    python -m profiling.sampling run --flamegraph -o profile.html script.py

 .. figure:: tachyon-flamegraph.png
    :alt: Tachyon interactive flame graph
    :align: center
    :width: 100%

    The flame graph visualization shows call stacks as nested rectangles, with
    width proportional to time spent. The sidebar displays runtime statistics,
    GIL metrics, and hotspot functions.

 .. only:: html

    `Try the interactive example <../_static/tachyon-example-flamegraph.html>`__!

 If no output file is specified, the profiler generates a filename based on
 the process ID (for example, ``flamegraph.12345.html``).

 The generated HTML file requires no external dependencies and can be opened
 directly in a web browser. The visualization displays call stacks as nested
 rectangles, with width proportional to time spent. Hovering over a rectangle
 shows details about that function including source code context, and clicking
 zooms into that portion of the call tree.

 The flame graph interface includes:

 - A sidebar showing profile summary, thread statistics, sampling efficiency
   metrics (see :ref:`sampling-efficiency`), and top hotspot functions
 - Search functionality supporting both function name matching and
   ``file.py:42`` line patterns
 - Per-thread filtering via dropdown
 - Dark/light theme toggle (preference saved across sessions)
 - SVG export for saving the current view

 The thread statistics section shows runtime behavior metrics:

 - **GIL Held**: percentage of samples where a thread held the global interpreter
   lock (actively running Python code)
 - **GIL Released**: percentage of samples where no thread held the GIL
 - **Waiting GIL**: percentage of samples where a thread was waiting to acquire
   the GIL
 - **GC**: percentage of samples during garbage collection

 These statistics help identify GIL contention and understand how time is
 distributed between Python execution, native code, and waiting.

 Flame graphs are particularly effective for identifying deep call stacks and
 understanding the hierarchical structure of time consumption. Wide rectangles
 at the top indicate functions that consume significant time either directly
 or through their callees.


 Gecko format
 ------------

 Gecko format (:option:`--gecko`) produces JSON output compatible with the Firefox
 Profiler::

    python -m profiling.sampling run --gecko script.py
    python -m profiling.sampling run --gecko -o profile.json script.py

 The `Firefox Profiler <https://profiler.firefox.com>`__ is a sophisticated
 web-based tool originally built for profiling Firefox itself. It provides
 features beyond basic flame graphs, including a timeline view, call tree
 exploration, and marker visualization. See the
 `Firefox Profiler documentation <https://profiler.firefox.com/docs/#/>`__ for
 detailed usage instructions.

 To use the output, open the Firefox Profiler in your browser and load the
 JSON file. The profiler runs entirely client-side, so your profiling data
 never leaves your machine.

 Gecko format automatically collects additional metadata about GIL state and
 CPU activity, enabling analysis features specific to Python's threading model.
 The profiler emits interval markers that appear as colored bands in the
 Firefox Profiler timeline:

 - **GIL markers**: show when threads hold or release the global interpreter lock
 - **CPU markers**: show when threads are executing on CPU versus idle
 - **Code type markers**: distinguish Python code from native (C extension) code
 - **GC markers**: indicate garbage collection activity

 For this reason, the :option:`--mode` option is not available with Gecko format;
 all relevant data is captured automatically.

 .. figure:: tachyon-gecko-calltree.png
    :alt: Firefox Profiler Call Tree view
    :align: center
    :width: 100%

    The Call Tree view shows the complete call hierarchy with sample counts
    and percentages. The sidebar displays detailed statistics for the
    selected function including running time and sample distribution.

 .. figure:: tachyon-gecko-flamegraph.png
    :alt: Firefox Profiler Flame Graph view
    :align: center
    :width: 100%

    The Flame Graph visualization shows call stacks as nested rectangles.
    Functions names are visible in the call hierarchy.

 .. figure:: tachyon-gecko-opcodes.png
    :alt: Firefox Profiler Marker Chart with opcodes
    :align: center
    :width: 100%

    The Marker Chart displays interval markers including CPU state, GIL
    status, and opcodes. With ``--opcodes`` enabled, bytecode instructions
    like ``BINARY_OP_ADD_FLOAT``, ``CALL_PY_EXACT_ARGS``, and
    ``CALL_LIST_APPEND`` appear as markers showing execution over time.


 Heatmap format
 --------------

 Heatmap format (:option:`--heatmap`) generates an interactive HTML visualization
 showing sample counts at the source line level::

    python -m profiling.sampling run --heatmap script.py
    python -m profiling.sampling run --heatmap -o my_heatmap script.py

 .. figure:: tachyon-heatmap.png
    :alt: Tachyon heatmap visualization
    :align: center
    :width: 100%

    The heatmap overlays sample counts directly on your source code. Lines are
    color-coded from cool (few samples) to hot (many samples). Navigation
    buttons (▲▼) let you jump between callers and callees.

 Unlike other formats that produce a single file, heatmap output creates a
 directory containing HTML files for each profiled source file. If no output
 path is specified, the directory is named ``heatmap_PID``.

 The heatmap visualization displays your source code with a color gradient
 indicating how many samples were collected at each line. Hot lines (many
 samples) appear in warm colors, while cold lines (few or no samples) appear
 in cool colors. This view helps pinpoint exactly which lines of code are
 responsible for time consumption.

 The heatmap interface provides several interactive features:

 - **Coloring modes**: toggle between "Self Time" (direct execution) and
   "Total Time" (cumulative, including time in called functions)
 - **Cold code filtering**: show all lines or only lines with samples
 - **Call graph navigation**: each line shows navigation buttons (▲ for callers,
   ▼ for callees) that let you trace execution paths through your code. When
   multiple functions called or were called from a line, a menu appears showing
   all options with their sample counts.
 - **Scroll minimap**: a vertical overview showing the heat distribution across
   the entire file
 - **Hierarchical index**: files organized by type (stdlib, site-packages,
   project) with aggregate sample counts per folder
 - **Dark/light theme**: toggle with preference saved across sessions
 - **Line linking**: click line numbers to create shareable URLs

 When opcode-level profiling is enabled with :option:`--opcodes`, each hot line
 can be expanded to show which bytecode instructions consumed time:

 .. figure:: tachyon-heatmap-with-opcodes.png
    :alt: Heatmap with expanded bytecode panel
    :align: center
    :width: 100%

    Expanding a hot line reveals the bytecode instructions executed, including
    specialized variants. The panel shows sample counts per instruction and the
    overall specialization percentage for the line.

 .. only:: html

    `Try the interactive example <../_static/tachyon-example-heatmap.html>`__!

 Heatmaps are especially useful when you know which file contains a performance
 issue but need to identify the specific lines. Many developers prefer this
 format because it maps directly to their source code, making it easy to read
 and navigate. For smaller scripts and focused analysis, heatmaps provide an
 intuitive view that shows exactly where time is spent without requiring
 interpretation of hierarchical visualizations.


 Binary format
 -------------

 Binary format (:option:`--binary`) produces a compact binary file for efficient
 storage of profiling data::

    python -m profiling.sampling run --binary -o profile.bin script.py
    python -m profiling.sampling attach --binary -o profile.bin 12345

 The :option:`--compression` option controls data compression:

 - ``auto`` (default): Use zstd compression if available, otherwise no
   compression
 - ``zstd``: Force zstd compression (requires :mod:`compression.zstd` support)
 - ``none``: Disable compression

 ::

    python -m profiling.sampling run --binary --compression=zstd -o profile.bin script.py

 To analyze binary profiles, use the :ref:`replay-command` to convert them to
 other formats like flame graphs or pstats output.


 Record and replay workflow
 ==========================

 The binary format combined with the replay command enables a record-and-replay
 workflow that separates data capture from analysis. Rather than generating
 visualizations during profiling, you capture raw data to a compact binary file
 and convert it to different formats later.

 This approach has three main benefits:

 - Sampling runs faster because the work of building data structures for
   visualization is deferred until replay.
 - A single binary capture can be converted to multiple output formats
   without re-profiling: pstats for a quick overview, flame graph for visual
   exploration, heatmap for line-level detail.
 - Binary files are compact and easy to share with colleagues who can convert
   them to their preferred format.

 A typical workflow::

    # Capture profile in production or during tests
    python -m profiling.sampling attach --binary -o profile.bin 12345

    # Later, analyze with different formats
    python -m profiling.sampling replay profile.bin
    python -m profiling.sampling replay --flamegraph -o profile.html profile.bin
    python -m profiling.sampling replay --heatmap -o heatmap profile.bin


 Live mode
 =========

 Live mode (:option:`--live`) provides a terminal-based real-time view of profiling
 data, similar to the ``top`` command for system processes::

    python -m profiling.sampling run --live script.py
    python -m profiling.sampling attach --live 12345

 .. figure:: tachyon-live-mode-2.gif
    :alt: Tachyon live mode showing all threads
    :align: center
    :width: 100%

    Live mode displays real-time profiling statistics, showing combined
    data from multiple threads in a multi-threaded application.

 The display updates continuously as new samples arrive, showing the current
 hottest functions. This mode requires the :mod:`curses` module, which is
 available on Unix-like systems but not on Windows. The terminal must be at
 least 60 columns wide and 12 lines tall; larger terminals display more columns.

 The header displays the top 3 hottest functions, sampling efficiency metrics,
 and thread status statistics (GIL held percentage, CPU usage, GC time). The
 main table shows function statistics with the currently sorted column indicated
 by an arrow (▼).

 When :option:`--opcodes` is enabled, an additional opcode panel appears below the
 main table, showing instruction-level statistics for the currently selected
 function. This panel displays which bytecode instructions are executing most
 frequently, including specialized variants and their base opcodes.

 .. figure:: tachyon-live-mode-1.gif
    :alt: Tachyon live mode with opcode panel
    :align: center
    :width: 100%

    Live mode with ``--opcodes`` enabled shows an opcode panel with a bytecode
    instruction breakdown for the selected function.


 Keyboard commands
 -----------------

 Within live mode, keyboard commands control the display:

 :kbd:`q`
    Quit the profiler and return to the shell.

 :kbd:`s` / :kbd:`S`
    Cycle through sort orders forward/backward (sample count, percentage,
    total time, cumulative percentage, cumulative time).

 :kbd:`p`
    Pause or resume display updates. Sampling continues in the background
    while the display is paused, so you can freeze the view to examine results
    without stopping data collection.

 :kbd:`r`
    Reset all statistics and start fresh. This is disabled after profiling
    finishes to prevent accidental data loss.

 :kbd:`/`
    Enter filter mode to search for functions by name. The filter uses
    case-insensitive substring matching against the filename and function name.
    Type a pattern and press Enter to apply, or Escape to cancel. Glob patterns
    and regular expressions are not supported.

 :kbd:`c`
    Clear the current filter and show all functions again.

 :kbd:`t`
    Toggle between viewing all threads combined or per-thread statistics.
    In per-thread mode, a thread counter (for example, ``1/4``) appears showing
    your position among the available threads.

 :kbd:`←` :kbd:`→` or :kbd:`↑` :kbd:`↓`
    In per-thread view, navigate between threads. Navigation wraps around
    from the last thread to the first and vice versa.

 :kbd:`+` / :kbd:`-`
    Increase or decrease the display refresh rate. The range is 0.05 seconds
    (20 Hz, very responsive) to 1.0 second (1 Hz, lower overhead). Faster refresh
    rates use more CPU. The default is 0.1 seconds (10 Hz).

 :kbd:`x`
    Toggle trend indicators that show whether functions are becoming hotter
    or cooler over time. When enabled, increasing metrics appear in green and
    decreasing metrics appear in red, comparing each update to the previous one.

 :kbd:`h` or :kbd:`?`
    Show the help screen with all available commands.

 :kbd:`j` / :kbd:`k` (or :kbd:`Up` / :kbd:`Down`)
    Navigate through opcode entries in the opcode panel (when ``--opcodes`` is
    enabled). These keys scroll through the instruction-level statistics for the
    currently selected function.

 When profiling finishes (duration expires or target process exits), the display
 shows a "PROFILING COMPLETE" banner and freezes the final results. You can
 still navigate, sort, and filter the results before pressing :kbd:`q` to exit.

 Live mode is incompatible with output format options (:option:`--collapsed`,
 :option:`--flamegraph`, and so on) because it uses an interactive terminal
 interface rather than producing file output.


 Async-aware profiling
 =====================

 For programs using :mod:`asyncio`, the profiler offers async-aware mode
 (:option:`--async-aware`) that reconstructs call stacks based on the task structure
 rather than the raw Python frames::

    python -m profiling.sampling run --async-aware async_script.py

 Standard profiling of async code can be confusing because the physical call
 stack often shows event loop internals rather than the logical flow of your
 coroutines. Async-aware mode addresses this by tracking which task is running
 and presenting stacks that reflect the ``await`` chain.

 .. code-block:: python

    import asyncio

    async def fetch(url):
        await asyncio.sleep(0.1)
        return url

    async def main():
        for _ in range(50):
            await asyncio.gather(fetch("a"), fetch("b"), fetch("c"))

    if __name__ == "__main__":
        asyncio.run(main())

 ::

    python -m profiling.sampling run --async-aware --flamegraph -o out.html script.py

 .. note::

    Async-aware profiling requires the target process to have the :mod:`asyncio`
    module loaded. If you profile a script before it imports asyncio, async-aware
    mode will not be able to capture task information.


 Async modes
 -----------

 The :option:`--async-mode` option controls which tasks appear in the profile::

    python -m profiling.sampling run --async-aware --async-mode=running async_script.py
    python -m profiling.sampling run --async-aware --async-mode=all async_script.py

 With :option:`--async-mode`\ ``=running`` (the default), only the task currently executing
 on the CPU is profiled. This shows where your program is actively spending time
 and is the typical choice for performance analysis.

 With :option:`--async-mode`\ ``=all``, tasks that are suspended (awaiting I/O, locks, or
 other tasks) are also included. This mode is useful for understanding what your
 program is waiting on, but produces larger profiles since every suspended task
 appears in each sample.


 Task markers and stack reconstruction
 -------------------------------------

 In async-aware profiles, you will see ``<task>`` frames that mark boundaries
 between asyncio tasks. These are synthetic frames inserted by the profiler to
 show the task structure. The task name appears as the function name in these
 frames.

 When a task awaits another task, the profiler reconstructs the logical call
 chain by following the ``await`` relationships. Only "leaf" tasks (tasks that
 no other task is currently awaiting) generate their own stack entries. Tasks
 being awaited by other tasks appear as part of their awaiter's stack instead.

 If a task has multiple awaiters (a diamond pattern in the task graph), the
 profiler deterministically selects one parent and annotates the task marker
 with the number of parents, for example ``MyTask (2 parents)``. This indicates
 that alternate execution paths exist but are not shown in this particular stack.


 Option restrictions
 -------------------

 Async-aware mode uses a different stack reconstruction mechanism and is
 incompatible with: :option:`--native`, :option:`--no-gc`, :option:`--all-threads`, and
 :option:`--mode`\ ``=cpu`` or :option:`--mode`\ ``=gil``.


 Command-line interface
 ======================

 .. program:: profiling.sampling

 The complete command-line interface for reference.


 Global options
 --------------

 .. option:: run

    Run and profile a Python script or module.

 .. option:: attach

    Attach to and profile a running process by PID.

 .. option:: replay

    Convert a binary profile file to another output format.


 Sampling options
 ----------------

 .. option:: -r <rate>, --sampling-rate <rate>

    Sampling rate (for example, ``10000``, ``10khz``, ``10k``). Default: ``1khz``.

 .. option:: -d <seconds>, --duration <seconds>

    Profiling duration in seconds. Default: 10.

 .. option:: -a, --all-threads

    Sample all threads, not just the main thread.

 .. option:: --realtime-stats

    Display sampling statistics during profiling.

 .. option:: --native

    Include ``<native>`` frames for non-Python code.

 .. option:: --no-gc

    Exclude ``<GC>`` frames for garbage collection.

 .. option:: --async-aware

    Enable async-aware profiling for asyncio programs.

 .. option:: --opcodes

    Gather bytecode opcode information for instruction-level profiling. Shows
    which bytecode instructions are executing, including specializations.
    Compatible with ``--live``, ``--flamegraph``, ``--heatmap``, and ``--gecko``
    formats only.

 .. option:: --subprocesses

    Also profile subprocesses. Each subprocess gets its own profiler
    instance and output file. Incompatible with ``--live``.

 .. option:: --blocking

    Pause the target process during each sample. This ensures consistent
    stack traces at the cost of slowing down the target. Use with longer
    intervals (1000 µs or higher) to minimize impact. See :ref:`blocking-mode`
    for details.


 Mode options
 ------------

 .. option:: --mode <mode>

    Sampling mode: ``wall`` (default), ``cpu``, ``gil``, or ``exception``.
    The ``cpu``, ``gil``, and ``exception`` modes are incompatible with
    ``--async-aware``.

 .. option:: --async-mode <mode>

    Async profiling mode: ``running`` (default) or ``all``.
    Requires ``--async-aware``.


 Output options
 --------------

 .. option:: --pstats

    Generate text statistics output. This is the default.

 .. option:: --collapsed

    Generate collapsed stack format for external flame graph tools.

 .. option:: --flamegraph

    Generate self-contained HTML flame graph.

 .. option:: --gecko

    Generate Gecko JSON format for Firefox Profiler.

 .. option:: --heatmap

    Generate HTML heatmap with line-level sample counts.

 .. option:: --binary

    Generate high-performance binary format for later conversion with the
    ``replay`` command.

 .. option:: --compression <type>

    Compression for binary format: ``auto`` (use zstd if available, default),
    ``zstd``, or ``none``.

 .. option:: -o <path>, --output <path>

    Output file or directory path. Default behavior varies by format:
    :option:`--pstats` writes to stdout, while other formats generate a file
    named ``<format>_<PID>.<ext>`` (for example, ``flamegraph_12345.html``).
    :option:`--heatmap` creates a directory named ``heatmap_<PID>``.


 pstats display options
 ----------------------

 These options apply only to pstats format output.

 .. option:: --sort <key>

    Sort order: ``nsamples``, ``tottime``, ``cumtime``, ``sample-pct``,
    ``cumul-pct``, ``nsamples-cumul``, or ``name``. Default: ``nsamples``.

 .. option:: -l <count>, --limit <count>

    Maximum number of entries to display. Default: 15.

 .. option:: --no-summary

    Omit the Legend and Summary of Interesting Functions sections from output.


 Run command options
 -------------------

 .. option:: -m, --module

    Treat the target as a module name rather than a script path.

 .. option:: --live

    Start interactive terminal interface instead of batch profiling.


 .. seealso::

    :mod:`profiling`
       Overview of Python profiling tools and guidance on choosing a profiler.

    :mod:`profiling.tracing`
       Deterministic tracing profiler for exact call counts and timing.

    :mod:`pstats`
       Statistics analysis for profile data.

    `Firefox Profiler <https://profiler.firefox.com>`__
       Web-based profiler that accepts Gecko format output. See the
       `documentation <https://profiler.firefox.com/docs/#/>`__ for usage details.

    `FlameGraph <https://github.com/brendangregg/FlameGraph>`__
       Tools for generating flame graphs from collapsed stack format.