api/docs/multi_trace_window.dox - external/github.com/DynamoRIO/dynamorio - Git at Google

 /* ******************************************************************************
  * Copyright (c) 2022 Google, Inc.  All rights reserved.
  * ******************************************************************************/

 /*
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions are met:
  *
  * * Redistributions of source code must retain the above copyright notice,
  *   this list of conditions and the following disclaimer.
  *
  * * Redistributions in binary form must reproduce the above copyright notice,
  *   this list of conditions and the following disclaimer in the documentation
  *   and/or other materials provided with the distribution.
  *
  * * Neither the name of Google, Inc. nor the names of its contributors may be
  *   used to endorse or promote products derived from this software without
  *   specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
  * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED. IN NO EVENT SHALL VMWARE, INC. OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
  * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
  * DAMAGE.
  */

 /**
  ****************************************************************************
 \page page_multi_trace_window Multi-Window Memtraces

 \tableofcontents

 # Overview

 Memory address traces (or "memtraces") are critical tools for analyzing application
 behavior and performance.  DynamoRIO's \ref page_drcachesim provides memtrace
 gathering and analyzing tools.  Today, these memtraces are gathered either during a
 full execution of a standalone application, or during a short burst of execution on a
 server (using the start/stop interfaces as described at \ref sec_drcachesim_partial).
 For long-running benchmarks with phased behavior, however, we would like to instead
 gather a series of smaller memtrace samples to ease simulation while still
 representing the application in the aggregate.

 # Initial Use Case: SPEC2017

 An initial use case driving this work is the gathering of memtraces from SPEC2017.
 These benchmarks execute tens of trillions of instructions each and include phased
 behavior.  Our goal is to gather uniformly sampled memtrace sequences across the
 whole execution.  To support a full warmup for each sample, our target is 10 billion
 instructions in each.  Something like 50 samples per benchmark should suffice to
 capture major phases.  With these parameters, for a 25-trillion-instruction
 benchmark, that would look like a series of 10 billion instruction traces each
 followed by 490 billion instructions of untraced execution.

 # Design Point: Separate Traces v. Merged-with-Markers

 Focusing on a use case of a series of 50 10-billion-instruction traces for a SPEC
 benchmark, there are two main ways to store them.  We could create 50 independent
 sets of trace files, each with its own metadata and separate set of sharded data
 files.  A simulator could either simulate all 50 separately and aggregate just the
 resulting statistics, or a single instance of a simulator could skip ahead to each
 new sample and use the first portion of the sample to warm up caches and other state.

 The alternative is to store all the data in one set of data files, with metadata
 markers inserted to indicate the division points between the samples.  This doesn’t
 support the separate simulation model up front, though we could provide an iterator
 interface that skips ahead to a target window and stops at the end of that window (or
 the simulator could be modified to stop when it sees a window separation marker).
 However, this will not be as efficient for parallel simulation with separate
 simulator instances for each window, since the skipping ahead will take some time.
 This arrangement does more easily support the single-simulator-instance approach, and
 more readily fits with online simulation.

 In terms of implementation, there are several factors to consider here.

 ## Separate raw files

 If we want separate final traces, at first the simplest approach is to produce a
 separate set of raw files for each tracing window.  These would be post-processed
 separately and independently.

 However, implementing this split does not fit well in the current interfaces.  To
 work with other filesystems, we have separated out the i/o and in particular
 directory creation, and some proprietary uses rework the output completely to write
 to a fixed set of destinations which are mapped back to per-thread raw files during
 post-processing.  Creating new sets of outputs at tracing time would require new
 interfaces and significant refactoring for these use cases.

 For typical use with files on the local disk, we could add creation of a new
 directory (and duplication of the module file) for each window by the tracing thread
 that hits the end-of-window trigger.  The other threads would each create a new
 output raw file each time they transitioned to a new window (see also the Proposal A
 discussion below).  This is implemented today with the \p -split_windows option.

 ## Splitting during raw2trace

 Alternatively, we could keep a single raw file for each thread and split it up into
 per-window final trace files during postprocessing by the raw2trace tool.  We would
 use markers inserted at the window transition points to identify where to separate.

 raw2trace would need to create a new output dir and duplicate the trace headers and
 module file.  Like for separate raw files, this goes against the current i/o
 separation where today we pass in a list of all the input and output files up front
 and raw2trace never opens a file on its own, to better support proprietary
 filesystems with upstream code.

 Another concern here is hitting file size limits with a single raw file across many
 sample traces.  For the example above of 50 10-billion-instruction traces, if we
 assume an average of 2 dynamic instructions per raw entry, each window might contain
 5GB of data, reaching 250GB for all 50.  Furthermore, the final trace is even larger.

 The file size problem gets worse if we use a constant sampling interval across
 SPEC2017.  Some SPEC2017 benchmarks have many more instructions than others.  The
 bwaves_s benchmark has 382 trillion instructions, so a constant interval might result
 in it having 50x more data than other benchmarks, exceeding the file size limit.  A
 constant number of windows is preferred for this reason.

 ## Splitting after raw2trace using an analyzer

 Given the complexities of splitting in earlier steps, and given that we may want to
 use a single simulator instance to process all of the sample traces, and given that
 for online trace analysis we will likely also have a single simulator instance:
 perhaps we should not try to split the samples and instead treat the 50 samples as a
 single trace with internal markers indicating the window division.

 Online and offline analyzers can use window ID markers to fast-forward or pause and
 align each thread to the next window.  Maybe the existing serial iterator can have
 built-in support for aligning the windows.

 If single-file final traces will exist, we would need to update all our existing
 analyzers to handle the gaps in the traces: reset state for function and callstack
 trackers; keep per-window numbers for statistics gatherers.

 We can also create an analyzer that splits a final trace up if we do want separate
 traces.

 ## Decision: Split the final trace with an analyzer

 Separate files seems to be the most flexible and useful setup for our expected use
 cases, in particular parallel simulation.  But given that separating early in the
 pipeline is complex, we’ll split after raw2trace post-processing, with a new analyzer
 tool.

 We’ll update some simple inspection and sanity tools (view, basic_counts, and
 invariant_checker) to handle and visualize windows, but we’ll assume that trace
 windows will be split before being analyzed by any more complex analysis tools.  For
 online traces we will probably stick with multi-window-at-once.

 We’ll create a tool to manually split up multi-window trace files.

 Update: We ended up implementing splitting at the raw file output level for the
 local-disk use case; we may additionally implement splitting in an anlyzer for other
 use cases.

 # Design Point: Continuous Control v. Re-Attach

 One method of obtaining multiple traces is to repeat today’s bursts over and over,
 with a full detach from the application after each trace.  However, each attach point
 is expensive, with long periods of profiling and code cache pre-population.  While a
 scheme of sharing the profiling and perhaps code cache could be developed while
 keeping a full detach, a simpler approach is to remain in control but switch from
 tracing to instruction counting in between tracing windows.  Instruction counting is
 needed to determine where to start the next window in any case.

 Instruction counting through instrumentation is not cheap, incurring perhaps a 1.5x
 slowdown.  Compared to the 50x overhead while tracing, however, it is acceptable.  If
 lower overhead is desired in the future, a scheme using a full detach and using
 hardware performance counters to count instruction can be investigated.  The decision
 for the initial implementation, however, is to use the simpler alternating tracing
 and counting instrumentation windows.

 # Design Point: Instrumentation Dispatch v. Flushing

 As the prior section concluded, we plan to alternate between tracing and instruction
 counting.  There are two main approaches to varying instrumentation during execution:
 inserting all cases up front with a dispatch to the desired current scheme, and
 replacing instrumentation by flushing the system’s software code cache when changing
 schemes.

 Flushing is an expensive process, and can be fragile as the lower-overhead forms of
 flushing open up race conditions between threads executing the old and new code cache
 contents.  Its complexity is one reason we are deciding to instead use a dispatch
 approach for our initial implementation.

 With dispatch, we insert both tracing and counting instrumentation for each block in
 the software code cache.  Dispatch code at the top of the block selects which scheme
 to use.  The current mode, either tracing or counting, is stored in memory and needs
 to be synchronized across all threads.

 The simplest method of synchronizing the instrumentation mode is to store it in a
 global variable, have the dispatch code use a load-acquire to read it, and modify it
 with a store-release.  There is overhead to a load-acquire at the top of every block,
 but experimentation shows that it is reasonable compared to the overhead of the
 instrumentation itself even for instruction counting mode, and its simplicity makes
 it our choice for the initial implementation.

 The mechanisms for creating the dispatch and separate copies for the modes is
 provided for us by the drbbdup library.  This library was, however, missing
 some key pieces we had to add.

 ## AArch64 support for drbbdup

 The drbbdup library had no AArch64 support initially.  It needed some updates to
 generated code sequences, as well as handling of the weak memory model on AArch64.
 To support storing the mode variable in global memory with the short reach of AArch64
 memory addressing modes, we added explicit support for using a scratch register to
 hold the address.  For the weak memory mode, explicit support for using a
 load-acquire to read the value was added.

 ## Function wrapping support for drbbdup

 The function wrapping library used when tracing does not easily work with drbbdup out
 of the box, due to the different control models.  We added an inverted control mode
 to the wrapping library to allow enabling wrapping only for the tracing mode and not
 for the counting mode.

 However, there is a complication here where a transition to counting mode while in
 the middle of a wrapped function can cause problems if cleanup at exiting the
 function is skipped.  We ended up adding support for cleanup-only execution of
 function wrapping actions which is invoked while counting, with full wrapping actions
 enabled while tracing.

 A final minor change to support DRWRAP_REPLACE_RETADDR was to use the tag rather than
 the first instruction’s application PC, since that wrapping scheme creates some
 blocks with nothing but synthetic instructions with no corresponding application PC.

 ## Write-xor-execute support for drbbdup

 The drbbdup library uses its own generated code area for out-of-line callouts, but
 only for dynamically discovered instrumentation cases.  This does not play well with
 write-xor-execute security features where we need to separate writable and executable
 code and have a special mechanism for DynamoRIO’s code cache that does not currently
 extend to tools using their own caches.  To solve this for our use case which has
 only static cases, we simply disabled the drbbdup code cache when a new flag is set
 disabling dynamic case support.

 ## Emulation support for drbbdup

 The memtrace code uses emulation sequence markers for instrumenting the proper
 application instructions in the face of various expansions for repeated string loops
 and scatter-gather instructions.  However, the drbbdup library presents its own
 interface layer which hides the markers and in fact results in memtrace missing some
 application instructions.

 ## Consider partial detach with PMU instruction counting for non-tracing windows?

 We could reduce the overhead of the non-tracing windows where we’re counting
 instructions by using the PMU to count for us.  We would do something like a detach
 but keep our code cache (and assume no code changes by the application) and re-attach
 when the PMU counting reaches the target.

 # Handling Phase Transitions

 For a normal memtrace burst, we completely detach from the server at the end of our
 desired trace duration.  This detach process synchronizes with every application
 thread.

 For multi-window traces, we are using multi-case dispatched instrumentation where we
 change the instrumentation type for each window.  We have no detach to go through and
 wake up all threads and have them flush their trace buffers and we're deliberately
 trying to avoid a global synchronization point.  Yet we would prefer perfect
 transitions between windows, whether that's separate raw files or accurately-placed
 markers.

 ## Key step: Add end-of-block phase change check

 We do flush prior to a syscall, so a thread at a kernel wait point should have an
 empty buffer and not be a concern.

 The main concern is a thread not in a wait state that happens to not be scheduled
 consistently for a long time and so does not fill up its buffer until well after the
 window ends.

 We can augment the current end-of-block flush check which today looks for the buffer
 being full.  We can add a check for the prior window having ended, by having a global
 window ordinal and storing its value per thread at the start of filling up a new
 buffer.  (This is better than simply checking the drbbdup mode value for being in
 non-tracing mode as that will not catch a double mode change.)  If the prior window
 has ended, we can flush the buffer, or simply add a marker, depending on the scheme
 (see below).

 A thread that receives a signal mid-block (it would have to be a synchronous signal
 as DR waits until the end of the block for asynchronous) will skip its end-of-block
 checks and redirect to run the app's signal handler: but it would hit the checks for
 the first block of the handler.

 The worst case inaccuracy here is a thread who starts writing in window N but ends up
 unscheduled until a much later window M.  But at most one basic block's worth of
 trace entries will be placed into window N even though they happened later.  Thus we
 have "basic block accuracy", which is pretty good, as typically a basic block only
 contains a few instructions.

 ## Proposal A: Separate raw files split at flush time

 If we're splitting raw files (see above), we would use the end-of-block window-change
 flush to emit a thread exit and create a new file.  In post-processing, we'd add any
 missing thread exits to threads that don't have them, to cover waiting threads who
 never reached a flush.

 As discussed above, the trigger thread would create a new directory for each window.
 A just-finished buffer is written to the directory corresponding to the window for
 its start point.

 A thread that is unscheduled for a long time could have a nearly-full buffer that is
 not written out until many windows later, but it would be written to the old
 directory for the old window.  The next buffer would go to a new file in the new
 window, with no files in the in-between window directories.

 ## Proposal B (winner): Label buffers with owning window

 If we add the window ordinal to every buffer header, we can identify which window
 they belong to, and avoid the need to separate raw files.  A window-end flush ensures
 a buffer belongs solely to the window identified in its header; the next buffer will
 have the new window value.

 This scheme can be used with file splitting during raw2trace, or waiting until
 analysis.  Each thread has one raw file which contains all windows during the
 execution.

 ## Proposal C: Trigger thread identifies buffer transition point of the other threads

 For this proposal, the thread triggering the end of the window walks the other
 threads and identifies the phase transition point inside the buffer, telling the
 post-processor where to split them.

 I considered having the triggerer also flush the buffers, but that is challenging
 with a race with the owner also flushing.  Plus, it still requires post-processing
 help to identify the precise point for splitting the buffer (without synchronization
 the triggerer can only get close).

 To avoid barriers on common case trace buffer writes, we use a lazy scheme where the
 triggerer does not modify the trace buffers themselves, but instead marks which
 portion has been written using a separate variable never accessed in a fastpath.

 Implementation:

 + The tracer maintains a global list of thread buffers using a global mutex on thread
   init and exit.

 + Each trace buffer has a corresponding set of externally_written variables, each
   holding a distance into the buffer that was written out by another thread.

 + On hitting the trace window endpoint threshold, the triggering thread grabs the
   mutex and walks the buffers.

   The triggerer doesn't have the current buffer position pointer.  Instead it walks
   the buffer until it reaches zeroed memory (we zero the buffer after each flush).
   We have no synchronization with the owning thread: but observing writes out of
   order should be ok since we'll just miss one by stopping too early.  We need to fix
   things up in post-processing in any case, because we need the phase transition to
   be at a clean point (we can't identify that point from triggerer: if we end at an
   instr entry, we don't know if some memrefs are coming afterward or not).  In
   post-processing we adjust that position to the end of the block, and we split the
   buffer contents around that point to the neighboring traces.

   The triggerer does a store-release of the furthest-writting point into the
   externally_written variable.

 + Whenever a thread writes out its buffer, it does a load-acquire on the
   externally_written variable and if non-zero it writes out a marker in the buffer
   header.  Post-processing reads the marker and uses it to split the buffer at the
   nearest block boundary after the marker value.

 + If windows are small enough that the triggerer doesn't complete its buffer walk
   before a new window starts: other thread buffers may completely go into the new
   window.  That seems ok: if the windows are that small, in the absence of
   application synchronization the resulting window split should be a possible thread
   ordering.

 This scheme ends up with block-level accuracy since the trigger thread's marked
 transition point must be adjusted to essentially a block boundary in post-processing.
 Thus, it does not seem any better than the other schemes, and it is more complex.

 ## Decision: Proposal B

 We decided to go with Proposal B as it works best with our file separation decision
 and is simple to implement.

 # Online Traces

 It makes sense for offline to treat each window trace as separate and simulate them
 separately (though aggregating the results to cover the whole application).

 But for online: we have the same instance of the simulator or analysis tool
 throughout the whole application run.  It will get confused if it throws away thread
 bookkeeping on a thread exit for a window.

 Either we have a window-controller simulator who spins up and down a new instance of
 the real target simulator/tool on each window, or we introduce new "end phase/start
 phase" markers.  If we have split offline traces, those would only be for online
 though which does not sound appealing.  Simulators/tools would need special handling
 for them: reporting statistics for the phase while continuing to aggregate for a
 multi-phase report or something.

 We might want combined files for offline too, as discussed above.  That would unify
 the two, which is appealing.


  ****************************************************************************
  */