| /* ****************************************************************************** |
| * Copyright (c) 2022 Google, Inc. All rights reserved. |
| * ******************************************************************************/ |
| |
| /* |
| * Redistribution and use in source and binary forms, with or without |
| * modification, are permitted provided that the following conditions are met: |
| * |
| * * Redistributions of source code must retain the above copyright notice, |
| * this list of conditions and the following disclaimer. |
| * |
| * * Redistributions in binary form must reproduce the above copyright notice, |
| * this list of conditions and the following disclaimer in the documentation |
| * and/or other materials provided with the distribution. |
| * |
| * * Neither the name of Google, Inc. nor the names of its contributors may be |
| * used to endorse or promote products derived from this software without |
| * specific prior written permission. |
| * |
| * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" |
| * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE |
| * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE |
| * ARE DISCLAIMED. IN NO EVENT SHALL VMWARE, INC. OR CONTRIBUTORS BE LIABLE |
| * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL |
| * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR |
| * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER |
| * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT |
| * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY |
| * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH |
| * DAMAGE. |
| */ |
| |
| /** |
| **************************************************************************** |
| \page page_multi_trace_window Multi-Window Memtraces |
| |
| \tableofcontents |
| |
| # Overview |
| |
| Memory address traces (or "memtraces") are critical tools for analyzing application |
| behavior and performance. DynamoRIO's \ref page_drcachesim provides memtrace |
| gathering and analyzing tools. Today, these memtraces are gathered either during a |
| full execution of a standalone application, or during a short burst of execution on a |
| server (using the start/stop interfaces as described at \ref sec_drcachesim_partial). |
| For long-running benchmarks with phased behavior, however, we would like to instead |
| gather a series of smaller memtrace samples to ease simulation while still |
| representing the application in the aggregate. |
| |
| # Initial Use Case: SPEC2017 |
| |
| An initial use case driving this work is the gathering of memtraces from SPEC2017. |
| These benchmarks execute tens of trillions of instructions each and include phased |
| behavior. Our goal is to gather uniformly sampled memtrace sequences across the |
| whole execution. To support a full warmup for each sample, our target is 10 billion |
| instructions in each. Something like 50 samples per benchmark should suffice to |
| capture major phases. With these parameters, for a 25-trillion-instruction |
| benchmark, that would look like a series of 10 billion instruction traces each |
| followed by 490 billion instructions of untraced execution. |
| |
| # Design Point: Separate Traces v. Merged-with-Markers |
| |
| Focusing on a use case of a series of 50 10-billion-instruction traces for a SPEC |
| benchmark, there are two main ways to store them. We could create 50 independent |
| sets of trace files, each with its own metadata and separate set of sharded data |
| files. A simulator could either simulate all 50 separately and aggregate just the |
| resulting statistics, or a single instance of a simulator could skip ahead to each |
| new sample and use the first portion of the sample to warm up caches and other state. |
| |
| The alternative is to store all the data in one set of data files, with metadata |
| markers inserted to indicate the division points between the samples. This doesn’t |
| support the separate simulation model up front, though we could provide an iterator |
| interface that skips ahead to a target window and stops at the end of that window (or |
| the simulator could be modified to stop when it sees a window separation marker). |
| However, this will not be as efficient for parallel simulation with separate |
| simulator instances for each window, since the skipping ahead will take some time. |
| This arrangement does more easily support the single-simulator-instance approach, and |
| more readily fits with online simulation. |
| |
| In terms of implementation, there are several factors to consider here. |
| |
| ## Separate raw files |
| |
| If we want separate final traces, at first the simplest approach is to produce a |
| separate set of raw files for each tracing window. These would be post-processed |
| separately and independently. |
| |
| However, implementing this split does not fit well in the current interfaces. To |
| work with other filesystems, we have separated out the i/o and in particular |
| directory creation, and some proprietary uses rework the output completely to write |
| to a fixed set of destinations which are mapped back to per-thread raw files during |
| post-processing. Creating new sets of outputs at tracing time would require new |
| interfaces and significant refactoring for these use cases. |
| |
| For typical use with files on the local disk, we could add creation of a new |
| directory (and duplication of the module file) for each window by the tracing thread |
| that hits the end-of-window trigger. The other threads would each create a new |
| output raw file each time they transitioned to a new window (see also the Proposal A |
| discussion below). This is implemented today with the \p -split_windows option. |
| |
| ## Splitting during raw2trace |
| |
| Alternatively, we could keep a single raw file for each thread and split it up into |
| per-window final trace files during postprocessing by the raw2trace tool. We would |
| use markers inserted at the window transition points to identify where to separate. |
| |
| raw2trace would need to create a new output dir and duplicate the trace headers and |
| module file. Like for separate raw files, this goes against the current i/o |
| separation where today we pass in a list of all the input and output files up front |
| and raw2trace never opens a file on its own, to better support proprietary |
| filesystems with upstream code. |
| |
| Another concern here is hitting file size limits with a single raw file across many |
| sample traces. For the example above of 50 10-billion-instruction traces, if we |
| assume an average of 2 dynamic instructions per raw entry, each window might contain |
| 5GB of data, reaching 250GB for all 50. Furthermore, the final trace is even larger. |
| |
| The file size problem gets worse if we use a constant sampling interval across |
| SPEC2017. Some SPEC2017 benchmarks have many more instructions than others. The |
| bwaves_s benchmark has 382 trillion instructions, so a constant interval might result |
| in it having 50x more data than other benchmarks, exceeding the file size limit. A |
| constant number of windows is preferred for this reason. |
| |
| ## Splitting after raw2trace using an analyzer |
| |
| Given the complexities of splitting in earlier steps, and given that we may want to |
| use a single simulator instance to process all of the sample traces, and given that |
| for online trace analysis we will likely also have a single simulator instance: |
| perhaps we should not try to split the samples and instead treat the 50 samples as a |
| single trace with internal markers indicating the window division. |
| |
| Online and offline analyzers can use window ID markers to fast-forward or pause and |
| align each thread to the next window. Maybe the existing serial iterator can have |
| built-in support for aligning the windows. |
| |
| If single-file final traces will exist, we would need to update all our existing |
| analyzers to handle the gaps in the traces: reset state for function and callstack |
| trackers; keep per-window numbers for statistics gatherers. |
| |
| We can also create an analyzer that splits a final trace up if we do want separate |
| traces. |
| |
| ## Decision: Split the final trace with an analyzer |
| |
| Separate files seems to be the most flexible and useful setup for our expected use |
| cases, in particular parallel simulation. But given that separating early in the |
| pipeline is complex, we’ll split after raw2trace post-processing, with a new analyzer |
| tool. |
| |
| We’ll update some simple inspection and sanity tools (view, basic_counts, and |
| invariant_checker) to handle and visualize windows, but we’ll assume that trace |
| windows will be split before being analyzed by any more complex analysis tools. For |
| online traces we will probably stick with multi-window-at-once. |
| |
| We’ll create a tool to manually split up multi-window trace files. |
| |
| Update: We ended up implementing splitting at the raw file output level for the |
| local-disk use case; we may additionally implement splitting in an anlyzer for other |
| use cases. |
| |
| # Design Point: Continuous Control v. Re-Attach |
| |
| One method of obtaining multiple traces is to repeat today’s bursts over and over, |
| with a full detach from the application after each trace. However, each attach point |
| is expensive, with long periods of profiling and code cache pre-population. While a |
| scheme of sharing the profiling and perhaps code cache could be developed while |
| keeping a full detach, a simpler approach is to remain in control but switch from |
| tracing to instruction counting in between tracing windows. Instruction counting is |
| needed to determine where to start the next window in any case. |
| |
| Instruction counting through instrumentation is not cheap, incurring perhaps a 1.5x |
| slowdown. Compared to the 50x overhead while tracing, however, it is acceptable. If |
| lower overhead is desired in the future, a scheme using a full detach and using |
| hardware performance counters to count instruction can be investigated. The decision |
| for the initial implementation, however, is to use the simpler alternating tracing |
| and counting instrumentation windows. |
| |
| # Design Point: Instrumentation Dispatch v. Flushing |
| |
| As the prior section concluded, we plan to alternate between tracing and instruction |
| counting. There are two main approaches to varying instrumentation during execution: |
| inserting all cases up front with a dispatch to the desired current scheme, and |
| replacing instrumentation by flushing the system’s software code cache when changing |
| schemes. |
| |
| Flushing is an expensive process, and can be fragile as the lower-overhead forms of |
| flushing open up race conditions between threads executing the old and new code cache |
| contents. Its complexity is one reason we are deciding to instead use a dispatch |
| approach for our initial implementation. |
| |
| With dispatch, we insert both tracing and counting instrumentation for each block in |
| the software code cache. Dispatch code at the top of the block selects which scheme |
| to use. The current mode, either tracing or counting, is stored in memory and needs |
| to be synchronized across all threads. |
| |
| The simplest method of synchronizing the instrumentation mode is to store it in a |
| global variable, have the dispatch code use a load-acquire to read it, and modify it |
| with a store-release. There is overhead to a load-acquire at the top of every block, |
| but experimentation shows that it is reasonable compared to the overhead of the |
| instrumentation itself even for instruction counting mode, and its simplicity makes |
| it our choice for the initial implementation. |
| |
| The mechanisms for creating the dispatch and separate copies for the modes is |
| provided for us by the drbbdup library. This library was, however, missing |
| some key pieces we had to add. |
| |
| ## AArch64 support for drbbdup |
| |
| The drbbdup library had no AArch64 support initially. It needed some updates to |
| generated code sequences, as well as handling of the weak memory model on AArch64. |
| To support storing the mode variable in global memory with the short reach of AArch64 |
| memory addressing modes, we added explicit support for using a scratch register to |
| hold the address. For the weak memory mode, explicit support for using a |
| load-acquire to read the value was added. |
| |
| ## Function wrapping support for drbbdup |
| |
| The function wrapping library used when tracing does not easily work with drbbdup out |
| of the box, due to the different control models. We added an inverted control mode |
| to the wrapping library to allow enabling wrapping only for the tracing mode and not |
| for the counting mode. |
| |
| However, there is a complication here where a transition to counting mode while in |
| the middle of a wrapped function can cause problems if cleanup at exiting the |
| function is skipped. We ended up adding support for cleanup-only execution of |
| function wrapping actions which is invoked while counting, with full wrapping actions |
| enabled while tracing. |
| |
| A final minor change to support DRWRAP_REPLACE_RETADDR was to use the tag rather than |
| the first instruction’s application PC, since that wrapping scheme creates some |
| blocks with nothing but synthetic instructions with no corresponding application PC. |
| |
| ## Write-xor-execute support for drbbdup |
| |
| The drbbdup library uses its own generated code area for out-of-line callouts, but |
| only for dynamically discovered instrumentation cases. This does not play well with |
| write-xor-execute security features where we need to separate writable and executable |
| code and have a special mechanism for DynamoRIO’s code cache that does not currently |
| extend to tools using their own caches. To solve this for our use case which has |
| only static cases, we simply disabled the drbbdup code cache when a new flag is set |
| disabling dynamic case support. |
| |
| ## Emulation support for drbbdup |
| |
| The memtrace code uses emulation sequence markers for instrumenting the proper |
| application instructions in the face of various expansions for repeated string loops |
| and scatter-gather instructions. However, the drbbdup library presents its own |
| interface layer which hides the markers and in fact results in memtrace missing some |
| application instructions. |
| |
| ## Consider partial detach with PMU instruction counting for non-tracing windows? |
| |
| We could reduce the overhead of the non-tracing windows where we’re counting |
| instructions by using the PMU to count for us. We would do something like a detach |
| but keep our code cache (and assume no code changes by the application) and re-attach |
| when the PMU counting reaches the target. |
| |
| # Handling Phase Transitions |
| |
| For a normal memtrace burst, we completely detach from the server at the end of our |
| desired trace duration. This detach process synchronizes with every application |
| thread. |
| |
| For multi-window traces, we are using multi-case dispatched instrumentation where we |
| change the instrumentation type for each window. We have no detach to go through and |
| wake up all threads and have them flush their trace buffers and we're deliberately |
| trying to avoid a global synchronization point. Yet we would prefer perfect |
| transitions between windows, whether that's separate raw files or accurately-placed |
| markers. |
| |
| ## Key step: Add end-of-block phase change check |
| |
| We do flush prior to a syscall, so a thread at a kernel wait point should have an |
| empty buffer and not be a concern. |
| |
| The main concern is a thread not in a wait state that happens to not be scheduled |
| consistently for a long time and so does not fill up its buffer until well after the |
| window ends. |
| |
| We can augment the current end-of-block flush check which today looks for the buffer |
| being full. We can add a check for the prior window having ended, by having a global |
| window ordinal and storing its value per thread at the start of filling up a new |
| buffer. (This is better than simply checking the drbbdup mode value for being in |
| non-tracing mode as that will not catch a double mode change.) If the prior window |
| has ended, we can flush the buffer, or simply add a marker, depending on the scheme |
| (see below). |
| |
| A thread that receives a signal mid-block (it would have to be a synchronous signal |
| as DR waits until the end of the block for asynchronous) will skip its end-of-block |
| checks and redirect to run the app's signal handler: but it would hit the checks for |
| the first block of the handler. |
| |
| The worst case inaccuracy here is a thread who starts writing in window N but ends up |
| unscheduled until a much later window M. But at most one basic block's worth of |
| trace entries will be placed into window N even though they happened later. Thus we |
| have "basic block accuracy", which is pretty good, as typically a basic block only |
| contains a few instructions. |
| |
| ## Proposal A: Separate raw files split at flush time |
| |
| If we're splitting raw files (see above), we would use the end-of-block window-change |
| flush to emit a thread exit and create a new file. In post-processing, we'd add any |
| missing thread exits to threads that don't have them, to cover waiting threads who |
| never reached a flush. |
| |
| As discussed above, the trigger thread would create a new directory for each window. |
| A just-finished buffer is written to the directory corresponding to the window for |
| its start point. |
| |
| A thread that is unscheduled for a long time could have a nearly-full buffer that is |
| not written out until many windows later, but it would be written to the old |
| directory for the old window. The next buffer would go to a new file in the new |
| window, with no files in the in-between window directories. |
| |
| ## Proposal B (winner): Label buffers with owning window |
| |
| If we add the window ordinal to every buffer header, we can identify which window |
| they belong to, and avoid the need to separate raw files. A window-end flush ensures |
| a buffer belongs solely to the window identified in its header; the next buffer will |
| have the new window value. |
| |
| This scheme can be used with file splitting during raw2trace, or waiting until |
| analysis. Each thread has one raw file which contains all windows during the |
| execution. |
| |
| ## Proposal C: Trigger thread identifies buffer transition point of the other threads |
| |
| For this proposal, the thread triggering the end of the window walks the other |
| threads and identifies the phase transition point inside the buffer, telling the |
| post-processor where to split them. |
| |
| I considered having the triggerer also flush the buffers, but that is challenging |
| with a race with the owner also flushing. Plus, it still requires post-processing |
| help to identify the precise point for splitting the buffer (without synchronization |
| the triggerer can only get close). |
| |
| To avoid barriers on common case trace buffer writes, we use a lazy scheme where the |
| triggerer does not modify the trace buffers themselves, but instead marks which |
| portion has been written using a separate variable never accessed in a fastpath. |
| |
| Implementation: |
| |
| + The tracer maintains a global list of thread buffers using a global mutex on thread |
| init and exit. |
| |
| + Each trace buffer has a corresponding set of externally_written variables, each |
| holding a distance into the buffer that was written out by another thread. |
| |
| + On hitting the trace window endpoint threshold, the triggering thread grabs the |
| mutex and walks the buffers. |
| |
| The triggerer doesn't have the current buffer position pointer. Instead it walks |
| the buffer until it reaches zeroed memory (we zero the buffer after each flush). |
| We have no synchronization with the owning thread: but observing writes out of |
| order should be ok since we'll just miss one by stopping too early. We need to fix |
| things up in post-processing in any case, because we need the phase transition to |
| be at a clean point (we can't identify that point from triggerer: if we end at an |
| instr entry, we don't know if some memrefs are coming afterward or not). In |
| post-processing we adjust that position to the end of the block, and we split the |
| buffer contents around that point to the neighboring traces. |
| |
| The triggerer does a store-release of the furthest-writting point into the |
| externally_written variable. |
| |
| + Whenever a thread writes out its buffer, it does a load-acquire on the |
| externally_written variable and if non-zero it writes out a marker in the buffer |
| header. Post-processing reads the marker and uses it to split the buffer at the |
| nearest block boundary after the marker value. |
| |
| + If windows are small enough that the triggerer doesn't complete its buffer walk |
| before a new window starts: other thread buffers may completely go into the new |
| window. That seems ok: if the windows are that small, in the absence of |
| application synchronization the resulting window split should be a possible thread |
| ordering. |
| |
| This scheme ends up with block-level accuracy since the trigger thread's marked |
| transition point must be adjusted to essentially a block boundary in post-processing. |
| Thus, it does not seem any better than the other schemes, and it is more complex. |
| |
| ## Decision: Proposal B |
| |
| We decided to go with Proposal B as it works best with our file separation decision |
| and is simple to implement. |
| |
| # Online Traces |
| |
| It makes sense for offline to treat each window trace as separate and simulate them |
| separately (though aggregating the results to cover the whole application). |
| |
| But for online: we have the same instance of the simulator or analysis tool |
| throughout the whole application run. It will get confused if it throws away thread |
| bookkeeping on a thread exit for a window. |
| |
| Either we have a window-controller simulator who spins up and down a new instance of |
| the real target simulator/tool on each window, or we introduce new "end phase/start |
| phase" markers. If we have split offline traces, those would only be for online |
| though which does not sound appealing. Simulators/tools would need special handling |
| for them: reporting statistics for the phase while continuing to aggregate for a |
| multi-phase report or something. |
| |
| We might want combined files for offline too, as discussed above. That would unify |
| the two, which is appealing. |
| |
| |
| **************************************************************************** |
| */ |