i#7860 load bal: Add load balancing to drmemtrace analyzer (#7878) When running drmemtrace analyzers with dynamic scheduling (i.e., core-sharded with live scheduling instead of replay), we want our worker threads to have relatively even "activity" as seen by a simulator for a reasonable final virtual schedule. Atomic activity counts are periodically updated and examined to accomplish this. If a worker's activity reaches a specified ratio versus the slowest worker, it sleeps until it is under the ratio. A new CLI flag --max_load_imbalance sets the max ratio; the default is 2.5. Adds a unit test, though it would likely become flaky if waiting were required. Confirmed manually that there is waiting in the 3 fast workers for the slow worker #0 who ends up with fewer instructions but within the ratio (there are no idles in this test): ``` 12: [analyzer] Worker 2 @3000 waiting for slowest 3 @1000 12: [analyzer] Worker 3 @13000 waiting for slowest 0 @1000 12: [analyzer] Worker 1 @14000 waiting for slowest 0 @1000 12: [analyzer] Worker 0 waited 0 times for load balancing 12: [analyzer] Worker 1 waited 71 times for load balancing 12: [analyzer] Worker 3 waited 71 times for load balancing 12: [analyzer] Worker 2 waited 71 times for load balancing 12: shard 0 saw 215000 instructions 12: shard 1 saw 530000 instructions 12: shard 2 saw 535000 instructions 12: shard 3 saw 530000 instructions ``` Tested on larger traces on 80 cores with a live max of 60 with target ratios at 2.0, 2.5, and 3.0, which were precisely hit in two consecutive runs each with the 100K check cadence settled on in the code here (auto-raising to 1M or higher reduces accuracy with some ratios climbing too high). The check_load_balance() function does show up in some profiles as high as 3.3% but seems worth the cost. When reducing below 1.5, targets are hard to reach, and the overhead does go up: check_load_balance() is 4.6% in a 1.25-target run and 15% in a 1.0 run in these experiments, with the ratio not dropping under 1.5: so 1.5 may be a practical limit, as noted in the option description, for now: which should be fine for initial use cases. Fixes #7860

DynamoRIO is a runtime code manipulation system that supports code transformations on any part of a program, while it executes. DynamoRIO exports an interface for building dynamic tools for a wide variety of uses: program analysis and understanding, profiling, instrumentation, optimization, translation, etc. Unlike many dynamic tool systems, DynamoRIO is not limited to insertion of callouts/trampolines and allows arbitrary modifications to application instructions via a powerful IA-32/AMD64/ARM/AArch64 instruction manipulation library. DynamoRIO provides efficient, transparent, and comprehensive manipulation of unmodified applications running on stock operating systems (Windows, Linux, or Android) and commodity IA-32, AMD64, ARM, and AArch64 hardware. Mac OSX support is in progress.
DynamoRIO is the basis for some well-known external tools:
Tools built on DynamoRIO and available in the release package include:
DynamoRIO‘s powerful API abstracts away the details of the underlying infrastructure and allows the tool builder to concentrate on analyzing or modifying the application’s runtime code stream. API documentation is included in the release package and can also be browsed online. Slides from our past tutorials are also available.
DynamoRIO is available free of charge as a binary package for both Windows and Linux. DynamoRIO's source code is available primarily under a BSD license.
Use the discussion list to ask questions.
To report a bug, use the issue tracker.
See also the DynamoRIO home page: http://dynamorio.org/