Flaky tests are reported in a separate step on the bots (example build).
Each test log provides a pre-filled command line for triggering an automated flake bisect, like:Trigger flake bisect on command line:echo '{"bisect_buildername": "V8 Linux64 - verify csa", "bisect_mastername": "client.v8", "build_config": "Release", "extra_args": [], "isolated_name": "bot_default", "swarming_dimensions": ["cpu:x86-64", "gpu:none", "os:Ubuntu-14.04", "pool:Chrome"], "test_name": "inspector/runtime/command-line-api-without-side-effects", "timeout_sec": 60, "to_revision": "7f51fdac5bc8bf28b30904e1601819b356187b43", "total_timeout_sec": 120, "variant": "nooptimization"}' | buildbucket.py put -b luci.v8.try -n v8_flako -p -
Before triggering flake bisects for the first time, users must log in with a google.com account:depot-tools-auth login https://cr-buildbucket.appspot.com
Then execute the provided command line, which returns a build URL running flake bisect (example).
If you're in luck, bisection will point you to a suspect. If not, you might want to read further...
For technical details, see also the implementation tracker bug. The flake bisect approach has the same intentions as findit, but uses a different implementation.
A bisect job has 3 phases: calibration, backwards and inwards bisection. During calibration, testing is repeated doubling the total timeout (or the number of repetitions) until enough flakes are detected in one run. Then, backwards bisection doubles the git range until a revision without flakes is found. At last, we bisect into the range of the good revision and the oldest bad one. Note, bisection doesn‘t produce new build products, it is purely based on builds previously created on V8’s continuous infrastructure.
--mode, example: Release or Debug).mjsunit/foobar.If a failing run times out, while a pass is running very fast, it is useful to tweak the timeout_sec parameter, so that bisection is not delayed waiting for the hanging runs to time out. E.g. if the pass is usually reached in <1 second, set the timeout to something small, e.g. 5 seconds.
In some runs, confidence is very low. E.g. calibration is satisfied if four flakes are seen in one run. During bisection, every run with one or more flakes is counted as bad. In such cases it might be useful to restart the bisect job setting to_revision to the culprit and using a higher number of repetitions or total timeout than the original job and confirm that the same conclusion is reached again.
Sometimes the overall timeout option doesn‘t work on windows. In this case it’s best to estimate a fitting number of repetitions and set total_timeout_sec to 0.
Rarely, a code path is only triggered with a particular random seed. In this case it might be beneficial to fix it using extra_args, e.g. "extra_args": ["--random-seed=123"]. Otherwise, the stress runner will use different random seeds throughout. Note though that a particular random seed might reproduce a problem in one revision, but not in another.