How to launch a functional bisect and interpret its results

A functional bisect determines the revision at which a particular benchmark or story started failing more often. It does this by doing a binary search between a known good and known bad revision, running the test multiple times at each potential revision until it narrows down the culprit to a single revision.

Identifying a good and bad revision

The first step is to launching a bisect is to identify a revision at which you‘re confident the test was passing more often (the good revision) and a revision at which you’re confident the test was failing more often (the bad revision).

The easiest way to do this is to use the flakiness dashboard (basic instructions here).

The below screenshot shows the easiest possible case: it's clear when the test went from passing to failing. (Remember, runs are listed from most recent on the left to oldest on the right.)

A simple test failure on the flakiness dashboard

To get the good revision, just click on the good revision box and note the end of its revision range.

Finding the good build revision on the flakiness dashboard

To get the bad revision, do the same for the bad revision box.

Finding the bad build revision on the flakiness dashboard

The flakier the test, the further you generally need to venture into the past “green zone” for your good revision or further into the newer “red zone” for your bad revision. Remember that it's better to pick a slightly wider, safer bisect range and have the bisect take slightly longer than to pick a too-narrow bisect range and have to rerun the entire bisect.

For example, let's look at a little flakier failure:

A flaky test on the flakiness dashboard

The first row here shows a story that went from being a little flaky to very flaky. To choose the “good” revision, we should go a little further back than the last one in the streak of good runs (in this case, labeled 292) because it‘s possible that that particular run was a lucky pass and the already-failing code had already been submitted, like the run labeled 246 a few runs later. Similarly, when choosing a “bad” revision, we should pick one a little more recent than the first in the streak of failures (in this case, labeled 226) because it’s possible that the first failure was one of the rare flakes that were already happening.

In this case, I might choose the following as safe good and bad revisions:

Finding the good and bad revisions for a flaky test on the flakiness dashboard

Now you can start the bisect.

Launching the bisect through the perf dashboard

To launch a function bisect, open up the “Browse Graphs” page on the Chrome Performance Dashboard.

For “Test Suite”, select the name of your benchmark.

For “Bot”, choose the platform on which the benchmark or story is failing. If it's failing on multiple platforms, you can choose any of them.

For the first “Subtest” entry, choose any metric that would have been collected in the particular story that‘s failing. Do not choose benchmark_duration, which is unsuitable for bisects. You’ll know that you picked a suitable metric when, at the end of this series of steps, you're shown a graph with recent data points for your story.

For the next “Subtest” entry, pick the story that's failing. If all stories are failing, you can choose any story. Make sure to select the story name without the “_ref” suffix on it.

Once you've done this, click “Add” to add a graph of this metric to the page. It should look like this.

A graph on the perf dashboard

Click on any data point in that graph to bring up a dialog the click “Bisect” in that dialog.