Triaging Gasper Alerts

Rota

(In time order.)

Monday Tuesday Wednesday **Thursday ** Friday non-PST kojih PST sullivan ojan rafaelw

Setup

To receive alerts in a manageable way (and not need to poll the Google Group to find new alerts):

  1. Subscribe to webkit-gasper@google.com
  2. Make a filter based on from:gasper-alerts@google.com (so alerts skip your inbox, but you still get personal emails, notably end-of-shift emails) whose actions are:
    • Skip Inbox
    • (Optional) Apply label “Gasper”
  3. Make a saved search for the previous day's alerts via list:webkit-gasper@google.com newer_than:1d and save it in the Quick Links gadget (available in Labs).

Triage process

  1. Review all the untriaged alerts at webkit-gasper@google.com
    • File appropriate bugs.
  2. At the end of your shift, send an email to webkit-gasper@google.com with:
    • The last email you've triaged, either as link to message on mailing list webpage or as alert subject (both, plus date-time, is a nice touch). This is just as a marker for the next Gasper sheriff to know where to start triaging.
    • (Optional) A link to the first email you‘ve triaged, for completeness. Ideally this where the previous Gasper sheriff left off, in which case just writing "From end of X’s report" is fine.
    • *(Optional) *List of bugs filed (bug number, link, title)
    • (Optional) Brief notes on individual bugs
    • (Optional) Summary/Discussion of alerts – often many alerts are related to a single bug or revision, esp. one noted in a previous report
    • This can be simple or elaborate; it's a fixed format, so should be quick to write and quick to read.

Extended discussion on individual bugs should take place on the bug page; more general discussion should happen at chromium-dev or (if confidential) at chrome-perf. Google+ posts may also be appropriate.

Triaging an individual alert

Outline

There are three steps:

  1. Determine if it's an actual regression. There are 3 main reasons for false alerts, which can be detected as follows:
    • Noisy test – check graph.
    • Improvement – think about correct direction (see below). Gasper doesn't currently distinguish improvements from regressions (tricky because direction varies with test).
    • Change in machine – check reference build (add _ref to the trace).
  2. If so, determine the specific regression and cause (if possible). Don't spend too much time on this. Look at several graphs (if regression shows up on several) – sometimes one will have a much narrower range than others, or you can intersect overlapping ranges. Look at the change log and see if anything jumps out, namely edit to the platform or module in question. You can run a performance bisect job to pinpoint the precision revision, but this takes a while to run, so do this after filing the bug. If you can‘t figure it out, assign the bug to the Chrome perf sheriff or webkit gardeners as appropriate and include a link to the regression range if it’s clear.
  3. File a bug or bugs, and assign to the responsible party or sheriff.

Sometimes regressions come in groups. You can address this either by first looking at a group of alerts and seeing if you notice patterns, such as several traces on the same test suite – this also helps in narrowing down the suspect revision range.

Alternatively, you can go through alerts individually, filing bugs as you go, and updating existing bug reports with later alerts.

Bisecting

Once you file a bug, you can pinpoint the revision, either to determine the cause (if you don't have a suspect) or confirm that the suspected revision is in fact guilty:

  • Bisecting Performance Regressions (the bisect tool) is a key tool for reducing a range to a specific revision. This takes quite a while, so do after you've filed a bug.

In configuring the bisect tool, you will need to set the command parameter (to the command line, see below), together with the metric parameter. The format for metric is graph/trace where these are the values in the “graph” and “trace” box at the top of the graph page. Alternatively, looking at test output, the corresponding test result line begins:

RESULT graph: trace=

Note that times/t is common and treated specially.

Assuming you successfully identify the culprit, update the bug accordingly:

  • Copy the full results of bisection (for reference)
  • Look up the SVN revision and refer to it via “Revision 123” in the bug text
  • Assign the bug to the author of the revision, or at least CC
  • A message to the author of the suspect revision is polite and helps make the message not come out of the blue.

Details

In detail:

  1. Open the chromium-perf link.
  2. If it‘s an improvement or it’s clear that the alert was just reporting a noisy test, you can stop here.
  3. IMPORTANT
    • Tests that measure milliseconds or megabytes, going down is an improvement.
    • Tests measure runs/second or “score”, going up is an improvement.
  4. Check that the reference build didn‘t also regress. If the reference build regressed at exactly the same run as the regular build, then you know it’s not a real regression – something just changed on the machine (e.g. it was updated to new hardware). Do this by adding the appropriate "_ref" trace to the comma separated list of traces. For example, this graph looks like it‘s a 10% regression on the “t” trace until you add the “t_ref” trace and see that it’s not. You can also remove the trace entry entirely to see all the traces for that test suite, but this is usually too much noise to make sense of.
  5. Uncheck the “Detailed view” checkbox to show the trace on all the perf bots. This will let you quickly see if the regression is WebKit-side or Chromium-side (i.e. if all the regressions are on the ChromiumWebkit master, then it‘s a WebKit-side change). It’s good to search for existing regressions, but often we‘ll be the first alert or it’ll be buried in another bug, so this is not essential.

Bug Filing

When filing a bug report, useful information to include (essential information in bold):

  • Summary of regression – this can be the title of the alert, or a human-readable interpretation (esp. for a group of alerts), or both for maximum clarity
  • Link to the Gasper graph – several if applicable
    • A screenshot of the graph is a nice touch – graph is slow to load, and can't link to specific point on it. You can highlight the regression (vertically) by mousing over the relevant portion, and indicate a horizontal level (and % change) by shift-clicking at the pre- level, then moving the pointer to indicate the post- level, and then taking the screenshot.
    • In Ubuntu 12, Shift-PrintScreen takes a picture of a specified area, so you don't need to screenshot then crop
  • Revision range for regression, with link to webpage for these
  • Specific revision, if it can be determined, with link to changeset and also to relevant bug, and explanation of why this revision looks like the culprit
  • Detailed description – what the issue is, as far as you can tell
  • Assign to sheriff (if cannot determine cause, or unclear), or assign to patch author if clear suspect revision
  • Test command to run to reproduce (see below for how to find)
  • Link to corresponding Chromium/WebKit bug if double-filed

Caveats

Beware that Chromium and WebKit contributors are in general not familiar with Gasper, and thus may have difficulty in identifying the regression (esp. if the graph is noisy) or determining how to reproduce the test.

There are various automatic shortcuts in Chromium Issue Tracker and WebKit Bugzilla. Most useful are: in Chromium, "Issue 123" or "Bug 123" link to a Chromium Issue, and "r123" or "revision 123" link to Chromium revision; in WebKit, "Bug" links to a WebKit bug.

Determine sheriff

You can determine the Chromium Perf sheriff from the Chromium BuildBot Waterfall – at the top left, under the “Sheriff” section, the (Perf) sheriff is the current Chromium Perf sheriff. For WebKit regressions, can instead assign to WebKit Gardener.

Determine test command line

To determine the command line for the test that triggered the alert:

  • On the alert, click on “View the graph”; this will take you to the graph page.
  • On the graph page, make a note of the TestSuite (top left of page), click “Go to builder” (top of box, 3rd line of page); this will take you to the waterfall for the perf bot.
  • On the builder page, search for the TestSuite and click “stdio”; this will take you to the stdio page for this run of the test suite.
  • On the stdio page, look through manually for the actual command; it may be the first line of the second block. Common command lines include:

performance_ui_tests (Linux, Windows):

out/Release/performance_ui_tests
C:\b\build\slave\Win7_Perf\build\src\build\Release\performance_ui_tests.exe

run_multipage_benchmarks (Page cyclers, Linux):

python tools/perf/run_multipage_benchmarks

...followed by some parameters.

  • To determine the correct test suite and subtest from a given metric, consider looking in chromium/src/chrome/test/perf/ particularly at TEST_F lines, and search for matches. For example, the metric warm/extension_empty in TestSuite startup_test is in gtest (Google Test) StartupTest.PerfExtensionEmpty rather than StartupTest.PerfWarm which you might expect instead.

Followup

In principle, once you have filed a bug and assigned it to an appropriate party, it‘s no longer your responsibility, though you may be CCed on comments. However, other contributors may not know how to reproduce the test and may need assistance, you are well-placed to verify fixes, and further alerts will be generated when the regression is fixed, so you likely haven’t heard the end of the bug. If you have time, the following followup is useful:

  • Help contributors run tests and verify the regression, or their fix.
  • Once fixed, verify (run another bisect with good set to N − 1 and bad set to N and compare times to times before/at the regression revision), and if fixed, mark VERIFIED.
  • Send a note to the mailing list advising that alerts will be triggered by the fix.

Making sense of the graphs

  • Click on the run where the regression happened. In the middle right there are CL (chromium regression range), Data (data from one of the traces) and Webkit (webkit regression range).
  • The bottom portion (CL/Data/Webkit) is in a separate frame; to determine the URL (so it can be pasted into a bug report), either:
  • Shift+click in the graph to fix a vertical level (this draws a horizontal line), and move the mouse to display the percentage change.
  • Traces:
    • To view all traces, remove the “trace” value from the input at the top (or the URL).
    • Especially important is the reference build (trace that ends in _ref); if this is also affected, that tells you that it wasn't a patch that caused the regression, but a change to the machine (i.e. nothing to fix).
    • Note that different graphs for the same test suite have different traces. So, if you switch which subgraph you are looking at, you may need to modify the traces field as well. If you simply remove the traces field, it will automatically show all traces.
  • In the non-detailed view, click on any of the graphs to open it in a new window.
  • In the detailed view, you can open the graph in a new window by copy-pasting the URL right above the graph.