We use stackdriver monitoring to check if the dashboard is running as expected. When you get an alert mail from stackdriver, you should do the following:
Understand what alerted, and find the relevant code. We have two main types of alerts:
utils.TickMonitoringCustomMetric
every time the code completes. If that call is not made, it generally means the code failed, and we send an alert. You'll want to search the code for the call to tick the metric with that name, so you can understand where the likely failure is.add_point_queue.py
.Analyze the logs for errors. Once you have a basic idea what codepath is failing, you'll want to look at the logs. There are two main entry points for this:
NEW ERROR
. Click through to look at callstacks and relevant logs.File a bug and follow up. The bug should be labeled P0
, Perf Dashboard
, Bug
. If it is clear the problem is with bisect, add that label as well. Reply to the email and link the bug, and update the bug with your findings as you understand the problem better.
If it's necessary at some point to have scheduled downtime, announce it ahead of time. At least 2 days before the downtime (ideally more), announce in these ways:
perf-sheriffs@chromium.org
).chrome-perf-dashboard-announce@google.com
.If possible, it's probably best to schedule it for Saturday, when usage is likely to be relatively low.
There are several routine tasks to do to set up the dashboard for a user. The official process for this is to file bugs on crbug.com with labels:
Performance-Dashboard-IPWhitelist
Performance-Dashboard-BotWhitelist
Performance-Dashboard-MonitoringRequest
You can view, create and edit sheriff rotations at /edit_sheriffs.
It’s fine to add a new sheriff rotation any time a team wants alerts to go to a new email address. It’s fine to make a temporary sheriff rotation for monitoring new tests before they are stable. Here are the fields that need to be filled out:
After creating a sheriffing rotation, you need to add the individual tests to monitor. You do this by clicking on “Set a sheriff for a group of tests”. It asks for a pattern. Patterns match test paths, which are of the form “Master/Bot/test-suite/graph/trace”. You can replace any part of the test path with a *
for a wildcard.
The dashboard will list the matching tests before allowing you to apply the pattern, so you’ll be able to check if the pattern is correct.
To remove a pattern, click “Remove a sheriff from a group of tests”.
If you want to keep alerting on most of the tests in a pattern and just disable alerting on a few noisy ones, you can add the “Disable Alerting” anomaly threshold config to the noisy tests (see “Modify anomaly threshold configs” below).
The default alert thresholds should work reasonably well for most test data, but there are some graphs for which it may not be correct. If there are invalid alerts, or the dashboard is not sending alerts when you expect them, you may want to modify an alert threshold config.
To edit alert threshold configs, go to /edit_anomaly_configs. Add a new config with a descriptive name and a JSON mapping of parameters to values.
Start off by using the anomaly threshold debugging page: /debug_alert. The page shows the segmentation of the data that was given by the anomaly finding algorithm. Based on the documentation, change the config parameters to get the alerts where you want them.
The dashboard can automatically apply labels to bugs filed on alerts, based on which test triggered the alert. This is useful for flagging the relevant teams attention. For example, the dashboard automatically applies the label “Cr-Blink-JavaScript” to dromaeo regressions, which cuts down on a lot of CC-ing by hand.
To make a label automatically applied to a bug, go to /edit_sheriffs and click “Set a bug lable to automatically apply to a group of tests”. Then type in a pattern as described in “Edit Sheriff Rotations -> Monitoring Tests” section above, and type in the bug label. You’ll see a list of tests the label will be applied to before you confirm.
To remove a label, go to /edit_sheriffs and click “Remove a bug label that automatically applies to a group of tests”.
When a test name changes, it is possible to migrate the existing test data to use the new name. You can do this by entering a pattern for the test name at /migrate_test_names.
There are two types of whitelisting on the perf dashboard:
The IP whitelist is a list of IP addresses of machines which are allowed to post data to /add_point. This is to prevent /add_point from being spammed. You can add a bot to the IP whitelist at /ip_whitelist. If you’re seeing 403 errors on your buildbots, the IPs to add are likely already in the logs. Note that if you are seeing 500 errors, those are not related to IP whitelisting. They are usually caused by an error in the JSON data sent by the buildbot. If you can’t tell by looking at the JSON data what is going wrong, the easiest thing to do is to add a unit test with the JSON to add_point_test.py
and debug it from there.
The bot whitelist is a list of bot names which are publicly visible. If a bot is not on the list, users must be logged into google.com accounts to see the data for that bot. You can add or remove a bot from the whitelist using the dev console by importing dashboard.change_internal_only
.