The perf bot sheriff is responsible for keeping the bots on the chromium.perf waterfall up and running, and triaging performance test failures and flakes.
Sheriff-O-Matic is (as of 2/27/2017) the recommended way to perfbot sheriff. It can be used to track the different issues and associate them with specific bugs, and annotate failures with useful information. It also attempts to group together similar failures across different builders, so it can help to see a higher level perspective on what is happening on the perf waterfall.
It is an actively staffed project, which should be getting better over time. If you find any bugs with the app, you can file a bug by clicking on the feedback link in the bottom right of the app, or by clicking this link.
Everyone can view the chromium.perf waterfall at https://build.chromium.org/p/chromium.perf/, but for Googlers it is recommended that you use the url https://uberchromegw.corp.google.com/i/chromium.perf/ instead. The reason for this is that in order to make the performance tests as realistic as possible, the chromium.perf waterfall runs release official builds of Chrome. But the logs from release official builds may leak info from our partners that we do not have permission to share outside of Google. So the logs are available to Googlers only. To avoid manually rewriting the URL when switching between the upstream and downstream views of the waterfall and bots, you can install the Chromium Waterfall View Switcher extension, which adds a switching button to Chrome's URL bar.
Note that there are three different views:
There is also milo, which has the same data as buildbot, but mirrored in a different datastore. It is generally faster than buildbot, and links to it will not break, as the data is kept around for much longer.
You can see a list of all previously filed bugs using the Performance-Sheriff-BotHealth label in crbug.
Please also check the recent perf-sheriffs@chromium.org postings for important announcements about bot turndowns and other known issues.
Some build configurations, in particular the perf builders and trybots, have multiple machines attached. If one or more of the machines go down, there are still other machines running, so the console or waterfall view will still show green, but those configs will run at reduced throughput. At least once during your shift, you should check the lists of buildslaves and ensure they're all running.
The machines restart between test runs, so just looking for “Status: Not connected” is not enough to indicate a problem. For each disconnected machine, you can also check the “Last heard from” column to ensure that it's been gone for at least an hour. To get it running again, file a bug against the current trooper and read go/bug-a-trooper for contacting troopers.
The chrome infrastructure team also maintains a set of dashboards you can use to view some debugging information about our systems. This is available at vi/chrome_infra. To debug offline buildslaves, you can look at the “Individual machine” dashboard, (at vi/chrome_infra/Machines/per_machine under the “Machines” section, which can show some useful information about the machine in question.
When a bot goes purple, it's usually because of an infrastructure failure outside of the tests. But you should first check the logs of a purple bot to try to better understand the problem. Sometimes a telemetry test failure can turn the bot purple, for example.
If the bot goes purple and you believe it's an infrastructure issue, file a bug with this template, which will automatically add the bug to the trooper queue. Be sure to note which step is failing, and paste any relevant info from the logs into the bug. Also be sure to read go/bug-a-trooper for contacting troopers.
There are three types of device failures:
device_status
step. Device failures of this type are expected to be purple. You can look at the buildbot status page to see how many devices were listed as online during this step. You should always see 7 devices online. If you see fewer than 7 devices online, there is a problem in the lab.device_status
but still in poor health. The symptom of this is that all the tests are failing on it. You can see that on the buildbot status page by looking at the Device Affinity
. If all tests with the same device affinity number are failing, it's probably a device failure.device_status
step. You should always see 7 total devices on a bot in one of three statuses: online, misisng, or blacklisted. If you see fewer than 7 devices it means there is a problem with the known devices persistent file and the device is unreachable via adb. This usually means the known devices file was cleared while a device was unreachable. A bug should be filed saying that there is a missing device. Going through previous logs will usually yield a device ID for the missing device.For these types of failures, please file a bug with this template which will add an issue to the infra labs queue.
If you need help triaging, here are the common labels you should use:
Here are the common components you should also use:
If you still need help, ask the speed infra chat, or escalate to sullivan@.
Sometimes when looking at failing android tests you will notice that there are tests on multiple devices failing. Sometimes (but not always) this means that there is a problem on the host machine. One way this problem can occur is if a test is using the wrong version of adb in one of its commands. This causes the adb server on the host to reset which can cause failures to anything trying to communicate with a device via adb during that time. A good tool for diagnosing this is the Test Trace step on the android runs. This is a trace of which tests are running. If you have all the tests across all the testing shards failing, it may be an issue on the host not with the tests. This will no longer be used when the android bots move to swarming, since each device will be sandboxed from the others and not run from a single point.
Sometimes when a compile step is failing, you may be asked to clobber example. Steps to clobber:
You want to keep the waterfall green! So any bot that is red or purple needs to be investigated. When a test fails:
return_code
.Telemetry test runner logs
Useful Content: Best place to start. These logs contain all of the python logging information from the telemetry test runner scripts.
Where to find: These logs can be found from the buildbot build page. Click the “[stdout]” link under any of the telemetry test buildbot steps to view the logs. Do not use the “stdio” link which will show similiar information but will expire earilier and be slower to load.
Android Logcat (Android)
Useful Content: This file contains all Android device logs. All Android apps and the Android system will log information to logcat. Good place to look if you believe an issue is device related (Android out-of-memory problem for example). Additionally, often information about native crashes will be logged to here.
Where to find: These logs can be found from the buildbot status page. Click the “logcat dump” link under one of the “gsutil upload” steps.
Test Trace (Android)
Useful Content: These logs graphically depict the start/end times for all telemetry tests on all of the devices. This can help determine if test failures were caused by an environmental issue. (see Cross-Device Failures)
Where to find: These logs can be found from the buildbot status page. Click the “Test Trace” link under one of the “gsutil Upload Test Trace” steps.
Symbolized Stack Traces (Android)
Useful Content: Contains symbolized stack traces of any Chrome or Android crashes.
Where to find: These logs can be found from the buildbot status page. The symbolized stack traces can be found under several steps. Click link under “symbolized breakpad crashes” step to see symbolized Chrome crashes. Click link under “stack tool with logcat dump” to see symbolized Android crashes.
As of Q4 2016 all desktop bots have been moved to the swarming pool with a goal of moving all android bots to swarming in early 2017. There is now one machine on the chromium.perf waterfall for each desktop configuration that is triggering test tasks on 5 corresponding swarming bots. All of our swarming bots exists in the chrome-perf swarming pool
--isolated-script-test-output=/b/s/w/ioFB73Qz/output.json
and --isolated-script-test-chartjson-output=/b/s/w/ioFB73Qz/chartjson-output.json
flags to be a local path/usr/bin/python ../../testing/scripts/run_telemetry_benchmark_as_googletest.py ../../tools/perf/run_benchmark indexeddb_perf -v --upload-results --output-format=chartjson --browser=release --isolated-script-test-output=tmp/output.json --isolated-script-test-chartjson-output=tmp/chartjson-output.json
python testing/scripts/run_telemetry_benchmark_as_googletest.py tools/perf/run_benchmark sunspider -v --output-format=chartjson --upload-results --browser=reference --output-trace-tag=_ref --isolated-script-test-output=foo/output.json --isolated-script-test-chartjson-output=foo/chart-output.json
python tools/mb/mb.py isolate //out/Release -m chromium.perf -b "Linux Builder" telemetry_perf_tests
python tools/swarming_client/isolate.py archive -I isolateserver.appspot.com -i out/Release/telemetry_perf_tests.isolate -s out/Release/telemetry_perf_tests.isolated
./tools/swarming_client/run_isolated.py -I https://isolateserver.appspot.com -s <insert_hash_here> -- sunspider -v --upload-results --output-format=chartjson --browser=reference --output-trace-tag=_ref --isolated-script-test-output=/usr/local/google/home/eyaich/projects/chromium/src/tmp/output.json
python tools/swarming_client/swarming.py trigger -v --isolate-server isolateserver.appspot.com -S chromium-swarm.appspot.com -d id build150-m1 -d pool Chrome-perf -d os Linux -s <insert_hash_here> -- sunspider -v --upload-results --output-format=chartjson --browser=reference --output-trace-tag=_ref -isolated-script-test-output='${ISOLATED_OUTDIR}/output.json' --isolated-script-test-chartjson-output='${ISOLATED_OUTDIR}/chart-output.json'
If the test is a telemetry test, its name will have a ‘.’ in it, such as thread_times.key_mobile_sites
or page_cycler.top_10
. The part before the first dot will be a python file in tools/perf/benchmarks.
If a telemetry test is failing and there is no clear culprit to revert immediately, disable the test. You can do this with the @benchmark.Disabled
decorator. Always add a comment next to your decorator with the bug id which has background on why the test was disabled, and also include a BUG= line in the CL.
Please disable the narrowest set of bots possible; for example, if the benchmark only fails on Windows Vista you can use @benchmark.Disabled('vista')
. Supported disabled arguments include:
win
mac
chromeos
linux
android
vista
win7
win8
yosemite
elcapitan
all
(please use as a last resort)If the test fails consistently in a very narrow set of circumstances, you may consider implementing a ShouldDisable
method on the benchmark instead. Here is and example of disabling a benchmark which OOMs on svelte.
As a last resort, if you need to disable a benchmark on a particular Android device, you can do so by checking the return value of possible_browser.platform.GetDeviceTypeName()
in ShouldDisable
. Here are some examples of this. The type name of the failing device can be found by searching for the value of ro.product.model
under the provision_devices
step of the failing bot.
Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do not submit with NOTRY=true.
Non-telemetry tests are configured in chromium.perf.json But do not manually edit this file. Update tools/perf/generate_perf_data.py to disable the test and rerun script to generate the new chromium.perf.json file. You can TBR any of the per-file OWNERS, but please do not submit with NOTRY=true.
Pri-0 bugs should have an owner or contact on speed infra team and be worked on as top priority. Pri-0 generally implies an entire waterfall is down.
Pri-1 bugs should be pinged daily, and checked to make sure someone is following up. Pri-1 bugs are for a red test (not yet disabled), purple bot, or failing device. Here is the list of Pri-1 bugs that have not been pinged today.
Pri-2 bugs are for disabled tests. These should be pinged weekly, and work towards fixing should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the list of Pri-2 bugs that have not been pinged in a week.
At the end of your shift you should send out a message to the next sheriff It should detail any ongoing issues you are trying to resolve. This can contain new bugs you have filed and bisects you are waiting to finish. If there has been any significant updates on older issues that the next sheriff should know about they should also be included. This will greatly decrease the amount of time needed for the next sheriff to come up to speed.
There is also a weekly debrief that you should see on your calendar titled Weekly Speed Sheriff Retrospective. For this meeting you should prepare any highlights or lowlights from your sheriffing shift as well as any other feedback you may have that could improve future sheriffing shifts.