The perf bot sheriff is responsible for keeping the bots on the chromium.perf waterfall up and running, and triaging performance test failures and flakes.
Sheriff-O-Matic is (as of 2/27/2017) the recommended way to perfbot sheriff. It can be used to track the different issues and associate them with specific bugs, and annotate failures with useful information. It also attempts to group together similar failures across different builders, so it can help to see a higher level perspective on what is happening on the perf waterfall.
It is an actively staffed project, which should be getting better over time. If you find any bugs with the app, you can file a bug by clicking on the feedback link in the bottom right of the app, or by clicking this link.
Everyone can view the chromium.perf waterfall at https://build.chromium.org/p/chromium.perf/, but for Googlers it is recommended that you use the url https://uberchromegw.corp.google.com/i/chromium.perf/ instead. The reason for this is that in order to make the performance tests as realistic as possible, the chromium.perf waterfall runs release official builds of Chrome. But the logs from release official builds may leak info from our partners that we do not have permission to share outside of Google. So the logs are available to Googlers only. To avoid manually rewriting the URL when switching between the upstream and downstream views of the waterfall and bots, you can install the Chromium Waterfall View Switcher extension, which adds a switching button to Chrome's URL bar.
Note that there are three different views:
There is also milo, which has the same data as buildbot, but mirrored in a different datastore. It is generally faster than buildbot, and links to it will not break, as the data is kept around for much longer.
You can see a list of all previously filed bugs using the Performance-Sheriff-BotHealth label in crbug.
Please also check the recent perf-sheriffs@chromium.org postings for important announcements about bot turndowns and other known issues.
As of Q2 2017 all desktop and android bots have been moved to the swarming. There is now one machine on the chromium.perf waterfall for each desktop configuration that is triggering test tasks on 5 corresponding swarming bots. All of our swarming bots exist in the chrome-perf swarming pool
Some things have probably changed about sheriffing since we migrated from buildbot. Here's a partial list:
At least once during your shift, you should check the lists of buildslaves and ensure they're all running.
The machines restart between test runs, so just looking for “Status: Not connected” is not enough to indicate a problem. For each disconnected machine, you can also check the “Last heard from” column to ensure that it's been gone for at least an hour. To get it running again, file a bug against the current trooper and read go/bug-a-trooper for contacting troopers.
The chrome infrastructure team also maintains a set of dashboards you can use to view some debugging information about our systems. This is available at vi/chrome_infra. To debug offline buildslaves, you can look at the “Individual machine” dashboard, (at vi/chrome_infra/Machines/per_machine under the “Machines” section, which can show some useful information about the machine in question.
When a bot goes purple, it's usually because of an infrastructure failure outside of the tests. But you should first check the logs of a purple bot to try to better understand the problem. Sometimes a telemetry test failure can turn the bot purple, for example.
If the bot goes purple and you believe it's an infrastructure issue, file a bug with this template, which will automatically add the bug to the trooper queue. Be sure to note which step is failing, and paste any relevant info from the logs into the bug. Also be sure to read go/bug-a-trooper for contacting troopers.
Android device failures will mainly manifest by turning a bunch of steps on a builder purple. Failures of this type are expected to be purple. This will manifest itself as a string of tests failing, all with the same device id. Sheriff-o-matic will try to do this grouping for you, and it will make an entry in sheriff-o-matic with a title like bot affinity build123-b1 is broken on chromium.perf/Linux Perf, affecting 10 tests
.
These issues are usually automatically handled by the labs team. If you are a Googler, you can see a list of tickets here which have been auto-filed by Chrome Infra. These are filed twice a day, and usually get attention from people in the lab, who can fix these bugs quickly. Check this list to see if a device is on there. If it isn't, then follow the instructions below to file a ticket with the labs team.
For these types of failures, please file a bug with this template which will add an issue to the infra labs queue.
If you need help triaging, here are the common labels you should use:
Here are the common components you should also use:
If you still need help, ask the speed infra chat, or escalate to sullivan@.
Sometimes when looking at failing android tests you will notice that there are tests on multiple devices failing. Sometimes (but not always) this means that there is a problem on the host machine. One way this problem can occur is if a test is using the wrong version of adb in one of its commands. This causes the adb server on the host to reset which can cause failures to anything trying to communicate with a device via adb during that time. A good tool for diagnosing this is the Test Trace step on the android runs. This is a trace of which tests are running. If you have all the tests across all the testing shards failing, it may be an issue on the host not with the tests. This will no longer be used when the android bots move to swarming, since each device will be sandboxed from the others and not run from a single point.
Sometimes when a compile step is failing, you may be asked to clobber example. Steps to clobber:
You want to keep the waterfall green! So any bot that is red or purple needs to be investigated. When a test fails:
return_code
.Telemetry test runner logs
Useful Content: Best place to start. These logs contain all of the python logging information from the telemetry test runner scripts.
Where to find: These logs can be found from the buildbot build page. Click the “[stdout]” link under any of the telemetry test buildbot steps to view the logs. Do not use the “stdio” link which will show similiar information but will expire earilier and be slower to load.
Android Logcat (Android)
Useful Content: This file contains all Android device logs. All Android apps and the Android system will log information to logcat. Good place to look if you believe an issue is device related (Android out-of-memory problem for example). Additionally, often information about native crashes will be logged to here.
Where to find: These logs can be found from the buildbot status page. Click the “logcat dump” link under one of the “gsutil upload” steps.
Test Trace (Android)
Useful Content: These logs graphically depict the start/end times for all telemetry tests on all of the devices. This can help determine if test failures were caused by an environmental issue. (see Cross-Device Failures)
Where to find: These logs can be found from the buildbot status page. Click the “Test Trace” link under one of the “gsutil Upload Test Trace” steps.
Symbolized Stack Traces (Android)
Useful Content: Contains symbolized stack traces of any Chrome or Android crashes.
Where to find: These logs can be found from the buildbot status page. The symbolized stack traces can be found under several steps. Click link under “symbolized breakpad crashes” step to see symbolized Chrome crashes. Click link under “stack tool with logcat dump” to see symbolized Android crashes.
If the test is a telemetry test, its name will have a ‘.’ in it, such as thread_times.key_mobile_sites
or page_cycler.top_10
. The part before the first dot will be a python file in tools/perf/benchmarks.
If a telemetry test is failing and there is no clear culprit to revert immediately, disable the story on the failing platforms.
For example:
You can do this with StoryExpectations.
To determine which stories are failing in a given run, go to the buildbot page for that run and search for Unexpected failures
.
Example: On platform P, story foo is failing on benchmark bar. On the same benchmark on platform Q, story baz is failing. To disable these stories, go to where benchmark bar is declared. Using codesearch, you can look for benchmark_baz which will likely be in bar.py. This is where you can disable the story.
Once there, find the benchmark's GetExpectations()
method. Inside there you should see a SetExpectations()
method. That is where stories are disabled.
Buildbot output for failing run on platform P:
bar.benchmark_baz Bot id: 'buildxxx-xx' ... Unexpected Failures: * foo
Buildbot output for failing run on platfromQ
bar.benchmark_baz Bot id: 'buildxxx-xx' ... Unexpected Failures: * baz
Code snippet from bar.py benchmark:
class BarBenchmark(perf_benchmark.PerfBenchmark): ... def Name(): return 'bar.benchmark_baz' ... def GetExpectations(self): class StoryExpectations(story.expectations.StoryExpectations): def SetExpectations(self): self.DisableStory( 'foo', [story.expectations.PLATFORM_P], 'crbug.com/1234') self.DisableStory( 'baz', [story.expectations.PLATFORM_Q], 'crbug.com/5678')
If a story is failing on multiple platforms, you can add more platforms to the list in the second argument to DisableStory()
. If the story is failing on different platforms for different reasons, you can have multiple DisableStory()
declarations for the same story with different reasons listed.
If a particular story isn't applicable to a given platform, it should be disabled using CanRunStory.
To find the currently supported disabling conditions view the expectations file.
If for some reason you are unable to disable at the granularity you would like, disable the test at the lowest granularity possible and contact rnephew@ to suggest new disabling criteria.
Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do not submit with NOTRY=true.
Non-telemetry tests are configured in chromium.perf.json But do not manually edit this file. Update tools/perf/generate_perf_data.py to disable the test and rerun script to generate the new chromium.perf.json file. You can TBR any of the per-file OWNERS, but please do not submit with NOTRY=true.
Pri-0 bugs should have an owner or contact on speed infra team and be worked on as top priority. Pri-0 generally implies an entire waterfall is down.
Pri-1 bugs should be pinged daily, and checked to make sure someone is following up. Pri-1 bugs are for a red test (not yet disabled), purple bot, or failing device. Here is the list of Pri-1 bugs that have not been pinged today.
Pri-2 bugs are for disabled tests. These should be pinged weekly, and work towards fixing should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the list of Pri-2 bugs that have not been pinged in a week.
At the end of your shift you should send out a message to the next sheriff It should detail any ongoing issues you are trying to resolve. This can contain new bugs you have filed and bisects you are waiting to finish. If there has been any significant updates on older issues that the next sheriff should know about they should also be included. This will greatly decrease the amount of time needed for the next sheriff to come up to speed.
There is also a weekly debrief that you should see on your calendar titled Weekly Speed Sheriff Retrospective. For this meeting you should prepare any highlights or lowlights from your sheriffing shift as well as any other feedback you may have that could improve future sheriffing shifts.
--isolated-script-test-output=/b/s/w/ioFB73Qz/output.json
and --isolated-script-test-chartjson-output=/b/s/w/ioFB73Qz/chartjson-output.json
flags to be a local path/usr/bin/python ../../testing/scripts/run_telemetry_benchmark_as_googletest.py ../../tools/perf/run_benchmark speedometer -v --upload-results --output-format=chartjson --browser=release --isolated-script-test-output=tmp/output.json --isolated-script-test-chartjson-output=tmp/chartjson-output.json
python testing/scripts/run_telemetry_benchmark_as_googletest.py tools/perf/run_benchmark sunspider -v --output-format=chartjson --upload-results --browser=reference --output-trace-tag=_ref --isolated-script-test-output=foo/output.json --isolated-script-test-chartjson-output=foo/chart-output.json
python tools/mb/mb.py isolate //out/Release -m chromium.perf -b "Linux Builder" telemetry_perf_tests
python tools/swarming_client/isolate.py archive -I isolateserver.appspot.com -i out/Release/telemetry_perf_tests.isolate -s out/Release/telemetry_perf_tests.isolated
./tools/swarming_client/run_isolated.py -I https://isolateserver.appspot.com -s <insert_hash_here> -- sunspider -v --upload-results --output-format=chartjson --browser=reference --output-trace-tag=_ref --isolated-script-test-output=/usr/local/google/home/eyaich/projects/chromium/src/tmp/output.json
python tools/swarming_client/swarming.py trigger -v --isolate-server isolateserver.appspot.com -S chromium-swarm.appspot.com -d id build150-m1 -d pool Chrome-perf -d os Linux -s <insert_hash_here> -- sunspider -v --upload-results --output-format=chartjson --browser=reference --output-trace-tag=_ref -isolated-script-test-output='${ISOLATED_OUTDIR}/output.json' --isolated-script-test-chartjson-output='${ISOLATED_OUTDIR}/chart-output.json'