tree: a290a0db9f7a0a2cd47c8849fc3fd6ce0da56d3b [path history] [tgz]
  1. backend/
  2. bigquery/
  3. config/
  4. frontend/
  5. proto/
  6. rpc/
  7. som/
  8. .clang-format
  9. .gitignore
  10. cron.yaml
  11. cron_default.yaml
  12. cron_prod.yaml
  13. cron_staging.yaml
  14. dispatch.yaml
  15. Makefile
  16. OWNERS
  17. README.md
go/src/infra/appengine/sheriff-o-matic/README.md

sheriff-o-matic

aka SoM

NOTE: All of the instructions below assume you are working in a single shell window. All shell commands should be run from the sheriff-o-matic directory (where this README lives).

Setting up local development environment

Prerequisites

You will need a chrome infra checkout as described here. That will create a local checkout of the entire infra repo, but that will include this application and many of its dependencies.

Warning: If you are starting from scratch, there may be a lot more setup involved than you expected. Please bear with us.

You‘ll also need some extras that aren’t in the default infra checkout.

# sudo where appropriate for your setup.

npm install -g bower

If you don‘t have npm or node installed yet, make sure you do so using gclient runhooks to pick up infra’s CIPD packages for nodejs and npm (avoid using other installation methods, as they won‘t match what the builders and other infra devs have installed). Then make sure you’ve run

eval `../../../../env.py`

in that shell window.

Setting up credentials

You will need access to either staging or prod sheriff-o-matic before you can do this, so contact chops-tfs-team@google.com to request access (“Please add me to the relevant AMI roles...”) if you don't already have it.

# in case you already have this pointed at a jwt file downloaded from gcp console:
unset GOOGLE_APPLICATION_CREDENTIALS

# Use your user identity instead of a service account, will require web flow auth:
gcloud auth application-default login

Note that some services (notably, Monorail) will not honor your credentials when authenticated this way. You'll see 401 Unauthorized responses in the console logs. For these, you may need to get service account credentials. We no longer recommend developers download service account credentials to their machines because they are more sensitive (and GCP limits how many we can have out in the wild).

Getting up and running locally

Note: if you would like, you can test on staging environment and skip the local setup sections.

After initial checkout, make sure you have all of the bower dependencies installed. Also run this whenever bower.json is updated:

make build

(Note that you should always be able to rm -rf frontend/bower_components and re-run bower install at any time. Occasionally there are changes that, when applied over an existing frontend/bower_components, will b0rk your checkout.)

To run locally from an infra.git checkout:

make devserver_default

(Note that issue tracker endpoint will return 403 in localhost frontend. Because the access token returned by local server is a fake token, thus the call can't be authorized.)

To run tests:

# Default (go and JS):
make test

# For go:
go test infra/appengine/sheriff-o-matic/som/...

# For interactive go, automatically re-runs tests on save:
cd som && goconvey

# For JS:
cd frontend
make wct

# For debugging JS, with a persistent browser instance you can reload:
cd frontend
make wct_debug

To view test coverage report after running tests:

google-chrome ./coverage/lcov-report/index.html

Adding some trees to your local SoM

Once you have a server running locally, you'll want to add at least one tree configuration to the datastore. Make sure you are logged in locally as an admin user (admin checkbox on fake devserver login page).

Navigate to localhost:8080/admin/portal and fill out the tree(s) you wish to test with locally. For consistency, you may just want to copy the settings from prod.

If you don't have access to prod or staging, you can manually enter this for “Trees in SOM” to get started with a reasonable default:

android:Android,chrome_browser_release:Chrome Browser Release,chromeos:Chrome OS,chromium:Chromium,chromium.clang:Chromium Clang,chromium.gpu:Chromium GPU,chromium.perf:Chromium Perf,fuchsia:Fuchsia,ios:iOS,angle:Angle,dawn:Dawn

After this step, you should see the trees appearing in SoM, but without any alerts. To populate the alerts, continue to the next section.

Populating alerts from local cron tasks

To populate alerts for ChromeOS or Fuchsia tree, run http://localhost:8081/_cron/analyze/chromeos or http://localhost:8081/_cron/analyze/fuchsia accordingly.

To populate alerts for other trees, firstly you must run http://localhost:8081/_cron/bq_query/chrome. This will populate the memcache. After that, you can run, for example http://localhost:8081/_cron/analyze/chromium to populate chromium tree.

There is a difference because other trees (aside from ChromeOS and Fuchsia) reads data from memcache (instead of querying BigQuery directly) in order to save cost.

The purpose of the cronjobs is to process data from SoM's BigQuery view and populate Datastore with alerts.

Deployment

Authenticating for deployment

In order to deploy to App Engine, you will need to be a member of the project (either sheriff-o-matic or sheriff-o-matic-staging). Before your first deployment, you will have to run ./gae.py login to authenticate yourself.

Deploying to staging

Sheriff-o-Matic has a staging server sheriff-o-matic-staging.appspot.com. To deploy to staging:

Note: As staging and prod has different datastore, information like alert grouping, bug attached, ... may not match the data in prod.

Currently, the cron jobs to populate data from BigQuery to datastore is scheduled to run every 30 minutes, so if you want the latest data, you may want to run the cron jobs manually.

Deploying to production

If you want to release a new version to prod:

  • Run make deploy_prod
  • Double-check that the version is not named with a -tainted suffix, as deploying such a version will cause alerts to fire (plus, you shouldn't deploy uncommitted code :).
  • Go to the Versions section of the App Engine Console and update the default version of the app services. Important: Remember to update both the “default” and “analyzer” services by clicking the “Migrate traffic” button. Having the default and analyzer services running different versions may cause errors and/or monitoring alerts to fire.
  • Wait for a while, making sure that the graphs looks fine and there is no abnormality in https://viceroy.corp.google.com/chrome_infra/Appengine/sheriff_o_matic_prod?duration=1h
  • Verify the the cron jobs are still successful with the new version (currently they are not sending alerts when they fail, so you need to check manually).
  • Do some validity check by clicking through the trees in the UI.
  • Send a PSA email to cit-sheriffing@ (cc chops-tfs-team@) about the new release, together with the release notes.
  • You can get the release notes by running (note that you may need to authenticate for deployment first).
make relnotes

You can also use the optional flags -since-date YYYY-MM-DD or -since-hash=<git short hash> if you need to manually specify the range of commits to include, using the command

go run ../../tools/relnotes/relnotes.go -since-hash <commit_hash> -app sheriff-o-matic -extra-paths .,../../monitoring/analyzer,../../monitoring/client,../../monitoring/messages

Tips: You can find the commit hash of a version by looking at the version name in appengine (Go to pantheon page for your app, and click at Versions section). For example, if your version name is 12345-20d8b52, then the commit hash is 20d8b52.

Deploying changes to BigQuery views

Changes to SoM's BigQuery view schema are deployed separately from AppEngine deployment described above. This happens when you modify the SQL files for bigquery views (the sql files in ./bigquery folder). The steps to deploy your changes are as follows:

  • cd ./bigquery
  • Run ./create_views.sh to deploy your change to staging
  • Verify that everything works as expected
  • Create a CL with your changes and get it reviewed
  • Land your change
  • Modify the file create_views.sh to point to prod by setting APP_ID=sheriff-o-matic
  • Run ./create_views.sh again to deploy your change to prod
  • Verify that everything works as expected in prod
  • Revert the change in create_views.sh by setting APP_ID=sheriff-o-matic-staging

If you want to revert your deployment, simply checkout the main git branch and run ./create_view.sh again for staging and prod.

Note: As there is no record of the deployment of BigQuery views, it is important that you only deploy to prod once your CL is landed, so it will be easier to debug later if something go wrong.

Dataflow

A (simplified) dataflow in SoM is in this diagram.

Assigning builds to trees

Chromium/Chrome

In the case of the chromium and chrome projects as well as their corresponding branch projects (chromium-m* and chrome-m*), builds are assigned to trees based on the value of the sheriff_rotations property. The sheriff_rotations property is a list containing the names of the trees the build should be included in.

To modify the sheriff_rotations property for a builder's builds, update the definition of the builder by setting the sheriff_rotations argument, which can take a single value or a list of values:

builder(
    name = "my-builder",
    ...
    sheriff_rotations = sheriff_rotations.ANDROID,
    ...
)

The builders are organized in files based on builder groups, which often are all assigned the same tree. In that case, the sheriff_rotations value can be set for the entire file by using module-level defaults:

defaults.set(
    builder_group = "my-builder-group",
    ...
    sheriff_rotations = sheriff_rotations.CHROMIUM,
    ...
)

Any values specified at the builder will be merged with those set in the module-level defaults. If the module-level defaults should be ignored, args.ignore_default can be used to take only what is specified at the builder, so the following example would cause the sheriff_rotations property to not be set regardless of the module-level default value:

builder(
    name = "my-unsheriffed-builder",
    ...
    sheriff_rotations = args.ignore_default(None),
    ...
)

Troubleshooting common problems

SoM is showing incorrect/missing/stale data

Firstly we should check if the cron jobs are running or not, and if the latest runs were successful. Maybe it also helps to look at recent cron logs to verify that there are no unexpected error logs.

If the cronjobs are successful, we should check if the alerts are configured to be shown in SoM or not. The conditions for each trees are in bigquery_analyzer.go. The alerts will then go through a filter in config.json.

If the alerts are supposed to show, but do not get shown in SoM, the next step is to look at Datastore (AlertJSONNonGrouping table) to see if we can find the alert there. If not, go to BigQuery, sheriffable_failures table to see if we can find the failure there. If the failure is in BigQuery, but not in DataStore, there may be a bug in the analyzer cron job, which will require further investigation.

If the failure cannot be found in SoM's BQ table, check if the build can be found in buildbucket table (e.g. cr-buildbucket.raw.completed_builds_prod). If it can be found, then maybe there is a problem with the create_views.sh script, and will require further investigation.

Make SoM monitor additional builders

Most of the time, those requests can be fulfilled by modifying the bigquery_analyzer.go file. We need to make sure those builders/steps are not filtered out in config.json.

Contributors

We don't currently run the WCT tests on CQ. So please be sure to run them yourself before submitting. Also keep an eye on test coverage as you make changes. It should not decrease with new commits.

If you would like to test your changes on our staging server (this is often necessary in order to test and debug integrations, and some issues will only reliably reproduce in the actual GAE runtime rather than local devserver), please contact chops-tfs-team@google.com to request access. We're happy to grant staging access to contributors!