Our event pipeline collects, stores, and aggregates event data from ChOps services. Event data can be any piece of information we want to collect for analysis or tracking. It is distinct from timeseries data, for which we use tsmon.
For Googlers, see the internal docs.
Lists of tables and their descriptions are available by project (e.g. infra or infra_internal) at doc/bigquery_tables.md. Table owners are responsible for updating those lists.
Tables are commonly identified by <project-id>.<dataset_id>.<table_id>
.
BigQuery tables belong to datasets. Dataset IDs and table IDs should be underscore delimited, e.g. test_results
.
For services which already have corresponding Google Cloud Projects, tables should be created in their own project, under the dataset “events.” For other services, create a new GCP project.
Datasets can be created in the easy-to-use console.
Rationale for per-project tables:
Tables are defined by schemas. Schemas are stored in .proto form. Therefore we have version control and can use the protoc tool to create language-specific instances. Use bqschemaupdater to create new tables or modify existing tables in BigQuery. As of right now, this tool must be run manually. Run the go environment setup script from infra.git:
eval `go/env.py`
and this should install bqschemaupdater in your path. See the docs or run bqschemaupdater --help
for more information.
Once you have a table, you can send events to it!
The following applies to non-GCP projects. Events sent from GCP projects to tables owned by the same project should just work.
You need to ensure the machines that will be running the code which sends events have proper credentials. At this point, you may need to enlist the help of a Chrome Operations Googler, as many of the following resources and repos are internal.
chrome-infra-events
project. You‘ll need the proper privileges to do this. If you don’t have them, ask a Chrome Infrastructure team member for help.Go: use go.chromium.org/luci/common/bq, example CL.
Python: use infra.libs.bigquery, example CL, docs.
How you instrument your code to add event logging depends on your needs, and there are a couple of options.
If you don’t need transactional integrity, and prefer a simpler configuration, use bq.Uploader This should be your default choice if you’re just starting out.
If you need ~8ms latency on inserts, or transactional integrity with datastore operations, use bqlog [TODO: update this link if/when bqlog moves out of tokenserver into a shared location].
Design trade-offs for using bq instead of bqlog: lower accuracy and precision. Some events may be duplicated in logs (say, if an operation that logs events has to be retried due to datastore contention). Intermittent failures in other supporting infrastructure may also cause events to be lost.
Design trade-offs for using bqlog instead of bq: You will have to enable task queues in your app if you haven’t already, and add a new cron task to your configuration. You will also not be able to use the bqschemaupdater (described below) tool to manage your logging schema code generation.
Package bq takes care of some boilerplate and makes it easy to add monitoring for uploads. It also takes care of adding insert IDs, which BigQuery uses to deduplicate rows. If you are not using bq.Uploader, check out bq.InsertIDGenerator.
With bq
, you can construct a synchronous Uploader
or asynchronous BatchUploader
depending on your needs.
kitchen is an example of a tool that uses bq.
You will need the google-cloud-bigquery library in your environment. infra.git/ENV has this dependency already, so you only need to add it if you are working outside that environment.
Check out the (../../infra/libs/bigquery/helper.py)[bigquery helper module]. Under the hood, it uses the BigQuery Python client. It is recommended that you use it over using the client directly, as it houses common logic around handling edge cases, formatting errors, and handling protobufs. You'll still have to provide an authenticated instance of google.cloud.bigquery.client.Client.
See this change for a simple example. (TODO: replace with a non-internal example that uses BigQueryHelper.) The API docs can also be helpful.
The beam documentation is a great place to get started.
Workflows are in the packages/dataflow
directory. packages/dataflow/common
contains abstractions that you will likely want to use. Take a look at what is available there before beginning your workflow.
See the dataflow package README for more information.
You should write tests for your pipeline. Tests can be run using test.py, e.g. ./test.py test packages/dataflow
.
We want Dataflow workflows like the ones that populate our aggregate tables (e.g. cq_attempts) to run at regular intervals. You can accomplish this by configuring a builder to run the remote_execute_dataflow_workflow recipe with the proper properties. See this change for an example.
The builder name should be dataflow-workflow-[job name]
where job name is the name of the remotely executed job. This naming scheme sets up automated alerting for builder failures.
Generally you will use the bigquery console for this. You can also use google data studio, which allows you to create dashboards and graphs from bigquery data.
Googlers can query existing plx tables from bigquery. Here's an example query:
SELECT issue, patchset, attempt_start_msec FROM chrome_infra.cq_attempts LIMIT 10;
To make this work on big query, you need to change the table. That example query would look like this on bigquery:
SELECT issue, patchset, attempt_start_msec FROM `plx.google:chrome_infra.cq_attempts.all` LIMIT 10;
Note the all
suffix on the table. This is only for tables which don't have existing suffixes like lastNdays
, today
.
ACLs for these tables depend on the user's gaia ID. If a service account needs access to a plx table, you need to add them as a READER to the table, which can be done either in PLX or in the materialization script. Request a googler to do this for you.
Note that it appears that enums lose their string values when you query them through bigquery. This means that you need to change
SELECT * FROM chrome_infra.cq_attempts WHERE fail_type = 'NOT_LGTM'
into
SELECT * FROM `plx.google:chrome_infra.cq_attempts.all` WHERE fail_type = 4
To execute a query which joins a table from a different cloud project, ensure the querying project's service account has BigQuery read permissions in the other project.
Datastudio: for graphs, reports, and dashboards
BigQuery and Dataflow limits are documented for tools when they are relevant. Because many components of the pipeline make API requests to BigQuery, that is documented here.
API Request limits are per user. In our case, the use is most often a service account. Check the BigQuery docs for the most up to date limits. At the time of writing, there are limits on requests per second and concurrent api requests. The client is responsible for ensuring it does not exceed these limits.