Overview

Our event pipeline collects, stores, and aggregates event data from ChOps services. Event data can be any piece of information we want to collect for analysis or tracking. It is distinct from timeseries data, for which we use tsmon.

For Googlers, see the internal docs.

Exploring Tables

Lists of tables and their descriptions are available by project (e.g. infra or infra_internal) at doc/bigquery_tables.md. Table owners are responsible for updating those lists.

infra tables

Step 1: Create a BigQuery table

Table Organization

Tables are commonly identified by <project-id>.<dataset_id>.<table_id>.

BigQuery tables belong to datasets. Dataset IDs and table IDs should be underscore delimited, e.g. test_results.

For services which already have corresponding Google Cloud Projects, tables should be created in their own project, under the dataset “events.” For other services, create a new GCP project.

Datasets can be created in the easy-to-use console.

Rationale for per-project tables:

  • Each project may ACL its tables as it sees fit, and apply its own quota constraints to stay within budget.
  • Different GCP instances of the same application code (say, staging vs production for a given AppEngine app) may keep separate ACLs and retention policies for their logs so they don’t write over each other.

Creating and updating tables

Tables are defined by schemas. Schemas are stored in .proto form. Therefore we have version control and can use the protoc tool to create language-specific instances. Use bqschemaupdater to create new tables or modify existing tables in BigQuery. As of right now, this tool must be run manually. Run the go environment setup script from infra.git:

eval `go/env.py`

and this should install bqschemaupdater in your path. See the docs or run bqschemaupdater --help for more information.

Step 2: Send events to BigQuery

Once you have a table, you can send events to it!

Credentials

The following applies to non-GCP projects. Events sent from GCP projects to tables owned by the same project should just work.

You need to ensure the machines that will be running the code which sends events have proper credentials. At this point, you may need to enlist the help of a Chrome Operations Googler, as many of the following resources and repos are internal.

  1. Choose a service account. This account may be a service account that is already associated with the service, or it may be a new one that you create.
  2. Give that service account the “BigQuery Data Editor” IAM role using the cloud console under “IAM & Admin” >> “IAM” in the chrome-infra-events project. You‘ll need the proper privileges to do this. If you don’t have them, ask a Chrome Infrastructure team member for help.
  3. If you have created a new private key for an account, you'll need to add it to puppet. More info.
  4. Make sure that file is available to your service. For CQ, this takes the form of passing the name of the credentials file to the service on start. See CL.

How to Choose a Library

TLDR

Go: use go.chromium.org/luci/common/bq, example CL.

Python: use infra.libs.bqh, example CL, docs.

Options

How you instrument your code to add event logging depends on your needs, and there are a couple of options.

If you don’t need transactional integrity, and prefer a simpler configuration, use bq.Uploader This should be your default choice if you’re just starting out.

If you need ~8ms latency on inserts, or transactional integrity with datastore operations, use bqlog [TODO: update this link if/when bqlog moves out of tokenserver into a shared location].

Design trade-offs for using bq instead of bqlog: lower accuracy and precision. Some events may be duplicated in logs (say, if an operation that logs events has to be retried due to datastore contention). Intermittent failures in other supporting infrastructure may also cause events to be lost.

Design trade-offs for using bqlog instead of bq: You will have to enable task queues in your app if you haven’t already, and add a new cron task to your configuration. You will also not be able to use the bqschemaupdater (described below) tool to manage your logging schema code generation.

From Go: bq package

Package bq takes care of some boilerplate and makes it easy to add monitoring for uploads. It also takes care of adding insert IDs, which BigQuery uses to deduplicate rows. If you are not using bq.Uploader, check out bq.InsertIDGenerator.

With bq, you can construct a synchronous Uploader or asynchronous BatchUploader depending on your needs.

kitchen is an example of a tool that uses bq.

From Python: infra/libs/bqh

You will need the google-cloud-bigquery library in your environment. infra.git/ENV has this dependency already, so you only need to add it if you are working outside that environment.

Check out the (../../infra/libs/bqh.py)[bigquery helper module]. Under the hood, it uses the BigQuery Python client. It is recommended that you use it over using the client directly, as it houses common logic around handling edge cases, formatting errors, and handling protobufs. You'll still have to provide an authenticated instance of google.cloud.bigquery.client.Client.

See this change for a simple example. (TODO: replace with a non-internal example that uses BigQueryHelper.) The API docs can also be helpful.

From Python GAE:

  1. Use message_to_dict function to convert a protobuf message to a dict.
  2. Use BigQuery REST API to insert rows.
  3. Don't forget to include insert ids in the request!

Step 3: Analyze/Track/Graph Events

Joining tables from other projects

To execute a query which joins a table from a different cloud project, ensure the querying project's service account has BigQuery read permissions in the other project.

Other tools

Datastudio: for graphs, reports, and dashboards

Limits

BigQuery and Dataflow limits are documented for tools when they are relevant. Because many components of the pipeline make API requests to BigQuery, that is documented here.

API Request limits are per user. In our case, the use is most often a service account. Check the BigQuery docs for the most up to date limits. At the time of writing, there are limits on requests per second and concurrent api requests. The client is responsible for ensuring it does not exceed these limits.