blob: 3cee2c7e7945c595d8c04c68514ee34f224b83dd [file] [log] [blame] [view]
# Overview
Our event pipeline collects, stores, and aggregates event data from ChOps
services. Event data can be any piece of information we want to collect for
analysis or tracking. It is distinct from timeseries data, for which we use
[tsmon](https://chrome-internal.googlesource.com/infra/infra_internal/+/master/doc/ts_mon.md).
For Googlers, see the [internal docs](https://chrome-internal.googlesource.com/infra/infra_internal/+/master/doc/event_pipeline.md).
[TOC]
# Exploring Tables
Lists of tables and their descriptions are available by project (e.g. infra or
infra_internal) at doc/bigquery_tables.md. Table owners are responsible for
updating those lists.
[infra tables](../bigquery_tables.md)
# Step 1: Create a BigQuery table
## Table Organization
Tables are commonly identified by `<project-id>.<dataset_id>.<table_id>`.
BigQuery tables belong to datasets. Dataset IDs and table IDs should be
underscore delimited, e.g. `test_results`.
For services which already have corresponding Google Cloud Projects, tables
should be created in their own project, under the dataset "events." For other
services, create a new GCP project.
Datasets can be created in the easy-to-use [console](bigquery.cloud.google.com).
Rationale for per-project tables:
* Each project may ACL its tables as it sees fit, and apply its own quota
constraints to stay within budget.
* Different GCP instances of the same application code (say, staging vs
production for a given AppEngine app) may keep separate ACLs and retention
policies for their logs so they don’t write over each other.
## Creating and updating tables
Tables are defined by schemas. Schemas are stored in .proto form. Therefore we
have version control and can use the protoc tool to create language-specific
instances. Use
[bqschemaupdater](https://chromium.googlesource.com/infra/luci/luci-go/+/master/tools/cmd/bqschemaupdater/README.md)
to create new tables or modify existing tables in BigQuery. As of right now,
this tool must be run manually. Run the go environment setup script from
infra.git:
eval `go/env.py`
and this should install bqschemaupdater in your path. See the
[docs](https://chromium.googlesource.com/infra/luci/luci-go/+/master/tools/cmd/bqschemaupdater/README.md)
or run `bqschemaupdater --help` for more information.
# Step 2: Send events to BigQuery
Once you have a table, you can send events to it!
## Credentials
The following applies to non-GCP projects. Events sent from GCP projects to
tables owned by the same project should just work.
You need to ensure the machines that will be running the code which sends events
have proper credentials. At this point, you may need to enlist the help of a
Chrome Operations Googler, as many of the following resources and repos are
internal.
1. Choose a [service
account](https://cloud.google.com/docs/authentication/#service_accounts).
This account may be a service account that is already associated with the
service, or it may be a new one that you create.
1. Give that service account the "BigQuery Data Editor" IAM role using the
[cloud console](https://console.cloud.google.com) under "IAM & Admin" >>
"IAM" in the `chrome-infra-events` project. You'll need the proper privileges
to do this. If you don't have them, ask a Chrome Infrastructure team member
for help.
1. If you have created a new private key for an account, you'll need to add it
to puppet. [More
info.](https://chrome-internal.googlesource.com/infra/puppet/+/master/README.md)
1. Make sure that file is available to your service. For CQ, this takes the form
of passing the name of the credentials file to the service on start. [See
CL.](https://chrome-internal-review.googlesource.com/c/405268/)
## How to Choose a Library
### TLDR
Go: use
[go.chromium.org/luci/common/bq](https://godoc.org/go.chromium.org/luci/common/bq),
[example CL](https://chromium-review.googlesource.com/c/infra/infra/+/719962).
Python: use
[infra.libs.bqh](https://cs.chromium.org/chromium/infra/infra/libs/bqh.py),
[example CL](https://chrome-internal-review.googlesource.com/c/infra/infra_internal/+/445955),
[docs](https://chromium.googlesource.com/infra/infra/+/master/infra/libs/README.md).
### Options
How you instrument your code to add event logging depends on your needs, and
there are a couple of options.
If you don’t need transactional integrity, and prefer a simpler configuration,
use [bq.Uploader](https://godoc.org/go.chromium.org/luci/common/bq#Uploader)
This should be your default choice if you’re just starting out.
If you need ~8ms latency on inserts, or transactional integrity with datastore
operations, use
[bqlog](https://godoc.org/go.chromium.org/luci/tokenserver/appengine/impl/utils/bqlog)
[TODO: update this link if/when bqlog moves out of tokenserver into a shared
location].
Design trade-offs for using bq instead of bqlog: lower
accuracy and precision. Some events may be duplicated in logs (say, if an
operation that logs events has to be retried due to datastore contention).
Intermittent failures in other supporting infrastructure may also cause events
to be lost.
Design trade-offs for using bqlog instead of bq: You will have to
enable task queues in your app if you haven’t already, and add a new cron task
to your configuration. You will also not be able to use the bqschemaupdater
(described below) tool to manage your logging schema code generation.
### From Go: bq package
Package [bq](https://godoc.org/go.chromium.org/luci/common/bq)
takes care of some boilerplate and makes it easy to add monitoring for uploads.
It also takes care of adding insert IDs, which BigQuery uses to deduplicate
rows. If you are not using
[bq.Uploader](https://godoc.org/go.chromium.org/luci/common/bq#Uploader),
check out
[bq.InsertIDGenerator](https://godoc.org/go.chromium.org/luci/common/bq#InsertIDGenerator).
With `bq`, you can construct a synchronous `Uploader` or asynchronous
`BatchUploader` depending on your needs.
[kitchen](../../go/src/infra/tools/kitchen/monitoring.go) is an example of a
tool that uses bq.
### From Python: infra/libs/bqh
You will need the
[google-cloud-bigquery](https://pypi.python.org/pypi/google-cloud-bigquery)
library in your environment. infra.git/ENV has this dependency already, so you
only need to add it if you are working outside that environment.
Check out the (../../infra/libs/bqh.py)[bigquery helper module]. Under the hood,
it uses the [BigQuery Python
client](https://cloud.google.com/bigquery/docs/reference/libraries#client-libraries-usage-python).
It is recommended that you use it over using the client directly, as it houses
common logic around handling edge cases, formatting errors, and handling
protobufs. You'll still have to provide an authenticated instance of
google.cloud.bigquery.client.Client.
See
[this change](https://chrome-internal-review.googlesource.com/c/407748/)
for a simple example. (TODO: replace with a non-internal example that uses
BigQueryHelper.) The [API
docs](https://googlecloudplatform.github.io/google-cloud-python/stable/bigquery-usage.html)
can also be helpful.
### From Python GAE:
1. Use [message_to_dict](https://chromium.googlesource.com/infra/infra/+/fe875b1417d5d6a73999462b1001a2852ef6efb9/packages/infra_libs/infra_libs/bqh.py#24)
function to convert a protobuf message to a dict.
2. Use [BigQuery REST API](https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll)
to insert rows.
3. Don't forget to include insert ids in the request!
# Step 3: Analyze/Track/Graph Events
## Joining tables from other projects
To execute a query which joins a table from a different cloud project, ensure
the querying project's service account has BigQuery read permissions in the
other project.
## Other tools
[Datastudio](http://datastudio.google.com): for graphs, reports, and dashboards
# Limits
BigQuery and Dataflow limits are documented for tools when they are relevant.
Because many components of the pipeline make API requests to BigQuery, that is
documented here.
API Request limits are per user. In our case, the use is most often a service
account. Check the [BigQuery
docs](https://cloud.google.com/bigquery/quotas#apirequests) for the most up to
date limits. At the time of writing, there are limits on requests per second and
concurrent api requests. The client is responsible for ensuring it does not
exceed these limits.