This directory contains Apache Beam pipelines for exporting entities from chromeperf's Datastore into BigQuery tables.
This README briefly describes how to run and deploy those pipelines.
Script name | Entity type exported | Type of export |
---|---|---|
export_anomalies.py | Anomaly | Incremental (by day) |
export_rows.py | Row | Incremental (by day) |
export_jobs.py | Job | Incremental (by day) |
export_testmetadata.py | TestMetadata | Full |
Script name | Purpose |
---|---|
calc_stats.py | Calculate signal quality statistics from Rows export. |
Follow the instructions at https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python#set-up-your-environment to get a virtualenv with the Beam SDK installed.
In the virtualenv with the Beam SDK installed:
$ cd dashboard/
$ PYTHONPATH=$PYTHONPATH:"$(pwd)/bq_export" python bq_export/export_rows.py \ --service_account_email=bigquery-exporter@chromeperf.iam.gserviceaccount.com \ --runner=DataflowRunner \ --region=us-central1 \ --experiments=use_beam_bq_sink \ --setup_file=bq_export/setup.py \ --staging_location=gs://chromeperf-dataflow/staging \ --template_location=gs://chromeperf-dataflow/templates/export_rows \ --temp_location=gs://chromeperf-dataflow-temp/export-rows-daily
$ PYTHONPATH=$PYTHONPATH:"$(pwd)/bq_export" python \ bq_export/export_anomalies.py \ --service_account_email=bigquery-exporter@chromeperf.iam.gserviceaccount.com \ --runner=DataflowRunner \ --region=us-central1 \ --experiments=use_beam_bq_sink \ --setup_file=bq_export/setup.py \ --staging_location=gs://chromeperf-dataflow/staging \ --template_location=gs://chromeperf-dataflow/templates/export_anomalies \ --temp_location=gs://chromeperf-dataflow-temp/export-anomalies-daily
$ PYTHONPATH=$PYTHONPATH:"$(pwd)/bq_export" python \ bq_export/export_jobs.py \ --service_account_email=bigquery-exporter@chromeperf.iam.gserviceaccount.com \ --runner=DataflowRunner \ --region=us-central1 \ --experiments=use_beam_bq_sink \ --setup_file=bq_export/setup.py \ --staging_location=gs://chromeperf-dataflow/staging \ --template_location=gs://chromeperf-dataflow/templates/export_jobs \ --temp_location=gs://chromeperf-dataflow-temp/export-jobs-daily
$ PYTHONPATH=$PYTHONPATH:"$(pwd)/bq_export" python \ bq_export/export_testmetadata.py \ --service_account_email=bigquery-exporter@chromeperf.iam.gserviceaccount.com \ --runner=DataflowRunner \ --region=us-central1 \ --experiments=use_beam_bq_sink \ --setup_file=bq_export/setup.py \ --staging_location=gs://chromeperf-dataflow/staging \ --template_location=gs://chromeperf-dataflow/templates/export_testmetadata \ --temp_location=gs://chromeperf-dataflow-temp/export-testmetadata-daily
$ PYTHONPATH=$PYTHONPATH:"$(pwd)/bq_export" python \ bq_export/delete_upload_tokens.py \ --service_account_email=bigquery-exporter@chromeperf.iam.gserviceaccount.com \ --runner=DataflowRunner \ --region=us-central1 \ --experiments=use_beam_bq_sink \ --setup_file=bq_export/setup.py \ --staging_location=gs://chromeperf-dataflow/staging \ --template_location=gs://chromeperf-dataflow/templates/delete_upload_tokens \ --temp_location=gs://chromeperf-dataflow-temp/delete-upload-tokens-tmp
There are Cloud Scheduler jobs configured to run gs://chromeperf-dataflow/templates/export_anomalies
, gs://chromeperf-dataflow/templates/export_rows
, and gs://chromeperf-dataflow/templates/export_jobs
, gs://chromeperf-dataflow/templates/export_testmetadata
, once every day, so updating those job templates is all that is required to update those daily runs. See the Cloud Scheduler jobs at https://console.cloud.google.com/cloudscheduler?project=chromeperf.
Tip: You can also use the “RUN NOW” buttons on the Cloud Scheduler console page to manually re-run daily job.
You can execute one-off jobs with the gcloud
tool. For example:
$ gcloud dataflow jobs run export-anomalies-backfill \ --service-account-email=bigquery-exporter@chromeperf.iam.gserviceaccount.com \ --gcs-location=gs://chromeperf-dataflow/templates/export_anomalies \ --disable-public-ips \ --max-workers=10 \ --region=us-central1 \ --staging-location=gs://chromeperf-dataflow-temp/export_anomalies \ --subnetwork=regions/us-central1/subnetworks/dashboard-batch
To execute a manual backfill, specify the end_date
and/or num_days
parameters. For example, this will regenerate the anomalies for December 2019:
$ gcloud dataflow jobs run export-anomalies-backfill \ --service-account-email=bigquery-exporter@chromeperf.iam.gserviceaccount.com \ --gcs-location=gs://chromeperf-dataflow/templates/export_anomalies \ --disable-public-ips \ --max-workers=10 \ --region=us-central1 \ --staging-location=gs://chromeperf-dataflow-temp/export_anomalies \ --subnetwork=regions/us-central1/subnetworks/dashboard-batch \ --parameters=end_date=20191231,num_days=31
Example for row backfill:
$ gcloud dataflow jobs run export-rows-backfill \ --service-account-email=bigquery-exporter@chromeperf.iam.gserviceaccount.com \ --gcs-location=gs://chromeperf-dataflow/templates/export_rows \ --disable-public-ips \ --max-workers=70 \ --region=us-central1 \ --staging-location=gs:/chromeperf-dataflow-temp/export-rows-daily \ --subnetwork=regions/us-central1/subnetworks/dashboard-batch \ --parameters=end_date=20230710,num_days=1 \ --worker-machine-type=e2-standard-4
Due to the amount of data this job handles, it requires a more powerful worker machine and more concurrent workers.
Tips:
When testing changes to the pipelines add table_suffix=_test
to the parameters to write to anomalies_test
or rows_test
tables rather than the real tables. Alternatively, change the dataset
parameter to something like dataset=chromeperf_dashboard_data_test
.
You can of also use the REST API instead of gcloud
. See Using the REST API in the Cloud Dataflow docs. See the Cloud Scheduler jobs for some examples.
In the virtualenv with the Beam SDK you can run a job without creating a template by specifying all the job execution parameters and omitting the template parameters (i.e. omit --staging_location
and --template_location
).
For example:
$ PYTHONPATH=$PYTHONPATH:"$(pwd)/bq_export" python bq_export/export_rows.py \ --service_account_email=bigquery-exporter@chromeperf.iam.gserviceaccount.com \ --runner=DataflowRunner \ --region=us-central1 \ --temp_location=gs://chromeperf-dataflow-temp/export_rows_temp \ --experiments=use_beam_bq_sink \ --experiments=shuffle_mode=service \ --autoscaling_algorithm=NONE \ --num_workers=70 \ --setup_file=bq_export/setup.py \ --no_use_public_ips \ --subnetwork=regions/us-central1/subnetworks/dashboard-batch \ --dataset=chromeperf_dashboard_data