Use beam.io.Write

Use this class instead of beam.io.iobase.Write.

Both can take a Sink, but for some reason beam.io.iobase.Write
interferes with the BigQuerySink write_disporition WRITE_TRUNCATE,
which truncates the table before writing to it. 

Without this change, we continually append to the table, resulting in
many duplicate rows.

Bug:615218
Change-Id: I9e7d3f830ed1e2bff7393ef22d608c73fe358a70
Reviewed-on: https://chromium-review.googlesource.com/620888
Commit-Queue: Katie Thomas <katthomas@google.com>
Reviewed-by: Sergiy Byelozyorov <sergiyb@chromium.org>

Cr-Mirrored-From: https://chromium.googlesource.com/infra/infra
Cr-Mirrored-Commit: 271e354cadef4c4bd721059b2383b08f249e5fd2
1 file changed
tree: bd80ed1fd0b736759049e5e53bc7f325e32f09db
  1. common/
  2. test/
  3. .coveragerc
  4. __init__.py
  5. cq_attempts.py
  6. README.md
  7. setup.py
README.md

Testing

To test that your pipeline will run remotely:

python <path-to-dataflow-job> --job_name <pick-a-job-name> \
--project chrome-infra-events --runner DataflowRunner --setup_file \
<infra-checkout-path>/packages/dataflow/setup.py \
--staging_location gs://dataflow-chrome-infra-events/staging --temp_location \
gs://dataflow-chrome-infra-events/temp --save_main_session

Job names should match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9]). Navigate to the Dataflow console in your browser (project: chrome-infra-events) and you should see your job running. Wait until it succeeds.