commit | d4dabf34e6d2d3842a681756a01198920d44e2e0 | [log] [tgz] |
---|---|---|
author | Katie Thomas <katthomas@google.com> | Fri Aug 18 15:08:54 2017 |
committer | Commit Bot <commit-bot@chromium.org> | Fri Aug 18 21:24:00 2017 |
tree | bd80ed1fd0b736759049e5e53bc7f325e32f09db | |
parent | c0013ec17f40bfad466ce1699ba050a1ad22061d [diff] |
Use beam.io.Write Use this class instead of beam.io.iobase.Write. Both can take a Sink, but for some reason beam.io.iobase.Write interferes with the BigQuerySink write_disporition WRITE_TRUNCATE, which truncates the table before writing to it. Without this change, we continually append to the table, resulting in many duplicate rows. Bug:615218 Change-Id: I9e7d3f830ed1e2bff7393ef22d608c73fe358a70 Reviewed-on: https://chromium-review.googlesource.com/620888 Commit-Queue: Katie Thomas <katthomas@google.com> Reviewed-by: Sergiy Byelozyorov <sergiyb@chromium.org> Cr-Mirrored-From: https://chromium.googlesource.com/infra/infra Cr-Mirrored-Commit: 271e354cadef4c4bd721059b2383b08f249e5fd2
To test that your pipeline will run remotely:
python <path-to-dataflow-job> --job_name <pick-a-job-name> \ --project chrome-infra-events --runner DataflowRunner --setup_file \ <infra-checkout-path>/packages/dataflow/setup.py \ --staging_location gs://dataflow-chrome-infra-events/staging --temp_location \ gs://dataflow-chrome-infra-events/temp --save_main_session
Job names should match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9]). Navigate to the Dataflow console in your browser (project: chrome-infra-events) and you should see your job running. Wait until it succeeds.