Fix race condition (COMPLETED/BOT_DIED) in task_runner.
Per b/69462084 and a TODO comment in this file, a race condition
exists when an external process polls task status that fails with
BOT_DIED (there's a short window in which the task will be marked
as COMPLETED before being updated to BOT_DIED). This change sends
the task-update without the exit code (to prevent the COMPLETED
status from being set), and then sends a task_error message. The
default must_signal_internal_failure is also changed to provide a
default error in case run_isolated fails.
TESTED=Ran unit tests in task_runner_test and they pass.
Any other tests?
Change-Id: I1db3e15ccbdac3da9181b273681180211b07841c
Reviewed-on: https://chromium-review.googlesource.com/834397
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>
Cr-Mirrored-From: https://chromium.googlesource.com/infra/luci/luci-py
Cr-Mirrored-Commit: 0b027452e658080df1f174c403946914443d2aa6
diff --git a/run_isolated.py b/run_isolated.py
index 0cdf3a6..96c2b0e 100755
--- a/run_isolated.py
+++ b/run_isolated.py
@@ -583,7 +583,7 @@
'duration': None,
'exit_code': None,
'had_hard_timeout': False,
- 'internal_failure': None,
+ 'internal_failure': 'run_isolated did not complete properly',
'stats': {
# 'isolated': {
# 'cipd': {
@@ -704,6 +704,10 @@
command, cwd, env, data.hard_timeout, data.grace_period)
finally:
result['duration'] = max(time.time() - start, 0)
+
+ # We successfully ran the command, set internal_failure back to
+ # None (even if the command failed, it's not an internal error).
+ result['internal_failure'] = None
except Exception as e:
# An internal error occurred. Report accordingly so the swarming task will
# be retried automatically.