The LUCI Scheduler Service periodically makes URL fetches, runs Swarming tasks or DM quests. It uses luci-config to fetch per-project lists of cron jobs. It tries to prevent concurrent execution of job invocations (i.e. an invocation will not start if previous one is still running).
It's built on top of App Engine Task Queues service.
To reduce confusion:
cron.yaml
).A job state is stored in datastore in a separate entity group. It is updated in a chain of Task Queue tasks. The lifecycle of a job:
SCHEDULED
and first TickLater
task is scheduled to run at some time in the future (based on the job's schedule).TickLater
runs. It transactionally updates job's state to QUEUED
, schedules next TickLater
task and enqueues StartInvocation
task.StartInvocation
task launches the invocation (starts URL fetch, Swarming task, etc) and moves the job to RUNNING
state. Once the invocation is finished, the job moves back to SCHEDULED
state.See statemachine.go
for complete description of all various states.
The LUCI Scheduler Service relies on two GAE subsystems: Datastore Service and Task Queue Service. There are some associated concerns:
TickLater
tasks are chained, a single skipped task may stop processing of some job forever.Datastore partial availability problem is tricky because naive implementation may choose to retry StartInvocation
tasks due to failed datastore writes and accidentally launch many invocations instead of one. Imagine the service scheduling a storm of Swarming tasks or DM quests, overloading entire infrastructure, just because its datastore is having a bad day.
To workaround datastore partial availability the service always writes something to datastore before sending external requests. In that case, if datastore is having issues, they will be detected before an external service is hit.
To workaround task queue issue the service uses “watchdog” GAE cron task:
TickLater
or StartInvocation
task is being scheduled, the cron job entity is updated with WatchdogTimerTs
: a timestamp in the future when the task should be finished already.TickLater
or StartInvocation
are running when expected, they move that timestamp further.TODO:
Separate “watchdog” GAE cron once per minute fetches from datastore all jobs that have WatchdogTimerTs
less than Now()
and repairs their state (by launching another TickLater
task) or at least reports them.