Lightweight distribution server with a legion of dumb bots.
One line description: Running all tests concurrently, not be impacted by network or device flakiness.
The server runs on AppEngine and is the only communication point for any client or bot that wants to use Swarming. Clients send task requests to the server and receive a task ID. It's up to the client to poll the server to know when a task is done.
The server uses its DB to store the tasks, the bots' state and stdout/stderr from the tasks. It exposes the web frontend UI, client JSON REST API and bot JSON API. It uses OAuth2 (or optionally the IP allowlist) to authenticate clients.
Task requests have a set of dimensions associated to it. It‘s the server that matches the request’s dimensions to find a bot which has the same set of dimensions.
The server serializes all its state in the DB. There is no continuously running thread. This makes upgrading the server and rolling back trivial. When the bot code is affected by the server upgrade, the bots seamlessly upgrade after their currently running task completes.
The client API is implemented via Cloud Endpoints over HTTPS.
All the APIs are idempotent: retrying a call in case of error is always safe. This permits transparent retry in case of failure. The exception is /tasks/new, which creates a new task, in case of failure, it may leave an orphan task.
The bot API is an implementation detail and doesn't provide any compatibility guarantee.
The server has a few major configuration points;
Each task is represented by a TaskRequest
and a TaskProperties
described in task_request.py. The TaskRequest
represents the meta data about a task, who, when, expiration timestamp, etc. The TaskProperties
contains the actual details to run a task, commands, environment variables, execution timeout, etc. This separation makes it possible to dedupe the task requests when the exact same .isolated
file is ran multiple times, so that task-deduplication can be eventually implemented.
A task also has a TaskResultSummary
to describe its pending state and a tiny TaskToRunShard
entity for the actual scheduling. They are respectively defined in task_result.py and task_to_run.py.
The task ID is the milliseconds since epoch plus low order random bits and the last byte set to 0. The last byte is used to differentiate between each try.
The server implements a Priority queue based on the creation timestamp of request. The priority is a 0-255 value with lower is higher priority. The priority enforces ordering, higher priority (lower value) tasks are run first. Then tasks with the same priority are run in either FIFO or LIFO order, depending on the server's configuration.
Technically speaking, it's possible to do more elastic priority scheduling, like old pending requests have their priority slowly increasing over time but the code to implement this was not committed since there was no immediate need.
When a bot polls the server for work, the server assigns the first available matching task available.
Matching is done via the dimensions of the request vs the dimensions of the bot. The bot must have all the dimensions listed on the request to be given the task. For example, it could be “os=Windows-Vista-SP2; gpu=15ad:0405”.
To make the process efficient, the dimensions are MD5 hashed and only the first 32 bits are saved so integer comparison can be used. This greatly reduce the size of the hot TaskToRunShard
entities used for task scheduling and the amount of memory necessary on the frontend server.
Once a bot is assigned to a task, a TaskRunResult
is created. If the task is actually a retry, multiple TaskRunResult
can be created for a single TaskRequest
.
During execution, the bot streams back the stdout and a heartbeat over HTTPS requests every 10 seconds. This works around stable long-lived network connectivity, as a failing HTTPS POST will simply be retried.
Swarming distributes tasks but it doesn't care much about the task itself. A task is marked as COMPLETED_SUCCESS
when the exit code is 0.
If a task stops being updated by its bot after 20 minutes (this can be configured by bot_ping_tolerance_secs
in task request property), a cron job will abort the task with BOT_DIED.
If any part of the scheduling, execution or processing of results fails, this is considered an infrastructure failure.
If a task is marked as idempotent, e.g. -idempotent
is used, the client certifies that the task do not have side effects. This means that running the task twice shall return the same results (pending flakiness).
The way it works internally is by calculating the SHA256 of TaskProperties
when marked as idempotent. When a TaskResultSummary
succeeds that was also idempotent, it sets a property to tell that its values can be reused.
When a new request comes in, it looks for a TaskResultSummary
that has properties_hash
set. If it finds one, the results are reused as-is and served to the client immediately, without ever scheduling a task.
Efficient task deduplication requires a deterministic build and no side effects in the tasks themselves. On the other hand, successful task deduplication can result in large infrastructure savings.
☞ See the user guide about idempotency for more information.
We accepted these caveats as we found the benefits outweighed, and by far, the caveats. The main issue has been coding defensively up-front, which represented a sunk cost in coding time.
Running on AppEngine forced Swarming to make every subsystem to support flakiness;
Each Swarming bot is intended to be extremely dumb and replaceable. These workers have a very limited understanding of the system and access the server via a JSON API. Each bot polls the server for a task. If the server hands a task, the bot runs the associated commands and then pipe the output back to the server. Once done, it starts polling again.
Only two basic assumptions are:
The bot's code is served directly from the server as a self-packaged swarming_bot.zip
. The server generates it on the fly and embeds its own URL in it. The server can also optionally have a custom bootstrap.py to further automate the bot bootstrapping process.
The bot keeps itself up to date with what the server provides.
swarming_bot.zip
. If it mismatches what the server expects, it is told to auto-update;swarming_bot.2.zip
or swarming_bot.1.zip
, depending on the currently running version and alternates between both names.swarming_bot.zip
is generated by the server and includes 2 generated files:Since the bot version calculation is done solely by the hash, the bot will also roll back to earlier versions if the server is rolled back. All the bot‘s code is inside the zip, this greatly reduces issues like a partial update, update failure when there’s no free space available, etc.
The bot also keeps a LKGBC
copy (Last Known Good Bot Code):
swarming_bot.zip
;swarming_bot.zip
is not the same version as the currently running version, if so;swarming.1.zip
or swarming.2.zip
) back to swarming_bot.zip
. This permits that at the next VM reboot, the most recent LKGBC version will be used right away.The bot code has been tested on Linux, Mac and Windows, Chrome OS' crouton and Raspbian.
The bot publishes a dictionary of dimensions, which is a dict(key, list(values)), where each value can have multiple values. For example, a Windows 10 bot would have 'os': ['Windows', 'Windows-10-15063']
. This permits broad or scoped selection of bot type.
For desktop OSes, it‘s about the OS and hardware properties. For devices, it’s about the device, not about the host driving it.
These “dimensions” are used to associate tasks with the bots. See below.
Multiple bots can run on a host simultaneously, as long as each bot has its own base directory. So for example, one could be located in /b/s/bot1
and a second in /b/s/bot2
.
In the scenario of multiple bots running on a host, make sure to never call Bot.host_reboot().
For bots that represent a device (Android, iOS, ChromeOS, Fuchsia), a bot can “own” all the devices connected to the host (generally via USB) or each bot can be in a docker container to own a single device.
In the case of devices that are communicated through IP, it's up to bot_config.py to decide what is “owned” by this bot. In some cases this can be determined by hardware (like when the host has two ethernet cards and devices are connected on the second), vlan proximity or hard coded host names in bot_config.py.
In the USB case, a prototype recipe to create udev rules to fire up the bot upon connection is included. The general idea is to reduce sysadmin overhead to its minimum, configure the host once, then connect devices. No need to touch the host again afterward. The server tolerates devices going Missing In Action or the host rebooting, forcibly killing the on-going tasks. The server will retry these task in this case, since it is an infrastructure failure.
The only sysadmin overhead remaining is to look for dead devices once in a while via the client tools or server monitoring functionality.
Clients trigger tasks and requests results via a Cloud Endpoints JSON REST API.
It is not possible for the client to access bots directly, no interactivity is provided by design.
See APIs above to write your own client.
When a client wishes to run something on Swarming, they can use the REST API or use the Go client swarming trigger
. It's a simple HTTPS POST with the TaskRequest
and TaskProperties
serialized as JSON.
The request message is NewTaskRequest
as defined in swarming_rpcs.py.
The bot selection process is inverted. It's the bot that polls for tasks. It looks at all the products of all the dimensions
it has and look at the oldest task with highest priority that has dimensions
which are also defined on the bot. If a task uses OR dimension (e.g. ‘|’ in ‘os:Ubuntu-18|Ubuntu-16), then bots having one of the OR’d dimension can poll the task.
The access control groups are optionally configured via single auth_service. This presents a coherent view on all the associated services.
Swarming is tested by python tests in the following ways: