The script here is used to manage the containers on android swarming bots. By design, it's stateless and event-driven. It has two modes of execution:
launch
: Ensures every locally connected android device has a running container. On the bots, it's called every 5 minutes via cron.add_device
: Gives a device's container access to its device. Called every time a device appears/reappears on the system via udev.It's intended to be run in conjucture with the android_docker container image. More information on the image can be found here.
Every 5 minutes, this script is invoked with the launch
argument. This is how the containers get spawned, shut down, and cleaned up. Essentially, it does the following:
On linux, docker limits containers to their set of resources via cgroups (e.g. memory, network, etc.) Since cgroups extend support for devices, we can leverage this to add an android device to a container. This is done via adding the device‘s descriptor to the container’s cgroup and creating a device node in the container's /dev filesystem. All of this is done when invoking the script with the add_device
argument.
Everytime a device reboots or resets, it momentarily dissapears from the host. When this happens, many things that uniquely identify the device change. This includes its major and minor numbers and its dev and bus numbers. Consequently, we need to re-add a device to its container everytime this happens. udev allows us to do this by running add_device
everytime an android device appears on the host.
To preform various forms of maintenance, the script here gracefully shuts down containers and, by association, the swarming bot running inside them. This is done by sending SIGTERM to the swarming bot process. This alerts the swarming bot that it should quit at the next available opportunity (ie: not during a test.) In order to fetch the pid of the swarming bot, the script runs lsof
on the swarming.lck file.
When a new container is launched with an image that is missing on a bot's local cache, it pulls it from the specified container registry. For all bots, this is the gcloud registry chops-public-images-prod. These images are world-readable, so no authentication with the registry is required. Additionally, when new images are fetched, any old and unused images still present on the machine will be deleted.
To avoid multiple simultaneous invocations of this service from stepping on itself (there are a few race conditions exposed when a misbehaving device is constantly connecting/disconnecting), parts of the script are wrapped in a mutex via a flock. The logic that's protected includes:
py-libusb is used to fetch a list of locally connected devices, which is part of this service's vpython spec. This library wraps the libusb system lib and provides methods for fetching information about USB devices.
The script and its dependencies are deployed as a CIPD package via puppet. The package (infra/swarm_docker_tools/$platform) is continously built on this bot. Puppet deploys it to the relevant bots at these revisions. The canary pin affects bots on chromium-swarm-dev, which the android testers on the chromium.dev waterfall run tests against. If the canary has been updated, the bots look fine, and the tests haven't regressed, you can proceed to update the stable pin.
The call sites of the script are also defined in puppet: