infra/services/android_docker - infra/infra

tree: b3febf1f7516da146cde4969cdc721809328a500 [path history] [tgz]

infra/services/android_docker/README.md

The script here is used to manage the containers on android swarming bots. By design, it's stateless and event-driven. It has two modes of execution:

launch: Ensures every locally connected android device has a running container. On the bots, it's called every 5 minutes via cron.
add_device: Gives a device's container access to its device. Called every time a device appears/reappears on the system via udev.

It's intended to be run in conjucture with the android_docker container image. More information on the image can be found here.

Launching the containers

Every 5 minutes, this script is invoked with the launch argument. This is how the containers get spawned, shut down, and cleaned up. Essentially, it does the following:

Gracefully shutdowns each container if container uptime is too high (default 4 hours) or host uptime is too high (default 24 hours).
Scans the USB bus for any connected android device and creates a container for each one.

Adding a device to a container

On linux, docker limits containers to their set of resources via cgroups (e.g. memory, network, etc.) Since cgroups extend support for devices, we can leverage this to add an android device to a container. This is done via adding the device‘s descriptor to the container’s cgroup and creating a device node in the container's /dev filesystem. All of this is done when invoking the script with the add_device argument.

Everytime a device reboots or resets, it momentarily dissapears from the host. When this happens, many things that uniquely identify the device change. This includes its major and minor numbers and its dev and bus numbers. Consequently, we need to re-add a device to its container everytime this happens. udev allows us to do this by running add_device everytime an android device appears on the host.

Gracefully shutting down a container

To preform various forms of maintenance, the script here gracefully shuts down containers and, by association, the swarming bot running inside them. This is done by sending SIGTERM to the swarming bot process. This alerts the swarming bot that it should quit at the next available opportunity (ie: not during a test.) In order to fetch the pid of the swarming bot, the script runs lsof on the swarming.lck file.

Fetching the docker image

When a new container is launched with an image that is missing on a bot's local cache, it pulls it from the specified container registry. For all bots, this is the gcloud registry chops-public-images-prod. These images are world-readable, so no authentication with the registry is required. Additionally, when new images are fetched, any old and unused images still present on the machine will be deleted.

File locking

To avoid multiple simultaneous invocations of this service from stepping on itself (there are a few race conditions exposed when a misbehaving device is constantly connecting/disconnecting), parts of the script are wrapped in a mutex via a flock. The logic that's protected includes:

scanning the USB bus (wrapped in a global flock)
any container interaction (wrapped in a device-specific flock)

Getting device list

py-libusb is used to fetch a list of locally connected devices, which is part of this service's vpython spec. This library wraps the libusb system lib and provides methods for fetching information about USB devices.

Deploying the script

The script and its dependencies are deployed as a CIPD package via puppet. The package (infra/swarm_docker_tools/$platform) is continously built on this bot. Puppet deploys it to the relevant bots at these revisions. The canary pin affects bots on chromium-swarm-dev, which the android testers on the chromium.dev waterfall run tests against. If the canary has been updated, the bots look fine, and the tests haven't regressed, you can proceed to update the stable pin.

The call sites of the script are also defined in puppet: