tree: feb05a202a71ff41777f913490169c5bc7e71475 [path history] [tgz]
  1. test/
  2. third_party/
  3. __init__.py
  4. __main__.py
  5. containers.py
  6. README.md
  7. usb_device.py
  8. vpython_main.py
infra/services/android_docker/README.md

The script here is used to manage the containers on android swarming bots. By design, it's stateless and event-driven. It has two modes of execution:

  • launch: Ensures every locally connected android device has a running container. On the bots, it's called every 5 minutes via cron.
  • add_device: Gives a device's container access to its device. Called every time a device appears/reappears on the system via udev.

It's intended to be run in conjucture with the android_docker container image. More information on the image can be found here.

Launching the containers

Every 5 minutes, this script is invoked with the launch argument. This is how the containers get spawned, shut down, and cleaned up. Essentially, it does the following:

  • Gracefully shutdowns each container if container uptime is too high (default 4 hours) or host uptime is too high (default 24 hours).
  • Scans the USB bus for any connected android device and creates a container for each one.

Adding a device to a container

On linux, docker limits containers to their set of resources via cgroups (v1, v2), e.g. memory, network, etc. However the device controller implementation changed from interface files in v1 to cgroup BPF in v2 which doesn't have any interface files.

To support this change, we first use the docker container option --device-cgroup-rule to grant the container access to all devices of a given major number. This is because every time a device reboots or resets, it momentarily disappears from the host. When this happens, many things that uniquely identify the device change. This includes its dev and bus numbers, and minor number. But the major number likely remains unchanged as it identifies the driver associated with the device which doesn't change.

Then a custom udev rule is written so that every time an android device appears on the host, the script is invoked with add_device argument which adds the device to the container by creating a device node in the container's /dev filesystem.

Gracefully shutting down a container

To preform various forms of maintenance, the script here gracefully shuts down containers and, by association, the swarming bot running inside them. This is done by sending SIGTERM to the swarming bot process. This alerts the swarming bot that it should quit at the next available opportunity (ie: not during a test.) In order to fetch the pid of the swarming bot, the script runs lsof on the swarming.lck file.

Fetching the docker image

When a new container is launched with an image that is missing on a bot's local cache, it pulls it from the specified container registry. For all bots, this is the gcloud registry chops-public-images-prod. These images are world-readable, so no authentication with the registry is required. Additionally, when new images are fetched, any old and unused images still present on the machine will be deleted.

File locking

To avoid multiple simultaneous invocations of this service from stepping on itself (there are a few race conditions exposed when a misbehaving device is constantly connecting/disconnecting), parts of the script are wrapped in a mutex via a flock. The logic that's protected includes:

  • scanning the USB bus (wrapped in a global flock)
  • any container interaction (wrapped in a device-specific flock)

Getting device list

py-libusb is used to fetch a list of locally connected devices, which is part of this service's vpython spec. This library wraps the libusb system lib and provides methods for fetching information about USB devices.

Deploying the script

The script and its dependencies are deployed as a CIPD package via puppet. The package (infra/swarm_docker_tools/$platform) is continously built on this bot. Puppet deploys it to the relevant bots at these revisions. The canary pin affects bots on chromium-swarm-dev, which the android testers on the chromium.dev waterfall run tests against. If the canary has been updated, the bots look fine, and the tests haven't regressed, you can proceed to update the stable pin.

The call sites of the script are also defined in puppet: