This document outlines the process by which Android runs in a Linux container in Chrome OS.
This document explains how the container for Android master works unless otherwise noted. The container for N may work in a slightly different way.
config.json is used by
run_oci, to describe how the container is set up. This file describes the mount structure, namespaces, device nodes that are to be created, cgroups configuration, and capabilities that are inherited.
Android is running using all of the available Linux
namespaces(7) to increase isolation from the rest of the system:
Running all of Android's userspace in namespaces also increases compatibility since we can provide it with an environment that is closer to what it expects to find under normal circumstances.
run_oci starts in the init namespace (which is shared with most of Chrome OS), running as real root with all capabilities. The mount namespace associated with that is referred to as the init mount namespace. Any mount performed in the init mount namespace will span user sessions and are performed before
run_oci starts, so they do not figure in
run_oci creates a mount namespace (while still being associated with init‘s user namespace) that is known as the intermediate mount namespace. Due to the fact that when it is running in this namespace it still has all of root’s capabilities in the init namespace, it can perform privileged operations, such as performing remounts (e.g. calling
MS_REMOUNT and without
MS_BIND), and requesting to mount a
tmpfs(5) into Android‘s
/dev with the
exec flags. This intermediate mount namespace is also used to avoid leaking mounts into the init mount namespace, and will be automatically cleaned up when the last process in the namespace exits. This process is typically Android’s init, but if the container fails to start, it can also be
Still within the intermediate mount namespace, the container process is created by calling the
clone(2) system call with the
CLONE_NEWUSER flags. Given that mount namespaces have an owner user namespace, the only way that we can transition into both is to perform both simultaneously. Since Linux 3.9,
CLONE_FS, so this also has the side effect of making this new process no longer share its root directory (
chroot(2)) with any other process.
Once in the container user namespace, the container process enters the rest of the namespaces using
unshare(2) system call with the appropriate flag for each namespace. After it performs this with the
CLONE_NEWNS flag, it enters the a mount namespace which is referred to as the container mount namespace. This is where the vast majority of the mounts happen. Since this is associated with the container user namespace and the processes here no longer run as root in the init user namespace, some operations are no longer allowed by the kernel, even though the capabilities might be set. Some examples are remounts that modify the
run_oci finishes setting up the container process and calls
exit(2) to daemonize the container process tree, there are no longer any processes in the system that have a direct reference to the intermediate mount namespace, so it is no longer accessible from anywhere. This means that there is no way to obtain a file descriptor that can be passed to
setns(2) in order to enter it. The namespace itself is still alive since it is the parent of the container mount namespace.
The user namespace is assigned 2,000,000 uids distributed in the following way:
|init namespace uid range||container namespace uid range|
|655360 - 660359||0 - 4999|
|600 - 649||5000 - 5049|
|660410 - 2655360||5050 - 2000000|
The second range maps Chrome OS daemon uids (600-649), into one of Android's OEM-specific AIDs ranges.
Similarly, gid is assigned in the same way as uids assignment, except the special gid 20119 is allocated for container gid 1065, which is Android's reserved gid. This exception is because ext4 resgid only accepts 16-bit gid, and hence the originally mapped gid 1065 + 655360 does not fit the ext4 resgid.
There are several ways in which resources are mounted inside the container:
system.raw.img, and another one for
mount(2)in the init mount namespace and
MS_SLAVEin the container mount namespace, which causes any mount changes under that mount point to propagate to other shared subtrees.
All mounts are performed in the
/opt/google/container/android/rootfs/root subtree. Given that
run_oci does not modify the init mount namespace, any mounts that span user sessions (such as the
system.raw.img loop mount) should have already been performed before
run_oci starts. This is typically handled by
The flags to the
mounts section are the ones understood by
mount(8). Note that one mount entry might become more than one call to
mount(2), since some flags combinations are ignored by the kernel (e.g. changes to mount propagation flags ignore all other flags).
/: This is
/etc/init/arc-system-mount.conf) in the init namespace. This spans container invocations since it is stateless. The
suidflags are added in the intermediate mount namespace, as well as recursively changing its propagation flags to be
/config/sdcardfs: Bind-mount of
/sys/kernel/config/sdcardfssubdirectory of a normal
/dev: This is a
tmpfsmounted in the intermediate mount namespace with
android-rootas owner. This is needed to get the
/dev/pts: Pseudo TTS devpts file system with namespace support so that it is in a different namespace than the parent namespace even though the device node ids look identical. Required for bionic CTS tests. The device is mounted with nosuid and noexec mount options for better security although stock Android does not use them.
/dev/ptmx: The kernel documentation for devpts indicates that there are two ways to support
/dev/ptmx: creating a symlink that points to
/dev/pts/ptmx, or bind-mounting
/dev/pts/ptmx. The bind-mount was chosen to mark it
/dev/kmsg: This is a bind-mount of the host‘s
/run/arc/android.kmsg.fifo, which is just a FIFO file. Logs written to the fake device are read by a job called
arc-kmsg-loggerand stored in host’s /var/log/android.kmsg.
/dev/socket: This is a normal
tmpfs, used by Android's
initto store socket files.
/dev/usb-ffs/adb: This is a bind-mount of the hosts's
/run/arc/adbdand is a slave mount, which contains a FIFO that acts as the ADB gadget configured through ConfigFS/FunctionFS. This file is only present in Developer Mode. Once the
/dev/usb-ffs/adb/ep0file is written to, the bulk-in and bulk-out endpoints will be bind-mounted into this same directory.
config.jsonbind-mounts one of host's read-only directories to
/data. This read-only and near-empty
/datais only for “mini” container for login screen, and is used until the user signs into Chrome OS. Once the user signs in,
OnBootContinue()function unmounts the read-only
/data, and then bind-mounts
/data/cache, respectively. These source directories are writable and in Chrome OS user’s encrypted directory managed by cryptohome.
tmpfsthat holds several mount points from other containers for Chrome <=> Android file system communication, such as
dlfs, OBB, and external storage.
/var/run/arc/sdcard: A FUSE file system provided by
sdcarddaemon running outside the container.
/var/run/chrome: Holds the ARC bridge and Wayland UNIX domain sockets.
/var/run/cras: Holds the CRAS UNIX domain socket.
/var/run/inputbridge: Holds a FIFO for doing IPC within the container. surfaceflinger uses the FIFO to propage input events from host to the container.
/sys: A normal
/sys/fs/selinux: This is bind-mounted from
/sys/fs/selinuxoutside the container.
/sys/kernel/debug: Since this directory is owned by real root with very restrictive permissions (so the container would not be able to access any resource in that directory), a
tmpfsis mounted in its place.
/sys/kernel/debug/sync: The permissions of this directory in the host are relaxed so that
android-rootcan access it, and bind-mounted in the container.
/sys/kernel/debug/tracing: This is bind-mounted from the host's /run/arc/debugfs/tracing, only in dev mode. Note that the group id is mapped into the container to allow access from inside by DAC.
/proc: A normal
procfs. This is mounted in the container mount namespace, which is associated with the container user+pid namespaces to display the correct PID mappings.
/proc/cmdline: A regular file with the runtime-generated kernel commandline is bind-mounted instead of the Chrome OS kernel commandline.
/proc/sys/vm/mmap_rnd_bits: Two regular files are bind-mounted since the original files are owned by real root with very restrictive permissions. Android's
initmodified the contents of these files to increase the
mmap(2)entropy, and will crash if this operation is not allowed. Mounting these two files reduces the number of mods to
/proc/sys/kernel/kptr_restrict: Same as with
/oem/etc: This is bind-mounted from host's
/var/run/arc/bugreport: This is bind-mounted from host‘s
/run/arc/bugreport. The container creates a pipe file in the directory to allow host’s
debugdto read it. When it is read, Android's
bugreportoutput is sent to the host side.
/var/run/arc/apkcache: This is bind-mounted from host‘s `/mnt/stateful_partition/unencrypted/apkcache. The host directory is for storing APK files specified by the device’s policy and downloaded on the host side.
/var/run/arc/dalvik-cache: This is bind-mounted from host's
/mnt/stateful_partition/unencrypted/art-data/dalvik-cache. The host directory is for storing boot*.art files compiled on the host side. This allows the container to load the files right away without building them.
/var/run/camera: Holds the arc-camera UNIX domain socket.
/var/run/arc/obb: This is bind-mounted from host's
/run/arc/obb. A daemon running outside the container called
/usr/bin/arc-obb-mountermounts an OBB image file as a FUSE file system to the directory when requested.
/var/run/arc/media: This is bind-mounted from host's
/run/arc/media. A daemon running outside the container called
/usr/bin/mount-passthroughmounts an external storage as a FUSE file system to the directory when needed.
/vendor: This is loop-mounted from host's
/opt/google/containers/android/vendor.raw.img. The directory may have graphic drivers, Houdini, board-specific APKs, and so on.
Android is running in a user namespace, and the
root user in the namespace has all possible capabilities in that namespace. Nevertheless, there are some operations in the kernel where the capability check is performed against the user in the init namespace. All the capabilities where all the checks are done in this way (such as
CAP_SYS_MODULE) are removed because no user within the container would be able to use it.
Additionally, the following capabilities were removed (by dropping them from the list of permitted, inheritable, effective, and ambient capability sets) to signal the container that it cannot perform certain operations:
CAP_SYS_BOOT: This signals Android's
initprocess that it should not use
reboot(2), but instead call
exit(2). It is also used to decide whether or not to block the
SIGTERMsignal, which can be used to request the container to terminate itself from the outside.
CAP_SYSLOG: This signals Android that it will not be able to access kernel pointers found in
By default, processes running inside the container are not allowed to access any device files. They can only access the ones that are explcitly allowed in the
The hooks used by
run_oci follow the Open Container Initiative spec for POSIX-platform Hooks, with a Chrome OS-specific extension that allows a hook to be installed after all the mounts have been processed, but prior to calling
In order to avoid paying the price of creating several processes and switching back and forth between namespaces (which added several milliseconds to the boot time when done naïvely), we have consolidated all of the hook execution to two hooks: pre-create and pre-chroot.
The pre-create hook invokes
arc-setup with the
--setup flag via its wrapper script,
/usr/sbin/arc_setup_wrapper.sh and creates host-side files and directories that will be bind-mounted to the container via
The pre-chroot hook invokes
arc-setup with the
--pre-chroot flag and performs several operations:
binfmt_miscto perform ARM binary translation on Intel devices.
run_oci, since these are not handled by either the build system, or the first invocation of
arc-setupthat occurs before
/dev/.coldboot_done, which is used by Android as a signal that it has reached a certain point during the boot sequence. This is normally done by Android's
initduring its first stage, but we do not use it and boot Android directly into
init's second stage.