| ## Container Specification - v1 |
| |
| This is the standard configuration for version 1 containers. It includes |
| namespaces, standard filesystem setup, a default Linux capability set, and |
| information about resource reservations. It also has information about any |
| populated environment settings for the processes running inside a container. |
| |
| Along with the configuration of how a container is created the standard also |
| discusses actions that can be performed on a container to manage and inspect |
| information about the processes running inside. |
| |
| The v1 profile is meant to be able to accommodate the majority of applications |
| with a strong security configuration. |
| |
| ### System Requirements and Compatibility |
| |
| Minimum requirements: |
| * Kernel version - 3.10 recommended 2.6.2x minimum(with backported patches) |
| * Mounted cgroups with each subsystem in its own hierarchy |
| |
| |
| ### Namespaces |
| |
| | Flag | Enabled | |
| | ------------ | ------- | |
| | CLONE_NEWPID | 1 | |
| | CLONE_NEWUTS | 1 | |
| | CLONE_NEWIPC | 1 | |
| | CLONE_NEWNET | 1 | |
| | CLONE_NEWNS | 1 | |
| | CLONE_NEWUSER | 1 | |
| |
| Namespaces are created for the container via the `clone` syscall. |
| |
| |
| ### Filesystem |
| |
| A root filesystem must be provided to a container for execution. The container |
| will use this root filesystem (rootfs) to jail and spawn processes inside where |
| the binaries and system libraries are local to that directory. Any binaries |
| to be executed must be contained within this rootfs. |
| |
| Mounts that happen inside the container are automatically cleaned up when the |
| container exits as the mount namespace is destroyed and the kernel will |
| unmount all the mounts that were setup within that namespace. |
| |
| For a container to execute properly there are certain filesystems that |
| are required to be mounted within the rootfs that the runtime will setup. |
| |
| | Path | Type | Flags | Data | |
| | ----------- | ------ | -------------------------------------- | ---------------------------------------- | |
| | /proc | proc | MS_NOEXEC,MS_NOSUID,MS_NODEV | | |
| | /dev | tmpfs | MS_NOEXEC,MS_STRICTATIME | mode=755 | |
| | /dev/shm | tmpfs | MS_NOEXEC,MS_NOSUID,MS_NODEV | mode=1777,size=65536k | |
| | /dev/mqueue | mqueue | MS_NOEXEC,MS_NOSUID,MS_NODEV | | |
| | /dev/pts | devpts | MS_NOEXEC,MS_NOSUID | newinstance,ptmxmode=0666,mode=620,gid=5 | |
| | /sys | sysfs | MS_NOEXEC,MS_NOSUID,MS_NODEV,MS_RDONLY | | |
| |
| |
| After a container's filesystems are mounted within the newly created |
| mount namespace `/dev` will need to be populated with a set of device nodes. |
| It is expected that a rootfs does not need to have any device nodes specified |
| for `/dev` within the rootfs as the container will setup the correct devices |
| that are required for executing a container's process. |
| |
| | Path | Mode | Access | |
| | ------------ | ---- | ---------- | |
| | /dev/null | 0666 | rwm | |
| | /dev/zero | 0666 | rwm | |
| | /dev/full | 0666 | rwm | |
| | /dev/tty | 0666 | rwm | |
| | /dev/random | 0666 | rwm | |
| | /dev/urandom | 0666 | rwm | |
| |
| |
| **ptmx** |
| `/dev/ptmx` will need to be a symlink to the host's `/dev/ptmx` within |
| the container. |
| |
| The use of a pseudo TTY is optional within a container and it should support both. |
| If a pseudo is provided to the container `/dev/console` will need to be |
| setup by binding the console in `/dev/` after it has been populated and mounted |
| in tmpfs. |
| |
| | Source | Destination | UID GID | Mode | Type | |
| | --------------- | ------------ | ------- | ---- | ---- | |
| | *pty host path* | /dev/console | 0 0 | 0600 | bind | |
| |
| |
| After `/dev/null` has been setup we check for any external links between |
| the container's io, STDIN, STDOUT, STDERR. If the container's io is pointing |
| to `/dev/null` outside the container we close and `dup2` the `/dev/null` |
| that is local to the container's rootfs. |
| |
| |
| After the container has `/proc` mounted a few standard symlinks are setup |
| within `/dev/` for the io. |
| |
| | Source | Destination | |
| | --------------- | ----------- | |
| | /proc/self/fd | /dev/fd | |
| | /proc/self/fd/0 | /dev/stdin | |
| | /proc/self/fd/1 | /dev/stdout | |
| | /proc/self/fd/2 | /dev/stderr | |
| |
| A `pivot_root` is used to change the root for the process, effectively |
| jailing the process inside the rootfs. |
| |
| ```c |
| put_old = mkdir(...); |
| pivot_root(rootfs, put_old); |
| chdir("/"); |
| unmount(put_old, MS_DETACH); |
| rmdir(put_old); |
| ``` |
| |
| For container's running with a rootfs inside `ramfs` a `MS_MOVE` combined |
| with a `chroot` is required as `pivot_root` is not supported in `ramfs`. |
| |
| ```c |
| mount(rootfs, "/", NULL, MS_MOVE, NULL); |
| chroot("."); |
| chdir("/"); |
| ``` |
| |
| The `umask` is set back to `0022` after the filesystem setup has been completed. |
| |
| ### Resources |
| |
| Cgroups are used to handle resource allocation for containers. This includes |
| system resources like cpu, memory, and device access. |
| |
| | Subsystem | Enabled | |
| | ---------- | ------- | |
| | devices | 1 | |
| | memory | 1 | |
| | cpu | 1 | |
| | cpuacct | 1 | |
| | cpuset | 1 | |
| | blkio | 1 | |
| | perf_event | 1 | |
| | freezer | 1 | |
| | hugetlb | 1 | |
| | pids | 1 | |
| |
| |
| All cgroup subsystem are joined so that statistics can be collected from |
| each of the subsystems. Freezer does not expose any stats but is joined |
| so that containers can be paused and resumed. |
| |
| The parent process of the container's init must place the init pid inside |
| the correct cgroups before the initialization begins. This is done so |
| that no processes or threads escape the cgroups. This sync is |
| done via a pipe ( specified in the runtime section below ) that the container's |
| init process will block waiting for the parent to finish setup. |
| |
| ### Security |
| |
| The standard set of Linux capabilities that are set in a container |
| provide a good default for security and flexibility for the applications. |
| |
| |
| | Capability | Enabled | |
| | -------------------- | ------- | |
| | CAP_NET_RAW | 1 | |
| | CAP_NET_BIND_SERVICE | 1 | |
| | CAP_AUDIT_READ | 1 | |
| | CAP_AUDIT_WRITE | 1 | |
| | CAP_DAC_OVERRIDE | 1 | |
| | CAP_SETFCAP | 1 | |
| | CAP_SETPCAP | 1 | |
| | CAP_SETGID | 1 | |
| | CAP_SETUID | 1 | |
| | CAP_MKNOD | 1 | |
| | CAP_CHOWN | 1 | |
| | CAP_FOWNER | 1 | |
| | CAP_FSETID | 1 | |
| | CAP_KILL | 1 | |
| | CAP_SYS_CHROOT | 1 | |
| | CAP_NET_BROADCAST | 0 | |
| | CAP_SYS_MODULE | 0 | |
| | CAP_SYS_RAWIO | 0 | |
| | CAP_SYS_PACCT | 0 | |
| | CAP_SYS_ADMIN | 0 | |
| | CAP_SYS_NICE | 0 | |
| | CAP_SYS_RESOURCE | 0 | |
| | CAP_SYS_TIME | 0 | |
| | CAP_SYS_TTY_CONFIG | 0 | |
| | CAP_AUDIT_CONTROL | 0 | |
| | CAP_MAC_OVERRIDE | 0 | |
| | CAP_MAC_ADMIN | 0 | |
| | CAP_NET_ADMIN | 0 | |
| | CAP_SYSLOG | 0 | |
| | CAP_DAC_READ_SEARCH | 0 | |
| | CAP_LINUX_IMMUTABLE | 0 | |
| | CAP_IPC_LOCK | 0 | |
| | CAP_IPC_OWNER | 0 | |
| | CAP_SYS_PTRACE | 0 | |
| | CAP_SYS_BOOT | 0 | |
| | CAP_LEASE | 0 | |
| | CAP_WAKE_ALARM | 0 | |
| | CAP_BLOCK_SUSPEND | 0 | |
| |
| |
| Additional security layers like [apparmor](https://wiki.ubuntu.com/AppArmor) |
| and [selinux](http://selinuxproject.org/page/Main_Page) can be used with |
| the containers. A container should support setting an apparmor profile or |
| selinux process and mount labels if provided in the configuration. |
| |
| Standard apparmor profile: |
| ```c |
| #include <tunables/global> |
| profile <profile_name> flags=(attach_disconnected,mediate_deleted) { |
| #include <abstractions/base> |
| network, |
| capability, |
| file, |
| umount, |
| |
| deny @{PROC}/sys/fs/** wklx, |
| deny @{PROC}/sysrq-trigger rwklx, |
| deny @{PROC}/mem rwklx, |
| deny @{PROC}/kmem rwklx, |
| deny @{PROC}/sys/kernel/[^s][^h][^m]* wklx, |
| deny @{PROC}/sys/kernel/*/** wklx, |
| |
| deny mount, |
| |
| deny /sys/[^f]*/** wklx, |
| deny /sys/f[^s]*/** wklx, |
| deny /sys/fs/[^c]*/** wklx, |
| deny /sys/fs/c[^g]*/** wklx, |
| deny /sys/fs/cg[^r]*/** wklx, |
| deny /sys/firmware/efi/efivars/** rwklx, |
| deny /sys/kernel/security/** rwklx, |
| } |
| ``` |
| |
| *TODO: seccomp work is being done to find a good default config* |
| |
| ### Runtime and Init Process |
| |
| During container creation the parent process needs to talk to the container's init |
| process and have a form of synchronization. This is accomplished by creating |
| a pipe that is passed to the container's init. When the init process first spawns |
| it will block on its side of the pipe until the parent closes its side. This |
| allows the parent to have time to set the new process inside a cgroup hierarchy |
| and/or write any uid/gid mappings required for user namespaces. |
| The pipe is passed to the init process via FD 3. |
| |
| The application consuming libcontainer should be compiled statically. libcontainer |
| does not define any init process and the arguments provided are used to `exec` the |
| process inside the application. There should be no long running init within the |
| container spec. |
| |
| If a pseudo tty is provided to a container it will open and `dup2` the console |
| as the container's STDIN, STDOUT, STDERR as well as mounting the console |
| as `/dev/console`. |
| |
| An extra set of mounts are provided to a container and setup for use. A container's |
| rootfs can contain some non portable files inside that can cause side effects during |
| execution of a process. These files are usually created and populated with the container |
| specific information via the runtime. |
| |
| **Extra runtime files:** |
| * /etc/hosts |
| * /etc/resolv.conf |
| * /etc/hostname |
| * /etc/localtime |
| |
| |
| #### Defaults |
| |
| There are a few defaults that can be overridden by users, but in their omission |
| these apply to processes within a container. |
| |
| | Type | Value | |
| | ------------------- | ------------------------------ | |
| | Parent Death Signal | SIGKILL | |
| | UID | 0 | |
| | GID | 0 | |
| | GROUPS | 0, NULL | |
| | CWD | "/" | |
| | $HOME | Current user's home dir or "/" | |
| | Readonly rootfs | false | |
| | Pseudo TTY | false | |
| |
| |
| ## Actions |
| |
| After a container is created there is a standard set of actions that can |
| be done to the container. These actions are part of the public API for |
| a container. |
| |
| | Action | Description | |
| | -------------- | ------------------------------------------------------------------ | |
| | Get processes | Return all the pids for processes running inside a container | |
| | Get Stats | Return resource statistics for the container as a whole | |
| | Wait | Waits on the container's init process ( pid 1 ) | |
| | Wait Process | Wait on any of the container's processes returning the exit status | |
| | Destroy | Kill the container's init process and remove any filesystem state | |
| | Signal | Send a signal to the container's init process | |
| | Signal Process | Send a signal to any of the container's processes | |
| | Pause | Pause all processes inside the container | |
| | Resume | Resume all processes inside the container if paused | |
| | Exec | Execute a new process inside of the container ( requires setns ) | |
| | Set | Setup configs of the container after it's created | |
| |
| ### Execute a new process inside of a running container. |
| |
| User can execute a new process inside of a running container. Any binaries to be |
| executed must be accessible within the container's rootfs. |
| |
| The started process will run inside the container's rootfs. Any changes |
| made by the process to the container's filesystem will persist after the |
| process finished executing. |
| |
| The started process will join all the container's existing namespaces. When the |
| container is paused, the process will also be paused and will resume when |
| the container is unpaused. The started process will only run when the container's |
| primary process (PID 1) is running, and will not be restarted when the container |
| is restarted. |
| |
| #### Planned additions |
| |
| The started process will have its own cgroups nested inside the container's |
| cgroups. This is used for process tracking and optionally resource allocation |
| handling for the new process. Freezer cgroup is required, the rest of the cgroups |
| are optional. The process executor must place its pid inside the correct |
| cgroups before starting the process. This is done so that no child processes or |
| threads can escape the cgroups. |
| |
| When the process is stopped, the process executor will try (in a best-effort way) |
| to stop all its children and remove the sub-cgroups. |