libcontainer/SPEC.md - external/github.com/docker/runc - Git at Google

 ## Container Specification - v1

 This is the standard configuration for version 1 containers.  It includes
 namespaces, standard filesystem setup, a default Linux capability set, and
 information about resource reservations.  It also has information about any
 populated environment settings for the processes running inside a container.

 Along with the configuration of how a container is created the standard also
 discusses actions that can be performed on a container to manage and inspect
 information about the processes running inside.

 The v1 profile is meant to be able to accommodate the majority of applications
 with a strong security configuration.

 ### System Requirements and Compatibility

 Minimum requirements:
 * Kernel version - 3.10 recommended 2.6.2x minimum(with backported patches)
 * Mounted cgroups with each subsystem in its own hierarchy


 ### Namespaces

 |     Flag      | Enabled |
 | ------------  | ------- |
 | CLONE_NEWPID  |    1    |
 | CLONE_NEWUTS  |    1    |
 | CLONE_NEWIPC  |    1    |
 | CLONE_NEWNET  |    1    |
 | CLONE_NEWNS   |    1    |
 | CLONE_NEWUSER |    1    |

 Namespaces are created for the container via the `clone` syscall.


 ### Filesystem

 A root filesystem must be provided to a container for execution.  The container
 will use this root filesystem (rootfs) to jail and spawn processes inside where
 the binaries and system libraries are local to that directory.  Any binaries
 to be executed must be contained within this rootfs.

 Mounts that happen inside the container are automatically cleaned up when the
 container exits as the mount namespace is destroyed and the kernel will
 unmount all the mounts that were setup within that namespace.

 For a container to execute properly there are certain filesystems that
 are required to be mounted within the rootfs that the runtime will setup.

 |     Path    |  Type  |                  Flags                 |                 Data                     |
 | ----------- | ------ | -------------------------------------- | ---------------------------------------- |
 | /proc       | proc   | MS_NOEXEC,MS_NOSUID,MS_NODEV           |                                          |
 | /dev        | tmpfs  | MS_NOEXEC,MS_STRICTATIME               | mode=755                                 |
 | /dev/shm    | tmpfs  | MS_NOEXEC,MS_NOSUID,MS_NODEV           | mode=1777,size=65536k                    |
 | /dev/mqueue | mqueue | MS_NOEXEC,MS_NOSUID,MS_NODEV           |                                          |
 | /dev/pts    | devpts | MS_NOEXEC,MS_NOSUID                    | newinstance,ptmxmode=0666,mode=620,gid=5 |
 | /sys        | sysfs  | MS_NOEXEC,MS_NOSUID,MS_NODEV,MS_RDONLY |                                          |


 After a container's filesystems are mounted within the newly created
 mount namespace `/dev` will need to be populated with a set of device nodes.
 It is expected that a rootfs does not need to have any device nodes specified
 for `/dev` within the rootfs as the container will setup the correct devices
 that are required for executing a container's process.

 |      Path    | Mode |   Access   |
 | ------------ | ---- | ---------- |
 | /dev/null    | 0666 |  rwm       |
 | /dev/zero    | 0666 |  rwm       |
 | /dev/full    | 0666 |  rwm       |
 | /dev/tty     | 0666 |  rwm       |
 | /dev/random  | 0666 |  rwm       |
 | /dev/urandom | 0666 |  rwm       |


 **ptmx**
 `/dev/ptmx` will need to be a symlink to the host's `/dev/ptmx` within
 the container.

 The use of a pseudo TTY is optional within a container and it should support both.
 If a pseudo is provided to the container `/dev/console` will need to be
 setup by binding the console in `/dev/` after it has been populated and mounted
 in tmpfs.

 |      Source     | Destination  | UID GID | Mode | Type |
 | --------------- | ------------ | ------- | ---- | ---- |
 | *pty host path* | /dev/console | 0 0     | 0600 | bind |


 After `/dev/null` has been setup we check for any external links between
 the container's io, STDIN, STDOUT, STDERR.  If the container's io is pointing
 to `/dev/null` outside the container we close and `dup2` the `/dev/null`
 that is local to the container's rootfs.


 After the container has `/proc` mounted a few standard symlinks are setup
 within `/dev/` for the io.

 |    Source       | Destination |
 | --------------- | ----------- |
 | /proc/self/fd   | /dev/fd     |
 | /proc/self/fd/0 | /dev/stdin  |
 | /proc/self/fd/1 | /dev/stdout |
 | /proc/self/fd/2 | /dev/stderr |

 A `pivot_root` is used to change the root for the process, effectively
 jailing the process inside the rootfs.

 ```c
 put_old = mkdir(...);
 pivot_root(rootfs, put_old);
 chdir("/");
 unmount(put_old, MS_DETACH);
 rmdir(put_old);
 ```

 For container's running with a rootfs inside `ramfs` a `MS_MOVE` combined
 with a `chroot` is required as `pivot_root` is not supported in `ramfs`.

 ```c
 mount(rootfs, "/", NULL, MS_MOVE, NULL);
 chroot(".");
 chdir("/");
 ```

 The `umask` is set back to `0022` after the filesystem setup has been completed.

 ### Resources

 Cgroups are used to handle resource allocation for containers.  This includes
 system resources like cpu, memory, and device access.

 | Subsystem  | Enabled |
 | ---------- | ------- |
 | devices    | 1       |
 | memory     | 1       |
 | cpu        | 1       |
 | cpuacct    | 1       |
 | cpuset     | 1       |
 | blkio      | 1       |
 | perf_event | 1       |
 | freezer    | 1       |
 | hugetlb    | 1       |
 | pids       | 1       |


 All cgroup subsystem are joined so that statistics can be collected from
 each of the subsystems.  Freezer does not expose any stats but is joined
 so that containers can be paused and resumed.

 The parent process of the container's init must place the init pid inside
 the correct cgroups before the initialization begins.  This is done so
 that no processes or threads escape the cgroups.  This sync is
 done via a pipe ( specified in the runtime section below ) that the container's
 init process will block waiting for the parent to finish setup.

 ### Security

 The standard set of Linux capabilities that are set in a container
 provide a good default for security and flexibility for the applications.


 |     Capability       | Enabled |
 | -------------------- | ------- |
 | CAP_NET_RAW          | 1       |
 | CAP_NET_BIND_SERVICE | 1       |
 | CAP_AUDIT_READ       | 1       |
 | CAP_AUDIT_WRITE      | 1       |
 | CAP_DAC_OVERRIDE     | 1       |
 | CAP_SETFCAP          | 1       |
 | CAP_SETPCAP          | 1       |
 | CAP_SETGID           | 1       |
 | CAP_SETUID           | 1       |
 | CAP_MKNOD            | 1       |
 | CAP_CHOWN            | 1       |
 | CAP_FOWNER           | 1       |
 | CAP_FSETID           | 1       |
 | CAP_KILL             | 1       |
 | CAP_SYS_CHROOT       | 1       |
 | CAP_NET_BROADCAST    | 0       |
 | CAP_SYS_MODULE       | 0       |
 | CAP_SYS_RAWIO        | 0       |
 | CAP_SYS_PACCT        | 0       |
 | CAP_SYS_ADMIN        | 0       |
 | CAP_SYS_NICE         | 0       |
 | CAP_SYS_RESOURCE     | 0       |
 | CAP_SYS_TIME         | 0       |
 | CAP_SYS_TTY_CONFIG   | 0       |
 | CAP_AUDIT_CONTROL    | 0       |
 | CAP_MAC_OVERRIDE     | 0       |
 | CAP_MAC_ADMIN        | 0       |
 | CAP_NET_ADMIN        | 0       |
 | CAP_SYSLOG           | 0       |
 | CAP_DAC_READ_SEARCH  | 0       |
 | CAP_LINUX_IMMUTABLE  | 0       |
 | CAP_IPC_LOCK         | 0       |
 | CAP_IPC_OWNER        | 0       |
 | CAP_SYS_PTRACE       | 0       |
 | CAP_SYS_BOOT         | 0       |
 | CAP_LEASE            | 0       |
 | CAP_WAKE_ALARM       | 0       |
 | CAP_BLOCK_SUSPEND    | 0       |


 Additional security layers like [apparmor](https://wiki.ubuntu.com/AppArmor)
 and [selinux](http://selinuxproject.org/page/Main_Page) can be used with
 the containers.  A container should support setting an apparmor profile or
 selinux process and mount labels if provided in the configuration.

 Standard apparmor profile:
 ```c
 #include <tunables/global>
 profile <profile_name> flags=(attach_disconnected,mediate_deleted) {
   #include <abstractions/base>
   network,
   capability,
   file,
   umount,

   deny @{PROC}/sys/fs/** wklx,
   deny @{PROC}/sysrq-trigger rwklx,
   deny @{PROC}/mem rwklx,
   deny @{PROC}/kmem rwklx,
   deny @{PROC}/sys/kernel/[^s][^h][^m]* wklx,
   deny @{PROC}/sys/kernel/*/** wklx,

   deny mount,

   deny /sys/[^f]*/** wklx,
   deny /sys/f[^s]*/** wklx,
   deny /sys/fs/[^c]*/** wklx,
   deny /sys/fs/c[^g]*/** wklx,
   deny /sys/fs/cg[^r]*/** wklx,
   deny /sys/firmware/efi/efivars/** rwklx,
   deny /sys/kernel/security/** rwklx,
 }
 ```

 *TODO: seccomp work is being done to find a good default config*

 ### Runtime and Init Process

 During container creation the parent process needs to talk to the container's init
 process and have a form of synchronization.  This is accomplished by creating
 a pipe that is passed to the container's init.  When the init process first spawns
 it will block on its side of the pipe until the parent closes its side.  This
 allows the parent to have time to set the new process inside a cgroup hierarchy
 and/or write any uid/gid mappings required for user namespaces.
 The pipe is passed to the init process via FD 3.

 The application consuming libcontainer should be compiled statically.  libcontainer
 does not define any init process and the arguments provided are used to `exec` the
 process inside the application.  There should be no long running init within the
 container spec.

 If a pseudo tty is provided to a container it will open and `dup2` the console
 as the container's STDIN, STDOUT, STDERR as well as mounting the console
 as `/dev/console`.

 An extra set of mounts are provided to a container and setup for use.  A container's
 rootfs can contain some non portable files inside that can cause side effects during
 execution of a process.  These files are usually created and populated with the container
 specific information via the runtime.

 **Extra runtime files:**
 * /etc/hosts
 * /etc/resolv.conf
 * /etc/hostname
 * /etc/localtime


 #### Defaults

 There are a few defaults that can be overridden by users, but in their omission
 these apply to processes within a container.

 |       Type          |             Value              |
 | ------------------- | ------------------------------ |
 | Parent Death Signal | SIGKILL                        |
 | UID                 | 0                              |
 | GID                 | 0                              |
 | GROUPS              | 0, NULL                        |
 | CWD                 | "/"                            |
 | $HOME               | Current user's home dir or "/" |
 | Readonly rootfs     | false                          |
 | Pseudo TTY          | false                          |


 ## Actions

 After a container is created there is a standard set of actions that can
 be done to the container.  These actions are part of the public API for
 a container.

 |     Action     |                         Description                                |
 | -------------- | ------------------------------------------------------------------ |
 | Get processes  | Return all the pids for processes running inside a container       |
 | Get Stats      | Return resource statistics for the container as a whole            |
 | Wait           | Waits on the container's init process ( pid 1 )                    |
 | Wait Process   | Wait on any of the container's processes returning the exit status |
 | Destroy        | Kill the container's init process and remove any filesystem state  |
 | Signal         | Send a signal to the container's init process                      |
 | Signal Process | Send a signal to any of the container's processes                  |
 | Pause          | Pause all processes inside the container                           |
 | Resume         | Resume all processes inside the container if paused                |
 | Exec           | Execute a new process inside of the container  ( requires setns )  |
 | Set            | Setup configs of the container after it's created                  |

 ### Execute a new process inside of a running container.

 User can execute a new process inside of a running container. Any binaries to be
 executed must be accessible within the container's rootfs.

 The started process will run inside the container's rootfs. Any changes
 made by the process to the container's filesystem will persist after the
 process finished executing.

 The started process will join all the container's existing namespaces. When the
 container is paused, the process will also be paused and will resume when
 the container is unpaused.  The started process will only run when the container's
 primary process (PID 1) is running, and will not be restarted when the container
 is restarted.

 #### Planned additions

 The started process will have its own cgroups nested inside the container's
 cgroups. This is used for process tracking and optionally resource allocation
 handling for the new process. Freezer cgroup is required, the rest of the cgroups
 are optional. The process executor must place its pid inside the correct
 cgroups before starting the process. This is done so that no child processes or
 threads can escape the cgroups.

 When the process is stopped, the process executor will try (in a best-effort way)
 to stop all its children and remove the sub-cgroups.
	## Container Specification - v1

	This is the standard configuration for version 1 containers. It includes
	namespaces, standard filesystem setup, a default Linux capability set, and
	information about resource reservations. It also has information about any
	populated environment settings for the processes running inside a container.

	Along with the configuration of how a container is created the standard also
	discusses actions that can be performed on a container to manage and inspect
	information about the processes running inside.

	The v1 profile is meant to be able to accommodate the majority of applications
	with a strong security configuration.

	### System Requirements and Compatibility

	Minimum requirements:
	* Kernel version - 3.10 recommended 2.6.2x minimum(with backported patches)
	* Mounted cgroups with each subsystem in its own hierarchy


	### Namespaces

	\| Flag \| Enabled \|
	\| ------------ \| ------- \|
	\| CLONE_NEWPID \| 1 \|
	\| CLONE_NEWUTS \| 1 \|
	\| CLONE_NEWIPC \| 1 \|
	\| CLONE_NEWNET \| 1 \|
	\| CLONE_NEWNS \| 1 \|
	\| CLONE_NEWUSER \| 1 \|

	Namespaces are created for the container via the `clone` syscall.


	### Filesystem

	A root filesystem must be provided to a container for execution. The container
	will use this root filesystem (rootfs) to jail and spawn processes inside where
	the binaries and system libraries are local to that directory. Any binaries
	to be executed must be contained within this rootfs.

	Mounts that happen inside the container are automatically cleaned up when the
	container exits as the mount namespace is destroyed and the kernel will
	unmount all the mounts that were setup within that namespace.

	For a container to execute properly there are certain filesystems that
	are required to be mounted within the rootfs that the runtime will setup.

	\| Path \| Type \| Flags \| Data \|
	\| ----------- \| ------ \| -------------------------------------- \| ---------------------------------------- \|
	\| /proc \| proc \| MS_NOEXEC,MS_NOSUID,MS_NODEV \| \|
	\| /dev \| tmpfs \| MS_NOEXEC,MS_STRICTATIME \| mode=755 \|
	\| /dev/shm \| tmpfs \| MS_NOEXEC,MS_NOSUID,MS_NODEV \| mode=1777,size=65536k \|
	\| /dev/mqueue \| mqueue \| MS_NOEXEC,MS_NOSUID,MS_NODEV \| \|
	\| /dev/pts \| devpts \| MS_NOEXEC,MS_NOSUID \| newinstance,ptmxmode=0666,mode=620,gid=5 \|
	\| /sys \| sysfs \| MS_NOEXEC,MS_NOSUID,MS_NODEV,MS_RDONLY \| \|


	After a container's filesystems are mounted within the newly created
	mount namespace `/dev` will need to be populated with a set of device nodes.
	It is expected that a rootfs does not need to have any device nodes specified
	for `/dev` within the rootfs as the container will setup the correct devices
	that are required for executing a container's process.

	\| Path \| Mode \| Access \|
	\| ------------ \| ---- \| ---------- \|
	\| /dev/null \| 0666 \| rwm \|
	\| /dev/zero \| 0666 \| rwm \|
	\| /dev/full \| 0666 \| rwm \|
	\| /dev/tty \| 0666 \| rwm \|
	\| /dev/random \| 0666 \| rwm \|
	\| /dev/urandom \| 0666 \| rwm \|


	ptmx
	`/dev/ptmx` will need to be a symlink to the host's `/dev/ptmx` within
	the container.

	The use of a pseudo TTY is optional within a container and it should support both.
	If a pseudo is provided to the container `/dev/console` will need to be
	setup by binding the console in `/dev/` after it has been populated and mounted
	in tmpfs.

	\| Source \| Destination \| UID GID \| Mode \| Type \|
	\| --------------- \| ------------ \| ------- \| ---- \| ---- \|
	\| pty host path \| /dev/console \| 0 0 \| 0600 \| bind \|


	After `/dev/null` has been setup we check for any external links between
	the container's io, STDIN, STDOUT, STDERR. If the container's io is pointing
	to `/dev/null` outside the container we close and `dup2` the `/dev/null`
	that is local to the container's rootfs.


	After the container has `/proc` mounted a few standard symlinks are setup
	within `/dev/` for the io.

	\| Source \| Destination \|
	\| --------------- \| ----------- \|
	\| /proc/self/fd \| /dev/fd \|
	\| /proc/self/fd/0 \| /dev/stdin \|
	\| /proc/self/fd/1 \| /dev/stdout \|
	\| /proc/self/fd/2 \| /dev/stderr \|

	A `pivot_root` is used to change the root for the process, effectively
	jailing the process inside the rootfs.

	```c
	put_old = mkdir(...);
	pivot_root(rootfs, put_old);
	chdir("/");
	unmount(put_old, MS_DETACH);
	rmdir(put_old);
	```

	For container's running with a rootfs inside `ramfs` a `MS_MOVE` combined
	with a `chroot` is required as `pivot_root` is not supported in `ramfs`.

	```c
	mount(rootfs, "/", NULL, MS_MOVE, NULL);
	chroot(".");
	chdir("/");
	```

	The `umask` is set back to `0022` after the filesystem setup has been completed.

	### Resources

	Cgroups are used to handle resource allocation for containers. This includes
	system resources like cpu, memory, and device access.

	\| Subsystem \| Enabled \|
	\| ---------- \| ------- \|
	\| devices \| 1 \|
	\| memory \| 1 \|
	\| cpu \| 1 \|
	\| cpuacct \| 1 \|
	\| cpuset \| 1 \|
	\| blkio \| 1 \|
	\| perf_event \| 1 \|
	\| freezer \| 1 \|
	\| hugetlb \| 1 \|
	\| pids \| 1 \|


	All cgroup subsystem are joined so that statistics can be collected from
	each of the subsystems. Freezer does not expose any stats but is joined
	so that containers can be paused and resumed.

	The parent process of the container's init must place the init pid inside
	the correct cgroups before the initialization begins. This is done so
	that no processes or threads escape the cgroups. This sync is
	done via a pipe ( specified in the runtime section below ) that the container's
	init process will block waiting for the parent to finish setup.

	### Security

	The standard set of Linux capabilities that are set in a container
	provide a good default for security and flexibility for the applications.


	\| Capability \| Enabled \|
	\| -------------------- \| ------- \|
	\| CAP_NET_RAW \| 1 \|
	\| CAP_NET_BIND_SERVICE \| 1 \|
	\| CAP_AUDIT_READ \| 1 \|
	\| CAP_AUDIT_WRITE \| 1 \|
	\| CAP_DAC_OVERRIDE \| 1 \|
	\| CAP_SETFCAP \| 1 \|
	\| CAP_SETPCAP \| 1 \|
	\| CAP_SETGID \| 1 \|
	\| CAP_SETUID \| 1 \|
	\| CAP_MKNOD \| 1 \|
	\| CAP_CHOWN \| 1 \|
	\| CAP_FOWNER \| 1 \|
	\| CAP_FSETID \| 1 \|
	\| CAP_KILL \| 1 \|
	\| CAP_SYS_CHROOT \| 1 \|
	\| CAP_NET_BROADCAST \| 0 \|
	\| CAP_SYS_MODULE \| 0 \|
	\| CAP_SYS_RAWIO \| 0 \|
	\| CAP_SYS_PACCT \| 0 \|
	\| CAP_SYS_ADMIN \| 0 \|
	\| CAP_SYS_NICE \| 0 \|
	\| CAP_SYS_RESOURCE \| 0 \|
	\| CAP_SYS_TIME \| 0 \|
	\| CAP_SYS_TTY_CONFIG \| 0 \|
	\| CAP_AUDIT_CONTROL \| 0 \|
	\| CAP_MAC_OVERRIDE \| 0 \|
	\| CAP_MAC_ADMIN \| 0 \|
	\| CAP_NET_ADMIN \| 0 \|
	\| CAP_SYSLOG \| 0 \|
	\| CAP_DAC_READ_SEARCH \| 0 \|
	\| CAP_LINUX_IMMUTABLE \| 0 \|
	\| CAP_IPC_LOCK \| 0 \|
	\| CAP_IPC_OWNER \| 0 \|
	\| CAP_SYS_PTRACE \| 0 \|
	\| CAP_SYS_BOOT \| 0 \|
	\| CAP_LEASE \| 0 \|
	\| CAP_WAKE_ALARM \| 0 \|
	\| CAP_BLOCK_SUSPEND \| 0 \|


	Additional security layers like [apparmor](https://wiki.ubuntu.com/AppArmor)
	and [selinux](http://selinuxproject.org/page/Main_Page) can be used with
	the containers. A container should support setting an apparmor profile or
	selinux process and mount labels if provided in the configuration.

	Standard apparmor profile:
	```c
	#include <tunables/global>
	profile <profile_name> flags=(attach_disconnected,mediate_deleted) {
	#include <abstractions/base>
	network,
	capability,
	file,
	umount,

	deny @{PROC}/sys/fs/** wklx,
	deny @{PROC}/sysrq-trigger rwklx,
	deny @{PROC}/mem rwklx,
	deny @{PROC}/kmem rwklx,
	deny @{PROC}/sys/kernel/[^s][^h][^m]* wklx,
	deny @{PROC}/sys/kernel//* wklx,

	deny mount,

	deny /sys/[^f]/* wklx,
	deny /sys/f[^s]/* wklx,
	deny /sys/fs/[^c]/* wklx,
	deny /sys/fs/c[^g]/* wklx,
	deny /sys/fs/cg[^r]/* wklx,
	deny /sys/firmware/efi/efivars/** rwklx,
	deny /sys/kernel/security/** rwklx,
	}
	```

	TODO: seccomp work is being done to find a good default config

	### Runtime and Init Process

	During container creation the parent process needs to talk to the container's init
	process and have a form of synchronization. This is accomplished by creating
	a pipe that is passed to the container's init. When the init process first spawns
	it will block on its side of the pipe until the parent closes its side. This
	allows the parent to have time to set the new process inside a cgroup hierarchy
	and/or write any uid/gid mappings required for user namespaces.
	The pipe is passed to the init process via FD 3.

	The application consuming libcontainer should be compiled statically. libcontainer
	does not define any init process and the arguments provided are used to `exec` the
	process inside the application. There should be no long running init within the
	container spec.

	If a pseudo tty is provided to a container it will open and `dup2` the console
	as the container's STDIN, STDOUT, STDERR as well as mounting the console
	as `/dev/console`.

	An extra set of mounts are provided to a container and setup for use. A container's
	rootfs can contain some non portable files inside that can cause side effects during
	execution of a process. These files are usually created and populated with the container
	specific information via the runtime.

	Extra runtime files:
	* /etc/hosts
	* /etc/resolv.conf
	* /etc/hostname
	* /etc/localtime


	#### Defaults

	There are a few defaults that can be overridden by users, but in their omission
	these apply to processes within a container.

	\| Type \| Value \|
	\| ------------------- \| ------------------------------ \|
	\| Parent Death Signal \| SIGKILL \|
	\| UID \| 0 \|
	\| GID \| 0 \|
	\| GROUPS \| 0, NULL \|
	\| CWD \| "/" \|
	\| $HOME \| Current user's home dir or "/" \|
	\| Readonly rootfs \| false \|
	\| Pseudo TTY \| false \|


	## Actions

	After a container is created there is a standard set of actions that can
	be done to the container. These actions are part of the public API for
	a container.

	\| Action \| Description \|
	\| -------------- \| ------------------------------------------------------------------ \|
	\| Get processes \| Return all the pids for processes running inside a container \|
	\| Get Stats \| Return resource statistics for the container as a whole \|
	\| Wait \| Waits on the container's init process ( pid 1 ) \|
	\| Wait Process \| Wait on any of the container's processes returning the exit status \|
	\| Destroy \| Kill the container's init process and remove any filesystem state \|
	\| Signal \| Send a signal to the container's init process \|
	\| Signal Process \| Send a signal to any of the container's processes \|
	\| Pause \| Pause all processes inside the container \|
	\| Resume \| Resume all processes inside the container if paused \|
	\| Exec \| Execute a new process inside of the container ( requires setns ) \|
	\| Set \| Setup configs of the container after it's created \|

	### Execute a new process inside of a running container.

	User can execute a new process inside of a running container. Any binaries to be
	executed must be accessible within the container's rootfs.

	The started process will run inside the container's rootfs. Any changes
	made by the process to the container's filesystem will persist after the
	process finished executing.

	The started process will join all the container's existing namespaces. When the
	container is paused, the process will also be paused and will resume when
	the container is unpaused. The started process will only run when the container's
	primary process (PID 1) is running, and will not be restarted when the container
	is restarted.

	#### Planned additions

	The started process will have its own cgroups nested inside the container's
	cgroups. This is used for process tracking and optionally resource allocation
	handling for the new process. Freezer cgroup is required, the rest of the cgroups
	are optional. The process executor must place its pid inside the correct
	cgroups before starting the process. This is done so that no child processes or
	threads can escape the cgroups.

	When the process is stopped, the process executor will try (in a best-effort way)
	to stop all its children and remove the sub-cgroups.