Sandboxing ChromeOS system services

In ChromeOS, OS-level functionality (such as configuring network interfaces) is implemented by a collection of system services and provided to Chrome over D-Bus. These system services have greater system and hardware access than the Chrome browser.

Separating functionality like this aims to prevent malicious websites from gaining access to OS-level functionality. If Chrome were able to directly control network interfaces, then a compromise in Chrome would give an attacker almost full control over the system. For example, by having a separate network manager, we can reduce the functionality exposed to an attacker to just querying interfaces and performing pre-determined actions on them.

ChromeOS uses a few different mechanisms to isolate system services from Chrome and from each other. We use a helper program called Minijail (executable minijail0). In most cases, Minijail is used in the service's init script. In other cases, Minijail wrappers are used if a service wants to apply restrictions to the programs that it launches, or to itself.

These different sandboxing mechanisms are described in the ChromeOS sandboxing talk (internal only).

The “Forbidden intersection”

The forbidden intersection is:

  • Running as root, or with CAP_SYS_ADMIN, and
  • Running in the init PID and mount namespaces, and
  • Running without a seccomp policy or without an enforcing SELinux domain.

You must avoid the forbidden intersection by having at least one of, preferably more than one of, ideally all of:

  • Running as a non-root user with no CAP_SYS_ADMIN - User ids
  • Running in a non-init mount and pid namespace - Namespaces
  • A seccomp policy OR running with an SELinux policy in an enforcing domain - Seccomp filters or SELinux

You don't normally need both Seccomp and SELinux but for very security-sensitive workloads this can be required.

Best practices for writing secure system services

Just remember that code has bugs, and these bugs can be used to take control of the code. An attacker can then do anything the original code was allowed to do. Therefore, code should only be given the absolute minimum level of privilege needed to perform its function.

Aim to keep your code lean, and your privileges low. Don‘t run your service as root. If you need to use third-party code that you didn’t write, you should definitely not run it as root.

Use the libraries provided by the system/SDK. In ChromeOS, libchrome and libbrillo (née libchromeos) offer a lot of functionality to avoid reinventing the wheel, poorly. Don‘t reinvent IPC, use D-Bus or Mojo. Don’t open listening sockets, connect to the required service.

Don't (ab)use shell scripts, shell script logic is harder to reason about and shell command-injection bugs are easy to miss. If you need functionality separated from your main service, use normal C++ binaries, not shell scripts. Moreover, when you execute them, consider further restricting their privileges (see section Minijail wrappers).

Just tell me what I need to do

  • Create a new user for your service: https://crrev.com/c/225257. Use this README as a guide on choosing and testing the new UID and GID.
  • Optionally, create a new group to control access to a resource and add the new user to that group: https://crrev.com/c/242551
  • Use Minijail to run your service as the user (and group) created in the previous step. In your init script:
    • exec minijail0 -u <user> -g <group> /full/path/to/binary
    • See section User ids.
  • If your service fails, you might need to grant it capabilities. See section Capabilities.
  • Use as many namespaces as possible. See section Namespaces.
  • Consider reducing the kernel attack surface exposed to your service by using seccomp filters, see section Seccomp filters.
  • Add your sandboxed service to the security.SandboxedServices test.

User ids

The first sandboxing mechanism is user ids. We try to run each service as its own user id, different from the root user, which allows us to restrict what files and directories the service can access, and also removes a big chunk of system functionality that‘s only available to the root user. Using the permission_broker service as an example, here’s its Upstart config file (lives in /etc/init):

permission_broker.conf

start on starting system-services
stop on stopping system-services
respawn

# Run as 'devbroker' user.
exec minijail0 -u devbroker -c 'cap_chown,cap_fowner+eip' -- \
    /usr/bin/permission_broker

Minijail's -u argument forces the target program (in this case permission_broker) to be executed as the devbroker user, instead of the root user. This is equivalent of doing sudo -u devbroker.

The user (devbroker in this example) needs to first be added to the build system database. An example of this (for a different user) can be found in the following CL: https://crrev.com/c/361830

Next, the user needs to be installed on the system. An example of this (again for a different user) can be found in the following CL: https://crrev.com/c/383076

See the ChromeOS user accounts README for more details.

There‘s a test in the CQ that keeps track of the users present on the system that request additional access (e.g. listing more than one user in a group). If your user does that, the test baseline has to be updated at the same time the accounts are added with another CL (e.g. https://crrev.com/c/894192). If you’re unsure whether you need this, the CQ will reject your CL when the test fails, so if the tests pass, you should be good to go!

You can use Cq-Depend to land the CLs together (see How do I specify the dependencies of a change?).

chronos-access membership requires SELinux

The Forbidden intersection notwithstanding, if a service accesses user data by having its user id in the chronos-access group, it must run with an enforcing SELinux domain. These services are accessing data owned by the chronos user, which could be malicious. A compromised Chrome browser could modify or corrupt chronos-owned data and attempt to escalate privileges by exploiting or confusing a service accessing this data.

SELinux is useful because it provides finer-grained control over what the service is allowed to do. Exploitation in these cases doesn't happen via memory corruption. Instead, the attacker will set up user data to confuse the service and trick it into performing valid operations on the wrong filesystem objects. This is usually referred to as a confused deputy attack. For example, a service might be tricked into mounting what it believes is a USB drive. In reality, however, it ends up mounting a virtual image on top of an existing file or directory, bypassing our write-XOR-execute restrictions.

Seccomp is not a great option to prevent this type of attack because it's not granular enough. In the previous example, since the service is allowed to perform mounts, the mount(2) system call has to be allowed. Seccomp does not have the ability to filter path arguments to system calls, so it would be impossible to restrict path arguments to the mount call. SELinux, on the other hand, allows us to restrict a service to only perform operations (like mount calls) on specific paths in the filesystem.

Capabilities

Some programs, however, require some of the system access usually granted only to the root user. We use Linux capabilities for this. Capabilities allow us to grant a specific subset of root's privileges to an otherwise unprivileged process. The link above has the full list of capabilities that can be granted to a process. Some of them are equivalent to root, so we avoid granting those. In general, most processes need capabilities to configure network interfaces, access raw sockets, or performing specific file operations. Capabilities are passed to Minijail using the -c switch. permission_broker, for example, needs capabilities to be able to chown(2) device nodes.

permission_broker.conf

start on starting system-services
stop on stopping system-services
respawn

# Run as <devbroker> user.
# Grant CAP_CHOWN and CAP_FOWNER.
exec minijail0 -u devbroker -c 'cap_chown,cap_fowner+eip' -- \
    /usr/bin/permission_broker

Capabilities are expressed using the format that cap_from_text(3) accepts.

Namespaces

Many resources in the Linux world can be isolated now such that a process has its own view of things. For example, it has its own list of mount points, and any changes it makes (unmounting, mounting more devices, etc...) are only visible to it. This helps keep a broken process from messing up the settings of other processes.

For more in-depth details, see the namespaces overview.

In ChromiumOS, we like to see every process/daemon run under as many unique namespaces as possible. Many are easy to enable/rationalize about: if you don't use a particular resource, then isolating it is straightforward. If you do rely on it though, it can take more effort.

Here‘s a quick overview. Use the command line option if the description below matches your service (or if you don’t know what functionality it‘s talking about -- most likely you aren’t using it!).

  • --profile=minimalistic-mountns: This is a good first default that enables mount and process namespaces. This only mounts /proc and creates a few basic device nodes in /dev. If you need more things mounted, you can use the -b (bind-mount) and -k (regular mount) flags.
  • --uts: Just always turn this on. It makes changes to the host / domain name not affect the rest of the system.
  • -e: If your process doesn't need network access. This also isolates netlink and UNIX abstract sockets. Note: D-Bus and syslog use named UNIX sockets, so they will still be usable (as long as you bind mounted them).
  • -l: If your process doesn't use SysV shared memory or IPC.
  • -p: If your process doesn't interact with other process PIDs (other than child processes).
  • -N: If your process doesn't need to modify common control groups settings.

When does a process need to run in the init mount or pid namespace?

Almost all processes do not need to run in the init namespace, please contact chromeos-security@ for a consultation if you believe that your process needs to run in the init mount or pid namespace.

Passing common resources

When using many namespaces to isolate a service, there are some resources that the service still reasonably should be able to access.

If bind-mounting on top of /run, you need to mount a tmpfs /run:

-k 'none,/run,tmpfs,MS_NODEV|MS_NOEXEC|MS_NOSUID,mode=755,size=10M'

If bind-mounting on top of /sys, you need to mount a tmpfs /sys:

-k 'none,/sys,sysfs,MS_NODEV|MS_NOEXEC|MS_NOSUID,mode=755,size=10M'
  • syslog: If using -d to mount a minimal /dev, you can pass access to the syslog daemon by using -b /dev/log. If your process mounts all of /dev, you need to use -b /run/systemd/journal since /dev/log is a symlink to /run/systemd/journal/dev-log. In either case, you do not need to specify the writable flag to -b for this to work. This will work across all namespaces (including -e network).
  • D-Bus: You can access the system D-Bus by using -b /run/dbus. You do not need to specify the writable flag to -b for this to work. This will work across all namespaces (including -e network and -p pid).
  • Nameservers: If you need to resolve hostnames (DNS), you'll need to pass -b /run/shill as that daemon maintains the /etc/resolv.conf file.

Mount propagation type guidance

When creating a new mount or entering a new mount namespace, an important consideration is the mount propagation mode of the mount. Some background on mount types can be found in linux kernel mount documentation, but a brief summary is:

  • Shared (MS_SHARED) - allows mount/unmount events to flow in both directions between namespaces
  • Mounts-flow-in (MS_SLAVE) - only allows mount events to flow in from the parent namespace
  • Private (MS_PRIVATE) - allows no flow of mount events
  • Unbindable (MS_UNBINDABLE) - same as private + you also cannot bind mount to this mount point

For how to change the mount propagation mode when entering a new mount namespace see the section for -K[mode] in minijail0(1).

When does a mount need to be shared?

Mounts need to be shared if and only if mount/unmount events need to flow between namespaces in both directions. If mounts only need to flow from one namespace to the other, then they must be shared on the parent namespace but can be mounts-flow-in on the child namespace. Making mounts shared increases the possibilities for interaction between processes that could undermine security of the system and user namespace separation.

Seccomp filters

Removing access to the filesystem and to root-only functionality is not enough to completely isolate a system service. A service running as its own user id and with no capabilities has access to a big chunk of the kernel API. The kernel therefore exposes a huge attack surface to non-root processes, and we would like to restrict what kernel functionality is available for sandboxed processes.

The mechanism we use is called Seccomp-BPF. Minijail can take a policy file that describes what syscalls will be allowed, what syscalls will be denied, and what syscalls will only be allowed with specific arguments. The full description of the policy file language can be found in the syscall_filter.c source.

Abridged policy for mtpd on amd64 platforms:

# Copyright (c) 2012 The Chromium OS Authors. All rights reserved.
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.
read: 1
ioctl: 1
write: 1
timerfd_settime: 1
open: 1
poll: 1
close: 1
# Don't allow mmap with both PROT_WRITE and PROT_EXEC.
mmap: arg2 in ~PROT_EXEC || arg2 in ~PROT_WRITE
mremap: 1
munmap: 1
# Don't allow mprotect with PROT_EXEC.
mprotect: arg2 in ~PROT_EXEC
lseek: 1
# Allow socket(domain==PF_LOCAL) or socket(domain==PF_NETLINK)
socket: arg0 == 0x1 || arg0 == 0x10
# Allow PR_SET_NAME from libchrome's base::PlatformThread::SetName()
prctl: arg0 == 0xf

Any syscall not explicitly mentioned, when called, results in the process being killed. The policy file can also tell the kernel to fail the system call (returning -1 and setting errno) without killing the process:

# execve: return EPERM
execve: return 1
NOTE: mmap and mprotect both have argument filters to prevent write-able executable memory since that makes certain classes of attacks much easier. In most cases mprotect does not need PROT_EXEC, but you might have to use arg2 in ~PROT_EXEC || arg2 in ~PROT_WRITE just like mmap in cases where child processes are executed and need to dynamically link shared libraries or the code implements a JIT compiler.

Generating seccomp policies using audit (on 4.14+ kernels)

On kernels 4.14 and above we can use the new SECCOMP_RET_LOG return value to make policy generation easier. On these kernels, the -L Minijail option will use SECCOMP_RET_LOG as the return value for blocked syscalls: those not listed in the policy or whose arguments don't match the policy. Instead of killing the process on a blocked syscall, the kernel will log the otherwise blocked syscall but will effectively allow it.

The advantage of this mechanism versus what we have available in pre-4.14 kernels is that instead of having to add syscalls to the policy one by one, you can run the process with -L, get a list of all the syscalls not included in the policy, review them, and automatically generate or augment a policy, all in one step.

NOTE: Minijail's -L flag requires minijail to be built with USE=cros-debug. Generally, this means that it will not work out of the box in prebuilt (e.g Goldeneye, CPFE, etc) OS images.

Our recommended way of using this functionality is to start with an empty policy which will cause all syscalls to be logged-but-allowed. The resulting audit logs (at /var/log/audit/audit.log*) can then be parsed with the generate_seccomp_policy.py script to automatically generate a policy. There's a bit of extra setup required and some associated caveats. Please see the detailed instructions at generate_seccomp_policy.py's README section on using Linux audit logs to generate policy.

This mechanism can also be combined with the strace-based mechanism below: run the process to be sandboxed under strace, generate a base policy using the policy generation script, and then refine it using -L.

Generating a seccomp policy using strace

This is the old and familiar way of generating policies by inspecting syscalls using strace. It does not have any kernel version dependencies and also does not require a minijail build with USE=cros-debug. Similar to audit logs above, the generate_seccomp_policy.py script can accept strace logs from an unsandboxed process to generate a policy.

NOTE: When creating strace logs for arm64, make sure you‘re running it in arm64 userland as most devices that support arm64 kernels run 32-bit arm userland by default. The image running on the device should be built for 64-bit arm userland e.g. kevin device can run the image built for 32-bit arm userland with --board=kevin flag and run the image for 64-bit arm userland built with --board=kevin64 flag. You can run file -L /bin/sh command to check which environment you’re running on.

Generate and pre-process the strace log

strace -f -o strace.log <program>

When sandboxing a dynamically-linked executable, Minijail will default to using LD_PRELOAD to install the seccomp filter. This will install the filter after glibc initialization, so remove the syscalls related to glibc initialization to obtain a smaller filter (and a tighter sandbox). Those are normally everything up to and including the following:

rt_sigaction(SIGRTMIN, {<sa_handler>, [], SA_RESTORER|SA_SIGINFO, <sa_restorer>}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {<sa_handler>, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, <sa_restorer>}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
brk(NULL)                               = <addr>
brk(<addr>)                             = <addr>

If you want to collect strace logs for an existing service and you already have a test device set up, the following steps can help especially if you are extending an existing policy:

  1. Mount the root-fs read write. mount -o remount,rw /
  2. Edit the init config file in /etc/init
    • Temporarily disable seccomp if it is present.
    • Add strace -f -o /tmp/strace.log before Minijail is invoked. Note that this will include all the minijail0 syscalls as well, but you can exclude them later.
  3. Reboot or reload the config initctl reload-configuration and restart the service.
  4. Run tests or perform actions that exercise the features of the service.
  5. Collect the resulting file /tmp/strace.log.

Generate policy using strace.log

~/chromiumos/src/aosp/external/minijail/tools/generate_seccomp_policy.py strace.log > $PROGRAM_NAME.policy

Testing and troubleshooting

  • Test the policy:

    • minijail0 -S seccomp.policy -L <cmd>
    • If the policy is incomplete:
      • (Kernel >=4.14) Minijail/audit will log all offending syscalls to /var/log/audit/audit.log (as discussed above).
      • (Kernel <4.14) Minijail will crash and log the first offending syscall in /var/log/messages.
  • You should ensure that your service is executing as many of its code paths as possible when executing strace or auditing, in particular error paths. In some cases it may be easier to add syscalls manually (for instance, abort) rather than forcing execution of those paths.

  • When you‘re collecting strace logs to create a seccomp policy, make sure you’re in the targeted userland. You can check the environment by running file -L /bin/sh command.

  • When not using the -n Minijail flag, privilege-dropping syscalls happen after the filter is installed, and they won‘t show up in normal program execution because they’re called by Minijail itself. Because the audit log method uses Minijail, this is only applicable to the strace method. The following syscalls need to be added to the policy:

    • setgroups(2), setresgid(2), and setresuid(2) for dropping root.
    • capget(2), capset(2), and prctl(2) for dropping capabilities.
    • Normally, it's just easier to use the -n flag (no_new_privs), which prevents the sandboxed process from obtaining new privileges and is therefore a good addition for sandboxing.
  • Sometimes Minijail will fail to compile the seccomp filter with an error similar to:

WARNING minijail0[32315]: libminijail[32315]: trailing garbage after constant: 'LOOP_GET_STATUS64'
WARNING minijail0[32315]: libminijail[32315]: compile_atom: /usr/share/policy/e2fsck-seccomp.policy(13): invalid constant 'LOOP_GET_STATUS64'
WARNING minijail0[32315]: libminijail[32315]: could not allocate filter block
WARNING minijail0[32315]: libminijail[32315]: compile_filter: compile_file() failed
ERR minijail0[32315]: libminijail[32315]: failed to compile seccomp filter BPF program in '/usr/share/policy/e2fsck-seccomp.policy'
  • This means that one of the constant parameters provided to a syscall could not be resolved: e.g. ioctl: arg1 == LOOP_GET_STATUS64.

  • To fix this, look up the hex value of the constant and substitute the constant e.g. ioctl: arg1 == 0x4C05.

  • Minijail resolves these constants based on headers that are in gen_constants-inl.h.

  • Some syscalls may be triggered more deterministically on VMs than in tests on real hardware (e.g. clock_gettime and gettimeofday). Developers should be on the lookout for such failures and the seccomp policies should be adjusted accordingly.

Finding failing syscalls

If a process violates its seccomp policy, it‘ll be terminated with SIGSYS (Bad system call) and you’ll see a message like this in /var/log/messages:

WARNING minijail0[1415]: libminijail[1415]: child process 1417 had a policy violation (/usr/share/policy/foo.policy)

There are a couple of ways to find out what syscall caused the violation:

  • If you can deploy a debug build of Minijail (the chromeos-base/minijail package built with USE="cros-debug") to the device, and add -L to the Minijail command, information on failing syscalls will be logged to /var/log/audit/audit.log. (Kernel version 4.14 or above is required for this to work.) Note that -L also stops policy violations from crashing the process, so execution will succeed.

    A lot of other things are logged to audit.log, so to locate the messages related to your binary, look for messages starting type=SECCOMP. For example, here's a violation message caused by the true command making syscall 157:

    type=SECCOMP msg=audit(1641860367.246:224): auid=0 uid=0 gid=0 ses=2 subj=u:r:minijail:s0 pid=13686 comm="true" exe="/usr/bin/coreutils" sig=0 arch=c000003e syscall=157 compat=0 ip=0x7a28525bde35 code=0x7ffc0000
    

    (Note that longer command names may be truncated in the comm= field.)

  • If you have a core dump or minidump (such as the .core file found in /var/spool/crash/, or the minidump attached to a crash report), you can open it in a debugger and dump the register values (e.g. with info registers in gdb or register read in lldb). Check the syscall calling conventions for your architecture to determine which registers contain the syscall number and arguments. (Note: 64-bit ARM systems (arm64, aarch64) run userspace programs in 32-bit mode, so you'll want to follow the arm calling conventions in that case.)

Once you have the syscall number, find out the name of the syscall by looking it up our in online syscalls table or in minijail0 -H (run on the same architecture as the program you're debugging). You can then add the syscall to your policy.

If you do not want to allow an entire syscall, you can only allow certain parameters, e.g. ioctl: arg1 == FDGETPRM. You can find the values of these parameters from a core or minidump using a debugger, as described above.

Installing and applying the generated policy

The policy file needs to be installed in the system, so we need to add it to the ebuild file. For example:

mtpd-9999.ebuild

# Install seccomp policy file.
insinto /usr/share/policy
use seccomp && newins "mtpd-seccomp-${ARCH}.policy" mtpd-seccomp.policy

And finally, the policy file has to be passed to Minijail, using the -S option. Again, using mtpd as an example:

mtpd.conf

# use minijail (drop root, set no_new_privs, set seccomp filter).
# Mount /proc, /sys, /dev, /run/udev so that USB devices can be
# discovered.  Also mount /run/dbus to communicate with D-Bus.
exec minijail0 -i -I -p -l -r -v -t -u mtp -g mtp -G \
  -P /mnt/empty -b / -b /proc -b /sys -b /dev \
  -k tmpfs,/run,tmpfs,0xe -b /run/dbus -b /run/udev \
  -n -S /usr/share/policy/mtpd-seccomp.policy -- \
  /usr/sbin/mtpd -minloglevel="${MTPD_MINLOGLEVEL}"

Securely mounting cryptohome daemon store folders

Some daemons store user data on the user‘s cryptohome under /home/.shadow/<user_hash>/mount/root/<daemon_name> or equivalently /home/root/<user_hash>/<daemon_name>. For instance, Session Manager stores user policy under /home/root/<user_hash>/session_manager/policy. This is useful if the data should be protected from other users since the user’s cryptohome is only mounted (and therefore decrypted) when the user logs in. If the user is not logged in, it is encrypted with the user's password.

However, if a daemon is already running inside a mount namespace (minijail0 -v ...) when the user's cryptohome is mounted, it does not see the mount since mount events do not propagate into mount namespaces by default. This propagation can be achieved, though, by making the parent mount a shared mount and the corresponding mount inside the namespace a shared or MS_SLAVE mount, see the shared subtrees doc.

To set up a cryptohome daemon store folder that propagates into your daemon‘s mount namespace, add this code to the src_install section of your daemon’s ebuild:

local daemon_store="/etc/daemon-store/<daemon_name>"
dodir "${daemon_store}"
fperms 0700 "${daemon_store}"
fowners <daemon_user>:<daemon_group> "${daemon_store}"

This directory is never used directly. It merely serves as a secure template for the chromeos_startup script, which picks it up and creates /run/daemon-store/<daemon_name> as a shared mount.

Next, move the user/group setup to pkg_setup() since pkg_preinst(), where this is usually done, runs after src_install():

pkg_setup() {
	# Has to be done in pkg_setup() instead of pkg_preinst() since
	# src_install() needs <daemon_user> and <daemon_group>.
	enewuser <daemon_user>
	enewgroup <daemon_group>
	cros-workon_pkg_setup
}

In your daemon's init script, mount the daemon store folder as MS_SLAVE in your mount namespace. Be sure not to mount all of /run. Make sure to mount with the MS_REC flag to propagate any already-mounted cryptohome bind mounts into the mount namespace.

minijail0 -v -Kslave \
          -k 'tmpfs,/run,tmpfs,MS_NOSUID|MS_NODEV|MS_NOEXEC' \
          -k '/run/daemon-store/<daemon_name>,/run/daemon-store/<daemon_name>,none,MS_BIND|MS_REC' \
          ...

During sign-in, when the user's cryptohome is mounted, Cryptohome creates /home/.shadow/<user_hash>/mount/root/<daemon_name>, bind-mounts it to /run/daemon-store/<daemon_name>/<user_hash> and copies ownership and mode from /etc/daemon-store/<daemon_name> to the bind target. Since /run/daemon-store/<daemon_name> is a shared mount outside of the mount namespace and a MS_SLAVE mount inside, the mount event propagates into the daemon.

Your daemon can now use /run/daemon-store/<daemon_name>/<user_hash> to store user data once the user's cryptohome is mounted. Note that even though /run/daemon-store is on a tmpfs, your data is actually stored on disk and not lost on reboot.

Be sure not to write to the folder before the cryptohome is mounted. Consider listening to Session Manager's SessionStateChanged signal or similar to detect mount events. Note that /run/daemon-store/<daemon_name>/<user_hash> might exist even though cryptohome is not mounted, so testing existence is not enough (it only works the first time).

The <user_hash> can be retrieved with Cryptohome's GetSanitizedUsername D-Bus method.

The following diagram illustrates the mount event propagation:

Mount propagation diagram

Minijail wrappers

TODO(jorgelo)