This is the project for handling all crash related operations on the device. The intention is to be as low-overhead as possible while still maximizing the usefulness of crash reports and minimizing collection of any & all PII data.
For more background details, see the original design doc. This document focuses on the current state of the project.
No crashes get collected without explicit consent from the device owner AND the currently logged in user. This is normally part of the OOBE setup flow (but can also be controlled via OS Settings).
Consent is first set up by the device owner and covers all users of that device. If no consent has been granted, then
crash_reporter will generally exit rather than doing any further processing (e.g. running collectors). If the device owner consents, users will be able to opt in or out as well, post M-103.
metrics_library determines whether the currently logged-in-user consents to crash collection, with one exception: boot crash collection.
When the device kernel panics or otherwise forcibly reboots, we use the
/home/chronos/boot-collect-consent file to determine whether the user that was signed in at the time of the crash consented to crash collection, since no user will be logged in when the boot collector runs. If the last-logged-in user consented to collection, we then fall back to device policy.
The only case where
crash_reporter collects crashes even without consent is for early crashes that occur before stateful partitions are mounted (because we cannot check consent then).
crash_sender still checks consent before it uploads crashes.
Even if consent has been granted, there is one notable scenario where crash reports are still not uploaded -- guest mode. While system crashes are still collected (as they shouldn‘t have any user PII in them), browser and user crashes are not uploaded. This leads to a bit of a chicken & egg where real crashes in guest mode are not caught and uploaded for tracking, but we’re willing to accept that for the user privacy guarantees.
If consent is later revoked, we do not upload any crashes that had been queued. In general,
crash_sender will remove crashes that are present if consent is revoked, but in some cases it will not: most notably, if a crash is stored to a system directory (i.e., not in a cryptohome) and we're using per-user consent,
crash_sender keeps the report around in case it is associated with a consenting user that later logs in again.
Setting up the crash reporter early in the boot process is tricky: before the encrypted stateful mount is setup, crash reporter does not have access to persistent storage to store crashes in. We run crash-reporter-early-init.conf early in the boot process to configure the kernel to pass crashes to us with an “early” flag to use a tmpfs path instead of a persistent location for storing crash reports.
After the encrypted stateful mount setup is complete, we run crash-reporter.conf which sets the crash path back to the persistent storage crash directory and initializes any system state that depends on it.
Then during boot, but not in the critical path, we run crash-boot-collect.conf to gather up any previous crashes in the system. For example, if the system had rebooted due to a kernel panic, or some firmware (like the EC) had crashed. This also includes crashes that occurred during early boot before persistent storage was available: such crashes are stored into
/run/crash_reporter/crash (or sometimes
/mnt/stateful_partition/reboot_vault/crash) and collected into
/var/spool/crash once boot completes.
We run the anomaly_detector in the background to monitor for system “anomalies” for which a notification mechanism is not available. This daemon operates by monitoring syslog messages at a predefined interval. Depending on the anomaly, it generates crashes and D-Bus signals.
At this point, if no crashes occur, nothing ever happens! And of course, since we never have bugs, that's precisely what happens!
If a userland crash occurs that would trigger a coredump, the kernel will execute crash_reporter directly (via
/proc/sys/kernel/core_pattern) and pass in the coredump and some metadata. We process it directly and queue the report in the right crash directory. The exact processing steps depend on the specific collector involved, so see the Collectors section for more details. See Crash Report Storage for more details on queuing.
If a Chrome (browser) process crashes, crash_reporter will be called by the kernel, but we actually ignore that invocation. Instead, Chrome itself will take care of processing that crash report before calling crash_reporter with the final crash report for queuing.
Each collector has a
All unittests are named
xxx_test.cc and contain tests for the corresponding
All init related scripts live in the init/ directory.
The core_collector program is kept separate because it‘s otherwise completely standalone and doesn’t share any code.
That covers pretty much all the files in here.
Here is a brief summary of each init script we provide. More details on each can be found in the sections below.
This section is triggered via the crash-reporter.conf init script.
We want to initialize boot collection as early as possible so as to minimize the time frame for crashes that we'd miss otherwise. Currently we initialize during
boot-services and after
syslog. This is because the crash_reporter code will use syslog for output/errors of its own, even during early init.
For any userland crashes that occur before this point, the crash-reporter-early-init.conf script helps us collect these. It runs after pre-startup, and saves crashes into temporary directories. We then collect these with crash-boot-collect.conf after the stateful partition is available.
The init step itself should be fairly quick as it only initializes internal state paths (e.g. under
/var), and it configures the various
/sys paths that the kernel has setup. Most notably, this is responsible for writing
/proc/sys/kernel/core_pattern which tells the kernel to execute us whenever a crash is detected.
For more details, see the core(5) man page.
We store reports in a couple of different places.
/var/spool/crash/: Non-user (i.e. system) queued reports, when not logged in or half of the time when logged in.
/run/daemon-store/crash/<user_hash>. This avoids having crashes become inaccessible if a test logs the user out.
/home/chronos/<user_hash>/crash/: User-specific queued reports. Used when invoked as the user (e.g. by Chrome as
chronoswhile logged in). Only used half of the time -- otherwise we use daemon-store (see below).
/home/chronos/crash: Crashes from the
chronosuser when not logged in, for instance, if Chrome crashes while not logged in.
/run/daemon-store/crash/<user_hash>. This avoids having crashes become inaccessible if a test logs the user out.
/run/daemon-store/crash/<user_hash>: Some crashes from the
chronosuser are sent here. In addition, half of all crashes that occur when a user is logged in are sent here. In the long term, all crashes should go here when the user is logged in. (We send half of crashes now as part of an experiment to validate usage of daemon-store.)
/mnt/stateful_partition/unencrypted/preserve/crash: Crashes found early in the boot process (before
/var/spool/crashis available) are stored here if we wish to preserve them across clobbers.
/run/crash_reporter/crash: Crashes found early in the boot process (before
/var/spool/crashis available) are stored here.
These directories are only written to by the crash-reporter at collection time. The crash_sender program will also delete reports after it uploads them.
These directories are only read by the crash_sender program for uploading, and by feedback reports. Test frameworks (e.g. autotest, tast) also offload crash reports, and the crash_serializer binary helps them do so.
We enforce a limit of about 32 crashes per spool directory. This is to avoid filling up the underlying storage, especially if a daemon goes into a crash loop and generates a lot of crashes quickly.
Every collector produces a unique name of sorts which is then looked up in the crash_reporter_logs.conf file. For example, user_collector uses the program's base file name, while udev_collector constructs a more complicated format to make it easier to match unique events.
The value of that setting is an arbitrary shell script which is then executed. The output is captured and automatically attached to the corresponding report.
This is most useful to collect data from device or program specific paths that otherwise would not be collected. For example, for kernel wifi driver warnings, we'll dump some of the low level PCI register state for those specific pieces of hardware.
The output of this snippet should be kept small as there are size limits to report uploads, and to minimize PII data leaking in.
Metric collection is currently being evaluated for their usefulness. As it stands currently, metrics are still in the codebase, but ultimately disabled just before being collected.
See https://crbug.com/754850 for more details.
If we re-enable the metrics, then we‘ll fill out this section of the doc. If we delete the code entirely, then we’ll delete this section too.
The crash_sender program is responsible for uploading reports to the server. It tries to be unobtrusive in a number of ways:
While there are multiple local crash queues, these are global limits.
Before we upload any report, we check to see if consent is still granted. This way we stop uploading reports right away if things change.
We also check if the network is online before uploading a crash report and stop uploading reports until the next run of crash_sender if we are offline.
If an upload fails for any reason (flaky network, system going to sleep, etc...), then the report is left in the local queue and retried later.
We attempt to upload the oldest reports first to minimize them going stale.
Network proxies are respected; see the Proxies section for more details.
After a report has been successfully uploaded, we add details to the
/var/log/chrome/Crash Reports/uploads.log file. This is used by Chrome itself in its internal
chrome://crashes page to provide crash information to the user. The exact format of this file is controlled by Chrome. Here are just some of the files involved:
uploads.loglags behind other platforms. See https://crbug.com/275910 for more details.
At runtime, there are a few different ways one can trigger uploading. Since crash_sender has internal locking, you don't have to worry about any of these methods clobbering or racing with any other crash component.
--dev. (This will upload crashes to crash-staging, rather than crash.)
upload_crashescommand to trigger immediately.
chrome://crashespage has a “Start Uploading” link.
Once these crashes are uploaded, where do they go you might wonder? By default, they're sent to the Google Crash Server. For obvious privacy reasons, this server is only accessible to Google employees, and even then not all employees have access, only ones that need it to debug.
This is where the [list_proxies] tool comes in. It's a simple front end to sending queries to Chrome via D-Bus and then printing the results to stdout. This info can then be exported or passed to tools like
For Chromeless devices, this presents a bit of a challenge.
We won't cover (in depth) files covered by these topics:
Otherwise, this should be an exhaustive list of paths on the filesystem that this project uses in any fashion.
These paths are guaranteed to persist across boots.
/var/lib/crash_reporter/: Non-volatile state we need across reboots. Currently we use to cache previous OS details so we can correctly attribute previous kernel crashes to the previous OS version that was running.
/var/lib/crash_reporter/pending_clean_shutdown: Used by the unclean_shutdown_collector.
/var/lib/crash_sender/: Non-volatile state that crash_sender maintains. Used to keep track of how many reports have been uploaded (and when) so we can regulate our limits. Do not add any additional files to this directory or it will break timestamp calculations. Add additional state information to the ‘state’ subdirectory instead. Currently we store a ‘client_id’ in the ‘state’ subdirectory for maintaining a persistent device identifier for coalescing crash reports by device. This ID should never be used for any other purpose.
/var/log/messages: anomaly_detector monitors this (read-only).
These spool dirs are covered in detail in Crash Report Storage.
/var/spool/crash/: System crash reports.
/home/chronos/<user_hash>/crash/: User-specific queued reports.
/home/root/<user_hash>/crash-reporter/: User-specific queued reports.
These paths are guaranteed to be reset at every boot, so we only store active runtime details here.
/run/crash_reporter/: Used by all crash-reporter tools (i.e. both crash_reporter and crash_sender) for runtime state. This should be moved to
/run/crash-reporter/as the project name.
/run/lock/crash_sender: Used by crash_sender to guarantee only one upload instance is active at a time.
These are used to communicate with metrics_daemon.
/run/metrics/external/crash-reporter/kernel-crash-detected: Used by crash_reporter to signal the metrics_daemon that a kernel crash occurred. Also used by integration tests to verify this functionality.
/run/metrics/external/crash-reporter/unclean-shutdown-detected: Used by crash_reporter to signal the metrics_daemon that an unclean shutdown occurred. Also used by integration tests to verify this functionality.
This is used to communicate with powerd.
/run/crash_reporter/boot-collector-done: Used by crash_reporter to signal the powerd that boot collector has successfully completed per-boot crash collection. Also used by integration tests that rely on the boot collector.
/runfiles was done historically because the crash-reporter.conf init and crash-boot-collect.conf were one script that always executed early and before metrics_daemon started. However, now we can make crash-boot-collect.conf wait for metrics_daemon, we should be able to switch to its more standard existing IPC methods.
VMs may ship
/etc/vm_crash_filter.textproto within the VM filesystem. If present, it contains a protobuf message of type VmCrashFilters. It‘s used by UserCollector’s VmSupport to control which processes may have crash reports generated.
/var/lib/crash_sender_paused: Used by integration tests to pause crash_sender. This should be moved to
/var/spool/crash/mock-consent: Used by integration tests to persist mock consent across reboots. This file should be created with a small number greater than 0 (like 2). It will be deleted after that number of reboots.
/var/spool/crash/crash-test-in-progress: Used by integration tests to persist crash test in progress state across reboots. This file should be created with a small number greater than 0 (like 2). It will be deleted after that number of reboots.
/run/crash_reporter/crash-test-in-progress: Used by integration tests to tell tools they are being exercised by an integration test that tests the crash system itself and to adjust their behavior accordingly.
/run/crash_reporter/mock-crash-sending: Used by integration tests to tell crash_sender to mock out the actual upload for testing purposes.
/run/crash_reporter/mock-consent: Used by integration tests to tell the crash system to act as if the user had given consent for crashes to be collected and uploaded.
/run/crash_reporter/filter-in: Used by integration tests to tell the crash_reporter to ignore invocations unless the command line contains the contents of this file as a substring.
/run/crash_reporter/filter-out: Used by integration tests to tell the crash_reporter to ignore invocations if the command line contains the contents of this file as a substring.
/mnt/stateful_partition/etc/collect_chrome_crashes: Used by tast tests to let crash_reporter collect browser crashes directly (normally it ignores them and lets Chrome handle things). This should be moved to
/tmp/ usage by tools would normally be quite concerning due to the ease of predictable file names, but crash_reporter and crash_sender both execute in a minijail with a unique empty
/tmp/ mount (and automatically freed on exit). That mount is visible only to the active process (and its children), so other processes in the system won‘t be able to inject content. This doesn’t mean the
/tmp/ usage shouldn't be improved, but it at least mitigates collisions (accidental or otherwise).
/tmp/crash_reporter/<pid>/: crash_reporter writes intermediate data here for the
<pid>that just crashed.
/tmp/crash_sender.XXXXXX/: crash_sender holds intermediate upload report status here and is cleared after every run.
/tmp/: core_collector writes intermediate
/procfiles directly here.
Here's any other random paths we handle.
/root/.leave_core: Normally crash_reporter will delete the large coredumps after converting them to minidumps. When this file exists, coredumps are saved in the spool dir too. Created on dev images automatically. This should be moved to
/etc/crash-reporter/leave-core, and duplicated in
/run/crash-reporter/so people don't have to modify the rootfs.
Each crash report has a
.meta file with a unique basename. The
.meta file is a simple key-value store read by crash_sender. Each collector will generate this, and any supplemental files will use that same basename with other extensions. e.g. The main crash image will be a minidump but with a
.dmp extension. Supplemental log files will have a
However, crash_sender doesn't care about that. The
.meta file will have keys that point to those files directly.
The uploaded crashes are only minidumps (and related metadata). Actual stack walking and symbolification happens on the server. The symbols themselves are generated by CrOS builders.
For more details, see the Symbols document.
Since these tools often run as root and work with untrusted data, security is a big concern. Check out security.md for more details.
For tips for hacking on this project, check out hacking.md.