2011-05-15
We intend to design a system for generating statistics and diagnostic information on a variety of crashes.
Our goals for devices that are opted into automatically submitting usage feedback and crash reports to Google:
Our non-goals:
Recognizing a wide variety of very bad user space problems such as a Chrome, X, or Window Manager that immediate exits and causes the machine to not be usable.
> Are you talking about processes that exit with a failed assert, or something else? If the former, it seems like we'd be able to report that.
> I‘m not opposed to adding this to the longer-term goals, but I’m not sure what interface would be appropriate here - looking through syslogs for errors, looking through application specific logs?
Our goal is to provide a stable platform. We need to be able to diagnose failures that do not occur in our labs or that are otherwise hard to reproduce automatically.
Google Breakpad is used by most Google applications on Mac, Windows, and Linux for x86 and ARM architectures to send “minidump” crash reports, very small crash reports that contain enough information to provide a stack trace of all threads running at the time of the crash. Google Breakpad does (as of Q1 2010) support ARM but is not yet used in production. Chrome in ChromeOS currently uses Google Breakpad and sends crash reports with product ID “Chrome_ChromeOS”.
The Canonical Ubuntu Linux project uses Apport to handle user space crashes. This is a Python package that intercepts all core file writes, invokes gdb, and collects stack dumps into a directory which it then sends out using an Anacron job. It relies on Python and debug information being present on the target.
The Linux kernel creates core files for processes that encounter unhandled signals. As of 2.6 kernels, the file location and naming can be customized by changing /proc/sys/kernel/core_pattern. Once core files are written they can be manually inspected by a developer or advanced user. Additionally, this kernel parameter can be set to cause a pipe to be opened to a user space process which then can receive the core file that would have been written to its stdin. We will rely on this mechanism to get diagnostic information and signaling for all user space processes.
On Windows, Microsoft has created the WINQUAL service which allows developers to retrieve crash diagnostics for their applications. When a Windows application crashes and does not handle the crash itself, the operating system prompts the user if they would like to send this particular crash report, and uploads it upon receiving consent. The WINQUAL service then aggregates and shows the reports. Crash reports can be sent as full dumps or as minidumps using the same format that Google Breakpad uses.
Linux Kernel Crash Dump (LKCD) is a set of kernel patches and a user-space application that enables a panicked Linux kernel to write crash diagnostics to its swap partition and then diagnose the crash and store it in a simplified form. It provides a command-line utility for diagnosing kernel state, but requires a fairly large file to be uploaded to diagnose the running kernel state remotely. This patch was last updated in 2005 and is an invasive kernel patch that's difficult to maintain and will never go upstream.
kexec based dump - is a method where the kernel “exec”s a new kernel that is stable into a reserved area of memory without performing a system reset first. The new stable kernel writes out the state and data of the old kernel to disk, whilst only operating from the reserved memory area. When any relevant state is written, the rest of memory is reclaimed and initialized and full system boot is completed. The patches for kexec are already upstream.
http://www.kerneloops.org/ - Collects crash dumps and provides a dashboard for all kernel developers to find crashes common across all versions, as well as specific to vendors/distributors. Provided they have enough server-side capacity to handle crash dumps from ChromeOS scale numbers of machines this is an option. kerneloops provides a user space process that runs at startup, prompts the user if they want to upload the kernel crash, and uploads an analyzed result.
Ubuntu uses Apport and LKCD to handle kernel crashes. It invokes lcrash to perform crash analysis on the vmcore image.
Firmware event logs can be stored in non-volatile storage. Traditionally problems during firmware initialization as well as kernel panic/problems can be placed here.
There can be different systems for recording kernel and user space crash diagnostic information. We ideally want stack traces at time of crash for either kind of crash. Some kinds of kernel crashes (i.e. crashes in interrupt handlers) by their nature will not be able to generate/persist any diagnostic information.
For user space crashes, we would need:
We will use rate limiting to avoid flooding our servers with crash diagnostics. We will limit to 8 crash reports per machine per day.
We need to build executables and libraries with debug information and upload these to crash server for those for which we would like stack traces with proper symbolic information.
We would like to have statistics on how often crashes are occurring in the field. For every release of ChromeOS on every device we would like to know how frequent unclean shutdowns, user space process, and kernel crashes are. Ideally we can know information on occurrences per individual user, for instance, knowing that 1% of users experience over 5 kernel panics per week. We will generate frequency data for these kinds of events in the course of a day and per week.
We must err on the side of getting too little information if the alternative is to potentially send sensitive information of a user who has not enabled this. As such, we should be careful, for instance, to not send kernel core files as the kernel core may have information for a variety of users. We also must avoid sending log files that may capture the accumulated activities of multiple users. We will send a unique but anonymous identifier per device to find potentially related crashes by those which happen on the same device and to help eliminate crashes from buggy/broken devices. User space processes which crash and which interact closely with the user, such as Chrome, the window manager, entd, and others are more likely to have sensitive data in memory at the time of crash. For this reason we encrypt the diagnostics generated from all executables which run as Linux user ‘chronos’ (which means they are started when the user logs in and terminated upon log out) when stored to disk. Since the encryption is based on the user‘s password, the only way a user’s crash diagnostics can be sent is when they are currently logged in to the device.
We will separate kernel and user space diagnostic gathering in implementation. Both however, need to adhere to our EULA with the user. During the out of box experience the owner chooses if he/she would like crashes on this device to be uploaded to Google servers. We must never send a crash report if they do not give consent. They may rescind their consent at any time which means that if we have enqueued a crash report to be sent which was created at a time when the user consented, and they rescind their consent before the crash is sent, the crash report must be discarded.
chronos
we enqueue its crash to /home/chronos/user/crash
which is on the cryptohome when a user is logged in and so it will be encrypted. If the crashed process was running as any other user, we enqueue the crash in /var/spool/crash
./home/root/<user_hash>/crash/
. This directory is still part of the cryptohome, but can be accessed without running as chronos. This will allow both creating and uploading crash reports with lower privileges./etc/crash_reporter.conf
contains a list of executables and shell commands for them. If an executable crashes which is listed in this file, the shell commands listed will be executed as root and their output will be sent in the crash report. For instance when the update_engine (auto updater) daemon crashes, this allows us to send the daemon's logs (listing attempts and application-level logs) in the crash report.crash_reporter
emitting a dbus signal that metrics_daemon
receives. That daemon generates and emits user metrics to Chrome./home/root/<user_hash>/crash/
which only requires being a member of the group crash-user-access
./sys/kernel/debug/preserved/kcrash
file./var/spool/crash
, /home/chronos/user/crash
, and /home/root/<user_hash>/crash
for reports, sends those, and removes them if successfully sent.One/some of these alternatives may indeed be what we implement in the longer run.
We considered linking Google Breakpad into every process with extra logic that determines where and how to store crash dumps. This was our first implementation. Unfortunately we cannot affect the linking of every process (since some come to Google as binary format). Also the act of installing a signal handler in every process can be disruptive. This could also be done at the libc level, such as how Android added to Bionic (their libc replacement) the ability to catch unhandled segfaults in process and signal a debugger process in the system. While possible, it seems tricky to be installing this into every process. The timing of when the library is initialized would be tricky, as well as watching for infinite loops (what if the crash sending process crashes).