Chromium Flag Guarding Guidelines
This document describes using base::Feature
flags which can be remotely set via a server. This applies to both A/B experiments (internal link) (disabled by default) and to kill switches (internal link) (enabled by default).
Google maintains its own server which you'll see referenced by its internal name Finch. Other embedders can and do run their own server for their products.
Goals
- Prevent large scale outages and reduce the response time latency of outages of Chromium and Android WebView
- Reduce the need for a binary respin to address problems in the field
- Catch regressions in core product vitals
Non-Goals
- Require a flag per CL/bug without consideration. This is not scalable. See the next section for guidance of when flags should be used.
- Flag-guarding of ChromeOS-specific features.
- Add a lot of long-lived server-configurable flags across the code base. New flags generated by this proposal should be removed 1-2 milestones after launch.
- Mandate that all changes be rolled out via server side configurations.
When is a flag required?
- Every project/feature launch shall use a flag unless it’s not feasible
- Every feature going through Launch Process (Note, you do not need a launch bug to use a flag)
- Every feature using origin trials, per existing guidelines
- Every deprecation/addition of web platform APIs
- Very large structural changes that have very different paths (e.g. navigation rewrite in PlzNavigate, networking rewrite in Network Service, Out-of-process Rasterization etc.).
- Refactorings in code paths that have historically been risky or prone to accidental breakages should also be treated the same as a new feature.
- Regardless of whether it is a new feature, refactoring, or bug fix, there is no minimum size that dictates whether a flag is required (either for isolated CLs or for many CLs that form a project/feature). A one line change with potential to impact stability, performance, usability is just as required to use a flag as a multi-thousand line feature.
- If, as a CL author, you are uncertain whether a flag can/should be used, talk to the relevant TL/Uber-TL and if still unsure, just use a flag.
When is a flag not required?
- Targeted/micro performance optimizations: projects like V8/Skia/decoders etc. that have their own large correctness and performance test suites to not have to use server rollouts since they have large confidence based on their tests
- Changes to core data structures where it would almost be impossible to maintain both worlds (e.g. V8 pointer compression where adding a branch in each dereference would not be practical).
- Features shipped via component updater: if we ship a bad component we can update to a fixed one
- Chrome A/B binary experiments: we can use Omaha/AppStore/Play to update users from bad builds
- Non-chromium-repo binary drops: e.g. SwiftShader
- Rolling/Updating third party dependencies (e.g. libvpx, libwebp etc.)
- Mechanical/automated refactorings
- Changes to internal API naming
- Simple parameter changes (adding params, changing the type etc.)
- Isolated refactorings where test coverage with high test coverage / confidence
What type of flag rollout to use?
- If a change has the potential to affect performance or memory (internal link), or you want to analyze the impact of the launch on feature-specific metrics, use a disabled-by-default base::Feature flag and run an A/B experiment. But this should usually not be done for web platform changes. Non-Googler committers will need to work with owners of the code that work at Google to launch and monitor the experiment.
- Otherwise it should be guarded minimally by an enabled-by-default base::Feature flag, which can be remotely disabled by a server configuration.
- For code in blink, this can be as simple as using a Runtime Enabled Feature, which has long been common-practice for new or changed APIs.
Prefer waterfall rollout for platform changes
Web developers expect web platform APIs to behave predictably for all users on a given version of Chrome, and violating that expectation creates high risk of site breakage. And if a site is broken for only some % of its users (developer: “It works on my machine!”), it will take longer for the issue to be noticed, reported, and root-caused.
For this reason, A/B testing or gradual Finch rollouts of developer-visible platform changes are discouraged and should not be done by default.
Instead, developer-visible platform changes, such as adding / removing APIs, should usually use a “waterfall” rollout: land the change default-enabled on trunk, and let it bake in canary, dev, and beta before going to stable through the normal release process.
This process ensures that developers have an opportunity to discover and file regressions by testing their sites in dev / beta, and have confidence that their sites work correctly on new Chrome releases before they reach stable users.
The code should still be guarded by a base::Feature which can be used for a Finch kill switch if a serious post-stable regression is discovered.
There are tradeoffs here, and in some cases those tradeoffs may favor a gradual Finch rollout of a platform change, but that should be the exception and not the rule.