Dictation is a ChromeOS accessibility feature that allows users to type and edit text with their voice.
Dictation can be toggled on in two ways: by pressing Search + D or by pressing the microphone icon (which has an accessibility label of “toggle dictation”) in the status tray. Once toggled, Dictation starts speech recognition and the user can start speaking. If a command is recognized, then it will execute the command to the best of its ability; otherwise it will input the recognized text as-is. Dictation can be turned off in the same two ways mentioned above; it will automatically turn off if no speech has been recognized within a short period of time. Lastly, Dictation can only be used when focus is on an editable field (textarea, input, contenteditable, etc).
It‘s also worth noting that Dictation utilizes two DLCs: Pumpkin and SODA. Pumpkin is a semantic parser that allows Dictation to extract meaning out of recognized text. SODA stands for “speech on-device API” and turns the user’s speech into text on-device (without sending it to a Google server). When Pumpkin isn‘t available or fails to download, Dictation falls back to regex-based speech parsing. Similarly, when SODA isn’t available, Dictation falls back to network speech recognition.
The majority of Dictation code lives in the dictation/ extension directory. There's also a small amount of C++ code, most of which is for the Dictation UI.
dictation/ extension directory is broken into a few subdirectories organized by functionality:
parse/ directory contains all code related to speech parsing, which is the process of turning recognized text into a command. Dictation currently utilizes two parsing strategies: regex-based, which uses regular expressions to match text to known commands, and Pumpkin-based, which uses a semantic parser developed by Google (see more about Pumpkin below).
macros/ directory contains all code for macro (also known as a command) implementation.
dictation/ directory. Some noteworthy classes are:
Dictation, which is the main object. It handles setup/teardown, interacts with APIs like chrome.speechRecognitionPrivate and chrome.settingsPrivate, and owns many other essential classes.
InputController, which handles all interaction with editable fields. In its current form, it uses various IME APIs to enter text into the editable, as well as listen to changes in the editable value. It's also responsible for calculating the data about the focused editable node, the current value, the selection start, and selection end. Lastly, it implements many of the editing commands supported by Dictation.
UIController, which handles all interaction with the Dictation UI. Since the Dictation UI is implemented as a View in C++, it uses the chrome.accessibilityPrivate API to manipulate the UI. Generally, all changes to the UI should go through the UIController.
FocusHandler, which tracks the currently focused node using the automation API. Dictation can only work on editable nodes, so FocusHandler is used mostly to check this precondition (and also to access to the node's accessibility data).
LocaleInfo, which is the source of truth for the Dictation locale and whether or not certain behaviors are supported in the current locale.
DictationBubbleController manages the Dictation UI from the C++ side and provides an entry point for updating/changing the UI.
DictationBubbleView is the actual implementation of the Dictation UI.
AccessibilityController exposes several Dictation-related APIs, mostly around updating the UI and showing notifications.
AccessibilityManager contains a fair amount of Dictation logic, specifically around setting up/tearing down the extension, managing DLC downloads, and showing notifications to the user.
Two C++ helper classes that are worth noting are
DictationBubbleTestHelper, which allows tests to query the state of the Dictation UI or wait for it to reach a certain state, and
SpeechRecognitionTestHelper, which allows tests to easily interact with speech recognizers.
Dictation utilizes a semantic parser called Pumpkin to extract meaning and intent out of text and allows us to turn recognized text into commands. To use Pumpkin in Dictation, a few steps had to be taken:
Pumpkin and associated config files take up roughly 5.9MB of space (estimate generated in December 2022). Adding this much memory overhead to rootfs was not a feasible option, so we added a DLC for Pumpkin so that it could be downloaded and used when needed. We added a script in Google3 that would quickly generate the DLC and upload it to Google Cloud Storage whenever it needed to be updated.
We added logic in Dictation that would initiate a download of the Pumpkin DLC. Dictation uses the chrome.accessibilityPrivate.installPumpkinForDictation() API to initiate the download. Once the DLC is downloaded, the AccessibilityManager reads the bytes of each Pumpkin file and sends them back to the Dictation extension. Lastly, the extension spins up a new sandboxed context to run pumpkin in.
pumpkin-<version>.tar.xz file to the chromium codebase for testing purposes. A copy of the tar file should be placed in your root directory (e.g. ~/pumpkin-3.0.tar.xz) if you followed the documentation above.
Note: It's important that we never remove semantic tags from the Pumpkin DLC because we want to avoid backwards compatibility issues (we never want to regress any commands).