Select to Speak is a Chrome OS feature to read text on the screen out loud.
There are millions of users who greatly benefit from some text-to-speech but don’t quite need a full screen reading experience where everything is read aloud each step of the way. For these users, whether they are low vision, dyslexic, neurologically diverse, or simply prefer to listen to text read aloud instead of visually reading it, we have built Select-to-Speak.
Go to Chrome settings, Accessibility settings, “Manage accessibility Features”, and enable “Select to Speak”. You can adjust the preferred voice, highlight color, and access text-to-speech preferences from the settings page.
With this feature enabled, you can read text on the screen in one of three ways:
Hold down the Search key, then use the touchpad or external mouse to tap or drag a region to be spoken
Tap the Select-to-Speak icon in the status tray and use the mouse or touchscreen to select a region to be spoken
Highlight text and use Search+S to speak only the selected text.
Read more on the Chrome help page under “Listen to part of a page”.
Use bugs.chromium.org, filing bugs under the component UI>Accessibility>SelectToSpeak.
Select to Speak will be abbreviated STS in this section.
STS code lives mainly in three places:
A component extension to do the bulk of the logic and processing, chrome/browser/resources/chromeos/accessibility/select_to_speak/
An event handler, ash/events/select_to_speak_event_handler.h
The status tray button, ash/system/accessibility/select_to_speak/select_to_speak_tray.h
Floating panel, system/accessibility/select_to_speak_menu_bubble_controller.h
In addition, there are settings for STS in chrome/browser/resources/ash/settings/os_a11y_page/select_to_speak_subpage.*
Tests are in ash_unittests and in browser_tests:
out/Release/ash_unittests --gtest_filter=”SelectToSpeak*” out/Release/browser_tests --gtest_filter=”SelectToSpeak*”
Developers can add log lines to any of the C++ files and see output in the console. To debug the STS extension, the easiest way is from an external browser. Start Chrome OS on Linux with this command-line flag:
Now open http://localhost:9222 in a separate instance of the browser, and debug the Select to Speak extension background page from there.
Like Chromevox, STS is implemented mainly as a component Chrome extension which is always loaded and running in the background when enabled, and unloaded when disabled. The only STS code outside of the extension is an EventRewriter which forwards keyboard and mouse events to the extension as needed, so that the extension can get events systemwide.
The STS extension does the following, at a high level:
Tracks key and mouse events to determine when a user has either:
a. Held down “search” and clicked & dragged a rectangle to specify a selection
b. Used “search” + “s” to indicate that selected text should be read
c. Has requested speech to be canceled by tapping ‘control’ or ‘search’ alone
Determines the Accessibility nodes that make up the selected region
Sends utterances to the Chrome Text-to-Speech extension to be spoken
Tracks utterance progress and updates the focus ring and highlight as needed.
Most STS logic takes place in select_to_speak.js.
Input to the extension is handled by input_handler.js, which handles user input from mouse, keyboard, and touchscreen events. Most logic here revolves around keeping track of state to see if the user has requested text using one of the three ways to activate the feature, search + mouse, tray button
Once input_handler determines that the user did request text to be spoken, STS must determine which part of the page to read. To do this it requests information from the Automation API, and then generates a list of AutomationNodes to be read.
select_to_speak.js fires a HitTest to the Automation API at the center of the rect selected by the user. When the API gets a result it returns via SelectToSpeak.onAutomationHitTest_. This function walks up from the hit test node to the nearest container to find a root, then back down through all the root’s children to find ones that overlap with the selected rect. Walking back down through the children occurs in NodeUtils.findAllMatching, and results in a list of AutomationNodes that can be sent for speech.
If the rect size is below a certain threshold, all nodes within overlapped block parent are selected.
select_to_speak.js requests focus information from the Automation API. The focus result is sent to SelectToSpeak.requestSpeakSelectedText_, which uses Automation selection to determine which nodes are selected. The complexity of logic here is converting between Automation selection and its deep equivalent, i.e. from parent nodes and offsets to their leaves. This occurs in NodeUtils.getDeepEquivalentForSelection. When the first and last nodes in selection are found, SelectToSpeak.readNodesInSelection_ is used to determine the entire list of AutomationNodes which should be sent for speech.
SelectToSpeak.startSpeechQueue_ takes a list of AutomationNodes, determines their text content, and sends the result to the Text to Speech API for speech. It begins by mapping the text content of the nodes to the nodes themselves, so that STS can speak smoothly across node boundaries (i.e. across line breaks) and follow speech progress with a highlight. The mapping between text and nodes occurs in repeated calls to ParagraphUtils.buildNodeGroup to build lists of nodes that should be spoken smoothly.
Each node group is sent to the Text to Speech API, with callbacks to allow for speech progress tracking, enabling the highlight to be dynamically updated with each word.
On each word boundary event, the TTS API sends a callback which is handled by SelectToSpeak.onTtsWordEvent_. This is used to check against the list of nodes being spoken to see which node is currently being spoken, and further check against the words in the node to see which word is spoken.
STS must also handle cases where:
Nodes become invalid during speech, i.e. if a page was closed. Speech should continue, but highlight stops.
Nodes disappear and re-appear during speech (a user may have switched tabs and switched back, or scrolled). Highlight should resume.
This occurs in SelectToSpeak.updateFromNodeState_.
STS runs in the extension process, but needs to communicate its three states (Inactive, Selecting, and Speaking) to the STS button in the status tray. It also needs to listen for users requesting state change using the SelectToSpeakTray button. The STS extension uses the AccessibitilityPrivate method setSelectToSpeakState to inform the SelectToSpeakTray of a status change, and listens to onSelectToSpeakStateChangeRequested to know when a user wants to change state. The STS extension is the source of truth for STS state.
STS will display a floating control panel when activated. The control panel hosts controls for pause/resume, updating reading speed, navigating by sentence or paragraph, and deactivating STS.
The panel is implemented as a native ASH component select_to_speak_menu_bubble_controller.h. Similar to focus rings, the STS component extension communicates with the panel via the
chrome.accessibilityPrivate API. The
chrome.accessibilityPrivate.updateSelectToSpeakPanel API controls the visibility and button states, and panel actions are communicated back to the extension by adding a listener to
When the panel is displayed, STS will no longer dismiss itself when TTS playback is complete. The user must quit STS either from the panel or the tray button.
When the panel is displayed, it is initially focused and captures keypresses to implement keyboard shortcuts:
If the panel loses focus, keyboard shortcuts will no longer work. User can press Search+S keyboard shortcut (with no text selection) to restore focus to the panel.
The panel is not shown when STS is activated on nodes where navigation features do not add value, such as in system UI or top-level windows.
root with role
chrome.tts.resume are not consistently implemented across all TTS engines, STS implements pause/resume functionality using the
chrome.tts.speak APIs. While TTS is playing, STS keeps track of the current word offset, and when TTS is resumed, it will call
speak with text trimmed to the start of the last spoken word.
Resuming TTS behaves differently depending on the context:
Users can navigate to adjacent paragraphs from the current block parent when Select-to-speak is active. A ‘paragraph’ is any block element as defined by ParagraphUtils.isBlock and the navigation occurs in DOM-order.
Paragraphs are split into sentences based on the
sentenceStarts property of an AutomationNode. Users can skip to previous and next sentences using similar technique as pause/resume (
speak with trimmed text). See sentence_utils.js for logic on breaking node groups into sentences.
Users can slow down or speed up TTS speaking rate using the floating control panel. The rate the user selects in the panel is multiplied by the system default TTS rate. So if the user selects 1.2x reading speed in the panel and has a system default of 2.0x, the effective TTS rate will be 2.4x.
When users adjust reading speed,
chrome.tts.stop is called, and
chrome.tts.speak is then called with text trimmed to the current word position, passing in the new effective TTS rate as an option.
Google Drive apps require a few work-arounds to work correctly with STS.
Any time a Google Drive document is loaded (such as a Doc, Sheet or Slides document), the script select_to_speak_gdocs_script must be executed to remove aria-hidden from the content container.
Using search+s to read highlighted text uses the clipboard to get text data from Google Docs, as selection information may not be available in the Automation API. This happens mostly in input_handler.js.
As of M94, Select-to-speak supports natural, server-generated voices. When enhanced network voices are enabled, Select-to-speak passes the user's selected natural voice name to
chrome.tts.speak. The TTS request is handled by the Enhanced Network TTS engine. The TTS engine then passes the request to native code ( EnhancedNetworkTts), which in turn sends a network request to the ReadAloud API, which produces synthesized audio.
For instructions on how to add new voices, see go/chromeos-natural-voices.
For more, Googlers could check out the Select to Speak feature design docs for more details on design as well as UMA.
Overall product design, go/select-to-speak-design
On-Screen UI for touch and tablet modes, go/chromeos-sts-on-screen-ui
Reading text at keystroke, go/chromeos-sts-selection-keystroke
Reading text at keystroke in Google Drive apps, go/sts-selection-in-drive
Navigation features, go/enhanced-sts-dd
Enhanced network voices, go/wavenet-chromeos-dd