Select to Speak (for developers)

Select to Speak is a Chrome OS feature to read text on the screen out loud.

There are millions of users who greatly benefit from some text-to-speech but don’t quite need a full screen reading experience where everything is read aloud each step of the way. For these users, whether they are low vision, dyslexic, neurologically diverse, or simply prefer to listen to text read aloud instead of visually reading it, we have built Select-to-Speak.

Using Select to Speak

Go to Chrome settings, Accessibility settings, “Manage accessibility Features”, and enable “Select to Speak”. You can adjust the preferred voice, highlight color, and access text-to-speech preferences from the settings page.

With this feature enabled, you can read text on the screen in one of three ways:

  • Hold down the Search key, then use the touchpad or external mouse to tap or drag a region to be spoken

  • Tap the Select-to-Speak icon in the status tray and use the mouse or touchscreen to select a region to be spoken

  • Highlight text and use Search+S to speak only the selected text.

Read more on the Chrome help page under “Listen to part of a page”.

Reporting bugs

Use bugs.chromium.org, filing bugs under the component UI>Accessibility>SelectToSpeak.

Developing

Select to Speak will be abbreviated STS in this section.

Code location

STS code lives mainly in three places:

  • A component extension to do the bulk of the logic and processing, chrome/browser/resources/chromeos/accessibility/select_to_speak/

  • An event handler, ash/events/select_to_speak_event_handler.h

  • The status tray button, ash/system/accessibility/select_to_speak/select_to_speak_tray.h

  • Floating panel, system/accessibility/select_to_speak_menu_bubble_controller.h

In addition, there are settings for STS in chrome/browser/resources/settings/a11y_page/manage_a11y_page.*

Tests

Tests are in ash_unittests and in browser_tests:

out/Release/ash_unittests --gtest_filter=”SelectToSpeak*”
out/Release/browser_tests --gtest_filter=”SelectToSpeak*”

Debugging

Developers can add log lines to any of the C++ files and see output in the console. To debug the STS extension, the easiest way is from an external browser. Start Chrome OS on Linux with this command-line flag:

out/Release/chrome --remote-debugging-port=9222

Now open http://localhost:9222 in a separate instance of the browser, and debug the Select to Speak extension background page from there.

How it works

Like Chromevox, STS is implemented mainly as a component Chrome extension which is always loaded and running in the background when enabled, and unloaded when disabled. The only STS code outside of the extension is an EventRewriter which forwards keyboard and mouse events to the extension as needed, so that the extension can get events systemwide.

The STS extension does the following, at a high level:

  1. Tracks key and mouse events to determine when a user has either:

    a. Held down “search” and clicked & dragged a rectangle to specify a selection

    b. Used “search” + “s” to indicate that selected text should be read

    c. Has requested speech to be canceled by tapping ‘control’ or ‘search’ alone

  2. Determines the Accessibility nodes that make up the selected region

  3. Sends utterances to the Chrome Text-to-Speech extension to be spoken

  4. Tracks utterance progress and updates the focus ring and highlight as needed.

Select to Speak extension structure

Most STS logic takes place in select_to_speak.js.

User input

Input to the extension is handled by input_handler.js, which handles user input from mouse, keyboard, and touchscreen events. Most logic here revolves around keeping track of state to see if the user has requested text using one of the three ways to activate the feature, search + mouse, tray button

  • mouse, or search + s.

Determining selected content

Once input_handler determines that the user did request text to be spoken, STS must determine which part of the page to read. To do this it requests information from the Automation API, and then generates a list of AutomationNodes to be read.

With mouse or touchpad

select_to_speak.js fires a HitTest to the Automation API at the center of the rect selected by the user. When the API gets a result it returns via SelectToSpeak.onAutomationHitTest_. This function walks up from the hit test node to the nearest container to find a root, then back down through all the root’s children to find ones that overlap with the selected rect. Walking back down through the children occurs in NodeUtils.findAllMatching, and results in a list of AutomationNodes that can be sent for speech.

If the rect size is below a certain threshold, all nodes within overlapped block parent are selected.

With search + s

select_to_speak.js requests focus information from the Automation API. The focus result is sent to SelectToSpeak.requestSpeakSelectedText_, which uses Automation selection to determine which nodes are selected. The complexity of logic here is converting between Automation selection and its deep equivalent, i.e. from parent nodes and offsets to their leaves. This occurs in NodeUtils.getDeepEquivalentForSelection. When the first and last nodes in selection are found, SelectToSpeak.readNodesInSelection_ is used to determine the entire list of AutomationNodes which should be sent for speech.

Speaking selected content

SelectToSpeak.startSpeechQueue_ takes a list of AutomationNodes, determines their text content, and sends the result to the Text to Speech API for speech. It begins by mapping the text content of the nodes to the nodes themselves, so that STS can speak smoothly across node boundaries (i.e. across line breaks) and follow speech progress with a highlight. The mapping between text and nodes occurs in repeated calls to ParagraphUtils.buildNodeGroup to build lists of nodes that should be spoken smoothly.

Each node group is sent to the Text to Speech API, with callbacks to allow for speech progress tracking, enabling the highlight to be dynamically updated with each word.

Highlighting content during speech

On each word boundary event, the TTS API sends a callback which is handled by SelectToSpeak.onTtsWordEvent_. This is used to check against the list of nodes being spoken to see which node is currently being spoken, and further check against the words in the node to see which word is spoken.

Edge cases

STS must also handle cases where:

  • Nodes become invalid during speech, i.e. if a page was closed. Speech should continue, but highlight stops.

  • Nodes disappear and re-appear during speech (a user may have switched tabs and switched back, or scrolled). Highlight should resume.

This occurs in SelectToSpeak.updateFromNodeState_.

Communication with SelectToSpeakTray

STS runs in the extension process, but needs to communicate its three states (Inactive, Selecting, and Speaking) to the STS button in the status tray. It also needs to listen for users requesting state change using the SelectToSpeakTray button. The STS extension uses the AccessibitilityPrivate method setSelectToSpeakState to inform the SelectToSpeakTray of a status change, and listens to onSelectToSpeakStateChangeRequested to know when a user wants to change state. The STS extension is the source of truth for STS state.

Navigation features

STS will display a floating control panel when activated. The control panel hosts controls for pause/resume, updating reading speed, navigating by sentence or paragraph, and deactivating STS.

Floating control panel

The panel is implemented as a native ASH component select_to_speak_menu_bubble_controller.h. Similar to focus rings, the STS component extension communicates with the panel via the chrome.accessibilityPrivate API. The chrome.accessibilityPrivate.updateSelectToSpeakPanel API controls the visibility and button states, and panel actions are communicated back to the extension by adding a listener to chrome.accessibilityPrivate.onSelectToSpeakPanelAction.

When the panel is displayed, STS will no longer dismiss itself when TTS playback is complete. The user must quit STS either from the panel or the tray button.

Keyboard shortcuts

When the panel is displayed, it is initially focused and captures keypresses to implement keyboard shortcuts:

  • Space - activates currently focused button, which is ‘Pause/Resume’ initially.
  • Left Arrow - Navigate to previous sentence (for RTL languages, this is Right Arrow)
  • Right Arrow - Navigate to next sentence (for RTL languages, this is Left Arrow)
  • Up Arrow - Navigate to previous paragraph
  • Down Arrow - Navigate to next paragraph

If the panel loses focus, keyboard shortcuts will no longer work. User can press Search+S keyboard shortcut (with no text selection) to restore focus to the panel.

Disallowed nodes

The panel is not shown when STS is activated on nodes where navigation features do not add value, such as in system UI or top-level windows.

  • System UI nodes - any nodes that have a root with role desktop
  • Root nodes that are children of the root desktop node

Pause/Resume

Since chrome.tts.pause and chrome.tts.resume are not consistently implemented across all TTS engines, STS implements pause/resume functionality using the chrome.tts.stop and chrome.tts.speak APIs. While TTS is playing, STS keeps track of the current word offset, and when TTS is resumed, it will call speak with text trimmed to the start of the last spoken word.

Resuming TTS behaves differently depending on the context:

  • If TTS was paused within the user-selected text, resuming will play until the end of the selected text.
  • If TTS stopped when it reached the end of the selected text, but before the end of the paragraph, resuming will continue from that point to the end of the paragraph.
  • If TTS stopped when it reached the end of a paragraph, resuming will speak the next paragraph.

Paragraph navigation

Users can navigate to adjacent paragraphs from the current block parent when Select-to-speak is active. A ‘paragraph’ is any block element as defined by ParagraphUtils.isBlock and the navigation occurs in DOM-order.

Sentence navigation

Paragraphs are split into sentences based on the sentenceStarts property of an AutomationNode. Users can skip to previous and next sentences using similar technique as pause/resume (stop then speak with trimmed text). See sentence_utils.js for logic on breaking node groups into sentences.

Reading speed

Users can slow down or speed up TTS speaking rate using the floating control panel. The rate the user selects in the panel is multiplied by the system default TTS rate. So if the user selects 1.2x reading speed in the panel and has a system default of 2.0x, the effective TTS rate will be 2.4x.

When users adjust reading speed, chrome.tts.stop is called, and chrome.tts.speak is then called with text trimmed to the current word position, passing in the new effective TTS rate as an option.

Special case: Google Drive apps

Google Drive apps require a few work-arounds to work correctly with STS.

  • Any time a Google Drive document is loaded (such as a Doc, Sheet or Slides document), the script select_to_speak_gdocs_script must be executed to remove aria-hidden from the content container.

  • Using search+s to read highlighted text uses the clipboard to get text data from Google Docs, as selection information may not be available in the Automation API. This happens mostly in input_handler.js.

Enhanced network voices

As of M94, Select-to-speak supports natural, server-generated voices. When enhanced network voices are enabled, Select-to-speak passes the user's selected natural voice name to chrome.tts.speak. The TTS request is handled by the Enhanced Network TTS engine. The TTS engine then passes the request to native code ( EnhancedNetworkTts), which in turn sends a network request to the ReadAloud API, which produces synthesized audio.

For instructions on how to add new voices, see go/chromeos-natural-voices.

For Googlers

For more, Googlers could check out the Select to Speak feature design docs for more details on design as well as UMA.