| # How Chrome Accessibility Works, Part 3 |
| |
| This document explains the technical details behind Chrome accessibility |
| code by starting at a high level and progressively adding more levels of |
| detail. |
| |
| See [Part 1](how_a11y_works.md) and |
| [Part 2](how_a11y_works_2.md) first. |
| |
| [TOC] |
| |
| ## Abstracting platform-specific APIs |
| |
| In [Part 1](how_a11y_works.md) we talked about how each platform has its own |
| accessibility API. Chromium originally had the platform-specific accessibility |
| APIs scattered throughout the code, but today a large fraction of the APIs for |
| Windows (including IAccessible, IAccessible2, and UI Automation), Linux, and |
| macOS have all been isolated and abstracted in one place that makes it |
| relatively easy to write cross-platform accessibility code. |
| |
| These abstractions are all in the |
| [ui/accessibility/platform]( |
| https://source.chromium.org/chromium/chromium/src/+/main:ui/accessibility/platform/) |
| directory. |
| |
| First, gfx::NativeViewAccessible is a typedef used throughout Chromium to |
| represent an instance of the platform-specific accessible object on the current |
| platform. It's defined alongside gfx::NativeView, gfx::NativeEvent, and other |
| similar types that have equivalents on each platform. Note that these are not |
| wrappers or abstractions; they're just typedefs enabling you to write a function |
| that returns an instance of the appropriate type on each platform. For |
| accessibility, gfx::NativeViewAccessible is defined to be IAccessible* on |
| Windows, id on Mac (where 'id' is the type for a generic Objective-C object, |
| which has to implement the informal NSAccessibility protocol), and AtkObject* on |
| Linux. |
| |
| The main class in ui/accessibility/platform is AXPlatformNode. When you call |
| AXPlatformNode::Create, you'll get back an object that implements the |
| correct interfaces for the platform you're running on - currently Windows, |
| macOS, and desktop Linux are supported. |
| |
| For each AXPlatformNode, you need to provide an AXPlatformNodeDelegate - an |
| instance of a class that you implement in order to provide all of the |
| accessibility details about that node, in a cross-platform way. |
| |
| While AXPlatformNodeDelegate is pure virtual, a base class is provided, |
| AXPlatformNodeDelegateBase, with default implementations of nearly all |
| of the virtual functions. You can inherit from AXPlatformNodeDelegateBase |
| and override only a few functions in order to easily get a working object. |
| |
| As a brief sketch, if you had a custom-drawn button and you wanted to |
| make it accessible, you could define a subclass like this: |
| |
| ```C++ |
| class MyButtonAXPlatformNodeDelegate |
| : public AXPlatformNodeDelegateBase { |
| MyButtonAXPlatformNodeDelegate() |
| : AXPlatformNodeDelegateBase() { |
| ... |
| } |
| |
| const AXNodeData& GetData() const override { |
| ... |
| } |
| |
| int GetChildCount() const override { |
| ... |
| } |
| |
| gfx::NativeViewAccessible ChildAtIndex(int index) const override { |
| ... |
| } |
| |
| gfx::NativeViewAccessible GetParent() const override { |
| ... |
| } |
| }; |
| ``` |
| |
| Then to construct the accessible object, you could just write this: |
| ```C++ |
| MyButtonAXPlatformNodeDelegate delegate; |
| AXPlatformNode* accessible = AXPlatformNode::Create(delegate); |
| ``` |
| |
| ## Events |
| |
| In the Chromium codebase, accessibility events are notifications sent from |
| the browser to assistive technology that something has happened. This is |
| the mechanism by which assistive technology can provide real-time feedback |
| as the user is interacting with the browser. Some common events found on |
| nearly all platforms include: |
| |
| * Focus changed |
| * Control value changed |
| * Bounding box changed |
| * Children changed (a node added, removed, or reordered one or more children) |
| * Load complete (a web page finished loading) |
| |
| While many platforms share the same types of events, they're not standardized |
| at all, and platforms have very different names for events and different |
| semantics around which events are fired when, and where. As a few examples: |
| |
| * On macOS, there are separate events for expanding and collapsing a row |
| in a table or tree, vs expanding or collapsing a pop-up menu |
| * On Android there's a separate event for the checked state changing, while |
| on other platforms, there's just a generic state changed event |
| * On Windows there are SHOW and HIDE events that need to be fired when |
| a node or subtree is created or destroyed |
| |
| In many cases, assistive technology is co-developed with initial accessibility |
| support in a platform's native widget toolkit - for example, TalkBack was |
| co-developed with the accessibility support in Android Views, |
| and NSAccessibility was co-developed |
| with AppKit's initial accessibility support. One thing that invariably seems to |
| happen is that event notifications get added to let assistive technology know |
| about changes to state of the app. |
| |
| Then when a new app comes along and needs to do some custom drawing or otherwise |
| implement some custom accessibility code, implementing those events ends up |
| being tricky. If the right events aren't fired in exactly the right order, the |
| assistive technology gets confused, since it was only built and tested with one |
| event sequence. |
| |
| This forms an implicit contract between the server and client, but it's one |
| that's rarely properly documented. |
| |
| For a cross-platform product like Chromium that needs to support the right |
| set of events to fire across so many platforms this gets very tricky. In |
| the early days we tried to have Blink fire the superset of all events needed |
| on any platform, but this often resulted in duplicate events or subtle |
| bugs, and a tendency for an event-related fix for one platform to accidentally |
| break another platform. |
| |
| Chrome's solution to this now is what we call "implicit" events. Blink, and |
| other parts of the codebase that build an accessibility tree simply notify that |
| an accessibility node is dirty, or an entire subtree is dirty. The |
| infrastructure crawls the dirty nodes and creates a tree mutation and propagates |
| it to all client interfaces. |
| |
| At the level of the client interface, we generate implicit events based on |
| changes to the accessibility tree as observed from that client's perspective, |
| using a class called AXEventGenerator. |
| |
| This allows us to keep the code that implements a particular contract in one |
| place and eliminate subtle differences between different types of content. |
| |
| ### AXEventGenerator |
| |
| AXEventGenerator is based on the idea of applying atomic updates to an |
| accessibility tree. As described in |
| [Data structures used by the accessibility cache]( |
| how_a11y_works_2.md#Data-structures-used-by-the-accessibility-cache) |
| in part 2, an AXTree is a "live tree" that's currently being served, |
| and an AXTreeUpdate is a serializable data structure that represents |
| either a snapshot of a tree or an atomic update to apply to an existing |
| tree. When AXTree applies an atomic AXTreeUpdate, it allows listeners to |
| get callbacks for any changes that happened to the tree. In particular, |
| it keeps both the old and new data for each changing node temporarily |
| so that listeners can trigger actions based on changes. |
| |
| AXEventGenerator is thus an AXTree listener. It considers every node |
| that changed in the tree and figures out what events to fire. It builds |
| up the set of events and continues modifying it until the atomic update |
| is finished, enabling it to consolidate and remove duplication. |
| |
| As one example of that, a live region is a portion of a web page that |
| may trigger assistive technology to notify whenever an update occurs. |
| On some platforms, Chromium needs to fire a "live region changed" |
| announcement on the root of the live region whenever it changes. |
| AXEventGenerator keeps track of any changes that happen within a live |
| region and ensures that exactly one "live region changed" event is |
| fired on the live region root. |
| |
| There are a small number of exceptions - events that can't be fired |
| via AXEventGenerator. These are things that can't be inferred just |
| from tree changes. One such example is the "autocorrection occurred" |
| event. When the browser performs an autocorrection while the user is |
| typing, the state change just looks like any other edit. The event |
| ensures assistive technology can announce the autocorrection. |
| |
| ### Focus events |
| |
| Focus events are one of the most important types of events, because |
| changing focus is often one of the most important events for assistive |
| technology to announce, and the focused node is the one that will be |
| the target of any input events. |
| |
| However, one of the challenges with focus events is that there's only |
| one element on the entire desktop that has focus at any one time, but |
| individual windows or iframes might not always be aware of the global |
| state of the entire desktop at the time they experience a focus change |
| within their scope. This can lead to a race condition. |
| |
| As an example, suppose that a user clicks a button in a web page, which after a |
| couple of seconds pops up a dialog and brings focus to an OK button. At the |
| same time, the user clicks on a different window to activate it, moving |
| focus to that window's active element. |
| |
| Because the windows come from different processes, the two focus events |
| (from the first window's dialog, and from the second window's active element) |
| could arrive at the browser process in either order. Here's an illustration |
| of this race condition: |
| |
|  |
| |
| From the standpoint of the browser, there's always only one node that has |
| focus. What's important here is that accessibility is completely consistent |
| with the browser in terms of reporting the correct node that has focus. |
| |
| The solution here is that only the browser is the source of truth when it |
| comes to which window has focus. Once we know which window has focus, each |
| accessibility tree tells us which node has focus within that tree. |
| |
| As a result, when a focus change happens in an AXTree, we can't just fire |
| a platform-specific focus event directly. Instead, we use that as a cue to |
| compute global focus and fire an update if needed. Here's an outline of the |
| algorithm: |
| |
| * Anytime focus changes in any accessibility tree, OR when the focused |
| window or iframe changes, recompute the focus. |
| * To compute focus, start with the focused window (or active window, |
| depending on the platform). If focus is in web content, |
| see what node is focused there. If that node is an iframe, recursively |
| jump into that iframe to see what's focus. |
| * Take the resulting deepest focused node and compare it to the last focused |
| node we computed. If it's different, fire a platform-specific accessibility |
| focus event. |
| |
| This ensures that accessibility focus events are always reliable and in sync. |
| |
| No other accessibility events have the same issue. Events like value changed, |
| selection changed, etc., are safe to fire even if a window is in the |
| background. Some assistive technology may be paying attention to background |
| windows. |
| |
| ## Actions |
| |
| In Chromium accessibility terminology, Actions flow the opposite direction from |
| events. Actions are when assistive technology wants to modify or interact with |
| the app on behalf of the user, such as clicking a button, selecting text, or |
| changing a control value. |
| |
| Note that screen readers rely very heavily on events, and partially on |
| actions. Users often use a combination of accessibility actions along with the |
| keyboard to directly drive an application, or have the screen reader warp the |
| mouse cursor directly to an element and simulate a click on that element. |
| |
| In contrast, assistive technology such as voice control makes heavy use of |
| actions and relies much less on events. Voice control relies heavily on |
| actions that enable directly changing control values, entering text, |
| activating buttons and links, and scrolling the page. |
| |
| Other assistive technology such as magnifiers are in-between - they may |
| follow focus events a lot but make heavy use of scroll actions. |
| |
| For the most part implementing actions is relatively straightforward. |
| The action is received by the part of the code that implements the |
| platform-specific accessibility APIs. It forwards the action to the |
| corresponding accessibility wrapper node in Blink, and that node |
| calls the appropriate internal APIs to directly manipulate the underlying |
| element, such as clicking a button or changing the value of a control. |
| |
| One minor complication is that on many platforms, actions are supposed to |
| return a success/failure code. Since actions are obviously implemented |
| asynchronously, Chromium can't know for sure if an action succeeded, so it |
| has to return success if an action seems valid, even though there's a |
| chance it might not actually succeed. |
| |
| ## Hit testing |
| |
| One specific special case of an action is a hit test. This is an API where |
| the assistive technology gives the x, y coordinates of a location on the |
| screen and asks the application (Chromium) to return which accessible |
| object is at that location. |
| |
| Applications of hit testing include: |
| |
| * Touch exploration on a touch-screen, or features to describe the |
| element as you're hovering over it with the mouse |
| * Using accessibility debuggers where you can click on an element and |
| get its accessibility properties |
| |
| Unfortunately on some platforms a hit test is a synchronous API. This is |
| a challenge because it's difficult to properly compute the correct element |
| at a location given just the accessibility tree, but blocking to wait for |
| a proper hit test in the render process can lead to deadlock and jankiness. |
| So Chromium employs the following approach: |
| |
| * The first time a hit test is received, it does an approximate hit test |
| based on the bounding boxes in the accessibility tree. This often returns |
| the correct result, but could fail in cases of complex layering or |
| non-rectangular objects. |
| * Subsequently, it makes an async call to the render process to do a proper |
| hit test and get the correct resulting element, and also the visible |
| bounding box of that element. |
| * The next hit test that's received, if the coordinates are within the |
| bounding box of the most recent proper hit test result, it returns that |
| result, which is correct. If the coordinates are outside of that bounding |
| box, go back to the first step. |
| |
| This algorithm works very well in practice when the user is moving the mouse |
| or dragging their finger across the screen, because we get dozens of hit |
| tests per second. At the edges of objects, the wrong result may be returned |
| for a few milliseconds, but as soon as the async result comes back, the |
| correct result is then returned. |
| |
| So for interactive use, it's quite seamless and reliable for users, while |
| still providing reasonable behavior in the less common circumstances where |
| a single hit test is called. |
| |
| ## Relative coordinates |
| |
| Up until now we've hinted about the fact that every node in the accessibility |
| tree stores a bounding box, but we haven't gone into much detail as to |
| how that bounding box is stored. |
| |
| If we always stored the bounding box in screen coordinates, then every time |
| a window is dragged or scrolled, or any time any part of the page moves or |
| scrolls, all of the affected bounding boxes would need to be recomputed, |
| which would involve a lot of recomputation and sending information from |
| render processes to the browser process. |
| |
| To minimize that work, in Chromium accessibility nodes store relative |
| coordinates. |
| |
| In particular, every node stores the following fields in a struct |
| called AXRelativeBounds: |
| |
| ```C++ |
| struct AX_BASE_EXPORT AXRelativeBounds final { |
| int offset_container_id; |
| Rect bounds; |
| Optional<Transform> transform; |
| }; |
| ``` |
| |
| The first field is the ID of the node's container, which can be any ancestor of |
| a node. That's the node that the bounds are relative to. |
| |
| The next field is the local bounding rect, relative to that container. |
| |
| The last field is an optional 4x4 transformation matrix, which can be |
| used to encode things like scale factors or even 3-D rotations. If this |
| concept is unfamiliar to you, search for tutorials on 4x4 transformation |
| matrices in the context of 3-D computer graphics. |
| |
| Computing the global bounding rect of a node is meant to be straightforward. |
| Start with the local rect. As long as the node isn't the root, keep walking |
| to the container node, applying the transformation matrix and adding the |
| bounds origin as you go. |
| |
| In addition, there are a couple of other fields relevant to the bounds |
| computation that are stored as sparse attributes in AXNodeData. These also |
| affect the bounds computation. |
| |
| * bool clips_children; |
| * int x_scroll_offset; |
| * int y_scroll_offset; |
| |
| For more information on bounding boxes, clipping, and offscreen, see |
| [Offscreen, Invisible and Size](offscreen.md). |
| |
| ## Text bounding boxes |
| |
| Most platform-specific accessibility APIs have a number of features |
| specifically to deal with text. Some of those APIs allow querying the |
| bounding box of an arbitrary range of text - often the text caret or |
| selection, but not necessarily. Applications include: |
| |
| * Highlighting text as it's read aloud |
| * Scrolling one particular text range into view |
| * Drawing highlights around the caret or selection to make it easier |
| for users to see them |
| |
| Because these APIs are synchronous, they must be served directly out of the |
| accessibility cache. That means that the accessibility cache needs to have |
| enough information to be able to retrieve the bounding box of any arbitrary |
| range of text on-screen. |
| |
| It would require quite a bit of memory to store the bounding box of |
| every individual character. To save memory, the following representation |
| is used: |
| |
| In the accessibility tree, we keep track of text nodes called "inline text |
| boxes". This corresponds to a similar concept in Blink, which is also |
| sometimes called a "text run". The idea is that given a single text node, |
| the text can be broken down into a sequence of text runs that each have |
| the following properties: |
| |
| * Each text run is on a single line |
| * Text goes a single direction (left-to-right, for example) |
| * The characters in that text run are all contiguous |
| |
| In the most common scenario, a single text node contains multiple |
| lines of text (potentially due to automatic wrapping with soft |
| line breaks). In the accessibility tree that node would have multiple |
| inline text box children, one for each line. |
| |
| Imagine we have the following paragraph, that's very narrow so it |
| wraps as follows: |
| |
| ``` |
| The quick brown fox |
| jumps over the |
| lazy dog. |
| ``` |
| |
| In the accessibility tree, it might be represented like this: |
| |
| ``` |
| Paragraph |
| Static Text "The quick brown fox jumps over the lazy dog." |
| Inline text box "The quick brown fox " |
| Inline text box "jumps over the " |
| Inline text box "lazy dog." |
| ``` |
| |
| Each inline text box comes with its own bounding box and text direction. |
| Then, to store the bounding box of every character, all we need to |
| do is store the width of each character. Since we know all of the |
| characters are written continuously in a line going the same direction, |
| we can use the bounds of the inline text box and the width of each |
| character to compute the bounding box of any individual character. |
| |
| The AXPosition class abstracts most of this computation. |
| |
| ## Iframes |
| |
| The last piece of complexity to address is that up until now we've |
| assumed that a single web page corresponds to a single frame, so a |
| web page is a single process. |
| |
| In Chromium, for security reasons iframes can also be running in |
| separate processes. This isn't always the case - for one thing, if |
| system resources are low, Chromium won't keep creating new processes, |
| and also, frames from the same origin (i.e. from the same website) |
| need to be in the same process so they can communicate synchronously |
| via JavaScript. But, frames from different sites can be in different |
| processes so accessibility code needs to deal with that. |
| |
| The essential challenge is that each frame, which may be in its own |
| process, needs to maintain an accessibility tree - but the end result |
| needs to be stitched together into a final resulting accessibility |
| tree in the browser process. Iframes are mostly just an implementation |
| detail; users and assistive technology are rarely concerned with this |
| detail. |
| |
| In order to stitch frames together: |
| |
| * Each accessibility tree gets a globally unique ID, we call it an AXTreeID. |
| For security reasons this is an UnguessableToken. |
| * An iframe element in an accessibility tree contains the AXTreeID of its |
| child frame. |
| * In the browser process, we keep a hash map of all of the trees, and |
| also cache the reverse direction (e.g. the map from the root of a |
| tree to its parent node). |
| |
| In order to reduce complexity, Chromium accessibility is built around the |
| concept that every frame is its own accessibility tree, no matter whether |
| the frame is in a different process or not. The advantage of this approach |
| is that the same codepath is used whether iframes are in the same process |
| or a remote process. If iframes break, they all break - that simplifies |
| testing and reduces the number of cases to consider. |
| |
| The concept of embedding one accessibility tree in another using an |
| AXTreeID is also exploited even more in Chrome OS accessibility, where |
| it's used to embed Android applications and more. |