Annotated Page Content (APC) is a structured and actionable representation of a webpage's content and layout. Its primary function is to enable a deep understanding of page structure, content, and interactive elements by downstream clients, who can receive the information as a protobuf tree.
APC is designed with the following principles in mind:
The foundation of APC is the AnnotatedPageContent protobuf message, which organizes page content into a hierarchical tree.
ContentNodesThe representation is a tree of ContentNodes. These nodes can represent layout containers on the page, grouping related information in a structure derived from the layout tree. This includes:
<article>, <nav>, <section>)ContentAttributes)Each ContentNode contains attributes that describe the element in detail:
TextInfo): The text content, along with styling information like size, emphasis, and color.ImageInfo): The image's alt text or caption, its URL, and security origin.AnchorData): The destination URL and the link's rel attribute.FormInfo, FormControlData): Includes the form's name/ID and data for individual controls like field name, value, and type. Password field values are omitted unless the user has made them visible on the page.InteractionInfo): Describes the node's interactivity (e.g., clickable, editable, focusable).The following elements are under consideration for future inclusion but are not currently part of the APC structure:
<audio>, <video>)<canvas>) and SVG (<svg>)APC is generated by traversing Blink's layout tree, not the DOM tree. This is a critical distinction because the layout tree only includes content that is actually rendered on the page.
The generation algorithm recursively traverses the layout tree, creating a ContentNode for each rendered object with structured content or a significant semantic role. It extracts relevant data and organizes the nodes into a hierarchy that preserves the visual order of the page.
On the browser side, the raw APC proto can be converted into various consumable formats, including:
{#ID}) that link back to the original ContentNode.A key goal of APC is to enable reliable interactions with webpages, even when they change dynamically.
To handle dynamic page changes, an algorithm robustly identifies the target element by matching key properties like its type, interactivity, and location. If needed, it can further verify the element by comparing its text content to ensure the correct action is taken.
Using APC requires careful attention to privacy and security. While APC provides data to help mitigate risks, feature owners bear ultimate responsibility.
isAccessibleForFree=false](https://developers.google.com/search/docs/appeara nce/structured-data/paywalled-content)) to flag paid content, and APC includes this signal.To run the unit tests for content extraction, use the following command:
autoninja -C out/Default blink_unittests && out/Default/blink_unittests --gtest_filter=AIPageContentAgentTest.*
The web tests for content extraction are located in third_party/blink/web_tests/content_extraction/.
To run the web tests:
third_party/blink/tools/run_web_tests.py -C out/Default content_extraction
To update the web test expectations:
third_party/blink/tools/run_web_tests.py -C out/Default content_extraction --reset-results