commit	f7993a1383df3a6c5e8249bc55631982d6b280d6	[log] [tgz]
author	Mike Wasserman <msw@chromium.org>	Tue Feb 25 23:12:18 2025
committer	Chromium LUCI CQ <chromium-scoped@luci-project-accounts.iam.gserviceaccount.com>	Tue Feb 25 23:12:18 2025
tree	13c1c445d3a33286f9aa92fc2adf363f3733728f
parent	53d584dd1fd3e4ac379ee74992cf8570df837b6b [diff]

Prompt API: Implement an expedient multimodal vision prototype

Bypasses optimization guide to use on-device model service directly.
This short-term implementation expedites a time-sensitive dev trial.

Enables ai.languageModel.prompt({type: 'image', content: image}):
  i = document.getElementsByTagName('img');
  s = await window.ai.languageModel.create();
  r = await s.prompt(['describe this image',
                      {type: 'image', content: i[0]}]);

Pass prompt[Streaming]() input via on_device_model::mojom::Input.
Invokes on_device_model::mojom::Session::Execute directly, with flag.
Adds a basic unit test. TODO: Add basic WPTs.

A more correct approach is in development; see crrev.com/c/6231611.
Builds on crrev.com/c/6086800 and prior+parallel prototyping efforts.
Credit to cduvall@chromium.org for big fixes in crrev.com/c/6253876.

Caution: This VERY ROUGH WIP prototype has quirks and known issues!
- Only a subset of JS ImageBitmapSource input types are supported
- Lacks history integration, token counting, overflow handling, etc.
- Does not support concurrent requests, multiple images, some errors.

Test locally on a device compatible with existing Prompt-API usage:
1) Run a Chrome-branded build, with flags and a test user-data-dir
$ chrome --no-sandbox --user-data-dir=/tmp/foo --enable-features=OptimizationGuideOnDeviceModel:on_device_model_image_input/true --enable-blink-features=AIPromptAPIMultimodalInput
2) Trigger model download and init via `ai.languageModel.create()`
3) Verify chrome://component "Optimization Guide..." is up-to-date
4) Quit & replace Chrome's model with a compatible test model:
  $ mv tmp/foo/OptGuideOnDeviceModel/2024.9.25.2033/weights.bin tmp/foo/OptGuideOnDeviceModel/2024.9.25.2033/weights.bin.OLD
  $ cp ~/Downloads/vision_model.tflite tmp/foo/OptGuideOnDeviceModel/2024.9.25.2033/weights.bin
5) Restart Chrome; try multimodal prompt() inputs per Explainer:
   https://github.com/webmachinelearning/prompt-api

Bug: 385173789, 385173368
Change-Id: Ibc75d6777df6c31eed608bcfd9134458da4ce136
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6246232
Mega-CQ: Mike Wasserman <msw@chromium.org>
Reviewed-by: Steven Holte <holte@chromium.org>
Commit-Queue: Mike Wasserman <msw@chromium.org>
Reviewed-by: Brad Triebwasser <btriebw@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1424824}

10 files changed

tree: 13c1c445d3a33286f9aa92fc2adf363f3733728f

README.md

Chromium

Chromium is an open-source browser project that aims to build a safer, faster, and more stable way for all users to experience the web.

The project's web site is https://www.chromium.org.

To check out the source code locally, don't use git clone! Instead, follow the instructions on how to get the code.

Documentation in the source is rooted in docs/README.md.

Learn how to Get Around the Chromium Source Code Directory Structure.

For historical reasons, there are some small top level directories. Now the guidance is that new top level directories are for product (e.g. Chrome, Android WebView, Ash). Even if these products have multiple executables, the code should be in subdirectories of the product.

If you found a bug, please file it at https://crbug.com/new.