GPU Synchronization in Chrome

Chrome supports multiple mechanisms for sequencing GPU drawing operations, this document provides a brief overview. The main focus is a high-level explanation of when synchronization is needed and which mechanism is appropriate.

Glossary

GL Sync Object: Generic GL-level synchronization object that can be in a “unsignaled” or “signaled” state. The only current implementation of this is a GL fence.

GL Fence: A GL sync object that is inserted into the GL command stream. It starts out unsignaled and becomes signaled when the GPU reaches this point in the command stream, implying that all previous commands have completed.

Client Wait: Block the client thread until a sync object becomes signaled, or until a timeout occurs.

Server Wait: Tells the GPU to defer executing commands issued after a fence until the fence signals. The client thread continues executing immediately and can continue submitting GL commands.

CHROMIUM fence sync: A command buffer specific GL fence that sequences operations among command buffer GL contexts without requiring driver-level execution of previous commands.

Native GL Fence: A GL Fence backed by a platform-specific cross-process synchronization mechanism.

GPU Fence Handle: An IPC-transportable object (typically a file descriptor) that can be used to duplicate a native GL fence into a different process's context.

GPU Fence: A Chrome abstraction that owns a GPU fence handle representing a native GL fence, usable for cross-process synchronization.

Use case overview

The core scenario is synchronizing read and write access to a shared resorce, for example drawing an image into an offscreen texture and compositing the result into a final image. The drawing operations need to be completed before reading to ensure correct output. A typical effect of wrong synchronization is that the output contains blank or incomplete results instead of the expected rendered sub-images, causing flickering or tearing.

“Completed” in this case means that the end result of using a resource as input will be equivalent to waiting for everything to finish rendering, but it does not necessarily mean that the GPU has fully finished all drawing operations at that time.

Single GL context: no synchronization needed

If all access to the shared resource happens in the same GL context, there is no need for explicit synchronization. GL guarantees that commands are logically processed in the order they are submitted. This is true both for local GL contexts (GL calls via ui/gl/ interfaces) and for a single command buffer GL context.

Multiple driver-level GL contexts in the same share group: use GLFence

A process can create multiple GL contexts that are part of the same share group. These contexts can be created in different threads within this process.

In this case, GL fences must be used for sequencing, for example:

Context A: draw image, create GLFence
Context B: server wait or client wait for GLFence, read image

gl::GLFence and its subclasses provide wrappers for GL/EGL fence handling methods such as eglFenceSyncKHR and eglWaitSyncKHR. These fence objects can be used cross-thread as long as both thread's GL contexts are part of the same share group.

For more details, please refer to the underlying extension documentation, for example:

Implementation-dependent: same-thread driver-level GL contexts

Many GL driver implementations are based on a per-thread command queue, with the effect that commands are processed in order even if they were issued from different contexts on that thread without explicit synchronization.

This behavior is not part of the GL standard, and some driver implementations use a per-context command queue where this assumption is not true.

See issue 510232 for an example of a problematic sequence:

// In one thread:
MakeCurrent(A);
Render1();
MakeCurrent(B);
Render2();
CreateSync(X);

// And in another thread:
MakeCurrent(C);
WaitSync(X);
Render3();
MakeCurrent(D);
Render4();

The only serialization guarantee is that Render2 will complete before Render3, but Render4 could theoretically complete before Render1.

Chrome assumes that the render steps happen in order Render1, Render2, Render3, and Render4, and requires this behavior to ensure security. If the driver doesn‘t ensure this sequencing, Chrome has to emulate it using virtual contexts. (Or by using explicit synchronization, but it doesn’t do that today.) See also the “CHROMIUM fence sync” section below.

Command buffer GL clients: use CHROMIUM sync tokens

Chrome's command buffer IPC interface uses multiple layers. There are multiple active IPC channels (typically one per process, i.e. one per Renderer and one for Browser). Each IPC channel has multiple scheduling groups (also called streams), and each stream can contain multiple command buffers, which in turn contain a sequence of GL commands.

Command buffers in the same client-side share group must be in the same stream. Command scheduling granuarity is at the stream level, and a client can choose to create and use multiple streams with different stream priorities. Stream IDs are arbitrary integers assigned by the client at creation time, see for example the ws::ContextProviderCommandBuffer constructor.

The CHROMIUM sync token is intended to order operations among command buffer GL instructions. It inserts an internal fence sync command in the stream, flushing it appropriately (see below), and generating a sync token from it which is a cross-context transportable reference to the underlying fence sync. A WaitSyncTokenCHROMIUM call does not ensure that the underlying GL commands have been executed at the GPU driver level, this mechanism is not suitable for synchronizing command buffer GL operations with a local driver-level GL context.

See the CHROMIUM_sync_point documentation for details.

Commands issued within a single command buffer don't need to be synchronized explicitly, they will be executed in the same order that they were issued.

Multiple command buffers within the same stream can use an ordering barrier to sequence their commands. Sync tokens are not necessary. Example:

// Command buffers gl1 and gl2 are in the same stream.
Render1(gl1);
gl1->OrderingBarrierCHROMIUM()
Render2(gl2);  // will happen after Render1.

Command buffers that are in different streams need to use sync tokens. If both are using the same IPC channel (i.e. same client process), an unverified sync token is sufficient, and commands do not need to be flushed to the server:

// stream A
Render1(glA);
glA->GenUnverifiedSyncTokenCHROMIUM(out_sync_token);

// stream B
glB->WaitSyncTokenCHROMIUM();
Render2(glB);  // will happen after Render1.

Command buffers that are using different IPC channels must use verified sync tokens. Verification is a check that the underlying fence sync was flushed to the server. Cross-process synchronization always uses verified sync tokens. GenSyncTokenCHROMIUM will force a shallow flush as a side effect if necessary. Example:

// IPC channel in process X
Render1(glX);
glX->GenSyncTokenCHROMIUM(out_sync_token);

// IPC channel in process Y
glY->WaitSyncTokenCHROMIUM();
Render2(glY);  // will happen after Render1.

Alternatively, unverified sync tokens can be converted to verified ones in bulk by calling VerifySyncTokensCHROMIUM. This will wait for a flush to complete as necessary. Use this to avoid multiple sequential flushes:

gl->GenUnverifiedSyncTokenCHROMIUM(out_sync_tokens[0]);
gl->GenUnverifiedSyncTokenCHROMIUM(out_sync_tokens[1]);
gl->VerifySyncTokensCHROMIUM(out_sync_tokens, 2);

Implementation notes

Correctness of the CHROMIUM fence sync mechanism depends on the assumption that commands issued from the command buffer service side happen in the order they were issued in that thread. This is handled in different ways:

Issue a glFlush on switching contexts on platforms where glFlush is sufficient to ensure ordering, i.e. MacOS. (This approach would not be well suited to tiling GPUs as used on many mobile GPUs where glFlush is an expensive operation, it may force content load/store between tile memory and main memory.) See for example gl::GLContextCGL::MakeCurrent:

  // It's likely we're going to switch OpenGL contexts at this point.
  // Before doing so, if there is a current context, flush it. There
  // are many implicit assumptions of flush ordering between contexts
  // at higher levels, and if a flush isn't performed, OpenGL commands
  // may be issued in unexpected orders, causing flickering and other
  // artifacts.

Force context virtualization so that all commands are issued into a single driver-level GL context. This is used on Qualcomm/Adreno chipsets, see issue 691102.
Assume per-thread command queues without explicit synchronization. GLX effectively ensures this. On Windows, ANGLE uses a single D3D device underneath all contexts which ensures strong ordering.

GPU control tasks are processed out of band and are only partially ordered in respect to GL commands. A gpu_control task always happens before any following GL commands issued on the same IPC channel. It usually executes before any preceding unflushed GL commands, but this is not guaranteed. A ShallowFlushCHROMIUM ensures that any following gpu_control tasks will execute after the flushed GL commands.

In this example, DoTask will execute after GLCommandA and before GLCommandD, but there is no ordering guarantee relative to CommandB and CommandC:

  // gles2_implementation.cc

  helper_->GLCommandA();
  ShallowFlushCHROMIUM();

  helper_->GLCommandB();
  helper_->GLCommandC();
  gpu_control_->DoTask();

  helper_->GLCommandD();

  // Execution order is one of:
  //   A | DoTask B C | D
  //   A | B DoTask C | D
  //   A | B C DoTask | D

The shallow flush adds the pending GL commands to the service's task queue, and this task queue is also used by incoming gpu control tasks and processed in order. The ShallowFlushCHROMIUM command returns as soon as the tasks are queued and does not wait for them to be processed.

Cross-process transport: GpuFence and GpuFenceHandle

Some platforms such as Android (most devices N and above) and ChromeOS support synchronizing a native GL context with a command buffer GL context through a GpuFence.

Use the static gl::GLFence::IsGpuFenceSupported() method to check at runtime if the current platform has support for the GpuFence mechanism including GpuFenceHandle transport.

The GpuFence mechanism supports two use cases:

Create a GLFence object in a local context, convert it to a client-side GpuFence, duplicate it into a command buffer service-side gpu fence, and issue a server wait on the command buffer service side. That service-side wait will be unblocked when the client-side GpuFence signals.
Create a new command buffer service-side gpu fence, request a GpuFenceHandle from it, use this handle to create a native GL fence object in the local context, then issue a server wait on the local GL fence object. This local server wait will be unblocked when the service-side gpu fence signals.

The CHROMIUM_gpu_fence extension documents the GLES API as used through the command buffer interface. This section contains additional information about the integration with local GL contexts that is needed to work with these objects.

Driver-level wrappers

In general, you should use the static gl::GLFence::CreateForGpuFence() and gl::GLFence::CreateFromGpuFence() factory methods to create a platform-specific local fence object instead of using an implementation class directly.

For Android and ChromeOS, the gl::GLFenceAndroidNativeFenceSync implementation wraps the EGL_ANDROID_native_fence_sync extension that allows creating a special EGLFence object from which a file descriptor can be extracted, and then creating a duplicate fence object from that file descriptor that is synchronized with the original fence.

GpuFence and GpuFenceHandle

A gfx::GpuFence object owns a GPU fence handle representing a native GL fence. The AsClientGpuFence method casts it to a ClientGpuFence type for use with the CHROMIUM_gpu_fence extension's CreateClientGpuFenceCHROMIUM call.

A gfx::GpuFenceHandle is an IPC-transportable wrapper for a file descriptor or other underlying primitive object, and is used to duplicate a native GL fence into another process. It has value semantics and can be copied multiple times, and then consumed exactly one time. Consumers take ownership of the underlying resource. Current GpuFenceHandle consumers are:

The gfx::GpuFence(gpu_fence_handle) constructor takes ownership of the handle's resources without constructing a local fence.
The IPC subsystem closes resources after sending. The typical idiom is to call gfx::CloneHandleForIPC(handle) on a GpuFenceHandle retrieved from a scope-lifetime object to create a copied handle that will be owned by the IPC subsystem.

Sample Code

A usage example for two-process synchronization is to sequence access to a globally shared drawable such as an AHardwareBuffer on Android, where the writer uses a local GL context and the reader is a command buffer context in the GPU process. The writer process draws into an AHardwareBuffer-backed GLImage in the local GL context, then creates a gpu fence to mark the end of drawing operations:

    // This example assumes that GpuFence is supported. If not, the application
    // should fall back to a different transport or synchronization method.
    DCHECK(gl::GLFence::IsGpuFenceSupported())

    // ... write to the shared drawable in local context, then create
    // a local fence.
    std::unique_ptr<gl::GLFence> local_fence = gl::GLFence::CreateForGpuFence();

    // Convert to a GpuFence.
    std::unique_ptr<gfx::GpuFence> gpu_fence = local_fence->GetGpuFence();
    // It's ok for local_fence to be destroyed now, the GpuFence remains valid.

    // Create a matching gpu fence on the command buffer context, issue
    // server wait, and destroy it.
    GLuint id = gl->CreateClientGpuFenceCHROMIUM(gpu_fence.AsClientGpuFence());
    // It's ok for gpu_fence to be destroyed now.
    gl->WaitGpuFenceCHROMIUM(id);
    gl->DestroyGpuFenceCHROMIUM(id);

    // ... read from the shared drawable via command buffer. These reads
    // will happen after the local_fence has signalled. The local
    // fence and gpu_fence dn't need to remain alive for this.

If a process wants to consume a drawable that was produced through a command buffer context in the GPU process, the sequence is as follows:

    // Set up callback that's waiting for the drawable to be ready.
    void callback(std::unique_ptr<gfx::GpuFence> gpu_fence) {
        // Create a local context GL fence from the GpuFence.
        std::unique_ptr<gl::GLFence> local_fence =
            gl::GLFence::CreateFromGpuFence(*gpu_fence);
        local_fence->ServerWait();
        // ... read from the shared drawable in the local context.
    }

    // ... write to the shared drawable via command buffer, then
    // create a gpu fence:
    GLuint id = gl->CreateGpuFenceCHROMIUM();
    context_support->GetGpuFenceHandle(id, base::BindOnce(callback));
    gl->DestroyGpuFenceCHROMIUM(id);

It is legal to create the GpuFence on a separate command buffer context instead of on the command buffer channel that did the drawing operations, but in that case gl->WaitSyncTokenCHROMIUM() or equivalent must be used to sequence the operations between the distinct command buffer contexts as usual.