| # GPU Synchronization in Chrome |
| |
| Chrome supports multiple mechanisms for sequencing GPU drawing operations, this |
| document provides a brief overview. The main focus is a high-level explanation |
| of when synchronization is needed and which mechanism is appropriate. |
| |
| [TOC] |
| |
| ## Glossary |
| |
| **GL Sync Object**: Generic GL-level synchronization object that can be in a |
| "unsignaled" or "signaled" state. The only current implementation of this is a |
| GL fence. |
| |
| **GL Fence**: A GL sync object that is inserted into the GL command stream. It |
| starts out unsignaled and becomes signaled when the GPU reaches this point in the |
| command stream, implying that all previous commands have completed. |
| |
| **Client Wait**: Block the client thread until a sync object becomes signaled, |
| or until a timeout occurs. |
| |
| **Server Wait**: Tells the GPU to defer executing commands issued after a fence |
| until the fence signals. The client thread continues executing immediately and |
| can continue submitting GL commands. |
| |
| **CHROMIUM fence sync**: A command buffer specific GL fence that sequences |
| operations among command buffer GL contexts without requiring driver-level |
| execution of previous commands. |
| |
| **Native GL Fence**: A GL Fence backed by a platform-specific cross-process |
| synchronization mechanism. |
| |
| **GPU Fence Handle**: An IPC-transportable object (typically a file descriptor) |
| that can be used to duplicate a native GL fence into a different process's |
| context. |
| |
| **GPU Fence**: A Chrome abstraction that owns a GPU fence handle representing a |
| native GL fence, usable for cross-process synchronization. |
| |
| ## Use case overview |
| |
| The core scenario is synchronizing read and write access to a shared resource, |
| for example drawing an image into an offscreen texture and compositing the |
| result into a final image. The drawing operations need to be completed before |
| reading to ensure correct output. A typical effect of wrong synchronization is |
| that the output contains blank or incomplete results instead of the expected |
| rendered sub-images, causing flickering or tearing. |
| |
| "Completed" in this case means that the end result of using a resource as input |
| will be equivalent to waiting for everything to finish rendering, but it does |
| not necessarily mean that the GPU has fully finished all drawing operations at |
| that time. |
| |
| ## Single GL context: no synchronization needed |
| |
| If all access to the shared resource happens in the same GL context, there is no |
| need for explicit synchronization. GL guarantees that commands are logically |
| processed in the order they are submitted. This is true both for local GL |
| contexts (GL calls via ui/gl/ interfaces) and for a single command buffer GL |
| context. |
| |
| ## Multiple driver-level GL contexts in the same share group: use GLFence |
| |
| A process can create multiple GL contexts that are part of the same share group. |
| These contexts can be created in different threads within this process. |
| |
| In this case, GL fences must be used for sequencing, for example: |
| |
| 1. Context A: draw image, create GLFence |
| 1. Context B: server wait or client wait for GLFence, read image |
| |
| [gl::GLFence](/ui/gl/gl_fence.h) and its subclasses provide wrappers for |
| GL/EGL fence handling methods such as `eglFenceSyncKHR` and `eglWaitSyncKHR`. |
| These fence objects can be used cross-thread as long as both thread's GL |
| contexts are part of the same share group. |
| |
| For more details, please refer to the underlying extension documentation, for example: |
| |
| * https://www.khronos.org/opengl/wiki/Synchronization |
| * https://www.khronos.org/registry/EGL/extensions/KHR/EGL_KHR_fence_sync.txt |
| * https://www.khronos.org/registry/EGL/extensions/KHR/EGL_KHR_wait_sync.txt |
| |
| ## Implementation-dependent: same-thread driver-level GL contexts |
| |
| Many GL driver implementations are based on a per-thread command queue, |
| with the effect that commands are processed in order even if they were issued |
| from different contexts on that thread without explicit synchronization. |
| |
| This behavior is not part of the GL standard, and some driver implementations |
| use a per-context command queue where this assumption is not true. |
| |
| See [issue 510232](http://crbug.com/510243#c23) for an example of a problematic |
| sequence: |
| |
| ``` |
| // In one thread: |
| MakeCurrent(A); |
| Render1(); |
| MakeCurrent(B); |
| Render2(); |
| CreateSync(X); |
| |
| // And in another thread: |
| MakeCurrent(C); |
| WaitSync(X); |
| Render3(); |
| MakeCurrent(D); |
| Render4(); |
| ``` |
| |
| The only serialization guarantee is that Render2 will complete before Render3, |
| but Render4 could theoretically complete before Render1. |
| |
| Chrome assumes that the render steps happen in order Render1, Render2, Render3, |
| and Render4, and requires this behavior to ensure security. If the driver doesn't |
| ensure this sequencing, Chrome has to emulate it using virtual contexts. (Or by |
| using explicit synchronization, but it doesn't do that today.) See also the |
| "CHROMIUM fence sync" section below. |
| |
| ## Command buffer GL clients: use CHROMIUM sync tokens |
| |
| Chrome's command buffer IPC interface uses multiple layers. There are multiple |
| active IPC channels (typically one per process, i.e. one per Renderer and one |
| for Browser). Each IPC channel has multiple scheduling groups (also called |
| streams), and each stream can contain multiple command buffers, which in turn |
| contain a sequence of GL commands. |
| |
| Command buffers in the same client-side share group must be in the same stream. |
| Command scheduling granuarity is at the stream level, and a client can choose to |
| create and use multiple streams with different stream priorities. Stream IDs are |
| arbitrary integers assigned by the client at creation time, see for example the |
| [viz::ContextProviderCommandBuffer](/services/viz/public/cpp/gpu/context_provider_command_buffer.h) |
| constructor. |
| |
| The CHROMIUM sync token is intended to order operations among command buffer GL |
| instructions. It inserts an internal fence sync command in the stream, flushing |
| it appropriately (see below), and generating a sync token from it which is a |
| cross-context transportable reference to the underlying fence sync. A |
| WaitSyncTokenCHROMIUM call does **not** ensure that the underlying GL commands |
| have been executed at the GPU driver level, this mechanism is not suitable for |
| synchronizing command buffer GL operations with a local driver-level GL context. |
| |
| See the |
| [CHROMIUM_sync_point](/gpu/GLES2/extensions/CHROMIUM/CHROMIUM_sync_point.txt) |
| documentation for details. |
| |
| Commands issued within a single command buffer don't need to be synchronized |
| explicitly, they will be executed in the same order that they were issued. |
| |
| Multiple command buffers within the same stream can use an ordering barrier to |
| sequence their commands. Sync tokens are not necessary. Example: |
| |
| ```c++ |
| // Command buffers gl1 and gl2 are in the same stream. |
| Render1(gl1); |
| gl1->OrderingBarrierCHROMIUM() |
| Render2(gl2); // will happen after Render1. |
| ``` |
| |
| Command buffers that are in different streams need to use sync tokens. If both |
| are using the same IPC channel (i.e. same client process), an unverified sync |
| token is sufficient, and commands do not need to be flushed to the server: |
| |
| ```c++ |
| // stream A |
| Render1(glA); |
| glA->GenUnverifiedSyncTokenCHROMIUM(out_sync_token); |
| |
| // stream B |
| glB->WaitSyncTokenCHROMIUM(sync_token); |
| Render2(glB); // will happen after Render1. |
| ``` |
| |
| Command buffers that are using different IPC channels must use verified sync |
| tokens. Verification is a check that the underlying fence sync was flushed to |
| the server. Cross-process synchronization always uses verified sync tokens. |
| `GenSyncTokenCHROMIUM` will force a shallow flush as a side effect if necessary. |
| Example: |
| |
| ```c++ |
| // IPC channel in process X |
| Render1(glX); |
| glX->GenSyncTokenCHROMIUM(out_sync_token); |
| |
| // IPC channel in process Y |
| glY->WaitSyncTokenCHROMIUM(sync_token); |
| Render2(glY); // will happen after Render1. |
| ``` |
| |
| Alternatively, unverified sync tokens can be converted to verified ones in bulk |
| by calling `VerifySyncTokensCHROMIUM`. This will wait for a flush to complete as |
| necessary. Use this to avoid multiple sequential flushes: |
| |
| ```c++ |
| gl->GenUnverifiedSyncTokenCHROMIUM(out_sync_tokens[0]); |
| gl->GenUnverifiedSyncTokenCHROMIUM(out_sync_tokens[1]); |
| gl->VerifySyncTokensCHROMIUM(out_sync_tokens, 2); |
| ``` |
| |
| ### Implementation notes |
| |
| Correctness of the CHROMIUM fence sync mechanism depends on the assumption that |
| commands issued from the command buffer service side happen in the order they |
| were issued in that thread. This is handled in different ways: |
| |
| * Issue a glFlush on switching contexts on platforms where glFlush is sufficient |
| to ensure ordering, i.e. MacOS. (This approach would not be well suited to |
| tiling GPUs as used on many mobile GPUs where glFlush is an expensive |
| operation, it may force content load/store between tile memory and main |
| memory.) See for example |
| [gl::GLContextCGL::MakeCurrent](/ui/gl/gl_context_cgl.cc): |
| ```c++ |
| // It's likely we're going to switch OpenGL contexts at this point. |
| // Before doing so, if there is a current context, flush it. There |
| // are many implicit assumptions of flush ordering between contexts |
| // at higher levels, and if a flush isn't performed, OpenGL commands |
| // may be issued in unexpected orders, causing flickering and other |
| // artifacts. |
| ``` |
| |
| * Force context virtualization so that all commands are issued into a single |
| driver-level GL context. This is used on Qualcomm/Adreno chipsets, see [issue |
| 691102](http://crbug.com/691102). |
| |
| * Assume per-thread command queues without explicit synchronization. GLX |
| effectively ensures this. On Windows, ANGLE uses a single D3D device |
| underneath all contexts which ensures strong ordering. |
| |
| GPU control tasks are processed out of band and are only partially ordered in |
| respect to GL commands. A gpu_control task always happens before any following |
| GL commands issued on the same IPC channel. It usually executes before any |
| preceding unflushed GL commands, but this is not guaranteed. A |
| `ShallowFlushCHROMIUM` ensures that any following gpu_control tasks will execute |
| after the flushed GL commands. |
| |
| In this example, DoTask will execute after GLCommandA and before GLCommandD, but |
| there is no ordering guarantee relative to CommandB and CommandC: |
| |
| ```c++ |
| // gles2_implementation.cc |
| |
| helper_->GLCommandA(); |
| ShallowFlushCHROMIUM(); |
| |
| helper_->GLCommandB(); |
| helper_->GLCommandC(); |
| gpu_control_->DoTask(); |
| |
| helper_->GLCommandD(); |
| |
| // Execution order is one of: |
| // A | DoTask B C | D |
| // A | B DoTask C | D |
| // A | B C DoTask | D |
| ``` |
| |
| The shallow flush adds the pending GL commands to the service's task queue, and |
| this task queue is also used by incoming gpu control tasks and processed in |
| order. The `ShallowFlushCHROMIUM` command returns as soon as the tasks are |
| queued and does not wait for them to be processed. |
| |
| ## Cross-process transport: GpuFence and GpuFenceHandle |
| |
| Some platforms such as Android (most devices N and above) and ChromeOS support |
| synchronizing a native GL context with a command buffer GL context through a |
| GpuFence. |
| |
| Use the static `gl::GLFence::IsGpuFenceSupported()` method to check at runtime if |
| the current platform has support for the GpuFence mechanism including |
| GpuFenceHandle transport. |
| |
| The GpuFence mechanism supports two use cases: |
| |
| * Create a GLFence object in a local context, convert it to a client-side |
| GpuFence, duplicate it into a command buffer service-side gpu fence, and |
| issue a server wait on the command buffer service side. That service-side |
| wait will be unblocked when the *client-side* GpuFence signals. |
| |
| * Create a new command buffer service-side gpu fence, request a GpuFenceHandle |
| from it, use this handle to create a native GL fence object in the local |
| context, then issue a server wait on the local GL fence object. This local |
| server wait will be unblocked when the *service-side* gpu fence signals. |
| |
| The [CHROMIUM_gpu_fence |
| extension](/gpu/GLES2/extensions/CHROMIUM/CHROMIUM_gpu_fence.txt) documents |
| the GLES API as used through the command buffer interface. This section contains |
| additional information about the integration with local GL contexts that is |
| needed to work with these objects. |
| |
| ### Driver-level wrappers |
| |
| In general, you should use the static `gl::GLFence::CreateForGpuFence()` and |
| `gl::GLFence::CreateFromGpuFence()` factory methods to create a |
| platform-specific local fence object instead of using an implementation class |
| directly. |
| |
| For Android and ChromeOS, the |
| [gl::GLFenceAndroidNativeFenceSync](/ui/gl/gl_fence_android_native_fence_sync.h) |
| implementation wraps the |
| [EGL_ANDROID_native_fence_sync](https://www.khronos.org/registry/EGL/extensions/ANDROID/EGL_ANDROID_native_fence_sync.txt) |
| extension that allows creating a special EGLFence object from which a file |
| descriptor can be extracted, and then creating a duplicate fence object from |
| that file descriptor that is synchronized with the original fence. |
| |
| ### GpuFence and GpuFenceHandle |
| |
| A [gfx::GpuFence](/ui/gfx/gpu_fence.h) object owns a GPU fence handle |
| representing a native GL fence. The `AsClientGpuFence` method casts it to a |
| ClientGpuFence type for use with the [CHROMIUM_gpu_fence |
| extension](/gpu/GLES2/extensions/CHROMIUM/CHROMIUM_gpu_fence.txt)'s |
| `CreateClientGpuFenceCHROMIUM` call. |
| |
| A [gfx::GpuFenceHandle](/ui/gfx/gpu_fence_handle.h) is an IPC-transportable |
| wrapper for a file descriptor or other underlying primitive object, and is used |
| to duplicate a native GL fence into another process. It has value semantics and |
| can be copied multiple times, and then consumed exactly one time. Consumers take |
| ownership of the underlying resource. Current GpuFenceHandle consumers are: |
| |
| * The `gfx::GpuFence(gpu_fence_handle)` constructor takes ownership of the |
| handle's resources without constructing a local fence. |
| |
| * The IPC subsystem closes resources after sending. The typical idiom is to call |
| `gfx::CloneHandleForIPC(handle)` on a GpuFenceHandle retrieved from a |
| scope-lifetime object to create a copied handle that will be owned by the IPC |
| subsystem. |
| |
| ### Sample Code |
| |
| A usage example for two-process synchronization is to sequence access to a |
| globally shared drawable such as an AHardwareBuffer on Android, where the |
| writer uses a local GL context and the reader is a command buffer context in |
| the GPU process. The writer process draws into an AHardwareBuffer-backed |
| SharedImage in the local GL context, then creates a gpu fence to mark the end of |
| drawing operations: |
| |
| ```c++ |
| // This example assumes that GpuFence is supported. If not, the application |
| // should fall back to a different transport or synchronization method. |
| DCHECK(gl::GLFence::IsGpuFenceSupported()) |
| |
| // ... write to the shared drawable in local context, then create |
| // a local fence. |
| std::unique_ptr<gl::GLFence> local_fence = gl::GLFence::CreateForGpuFence(); |
| |
| // Convert to a GpuFence. |
| std::unique_ptr<gfx::GpuFence> gpu_fence = local_fence->GetGpuFence(); |
| // It's ok for local_fence to be destroyed now, the GpuFence remains valid. |
| |
| // Create a matching gpu fence on the command buffer context, issue |
| // server wait, and destroy it. |
| GLuint id = gl->CreateClientGpuFenceCHROMIUM(gpu_fence.AsClientGpuFence()); |
| // It's ok for gpu_fence to be destroyed now. |
| gl->WaitGpuFenceCHROMIUM(id); |
| gl->DestroyGpuFenceCHROMIUM(id); |
| |
| // ... read from the shared drawable via command buffer. These reads |
| // will happen after the local_fence has signalled. The local |
| // fence and gpu_fence dn't need to remain alive for this. |
| ``` |
| |
| If a process wants to consume a drawable that was produced through a command |
| buffer context in the GPU process, the sequence is as follows: |
| |
| ```c++ |
| // Set up callback that's waiting for the drawable to be ready. |
| void callback(std::unique_ptr<gfx::GpuFence> gpu_fence) { |
| // Create a local context GL fence from the GpuFence. |
| std::unique_ptr<gl::GLFence> local_fence = |
| gl::GLFence::CreateFromGpuFence(*gpu_fence); |
| local_fence->ServerWait(); |
| // ... read from the shared drawable in the local context. |
| } |
| |
| // ... write to the shared drawable via command buffer, then |
| // create a gpu fence: |
| GLuint id = gl->CreateGpuFenceCHROMIUM(); |
| context_support->GetGpuFenceHandle(id, base::BindOnce(callback)); |
| gl->DestroyGpuFenceCHROMIUM(id); |
| ``` |
| |
| It is legal to create the GpuFence on a separate command buffer context instead |
| of on the command buffer channel that did the drawing operations, but in that |
| case `gl->WaitSyncTokenCHROMIUM()` or equivalent must be used to sequence the |
| operations between the distinct command buffer contexts as usual. |