| GL Dispatch |
| =========== |
| |
| Several factors combine to make efficient dispatch of OpenGL functions |
| fairly complicated. This document attempts to explain some of the issues |
| and introduce the reader to Mesa's implementation. Readers already |
| familiar with the issues around GL dispatch can safely skip ahead to the |
| :ref:`overview of Mesa's implementation <overview>`. |
| |
| 1. Complexity of GL Dispatch |
| ---------------------------- |
| |
| Every GL application has at least one object called a GL *context*. This |
| object, which is an implicit parameter to every GL function, stores all |
| of the GL related state for the application. Every texture, every buffer |
| object, every enable, and much, much more is stored in the context. |
| Since an application can have more than one context, the context to be |
| used is selected by a window-system dependent function such as |
| ``glXMakeContextCurrent``. |
| |
| In environments that implement OpenGL with X-Windows using GLX, every GL |
| function, including the pointers returned by ``glXGetProcAddress``, are |
| *context independent*. This means that no matter what context is |
| currently active, the same ``glVertex3fv`` function is used. |
| |
| This creates the first bit of dispatch complexity. An application can |
| have two GL contexts. One context is a direct rendering context where |
| function calls are routed directly to a driver loaded within the |
| application's address space. The other context is an indirect rendering |
| context where function calls are converted to GLX protocol and sent to a |
| server. The same ``glVertex3fv`` has to do the right thing depending on |
| which context is current. |
| |
| Highly optimized drivers or GLX protocol implementations may want to |
| change the behavior of GL functions depending on current state. For |
| example, ``glFogCoordf`` may operate differently depending on whether or |
| not fog is enabled. |
| |
| In multi-threaded environments, it is possible for each thread to have a |
| different GL context current. This means that poor old ``glVertex3fv`` |
| has to know which GL context is current in the thread where it is being |
| called. |
| |
| .. _overview: |
| |
| 2. Overview of Mesa's Implementation |
| ------------------------------------ |
| |
| Mesa uses two per-thread pointers. The first pointer stores the address |
| of the context current in the thread, and the second pointer stores the |
| address of the *dispatch table* associated with that context. The |
| dispatch table stores pointers to functions that actually implement |
| specific GL functions. Each time a new context is made current in a |
| thread, these pointers are updated. |
| |
| The implementation of functions such as ``glVertex3fv`` becomes |
| conceptually simple: |
| |
| - Fetch the current dispatch table pointer. |
| - Fetch the pointer to the real ``glVertex3fv`` function from the |
| table. |
| - Call the real function. |
| |
| This can be implemented in just a few lines of C code. The file |
| ``src/mesa/glapi/glapitemp.h`` contains code very similar to this. |
| |
| .. code-block:: c |
| :caption: Sample dispatch function |
| |
| void glVertex3f(GLfloat x, GLfloat y, GLfloat z) |
| { |
| const struct _glapi_table * const dispatch = GET_DISPATCH(); |
| |
| (*dispatch->Vertex3f)(x, y, z); |
| } |
| |
| The problem with this simple implementation is the large amount of |
| overhead that it adds to every GL function call. |
| |
| In a multithreaded environment, a naive implementation of |
| ``GET_DISPATCH`` involves a call to ``pthread_getspecific`` or a similar |
| function. Mesa provides a wrapper function called |
| ``_glapi_get_dispatch`` that is used by default. |
| |
| 3. Optimizations |
| ---------------- |
| |
| A number of optimizations have been made over the years to diminish the |
| performance hit imposed by GL dispatch. This section describes these |
| optimizations. The benefits of each optimization and the situations |
| where each can or cannot be used are listed. |
| |
| 3.1. Dual dispatch table pointers |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| The vast majority of OpenGL applications use the API in a single |
| threaded manner. That is, the application has only one thread that makes |
| calls into the GL. In these cases, not only do the calls to |
| ``pthread_getspecific`` hurt performance, but they are completely |
| unnecessary! It is possible to detect this common case and avoid these |
| calls. |
| |
| Each time a new dispatch table is set, Mesa examines and records the ID |
| of the executing thread. If the same thread ID is always seen, Mesa |
| knows that the application is, from OpenGL's point of view, single |
| threaded. |
| |
| As long as an application is single threaded, Mesa stores a pointer to |
| the dispatch table in a global variable called ``_glapi_Dispatch``. The |
| pointer is also stored in a per-thread location via |
| ``pthread_setspecific``. When Mesa detects that an application has |
| become multithreaded, ``NULL`` is stored in ``_glapi_Dispatch``. |
| |
| Using this simple mechanism the dispatch functions can detect the |
| multithreaded case by comparing ``_glapi_Dispatch`` to ``NULL``. The |
| resulting implementation of ``GET_DISPATCH`` is slightly more complex, |
| but it avoids the expensive ``pthread_getspecific`` call in the common |
| case. |
| |
| .. code-block:: c |
| :caption: Improved ``GET_DISPATCH`` Implementation |
| |
| #define GET_DISPATCH() \ |
| (_glapi_Dispatch != NULL) \ |
| ? _glapi_Dispatch : pthread_getspecific(&_glapi_Dispatch_key) |
| |
| 3.2. ELF TLS |
| ~~~~~~~~~~~~ |
| |
| Starting with the 2.4.20 Linux kernel, each thread is allocated an area |
| of per-thread, global storage. Variables can be put in this area using |
| some extensions to GCC. By storing the dispatch table pointer in this |
| area, the expensive call to ``pthread_getspecific`` and the test of |
| ``_glapi_Dispatch`` can be avoided. |
| |
| The dispatch table pointer is stored in a new variable called |
| ``_glapi_tls_Dispatch``. A new variable name is used so that a single |
| libGL can implement both interfaces. This allows the libGL to operate |
| with direct rendering drivers that use either interface. Once the |
| pointer is properly declared, ``GET_DISPACH`` becomes a simple variable |
| reference. |
| |
| .. code-block:: c |
| :caption: TLS ``GET_DISPATCH`` Implementation |
| |
| extern __thread struct _glapi_table *_glapi_tls_Dispatch |
| __attribute__((tls_model("initial-exec"))); |
| |
| #define GET_DISPATCH() _glapi_tls_Dispatch |
| |
| Use of this path is controlled by the preprocessor define |
| ``USE_ELF_TLS``. Any platform capable of using ELF TLS should use this |
| as the default dispatch method. |
| |
| Windows has a similar concept, and beginning with Windows Vista, shared |
| libraries can take advantage of compiler-assisted TLS. This TLS data |
| has no fixed size and does not compete with API-based TLS (``TlsAlloc``) |
| for the limited number of slots available there, and so ``USE_ELF_TLS`` can |
| be used on Windows too, even though it's not truly ELF. |
| |
| 3.3. Assembly Language Dispatch Stubs |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Many platforms have difficulty properly optimizing the tail-call in the |
| dispatch stubs. Platforms like x86 that pass parameters on the stack |
| seem to have even more difficulty optimizing these routines. All of the |
| dispatch routines are very short, and it is trivial to create optimal |
| assembly language versions. The amount of optimization provided by using |
| assembly stubs varies from platform to platform and application to |
| application. However, by using the assembly stubs, many platforms can |
| use an additional space optimization (see :ref:`below <fixedsize>`). |
| |
| The biggest hurdle to creating assembly stubs is handling the various |
| ways that the dispatch table pointer can be accessed. There are four |
| different methods that can be used: |
| |
| #. Using ``_glapi_Dispatch`` directly in builds for non-multithreaded |
| environments. |
| #. Using ``_glapi_Dispatch`` and ``_glapi_get_dispatch`` in |
| multithreaded environments. |
| #. Using ``_glapi_Dispatch`` and ``pthread_getspecific`` in |
| multithreaded environments. |
| #. Using ``_glapi_tls_Dispatch`` directly in TLS enabled multithreaded |
| environments. |
| |
| People wishing to implement assembly stubs for new platforms should |
| focus on #4 if the new platform supports TLS. Otherwise, implement #2 |
| followed by #3. Environments that do not support multithreading are |
| uncommon and not terribly relevant. |
| |
| Selection of the dispatch table pointer access method is controlled by a |
| few preprocessor defines. |
| |
| - If ``USE_ELF_TLS`` is defined, method #3 is used. |
| - If ``HAVE_PTHREAD`` is defined, method #2 is used. |
| - If none of the preceding are defined, method #1 is used. |
| |
| Two different techniques are used to handle the various different cases. |
| On x86 and SPARC, a macro called ``GL_STUB`` is used. In the preamble of |
| the assembly source file different implementations of the macro are |
| selected based on the defined preprocessor variables. The assembly code |
| then consists of a series of invocations of the macros such as: |
| |
| .. code-block:: c |
| :caption: SPARC Assembly Implementation of ``glColor3fv`` |
| |
| GL_STUB(Color3fv, _gloffset_Color3fv) |
| |
| The benefit of this technique is that changes to the calling pattern |
| (i.e., addition of a new dispatch table pointer access method) require |
| fewer changed lines in the assembly code. |
| |
| However, this technique can only be used on platforms where the function |
| implementation does not change based on the parameters passed to the |
| function. For example, since x86 passes all parameters on the stack, no |
| additional code is needed to save and restore function parameters around |
| a call to ``pthread_getspecific``. Since x86-64 passes parameters in |
| registers, varying amounts of code needs to be inserted around the call |
| to ``pthread_getspecific`` to save and restore the GL function's |
| parameters. |
| |
| The other technique, used by platforms like x86-64 that cannot use the |
| first technique, is to insert ``#ifdef`` within the assembly |
| implementation of each function. This makes the assembly file |
| considerably larger (e.g., 29,332 lines for ``glapi_x86-64.S`` versus |
| 1,155 lines for ``glapi_x86.S``) and causes simple changes to the |
| function implementation to generate many lines of diffs. Since the |
| assembly files are typically generated by scripts, this isn't a |
| significant problem. |
| |
| Once a new assembly file is created, it must be inserted in the build |
| system. There are two steps to this. The file must first be added to |
| ``src/mesa/sources``. That gets the file built and linked. The second |
| step is to add the correct ``#ifdef`` magic to |
| ``src/mesa/glapi/glapi_dispatch.c`` to prevent the C version of the |
| dispatch functions from being built. |
| |
| .. _fixedsize: |
| |
| 3.4. Fixed-Length Dispatch Stubs |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| To implement ``glXGetProcAddress``, Mesa stores a table that associates |
| function names with pointers to those functions. This table is stored in |
| ``src/mesa/glapi/glprocs.h``. For different reasons on different |
| platforms, storing all of those pointers is inefficient. On most |
| platforms, including all known platforms that support TLS, we can avoid |
| this added overhead. |
| |
| If the assembly stubs are all the same size, the pointer need not be |
| stored for every function. The location of the function can instead be |
| calculated by multiplying the size of the dispatch stub by the offset of |
| the function in the table. This value is then added to the address of |
| the first dispatch stub. |
| |
| This path is activated by adding the correct ``#ifdef`` magic to |
| ``src/mesa/glapi/glapi.c`` just before ``glprocs.h`` is included. |