| ============================ |
| PNaCl C/C++ Language Support |
| ============================ |
| |
| .. contents:: |
| :local: |
| :backlinks: none |
| :depth: 3 |
| |
| Source language support |
| ======================= |
| |
| The currently supported languages are C and C++. The PNaCl toolchain is |
| based on recent Clang, which fully supports C++11 and most of C11. A |
| detailed status of the language support is available `here |
| <http://clang.llvm.org/cxx_status.html>`_. |
| |
| For information on using languages other than C/C++, see the :ref:`FAQ |
| section on other languages <other_languages>`. |
| |
| As for the standard libraries, the PNaCl toolchain is currently based on |
| ``libc++``, and the ``newlib`` standard C library. ``libstdc++`` is also |
| supported but its use is discouraged; see :ref:`building_cpp_libraries` |
| for more details. |
| |
| Versions |
| -------- |
| |
| Version information can be obtained: |
| |
| * Clang/LLVM: run ``pnacl-clang -v``. |
| * ``newlib``: use the ``_NEWLIB_VERSION`` macro. |
| * ``libc++``: use the ``_LIBCPP_VERSION`` macro. |
| * ``libstdc++``: use the ``_GLIBCXX_VERSION`` macro. |
| |
| Preprocessor definitions |
| ------------------------ |
| |
| When compiling C/C++ code, the PNaCl toolchain defines the ``__pnacl__`` |
| macro. In addition, ``__native_client__`` is defined for compatibility |
| with other NaCl toolchains. |
| |
| .. _memory_model_and_atomics: |
| |
| Memory Model and Atomics |
| ======================== |
| |
| Memory Model for Concurrent Operations |
| -------------------------------------- |
| |
| The memory model offered by PNaCl relies on the same coding guidelines |
| as the C11/C++11 one: concurrent accesses must always occur through |
| atomic primitives (offered by :ref:`atomic intrinsics |
| <bitcode_atomicintrinsics>`), and these accesses must always |
| occur with the same size for the same memory location. Visibility of |
| stores is provided on a happens-before basis that relates memory |
| locations to each other as the C11/C++11 standards do. |
| |
| Non-atomic memory accesses may be reordered, separated, elided or fused |
| according to C and C++'s memory model before the pexe is created as well |
| as after its creation. Accessing atomic memory location through |
| non-atomic primitives is :ref:`Undefined Behavior <undefined_behavior>`. |
| |
| As in C11/C++11 some atomic accesses may be implemented with locks on |
| certain platforms. The ``ATOMIC_*_LOCK_FREE`` macros will always be |
| ``1``, signifying that all types are sometimes lock-free. The |
| ``is_lock_free`` methods and ``atomic_is_lock_free`` will return the |
| current platform's implementation at translation time. These macros, |
| methods and functions are in the C11 header ``<stdatomic.h>`` and the |
| C++11 header ``<atomic>``. |
| |
| The PNaCl toolchain supports concurrent memory accesses through legacy |
| GCC-style ``__sync_*`` builtins, as well as through C11/C++11 atomic |
| primitives and the underlying `GCCMM |
| <http://gcc.gnu.org/wiki/Atomic/GCCMM>`_ ``__atomic_*`` |
| primitives. ``volatile`` memory accesses can also be used, though these |
| are discouraged. See `Volatile Memory Accesses`_. |
| |
| PNaCl supports concurrency and parallelism with some restrictions: |
| |
| * Threading is explicitly supported and has no restrictions over what |
| prevalent implementations offer. See `Threading`_. |
| |
| * ``volatile`` and atomic operations are address-free (operations on the |
| same memory location via two different addresses work atomically), as |
| intended by the C11/C++11 standards. This is critical in supporting |
| synchronous "external modifications" such as mapping underlying memory |
| at multiple locations. |
| |
| * Inter-process communication through shared memory is currently not |
| supported. See `Future Directions`_. |
| |
| * Signal handling isn't supported, PNaCl therefore promotes all |
| primitives to cross-thread (instead of single-thread). This may change |
| at a later date. Note that using atomic operations which aren't |
| lock-free may lead to deadlocks when handling asynchronous |
| signals. See `Future Directions`_. |
| |
| * Direct interaction with device memory isn't supported, and there is no |
| intent to support it. The embedding sandbox's runtime can offer APIs |
| to indirectly access devices. |
| |
| Setting up the above mechanisms requires assistance from the embedding |
| sandbox's runtime (e.g. NaCl's Pepper APIs), but using them once setup |
| can be done through regular C/C++ code. |
| |
| Atomic Memory Ordering Constraints |
| ---------------------------------- |
| |
| Atomics follow the same ordering constraints as in regular C11/C++11, |
| but all accesses are promoted to sequential consistency (the strongest |
| memory ordering) at pexe creation time. We plan to support more of the |
| C11/C++11 memory orderings in the future. |
| |
| Some additional restrictions, following the C11/C++11 standards: |
| |
| - Atomic accesses must at least be naturally aligned. |
| - Some accesses may not actually be atomic on certain platforms, |
| requiring an implementation that uses global locks. |
| - An atomic memory location must always be accessed with atomic |
| primitives, and these primitives must always be of the same bit size |
| for that location. |
| - Not all memory orderings are valid for all atomic operations. |
| |
| Volatile Memory Accesses |
| ------------------------ |
| |
| The C11/C++11 standards mandate that ``volatile`` accesses execute in |
| program order (but are not fences, so other memory operations can |
| reorder around them), are not necessarily atomic, and can’t be |
| elided. They can be separated into smaller width accesses. |
| |
| Before any optimizations occur, the PNaCl toolchain transforms |
| ``volatile`` loads and stores into sequentially consistent ``volatile`` |
| atomic loads and stores, and applies regular compiler optimizations |
| along the above guidelines. This orders ``volatiles`` according to the |
| atomic rules, and means that fences (including ``__sync_synchronize``) |
| act in a better-defined manner. Regular memory accesses still do not |
| have ordering guarantees with ``volatile`` and atomic accesses, though |
| the internal representation of ``__sync_synchronize`` attempts to |
| prevent reordering of memory accesses to objects which may escape. |
| |
| Relaxed ordering could be used instead, but for the first release it is |
| more conservative to apply sequential consistency. Future releases may |
| change what happens at compile-time, but already-released pexes will |
| continue using sequential consistency. |
| |
| The PNaCl toolchain also requires that ``volatile`` accesses be at least |
| naturally aligned, and tries to guarantee this alignment. |
| |
| The above guarantees ease the support of legacy (i.e. non-C11/C++11) |
| code, and combined with builtin fences these programs can do meaningful |
| cross-thread communication without changing code. They also better |
| reflect the original code's intent and guarantee better portability. |
| |
| .. _language_support_threading: |
| |
| Threading |
| ========= |
| |
| Threading is explicitly supported through C11/C++11's threading |
| libraries as well as POSIX threads. |
| |
| Communication between threads should use atomic primitives as described |
| in `Memory Model and Atomics`_. |
| |
| ``setjmp`` and ``longjmp`` |
| ========================== |
| |
| PNaCl and NaCl support ``setjmp`` and ``longjmp`` without any |
| restrictions beyond C's. |
| |
| .. _exception_handling: |
| |
| C++ Exception Handling |
| ====================== |
| |
| PNaCl currently supports C++ exception handling through ``setjmp()`` and |
| ``longjmp()``, which can be enabled with the ``--pnacl-exceptions=sjlj`` linker |
| flag (set with ``LDFLAGS`` when using Make). Exceptions are disabled by default |
| so that faster and smaller code is generated, and ``throw`` statements are |
| replaced with calls to ``abort()``. The usual ``-fno-exceptions`` flag is also |
| supported, though the default is ``-fexceptions``. PNaCl will support full |
| zero-cost exception handling in the future. |
| |
| .. note:: When using webports_ or other prebuilt static libraries, you don't |
| need to recompile because the exception handling support is |
| implemented at link time (when all the static libraries are put |
| together with your application). |
| |
| .. _webports: https://chromium.googlesource.com/webports |
| |
| NaCl supports full zero-cost C++ exception handling. |
| |
| Inline Assembly |
| =============== |
| |
| Inline assembly isn't supported by PNaCl because it isn't portable. The |
| one current exception is the common compiler barrier idiom |
| ``asm("":::"memory")``, which gets transformed to a sequentially |
| consistent memory barrier (equivalent to ``__sync_synchronize()``). In |
| PNaCl this barrier is only guaranteed to order ``volatile`` and atomic |
| memory accesses, though in practice the implementation attempts to also |
| prevent reordering of memory accesses to objects which may escape. |
| |
| PNaCl supports :ref:`Portable SIMD Vectors <portable_simd_vectors>`, |
| which are traditionally expressed through target-specific intrinsics or |
| inline assembly. |
| |
| NaCl supports a fairly wide subset of inline assembly through GCC's |
| inline assembly syntax, with the restriction that the sandboxing model |
| for the target architecture has to be respected. |
| |
| .. _portable_simd_vectors: |
| |
| Portable SIMD Vectors |
| ===================== |
| |
| SIMD vectors aren't part of the C/C++ standards and are traditionally |
| very hardware-specific. Portable Native Client offers a portable version |
| of SIMD vector datatypes and operations which map well to modern |
| architectures and offer performance which matches or approaches |
| hardware-specific uses. |
| |
| SIMD vector support was added to Portable Native Client for version 37 of Chrome |
| and more features, including performance enhancements, have been added in |
| subsequent releases, see the :ref:`Release Notes <sdk-release-notes>` for more |
| details. |
| |
| Hand-Coding Vector Extensions |
| ----------------------------- |
| |
| The initial vector support in Portable Native Client adds `LLVM vectors |
| <http://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors>`_ |
| and `GCC vectors |
| <http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html>`_ since these |
| are well supported by different hardware platforms and don't require any |
| new compiler intrinsics. |
| |
| Vector types can be used through the ``vector_size`` attribute: |
| |
| .. naclcode:: |
| |
| #define VECTOR_BYTES 16 |
| typedef int v4s __attribute__((vector_size(VECTOR_BYTES))); |
| v4s a = {1,2,3,4}; |
| v4s b = {5,6,7,8}; |
| v4s c, d, e; |
| c = a + b; /* c = {6,8,10,12} */ |
| d = b >> a; /* d = {2,1,0,0} */ |
| |
| Vector comparisons are represented as a bitmask as wide as the compared |
| elements of all ``0`` or all ``1``: |
| |
| .. naclcode:: |
| |
| typedef int v4s __attribute__((vector_size(16))); |
| v4s snip(v4s in) { |
| v4s limit = {32,64,128,256}; |
| v4s mask = in > limit; |
| v4s ret = in & mask; |
| return ret; |
| } |
| |
| Vector datatypes are currently expected to be 128-bit wide with one of the |
| following element types, and they're expected to be aligned to the underlying |
| element's bit width (loads and store will otherwise be broken up into scalar |
| accesses to prevent faults): |
| |
| ============ ============ ================ ====================== |
| Type Num Elements Vector Bit Width Expected Bit Alignment |
| ============ ============ ================ ====================== |
| ``uint8_t`` 16 128 8 |
| ``int8_t`` 16 128 8 |
| ``uint16_t`` 8 128 16 |
| ``int16_t`` 8 128 16 |
| ``uint32_t`` 4 128 32 |
| ``int32_t`` 4 128 32 |
| ``float`` 4 128 32 |
| ============ ============ ================ ====================== |
| |
| 64-bit integers and double-precision floating point will be supported in |
| a future release, as will 256-bit and 512-bit vectors. |
| |
| Vector element bit width alignment can be stated explicitly (this is assumed by |
| PNaCl, but not necessarily by other compilers), and smaller alignments can also |
| be specified: |
| |
| .. naclcode:: |
| |
| typedef int v4s_element __attribute__((vector_size(16), aligned(4))); |
| typedef int v4s_unaligned __attribute__((vector_size(16), aligned(1))); |
| |
| |
| The following operators are supported on vectors: |
| |
| +----------------------------------------------+ |
| | unary ``+``, ``-`` | |
| +----------------------------------------------+ |
| | ``++``, ``--`` | |
| +----------------------------------------------+ |
| | ``+``, ``-``, ``*``, ``/``, ``%`` | |
| +----------------------------------------------+ |
| | ``&``, ``|``, ``^``, ``~`` | |
| +----------------------------------------------+ |
| | ``>>``, ``<<`` | |
| +----------------------------------------------+ |
| | ``!``, ``&&``, ``||`` | |
| +----------------------------------------------+ |
| | ``==``, ``!=``, ``>``, ``<``, ``>=``, ``<=`` | |
| +----------------------------------------------+ |
| | ``=`` | |
| +----------------------------------------------+ |
| |
| C-style casts can be used to convert one vector type to another without |
| modifying the underlying bits. ``__builtin_convertvector`` can be used |
| to convert from one type to another provided both types have the same |
| number of elements, truncating when converting from floating-point to |
| integer. |
| |
| .. naclcode:: |
| |
| typedef unsigned v4u __attribute__((vector_size(16))); |
| typedef float v4f __attribute__((vector_size(16))); |
| v4u a = {0x3f19999a,0x40000000,0x40490fdb,0x66ff0c30}; |
| v4f b = (v4f) a; /* b = {0.6,2,3.14159,6.02214e+23} */ |
| v4u c = __builtin_convertvector(b, v4u); /* c = {0,2,3,0} */ |
| |
| It is also possible to use array-style indexing into vectors to extract |
| individual elements using ``[]``. |
| |
| .. naclcode:: |
| |
| typedef unsigned v4u __attribute__((vector_size(16))); |
| template<typename T> |
| void print(const T v) { |
| for (size_t i = 0; i != sizeof(v) / sizeof(v[0]); ++i) |
| std::cout << v[i] << ' '; |
| std::cout << std::endl; |
| } |
| |
| Vector shuffles (often called permutation or swizzle) operations are |
| supported through ``__builtin_shufflevector``. The builtin has two |
| vector arguments of the same element type, followed by a list of |
| constant integers that specify the element indices of the first two |
| vectors that should be extracted and returned in a new vector. These |
| element indices are numbered sequentially starting with the first |
| vector, continuing into the second vector. Thus, if ``vec1`` is a |
| 4-element vector, index ``5`` would refer to the second element of |
| ``vec2``. An index of ``-1`` can be used to indicate that the |
| corresponding element in the returned vector is a don’t care and can be |
| optimized by the backend. |
| |
| The result of ``__builtin_shufflevector`` is a vector with the same |
| element type as ``vec1`` / ``vec2`` but that has an element count equal |
| to the number of indices specified. |
| |
| .. naclcode:: |
| |
| // identity operation - return 4-element vector v1. |
| __builtin_shufflevector(v1, v1, 0, 1, 2, 3) |
| |
| // "Splat" element 0 of v1 into a 4-element result. |
| __builtin_shufflevector(v1, v1, 0, 0, 0, 0) |
| |
| // Reverse 4-element vector v1. |
| __builtin_shufflevector(v1, v1, 3, 2, 1, 0) |
| |
| // Concatenate every other element of 4-element vectors v1 and v2. |
| __builtin_shufflevector(v1, v2, 0, 2, 4, 6) |
| |
| // Concatenate every other element of 8-element vectors v1 and v2. |
| __builtin_shufflevector(v1, v2, 0, 2, 4, 6, 8, 10, 12, 14) |
| |
| // Shuffle v1 with some elements being undefined |
| __builtin_shufflevector(v1, v1, 3, -1, 1, -1) |
| |
| One common use of ``__builtin_shufflevector`` is to perform |
| vector-scalar operations: |
| |
| .. naclcode:: |
| |
| typedef int v4s __attribute__((vector_size(16))); |
| v4s shift_right_by(v4s shift_me, int shift_amount) { |
| v4s tmp = {shift_amount}; |
| return shift_me >> __builtin_shuffle_vector(tmp, tmp, 0, 0, 0, 0); |
| } |
| |
| Auto-Vectorization |
| ------------------ |
| |
| Auto-vectorization is currently not enabled for Portable Native Client, |
| but will be in a future release. |
| |
| Undefined Behavior |
| ================== |
| |
| The C and C++ languages expose some undefined behavior which is |
| discussed in :ref:`PNaCl Undefined Behavior <undefined_behavior>`. |
| |
| .. _c_cpp_floating_point: |
| |
| Floating-Point |
| ============== |
| |
| PNaCl exposes 32-bit and 64-bit floating point operations which are |
| mostly IEEE-754 compliant. There are a few caveats: |
| |
| * Some :ref:`floating-point behavior is currently left as undefined |
| <undefined_behavior_fp>`. |
| * The default rounding mode is round-to-nearest and other rounding modes |
| are currently not usable, which isn't IEEE-754 compliant. PNaCl could |
| support switching modes (the 4 modes exposed by C99 ``FLT_ROUNDS`` |
| macros). |
| * Signaling ``NaN`` never fault. |
| * Fast-math optimizations are currently supported before *pexe* creation |
| time. A *pexe* loses all fast-math information when it is |
| created. Fast-math translation could be enabled at a later date, |
| potentially at a perf-function granularity. This wouldn't affect |
| already-existing *pexe*; it would be an opt-in feature. |
| |
| * Fused-multiply-add have higher precision and often execute faster; |
| PNaCl currently disallows them in the *pexe* because they aren't |
| supported on all platforms and can't realistically be |
| emulated. PNaCl could (but currently doesn't) only generate them in |
| the backend if fast-math were specified and the hardware supports |
| the operation. |
| * Transcendentals aren't exposed by PNaCl's ABI; they are part of the |
| math library that is included in the *pexe*. PNaCl could, but |
| currently doesn't, use hardware support if fast-math were provided |
| in the *pexe*. |
| |
| Computed ``goto`` |
| ================= |
| |
| PNaCl supports computed ``goto``, a non-standard GCC extension to C used |
| by some interpreters, by lowering them to ``switch`` statements. The |
| resulting use of ``switch`` might not be as fast as the original |
| indirect branches. If you are compiling a program that has a |
| compile-time option for using computed ``goto``, it's possible that the |
| program will run faster with the option turned off (e.g., if the program |
| does extra work to take advantage of computed ``goto``). |
| |
| NaCl supports computed ``goto`` without any transformation. |
| |
| Future Directions |
| ================= |
| |
| Inter-Process Communication |
| --------------------------- |
| |
| Inter-process communication through shared memory is currently not |
| supported by PNaCl/NaCl. When implemented, it may be limited to |
| operations which are lock-free on the current platform (``is_lock_free`` |
| methods). It will rely on the address-free properly discussed in `Memory |
| Model for Concurrent Operations`_. |
| |
| POSIX-style Signal Handling |
| --------------------------- |
| |
| POSIX-style signal handling really consists of two different features: |
| |
| * **Hardware exception handling** (synchronous signals): The ability |
| to catch hardware exceptions (such as memory access faults and |
| division by zero) using a signal handler. |
| |
| PNaCl currently doesn't support hardware exception handling. |
| |
| NaCl supports hardware exception handling via the |
| ``<nacl/nacl_exception.h>`` interface. |
| |
| * **Asynchronous interruption of threads** (asynchronous signals): The |
| ability to asynchronously interrupt the execution of a thread, |
| forcing the thread to run a signal handler. |
| |
| A similar feature is **thread suspension**: The ability to |
| asynchronously suspend and resume a thread and inspect or modify its |
| execution state (such as register state). |
| |
| Neither PNaCl nor NaCl currently support asynchronous interruption |
| or suspension of threads. |
| |
| If PNaCl were to support either of these, the interaction of |
| ``volatile`` and atomics with same-thread signal handling would need |
| to be carefully detailed. |