9ca61ef Merge pull request #3215 from RaviTriv:lgamma by Copybara-Service · 8 hours ago upstream/master
3a9bbda use MulSub and NegMulAdd by Ravi · 12 hours ago
de6cb0b update user facing name to LogGamma by Ravi · 12 hours ago
feda8bf apply fused multiply add on FMA targets by Ravi · 12 hours ago
92e21ca implement lgamma by Ravi · 12 hours ago

Efficient and performance-portable vector software

Highway is a C++ library that provides portable SIMD/vector intrinsics.

Previously licensed under Apache 2, now dual-licensed as Apache 2 / BSD-3.

Why

We are passionate about high-performance software. We see major untapped potential in CPUs (servers, mobile, desktops). Highway is for engineers who want to reliably and economically push the boundaries of what is possible in software.

How

CPUs provide SIMD/vector instructions that apply the same operation to multiple data items. This can reduce energy usage e.g. fivefold because fewer instructions are executed. We also often see 5-10x speedups.

Highway makes SIMD/vector programming practical and workable according to these guiding principles:

Does what you expect: Highway is a C++ library with carefully-chosen functions that map well to CPU instructions without extensive compiler transformations. The resulting code is more predictable and robust to code changes/compiler updates than autovectorization.

Works on widely-used platforms: Highway supports seven architectures; the same application code can target various instruction sets, including those with ‘scalable’ vectors (size unknown at compile time). Highway only requires C++17 (language features, not necessarily the library) and supports four families of compilers. If you want to use Highway on other platforms, please raise an issue.

Flexible to deploy: Applications using Highway can run on heterogeneous clouds or client devices, choosing the best available instruction set at runtime. Alternatively, developers may choose to target a single instruction set without any runtime overhead. In both cases, the application code is the same except for swapping HWY_STATIC_DISPATCH with HWY_DYNAMIC_DISPATCH plus one line of code. See also @kfjahnke's introduction to dispatching.

Suitable for a variety of domains: Highway provides an extensive set of operations, used for image processing (floating-point), compression, video analysis, linear algebra, cryptography, sorting and random generation. We recognise that new use-cases may require additional ops and are happy to add them where it makes sense (e.g. no performance cliffs on some architectures). If you would like to discuss, please file an issue.

Rewards data-parallel design: Highway provides tools such as Gather, MaskedLoad, and FixedTag to enable speedups for legacy data structures. However, the biggest gains are unlocked by designing algorithms and data structures for scalable vectors. Helpful techniques include batching, structure-of-array layouts, and aligned/padded allocations.

We recommend these resources for getting started:

Examples

Online demos using Compiler Explorer:

multiple targets with dynamic dispatch (more complicated, but flexible and uses best available SIMD)
single target using -m flags (simpler, but requires/only uses the instruction set enabled by compiler flags)

We have prepared several tutorial-like examples in hwy/examples; see the README.md there for more information.

Basics:

sum_array_advanced.cc: 4x unrolling and remainder handling.
dot_product_unroll.cc: similar to sum_array_advanced, plus reduction.
matrix_transpose_scatter_gather: shows/compares Scatter and Gather.

Infrastructure:

benchmark.cc: dot product with remainder-free loops and benchmarking.
profiler_example: shows how to use the our built-in profiler for measuring the time cost of annotated zones.
skeleton*: a complete example of a module with runtime dispatch.

Challenges:

masks_and_logic.cc: visualizes masks/comparisons with ASCII art.
ctf_aes.cc: capture the flag, brute-force AES password guessing.

We observe that Highway is referenced in the following open source projects, found via sourcegraph.com. Most are GitHub repositories. If you would like to add your project or link to it directly, feel free to raise an issue or contact us via the below email.

Audio: Zimtohrli perceptual metric
Browsers: Chromium (+Vivaldi), Firefox (+floorp / foxhound / librewolf / Waterfox)
Computational biology: RNA analysis, long-sequence preprocessing
Computer graphics: ghostty-org/ghostty, Sparse voxel renderer, tgfx 2D Graphics library
Cryptography: google/distributed_point_functions, google/shell-encryption
Data structures: bkille/BitLib
Image codecs: eustas/2im, Grok JPEG 2000, JPEG XL, JPEGenc, Jpegli, libaom, OpenHTJ2K
Image processing: awxkee/aire, cloudinary/ssimulacra2, libvips, m-ab-s/media-autobuild_suite,
Image viewers: AlienCowEatCake/ImageViewer, diffractor/diffractor, Lux panorama/image viewer, mirillis/jpegxl-wic
Information retrieval: iresearch database index, michaeljclark/zvec, nebula interactive analytics / OLAP, ScaNN Scalable Nearest Neighbors,
Machine learning: array2d/deepx, gemma.cpp, Tensorflow, Numpy, zpye/SimpleInfer
Programming languages: AOT-compiled python, oven-sh/bun, V8/V8, yinqiwen/rapidudf
Robotics: MIT Model-Based Design and Verification
Vector search: 1yefuwang1/vectorlite, vespa-engine/vespa

Other

Evaluation of C++ SIMD Libraries: “Highway excelled with a strong performance across multiple SIMD extensions [..]. Thus, Highway may currently be the most suitable SIMD library for many software projects.”
zimt: C++11 template library to process n-dimensional arrays with multi-threaded SIMD code
vectorized Quicksort (paper)

Highway can also be installed via a package manager or from a repository:

Obtained from repology

Current status

Targets

Highway supports 27 targets, listed in alphabetical order of platform:

Any: EMU128, SCALAR;
Armv7+: NEON_WITHOUT_AES, NEON, NEON_BF16, SVE, SVE2, SVE_256, SVE2_128;
IBM Z: Z14, Z15;
LoongArch: LSX, LASX;
POWER: PPC8 (v2.07), PPC9 (v3.0), PPC10 (v3.1B, not yet supported due to compiler bugs, see #1207; also requires QEMU 7.2);
RISC-V: RVV (1.0);
WebAssembly: WASM, WASM_EMU256 (a 2x unrolled version of wasm128, enabled if HWY_WANT_WASM2 is defined. This will remain supported until it is potentially superseded by a future version of WASM.);
x86:
- SSE2
- SSSE3 (~Intel Core)
- SSE4 (~Nehalem, also includes AES + CLMUL).
- AVX2 (~Haswell, also includes BMI2 + F16 + FMA)
- AVX3 (~Skylake, AVX-512F/BW/CD/DQ/VL)
- AVX3_DL (~Icelake, includes BitAlg + CLMUL + GFNI + VAES + VBMI + VBMI2 + VNNI + VPOPCNT),
- AVX3_ZEN4 (AVX3_DL plus BF16, optimized for AMD Zen4; requires opt-in by defining HWY_WANT_AVX3_ZEN4 if compiling for static dispatch, but enabled by default for runtime dispatch),
- AVX3_SPR (~Sapphire Rapids, includes AVX-512FP16)
- AVX10_2 (~Diamond Rapids)

Our policy is that unless otherwise specified, targets will remain supported as long as they can be (cross-)compiled with currently supported Clang or GCC, and tested using QEMU. If the target can be compiled with LLVM trunk and tested using our version of QEMU without extra flags, then it is eligible for inclusion in our continuous testing infrastructure. Otherwise, the target will be manually tested before releases with selected versions/configurations of Clang and GCC.

SVE was initially tested using farm_sve (see acknowledgments).

Versioning

Highway releases aim to follow the semver.org system (MAJOR.MINOR.PATCH), incrementing MINOR after backward-compatible additions and PATCH after backward-compatible fixes. We recommend using releases (rather than the Git tip) because they are tested more extensively, see below.

The current version 1.0 signals an increased focus on backwards compatibility. Applications using documented functionality will remain compatible with future updates that have the same major version number.

Testing

Continuous integration tests build with a recent version of Clang (running on native x86, or QEMU for RISC-V and Arm) and MSVC 2019 (v19.28, running on native x86).

Before releases, we also test on x86 with Clang and GCC, and Armv7/8 via GCC cross-compile. See the testing process for details.

Related modules

The contrib directory contains SIMD-related utilities: an image class with aligned rows, a math library (16 functions already implemented, mostly trigonometry), and functions for computing dot products and sorting.

Other libraries

If you only require x86 support, you may also use Agner Fog's VCL vector class library. It includes many functions including a complete math library.

If you have existing code using x86/NEON intrinsics, you may be interested in SIMDe, which emulates those intrinsics using other platforms' intrinsics or autovectorization.

xSIMD is a header only C++ library. It supports Arm, Power, RISC-V, WebAssembly and x86 targets. Has a high level interface, but fewer supported operations.

NumKong a SIMD accelerated math C library focused on operations such as dot products and mixed precision matrix multiplications. It can be used from C++, Go, Python, Rust, Swift and WebAssembly. Accelerated operations are available on Arm, LoongArch, Power, RISC-V and x86.

Installation

This project uses CMake to generate and build. In a Debian-based system you can install it via:

sudo apt install cmake

Highway‘s unit tests use googletest. By default, Highway’s CMake downloads this dependency at configuration time. You can avoid this by setting the HWY_SYSTEM_GTEST CMake variable to ON and installing gtest separately:

sudo apt install libgtest-dev

Alternatively, you can define HWY_TEST_STANDALONE=1 and remove all occurrences of gtest_main in each BUILD file, then tests avoid the dependency on GUnit.

Running cross-compiled tests requires support from the OS, which on Debian is provided by the qemu-user-binfmt package.

To build Highway as a shared or static library (depending on BUILD_SHARED_LIBS), the standard CMake workflow can be used:

mkdir -p build && cd build
cmake ..
make -j && make test

Or you can run run_tests.sh (run_tests.bat on Windows).

Bazel is also supported for building, but it is not as widely used/tested.

When building for Armv7, a limitation of current compilers requires you to add -DHWY_CMAKE_ARM7:BOOL=ON to the CMake command line; see #834 and #1032. We understand that work is underway to remove this limitation.

To benefit from Armv8/v9 vusdot and vusdotq instructions, you can add “+i8mm” to the -march compiler flag, assuming the target CPU(s) support that.

Building on 32-bit x86 is not officially supported, and AVX2/3 are disabled by default there. Note that johnplatts has successfully built and run the Highway tests on 32-bit x86, including AVX2/3, on GCC 7/8 and Clang 8/11/12. On Ubuntu 22.04, Clang 11 and 12, but not later versions, require extra compiler flags -m32 -isystem /usr/i686-linux-gnu/include. Clang 10 and earlier require the above plus -isystem /usr/i686-linux-gnu/include/c++/12/i686-linux-gnu. See #1279.

Building highway - Using vcpkg

highway is now available in vcpkg

vcpkg install highway

The highway port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please create an issue or pull request on the vcpkg repository.

Quick start

You can use the benchmark inside examples/ as a starting point.

A quick-reference page briefly lists all operations and their parameters, and the instruction_matrix indicates the number of instructions per operation.

The FAQ answers questions about portability, API design and where to find more information.

We recommend using full SIMD vectors whenever possible for maximum performance portability. To obtain them, pass a ScalableTag<float> (or equivalently HWY_FULL(float)) tag to functions such as Zero/Set/Load. There are two alternatives for use-cases requiring an upper bound on the lanes:

For up to N lanes, specify CappedTag<T, N> or the equivalent HWY_CAPPED(T, N). The actual number of lanes will be N rounded down to the nearest power of two, such as 4 if N is 5, or 8 if N is 8. This is useful for data structures such as a narrow matrix. A loop is still required because vectors may actually have fewer than N lanes.
For exactly a power of two N lanes, specify FixedTag<T, N>. The largest supported N depends on the target, but is guaranteed to be at least 16/sizeof(T).

Due to ADL restrictions, user code calling Highway ops must either:

Reside inside namespace hwy { namespace HWY_NAMESPACE {; or
prefix each op with an alias such as namespace hn = hwy::HWY_NAMESPACE; hn::Add(); or
add using-declarations for each op used: using hwy::HWY_NAMESPACE::Add;.

Additionally, each function that calls Highway ops (such as Load) must either be prefixed with HWY_ATTR, OR reside between HWY_BEFORE_NAMESPACE() and HWY_AFTER_NAMESPACE(). Lambda functions currently require HWY_ATTR before their opening brace.

Do not use namespace-scope nor static initializers for SIMD vectors because this can cause SIGILL when using runtime dispatch and the compiler chooses an initializer compiled for a target not supported by the current CPU. Instead, constants initialized via Set should generally be local (const) variables.

The entry points into code using Highway differ slightly depending on whether they use static or dynamic dispatch. In both cases, we recommend that the top-level function receives one or more pointers to arrays, rather than target-specific vector types.

For static dispatch, HWY_TARGET will be the best available target among HWY_BASELINE_TARGETS, i.e. those allowed for use by the compiler (see quick-reference). Functions inside HWY_NAMESPACE can be called using HWY_STATIC_DISPATCH(func)(args) within the same module they are defined in. You can call the function from other modules by wrapping it in a regular function and declaring the regular function in a header.
For dynamic dispatch, a table of function pointers is generated via the HWY_EXPORT macro that is used by HWY_DYNAMIC_DISPATCH(func)(args) to call the best function pointer for the current CPU's supported targets. A module is automatically compiled for each target in HWY_TARGETS (see quick-reference) if HWY_TARGET_INCLUDE is defined and foreach_target.h is included. Note that the first invocation of HWY_DYNAMIC_DISPATCH, or each call to the pointer returned by the first invocation of HWY_DYNAMIC_POINTER, involves some CPU detection overhead. You can prevent this by calling the following before any invocation of HWY_DYNAMIC_*: hwy::GetChosenTarget().Update(hwy::SupportedTargets());.

See also a separate introduction to dynamic dispatch by @kfjahnke.

When using dynamic dispatch, foreach_target.h is included from translation units (.cc files), not headers. Headers containing vector code shared between several translation units require a special include guard, for example the following taken from examples/skeleton-inl.h:

#if defined(HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_) == defined(HWY_TARGET_TOGGLE)
#ifdef HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_
#undef HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_
#else
#define HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_
#endif

#include "hwy/highway.h"
// Your vector code
#endif

By convention, we name such headers -inl.h because their contents (often function templates) are usually inlined.

Compiler flags

Applications should be compiled with optimizations enabled. Without inlining SIMD code may slow down by factors of 10 to 100. For clang and GCC, -O2 is generally sufficient.

For Clang and GCC, we recommend using dynamic dispatch (see Quick start), because this avoids the need for extra compiler flags which may be incompatible with other libraries and always uses the best available instructions. If you know exactly which CPU you are running on, you can specify that as the baseline target, which avoids generating code for any older instruction sets. Compiler flags must match the predefined macro checks for HWY_BASELINE_* in detect_targets.h. They can also be deduced from the HWY_TARGET_STR in set_macros-inl.h. For x86, suggested flags are currently:

HWY_AVX2: -march=x86-64-v3 -maes -mpclmul, or -march=haswell -maes
HWY_AVX3: -march=x86-64-v4, or -march=skx
HWY_AVX3_DL: -march=icelake-server
HWY_ZEN4: -march=znver4
HWY_AVX3_SPR: -march=icelake-server -mavx512fp16 -mavx512bf16
HWY_AVX10_2: -march=novalake (requires GCC 16 or Clang 22)

See also the godbolt examples above mentioning -m targets.

For MSVC, we recommend compiling with /Gv to allow non-inlined functions to pass vector arguments in registers. If intending to use the AVX2 target together with half-width vectors (e.g. for PromoteTo), it is also important to compile with /arch:AVX2. This seems to be the only way to reliably generate VEX-encoded SSE instructions on MSVC. Sometimes MSVC generates VEX-encoded SSE instructions, if they are mixed with AVX, but not always, see DevCom-10618264. Otherwise, mixing VEX-encoded AVX2 instructions and non-VEX SSE may cause severe performance degradation. Unfortunately, with /arch:AVX2 option, the resulting binary will then require AVX2. Note that no such flag is needed for clang and GCC because they support target-specific attributes, which we use to ensure proper VEX code generation for AVX2 targets.

Strip-mining loops

When vectorizing a loop, an important question is whether and how to deal with a number of iterations (‘trip count’, denoted count) that does not evenly divide the vector size N = Lanes(d). For example, it may be necessary to avoid writing past the end of an array.

In this section, let T denote the element type and d = ScalableTag<T>. Assume the loop body is given as a function template<bool partial, class D> void LoopBody(D d, size_t index, size_t max_n).

“Strip-mining” is a technique for vectorizing a loop by transforming it into an outer loop and inner loop, such that the number of iterations in the inner loop matches the vector width. Then, the inner loop is replaced with vector operations.

Highway offers several strategies for loop vectorization:

Ensure all inputs/outputs are padded. Then the (outer) loop is simply
```
for (size_t i = 0; i < count; i += N) LoopBody<false>(d, i, 0);
```
Here, the template parameter and second function argument are not needed.
This is the preferred option, unless N is in the thousands and vector operations are pipelined with long latencies. This was the case for supercomputers in the 90s, but nowadays ALUs are cheap and we see most implementations split vectors into 1, 2 or 4 parts, so there is little cost to processing entire vectors even if we do not need all their lanes. Indeed this avoids the (potentially large) cost of predication or partial loads/stores on older targets, and does not duplicate code.
Process whole vectors and include previously processed elements in the last vector:
```
for (size_t i = 0; i < count; i += N) LoopBody<false>(d, HWY_MIN(i, count - N), 0);
```
This is the second preferred option provided that count >= N and LoopBody is idempotent. Some elements might be processed twice, but a single code path and full vectorization is usually worth it. Even if count < N, it usually makes sense to pad inputs/outputs up to N.
Use the Transform* functions in hwy/contrib/algo/transform-inl.h. This takes care of the loop and remainder handling and you simply define a generic lambda function (C++14) or functor which receives the current vector from the input/output array, plus optionally vectors from up to two extra input arrays, and returns the value to write to the input/output array.
Here is an example implementing the BLAS function SAXPY (alpha * x + y):
```
Transform1(d, x, n, y, [](auto d, const auto v, const auto v1) HWY_ATTR {
  return MulAdd(Set(d, alpha), v, v1);
});
```
Process whole vectors as above, followed by a scalar loop:
```
size_t i = 0;
for (; i + N <= count; i += N) LoopBody<false>(d, i, 0);
for (; i < count; ++i) LoopBody<false>(CappedTag<T, 1>(), i, 0);
```
The template parameter and second function arguments are again not needed.
This avoids duplicating code, and is reasonable if count is large. If count is small, the second loop may be slower than the next option.
Process whole vectors as above, followed by a single call to a modified LoopBody with masking:
```
size_t i = 0;
for (; i + N <= count; i += N) {
  LoopBody<false>(d, i, 0);
}
if (i < count) {
  LoopBody<true>(d, i, count - i);
}
```
Now the template parameter and third function argument can be used inside LoopBody to non-atomically ‘blend’ the first num_remaining lanes of v with the previous contents of memory at subsequent locations: BlendedStore(v, FirstN(d, num_remaining), d, pointer);. Similarly, MaskedLoad(FirstN(d, num_remaining), d, pointer) loads the first num_remaining elements and returns zero in other lanes.
This is a good default when it is infeasible to ensure vectors are padded, but is only safe #if !HWY_MEM_OPS_MIGHT_FAULT! In contrast to the scalar loop, only a single final iteration is needed. The increased code size from two loop bodies is expected to be worthwhile because it avoids the cost of masking in all but the final iteration.

Additional resources

Acknowledgments

We have used farm-sve by Berenger Bramas; it has proved useful for checking the SVE port on an x86 development machine.

This is not an officially supported Google product. Contact: janwas@google.com

Branches

Tags