update design_philosophy/FAQ, mention vqsort throttling finding
PiperOrigin-RevId: 540255745
diff --git a/g3doc/design_philosophy.md b/g3doc/design_philosophy.md
index 9884ec3..1032960 100644
--- a/g3doc/design_philosophy.md
+++ b/g3doc/design_philosophy.md
@@ -21,7 +21,7 @@
during initial development. Analysis tools can warn about some potential
inefficiencies, but likely not all. We instead provide [a carefully chosen
set of vector types and operations that are efficient on all target
- platforms](instruction_matrix.pdf) (PPC8, SSE4/AVX2+, Armv8).
+ platforms](instruction_matrix.pdf) (Armv8, PPC8, x86).
* Future SIMD hardware features are difficult to predict. For example, AVX2
came with surprising semantics (almost no interaction between 128-bit
@@ -64,11 +64,11 @@
same file or in `*-inl.h` headers. We generate all code paths from the same
source to reduce implementation- and debugging cost.
-* Not every CPU need be supported. For example, pre-SSSE3 CPUs are
- increasingly rare and the AVX instruction set is limited to floating-point
- operations. To reduce code size and compile time, we provide specializations
- for S-SSE3, SSE4, AVX2 and AVX-512 instruction sets on x86, plus a scalar
- fallback.
+* Not every CPU need be supported. To reduce code size and compile time, we
+ group x86 targets into clusters. In particular, SSE3 instructions are only
+ used/available if S-SSE3 is also available, and AVX only if AVX2 is also
+ supported. Code generation for AVX3_DL also requires opting-in by defining
+ HWY_WANT_AVX3_DL.
* Access to platform-specific intrinsics is necessary for acceptance in
performance-critical projects. We provide conversions to and from intrinsics
@@ -109,10 +109,10 @@
a subset of this functionality on other platforms at zero cost.
Masks are returned by comparisons and `TestBit`; they serve as the input to
-`IfThen*`. We provide conversions between masks and vector lanes. For clarity
-and safety, we use FF..FF as the definition of true. To also benefit from
-x86 instructions that only require the sign bit of floating-point inputs to be
-set, we provide a special `ZeroIfNegative` function.
+`IfThen*`. We provide conversions between masks and vector lanes. On targets
+without dedicated mask registers, we use FF..FF as the definition of true. To
+also benefit from x86 instructions that only require the sign bit of
+floating-point inputs to be set, we provide a special `ZeroIfNegative` function.
## Differences vs. [P0214R5](https://goo.gl/zKW4SA) / std::experimental::simd
@@ -120,8 +120,10 @@
functions. By contrast, P0214R5 requires a wrapper class, which does not
work for sizeless vector types currently used by Arm SVE and Risc-V.
-1. Adding widely used and portable operations such as `AndNot`, `AverageRound`,
- bit-shift by immediates and `IfThenElse`.
+1. Supporting many more operations such as 128-bit compare/minmax, AES/CLMUL,
+ `AndNot`, `AverageRound`, bit-shift by immediates, compress/expand,
+ fixed-point mul, `IfThenElse`, interleaved load/store, lzcnt, mask find/set,
+ masked load/store, popcount, reductions, saturated add/sub, scatter/gather.
1. Designing the API to avoid or minimize overhead on AVX2/AVX-512 caused by
crossing 128-bit 'block' boundaries.
@@ -156,10 +158,6 @@
1. Ensuring signed integer overflow has well-defined semantics (wraparound).
-1. Simple header-only implementation and a fraction of the size of the Vc
- library from which P0214 was derived (39K, vs. 92K lines in
- https://github.com/VcDevel/Vc according to the gloc Chrome extension).
-
1. Avoiding hidden performance costs. P0214R5 allows implicit conversions from
integer to float, which costs 3-4 cycles on x86. We make these conversions
explicit to ensure their cost is visible.
diff --git a/g3doc/faq.md b/g3doc/faq.md
index 5c6f6df..938a3d9 100644
--- a/g3doc/faq.md
+++ b/g3doc/faq.md
@@ -374,9 +374,13 @@
the entire system, it is important to measure end-to-end application performance
rather than rely on microbenchmarks. In practice, we find the speedup from
sustained SIMD usage (not just sporadic instructions amid mostly scalar code) is
-much larger than the impact of throttling. For JPEG XL image decompression and
-vectorized Quicksort, we observe a 1.4-1.6x end to end speedup from AVX-512 vs
-AVX2, even on multiple cores of a Xeon Gold. Note that throttling is
+much larger than the impact of throttling. For JPEG XL image decompression, we
+observe a 1.4-1.6x end to end speedup from AVX-512 vs. AVX2, even on multiple
+cores of a Xeon Gold. For
+[vectorized Quicksort](https://github.com/google/highway/blob/master/hwy/contrib/sort/README.md#study-of-avx-512-downclocking),
+we find that throttling is not detectable on a single Skylake core, and the
+AVX-512 startup overhead is worthwhile for inputs >= 80 KiB. Note that
+throttling is
[no longer a concern on recent Intel](https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html#summary)
implementations of AVX-512 (Icelake and Rocket Lake client), and AMD CPUs do not
throttle AVX2 or AVX-512.