update design_philosophy/FAQ, mention vqsort throttling finding PiperOrigin-RevId: 540255745

commit: e47dc4fb60a440d1d3633ea3e4bf33bbe2224ac4 [log] [tgz]
author: Jan Wassenberg <janwas@google.com> Wed Jun 14 13:48:47 2023
committer: Copybara-Service <copybara-worker@google.com> Wed Jun 14 13:49:25 2023
tree: 9fa5b617a0ee97683559047211cc7fe3d286fd75
parent: d68857f90f3b55d753a68035263de6197b8e5b69 [diff]
diff --git a/g3doc/design_philosophy.md b/g3doc/design_philosophy.md
index 9884ec3..1032960 100644
--- a/g3doc/design_philosophy.md
+++ b/g3doc/design_philosophy.md

@@ -21,7 +21,7 @@
     during initial development. Analysis tools can warn about some potential
     inefficiencies, but likely not all. We instead provide [a carefully chosen
     set of vector types and operations that are efficient on all target
-    platforms](instruction_matrix.pdf) (PPC8, SSE4/AVX2+, Armv8).
+    platforms](instruction_matrix.pdf) (Armv8, PPC8, x86).
 
 *   Future SIMD hardware features are difficult to predict. For example, AVX2
     came with surprising semantics (almost no interaction between 128-bit
@@ -64,11 +64,11 @@
     same file or in `*-inl.h` headers. We generate all code paths from the same
     source to reduce implementation- and debugging cost.
 
-*   Not every CPU need be supported. For example, pre-SSSE3 CPUs are
-    increasingly rare and the AVX instruction set is limited to floating-point
-    operations. To reduce code size and compile time, we provide specializations
-    for S-SSE3, SSE4, AVX2 and AVX-512 instruction sets on x86, plus a scalar
-    fallback.
+*   Not every CPU need be supported. To reduce code size and compile time, we
+    group x86 targets into clusters. In particular, SSE3 instructions are only
+    used/available if S-SSE3 is also available, and AVX only if AVX2 is also
+    supported. Code generation for AVX3_DL also requires opting-in by defining
+    HWY_WANT_AVX3_DL.
 
 *   Access to platform-specific intrinsics is necessary for acceptance in
     performance-critical projects. We provide conversions to and from intrinsics
@@ -109,10 +109,10 @@
 a subset of this functionality on other platforms at zero cost.
 
 Masks are returned by comparisons and `TestBit`; they serve as the input to
-`IfThen*`. We provide conversions between masks and vector lanes. For clarity
-and safety, we use FF..FF as the definition of true. To also benefit from
-x86 instructions that only require the sign bit of floating-point inputs to be
-set, we provide a special `ZeroIfNegative` function.
+`IfThen*`. We provide conversions between masks and vector lanes. On targets
+without dedicated mask registers, we use FF..FF as the definition of true. To
+also benefit from x86 instructions that only require the sign bit of
+floating-point inputs to be set, we provide a special `ZeroIfNegative` function.
 
 ## Differences vs. [P0214R5](https://goo.gl/zKW4SA) / std::experimental::simd
 
@@ -120,8 +120,10 @@
     functions. By contrast, P0214R5 requires a wrapper class, which does not
     work for sizeless vector types currently used by Arm SVE and Risc-V.
 
-1.  Adding widely used and portable operations such as `AndNot`, `AverageRound`,
-    bit-shift by immediates and `IfThenElse`.
+1.  Supporting many more operations such as 128-bit compare/minmax, AES/CLMUL,
+    `AndNot`, `AverageRound`, bit-shift by immediates, compress/expand,
+    fixed-point mul, `IfThenElse`, interleaved load/store, lzcnt, mask find/set,
+    masked load/store, popcount, reductions, saturated add/sub, scatter/gather.
 
 1.  Designing the API to avoid or minimize overhead on AVX2/AVX-512 caused by
     crossing 128-bit 'block' boundaries.
@@ -156,10 +158,6 @@
 
 1.  Ensuring signed integer overflow has well-defined semantics (wraparound).
 
-1.  Simple header-only implementation and a fraction of the size of the Vc
-    library from which P0214 was derived (39K, vs. 92K lines in
-    https://github.com/VcDevel/Vc according to the gloc Chrome extension).
-
 1.  Avoiding hidden performance costs. P0214R5 allows implicit conversions from
     integer to float, which costs 3-4 cycles on x86. We make these conversions
     explicit to ensure their cost is visible.

diff --git a/g3doc/faq.md b/g3doc/faq.md
index 5c6f6df..938a3d9 100644
--- a/g3doc/faq.md
+++ b/g3doc/faq.md

@@ -374,9 +374,13 @@
 the entire system, it is important to measure end-to-end application performance
 rather than rely on microbenchmarks. In practice, we find the speedup from
 sustained SIMD usage (not just sporadic instructions amid mostly scalar code) is
-much larger than the impact of throttling. For JPEG XL image decompression and
-vectorized Quicksort, we observe a 1.4-1.6x end to end speedup from AVX-512 vs
-AVX2, even on multiple cores of a Xeon Gold. Note that throttling is
+much larger than the impact of throttling. For JPEG XL image decompression, we
+observe a 1.4-1.6x end to end speedup from AVX-512 vs. AVX2, even on multiple
+cores of a Xeon Gold. For
+[vectorized Quicksort](https://github.com/google/highway/blob/master/hwy/contrib/sort/README.md#study-of-avx-512-downclocking),
+we find that throttling is not detectable on a single Skylake core, and the
+AVX-512 startup overhead is worthwhile for inputs >= 80 KiB. Note that
+throttling is
 [no longer a concern on recent Intel](https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html#summary)
 implementations of AVX-512 (Icelake and Rocket Lake client), and AMD CPUs do not
 throttle AVX2 or AVX-512.
commit	e47dc4fb60a440d1d3633ea3e4bf33bbe2224ac4	[log] [tgz]
author	Jan Wassenberg <janwas@google.com>	Wed Jun 14 13:48:47 2023
committer	Copybara-Service <copybara-worker@google.com>	Wed Jun 14 13:49:25 2023
tree	9fa5b617a0ee97683559047211cc7fe3d286fd75
parent	d68857f90f3b55d753a68035263de6197b8e5b69 [diff]