[zlib][riscv] Implement generic chunk_copy

Back in 2017, Simon Hosie implemented chunk_copy for Arm using NEON
instructions, which was later ported to x86-64 by Noel Gordon.

The basic idea is to perform wide loads and stores while doing
data decompression (i.e. load a single wide vector instead of single byte).

The current chunk_copy can be easily ported to other architectures that use
fixed length vectors/registers, but doesn't scale so well for architectures
with varied vector lengths (e.g. Arm SVE or RISCV RVV 1.0).

In any case, it is possible to have a *generic* chunk_copy** relying on the
compiler builtins memcopy/memset and this patch introduces this functionality
in Chromium zlib.

One important detail is that chunk_copy was coded *before* read64le (an
optimization suggested by Nigel Tao that requires unaligned loads) and it is
a requirement for both read64le and unconditional decoding of literals
(suggested by Dougall Johnson).

The penalty of unaligned loads in read64le can actually negate the benefits of chunk_copy,
which is why we rely on clang flags to allow code generation that deals with
the issue.

The current patch yielded an average gain of +9.5% on a K230 board, with higher
gains for some important content like HTML (+16%) and source code (+11.6%).

** Link:

Bug: 329282661
Change-Id: Ia32a4a1fed16169a59cd39775fa68f4e675dac09
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5402331
Reviewed-by: Chris Blume <cblume@chromium.org>
Commit-Queue: Adenilson Cavalcanti <cavalcantii@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1283414}
GitOrigin-RevId: cb959c56ec21abb0526f52b5f66a07fba7b6b145
2 files changed