Enable decode input reads in 64 bit chunks

This enables reading bigger chunks of data in the DEFLATE decoder
on aarch64.

Basically instead of performing 2x 32-bit loads (i.e. ldrb w22,[x9])
followed by a second write in higher lane of the register (i.e. w23),
memcpy will do a 64-bit load to the same register.
(i.e. ldr x22, [x9]).

This also allows to halve the amount of following operations (i.e. adds
and shifts), improving performance in decompression.

For JavaScript content the gain was close to 14% in big cores (A72) and
9% for little cores (A53).

Bug: 812499
Change-Id: I010604ee62e72a769ce2a7912afb7e334adefacf
Reviewed-on: https://chromium-review.googlesource.com/c/1447042
Reviewed-by: Mike Klein <mtklein@chromium.org>
Reviewed-by: Adenilson Cavalcanti <cavalcantii@chromium.org>
Commit-Queue: Adenilson Cavalcanti <cavalcantii@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#628091}
Cr-Mirrored-From: https://chromium.googlesource.com/chromium/src
Cr-Mirrored-Commit: e2aef12cf002ca3577b9bfea3f2a89eed5379a4f
1 file changed