Enable decode input reads in 64 bit chunks

This enables reading bigger chunks of data in the DEFLATE decoder
on aarch64.

Basically instead of performing 2x 32-bit loads (i.e. ldrb w22,[x9])
followed by a second write in higher lane of the register (i.e. w23),
memcpy will do a 64-bit load to the same register.
(i.e. ldr x22, [x9]).

This also allows to halve the amount of following operations (i.e. adds
and shifts), improving performance in decompression.

For JavaScript content the gain was close to 14% in big cores (A72) and
9% for little cores (A53).

