Optimize Neon SAD reductions using wider ADDP instruction

Implement AArch64-only paths for each of the Neon SAD reduction
functions, making use of a wider pairwise addition instruction only
available on AArch64.

This change removes the need for shuffling between high and low
halves of Neon vectors - resulting in a faster reduction that requires
fewer instructions.

Bug: b/181236880
Change-Id: I1c48580b4aec27222538eeab44e38ecc1f2009dc
1 file changed