[NEON] Optimize FHT functions, add highbd FHT 4x4

Refactor & optimize FHT functions further, use new butterfly functions
4x4 5% faster, 8x8 & 16x16 10% faster than previous versions.
Highbd 4x4 FHT version 2.27x faster than C version for --rt.

Change-Id: I3ebcd26010f6c5c067026aa9353cde46669c5d94
4 files changed