Merge transpose and permute in Neon SDOT vertical convolution

The original dot-product implementation of vpx_convolve8_vert_neon
used a separate transpose before and after the convolution operation.
This patch merges the first transpose with the TBL permute (necessary
before using SDOT to compute the convolution) to significantly reduce
the amount of data re-arrangement. This new approach also allows for
more effective data re-use between loop iterations.

Co-authored by: James Greenhalgh <james.greenhalgh@arm.com>

Bug: b/181236880
Change-Id: I87fe4dadd312c3ad6216943b71a5410ddf4a1b5b
2 files changed