Optimize remaining mse and sse functions in variance_neon.c

Implement sum of squared difference calculations in vpx_mse16x16_neon
and vpx_get4x4sse_cs_neon using the ABD and UDOT instructions -
instead of widening subtracts followed by a sequence of MLAs.

The existing implementation is retained for use on CPUs that do not
implement the Armv8.4-A UDOT instruction. This commit also updates
the variable names used in the existing implementations to be more

Bug: b/181236880
Change-Id: Id4ad8ea7c808af1ac9bb5f1b63327ab487e4b1c7
1 file changed