Implement Neon variance functions using UDOT instruction

Accelerate Neon variance functions by implementing the sum of squares
calculation using the Armv8.4-A UDOT instruction instead of 4 MLAs.

The previous implementation is retained for use on CPUs that do not
implement the Armv8.4-A dot product instructions.

Bug: b/181236880
Change-Id: I9ab3d52634278b9b6f0011f39390a1195210bc75
2 files changed