This allows the Neon intrinsics code to be built successfully (albeit
likely with reduced run-time performance) with Xcode 5.0-6.2
(iOS/AArch64) and Android NDK < r19 (AArch32). Note that Xcode 5.0-6.2
will not build the Armv8 GAS code without gas-preprocessor.pl, and no
version of Xcode will build the Armv7 GAS code without
gas-preprocessor.pl, so we always use the full Neon intrinsics
implementation by default with macOS and iOS builds.
Auto-detecting the completeness of the compiler's set of Neon intrinsics
also allows us to more intelligently set the default value of
NEON_INTRINSICS, based on the values of HAVE_VLD1*. This is a
reasonable, albeit imperfect, proxy for whether a compiler has a full
and optimal set of Neon intrinsics. Specific notes:
- 64-bit RGB-to-YCbCr color conversion
does not use any of the intrinsics in question, regresses with GCC
- 64-bit accurate integer forward DCT
uses vld1_s16_x3(), regresses with GCC
- 64-bit Huffman encoding
uses vld1q_u8_x4(), regresses with GCC
- 64-bit YCbCr-to-RGB color conversion
does not use any of the intrinsics in question, regresses with GCC
- 64-bit accurate integer inverse DCT
uses vld1_s16_x3(), regresses with GCC
- 64-bit 4x4 inverse DCT
uses vld1_s16_x3(). I did not test this algorithm in isolation, so
it may in fact regress with GCC, but the regression may be hidden by
the speedup from the new SIMD-accelerated upsampling algorithms.
- 32-bit RGB-to-YCbCr color conversion:
uses vld1_u16_x2(), regresses with GCC
- 32-bit accurate integer forward DCT
uses vld1_s16_x3(), regression irrelevant because there was no
previous implementation
- 32-bit accurate integer inverse DCT
uses vld1_s16_x3(), regresses with GCC
- 32-bit fast integer inverse DCT
does not use any of the intrinsics in question, regresses with GCC
- 32-bit 4x4 inverse DCT
uses vld1_s16_x3(). I did not test this algorithm in isolation, so
it may in fact regress with GCC, but the regression may be hidden by
the speedup from the new SIMD-accelerated upsampling algorithms.
Presumably when GCC includes a full and optimal set of Neon intrinsics,
the HAVE_VLD1* tests will pass, and the full Neon intrinsics
implementation will be enabled automatically.