'buffer' is both passed into the inline assembly code and modified by
it. See https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html, 6.47.2.3.
With GCC 4, this commit does not change the generated assembly code at
all.
With GCC 8, this commit fixes an assembly error:
/tmp/{foo}.s: Assembler messages:
/tmp/{foo}.s:775: Error: registers may not be the same --
`str r9,[r9],#4'
I'm not sure why that error went unnoticed, since I definitely
benchmarked the previous commit with GCC 8. Anyhow, this commit changes
the generated assembly code slightly but does not alter performance.
With Clang 10, this commit changes the generated assembly code slightly
but does not alter performance.
Refer to #529
Referring to the C standard
(http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf,
J.2 Undefined behavior), the behavior of the compiler is undefined if
"conversion between two pointer types produces a result that is
incorrectly aligned." Thus, the behavior of this code
*((uint32_t *)buffer) = BUILTIN_BSWAP32(put_buffer);
in the AArch32 version of the FLUSH() macro is undefined unless 'buffer'
is 32-bit-aligned. Referring to
https://bugs.llvm.org/show_bug.cgi?id=50785, certain versions of Clang,
when generating Thumb (T32) instructions, miscompile that code into an
assembly instruction (stm) that requires the destination to be
32-bit-aligned. Since such alignment cannot be guaranteed within the
Huffman encoder, this reportedly led to crashes (SIGBUS: illegal
alignment) with AArch32/Thumb builds of libjpeg-turbo running on Android
devices, although thus far I have been unable to reproduce those crashes
with a plain Linux/Arm system.
The miscompilation is visible with the Compiler Explorer:
https://godbolt.org/z/rv1ccx1Pb
However, it goes away when removing the return statement from the
function. Thus, it seems that Clang's behavior in this regard is
somewhat variable, which may explain why the crashes are only
reproducible on certain platforms.
The suggested workaround is to use memcpy(), but whereas Clang and
recent GCC releases are smart enough to compile a 4-byte memcpy() call
into a str instruction, GCC < 6 is not. Referring to
https://godbolt.org/z/ae7Wje3P6, the only way to consistently produce
the desired str instruction across all supported compilers is to use
inline assembly. Visual C++ presumably does not miscompile the code in
question, since no issues have been reported with it, but since the code
relies on undefined compiler behavior, prudence dictates that
e4ec23d7ae should be reverted for Visual
C++, which this commit does. The performance impact of
e4ec23d7ae for Visual C++/Arm builds is
unknown (I have no ability to test such builds), but regardless, this
commit reverts the Visual C++/Arm performance to that of libjpeg-turbo
2.1 beta1.
Closes#529
Define compiler-independent byte-swap macros and use them instead of
executing 'rev' via inline assembly code with GCC-compatible compilers
or a slow shift-store sequence with Visual C++.
* This produces identical assembly code with:
- 64-bit GCC 8.4.0 (Linux)
- 64-bit GCC 9.3.0 (Linux)
- 64-bit Clang 10.0.0 (Linux)
- 64-bit Clang 10.0.0 (MinGW)
- 64-bit Clang 12.0.0 (Xcode 12.2, macOS)
- 64-bit Clang 12.0.0 (Xcode 12.2, iOS)
* This produces different assembly code with:
- 64-bit GCC 4.9.1 (Linux)
- 32-bit GCC 4.8.2 (Linux)
- 32-bit GCC 8.4.0 (Linux)
- 32-bit GCC 9.3.0 (Linux)
Since the intrinsics implementation of Huffman encoding is not used
by default with these compilers, this is not a concern.
- 32-bit Clang 10.0.0 (Linux)
Verified performance neutrality
Closes#507
- Use the _M_ARM and _M_ARM64 macros provided by Visual Studio for
compile-time detection of Arm builds, since __arm__ and __aarch64__
are only present in GNU-compatible compilers.
- Neon/intrinsics: Use the _CountLeadingZeros() and
_CountLeadingZeros64() intrinsics provided by Visual Studio, since
__builtin_clz() and __builtin_clzl() are only present in
GNU-compatible compilers.
- Neon/intrinsics: Since Visual Studio does not support static vector
initialization, replace static initialization of Neon vectors with the
appropriate intrinsics. Compared to the static initialization
approach, this produces identical assembly code with both GCC and
Clang.
- Neon/intrinsics: Since Visual Studio does not support inline assembly
code, provide alternative code paths for Visual Studio whenever inline
assembly is used.
- Build: Set FLOATTEST appropriately for AArch64 Visual Studio builds
(Visual Studio does not emit fused multiply-add [FMA] instructions by
default for such builds.)
- Neon/intrinsics: Move temporary buffer allocation outside of nested
loops. Since Visual Studio configures Arm builds with a relatively
small amount of stack memory, attempting to allocate those buffers
within the inner loops caused a stack overflow.
Closes#461Closes#475
The previous AArch64 GAS implementation is retained by default when
using GCC, in order to avoid a performance regression. The intrinsics
implementation can be forced on or off using the new NEON_INTRINSICS
CMake variable. The previous AArch32 GAS implementation has been
removed, since the intrinsics implementation provides the same or better
performance.