"ARM"="Arm", "NEON"="Neon"

Refer to:
https://www.arm.com/company/policies/trademarks/arm-trademark-list/arm-trademark
https://www.arm.com/company/policies/trademarks/arm-trademark-list/neon-trademark

NOTE: These changes are only applied to change log entries for 2.0.x and
later, since the change log is a historical record and Arm's new
trademark policy did not go into effect until late 2017.
This commit is contained in:
DRC
2020-10-15 17:47:31 -05:00
parent b5a1472781
commit 1ed312eab6
15 changed files with 76 additions and 76 deletions

View File

@@ -398,8 +398,8 @@ located (usually **/usr/bin**.) Next, execute the following commands:
Building libjpeg-turbo for iOS Building libjpeg-turbo for iOS
------------------------------ ------------------------------
iOS platforms, such as the iPhone and iPad, use ARM processors, and all iOS platforms, such as the iPhone and iPad, use Arm processors, and all
currently supported models include NEON instructions. Thus, they can take currently supported models include Neon instructions. Thus, they can take
advantage of libjpeg-turbo's SIMD extensions to significantly accelerate JPEG advantage of libjpeg-turbo's SIMD extensions to significantly accelerate JPEG
compression/decompression. This section describes how to build libjpeg-turbo compression/decompression. This section describes how to build libjpeg-turbo
for these platforms. for these platforms.
@@ -412,7 +412,7 @@ for these platforms.
it should be installed in your `PATH`. it should be installed in your `PATH`.
### ARMv7 (32-bit) ### Armv7 (32-bit)
**gas-preprocessor.pl required** **gas-preprocessor.pl required**
@@ -465,7 +465,7 @@ Same as above, but replace the first line with:
make make
### ARMv7s (32-bit) ### Armv7s (32-bit)
**gas-preprocessor.pl required** **gas-preprocessor.pl required**
@@ -493,13 +493,13 @@ iPhone 5/iPad 4th Generation and newer:
#### Xcode 5 and later (Clang) #### Xcode 5 and later (Clang)
Same as the ARMv7 build procedure for Xcode 5 and later, except replace the Same as the Armv7 build procedure for Xcode 5 and later, except replace the
compiler flags as follows: compiler flags as follows:
export CFLAGS="-Wall -mfloat-abi=softfp -arch armv7s -miphoneos-version-min=6.0" export CFLAGS="-Wall -mfloat-abi=softfp -arch armv7s -miphoneos-version-min=6.0"
### ARMv8 (64-bit) ### Armv8 (64-bit)
**gas-preprocessor.pl required if using Xcode < 6** **gas-preprocessor.pl required if using Xcode < 6**
@@ -523,7 +523,7 @@ iPhone 5S/iPad Mini 2/iPad Air and newer.
[additional CMake flags] {source_directory} [additional CMake flags] {source_directory}
make make
Once built, lipo can be used to combine the ARMv7, v7s, and/or v8 variants into Once built, lipo can be used to combine the Armv7, v7s, and/or v8 variants into
a universal library. a universal library.
@@ -534,7 +534,7 @@ Building libjpeg-turbo for Android platforms requires v13b or later of the
[Android NDK](https://developer.android.com/tools/sdk/ndk). [Android NDK](https://developer.android.com/tools/sdk/ndk).
### ARMv7 (32-bit) ### Armv7 (32-bit)
The following is a general recipe script that can be modified for your specific The following is a general recipe script that can be modified for your specific
needs. needs.
@@ -559,7 +559,7 @@ needs.
make make
### ARMv8 (64-bit) ### Armv8 (64-bit)
The following is a general recipe script that can be modified for your specific The following is a general recipe script that can be modified for your specific
needs. needs.
@@ -742,21 +742,21 @@ must be built on OS X 10.6 or later.
make udmg make udmg
This creates a Mac package/disk image that contains universal x86-64/i386/ARM This creates a Mac package/disk image that contains universal x86-64/i386/Arm
binaries. The following CMake variables control which architectures are binaries. The following CMake variables control which architectures are
included in the universal binaries. Setting any of these variables to an empty included in the universal binaries. Setting any of these variables to an empty
string excludes that architecture from the package. string excludes that architecture from the package.
* `OSX_32BIT_BUILD`: Directory containing an i386 (32-bit) Mac build of * `OSX_32BIT_BUILD`: Directory containing an i386 (32-bit) Mac build of
libjpeg-turbo (default: *{source_directory}*/osxx86) libjpeg-turbo (default: *{source_directory}*/osxx86)
* `IOS_ARMV7_BUILD`: Directory containing an ARMv7 (32-bit) iOS build of * `IOS_ARMV7_BUILD`: Directory containing an Armv7 (32-bit) iOS build of
libjpeg-turbo (default: *{source_directory}*/iosarmv7) libjpeg-turbo (default: *{source_directory}*/iosarmv7)
* `IOS_ARMV7S_BUILD`: Directory containing an ARMv7s (32-bit) iOS build of * `IOS_ARMV7S_BUILD`: Directory containing an Armv7s (32-bit) iOS build of
libjpeg-turbo (default: *{source_directory}*/iosarmv7s) libjpeg-turbo (default: *{source_directory}*/iosarmv7s)
* `IOS_ARMV8_BUILD`: Directory containing an ARMv8 (64-bit) iOS build of * `IOS_ARMV8_BUILD`: Directory containing an Armv8 (64-bit) iOS build of
libjpeg-turbo (default: *{source_directory}*/iosarmv8) libjpeg-turbo (default: *{source_directory}*/iosarmv8)
You should first use CMake to configure i386, ARMv7, ARMv7s, and/or ARMv8 You should first use CMake to configure i386, Armv7, Armv7s, and/or Armv8
sub-builds of libjpeg-turbo (see "Build Recipes" and "Building libjpeg-turbo sub-builds of libjpeg-turbo (see "Build Recipes" and "Building libjpeg-turbo
for iOS" above) in build directories that match those specified in the for iOS" above) in build directories that match those specified in the
aforementioned CMake variables. Next, configure the primary build of aforementioned CMake variables. Next, configure the primary build of

View File

@@ -20,8 +20,8 @@ with `jpeg_skip_scanlines()`, and the issues could not readily be fixed.
- Fixed an issue whereby `jpeg_skip_scanlines()` always returned 0 when - Fixed an issue whereby `jpeg_skip_scanlines()` always returned 0 when
skipping past the end of an image. skipping past the end of an image.
3. The ARM 64-bit (ARMv8) NEON SIMD extensions can now be built using MinGW 3. The Arm 64-bit (Armv8) Neon SIMD extensions can now be built using MinGW
toolchains targetting ARM64 (AArch64) Windows binaries. toolchains targetting Arm64 (AArch64) Windows binaries.
4. Fixed unexpected visual artifacts that occurred when using 4. Fixed unexpected visual artifacts that occurred when using
`jpeg_crop_scanline()` and interblock smoothing while decompressing only the DC `jpeg_crop_scanline()` and interblock smoothing while decompressing only the DC
@@ -94,7 +94,7 @@ other user-visible errant behavior, and given that the lossless transformer
(unlike the decompressor) is not generally exposed to arbitrary data exploits, (unlike the decompressor) is not generally exposed to arbitrary data exploits,
this issue did not likely pose a security risk. this issue did not likely pose a security risk.
6. The ARM 64-bit (ARMv8) NEON SIMD assembly code now stores constants in a 6. The Arm 64-bit (Armv8) Neon SIMD assembly code now stores constants in a
separate read-only data section rather than in the text section, to support separate read-only data section rather than in the text section, to support
execute-only memory layouts. execute-only memory layouts.
@@ -380,7 +380,7 @@ algorithm that caused incorrect dithering in the output image. This algorithm
now produces bitwise-identical results to the unmerged algorithms. now produces bitwise-identical results to the unmerged algorithms.
12. The SIMD function symbols for x86[-64]/ELF, MIPS/ELF, macOS/x86[-64] (if 12. The SIMD function symbols for x86[-64]/ELF, MIPS/ELF, macOS/x86[-64] (if
libjpeg-turbo is built with YASM), and iOS/ARM[64] builds are now private. libjpeg-turbo is built with YASM), and iOS/Arm[64] builds are now private.
This prevents those symbols from being exposed in applications or shared This prevents those symbols from being exposed in applications or shared
libraries that link statically with libjpeg-turbo. libraries that link statically with libjpeg-turbo.

View File

@@ -2,7 +2,7 @@ Background
========== ==========
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
MIPS systems, as well as progressive JPEG compression on x86 and x86-64 MIPS systems, as well as progressive JPEG compression on x86 and x86-64
systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg,
all else being equal. On other types of systems, libjpeg-turbo can still all else being equal. On other types of systems, libjpeg-turbo can still

View File

@@ -137,13 +137,13 @@ set(OSX_32BIT_BUILD ${DEFAULT_OSX_32BIT_BUILD} CACHE PATH
"Directory containing 32-bit (i386) Mac build to include in universal binaries (default: ${DEFAULT_OSX_32BIT_BUILD})") "Directory containing 32-bit (i386) Mac build to include in universal binaries (default: ${DEFAULT_OSX_32BIT_BUILD})")
set(DEFAULT_IOS_ARMV7_BUILD ${CMAKE_SOURCE_DIR}/iosarmv7) set(DEFAULT_IOS_ARMV7_BUILD ${CMAKE_SOURCE_DIR}/iosarmv7)
set(IOS_ARMV7_BUILD ${DEFAULT_IOS_ARMV7_BUILD} CACHE PATH set(IOS_ARMV7_BUILD ${DEFAULT_IOS_ARMV7_BUILD} CACHE PATH
"Directory containing ARMv7 iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV7_BUILD})") "Directory containing Armv7 iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV7_BUILD})")
set(DEFAULT_IOS_ARMV7S_BUILD ${CMAKE_SOURCE_DIR}/iosarmv7s) set(DEFAULT_IOS_ARMV7S_BUILD ${CMAKE_SOURCE_DIR}/iosarmv7s)
set(IOS_ARMV7S_BUILD ${DEFAULT_IOS_ARMV7S_BUILD} CACHE PATH set(IOS_ARMV7S_BUILD ${DEFAULT_IOS_ARMV7S_BUILD} CACHE PATH
"Directory containing ARMv7s iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV7S_BUILD})") "Directory containing Armv7s iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV7S_BUILD})")
set(DEFAULT_IOS_ARMV8_BUILD ${CMAKE_SOURCE_DIR}/iosarmv8) set(DEFAULT_IOS_ARMV8_BUILD ${CMAKE_SOURCE_DIR}/iosarmv8)
set(IOS_ARMV8_BUILD ${DEFAULT_IOS_ARMV8_BUILD} CACHE PATH set(IOS_ARMV8_BUILD ${DEFAULT_IOS_ARMV8_BUILD} CACHE PATH
"Directory containing ARMv8 iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV8_BUILD})") "Directory containing Armv8 iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV8_BUILD})")
set(OSX_APP_CERT_NAME "" CACHE STRING set(OSX_APP_CERT_NAME "" CACHE STRING
"Name of the Developer ID Application certificate (in the macOS keychain) that should be used to sign the libjpeg-turbo DMG. Leave this blank to generate an unsigned DMG.") "Name of the Developer ID Application certificate (in the macOS keychain) that should be used to sign the libjpeg-turbo DMG. Leave this blank to generate an unsigned DMG.")

View File

@@ -34,10 +34,10 @@
* memory footprint by 64k, which is important for some mobile applications * memory footprint by 64k, which is important for some mobile applications
* that create many isolated instances of libjpeg-turbo (web browsers, for * that create many isolated instances of libjpeg-turbo (web browsers, for
* instance.) This may improve performance on some mobile platforms as well. * instance.) This may improve performance on some mobile platforms as well.
* This feature is enabled by default only on ARM processors, because some x86 * This feature is enabled by default only on Arm processors, because some x86
* chips have a slow implementation of bsr, and the use of clz/bsr cannot be * chips have a slow implementation of bsr, and the use of clz/bsr cannot be
* shown to have a significant performance impact even on the x86 chips that * shown to have a significant performance impact even on the x86 chips that
* have a fast implementation of it. When building for ARMv6, you can * have a fast implementation of it. When building for Armv6, you can
* explicitly disable the use of clz/bsr by adding -mthumb to the compiler * explicitly disable the use of clz/bsr by adding -mthumb to the compiler
* flags (this defines __thumb__). * flags (this defines __thumb__).
*/ */

View File

@@ -43,10 +43,10 @@
* memory footprint by 64k, which is important for some mobile applications * memory footprint by 64k, which is important for some mobile applications
* that create many isolated instances of libjpeg-turbo (web browsers, for * that create many isolated instances of libjpeg-turbo (web browsers, for
* instance.) This may improve performance on some mobile platforms as well. * instance.) This may improve performance on some mobile platforms as well.
* This feature is enabled by default only on ARM processors, because some x86 * This feature is enabled by default only on Arm processors, because some x86
* chips have a slow implementation of bsr, and the use of clz/bsr cannot be * chips have a slow implementation of bsr, and the use of clz/bsr cannot be
* shown to have a significant performance impact even on the x86 chips that * shown to have a significant performance impact even on the x86 chips that
* have a fast implementation of it. When building for ARMv6, you can * have a fast implementation of it. When building for Armv6, you can
* explicitly disable the use of clz/bsr by adding -mthumb to the compiler * explicitly disable the use of clz/bsr by adding -mthumb to the compiler
* flags (this defines __thumb__). * flags (this defines __thumb__).
*/ */

View File

@@ -1,4 +1,4 @@
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86 and x86-64 systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal. On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines. In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs. libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86 and x86-64 systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal. On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines. In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs.
libjpeg-turbo implements both the traditional libjpeg API as well as the less powerful but more straightforward TurboJPEG API. libjpeg-turbo also features colorspace extensions that allow it to compress from/decompress to 32-bit and big-endian pixel buffers (RGBX, XBGR, etc.), as well as a full-featured Java interface. libjpeg-turbo implements both the traditional libjpeg API as well as the less powerful but more straightforward TurboJPEG API. libjpeg-turbo also features colorspace extensions that allow it to compress from/decompress to 32-bit and big-endian pixel buffers (RGBX, XBGR, etc.), as well as a full-featured Java interface.

View File

@@ -9,7 +9,7 @@ Homepage: @PKGURL@
Installed-Size: {__SIZE} Installed-Size: {__SIZE}
Description: A SIMD-accelerated JPEG codec that provides both the libjpeg and TurboJPEG APIs Description: A SIMD-accelerated JPEG codec that provides both the libjpeg and TurboJPEG APIs
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
MIPS systems, as well as progressive JPEG compression on x86 and x86-64 MIPS systems, as well as progressive JPEG compression on x86 and x86-64
systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg,
all else being equal. On other types of systems, libjpeg-turbo can still all else being equal. On other types of systems, libjpeg-turbo can still

View File

@@ -223,15 +223,15 @@ install_ios()
} }
if [ $UNIVERSAL = 1 -a "$BUILDDIRARMV7" != "" ]; then if [ $UNIVERSAL = 1 -a "$BUILDDIRARMV7" != "" ]; then
install_ios $BUILDDIRARMV7 ARMv7 armv7 arm install_ios $BUILDDIRARMV7 Armv7 armv7 arm
fi fi
if [ $UNIVERSAL = 1 -a "$BUILDDIRARMV7S" != "" ]; then if [ $UNIVERSAL = 1 -a "$BUILDDIRARMV7S" != "" ]; then
install_ios $BUILDDIRARMV7S ARMv7s armv7s arm install_ios $BUILDDIRARMV7S Armv7s armv7s arm
fi fi
if [ $UNIVERSAL = 1 -a "$BUILDDIRARMV8" != "" ]; then if [ $UNIVERSAL = 1 -a "$BUILDDIRARMV8" != "" ]; then
install_ios $BUILDDIRARMV8 ARMv8 armv8 arm64 install_ios $BUILDDIRARMV8 Armv8 armv8 arm64
fi fi
install_name_tool -id $LIBDIR/$LIBJPEG_DSO_NAME $PKGROOT/$LIBDIR/$LIBJPEG_DSO_NAME install_name_tool -id $LIBDIR/$LIBJPEG_DSO_NAME $PKGROOT/$LIBDIR/$LIBJPEG_DSO_NAME

View File

@@ -52,7 +52,7 @@ Provides: %{name} = %{version}-%{release}, @CMAKE_PROJECT_NAME@ = %{version}-%{r
%description %description
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
MIPS systems, as well as progressive JPEG compression on x86 and x86-64 MIPS systems, as well as progressive JPEG compression on x86 and x86-64
systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg,
all else being equal. On other types of systems, libjpeg-turbo can still all else being equal. On other types of systems, libjpeg-turbo can still

View File

@@ -205,7 +205,7 @@ endif()
############################################################################### ###############################################################################
# ARM (GAS) # Arm (GAS)
############################################################################### ###############################################################################
elseif(CPU_TYPE STREQUAL "arm64" OR CPU_TYPE STREQUAL "arm") elseif(CPU_TYPE STREQUAL "arm64" OR CPU_TYPE STREQUAL "arm")

View File

@@ -13,7 +13,7 @@
* *
* This file contains the interface between the "normal" portions * This file contains the interface between the "normal" portions
* of the library and the SIMD implementations when running on a * of the library and the SIMD implementations when running on a
* 32-bit ARM architecture. * 32-bit Arm architecture.
*/ */
#define JPEG_INTERNALS #define JPEG_INTERNALS
@@ -118,7 +118,7 @@ init_simd(void)
#if defined(__ARM_NEON__) #if defined(__ARM_NEON__)
simd_support |= JSIMD_NEON; simd_support |= JSIMD_NEON;
#elif defined(__linux__) || defined(ANDROID) || defined(__ANDROID__) #elif defined(__linux__) || defined(ANDROID) || defined(__ANDROID__)
/* We still have a chance to use NEON regardless of globally used /* We still have a chance to use Neon regardless of globally used
* -mcpu/-mfpu options passed to gcc by performing runtime detection via * -mcpu/-mfpu options passed to gcc by performing runtime detection via
* /proc/cpuinfo parsing on linux/android */ * /proc/cpuinfo parsing on linux/android */
while (!parse_proc_cpuinfo(bufsize)) { while (!parse_proc_cpuinfo(bufsize)) {

View File

@@ -1,5 +1,5 @@
/* /*
* ARMv7 NEON optimizations for libjpeg-turbo * Armv7 Neon optimizations for libjpeg-turbo
* *
* Copyright (C) 2009-2011, Nokia Corporation and/or its subsidiary(-ies). * Copyright (C) 2009-2011, Nokia Corporation and/or its subsidiary(-ies).
* All Rights Reserved. * All Rights Reserved.
@@ -229,7 +229,7 @@ asm_function jsimd_idct_islow_neon
ROW7L .req d30 ROW7L .req d30
ROW7R .req d31 ROW7R .req d31
/* Load and dequantize coefficients into NEON registers /* Load and dequantize coefficients into Neon registers
* with the following allocation: * with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
@@ -261,7 +261,7 @@ asm_function jsimd_idct_islow_neon
vld1.16 {d0, d1, d2, d3}, [ip, :128] /* load constants */ vld1.16 {d0, d1, d2, d3}, [ip, :128] /* load constants */
add ip, ip, #16 add ip, ip, #16
vmul.s16 q15, q15, q3 vmul.s16 q15, q15, q3
vpush {d8-d15} /* save NEON registers */ vpush {d8-d15} /* save Neon registers */
/* 1-D IDCT, pass 1, left 4x8 half */ /* 1-D IDCT, pass 1, left 4x8 half */
vadd.s16 d4, ROW7L, ROW3L vadd.s16 d4, ROW7L, ROW3L
vadd.s16 d5, ROW5L, ROW1L vadd.s16 d5, ROW5L, ROW1L
@@ -507,7 +507,7 @@ asm_function jsimd_idct_islow_neon
vqrshrn.s16 d17, q9, #2 vqrshrn.s16 d17, q9, #2
vqrshrn.s16 d18, q10, #2 vqrshrn.s16 d18, q10, #2
vqrshrn.s16 d19, q11, #2 vqrshrn.s16 d19, q11, #2
vpop {d8-d15} /* restore NEON registers */ vpop {d8-d15} /* restore Neon registers */
vqrshrn.s16 d20, q12, #2 vqrshrn.s16 d20, q12, #2
/* Transpose the final 8-bit samples and do signed->unsigned conversion */ /* Transpose the final 8-bit samples and do signed->unsigned conversion */
vtrn.16 q8, q9 vtrn.16 q8, q9
@@ -688,7 +688,7 @@ asm_function jsimd_idct_islow_neon
* function from jidctfst.c * function from jidctfst.c
* *
* Normally 1-D AAN DCT needs 5 multiplications and 29 additions. * Normally 1-D AAN DCT needs 5 multiplications and 29 additions.
* But in ARM NEON case some extra additions are required because VQDMULH * But in Arm Neon case some extra additions are required because VQDMULH
* instruction can't handle the constants larger than 1. So the expressions * instruction can't handle the constants larger than 1. So the expressions
* like "x * 1.082392200" have to be converted to "x * 0.082392200 + x", * like "x * 1.082392200" have to be converted to "x * 0.082392200 + x",
* which introduces an extra addition. Overall, there are 6 extra additions * which introduces an extra addition. Overall, there are 6 extra additions
@@ -718,7 +718,7 @@ asm_function jsimd_idct_ifast_neon
TMP3 .req r2 TMP3 .req r2
TMP4 .req ip TMP4 .req ip
/* Load and dequantize coefficients into NEON registers /* Load and dequantize coefficients into Neon registers
* with the following allocation: * with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
@@ -749,7 +749,7 @@ asm_function jsimd_idct_ifast_neon
vmul.s16 q13, q13, q1 vmul.s16 q13, q13, q1
vld1.16 {d0}, [ip, :64] /* load constants */ vld1.16 {d0}, [ip, :64] /* load constants */
vmul.s16 q15, q15, q3 vmul.s16 q15, q15, q3
vpush {d8-d13} /* save NEON registers */ vpush {d8-d13} /* save Neon registers */
/* 1-D IDCT, pass 1 */ /* 1-D IDCT, pass 1 */
vsub.s16 q2, q10, q14 vsub.s16 q2, q10, q14
vadd.s16 q14, q10, q14 vadd.s16 q14, q10, q14
@@ -842,7 +842,7 @@ asm_function jsimd_idct_ifast_neon
vadd.s16 q14, q5, q3 vadd.s16 q14, q5, q3
vsub.s16 q9, q5, q3 vsub.s16 q9, q5, q3
vsub.s16 q13, q10, q2 vsub.s16 q13, q10, q2
vpop {d8-d13} /* restore NEON registers */ vpop {d8-d13} /* restore Neon registers */
vadd.s16 q10, q10, q2 vadd.s16 q10, q10, q2
vsub.s16 q11, q12, q1 vsub.s16 q11, q12, q1
vadd.s16 q12, q12, q1 vadd.s16 q12, q12, q1
@@ -913,7 +913,7 @@ asm_function jsimd_idct_ifast_neon
* *
* NOTE: jpeg-8 has an improved implementation of 4x4 inverse-DCT, which * NOTE: jpeg-8 has an improved implementation of 4x4 inverse-DCT, which
* requires much less arithmetic operations and hence should be faster. * requires much less arithmetic operations and hence should be faster.
* The primary purpose of this particular NEON optimized function is * The primary purpose of this particular Neon optimized function is
* bit exact compatibility with jpeg-6b. * bit exact compatibility with jpeg-6b.
* *
* TODO: a bit better instructions scheduling can be achieved by expanding * TODO: a bit better instructions scheduling can be achieved by expanding
@@ -1016,7 +1016,7 @@ asm_function jsimd_idct_4x4_neon
adr TMP4, jsimd_idct_4x4_neon_consts adr TMP4, jsimd_idct_4x4_neon_consts
vld1.16 {d0, d1, d2, d3}, [TMP4, :128] vld1.16 {d0, d1, d2, d3}, [TMP4, :128]
/* Load all COEF_BLOCK into NEON registers with the following allocation: /* Load all COEF_BLOCK into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
* 0 | d4 | d5 * 0 | d4 | d5
@@ -1126,7 +1126,7 @@ asm_function jsimd_idct_4x4_neon
* *
* NOTE: jpeg-8 has an improved implementation of 2x2 inverse-DCT, which * NOTE: jpeg-8 has an improved implementation of 2x2 inverse-DCT, which
* requires much less arithmetic operations and hence should be faster. * requires much less arithmetic operations and hence should be faster.
* The primary purpose of this particular NEON optimized function is * The primary purpose of this particular Neon optimized function is
* bit exact compatibility with jpeg-6b. * bit exact compatibility with jpeg-6b.
*/ */
@@ -1173,7 +1173,7 @@ asm_function jsimd_idct_2x2_neon
adr TMP2, jsimd_idct_2x2_neon_consts adr TMP2, jsimd_idct_2x2_neon_consts
vld1.16 {d0}, [TMP2, :64] vld1.16 {d0}, [TMP2, :64]
/* Load all COEF_BLOCK into NEON registers with the following allocation: /* Load all COEF_BLOCK into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
* 0 | d4 | d5 * 0 | d4 | d5
@@ -1499,7 +1499,7 @@ asm_function jsimd_ycc_\colorid\()_convert_neon
adr ip, jsimd_ycc_\colorid\()_neon_consts adr ip, jsimd_ycc_\colorid\()_neon_consts
vld1.16 {d0, d1, d2, d3}, [ip, :128] vld1.16 {d0, d1, d2, d3}, [ip, :128]
/* Save ARM registers and handle input arguments */ /* Save Arm registers and handle input arguments */
push {r4, r5, r6, r7, r8, r9, r10, lr} push {r4, r5, r6, r7, r8, r9, r10, lr}
ldr NUM_ROWS, [sp, #(4 * 8)] ldr NUM_ROWS, [sp, #(4 * 8)]
ldr INPUT_BUF0, [INPUT_BUF] ldr INPUT_BUF0, [INPUT_BUF]
@@ -1507,7 +1507,7 @@ asm_function jsimd_ycc_\colorid\()_convert_neon
ldr INPUT_BUF2, [INPUT_BUF, #8] ldr INPUT_BUF2, [INPUT_BUF, #8]
.unreq INPUT_BUF .unreq INPUT_BUF
/* Save NEON registers */ /* Save Neon registers */
vpush {d8-d15} vpush {d8-d15}
/* Initially set d10, d11, d12, d13 to 0xFF */ /* Initially set d10, d11, d12, d13 to 0xFF */
@@ -1814,7 +1814,7 @@ asm_function jsimd_\colorid\()_ycc_convert_neon
adr ip, jsimd_\colorid\()_ycc_neon_consts adr ip, jsimd_\colorid\()_ycc_neon_consts
vld1.16 {d0, d1, d2, d3}, [ip, :128] vld1.16 {d0, d1, d2, d3}, [ip, :128]
/* Save ARM registers and handle input arguments */ /* Save Arm registers and handle input arguments */
push {r4, r5, r6, r7, r8, r9, r10, lr} push {r4, r5, r6, r7, r8, r9, r10, lr}
ldr NUM_ROWS, [sp, #(4 * 8)] ldr NUM_ROWS, [sp, #(4 * 8)]
ldr OUTPUT_BUF0, [OUTPUT_BUF] ldr OUTPUT_BUF0, [OUTPUT_BUF]
@@ -1822,7 +1822,7 @@ asm_function jsimd_\colorid\()_ycc_convert_neon
ldr OUTPUT_BUF2, [OUTPUT_BUF, #8] ldr OUTPUT_BUF2, [OUTPUT_BUF, #8]
.unreq OUTPUT_BUF .unreq OUTPUT_BUF
/* Save NEON registers */ /* Save Neon registers */
vpush {d8-d15} vpush {d8-d15}
/* Outer loop over scanlines */ /* Outer loop over scanlines */
@@ -2017,7 +2017,7 @@ asm_function jsimd_fdct_ifast_neon
adr TMP, jsimd_fdct_ifast_neon_consts adr TMP, jsimd_fdct_ifast_neon_consts
vld1.16 {d0}, [TMP, :64] vld1.16 {d0}, [TMP, :64]
/* Load all DATA into NEON registers with the following allocation: /* Load all DATA into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
* 0 | d16 | d17 | q8 * 0 | d16 | d17 | q8
@@ -2112,8 +2112,8 @@ asm_function jsimd_fdct_ifast_neon
* *
* Note: the code uses 2 stage pipelining in order to improve instructions * Note: the code uses 2 stage pipelining in order to improve instructions
* scheduling and eliminate stalls (this provides ~15% better * scheduling and eliminate stalls (this provides ~15% better
* performance for this function on both ARM Cortex-A8 and * performance for this function on both Arm Cortex-A8 and
* ARM Cortex-A9 when compared to the non-pipelined variant). * Arm Cortex-A9 when compared to the non-pipelined variant).
* The instructions which belong to the second stage use different * The instructions which belong to the second stage use different
* indentation for better readiability. * indentation for better readiability.
*/ */

View File

@@ -12,7 +12,7 @@
* *
* This file contains the interface between the "normal" portions * This file contains the interface between the "normal" portions
* of the library and the SIMD implementations when running on a * of the library and the SIMD implementations when running on a
* 64-bit ARM architecture. * 64-bit Arm architecture.
*/ */
#define JPEG_INTERNALS #define JPEG_INTERNALS
@@ -114,8 +114,8 @@ parse_proc_cpuinfo(int bufsize)
*/ */
/* /*
* ARMv8 architectures support NEON extensions by default. * Armv8 architectures support Neon extensions by default.
* It is no longer optional as it was with ARMv7. * It is no longer optional as it was with Armv7.
*/ */

View File

@@ -1,5 +1,5 @@
/* /*
* ARMv8 NEON optimizations for libjpeg-turbo * Armv8 Neon optimizations for libjpeg-turbo
* *
* Copyright (C) 2009-2011, Nokia Corporation and/or its subsidiary(-ies). * Copyright (C) 2009-2011, Nokia Corporation and/or its subsidiary(-ies).
* All Rights Reserved. * All Rights Reserved.
@@ -611,7 +611,7 @@ asm_function jsimd_idct_islow_neon
shrn2 v5.8h, v15.4s, #16 /* wsptr[DCTSIZE*3] = (int)DESCALE(tmp13 + tmp0, CONST_BITS+PASS1_BITS+3) */ shrn2 v5.8h, v15.4s, #16 /* wsptr[DCTSIZE*3] = (int)DESCALE(tmp13 + tmp0, CONST_BITS+PASS1_BITS+3) */
shrn2 v6.8h, v17.4s, #16 /* wsptr[DCTSIZE*4] = (int)DESCALE(tmp13 - tmp0, CONST_BITS+PASS1_BITS+3) */ shrn2 v6.8h, v17.4s, #16 /* wsptr[DCTSIZE*4] = (int)DESCALE(tmp13 - tmp0, CONST_BITS+PASS1_BITS+3) */
movi v0.16b, #(CENTERJSAMPLE) movi v0.16b, #(CENTERJSAMPLE)
/* Prepare pointers (dual-issue with NEON instructions) */ /* Prepare pointers (dual-issue with Neon instructions) */
ldp TMP1, TMP2, [OUTPUT_BUF], 16 ldp TMP1, TMP2, [OUTPUT_BUF], 16
sqrshrn v28.8b, v2.8h, #(CONST_BITS+PASS1_BITS+3-16) sqrshrn v28.8b, v2.8h, #(CONST_BITS+PASS1_BITS+3-16)
ldp TMP3, TMP4, [OUTPUT_BUF], 16 ldp TMP3, TMP4, [OUTPUT_BUF], 16
@@ -992,7 +992,7 @@ asm_function jsimd_idct_islow_neon
* function from jidctfst.c * function from jidctfst.c
* *
* Normally 1-D AAN DCT needs 5 multiplications and 29 additions. * Normally 1-D AAN DCT needs 5 multiplications and 29 additions.
* But in ARM NEON case some extra additions are required because VQDMULH * But in Arm Neon case some extra additions are required because VQDMULH
* instruction can't handle the constants larger than 1. So the expressions * instruction can't handle the constants larger than 1. So the expressions
* like "x * 1.082392200" have to be converted to "x * 0.082392200 + x", * like "x * 1.082392200" have to be converted to "x * 0.082392200 + x",
* which introduces an extra addition. Overall, there are 6 extra additions * which introduces an extra addition. Overall, there are 6 extra additions
@@ -1024,7 +1024,7 @@ asm_function jsimd_idct_ifast_neon
instruction ensures that those bits are set to zero. */ instruction ensures that those bits are set to zero. */
uxtw x3, w3 uxtw x3, w3
/* Load and dequantize coefficients into NEON registers /* Load and dequantize coefficients into Neon registers
* with the following allocation: * with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
@@ -1037,7 +1037,7 @@ asm_function jsimd_idct_ifast_neon
* 6 | d28 | d29 ( v22.8h ) * 6 | d28 | d29 ( v22.8h )
* 7 | d30 | d31 ( v23.8h ) * 7 | d30 | d31 ( v23.8h )
*/ */
/* Save NEON registers used in fast IDCT */ /* Save Neon registers used in fast IDCT */
get_symbol_loc TMP5, Ljsimd_idct_ifast_neon_consts get_symbol_loc TMP5, Ljsimd_idct_ifast_neon_consts
ld1 {v16.8h, v17.8h}, [COEF_BLOCK], 32 ld1 {v16.8h, v17.8h}, [COEF_BLOCK], 32
ld1 {v0.8h, v1.8h}, [DCT_TABLE], 32 ld1 {v0.8h, v1.8h}, [DCT_TABLE], 32
@@ -1142,7 +1142,7 @@ asm_function jsimd_idct_ifast_neon
add v20.8h, v20.8h, v1.8h add v20.8h, v20.8h, v1.8h
/* Descale to 8-bit and range limit */ /* Descale to 8-bit and range limit */
movi v0.16b, #0x80 movi v0.16b, #0x80
/* Prepare pointers (dual-issue with NEON instructions) */ /* Prepare pointers (dual-issue with Neon instructions) */
ldp TMP1, TMP2, [OUTPUT_BUF], 16 ldp TMP1, TMP2, [OUTPUT_BUF], 16
sqshrn v28.8b, v16.8h, #5 sqshrn v28.8b, v16.8h, #5
ldp TMP3, TMP4, [OUTPUT_BUF], 16 ldp TMP3, TMP4, [OUTPUT_BUF], 16
@@ -1221,7 +1221,7 @@ asm_function jsimd_idct_ifast_neon
* *
* NOTE: jpeg-8 has an improved implementation of 4x4 inverse-DCT, which * NOTE: jpeg-8 has an improved implementation of 4x4 inverse-DCT, which
* requires much less arithmetic operations and hence should be faster. * requires much less arithmetic operations and hence should be faster.
* The primary purpose of this particular NEON optimized function is * The primary purpose of this particular Neon optimized function is
* bit exact compatibility with jpeg-6b. * bit exact compatibility with jpeg-6b.
* *
* TODO: a bit better instructions scheduling can be achieved by expanding * TODO: a bit better instructions scheduling can be achieved by expanding
@@ -1291,7 +1291,7 @@ asm_function jsimd_idct_4x4_neon
instruction ensures that those bits are set to zero. */ instruction ensures that those bits are set to zero. */
uxtw x3, w3 uxtw x3, w3
/* Save all used NEON registers */ /* Save all used Neon registers */
sub sp, sp, 64 sub sp, sp, 64
mov x9, sp mov x9, sp
/* Load constants (v3.4h is just used for padding) */ /* Load constants (v3.4h is just used for padding) */
@@ -1300,7 +1300,7 @@ asm_function jsimd_idct_4x4_neon
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32 st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
ld1 {v0.4h, v1.4h, v2.4h, v3.4h}, [TMP4] ld1 {v0.4h, v1.4h, v2.4h, v3.4h}, [TMP4]
/* Load all COEF_BLOCK into NEON registers with the following allocation: /* Load all COEF_BLOCK into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
* 0 | v4.4h | v5.4h * 0 | v4.4h | v5.4h
@@ -1434,7 +1434,7 @@ asm_function jsimd_idct_4x4_neon
* *
* NOTE: jpeg-8 has an improved implementation of 2x2 inverse-DCT, which * NOTE: jpeg-8 has an improved implementation of 2x2 inverse-DCT, which
* requires much less arithmetic operations and hence should be faster. * requires much less arithmetic operations and hence should be faster.
* The primary purpose of this particular NEON optimized function is * The primary purpose of this particular Neon optimized function is
* bit exact compatibility with jpeg-6b. * bit exact compatibility with jpeg-6b.
*/ */
@@ -1483,7 +1483,7 @@ asm_function jsimd_idct_2x2_neon
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32 st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
ld1 {v14.4h}, [TMP2] ld1 {v14.4h}, [TMP2]
/* Load all COEF_BLOCK into NEON registers with the following allocation: /* Load all COEF_BLOCK into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
* 0 | v4.4h | v5.4h * 0 | v4.4h | v5.4h
@@ -1857,7 +1857,7 @@ asm_function jsimd_ycc_\colorid\()_convert_neon_slowst3
/* Load constants to d1, d2, d3 (v0.4h is just used for padding) */ /* Load constants to d1, d2, d3 (v0.4h is just used for padding) */
get_symbol_loc x15, Ljsimd_ycc_rgb_neon_consts get_symbol_loc x15, Ljsimd_ycc_rgb_neon_consts
/* Save NEON registers */ /* Save Neon registers */
st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x9], 32 st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x9], 32
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32 st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
ld1 {v0.4h, v1.4h}, [x15], 16 ld1 {v0.4h, v1.4h}, [x15], 16
@@ -2142,7 +2142,7 @@ generate_jsimd_ycc_rgb_convert_neon extbgr, 24, 2, .4h, 1, .4h, 0, .4h, .8b,
.endm .endm
/* TODO: expand macros and interleave instructions if some in-order /* TODO: expand macros and interleave instructions if some in-order
* ARM64 processor actually can dual-issue LOAD/STORE with ALU */ * AArch64 processor actually can dual-issue LOAD/STORE with ALU */
.macro do_rgb_to_yuv_stage2_store_load_stage1 fast_ld3 .macro do_rgb_to_yuv_stage2_store_load_stage1 fast_ld3
do_rgb_to_yuv_stage2 do_rgb_to_yuv_stage2
do_load \bpp, 8, \fast_ld3 do_load \bpp, 8, \fast_ld3
@@ -2182,7 +2182,7 @@ asm_function jsimd_\colorid\()_ycc_convert_neon_slowld3
ldr OUTPUT_BUF2, [OUTPUT_BUF, #16] ldr OUTPUT_BUF2, [OUTPUT_BUF, #16]
.unreq OUTPUT_BUF .unreq OUTPUT_BUF
/* Save NEON registers */ /* Save Neon registers */
sub sp, sp, #64 sub sp, sp, #64
mov x9, sp mov x9, sp
st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x9], 32 st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x9], 32
@@ -2396,13 +2396,13 @@ asm_function jsimd_fdct_islow_neon
get_symbol_loc TMP, Ljsimd_fdct_islow_neon_consts get_symbol_loc TMP, Ljsimd_fdct_islow_neon_consts
ld1 {v0.8h, v1.8h}, [TMP] ld1 {v0.8h, v1.8h}, [TMP]
/* Save NEON registers */ /* Save Neon registers */
sub sp, sp, #64 sub sp, sp, #64
mov x10, sp mov x10, sp
st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x10], 32 st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x10], 32
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x10], 32 st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x10], 32
/* Load all DATA into NEON registers with the following allocation: /* Load all DATA into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
* 0 | d16 | d17 | v16.8h * 0 | d16 | d17 | v16.8h
@@ -2629,7 +2629,7 @@ asm_function jsimd_fdct_islow_neon
st1 {v16.8h, v17.8h, v18.8h, v19.8h}, [DATA], 64 st1 {v16.8h, v17.8h, v18.8h, v19.8h}, [DATA], 64
st1 {v20.8h, v21.8h, v22.8h, v23.8h}, [DATA] st1 {v20.8h, v21.8h, v22.8h, v23.8h}, [DATA]
/* Restore NEON registers */ /* Restore Neon registers */
ld1 {v8.8b, v9.8b, v10.8b, v11.8b}, [sp], 32 ld1 {v8.8b, v9.8b, v10.8b, v11.8b}, [sp], 32
ld1 {v12.8b, v13.8b, v14.8b, v15.8b}, [sp], 32 ld1 {v12.8b, v13.8b, v14.8b, v15.8b}, [sp], 32
@@ -2681,7 +2681,7 @@ asm_function jsimd_fdct_ifast_neon
get_symbol_loc TMP, Ljsimd_fdct_ifast_neon_consts get_symbol_loc TMP, Ljsimd_fdct_ifast_neon_consts
ld1 {v0.4h}, [TMP] ld1 {v0.4h}, [TMP]
/* Load all DATA into NEON registers with the following allocation: /* Load all DATA into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
* 0 | d16 | d17 | v0.8h * 0 | d16 | d17 | v0.8h
@@ -3066,7 +3066,7 @@ asm_function jsimd_huff_encode_one_block_neon_slowtbl
.endif .endif
sub sp, sp, 272 sub sp, sp, 272
sub BUFFER, BUFFER, #0x1 /* BUFFER=buffer-- */ sub BUFFER, BUFFER, #0x1 /* BUFFER=buffer-- */
/* Save ARM registers */ /* Save Arm registers */
stp x19, x20, [sp] stp x19, x20, [sp]
get_symbol_loc x15, Ljsimd_huff_encode_one_block_neon_consts get_symbol_loc x15, Ljsimd_huff_encode_one_block_neon_consts
ldr PUT_BUFFER, [x0, #0x10] ldr PUT_BUFFER, [x0, #0x10]