"ARM"="Arm", "NEON"="Neon"
Refer to: https://www.arm.com/company/policies/trademarks/arm-trademark-list/arm-trademark https://www.arm.com/company/policies/trademarks/arm-trademark-list/neon-trademark NOTE: These changes are only applied to change log entries for 2.0.x and later, since the change log is a historical record and Arm's new trademark policy did not go into effect until late 2017.
This commit is contained in:
28
BUILDING.md
28
BUILDING.md
@@ -398,8 +398,8 @@ located (usually **/usr/bin**.) Next, execute the following commands:
|
||||
Building libjpeg-turbo for iOS
|
||||
------------------------------
|
||||
|
||||
iOS platforms, such as the iPhone and iPad, use ARM processors, and all
|
||||
currently supported models include NEON instructions. Thus, they can take
|
||||
iOS platforms, such as the iPhone and iPad, use Arm processors, and all
|
||||
currently supported models include Neon instructions. Thus, they can take
|
||||
advantage of libjpeg-turbo's SIMD extensions to significantly accelerate JPEG
|
||||
compression/decompression. This section describes how to build libjpeg-turbo
|
||||
for these platforms.
|
||||
@@ -412,7 +412,7 @@ for these platforms.
|
||||
it should be installed in your `PATH`.
|
||||
|
||||
|
||||
### ARMv7 (32-bit)
|
||||
### Armv7 (32-bit)
|
||||
|
||||
**gas-preprocessor.pl required**
|
||||
|
||||
@@ -465,7 +465,7 @@ Same as above, but replace the first line with:
|
||||
make
|
||||
|
||||
|
||||
### ARMv7s (32-bit)
|
||||
### Armv7s (32-bit)
|
||||
|
||||
**gas-preprocessor.pl required**
|
||||
|
||||
@@ -493,13 +493,13 @@ iPhone 5/iPad 4th Generation and newer:
|
||||
|
||||
#### Xcode 5 and later (Clang)
|
||||
|
||||
Same as the ARMv7 build procedure for Xcode 5 and later, except replace the
|
||||
Same as the Armv7 build procedure for Xcode 5 and later, except replace the
|
||||
compiler flags as follows:
|
||||
|
||||
export CFLAGS="-Wall -mfloat-abi=softfp -arch armv7s -miphoneos-version-min=6.0"
|
||||
|
||||
|
||||
### ARMv8 (64-bit)
|
||||
### Armv8 (64-bit)
|
||||
|
||||
**gas-preprocessor.pl required if using Xcode < 6**
|
||||
|
||||
@@ -523,7 +523,7 @@ iPhone 5S/iPad Mini 2/iPad Air and newer.
|
||||
[additional CMake flags] {source_directory}
|
||||
make
|
||||
|
||||
Once built, lipo can be used to combine the ARMv7, v7s, and/or v8 variants into
|
||||
Once built, lipo can be used to combine the Armv7, v7s, and/or v8 variants into
|
||||
a universal library.
|
||||
|
||||
|
||||
@@ -534,7 +534,7 @@ Building libjpeg-turbo for Android platforms requires v13b or later of the
|
||||
[Android NDK](https://developer.android.com/tools/sdk/ndk).
|
||||
|
||||
|
||||
### ARMv7 (32-bit)
|
||||
### Armv7 (32-bit)
|
||||
|
||||
The following is a general recipe script that can be modified for your specific
|
||||
needs.
|
||||
@@ -559,7 +559,7 @@ needs.
|
||||
make
|
||||
|
||||
|
||||
### ARMv8 (64-bit)
|
||||
### Armv8 (64-bit)
|
||||
|
||||
The following is a general recipe script that can be modified for your specific
|
||||
needs.
|
||||
@@ -742,21 +742,21 @@ must be built on OS X 10.6 or later.
|
||||
|
||||
make udmg
|
||||
|
||||
This creates a Mac package/disk image that contains universal x86-64/i386/ARM
|
||||
This creates a Mac package/disk image that contains universal x86-64/i386/Arm
|
||||
binaries. The following CMake variables control which architectures are
|
||||
included in the universal binaries. Setting any of these variables to an empty
|
||||
string excludes that architecture from the package.
|
||||
|
||||
* `OSX_32BIT_BUILD`: Directory containing an i386 (32-bit) Mac build of
|
||||
libjpeg-turbo (default: *{source_directory}*/osxx86)
|
||||
* `IOS_ARMV7_BUILD`: Directory containing an ARMv7 (32-bit) iOS build of
|
||||
* `IOS_ARMV7_BUILD`: Directory containing an Armv7 (32-bit) iOS build of
|
||||
libjpeg-turbo (default: *{source_directory}*/iosarmv7)
|
||||
* `IOS_ARMV7S_BUILD`: Directory containing an ARMv7s (32-bit) iOS build of
|
||||
* `IOS_ARMV7S_BUILD`: Directory containing an Armv7s (32-bit) iOS build of
|
||||
libjpeg-turbo (default: *{source_directory}*/iosarmv7s)
|
||||
* `IOS_ARMV8_BUILD`: Directory containing an ARMv8 (64-bit) iOS build of
|
||||
* `IOS_ARMV8_BUILD`: Directory containing an Armv8 (64-bit) iOS build of
|
||||
libjpeg-turbo (default: *{source_directory}*/iosarmv8)
|
||||
|
||||
You should first use CMake to configure i386, ARMv7, ARMv7s, and/or ARMv8
|
||||
You should first use CMake to configure i386, Armv7, Armv7s, and/or Armv8
|
||||
sub-builds of libjpeg-turbo (see "Build Recipes" and "Building libjpeg-turbo
|
||||
for iOS" above) in build directories that match those specified in the
|
||||
aforementioned CMake variables. Next, configure the primary build of
|
||||
|
||||
@@ -20,8 +20,8 @@ with `jpeg_skip_scanlines()`, and the issues could not readily be fixed.
|
||||
- Fixed an issue whereby `jpeg_skip_scanlines()` always returned 0 when
|
||||
skipping past the end of an image.
|
||||
|
||||
3. The ARM 64-bit (ARMv8) NEON SIMD extensions can now be built using MinGW
|
||||
toolchains targetting ARM64 (AArch64) Windows binaries.
|
||||
3. The Arm 64-bit (Armv8) Neon SIMD extensions can now be built using MinGW
|
||||
toolchains targetting Arm64 (AArch64) Windows binaries.
|
||||
|
||||
4. Fixed unexpected visual artifacts that occurred when using
|
||||
`jpeg_crop_scanline()` and interblock smoothing while decompressing only the DC
|
||||
@@ -94,7 +94,7 @@ other user-visible errant behavior, and given that the lossless transformer
|
||||
(unlike the decompressor) is not generally exposed to arbitrary data exploits,
|
||||
this issue did not likely pose a security risk.
|
||||
|
||||
6. The ARM 64-bit (ARMv8) NEON SIMD assembly code now stores constants in a
|
||||
6. The Arm 64-bit (Armv8) Neon SIMD assembly code now stores constants in a
|
||||
separate read-only data section rather than in the text section, to support
|
||||
execute-only memory layouts.
|
||||
|
||||
@@ -380,7 +380,7 @@ algorithm that caused incorrect dithering in the output image. This algorithm
|
||||
now produces bitwise-identical results to the unmerged algorithms.
|
||||
|
||||
12. The SIMD function symbols for x86[-64]/ELF, MIPS/ELF, macOS/x86[-64] (if
|
||||
libjpeg-turbo is built with YASM), and iOS/ARM[64] builds are now private.
|
||||
libjpeg-turbo is built with YASM), and iOS/Arm[64] builds are now private.
|
||||
This prevents those symbols from being exposed in applications or shared
|
||||
libraries that link statically with libjpeg-turbo.
|
||||
|
||||
|
||||
@@ -2,7 +2,7 @@ Background
|
||||
==========
|
||||
|
||||
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
|
||||
baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and
|
||||
baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
|
||||
MIPS systems, as well as progressive JPEG compression on x86 and x86-64
|
||||
systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg,
|
||||
all else being equal. On other types of systems, libjpeg-turbo can still
|
||||
|
||||
@@ -137,13 +137,13 @@ set(OSX_32BIT_BUILD ${DEFAULT_OSX_32BIT_BUILD} CACHE PATH
|
||||
"Directory containing 32-bit (i386) Mac build to include in universal binaries (default: ${DEFAULT_OSX_32BIT_BUILD})")
|
||||
set(DEFAULT_IOS_ARMV7_BUILD ${CMAKE_SOURCE_DIR}/iosarmv7)
|
||||
set(IOS_ARMV7_BUILD ${DEFAULT_IOS_ARMV7_BUILD} CACHE PATH
|
||||
"Directory containing ARMv7 iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV7_BUILD})")
|
||||
"Directory containing Armv7 iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV7_BUILD})")
|
||||
set(DEFAULT_IOS_ARMV7S_BUILD ${CMAKE_SOURCE_DIR}/iosarmv7s)
|
||||
set(IOS_ARMV7S_BUILD ${DEFAULT_IOS_ARMV7S_BUILD} CACHE PATH
|
||||
"Directory containing ARMv7s iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV7S_BUILD})")
|
||||
"Directory containing Armv7s iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV7S_BUILD})")
|
||||
set(DEFAULT_IOS_ARMV8_BUILD ${CMAKE_SOURCE_DIR}/iosarmv8)
|
||||
set(IOS_ARMV8_BUILD ${DEFAULT_IOS_ARMV8_BUILD} CACHE PATH
|
||||
"Directory containing ARMv8 iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV8_BUILD})")
|
||||
"Directory containing Armv8 iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV8_BUILD})")
|
||||
|
||||
set(OSX_APP_CERT_NAME "" CACHE STRING
|
||||
"Name of the Developer ID Application certificate (in the macOS keychain) that should be used to sign the libjpeg-turbo DMG. Leave this blank to generate an unsigned DMG.")
|
||||
|
||||
4
jchuff.c
4
jchuff.c
@@ -34,10 +34,10 @@
|
||||
* memory footprint by 64k, which is important for some mobile applications
|
||||
* that create many isolated instances of libjpeg-turbo (web browsers, for
|
||||
* instance.) This may improve performance on some mobile platforms as well.
|
||||
* This feature is enabled by default only on ARM processors, because some x86
|
||||
* This feature is enabled by default only on Arm processors, because some x86
|
||||
* chips have a slow implementation of bsr, and the use of clz/bsr cannot be
|
||||
* shown to have a significant performance impact even on the x86 chips that
|
||||
* have a fast implementation of it. When building for ARMv6, you can
|
||||
* have a fast implementation of it. When building for Armv6, you can
|
||||
* explicitly disable the use of clz/bsr by adding -mthumb to the compiler
|
||||
* flags (this defines __thumb__).
|
||||
*/
|
||||
|
||||
@@ -43,10 +43,10 @@
|
||||
* memory footprint by 64k, which is important for some mobile applications
|
||||
* that create many isolated instances of libjpeg-turbo (web browsers, for
|
||||
* instance.) This may improve performance on some mobile platforms as well.
|
||||
* This feature is enabled by default only on ARM processors, because some x86
|
||||
* This feature is enabled by default only on Arm processors, because some x86
|
||||
* chips have a slow implementation of bsr, and the use of clz/bsr cannot be
|
||||
* shown to have a significant performance impact even on the x86 chips that
|
||||
* have a fast implementation of it. When building for ARMv6, you can
|
||||
* have a fast implementation of it. When building for Armv6, you can
|
||||
* explicitly disable the use of clz/bsr by adding -mthumb to the compiler
|
||||
* flags (this defines __thumb__).
|
||||
*/
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86 and x86-64 systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal. On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines. In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs.
|
||||
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86 and x86-64 systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal. On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines. In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs.
|
||||
|
||||
libjpeg-turbo implements both the traditional libjpeg API as well as the less powerful but more straightforward TurboJPEG API. libjpeg-turbo also features colorspace extensions that allow it to compress from/decompress to 32-bit and big-endian pixel buffers (RGBX, XBGR, etc.), as well as a full-featured Java interface.
|
||||
|
||||
|
||||
@@ -9,7 +9,7 @@ Homepage: @PKGURL@
|
||||
Installed-Size: {__SIZE}
|
||||
Description: A SIMD-accelerated JPEG codec that provides both the libjpeg and TurboJPEG APIs
|
||||
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
|
||||
baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and
|
||||
baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
|
||||
MIPS systems, as well as progressive JPEG compression on x86 and x86-64
|
||||
systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg,
|
||||
all else being equal. On other types of systems, libjpeg-turbo can still
|
||||
|
||||
@@ -223,15 +223,15 @@ install_ios()
|
||||
}
|
||||
|
||||
if [ $UNIVERSAL = 1 -a "$BUILDDIRARMV7" != "" ]; then
|
||||
install_ios $BUILDDIRARMV7 ARMv7 armv7 arm
|
||||
install_ios $BUILDDIRARMV7 Armv7 armv7 arm
|
||||
fi
|
||||
|
||||
if [ $UNIVERSAL = 1 -a "$BUILDDIRARMV7S" != "" ]; then
|
||||
install_ios $BUILDDIRARMV7S ARMv7s armv7s arm
|
||||
install_ios $BUILDDIRARMV7S Armv7s armv7s arm
|
||||
fi
|
||||
|
||||
if [ $UNIVERSAL = 1 -a "$BUILDDIRARMV8" != "" ]; then
|
||||
install_ios $BUILDDIRARMV8 ARMv8 armv8 arm64
|
||||
install_ios $BUILDDIRARMV8 Armv8 armv8 arm64
|
||||
fi
|
||||
|
||||
install_name_tool -id $LIBDIR/$LIBJPEG_DSO_NAME $PKGROOT/$LIBDIR/$LIBJPEG_DSO_NAME
|
||||
|
||||
@@ -52,7 +52,7 @@ Provides: %{name} = %{version}-%{release}, @CMAKE_PROJECT_NAME@ = %{version}-%{r
|
||||
|
||||
%description
|
||||
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
|
||||
baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and
|
||||
baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
|
||||
MIPS systems, as well as progressive JPEG compression on x86 and x86-64
|
||||
systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg,
|
||||
all else being equal. On other types of systems, libjpeg-turbo can still
|
||||
|
||||
@@ -205,7 +205,7 @@ endif()
|
||||
|
||||
|
||||
###############################################################################
|
||||
# ARM (GAS)
|
||||
# Arm (GAS)
|
||||
###############################################################################
|
||||
|
||||
elseif(CPU_TYPE STREQUAL "arm64" OR CPU_TYPE STREQUAL "arm")
|
||||
|
||||
@@ -13,7 +13,7 @@
|
||||
*
|
||||
* This file contains the interface between the "normal" portions
|
||||
* of the library and the SIMD implementations when running on a
|
||||
* 32-bit ARM architecture.
|
||||
* 32-bit Arm architecture.
|
||||
*/
|
||||
|
||||
#define JPEG_INTERNALS
|
||||
@@ -118,7 +118,7 @@ init_simd(void)
|
||||
#if defined(__ARM_NEON__)
|
||||
simd_support |= JSIMD_NEON;
|
||||
#elif defined(__linux__) || defined(ANDROID) || defined(__ANDROID__)
|
||||
/* We still have a chance to use NEON regardless of globally used
|
||||
/* We still have a chance to use Neon regardless of globally used
|
||||
* -mcpu/-mfpu options passed to gcc by performing runtime detection via
|
||||
* /proc/cpuinfo parsing on linux/android */
|
||||
while (!parse_proc_cpuinfo(bufsize)) {
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
/*
|
||||
* ARMv7 NEON optimizations for libjpeg-turbo
|
||||
* Armv7 Neon optimizations for libjpeg-turbo
|
||||
*
|
||||
* Copyright (C) 2009-2011, Nokia Corporation and/or its subsidiary(-ies).
|
||||
* All Rights Reserved.
|
||||
@@ -229,7 +229,7 @@ asm_function jsimd_idct_islow_neon
|
||||
ROW7L .req d30
|
||||
ROW7R .req d31
|
||||
|
||||
/* Load and dequantize coefficients into NEON registers
|
||||
/* Load and dequantize coefficients into Neon registers
|
||||
* with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
@@ -261,7 +261,7 @@ asm_function jsimd_idct_islow_neon
|
||||
vld1.16 {d0, d1, d2, d3}, [ip, :128] /* load constants */
|
||||
add ip, ip, #16
|
||||
vmul.s16 q15, q15, q3
|
||||
vpush {d8-d15} /* save NEON registers */
|
||||
vpush {d8-d15} /* save Neon registers */
|
||||
/* 1-D IDCT, pass 1, left 4x8 half */
|
||||
vadd.s16 d4, ROW7L, ROW3L
|
||||
vadd.s16 d5, ROW5L, ROW1L
|
||||
@@ -507,7 +507,7 @@ asm_function jsimd_idct_islow_neon
|
||||
vqrshrn.s16 d17, q9, #2
|
||||
vqrshrn.s16 d18, q10, #2
|
||||
vqrshrn.s16 d19, q11, #2
|
||||
vpop {d8-d15} /* restore NEON registers */
|
||||
vpop {d8-d15} /* restore Neon registers */
|
||||
vqrshrn.s16 d20, q12, #2
|
||||
/* Transpose the final 8-bit samples and do signed->unsigned conversion */
|
||||
vtrn.16 q8, q9
|
||||
@@ -688,7 +688,7 @@ asm_function jsimd_idct_islow_neon
|
||||
* function from jidctfst.c
|
||||
*
|
||||
* Normally 1-D AAN DCT needs 5 multiplications and 29 additions.
|
||||
* But in ARM NEON case some extra additions are required because VQDMULH
|
||||
* But in Arm Neon case some extra additions are required because VQDMULH
|
||||
* instruction can't handle the constants larger than 1. So the expressions
|
||||
* like "x * 1.082392200" have to be converted to "x * 0.082392200 + x",
|
||||
* which introduces an extra addition. Overall, there are 6 extra additions
|
||||
@@ -718,7 +718,7 @@ asm_function jsimd_idct_ifast_neon
|
||||
TMP3 .req r2
|
||||
TMP4 .req ip
|
||||
|
||||
/* Load and dequantize coefficients into NEON registers
|
||||
/* Load and dequantize coefficients into Neon registers
|
||||
* with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
@@ -749,7 +749,7 @@ asm_function jsimd_idct_ifast_neon
|
||||
vmul.s16 q13, q13, q1
|
||||
vld1.16 {d0}, [ip, :64] /* load constants */
|
||||
vmul.s16 q15, q15, q3
|
||||
vpush {d8-d13} /* save NEON registers */
|
||||
vpush {d8-d13} /* save Neon registers */
|
||||
/* 1-D IDCT, pass 1 */
|
||||
vsub.s16 q2, q10, q14
|
||||
vadd.s16 q14, q10, q14
|
||||
@@ -842,7 +842,7 @@ asm_function jsimd_idct_ifast_neon
|
||||
vadd.s16 q14, q5, q3
|
||||
vsub.s16 q9, q5, q3
|
||||
vsub.s16 q13, q10, q2
|
||||
vpop {d8-d13} /* restore NEON registers */
|
||||
vpop {d8-d13} /* restore Neon registers */
|
||||
vadd.s16 q10, q10, q2
|
||||
vsub.s16 q11, q12, q1
|
||||
vadd.s16 q12, q12, q1
|
||||
@@ -913,7 +913,7 @@ asm_function jsimd_idct_ifast_neon
|
||||
*
|
||||
* NOTE: jpeg-8 has an improved implementation of 4x4 inverse-DCT, which
|
||||
* requires much less arithmetic operations and hence should be faster.
|
||||
* The primary purpose of this particular NEON optimized function is
|
||||
* The primary purpose of this particular Neon optimized function is
|
||||
* bit exact compatibility with jpeg-6b.
|
||||
*
|
||||
* TODO: a bit better instructions scheduling can be achieved by expanding
|
||||
@@ -1016,7 +1016,7 @@ asm_function jsimd_idct_4x4_neon
|
||||
adr TMP4, jsimd_idct_4x4_neon_consts
|
||||
vld1.16 {d0, d1, d2, d3}, [TMP4, :128]
|
||||
|
||||
/* Load all COEF_BLOCK into NEON registers with the following allocation:
|
||||
/* Load all COEF_BLOCK into Neon registers with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
* 0 | d4 | d5
|
||||
@@ -1126,7 +1126,7 @@ asm_function jsimd_idct_4x4_neon
|
||||
*
|
||||
* NOTE: jpeg-8 has an improved implementation of 2x2 inverse-DCT, which
|
||||
* requires much less arithmetic operations and hence should be faster.
|
||||
* The primary purpose of this particular NEON optimized function is
|
||||
* The primary purpose of this particular Neon optimized function is
|
||||
* bit exact compatibility with jpeg-6b.
|
||||
*/
|
||||
|
||||
@@ -1173,7 +1173,7 @@ asm_function jsimd_idct_2x2_neon
|
||||
adr TMP2, jsimd_idct_2x2_neon_consts
|
||||
vld1.16 {d0}, [TMP2, :64]
|
||||
|
||||
/* Load all COEF_BLOCK into NEON registers with the following allocation:
|
||||
/* Load all COEF_BLOCK into Neon registers with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
* 0 | d4 | d5
|
||||
@@ -1499,7 +1499,7 @@ asm_function jsimd_ycc_\colorid\()_convert_neon
|
||||
adr ip, jsimd_ycc_\colorid\()_neon_consts
|
||||
vld1.16 {d0, d1, d2, d3}, [ip, :128]
|
||||
|
||||
/* Save ARM registers and handle input arguments */
|
||||
/* Save Arm registers and handle input arguments */
|
||||
push {r4, r5, r6, r7, r8, r9, r10, lr}
|
||||
ldr NUM_ROWS, [sp, #(4 * 8)]
|
||||
ldr INPUT_BUF0, [INPUT_BUF]
|
||||
@@ -1507,7 +1507,7 @@ asm_function jsimd_ycc_\colorid\()_convert_neon
|
||||
ldr INPUT_BUF2, [INPUT_BUF, #8]
|
||||
.unreq INPUT_BUF
|
||||
|
||||
/* Save NEON registers */
|
||||
/* Save Neon registers */
|
||||
vpush {d8-d15}
|
||||
|
||||
/* Initially set d10, d11, d12, d13 to 0xFF */
|
||||
@@ -1814,7 +1814,7 @@ asm_function jsimd_\colorid\()_ycc_convert_neon
|
||||
adr ip, jsimd_\colorid\()_ycc_neon_consts
|
||||
vld1.16 {d0, d1, d2, d3}, [ip, :128]
|
||||
|
||||
/* Save ARM registers and handle input arguments */
|
||||
/* Save Arm registers and handle input arguments */
|
||||
push {r4, r5, r6, r7, r8, r9, r10, lr}
|
||||
ldr NUM_ROWS, [sp, #(4 * 8)]
|
||||
ldr OUTPUT_BUF0, [OUTPUT_BUF]
|
||||
@@ -1822,7 +1822,7 @@ asm_function jsimd_\colorid\()_ycc_convert_neon
|
||||
ldr OUTPUT_BUF2, [OUTPUT_BUF, #8]
|
||||
.unreq OUTPUT_BUF
|
||||
|
||||
/* Save NEON registers */
|
||||
/* Save Neon registers */
|
||||
vpush {d8-d15}
|
||||
|
||||
/* Outer loop over scanlines */
|
||||
@@ -2017,7 +2017,7 @@ asm_function jsimd_fdct_ifast_neon
|
||||
adr TMP, jsimd_fdct_ifast_neon_consts
|
||||
vld1.16 {d0}, [TMP, :64]
|
||||
|
||||
/* Load all DATA into NEON registers with the following allocation:
|
||||
/* Load all DATA into Neon registers with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
* 0 | d16 | d17 | q8
|
||||
@@ -2112,8 +2112,8 @@ asm_function jsimd_fdct_ifast_neon
|
||||
*
|
||||
* Note: the code uses 2 stage pipelining in order to improve instructions
|
||||
* scheduling and eliminate stalls (this provides ~15% better
|
||||
* performance for this function on both ARM Cortex-A8 and
|
||||
* ARM Cortex-A9 when compared to the non-pipelined variant).
|
||||
* performance for this function on both Arm Cortex-A8 and
|
||||
* Arm Cortex-A9 when compared to the non-pipelined variant).
|
||||
* The instructions which belong to the second stage use different
|
||||
* indentation for better readiability.
|
||||
*/
|
||||
|
||||
@@ -12,7 +12,7 @@
|
||||
*
|
||||
* This file contains the interface between the "normal" portions
|
||||
* of the library and the SIMD implementations when running on a
|
||||
* 64-bit ARM architecture.
|
||||
* 64-bit Arm architecture.
|
||||
*/
|
||||
|
||||
#define JPEG_INTERNALS
|
||||
@@ -114,8 +114,8 @@ parse_proc_cpuinfo(int bufsize)
|
||||
*/
|
||||
|
||||
/*
|
||||
* ARMv8 architectures support NEON extensions by default.
|
||||
* It is no longer optional as it was with ARMv7.
|
||||
* Armv8 architectures support Neon extensions by default.
|
||||
* It is no longer optional as it was with Armv7.
|
||||
*/
|
||||
|
||||
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
/*
|
||||
* ARMv8 NEON optimizations for libjpeg-turbo
|
||||
* Armv8 Neon optimizations for libjpeg-turbo
|
||||
*
|
||||
* Copyright (C) 2009-2011, Nokia Corporation and/or its subsidiary(-ies).
|
||||
* All Rights Reserved.
|
||||
@@ -611,7 +611,7 @@ asm_function jsimd_idct_islow_neon
|
||||
shrn2 v5.8h, v15.4s, #16 /* wsptr[DCTSIZE*3] = (int)DESCALE(tmp13 + tmp0, CONST_BITS+PASS1_BITS+3) */
|
||||
shrn2 v6.8h, v17.4s, #16 /* wsptr[DCTSIZE*4] = (int)DESCALE(tmp13 - tmp0, CONST_BITS+PASS1_BITS+3) */
|
||||
movi v0.16b, #(CENTERJSAMPLE)
|
||||
/* Prepare pointers (dual-issue with NEON instructions) */
|
||||
/* Prepare pointers (dual-issue with Neon instructions) */
|
||||
ldp TMP1, TMP2, [OUTPUT_BUF], 16
|
||||
sqrshrn v28.8b, v2.8h, #(CONST_BITS+PASS1_BITS+3-16)
|
||||
ldp TMP3, TMP4, [OUTPUT_BUF], 16
|
||||
@@ -992,7 +992,7 @@ asm_function jsimd_idct_islow_neon
|
||||
* function from jidctfst.c
|
||||
*
|
||||
* Normally 1-D AAN DCT needs 5 multiplications and 29 additions.
|
||||
* But in ARM NEON case some extra additions are required because VQDMULH
|
||||
* But in Arm Neon case some extra additions are required because VQDMULH
|
||||
* instruction can't handle the constants larger than 1. So the expressions
|
||||
* like "x * 1.082392200" have to be converted to "x * 0.082392200 + x",
|
||||
* which introduces an extra addition. Overall, there are 6 extra additions
|
||||
@@ -1024,7 +1024,7 @@ asm_function jsimd_idct_ifast_neon
|
||||
instruction ensures that those bits are set to zero. */
|
||||
uxtw x3, w3
|
||||
|
||||
/* Load and dequantize coefficients into NEON registers
|
||||
/* Load and dequantize coefficients into Neon registers
|
||||
* with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
@@ -1037,7 +1037,7 @@ asm_function jsimd_idct_ifast_neon
|
||||
* 6 | d28 | d29 ( v22.8h )
|
||||
* 7 | d30 | d31 ( v23.8h )
|
||||
*/
|
||||
/* Save NEON registers used in fast IDCT */
|
||||
/* Save Neon registers used in fast IDCT */
|
||||
get_symbol_loc TMP5, Ljsimd_idct_ifast_neon_consts
|
||||
ld1 {v16.8h, v17.8h}, [COEF_BLOCK], 32
|
||||
ld1 {v0.8h, v1.8h}, [DCT_TABLE], 32
|
||||
@@ -1142,7 +1142,7 @@ asm_function jsimd_idct_ifast_neon
|
||||
add v20.8h, v20.8h, v1.8h
|
||||
/* Descale to 8-bit and range limit */
|
||||
movi v0.16b, #0x80
|
||||
/* Prepare pointers (dual-issue with NEON instructions) */
|
||||
/* Prepare pointers (dual-issue with Neon instructions) */
|
||||
ldp TMP1, TMP2, [OUTPUT_BUF], 16
|
||||
sqshrn v28.8b, v16.8h, #5
|
||||
ldp TMP3, TMP4, [OUTPUT_BUF], 16
|
||||
@@ -1221,7 +1221,7 @@ asm_function jsimd_idct_ifast_neon
|
||||
*
|
||||
* NOTE: jpeg-8 has an improved implementation of 4x4 inverse-DCT, which
|
||||
* requires much less arithmetic operations and hence should be faster.
|
||||
* The primary purpose of this particular NEON optimized function is
|
||||
* The primary purpose of this particular Neon optimized function is
|
||||
* bit exact compatibility with jpeg-6b.
|
||||
*
|
||||
* TODO: a bit better instructions scheduling can be achieved by expanding
|
||||
@@ -1291,7 +1291,7 @@ asm_function jsimd_idct_4x4_neon
|
||||
instruction ensures that those bits are set to zero. */
|
||||
uxtw x3, w3
|
||||
|
||||
/* Save all used NEON registers */
|
||||
/* Save all used Neon registers */
|
||||
sub sp, sp, 64
|
||||
mov x9, sp
|
||||
/* Load constants (v3.4h is just used for padding) */
|
||||
@@ -1300,7 +1300,7 @@ asm_function jsimd_idct_4x4_neon
|
||||
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
|
||||
ld1 {v0.4h, v1.4h, v2.4h, v3.4h}, [TMP4]
|
||||
|
||||
/* Load all COEF_BLOCK into NEON registers with the following allocation:
|
||||
/* Load all COEF_BLOCK into Neon registers with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
* 0 | v4.4h | v5.4h
|
||||
@@ -1434,7 +1434,7 @@ asm_function jsimd_idct_4x4_neon
|
||||
*
|
||||
* NOTE: jpeg-8 has an improved implementation of 2x2 inverse-DCT, which
|
||||
* requires much less arithmetic operations and hence should be faster.
|
||||
* The primary purpose of this particular NEON optimized function is
|
||||
* The primary purpose of this particular Neon optimized function is
|
||||
* bit exact compatibility with jpeg-6b.
|
||||
*/
|
||||
|
||||
@@ -1483,7 +1483,7 @@ asm_function jsimd_idct_2x2_neon
|
||||
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
|
||||
ld1 {v14.4h}, [TMP2]
|
||||
|
||||
/* Load all COEF_BLOCK into NEON registers with the following allocation:
|
||||
/* Load all COEF_BLOCK into Neon registers with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
* 0 | v4.4h | v5.4h
|
||||
@@ -1857,7 +1857,7 @@ asm_function jsimd_ycc_\colorid\()_convert_neon_slowst3
|
||||
/* Load constants to d1, d2, d3 (v0.4h is just used for padding) */
|
||||
get_symbol_loc x15, Ljsimd_ycc_rgb_neon_consts
|
||||
|
||||
/* Save NEON registers */
|
||||
/* Save Neon registers */
|
||||
st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x9], 32
|
||||
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
|
||||
ld1 {v0.4h, v1.4h}, [x15], 16
|
||||
@@ -2142,7 +2142,7 @@ generate_jsimd_ycc_rgb_convert_neon extbgr, 24, 2, .4h, 1, .4h, 0, .4h, .8b,
|
||||
.endm
|
||||
|
||||
/* TODO: expand macros and interleave instructions if some in-order
|
||||
* ARM64 processor actually can dual-issue LOAD/STORE with ALU */
|
||||
* AArch64 processor actually can dual-issue LOAD/STORE with ALU */
|
||||
.macro do_rgb_to_yuv_stage2_store_load_stage1 fast_ld3
|
||||
do_rgb_to_yuv_stage2
|
||||
do_load \bpp, 8, \fast_ld3
|
||||
@@ -2182,7 +2182,7 @@ asm_function jsimd_\colorid\()_ycc_convert_neon_slowld3
|
||||
ldr OUTPUT_BUF2, [OUTPUT_BUF, #16]
|
||||
.unreq OUTPUT_BUF
|
||||
|
||||
/* Save NEON registers */
|
||||
/* Save Neon registers */
|
||||
sub sp, sp, #64
|
||||
mov x9, sp
|
||||
st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x9], 32
|
||||
@@ -2396,13 +2396,13 @@ asm_function jsimd_fdct_islow_neon
|
||||
get_symbol_loc TMP, Ljsimd_fdct_islow_neon_consts
|
||||
ld1 {v0.8h, v1.8h}, [TMP]
|
||||
|
||||
/* Save NEON registers */
|
||||
/* Save Neon registers */
|
||||
sub sp, sp, #64
|
||||
mov x10, sp
|
||||
st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x10], 32
|
||||
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x10], 32
|
||||
|
||||
/* Load all DATA into NEON registers with the following allocation:
|
||||
/* Load all DATA into Neon registers with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
* 0 | d16 | d17 | v16.8h
|
||||
@@ -2629,7 +2629,7 @@ asm_function jsimd_fdct_islow_neon
|
||||
st1 {v16.8h, v17.8h, v18.8h, v19.8h}, [DATA], 64
|
||||
st1 {v20.8h, v21.8h, v22.8h, v23.8h}, [DATA]
|
||||
|
||||
/* Restore NEON registers */
|
||||
/* Restore Neon registers */
|
||||
ld1 {v8.8b, v9.8b, v10.8b, v11.8b}, [sp], 32
|
||||
ld1 {v12.8b, v13.8b, v14.8b, v15.8b}, [sp], 32
|
||||
|
||||
@@ -2681,7 +2681,7 @@ asm_function jsimd_fdct_ifast_neon
|
||||
get_symbol_loc TMP, Ljsimd_fdct_ifast_neon_consts
|
||||
ld1 {v0.4h}, [TMP]
|
||||
|
||||
/* Load all DATA into NEON registers with the following allocation:
|
||||
/* Load all DATA into Neon registers with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
* 0 | d16 | d17 | v0.8h
|
||||
@@ -3066,7 +3066,7 @@ asm_function jsimd_huff_encode_one_block_neon_slowtbl
|
||||
.endif
|
||||
sub sp, sp, 272
|
||||
sub BUFFER, BUFFER, #0x1 /* BUFFER=buffer-- */
|
||||
/* Save ARM registers */
|
||||
/* Save Arm registers */
|
||||
stp x19, x20, [sp]
|
||||
get_symbol_loc x15, Ljsimd_huff_encode_one_block_neon_consts
|
||||
ldr PUT_BUFFER, [x0, #0x10]
|
||||
|
||||
Reference in New Issue
Block a user