"ARM"="Arm", "NEON"="Neon"

Refer to:
https://www.arm.com/company/policies/trademarks/arm-trademark-list/arm-trademark
https://www.arm.com/company/policies/trademarks/arm-trademark-list/neon-trademark

NOTE: These changes are only applied to change log entries for 2.0.x and
later, since the change log is a historical record and Arm's new
trademark policy did not go into effect until late 2017.
This commit is contained in:
DRC
2020-10-15 17:47:31 -05:00
parent b5a1472781
commit 1ed312eab6
15 changed files with 76 additions and 76 deletions

View File

@@ -398,8 +398,8 @@ located (usually **/usr/bin**.) Next, execute the following commands:
Building libjpeg-turbo for iOS
------------------------------
iOS platforms, such as the iPhone and iPad, use ARM processors, and all
currently supported models include NEON instructions. Thus, they can take
iOS platforms, such as the iPhone and iPad, use Arm processors, and all
currently supported models include Neon instructions. Thus, they can take
advantage of libjpeg-turbo's SIMD extensions to significantly accelerate JPEG
compression/decompression. This section describes how to build libjpeg-turbo
for these platforms.
@@ -412,7 +412,7 @@ for these platforms.
it should be installed in your `PATH`.
### ARMv7 (32-bit)
### Armv7 (32-bit)
**gas-preprocessor.pl required**
@@ -465,7 +465,7 @@ Same as above, but replace the first line with:
make
### ARMv7s (32-bit)
### Armv7s (32-bit)
**gas-preprocessor.pl required**
@@ -493,13 +493,13 @@ iPhone 5/iPad 4th Generation and newer:
#### Xcode 5 and later (Clang)
Same as the ARMv7 build procedure for Xcode 5 and later, except replace the
Same as the Armv7 build procedure for Xcode 5 and later, except replace the
compiler flags as follows:
export CFLAGS="-Wall -mfloat-abi=softfp -arch armv7s -miphoneos-version-min=6.0"
### ARMv8 (64-bit)
### Armv8 (64-bit)
**gas-preprocessor.pl required if using Xcode < 6**
@@ -523,7 +523,7 @@ iPhone 5S/iPad Mini 2/iPad Air and newer.
[additional CMake flags] {source_directory}
make
Once built, lipo can be used to combine the ARMv7, v7s, and/or v8 variants into
Once built, lipo can be used to combine the Armv7, v7s, and/or v8 variants into
a universal library.
@@ -534,7 +534,7 @@ Building libjpeg-turbo for Android platforms requires v13b or later of the
[Android NDK](https://developer.android.com/tools/sdk/ndk).
### ARMv7 (32-bit)
### Armv7 (32-bit)
The following is a general recipe script that can be modified for your specific
needs.
@@ -559,7 +559,7 @@ needs.
make
### ARMv8 (64-bit)
### Armv8 (64-bit)
The following is a general recipe script that can be modified for your specific
needs.
@@ -742,21 +742,21 @@ must be built on OS X 10.6 or later.
make udmg
This creates a Mac package/disk image that contains universal x86-64/i386/ARM
This creates a Mac package/disk image that contains universal x86-64/i386/Arm
binaries. The following CMake variables control which architectures are
included in the universal binaries. Setting any of these variables to an empty
string excludes that architecture from the package.
* `OSX_32BIT_BUILD`: Directory containing an i386 (32-bit) Mac build of
libjpeg-turbo (default: *{source_directory}*/osxx86)
* `IOS_ARMV7_BUILD`: Directory containing an ARMv7 (32-bit) iOS build of
* `IOS_ARMV7_BUILD`: Directory containing an Armv7 (32-bit) iOS build of
libjpeg-turbo (default: *{source_directory}*/iosarmv7)
* `IOS_ARMV7S_BUILD`: Directory containing an ARMv7s (32-bit) iOS build of
* `IOS_ARMV7S_BUILD`: Directory containing an Armv7s (32-bit) iOS build of
libjpeg-turbo (default: *{source_directory}*/iosarmv7s)
* `IOS_ARMV8_BUILD`: Directory containing an ARMv8 (64-bit) iOS build of
* `IOS_ARMV8_BUILD`: Directory containing an Armv8 (64-bit) iOS build of
libjpeg-turbo (default: *{source_directory}*/iosarmv8)
You should first use CMake to configure i386, ARMv7, ARMv7s, and/or ARMv8
You should first use CMake to configure i386, Armv7, Armv7s, and/or Armv8
sub-builds of libjpeg-turbo (see "Build Recipes" and "Building libjpeg-turbo
for iOS" above) in build directories that match those specified in the
aforementioned CMake variables. Next, configure the primary build of

View File

@@ -20,8 +20,8 @@ with `jpeg_skip_scanlines()`, and the issues could not readily be fixed.
- Fixed an issue whereby `jpeg_skip_scanlines()` always returned 0 when
skipping past the end of an image.
3. The ARM 64-bit (ARMv8) NEON SIMD extensions can now be built using MinGW
toolchains targetting ARM64 (AArch64) Windows binaries.
3. The Arm 64-bit (Armv8) Neon SIMD extensions can now be built using MinGW
toolchains targetting Arm64 (AArch64) Windows binaries.
4. Fixed unexpected visual artifacts that occurred when using
`jpeg_crop_scanline()` and interblock smoothing while decompressing only the DC
@@ -94,7 +94,7 @@ other user-visible errant behavior, and given that the lossless transformer
(unlike the decompressor) is not generally exposed to arbitrary data exploits,
this issue did not likely pose a security risk.
6. The ARM 64-bit (ARMv8) NEON SIMD assembly code now stores constants in a
6. The Arm 64-bit (Armv8) Neon SIMD assembly code now stores constants in a
separate read-only data section rather than in the text section, to support
execute-only memory layouts.
@@ -380,7 +380,7 @@ algorithm that caused incorrect dithering in the output image. This algorithm
now produces bitwise-identical results to the unmerged algorithms.
12. The SIMD function symbols for x86[-64]/ELF, MIPS/ELF, macOS/x86[-64] (if
libjpeg-turbo is built with YASM), and iOS/ARM[64] builds are now private.
libjpeg-turbo is built with YASM), and iOS/Arm[64] builds are now private.
This prevents those symbols from being exposed in applications or shared
libraries that link statically with libjpeg-turbo.

View File

@@ -2,7 +2,7 @@ Background
==========
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and
baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
MIPS systems, as well as progressive JPEG compression on x86 and x86-64
systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg,
all else being equal. On other types of systems, libjpeg-turbo can still

View File

@@ -137,13 +137,13 @@ set(OSX_32BIT_BUILD ${DEFAULT_OSX_32BIT_BUILD} CACHE PATH
"Directory containing 32-bit (i386) Mac build to include in universal binaries (default: ${DEFAULT_OSX_32BIT_BUILD})")
set(DEFAULT_IOS_ARMV7_BUILD ${CMAKE_SOURCE_DIR}/iosarmv7)
set(IOS_ARMV7_BUILD ${DEFAULT_IOS_ARMV7_BUILD} CACHE PATH
"Directory containing ARMv7 iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV7_BUILD})")
"Directory containing Armv7 iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV7_BUILD})")
set(DEFAULT_IOS_ARMV7S_BUILD ${CMAKE_SOURCE_DIR}/iosarmv7s)
set(IOS_ARMV7S_BUILD ${DEFAULT_IOS_ARMV7S_BUILD} CACHE PATH
"Directory containing ARMv7s iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV7S_BUILD})")
"Directory containing Armv7s iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV7S_BUILD})")
set(DEFAULT_IOS_ARMV8_BUILD ${CMAKE_SOURCE_DIR}/iosarmv8)
set(IOS_ARMV8_BUILD ${DEFAULT_IOS_ARMV8_BUILD} CACHE PATH
"Directory containing ARMv8 iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV8_BUILD})")
"Directory containing Armv8 iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV8_BUILD})")
set(OSX_APP_CERT_NAME "" CACHE STRING
"Name of the Developer ID Application certificate (in the macOS keychain) that should be used to sign the libjpeg-turbo DMG. Leave this blank to generate an unsigned DMG.")

View File

@@ -34,10 +34,10 @@
* memory footprint by 64k, which is important for some mobile applications
* that create many isolated instances of libjpeg-turbo (web browsers, for
* instance.) This may improve performance on some mobile platforms as well.
* This feature is enabled by default only on ARM processors, because some x86
* This feature is enabled by default only on Arm processors, because some x86
* chips have a slow implementation of bsr, and the use of clz/bsr cannot be
* shown to have a significant performance impact even on the x86 chips that
* have a fast implementation of it. When building for ARMv6, you can
* have a fast implementation of it. When building for Armv6, you can
* explicitly disable the use of clz/bsr by adding -mthumb to the compiler
* flags (this defines __thumb__).
*/

View File

@@ -43,10 +43,10 @@
* memory footprint by 64k, which is important for some mobile applications
* that create many isolated instances of libjpeg-turbo (web browsers, for
* instance.) This may improve performance on some mobile platforms as well.
* This feature is enabled by default only on ARM processors, because some x86
* This feature is enabled by default only on Arm processors, because some x86
* chips have a slow implementation of bsr, and the use of clz/bsr cannot be
* shown to have a significant performance impact even on the x86 chips that
* have a fast implementation of it. When building for ARMv6, you can
* have a fast implementation of it. When building for Armv6, you can
* explicitly disable the use of clz/bsr by adding -mthumb to the compiler
* flags (this defines __thumb__).
*/

View File

@@ -1,4 +1,4 @@
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86 and x86-64 systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal. On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines. In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs.
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86 and x86-64 systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal. On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines. In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs.
libjpeg-turbo implements both the traditional libjpeg API as well as the less powerful but more straightforward TurboJPEG API. libjpeg-turbo also features colorspace extensions that allow it to compress from/decompress to 32-bit and big-endian pixel buffers (RGBX, XBGR, etc.), as well as a full-featured Java interface.

View File

@@ -9,7 +9,7 @@ Homepage: @PKGURL@
Installed-Size: {__SIZE}
Description: A SIMD-accelerated JPEG codec that provides both the libjpeg and TurboJPEG APIs
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and
baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
MIPS systems, as well as progressive JPEG compression on x86 and x86-64
systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg,
all else being equal. On other types of systems, libjpeg-turbo can still

View File

@@ -223,15 +223,15 @@ install_ios()
}
if [ $UNIVERSAL = 1 -a "$BUILDDIRARMV7" != "" ]; then
install_ios $BUILDDIRARMV7 ARMv7 armv7 arm
install_ios $BUILDDIRARMV7 Armv7 armv7 arm
fi
if [ $UNIVERSAL = 1 -a "$BUILDDIRARMV7S" != "" ]; then
install_ios $BUILDDIRARMV7S ARMv7s armv7s arm
install_ios $BUILDDIRARMV7S Armv7s armv7s arm
fi
if [ $UNIVERSAL = 1 -a "$BUILDDIRARMV8" != "" ]; then
install_ios $BUILDDIRARMV8 ARMv8 armv8 arm64
install_ios $BUILDDIRARMV8 Armv8 armv8 arm64
fi
install_name_tool -id $LIBDIR/$LIBJPEG_DSO_NAME $PKGROOT/$LIBDIR/$LIBJPEG_DSO_NAME

View File

@@ -52,7 +52,7 @@ Provides: %{name} = %{version}-%{release}, @CMAKE_PROJECT_NAME@ = %{version}-%{r
%description
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and
baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
MIPS systems, as well as progressive JPEG compression on x86 and x86-64
systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg,
all else being equal. On other types of systems, libjpeg-turbo can still

View File

@@ -205,7 +205,7 @@ endif()
###############################################################################
# ARM (GAS)
# Arm (GAS)
###############################################################################
elseif(CPU_TYPE STREQUAL "arm64" OR CPU_TYPE STREQUAL "arm")

View File

@@ -13,7 +13,7 @@
*
* This file contains the interface between the "normal" portions
* of the library and the SIMD implementations when running on a
* 32-bit ARM architecture.
* 32-bit Arm architecture.
*/
#define JPEG_INTERNALS
@@ -118,7 +118,7 @@ init_simd(void)
#if defined(__ARM_NEON__)
simd_support |= JSIMD_NEON;
#elif defined(__linux__) || defined(ANDROID) || defined(__ANDROID__)
/* We still have a chance to use NEON regardless of globally used
/* We still have a chance to use Neon regardless of globally used
* -mcpu/-mfpu options passed to gcc by performing runtime detection via
* /proc/cpuinfo parsing on linux/android */
while (!parse_proc_cpuinfo(bufsize)) {

View File

@@ -1,5 +1,5 @@
/*
* ARMv7 NEON optimizations for libjpeg-turbo
* Armv7 Neon optimizations for libjpeg-turbo
*
* Copyright (C) 2009-2011, Nokia Corporation and/or its subsidiary(-ies).
* All Rights Reserved.
@@ -229,7 +229,7 @@ asm_function jsimd_idct_islow_neon
ROW7L .req d30
ROW7R .req d31
/* Load and dequantize coefficients into NEON registers
/* Load and dequantize coefficients into Neon registers
* with the following allocation:
* 0 1 2 3 | 4 5 6 7
* ---------+--------
@@ -261,7 +261,7 @@ asm_function jsimd_idct_islow_neon
vld1.16 {d0, d1, d2, d3}, [ip, :128] /* load constants */
add ip, ip, #16
vmul.s16 q15, q15, q3
vpush {d8-d15} /* save NEON registers */
vpush {d8-d15} /* save Neon registers */
/* 1-D IDCT, pass 1, left 4x8 half */
vadd.s16 d4, ROW7L, ROW3L
vadd.s16 d5, ROW5L, ROW1L
@@ -507,7 +507,7 @@ asm_function jsimd_idct_islow_neon
vqrshrn.s16 d17, q9, #2
vqrshrn.s16 d18, q10, #2
vqrshrn.s16 d19, q11, #2
vpop {d8-d15} /* restore NEON registers */
vpop {d8-d15} /* restore Neon registers */
vqrshrn.s16 d20, q12, #2
/* Transpose the final 8-bit samples and do signed->unsigned conversion */
vtrn.16 q8, q9
@@ -688,7 +688,7 @@ asm_function jsimd_idct_islow_neon
* function from jidctfst.c
*
* Normally 1-D AAN DCT needs 5 multiplications and 29 additions.
* But in ARM NEON case some extra additions are required because VQDMULH
* But in Arm Neon case some extra additions are required because VQDMULH
* instruction can't handle the constants larger than 1. So the expressions
* like "x * 1.082392200" have to be converted to "x * 0.082392200 + x",
* which introduces an extra addition. Overall, there are 6 extra additions
@@ -718,7 +718,7 @@ asm_function jsimd_idct_ifast_neon
TMP3 .req r2
TMP4 .req ip
/* Load and dequantize coefficients into NEON registers
/* Load and dequantize coefficients into Neon registers
* with the following allocation:
* 0 1 2 3 | 4 5 6 7
* ---------+--------
@@ -749,7 +749,7 @@ asm_function jsimd_idct_ifast_neon
vmul.s16 q13, q13, q1
vld1.16 {d0}, [ip, :64] /* load constants */
vmul.s16 q15, q15, q3
vpush {d8-d13} /* save NEON registers */
vpush {d8-d13} /* save Neon registers */
/* 1-D IDCT, pass 1 */
vsub.s16 q2, q10, q14
vadd.s16 q14, q10, q14
@@ -842,7 +842,7 @@ asm_function jsimd_idct_ifast_neon
vadd.s16 q14, q5, q3
vsub.s16 q9, q5, q3
vsub.s16 q13, q10, q2
vpop {d8-d13} /* restore NEON registers */
vpop {d8-d13} /* restore Neon registers */
vadd.s16 q10, q10, q2
vsub.s16 q11, q12, q1
vadd.s16 q12, q12, q1
@@ -913,7 +913,7 @@ asm_function jsimd_idct_ifast_neon
*
* NOTE: jpeg-8 has an improved implementation of 4x4 inverse-DCT, which
* requires much less arithmetic operations and hence should be faster.
* The primary purpose of this particular NEON optimized function is
* The primary purpose of this particular Neon optimized function is
* bit exact compatibility with jpeg-6b.
*
* TODO: a bit better instructions scheduling can be achieved by expanding
@@ -1016,7 +1016,7 @@ asm_function jsimd_idct_4x4_neon
adr TMP4, jsimd_idct_4x4_neon_consts
vld1.16 {d0, d1, d2, d3}, [TMP4, :128]
/* Load all COEF_BLOCK into NEON registers with the following allocation:
/* Load all COEF_BLOCK into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7
* ---------+--------
* 0 | d4 | d5
@@ -1126,7 +1126,7 @@ asm_function jsimd_idct_4x4_neon
*
* NOTE: jpeg-8 has an improved implementation of 2x2 inverse-DCT, which
* requires much less arithmetic operations and hence should be faster.
* The primary purpose of this particular NEON optimized function is
* The primary purpose of this particular Neon optimized function is
* bit exact compatibility with jpeg-6b.
*/
@@ -1173,7 +1173,7 @@ asm_function jsimd_idct_2x2_neon
adr TMP2, jsimd_idct_2x2_neon_consts
vld1.16 {d0}, [TMP2, :64]
/* Load all COEF_BLOCK into NEON registers with the following allocation:
/* Load all COEF_BLOCK into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7
* ---------+--------
* 0 | d4 | d5
@@ -1499,7 +1499,7 @@ asm_function jsimd_ycc_\colorid\()_convert_neon
adr ip, jsimd_ycc_\colorid\()_neon_consts
vld1.16 {d0, d1, d2, d3}, [ip, :128]
/* Save ARM registers and handle input arguments */
/* Save Arm registers and handle input arguments */
push {r4, r5, r6, r7, r8, r9, r10, lr}
ldr NUM_ROWS, [sp, #(4 * 8)]
ldr INPUT_BUF0, [INPUT_BUF]
@@ -1507,7 +1507,7 @@ asm_function jsimd_ycc_\colorid\()_convert_neon
ldr INPUT_BUF2, [INPUT_BUF, #8]
.unreq INPUT_BUF
/* Save NEON registers */
/* Save Neon registers */
vpush {d8-d15}
/* Initially set d10, d11, d12, d13 to 0xFF */
@@ -1814,7 +1814,7 @@ asm_function jsimd_\colorid\()_ycc_convert_neon
adr ip, jsimd_\colorid\()_ycc_neon_consts
vld1.16 {d0, d1, d2, d3}, [ip, :128]
/* Save ARM registers and handle input arguments */
/* Save Arm registers and handle input arguments */
push {r4, r5, r6, r7, r8, r9, r10, lr}
ldr NUM_ROWS, [sp, #(4 * 8)]
ldr OUTPUT_BUF0, [OUTPUT_BUF]
@@ -1822,7 +1822,7 @@ asm_function jsimd_\colorid\()_ycc_convert_neon
ldr OUTPUT_BUF2, [OUTPUT_BUF, #8]
.unreq OUTPUT_BUF
/* Save NEON registers */
/* Save Neon registers */
vpush {d8-d15}
/* Outer loop over scanlines */
@@ -2017,7 +2017,7 @@ asm_function jsimd_fdct_ifast_neon
adr TMP, jsimd_fdct_ifast_neon_consts
vld1.16 {d0}, [TMP, :64]
/* Load all DATA into NEON registers with the following allocation:
/* Load all DATA into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7
* ---------+--------
* 0 | d16 | d17 | q8
@@ -2112,8 +2112,8 @@ asm_function jsimd_fdct_ifast_neon
*
* Note: the code uses 2 stage pipelining in order to improve instructions
* scheduling and eliminate stalls (this provides ~15% better
* performance for this function on both ARM Cortex-A8 and
* ARM Cortex-A9 when compared to the non-pipelined variant).
* performance for this function on both Arm Cortex-A8 and
* Arm Cortex-A9 when compared to the non-pipelined variant).
* The instructions which belong to the second stage use different
* indentation for better readiability.
*/

View File

@@ -12,7 +12,7 @@
*
* This file contains the interface between the "normal" portions
* of the library and the SIMD implementations when running on a
* 64-bit ARM architecture.
* 64-bit Arm architecture.
*/
#define JPEG_INTERNALS
@@ -114,8 +114,8 @@ parse_proc_cpuinfo(int bufsize)
*/
/*
* ARMv8 architectures support NEON extensions by default.
* It is no longer optional as it was with ARMv7.
* Armv8 architectures support Neon extensions by default.
* It is no longer optional as it was with Armv7.
*/

View File

@@ -1,5 +1,5 @@
/*
* ARMv8 NEON optimizations for libjpeg-turbo
* Armv8 Neon optimizations for libjpeg-turbo
*
* Copyright (C) 2009-2011, Nokia Corporation and/or its subsidiary(-ies).
* All Rights Reserved.
@@ -611,7 +611,7 @@ asm_function jsimd_idct_islow_neon
shrn2 v5.8h, v15.4s, #16 /* wsptr[DCTSIZE*3] = (int)DESCALE(tmp13 + tmp0, CONST_BITS+PASS1_BITS+3) */
shrn2 v6.8h, v17.4s, #16 /* wsptr[DCTSIZE*4] = (int)DESCALE(tmp13 - tmp0, CONST_BITS+PASS1_BITS+3) */
movi v0.16b, #(CENTERJSAMPLE)
/* Prepare pointers (dual-issue with NEON instructions) */
/* Prepare pointers (dual-issue with Neon instructions) */
ldp TMP1, TMP2, [OUTPUT_BUF], 16
sqrshrn v28.8b, v2.8h, #(CONST_BITS+PASS1_BITS+3-16)
ldp TMP3, TMP4, [OUTPUT_BUF], 16
@@ -992,7 +992,7 @@ asm_function jsimd_idct_islow_neon
* function from jidctfst.c
*
* Normally 1-D AAN DCT needs 5 multiplications and 29 additions.
* But in ARM NEON case some extra additions are required because VQDMULH
* But in Arm Neon case some extra additions are required because VQDMULH
* instruction can't handle the constants larger than 1. So the expressions
* like "x * 1.082392200" have to be converted to "x * 0.082392200 + x",
* which introduces an extra addition. Overall, there are 6 extra additions
@@ -1024,7 +1024,7 @@ asm_function jsimd_idct_ifast_neon
instruction ensures that those bits are set to zero. */
uxtw x3, w3
/* Load and dequantize coefficients into NEON registers
/* Load and dequantize coefficients into Neon registers
* with the following allocation:
* 0 1 2 3 | 4 5 6 7
* ---------+--------
@@ -1037,7 +1037,7 @@ asm_function jsimd_idct_ifast_neon
* 6 | d28 | d29 ( v22.8h )
* 7 | d30 | d31 ( v23.8h )
*/
/* Save NEON registers used in fast IDCT */
/* Save Neon registers used in fast IDCT */
get_symbol_loc TMP5, Ljsimd_idct_ifast_neon_consts
ld1 {v16.8h, v17.8h}, [COEF_BLOCK], 32
ld1 {v0.8h, v1.8h}, [DCT_TABLE], 32
@@ -1142,7 +1142,7 @@ asm_function jsimd_idct_ifast_neon
add v20.8h, v20.8h, v1.8h
/* Descale to 8-bit and range limit */
movi v0.16b, #0x80
/* Prepare pointers (dual-issue with NEON instructions) */
/* Prepare pointers (dual-issue with Neon instructions) */
ldp TMP1, TMP2, [OUTPUT_BUF], 16
sqshrn v28.8b, v16.8h, #5
ldp TMP3, TMP4, [OUTPUT_BUF], 16
@@ -1221,7 +1221,7 @@ asm_function jsimd_idct_ifast_neon
*
* NOTE: jpeg-8 has an improved implementation of 4x4 inverse-DCT, which
* requires much less arithmetic operations and hence should be faster.
* The primary purpose of this particular NEON optimized function is
* The primary purpose of this particular Neon optimized function is
* bit exact compatibility with jpeg-6b.
*
* TODO: a bit better instructions scheduling can be achieved by expanding
@@ -1291,7 +1291,7 @@ asm_function jsimd_idct_4x4_neon
instruction ensures that those bits are set to zero. */
uxtw x3, w3
/* Save all used NEON registers */
/* Save all used Neon registers */
sub sp, sp, 64
mov x9, sp
/* Load constants (v3.4h is just used for padding) */
@@ -1300,7 +1300,7 @@ asm_function jsimd_idct_4x4_neon
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
ld1 {v0.4h, v1.4h, v2.4h, v3.4h}, [TMP4]
/* Load all COEF_BLOCK into NEON registers with the following allocation:
/* Load all COEF_BLOCK into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7
* ---------+--------
* 0 | v4.4h | v5.4h
@@ -1434,7 +1434,7 @@ asm_function jsimd_idct_4x4_neon
*
* NOTE: jpeg-8 has an improved implementation of 2x2 inverse-DCT, which
* requires much less arithmetic operations and hence should be faster.
* The primary purpose of this particular NEON optimized function is
* The primary purpose of this particular Neon optimized function is
* bit exact compatibility with jpeg-6b.
*/
@@ -1483,7 +1483,7 @@ asm_function jsimd_idct_2x2_neon
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
ld1 {v14.4h}, [TMP2]
/* Load all COEF_BLOCK into NEON registers with the following allocation:
/* Load all COEF_BLOCK into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7
* ---------+--------
* 0 | v4.4h | v5.4h
@@ -1857,7 +1857,7 @@ asm_function jsimd_ycc_\colorid\()_convert_neon_slowst3
/* Load constants to d1, d2, d3 (v0.4h is just used for padding) */
get_symbol_loc x15, Ljsimd_ycc_rgb_neon_consts
/* Save NEON registers */
/* Save Neon registers */
st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x9], 32
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
ld1 {v0.4h, v1.4h}, [x15], 16
@@ -2142,7 +2142,7 @@ generate_jsimd_ycc_rgb_convert_neon extbgr, 24, 2, .4h, 1, .4h, 0, .4h, .8b,
.endm
/* TODO: expand macros and interleave instructions if some in-order
* ARM64 processor actually can dual-issue LOAD/STORE with ALU */
* AArch64 processor actually can dual-issue LOAD/STORE with ALU */
.macro do_rgb_to_yuv_stage2_store_load_stage1 fast_ld3
do_rgb_to_yuv_stage2
do_load \bpp, 8, \fast_ld3
@@ -2182,7 +2182,7 @@ asm_function jsimd_\colorid\()_ycc_convert_neon_slowld3
ldr OUTPUT_BUF2, [OUTPUT_BUF, #16]
.unreq OUTPUT_BUF
/* Save NEON registers */
/* Save Neon registers */
sub sp, sp, #64
mov x9, sp
st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x9], 32
@@ -2396,13 +2396,13 @@ asm_function jsimd_fdct_islow_neon
get_symbol_loc TMP, Ljsimd_fdct_islow_neon_consts
ld1 {v0.8h, v1.8h}, [TMP]
/* Save NEON registers */
/* Save Neon registers */
sub sp, sp, #64
mov x10, sp
st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x10], 32
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x10], 32
/* Load all DATA into NEON registers with the following allocation:
/* Load all DATA into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7
* ---------+--------
* 0 | d16 | d17 | v16.8h
@@ -2629,7 +2629,7 @@ asm_function jsimd_fdct_islow_neon
st1 {v16.8h, v17.8h, v18.8h, v19.8h}, [DATA], 64
st1 {v20.8h, v21.8h, v22.8h, v23.8h}, [DATA]
/* Restore NEON registers */
/* Restore Neon registers */
ld1 {v8.8b, v9.8b, v10.8b, v11.8b}, [sp], 32
ld1 {v12.8b, v13.8b, v14.8b, v15.8b}, [sp], 32
@@ -2681,7 +2681,7 @@ asm_function jsimd_fdct_ifast_neon
get_symbol_loc TMP, Ljsimd_fdct_ifast_neon_consts
ld1 {v0.4h}, [TMP]
/* Load all DATA into NEON registers with the following allocation:
/* Load all DATA into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7
* ---------+--------
* 0 | d16 | d17 | v0.8h
@@ -3066,7 +3066,7 @@ asm_function jsimd_huff_encode_one_block_neon_slowtbl
.endif
sub sp, sp, 272
sub BUFFER, BUFFER, #0x1 /* BUFFER=buffer-- */
/* Save ARM registers */
/* Save Arm registers */
stp x19, x20, [sp]
get_symbol_loc x15, Ljsimd_huff_encode_one_block_neon_consts
ldr PUT_BUFFER, [x0, #0x10]