Merge branch 'master' into dev

This commit is contained in:
DRC
2020-10-19 21:17:46 -05:00
17 changed files with 97 additions and 82 deletions

View File

@@ -393,8 +393,8 @@ located (usually **/usr/bin**.) Next, execute the following commands:
Building libjpeg-turbo for iOS Building libjpeg-turbo for iOS
------------------------------ ------------------------------
iOS platforms, such as the iPhone and iPad, use ARM processors, and all iOS platforms, such as the iPhone and iPad, use Arm processors, and all
currently supported models include NEON instructions. Thus, they can take currently supported models include Neon instructions. Thus, they can take
advantage of libjpeg-turbo's SIMD extensions to significantly accelerate JPEG advantage of libjpeg-turbo's SIMD extensions to significantly accelerate JPEG
compression/decompression. This section describes how to build libjpeg-turbo compression/decompression. This section describes how to build libjpeg-turbo
for these platforms. for these platforms.
@@ -407,7 +407,7 @@ for these platforms.
it should be installed in your `PATH`. it should be installed in your `PATH`.
### ARMv8 (64-bit) ### Armv8 (64-bit)
**gas-preprocessor.pl required if using Xcode < 6** **gas-preprocessor.pl required if using Xcode < 6**
@@ -439,7 +439,7 @@ Building libjpeg-turbo for Android platforms requires v13b or later of the
[Android NDK](https://developer.android.com/tools/sdk/ndk). [Android NDK](https://developer.android.com/tools/sdk/ndk).
### ARMv7 (32-bit) ### Armv7 (32-bit)
The following is a general recipe script that can be modified for your specific The following is a general recipe script that can be modified for your specific
needs. needs.
@@ -464,7 +464,7 @@ needs.
make make
### ARMv8 (64-bit) ### Armv8 (64-bit)
The following is a general recipe script that can be modified for your specific The following is a general recipe script that can be modified for your specific
needs. needs.
@@ -643,12 +643,12 @@ Create Mac package/disk image. This requires pkgbuild and productbuild, which
are installed by default on OS X 10.7 and later. are installed by default on OS X 10.7 and later.
In order to create a Mac package/disk image that contains universal In order to create a Mac package/disk image that contains universal
x86-64/ARM binaries, set the following CMake variable: x86-64/Arm binaries, set the following CMake variable:
* `IOS_ARMV8_BUILD`: Directory containing an ARMv8 (64-bit) iOS build of * `IOS_ARMV8_BUILD`: Directory containing an Armv8 (64-bit) iOS build of
libjpeg-turbo to include in the universal binaries libjpeg-turbo to include in the universal binaries
You should first use CMake to configure an ARMv8 sub-build of libjpeg-turbo You should first use CMake to configure an Armv8 sub-build of libjpeg-turbo
(see "Building libjpeg-turbo for iOS" above) in a build directory that matches (see "Building libjpeg-turbo for iOS" above) in a build directory that matches
the one specified in the aforementioned CMake variable. Next, configure the the one specified in the aforementioned CMake variable. Next, configure the
primary (x86-64) build of libjpeg-turbo as an out-of-tree build, specifying the primary (x86-64) build of libjpeg-turbo as an out-of-tree build, specifying the

View File

@@ -55,10 +55,12 @@ if(CMAKE_SYSTEM_PROCESSOR_LC MATCHES "x86_64" OR
set(CMAKE_SYSTEM_PROCESSOR ${CPU_TYPE}) set(CMAKE_SYSTEM_PROCESSOR ${CPU_TYPE})
endif() endif()
elseif(CMAKE_SYSTEM_PROCESSOR_LC STREQUAL "aarch64" OR elseif(CMAKE_SYSTEM_PROCESSOR_LC STREQUAL "aarch64" OR
CMAKE_SYSTEM_PROCESSOR_LC MATCHES "arm*64*") CMAKE_SYSTEM_PROCESSOR_LC MATCHES "arm*")
set(CPU_TYPE arm64) if(BITS EQUAL 64)
elseif(CMAKE_SYSTEM_PROCESSOR_LC MATCHES "arm*") set(CPU_TYPE arm64)
set(CPU_TYPE arm) else()
set(CPU_TYPE arm)
endif()
elseif(CMAKE_SYSTEM_PROCESSOR_LC MATCHES "ppc*" OR elseif(CMAKE_SYSTEM_PROCESSOR_LC MATCHES "ppc*" OR
CMAKE_SYSTEM_PROCESSOR_LC MATCHES "powerpc*") CMAKE_SYSTEM_PROCESSOR_LC MATCHES "powerpc*")
set(CPU_TYPE powerpc) set(CPU_TYPE powerpc)

View File

@@ -33,7 +33,7 @@ approximately 2x when using the fast integer IDCT
The overall decompression speedup for RGB images is now approximately The overall decompression speedup for RGB images is now approximately
2.3-3.7x (compared to 2-3.5x with libjpeg-turbo 2.0.x.) 2.3-3.7x (compared to 2-3.5x with libjpeg-turbo 2.0.x.)
3. 32-bit (ARMv7 or ARMv7s) iOS builds of libjpeg-turbo are no longer 3. 32-bit (Armv7 or Armv7s) iOS builds of libjpeg-turbo are no longer
supported, and the libjpeg-turbo build system can no longer be used to package supported, and the libjpeg-turbo build system can no longer be used to package
such builds. 32-bit iOS apps cannot run in iOS 11 and later, and the App Store such builds. 32-bit iOS apps cannot run in iOS 11 and later, and the App Store
no longer allows them. no longer allows them.
@@ -61,10 +61,10 @@ higher-frequency scan. libjpeg-turbo now applies block smoothing parameters to
each iMCU row based on which scan generated the pixels in that row, rather than each iMCU row based on which scan generated the pixels in that row, rather than
always using the block smoothing parameters for the most recent scan. always using the block smoothing parameters for the most recent scan.
7. Added SIMD acceleration for progressive Huffman encoding on ARM 64-bit 7. Added SIMD acceleration for progressive Huffman encoding on Arm 64-bit
(ARMv8) platforms. This speeds up the compression of full-color progressive (Armv8) platforms. This speeds up the compression of full-color progressive
JPEGs by about 30-40% on average (relative to libjpeg-turbo 2.0.x) when using JPEGs by about 30-40% on average (relative to libjpeg-turbo 2.0.x) when using
modern ARMv8 CPUs. modern Armv8 CPUs.
8. Added configure-time and run-time auto-detection of Loongson MMI SIMD 8. Added configure-time and run-time auto-detection of Loongson MMI SIMD
instructions, so that the Loongson MMI SIMD extensions can be included in any instructions, so that the Loongson MMI SIMD extensions can be included in any
@@ -124,8 +124,8 @@ with `jpeg_skip_scanlines()`, and the issues could not readily be fixed.
- Fixed an issue whereby `jpeg_skip_scanlines()` always returned 0 when - Fixed an issue whereby `jpeg_skip_scanlines()` always returned 0 when
skipping past the end of an image. skipping past the end of an image.
3. The ARM 64-bit (ARMv8) NEON SIMD extensions can now be built using MinGW 3. The Arm 64-bit (Armv8) Neon SIMD extensions can now be built using MinGW
toolchains targetting ARM64 (AArch64) Windows binaries. toolchains targetting Arm64 (AArch64) Windows binaries.
4. Fixed unexpected visual artifacts that occurred when using 4. Fixed unexpected visual artifacts that occurred when using
`jpeg_crop_scanline()` and interblock smoothing while decompressing only the DC `jpeg_crop_scanline()` and interblock smoothing while decompressing only the DC
@@ -198,7 +198,7 @@ other user-visible errant behavior, and given that the lossless transformer
(unlike the decompressor) is not generally exposed to arbitrary data exploits, (unlike the decompressor) is not generally exposed to arbitrary data exploits,
this issue did not likely pose a security risk. this issue did not likely pose a security risk.
6. The ARM 64-bit (ARMv8) NEON SIMD assembly code now stores constants in a 6. The Arm 64-bit (Armv8) Neon SIMD assembly code now stores constants in a
separate read-only data section rather than in the text section, to support separate read-only data section rather than in the text section, to support
execute-only memory layouts. execute-only memory layouts.
@@ -484,7 +484,7 @@ algorithm that caused incorrect dithering in the output image. This algorithm
now produces bitwise-identical results to the unmerged algorithms. now produces bitwise-identical results to the unmerged algorithms.
12. The SIMD function symbols for x86[-64]/ELF, MIPS/ELF, macOS/x86[-64] (if 12. The SIMD function symbols for x86[-64]/ELF, MIPS/ELF, macOS/x86[-64] (if
libjpeg-turbo is built with YASM), and iOS/ARM[64] builds are now private. libjpeg-turbo is built with YASM), and iOS/Arm[64] builds are now private.
This prevents those symbols from being exposed in applications or shared This prevents those symbols from being exposed in applications or shared
libraries that link statically with libjpeg-turbo. libraries that link statically with libjpeg-turbo.

View File

@@ -2,8 +2,8 @@ Background
========== ==========
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
MIPS systems, as well as progressive JPEG compression on x86, x86-64, and ARMv8 MIPS systems, as well as progressive JPEG compression on x86, x86-64, and Armv8
systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg,
all else being equal. On other types of systems, libjpeg-turbo can still all else being equal. On other types of systems, libjpeg-turbo can still
outperform libjpeg by a significant amount, by virtue of its highly-optimized outperform libjpeg by a significant amount, by virtue of its highly-optimized

View File

@@ -23,11 +23,18 @@ set(RPMARCH ${CMAKE_SYSTEM_PROCESSOR})
if(CPU_TYPE STREQUAL "x86_64") if(CPU_TYPE STREQUAL "x86_64")
set(DEBARCH amd64) set(DEBARCH amd64)
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "armv7*") elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "armv7*")
set(RPMARCH armv7hl)
set(DEBARCH armhf) set(DEBARCH armhf)
elseif(CPU_TYPE STREQUAL "arm64") elseif(CPU_TYPE STREQUAL "arm64")
set(DEBARCH ${CPU_TYPE}) set(DEBARCH ${CPU_TYPE})
elseif(CPU_TYPE STREQUAL "arm") elseif(CPU_TYPE STREQUAL "arm")
set(DEBARCH armel) if(CMAKE_C_COMPILER MATCHES "gnueabihf")
set(RPMARCH armv7hl)
set(DEBARCH armhf)
else()
set(RPMARCH armel)
set(DEBARCH armel)
endif()
elseif(CMAKE_SYSTEM_PROCESSOR_LC STREQUAL "ppc64le") elseif(CMAKE_SYSTEM_PROCESSOR_LC STREQUAL "ppc64le")
set(DEBARCH ppc64el) set(DEBARCH ppc64el)
elseif(CPU_TYPE STREQUAL "powerpc" AND BITS EQUAL 32) elseif(CPU_TYPE STREQUAL "powerpc" AND BITS EQUAL 32)
@@ -128,7 +135,7 @@ endif() # WIN32
if(APPLE) if(APPLE)
set(IOS_ARMV8_BUILD "" CACHE PATH set(IOS_ARMV8_BUILD "" CACHE PATH
"Directory containing ARMv8 iOS build to include in universal binaries") "Directory containing Armv8 iOS build to include in universal binaries")
set(OSX_APP_CERT_NAME "" CACHE STRING set(OSX_APP_CERT_NAME "" CACHE STRING
"Name of the Developer ID Application certificate (in the macOS keychain) that should be used to sign the libjpeg-turbo DMG. Leave this blank to generate an unsigned DMG.") "Name of the Developer ID Application certificate (in the macOS keychain) that should be used to sign the libjpeg-turbo DMG. Leave this blank to generate an unsigned DMG.")

View File

@@ -35,10 +35,10 @@
* memory footprint by 64k, which is important for some mobile applications * memory footprint by 64k, which is important for some mobile applications
* that create many isolated instances of libjpeg-turbo (web browsers, for * that create many isolated instances of libjpeg-turbo (web browsers, for
* instance.) This may improve performance on some mobile platforms as well. * instance.) This may improve performance on some mobile platforms as well.
* This feature is enabled by default only on ARM processors, because some x86 * This feature is enabled by default only on Arm processors, because some x86
* chips have a slow implementation of bsr, and the use of clz/bsr cannot be * chips have a slow implementation of bsr, and the use of clz/bsr cannot be
* shown to have a significant performance impact even on the x86 chips that * shown to have a significant performance impact even on the x86 chips that
* have a fast implementation of it. When building for ARMv6, you can * have a fast implementation of it. When building for Armv6, you can
* explicitly disable the use of clz/bsr by adding -mthumb to the compiler * explicitly disable the use of clz/bsr by adding -mthumb to the compiler
* flags (this defines __thumb__). * flags (this defines __thumb__).
*/ */
@@ -73,7 +73,7 @@ typedef size_t bit_buf_type;
#endif #endif
/* NOTE: The more optimal Huffman encoding algorithm has not yet been /* NOTE: The more optimal Huffman encoding algorithm has not yet been
* implemented in the ARM NEON SIMD extensions, which is why we retain the old * implemented in the Arm Neon SIMD extensions, which is why we retain the old
* Huffman encoder behavior for that platform. * Huffman encoder behavior for that platform.
*/ */
#if defined(WITH_SIMD) && !(defined(__arm__) || defined(__aarch64__)) #if defined(WITH_SIMD) && !(defined(__arm__) || defined(__aarch64__))
@@ -98,7 +98,7 @@ typedef struct {
simd_bit_buf_type simd; simd_bit_buf_type simd;
} put_buffer; /* current bit accumulation buffer */ } put_buffer; /* current bit accumulation buffer */
int free_bits; /* # of bits available in it */ int free_bits; /* # of bits available in it */
/* (ARM SIMD: # of bits now in it) */ /* (Arm SIMD: # of bits now in it) */
int last_dc_val[MAX_COMPS_IN_SCAN]; /* last DC coef for each component */ int last_dc_val[MAX_COMPS_IN_SCAN]; /* last DC coef for each component */
} savable_state; } savable_state;

View File

@@ -43,10 +43,10 @@
* memory footprint by 64k, which is important for some mobile applications * memory footprint by 64k, which is important for some mobile applications
* that create many isolated instances of libjpeg-turbo (web browsers, for * that create many isolated instances of libjpeg-turbo (web browsers, for
* instance.) This may improve performance on some mobile platforms as well. * instance.) This may improve performance on some mobile platforms as well.
* This feature is enabled by default only on ARM processors, because some x86 * This feature is enabled by default only on Arm processors, because some x86
* chips have a slow implementation of bsr, and the use of clz/bsr cannot be * chips have a slow implementation of bsr, and the use of clz/bsr cannot be
* shown to have a significant performance impact even on the x86 chips that * shown to have a significant performance impact even on the x86 chips that
* have a fast implementation of it. When building for ARMv6, you can * have a fast implementation of it. When building for Armv6, you can
* explicitly disable the use of clz/bsr by adding -mthumb to the compiler * explicitly disable the use of clz/bsr by adding -mthumb to the compiler
* flags (this defines __thumb__). * flags (this defines __thumb__).
*/ */

View File

@@ -1,4 +1,4 @@
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86, x86-64, and ARMv8 systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal. On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines. In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs. libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86, x86-64, and Armv8 systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal. On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines. In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs.
libjpeg-turbo implements both the traditional libjpeg API as well as the less powerful but more straightforward TurboJPEG API. libjpeg-turbo also features colorspace extensions that allow it to compress from/decompress to 32-bit and big-endian pixel buffers (RGBX, XBGR, etc.), as well as a full-featured Java interface. libjpeg-turbo implements both the traditional libjpeg API as well as the less powerful but more straightforward TurboJPEG API. libjpeg-turbo also features colorspace extensions that allow it to compress from/decompress to 32-bit and big-endian pixel buffers (RGBX, XBGR, etc.), as well as a full-featured Java interface.

View File

@@ -9,9 +9,9 @@ Homepage: @PKGURL@
Installed-Size: {__SIZE} Installed-Size: {__SIZE}
Description: A SIMD-accelerated JPEG codec that provides both the libjpeg and TurboJPEG APIs Description: A SIMD-accelerated JPEG codec that provides both the libjpeg and TurboJPEG APIs
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
MIPS systems, as well as progressive JPEG compression on x86, x86-64, and MIPS systems, as well as progressive JPEG compression on x86, x86-64, and
ARMv8 systems. On such systems, libjpeg-turbo is generally 2-6x as fast as Armv8 systems. On such systems, libjpeg-turbo is generally 2-6x as fast as
libjpeg, all else being equal. On other types of systems, libjpeg-turbo can libjpeg, all else being equal. On other types of systems, libjpeg-turbo can
still outperform libjpeg by a significant amount, by virtue of its still outperform libjpeg by a significant amount, by virtue of its
highly-optimized Huffman coding routines. In many cases, the performance of highly-optimized Huffman coding routines. In many cases, the performance of

View File

@@ -54,7 +54,11 @@ makedeb()
if [ $SUPPLEMENT = 1 ]; then if [ $SUPPLEMENT = 1 ]; then
PKGNAME=$PKGNAME\32 PKGNAME=$PKGNAME\32
DEBARCH=amd64 if [ "$DEBARCH" = "i386" ]; then
DEBARCH=amd64
else
DEBARCH=arm64
fi
fi fi
umask 022 umask 022
@@ -110,6 +114,8 @@ if [ ! `uid` -eq 0 ]; then
fi fi
makedeb 0 makedeb 0
if [ "$DEBARCH" = "i386" ]; then makedeb 1; fi if [ "$DEBARCH" = "i386" -o "$DEBARCH" = "armel" -o "$DEBARCH" = "armhf" ]; then
makedeb 1
fi
exit exit

View File

@@ -160,7 +160,7 @@ install_ios()
} }
if [ "$BUILDDIRARMV8" != "" ]; then if [ "$BUILDDIRARMV8" != "" ]; then
install_ios $BUILDDIRARMV8 ARMv8 armv8 arm64 install_ios $BUILDDIRARMV8 Armv8 armv8 arm64
fi fi
install_name_tool -id $LIBDIR/$LIBJPEG_DSO_NAME $PKGROOT/$LIBDIR/$LIBJPEG_DSO_NAME install_name_tool -id $LIBDIR/$LIBJPEG_DSO_NAME $PKGROOT/$LIBDIR/$LIBJPEG_DSO_NAME

View File

@@ -52,8 +52,8 @@ Provides: %{name} = %{version}-%{release}, @CMAKE_PROJECT_NAME@ = %{version}-%{r
%description %description
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
MIPS systems, as well as progressive JPEG compression on x86, x86-64, and ARMv8 MIPS systems, as well as progressive JPEG compression on x86, x86-64, and Armv8
systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg,
all else being equal. On other types of systems, libjpeg-turbo can still all else being equal. On other types of systems, libjpeg-turbo can still
outperform libjpeg by a significant amount, by virtue of its highly-optimized outperform libjpeg by a significant amount, by virtue of its highly-optimized

View File

@@ -208,7 +208,7 @@ endif()
############################################################################### ###############################################################################
# ARM (GAS) # Arm (GAS)
############################################################################### ###############################################################################
elseif(CPU_TYPE STREQUAL "arm64" OR CPU_TYPE STREQUAL "arm") elseif(CPU_TYPE STREQUAL "arm64" OR CPU_TYPE STREQUAL "arm")

View File

@@ -13,7 +13,7 @@
* *
* This file contains the interface between the "normal" portions * This file contains the interface between the "normal" portions
* of the library and the SIMD implementations when running on a * of the library and the SIMD implementations when running on a
* 32-bit ARM architecture. * 32-bit Arm architecture.
*/ */
#define JPEG_INTERNALS #define JPEG_INTERNALS
@@ -118,7 +118,7 @@ init_simd(void)
#if defined(__ARM_NEON__) #if defined(__ARM_NEON__)
simd_support |= JSIMD_NEON; simd_support |= JSIMD_NEON;
#elif defined(__linux__) || defined(ANDROID) || defined(__ANDROID__) #elif defined(__linux__) || defined(ANDROID) || defined(__ANDROID__)
/* We still have a chance to use NEON regardless of globally used /* We still have a chance to use Neon regardless of globally used
* -mcpu/-mfpu options passed to gcc by performing runtime detection via * -mcpu/-mfpu options passed to gcc by performing runtime detection via
* /proc/cpuinfo parsing on linux/android */ * /proc/cpuinfo parsing on linux/android */
while (!parse_proc_cpuinfo(bufsize)) { while (!parse_proc_cpuinfo(bufsize)) {

View File

@@ -1,5 +1,5 @@
/* /*
* ARMv7 NEON optimizations for libjpeg-turbo * Armv7 Neon optimizations for libjpeg-turbo
* *
* Copyright (C) 2009-2011, Nokia Corporation and/or its subsidiary(-ies). * Copyright (C) 2009-2011, Nokia Corporation and/or its subsidiary(-ies).
* All Rights Reserved. * All Rights Reserved.
@@ -229,7 +229,7 @@ asm_function jsimd_idct_islow_neon
ROW7L .req d30 ROW7L .req d30
ROW7R .req d31 ROW7R .req d31
/* Load and dequantize coefficients into NEON registers /* Load and dequantize coefficients into Neon registers
* with the following allocation: * with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
@@ -261,7 +261,7 @@ asm_function jsimd_idct_islow_neon
vld1.16 {d0, d1, d2, d3}, [ip, :128] /* load constants */ vld1.16 {d0, d1, d2, d3}, [ip, :128] /* load constants */
add ip, ip, #16 add ip, ip, #16
vmul.s16 q15, q15, q3 vmul.s16 q15, q15, q3
vpush {d8-d15} /* save NEON registers */ vpush {d8-d15} /* save Neon registers */
/* 1-D IDCT, pass 1, left 4x8 half */ /* 1-D IDCT, pass 1, left 4x8 half */
vadd.s16 d4, ROW7L, ROW3L vadd.s16 d4, ROW7L, ROW3L
vadd.s16 d5, ROW5L, ROW1L vadd.s16 d5, ROW5L, ROW1L
@@ -507,7 +507,7 @@ asm_function jsimd_idct_islow_neon
vqrshrn.s16 d17, q9, #2 vqrshrn.s16 d17, q9, #2
vqrshrn.s16 d18, q10, #2 vqrshrn.s16 d18, q10, #2
vqrshrn.s16 d19, q11, #2 vqrshrn.s16 d19, q11, #2
vpop {d8-d15} /* restore NEON registers */ vpop {d8-d15} /* restore Neon registers */
vqrshrn.s16 d20, q12, #2 vqrshrn.s16 d20, q12, #2
/* Transpose the final 8-bit samples and do signed->unsigned conversion */ /* Transpose the final 8-bit samples and do signed->unsigned conversion */
vtrn.16 q8, q9 vtrn.16 q8, q9
@@ -688,7 +688,7 @@ asm_function jsimd_idct_islow_neon
* function from jidctfst.c * function from jidctfst.c
* *
* Normally 1-D AAN DCT needs 5 multiplications and 29 additions. * Normally 1-D AAN DCT needs 5 multiplications and 29 additions.
* But in ARM NEON case some extra additions are required because VQDMULH * But in Arm Neon case some extra additions are required because VQDMULH
* instruction can't handle the constants larger than 1. So the expressions * instruction can't handle the constants larger than 1. So the expressions
* like "x * 1.082392200" have to be converted to "x * 0.082392200 + x", * like "x * 1.082392200" have to be converted to "x * 0.082392200 + x",
* which introduces an extra addition. Overall, there are 6 extra additions * which introduces an extra addition. Overall, there are 6 extra additions
@@ -718,7 +718,7 @@ asm_function jsimd_idct_ifast_neon
TMP3 .req r2 TMP3 .req r2
TMP4 .req ip TMP4 .req ip
/* Load and dequantize coefficients into NEON registers /* Load and dequantize coefficients into Neon registers
* with the following allocation: * with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
@@ -749,7 +749,7 @@ asm_function jsimd_idct_ifast_neon
vmul.s16 q13, q13, q1 vmul.s16 q13, q13, q1
vld1.16 {d0}, [ip, :64] /* load constants */ vld1.16 {d0}, [ip, :64] /* load constants */
vmul.s16 q15, q15, q3 vmul.s16 q15, q15, q3
vpush {d8-d13} /* save NEON registers */ vpush {d8-d13} /* save Neon registers */
/* 1-D IDCT, pass 1 */ /* 1-D IDCT, pass 1 */
vsub.s16 q2, q10, q14 vsub.s16 q2, q10, q14
vadd.s16 q14, q10, q14 vadd.s16 q14, q10, q14
@@ -842,7 +842,7 @@ asm_function jsimd_idct_ifast_neon
vadd.s16 q14, q5, q3 vadd.s16 q14, q5, q3
vsub.s16 q9, q5, q3 vsub.s16 q9, q5, q3
vsub.s16 q13, q10, q2 vsub.s16 q13, q10, q2
vpop {d8-d13} /* restore NEON registers */ vpop {d8-d13} /* restore Neon registers */
vadd.s16 q10, q10, q2 vadd.s16 q10, q10, q2
vsub.s16 q11, q12, q1 vsub.s16 q11, q12, q1
vadd.s16 q12, q12, q1 vadd.s16 q12, q12, q1
@@ -913,7 +913,7 @@ asm_function jsimd_idct_ifast_neon
* *
* NOTE: jpeg-8 has an improved implementation of 4x4 inverse-DCT, which * NOTE: jpeg-8 has an improved implementation of 4x4 inverse-DCT, which
* requires much less arithmetic operations and hence should be faster. * requires much less arithmetic operations and hence should be faster.
* The primary purpose of this particular NEON optimized function is * The primary purpose of this particular Neon optimized function is
* bit exact compatibility with jpeg-6b. * bit exact compatibility with jpeg-6b.
* *
* TODO: a bit better instructions scheduling can be achieved by expanding * TODO: a bit better instructions scheduling can be achieved by expanding
@@ -1016,7 +1016,7 @@ asm_function jsimd_idct_4x4_neon
adr TMP4, jsimd_idct_4x4_neon_consts adr TMP4, jsimd_idct_4x4_neon_consts
vld1.16 {d0, d1, d2, d3}, [TMP4, :128] vld1.16 {d0, d1, d2, d3}, [TMP4, :128]
/* Load all COEF_BLOCK into NEON registers with the following allocation: /* Load all COEF_BLOCK into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
* 0 | d4 | d5 * 0 | d4 | d5
@@ -1126,7 +1126,7 @@ asm_function jsimd_idct_4x4_neon
* *
* NOTE: jpeg-8 has an improved implementation of 2x2 inverse-DCT, which * NOTE: jpeg-8 has an improved implementation of 2x2 inverse-DCT, which
* requires much less arithmetic operations and hence should be faster. * requires much less arithmetic operations and hence should be faster.
* The primary purpose of this particular NEON optimized function is * The primary purpose of this particular Neon optimized function is
* bit exact compatibility with jpeg-6b. * bit exact compatibility with jpeg-6b.
*/ */
@@ -1173,7 +1173,7 @@ asm_function jsimd_idct_2x2_neon
adr TMP2, jsimd_idct_2x2_neon_consts adr TMP2, jsimd_idct_2x2_neon_consts
vld1.16 {d0}, [TMP2, :64] vld1.16 {d0}, [TMP2, :64]
/* Load all COEF_BLOCK into NEON registers with the following allocation: /* Load all COEF_BLOCK into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
* 0 | d4 | d5 * 0 | d4 | d5
@@ -1499,7 +1499,7 @@ asm_function jsimd_ycc_\colorid\()_convert_neon
adr ip, jsimd_ycc_\colorid\()_neon_consts adr ip, jsimd_ycc_\colorid\()_neon_consts
vld1.16 {d0, d1, d2, d3}, [ip, :128] vld1.16 {d0, d1, d2, d3}, [ip, :128]
/* Save ARM registers and handle input arguments */ /* Save Arm registers and handle input arguments */
push {r4, r5, r6, r7, r8, r9, r10, lr} push {r4, r5, r6, r7, r8, r9, r10, lr}
ldr NUM_ROWS, [sp, #(4 * 8)] ldr NUM_ROWS, [sp, #(4 * 8)]
ldr INPUT_BUF0, [INPUT_BUF] ldr INPUT_BUF0, [INPUT_BUF]
@@ -1507,7 +1507,7 @@ asm_function jsimd_ycc_\colorid\()_convert_neon
ldr INPUT_BUF2, [INPUT_BUF, #8] ldr INPUT_BUF2, [INPUT_BUF, #8]
.unreq INPUT_BUF .unreq INPUT_BUF
/* Save NEON registers */ /* Save Neon registers */
vpush {d8-d15} vpush {d8-d15}
/* Initially set d10, d11, d12, d13 to 0xFF */ /* Initially set d10, d11, d12, d13 to 0xFF */
@@ -1814,7 +1814,7 @@ asm_function jsimd_\colorid\()_ycc_convert_neon
adr ip, jsimd_\colorid\()_ycc_neon_consts adr ip, jsimd_\colorid\()_ycc_neon_consts
vld1.16 {d0, d1, d2, d3}, [ip, :128] vld1.16 {d0, d1, d2, d3}, [ip, :128]
/* Save ARM registers and handle input arguments */ /* Save Arm registers and handle input arguments */
push {r4, r5, r6, r7, r8, r9, r10, lr} push {r4, r5, r6, r7, r8, r9, r10, lr}
ldr NUM_ROWS, [sp, #(4 * 8)] ldr NUM_ROWS, [sp, #(4 * 8)]
ldr OUTPUT_BUF0, [OUTPUT_BUF] ldr OUTPUT_BUF0, [OUTPUT_BUF]
@@ -1822,7 +1822,7 @@ asm_function jsimd_\colorid\()_ycc_convert_neon
ldr OUTPUT_BUF2, [OUTPUT_BUF, #8] ldr OUTPUT_BUF2, [OUTPUT_BUF, #8]
.unreq OUTPUT_BUF .unreq OUTPUT_BUF
/* Save NEON registers */ /* Save Neon registers */
vpush {d8-d15} vpush {d8-d15}
/* Outer loop over scanlines */ /* Outer loop over scanlines */
@@ -2017,7 +2017,7 @@ asm_function jsimd_fdct_ifast_neon
adr TMP, jsimd_fdct_ifast_neon_consts adr TMP, jsimd_fdct_ifast_neon_consts
vld1.16 {d0}, [TMP, :64] vld1.16 {d0}, [TMP, :64]
/* Load all DATA into NEON registers with the following allocation: /* Load all DATA into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
* 0 | d16 | d17 | q8 * 0 | d16 | d17 | q8
@@ -2112,8 +2112,8 @@ asm_function jsimd_fdct_ifast_neon
* *
* Note: the code uses 2 stage pipelining in order to improve instructions * Note: the code uses 2 stage pipelining in order to improve instructions
* scheduling and eliminate stalls (this provides ~15% better * scheduling and eliminate stalls (this provides ~15% better
* performance for this function on both ARM Cortex-A8 and * performance for this function on both Arm Cortex-A8 and
* ARM Cortex-A9 when compared to the non-pipelined variant). * Arm Cortex-A9 when compared to the non-pipelined variant).
* The instructions which belong to the second stage use different * The instructions which belong to the second stage use different
* indentation for better readiability. * indentation for better readiability.
*/ */

View File

@@ -12,7 +12,7 @@
* *
* This file contains the interface between the "normal" portions * This file contains the interface between the "normal" portions
* of the library and the SIMD implementations when running on a * of the library and the SIMD implementations when running on a
* 64-bit ARM architecture. * 64-bit Arm architecture.
*/ */
#define JPEG_INTERNALS #define JPEG_INTERNALS
@@ -115,8 +115,8 @@ parse_proc_cpuinfo(int bufsize)
*/ */
/* /*
* ARMv8 architectures support NEON extensions by default. * Armv8 architectures support Neon extensions by default.
* It is no longer optional as it was with ARMv7. * It is no longer optional as it was with Armv7.
*/ */

View File

@@ -1,5 +1,5 @@
/* /*
* ARMv8 NEON optimizations for libjpeg-turbo * Armv8 Neon optimizations for libjpeg-turbo
* *
* Copyright (C) 2009-2011, Nokia Corporation and/or its subsidiary(-ies). * Copyright (C) 2009-2011, Nokia Corporation and/or its subsidiary(-ies).
* All Rights Reserved. * All Rights Reserved.
@@ -625,7 +625,7 @@ asm_function jsimd_idct_islow_neon
shrn2 v5.8h, v15.4s, #16 /* wsptr[DCTSIZE*3] = (int)DESCALE(tmp13 + tmp0, CONST_BITS+PASS1_BITS+3) */ shrn2 v5.8h, v15.4s, #16 /* wsptr[DCTSIZE*3] = (int)DESCALE(tmp13 + tmp0, CONST_BITS+PASS1_BITS+3) */
shrn2 v6.8h, v17.4s, #16 /* wsptr[DCTSIZE*4] = (int)DESCALE(tmp13 - tmp0, CONST_BITS+PASS1_BITS+3) */ shrn2 v6.8h, v17.4s, #16 /* wsptr[DCTSIZE*4] = (int)DESCALE(tmp13 - tmp0, CONST_BITS+PASS1_BITS+3) */
movi v0.16b, #(CENTERJSAMPLE) movi v0.16b, #(CENTERJSAMPLE)
/* Prepare pointers (dual-issue with NEON instructions) */ /* Prepare pointers (dual-issue with Neon instructions) */
ldp TMP1, TMP2, [OUTPUT_BUF], 16 ldp TMP1, TMP2, [OUTPUT_BUF], 16
sqrshrn v28.8b, v2.8h, #(CONST_BITS+PASS1_BITS+3-16) sqrshrn v28.8b, v2.8h, #(CONST_BITS+PASS1_BITS+3-16)
ldp TMP3, TMP4, [OUTPUT_BUF], 16 ldp TMP3, TMP4, [OUTPUT_BUF], 16
@@ -1006,7 +1006,7 @@ asm_function jsimd_idct_islow_neon
* function from jidctfst.c * function from jidctfst.c
* *
* Normally 1-D AAN DCT needs 5 multiplications and 29 additions. * Normally 1-D AAN DCT needs 5 multiplications and 29 additions.
* But in ARM NEON case some extra additions are required because VQDMULH * But in Arm Neon case some extra additions are required because VQDMULH
* instruction can't handle the constants larger than 1. So the expressions * instruction can't handle the constants larger than 1. So the expressions
* like "x * 1.082392200" have to be converted to "x * 0.082392200 + x", * like "x * 1.082392200" have to be converted to "x * 0.082392200 + x",
* which introduces an extra addition. Overall, there are 6 extra additions * which introduces an extra addition. Overall, there are 6 extra additions
@@ -1038,7 +1038,7 @@ asm_function jsimd_idct_ifast_neon
instruction ensures that those bits are set to zero. */ instruction ensures that those bits are set to zero. */
uxtw x3, w3 uxtw x3, w3
/* Load and dequantize coefficients into NEON registers /* Load and dequantize coefficients into Neon registers
* with the following allocation: * with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
@@ -1051,7 +1051,7 @@ asm_function jsimd_idct_ifast_neon
* 6 | d28 | d29 ( v22.8h ) * 6 | d28 | d29 ( v22.8h )
* 7 | d30 | d31 ( v23.8h ) * 7 | d30 | d31 ( v23.8h )
*/ */
/* Save NEON registers used in fast IDCT */ /* Save Neon registers used in fast IDCT */
get_symbol_loc TMP5, Ljsimd_idct_ifast_neon_consts get_symbol_loc TMP5, Ljsimd_idct_ifast_neon_consts
ld1 {v16.8h, v17.8h}, [COEF_BLOCK], 32 ld1 {v16.8h, v17.8h}, [COEF_BLOCK], 32
ld1 {v0.8h, v1.8h}, [DCT_TABLE], 32 ld1 {v0.8h, v1.8h}, [DCT_TABLE], 32
@@ -1156,7 +1156,7 @@ asm_function jsimd_idct_ifast_neon
add v20.8h, v20.8h, v1.8h add v20.8h, v20.8h, v1.8h
/* Descale to 8-bit and range limit */ /* Descale to 8-bit and range limit */
movi v0.16b, #0x80 movi v0.16b, #0x80
/* Prepare pointers (dual-issue with NEON instructions) */ /* Prepare pointers (dual-issue with Neon instructions) */
ldp TMP1, TMP2, [OUTPUT_BUF], 16 ldp TMP1, TMP2, [OUTPUT_BUF], 16
sqshrn v28.8b, v16.8h, #5 sqshrn v28.8b, v16.8h, #5
ldp TMP3, TMP4, [OUTPUT_BUF], 16 ldp TMP3, TMP4, [OUTPUT_BUF], 16
@@ -1235,7 +1235,7 @@ asm_function jsimd_idct_ifast_neon
* *
* NOTE: jpeg-8 has an improved implementation of 4x4 inverse-DCT, which * NOTE: jpeg-8 has an improved implementation of 4x4 inverse-DCT, which
* requires much less arithmetic operations and hence should be faster. * requires much less arithmetic operations and hence should be faster.
* The primary purpose of this particular NEON optimized function is * The primary purpose of this particular Neon optimized function is
* bit exact compatibility with jpeg-6b. * bit exact compatibility with jpeg-6b.
* *
* TODO: a bit better instructions scheduling can be achieved by expanding * TODO: a bit better instructions scheduling can be achieved by expanding
@@ -1305,7 +1305,7 @@ asm_function jsimd_idct_4x4_neon
instruction ensures that those bits are set to zero. */ instruction ensures that those bits are set to zero. */
uxtw x3, w3 uxtw x3, w3
/* Save all used NEON registers */ /* Save all used Neon registers */
sub sp, sp, 64 sub sp, sp, 64
mov x9, sp mov x9, sp
/* Load constants (v3.4h is just used for padding) */ /* Load constants (v3.4h is just used for padding) */
@@ -1314,7 +1314,7 @@ asm_function jsimd_idct_4x4_neon
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32 st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
ld1 {v0.4h, v1.4h, v2.4h, v3.4h}, [TMP4] ld1 {v0.4h, v1.4h, v2.4h, v3.4h}, [TMP4]
/* Load all COEF_BLOCK into NEON registers with the following allocation: /* Load all COEF_BLOCK into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
* 0 | v4.4h | v5.4h * 0 | v4.4h | v5.4h
@@ -1448,7 +1448,7 @@ asm_function jsimd_idct_4x4_neon
* *
* NOTE: jpeg-8 has an improved implementation of 2x2 inverse-DCT, which * NOTE: jpeg-8 has an improved implementation of 2x2 inverse-DCT, which
* requires much less arithmetic operations and hence should be faster. * requires much less arithmetic operations and hence should be faster.
* The primary purpose of this particular NEON optimized function is * The primary purpose of this particular Neon optimized function is
* bit exact compatibility with jpeg-6b. * bit exact compatibility with jpeg-6b.
*/ */
@@ -1497,7 +1497,7 @@ asm_function jsimd_idct_2x2_neon
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32 st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
ld1 {v14.4h}, [TMP2] ld1 {v14.4h}, [TMP2]
/* Load all COEF_BLOCK into NEON registers with the following allocation: /* Load all COEF_BLOCK into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
* 0 | v4.4h | v5.4h * 0 | v4.4h | v5.4h
@@ -1871,7 +1871,7 @@ asm_function jsimd_ycc_\colorid\()_convert_neon_slowst3
/* Load constants to d1, d2, d3 (v0.4h is just used for padding) */ /* Load constants to d1, d2, d3 (v0.4h is just used for padding) */
get_symbol_loc x15, Ljsimd_ycc_rgb_neon_consts get_symbol_loc x15, Ljsimd_ycc_rgb_neon_consts
/* Save NEON registers */ /* Save Neon registers */
st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x9], 32 st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x9], 32
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32 st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
ld1 {v0.4h, v1.4h}, [x15], 16 ld1 {v0.4h, v1.4h}, [x15], 16
@@ -2156,7 +2156,7 @@ generate_jsimd_ycc_rgb_convert_neon extbgr, 24, 2, .4h, 1, .4h, 0, .4h, .8b,
.endm .endm
/* TODO: expand macros and interleave instructions if some in-order /* TODO: expand macros and interleave instructions if some in-order
* ARM64 processor actually can dual-issue LOAD/STORE with ALU */ * AArch64 processor actually can dual-issue LOAD/STORE with ALU */
.macro do_rgb_to_yuv_stage2_store_load_stage1 fast_ld3 .macro do_rgb_to_yuv_stage2_store_load_stage1 fast_ld3
do_rgb_to_yuv_stage2 do_rgb_to_yuv_stage2
do_load \bpp, 8, \fast_ld3 do_load \bpp, 8, \fast_ld3
@@ -2196,7 +2196,7 @@ asm_function jsimd_\colorid\()_ycc_convert_neon_slowld3
ldr OUTPUT_BUF2, [OUTPUT_BUF, #16] ldr OUTPUT_BUF2, [OUTPUT_BUF, #16]
.unreq OUTPUT_BUF .unreq OUTPUT_BUF
/* Save NEON registers */ /* Save Neon registers */
sub sp, sp, #64 sub sp, sp, #64
mov x9, sp mov x9, sp
st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x9], 32 st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x9], 32
@@ -2410,13 +2410,13 @@ asm_function jsimd_fdct_islow_neon
get_symbol_loc TMP, Ljsimd_fdct_islow_neon_consts get_symbol_loc TMP, Ljsimd_fdct_islow_neon_consts
ld1 {v0.8h, v1.8h}, [TMP] ld1 {v0.8h, v1.8h}, [TMP]
/* Save NEON registers */ /* Save Neon registers */
sub sp, sp, #64 sub sp, sp, #64
mov x10, sp mov x10, sp
st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x10], 32 st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x10], 32
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x10], 32 st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x10], 32
/* Load all DATA into NEON registers with the following allocation: /* Load all DATA into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
* 0 | d16 | d17 | v16.8h * 0 | d16 | d17 | v16.8h
@@ -2643,7 +2643,7 @@ asm_function jsimd_fdct_islow_neon
st1 {v16.8h, v17.8h, v18.8h, v19.8h}, [DATA], 64 st1 {v16.8h, v17.8h, v18.8h, v19.8h}, [DATA], 64
st1 {v20.8h, v21.8h, v22.8h, v23.8h}, [DATA] st1 {v20.8h, v21.8h, v22.8h, v23.8h}, [DATA]
/* Restore NEON registers */ /* Restore Neon registers */
ld1 {v8.8b, v9.8b, v10.8b, v11.8b}, [sp], 32 ld1 {v8.8b, v9.8b, v10.8b, v11.8b}, [sp], 32
ld1 {v12.8b, v13.8b, v14.8b, v15.8b}, [sp], 32 ld1 {v12.8b, v13.8b, v14.8b, v15.8b}, [sp], 32
@@ -2695,7 +2695,7 @@ asm_function jsimd_fdct_ifast_neon
get_symbol_loc TMP, Ljsimd_fdct_ifast_neon_consts get_symbol_loc TMP, Ljsimd_fdct_ifast_neon_consts
ld1 {v0.4h}, [TMP] ld1 {v0.4h}, [TMP]
/* Load all DATA into NEON registers with the following allocation: /* Load all DATA into Neon registers with the following allocation:
* 0 1 2 3 | 4 5 6 7 * 0 1 2 3 | 4 5 6 7
* ---------+-------- * ---------+--------
* 0 | d16 | d17 | v0.8h * 0 | d16 | d17 | v0.8h
@@ -3080,7 +3080,7 @@ asm_function jsimd_huff_encode_one_block_neon_slowtbl
.endif .endif
sub sp, sp, 272 sub sp, sp, 272
sub BUFFER, BUFFER, #0x1 /* BUFFER=buffer-- */ sub BUFFER, BUFFER, #0x1 /* BUFFER=buffer-- */
/* Save ARM registers */ /* Save Arm registers */
stp x19, x20, [sp] stp x19, x20, [sp]
get_symbol_loc x15, Ljsimd_huff_encode_one_block_neon_consts get_symbol_loc x15, Ljsimd_huff_encode_one_block_neon_consts
ldr PUT_BUFFER, [x0, #0x10] ldr PUT_BUFFER, [x0, #0x10]