Merge branch 'master' into dev
This commit is contained in:
16
BUILDING.md
16
BUILDING.md
@@ -393,8 +393,8 @@ located (usually **/usr/bin**.) Next, execute the following commands:
|
||||
Building libjpeg-turbo for iOS
|
||||
------------------------------
|
||||
|
||||
iOS platforms, such as the iPhone and iPad, use ARM processors, and all
|
||||
currently supported models include NEON instructions. Thus, they can take
|
||||
iOS platforms, such as the iPhone and iPad, use Arm processors, and all
|
||||
currently supported models include Neon instructions. Thus, they can take
|
||||
advantage of libjpeg-turbo's SIMD extensions to significantly accelerate JPEG
|
||||
compression/decompression. This section describes how to build libjpeg-turbo
|
||||
for these platforms.
|
||||
@@ -407,7 +407,7 @@ for these platforms.
|
||||
it should be installed in your `PATH`.
|
||||
|
||||
|
||||
### ARMv8 (64-bit)
|
||||
### Armv8 (64-bit)
|
||||
|
||||
**gas-preprocessor.pl required if using Xcode < 6**
|
||||
|
||||
@@ -439,7 +439,7 @@ Building libjpeg-turbo for Android platforms requires v13b or later of the
|
||||
[Android NDK](https://developer.android.com/tools/sdk/ndk).
|
||||
|
||||
|
||||
### ARMv7 (32-bit)
|
||||
### Armv7 (32-bit)
|
||||
|
||||
The following is a general recipe script that can be modified for your specific
|
||||
needs.
|
||||
@@ -464,7 +464,7 @@ needs.
|
||||
make
|
||||
|
||||
|
||||
### ARMv8 (64-bit)
|
||||
### Armv8 (64-bit)
|
||||
|
||||
The following is a general recipe script that can be modified for your specific
|
||||
needs.
|
||||
@@ -643,12 +643,12 @@ Create Mac package/disk image. This requires pkgbuild and productbuild, which
|
||||
are installed by default on OS X 10.7 and later.
|
||||
|
||||
In order to create a Mac package/disk image that contains universal
|
||||
x86-64/ARM binaries, set the following CMake variable:
|
||||
x86-64/Arm binaries, set the following CMake variable:
|
||||
|
||||
* `IOS_ARMV8_BUILD`: Directory containing an ARMv8 (64-bit) iOS build of
|
||||
* `IOS_ARMV8_BUILD`: Directory containing an Armv8 (64-bit) iOS build of
|
||||
libjpeg-turbo to include in the universal binaries
|
||||
|
||||
You should first use CMake to configure an ARMv8 sub-build of libjpeg-turbo
|
||||
You should first use CMake to configure an Armv8 sub-build of libjpeg-turbo
|
||||
(see "Building libjpeg-turbo for iOS" above) in a build directory that matches
|
||||
the one specified in the aforementioned CMake variable. Next, configure the
|
||||
primary (x86-64) build of libjpeg-turbo as an out-of-tree build, specifying the
|
||||
|
||||
@@ -55,10 +55,12 @@ if(CMAKE_SYSTEM_PROCESSOR_LC MATCHES "x86_64" OR
|
||||
set(CMAKE_SYSTEM_PROCESSOR ${CPU_TYPE})
|
||||
endif()
|
||||
elseif(CMAKE_SYSTEM_PROCESSOR_LC STREQUAL "aarch64" OR
|
||||
CMAKE_SYSTEM_PROCESSOR_LC MATCHES "arm*64*")
|
||||
CMAKE_SYSTEM_PROCESSOR_LC MATCHES "arm*")
|
||||
if(BITS EQUAL 64)
|
||||
set(CPU_TYPE arm64)
|
||||
elseif(CMAKE_SYSTEM_PROCESSOR_LC MATCHES "arm*")
|
||||
else()
|
||||
set(CPU_TYPE arm)
|
||||
endif()
|
||||
elseif(CMAKE_SYSTEM_PROCESSOR_LC MATCHES "ppc*" OR
|
||||
CMAKE_SYSTEM_PROCESSOR_LC MATCHES "powerpc*")
|
||||
set(CPU_TYPE powerpc)
|
||||
|
||||
16
ChangeLog.md
16
ChangeLog.md
@@ -33,7 +33,7 @@ approximately 2x when using the fast integer IDCT
|
||||
The overall decompression speedup for RGB images is now approximately
|
||||
2.3-3.7x (compared to 2-3.5x with libjpeg-turbo 2.0.x.)
|
||||
|
||||
3. 32-bit (ARMv7 or ARMv7s) iOS builds of libjpeg-turbo are no longer
|
||||
3. 32-bit (Armv7 or Armv7s) iOS builds of libjpeg-turbo are no longer
|
||||
supported, and the libjpeg-turbo build system can no longer be used to package
|
||||
such builds. 32-bit iOS apps cannot run in iOS 11 and later, and the App Store
|
||||
no longer allows them.
|
||||
@@ -61,10 +61,10 @@ higher-frequency scan. libjpeg-turbo now applies block smoothing parameters to
|
||||
each iMCU row based on which scan generated the pixels in that row, rather than
|
||||
always using the block smoothing parameters for the most recent scan.
|
||||
|
||||
7. Added SIMD acceleration for progressive Huffman encoding on ARM 64-bit
|
||||
(ARMv8) platforms. This speeds up the compression of full-color progressive
|
||||
7. Added SIMD acceleration for progressive Huffman encoding on Arm 64-bit
|
||||
(Armv8) platforms. This speeds up the compression of full-color progressive
|
||||
JPEGs by about 30-40% on average (relative to libjpeg-turbo 2.0.x) when using
|
||||
modern ARMv8 CPUs.
|
||||
modern Armv8 CPUs.
|
||||
|
||||
8. Added configure-time and run-time auto-detection of Loongson MMI SIMD
|
||||
instructions, so that the Loongson MMI SIMD extensions can be included in any
|
||||
@@ -124,8 +124,8 @@ with `jpeg_skip_scanlines()`, and the issues could not readily be fixed.
|
||||
- Fixed an issue whereby `jpeg_skip_scanlines()` always returned 0 when
|
||||
skipping past the end of an image.
|
||||
|
||||
3. The ARM 64-bit (ARMv8) NEON SIMD extensions can now be built using MinGW
|
||||
toolchains targetting ARM64 (AArch64) Windows binaries.
|
||||
3. The Arm 64-bit (Armv8) Neon SIMD extensions can now be built using MinGW
|
||||
toolchains targetting Arm64 (AArch64) Windows binaries.
|
||||
|
||||
4. Fixed unexpected visual artifacts that occurred when using
|
||||
`jpeg_crop_scanline()` and interblock smoothing while decompressing only the DC
|
||||
@@ -198,7 +198,7 @@ other user-visible errant behavior, and given that the lossless transformer
|
||||
(unlike the decompressor) is not generally exposed to arbitrary data exploits,
|
||||
this issue did not likely pose a security risk.
|
||||
|
||||
6. The ARM 64-bit (ARMv8) NEON SIMD assembly code now stores constants in a
|
||||
6. The Arm 64-bit (Armv8) Neon SIMD assembly code now stores constants in a
|
||||
separate read-only data section rather than in the text section, to support
|
||||
execute-only memory layouts.
|
||||
|
||||
@@ -484,7 +484,7 @@ algorithm that caused incorrect dithering in the output image. This algorithm
|
||||
now produces bitwise-identical results to the unmerged algorithms.
|
||||
|
||||
12. The SIMD function symbols for x86[-64]/ELF, MIPS/ELF, macOS/x86[-64] (if
|
||||
libjpeg-turbo is built with YASM), and iOS/ARM[64] builds are now private.
|
||||
libjpeg-turbo is built with YASM), and iOS/Arm[64] builds are now private.
|
||||
This prevents those symbols from being exposed in applications or shared
|
||||
libraries that link statically with libjpeg-turbo.
|
||||
|
||||
|
||||
@@ -2,8 +2,8 @@ Background
|
||||
==========
|
||||
|
||||
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
|
||||
baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and
|
||||
MIPS systems, as well as progressive JPEG compression on x86, x86-64, and ARMv8
|
||||
baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
|
||||
MIPS systems, as well as progressive JPEG compression on x86, x86-64, and Armv8
|
||||
systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg,
|
||||
all else being equal. On other types of systems, libjpeg-turbo can still
|
||||
outperform libjpeg by a significant amount, by virtue of its highly-optimized
|
||||
|
||||
@@ -23,11 +23,18 @@ set(RPMARCH ${CMAKE_SYSTEM_PROCESSOR})
|
||||
if(CPU_TYPE STREQUAL "x86_64")
|
||||
set(DEBARCH amd64)
|
||||
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "armv7*")
|
||||
set(RPMARCH armv7hl)
|
||||
set(DEBARCH armhf)
|
||||
elseif(CPU_TYPE STREQUAL "arm64")
|
||||
set(DEBARCH ${CPU_TYPE})
|
||||
elseif(CPU_TYPE STREQUAL "arm")
|
||||
if(CMAKE_C_COMPILER MATCHES "gnueabihf")
|
||||
set(RPMARCH armv7hl)
|
||||
set(DEBARCH armhf)
|
||||
else()
|
||||
set(RPMARCH armel)
|
||||
set(DEBARCH armel)
|
||||
endif()
|
||||
elseif(CMAKE_SYSTEM_PROCESSOR_LC STREQUAL "ppc64le")
|
||||
set(DEBARCH ppc64el)
|
||||
elseif(CPU_TYPE STREQUAL "powerpc" AND BITS EQUAL 32)
|
||||
@@ -128,7 +135,7 @@ endif() # WIN32
|
||||
if(APPLE)
|
||||
|
||||
set(IOS_ARMV8_BUILD "" CACHE PATH
|
||||
"Directory containing ARMv8 iOS build to include in universal binaries")
|
||||
"Directory containing Armv8 iOS build to include in universal binaries")
|
||||
|
||||
set(OSX_APP_CERT_NAME "" CACHE STRING
|
||||
"Name of the Developer ID Application certificate (in the macOS keychain) that should be used to sign the libjpeg-turbo DMG. Leave this blank to generate an unsigned DMG.")
|
||||
|
||||
8
jchuff.c
8
jchuff.c
@@ -35,10 +35,10 @@
|
||||
* memory footprint by 64k, which is important for some mobile applications
|
||||
* that create many isolated instances of libjpeg-turbo (web browsers, for
|
||||
* instance.) This may improve performance on some mobile platforms as well.
|
||||
* This feature is enabled by default only on ARM processors, because some x86
|
||||
* This feature is enabled by default only on Arm processors, because some x86
|
||||
* chips have a slow implementation of bsr, and the use of clz/bsr cannot be
|
||||
* shown to have a significant performance impact even on the x86 chips that
|
||||
* have a fast implementation of it. When building for ARMv6, you can
|
||||
* have a fast implementation of it. When building for Armv6, you can
|
||||
* explicitly disable the use of clz/bsr by adding -mthumb to the compiler
|
||||
* flags (this defines __thumb__).
|
||||
*/
|
||||
@@ -73,7 +73,7 @@ typedef size_t bit_buf_type;
|
||||
#endif
|
||||
|
||||
/* NOTE: The more optimal Huffman encoding algorithm has not yet been
|
||||
* implemented in the ARM NEON SIMD extensions, which is why we retain the old
|
||||
* implemented in the Arm Neon SIMD extensions, which is why we retain the old
|
||||
* Huffman encoder behavior for that platform.
|
||||
*/
|
||||
#if defined(WITH_SIMD) && !(defined(__arm__) || defined(__aarch64__))
|
||||
@@ -98,7 +98,7 @@ typedef struct {
|
||||
simd_bit_buf_type simd;
|
||||
} put_buffer; /* current bit accumulation buffer */
|
||||
int free_bits; /* # of bits available in it */
|
||||
/* (ARM SIMD: # of bits now in it) */
|
||||
/* (Arm SIMD: # of bits now in it) */
|
||||
int last_dc_val[MAX_COMPS_IN_SCAN]; /* last DC coef for each component */
|
||||
} savable_state;
|
||||
|
||||
|
||||
@@ -43,10 +43,10 @@
|
||||
* memory footprint by 64k, which is important for some mobile applications
|
||||
* that create many isolated instances of libjpeg-turbo (web browsers, for
|
||||
* instance.) This may improve performance on some mobile platforms as well.
|
||||
* This feature is enabled by default only on ARM processors, because some x86
|
||||
* This feature is enabled by default only on Arm processors, because some x86
|
||||
* chips have a slow implementation of bsr, and the use of clz/bsr cannot be
|
||||
* shown to have a significant performance impact even on the x86 chips that
|
||||
* have a fast implementation of it. When building for ARMv6, you can
|
||||
* have a fast implementation of it. When building for Armv6, you can
|
||||
* explicitly disable the use of clz/bsr by adding -mthumb to the compiler
|
||||
* flags (this defines __thumb__).
|
||||
*/
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86, x86-64, and ARMv8 systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal. On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines. In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs.
|
||||
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86, x86-64, and Armv8 systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal. On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines. In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs.
|
||||
|
||||
libjpeg-turbo implements both the traditional libjpeg API as well as the less powerful but more straightforward TurboJPEG API. libjpeg-turbo also features colorspace extensions that allow it to compress from/decompress to 32-bit and big-endian pixel buffers (RGBX, XBGR, etc.), as well as a full-featured Java interface.
|
||||
|
||||
|
||||
@@ -9,9 +9,9 @@ Homepage: @PKGURL@
|
||||
Installed-Size: {__SIZE}
|
||||
Description: A SIMD-accelerated JPEG codec that provides both the libjpeg and TurboJPEG APIs
|
||||
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
|
||||
baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and
|
||||
baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
|
||||
MIPS systems, as well as progressive JPEG compression on x86, x86-64, and
|
||||
ARMv8 systems. On such systems, libjpeg-turbo is generally 2-6x as fast as
|
||||
Armv8 systems. On such systems, libjpeg-turbo is generally 2-6x as fast as
|
||||
libjpeg, all else being equal. On other types of systems, libjpeg-turbo can
|
||||
still outperform libjpeg by a significant amount, by virtue of its
|
||||
highly-optimized Huffman coding routines. In many cases, the performance of
|
||||
|
||||
@@ -54,7 +54,11 @@ makedeb()
|
||||
|
||||
if [ $SUPPLEMENT = 1 ]; then
|
||||
PKGNAME=$PKGNAME\32
|
||||
if [ "$DEBARCH" = "i386" ]; then
|
||||
DEBARCH=amd64
|
||||
else
|
||||
DEBARCH=arm64
|
||||
fi
|
||||
fi
|
||||
|
||||
umask 022
|
||||
@@ -110,6 +114,8 @@ if [ ! `uid` -eq 0 ]; then
|
||||
fi
|
||||
|
||||
makedeb 0
|
||||
if [ "$DEBARCH" = "i386" ]; then makedeb 1; fi
|
||||
if [ "$DEBARCH" = "i386" -o "$DEBARCH" = "armel" -o "$DEBARCH" = "armhf" ]; then
|
||||
makedeb 1
|
||||
fi
|
||||
|
||||
exit
|
||||
|
||||
@@ -160,7 +160,7 @@ install_ios()
|
||||
}
|
||||
|
||||
if [ "$BUILDDIRARMV8" != "" ]; then
|
||||
install_ios $BUILDDIRARMV8 ARMv8 armv8 arm64
|
||||
install_ios $BUILDDIRARMV8 Armv8 armv8 arm64
|
||||
fi
|
||||
|
||||
install_name_tool -id $LIBDIR/$LIBJPEG_DSO_NAME $PKGROOT/$LIBDIR/$LIBJPEG_DSO_NAME
|
||||
|
||||
@@ -52,8 +52,8 @@ Provides: %{name} = %{version}-%{release}, @CMAKE_PROJECT_NAME@ = %{version}-%{r
|
||||
|
||||
%description
|
||||
libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
|
||||
baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and
|
||||
MIPS systems, as well as progressive JPEG compression on x86, x86-64, and ARMv8
|
||||
baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
|
||||
MIPS systems, as well as progressive JPEG compression on x86, x86-64, and Armv8
|
||||
systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg,
|
||||
all else being equal. On other types of systems, libjpeg-turbo can still
|
||||
outperform libjpeg by a significant amount, by virtue of its highly-optimized
|
||||
|
||||
@@ -208,7 +208,7 @@ endif()
|
||||
|
||||
|
||||
###############################################################################
|
||||
# ARM (GAS)
|
||||
# Arm (GAS)
|
||||
###############################################################################
|
||||
|
||||
elseif(CPU_TYPE STREQUAL "arm64" OR CPU_TYPE STREQUAL "arm")
|
||||
|
||||
@@ -13,7 +13,7 @@
|
||||
*
|
||||
* This file contains the interface between the "normal" portions
|
||||
* of the library and the SIMD implementations when running on a
|
||||
* 32-bit ARM architecture.
|
||||
* 32-bit Arm architecture.
|
||||
*/
|
||||
|
||||
#define JPEG_INTERNALS
|
||||
@@ -118,7 +118,7 @@ init_simd(void)
|
||||
#if defined(__ARM_NEON__)
|
||||
simd_support |= JSIMD_NEON;
|
||||
#elif defined(__linux__) || defined(ANDROID) || defined(__ANDROID__)
|
||||
/* We still have a chance to use NEON regardless of globally used
|
||||
/* We still have a chance to use Neon regardless of globally used
|
||||
* -mcpu/-mfpu options passed to gcc by performing runtime detection via
|
||||
* /proc/cpuinfo parsing on linux/android */
|
||||
while (!parse_proc_cpuinfo(bufsize)) {
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
/*
|
||||
* ARMv7 NEON optimizations for libjpeg-turbo
|
||||
* Armv7 Neon optimizations for libjpeg-turbo
|
||||
*
|
||||
* Copyright (C) 2009-2011, Nokia Corporation and/or its subsidiary(-ies).
|
||||
* All Rights Reserved.
|
||||
@@ -229,7 +229,7 @@ asm_function jsimd_idct_islow_neon
|
||||
ROW7L .req d30
|
||||
ROW7R .req d31
|
||||
|
||||
/* Load and dequantize coefficients into NEON registers
|
||||
/* Load and dequantize coefficients into Neon registers
|
||||
* with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
@@ -261,7 +261,7 @@ asm_function jsimd_idct_islow_neon
|
||||
vld1.16 {d0, d1, d2, d3}, [ip, :128] /* load constants */
|
||||
add ip, ip, #16
|
||||
vmul.s16 q15, q15, q3
|
||||
vpush {d8-d15} /* save NEON registers */
|
||||
vpush {d8-d15} /* save Neon registers */
|
||||
/* 1-D IDCT, pass 1, left 4x8 half */
|
||||
vadd.s16 d4, ROW7L, ROW3L
|
||||
vadd.s16 d5, ROW5L, ROW1L
|
||||
@@ -507,7 +507,7 @@ asm_function jsimd_idct_islow_neon
|
||||
vqrshrn.s16 d17, q9, #2
|
||||
vqrshrn.s16 d18, q10, #2
|
||||
vqrshrn.s16 d19, q11, #2
|
||||
vpop {d8-d15} /* restore NEON registers */
|
||||
vpop {d8-d15} /* restore Neon registers */
|
||||
vqrshrn.s16 d20, q12, #2
|
||||
/* Transpose the final 8-bit samples and do signed->unsigned conversion */
|
||||
vtrn.16 q8, q9
|
||||
@@ -688,7 +688,7 @@ asm_function jsimd_idct_islow_neon
|
||||
* function from jidctfst.c
|
||||
*
|
||||
* Normally 1-D AAN DCT needs 5 multiplications and 29 additions.
|
||||
* But in ARM NEON case some extra additions are required because VQDMULH
|
||||
* But in Arm Neon case some extra additions are required because VQDMULH
|
||||
* instruction can't handle the constants larger than 1. So the expressions
|
||||
* like "x * 1.082392200" have to be converted to "x * 0.082392200 + x",
|
||||
* which introduces an extra addition. Overall, there are 6 extra additions
|
||||
@@ -718,7 +718,7 @@ asm_function jsimd_idct_ifast_neon
|
||||
TMP3 .req r2
|
||||
TMP4 .req ip
|
||||
|
||||
/* Load and dequantize coefficients into NEON registers
|
||||
/* Load and dequantize coefficients into Neon registers
|
||||
* with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
@@ -749,7 +749,7 @@ asm_function jsimd_idct_ifast_neon
|
||||
vmul.s16 q13, q13, q1
|
||||
vld1.16 {d0}, [ip, :64] /* load constants */
|
||||
vmul.s16 q15, q15, q3
|
||||
vpush {d8-d13} /* save NEON registers */
|
||||
vpush {d8-d13} /* save Neon registers */
|
||||
/* 1-D IDCT, pass 1 */
|
||||
vsub.s16 q2, q10, q14
|
||||
vadd.s16 q14, q10, q14
|
||||
@@ -842,7 +842,7 @@ asm_function jsimd_idct_ifast_neon
|
||||
vadd.s16 q14, q5, q3
|
||||
vsub.s16 q9, q5, q3
|
||||
vsub.s16 q13, q10, q2
|
||||
vpop {d8-d13} /* restore NEON registers */
|
||||
vpop {d8-d13} /* restore Neon registers */
|
||||
vadd.s16 q10, q10, q2
|
||||
vsub.s16 q11, q12, q1
|
||||
vadd.s16 q12, q12, q1
|
||||
@@ -913,7 +913,7 @@ asm_function jsimd_idct_ifast_neon
|
||||
*
|
||||
* NOTE: jpeg-8 has an improved implementation of 4x4 inverse-DCT, which
|
||||
* requires much less arithmetic operations and hence should be faster.
|
||||
* The primary purpose of this particular NEON optimized function is
|
||||
* The primary purpose of this particular Neon optimized function is
|
||||
* bit exact compatibility with jpeg-6b.
|
||||
*
|
||||
* TODO: a bit better instructions scheduling can be achieved by expanding
|
||||
@@ -1016,7 +1016,7 @@ asm_function jsimd_idct_4x4_neon
|
||||
adr TMP4, jsimd_idct_4x4_neon_consts
|
||||
vld1.16 {d0, d1, d2, d3}, [TMP4, :128]
|
||||
|
||||
/* Load all COEF_BLOCK into NEON registers with the following allocation:
|
||||
/* Load all COEF_BLOCK into Neon registers with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
* 0 | d4 | d5
|
||||
@@ -1126,7 +1126,7 @@ asm_function jsimd_idct_4x4_neon
|
||||
*
|
||||
* NOTE: jpeg-8 has an improved implementation of 2x2 inverse-DCT, which
|
||||
* requires much less arithmetic operations and hence should be faster.
|
||||
* The primary purpose of this particular NEON optimized function is
|
||||
* The primary purpose of this particular Neon optimized function is
|
||||
* bit exact compatibility with jpeg-6b.
|
||||
*/
|
||||
|
||||
@@ -1173,7 +1173,7 @@ asm_function jsimd_idct_2x2_neon
|
||||
adr TMP2, jsimd_idct_2x2_neon_consts
|
||||
vld1.16 {d0}, [TMP2, :64]
|
||||
|
||||
/* Load all COEF_BLOCK into NEON registers with the following allocation:
|
||||
/* Load all COEF_BLOCK into Neon registers with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
* 0 | d4 | d5
|
||||
@@ -1499,7 +1499,7 @@ asm_function jsimd_ycc_\colorid\()_convert_neon
|
||||
adr ip, jsimd_ycc_\colorid\()_neon_consts
|
||||
vld1.16 {d0, d1, d2, d3}, [ip, :128]
|
||||
|
||||
/* Save ARM registers and handle input arguments */
|
||||
/* Save Arm registers and handle input arguments */
|
||||
push {r4, r5, r6, r7, r8, r9, r10, lr}
|
||||
ldr NUM_ROWS, [sp, #(4 * 8)]
|
||||
ldr INPUT_BUF0, [INPUT_BUF]
|
||||
@@ -1507,7 +1507,7 @@ asm_function jsimd_ycc_\colorid\()_convert_neon
|
||||
ldr INPUT_BUF2, [INPUT_BUF, #8]
|
||||
.unreq INPUT_BUF
|
||||
|
||||
/* Save NEON registers */
|
||||
/* Save Neon registers */
|
||||
vpush {d8-d15}
|
||||
|
||||
/* Initially set d10, d11, d12, d13 to 0xFF */
|
||||
@@ -1814,7 +1814,7 @@ asm_function jsimd_\colorid\()_ycc_convert_neon
|
||||
adr ip, jsimd_\colorid\()_ycc_neon_consts
|
||||
vld1.16 {d0, d1, d2, d3}, [ip, :128]
|
||||
|
||||
/* Save ARM registers and handle input arguments */
|
||||
/* Save Arm registers and handle input arguments */
|
||||
push {r4, r5, r6, r7, r8, r9, r10, lr}
|
||||
ldr NUM_ROWS, [sp, #(4 * 8)]
|
||||
ldr OUTPUT_BUF0, [OUTPUT_BUF]
|
||||
@@ -1822,7 +1822,7 @@ asm_function jsimd_\colorid\()_ycc_convert_neon
|
||||
ldr OUTPUT_BUF2, [OUTPUT_BUF, #8]
|
||||
.unreq OUTPUT_BUF
|
||||
|
||||
/* Save NEON registers */
|
||||
/* Save Neon registers */
|
||||
vpush {d8-d15}
|
||||
|
||||
/* Outer loop over scanlines */
|
||||
@@ -2017,7 +2017,7 @@ asm_function jsimd_fdct_ifast_neon
|
||||
adr TMP, jsimd_fdct_ifast_neon_consts
|
||||
vld1.16 {d0}, [TMP, :64]
|
||||
|
||||
/* Load all DATA into NEON registers with the following allocation:
|
||||
/* Load all DATA into Neon registers with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
* 0 | d16 | d17 | q8
|
||||
@@ -2112,8 +2112,8 @@ asm_function jsimd_fdct_ifast_neon
|
||||
*
|
||||
* Note: the code uses 2 stage pipelining in order to improve instructions
|
||||
* scheduling and eliminate stalls (this provides ~15% better
|
||||
* performance for this function on both ARM Cortex-A8 and
|
||||
* ARM Cortex-A9 when compared to the non-pipelined variant).
|
||||
* performance for this function on both Arm Cortex-A8 and
|
||||
* Arm Cortex-A9 when compared to the non-pipelined variant).
|
||||
* The instructions which belong to the second stage use different
|
||||
* indentation for better readiability.
|
||||
*/
|
||||
|
||||
@@ -12,7 +12,7 @@
|
||||
*
|
||||
* This file contains the interface between the "normal" portions
|
||||
* of the library and the SIMD implementations when running on a
|
||||
* 64-bit ARM architecture.
|
||||
* 64-bit Arm architecture.
|
||||
*/
|
||||
|
||||
#define JPEG_INTERNALS
|
||||
@@ -115,8 +115,8 @@ parse_proc_cpuinfo(int bufsize)
|
||||
*/
|
||||
|
||||
/*
|
||||
* ARMv8 architectures support NEON extensions by default.
|
||||
* It is no longer optional as it was with ARMv7.
|
||||
* Armv8 architectures support Neon extensions by default.
|
||||
* It is no longer optional as it was with Armv7.
|
||||
*/
|
||||
|
||||
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
/*
|
||||
* ARMv8 NEON optimizations for libjpeg-turbo
|
||||
* Armv8 Neon optimizations for libjpeg-turbo
|
||||
*
|
||||
* Copyright (C) 2009-2011, Nokia Corporation and/or its subsidiary(-ies).
|
||||
* All Rights Reserved.
|
||||
@@ -625,7 +625,7 @@ asm_function jsimd_idct_islow_neon
|
||||
shrn2 v5.8h, v15.4s, #16 /* wsptr[DCTSIZE*3] = (int)DESCALE(tmp13 + tmp0, CONST_BITS+PASS1_BITS+3) */
|
||||
shrn2 v6.8h, v17.4s, #16 /* wsptr[DCTSIZE*4] = (int)DESCALE(tmp13 - tmp0, CONST_BITS+PASS1_BITS+3) */
|
||||
movi v0.16b, #(CENTERJSAMPLE)
|
||||
/* Prepare pointers (dual-issue with NEON instructions) */
|
||||
/* Prepare pointers (dual-issue with Neon instructions) */
|
||||
ldp TMP1, TMP2, [OUTPUT_BUF], 16
|
||||
sqrshrn v28.8b, v2.8h, #(CONST_BITS+PASS1_BITS+3-16)
|
||||
ldp TMP3, TMP4, [OUTPUT_BUF], 16
|
||||
@@ -1006,7 +1006,7 @@ asm_function jsimd_idct_islow_neon
|
||||
* function from jidctfst.c
|
||||
*
|
||||
* Normally 1-D AAN DCT needs 5 multiplications and 29 additions.
|
||||
* But in ARM NEON case some extra additions are required because VQDMULH
|
||||
* But in Arm Neon case some extra additions are required because VQDMULH
|
||||
* instruction can't handle the constants larger than 1. So the expressions
|
||||
* like "x * 1.082392200" have to be converted to "x * 0.082392200 + x",
|
||||
* which introduces an extra addition. Overall, there are 6 extra additions
|
||||
@@ -1038,7 +1038,7 @@ asm_function jsimd_idct_ifast_neon
|
||||
instruction ensures that those bits are set to zero. */
|
||||
uxtw x3, w3
|
||||
|
||||
/* Load and dequantize coefficients into NEON registers
|
||||
/* Load and dequantize coefficients into Neon registers
|
||||
* with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
@@ -1051,7 +1051,7 @@ asm_function jsimd_idct_ifast_neon
|
||||
* 6 | d28 | d29 ( v22.8h )
|
||||
* 7 | d30 | d31 ( v23.8h )
|
||||
*/
|
||||
/* Save NEON registers used in fast IDCT */
|
||||
/* Save Neon registers used in fast IDCT */
|
||||
get_symbol_loc TMP5, Ljsimd_idct_ifast_neon_consts
|
||||
ld1 {v16.8h, v17.8h}, [COEF_BLOCK], 32
|
||||
ld1 {v0.8h, v1.8h}, [DCT_TABLE], 32
|
||||
@@ -1156,7 +1156,7 @@ asm_function jsimd_idct_ifast_neon
|
||||
add v20.8h, v20.8h, v1.8h
|
||||
/* Descale to 8-bit and range limit */
|
||||
movi v0.16b, #0x80
|
||||
/* Prepare pointers (dual-issue with NEON instructions) */
|
||||
/* Prepare pointers (dual-issue with Neon instructions) */
|
||||
ldp TMP1, TMP2, [OUTPUT_BUF], 16
|
||||
sqshrn v28.8b, v16.8h, #5
|
||||
ldp TMP3, TMP4, [OUTPUT_BUF], 16
|
||||
@@ -1235,7 +1235,7 @@ asm_function jsimd_idct_ifast_neon
|
||||
*
|
||||
* NOTE: jpeg-8 has an improved implementation of 4x4 inverse-DCT, which
|
||||
* requires much less arithmetic operations and hence should be faster.
|
||||
* The primary purpose of this particular NEON optimized function is
|
||||
* The primary purpose of this particular Neon optimized function is
|
||||
* bit exact compatibility with jpeg-6b.
|
||||
*
|
||||
* TODO: a bit better instructions scheduling can be achieved by expanding
|
||||
@@ -1305,7 +1305,7 @@ asm_function jsimd_idct_4x4_neon
|
||||
instruction ensures that those bits are set to zero. */
|
||||
uxtw x3, w3
|
||||
|
||||
/* Save all used NEON registers */
|
||||
/* Save all used Neon registers */
|
||||
sub sp, sp, 64
|
||||
mov x9, sp
|
||||
/* Load constants (v3.4h is just used for padding) */
|
||||
@@ -1314,7 +1314,7 @@ asm_function jsimd_idct_4x4_neon
|
||||
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
|
||||
ld1 {v0.4h, v1.4h, v2.4h, v3.4h}, [TMP4]
|
||||
|
||||
/* Load all COEF_BLOCK into NEON registers with the following allocation:
|
||||
/* Load all COEF_BLOCK into Neon registers with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
* 0 | v4.4h | v5.4h
|
||||
@@ -1448,7 +1448,7 @@ asm_function jsimd_idct_4x4_neon
|
||||
*
|
||||
* NOTE: jpeg-8 has an improved implementation of 2x2 inverse-DCT, which
|
||||
* requires much less arithmetic operations and hence should be faster.
|
||||
* The primary purpose of this particular NEON optimized function is
|
||||
* The primary purpose of this particular Neon optimized function is
|
||||
* bit exact compatibility with jpeg-6b.
|
||||
*/
|
||||
|
||||
@@ -1497,7 +1497,7 @@ asm_function jsimd_idct_2x2_neon
|
||||
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
|
||||
ld1 {v14.4h}, [TMP2]
|
||||
|
||||
/* Load all COEF_BLOCK into NEON registers with the following allocation:
|
||||
/* Load all COEF_BLOCK into Neon registers with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
* 0 | v4.4h | v5.4h
|
||||
@@ -1871,7 +1871,7 @@ asm_function jsimd_ycc_\colorid\()_convert_neon_slowst3
|
||||
/* Load constants to d1, d2, d3 (v0.4h is just used for padding) */
|
||||
get_symbol_loc x15, Ljsimd_ycc_rgb_neon_consts
|
||||
|
||||
/* Save NEON registers */
|
||||
/* Save Neon registers */
|
||||
st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x9], 32
|
||||
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
|
||||
ld1 {v0.4h, v1.4h}, [x15], 16
|
||||
@@ -2156,7 +2156,7 @@ generate_jsimd_ycc_rgb_convert_neon extbgr, 24, 2, .4h, 1, .4h, 0, .4h, .8b,
|
||||
.endm
|
||||
|
||||
/* TODO: expand macros and interleave instructions if some in-order
|
||||
* ARM64 processor actually can dual-issue LOAD/STORE with ALU */
|
||||
* AArch64 processor actually can dual-issue LOAD/STORE with ALU */
|
||||
.macro do_rgb_to_yuv_stage2_store_load_stage1 fast_ld3
|
||||
do_rgb_to_yuv_stage2
|
||||
do_load \bpp, 8, \fast_ld3
|
||||
@@ -2196,7 +2196,7 @@ asm_function jsimd_\colorid\()_ycc_convert_neon_slowld3
|
||||
ldr OUTPUT_BUF2, [OUTPUT_BUF, #16]
|
||||
.unreq OUTPUT_BUF
|
||||
|
||||
/* Save NEON registers */
|
||||
/* Save Neon registers */
|
||||
sub sp, sp, #64
|
||||
mov x9, sp
|
||||
st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x9], 32
|
||||
@@ -2410,13 +2410,13 @@ asm_function jsimd_fdct_islow_neon
|
||||
get_symbol_loc TMP, Ljsimd_fdct_islow_neon_consts
|
||||
ld1 {v0.8h, v1.8h}, [TMP]
|
||||
|
||||
/* Save NEON registers */
|
||||
/* Save Neon registers */
|
||||
sub sp, sp, #64
|
||||
mov x10, sp
|
||||
st1 {v8.8b, v9.8b, v10.8b, v11.8b}, [x10], 32
|
||||
st1 {v12.8b, v13.8b, v14.8b, v15.8b}, [x10], 32
|
||||
|
||||
/* Load all DATA into NEON registers with the following allocation:
|
||||
/* Load all DATA into Neon registers with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
* 0 | d16 | d17 | v16.8h
|
||||
@@ -2643,7 +2643,7 @@ asm_function jsimd_fdct_islow_neon
|
||||
st1 {v16.8h, v17.8h, v18.8h, v19.8h}, [DATA], 64
|
||||
st1 {v20.8h, v21.8h, v22.8h, v23.8h}, [DATA]
|
||||
|
||||
/* Restore NEON registers */
|
||||
/* Restore Neon registers */
|
||||
ld1 {v8.8b, v9.8b, v10.8b, v11.8b}, [sp], 32
|
||||
ld1 {v12.8b, v13.8b, v14.8b, v15.8b}, [sp], 32
|
||||
|
||||
@@ -2695,7 +2695,7 @@ asm_function jsimd_fdct_ifast_neon
|
||||
get_symbol_loc TMP, Ljsimd_fdct_ifast_neon_consts
|
||||
ld1 {v0.4h}, [TMP]
|
||||
|
||||
/* Load all DATA into NEON registers with the following allocation:
|
||||
/* Load all DATA into Neon registers with the following allocation:
|
||||
* 0 1 2 3 | 4 5 6 7
|
||||
* ---------+--------
|
||||
* 0 | d16 | d17 | v0.8h
|
||||
@@ -3080,7 +3080,7 @@ asm_function jsimd_huff_encode_one_block_neon_slowtbl
|
||||
.endif
|
||||
sub sp, sp, 272
|
||||
sub BUFFER, BUFFER, #0x1 /* BUFFER=buffer-- */
|
||||
/* Save ARM registers */
|
||||
/* Save Arm registers */
|
||||
stp x19, x20, [sp]
|
||||
get_symbol_loc x15, Ljsimd_huff_encode_one_block_neon_consts
|
||||
ldr PUT_BUFFER, [x0, #0x10]
|
||||
|
||||
Reference in New Issue
Block a user