Commit Graph

2337 Commits

Author SHA1 Message Date
Jonathan Wright
eb14189caa Fix Neon SIMD build issues with Visual Studio
- Use the _M_ARM and _M_ARM64 macros provided by Visual Studio for
  compile-time detection of Arm builds, since __arm__ and __aarch64__
  are only present in GNU-compatible compilers.
- Neon/intrinsics: Use the _CountLeadingZeros() and
  _CountLeadingZeros64() intrinsics provided by Visual Studio, since
  __builtin_clz() and __builtin_clzl() are only present in
  GNU-compatible compilers.
- Neon/intrinsics: Since Visual Studio does not support static vector
  initialization, replace static initialization of Neon vectors with the
  appropriate intrinsics.  Compared to the static initialization
  approach, this produces identical assembly code with both GCC and
  Clang.
- Neon/intrinsics: Since Visual Studio does not support inline assembly
  code, provide alternative code paths for Visual Studio whenever inline
  assembly is used.
- Build: Set FLOATTEST appropriately for AArch64 Visual Studio builds
  (Visual Studio does not emit fused multiply-add [FMA] instructions by
  default for such builds.)
- Neon/intrinsics: Move temporary buffer allocation outside of nested
  loops.  Since Visual Studio configures Arm builds with a relatively
  small amount of stack memory, attempting to allocate those buffers
  within the inner loops caused a stack overflow.

Closes #461
Closes #475
2020-11-24 21:13:16 -06:00
DRC
91dd3b23ad ChangeLog: macOS Armv8/x86-64 univ. binary support 2020-11-24 20:31:59 -06:00
DRC
7e0d94d3a7 Merge branch 'master' into dev 2020-11-24 20:31:51 -06:00
DRC
1c839761cf Force Git to treat testorig.ppm as a binary file
Otherwise, because the file begins with an ASCII header, Git will
erroneously treat is as an ASCII file, and if Git for Windows is
configured with default options (specifically, "Checkout windows-style,
commit Unix-style line endings"), it will add carriage return characters
to all of the "linefeed" characters in the PPM file, thus corrupting it
and causing libjpeg-turbo's regression tests to fail.
2020-11-24 18:54:44 -06:00
DRC
6d91e950c8 Use 5x5 win & 9 AC coeffs when smoothing DC scans
... of progressive images.

Based on:
be8d36d13b
9d528f278e
85f36f0765
63a4d39e38
51336a6ad5

Closes #459
Closes #474
2020-11-24 18:21:42 -06:00
DRC
d523435e18 Travis: Use Xcode 12.2 for all iOS & macOS builds
There doesn't seem to be any performance or compatibility downside to
this, and it has the advantages of simplicity and consistency between
the PR and official builds.
2020-11-19 19:30:51 -06:00
DRC
1ac83cd636 Travis: The Mac build log is now log-macos.txt
(oversight from f7a10a61e3)
2020-11-18 18:16:12 -06:00
DRC
0ba70b6a13 Build: Support macOS Armv8/x86-64 univ. binaries
- Rename IOS_ARMV8_BUILD to ARMV8_BUILD.
- Rename install_ios() to install_subbuild() in makemacpkg.
- Wordsmith the build instructions accordingly.
- Use xcode12.2 image in Travis CI.
2020-11-18 17:40:44 -06:00
DRC
e417033d84 Merge branch 'master' into dev 2020-11-18 14:13:54 -06:00
DRC
6d2e8837b4 jpeg_skip_scanlines(): Avoid NULL + 0 UBSan error
This error occurs at the call to (*cinfo->cconvert->color_convert)() in
sep_upsample() whenever cinfo->upsample->need_context_rows == TRUE
(i.e. whenever h2v2 or h1v2 fancy upsampling is used.)  The error is
innocuous, since (*cinfo->cconvert->color_convert)() points to a dummy
function (noop_convert()) in that case.

Fixes #470
2020-11-18 13:33:47 -06:00
DRC
f7c5489244 Travis: Add /opt/local/bin to PATH for Mac build
(oversight from previous commit)

macports-ci does this, and it's necessary in order for the build script
to find md5sum.
2020-11-18 10:11:21 -06:00
DRC
f7a10a61e3 Build: "OS X"/"OSX" = "macOS"/"MACOS"
There are no supported versions of "OS X" anymore.  The operating system
has been named "macOS" since 10.12 Sierra, which was released four years
ago.
2020-11-17 13:53:33 -06:00
DRC
d111d9ff7a Merge branch 'master' into dev 2020-11-17 11:54:20 -06:00
DRC
10ba6ed336 Travis: Install MacPorts without using macports-ci 2020-11-16 17:38:06 -06:00
DRC
292d78e786 Merge branch 'master' into dev 2020-11-16 15:28:02 -06:00
DRC
88bf1d1678 Build: Set FLOATTEST more intelligently
The "32bit" vs. "64bit" floating point test results actually have
nothing to do with the FPU.  That was a fallacious assumption based on
the observation that, with multiple CPU types, 32-bit and 64-bit builds
produce different floating point test results.  It seems that this is,
in fact, due to differing compiler behavior-- more specifically, whether
fused multiply-add (FMA) instructions are used to combine multiple
floating point operations into a single instruction ("floating point
expression contraction".)  GCC does this by default if the target
supports FMA instructions, which PowerPC and AArch64 targets both do.

Fixes #468
2020-11-16 15:19:42 -06:00
DRC
8f8305981b Merge branch 'master' into dev 2020-11-13 15:21:26 -06:00
DRC
42f7c78fe3 BUILDING.md: Use min. iOS v8 in iOS Armv8 example
This is necessary in order to enable thread-local storage.
2020-11-13 15:18:35 -06:00
DRC
33859880e9 Neon: Auto-detect compiler intrinsics completeness
This allows the Neon intrinsics code to be built successfully (albeit
likely with reduced run-time performance) with Xcode 5.0-6.2
(iOS/AArch64) and Android NDK < r19 (AArch32).  Note that Xcode 5.0-6.2
will not build the Armv8 GAS code without gas-preprocessor.pl, and no
version of Xcode will build the Armv7 GAS code without
gas-preprocessor.pl, so we always use the full Neon intrinsics
implementation by default with macOS and iOS builds.

Auto-detecting the completeness of the compiler's set of Neon intrinsics
also allows us to more intelligently set the default value of
NEON_INTRINSICS, based on the values of HAVE_VLD1*.  This is a
reasonable, albeit imperfect, proxy for whether a compiler has a full
and optimal set of Neon intrinsics.  Specific notes:

  - 64-bit RGB-to-YCbCr color conversion
    does not use any of the intrinsics in question, regresses with GCC
  - 64-bit accurate integer forward DCT
    uses vld1_s16_x3(), regresses with GCC
  - 64-bit Huffman encoding
    uses vld1q_u8_x4(), regresses with GCC
  - 64-bit YCbCr-to-RGB color conversion
    does not use any of the intrinsics in question, regresses with GCC
  - 64-bit accurate integer inverse DCT
    uses vld1_s16_x3(), regresses with GCC
  - 64-bit 4x4 inverse DCT
    uses vld1_s16_x3().  I did not test this algorithm in isolation, so
    it may in fact regress with GCC, but the regression may be hidden by
    the speedup from the new SIMD-accelerated upsampling algorithms.

  - 32-bit RGB-to-YCbCr color conversion:
    uses vld1_u16_x2(), regresses with GCC
  - 32-bit accurate integer forward DCT
    uses vld1_s16_x3(), regression irrelevant because there was no
    previous implementation
  - 32-bit accurate integer inverse DCT
    uses vld1_s16_x3(), regresses with GCC
  - 32-bit fast integer inverse DCT
    does not use any of the intrinsics in question, regresses with GCC
  - 32-bit 4x4 inverse DCT
    uses vld1_s16_x3().  I did not test this algorithm in isolation, so
    it may in fact regress with GCC, but the regression may be hidden by
    the speedup from the new SIMD-accelerated upsampling algorithms.

Presumably when GCC includes a full and optimal set of Neon intrinsics,
the HAVE_VLD1* tests will pass, and the full Neon intrinsics
implementation will be enabled automatically.
2020-11-13 15:16:34 -06:00
DRC
3e9e7c7055 Fix build if WITH_12BIT==1 && WITH_JPEG(7|8)==1
Fixes #466
2020-11-11 17:54:06 -06:00
DRC
bbd8089297 Neon: Finalize intrinsics implementation
- Remove gas-preprocessor.pl.  None of the compilers that can build the
  new intrinsics implementation require gas-preprocessor.pl (tested
  with Xcode and with Clang 3.9+ for Linux.)
- Document that Xcode 6.3.x or later is now required for iOS builds
  (older versions of Xcode do not have a full set of Neon intrinsics.)
- Add a change log entry.
- Do not enable the ASM CMake language unless NEON_INTRINSICS is false.
- Add a Clang/Arm64 test to .travis.yml in order to test the new
  intrinsics implementation.

Closes #455
2020-11-10 19:58:28 -06:00
Martyn Jacques
141f26ff6d Neon: Intrinsics impl. of 2x2 and 4x4 scaled IDCTs
The previous AArch32 and AArch64 GAS implementations have been removed,
since the intrinsics implementations provide the same or better
performance.
2020-11-10 19:09:09 -06:00
Jonathan Wright
4574f01f43 Neon: Intrinsics impl. of h2v1 & h2v2 plain upsamp
There was no previous GAS implementation.

NOTE: This doesn't produce much of a speedup when using -O3, because -O3
already enables Neon autovectorization, which works well for the scalar
C implementation of plain upsampling.  However, the Neon SIMD
implementation will benefit other optimization levels.
2020-11-10 19:09:09 -06:00
Jonathan Wright
ba52a3de32 Neon: Intrinsics impl of h2v1 & h2v2 merged upsamp
There was no previous GAS implementation.

This commit also reverts 40557b2301 and
7723d7f7d0.
7723d7f7d0 was only necessary because
there was no Neon implementation of merged upsampling/color conversion,
and 40557b2301 was only necessary because
of 7723d7f7d0.
2020-11-10 19:09:09 -06:00
Jonathan Wright
240ba417aa Neon: Intrinsics impl. of prog. Huffman encoding
The previous AArch64 GAS implementation has been removed, since the
intrinsics implementation provides the same or better performance.
There was no previous AArch32 GAS implementation.
2020-11-10 19:09:09 -06:00
Jonathan Wright
ed581cd935 Neon: Intrinsics impl. of accurate int inverse DCT
The previous AArch32 and AArch64 GAS implementations are retained by
default when using GCC, in order to avoid a performance regression.  The
intrinsics implementation can be forced on or off using the new
NEON_INTRINSICS CMake variable.
2020-11-10 19:09:09 -06:00
Jonathan Wright
2c6b68e283 Neon: Intrinsics impl. of fast integer Inverse DCT
The previous AArch32 GAS implementation is retained by default when
using GCC, in order to avoid a performance regression.  The intrinsics
implementation can be forced on or off using the new NEON_INTRINSICS
CMake variable.  The previous AArch64 GAS implementation has been
removed, since the intrinsics implementation provides the same or better
performance.
2020-11-10 19:09:09 -06:00
Jonathan Wright
2acfb93c94 Neon: Intrinsics impl. of h1v2 fancy upsamling
There was no previous GAS implementation.
2020-11-10 19:09:09 -06:00
Jonathan Wright
975307775c Neon: Intrinsics impl. of h2v1 & h2v2 fancy upsamp
The previous AArch32 GAS implementation of h2v1 fancy upsampling has
been removed, since the intrinsics implementation provides the same or
better performance.  There was no previous GAS implementation of h2v2
fancy upsampling, and there was no previous AArch64 GAS implementation
of h2v1 fancy upsampling.
2020-11-10 19:09:09 -06:00
Jonathan Wright
5dbd39323c Neon: Intrinsics implementation of YCbCr->RGB565
The previous AArch64 GAS implementation is retained by default when
using GCC, in order to avoid a performance regression.  The intrinsics
implementation can be forced on or off using the new NEON_INTRINSICS
CMake variable.  The previous AArch32 GAS implementation has been
removed, since the intrinsics implementation provides the same or better
performance.
2020-11-10 19:09:09 -06:00
Jonathan Wright
0f35cd68f2 Neon: Intrinsics implementation of YCbCr->RGB
The previous AArch64 GAS implementation is retained by default when
using GCC, in order to avoid a performance regression.  The intrinsics
implementation can be forced on or off using the new NEON_INTRINSICS
CMake variable.  The previous AArch32 GAS implementation has been
removed, since the intrinsics implementation provides the same or better
performance.
2020-11-10 19:09:09 -06:00
Jonathan Wright
f3c3f01d23 Neon: Intrinsics impl. of Huffman encoding
The previous AArch64 GAS implementation is retained by default when
using GCC, in order to avoid a performance regression.  The intrinsics
implementation can be forced on or off using the new NEON_INTRINSICS
CMake variable.  The previous AArch32 GAS implementation has been
removed, since the intrinsics implementation provides the same or better
performance.
2020-11-10 19:09:09 -06:00
Jonathan Wright
d0004de5dd Neon: Intrinsics impl. of accurate int forward DCT
The previous AArch64 GAS implementation is retained by default when
using GCC, in order to avoid a performance regression.  The intrinsics
implementation can be forced on or off using the new NEON_INTRINSICS
CMake variable.  There was no previous AArch32 GAS implementation.
2020-11-10 19:09:09 -06:00
Jonathan Wright
3d84668d42 Neon: Intrinsics impl. of fast integer forward DCT
The previous AArch32 and AArch64 GAS implementations have been removed,
since the intrinsics implementation provides the same or better
performance.
2020-11-10 19:09:09 -06:00
Jonathan Wright
951d3677eb Neon: Intrinsics impl. of int sample conv./quant.
The previous AArch32 and AArch64 GAS implementations have been removed,
since the intrinsics implementation provides the same or better
performance.
2020-11-10 19:09:09 -06:00
Jonathan Wright
366168aa7d Neon: Intrinsics impl. of h2v1 & h2v2 downsampling
The previous AArch64 GAS implementation has been removed, since the
intrinsics implementation provides the same or better performance.
There was no previous AArch32 GAS implementation.
2020-11-10 19:09:09 -06:00
Jonathan Wright
f73b1dbc60 Neon: Intrinsics implementation of RGB->Grayscale
There was no previous GAS implementation.
2020-11-10 19:09:09 -06:00
Jonathan Wright
4f2216b435 Neon: Intrinsics implementation of RGB->YCbCr
The previous AArch32 and AArch64 GAS implementations are retained by
default when using GCC, in order to avoid a performance regression.  The
intrinsics implementation can be forced on or off using a new
NEON_INTRINSICS CMake variable.
2020-11-10 19:09:05 -06:00
DRC
0efc4858d4 Merge branch 'master' into dev 2020-11-09 19:02:28 -06:00
DRC
02227e48a9 Travis: Combine PPC/Arm tests with jpeg-7/8 tests
There is no reason not to, since the jpeg-7 and jpeg-8 API/ABI tests do
not exercise the SIMD extensions any differently than the other tests.
2020-11-09 18:20:41 -06:00
DRC
c7dd191271 Merge branch 'master' into dev 2020-11-08 15:15:02 -06:00
DRC
40557b2301 Build: Fix test failures w/ Arm Neon SIMD exts
Regression caused by
a46c111d9f

Because of 7723d7f7d0, which was
introduced in libjpeg-turbo 1.5.1 in response to #81, merged upsampling/
color conversion is disabled on platforms that have SIMD-accelerated
YCbCr -> RGB color conversion but not SIMD-accelerated merged
upsampling/color conversion.  This was intended to improve performance
with the Neon SIMD extensions, since those are the only SIMD extensions
for which those circumstances apply.  Under normal circumstances, the
separate "plain" (non-fancy) upsampling and color conversion routines
will produce bitwise-identical output to the merged upsampling/color
conversion routines, but that is not the case when skipping scanlines
starting at an odd-numbered scanline.  The modified test introduced in
a46c111d9f does precisely that in order to
validate the fixes introduced in
9120a24743 and
a46c111d9f.

Because of 7723d7f7d0, the segfault fixed
in 9120a24743 and
a46c111d9f didn't affect the Neon SIMD
extensions, so this commit effectively reverts the test modifications in
a46c111d9f when using those SIMD
extensions.  We can get rid of this hack, as well as
7723d7f7d0, once a Neon implementation of
merged upsampling/color conversion is available.
2020-11-08 14:57:01 -06:00
DRC
a524b9b06b Travis: Regression-test Armv8 and PPC SIMD exts
Currently this only tests the 64-bit code paths, but it's better than
nothing.
2020-11-06 17:24:16 -06:00
DRC
7c1a1789d2 Merge branch 'master' into dev 2020-11-05 16:04:55 -06:00
DRC
6e632af9f6 Demote "fast" [I]DCT algorithms to legacy status
- Refer to the "slow" [I]DCT algorithms as "accurate" instead, since
  they are not slow under libjpeg-turbo.
- Adjust documentation claims to reflect the fact that the "slow" and
  "fast" algorithms produce about the same performance on AVX2-equipped
  CPUs (because of the dual-lane nature of AVX2, it was not possible to
  accelerate the "fast" algorithm beyond what was achievable with SSE2.)
  Also adjust the claims to reflect the fact that the "fast" algorithm
  tends to be ~5-15% faster than the "slow" algorithm on
  non-AVX2-equipped CPUs, regardless of the use of the libjpeg-turbo
  SIMD extensions.
- Indicate the legacy status of the "fast" and float algorithms in the
  documentation and cjpeg/djpeg usage info.
- Remove obsolete paragraph in the djpeg man page that suggested that
  the float algorithm could be faster than the "fast" algorithm on some
  CPUs.
2020-11-05 15:59:31 -06:00
DRC
cd342acf7f Merge branch 'master' into dev 2020-10-27 16:45:23 -05:00
DRC
c3bfbde21d jpegtran.c: "subarea" = "region"
It is our convention to use the term "region" when referring to crop
specs, since this is more consistent with the terminology used by the
rest of the image processing community.
2020-10-27 15:45:24 -05:00
DRC
a8656d703d jpegtran.1: Minor formatting tweak 2020-10-27 15:45:24 -05:00
DRC
9ecb67c219 transupp.c: Code formatting tweaks 2020-10-27 15:45:24 -05:00
DRC
53c685b7f4 cdjpeg.h: Remove unused function stub
enable_signal_catcher() was only needed with libjpeg's temp. file memory
manager (jmemname.c), which libjpeg-turbo has never supported.
2020-10-27 15:45:24 -05:00