Merge commit '8a2cad020171184a49fa8696df0b9e267f1cf2f6'

* commit '8a2cad020171184a49fa8696df0b9e267f1cf2f6': (99 commits) Build: Handle CMAKE_OSX_ARCHITECTURES=(i386|ppc) Add Sponsor button for GitHub repository Build: Support CMAKE_OSX_ARCHITECTURES cjpeg: Fix FPE when compressing 0-width GIF Fix build with Visual C++ and /std:c11 or /std:c17 Neon: Fix Huffman enc. error w/Visual Studio+Clang Use CLZ compiler intrinsic for Windows/Arm builds Build: Use correct SIMD exts w/VStudio IDE + Arm64 jcphuff.c: Fix compiler warning with clang-cl Migrate from Travis CI to GitHub Actions tjexample.c: Fix mem leak if tjTransform() fails Build: Officially support Ninja decompress_smooth_data(): Fix another uninit. read LICENSE.md: Remove trailing whitespace Build: Test for correct AArch32 RPM/DEBARCH value LICENSE.md: Formatting tweak Fix uninitialized read in decompress_smooth_data() Fix buffer overrun with certain narrow prog JPEGs Bump revision to 2.0.91 for post-beta fixes Travis: Use Docker tag that matches Git branch ...
2021-02-26 21:30:09 +00:00
parent 69bd7ed953 8a2cad0201
commit 886ddb1786
166 changed files with 17693 additions and 12607 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -2,3 +2,4 @@
 /appveyor.yml export-ignore
 /ci export-ignore
 /.gitattributes export-ignore
 *.ppm binary
--- a/BUILDING.md
+++ b/BUILDING.md
@@ -12,10 +12,7 @@ Build Requirements
 - [NASM](http://www.nasm.us) or [YASM](http://yasm.tortall.net)
  (if building x86 or x86-64 SIMD extensions)
-  * If using NASM, 2.10 or later is required.
+  * If using NASM, 2.13 or later is required.
  * If using NASM, 2.10 or later (except 2.11.08) is required for an x86-64 Mac
    build (2.11.08 does not work properly with libjpeg-turbo's x86-64 SIMD code
    when building macho64 objects.)
  * If using YASM, 1.2.0 or later is required.
  * If building on macOS, NASM or YASM can be obtained from
    [MacPorts](http://www.macports.org/) or [Homebrew](http://brew.sh/).
@@ -49,10 +46,8 @@ Build Requirements
 - If building the TurboJPEG Java wrapper, JDK or OpenJDK 1.5 or later is
  required.  Most modern Linux distributions, as well as Solaris 10 and later,
-  include JDK or OpenJDK.  On OS X 10.5 and 10.6, it will be necessary to
+  include JDK or OpenJDK.  For other systems, you can obtain the Oracle Java
-  install the Java Developer Package, which can be downloaded from
+  Development Kit from
  <http://developer.apple.com/downloads> (Apple ID required.)  For other
  systems, you can obtain the Oracle Java Development Kit from
  <http://www.oracle.com/technetwork/java/javase/downloads>.
  * If using JDK 11 or later, CMake 3.10.x or later must also be used.
@@ -62,22 +57,22 @@ Build Requirements
 - Microsoft Visual C++ 2005 or later
  If you don't already have Visual C++, then the easiest way to get it is by
-  installing the
+  installing
-  [Windows SDK](http://msdn.microsoft.com/en-us/windows/bb980924.aspx).
+  [Visual Studio Community Edition](https://visualstudio.microsoft.com),
-  The Windows SDK includes both 32-bit and 64-bit Visual C++ compilers and
+  which includes everything necessary to build libjpeg-turbo.
  everything necessary to build libjpeg-turbo.
-  * You can also use Microsoft Visual Studio Express/Community Edition, which
+  * You can also download and install the standalone Windows SDK (for Windows 7
-    is a free download.  (NOTE: versions prior to 2012 can only be used to
+    or later), which includes command-line versions of the 32-bit and 64-bit
-    build 32-bit code.)
+    Visual C++ compilers.
  * If you intend to build libjpeg-turbo from the command line, then add the
    appropriate compiler and SDK directories to the `INCLUDE`, `LIB`, and
    `PATH` environment variables.  This is generally accomplished by
-    executing `vcvars32.bat` or `vcvars64.bat` and `SetEnv.cmd`.
+    executing `vcvars32.bat` or `vcvars64.bat`, which are located in the same
-    `vcvars32.bat` and `vcvars64.bat` are part of Visual C++ and are located in
+    directory as the compiler.
-    the same directory as the compiler.  `SetEnv.cmd` is part of the Windows
+  * If built with Visual C++ 2015 or later, the libjpeg-turbo static libraries
-    SDK.  You can pass optional arguments to `SetEnv.cmd` to specify a 32-bit
+    cannot be used with earlier versions of Visual C++, and vice versa.
-    or 64-bit build environment.
+  * The libjpeg API DLL (**jpeg{version}.dll**) will depend on the C run-time
    DLLs corresponding to the version of Visual C++ that was used to build it.
   ... OR ...
@@ -120,6 +115,13 @@ directory, whereas *{source_directory}* refers to the libjpeg-turbo source
 directory.  For in-tree builds, these directories are the same.
 Ninja
 -----
 In all of the procedures and recipes below, replace `make` with `ninja` and
 `Unix Makefiles` with `Ninja` if using Ninja.
 Build Procedure
 ---------------
@@ -345,7 +347,7 @@ Build Recipes
 -------------
-### 32-bit Build on 64-bit Linux/Unix/Mac
+### 32-bit Build on 64-bit Linux/Unix
 Use export/setenv to set the following environment variables before running
 CMake:
@@ -417,103 +419,9 @@ compression/decompression.  This section describes how to build libjpeg-turbo
 for these platforms.
 ### Additional build requirements
 - For configurations that require [gas-preprocessor.pl]
  (https://raw.githubusercontent.com/libjpeg-turbo/gas-preprocessor/master/gas-preprocessor.pl),
  it should be installed in your `PATH`.
 ### Armv7 (32-bit)
 **gas-preprocessor.pl required**
 The following scripts demonstrate how to build libjpeg-turbo to run on the
 iPhone 3GS-4S/iPad 1st-3rd Generation and newer:
 #### Xcode 4.2 and earlier (LLVM-GCC)
    IOS_PLATFORMDIR=/Developer/Platforms/iPhoneOS.platform
    IOS_SYSROOT=($IOS_PLATFORMDIR/Developer/SDKs/iPhoneOS*.sdk)
    export CFLAGS="-mfloat-abi=softfp -march=armv7 -mcpu=cortex-a8 -mtune=cortex-a8 -mfpu=neon -miphoneos-version-min=3.0"
    cd {build_directory}
    cat <<EOF >toolchain.cmake
    set(CMAKE_SYSTEM_NAME Darwin)
    set(CMAKE_SYSTEM_PROCESSOR arm)
    set(CMAKE_C_COMPILER ${IOS_PLATFORMDIR}/Developer/usr/bin/arm-apple-darwin10-llvm-gcc-4.2)
    EOF
    cmake -G"Unix Makefiles" -DCMAKE_TOOLCHAIN_FILE=toolchain.cmake \
      -DCMAKE_OSX_SYSROOT=${IOS_SYSROOT[0]} \
      [additional CMake flags] {source_directory}
    make
 #### Xcode 4.3-4.6 (LLVM-GCC)
 Same as above, but replace the first line with:
    IOS_PLATFORMDIR=/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform
 #### Xcode 5 and later (Clang)
    IOS_PLATFORMDIR=/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform
    IOS_SYSROOT=($IOS_PLATFORMDIR/Developer/SDKs/iPhoneOS*.sdk)
    export CFLAGS="-mfloat-abi=softfp -arch armv7 -miphoneos-version-min=3.0"
    export ASMFLAGS="-no-integrated-as"
    cd {build_directory}
    cat <<EOF >toolchain.cmake
    set(CMAKE_SYSTEM_NAME Darwin)
    set(CMAKE_SYSTEM_PROCESSOR arm)
    set(CMAKE_C_COMPILER /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang)
    EOF
    cmake -G"Unix Makefiles" -DCMAKE_TOOLCHAIN_FILE=toolchain.cmake \
      -DCMAKE_OSX_SYSROOT=${IOS_SYSROOT[0]} \
      [additional CMake flags] {source_directory}
    make
 ### Armv7s (32-bit)
 **gas-preprocessor.pl required**
 The following scripts demonstrate how to build libjpeg-turbo to run on the
 iPhone 5/iPad 4th Generation and newer:
 #### Xcode 4.5-4.6 (LLVM-GCC)
    IOS_PLATFORMDIR=/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform
    IOS_SYSROOT=($IOS_PLATFORMDIR/Developer/SDKs/iPhoneOS*.sdk)
    export CFLAGS="-Wall -mfloat-abi=softfp -march=armv7s -mcpu=swift -mtune=swift -mfpu=neon -miphoneos-version-min=6.0"
    cd {build_directory}
    cat <<EOF >toolchain.cmake
    set(CMAKE_SYSTEM_NAME Darwin)
    set(CMAKE_SYSTEM_PROCESSOR arm)
    set(CMAKE_C_COMPILER ${IOS_PLATFORMDIR}/Developer/usr/bin/arm-apple-darwin10-llvm-gcc-4.2)
    EOF
    cmake -G"Unix Makefiles" -DCMAKE_TOOLCHAIN_FILE=toolchain.cmake \
      -DCMAKE_OSX_SYSROOT=${IOS_SYSROOT[0]} \
      [additional CMake flags] {source_directory}
    make
 #### Xcode 5 and later (Clang)
 Same as the Armv7 build procedure for Xcode 5 and later, except replace the
 compiler flags as follows:
    export CFLAGS="-Wall -mfloat-abi=softfp -arch armv7s -miphoneos-version-min=6.0"
 ### Armv8 (64-bit)
-**gas-preprocessor.pl required if using Xcode < 6**
+**Xcode 5 or later required, Xcode 6.3.x or later recommended**
 The following script demonstrates how to build libjpeg-turbo to run on the
 iPhone 5S/iPad Mini 2/iPad Air and newer.
@@ -535,9 +443,6 @@ iPhone 5S/iPad Mini 2/iPad Air and newer.
      [additional CMake flags] {source_directory}
    make
 Once built, lipo can be used to combine the Armv7, v7s, and/or v8 variants into
 a universal library.
 Building libjpeg-turbo for Android
 ----------------------------------
@@ -548,6 +453,8 @@ Building libjpeg-turbo for Android platforms requires v13b or later of the
 ### Armv7 (32-bit)
 **NDK r19 or later with Clang recommended**
 The following is a general recipe script that can be modified for your specific
 needs.
@@ -573,6 +480,8 @@ needs.
 ### Armv8 (64-bit)
 **Clang recommended**
 The following is a general recipe script that can be modified for your specific
 needs.
@@ -747,44 +656,23 @@ Mac
    make dmg
 Create Mac package/disk image.  This requires pkgbuild and productbuild, which
-are installed by default on OS X 10.7 and later and which can be obtained by
+are installed by default on OS X/macOS 10.7 and later.
 installing Xcode 3.2.6 (with the "Unix Development" option) on OS X 10.6.
 Packages built in this manner can be installed on OS X 10.5 and later, but they
 must be built on OS X 10.6 or later.
-    make udmg
+In order to create a Mac package/disk image that contains universal
 x86-64/Arm binaries, set the following CMake variable:
-This creates a Mac package/disk image that contains universal x86-64/i386/Arm
+* `ARMV8_BUILD`: Directory containing an Armv8 (64-bit) iOS or macOS build of
-binaries.  The following CMake variables control which architectures are
+  libjpeg-turbo to include in the universal binaries
 included in the universal binaries.  Setting any of these variables to an empty
 string excludes that architecture from the package.
-* `OSX_32BIT_BUILD`: Directory containing an i386 (32-bit) Mac build of
+You should first use CMake to configure an Armv8 sub-build of libjpeg-turbo
-  libjpeg-turbo (default: *{source_directory}*/osxx86)
+(see "Building libjpeg-turbo for iOS" above, if applicable) in a build
-* `IOS_ARMV7_BUILD`: Directory containing an Armv7 (32-bit) iOS build of
+directory that matches the one specified in the aforementioned CMake variable.
-  libjpeg-turbo (default: *{source_directory}*/iosarmv7)
+Next, configure the primary (x86-64) build of libjpeg-turbo as an out-of-tree
-* `IOS_ARMV7S_BUILD`: Directory containing an Armv7s (32-bit) iOS build of
+build, specifying the aforementioned CMake variable, and build it.  Once the
-  libjpeg-turbo (default: *{source_directory}*/iosarmv7s)
+primary build has been built, run `make dmg` from the build directory.  The
-* `IOS_ARMV8_BUILD`: Directory containing an Armv8 (64-bit) iOS build of
+packaging system will build the sub-build, use lipo to combine it with the
-  libjpeg-turbo (default: *{source_directory}*/iosarmv8)
+primary build into a single set of universal binaries, then package the
-
+universal binaries.
 You should first use CMake to configure i386, Armv7, Armv7s, and/or Armv8
 sub-builds of libjpeg-turbo (see "Build Recipes" and "Building libjpeg-turbo
 for iOS" above) in build directories that match those specified in the
 aforementioned CMake variables.  Next, configure the primary build of
 libjpeg-turbo as an out-of-tree build, and build it.  Once the primary build
 has been built, run `make udmg` from the build directory.  The packaging system
 will build the sub-builds, use lipo to combine them into a single set of
 universal binaries, then package the universal binaries in the same manner as
 `make dmg`.
 Cygwin
 ------
    make cygwinpkg
 Build a Cygwin binary package.
 Windows
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -5,7 +5,7 @@ if(CMAKE_EXECUTABLE_SUFFIX)
 endif()
 project(mozjpeg C)
-set(VERSION 4.0.3)
+set(VERSION 4.0.4)
 string(REPLACE "." ";" VERSION_TRIPLET ${VERSION})
 list(GET VERSION_TRIPLET 0 VERSION_MAJOR)
 list(GET VERSION_TRIPLET 1 VERSION_MINOR)
@@ -41,12 +41,19 @@ message(STATUS "VERSION = ${VERSION}, BUILD = ${BUILD}")
 # Detect CPU type and whether we're building 64-bit or 32-bit code
 math(EXPR BITS "${CMAKE_SIZEOF_VOID_P} * 8")
 string(TOLOWER ${CMAKE_SYSTEM_PROCESSOR} CMAKE_SYSTEM_PROCESSOR_LC)
 set(COUNT 1)
 foreach(ARCH ${CMAKE_OSX_ARCHITECTURES})
  if(COUNT GREATER 1)
    message(FATAL_ERROR "The libjpeg-turbo build system does not support multiple values in CMAKE_OSX_ARCHITECTURES.")
  endif()
  math(EXPR COUNT "${COUNT}+1")
 endforeach()
 if(CMAKE_SYSTEM_PROCESSOR_LC MATCHES "x86_64" OR
  CMAKE_SYSTEM_PROCESSOR_LC MATCHES "amd64" OR
  CMAKE_SYSTEM_PROCESSOR_LC MATCHES "i[0-9]86" OR
  CMAKE_SYSTEM_PROCESSOR_LC MATCHES "x86" OR
  CMAKE_SYSTEM_PROCESSOR_LC MATCHES "ia32")
-  if(BITS EQUAL 64)
+  if(BITS EQUAL 64 OR CMAKE_C_COMPILER_ABI MATCHES "ELF X32")
    set(CPU_TYPE x86_64)
  else()
    set(CPU_TYPE i386)
@@ -57,9 +64,9 @@ if(CMAKE_SYSTEM_PROCESSOR_LC MATCHES "x86_64" OR
 elseif(CMAKE_SYSTEM_PROCESSOR_LC STREQUAL "aarch64" OR
  CMAKE_SYSTEM_PROCESSOR_LC MATCHES "arm*")
  if(BITS EQUAL 64)
-  set(CPU_TYPE arm64)
+    set(CPU_TYPE arm64)
  else()
-  set(CPU_TYPE arm)
+    set(CPU_TYPE arm)
  endif()
 elseif(CMAKE_SYSTEM_PROCESSOR_LC MATCHES "ppc*" OR
  CMAKE_SYSTEM_PROCESSOR_LC MATCHES "powerpc*")
@@ -67,6 +74,18 @@ elseif(CMAKE_SYSTEM_PROCESSOR_LC MATCHES "ppc*" OR
 else()
  set(CPU_TYPE ${CMAKE_SYSTEM_PROCESSOR_LC})
 endif()
 if(CMAKE_OSX_ARCHITECTURES MATCHES "x86_64" OR
  CMAKE_OSX_ARCHITECTURES MATCHES "arm64" OR
  CMAKE_OSX_ARCHITECTURES MATCHES "i386")
  set(CPU_TYPE ${CMAKE_OSX_ARCHITECTURES})
 endif()
 if(CMAKE_OSX_ARCHITECTURES MATCHES "ppc")
  set(CPU_TYPE powerpc)
 endif()
 if(MSVC_IDE AND CMAKE_GENERATOR_PLATFORM MATCHES "arm64")
  set(CPU_TYPE arm64)
 endif()
 message(STATUS "${BITS}-bit build (${CPU_TYPE})")
@@ -84,7 +103,9 @@ if(WIN32)
    set(CMAKE_INSTALL_DEFAULT_PREFIX "${CMAKE_INSTALL_DEFAULT_PREFIX}64")
  endif()
 else()
  if(NOT CMAKE_INSTALL_DEFAULT_PREFIX)
  set(CMAKE_INSTALL_DEFAULT_PREFIX /opt/${CMAKE_PROJECT_NAME})
  endif()
 endif()
 if(CMAKE_INSTALL_PREFIX_INITIALIZED_TO_DEFAULT)
  set(CMAKE_INSTALL_PREFIX "${CMAKE_INSTALL_DEFAULT_PREFIX}" CACHE PATH
@@ -103,6 +124,8 @@ if(CMAKE_INSTALL_PREFIX STREQUAL "${CMAKE_INSTALL_DEFAULT_PREFIX}")
  if(UNIX AND NOT APPLE)
    if(BITS EQUAL 64)
      set(CMAKE_INSTALL_DEFAULT_LIBDIR "lib64")
    elseif(CMAKE_C_COMPILER_ABI MATCHES "ELF X32")
      set(CMAKE_INSTALL_DEFAULT_LIBDIR "libx32")
    else()
      set(CMAKE_INSTALL_DEFAULT_LIBDIR "lib32")
    endif()
@@ -135,9 +158,9 @@ endforeach()
 macro(boolean_number var)
  if(${var})
-    set(${var} 1)
+    set(${var} 1 ${ARGN})
  else()
-    set(${var} 0)
+    set(${var} 0 ${ARGN})
  endif()
 endmacro()
@@ -155,8 +178,12 @@ option(WITH_ARITH_DEC "Include arithmetic decoding support when emulating the li
 boolean_number(WITH_ARITH_DEC)
 option(WITH_ARITH_ENC "Include arithmetic encoding support when emulating the libjpeg v6b API/ABI" FALSE)
 boolean_number(WITH_ARITH_ENC)
-option(WITH_JAVA "Build Java wrapper for the TurboJPEG API library (implies ENABLE_SHARED=1)" FALSE)
+if(CMAKE_C_COMPILER_ABI MATCHES "ELF X32")
-boolean_number(WITH_JAVA)
+  set(WITH_JAVA 0)
 else()
  option(WITH_JAVA "Build Java wrapper for the TurboJPEG API library (implies ENABLE_SHARED=1)" FALSE)
  boolean_number(WITH_JAVA)
 endif()
 option(WITH_JPEG7 "Emulate libjpeg v7 API/ABI (this makes ${CMAKE_PROJECT_NAME} backward-incompatible with libjpeg v6b)" FALSE)
 boolean_number(WITH_JPEG7)
 option(WITH_JPEG8 "Emulate libjpeg v8 API/ABI (this makes ${CMAKE_PROJECT_NAME} backward-incompatible with libjpeg v6b)" FALSE)
@@ -418,13 +445,6 @@ if(UNIX)
        exit(is_shifting_signed(-0x7F7E80B1L));
      }" RIGHT_SHIFT_IS_UNSIGNED)
  endif()
  if(CMAKE_CROSSCOMPILING)
    set(__CHAR_UNSIGNED__ 0)
  else()
    check_c_source_runs("int main(void) { return ((char) -1 < 0); }"
      __CHAR_UNSIGNED__)
  endif()
 endif()
 if(MSVC)
@@ -550,6 +570,9 @@ endif()
 if(WITH_SIMD)
  add_subdirectory(simd)
  if(NEON_INTRINSICS)
    add_definitions(-DNEON_INTRINSICS)
  endif()
 elseif(NOT WITH_12BIT)
  message(STATUS "SIMD extensions: None (WITH_SIMD = ${WITH_SIMD})")
 endif()
@@ -746,6 +769,8 @@ if(WITH_12BIT)
  set(MD5_PPM_RGB_ISLOW f3301d2219783b8b3d942b7239fa50c0)
  set(MD5_JPEG_422_IFAST_OPT 7322e3bd2f127f7de4b40d4480ce60e4)
  set(MD5_PPM_422_IFAST 79807fa552899e66a04708f533e16950)
  set(MD5_JPEG_440_ISLOW e25c1912e38367be505a89c410c1c2d2)
  set(MD5_PPM_440_ISLOW e7d2e26288870cfcb30f3114ad01e380)
  set(MD5_PPM_422M_IFAST 07737bfe8a7c1c87aaa393a0098d16b0)
  set(MD5_JPEG_420_IFAST_Q100_PROG 008ab68d6ddbba04a8f01deee4e0f9f8)
  set(MD5_PPM_420_Q100_IFAST 1b3730122709f53d007255e8dfd3305e)
@@ -795,6 +820,8 @@ else()
  set(MD5_BMP_RGB_ISLOW_565D 4cfa0928ef3e6bb626d7728c924cfda4)
  set(MD5_JPEG_422_IFAST_OPT 2540287b79d913f91665e660303ab2c8)
  set(MD5_PPM_422_IFAST 35bd6b3f833bad23de82acea847129fa)
  set(MD5_JPEG_440_ISLOW 538bc02bd4b4658fd85de6ece6cbeda6)
  set(MD5_PPM_440_ISLOW 11e7eab7ef7ef3276934bb7e7b6bb377)
  set(MD5_PPM_422M_IFAST 8dbc65323d62cca7c91ba02dd1cfa81d)
  set(MD5_BMP_422M_IFAST_565 3294bd4d9a1f2b3d08ea6020d0db7065)
  set(MD5_BMP_422M_IFAST_565D da98c9c7b6039511be4a79a878a9abc1)
@@ -824,29 +851,7 @@ else()
  set(MD5_PPM_3x2_IFAST fd283664b3b49127984af0a7f118fccd)
  set(MD5_JPEG_420_ISLOW_ARI e986fb0a637a8d833d96e8a6d6d84ea1)
  set(MD5_JPEG_444_ISLOW_PROGARI 0a8f1c8f66e113c3cf635df0a475a617)
  # Since v1.5.1, libjpeg-turbo uses the separate non-fancy upsampling and
  # YCbCr -> RGB color conversion routines rather than merged upsampling/color
  # conversion when fancy upsampling is disabled on platforms that have a SIMD
  # implementation of YCbCr -> RGB color conversion but no SIMD implementation
  # of merged upsampling/color conversion.  This was intended to improve the
  # performance of the Arm Neon SIMD extensions, the only SIMD extensions for
  # which those circumstances currently apply.  The separate non-fancy
  # upsampling and color conversion routines usually produce bitwise-identical
  # output to the merged upsampling/color conversion routines, but that is not
  # the case when skipping scanlines starting at an odd-numbered scanline.  In
  # libjpeg-turbo 2.0.5 and prior, doing that while using merged h2v2
  # upsampling caused a segfault, so this test validates the fix for that
  # segfault.  Unfortunately, however, the test also produces different bitwise
  # output when using the Neon SIMD extensions, because of the aforementioned
  # optimization.  The easiest workaround is to use the old test from
  # libjpeg-turbo 2.0.5 and prior when using the Neon SIMD extensions.  The
  # aforementioned segfault never would have occurred with the Neon SIMD
  # extensions anyhow, since merged upsampling is disabled when using them.
  if((CPU_TYPE STREQUAL "arm64" OR CPU_TYPE STREQUAL "arm") AND WITH_SIMD)
  set(MD5_PPM_420M_IFAST_ARI 72b59a99bcf1de24c5b27d151bde2437)
  else()
    set(MD5_PPM_420M_IFAST_ARI 57251da28a35b46eecb7177d82d10e0e)
  endif()
  set(MD5_JPEG_420_ISLOW 9a68f56bc76e466aa7e52f415d0f4a5f)
  set(MD5_PPM_420M_ISLOW_2_1 9f9de8c0612f8d06869b960b05abf9c9)
  set(MD5_PPM_420M_ISLOW_15_8 b6875bc070720b899566cc06459b63b7)
@@ -954,7 +959,7 @@ if(CPU_TYPE STREQUAL "x86_64" OR CPU_TYPE STREQUAL "i386")
  endif()
 else()
  if((CPU_TYPE STREQUAL "powerpc" OR CPU_TYPE STREQUAL "arm64") AND
-    NOT CMAKE_C_COMPILER_ID STREQUAL "Clang")
+    NOT CMAKE_C_COMPILER_ID STREQUAL "Clang" AND NOT MSVC)
    set(DEFAULT_FLOATTEST fp-contract)
  else()
    set(DEFAULT_FLOATTEST no-fp-contract)
@@ -1101,6 +1106,16 @@ foreach(libtype ${TEST_LIBTYPES})
    testout_422_ifast.ppm testout_422_ifast_opt.jpg
    ${MD5_PPM_422_IFAST} cjpeg-${libtype}-422-ifast-opt)
  # CC: RGB->YCC  SAMP: fullsize/h1v2  FDCT: islow  ENT: huff
  add_bittest(cjpeg 440-islow "-sample;1x2;-dct;int"
    testout_440_islow.jpg ${TESTIMAGES}/testorig.ppm
    ${MD5_JPEG_440_ISLOW})
  # CC: YCC->RGB  SAMP: fullsize/h1v2 fancy  IDCT: islow  ENT: huff
  add_bittest(djpeg 440-islow "-dct;int"
    testout_440_islow.ppm testout_440_islow.jpg
    ${MD5_PPM_440_ISLOW} cjpeg-${libtype}-440-islow)
  # CC: YCC->RGB  SAMP: h2v1 merged  IDCT: ifast  ENT: huff
  add_bittest(djpeg 422m-ifast "-dct;fast;-nosmooth"
    testout_422m_ifast.ppm testout_422_ifast_opt.jpg
@@ -1209,17 +1224,9 @@ foreach(libtype ${TEST_LIBTYPES})
  if(WITH_ARITH_DEC)
    # CC: RGB->YCC  SAMP: h2v2 merged  IDCT: ifast  ENT: arith
    if((CPU_TYPE STREQUAL "arm64" OR CPU_TYPE STREQUAL "arm") AND WITH_SIMD)
      # Refer to the comment above the definition of MD5_PPM_420M_IFAST_ARI for
      # an explanation of why this is necessary.
    add_bittest(djpeg 420m-ifast-ari "-fast;-ppm"
      testout_420m_ifast_ari.ppm ${TESTIMAGES}/testimgari.jpg
      ${MD5_PPM_420M_IFAST_ARI})
    else()
      add_bittest(djpeg 420m-ifast-ari "-fast;-skip;1,20;-ppm"
        testout_420m_ifast_ari.ppm ${TESTIMAGES}/testimgari.jpg
        ${MD5_PPM_420M_IFAST_ARI})
    endif()
    add_bittest(jpegtran 420-islow "-revert"
      testout_420_islow.jpg ${TESTIMAGES}/testimgari.jpg
@@ -1425,10 +1432,13 @@ set(EXE ${CMAKE_EXECUTABLE_SUFFIX})
 if(WITH_TURBOJPEG)
  if(ENABLE_SHARED)
-    install(TARGETS turbojpeg tjbench
+    install(TARGETS turbojpeg EXPORT ${CMAKE_PROJECT_NAME}Targets
      INCLUDES DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
      ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}
      LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
      RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
    install(TARGETS tjbench
      RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
    if(NOT CMAKE_VERSION VERSION_LESS "3.1" AND MSVC AND
      CMAKE_C_LINKER_SUPPORTS_PDB)
      install(FILES "$<TARGET_PDB_FILE:turbojpeg>"
@@ -1436,8 +1446,9 @@ if(WITH_TURBOJPEG)
    endif()
  endif()
  if(ENABLE_STATIC)
-    install(TARGETS turbojpeg-static ARCHIVE
+    install(TARGETS turbojpeg-static EXPORT ${CMAKE_PROJECT_NAME}Targets
-      DESTINATION ${CMAKE_INSTALL_LIBDIR})
+      INCLUDES DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
      ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR})
    if(NOT ENABLE_SHARED)
      if(MSVC_IDE OR XCODE)
        set(DIR "${CMAKE_CURRENT_BINARY_DIR}/\${CMAKE_INSTALL_CONFIG_NAME}")
@@ -1453,7 +1464,9 @@ if(WITH_TURBOJPEG)
 endif()
 if(ENABLE_STATIC)
-  install(TARGETS jpeg-static ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR})
+  install(TARGETS jpeg-static EXPORT ${CMAKE_PROJECT_NAME}Targets
    INCLUDES DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
    ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR})
  if(NOT ENABLE_SHARED)
    if(MSVC_IDE OR XCODE)
      set(DIR "${CMAKE_CURRENT_BINARY_DIR}/\${CMAKE_INSTALL_CONFIG_NAME}")
@@ -1493,6 +1506,13 @@ endif()
 install(FILES ${CMAKE_CURRENT_BINARY_DIR}/pkgscripts/libjpeg.pc
  ${CMAKE_CURRENT_BINARY_DIR}/pkgscripts/libturbojpeg.pc
  DESTINATION ${CMAKE_INSTALL_LIBDIR}/pkgconfig)
 install(FILES
  ${CMAKE_CURRENT_BINARY_DIR}/pkgscripts/${CMAKE_PROJECT_NAME}Config.cmake
  ${CMAKE_CURRENT_BINARY_DIR}/pkgscripts/${CMAKE_PROJECT_NAME}ConfigVersion.cmake
  DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/${CMAKE_PROJECT_NAME})
 install(EXPORT ${CMAKE_PROJECT_NAME}Targets
  NAMESPACE ${CMAKE_PROJECT_NAME}::
  DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/${CMAKE_PROJECT_NAME})
 install(FILES ${CMAKE_CURRENT_BINARY_DIR}/jconfig.h
  ${CMAKE_CURRENT_SOURCE_DIR}/jerror.h ${CMAKE_CURRENT_SOURCE_DIR}/jmorecfg.h
--- a/ChangeLog.md
+++ b/ChangeLog.md
@@ -1,3 +1,168 @@
 2.1 post-beta
 =============
 ### Significant changes relative to 2.1 beta1
 1. Fixed a regression introduced by 2.1 beta1[6(b)] whereby attempting to
 decompress certain progressive JPEG images with one or more component planes of
 width 8 or less caused a buffer overrun.
 2. Fixed a regression introduced by 2.1 beta1[6(b)] whereby attempting to
 decompress a specially-crafted malformed progressive JPEG image caused the
 block smoothing algorithm to read from uninitialized memory.
 3. Fixed an issue in the Arm Neon SIMD Huffman encoders that caused the
 encoders to generate incorrect results when using the Clang compiler with
 Visual Studio.
 4. Fixed a floating point exception that occurred when attempting to compress a
 specially-crafted malformed GIF image with a specified image width of 0 using
 cjpeg.
 2.0.90 (2.1 beta1)
 ==================
 ### Significant changes relative to 2.0.6:
 1. The build system, x86-64 SIMD extensions, and accelerated Huffman codec now
 support the x32 ABI on Linux, which allows for using x86-64 instructions with
 32-bit pointers.  The x32 ABI is generally enabled by adding `-mx32` to the
 compiler flags.
     Caveats:
     - CMake 3.9.0 or later is required in order for the build system to
 automatically detect an x32 build.
     - Java does not support the x32 ABI, and thus the TurboJPEG Java API will
 automatically be disabled with x32 builds.
 2. Added Loongson MMI SIMD implementations of the RGB-to-grayscale, 4:2:2 fancy
 chroma upsampling, 4:2:2 and 4:2:0 merged chroma upsampling/color conversion,
 and fast integer DCT/IDCT algorithms.  Relative to libjpeg-turbo 2.0.x, this
 speeds up:
     - the compression of RGB source images into grayscale JPEG images by
 approximately 20%
     - the decompression of 4:2:2 JPEG images by approximately 40-60% when
 using fancy upsampling
     - the decompression of 4:2:2 and 4:2:0 JPEG images by approximately
 15-20% when using merged upsampling
     - the compression of RGB source images by approximately 30-45% when using
 the fast integer DCT
     - the decompression of JPEG images into RGB destination images by
 approximately 2x when using the fast integer IDCT
    The overall decompression speedup for RGB images is now approximately
 2.3-3.7x (compared to 2-3.5x with libjpeg-turbo 2.0.x.)
 3. 32-bit (Armv7 or Armv7s) iOS builds of libjpeg-turbo are no longer
 supported, and the libjpeg-turbo build system can no longer be used to package
 such builds.  32-bit iOS apps cannot run in iOS 11 and later, and the App Store
 no longer allows them.
 4. 32-bit (i386) OS X/macOS builds of libjpeg-turbo are no longer supported,
 and the libjpeg-turbo build system can no longer be used to package such
 builds.  32-bit Mac applications cannot run in macOS 10.15 "Catalina" and
 later, and the App Store no longer allows them.
 5. The SSE2 (x86 SIMD) and C Huffman encoding algorithms have been
 significantly optimized, resulting in a measured average overall compression
 speedup of 12-28% for 64-bit code and 22-52% for 32-bit code on various Intel
 and AMD CPUs, as well as a measured average overall compression speedup of
 0-23% on platforms that do not have a SIMD-accelerated Huffman encoding
 implementation.
 6. The block smoothing algorithm that is applied by default when decompressing
 progressive Huffman-encoded JPEG images has been improved in the following
 ways:
     - The algorithm is now more fault-tolerant.  Previously, if a particular
 scan was incomplete, then the smoothing parameters for the incomplete scan
 would be applied to the entire output image, including the parts of the image
 that were generated by the prior (complete) scan.  Visually, this had the
 effect of removing block smoothing from lower-frequency scans if they were
 followed by an incomplete higher-frequency scan.  libjpeg-turbo now applies
 block smoothing parameters to each iMCU row based on which scan generated the
 pixels in that row, rather than always using the block smoothing parameters for
 the most recent scan.
     - When applying block smoothing to DC scans, a Gaussian-like kernel with a
 5x5 window is used to reduce the "blocky" appearance.
 7. Added SIMD acceleration for progressive Huffman encoding on Arm platforms.
 This speeds up the compression of full-color progressive JPEGs by about 30-40%
 on average (relative to libjpeg-turbo 2.0.x) when using modern Arm CPUs.
 8. Added configure-time and run-time auto-detection of Loongson MMI SIMD
 instructions, so that the Loongson MMI SIMD extensions can be included in any
 MIPS64 libjpeg-turbo build.
 9. Added fault tolerance features to djpeg and jpegtran, mainly to demonstrate
 methods by which applications can guard against the exploits of the JPEG format
 described in the report
 ["Two Issues with the JPEG Standard"](https://libjpeg-turbo.org/pmwiki/uploads/About/TwoIssueswiththeJPEGStandard.pdf).
     - Both programs now accept a `-maxscans` argument, which can be used to
 limit the number of allowable scans in the input file.
     - Both programs now accept a `-strict` argument, which can be used to
 treat all warnings as fatal.
 10. CMake package config files are now included for both the libjpeg and
 TurboJPEG API libraries.  This facilitates using libjpeg-turbo with CMake's
 `find_package()` function.  For example:
        find_package(libjpeg-turbo CONFIG REQUIRED)
        add_executable(libjpeg_program libjpeg_program.c)
        target_link_libraries(libjpeg_program PUBLIC libjpeg-turbo::jpeg)
        add_executable(libjpeg_program_static libjpeg_program.c)
        target_link_libraries(libjpeg_program_static PUBLIC
          libjpeg-turbo::jpeg-static)
        add_executable(turbojpeg_program turbojpeg_program.c)
        target_link_libraries(turbojpeg_program PUBLIC
          libjpeg-turbo::turbojpeg)
        add_executable(turbojpeg_program_static turbojpeg_program.c)
        target_link_libraries(turbojpeg_program_static PUBLIC
          libjpeg-turbo::turbojpeg-static)
 11. Since the Unisys LZW patent has long expired, cjpeg and djpeg can now
 read/write both LZW-compressed and uncompressed GIF files (feature ported from
 jpeg-6a and jpeg-9d.)
 12. jpegtran now includes the `-wipe` and `-drop` options from jpeg-9a and
 jpeg-9d, as well as the ability to expand the image size using the `-crop`
 option.  Refer to jpegtran.1 or usage.txt for more details.
 13. Added a complete intrinsics implementation of the Arm Neon SIMD extensions,
 thus providing SIMD acceleration on Arm platforms for all of the algorithms
 that are SIMD-accelerated on x86 platforms.  This new implementation is
 significantly faster in some cases than the old GAS implementation--
 depending on the algorithms used, the type of CPU core, and the compiler.  GCC,
 as of this writing, does not provide a full or optimal set of Neon intrinsics,
 so for performance reasons, the default when building libjpeg-turbo with GCC is
 to continue using the GAS implementation of the following algorithms:
     - 32-bit RGB-to-YCbCr color conversion
     - 32-bit fast and accurate inverse DCT
     - 64-bit RGB-to-YCbCr and YCbCr-to-RGB color conversion
     - 64-bit accurate forward and inverse DCT
     - 64-bit Huffman encoding
    A new CMake variable (`NEON_INTRINSICS`) can be used to override this
 default.
    Since the new intrinsics implementation includes SIMD acceleration
 for merged upsampling/color conversion, 1.5.1[5] is no longer necessary and has
 been reverted.
 14. The Arm Neon SIMD extensions can now be built using Visual Studio.
 15. The build system can now be used to generate a universal x86-64 + Armv8
 libjpeg-turbo SDK package for both iOS and macOS.
 2.0.6
 =====
--- a/LICENSE.md
+++ b/LICENSE.md
@@ -91,7 +91,7 @@ best of our understanding.
 The Modified (3-clause) BSD License
 ===================================
-Copyright (C)2009-2020 D. R. Commander.  All Rights Reserved.
+Copyright (C)2009-2021 D. R. Commander.  All Rights Reserved.<br>
 Copyright (C)2015 Viktor Szathmáry.  All Rights Reserved.
 Redistribution and use in source and binary forms, with or without
--- a/README.ijg
+++ b/README.ijg
@@ -128,7 +128,7 @@ with respect to this software, its quality, accuracy, merchantability, or
 fitness for a particular purpose.  This software is provided "AS IS", and you,
 its user, assume the entire risk as to its quality and accuracy.
-This software is copyright (C) 1991-2016, Thomas G. Lane, Guido Vollbeding.
+This software is copyright (C) 1991-2020, Thomas G. Lane, Guido Vollbeding.
 All Rights Reserved except as specified below.
 Permission is hereby granted to use, copy, modify, and distribute this
@@ -159,19 +159,6 @@ commercial products, provided that all warranty or liability claims are
 assumed by the product vendor.
 The IJG distribution formerly included code to read and write GIF files.
 To avoid entanglement with the Unisys LZW patent (now expired), GIF reading
 support has been removed altogether, and the GIF writer has been simplified
 to produce "uncompressed GIFs".  This technique does not use the LZW
 algorithm; the resulting GIF files are larger than usual, but are readable
 by all standard GIF decoders.
 We are required to state that
    "The Graphics Interchange Format(c) is the Copyright property of
    CompuServe Incorporated.  GIF(sm) is a Service Mark property of
    CompuServe Incorporated."
 REFERENCES
 ==========
--- a/cderror.h
+++ b/cderror.h
@@ -1,9 +1,11 @@
 /*
 * cderror.h
 *
 * This file was part of the Independent JPEG Group's software:
 * Copyright (C) 1994-1997, Thomas G. Lane.
 * Modified 2009-2017 by Guido Vollbeding.
- * This file is part of the Independent JPEG Group's software.
+ * libjpeg-turbo Modifications:
 * Copyright (C) 2021, D. R. Commander.
 * For conditions of distribution and use, see the accompanying README.ijg
 * file.
 *
@@ -42,7 +44,7 @@ JMESSAGE(JMSG_FIRSTADDONCODE = 1000, NULL) /* Must be first entry! */
 #ifdef BMP_SUPPORTED
 JMESSAGE(JERR_BMP_BADCMAP, "Unsupported BMP colormap format")
-JMESSAGE(JERR_BMP_BADDEPTH, "Only 8- and 24-bit BMP files are supported")
+JMESSAGE(JERR_BMP_BADDEPTH, "Only 8-, 24-, and 32-bit BMP files are supported")
 JMESSAGE(JERR_BMP_BADHEADER, "Invalid BMP file: bad header length")
 JMESSAGE(JERR_BMP_BADPLANES, "Invalid BMP file: biPlanes not equal to 1")
 JMESSAGE(JERR_BMP_COLORSPACE, "BMP output must be grayscale or RGB")
@@ -50,9 +52,9 @@ JMESSAGE(JERR_BMP_COMPRESSED, "Sorry, compressed BMPs not yet supported")
 JMESSAGE(JERR_BMP_EMPTY, "Empty BMP image")
 JMESSAGE(JERR_BMP_NOT, "Not a BMP file - does not start with BM")
 JMESSAGE(JERR_BMP_OUTOFRANGE, "Numeric value out of range in BMP file")
-JMESSAGE(JTRC_BMP, "%ux%u 24-bit BMP image")
+JMESSAGE(JTRC_BMP, "%ux%u %d-bit BMP image")
 JMESSAGE(JTRC_BMP_MAPPED, "%ux%u 8-bit colormapped BMP image")
-JMESSAGE(JTRC_BMP_OS2, "%ux%u 24-bit OS2 BMP image")
+JMESSAGE(JTRC_BMP_OS2, "%ux%u %d-bit OS2 BMP image")
 JMESSAGE(JTRC_BMP_OS2_MAPPED, "%ux%u 8-bit colormapped OS2 BMP image")
 #endif /* BMP_SUPPORTED */
@@ -60,6 +62,7 @@ JMESSAGE(JTRC_BMP_OS2_MAPPED, "%ux%u 8-bit colormapped OS2 BMP image")
 JMESSAGE(JERR_GIF_BUG, "GIF output got confused")
 JMESSAGE(JERR_GIF_CODESIZE, "Bogus GIF codesize %d")
 JMESSAGE(JERR_GIF_COLORSPACE, "GIF output must be grayscale or RGB")
 JMESSAGE(JERR_GIF_EMPTY, "Empty GIF image")
 JMESSAGE(JERR_GIF_IMAGENOTFOUND, "Too few images in GIF file")
 JMESSAGE(JERR_GIF_NOT, "Not a GIF file")
 JMESSAGE(JTRC_GIF, "%ux%ux%d GIF image")
@@ -84,23 +87,6 @@ JMESSAGE(JTRC_PPM, "%ux%u PPM image")
 JMESSAGE(JTRC_PPM_TEXT, "%ux%u text PPM image")
 #endif /* PPM_SUPPORTED */
 #ifdef RLE_SUPPORTED
 JMESSAGE(JERR_RLE_BADERROR, "Bogus error code from RLE library")
 JMESSAGE(JERR_RLE_COLORSPACE, "RLE output must be grayscale or RGB")
 JMESSAGE(JERR_RLE_DIMENSIONS, "Image dimensions (%ux%u) too large for RLE")
 JMESSAGE(JERR_RLE_EMPTY, "Empty RLE file")
 JMESSAGE(JERR_RLE_EOF, "Premature EOF in RLE header")
 JMESSAGE(JERR_RLE_MEM, "Insufficient memory for RLE header")
 JMESSAGE(JERR_RLE_NOT, "Not an RLE file")
 JMESSAGE(JERR_RLE_TOOMANYCHANNELS, "Cannot handle %d output channels for RLE")
 JMESSAGE(JERR_RLE_UNSUPPORTED, "Cannot handle this RLE setup")
 JMESSAGE(JTRC_RLE, "%ux%u full-color RLE file")
 JMESSAGE(JTRC_RLE_FULLMAP, "%ux%u full-color RLE file with map of length %d")
 JMESSAGE(JTRC_RLE_GRAY, "%ux%u grayscale RLE file")
 JMESSAGE(JTRC_RLE_MAPGRAY, "%ux%u grayscale RLE file with map of length %d")
 JMESSAGE(JTRC_RLE_MAPPED, "%ux%u colormapped RLE file with map of length %d")
 #endif /* RLE_SUPPORTED */
 #ifdef TARGA_SUPPORTED
 JMESSAGE(JERR_TGA_BADCMAP, "Unsupported Targa colormap format")
 JMESSAGE(JERR_TGA_BADPARMS, "Invalid or unsupported Targa file")
--- a/cdjpeg.c
+++ b/cdjpeg.c
@@ -3,8 +3,8 @@
 *
 * This file was part of the Independent JPEG Group's software:
 * Copyright (C) 1991-1997, Thomas G. Lane.
- * It was modified by The libjpeg-turbo Project to include only code relevant
+ * libjpeg-turbo Modifications:
- * to libjpeg-turbo.
+ * Copyright (C) 2019, D. R. Commander.
 * For conditions of distribution and use, see the accompanying README.ijg
 * file.
 *
@@ -25,26 +25,37 @@
 * Optional progress monitor: display a percent-done figure on stderr.
 */
 #ifdef PROGRESS_REPORT
 METHODDEF(void)
 progress_monitor(j_common_ptr cinfo)
 {
  cd_progress_ptr prog = (cd_progress_ptr)cinfo->progress;
  int total_passes = prog->pub.total_passes + prog->total_extra_passes;
  int percent_done =
    (int)(prog->pub.pass_counter * 100L / prog->pub.pass_limit);
-  if (percent_done != prog->percent_done) {
+  if (prog->max_scans != 0 && cinfo->is_decompressor) {
-    prog->percent_done = percent_done;
+    int scan_no = ((j_decompress_ptr)cinfo)->input_scan_number;
-    if (total_passes > 1) {
+
-      fprintf(stderr, "\rPass %d/%d: %3d%% ",
+    if (scan_no > (int)prog->max_scans) {
-              prog->pub.completed_passes + prog->completed_extra_passes + 1,
+      fprintf(stderr, "Scan number %d exceeds maximum scans (%d)\n", scan_no,
-              total_passes, percent_done);
+              prog->max_scans);
-    } else {
+      exit(EXIT_FAILURE);
-      fprintf(stderr, "\r %3d%% ", percent_done);
+    }
  }
  if (prog->report) {
    int total_passes = prog->pub.total_passes + prog->total_extra_passes;
    int percent_done =
      (int)(prog->pub.pass_counter * 100L / prog->pub.pass_limit);
    if (percent_done != prog->percent_done) {
      prog->percent_done = percent_done;
      if (total_passes > 1) {
        fprintf(stderr, "\rPass %d/%d: %3d%% ",
                prog->pub.completed_passes + prog->completed_extra_passes + 1,
                total_passes, percent_done);
      } else {
        fprintf(stderr, "\r %3d%% ", percent_done);
      }
      fflush(stderr);
    }
    fflush(stderr);
  }
 }
@@ -57,6 +68,8 @@ start_progress_monitor(j_common_ptr cinfo, cd_progress_ptr progress)
    progress->pub.progress_monitor = progress_monitor;
    progress->completed_extra_passes = 0;
    progress->total_extra_passes = 0;
    progress->max_scans = 0;
    progress->report = FALSE;
    progress->percent_done = -1;
    cinfo->progress = &progress->pub;
  }
@@ -73,8 +86,6 @@ end_progress_monitor(j_common_ptr cinfo)
  }
 }
 #endif
 /*
 * Case-insensitive matching of possibly-abbreviated keyword switches.
--- a/cdjpeg.h
+++ b/cdjpeg.h
@@ -3,8 +3,9 @@
 *
 * This file was part of the Independent JPEG Group's software:
 * Copyright (C) 1994-1997, Thomas G. Lane.
 * Modified 2019 by Guido Vollbeding.
 * libjpeg-turbo Modifications:
- * Copyright (C) 2017, D. R. Commander.
+ * Copyright (C) 2017, 2019, D. R. Commander.
 * mozjpeg Modifications:
 * Copyright (C) 2014, Mozilla Corporation.
 * For conditions of distribution and use, see the accompanying README.ijg file.
@@ -65,9 +66,9 @@ struct djpeg_dest_struct {
  void (*finish_output) (j_decompress_ptr cinfo, djpeg_dest_ptr dinfo);
  /* Re-calculate buffer dimensions based on output dimensions (for use with
     partial image decompression.)  If this is NULL, then the output format
-     does not support partial image decompression (BMP and RLE, in particular,
+     does not support partial image decompression (BMP, in particular, cannot
-     cannot support partial decompression because they use an inversion buffer
+     support partial decompression because it uses an inversion buffer to write
-     to write the image in bottom-up order.) */
+     the image in bottom-up order.) */
  void (*calc_buffer_dimensions) (j_decompress_ptr cinfo,
                                  djpeg_dest_ptr dinfo);
@@ -96,6 +97,9 @@ struct cdjpeg_progress_mgr {
  struct jpeg_progress_mgr pub; /* fields known to JPEG library */
  int completed_extra_passes;   /* extra passes completed */
  int total_extra_passes;       /* total extra */
  JDIMENSION max_scans;         /* abort if the number of scans exceeds this
                                   value and the value is non-zero */
  boolean report;               /* whether or not to report progress */
  /* last printed percentage stored here to avoid multiple printouts */
  int percent_done;
 };
@@ -112,21 +116,19 @@ EXTERN(cjpeg_source_ptr) jinit_read_bmp(j_compress_ptr cinfo,
 EXTERN(djpeg_dest_ptr) jinit_write_bmp(j_decompress_ptr cinfo, boolean is_os2,
                                       boolean use_inversion_array);
 EXTERN(cjpeg_source_ptr) jinit_read_gif(j_compress_ptr cinfo);
-EXTERN(djpeg_dest_ptr) jinit_write_gif(j_decompress_ptr cinfo);
+EXTERN(djpeg_dest_ptr) jinit_write_gif(j_decompress_ptr cinfo, boolean is_lzw);
 EXTERN(cjpeg_source_ptr) jinit_read_ppm(j_compress_ptr cinfo);
 EXTERN(djpeg_dest_ptr) jinit_write_ppm(j_decompress_ptr cinfo);
 EXTERN(cjpeg_source_ptr) jinit_read_rle(j_compress_ptr cinfo);
 EXTERN(djpeg_dest_ptr) jinit_write_rle(j_decompress_ptr cinfo);
 EXTERN(cjpeg_source_ptr) jinit_read_targa(j_compress_ptr cinfo);
 EXTERN(djpeg_dest_ptr) jinit_write_targa(j_decompress_ptr cinfo);
 /* cjpeg support routines (in rdswitch.c) */
 EXTERN(boolean) read_quant_tables(j_compress_ptr cinfo, char *filename,
-                                   boolean force_baseline);
+                                  boolean force_baseline);
 EXTERN(boolean) read_scan_script(j_compress_ptr cinfo, char *filename);
 EXTERN(boolean) set_quality_ratings(j_compress_ptr cinfo, char *arg,
-                                     boolean force_baseline);
+                                    boolean force_baseline);
 EXTERN(boolean) set_quant_slots(j_compress_ptr cinfo, char *arg);
 EXTERN(boolean) set_sample_factors(j_compress_ptr cinfo, char *arg);
@@ -137,7 +139,7 @@ EXTERN(void) read_color_map(j_decompress_ptr cinfo, FILE *infile);
 /* common support routines (in cdjpeg.c) */
 EXTERN(void) start_progress_monitor(j_common_ptr cinfo,
-                                     cd_progress_ptr progress);
+                                    cd_progress_ptr progress);
 EXTERN(void) end_progress_monitor(j_common_ptr cinfo);
 EXTERN(boolean) keymatch(char *arg, const char *keyword, int minchars);
 EXTERN(FILE *) read_stdin(void);
--- a/change.log
+++ b/change.log
@@ -6,6 +6,25 @@ reference.  Please see ChangeLog.md for information specific to libjpeg-turbo.
 CHANGE LOG for Independent JPEG Group's JPEG software
 Version 9d  12-Jan-2020
 -----------------------
 Restore GIF read and write support from libjpeg version 6a.
 Thank to Wolfgang Werner (W.W.) Heinz for suggestion.
 Add jpegtran -drop option; add options to the crop extension and wipe
 to fill the extra area with content from the source image region,
 instead of gray out.
 Version 9c  14-Jan-2018
 -----------------------
 jpegtran: add an option to the -wipe switch to fill the region
 with the average of adjacent blocks, instead of gray out.
 Thank to Caitlyn Feddock and Maddie Ziegler for inspiration.
 Version 9b  17-Jan-2016
 -----------------------
@@ -13,6 +32,13 @@ Document 'f' specifier for jpegtran -crop specification.
 Thank to Michele Martone for suggestion.
 Version 9a  19-Jan-2014
 -----------------------
 Add jpegtran -wipe option and extension for -crop.
 Thank to Andrew Senior, David Clunie, and Josef Schmid for suggestion.
 Version 9  13-Jan-2013
 ----------------------
@@ -138,11 +164,6 @@ Huffman tables being used.
 Huffman tables are checked for validity much more carefully than before.
 To avoid the Unisys LZW patent, djpeg's GIF output capability has been
 changed to produce "uncompressed GIFs", and cjpeg's GIF input capability
 has been removed altogether.  We're not happy about it either, but there
 seems to be no good alternative.
 The configure script now supports building libjpeg as a shared library
 on many flavors of Unix (all the ones that GNU libtool knows how to
 build shared libraries for).  Use "./configure --enable-shared" to
--- a/cjpeg.1
+++ b/cjpeg.1
@@ -16,8 +16,7 @@ cjpeg \- compress an image file to a JPEG file
 compresses the named image file, or the standard input if no file is
 named, and produces a JPEG/JFIF file on the standard output.
 The currently supported input file formats are: PPM (PBMPLUS color
-format), PGM (PBMPLUS grayscale format), BMP, Targa, and RLE (Utah Raster
+format), PGM (PBMPLUS grayscale format), BMP, GIF, and Targa.
 Toolkit format).  (RLE is supported only if the URT library is available.)
 .SH OPTIONS
 All switch names may be abbreviated; for example,
 .B \-grayscale
@@ -42,10 +41,10 @@ Scale quantization tables to adjust image quality.  Quality is 0 (worst) to
 .TP
 .B \-grayscale
 Create monochrome JPEG file from color input.  Be sure to use this switch when
-compressing a grayscale BMP file, because
+compressing a grayscale BMP or GIF file, because
 .B cjpeg
-isn't bright enough to notice whether a BMP file uses only shades of gray.
+isn't bright enough to notice whether a BMP or GIF file uses only shades of
-By saying
+gray.  By saying
 .BR \-grayscale,
 you'll get a smaller JPEG file that takes less time to process.
 .TP
@@ -224,6 +223,9 @@ Compress to memory instead of a file.  This feature was implemented mainly as a
 way of testing the in-memory destination manager (jpeg_mem_dest()), but it is
 also useful for benchmarking, since it reduces the I/O overhead.
 .TP
 .BI \-report
 Report compression progress.
 .TP
 .B \-verbose
 Enable debug printout.  More
 .BR \-v 's
@@ -350,11 +352,6 @@ This file was modified by The libjpeg-turbo Project to include only information
 relevant to libjpeg-turbo, to wordsmith certain sections, and to describe
 features not present in libjpeg.
 .SH ISSUES
 Support for GIF input files was removed in cjpeg v6b due to concerns over
 the Unisys LZW patent.  Although this patent expired in 2006, cjpeg still
 lacks GIF support, for these historical reasons.  (Conversion of GIF files to
 JPEG is usually a bad idea anyway, since GIF is a 256-color format.)
 .PP
 Not all variants of BMP and Targa file formats are supported.
 .PP
 The
--- a/cjpeg.c
+++ b/cjpeg.c
@@ -5,7 +5,7 @@
 * Copyright (C) 1991-1998, Thomas G. Lane.
 * Modified 2003-2011 by Guido Vollbeding.
 * libjpeg-turbo Modifications:
- * Copyright (C) 2010, 2013-2014, 2017, 2020, D. R. Commander.
+ * Copyright (C) 2010, 2013-2014, 2017, 2019-2020, D. R. Commander.
 * mozjpeg Modifications:
 * Copyright (C) 2014, Mozilla Corporation.
 * For conditions of distribution and use, see the accompanying README file.
@@ -70,9 +70,9 @@ static const char * const cdjpeg_message_table[] = {
 *     2) assume we can push back more than one character (works in
 *        some C implementations, but unportable);
 *     3) provide our own buffering (breaks input readers that want to use
- *        stdio directly, such as the RLE library);
+ *        stdio directly);
 * or  4) don't put back the data, and modify the input_init methods to assume
- *        they start reading after the start of file (also breaks RLE library).
+ *        they start reading after the start of file.
 * #1 is attractive for MS-DOS but is untenable on Unix.
 *
 * The most portable solution for file types that can't be identified by their
@@ -124,10 +124,6 @@ select_file_type(j_compress_ptr cinfo, FILE *infile)
    copy_markers = TRUE;
    return jinit_read_png(cinfo);
 #endif
 #ifdef RLE_SUPPORTED
  case 'R':
    return jinit_read_rle(cinfo);
 #endif
 #ifdef TARGA_SUPPORTED
  case 0x00:
    return jinit_read_targa(cinfo);
@@ -158,6 +154,7 @@ static const char *progname;    /* program name for error messages */
 static char *icc_filename;      /* for -icc switch */
 static char *outfilename;       /* for -outfile switch */
 boolean memdst;                 /* for -memdst switch */
 boolean report;                 /* for -report switch */
 LOCAL(void)
@@ -236,6 +233,7 @@ usage(void)
 #if JPEG_LIB_VERSION >= 80 || defined(MEM_SRCDST_SUPPORTED)
  fprintf(stderr, "  -memdst        Compress to memory instead of file (useful for benchmarking)\n");
 #endif
  fprintf(stderr, "  -report        Report compression progress\n");
  fprintf(stderr, "  -verbose  or  -debug   Emit debug output\n");
  fprintf(stderr, "  -version       Print version information and exit\n");
  fprintf(stderr, "Switches for wizards:\n");
@@ -283,6 +281,7 @@ parse_switches(j_compress_ptr cinfo, int argc, char **argv,
  icc_filename = NULL;
  outfilename = NULL;
  memdst = FALSE;
  report = FALSE;
  cinfo->err->trace_level = 0;
  /* Scan command line options, adjust parameters */
@@ -470,6 +469,8 @@ parse_switches(j_compress_ptr cinfo, int argc, char **argv,
      qtablefile = argv[argn];
      /* We postpone actually reading the file in case -quality comes later. */
    } else if (keymatch(arg, "report", 3)) {
      report = TRUE;
    } else if (keymatch(arg, "quant-table", 7)) {
      int val;
      if (++argn >= argc)       /* advance to next argument */
@@ -485,7 +486,7 @@ parse_switches(j_compress_ptr cinfo, int argc, char **argv,
    } else if (keymatch(arg, "quant-baseline", 7)) {
      /* Force quantization table to meet baseline requirements */
      force_baseline = TRUE;
-    
+
    } else if (keymatch(arg, "restart", 1)) {
      /* Restart interval in MCU rows (or in MCUs with 'b'). */
      long lval;
@@ -662,9 +663,7 @@ main(int argc, char **argv)
 {
  struct jpeg_compress_struct cinfo;
  struct jpeg_error_mgr jerr;
 #ifdef PROGRESS_REPORT
  struct cdjpeg_progress_mgr progress;
 #endif
  int file_index;
  cjpeg_source_ptr src_mgr;
  FILE *input_file;
@@ -785,9 +784,10 @@ main(int argc, char **argv)
    fclose(icc_file);
  }
-#ifdef PROGRESS_REPORT
+  if (report) {
  start_progress_monitor((j_common_ptr)&cinfo, &progress);
-#endif
+    progress.report = report;
  }
  /* Figure out the input file format, and set up to read it. */
  src_mgr = select_file_type(&cinfo, input_file);
@@ -873,9 +873,8 @@ main(int argc, char **argv)
  if (output_file != stdout && output_file != NULL)
    fclose(output_file);
-#ifdef PROGRESS_REPORT
+  if (report)
  end_progress_monitor((j_common_ptr)&cinfo);
 #endif
  if (memdst) {
    fprintf(stderr, "Compressed size:  %lu bytes\n", outsize);
--- a/cmakescripts/BuildPackages.cmake
+++ b/cmakescripts/BuildPackages.cmake
@@ -22,13 +22,15 @@ if(CMAKE_SYSTEM_NAME STREQUAL "Linux")
 set(RPMARCH ${CMAKE_SYSTEM_PROCESSOR})
 if(CPU_TYPE STREQUAL "x86_64")
  set(DEBARCH amd64)
 elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "armv7*")
  set(RPMARCH armv7hl)
  set(DEBARCH armhf)
 elseif(CPU_TYPE STREQUAL "arm64")
  set(DEBARCH ${CPU_TYPE})
 elseif(CPU_TYPE STREQUAL "arm")
-  if(CMAKE_C_COMPILER MATCHES "gnueabihf")
+  check_c_source_compiles("
    #if __ARM_PCS_VFP != 1
    #error \"float ABI = softfp\"
    #endif
    int main(void) { return 0; }" HAVE_HARD_FLOAT)
  if(HAVE_HARD_FLOAT)
    set(RPMARCH armv7hl)
    set(DEBARCH armhf)
  else()
@@ -78,12 +80,14 @@ if(WIN32)
 if(MSVC)
  set(INST_PLATFORM "Visual C++")
-  set(INST_NAME ${CMAKE_PROJECT_NAME}-${VERSION}-vc)
+  set(INST_ID vc)
  set(INST_NAME ${CMAKE_PROJECT_NAME}-${VERSION}-${INST_ID})
  set(INST_REG_NAME ${CMAKE_PROJECT_NAME})
 elseif(MINGW)
  set(INST_PLATFORM GCC)
-  set(INST_NAME ${CMAKE_PROJECT_NAME}-${VERSION}-gcc)
+  set(INST_ID gcc)
-  set(INST_REG_NAME ${CMAKE_PROJECT_NAME}-gcc)
+  set(INST_NAME ${CMAKE_PROJECT_NAME}-${VERSION}-${INST_ID})
  set(INST_REG_NAME ${CMAKE_PROJECT_NAME}-${INST_ID})
  set(INST_DEFS -DGCC)
 endif()
@@ -107,6 +111,12 @@ endif()
 string(REGEX REPLACE "/" "\\\\" INST_DIR ${CMAKE_INSTALL_PREFIX})
 configure_file(release/installer.nsi.in installer.nsi @ONLY)
 # TODO: It would be nice to eventually switch to CPack and eliminate this mess,
 # but not today.
 configure_file(win/projectTargets.cmake.in
  win/${CMAKE_PROJECT_NAME}Targets.cmake @ONLY)
 configure_file(win/${INST_ID}/projectTargets-release.cmake.in
  win/${CMAKE_PROJECT_NAME}Targets-release.cmake @ONLY)
 if(WITH_JAVA)
  set(JAVA_DEPEND turbojpeg-java)
@@ -120,53 +130,28 @@ add_custom_target(installer
 endif() # WIN32
 ###############################################################################
 # Cygwin Package
 ###############################################################################
 if(CYGWIN)
 configure_file(release/makecygwinpkg.in pkgscripts/makecygwinpkg)
 add_custom_target(cygwinpkg pkgscripts/makecygwinpkg)
 endif() # CYGWIN
 ###############################################################################
 # Mac DMG
 ###############################################################################
 if(APPLE)
-set(DEFAULT_OSX_32BIT_BUILD ${CMAKE_SOURCE_DIR}/osxx86)
+set(ARMV8_BUILD "" CACHE PATH
-set(OSX_32BIT_BUILD ${DEFAULT_OSX_32BIT_BUILD} CACHE PATH
+  "Directory containing Armv8 iOS or macOS build to include in universal binaries")
  "Directory containing 32-bit (i386) Mac build to include in universal binaries (default: ${DEFAULT_OSX_32BIT_BUILD})")
 set(DEFAULT_IOS_ARMV7_BUILD ${CMAKE_SOURCE_DIR}/iosarmv7)
 set(IOS_ARMV7_BUILD ${DEFAULT_IOS_ARMV7_BUILD} CACHE PATH
  "Directory containing Armv7 iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV7_BUILD})")
 set(DEFAULT_IOS_ARMV7S_BUILD ${CMAKE_SOURCE_DIR}/iosarmv7s)
 set(IOS_ARMV7S_BUILD ${DEFAULT_IOS_ARMV7S_BUILD} CACHE PATH
  "Directory containing Armv7s iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV7S_BUILD})")
 set(DEFAULT_IOS_ARMV8_BUILD ${CMAKE_SOURCE_DIR}/iosarmv8)
 set(IOS_ARMV8_BUILD ${DEFAULT_IOS_ARMV8_BUILD} CACHE PATH
  "Directory containing Armv8 iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV8_BUILD})")
-set(OSX_APP_CERT_NAME "" CACHE STRING
+set(MACOS_APP_CERT_NAME "" CACHE STRING
  "Name of the Developer ID Application certificate (in the macOS keychain) that should be used to sign the libjpeg-turbo DMG.  Leave this blank to generate an unsigned DMG.")
-set(OSX_INST_CERT_NAME "" CACHE STRING
+set(MACOS_INST_CERT_NAME "" CACHE STRING
  "Name of the Developer ID Installer certificate (in the macOS keychain) that should be used to sign the libjpeg-turbo installer package.  Leave this blank to generate an unsigned package.")
 configure_file(release/makemacpkg.in pkgscripts/makemacpkg)
 configure_file(release/Distribution.xml.in pkgscripts/Distribution.xml)
 configure_file(release/Welcome.rtf.in pkgscripts/Welcome.rtf)
 configure_file(release/uninstall.in pkgscripts/uninstall)
 add_custom_target(dmg pkgscripts/makemacpkg
  SOURCES pkgscripts/makemacpkg)
 add_custom_target(udmg pkgscripts/makemacpkg universal
  SOURCES pkgscripts/makemacpkg)
 endif() # APPLE
@@ -187,3 +172,12 @@ add_custom_target(tarball pkgscripts/maketarball
 configure_file(release/libjpeg.pc.in pkgscripts/libjpeg.pc @ONLY)
 configure_file(release/libturbojpeg.pc.in pkgscripts/libturbojpeg.pc @ONLY)
 include(CMakePackageConfigHelpers)
 write_basic_package_version_file(
  pkgscripts/${CMAKE_PROJECT_NAME}ConfigVersion.cmake
  VERSION ${VERSION} COMPATIBILITY AnyNewerVersion)
 configure_package_config_file(release/Config.cmake.in
  pkgscripts/${CMAKE_PROJECT_NAME}Config.cmake
  INSTALL_DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/${CMAKE_PROJECT_NAME})
--- a/cmakescripts/GNUInstallDirs.cmake
+++ b/cmakescripts/GNUInstallDirs.cmake
@@ -118,6 +118,7 @@
 #   absolute paths where necessary, using the same logic.
 #=============================================================================
 # Copyright 2018 Matthias Räncker
 # Copyright 2016, 2019 D. R. Commander
 # Copyright 2016 Dmitry Marakasov
 # Copyright 2016 Roger Leigh
@@ -259,6 +260,8 @@ if(NOT DEFINED CMAKE_INSTALL_DEFAULT_LIBDIR)
      else()
        if("${CMAKE_SIZEOF_VOID_P}" EQUAL "8")
          set(CMAKE_INSTALL_DEFAULT_LIBDIR "lib64")
        elseif(CMAKE_C_COMPILER_ABI MATCHES "ELF X32")
          set(CMAKE_INSTALL_DEFAULT_LIBDIR "libx32")
        endif()
      endif()
    endif()
--- a/djpeg.1
+++ b/djpeg.1
@@ -15,8 +15,7 @@ djpeg \- decompress a JPEG file to an image file
 .B djpeg
 decompresses the named JPEG file, or the standard input if no file is named,
 and produces an image file on the standard output.  PBMPLUS (PPM/PGM), BMP,
-GIF, Targa, or RLE (Utah Raster Toolkit) output format can be selected.
+GIF, or Targa output format can be selected.
 (RLE is supported only if the URT library is available.)
 .SH OPTIONS
 All switch names may be abbreviated; for example,
 .B \-grayscale
@@ -81,9 +80,20 @@ is specified, or if the JPEG file is grayscale; otherwise, 24-bit full-color
 format is emitted.
 .TP
 .B \-gif
-Select GIF output format.  Since GIF does not support more than 256 colors,
+Select GIF output format (LZW-compressed).  Since GIF does not support more
 than 256 colors,
 .B \-colors 256
-is assumed (unless you specify a smaller number of colors).
+is assumed (unless you specify a smaller number of colors).  If you specify
 .BR \-fast,
 the default number of colors is 216.
 .TP
 .B \-gif0
 Select GIF output format (uncompressed).  Since GIF does not support more than
 256 colors,
 .B \-colors 256
 is assumed (unless you specify a smaller number of colors).  If you specify
 .BR \-fast,
 the default number of colors is 216.
 .TP
 .B \-os2
 Select BMP output format (OS/2 1.x flavor).  8-bit colormapped format is
@@ -100,9 +110,6 @@ PGM is emitted if the JPEG file is grayscale or if
 .B \-grayscale
 is specified; otherwise PPM is emitted.
 .TP
 .B \-rle
 Select RLE output format.  (Requires URT library.)
 .TP
 .B \-targa
 Select Targa output format.  Grayscale format is emitted if the JPEG file is
 grayscale or if
@@ -198,6 +205,19 @@ number.  For example,
 .B \-max 4m
 selects 4000000 bytes.  If more space is needed, an error will occur.
 .TP
 .BI \-maxscans " N"
 Abort if the JPEG image contains more than
 .I N
 scans.  This feature demonstrates a method by which applications can guard
 against denial-of-service attacks instigated by specially-crafted malformed
 JPEG images containing numerous scans with missing image data or image data
 consisting only of "EOB runs" (a feature of progressive JPEG images that allows
 potentially hundreds of thousands of adjoining zero-value pixels to be
 represented using only a few bytes.)  Attempting to decompress such malformed
 JPEG images can cause excessive CPU activity, since the decompressor must fully
 process each scan (even if the scan is corrupt) before it can proceed to the
 next scan.
 .TP
 .BI \-outfile " name"
 Send output image to the named file, not to standard output.
 .TP
@@ -205,6 +225,9 @@ Send output image to the named file, not to standard output.
 Load input file into memory before decompressing.  This feature was implemented
 mainly as a way of testing the in-memory source manager (jpeg_mem_src().)
 .TP
 .BI \-report
 Report decompression progress.
 .TP
 .BI \-skip " Y0,Y1"
 Decompress all rows of the JPEG image except those between Y0 and Y1
 (inclusive.)  Note that if decompression scaling is being used, then Y0 and Y1
@@ -218,6 +241,12 @@ decompression scaling is being used, then X, Y, W, and H are relative to the
 scaled image dimensions.  Currently this option only works with the
 PBMPLUS (PPM/PGM), GIF, and Targa output formats.
 .TP
 .BI \-strict
 Treat all warnings as fatal.  This feature also demonstrates a method by which
 applications can guard against attacks instigated by specially-crafted
 malformed JPEG images.  Enabling this option will cause the decompressor to
 abort if the JPEG image contains incomplete or corrupt image data.
 .TP
 .B \-verbose
 Enable debug printout.  More
 .BR \-v 's
@@ -289,10 +318,3 @@ Independent JPEG Group
 This file was modified by The libjpeg-turbo Project to include only information
 relevant to libjpeg-turbo, to wordsmith certain sections, and to describe
 features not present in libjpeg.
 .SH ISSUES
 Support for compressed GIF output files was removed in djpeg v6b due to
 concerns over the Unisys LZW patent.  Although this patent expired in 2006,
 djpeg still lacks compressed GIF support, for these historical reasons.
 (Conversion of JPEG files to GIF is usually a bad idea anyway, since GIF is a
 256-color format.)  The uncompressed GIF files that djpeg generates are larger
 than they should be, but they are readable by standard GIF decoders.
--- a/djpeg.c
+++ b/djpeg.c
@@ -3,9 +3,9 @@
 *
 * This file was part of the Independent JPEG Group's software:
 * Copyright (C) 1991-1997, Thomas G. Lane.
- * Modified 2013 by Guido Vollbeding.
+ * Modified 2013-2019 by Guido Vollbeding.
 * libjpeg-turbo Modifications:
- * Copyright (C) 2010-2011, 2013-2017, 2020, D. R. Commander.
+ * Copyright (C) 2010-2011, 2013-2017, 2019-2020, D. R. Commander.
 * Copyright (C) 2015, Google, Inc.
 * For conditions of distribution and use, see the accompanying README.ijg
 * file.
@@ -68,10 +68,10 @@ static const char * const cdjpeg_message_table[] = {
 typedef enum {
  FMT_BMP,                      /* BMP format (Windows flavor) */
-  FMT_GIF,                      /* GIF format */
+  FMT_GIF,                      /* GIF format (LZW-compressed) */
  FMT_GIF0,                     /* GIF format (uncompressed) */
  FMT_OS2,                      /* BMP format (OS/2 flavor) */
  FMT_PPM,                      /* PPM/PGM (PBMPLUS formats) */
  FMT_RLE,                      /* RLE format */
  FMT_TARGA,                    /* Targa format */
  FMT_TIFF                      /* TIFF format */
 } IMAGE_FORMATS;
@@ -94,11 +94,14 @@ static IMAGE_FORMATS requested_fmt;
 static const char *progname;    /* program name for error messages */
 static char *icc_filename;      /* for -icc switch */
 JDIMENSION max_scans;           /* for -maxscans switch */
 static char *outfilename;       /* for -outfile switch */
 boolean memsrc;                 /* for -memsrc switch */
 boolean report;                 /* for -report switch */
 boolean skip, crop;
 JDIMENSION skip_start, skip_end;
 JDIMENSION crop_x, crop_y, crop_width, crop_height;
 boolean strict;                 /* for -strict switch */
 #define INPUT_BUF_SIZE  4096
@@ -127,8 +130,10 @@ usage(void)
          (DEFAULT_FMT == FMT_BMP ? " (default)" : ""));
 #endif
 #ifdef GIF_SUPPORTED
-  fprintf(stderr, "  -gif           Select GIF output format%s\n",
+  fprintf(stderr, "  -gif           Select GIF output format (LZW-compressed)%s\n",
          (DEFAULT_FMT == FMT_GIF ? " (default)" : ""));
  fprintf(stderr, "  -gif0          Select GIF output format (uncompressed)%s\n",
          (DEFAULT_FMT == FMT_GIF0 ? " (default)" : ""));
 #endif
 #ifdef BMP_SUPPORTED
  fprintf(stderr, "  -os2           Select BMP output format (OS/2 style)%s\n",
@@ -138,10 +143,6 @@ usage(void)
  fprintf(stderr, "  -pnm           Select PBMPLUS (PPM/PGM) output format%s\n",
          (DEFAULT_FMT == FMT_PPM ? " (default)" : ""));
 #endif
 #ifdef RLE_SUPPORTED
  fprintf(stderr, "  -rle           Select Utah RLE output format%s\n",
          (DEFAULT_FMT == FMT_RLE ? " (default)" : ""));
 #endif
 #ifdef TARGA_SUPPORTED
  fprintf(stderr, "  -targa         Select Targa output format%s\n",
          (DEFAULT_FMT == FMT_TARGA ? " (default)" : ""));
@@ -171,14 +172,16 @@ usage(void)
  fprintf(stderr, "  -onepass       Use 1-pass quantization (fast, low quality)\n");
 #endif
  fprintf(stderr, "  -maxmemory N   Maximum memory to use (in kbytes)\n");
  fprintf(stderr, "  -maxscans N    Maximum number of scans to allow in input file\n");
  fprintf(stderr, "  -outfile name  Specify name for output file\n");
 #if JPEG_LIB_VERSION >= 80 || defined(MEM_SRCDST_SUPPORTED)
  fprintf(stderr, "  -memsrc        Load input file into memory before decompressing\n");
 #endif
-
+  fprintf(stderr, "  -report        Report decompression progress\n");
  fprintf(stderr, "  -skip Y0,Y1    Decompress all rows except those between Y0 and Y1 (inclusive)\n");
  fprintf(stderr, "  -crop WxH+X+Y  Decompress only a rectangular subregion of the image\n");
  fprintf(stderr, "                 [requires PBMPLUS (PPM/PGM), GIF, or Targa output format]\n");
  fprintf(stderr, "  -strict        Treat all warnings as fatal\n");
  fprintf(stderr, "  -verbose  or  -debug   Emit debug output\n");
  fprintf(stderr, "  -version       Print version information and exit\n");
  exit(EXIT_FAILURE);
@@ -203,10 +206,13 @@ parse_switches(j_decompress_ptr cinfo, int argc, char **argv,
  /* Set up default JPEG parameters. */
  requested_fmt = DEFAULT_FMT;  /* set default output file format */
  icc_filename = NULL;
  max_scans = 0;
  outfilename = NULL;
  memsrc = FALSE;
  report = FALSE;
  skip = FALSE;
  crop = FALSE;
  strict = FALSE;
  cinfo->err->trace_level = 0;
  /* Scan command line options, adjust parameters */
@@ -224,7 +230,7 @@ parse_switches(j_decompress_ptr cinfo, int argc, char **argv,
    arg++;                      /* advance past switch marker character */
    if (keymatch(arg, "bmp", 1)) {
-      /* BMP output format. */
+      /* BMP output format (Windows flavor). */
      requested_fmt = FMT_BMP;
    } else if (keymatch(arg, "colors", 1) || keymatch(arg, "colours", 1) ||
@@ -295,9 +301,13 @@ parse_switches(j_decompress_ptr cinfo, int argc, char **argv,
      cinfo->do_fancy_upsampling = FALSE;
    } else if (keymatch(arg, "gif", 1)) {
-      /* GIF output format. */
+      /* GIF output format (LZW-compressed). */
      requested_fmt = FMT_GIF;
    } else if (keymatch(arg, "gif0", 4)) {
      /* GIF output format (uncompressed). */
      requested_fmt = FMT_GIF0;
    } else if (keymatch(arg, "grayscale", 2) ||
               keymatch(arg, "greyscale", 2)) {
      /* Force monochrome output. */
@@ -351,6 +361,12 @@ parse_switches(j_decompress_ptr cinfo, int argc, char **argv,
        lval *= 1000L;
      cinfo->mem->max_memory_to_use = lval * 1000L;
    } else if (keymatch(arg, "maxscans", 4)) {
      if (++argn >= argc)       /* advance to next argument */
        usage();
      if (sscanf(argv[argn], "%u", &max_scans) != 1)
        usage();
    } else if (keymatch(arg, "nosmooth", 3)) {
      /* Suppress fancy upsampling */
      cinfo->do_fancy_upsampling = FALSE;
@@ -383,9 +399,8 @@ parse_switches(j_decompress_ptr cinfo, int argc, char **argv,
      /* PPM/PGM output format. */
      requested_fmt = FMT_PPM;
-    } else if (keymatch(arg, "rle", 1)) {
+    } else if (keymatch(arg, "report", 2)) {
-      /* RLE output format. */
+      report = TRUE;
      requested_fmt = FMT_RLE;
    } else if (keymatch(arg, "scale", 2)) {
      /* Scale the output image by a fraction M/N. */
@@ -413,6 +428,9 @@ parse_switches(j_decompress_ptr cinfo, int argc, char **argv,
        usage();
      crop = TRUE;
    } else if (keymatch(arg, "strict", 2)) {
      strict = TRUE;
    } else if (keymatch(arg, "targa", 1)) {
      /* Targa output format. */
      requested_fmt = FMT_TARGA;
@@ -444,7 +462,7 @@ jpeg_getc(j_decompress_ptr cinfo)
      ERREXIT(cinfo, JERR_CANT_SUSPEND);
  }
  datasrc->bytes_in_buffer--;
-  return GETJOCTET(*datasrc->next_input_byte++);
+  return *datasrc->next_input_byte++;
 }
@@ -499,6 +517,19 @@ print_text_marker(j_decompress_ptr cinfo)
 }
 METHODDEF(void)
 my_emit_message(j_common_ptr cinfo, int msg_level)
 {
  if (msg_level < 0) {
    /* Treat warning as fatal */
    cinfo->err->error_exit(cinfo);
  } else {
    if (cinfo->err->trace_level >= msg_level)
      cinfo->err->output_message(cinfo);
  }
 }
 /*
 * The main program.
 */
@@ -508,9 +539,7 @@ main(int argc, char **argv)
 {
  struct jpeg_decompress_struct cinfo;
  struct jpeg_error_mgr jerr;
 #ifdef PROGRESS_REPORT
  struct cdjpeg_progress_mgr progress;
 #endif
  int file_index;
  djpeg_dest_ptr dest_mgr = NULL;
  FILE *input_file;
@@ -557,6 +586,9 @@ main(int argc, char **argv)
  file_index = parse_switches(&cinfo, argc, argv, 0, FALSE);
  if (strict)
    jerr.emit_message = my_emit_message;
 #ifdef TWO_FILE_COMMANDLINE
  /* Must have either -outfile switch or explicit output file name */
  if (outfilename == NULL) {
@@ -603,9 +635,11 @@ main(int argc, char **argv)
    output_file = write_stdout();
  }
-#ifdef PROGRESS_REPORT
+  if (report || max_scans != 0) {
-  start_progress_monitor((j_common_ptr)&cinfo, &progress);
+    start_progress_monitor((j_common_ptr)&cinfo, &progress);
-#endif
+    progress.report = report;
    progress.max_scans = max_scans;
  }
  /* Specify data source for decompression */
 #if JPEG_LIB_VERSION >= 80 || defined(MEM_SRCDST_SUPPORTED)
@@ -653,7 +687,10 @@ main(int argc, char **argv)
 #endif
 #ifdef GIF_SUPPORTED
  case FMT_GIF:
-    dest_mgr = jinit_write_gif(&cinfo);
+    dest_mgr = jinit_write_gif(&cinfo, TRUE);
    break;
  case FMT_GIF0:
    dest_mgr = jinit_write_gif(&cinfo, FALSE);
    break;
 #endif
 #ifdef PPM_SUPPORTED
@@ -661,11 +698,6 @@ main(int argc, char **argv)
    dest_mgr = jinit_write_ppm(&cinfo);
    break;
 #endif
 #ifdef RLE_SUPPORTED
  case FMT_RLE:
    dest_mgr = jinit_write_rle(&cinfo);
    break;
 #endif
 #ifdef TARGA_SUPPORTED
  case FMT_TARGA:
    dest_mgr = jinit_write_targa(&cinfo);
@@ -781,12 +813,11 @@ main(int argc, char **argv)
    }
  }
 #ifdef PROGRESS_REPORT
  /* Hack: count final pass as done in case finish_output does an extra pass.
   * The library won't have updated completed_passes.
   */
-  progress.pub.completed_passes = progress.pub.total_passes;
+  if (report || max_scans != 0)
-#endif
+    progress.pub.completed_passes = progress.pub.total_passes;
  if (icc_filename != NULL) {
    FILE *icc_file;
@@ -825,9 +856,8 @@ main(int argc, char **argv)
  if (output_file != stdout)
    fclose(output_file);
-#ifdef PROGRESS_REPORT
+  if (report || max_scans != 0)
-  end_progress_monitor((j_common_ptr)&cinfo);
+    end_progress_monitor((j_common_ptr)&cinfo);
 #endif
  if (memsrc)
    free(inbuffer);
--- a/java/README
+++ b/java/README
@@ -38,7 +38,7 @@ Installation Directory
 ----------------------
 The TurboJPEG Java Wrapper will look for the TurboJPEG JNI library
-(libturbojpeg.so, libturbojpeg.jnilib, or turbojpeg.dll) in the system library
+(libturbojpeg.so, libturbojpeg.dylib, or turbojpeg.dll) in the system library
 paths or in any paths specified in LD_LIBRARY_PATH (Un*x), DYLD_LIBRARY_PATH
 (Mac), or PATH (Windows.)  Failing this, on Un*x and Mac systems, the wrapper
 will look for the JNI library under the library directory configured when
--- a/java/org/libjpegturbo/turbojpeg/TJLoader-unix.java.in
+++ b/java/org/libjpegturbo/turbojpeg/TJLoader-unix.java.in
@@ -1,5 +1,5 @@
 /*
- * Copyright (C)2011-2013, 2016 D. R. Commander.  All Rights Reserved.
+ * Copyright (C)2011-2013, 2016, 2020 D. R. Commander.  All Rights Reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions are met:
@@ -36,9 +36,9 @@ final class TJLoader {
      String os = System.getProperty("os.name").toLowerCase();
      if (os.indexOf("mac") >= 0) {
        try {
-          System.load("@CMAKE_INSTALL_FULL_LIBDIR@/libturbojpeg.jnilib");
+          System.load("@CMAKE_INSTALL_FULL_LIBDIR@/libturbojpeg.dylib");
        } catch (java.lang.UnsatisfiedLinkError e2) {
-          System.load("/usr/lib/libturbojpeg.jnilib");
+          System.load("/usr/lib/libturbojpeg.dylib");
        }
      } else {
        try {
--- a/jccolext.c
+++ b/jccolext.c
@@ -48,9 +48,9 @@ rgb_ycc_convert_internal(j_compress_ptr cinfo, JSAMPARRAY input_buf,
    outptr2 = output_buf[2][output_row];
    output_row++;
    for (col = 0; col < num_cols; col++) {
-      r = GETJSAMPLE(inptr[RGB_RED]);
+      r = inptr[RGB_RED];
-      g = GETJSAMPLE(inptr[RGB_GREEN]);
+      g = inptr[RGB_GREEN];
-      b = GETJSAMPLE(inptr[RGB_BLUE]);
+      b = inptr[RGB_BLUE];
      inptr += RGB_PIXELSIZE;
      /* If the inputs are 0..MAXJSAMPLE, the outputs of these equations
       * must be too; we do not need an explicit range-limiting operation.
@@ -100,9 +100,9 @@ rgb_gray_convert_internal(j_compress_ptr cinfo, JSAMPARRAY input_buf,
    outptr = output_buf[0][output_row];
    output_row++;
    for (col = 0; col < num_cols; col++) {
-      r = GETJSAMPLE(inptr[RGB_RED]);
+      r = inptr[RGB_RED];
-      g = GETJSAMPLE(inptr[RGB_GREEN]);
+      g = inptr[RGB_GREEN];
-      b = GETJSAMPLE(inptr[RGB_BLUE]);
+      b = inptr[RGB_BLUE];
      inptr += RGB_PIXELSIZE;
      /* Y */
      outptr[col] = (JSAMPLE)((ctab[r + R_Y_OFF] + ctab[g + G_Y_OFF] +
@@ -135,9 +135,9 @@ rgb_rgb_convert_internal(j_compress_ptr cinfo, JSAMPARRAY input_buf,
    outptr2 = output_buf[2][output_row];
    output_row++;
    for (col = 0; col < num_cols; col++) {
-      outptr0[col] = GETJSAMPLE(inptr[RGB_RED]);
+      outptr0[col] = inptr[RGB_RED];
-      outptr1[col] = GETJSAMPLE(inptr[RGB_GREEN]);
+      outptr1[col] = inptr[RGB_GREEN];
-      outptr2[col] = GETJSAMPLE(inptr[RGB_BLUE]);
+      outptr2[col] = inptr[RGB_BLUE];
      inptr += RGB_PIXELSIZE;
    }
  }
--- a/jccolor.c
+++ b/jccolor.c
@@ -392,11 +392,11 @@ cmyk_ycck_convert(j_compress_ptr cinfo, JSAMPARRAY input_buf,
    outptr3 = output_buf[3][output_row];
    output_row++;
    for (col = 0; col < num_cols; col++) {
-      r = MAXJSAMPLE - GETJSAMPLE(inptr[0]);
+      r = MAXJSAMPLE - inptr[0];
-      g = MAXJSAMPLE - GETJSAMPLE(inptr[1]);
+      g = MAXJSAMPLE - inptr[1];
-      b = MAXJSAMPLE - GETJSAMPLE(inptr[2]);
+      b = MAXJSAMPLE - inptr[2];
      /* K passes through as-is */
-      outptr3[col] = inptr[3];  /* don't need GETJSAMPLE here */
+      outptr3[col] = inptr[3];
      inptr += 4;
      /* If the inputs are 0..MAXJSAMPLE, the outputs of these equations
       * must be too; we do not need an explicit range-limiting operation.
@@ -438,7 +438,7 @@ grayscale_convert(j_compress_ptr cinfo, JSAMPARRAY input_buf,
    outptr = output_buf[0][output_row];
    output_row++;
    for (col = 0; col < num_cols; col++) {
-      outptr[col] = inptr[0];   /* don't need GETJSAMPLE() here */
+      outptr[col] = inptr[0];
      inptr += instride;
    }
  }
@@ -497,7 +497,7 @@ null_convert(j_compress_ptr cinfo, JSAMPARRAY input_buf, JSAMPIMAGE output_buf,
        inptr = *input_buf;
        outptr = output_buf[ci][output_row];
        for (col = 0; col < num_cols; col++) {
-          outptr[col] = inptr[ci]; /* don't need GETJSAMPLE() here */
+          outptr[col] = inptr[ci];
          inptr += nc;
        }
      }
--- a/jcdctmgr.c
+++ b/jcdctmgr.c
@@ -574,19 +574,19 @@ convsamp (JSAMPARRAY sample_data, JDIMENSION start_col, DCTELEM *workspace)
    elemptr = sample_data[elemr] + start_col;
 #if DCTSIZE == 8                /* unroll the inner loop */
-    *workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
+    *workspaceptr++ = (*elemptr++) - CENTERJSAMPLE;
-    *workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
+    *workspaceptr++ = (*elemptr++) - CENTERJSAMPLE;
-    *workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
+    *workspaceptr++ = (*elemptr++) - CENTERJSAMPLE;
-    *workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
+    *workspaceptr++ = (*elemptr++) - CENTERJSAMPLE;
-    *workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
+    *workspaceptr++ = (*elemptr++) - CENTERJSAMPLE;
-    *workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
+    *workspaceptr++ = (*elemptr++) - CENTERJSAMPLE;
-    *workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
+    *workspaceptr++ = (*elemptr++) - CENTERJSAMPLE;
-    *workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
+    *workspaceptr++ = (*elemptr++) - CENTERJSAMPLE;
 #else
    {
      register int elemc;
      for (elemc = DCTSIZE; elemc > 0; elemc--)
-        *workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
+        *workspaceptr++ = (*elemptr++) - CENTERJSAMPLE;
    }
 #endif
  }
@@ -774,20 +774,19 @@ convsamp_float(JSAMPARRAY sample_data, JDIMENSION start_col,
  for (elemr = 0; elemr < DCTSIZE; elemr++) {
    elemptr = sample_data[elemr] + start_col;
 #if DCTSIZE == 8                /* unroll the inner loop */
-    *workspaceptr++ = (FAST_FLOAT)(GETJSAMPLE(*elemptr++) - CENTERJSAMPLE);
+    *workspaceptr++ = (FAST_FLOAT)((*elemptr++) - CENTERJSAMPLE);
-    *workspaceptr++ = (FAST_FLOAT)(GETJSAMPLE(*elemptr++) - CENTERJSAMPLE);
+    *workspaceptr++ = (FAST_FLOAT)((*elemptr++) - CENTERJSAMPLE);
-    *workspaceptr++ = (FAST_FLOAT)(GETJSAMPLE(*elemptr++) - CENTERJSAMPLE);
+    *workspaceptr++ = (FAST_FLOAT)((*elemptr++) - CENTERJSAMPLE);
-    *workspaceptr++ = (FAST_FLOAT)(GETJSAMPLE(*elemptr++) - CENTERJSAMPLE);
+    *workspaceptr++ = (FAST_FLOAT)((*elemptr++) - CENTERJSAMPLE);
-    *workspaceptr++ = (FAST_FLOAT)(GETJSAMPLE(*elemptr++) - CENTERJSAMPLE);
+    *workspaceptr++ = (FAST_FLOAT)((*elemptr++) - CENTERJSAMPLE);
-    *workspaceptr++ = (FAST_FLOAT)(GETJSAMPLE(*elemptr++) - CENTERJSAMPLE);
+    *workspaceptr++ = (FAST_FLOAT)((*elemptr++) - CENTERJSAMPLE);
-    *workspaceptr++ = (FAST_FLOAT)(GETJSAMPLE(*elemptr++) - CENTERJSAMPLE);
+    *workspaceptr++ = (FAST_FLOAT)((*elemptr++) - CENTERJSAMPLE);
-    *workspaceptr++ = (FAST_FLOAT)(GETJSAMPLE(*elemptr++) - CENTERJSAMPLE);
+    *workspaceptr++ = (FAST_FLOAT)((*elemptr++) - CENTERJSAMPLE);
 #else
    {
      register int elemc;
      for (elemc = DCTSIZE; elemc > 0; elemc--)
-        *workspaceptr++ = (FAST_FLOAT)
+        *workspaceptr++ = (FAST_FLOAT)((*elemptr++) - CENTERJSAMPLE);
                          (GETJSAMPLE(*elemptr++) - CENTERJSAMPLE);
    }
 #endif
  }
--- a/jchuff.c
+++ b/jchuff.c
@@ -4,8 +4,10 @@
 * This file was part of the Independent JPEG Group's software:
 * Copyright (C) 1991-1997, Thomas G. Lane.
 * libjpeg-turbo Modifications:
- * Copyright (C) 2009-2011, 2014-2016, 2018-2019, D. R. Commander.
+ * Copyright (C) 2009-2011, 2014-2016, 2018-2020, D. R. Commander.
 * Copyright (C) 2015, Matthieu Darbois.
 * Copyright (C) 2018, Matthias Räncker.
 * Copyright (C) 2020, Arm Limited.
 * For conditions of distribution and use, see the accompanying README.ijg
 * file.
 *
@@ -42,15 +44,19 @@
 * flags (this defines __thumb__).
 */
-/* NOTE: Both GCC and Clang define __GNUC__ */
+#if defined(__arm__) || defined(__aarch64__) || defined(_M_ARM) || \
-#if defined(__GNUC__) && (defined(__arm__) || defined(__aarch64__))
+    defined(_M_ARM64)
 #if !defined(__thumb__) || defined(__thumb2__)
 #define USE_CLZ_INTRINSIC
 #endif
 #endif
 #ifdef USE_CLZ_INTRINSIC
 #if defined(_MSC_VER) && !defined(__clang__)
 #define JPEG_NBITS_NONZERO(x)  (32 - _CountLeadingZeros(x))
 #else
 #define JPEG_NBITS_NONZERO(x)  (32 - __builtin_clz(x))
 #endif
 #define JPEG_NBITS(x)          (x ? JPEG_NBITS_NONZERO(x) : 0)
 #else
 #include "jpeg_nbits_table.h"
@@ -65,32 +71,43 @@
 * but must not be updated permanently until we complete the MCU.
 */
 #if defined(__x86_64__) && defined(__ILP32__)
 typedef unsigned long long bit_buf_type;
 #else
 typedef size_t bit_buf_type;
 #endif
 /* NOTE: The more optimal Huffman encoding algorithm is only used by the
 * intrinsics implementation of the Arm Neon SIMD extensions, which is why we
 * retain the old Huffman encoder behavior when using the GAS implementation.
 */
 #if defined(WITH_SIMD) && !(defined(__arm__) || defined(__aarch64__) || \
                            defined(_M_ARM) || defined(_M_ARM64))
 typedef unsigned long long simd_bit_buf_type;
 #else
 typedef bit_buf_type simd_bit_buf_type;
 #endif
 #if (defined(SIZEOF_SIZE_T) && SIZEOF_SIZE_T == 8) || defined(_WIN64) || \
    (defined(__x86_64__) && defined(__ILP32__))
 #define BIT_BUF_SIZE  64
 #elif (defined(SIZEOF_SIZE_T) && SIZEOF_SIZE_T == 4) || defined(_WIN32)
 #define BIT_BUF_SIZE  32
 #else
 #error Cannot determine word size
 #endif
 #define SIMD_BIT_BUF_SIZE  (sizeof(simd_bit_buf_type) * 8)
 typedef struct {
-  size_t put_buffer;                    /* current bit-accumulation buffer */
+  union {
-  int put_bits;                         /* # of bits now in it */
+    bit_buf_type c;
    simd_bit_buf_type simd;
  } put_buffer;                         /* current bit accumulation buffer */
  int free_bits;                        /* # of bits available in it */
                                        /* (Neon GAS: # of bits now in it) */
  int last_dc_val[MAX_COMPS_IN_SCAN];   /* last DC coef for each component */
 } savable_state;
 /* This macro is to work around compilers with missing or broken
 * structure assignment.  You'll need to fix this code if you have
 * such a compiler and you change MAX_COMPS_IN_SCAN.
 */
 #ifndef NO_STRUCT_ASSIGN
 #define ASSIGN_STATE(dest, src)  ((dest) = (src))
 #else
 #if MAX_COMPS_IN_SCAN == 4
 #define ASSIGN_STATE(dest, src) \
  ((dest).put_buffer = (src).put_buffer, \
   (dest).put_bits = (src).put_bits, \
   (dest).last_dc_val[0] = (src).last_dc_val[0], \
   (dest).last_dc_val[1] = (src).last_dc_val[1], \
   (dest).last_dc_val[2] = (src).last_dc_val[2], \
   (dest).last_dc_val[3] = (src).last_dc_val[3])
 #endif
 #endif
 typedef struct {
  struct jpeg_entropy_encoder pub; /* public fields */
@@ -123,6 +140,7 @@ typedef struct {
  size_t free_in_buffer;        /* # of byte spaces remaining in buffer */
  savable_state cur;            /* Current bit buffer & DC state */
  j_compress_ptr cinfo;         /* dump_buffer needs access to this */
  int simd;
 } working_state;
@@ -201,8 +219,17 @@ start_pass_huff(j_compress_ptr cinfo, boolean gather_statistics)
  }
  /* Initialize bit buffer to empty */
-  entropy->saved.put_buffer = 0;
+  if (entropy->simd) {
-  entropy->saved.put_bits = 0;
+    entropy->saved.put_buffer.simd = 0;
 #if defined(__aarch64__) && !defined(NEON_INTRINSICS)
    entropy->saved.free_bits = 0;
 #else
    entropy->saved.free_bits = SIMD_BIT_BUF_SIZE;
 #endif
  } else {
    entropy->saved.put_buffer.c = 0;
    entropy->saved.free_bits = BIT_BUF_SIZE;
  }
  /* Initialize restart stuff */
  entropy->restarts_to_go = cinfo->restart_interval;
@@ -334,94 +361,94 @@ dump_buffer(working_state *state)
 /* Outputting bits to the file */
-/* These macros perform the same task as the emit_bits() function in the
+/* Output byte b and, speculatively, an additional 0 byte.  0xFF must be
- * original libjpeg code.  In addition to reducing overhead by explicitly
+ * encoded as 0xFF 0x00, so the output buffer pointer is advanced by 2 if the
- * inlining the code, additional performance is achieved by taking into
+ * byte is 0xFF.  Otherwise, the output buffer pointer is advanced by 1, and
- * account the size of the bit buffer and waiting until it is almost full
+ * the speculative 0 byte will be overwritten by the next byte.
 * before emptying it.  This mostly benefits 64-bit platforms, since 6
 * bytes can be stored in a 64-bit bit buffer before it has to be emptied.
 */
-
+#define EMIT_BYTE(b) { \
-#define EMIT_BYTE() { \
+  buffer[0] = (JOCTET)(b); \
-  JOCTET c; \
+  buffer[1] = 0; \
-  put_bits -= 8; \
+  buffer -= -2 + ((JOCTET)(b) < 0xFF); \
  c = (JOCTET)GETJOCTET(put_buffer >> put_bits); \
  *buffer++ = c; \
  if (c == 0xFF)  /* need to stuff a zero byte? */ \
    *buffer++ = 0; \
 }
-#define PUT_BITS(code, size) { \
+/* Output the entire bit buffer.  If there are no 0xFF bytes in it, then write
-  put_bits += size; \
+ * directly to the output buffer.  Otherwise, use the EMIT_BYTE() macro to
-  put_buffer = (put_buffer << size) | code; \
+ * encode 0xFF as 0xFF 0x00.
-}
+ */
 #if BIT_BUF_SIZE == 64
-#if SIZEOF_SIZE_T != 8 && !defined(_WIN64)
+#define FLUSH() { \
-
+  if (put_buffer & 0x8080808080808080 & ~(put_buffer + 0x0101010101010101)) { \
-#define CHECKBUF15() { \
+    EMIT_BYTE(put_buffer >> 56) \
-  if (put_bits > 15) { \
+    EMIT_BYTE(put_buffer >> 48) \
-    EMIT_BYTE() \
+    EMIT_BYTE(put_buffer >> 40) \
-    EMIT_BYTE() \
+    EMIT_BYTE(put_buffer >> 32) \
    EMIT_BYTE(put_buffer >> 24) \
    EMIT_BYTE(put_buffer >> 16) \
    EMIT_BYTE(put_buffer >>  8) \
    EMIT_BYTE(put_buffer      ) \
  } else { \
    buffer[0] = (JOCTET)(put_buffer >> 56); \
    buffer[1] = (JOCTET)(put_buffer >> 48); \
    buffer[2] = (JOCTET)(put_buffer >> 40); \
    buffer[3] = (JOCTET)(put_buffer >> 32); \
    buffer[4] = (JOCTET)(put_buffer >> 24); \
    buffer[5] = (JOCTET)(put_buffer >> 16); \
    buffer[6] = (JOCTET)(put_buffer >> 8); \
    buffer[7] = (JOCTET)(put_buffer); \
    buffer += 8; \
  } \
 }
 #endif
 #define CHECKBUF31() { \
  if (put_bits > 31) { \
    EMIT_BYTE() \
    EMIT_BYTE() \
    EMIT_BYTE() \
    EMIT_BYTE() \
  } \
 }
 #define CHECKBUF47() { \
  if (put_bits > 47) { \
    EMIT_BYTE() \
    EMIT_BYTE() \
    EMIT_BYTE() \
    EMIT_BYTE() \
    EMIT_BYTE() \
    EMIT_BYTE() \
  } \
 }
 #if !defined(_WIN32) && !defined(SIZEOF_SIZE_T)
 #error Cannot determine word size
 #endif
 #if SIZEOF_SIZE_T == 8 || defined(_WIN64)
 #define EMIT_BITS(code, size) { \
  CHECKBUF47() \
  PUT_BITS(code, size) \
 }
 #define EMIT_CODE(code, size) { \
  temp2 &= (((JLONG)1) << nbits) - 1; \
  CHECKBUF31() \
  PUT_BITS(code, size) \
  PUT_BITS(temp2, nbits) \
 }
 #else
-#define EMIT_BITS(code, size) { \
+#define FLUSH() { \
-  PUT_BITS(code, size) \
+  if (put_buffer & 0x80808080 & ~(put_buffer + 0x01010101)) { \
-  CHECKBUF15() \
+    EMIT_BYTE(put_buffer >> 24) \
-}
+    EMIT_BYTE(put_buffer >> 16) \
-
+    EMIT_BYTE(put_buffer >>  8) \
-#define EMIT_CODE(code, size) { \
+    EMIT_BYTE(put_buffer      ) \
-  temp2 &= (((JLONG)1) << nbits) - 1; \
+  } else { \
-  PUT_BITS(code, size) \
+    buffer[0] = (JOCTET)(put_buffer >> 24); \
-  CHECKBUF15() \
+    buffer[1] = (JOCTET)(put_buffer >> 16); \
-  PUT_BITS(temp2, nbits) \
+    buffer[2] = (JOCTET)(put_buffer >> 8); \
-  CHECKBUF15() \
+    buffer[3] = (JOCTET)(put_buffer); \
    buffer += 4; \
  } \
 }
 #endif
 /* Fill the bit buffer to capacity with the leading bits from code, then output
 * the bit buffer and put the remaining bits from code into the bit buffer.
 */
 #define PUT_AND_FLUSH(code, size) { \
  put_buffer = (put_buffer << (size + free_bits)) | (code >> -free_bits); \
  FLUSH() \
  free_bits += BIT_BUF_SIZE; \
  put_buffer = code; \
 }
 /* Insert code into the bit buffer and output the bit buffer if needed.
 * NOTE: We can't flush with free_bits == 0, since the left shift in
 * PUT_AND_FLUSH() would have undefined behavior.
 */
 #define PUT_BITS(code, size) { \
  free_bits -= size; \
  if (free_bits < 0) \
    PUT_AND_FLUSH(code, size) \
  else \
    put_buffer = (put_buffer << size) | code; \
 }
 #define PUT_CODE(code, size) { \
  temp &= (((JLONG)1) << nbits) - 1; \
  temp |= code << nbits; \
  nbits += size; \
  PUT_BITS(temp, nbits) \
 }
 /* Although it is exceedingly rare, it is possible for a Huffman-encoded
 * coefficient block to be larger than the 128-byte unencoded block.  For each
@@ -444,6 +471,7 @@ dump_buffer(working_state *state)
 #define STORE_BUFFER() { \
  if (localbuf) { \
    size_t bytes, bytestocopy; \
    bytes = buffer - _buffer; \
    buffer = _buffer; \
    while (bytes > 0) { \
@@ -466,20 +494,46 @@ dump_buffer(working_state *state)
 LOCAL(boolean)
 flush_bits(working_state *state)
 {
-  JOCTET _buffer[BUFSIZE], *buffer;
+  JOCTET _buffer[BUFSIZE], *buffer, temp;
-  size_t put_buffer;  int put_bits;
+  simd_bit_buf_type put_buffer;  int put_bits;
-  size_t bytes, bytestocopy;  int localbuf = 0;
+  int localbuf = 0;
  if (state->simd) {
 #if defined(__aarch64__) && !defined(NEON_INTRINSICS)
    put_bits = state->cur.free_bits;
 #else
    put_bits = SIMD_BIT_BUF_SIZE - state->cur.free_bits;
 #endif
    put_buffer = state->cur.put_buffer.simd;
  } else {
    put_bits = BIT_BUF_SIZE - state->cur.free_bits;
    put_buffer = state->cur.put_buffer.c;
  }
  put_buffer = state->cur.put_buffer;
  put_bits = state->cur.put_bits;
  LOAD_BUFFER()
-  /* fill any partial byte with ones */
+  while (put_bits >= 8) {
-  PUT_BITS(0x7F, 7)
+    put_bits -= 8;
-  while (put_bits >= 8) EMIT_BYTE()
+    temp = (JOCTET)(put_buffer >> put_bits);
    EMIT_BYTE(temp)
  }
  if (put_bits) {
    /* fill partial byte with ones */
    temp = (JOCTET)((put_buffer << (8 - put_bits)) | (0xFF >> put_bits));
    EMIT_BYTE(temp)
  }
-  state->cur.put_buffer = 0;    /* and reset bit-buffer to empty */
+  if (state->simd) {                    /* and reset bit buffer to empty */
-  state->cur.put_bits = 0;
+    state->cur.put_buffer.simd = 0;
 #if defined(__aarch64__) && !defined(NEON_INTRINSICS)
    state->cur.free_bits = 0;
 #else
    state->cur.free_bits = SIMD_BIT_BUF_SIZE;
 #endif
  } else {
    state->cur.put_buffer.c = 0;
    state->cur.free_bits = BIT_BUF_SIZE;
  }
  STORE_BUFFER()
  return TRUE;
@@ -493,7 +547,7 @@ encode_one_block_simd(working_state *state, JCOEFPTR block, int last_dc_val,
                      c_derived_tbl *dctbl, c_derived_tbl *actbl)
 {
  JOCTET _buffer[BUFSIZE], *buffer;
-  size_t bytes, bytestocopy;  int localbuf = 0;
+  int localbuf = 0;
  LOAD_BUFFER()
@@ -509,53 +563,41 @@ LOCAL(boolean)
 encode_one_block(working_state *state, JCOEFPTR block, int last_dc_val,
                 c_derived_tbl *dctbl, c_derived_tbl *actbl)
 {
-  int temp, temp2, temp3;
+  int temp, nbits, free_bits;
-  int nbits;
+  bit_buf_type put_buffer;
  int r, code, size;
  JOCTET _buffer[BUFSIZE], *buffer;
-  size_t put_buffer;  int put_bits;
+  int localbuf = 0;
  int code_0xf0 = actbl->ehufco[0xf0], size_0xf0 = actbl->ehufsi[0xf0];
  size_t bytes, bytestocopy;  int localbuf = 0;
-  put_buffer = state->cur.put_buffer;
+  free_bits = state->cur.free_bits;
-  put_bits = state->cur.put_bits;
+  put_buffer = state->cur.put_buffer.c;
  LOAD_BUFFER()
  /* Encode the DC coefficient difference per section F.1.2.1 */
-  temp = temp2 = block[0] - last_dc_val;
+  temp = block[0] - last_dc_val;
  /* This is a well-known technique for obtaining the absolute value without a
   * branch.  It is derived from an assembly language technique presented in
   * "How to Optimize for the Pentium Processors", Copyright (c) 1996, 1997 by
-   * Agner Fog.
+   * Agner Fog.  This code assumes we are on a two's complement machine.
   */
-  temp3 = temp >> (CHAR_BIT * sizeof(int) - 1);
+  nbits = temp >> (CHAR_BIT * sizeof(int) - 1);
-  temp ^= temp3;
+  temp += nbits;
-  temp -= temp3;
+  nbits ^= temp;
  /* For a negative input, want temp2 = bitwise complement of abs(input) */
  /* This code assumes we are on a two's complement machine */
  temp2 += temp3;
  /* Find the number of bits needed for the magnitude of the coefficient */
-  nbits = JPEG_NBITS(temp);
+  nbits = JPEG_NBITS(nbits);
-  /* Emit the Huffman-coded symbol for the number of bits */
+  /* Emit the Huffman-coded symbol for the number of bits.
-  code = dctbl->ehufco[nbits];
+   * Emit that number of bits of the value, if positive,
-  size = dctbl->ehufsi[nbits];
+   * or the complement of its magnitude, if negative.
-  EMIT_BITS(code, size)
+   */
-
+  PUT_CODE(dctbl->ehufco[nbits], dctbl->ehufsi[nbits])
  /* Mask off any extra bits in code */
  temp2 &= (((JLONG)1) << nbits) - 1;
  /* Emit that number of bits of the value, if positive, */
  /* or the complement of its magnitude, if negative. */
  EMIT_BITS(temp2, nbits)
  /* Encode the AC coefficients per section F.1.2.2 */
-  r = 0;                        /* r = run length of zeros */
+  {
    int r = 0;                  /* r = run length of zeros */
 /* Manually unroll the k loop to eliminate the counter variable.  This
 * improves performance greatly on systems with a limited number of
@@ -563,51 +605,46 @@ encode_one_block(working_state *state, JCOEFPTR block, int last_dc_val,
 */
 #define kloop(jpeg_natural_order_of_k) { \
  if ((temp = block[jpeg_natural_order_of_k]) == 0) { \
-    r++; \
+    r += 16; \
  } else { \
    temp2 = temp; \
    /* Branch-less absolute value, bitwise complement, etc., same as above */ \
-    temp3 = temp >> (CHAR_BIT * sizeof(int) - 1); \
+    nbits = temp >> (CHAR_BIT * sizeof(int) - 1); \
-    temp ^= temp3; \
+    temp += nbits; \
-    temp -= temp3; \
+    nbits ^= temp; \
-    temp2 += temp3; \
+    nbits = JPEG_NBITS_NONZERO(nbits); \
    nbits = JPEG_NBITS_NONZERO(temp); \
    /* if run length > 15, must emit special run-length-16 codes (0xF0) */ \
-    while (r > 15) { \
+    while (r >= 16 * 16) { \
-      EMIT_BITS(code_0xf0, size_0xf0) \
+      r -= 16 * 16; \
-      r -= 16; \
+      PUT_BITS(actbl->ehufco[0xf0], actbl->ehufsi[0xf0]) \
    } \
    /* Emit Huffman symbol for run length / number of bits */ \
-    temp3 = (r << 4) + nbits; \
+    r += nbits; \
-    code = actbl->ehufco[temp3]; \
+    PUT_CODE(actbl->ehufco[r], actbl->ehufsi[r]) \
    size = actbl->ehufsi[temp3]; \
    EMIT_CODE(code, size) \
    r = 0; \
  } \
 }
-  /* One iteration for each value in jpeg_natural_order[] */
+    /* One iteration for each value in jpeg_natural_order[] */
-  kloop(1);   kloop(8);   kloop(16);  kloop(9);   kloop(2);   kloop(3);
+    kloop(1);   kloop(8);   kloop(16);  kloop(9);   kloop(2);   kloop(3);
-  kloop(10);  kloop(17);  kloop(24);  kloop(32);  kloop(25);  kloop(18);
+    kloop(10);  kloop(17);  kloop(24);  kloop(32);  kloop(25);  kloop(18);
-  kloop(11);  kloop(4);   kloop(5);   kloop(12);  kloop(19);  kloop(26);
+    kloop(11);  kloop(4);   kloop(5);   kloop(12);  kloop(19);  kloop(26);
-  kloop(33);  kloop(40);  kloop(48);  kloop(41);  kloop(34);  kloop(27);
+    kloop(33);  kloop(40);  kloop(48);  kloop(41);  kloop(34);  kloop(27);
-  kloop(20);  kloop(13);  kloop(6);   kloop(7);   kloop(14);  kloop(21);
+    kloop(20);  kloop(13);  kloop(6);   kloop(7);   kloop(14);  kloop(21);
-  kloop(28);  kloop(35);  kloop(42);  kloop(49);  kloop(56);  kloop(57);
+    kloop(28);  kloop(35);  kloop(42);  kloop(49);  kloop(56);  kloop(57);
-  kloop(50);  kloop(43);  kloop(36);  kloop(29);  kloop(22);  kloop(15);
+    kloop(50);  kloop(43);  kloop(36);  kloop(29);  kloop(22);  kloop(15);
-  kloop(23);  kloop(30);  kloop(37);  kloop(44);  kloop(51);  kloop(58);
+    kloop(23);  kloop(30);  kloop(37);  kloop(44);  kloop(51);  kloop(58);
-  kloop(59);  kloop(52);  kloop(45);  kloop(38);  kloop(31);  kloop(39);
+    kloop(59);  kloop(52);  kloop(45);  kloop(38);  kloop(31);  kloop(39);
-  kloop(46);  kloop(53);  kloop(60);  kloop(61);  kloop(54);  kloop(47);
+    kloop(46);  kloop(53);  kloop(60);  kloop(61);  kloop(54);  kloop(47);
-  kloop(55);  kloop(62);  kloop(63);
+    kloop(55);  kloop(62);  kloop(63);
-  /* If the last coef(s) were zero, emit an end-of-block code */
+    /* If the last coef(s) were zero, emit an end-of-block code */
-  if (r > 0) {
+    if (r > 0) {
-    code = actbl->ehufco[0];
+      PUT_BITS(actbl->ehufco[0], actbl->ehufsi[0])
-    size = actbl->ehufsi[0];
+    }
    EMIT_BITS(code, size)
  }
-  state->cur.put_buffer = put_buffer;
+  state->cur.put_buffer.c = put_buffer;
-  state->cur.put_bits = put_bits;
+  state->cur.free_bits = free_bits;
  STORE_BUFFER()
  return TRUE;
@@ -654,8 +691,9 @@ encode_mcu_huff(j_compress_ptr cinfo, JBLOCKROW *MCU_data)
  /* Load up working state */
  state.next_output_byte = cinfo->dest->next_output_byte;
  state.free_in_buffer = cinfo->dest->free_in_buffer;
-  ASSIGN_STATE(state.cur, entropy->saved);
+  state.cur = entropy->saved;
  state.cinfo = cinfo;
  state.simd = entropy->simd;
  /* Emit restart marker if needed */
  if (cinfo->restart_interval) {
@@ -694,7 +732,7 @@ encode_mcu_huff(j_compress_ptr cinfo, JBLOCKROW *MCU_data)
  /* Completed MCU, so update state */
  cinfo->dest->next_output_byte = state.next_output_byte;
  cinfo->dest->free_in_buffer = state.free_in_buffer;
-  ASSIGN_STATE(entropy->saved, state.cur);
+  entropy->saved = state.cur;
  /* Update restart-interval state too */
  if (cinfo->restart_interval) {
@@ -723,8 +761,9 @@ finish_pass_huff(j_compress_ptr cinfo)
  /* Load up working state ... flush_bits needs it */
  state.next_output_byte = cinfo->dest->next_output_byte;
  state.free_in_buffer = cinfo->dest->free_in_buffer;
-  ASSIGN_STATE(state.cur, entropy->saved);
+  state.cur = entropy->saved;
  state.cinfo = cinfo;
  state.simd = entropy->simd;
  /* Flush out the last data */
  if (!flush_bits(&state))
@@ -733,7 +772,7 @@ finish_pass_huff(j_compress_ptr cinfo)
  /* Update state */
  cinfo->dest->next_output_byte = state.next_output_byte;
  cinfo->dest->free_in_buffer = state.free_in_buffer;
-  ASSIGN_STATE(entropy->saved, state.cur);
+  entropy->saved = state.cur;
 }
--- a/jconfig.h.in
+++ b/jconfig.h.in
@@ -61,11 +61,6 @@
   unsigned. */
 #cmakedefine RIGHT_SHIFT_IS_UNSIGNED 1
 /* Define to 1 if type `char' is unsigned and you are not using gcc.  */
 #ifndef __CHAR_UNSIGNED__
  #cmakedefine __CHAR_UNSIGNED__ 1
 #endif
 /* Define to empty if `const' does not conform to ANSI C. */
 /* #undef const */
--- a/jconfig.txt
+++ b/jconfig.txt
@@ -42,12 +42,6 @@
 */
 /* #define const */
 /* Define this if an ordinary "char" type is unsigned.
 * If you're not sure, leaving it undefined will work at some cost in speed.
 * If you defined HAVE_UNSIGNED_CHAR then the speed difference is minimal.
 */
 #undef __CHAR_UNSIGNED__
 /* Define this if your system has an ANSI-conforming <stddef.h> file.
 */
 #define HAVE_STDDEF_H
@@ -119,7 +113,6 @@ typedef unsigned char boolean;
 #define BMP_SUPPORTED           /* BMP image file format */
 #define GIF_SUPPORTED           /* GIF image file format */
 #define PPM_SUPPORTED           /* PBMPLUS PPM/PGM image file format */
 #undef RLE_SUPPORTED            /* Utah RLE image file format */
 #define TARGA_SUPPORTED         /* Targa image file format */
 /* Define this if you want to name both input and output files on the command
--- a/jcphuff.c
+++ b/jcphuff.c
@@ -4,8 +4,9 @@
 * This file was part of the Independent JPEG Group's software:
 * Copyright (C) 1995-1997, Thomas G. Lane.
 * libjpeg-turbo Modifications:
- * Copyright (C) 2011, 2015, 2018, D. R. Commander.
+ * Copyright (C) 2011, 2015, 2018, 2021, D. R. Commander.
 * Copyright (C) 2016, 2018, Matthieu Darbois.
 * Copyright (C) 2020, Arm Limited.
 * Copyright (C) 2014, Mozilla Corporation.
 * For conditions of distribution and use, see the accompanying README.ijg
 * file.
@@ -52,15 +53,19 @@
 * flags (this defines __thumb__).
 */
-/* NOTE: Both GCC and Clang define __GNUC__ */
+#if defined(__arm__) || defined(__aarch64__) || defined(_M_ARM) || \
-#if defined(__GNUC__) && (defined(__arm__) || defined(__aarch64__))
+    defined(_M_ARM64)
 #if !defined(__thumb__) || defined(__thumb2__)
 #define USE_CLZ_INTRINSIC
 #endif
 #endif
 #ifdef USE_CLZ_INTRINSIC
 #if defined(_MSC_VER) && !defined(__clang__)
 #define JPEG_NBITS_NONZERO(x)  (32 - _CountLeadingZeros(x))
 #else
 #define JPEG_NBITS_NONZERO(x)  (32 - __builtin_clz(x))
 #endif
 #define JPEG_NBITS(x)          (x ? JPEG_NBITS_NONZERO(x) : 0)
 #else
 #include "jpeg_nbits_table.h"
@@ -136,9 +141,9 @@ typedef phuff_entropy_encoder *phuff_entropy_ptr;
 #ifdef RIGHT_SHIFT_IS_UNSIGNED
 #define ISHIFT_TEMPS    int ishift_temp;
 #define IRIGHT_SHIFT(x,shft)  \
-        ((ishift_temp = (x)) < 0 ? \
+  ((ishift_temp = (x)) < 0 ? \
         (ishift_temp >> (shft)) | ((~0) << (16-(shft))) : \
-         (ishift_temp >> (shft)))
+   (ishift_temp >> (shft)))
 #else
 #define ISHIFT_TEMPS
 #define IRIGHT_SHIFT(x,shft)    ((x) >> (shft))
@@ -148,19 +153,19 @@ typedef phuff_entropy_encoder *phuff_entropy_ptr;
 /* Forward declarations */
 METHODDEF(boolean) encode_mcu_DC_first (j_compress_ptr cinfo,
-                                        JBLOCKROW *MCU_data);
+                                       JBLOCKROW *MCU_data);
 METHODDEF(void) encode_mcu_AC_first_prepare
  (const JCOEF *block, const int *jpeg_natural_order_start, int Sl, int Al,
   JCOEF *values, size_t *zerobits);
 METHODDEF(boolean) encode_mcu_AC_first (j_compress_ptr cinfo,
-                                        JBLOCKROW *MCU_data);
+                                       JBLOCKROW *MCU_data);
 METHODDEF(boolean) encode_mcu_DC_refine (j_compress_ptr cinfo,
-                                         JBLOCKROW *MCU_data);
+                                        JBLOCKROW *MCU_data);
 METHODDEF(int) encode_mcu_AC_refine_prepare
  (const JCOEF *block, const int *jpeg_natural_order_start, int Sl, int Al,
   JCOEF *absvalues, size_t *bits);
 METHODDEF(boolean) encode_mcu_AC_refine (j_compress_ptr cinfo,
-                                         JBLOCKROW *MCU_data);
+                                        JBLOCKROW *MCU_data);
 METHODDEF(void) finish_pass_phuff (j_compress_ptr cinfo);
 METHODDEF(void) finish_pass_gather_phuff (j_compress_ptr cinfo);
@@ -170,24 +175,26 @@ INLINE
 METHODDEF(int)
 count_zeroes(size_t *x)
 {
  int result;
 #if defined(HAVE_BUILTIN_CTZL)
  int result;
  result = __builtin_ctzl(*x);
  *x >>= result;
 #elif defined(HAVE_BITSCANFORWARD64)
  unsigned long result;
  _BitScanForward64(&result, *x);
  *x >>= result;
 #elif defined(HAVE_BITSCANFORWARD)
  unsigned long result;
  _BitScanForward(&result, *x);
  *x >>= result;
 #else
-  result = 0;
+  int result = 0;
  while ((*x & 1) == 0) {
    ++result;
    *x >>= 1;
  }
 #endif
-  return result;
+  return (int)result;
 }
@@ -306,7 +313,7 @@ start_pass_phuff (j_compress_ptr cinfo, boolean gather_statistics)
 /* Emit a byte */
 #define emit_byte(entropy, val) { \
  *(entropy)->next_output_byte++ = (JOCTET)(val); \
-          if (--(entropy)->free_in_buffer == 0)  \
+  if (--(entropy)->free_in_buffer == 0) \
    dump_buffer(entropy); \
 }
@@ -403,7 +410,7 @@ emit_symbol (phuff_entropy_ptr entropy, int tbl_no, int symbol)
 LOCAL(void)
 emit_buffered_bits (phuff_entropy_ptr entropy, char *bufstart,
-                    unsigned int nbits)
+                   unsigned int nbits)
 {
  if (entropy->gather_statistics)
    return;                     /* no real work */
@@ -524,7 +531,7 @@ encode_mcu_DC_first (j_compress_ptr cinfo, JBLOCKROW *MCU_data)
    temp3 = temp >> (CHAR_BIT * sizeof(int) - 1);
    temp ^= temp3;
    temp -= temp3;              /* temp is abs value of input */
-      /* For a negative input, want temp2 = bitwise complement of abs(input) */
+    /* For a negative input, want temp2 = bitwise complement of abs(input) */
    temp2 = temp ^ temp3;
    /* Find the number of bits needed for the magnitude of the coefficient */
@@ -696,9 +703,9 @@ encode_mcu_AC_first (j_compress_ptr cinfo, JBLOCKROW *MCU_data)
  zerobits |= bits[1];
 #endif
-    /* Emit any pending EOBRUN */
+  /* Emit any pending EOBRUN */
  if (zerobits && (entropy->EOBRUN > 0))
-      emit_eobrun(entropy);
+    emit_eobrun(entropy);
 #if SIZEOF_SIZE_T == 4
  zerobits = bits[0];
@@ -983,7 +990,7 @@ encode_mcu_AC_refine (j_compress_ptr cinfo, JBLOCKROW *MCU_data)
    r += idx;
    cabsvalue += idx;
    goto first_iter_ac_refine;
-    }
+  }
  ENCODE_COEFS_AC_REFINE(first_iter_ac_refine:);
 #endif
--- a/jcsample.c
+++ b/jcsample.c
@@ -6,7 +6,7 @@
 * libjpeg-turbo Modifications:
 * Copyright 2009 Pierre Ossman <ossman@cendio.se> for Cendio AB
 * Copyright (C) 2014, MIPS Technologies, Inc., California.
- * Copyright (C) 2015, D. R. Commander.
+ * Copyright (C) 2015, 2019, D. R. Commander.
 * For conditions of distribution and use, see the accompanying README.ijg
 * file.
 *
@@ -103,7 +103,7 @@ expand_right_edge(JSAMPARRAY image_data, int num_rows, JDIMENSION input_cols,
  if (numcols > 0) {
    for (row = 0; row < num_rows; row++) {
      ptr = image_data[row] + input_cols;
-      pixval = ptr[-1];         /* don't need GETJSAMPLE() here */
+      pixval = ptr[-1];
      for (count = numcols; count > 0; count--)
        *ptr++ = pixval;
    }
@@ -174,7 +174,7 @@ int_downsample(j_compress_ptr cinfo, jpeg_component_info *compptr,
      for (v = 0; v < v_expand; v++) {
        inptr = input_data[inrow + v] + outcol_h;
        for (h = 0; h < h_expand; h++) {
-          outvalue += (JLONG)GETJSAMPLE(*inptr++);
+          outvalue += (JLONG)(*inptr++);
        }
      }
      *outptr++ = (JSAMPLE)((outvalue + numpix2) / numpix);
@@ -237,8 +237,7 @@ h2v1_downsample(j_compress_ptr cinfo, jpeg_component_info *compptr,
    inptr = input_data[outrow];
    bias = 0;                   /* bias = 0,1,0,1,... for successive samples */
    for (outcol = 0; outcol < output_cols; outcol++) {
-      *outptr++ =
+      *outptr++ = (JSAMPLE)((inptr[0] + inptr[1] + bias) >> 1);
        (JSAMPLE)((GETJSAMPLE(*inptr) + GETJSAMPLE(inptr[1]) + bias) >> 1);
      bias ^= 1;                /* 0=>1, 1=>0 */
      inptr += 2;
    }
@@ -277,8 +276,7 @@ h2v2_downsample(j_compress_ptr cinfo, jpeg_component_info *compptr,
    bias = 1;                   /* bias = 1,2,1,2,... for successive samples */
    for (outcol = 0; outcol < output_cols; outcol++) {
      *outptr++ =
-        (JSAMPLE)((GETJSAMPLE(*inptr0) + GETJSAMPLE(inptr0[1]) +
+        (JSAMPLE)((inptr0[0] + inptr0[1] + inptr1[0] + inptr1[1] + bias) >> 2);
                   GETJSAMPLE(*inptr1) + GETJSAMPLE(inptr1[1]) + bias) >> 2);
      bias ^= 3;                /* 1=>2, 2=>1 */
      inptr0 += 2;  inptr1 += 2;
    }
@@ -337,33 +335,25 @@ h2v2_smooth_downsample(j_compress_ptr cinfo, jpeg_component_info *compptr,
    below_ptr = input_data[inrow + 2];
    /* Special case for first column: pretend column -1 is same as column 0 */
-    membersum = GETJSAMPLE(*inptr0) + GETJSAMPLE(inptr0[1]) +
+    membersum = inptr0[0] + inptr0[1] + inptr1[0] + inptr1[1];
-                GETJSAMPLE(*inptr1) + GETJSAMPLE(inptr1[1]);
+    neighsum = above_ptr[0] + above_ptr[1] + below_ptr[0] + below_ptr[1] +
-    neighsum = GETJSAMPLE(*above_ptr) + GETJSAMPLE(above_ptr[1]) +
+               inptr0[0] + inptr0[2] + inptr1[0] + inptr1[2];
               GETJSAMPLE(*below_ptr) + GETJSAMPLE(below_ptr[1]) +
               GETJSAMPLE(*inptr0) + GETJSAMPLE(inptr0[2]) +
               GETJSAMPLE(*inptr1) + GETJSAMPLE(inptr1[2]);
    neighsum += neighsum;
-    neighsum += GETJSAMPLE(*above_ptr) + GETJSAMPLE(above_ptr[2]) +
+    neighsum += above_ptr[0] + above_ptr[2] + below_ptr[0] + below_ptr[2];
                GETJSAMPLE(*below_ptr) + GETJSAMPLE(below_ptr[2]);
    membersum = membersum * memberscale + neighsum * neighscale;
    *outptr++ = (JSAMPLE)((membersum + 32768) >> 16);
    inptr0 += 2;  inptr1 += 2;  above_ptr += 2;  below_ptr += 2;
    for (colctr = output_cols - 2; colctr > 0; colctr--) {
      /* sum of pixels directly mapped to this output element */
-      membersum = GETJSAMPLE(*inptr0) + GETJSAMPLE(inptr0[1]) +
+      membersum = inptr0[0] + inptr0[1] + inptr1[0] + inptr1[1];
                  GETJSAMPLE(*inptr1) + GETJSAMPLE(inptr1[1]);
      /* sum of edge-neighbor pixels */
-      neighsum = GETJSAMPLE(*above_ptr) + GETJSAMPLE(above_ptr[1]) +
+      neighsum = above_ptr[0] + above_ptr[1] + below_ptr[0] + below_ptr[1] +
-                 GETJSAMPLE(*below_ptr) + GETJSAMPLE(below_ptr[1]) +
+                 inptr0[-1] + inptr0[2] + inptr1[-1] + inptr1[2];
                 GETJSAMPLE(inptr0[-1]) + GETJSAMPLE(inptr0[2]) +
                 GETJSAMPLE(inptr1[-1]) + GETJSAMPLE(inptr1[2]);
      /* The edge-neighbors count twice as much as corner-neighbors */
      neighsum += neighsum;
      /* Add in the corner-neighbors */
-      neighsum += GETJSAMPLE(above_ptr[-1]) + GETJSAMPLE(above_ptr[2]) +
+      neighsum += above_ptr[-1] + above_ptr[2] + below_ptr[-1] + below_ptr[2];
                  GETJSAMPLE(below_ptr[-1]) + GETJSAMPLE(below_ptr[2]);
      /* form final output scaled up by 2^16 */
      membersum = membersum * memberscale + neighsum * neighscale;
      /* round, descale and output it */
@@ -372,15 +362,11 @@ h2v2_smooth_downsample(j_compress_ptr cinfo, jpeg_component_info *compptr,
    }
    /* Special case for last column */
-    membersum = GETJSAMPLE(*inptr0) + GETJSAMPLE(inptr0[1]) +
+    membersum = inptr0[0] + inptr0[1] + inptr1[0] + inptr1[1];
-                GETJSAMPLE(*inptr1) + GETJSAMPLE(inptr1[1]);
+    neighsum = above_ptr[0] + above_ptr[1] + below_ptr[0] + below_ptr[1] +
-    neighsum = GETJSAMPLE(*above_ptr) + GETJSAMPLE(above_ptr[1]) +
+               inptr0[-1] + inptr0[1] + inptr1[-1] + inptr1[1];
               GETJSAMPLE(*below_ptr) + GETJSAMPLE(below_ptr[1]) +
               GETJSAMPLE(inptr0[-1]) + GETJSAMPLE(inptr0[1]) +
               GETJSAMPLE(inptr1[-1]) + GETJSAMPLE(inptr1[1]);
    neighsum += neighsum;
-    neighsum += GETJSAMPLE(above_ptr[-1]) + GETJSAMPLE(above_ptr[1]) +
+    neighsum += above_ptr[-1] + above_ptr[1] + below_ptr[-1] + below_ptr[1];
                GETJSAMPLE(below_ptr[-1]) + GETJSAMPLE(below_ptr[1]);
    membersum = membersum * memberscale + neighsum * neighscale;
    *outptr = (JSAMPLE)((membersum + 32768) >> 16);
@@ -429,21 +415,18 @@ fullsize_smooth_downsample(j_compress_ptr cinfo, jpeg_component_info *compptr,
    below_ptr = input_data[outrow + 1];
    /* Special case for first column */
-    colsum = GETJSAMPLE(*above_ptr++) + GETJSAMPLE(*below_ptr++) +
+    colsum = (*above_ptr++) + (*below_ptr++) + inptr[0];
-             GETJSAMPLE(*inptr);
+    membersum = *inptr++;
-    membersum = GETJSAMPLE(*inptr++);
+    nextcolsum = above_ptr[0] + below_ptr[0] + inptr[0];
    nextcolsum = GETJSAMPLE(*above_ptr) + GETJSAMPLE(*below_ptr) +
                 GETJSAMPLE(*inptr);
    neighsum = colsum + (colsum - membersum) + nextcolsum;
    membersum = membersum * memberscale + neighsum * neighscale;
    *outptr++ = (JSAMPLE)((membersum + 32768) >> 16);
    lastcolsum = colsum;  colsum = nextcolsum;
    for (colctr = output_cols - 2; colctr > 0; colctr--) {
-      membersum = GETJSAMPLE(*inptr++);
+      membersum = *inptr++;
      above_ptr++;  below_ptr++;
-      nextcolsum = GETJSAMPLE(*above_ptr) + GETJSAMPLE(*below_ptr) +
+      nextcolsum = above_ptr[0] + below_ptr[0] + inptr[0];
                   GETJSAMPLE(*inptr);
      neighsum = lastcolsum + (colsum - membersum) + nextcolsum;
      membersum = membersum * memberscale + neighsum * neighscale;
      *outptr++ = (JSAMPLE)((membersum + 32768) >> 16);
@@ -451,7 +434,7 @@ fullsize_smooth_downsample(j_compress_ptr cinfo, jpeg_component_info *compptr,
    }
    /* Special case for last column */
-    membersum = GETJSAMPLE(*inptr);
+    membersum = *inptr;
    neighsum = lastcolsum + (colsum - membersum) + colsum;
    membersum = membersum * memberscale + neighsum * neighscale;
    *outptr = (JSAMPLE)((membersum + 32768) >> 16);
--- a/jdapistd.c
+++ b/jdapistd.c
@@ -4,7 +4,7 @@
 * This file was part of the Independent JPEG Group's software:
 * Copyright (C) 1994-1996, Thomas G. Lane.
 * libjpeg-turbo Modifications:
- * Copyright (C) 2010, 2015-2018, 2020, D. R. Commander.
+ * Copyright (C) 2010, 2015-2020, D. R. Commander.
 * Copyright (C) 2015, Google, Inc.
 * For conditions of distribution and use, see the accompanying README.ijg
 * file.
@@ -319,6 +319,8 @@ read_and_discard_scanlines(j_decompress_ptr cinfo, JDIMENSION num_lines)
 {
  JDIMENSION n;
  my_master_ptr master = (my_master_ptr)cinfo->master;
  JSAMPLE dummy_sample[1] = { 0 };
  JSAMPROW dummy_row = dummy_sample;
  JSAMPARRAY scanlines = NULL;
  void (*color_convert) (j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
                         JDIMENSION input_row, JSAMPARRAY output_buf,
@@ -329,6 +331,10 @@ read_and_discard_scanlines(j_decompress_ptr cinfo, JDIMENSION num_lines)
  if (cinfo->cconvert && cinfo->cconvert->color_convert) {
    color_convert = cinfo->cconvert->color_convert;
    cinfo->cconvert->color_convert = noop_convert;
    /* This just prevents UBSan from complaining about adding 0 to a NULL
     * pointer.  The pointer isn't actually used.
     */
    scanlines = &dummy_row;
  }
  if (cinfo->cquantize && cinfo->cquantize->color_quantize) {
@@ -532,6 +538,8 @@ jpeg_skip_scanlines(j_decompress_ptr cinfo, JDIMENSION num_lines)
         * decoded coefficients.  This is ~5% faster for large subsets, but
         * it's tough to tell a difference for smaller images.
         */
        if (!cinfo->entropy->insufficient_data)
          cinfo->master->last_good_iMCU_row = cinfo->input_iMCU_row;
        (*cinfo->entropy->decode_mcu) (cinfo, NULL);
      }
    }
--- a/jdarith.c
+++ b/jdarith.c
@@ -4,7 +4,7 @@
 * This file was part of the Independent JPEG Group's software:
 * Developed 1997-2015 by Guido Vollbeding.
 * libjpeg-turbo Modifications:
- * Copyright (C) 2015-2018, D. R. Commander.
+ * Copyright (C) 2015-2020, D. R. Commander.
 * For conditions of distribution and use, see the accompanying README.ijg
 * file.
 *
@@ -80,7 +80,7 @@ get_byte(j_decompress_ptr cinfo)
    if (!(*src->fill_input_buffer) (cinfo))
      ERREXIT(cinfo, JERR_CANT_SUSPEND);
  src->bytes_in_buffer--;
-  return GETJOCTET(*src->next_input_byte++);
+  return *src->next_input_byte++;
 }
@@ -665,8 +665,16 @@ bad:
    for (ci = 0; ci < cinfo->comps_in_scan; ci++) {
      int coefi, cindex = cinfo->cur_comp_info[ci]->component_index;
      int *coef_bit_ptr = &cinfo->coef_bits[cindex][0];
      int *prev_coef_bit_ptr =
        &cinfo->coef_bits[cindex + cinfo->num_components][0];
      if (cinfo->Ss && coef_bit_ptr[0] < 0) /* AC without prior DC scan */
        WARNMS2(cinfo, JWRN_BOGUS_PROGRESSION, cindex, 0);
      for (coefi = MIN(cinfo->Ss, 1); coefi <= MAX(cinfo->Se, 9); coefi++) {
        if (cinfo->input_scan_number > 1)
          prev_coef_bit_ptr[coefi] = coef_bit_ptr[coefi];
        else
          prev_coef_bit_ptr[coefi] = 0;
      }
      for (coefi = cinfo->Ss; coefi <= cinfo->Se; coefi++) {
        int expected = (coef_bit_ptr[coefi] < 0) ? 0 : coef_bit_ptr[coefi];
        if (cinfo->Ah != expected)
@@ -727,6 +735,7 @@ bad:
  entropy->c = 0;
  entropy->a = 0;
  entropy->ct = -16;    /* force reading 2 initial bytes to fill C */
  entropy->pub.insufficient_data = FALSE;
  /* Initialize restart counter */
  entropy->restarts_to_go = cinfo->restart_interval;
@@ -763,7 +772,7 @@ jinit_arith_decoder(j_decompress_ptr cinfo)
    int *coef_bit_ptr, ci;
    cinfo->coef_bits = (int (*)[DCTSIZE2])
      (*cinfo->mem->alloc_small) ((j_common_ptr)cinfo, JPOOL_IMAGE,
-                                  cinfo->num_components * DCTSIZE2 *
+                                  cinfo->num_components * 2 * DCTSIZE2 *
                                  sizeof(int));
    coef_bit_ptr = &cinfo->coef_bits[0][0];
    for (ci = 0; ci < cinfo->num_components; ci++)
--- a/jdcoefct.c
+++ b/jdcoefct.c
@@ -5,7 +5,7 @@
 * Copyright (C) 1994-1997, Thomas G. Lane.
 * libjpeg-turbo Modifications:
 * Copyright 2009 Pierre Ossman <ossman@cendio.se> for Cendio AB
- * Copyright (C) 2010, 2015-2016, D. R. Commander.
+ * Copyright (C) 2010, 2015-2016, 2019-2020, D. R. Commander.
 * Copyright (C) 2015, 2020, Google, Inc.
 * For conditions of distribution and use, see the accompanying README.ijg
 * file.
@@ -102,6 +102,8 @@ decompress_onepass(j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
      /* Try to fetch an MCU.  Entropy decoder expects buffer to be zeroed. */
      jzero_far((void *)coef->MCU_buffer[0],
                (size_t)(cinfo->blocks_in_MCU * sizeof(JBLOCK)));
      if (!cinfo->entropy->insufficient_data)
        cinfo->master->last_good_iMCU_row = cinfo->input_iMCU_row;
      if (!(*cinfo->entropy->decode_mcu) (cinfo, coef->MCU_buffer)) {
        /* Suspension forced; update state counters and exit */
        coef->MCU_vert_offset = yoffset;
@@ -227,6 +229,8 @@ consume_data(j_decompress_ptr cinfo)
          }
        }
      }
      if (!cinfo->entropy->insufficient_data)
        cinfo->master->last_good_iMCU_row = cinfo->input_iMCU_row;
      /* Try to fetch the MCU. */
      if (!(*cinfo->entropy->decode_mcu) (cinfo, coef->MCU_buffer)) {
        /* Suspension forced; update state counters and exit */
@@ -326,19 +330,22 @@ decompress_data(j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
 #ifdef BLOCK_SMOOTHING_SUPPORTED
 /*
- * This code applies interblock smoothing as described by section K.8
+ * This code applies interblock smoothing; the first 9 AC coefficients are
- * of the JPEG standard: the first 5 AC coefficients are estimated from
+ * estimated from the DC values of a DCT block and its 24 neighboring blocks.
 * the DC values of a DCT block and its 8 neighboring blocks.
 * We apply smoothing only for progressive JPEG decoding, and only if
 * the coefficients it can estimate are not yet known to full precision.
 */
-/* Natural-order array positions of the first 5 zigzag-order coefficients */
+/* Natural-order array positions of the first 9 zigzag-order coefficients */
 #define Q01_POS  1
 #define Q10_POS  8
 #define Q20_POS  16
 #define Q11_POS  9
 #define Q02_POS  2
 #define Q03_POS  3
 #define Q12_POS  10
 #define Q21_POS  17
 #define Q30_POS  24
 /*
 * Determine whether block smoothing is applicable and safe.
@@ -356,8 +363,8 @@ smoothing_ok(j_decompress_ptr cinfo)
  int ci, coefi;
  jpeg_component_info *compptr;
  JQUANT_TBL *qtable;
-  int *coef_bits;
+  int *coef_bits, *prev_coef_bits;
-  int *coef_bits_latch;
+  int *coef_bits_latch, *prev_coef_bits_latch;
  if (!cinfo->progressive_mode || cinfo->coef_bits == NULL)
    return FALSE;
@@ -366,34 +373,47 @@ smoothing_ok(j_decompress_ptr cinfo)
  if (coef->coef_bits_latch == NULL)
    coef->coef_bits_latch = (int *)
      (*cinfo->mem->alloc_small) ((j_common_ptr)cinfo, JPOOL_IMAGE,
-                                  cinfo->num_components *
+                                  cinfo->num_components * 2 *
                                  (SAVED_COEFS * sizeof(int)));
  coef_bits_latch = coef->coef_bits_latch;
  prev_coef_bits_latch =
    &coef->coef_bits_latch[cinfo->num_components * SAVED_COEFS];
  for (ci = 0, compptr = cinfo->comp_info; ci < cinfo->num_components;
       ci++, compptr++) {
    /* All components' quantization values must already be latched. */
    if ((qtable = compptr->quant_table) == NULL)
      return FALSE;
-    /* Verify DC & first 5 AC quantizers are nonzero to avoid zero-divide. */
+    /* Verify DC & first 9 AC quantizers are nonzero to avoid zero-divide. */
    if (qtable->quantval[0] == 0 ||
        qtable->quantval[Q01_POS] == 0 ||
        qtable->quantval[Q10_POS] == 0 ||
        qtable->quantval[Q20_POS] == 0 ||
        qtable->quantval[Q11_POS] == 0 ||
-        qtable->quantval[Q02_POS] == 0)
+        qtable->quantval[Q02_POS] == 0 ||
        qtable->quantval[Q03_POS] == 0 ||
        qtable->quantval[Q12_POS] == 0 ||
        qtable->quantval[Q21_POS] == 0 ||
        qtable->quantval[Q30_POS] == 0)
      return FALSE;
    /* DC values must be at least partly known for all components. */
    coef_bits = cinfo->coef_bits[ci];
    prev_coef_bits = cinfo->coef_bits[ci + cinfo->num_components];
    if (coef_bits[0] < 0)
      return FALSE;
    coef_bits_latch[0] = coef_bits[0];
    /* Block smoothing is helpful if some AC coefficients remain inaccurate. */
-    for (coefi = 1; coefi <= 5; coefi++) {
+    for (coefi = 1; coefi < SAVED_COEFS; coefi++) {
      if (cinfo->input_scan_number > 1)
        prev_coef_bits_latch[coefi] = prev_coef_bits[coefi];
      else
        prev_coef_bits_latch[coefi] = -1;
      coef_bits_latch[coefi] = coef_bits[coefi];
      if (coef_bits[coefi] != 0)
        smoothing_useful = TRUE;
    }
    coef_bits_latch += SAVED_COEFS;
    prev_coef_bits_latch += SAVED_COEFS;
  }
  return smoothing_useful;
@@ -412,17 +432,20 @@ decompress_smooth_data(j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
  JDIMENSION block_num, last_block_column;
  int ci, block_row, block_rows, access_rows;
  JBLOCKARRAY buffer;
-  JBLOCKROW buffer_ptr, prev_block_row, next_block_row;
+  JBLOCKROW buffer_ptr, prev_prev_block_row, prev_block_row;
  JBLOCKROW next_block_row, next_next_block_row;
  JSAMPARRAY output_ptr;
  JDIMENSION output_col;
  jpeg_component_info *compptr;
  inverse_DCT_method_ptr inverse_DCT;
-  boolean first_row, last_row;
+  boolean change_dc;
  JCOEF *workspace;
  int *coef_bits;
  JQUANT_TBL *quanttbl;
-  JLONG Q00, Q01, Q02, Q10, Q11, Q20, num;
+  JLONG Q00, Q01, Q02, Q03 = 0, Q10, Q11, Q12 = 0, Q20, Q21 = 0, Q30 = 0, num;
-  int DC1, DC2, DC3, DC4, DC5, DC6, DC7, DC8, DC9;
+  int DC01, DC02, DC03, DC04, DC05, DC06, DC07, DC08, DC09, DC10, DC11, DC12,
      DC13, DC14, DC15, DC16, DC17, DC18, DC19, DC20, DC21, DC22, DC23, DC24,
      DC25;
  int Al, pred;
  /* Keep a local variable to avoid looking it up more than once */
@@ -434,10 +457,10 @@ decompress_smooth_data(j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
    if (cinfo->input_scan_number == cinfo->output_scan_number) {
      /* If input is working on current scan, we ordinarily want it to
       * have completed the current row.  But if input scan is DC,
-       * we want it to keep one row ahead so that next block row's DC
+       * we want it to keep two rows ahead so that next two block rows' DC
       * values are up to date.
       */
-      JDIMENSION delta = (cinfo->Ss == 0) ? 1 : 0;
+      JDIMENSION delta = (cinfo->Ss == 0) ? 2 : 0;
      if (cinfo->input_iMCU_row > cinfo->output_iMCU_row + delta)
        break;
    }
@@ -452,34 +475,53 @@ decompress_smooth_data(j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
    if (!compptr->component_needed)
      continue;
    /* Count non-dummy DCT block rows in this iMCU row. */
-    if (cinfo->output_iMCU_row < last_iMCU_row) {
+    if (cinfo->output_iMCU_row < last_iMCU_row - 1) {
      block_rows = compptr->v_samp_factor;
      access_rows = block_rows * 3; /* this and next two iMCU rows */
    } else if (cinfo->output_iMCU_row < last_iMCU_row) {
      block_rows = compptr->v_samp_factor;
      access_rows = block_rows * 2; /* this and next iMCU row */
      last_row = FALSE;
    } else {
      /* NB: can't use last_row_height here; it is input-side-dependent! */
      block_rows = (int)(compptr->height_in_blocks % compptr->v_samp_factor);
      if (block_rows == 0) block_rows = compptr->v_samp_factor;
      access_rows = block_rows; /* this iMCU row only */
      last_row = TRUE;
    }
    /* Align the virtual buffer for this component. */
-    if (cinfo->output_iMCU_row > 0) {
+    if (cinfo->output_iMCU_row > 1) {
-      access_rows += compptr->v_samp_factor; /* prior iMCU row too */
+      access_rows += 2 * compptr->v_samp_factor; /* prior two iMCU rows too */
      buffer = (*cinfo->mem->access_virt_barray)
        ((j_common_ptr)cinfo, coef->whole_image[ci],
         (cinfo->output_iMCU_row - 2) * compptr->v_samp_factor,
         (JDIMENSION)access_rows, FALSE);
      buffer += 2 * compptr->v_samp_factor; /* point to current iMCU row */
    } else if (cinfo->output_iMCU_row > 0) {
      buffer = (*cinfo->mem->access_virt_barray)
        ((j_common_ptr)cinfo, coef->whole_image[ci],
         (cinfo->output_iMCU_row - 1) * compptr->v_samp_factor,
         (JDIMENSION)access_rows, FALSE);
      buffer += compptr->v_samp_factor; /* point to current iMCU row */
      first_row = FALSE;
    } else {
      buffer = (*cinfo->mem->access_virt_barray)
        ((j_common_ptr)cinfo, coef->whole_image[ci],
         (JDIMENSION)0, (JDIMENSION)access_rows, FALSE);
      first_row = TRUE;
    }
-    /* Fetch component-dependent info */
+    /* Fetch component-dependent info.
-    coef_bits = coef->coef_bits_latch + (ci * SAVED_COEFS);
+     * If the current scan is incomplete, then we use the component-dependent
     * info from the previous scan.
     */
    if (cinfo->output_iMCU_row > cinfo->master->last_good_iMCU_row)
      coef_bits =
        coef->coef_bits_latch + ((ci + cinfo->num_components) * SAVED_COEFS);
    else
      coef_bits = coef->coef_bits_latch + (ci * SAVED_COEFS);
    /* We only do DC interpolation if no AC coefficient data is available. */
    change_dc =
      coef_bits[1] == -1 && coef_bits[2] == -1 && coef_bits[3] == -1 &&
      coef_bits[4] == -1 && coef_bits[5] == -1 && coef_bits[6] == -1 &&
      coef_bits[7] == -1 && coef_bits[8] == -1 && coef_bits[9] == -1;
    quanttbl = compptr->quant_table;
    Q00 = quanttbl->quantval[0];
    Q01 = quanttbl->quantval[Q01_POS];
@@ -487,27 +529,51 @@ decompress_smooth_data(j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
    Q20 = quanttbl->quantval[Q20_POS];
    Q11 = quanttbl->quantval[Q11_POS];
    Q02 = quanttbl->quantval[Q02_POS];
    if (change_dc) {
      Q03 = quanttbl->quantval[Q03_POS];
      Q12 = quanttbl->quantval[Q12_POS];
      Q21 = quanttbl->quantval[Q21_POS];
      Q30 = quanttbl->quantval[Q30_POS];
    }
    inverse_DCT = cinfo->idct->inverse_DCT[ci];
    output_ptr = output_buf[ci];
    /* Loop over all DCT blocks to be processed. */
    for (block_row = 0; block_row < block_rows; block_row++) {
      buffer_ptr = buffer[block_row] + cinfo->master->first_MCU_col[ci];
-      if (first_row && block_row == 0)
+
      if (block_row > 0 || cinfo->output_iMCU_row > 0)
        prev_block_row =
          buffer[block_row - 1] + cinfo->master->first_MCU_col[ci];
      else
        prev_block_row = buffer_ptr;
      if (block_row > 1 || cinfo->output_iMCU_row > 1)
        prev_prev_block_row =
          buffer[block_row - 2] + cinfo->master->first_MCU_col[ci];
      else
        prev_prev_block_row = prev_block_row;
      if (block_row < block_rows - 1 || cinfo->output_iMCU_row < last_iMCU_row)
        next_block_row =
          buffer[block_row + 1] + cinfo->master->first_MCU_col[ci];
      else
        prev_block_row = buffer[block_row - 1] +
                         cinfo->master->first_MCU_col[ci];
      if (last_row && block_row == block_rows - 1)
        next_block_row = buffer_ptr;
      if (block_row < block_rows - 2 ||
          cinfo->output_iMCU_row < last_iMCU_row - 1)
        next_next_block_row =
          buffer[block_row + 2] + cinfo->master->first_MCU_col[ci];
      else
-        next_block_row = buffer[block_row + 1] +
+        next_next_block_row = next_block_row;
-                         cinfo->master->first_MCU_col[ci];
+
      /* We fetch the surrounding DC values using a sliding-register approach.
-       * Initialize all nine here so as to do the right thing on narrow pics.
+       * Initialize all 25 here so as to do the right thing on narrow pics.
       */
-      DC1 = DC2 = DC3 = (int)prev_block_row[0][0];
+      DC01 = DC02 = DC03 = DC04 = DC05 = (int)prev_prev_block_row[0][0];
-      DC4 = DC5 = DC6 = (int)buffer_ptr[0][0];
+      DC06 = DC07 = DC08 = DC09 = DC10 = (int)prev_block_row[0][0];
-      DC7 = DC8 = DC9 = (int)next_block_row[0][0];
+      DC11 = DC12 = DC13 = DC14 = DC15 = (int)buffer_ptr[0][0];
      DC16 = DC17 = DC18 = DC19 = DC20 = (int)next_block_row[0][0];
      DC21 = DC22 = DC23 = DC24 = DC25 = (int)next_next_block_row[0][0];
      output_col = 0;
      last_block_column = compptr->width_in_blocks - 1;
      for (block_num = cinfo->master->first_MCU_col[ci];
@@ -515,18 +581,39 @@ decompress_smooth_data(j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
        /* Fetch current DCT block into workspace so we can modify it. */
        jcopy_block_row(buffer_ptr, (JBLOCKROW)workspace, (JDIMENSION)1);
        /* Update DC values */
-        if (block_num < last_block_column) {
+        if (block_num == cinfo->master->first_MCU_col[ci] &&
-          DC3 = (int)prev_block_row[1][0];
+            block_num < last_block_column) {
-          DC6 = (int)buffer_ptr[1][0];
+          DC04 = (int)prev_prev_block_row[1][0];
-          DC9 = (int)next_block_row[1][0];
+          DC09 = (int)prev_block_row[1][0];
          DC14 = (int)buffer_ptr[1][0];
          DC19 = (int)next_block_row[1][0];
          DC24 = (int)next_next_block_row[1][0];
        }
-        /* Compute coefficient estimates per K.8.
+        if (block_num + 1 < last_block_column) {
-         * An estimate is applied only if coefficient is still zero,
+          DC05 = (int)prev_prev_block_row[2][0];
-         * and is not known to be fully accurate.
+          DC10 = (int)prev_block_row[2][0];
          DC15 = (int)buffer_ptr[2][0];
          DC20 = (int)next_block_row[2][0];
          DC25 = (int)next_next_block_row[2][0];
        }
        /* If DC interpolation is enabled, compute coefficient estimates using
         * a Gaussian-like kernel, keeping the averages of the DC values.
         *
         * If DC interpolation is disabled, compute coefficient estimates using
         * an algorithm similar to the one described in Section K.8 of the JPEG
         * standard, except applied to a 5x5 window rather than a 3x3 window.
         *
         * An estimate is applied only if the coefficient is still zero and is
         * not known to be fully accurate.
         */
        /* AC01 */
        if ((Al = coef_bits[1]) != 0 && workspace[1] == 0) {
-          num = 36 * Q00 * (DC4 - DC6);
+          num = Q00 * (change_dc ?
                (-DC01 - DC02 + DC04 + DC05 - 3 * DC06 + 13 * DC07 -
                 13 * DC09 + 3 * DC10 - 3 * DC11 + 38 * DC12 - 38 * DC14 +
                 3 * DC15 - 3 * DC16 + 13 * DC17 - 13 * DC19 + 3 * DC20 -
                 DC21 - DC22 + DC24 + DC25) :
                (-7 * DC11 + 50 * DC12 - 50 * DC14 + 7 * DC15));
          if (num >= 0) {
            pred = (int)(((Q01 << 7) + num) / (Q01 << 8));
            if (Al > 0 && pred >= (1 << Al))
@@ -541,7 +628,12 @@ decompress_smooth_data(j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
        }
        /* AC10 */
        if ((Al = coef_bits[2]) != 0 && workspace[8] == 0) {
-          num = 36 * Q00 * (DC2 - DC8);
+          num = Q00 * (change_dc ?
                (-DC01 - 3 * DC02 - 3 * DC03 - 3 * DC04 - DC05 - DC06 +
                 13 * DC07 + 38 * DC08 + 13 * DC09 - DC10 + DC16 -
                 13 * DC17 - 38 * DC18 - 13 * DC19 + DC20 + DC21 +
                 3 * DC22 + 3 * DC23 + 3 * DC24 + DC25) :
                (-7 * DC03 + 50 * DC08 - 50 * DC18 + 7 * DC23));
          if (num >= 0) {
            pred = (int)(((Q10 << 7) + num) / (Q10 << 8));
            if (Al > 0 && pred >= (1 << Al))
@@ -556,7 +648,10 @@ decompress_smooth_data(j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
        }
        /* AC20 */
        if ((Al = coef_bits[3]) != 0 && workspace[16] == 0) {
-          num = 9 * Q00 * (DC2 + DC8 - 2 * DC5);
+          num = Q00 * (change_dc ?
                (DC03 + 2 * DC07 + 7 * DC08 + 2 * DC09 - 5 * DC12 - 14 * DC13 -
                 5 * DC14 + 2 * DC17 + 7 * DC18 + 2 * DC19 + DC23) :
                (-DC03 + 13 * DC08 - 24 * DC13 + 13 * DC18 - DC23));
          if (num >= 0) {
            pred = (int)(((Q20 << 7) + num) / (Q20 << 8));
            if (Al > 0 && pred >= (1 << Al))
@@ -571,7 +666,11 @@ decompress_smooth_data(j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
        }
        /* AC11 */
        if ((Al = coef_bits[4]) != 0 && workspace[9] == 0) {
-          num = 5 * Q00 * (DC1 - DC3 - DC7 + DC9);
+          num = Q00 * (change_dc ?
                (-DC01 + DC05 + 9 * DC07 - 9 * DC09 - 9 * DC17 +
                 9 * DC19 + DC21 - DC25) :
                (DC10 + DC16 - 10 * DC17 + 10 * DC19 - DC02 - DC20 + DC22 -
                 DC24 + DC04 - DC06 + 10 * DC07 - 10 * DC09));
          if (num >= 0) {
            pred = (int)(((Q11 << 7) + num) / (Q11 << 8));
            if (Al > 0 && pred >= (1 << Al))
@@ -586,7 +685,10 @@ decompress_smooth_data(j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
        }
        /* AC02 */
        if ((Al = coef_bits[5]) != 0 && workspace[2] == 0) {
-          num = 9 * Q00 * (DC4 + DC6 - 2 * DC5);
+          num = Q00 * (change_dc ?
                (2 * DC07 - 5 * DC08 + 2 * DC09 + DC11 + 7 * DC12 - 14 * DC13 +
                 7 * DC14 + DC15 + 2 * DC17 - 5 * DC18 + 2 * DC19) :
                (-DC11 + 13 * DC12 - 24 * DC13 + 13 * DC14 - DC15));
          if (num >= 0) {
            pred = (int)(((Q02 << 7) + num) / (Q02 << 8));
            if (Al > 0 && pred >= (1 << Al))
@@ -599,14 +701,96 @@ decompress_smooth_data(j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
          }
          workspace[2] = (JCOEF)pred;
        }
        if (change_dc) {
          /* AC03 */
          if ((Al = coef_bits[6]) != 0 && workspace[3] == 0) {
            num = Q00 * (DC07 - DC09 + 2 * DC12 - 2 * DC14 + DC17 - DC19);
            if (num >= 0) {
              pred = (int)(((Q03 << 7) + num) / (Q03 << 8));
              if (Al > 0 && pred >= (1 << Al))
                pred = (1 << Al) - 1;
            } else {
              pred = (int)(((Q03 << 7) - num) / (Q03 << 8));
              if (Al > 0 && pred >= (1 << Al))
                pred = (1 << Al) - 1;
              pred = -pred;
            }
            workspace[3] = (JCOEF)pred;
          }
          /* AC12 */
          if ((Al = coef_bits[7]) != 0 && workspace[10] == 0) {
            num = Q00 * (DC07 - 3 * DC08 + DC09 - DC17 + 3 * DC18 - DC19);
            if (num >= 0) {
              pred = (int)(((Q12 << 7) + num) / (Q12 << 8));
              if (Al > 0 && pred >= (1 << Al))
                pred = (1 << Al) - 1;
            } else {
              pred = (int)(((Q12 << 7) - num) / (Q12 << 8));
              if (Al > 0 && pred >= (1 << Al))
                pred = (1 << Al) - 1;
              pred = -pred;
            }
            workspace[10] = (JCOEF)pred;
          }
          /* AC21 */
          if ((Al = coef_bits[8]) != 0 && workspace[17] == 0) {
            num = Q00 * (DC07 - DC09 - 3 * DC12 + 3 * DC14 + DC17 - DC19);
            if (num >= 0) {
              pred = (int)(((Q21 << 7) + num) / (Q21 << 8));
              if (Al > 0 && pred >= (1 << Al))
                pred = (1 << Al) - 1;
            } else {
              pred = (int)(((Q21 << 7) - num) / (Q21 << 8));
              if (Al > 0 && pred >= (1 << Al))
                pred = (1 << Al) - 1;
              pred = -pred;
            }
            workspace[17] = (JCOEF)pred;
          }
          /* AC30 */
          if ((Al = coef_bits[9]) != 0 && workspace[24] == 0) {
            num = Q00 * (DC07 + 2 * DC08 + DC09 - DC17 - 2 * DC18 - DC19);
            if (num >= 0) {
              pred = (int)(((Q30 << 7) + num) / (Q30 << 8));
              if (Al > 0 && pred >= (1 << Al))
                pred = (1 << Al) - 1;
            } else {
              pred = (int)(((Q30 << 7) - num) / (Q30 << 8));
              if (Al > 0 && pred >= (1 << Al))
                pred = (1 << Al) - 1;
              pred = -pred;
            }
            workspace[24] = (JCOEF)pred;
          }
          /* coef_bits[0] is non-negative.  Otherwise this function would not
           * be called.
           */
          num = Q00 *
                (-2 * DC01 - 6 * DC02 - 8 * DC03 - 6 * DC04 - 2 * DC05 -
                 6 * DC06 + 6 * DC07 + 42 * DC08 + 6 * DC09 - 6 * DC10 -
                 8 * DC11 + 42 * DC12 + 152 * DC13 + 42 * DC14 - 8 * DC15 -
                 6 * DC16 + 6 * DC17 + 42 * DC18 + 6 * DC19 - 6 * DC20 -
                 2 * DC21 - 6 * DC22 - 8 * DC23 - 6 * DC24 - 2 * DC25);
          if (num >= 0) {
            pred = (int)(((Q00 << 7) + num) / (Q00 << 8));
          } else {
            pred = (int)(((Q00 << 7) - num) / (Q00 << 8));
            pred = -pred;
          }
          workspace[0] = (JCOEF)pred;
        }  /* change_dc */
        /* OK, do the IDCT */
        (*inverse_DCT) (cinfo, compptr, (JCOEFPTR)workspace, output_ptr,
                        output_col);
        /* Advance for next column */
-        DC1 = DC2;  DC2 = DC3;
+        DC01 = DC02;  DC02 = DC03;  DC03 = DC04;  DC04 = DC05;
-        DC4 = DC5;  DC5 = DC6;
+        DC06 = DC07;  DC07 = DC08;  DC08 = DC09;  DC09 = DC10;
-        DC7 = DC8;  DC8 = DC9;
+        DC11 = DC12;  DC12 = DC13;  DC13 = DC14;  DC14 = DC15;
-        buffer_ptr++, prev_block_row++, next_block_row++;
+        DC16 = DC17;  DC17 = DC18;  DC18 = DC19;  DC19 = DC20;
        DC21 = DC22;  DC22 = DC23;  DC23 = DC24;  DC24 = DC25;
        buffer_ptr++, prev_block_row++, next_block_row++,
          prev_prev_block_row++, next_next_block_row++;
        output_col += compptr->_DCT_scaled_size;
      }
      output_ptr += compptr->_DCT_scaled_size;
@@ -655,7 +839,7 @@ jinit_d_coef_controller(j_decompress_ptr cinfo, boolean need_full_buffer)
 #ifdef BLOCK_SMOOTHING_SUPPORTED
      /* If block smoothing could be used, need a bigger window */
      if (cinfo->progressive_mode)
-        access_rows *= 3;
+        access_rows *= 5;
 #endif
      coef->whole_image[ci] = (*cinfo->mem->request_virt_barray)
        ((j_common_ptr)cinfo, JPOOL_IMAGE, TRUE,
--- a/jdcoefct.h
+++ b/jdcoefct.h
@@ -5,6 +5,7 @@
 * Copyright (C) 1994-1997, Thomas G. Lane.
 * libjpeg-turbo Modifications:
 * Copyright 2009 Pierre Ossman <ossman@cendio.se> for Cendio AB
 * Copyright (C) 2020, Google, Inc.
 * For conditions of distribution and use, see the accompanying README.ijg
 * file.
 */
@@ -51,7 +52,7 @@ typedef struct {
 #ifdef BLOCK_SMOOTHING_SUPPORTED
  /* When doing block smoothing, we latch coefficient Al values here */
  int *coef_bits_latch;
-#define SAVED_COEFS  6          /* we save coef_bits[0..5] */
+#define SAVED_COEFS  10         /* we save coef_bits[0..9] */
 #endif
 } my_coef_controller;
--- a/jdcol565.c
+++ b/jdcol565.c
@@ -45,9 +45,9 @@ ycc_rgb565_convert_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
    outptr = *output_buf++;
    if (PACK_NEED_ALIGNMENT(outptr)) {
-      y  = GETJSAMPLE(*inptr0++);
+      y  = *inptr0++;
-      cb = GETJSAMPLE(*inptr1++);
+      cb = *inptr1++;
-      cr = GETJSAMPLE(*inptr2++);
+      cr = *inptr2++;
      r = range_limit[y + Crrtab[cr]];
      g = range_limit[y + ((int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr],
                                            SCALEBITS))];
@@ -58,18 +58,18 @@ ycc_rgb565_convert_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
      num_cols--;
    }
    for (col = 0; col < (num_cols >> 1); col++) {
-      y  = GETJSAMPLE(*inptr0++);
+      y  = *inptr0++;
-      cb = GETJSAMPLE(*inptr1++);
+      cb = *inptr1++;
-      cr = GETJSAMPLE(*inptr2++);
+      cr = *inptr2++;
      r = range_limit[y + Crrtab[cr]];
      g = range_limit[y + ((int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr],
                                            SCALEBITS))];
      b = range_limit[y + Cbbtab[cb]];
      rgb = PACK_SHORT_565(r, g, b);
-      y  = GETJSAMPLE(*inptr0++);
+      y  = *inptr0++;
-      cb = GETJSAMPLE(*inptr1++);
+      cb = *inptr1++;
-      cr = GETJSAMPLE(*inptr2++);
+      cr = *inptr2++;
      r = range_limit[y + Crrtab[cr]];
      g = range_limit[y + ((int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr],
                                            SCALEBITS))];
@@ -80,9 +80,9 @@ ycc_rgb565_convert_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
      outptr += 4;
    }
    if (num_cols & 1) {
-      y  = GETJSAMPLE(*inptr0);
+      y  = *inptr0;
-      cb = GETJSAMPLE(*inptr1);
+      cb = *inptr1;
-      cr = GETJSAMPLE(*inptr2);
+      cr = *inptr2;
      r = range_limit[y + Crrtab[cr]];
      g = range_limit[y + ((int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr],
                                            SCALEBITS))];
@@ -125,9 +125,9 @@ ycc_rgb565D_convert_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
    input_row++;
    outptr = *output_buf++;
    if (PACK_NEED_ALIGNMENT(outptr)) {
-      y  = GETJSAMPLE(*inptr0++);
+      y  = *inptr0++;
-      cb = GETJSAMPLE(*inptr1++);
+      cb = *inptr1++;
-      cr = GETJSAMPLE(*inptr2++);
+      cr = *inptr2++;
      r = range_limit[DITHER_565_R(y + Crrtab[cr], d0)];
      g = range_limit[DITHER_565_G(y +
                                   ((int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr],
@@ -139,9 +139,9 @@ ycc_rgb565D_convert_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
      num_cols--;
    }
    for (col = 0; col < (num_cols >> 1); col++) {
-      y  = GETJSAMPLE(*inptr0++);
+      y  = *inptr0++;
-      cb = GETJSAMPLE(*inptr1++);
+      cb = *inptr1++;
-      cr = GETJSAMPLE(*inptr2++);
+      cr = *inptr2++;
      r = range_limit[DITHER_565_R(y + Crrtab[cr], d0)];
      g = range_limit[DITHER_565_G(y +
                                   ((int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr],
@@ -150,9 +150,9 @@ ycc_rgb565D_convert_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
      d0 = DITHER_ROTATE(d0);
      rgb = PACK_SHORT_565(r, g, b);
-      y  = GETJSAMPLE(*inptr0++);
+      y  = *inptr0++;
-      cb = GETJSAMPLE(*inptr1++);
+      cb = *inptr1++;
-      cr = GETJSAMPLE(*inptr2++);
+      cr = *inptr2++;
      r = range_limit[DITHER_565_R(y + Crrtab[cr], d0)];
      g = range_limit[DITHER_565_G(y +
                                   ((int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr],
@@ -165,9 +165,9 @@ ycc_rgb565D_convert_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
      outptr += 4;
    }
    if (num_cols & 1) {
-      y  = GETJSAMPLE(*inptr0);
+      y  = *inptr0;
-      cb = GETJSAMPLE(*inptr1);
+      cb = *inptr1;
-      cr = GETJSAMPLE(*inptr2);
+      cr = *inptr2;
      r = range_limit[DITHER_565_R(y + Crrtab[cr], d0)];
      g = range_limit[DITHER_565_G(y +
                                   ((int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr],
@@ -202,32 +202,32 @@ rgb_rgb565_convert_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
    input_row++;
    outptr = *output_buf++;
    if (PACK_NEED_ALIGNMENT(outptr)) {
-      r = GETJSAMPLE(*inptr0++);
+      r = *inptr0++;
-      g = GETJSAMPLE(*inptr1++);
+      g = *inptr1++;
-      b = GETJSAMPLE(*inptr2++);
+      b = *inptr2++;
      rgb = PACK_SHORT_565(r, g, b);
      *(INT16 *)outptr = (INT16)rgb;
      outptr += 2;
      num_cols--;
    }
    for (col = 0; col < (num_cols >> 1); col++) {
-      r = GETJSAMPLE(*inptr0++);
+      r = *inptr0++;
-      g = GETJSAMPLE(*inptr1++);
+      g = *inptr1++;
-      b = GETJSAMPLE(*inptr2++);
+      b = *inptr2++;
      rgb = PACK_SHORT_565(r, g, b);
-      r = GETJSAMPLE(*inptr0++);
+      r = *inptr0++;
-      g = GETJSAMPLE(*inptr1++);
+      g = *inptr1++;
-      b = GETJSAMPLE(*inptr2++);
+      b = *inptr2++;
      rgb = PACK_TWO_PIXELS(rgb, PACK_SHORT_565(r, g, b));
      WRITE_TWO_ALIGNED_PIXELS(outptr, rgb);
      outptr += 4;
    }
    if (num_cols & 1) {
-      r = GETJSAMPLE(*inptr0);
+      r = *inptr0;
-      g = GETJSAMPLE(*inptr1);
+      g = *inptr1;
-      b = GETJSAMPLE(*inptr2);
+      b = *inptr2;
      rgb = PACK_SHORT_565(r, g, b);
      *(INT16 *)outptr = (INT16)rgb;
    }
@@ -259,24 +259,24 @@ rgb_rgb565D_convert_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
    input_row++;
    outptr = *output_buf++;
    if (PACK_NEED_ALIGNMENT(outptr)) {
-      r = range_limit[DITHER_565_R(GETJSAMPLE(*inptr0++), d0)];
+      r = range_limit[DITHER_565_R(*inptr0++, d0)];
-      g = range_limit[DITHER_565_G(GETJSAMPLE(*inptr1++), d0)];
+      g = range_limit[DITHER_565_G(*inptr1++, d0)];
-      b = range_limit[DITHER_565_B(GETJSAMPLE(*inptr2++), d0)];
+      b = range_limit[DITHER_565_B(*inptr2++, d0)];
      rgb = PACK_SHORT_565(r, g, b);
      *(INT16 *)outptr = (INT16)rgb;
      outptr += 2;
      num_cols--;
    }
    for (col = 0; col < (num_cols >> 1); col++) {
-      r = range_limit[DITHER_565_R(GETJSAMPLE(*inptr0++), d0)];
+      r = range_limit[DITHER_565_R(*inptr0++, d0)];
-      g = range_limit[DITHER_565_G(GETJSAMPLE(*inptr1++), d0)];
+      g = range_limit[DITHER_565_G(*inptr1++, d0)];
-      b = range_limit[DITHER_565_B(GETJSAMPLE(*inptr2++), d0)];
+      b = range_limit[DITHER_565_B(*inptr2++, d0)];
      d0 = DITHER_ROTATE(d0);
      rgb = PACK_SHORT_565(r, g, b);
-      r = range_limit[DITHER_565_R(GETJSAMPLE(*inptr0++), d0)];
+      r = range_limit[DITHER_565_R(*inptr0++, d0)];
-      g = range_limit[DITHER_565_G(GETJSAMPLE(*inptr1++), d0)];
+      g = range_limit[DITHER_565_G(*inptr1++, d0)];
-      b = range_limit[DITHER_565_B(GETJSAMPLE(*inptr2++), d0)];
+      b = range_limit[DITHER_565_B(*inptr2++, d0)];
      d0 = DITHER_ROTATE(d0);
      rgb = PACK_TWO_PIXELS(rgb, PACK_SHORT_565(r, g, b));
@@ -284,9 +284,9 @@ rgb_rgb565D_convert_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
      outptr += 4;
    }
    if (num_cols & 1) {
-      r = range_limit[DITHER_565_R(GETJSAMPLE(*inptr0), d0)];
+      r = range_limit[DITHER_565_R(*inptr0, d0)];
-      g = range_limit[DITHER_565_G(GETJSAMPLE(*inptr1), d0)];
+      g = range_limit[DITHER_565_G(*inptr1, d0)];
-      b = range_limit[DITHER_565_B(GETJSAMPLE(*inptr2), d0)];
+      b = range_limit[DITHER_565_B(*inptr2, d0)];
      rgb = PACK_SHORT_565(r, g, b);
      *(INT16 *)outptr = (INT16)rgb;
    }
--- a/jdcolext.c
+++ b/jdcolext.c
@@ -53,9 +53,9 @@ ycc_rgb_convert_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
    input_row++;
    outptr = *output_buf++;
    for (col = 0; col < num_cols; col++) {
-      y  = GETJSAMPLE(inptr0[col]);
+      y  = inptr0[col];
-      cb = GETJSAMPLE(inptr1[col]);
+      cb = inptr1[col];
-      cr = GETJSAMPLE(inptr2[col]);
+      cr = inptr2[col];
      /* Range-limiting is essential due to noise introduced by DCT losses. */
      outptr[RGB_RED] =   range_limit[y + Crrtab[cr]];
      outptr[RGB_GREEN] = range_limit[y +
@@ -93,7 +93,6 @@ gray_rgb_convert_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
    inptr = input_buf[0][input_row++];
    outptr = *output_buf++;
    for (col = 0; col < num_cols; col++) {
      /* We can dispense with GETJSAMPLE() here */
      outptr[RGB_RED] = outptr[RGB_GREEN] = outptr[RGB_BLUE] = inptr[col];
      /* Set unused byte to 0xFF so it can be interpreted as an opaque */
      /* alpha channel value */
@@ -128,7 +127,6 @@ rgb_rgb_convert_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
    input_row++;
    outptr = *output_buf++;
    for (col = 0; col < num_cols; col++) {
      /* We can dispense with GETJSAMPLE() here */
      outptr[RGB_RED] = inptr0[col];
      outptr[RGB_GREEN] = inptr1[col];
      outptr[RGB_BLUE] = inptr2[col];
--- a/jdcolor.c
+++ b/jdcolor.c
@@ -341,9 +341,9 @@ rgb_gray_convert(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
    input_row++;
    outptr = *output_buf++;
    for (col = 0; col < num_cols; col++) {
-      r = GETJSAMPLE(inptr0[col]);
+      r = inptr0[col];
-      g = GETJSAMPLE(inptr1[col]);
+      g = inptr1[col];
-      b = GETJSAMPLE(inptr2[col]);
+      b = inptr2[col];
      /* Y */
      outptr[col] = (JSAMPLE)((ctab[r + R_Y_OFF] + ctab[g + G_Y_OFF] +
                               ctab[b + B_Y_OFF]) >> SCALEBITS);
@@ -550,9 +550,9 @@ ycck_cmyk_convert(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
    input_row++;
    outptr = *output_buf++;
    for (col = 0; col < num_cols; col++) {
-      y  = GETJSAMPLE(inptr0[col]);
+      y  = inptr0[col];
-      cb = GETJSAMPLE(inptr1[col]);
+      cb = inptr1[col];
-      cr = GETJSAMPLE(inptr2[col]);
+      cr = inptr2[col];
      /* Range-limiting is essential due to noise introduced by DCT losses. */
      outptr[0] = range_limit[MAXJSAMPLE - (y + Crrtab[cr])];   /* red */
      outptr[1] = range_limit[MAXJSAMPLE - (y +                 /* green */
@@ -560,7 +560,7 @@ ycck_cmyk_convert(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
                                                 SCALEBITS)))];
      outptr[2] = range_limit[MAXJSAMPLE - (y + Cbbtab[cb])];   /* blue */
      /* K passes through unchanged */
-      outptr[3] = inptr3[col];  /* don't need GETJSAMPLE here */
+      outptr[3] = inptr3[col];
      outptr += 4;
    }
  }
--- a/jdhuff.c
+++ b/jdhuff.c
@@ -5,6 +5,7 @@
 * Copyright (C) 1991-1997, Thomas G. Lane.
 * libjpeg-turbo Modifications:
 * Copyright (C) 2009-2011, 2016, 2018-2019, D. R. Commander.
 * Copyright (C) 2018, Matthias Räncker.
 * For conditions of distribution and use, see the accompanying README.ijg
 * file.
 *
@@ -39,24 +40,6 @@ typedef struct {
  int last_dc_val[MAX_COMPS_IN_SCAN]; /* last DC coef for each component */
 } savable_state;
 /* This macro is to work around compilers with missing or broken
 * structure assignment.  You'll need to fix this code if you have
 * such a compiler and you change MAX_COMPS_IN_SCAN.
 */
 #ifndef NO_STRUCT_ASSIGN
 #define ASSIGN_STATE(dest, src)  ((dest) = (src))
 #else
 #if MAX_COMPS_IN_SCAN == 4
 #define ASSIGN_STATE(dest, src) \
  ((dest).last_dc_val[0] = (src).last_dc_val[0], \
   (dest).last_dc_val[1] = (src).last_dc_val[1], \
   (dest).last_dc_val[2] = (src).last_dc_val[2], \
   (dest).last_dc_val[3] = (src).last_dc_val[3])
 #endif
 #endif
 typedef struct {
  struct jpeg_entropy_decoder pub; /* public fields */
@@ -325,7 +308,7 @@ jpeg_fill_bit_buffer(bitread_working_state *state,
        bytes_in_buffer = cinfo->src->bytes_in_buffer;
      }
      bytes_in_buffer--;
-      c = GETJOCTET(*next_input_byte++);
+      c = *next_input_byte++;
      /* If it's 0xFF, check and discard stuffed zero byte */
      if (c == 0xFF) {
@@ -342,7 +325,7 @@ jpeg_fill_bit_buffer(bitread_working_state *state,
            bytes_in_buffer = cinfo->src->bytes_in_buffer;
          }
          bytes_in_buffer--;
-          c = GETJOCTET(*next_input_byte++);
+          c = *next_input_byte++;
        } while (c == 0xFF);
        if (c == 0) {
@@ -405,8 +388,8 @@ no_more_bytes:
 #define GET_BYTE { \
  register int c0, c1; \
-  c0 = GETJOCTET(*buffer++); \
+  c0 = *buffer++; \
-  c1 = GETJOCTET(*buffer); \
+  c1 = *buffer; \
  /* Pre-execute most common case */ \
  get_buffer = (get_buffer << 8) | c0; \
  bits_left += 8; \
@@ -423,7 +406,7 @@ no_more_bytes:
  } \
 }
-#if SIZEOF_SIZE_T == 8 || defined(_WIN64)
+#if SIZEOF_SIZE_T == 8 || defined(_WIN64) || (defined(__x86_64__) && defined(__ILP32__))
 /* Pre-fetch 48 bytes, because the holding register is 64-bit */
 #define FILL_BIT_BUFFER_FAST \
@@ -568,7 +551,7 @@ decode_mcu_slow(j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
  /* Load up working state */
  BITREAD_LOAD_STATE(cinfo, entropy->bitstate);
-  ASSIGN_STATE(state, entropy->saved);
+  state = entropy->saved;
  for (blkn = 0; blkn < cinfo->blocks_in_MCU; blkn++) {
    JBLOCKROW block = MCU_data ? MCU_data[blkn] : NULL;
@@ -653,7 +636,7 @@ decode_mcu_slow(j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
  /* Completed MCU, so update state */
  BITREAD_SAVE_STATE(cinfo, entropy->bitstate);
-  ASSIGN_STATE(entropy->saved, state);
+  entropy->saved = state;
  return TRUE;
 }
@@ -671,7 +654,7 @@ decode_mcu_fast(j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
  /* Load up working state */
  BITREAD_LOAD_STATE(cinfo, entropy->bitstate);
  buffer = (JOCTET *)br_state.next_input_byte;
-  ASSIGN_STATE(state, entropy->saved);
+  state = entropy->saved;
  for (blkn = 0; blkn < cinfo->blocks_in_MCU; blkn++) {
    JBLOCKROW block = MCU_data ? MCU_data[blkn] : NULL;
@@ -740,7 +723,7 @@ decode_mcu_fast(j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
  br_state.bytes_in_buffer -= (buffer - br_state.next_input_byte);
  br_state.next_input_byte = buffer;
  BITREAD_SAVE_STATE(cinfo, entropy->bitstate);
-  ASSIGN_STATE(entropy->saved, state);
+  entropy->saved = state;
  return TRUE;
 }
--- a/jdhuff.h
+++ b/jdhuff.h
@@ -5,6 +5,7 @@
 * Copyright (C) 1991-1997, Thomas G. Lane.
 * libjpeg-turbo Modifications:
 * Copyright (C) 2010-2011, 2015-2016, D. R. Commander.
 * Copyright (C) 2018, Matthias Räncker.
 * For conditions of distribution and use, see the accompanying README.ijg
 * file.
 *
@@ -78,6 +79,11 @@ EXTERN(void) jpeg_make_d_derived_tbl(j_decompress_ptr cinfo, boolean isDC,
 typedef size_t bit_buf_type;            /* type of bit-extraction buffer */
 #define BIT_BUF_SIZE  64                /* size of buffer in bits */
 #elif defined(__x86_64__) && defined(__ILP32__)
 typedef unsigned long long bit_buf_type; /* type of bit-extraction buffer */
 #define BIT_BUF_SIZE  64                 /* size of buffer in bits */
 #else
 typedef unsigned long bit_buf_type;     /* type of bit-extraction buffer */
--- a/jdicc.c
+++ b/jdicc.c
@@ -38,18 +38,18 @@ marker_is_icc(jpeg_saved_marker_ptr marker)
    marker->marker == ICC_MARKER &&
    marker->data_length >= ICC_OVERHEAD_LEN &&
    /* verify the identifying string */
-    GETJOCTET(marker->data[0]) == 0x49 &&
+    marker->data[0] == 0x49 &&
-    GETJOCTET(marker->data[1]) == 0x43 &&
+    marker->data[1] == 0x43 &&
-    GETJOCTET(marker->data[2]) == 0x43 &&
+    marker->data[2] == 0x43 &&
-    GETJOCTET(marker->data[3]) == 0x5F &&
+    marker->data[3] == 0x5F &&
-    GETJOCTET(marker->data[4]) == 0x50 &&
+    marker->data[4] == 0x50 &&
-    GETJOCTET(marker->data[5]) == 0x52 &&
+    marker->data[5] == 0x52 &&
-    GETJOCTET(marker->data[6]) == 0x4F &&
+    marker->data[6] == 0x4F &&
-    GETJOCTET(marker->data[7]) == 0x46 &&
+    marker->data[7] == 0x46 &&
-    GETJOCTET(marker->data[8]) == 0x49 &&
+    marker->data[8] == 0x49 &&
-    GETJOCTET(marker->data[9]) == 0x4C &&
+    marker->data[9] == 0x4C &&
-    GETJOCTET(marker->data[10]) == 0x45 &&
+    marker->data[10] == 0x45 &&
-    GETJOCTET(marker->data[11]) == 0x0;
+    marker->data[11] == 0x0;
 }
@@ -102,12 +102,12 @@ jpeg_read_icc_profile(j_decompress_ptr cinfo, JOCTET **icc_data_ptr,
  for (marker = cinfo->marker_list; marker != NULL; marker = marker->next) {
    if (marker_is_icc(marker)) {
      if (num_markers == 0)
-        num_markers = GETJOCTET(marker->data[13]);
+        num_markers = marker->data[13];
-      else if (num_markers != GETJOCTET(marker->data[13])) {
+      else if (num_markers != marker->data[13]) {
        WARNMS(cinfo, JWRN_BOGUS_ICC);  /* inconsistent num_markers fields */
        return FALSE;
      }
-      seq_no = GETJOCTET(marker->data[12]);
+      seq_no = marker->data[12];
      if (seq_no <= 0 || seq_no > num_markers) {
        WARNMS(cinfo, JWRN_BOGUS_ICC);  /* bogus sequence number */
        return FALSE;
@@ -154,7 +154,7 @@ jpeg_read_icc_profile(j_decompress_ptr cinfo, JOCTET **icc_data_ptr,
      JOCTET FAR *src_ptr;
      JOCTET *dst_ptr;
      unsigned int length;
-      seq_no = GETJOCTET(marker->data[12]);
+      seq_no = marker->data[12];
      dst_ptr = icc_data + data_offset[seq_no];
      src_ptr = marker->data + ICC_OVERHEAD_LEN;
      length = data_length[seq_no];
--- a/jdmarker.c
+++ b/jdmarker.c
@@ -151,7 +151,7 @@ typedef my_marker_reader *my_marker_ptr;
 #define INPUT_BYTE(cinfo, V, action) \
  MAKESTMT( MAKE_BYTE_AVAIL(cinfo, action); \
            bytes_in_buffer--; \
-            V = GETJOCTET(*next_input_byte++); )
+            V = *next_input_byte++; )
 /* As above, but read two bytes interpreted as an unsigned 16-bit integer.
 * V should be declared unsigned int or perhaps JLONG.
@@ -159,10 +159,10 @@ typedef my_marker_reader *my_marker_ptr;
 #define INPUT_2BYTES(cinfo, V, action) \
  MAKESTMT( MAKE_BYTE_AVAIL(cinfo, action); \
            bytes_in_buffer--; \
-            V = ((unsigned int)GETJOCTET(*next_input_byte++)) << 8; \
+            V = ((unsigned int)(*next_input_byte++)) << 8; \
            MAKE_BYTE_AVAIL(cinfo, action); \
            bytes_in_buffer--; \
-            V += GETJOCTET(*next_input_byte++); )
+            V += *next_input_byte++; )
 /*
@@ -608,18 +608,18 @@ examine_app0(j_decompress_ptr cinfo, JOCTET *data, unsigned int datalen,
  JLONG totallen = (JLONG)datalen + remaining;
  if (datalen >= APP0_DATA_LEN &&
-      GETJOCTET(data[0]) == 0x4A &&
+      data[0] == 0x4A &&
-      GETJOCTET(data[1]) == 0x46 &&
+      data[1] == 0x46 &&
-      GETJOCTET(data[2]) == 0x49 &&
+      data[2] == 0x49 &&
-      GETJOCTET(data[3]) == 0x46 &&
+      data[3] == 0x46 &&
-      GETJOCTET(data[4]) == 0) {
+      data[4] == 0) {
    /* Found JFIF APP0 marker: save info */
    cinfo->saw_JFIF_marker = TRUE;
-    cinfo->JFIF_major_version = GETJOCTET(data[5]);
+    cinfo->JFIF_major_version = data[5];
-    cinfo->JFIF_minor_version = GETJOCTET(data[6]);
+    cinfo->JFIF_minor_version = data[6];
-    cinfo->density_unit = GETJOCTET(data[7]);
+    cinfo->density_unit = data[7];
-    cinfo->X_density = (GETJOCTET(data[8]) << 8) + GETJOCTET(data[9]);
+    cinfo->X_density = (data[8] << 8) + data[9];
-    cinfo->Y_density = (GETJOCTET(data[10]) << 8) + GETJOCTET(data[11]);
+    cinfo->Y_density = (data[10] << 8) + data[11];
    /* Check version.
     * Major version must be 1, anything else signals an incompatible change.
     * (We used to treat this as an error, but now it's a nonfatal warning,
@@ -634,24 +634,22 @@ examine_app0(j_decompress_ptr cinfo, JOCTET *data, unsigned int datalen,
             cinfo->JFIF_major_version, cinfo->JFIF_minor_version,
             cinfo->X_density, cinfo->Y_density, cinfo->density_unit);
    /* Validate thumbnail dimensions and issue appropriate messages */
-    if (GETJOCTET(data[12]) | GETJOCTET(data[13]))
+    if (data[12] | data[13])
-      TRACEMS2(cinfo, 1, JTRC_JFIF_THUMBNAIL,
+      TRACEMS2(cinfo, 1, JTRC_JFIF_THUMBNAIL, data[12], data[13]);
               GETJOCTET(data[12]), GETJOCTET(data[13]));
    totallen -= APP0_DATA_LEN;
-    if (totallen !=
+    if (totallen != ((JLONG)data[12] * (JLONG)data[13] * (JLONG)3))
        ((JLONG)GETJOCTET(data[12]) * (JLONG)GETJOCTET(data[13]) * (JLONG)3))
      TRACEMS1(cinfo, 1, JTRC_JFIF_BADTHUMBNAILSIZE, (int)totallen);
  } else if (datalen >= 6 &&
-             GETJOCTET(data[0]) == 0x4A &&
+             data[0] == 0x4A &&
-             GETJOCTET(data[1]) == 0x46 &&
+             data[1] == 0x46 &&
-             GETJOCTET(data[2]) == 0x58 &&
+             data[2] == 0x58 &&
-             GETJOCTET(data[3]) == 0x58 &&
+             data[3] == 0x58 &&
-             GETJOCTET(data[4]) == 0) {
+             data[4] == 0) {
    /* Found JFIF "JFXX" extension APP0 marker */
    /* The library doesn't actually do anything with these,
     * but we try to produce a helpful trace message.
     */
-    switch (GETJOCTET(data[5])) {
+    switch (data[5]) {
    case 0x10:
      TRACEMS1(cinfo, 1, JTRC_THUMB_JPEG, (int)totallen);
      break;
@@ -662,8 +660,7 @@ examine_app0(j_decompress_ptr cinfo, JOCTET *data, unsigned int datalen,
      TRACEMS1(cinfo, 1, JTRC_THUMB_RGB, (int)totallen);
      break;
    default:
-      TRACEMS2(cinfo, 1, JTRC_JFIF_EXTENSION,
+      TRACEMS2(cinfo, 1, JTRC_JFIF_EXTENSION, data[5], (int)totallen);
               GETJOCTET(data[5]), (int)totallen);
      break;
    }
  } else {
@@ -684,16 +681,16 @@ examine_app14(j_decompress_ptr cinfo, JOCTET *data, unsigned int datalen,
  unsigned int version, flags0, flags1, transform;
  if (datalen >= APP14_DATA_LEN &&
-      GETJOCTET(data[0]) == 0x41 &&
+      data[0] == 0x41 &&
-      GETJOCTET(data[1]) == 0x64 &&
+      data[1] == 0x64 &&
-      GETJOCTET(data[2]) == 0x6F &&
+      data[2] == 0x6F &&
-      GETJOCTET(data[3]) == 0x62 &&
+      data[3] == 0x62 &&
-      GETJOCTET(data[4]) == 0x65) {
+      data[4] == 0x65) {
    /* Found Adobe APP14 marker */
-    version = (GETJOCTET(data[5]) << 8) + GETJOCTET(data[6]);
+    version = (data[5] << 8) + data[6];
-    flags0 = (GETJOCTET(data[7]) << 8) + GETJOCTET(data[8]);
+    flags0 = (data[7] << 8) + data[8];
-    flags1 = (GETJOCTET(data[9]) << 8) + GETJOCTET(data[10]);
+    flags1 = (data[9] << 8) + data[10];
-    transform = GETJOCTET(data[11]);
+    transform = data[11];
    TRACEMS4(cinfo, 1, JTRC_ADOBE, version, flags0, flags1, transform);
    cinfo->saw_Adobe_marker = TRUE;
    cinfo->Adobe_transform = (UINT8)transform;
--- a/jdmaster.c
+++ b/jdmaster.c
@@ -5,7 +5,7 @@
 * Copyright (C) 1991-1997, Thomas G. Lane.
 * Modified 2002-2009 by Guido Vollbeding.
 * libjpeg-turbo Modifications:
- * Copyright (C) 2009-2011, 2016, D. R. Commander.
+ * Copyright (C) 2009-2011, 2016, 2019, D. R. Commander.
 * Copyright (C) 2013, Linaro Limited.
 * Copyright (C) 2015, Google, Inc.
 * For conditions of distribution and use, see the accompanying README.ijg
@@ -22,7 +22,6 @@
 #include "jpeglib.h"
 #include "jpegcomp.h"
 #include "jdmaster.h"
 #include "jsimd.h"
 /*
@@ -70,17 +69,6 @@ use_merged_upsample(j_decompress_ptr cinfo)
      cinfo->comp_info[1]._DCT_scaled_size != cinfo->_min_DCT_scaled_size ||
      cinfo->comp_info[2]._DCT_scaled_size != cinfo->_min_DCT_scaled_size)
    return FALSE;
 #ifdef WITH_SIMD
  /* If YCbCr-to-RGB color conversion is SIMD-accelerated but merged upsampling
     isn't, then disabling merged upsampling is likely to be faster when
     decompressing YCbCr JPEG images. */
  if (!jsimd_can_h2v2_merged_upsample() && !jsimd_can_h2v1_merged_upsample() &&
      jsimd_can_ycc_rgb() && cinfo->jpeg_color_space == JCS_YCbCr &&
      (cinfo->out_color_space == JCS_RGB ||
       (cinfo->out_color_space >= JCS_EXT_RGB &&
        cinfo->out_color_space <= JCS_EXT_ARGB)))
    return FALSE;
 #endif
  /* ??? also need to test for upsample-time rescaling, when & if supported */
  return TRUE;                  /* by golly, it'll work... */
 #else
@@ -580,6 +568,7 @@ master_selection(j_decompress_ptr cinfo)
   */
  cinfo->master->first_iMCU_col = 0;
  cinfo->master->last_iMCU_col = cinfo->MCUs_per_row - 1;
  cinfo->master->last_good_iMCU_row = 0;
 #ifdef D_MULTISCAN_FILES_SUPPORTED
  /* If jpeg_start_decompress will read the whole file, initialize
--- a/jdmrg565.c
+++ b/jdmrg565.c
@@ -43,20 +43,20 @@ h2v1_merged_upsample_565_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
  /* Loop for each pair of output pixels */
  for (col = cinfo->output_width >> 1; col > 0; col--) {
    /* Do the chroma part of the calculation */
-    cb = GETJSAMPLE(*inptr1++);
+    cb = *inptr1++;
-    cr = GETJSAMPLE(*inptr2++);
+    cr = *inptr2++;
    cred = Crrtab[cr];
    cgreen = (int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr], SCALEBITS);
    cblue = Cbbtab[cb];
    /* Fetch 2 Y values and emit 2 pixels */
-    y  = GETJSAMPLE(*inptr0++);
+    y  = *inptr0++;
    r = range_limit[y + cred];
    g = range_limit[y + cgreen];
    b = range_limit[y + cblue];
    rgb = PACK_SHORT_565(r, g, b);
-    y  = GETJSAMPLE(*inptr0++);
+    y  = *inptr0++;
    r = range_limit[y + cred];
    g = range_limit[y + cgreen];
    b = range_limit[y + cblue];
@@ -68,12 +68,12 @@ h2v1_merged_upsample_565_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
  /* If image width is odd, do the last output column separately */
  if (cinfo->output_width & 1) {
-    cb = GETJSAMPLE(*inptr1);
+    cb = *inptr1;
-    cr = GETJSAMPLE(*inptr2);
+    cr = *inptr2;
    cred = Crrtab[cr];
    cgreen = (int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr], SCALEBITS);
    cblue = Cbbtab[cb];
-    y  = GETJSAMPLE(*inptr0);
+    y  = *inptr0;
    r = range_limit[y + cred];
    g = range_limit[y + cgreen];
    b = range_limit[y + cblue];
@@ -115,21 +115,21 @@ h2v1_merged_upsample_565D_internal(j_decompress_ptr cinfo,
  /* Loop for each pair of output pixels */
  for (col = cinfo->output_width >> 1; col > 0; col--) {
    /* Do the chroma part of the calculation */
-    cb = GETJSAMPLE(*inptr1++);
+    cb = *inptr1++;
-    cr = GETJSAMPLE(*inptr2++);
+    cr = *inptr2++;
    cred = Crrtab[cr];
    cgreen = (int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr], SCALEBITS);
    cblue = Cbbtab[cb];
    /* Fetch 2 Y values and emit 2 pixels */
-    y  = GETJSAMPLE(*inptr0++);
+    y  = *inptr0++;
    r = range_limit[DITHER_565_R(y + cred, d0)];
    g = range_limit[DITHER_565_G(y + cgreen, d0)];
    b = range_limit[DITHER_565_B(y + cblue, d0)];
    d0 = DITHER_ROTATE(d0);
    rgb = PACK_SHORT_565(r, g, b);
-    y  = GETJSAMPLE(*inptr0++);
+    y  = *inptr0++;
    r = range_limit[DITHER_565_R(y + cred, d0)];
    g = range_limit[DITHER_565_G(y + cgreen, d0)];
    b = range_limit[DITHER_565_B(y + cblue, d0)];
@@ -142,12 +142,12 @@ h2v1_merged_upsample_565D_internal(j_decompress_ptr cinfo,
  /* If image width is odd, do the last output column separately */
  if (cinfo->output_width & 1) {
-    cb = GETJSAMPLE(*inptr1);
+    cb = *inptr1;
-    cr = GETJSAMPLE(*inptr2);
+    cr = *inptr2;
    cred = Crrtab[cr];
    cgreen = (int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr], SCALEBITS);
    cblue = Cbbtab[cb];
-    y  = GETJSAMPLE(*inptr0);
+    y  = *inptr0;
    r = range_limit[DITHER_565_R(y + cred, d0)];
    g = range_limit[DITHER_565_G(y + cgreen, d0)];
    b = range_limit[DITHER_565_B(y + cblue, d0)];
@@ -189,20 +189,20 @@ h2v2_merged_upsample_565_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
  /* Loop for each group of output pixels */
  for (col = cinfo->output_width >> 1; col > 0; col--) {
    /* Do the chroma part of the calculation */
-    cb = GETJSAMPLE(*inptr1++);
+    cb = *inptr1++;
-    cr = GETJSAMPLE(*inptr2++);
+    cr = *inptr2++;
    cred = Crrtab[cr];
    cgreen = (int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr], SCALEBITS);
    cblue = Cbbtab[cb];
    /* Fetch 4 Y values and emit 4 pixels */
-    y  = GETJSAMPLE(*inptr00++);
+    y  = *inptr00++;
    r = range_limit[y + cred];
    g = range_limit[y + cgreen];
    b = range_limit[y + cblue];
    rgb = PACK_SHORT_565(r, g, b);
-    y  = GETJSAMPLE(*inptr00++);
+    y  = *inptr00++;
    r = range_limit[y + cred];
    g = range_limit[y + cgreen];
    b = range_limit[y + cblue];
@@ -211,13 +211,13 @@ h2v2_merged_upsample_565_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
    WRITE_TWO_PIXELS(outptr0, rgb);
    outptr0 += 4;
-    y  = GETJSAMPLE(*inptr01++);
+    y  = *inptr01++;
    r = range_limit[y + cred];
    g = range_limit[y + cgreen];
    b = range_limit[y + cblue];
    rgb = PACK_SHORT_565(r, g, b);
-    y  = GETJSAMPLE(*inptr01++);
+    y  = *inptr01++;
    r = range_limit[y + cred];
    g = range_limit[y + cgreen];
    b = range_limit[y + cblue];
@@ -229,20 +229,20 @@ h2v2_merged_upsample_565_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
  /* If image width is odd, do the last output column separately */
  if (cinfo->output_width & 1) {
-    cb = GETJSAMPLE(*inptr1);
+    cb = *inptr1;
-    cr = GETJSAMPLE(*inptr2);
+    cr = *inptr2;
    cred = Crrtab[cr];
    cgreen = (int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr], SCALEBITS);
    cblue = Cbbtab[cb];
-    y  = GETJSAMPLE(*inptr00);
+    y  = *inptr00;
    r = range_limit[y + cred];
    g = range_limit[y + cgreen];
    b = range_limit[y + cblue];
    rgb = PACK_SHORT_565(r, g, b);
    *(INT16 *)outptr0 = (INT16)rgb;
-    y  = GETJSAMPLE(*inptr01);
+    y  = *inptr01;
    r = range_limit[y + cred];
    g = range_limit[y + cgreen];
    b = range_limit[y + cblue];
@@ -287,21 +287,21 @@ h2v2_merged_upsample_565D_internal(j_decompress_ptr cinfo,
  /* Loop for each group of output pixels */
  for (col = cinfo->output_width >> 1; col > 0; col--) {
    /* Do the chroma part of the calculation */
-    cb = GETJSAMPLE(*inptr1++);
+    cb = *inptr1++;
-    cr = GETJSAMPLE(*inptr2++);
+    cr = *inptr2++;
    cred = Crrtab[cr];
    cgreen = (int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr], SCALEBITS);
    cblue = Cbbtab[cb];
    /* Fetch 4 Y values and emit 4 pixels */
-    y  = GETJSAMPLE(*inptr00++);
+    y  = *inptr00++;
    r = range_limit[DITHER_565_R(y + cred, d0)];
    g = range_limit[DITHER_565_G(y + cgreen, d0)];
    b = range_limit[DITHER_565_B(y + cblue, d0)];
    d0 = DITHER_ROTATE(d0);
    rgb = PACK_SHORT_565(r, g, b);
-    y  = GETJSAMPLE(*inptr00++);
+    y  = *inptr00++;
    r = range_limit[DITHER_565_R(y + cred, d0)];
    g = range_limit[DITHER_565_G(y + cgreen, d0)];
    b = range_limit[DITHER_565_B(y + cblue, d0)];
@@ -311,14 +311,14 @@ h2v2_merged_upsample_565D_internal(j_decompress_ptr cinfo,
    WRITE_TWO_PIXELS(outptr0, rgb);
    outptr0 += 4;
-    y  = GETJSAMPLE(*inptr01++);
+    y  = *inptr01++;
    r = range_limit[DITHER_565_R(y + cred, d1)];
    g = range_limit[DITHER_565_G(y + cgreen, d1)];
    b = range_limit[DITHER_565_B(y + cblue, d1)];
    d1 = DITHER_ROTATE(d1);
    rgb = PACK_SHORT_565(r, g, b);
-    y  = GETJSAMPLE(*inptr01++);
+    y  = *inptr01++;
    r = range_limit[DITHER_565_R(y + cred, d1)];
    g = range_limit[DITHER_565_G(y + cgreen, d1)];
    b = range_limit[DITHER_565_B(y + cblue, d1)];
@@ -331,20 +331,20 @@ h2v2_merged_upsample_565D_internal(j_decompress_ptr cinfo,
  /* If image width is odd, do the last output column separately */
  if (cinfo->output_width & 1) {
-    cb = GETJSAMPLE(*inptr1);
+    cb = *inptr1;
-    cr = GETJSAMPLE(*inptr2);
+    cr = *inptr2;
    cred = Crrtab[cr];
    cgreen = (int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr], SCALEBITS);
    cblue = Cbbtab[cb];
-    y  = GETJSAMPLE(*inptr00);
+    y  = *inptr00;
    r = range_limit[DITHER_565_R(y + cred, d0)];
    g = range_limit[DITHER_565_G(y + cgreen, d0)];
    b = range_limit[DITHER_565_B(y + cblue, d0)];
    rgb = PACK_SHORT_565(r, g, b);
    *(INT16 *)outptr0 = (INT16)rgb;
-    y  = GETJSAMPLE(*inptr01);
+    y  = *inptr01;
    r = range_limit[DITHER_565_R(y + cred, d1)];
    g = range_limit[DITHER_565_G(y + cgreen, d1)];
    b = range_limit[DITHER_565_B(y + cblue, d1)];
--- a/jdmrgext.c
+++ b/jdmrgext.c
@@ -46,13 +46,13 @@ h2v1_merged_upsample_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
  /* Loop for each pair of output pixels */
  for (col = cinfo->output_width >> 1; col > 0; col--) {
    /* Do the chroma part of the calculation */
-    cb = GETJSAMPLE(*inptr1++);
+    cb = *inptr1++;
-    cr = GETJSAMPLE(*inptr2++);
+    cr = *inptr2++;
    cred = Crrtab[cr];
    cgreen = (int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr], SCALEBITS);
    cblue = Cbbtab[cb];
    /* Fetch 2 Y values and emit 2 pixels */
-    y  = GETJSAMPLE(*inptr0++);
+    y  = *inptr0++;
    outptr[RGB_RED] =   range_limit[y + cred];
    outptr[RGB_GREEN] = range_limit[y + cgreen];
    outptr[RGB_BLUE] =  range_limit[y + cblue];
@@ -60,7 +60,7 @@ h2v1_merged_upsample_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
    outptr[RGB_ALPHA] = 0xFF;
 #endif
    outptr += RGB_PIXELSIZE;
-    y  = GETJSAMPLE(*inptr0++);
+    y  = *inptr0++;
    outptr[RGB_RED] =   range_limit[y + cred];
    outptr[RGB_GREEN] = range_limit[y + cgreen];
    outptr[RGB_BLUE] =  range_limit[y + cblue];
@@ -71,12 +71,12 @@ h2v1_merged_upsample_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
  }
  /* If image width is odd, do the last output column separately */
  if (cinfo->output_width & 1) {
-    cb = GETJSAMPLE(*inptr1);
+    cb = *inptr1;
-    cr = GETJSAMPLE(*inptr2);
+    cr = *inptr2;
    cred = Crrtab[cr];
    cgreen = (int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr], SCALEBITS);
    cblue = Cbbtab[cb];
-    y  = GETJSAMPLE(*inptr0);
+    y  = *inptr0;
    outptr[RGB_RED] =   range_limit[y + cred];
    outptr[RGB_GREEN] = range_limit[y + cgreen];
    outptr[RGB_BLUE] =  range_limit[y + cblue];
@@ -120,13 +120,13 @@ h2v2_merged_upsample_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
  /* Loop for each group of output pixels */
  for (col = cinfo->output_width >> 1; col > 0; col--) {
    /* Do the chroma part of the calculation */
-    cb = GETJSAMPLE(*inptr1++);
+    cb = *inptr1++;
-    cr = GETJSAMPLE(*inptr2++);
+    cr = *inptr2++;
    cred = Crrtab[cr];
    cgreen = (int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr], SCALEBITS);
    cblue = Cbbtab[cb];
    /* Fetch 4 Y values and emit 4 pixels */
-    y  = GETJSAMPLE(*inptr00++);
+    y  = *inptr00++;
    outptr0[RGB_RED] =   range_limit[y + cred];
    outptr0[RGB_GREEN] = range_limit[y + cgreen];
    outptr0[RGB_BLUE] =  range_limit[y + cblue];
@@ -134,7 +134,7 @@ h2v2_merged_upsample_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
    outptr0[RGB_ALPHA] = 0xFF;
 #endif
    outptr0 += RGB_PIXELSIZE;
-    y  = GETJSAMPLE(*inptr00++);
+    y  = *inptr00++;
    outptr0[RGB_RED] =   range_limit[y + cred];
    outptr0[RGB_GREEN] = range_limit[y + cgreen];
    outptr0[RGB_BLUE] =  range_limit[y + cblue];
@@ -142,7 +142,7 @@ h2v2_merged_upsample_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
    outptr0[RGB_ALPHA] = 0xFF;
 #endif
    outptr0 += RGB_PIXELSIZE;
-    y  = GETJSAMPLE(*inptr01++);
+    y  = *inptr01++;
    outptr1[RGB_RED] =   range_limit[y + cred];
    outptr1[RGB_GREEN] = range_limit[y + cgreen];
    outptr1[RGB_BLUE] =  range_limit[y + cblue];
@@ -150,7 +150,7 @@ h2v2_merged_upsample_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
    outptr1[RGB_ALPHA] = 0xFF;
 #endif
    outptr1 += RGB_PIXELSIZE;
-    y  = GETJSAMPLE(*inptr01++);
+    y  = *inptr01++;
    outptr1[RGB_RED] =   range_limit[y + cred];
    outptr1[RGB_GREEN] = range_limit[y + cgreen];
    outptr1[RGB_BLUE] =  range_limit[y + cblue];
@@ -161,19 +161,19 @@ h2v2_merged_upsample_internal(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
  }
  /* If image width is odd, do the last output column separately */
  if (cinfo->output_width & 1) {
-    cb = GETJSAMPLE(*inptr1);
+    cb = *inptr1;
-    cr = GETJSAMPLE(*inptr2);
+    cr = *inptr2;
    cred = Crrtab[cr];
    cgreen = (int)RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr], SCALEBITS);
    cblue = Cbbtab[cb];
-    y  = GETJSAMPLE(*inptr00);
+    y  = *inptr00;
    outptr0[RGB_RED] =   range_limit[y + cred];
    outptr0[RGB_GREEN] = range_limit[y + cgreen];
    outptr0[RGB_BLUE] =  range_limit[y + cblue];
 #ifdef RGB_ALPHA
    outptr0[RGB_ALPHA] = 0xFF;
 #endif
-    y  = GETJSAMPLE(*inptr01);
+    y  = *inptr01;
    outptr1[RGB_RED] =   range_limit[y + cred];
    outptr1[RGB_GREEN] = range_limit[y + cgreen];
    outptr1[RGB_BLUE] =  range_limit[y + cblue];
--- a/jdphuff.c
+++ b/jdphuff.c
@@ -4,7 +4,7 @@
 * This file was part of the Independent JPEG Group's software:
 * Copyright (C) 1995-1997, Thomas G. Lane.
 * libjpeg-turbo Modifications:
- * Copyright (C) 2015-2016, 2018, D. R. Commander.
+ * Copyright (C) 2015-2016, 2018-2020, D. R. Commander.
 * For conditions of distribution and use, see the accompanying README.ijg
 * file.
 *
@@ -41,25 +41,6 @@ typedef struct {
  int last_dc_val[MAX_COMPS_IN_SCAN];   /* last DC coef for each component */
 } savable_state;
 /* This macro is to work around compilers with missing or broken
 * structure assignment.  You'll need to fix this code if you have
 * such a compiler and you change MAX_COMPS_IN_SCAN.
 */
 #ifndef NO_STRUCT_ASSIGN
 #define ASSIGN_STATE(dest, src)  ((dest) = (src))
 #else
 #if MAX_COMPS_IN_SCAN == 4
 #define ASSIGN_STATE(dest, src) \
  ((dest).EOBRUN = (src).EOBRUN, \
   (dest).last_dc_val[0] = (src).last_dc_val[0], \
   (dest).last_dc_val[1] = (src).last_dc_val[1], \
   (dest).last_dc_val[2] = (src).last_dc_val[2], \
   (dest).last_dc_val[3] = (src).last_dc_val[3])
 #endif
 #endif
 typedef struct {
  struct jpeg_entropy_decoder pub; /* public fields */
@@ -102,7 +83,7 @@ start_pass_phuff_decoder(j_decompress_ptr cinfo)
  boolean is_DC_band, bad;
  int ci, coefi, tbl;
  d_derived_tbl **pdtbl;
-  int *coef_bit_ptr;
+  int *coef_bit_ptr, *prev_coef_bit_ptr;
  jpeg_component_info *compptr;
  is_DC_band = (cinfo->Ss == 0);
@@ -143,8 +124,15 @@ start_pass_phuff_decoder(j_decompress_ptr cinfo)
  for (ci = 0; ci < cinfo->comps_in_scan; ci++) {
    int cindex = cinfo->cur_comp_info[ci]->component_index;
    coef_bit_ptr = &cinfo->coef_bits[cindex][0];
    prev_coef_bit_ptr = &cinfo->coef_bits[cindex + cinfo->num_components][0];
    if (!is_DC_band && coef_bit_ptr[0] < 0) /* AC without prior DC scan */
      WARNMS2(cinfo, JWRN_BOGUS_PROGRESSION, cindex, 0);
    for (coefi = MIN(cinfo->Ss, 1); coefi <= MAX(cinfo->Se, 9); coefi++) {
      if (cinfo->input_scan_number > 1)
        prev_coef_bit_ptr[coefi] = coef_bit_ptr[coefi];
      else
        prev_coef_bit_ptr[coefi] = 0;
    }
    for (coefi = cinfo->Ss; coefi <= cinfo->Se; coefi++) {
      int expected = (coef_bit_ptr[coefi] < 0) ? 0 : coef_bit_ptr[coefi];
      if (cinfo->Ah != expected)
@@ -323,7 +311,7 @@ decode_mcu_DC_first(j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
    /* Load up working state */
    BITREAD_LOAD_STATE(cinfo, entropy->bitstate);
-    ASSIGN_STATE(state, entropy->saved);
+    state = entropy->saved;
    /* Outer loop handles each block in the MCU */
@@ -356,7 +344,7 @@ decode_mcu_DC_first(j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
    /* Completed MCU, so update state */
    BITREAD_SAVE_STATE(cinfo, entropy->bitstate);
-    ASSIGN_STATE(entropy->saved, state);
+    entropy->saved = state;
  }
  /* Account for restart interval (no-op if not using restarts) */
@@ -676,7 +664,7 @@ jinit_phuff_decoder(j_decompress_ptr cinfo)
  /* Create progression status table */
  cinfo->coef_bits = (int (*)[DCTSIZE2])
    (*cinfo->mem->alloc_small) ((j_common_ptr)cinfo, JPOOL_IMAGE,
-                                cinfo->num_components * DCTSIZE2 *
+                                cinfo->num_components * 2 * DCTSIZE2 *
                                sizeof(int));
  coef_bit_ptr = &cinfo->coef_bits[0][0];
  for (ci = 0; ci < cinfo->num_components; ci++)
--- a/jdsample.c
+++ b/jdsample.c
@@ -8,7 +8,7 @@
 * Copyright (C) 2010, 2015-2016, D. R. Commander.
 * Copyright (C) 2014, MIPS Technologies, Inc., California.
 * Copyright (C) 2015, Google, Inc.
- * Copyright (C) 2019, Arm Limited.
+ * Copyright (C) 2019-2020, Arm Limited.
 * For conditions of distribution and use, see the accompanying README.ijg
 * file.
 *
@@ -177,7 +177,7 @@ int_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
    outptr = output_data[outrow];
    outend = outptr + cinfo->output_width;
    while (outptr < outend) {
-      invalue = *inptr++;       /* don't need GETJSAMPLE() here */
+      invalue = *inptr++;
      for (h = h_expand; h > 0; h--) {
        *outptr++ = invalue;
      }
@@ -213,7 +213,7 @@ h2v1_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
    outptr = output_data[inrow];
    outend = outptr + cinfo->output_width;
    while (outptr < outend) {
-      invalue = *inptr++;       /* don't need GETJSAMPLE() here */
+      invalue = *inptr++;
      *outptr++ = invalue;
      *outptr++ = invalue;
    }
@@ -242,7 +242,7 @@ h2v2_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
    outptr = output_data[outrow];
    outend = outptr + cinfo->output_width;
    while (outptr < outend) {
-      invalue = *inptr++;       /* don't need GETJSAMPLE() here */
+      invalue = *inptr++;
      *outptr++ = invalue;
      *outptr++ = invalue;
    }
@@ -283,20 +283,20 @@ h2v1_fancy_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
    inptr = input_data[inrow];
    outptr = output_data[inrow];
    /* Special case for first column */
-    invalue = GETJSAMPLE(*inptr++);
+    invalue = *inptr++;
    *outptr++ = (JSAMPLE)invalue;
-    *outptr++ = (JSAMPLE)((invalue * 3 + GETJSAMPLE(*inptr) + 2) >> 2);
+    *outptr++ = (JSAMPLE)((invalue * 3 + inptr[0] + 2) >> 2);
    for (colctr = compptr->downsampled_width - 2; colctr > 0; colctr--) {
      /* General case: 3/4 * nearer pixel + 1/4 * further pixel */
-      invalue = GETJSAMPLE(*inptr++) * 3;
+      invalue = (*inptr++) * 3;
-      *outptr++ = (JSAMPLE)((invalue + GETJSAMPLE(inptr[-2]) + 1) >> 2);
+      *outptr++ = (JSAMPLE)((invalue + inptr[-2] + 1) >> 2);
-      *outptr++ = (JSAMPLE)((invalue + GETJSAMPLE(*inptr) + 2) >> 2);
+      *outptr++ = (JSAMPLE)((invalue + inptr[0] + 2) >> 2);
    }
    /* Special case for last column */
-    invalue = GETJSAMPLE(*inptr);
+    invalue = *inptr;
-    *outptr++ = (JSAMPLE)((invalue * 3 + GETJSAMPLE(inptr[-1]) + 1) >> 2);
+    *outptr++ = (JSAMPLE)((invalue * 3 + inptr[-1] + 1) >> 2);
    *outptr++ = (JSAMPLE)invalue;
  }
 }
@@ -338,7 +338,7 @@ h1v2_fancy_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
      outptr = output_data[outrow++];
      for (colctr = 0; colctr < compptr->downsampled_width; colctr++) {
-        thiscolsum = GETJSAMPLE(*inptr0++) * 3 + GETJSAMPLE(*inptr1++);
+        thiscolsum = (*inptr0++) * 3 + (*inptr1++);
        *outptr++ = (JSAMPLE)((thiscolsum + bias) >> 2);
      }
    }
@@ -381,8 +381,8 @@ h2v2_fancy_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
      outptr = output_data[outrow++];
      /* Special case for first column */
-      thiscolsum = GETJSAMPLE(*inptr0++) * 3 + GETJSAMPLE(*inptr1++);
+      thiscolsum = (*inptr0++) * 3 + (*inptr1++);
-      nextcolsum = GETJSAMPLE(*inptr0++) * 3 + GETJSAMPLE(*inptr1++);
+      nextcolsum = (*inptr0++) * 3 + (*inptr1++);
      *outptr++ = (JSAMPLE)((thiscolsum * 4 + 8) >> 4);
      *outptr++ = (JSAMPLE)((thiscolsum * 3 + nextcolsum + 7) >> 4);
      lastcolsum = thiscolsum;  thiscolsum = nextcolsum;
@@ -390,7 +390,7 @@ h2v2_fancy_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
      for (colctr = compptr->downsampled_width - 2; colctr > 0; colctr--) {
        /* General case: 3/4 * nearer pixel + 1/4 * further pixel in each */
        /* dimension, thus 9/16, 3/16, 3/16, 1/16 overall */
-        nextcolsum = GETJSAMPLE(*inptr0++) * 3 + GETJSAMPLE(*inptr1++);
+        nextcolsum = (*inptr0++) * 3 + (*inptr1++);
        *outptr++ = (JSAMPLE)((thiscolsum * 3 + lastcolsum + 8) >> 4);
        *outptr++ = (JSAMPLE)((thiscolsum * 3 + nextcolsum + 7) >> 4);
        lastcolsum = thiscolsum;  thiscolsum = nextcolsum;
@@ -477,7 +477,13 @@ jinit_upsampler(j_decompress_ptr cinfo)
    } else if (h_in_group == h_out_group &&
               v_in_group * 2 == v_out_group && do_fancy) {
      /* Non-fancy upsampling is handled by the generic method */
-      upsample->methods[ci] = h1v2_fancy_upsample;
+#if defined(__arm__) || defined(__aarch64__) || \
    defined(_M_ARM) || defined(_M_ARM64)
      if (jsimd_can_h1v2_fancy_upsample())
        upsample->methods[ci] = jsimd_h1v2_fancy_upsample;
      else
 #endif
        upsample->methods[ci] = h1v2_fancy_upsample;
      upsample->pub.need_context_rows = TRUE;
    } else if (h_in_group * 2 == h_out_group &&
               v_in_group * 2 == v_out_group) {
--- a/jerror.h
+++ b/jerror.h
@@ -211,6 +211,10 @@ JMESSAGE(JERR_BAD_PARAM_VALUE, "Bogus parameter value")
 JMESSAGE(JERR_UNSUPPORTED_SUSPEND, "I/O suspension not supported in scan optimization")  
 JMESSAGE(JWRN_BOGUS_ICC, "Corrupt JPEG data: bad ICC marker")
 #if JPEG_LIB_VERSION < 70
 JMESSAGE(JERR_BAD_DROP_SAMPLING,
         "Component index %d: mismatching sampling ratio %d:%d, %d:%d, %c")
 #endif
 #ifdef JMAKE_ENUM_LIST
@@ -255,8 +259,17 @@ JMESSAGE(JWRN_BOGUS_ICC, "Corrupt JPEG data: bad ICC marker")
   (cinfo)->err->msg_parm.i[1] = (p2), \
   (cinfo)->err->msg_parm.i[2] = (p3), \
   (cinfo)->err->msg_parm.i[3] = (p4), \
-   (*(cinfo)->err->error_exit) ((j_common_ptr) (cinfo)))
+   (*(cinfo)->err->error_exit) ((j_common_ptr)(cinfo)))
-#define ERREXITS(cinfo,code,str)  \
+#define ERREXIT6(cinfo, code, p1, p2, p3, p4, p5, p6) \
  ((cinfo)->err->msg_code = (code), \
   (cinfo)->err->msg_parm.i[0] = (p1), \
   (cinfo)->err->msg_parm.i[1] = (p2), \
   (cinfo)->err->msg_parm.i[2] = (p3), \
   (cinfo)->err->msg_parm.i[3] = (p4), \
   (cinfo)->err->msg_parm.i[4] = (p5), \
   (cinfo)->err->msg_parm.i[5] = (p6), \
   (*(cinfo)->err->error_exit) ((j_common_ptr)(cinfo)))
 #define ERREXITS(cinfo, code, str) \
  ((cinfo)->err->msg_code = (code), \
   strncpy((cinfo)->err->msg_parm.s, (str), JMSG_STR_PARM_MAX), \
   (*(cinfo)->err->error_exit) ((j_common_ptr) (cinfo)))
@@ -292,24 +305,24 @@ JMESSAGE(JWRN_BOGUS_ICC, "Corrupt JPEG data: bad ICC marker")
   (*(cinfo)->err->emit_message) ((j_common_ptr) (cinfo), (lvl)))
 #define TRACEMS3(cinfo,lvl,code,p1,p2,p3)  \
  MAKESTMT(int * _mp = (cinfo)->err->msg_parm.i; \
-           _mp[0] = (p1); _mp[1] = (p2); _mp[2] = (p3); \
+           _mp[0] = (p1);  _mp[1] = (p2);  _mp[2] = (p3); \
           (cinfo)->err->msg_code = (code); \
           (*(cinfo)->err->emit_message) ((j_common_ptr) (cinfo), (lvl)); )
 #define TRACEMS4(cinfo,lvl,code,p1,p2,p3,p4)  \
  MAKESTMT(int * _mp = (cinfo)->err->msg_parm.i; \
-           _mp[0] = (p1); _mp[1] = (p2); _mp[2] = (p3); _mp[3] = (p4); \
+           _mp[0] = (p1);  _mp[1] = (p2);  _mp[2] = (p3);  _mp[3] = (p4); \
           (cinfo)->err->msg_code = (code); \
           (*(cinfo)->err->emit_message) ((j_common_ptr) (cinfo), (lvl)); )
 #define TRACEMS5(cinfo,lvl,code,p1,p2,p3,p4,p5)  \
  MAKESTMT(int * _mp = (cinfo)->err->msg_parm.i; \
-           _mp[0] = (p1); _mp[1] = (p2); _mp[2] = (p3); _mp[3] = (p4); \
+           _mp[0] = (p1);  _mp[1] = (p2);  _mp[2] = (p3);  _mp[3] = (p4); \
           _mp[4] = (p5); \
           (cinfo)->err->msg_code = (code); \
           (*(cinfo)->err->emit_message) ((j_common_ptr) (cinfo), (lvl)); )
 #define TRACEMS8(cinfo,lvl,code,p1,p2,p3,p4,p5,p6,p7,p8)  \
  MAKESTMT(int * _mp = (cinfo)->err->msg_parm.i; \
-           _mp[0] = (p1); _mp[1] = (p2); _mp[2] = (p3); _mp[3] = (p4); \
+           _mp[0] = (p1);  _mp[1] = (p2);  _mp[2] = (p3);  _mp[3] = (p4); \
-           _mp[4] = (p5); _mp[5] = (p6); _mp[6] = (p7); _mp[7] = (p8); \
+           _mp[4] = (p5);  _mp[5] = (p6);  _mp[6] = (p7);  _mp[7] = (p8); \
           (cinfo)->err->msg_code = (code); \
           (*(cinfo)->err->emit_message) ((j_common_ptr) (cinfo), (lvl)); )
 #define TRACEMSS(cinfo,lvl,code,str)  \
--- a/jidctint.c
+++ b/jidctint.c
@@ -3,7 +3,7 @@
 *
 * This file was part of the Independent JPEG Group's software:
 * Copyright (C) 1991-1998, Thomas G. Lane.
- * Modification developed 2002-2009 by Guido Vollbeding.
+ * Modification developed 2002-2018 by Guido Vollbeding.
 * libjpeg-turbo Modifications:
 * Copyright (C) 2015, 2020, D. R. Commander.
 * For conditions of distribution and use, see the accompanying README.ijg
@@ -417,7 +417,7 @@ jpeg_idct_islow(j_decompress_ptr cinfo, jpeg_component_info *compptr,
 /*
 * Perform dequantization and inverse DCT on one block of coefficients,
- * producing a 7x7 output block.
+ * producing a reduced-size 7x7 output block.
 *
 * Optimized algorithm with 12 multiplications in the 1-D kernel.
 * cK represents sqrt(2) * cos(K*pi/14).
@@ -1258,7 +1258,7 @@ jpeg_idct_10x10(j_decompress_ptr cinfo, jpeg_component_info *compptr,
 /*
 * Perform dequantization and inverse DCT on one block of coefficients,
- * producing a 11x11 output block.
+ * producing an 11x11 output block.
 *
 * Optimized algorithm with 24 multiplications in the 1-D kernel.
 * cK represents sqrt(2) * cos(K*pi/22).
@@ -2398,7 +2398,7 @@ jpeg_idct_16x16(j_decompress_ptr cinfo, jpeg_component_info *compptr,
    tmp0 = DEQUANTIZE(inptr[DCTSIZE * 0], quantptr[DCTSIZE * 0]);
    tmp0 = LEFT_SHIFT(tmp0, CONST_BITS);
    /* Add fudge factor here for final descale. */
-    tmp0 += 1 << (CONST_BITS - PASS1_BITS - 1);
+    tmp0 += ONE << (CONST_BITS - PASS1_BITS - 1);
    z1 = DEQUANTIZE(inptr[DCTSIZE * 4], quantptr[DCTSIZE * 4]);
    tmp1 = MULTIPLY(z1, FIX(1.306562965));      /* c4[16] = c2[8] */
--- a/jmorecfg.h
+++ b/jmorecfg.h
@@ -43,25 +43,11 @@
 #if BITS_IN_JSAMPLE == 8
 /* JSAMPLE should be the smallest type that will hold the values 0..255.
 * You can use a signed char by having GETJSAMPLE mask it with 0xFF.
 */
 #ifdef HAVE_UNSIGNED_CHAR
 typedef unsigned char JSAMPLE;
 #define GETJSAMPLE(value)  ((int)(value))
 #else /* not HAVE_UNSIGNED_CHAR */
 typedef char JSAMPLE;
 #ifdef __CHAR_UNSIGNED__
 #define GETJSAMPLE(value)  ((int)(value))
 #else
 #define GETJSAMPLE(value)  ((int)(value) & 0xFF)
 #endif /* __CHAR_UNSIGNED__ */
 #endif /* HAVE_UNSIGNED_CHAR */
 #define MAXJSAMPLE      255
 #define CENTERJSAMPLE   128
@@ -97,22 +83,9 @@ typedef short JCOEF;
 * managers, this is also the data type passed to fread/fwrite.
 */
 #ifdef HAVE_UNSIGNED_CHAR
 typedef unsigned char JOCTET;
 #define GETJOCTET(value)  (value)
 #else /* not HAVE_UNSIGNED_CHAR */
 typedef char JOCTET;
 #ifdef __CHAR_UNSIGNED__
 #define GETJOCTET(value)  (value)
 #else
 #define GETJOCTET(value)  ((value) & 0xFF)
 #endif /* __CHAR_UNSIGNED__ */
 #endif /* HAVE_UNSIGNED_CHAR */
 /* These typedefs are used for various table entries and so forth.
 * They must be at least as wide as specified; but making them too big
@@ -123,15 +96,7 @@ typedef char JOCTET;
 /* UINT8 must hold at least the values 0..255. */
 #ifdef HAVE_UNSIGNED_CHAR
 typedef unsigned char UINT8;
 #else /* not HAVE_UNSIGNED_CHAR */
 #ifdef __CHAR_UNSIGNED__
 typedef char UINT8;
 #else /* not __CHAR_UNSIGNED__ */
 typedef short UINT8;
 #endif /* __CHAR_UNSIGNED__ */
 #endif /* HAVE_UNSIGNED_CHAR */
 /* UINT16 must hold at least the values 0..65535. */
--- a/jpegint.h
+++ b/jpegint.h
@@ -5,7 +5,7 @@
 * Copyright (C) 1991-1997, Thomas G. Lane.
 * Modified 1997-2009 by Guido Vollbeding.
 * libjpeg-turbo Modifications:
- * Copyright (C) 2015-2016, D. R. Commander.
+ * Copyright (C) 2015-2016, 2019, D. R. Commander.
 * Copyright (C) 2015, Google, Inc.
 * mozjpeg Modifications:
 * Copyright (C) 2014, Mozilla Corporation.
@@ -220,6 +220,9 @@ struct jpeg_decomp_master {
  JDIMENSION first_MCU_col[MAX_COMPONENTS];
  JDIMENSION last_MCU_col[MAX_COMPONENTS];
  boolean jinit_upsampler_no_alloc;
  /* Last iMCU row that was successfully decoded */
  JDIMENSION last_good_iMCU_row;
 };
 /* Input control module */
--- a/jpegtran.1
+++ b/jpegtran.1
@@ -1,4 +1,4 @@
-.TH JPEGTRAN 1 "18 March 2017"
+.TH JPEGTRAN 1 "26 October 2020"
 .SH NAME
 jpegtran \- lossless transformation of JPEG files
 .SH SYNOPSIS
@@ -180,6 +180,47 @@ left corner of the selected region must fall on an iMCU boundary.  If it
 doesn't, then it is silently moved up and/or left to the nearest iMCU boundary
 (the lower right corner is unchanged.)
 .PP
 If W or H is larger than the width/height of the input image, then the output
 image is expanded in size, and the expanded region is filled in with zeros
 (neutral gray).  Attaching an 'f' character ("flatten") to the width number
 will cause each block in the expanded region to be filled in with the DC
 coefficient of the nearest block in the input image rather than grayed out.
 Attaching an 'r' character ("reflect") to the width number will cause the
 expanded region to be filled in with repeated reflections of the input image
 rather than grayed out.
 .PP
 A complementary lossless wipe option is provided to discard (gray out) data
 inside a given image region while losslessly preserving what is outside:
 .TP
 .B \-wipe WxH+X+Y
 Wipe (gray out) a rectangular region of width W and height H from the input
 image, starting at point X,Y.
 .PP
 Attaching an 'f' character ("flatten") to the width number will cause the
 region to be filled with the average of adjacent blocks rather than grayed out.
 If the wipe region and the region outside the wipe region, when adjusted to the
 nearest iMCU boundary, form two horizontally adjacent rectangles, then
 attaching an 'r' character ("reflect") to the width number will cause the wipe
 region to be filled with repeated reflections of the outside region rather than
 grayed out.
 .PP
 A lossless drop option is also provided, which allows another JPEG image to be
 inserted ("dropped") into the input image data at a given position, replacing
 the existing image data at that position:
 .TP
 .B \-drop +X+Y filename
 Drop (insert) another image at point X,Y
 .PP
 Both the input image and the drop image must have the same subsampling level.
 It is best if they also have the same quantization (quality.)  Otherwise, the
 quantization of the output image will be adapted to accommodate the higher of
 the input image quality and the drop image quality.  The trim option can be
 used with the drop option to requantize the drop image to match the input
 image.  Note that a grayscale image can be dropped into a full-color image or
 vice versa, as long as the full-color image has no vertical subsampling.  If
 the input image is grayscale and the drop image is full-color, then the
 chrominance channels from the drop image will be discarded.
 .PP
 Other not-strictly-lossless transformation switches are:
 .TP
 .B \-grayscale
@@ -229,9 +270,31 @@ number.  For example,
 .B \-max 4m
 selects 4000000 bytes.  If more space is needed, an error will occur.
 .TP
 .BI \-maxscans " N"
 Abort if the input image contains more than
 .I N
 scans.  This feature demonstrates a method by which applications can guard
 against denial-of-service attacks instigated by specially-crafted malformed
 JPEG images containing numerous scans with missing image data or image data
 consisting only of "EOB runs" (a feature of progressive JPEG images that allows
 potentially hundreds of thousands of adjoining zero-value pixels to be
 represented using only a few bytes.)  Attempting to transform such malformed
 JPEG images can cause excessive CPU activity, since the decompressor must fully
 process each scan (even if the scan is corrupt) before it can proceed to the
 next scan.
 .TP
 .BI \-outfile " name"
 Send output image to the named file, not to standard output.
 .TP
 .BI \-report
 Report transformation progress.
 .TP
 .BI \-strict
 Treat all warnings as fatal.  This feature also demonstrates a method by which
 applications can guard against attacks instigated by specially-crafted
 malformed JPEG images.  Enabling this option will cause the decompressor to
 abort if the input image contains incomplete or corrupt image data.
 .TP
 .B \-verbose
 Enable debug printout.  More
 .BR \-v 's
--- a/jpegtran.c
+++ b/jpegtran.c
@@ -2,9 +2,9 @@
 * jpegtran.c
 *
 * This file was part of the Independent JPEG Group's software:
- * Copyright (C) 1995-2010, Thomas G. Lane, Guido Vollbeding.
+ * Copyright (C) 1995-2019, Thomas G. Lane, Guido Vollbeding.
 * libjpeg-turbo Modifications:
- * Copyright (C) 2010, 2014, 2017, 2020, D. R. Commander.
+ * Copyright (C) 2010, 2014, 2017, 2019-2020, D. R. Commander.
 * mozjpeg Modifications:
 * Copyright (C) 2014, Mozilla Corporation.
 * For conditions of distribution and use, see the accompanying README file.
@@ -42,7 +42,11 @@
 static const char *progname;    /* program name for error messages */
 static char *icc_filename;      /* for -icc switch */
 JDIMENSION max_scans;           /* for -maxscans switch */
 static char *outfilename;       /* for -outfile switch */
 static char *dropfilename;      /* for -drop switch */
 boolean report;                 /* for -report switch */
 boolean strict;                 /* for -strict switch */
 static boolean prefer_smallest;  /* use smallest of input or result file (if no image-changing options supplied) */
 static JCOPY_OPTION copyoption; /* -copy switch */
 static jpeg_transform_info transformoption; /* image transformation options */
@@ -76,8 +80,9 @@ usage(void)
  fprintf(stderr, "Switches for modifying the image:\n");
 #if TRANSFORMS_SUPPORTED
  fprintf(stderr, "  -crop WxH+X+Y  Crop to a rectangular region\n");
-  fprintf(stderr, "  -grayscale     Reduce to grayscale (omit color data)\n");
+  fprintf(stderr, "  -drop +X+Y filename          Drop (insert) another image\n");
  fprintf(stderr, "  -flip [horizontal|vertical]  Mirror image (left-right or top-bottom)\n");
  fprintf(stderr, "  -grayscale     Reduce to grayscale (omit color data)\n");
  fprintf(stderr, "  -perfect       Fail if there is non-transformable edge blocks\n");
  fprintf(stderr, "  -rotate [90|180|270]         Rotate image (degrees clockwise)\n");
 #endif
@@ -85,6 +90,8 @@ usage(void)
  fprintf(stderr, "  -transpose     Transpose image\n");
  fprintf(stderr, "  -transverse    Transverse transpose image\n");
  fprintf(stderr, "  -trim          Drop non-transformable edge blocks\n");
  fprintf(stderr, "                 with -drop: Requantize drop file to match source file\n");
  fprintf(stderr, "  -wipe WxH+X+Y  Wipe (gray out) a rectangular region\n");
 #endif
  fprintf(stderr, "Switches for advanced users:\n");
 #ifdef C_ARITH_CODING_SUPPORTED
@@ -93,7 +100,10 @@ usage(void)
  fprintf(stderr, "  -icc FILE      Embed ICC profile contained in FILE\n");
  fprintf(stderr, "  -restart N     Set restart interval in rows, or in blocks with B\n");
  fprintf(stderr, "  -maxmemory N   Maximum memory to use (in kbytes)\n");
  fprintf(stderr, "  -maxscans N    Maximum number of scans to allow in input file\n");
  fprintf(stderr, "  -outfile name  Specify name for output file\n");
  fprintf(stderr, "  -report        Report transformation progress\n");
  fprintf(stderr, "  -strict        Treat all warnings as fatal\n");
  fprintf(stderr, "  -verbose  or  -debug   Emit debug output\n");
  fprintf(stderr, "  -version       Print version information and exit\n");
  fprintf(stderr, "Switches for wizards:\n");
@@ -151,7 +161,10 @@ parse_switches(j_compress_ptr cinfo, int argc, char **argv,
  simple_progressive = FALSE;
 #endif
  icc_filename = NULL;
  max_scans = 0;
  outfilename = NULL;
  report = FALSE;
  strict = FALSE;
  copyoption = JCOPYOPT_DEFAULT;
  transformoption.transform = JXFORM_NONE;
  transformoption.perfect = FALSE;
@@ -208,7 +221,8 @@ parse_switches(j_compress_ptr cinfo, int argc, char **argv,
 #if TRANSFORMS_SUPPORTED
      if (++argn >= argc)       /* advance to next argument */
        usage();
-      if (!jtransform_parse_crop_spec(&transformoption, argv[argn])) {
+      if (transformoption.crop /* reject multiple crop/drop/wipe requests */ ||
          !jtransform_parse_crop_spec(&transformoption, argv[argn])) {
        fprintf(stderr, "%s: bogus -crop argument '%s'\n",
                progname, argv[argn]);
        exit(EXIT_FAILURE);
@@ -218,6 +232,26 @@ parse_switches(j_compress_ptr cinfo, int argc, char **argv,
      select_transform(JXFORM_NONE);    /* force an error */
 #endif
    } else if (keymatch(arg, "drop", 2)) {
 #if TRANSFORMS_SUPPORTED
      if (++argn >= argc)       /* advance to next argument */
        usage();
      if (transformoption.crop /* reject multiple crop/drop/wipe requests */ ||
          !jtransform_parse_crop_spec(&transformoption, argv[argn]) ||
          transformoption.crop_width_set != JCROP_UNSET ||
          transformoption.crop_height_set != JCROP_UNSET) {
        fprintf(stderr, "%s: bogus -drop argument '%s'\n",
                progname, argv[argn]);
        exit(EXIT_FAILURE);
      }
      if (++argn >= argc)       /* advance to next argument */
        usage();
      dropfilename = argv[argn];
      select_transform(JXFORM_DROP);
 #else
      select_transform(JXFORM_NONE);    /* force an error */
 #endif
    } else if (keymatch(arg, "debug", 1) || keymatch(arg, "verbose", 1)) {
      /* Enable debug printouts. */
      /* On first -d, print version identification */
@@ -282,6 +316,12 @@ parse_switches(j_compress_ptr cinfo, int argc, char **argv,
        lval *= 1000L;
      cinfo->mem->max_memory_to_use = lval * 1000L;
    } else if (keymatch(arg, "maxscans", 4)) {
      if (++argn >= argc)       /* advance to next argument */
        usage();
      if (sscanf(argv[argn], "%u", &max_scans) != 1)
        usage();
    } else if (keymatch(arg, "optimize", 1) || keymatch(arg, "optimise", 1)) {
      /* Enable entropy parm optimization. */
 #ifdef ENTROPY_OPT_SUPPORTED
@@ -315,6 +355,9 @@ parse_switches(j_compress_ptr cinfo, int argc, char **argv,
      exit(EXIT_FAILURE);
 #endif
    } else if (keymatch(arg, "report", 3)) {
      report = TRUE;
    } else if (keymatch(arg, "restart", 1)) {
      /* Restart interval in MCU rows (or in MCUs with 'b'). */
      long lval;
@@ -368,6 +411,9 @@ parse_switches(j_compress_ptr cinfo, int argc, char **argv,
      exit(EXIT_FAILURE);
 #endif
    } else if (keymatch(arg, "strict", 2)) {
      strict = TRUE;
    } else if (keymatch(arg, "transpose", 1)) {
      /* Transpose (across UL-to-LR axis). */
      select_transform(JXFORM_TRANSPOSE);
@@ -383,6 +429,21 @@ parse_switches(j_compress_ptr cinfo, int argc, char **argv,
      transformoption.trim = TRUE;
      prefer_smallest = FALSE;
    } else if (keymatch(arg, "wipe", 1)) {
 #if TRANSFORMS_SUPPORTED
      if (++argn >= argc)       /* advance to next argument */
        usage();
      if (transformoption.crop /* reject multiple crop/drop/wipe requests */ ||
          !jtransform_parse_crop_spec(&transformoption, argv[argn])) {
        fprintf(stderr, "%s: bogus -wipe argument '%s'\n",
                progname, argv[argn]);
        exit(EXIT_FAILURE);
      }
      select_transform(JXFORM_WIPE);
 #else
      select_transform(JXFORM_NONE);    /* force an error */
 #endif
    } else {
      usage();                  /* bogus switch */
    }
@@ -409,6 +470,19 @@ parse_switches(j_compress_ptr cinfo, int argc, char **argv,
 }
 METHODDEF(void)
 my_emit_message(j_common_ptr cinfo, int msg_level)
 {
  if (msg_level < 0) {
    /* Treat warning as fatal */
    cinfo->err->error_exit(cinfo);
  } else {
    if (cinfo->err->trace_level >= msg_level)
      cinfo->err->output_message(cinfo);
  }
 }
 /*
 * The main program.
 */
@@ -417,11 +491,14 @@ int
 main(int argc, char **argv)
 {
  struct jpeg_decompress_struct srcinfo;
 #if TRANSFORMS_SUPPORTED
  struct jpeg_decompress_struct dropinfo;
  struct jpeg_error_mgr jdroperr;
  FILE *drop_file;
 #endif
  struct jpeg_compress_struct dstinfo;
  struct jpeg_error_mgr jsrcerr, jdsterr;
-#ifdef PROGRESS_REPORT
+  struct cdjpeg_progress_mgr src_progress, dst_progress;
  struct cdjpeg_progress_mgr progress;
 #endif
  jvirt_barray_ptr *src_coef_arrays;
  jvirt_barray_ptr *dst_coef_arrays;
  int file_index;
@@ -458,13 +535,16 @@ main(int argc, char **argv)
   * values read here are mostly ignored; we will rescan the switches after
   * opening the input file.  Also note that most of the switches affect the
   * destination JPEG object, so we parse into that and then copy over what
-   * needs to affects the source too.
+   * needs to affect the source too.
   */
  file_index = parse_switches(&dstinfo, argc, argv, 0, FALSE);
  jsrcerr.trace_level = jdsterr.trace_level;
  srcinfo.mem->max_memory_to_use = dstinfo.mem->max_memory_to_use;
  if (strict)
    jsrcerr.emit_message = my_emit_message;
 #ifdef TWO_FILE_COMMANDLINE
  /* Must have either -outfile switch or explicit output file name */
  if (outfilename == NULL) {
@@ -530,8 +610,29 @@ main(int argc, char **argv)
      copyoption = JCOPYOPT_ALL_EXCEPT_ICC;
  }
-#ifdef PROGRESS_REPORT
+  if (report) {
-  start_progress_monitor((j_common_ptr)&dstinfo, &progress);
+    start_progress_monitor((j_common_ptr)&dstinfo, &dst_progress);
    dst_progress.report = report;
  }
  if (report || max_scans != 0) {
    start_progress_monitor((j_common_ptr)&srcinfo, &src_progress);
    src_progress.report = report;
    src_progress.max_scans = max_scans;
  }
 #if TRANSFORMS_SUPPORTED
  /* Open the drop file. */
  if (dropfilename != NULL) {
    if ((drop_file = fopen(dropfilename, READ_BINARY)) == NULL) {
      fprintf(stderr, "%s: can't open %s for reading\n", progname,
              dropfilename);
      exit(EXIT_FAILURE);
    }
    dropinfo.err = jpeg_std_error(&jdroperr);
    jpeg_create_decompress(&dropinfo);
    jpeg_stdio_src(&dropinfo, drop_file);
  } else {
    drop_file = NULL;
  }
 #endif
  /* Specify data source for decompression */
@@ -569,6 +670,17 @@ main(int argc, char **argv)
  /* Read file header */
  (void)jpeg_read_header(&srcinfo, TRUE);
 #if TRANSFORMS_SUPPORTED
  if (dropfilename != NULL) {
    (void)jpeg_read_header(&dropinfo, TRUE);
    transformoption.crop_width = dropinfo.image_width;
    transformoption.crop_width_set = JCROP_POS;
    transformoption.crop_height = dropinfo.image_height;
    transformoption.crop_height_set = JCROP_POS;
    transformoption.drop_ptr = &dropinfo;
  }
 #endif
  /* Any space needed by a transform option must be requested before
   * jpeg_read_coefficients so that memory allocation will be done right.
   */
@@ -584,6 +696,12 @@ main(int argc, char **argv)
  /* Read source file as DCT coefficients */
  src_coef_arrays = jpeg_read_coefficients(&srcinfo);
 #if TRANSFORMS_SUPPORTED
  if (dropfilename != NULL) {
    transformoption.drop_coef_arrays = jpeg_read_coefficients(&dropinfo);
  }
 #endif
  /* Initialize destination compression parameters from source values */
  jpeg_copy_critical_parameters(&srcinfo, &dstinfo);
@@ -672,27 +790,40 @@ main(int argc, char **argv)
      else
        fprintf(stderr, "%s: can't write to stdout\n", progname);
    }
  jpeg_destroy_compress(&dstinfo);
 #if TRANSFORMS_SUPPORTED
  if (dropfilename != NULL) {
    (void)jpeg_finish_decompress(&dropinfo);
    jpeg_destroy_decompress(&dropinfo);
  }
 #endif
  jpeg_destroy_compress(&dstinfo);
  (void)jpeg_finish_decompress(&srcinfo);
  jpeg_destroy_decompress(&srcinfo);
  /* Close output file, if we opened it */
  if (fp != stdout)
    fclose(fp);
-
+#if TRANSFORMS_SUPPORTED
-#ifdef PROGRESS_REPORT
+  if (drop_file != NULL)
-  end_progress_monitor((j_common_ptr)&dstinfo);
+    fclose(drop_file);
 #endif
  if (report)
  end_progress_monitor((j_common_ptr)&dstinfo);
  if (report || max_scans != 0)
    end_progress_monitor((j_common_ptr)&srcinfo);
  free(inbuffer);
  free(outbuffer);
  free(icc_profile);
  /* All done. */
 #if TRANSFORMS_SUPPORTED
  if (dropfilename != NULL)
    exit(jsrcerr.num_warnings + jdroperr.num_warnings +
         jdsterr.num_warnings ? EXIT_WARNING : EXIT_SUCCESS);
 #endif
  exit(jsrcerr.num_warnings + jdsterr.num_warnings ?
       EXIT_WARNING : EXIT_SUCCESS);
  return 0;                     /* suppress no-return-value warnings */
--- a/jquant1.c
+++ b/jquant1.c
@@ -479,7 +479,7 @@ color_quantize(j_decompress_ptr cinfo, JSAMPARRAY input_buf,
    for (col = width; col > 0; col--) {
      pixcode = 0;
      for (ci = 0; ci < nc; ci++) {
-        pixcode += GETJSAMPLE(colorindex[ci][GETJSAMPLE(*ptrin++)]);
+        pixcode += colorindex[ci][*ptrin++];
      }
      *ptrout++ = (JSAMPLE)pixcode;
    }
@@ -506,9 +506,9 @@ color_quantize3(j_decompress_ptr cinfo, JSAMPARRAY input_buf,
    ptrin = input_buf[row];
    ptrout = output_buf[row];
    for (col = width; col > 0; col--) {
-      pixcode  = GETJSAMPLE(colorindex0[GETJSAMPLE(*ptrin++)]);
+      pixcode  = colorindex0[*ptrin++];
-      pixcode += GETJSAMPLE(colorindex1[GETJSAMPLE(*ptrin++)]);
+      pixcode += colorindex1[*ptrin++];
-      pixcode += GETJSAMPLE(colorindex2[GETJSAMPLE(*ptrin++)]);
+      pixcode += colorindex2[*ptrin++];
      *ptrout++ = (JSAMPLE)pixcode;
    }
  }
@@ -552,7 +552,7 @@ quantize_ord_dither(j_decompress_ptr cinfo, JSAMPARRAY input_buf,
         * required amount of padding.
         */
        *output_ptr +=
-          colorindex_ci[GETJSAMPLE(*input_ptr) + dither[col_index]];
+          colorindex_ci[*input_ptr + dither[col_index]];
        input_ptr += nc;
        output_ptr++;
        col_index = (col_index + 1) & ODITHER_MASK;
@@ -595,12 +595,9 @@ quantize3_ord_dither(j_decompress_ptr cinfo, JSAMPARRAY input_buf,
    col_index = 0;
    for (col = width; col > 0; col--) {
-      pixcode  =
+      pixcode  = colorindex0[(*input_ptr++) + dither0[col_index]];
-        GETJSAMPLE(colorindex0[GETJSAMPLE(*input_ptr++) + dither0[col_index]]);
+      pixcode += colorindex1[(*input_ptr++) + dither1[col_index]];
-      pixcode +=
+      pixcode += colorindex2[(*input_ptr++) + dither2[col_index]];
        GETJSAMPLE(colorindex1[GETJSAMPLE(*input_ptr++) + dither1[col_index]]);
      pixcode +=
        GETJSAMPLE(colorindex2[GETJSAMPLE(*input_ptr++) + dither2[col_index]]);
      *output_ptr++ = (JSAMPLE)pixcode;
      col_index = (col_index + 1) & ODITHER_MASK;
    }
@@ -677,15 +674,15 @@ quantize_fs_dither(j_decompress_ptr cinfo, JSAMPARRAY input_buf,
         * The maximum error is +- MAXJSAMPLE; this sets the required size
         * of the range_limit array.
         */
-        cur += GETJSAMPLE(*input_ptr);
+        cur += *input_ptr;
-        cur = GETJSAMPLE(range_limit[cur]);
+        cur = range_limit[cur];
        /* Select output value, accumulate into output code for this pixel */
-        pixcode = GETJSAMPLE(colorindex_ci[cur]);
+        pixcode = colorindex_ci[cur];
        *output_ptr += (JSAMPLE)pixcode;
        /* Compute actual representation error at this pixel */
        /* Note: we can do this even though we don't have the final */
        /* pixel code, because the colormap is orthogonal. */
-        cur -= GETJSAMPLE(colormap_ci[pixcode]);
+        cur -= colormap_ci[pixcode];
        /* Compute error fractions to be propagated to adjacent pixels.
         * Add these into the running sums, and simultaneously shift the
         * next-line error sums left by 1 column.
--- a/jquant2.c
+++ b/jquant2.c
@@ -215,9 +215,9 @@ prescan_quantize(j_decompress_ptr cinfo, JSAMPARRAY input_buf,
    ptr = input_buf[row];
    for (col = width; col > 0; col--) {
      /* get pixel value and index into the histogram */
-      histp = &histogram[GETJSAMPLE(ptr[0]) >> C0_SHIFT]
+      histp = &histogram[ptr[0] >> C0_SHIFT]
-                        [GETJSAMPLE(ptr[1]) >> C1_SHIFT]
+                        [ptr[1] >> C1_SHIFT]
-                        [GETJSAMPLE(ptr[2]) >> C2_SHIFT];
+                        [ptr[2] >> C2_SHIFT];
      /* increment, check for overflow and undo increment if so. */
      if (++(*histp) <= 0)
        (*histp)--;
@@ -665,7 +665,7 @@ find_nearby_colors(j_decompress_ptr cinfo, int minc0, int minc1, int minc2,
  for (i = 0; i < numcolors; i++) {
    /* We compute the squared-c0-distance term, then add in the other two. */
-    x = GETJSAMPLE(cinfo->colormap[0][i]);
+    x = cinfo->colormap[0][i];
    if (x < minc0) {
      tdist = (x - minc0) * C0_SCALE;
      min_dist = tdist * tdist;
@@ -688,7 +688,7 @@ find_nearby_colors(j_decompress_ptr cinfo, int minc0, int minc1, int minc2,
      }
    }
-    x = GETJSAMPLE(cinfo->colormap[1][i]);
+    x = cinfo->colormap[1][i];
    if (x < minc1) {
      tdist = (x - minc1) * C1_SCALE;
      min_dist += tdist * tdist;
@@ -710,7 +710,7 @@ find_nearby_colors(j_decompress_ptr cinfo, int minc0, int minc1, int minc2,
      }
    }
-    x = GETJSAMPLE(cinfo->colormap[2][i]);
+    x = cinfo->colormap[2][i];
    if (x < minc2) {
      tdist = (x - minc2) * C2_SCALE;
      min_dist += tdist * tdist;
@@ -788,13 +788,13 @@ find_best_colors(j_decompress_ptr cinfo, int minc0, int minc1, int minc2,
 #define STEP_C2  ((1 << C2_SHIFT) * C2_SCALE)
  for (i = 0; i < numcolors; i++) {
-    icolor = GETJSAMPLE(colorlist[i]);
+    icolor = colorlist[i];
    /* Compute (square of) distance from minc0/c1/c2 to this color */
-    inc0 = (minc0 - GETJSAMPLE(cinfo->colormap[0][icolor])) * C0_SCALE;
+    inc0 = (minc0 - cinfo->colormap[0][icolor]) * C0_SCALE;
    dist0 = inc0 * inc0;
-    inc1 = (minc1 - GETJSAMPLE(cinfo->colormap[1][icolor])) * C1_SCALE;
+    inc1 = (minc1 - cinfo->colormap[1][icolor]) * C1_SCALE;
    dist0 += inc1 * inc1;
-    inc2 = (minc2 - GETJSAMPLE(cinfo->colormap[2][icolor])) * C2_SCALE;
+    inc2 = (minc2 - cinfo->colormap[2][icolor]) * C2_SCALE;
    dist0 += inc2 * inc2;
    /* Form the initial difference increments */
    inc0 = inc0 * (2 * STEP_C0) + STEP_C0 * STEP_C0;
@@ -879,7 +879,7 @@ fill_inverse_cmap(j_decompress_ptr cinfo, int c0, int c1, int c2)
    for (ic1 = 0; ic1 < BOX_C1_ELEMS; ic1++) {
      cachep = &histogram[c0 + ic0][c1 + ic1][c2];
      for (ic2 = 0; ic2 < BOX_C2_ELEMS; ic2++) {
-        *cachep++ = (histcell)(GETJSAMPLE(*cptr++) + 1);
+        *cachep++ = (histcell)((*cptr++) + 1);
      }
    }
  }
@@ -909,9 +909,9 @@ pass2_no_dither(j_decompress_ptr cinfo, JSAMPARRAY input_buf,
    outptr = output_buf[row];
    for (col = width; col > 0; col--) {
      /* get pixel value and index into the cache */
-      c0 = GETJSAMPLE(*inptr++) >> C0_SHIFT;
+      c0 = (*inptr++) >> C0_SHIFT;
-      c1 = GETJSAMPLE(*inptr++) >> C1_SHIFT;
+      c1 = (*inptr++) >> C1_SHIFT;
-      c2 = GETJSAMPLE(*inptr++) >> C2_SHIFT;
+      c2 = (*inptr++) >> C2_SHIFT;
      cachep = &histogram[c0][c1][c2];
      /* If we have not seen this color before, find nearest colormap entry */
      /* and update the cache */
@@ -996,12 +996,12 @@ pass2_fs_dither(j_decompress_ptr cinfo, JSAMPARRAY input_buf,
       * The maximum error is +- MAXJSAMPLE (or less with error limiting);
       * this sets the required size of the range_limit array.
       */
-      cur0 += GETJSAMPLE(inptr[0]);
+      cur0 += inptr[0];
-      cur1 += GETJSAMPLE(inptr[1]);
+      cur1 += inptr[1];
-      cur2 += GETJSAMPLE(inptr[2]);
+      cur2 += inptr[2];
-      cur0 = GETJSAMPLE(range_limit[cur0]);
+      cur0 = range_limit[cur0];
-      cur1 = GETJSAMPLE(range_limit[cur1]);
+      cur1 = range_limit[cur1];
-      cur2 = GETJSAMPLE(range_limit[cur2]);
+      cur2 = range_limit[cur2];
      /* Index into the cache with adjusted pixel value */
      cachep =
        &histogram[cur0 >> C0_SHIFT][cur1 >> C1_SHIFT][cur2 >> C2_SHIFT];
@@ -1015,9 +1015,9 @@ pass2_fs_dither(j_decompress_ptr cinfo, JSAMPARRAY input_buf,
        register int pixcode = *cachep - 1;
        *outptr = (JSAMPLE)pixcode;
        /* Compute representation error for this pixel */
-        cur0 -= GETJSAMPLE(colormap0[pixcode]);
+        cur0 -= colormap0[pixcode];
-        cur1 -= GETJSAMPLE(colormap1[pixcode]);
+        cur1 -= colormap1[pixcode];
-        cur2 -= GETJSAMPLE(colormap2[pixcode]);
+        cur2 -= colormap2[pixcode];
      }
      /* Compute error fractions to be propagated to adjacent pixels.
       * Add these into the running sums, and simultaneously shift the
--- a/jsimd.h
+++ b/jsimd.h
@@ -4,6 +4,7 @@
 * Copyright 2009 Pierre Ossman <ossman@cendio.se> for Cendio AB
 * Copyright (C) 2011, 2014, D. R. Commander.
 * Copyright (C) 2015-2016, 2018, Matthieu Darbois.
 * Copyright (C) 2020, Arm Limited.
 *
 * Based on the x86 SIMD extension for IJG JPEG library,
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
@@ -75,6 +76,7 @@ EXTERN(void) jsimd_int_upsample(j_decompress_ptr cinfo,
 EXTERN(int) jsimd_can_h2v2_fancy_upsample(void);
 EXTERN(int) jsimd_can_h2v1_fancy_upsample(void);
 EXTERN(int) jsimd_can_h1v2_fancy_upsample(void);
 EXTERN(void) jsimd_h2v2_fancy_upsample(j_decompress_ptr cinfo,
                                       jpeg_component_info *compptr,
@@ -84,6 +86,10 @@ EXTERN(void) jsimd_h2v1_fancy_upsample(j_decompress_ptr cinfo,
                                       jpeg_component_info *compptr,
                                       JSAMPARRAY input_data,
                                       JSAMPARRAY *output_data_ptr);
 EXTERN(void) jsimd_h1v2_fancy_upsample(j_decompress_ptr cinfo,
                                       jpeg_component_info *compptr,
                                       JSAMPARRAY input_data,
                                       JSAMPARRAY *output_data_ptr);
 EXTERN(int) jsimd_can_h2v2_merged_upsample(void);
 EXTERN(int) jsimd_can_h2v1_merged_upsample(void);
--- a/jsimd_none.c
+++ b/jsimd_none.c
@@ -4,6 +4,7 @@
 * Copyright 2009 Pierre Ossman <ossman@cendio.se> for Cendio AB
 * Copyright (C) 2009-2011, 2014, D. R. Commander.
 * Copyright (C) 2015-2016, 2018, Matthieu Darbois.
 * Copyright (C) 2020, Arm Limited.
 *
 * Based on the x86 SIMD extension for IJG JPEG library,
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
@@ -169,6 +170,12 @@ jsimd_can_h2v1_fancy_upsample(void)
  return 0;
 }
 GLOBAL(int)
 jsimd_can_h1v2_fancy_upsample(void)
 {
  return 0;
 }
 GLOBAL(void)
 jsimd_h2v2_fancy_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
                          JSAMPARRAY input_data, JSAMPARRAY *output_data_ptr)
@@ -181,6 +188,12 @@ jsimd_h2v1_fancy_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
 {
 }
 GLOBAL(void)
 jsimd_h1v2_fancy_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
                          JSAMPARRAY input_data, JSAMPARRAY *output_data_ptr)
 {
 }
 GLOBAL(int)
 jsimd_can_h2v2_merged_upsample(void)
 {
--- a/jversion.h
+++ b/jversion.h
@@ -2,9 +2,9 @@
 * jversion.h
 *
 * This file was part of the Independent JPEG Group's software:
- * Copyright (C) 1991-2012, Thomas G. Lane, Guido Vollbeding.
+ * Copyright (C) 1991-2020, Thomas G. Lane, Guido Vollbeding.
 * libjpeg-turbo Modifications:
- * Copyright (C) 2010, 2012-2020, D. R. Commander.
+ * Copyright (C) 2010, 2012-2021, D. R. Commander.
 * mozjpeg Modifications:
 * Copyright (C) 2014, Mozilla Corporation.
 * For conditions of distribution and use, see the accompanying README file.
@@ -38,9 +38,9 @@
 */
 #define JCOPYRIGHT \
-  "Copyright (C) 2009-2020 D. R. Commander\n" \
+  "Copyright (C) 2009-2021 D. R. Commander\n" \
  "Copyright (C) 2015, 2020 Google, Inc.\n" \
-  "Copyright (C) 2019 Arm Limited\n" \
+  "Copyright (C) 2019-2020 Arm Limited\n" \
  "Copyright (C) 2015-2016, 2018 Matthieu Darbois\n" \
  "Copyright (C) 2011-2016 Siarhei Siamashka\n" \
  "Copyright (C) 2015 Intel Corporation\n" \
@@ -49,7 +49,7 @@
  "Copyright (C) 2009, 2012 Pierre Ossman for Cendio AB\n" \
  "Copyright (C) 2009-2011 Nokia Corporation and/or its subsidiary(-ies)\n" \
  "Copyright (C) 1999-2006 MIYASAKA Masaru\n" \
-  "Copyright (C) 1991-2017 Thomas G. Lane, Guido Vollbeding"
+  "Copyright (C) 1991-2020 Thomas G. Lane, Guido Vollbeding"
 #define JCOPYRIGHT_SHORT \
-  "Copyright (C) 1991-2020 The libjpeg-turbo Project and many others"
+  "Copyright (C) 1991-2021 The libjpeg-turbo Project and many others"
--- a/rdbmp.c
+++ b/rdbmp.c
@@ -12,7 +12,7 @@
 *
 * This file contains routines to read input images in Microsoft "BMP"
 * format (MS Windows 3.x, OS/2 1.x, and OS/2 2.x flavors).
- * Currently, only 8-bit and 24-bit images are supported, not 1-bit or
+ * Currently, only 8-, 24-, and 32-bit images are supported, not 1-bit or
 * 4-bit (feeding such low-depth images into JPEG would be silly anyway).
 * Also, we don't support RLE-compressed files.
 *
@@ -34,18 +34,8 @@
 /* Macros to deal with unsigned chars as efficiently as compiler allows */
 #ifdef HAVE_UNSIGNED_CHAR
 typedef unsigned char U_CHAR;
 #define UCH(x)  ((int)(x))
 #else /* !HAVE_UNSIGNED_CHAR */
 #ifdef __CHAR_UNSIGNED__
 typedef char U_CHAR;
 #define UCH(x)  ((int)(x))
 #else
 typedef char U_CHAR;
 #define UCH(x)  ((int)(x) & 0xFF)
 #endif
 #endif /* HAVE_UNSIGNED_CHAR */
 #define ReadOK(file, buffer, len) \
@@ -71,7 +61,7 @@ typedef struct _bmp_source_struct {
  JDIMENSION source_row;        /* Current source row number */
  JDIMENSION row_width;         /* Physical width of scanlines in file */
-  int bits_per_pixel;           /* remembers 8- or 24-bit format */
+  int bits_per_pixel;           /* remembers 8-, 24-, or 32-bit format */
  int cmap_length;              /* colormap length */
  boolean use_inversion_array;  /* TRUE = preload the whole image, which is
@@ -179,14 +169,14 @@ get_8bit_row(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
  outptr = source->pub.buffer[0];
  if (cinfo->in_color_space == JCS_GRAYSCALE) {
    for (col = cinfo->image_width; col > 0; col--) {
-      t = GETJSAMPLE(*inptr++);
+      t = *inptr++;
      if (t >= cmaplen)
        ERREXIT(cinfo, JERR_BMP_OUTOFRANGE);
      *outptr++ = colormap[0][t];
    }
  } else if (cinfo->in_color_space == JCS_CMYK) {
    for (col = cinfo->image_width; col > 0; col--) {
-      t = GETJSAMPLE(*inptr++);
+      t = *inptr++;
      if (t >= cmaplen)
        ERREXIT(cinfo, JERR_BMP_OUTOFRANGE);
      rgb_to_cmyk(colormap[0][t], colormap[1][t], colormap[2][t], outptr,
@@ -202,7 +192,7 @@ get_8bit_row(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
    if (aindex >= 0) {
      for (col = cinfo->image_width; col > 0; col--) {
-        t = GETJSAMPLE(*inptr++);
+        t = *inptr++;
        if (t >= cmaplen)
          ERREXIT(cinfo, JERR_BMP_OUTOFRANGE);
        outptr[rindex] = colormap[0][t];
@@ -213,7 +203,7 @@ get_8bit_row(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
      }
    } else {
      for (col = cinfo->image_width; col > 0; col--) {
-        t = GETJSAMPLE(*inptr++);
+        t = *inptr++;
        if (t >= cmaplen)
          ERREXIT(cinfo, JERR_BMP_OUTOFRANGE);
        outptr[rindex] = colormap[0][t];
@@ -258,7 +248,6 @@ get_24bit_row(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
    MEMCOPY(outptr, inptr, source->row_width);
  } else if (cinfo->in_color_space == JCS_CMYK) {
    for (col = cinfo->image_width; col > 0; col--) {
      /* can omit GETJSAMPLE() safely */
      JSAMPLE b = *inptr++, g = *inptr++, r = *inptr++;
      rgb_to_cmyk(r, g, b, outptr, outptr + 1, outptr + 2, outptr + 3);
      outptr += 4;
@@ -272,7 +261,7 @@ get_24bit_row(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
    if (aindex >= 0) {
      for (col = cinfo->image_width; col > 0; col--) {
-        outptr[bindex] = *inptr++;      /* can omit GETJSAMPLE() safely */
+        outptr[bindex] = *inptr++;
        outptr[gindex] = *inptr++;
        outptr[rindex] = *inptr++;
        outptr[aindex] = 0xFF;
@@ -280,7 +269,7 @@ get_24bit_row(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
      }
    } else {
      for (col = cinfo->image_width; col > 0; col--) {
-        outptr[bindex] = *inptr++;      /* can omit GETJSAMPLE() safely */
+        outptr[bindex] = *inptr++;
        outptr[gindex] = *inptr++;
        outptr[rindex] = *inptr++;
        outptr += ps;
@@ -323,7 +312,6 @@ get_32bit_row(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
    MEMCOPY(outptr, inptr, source->row_width);
  } else if (cinfo->in_color_space == JCS_CMYK) {
    for (col = cinfo->image_width; col > 0; col--) {
      /* can omit GETJSAMPLE() safely */
      JSAMPLE b = *inptr++, g = *inptr++, r = *inptr++;
      rgb_to_cmyk(r, g, b, outptr, outptr + 1, outptr + 2, outptr + 3);
      inptr++;                          /* skip the 4th byte (Alpha channel) */
@@ -338,7 +326,7 @@ get_32bit_row(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
    if (aindex >= 0) {
      for (col = cinfo->image_width; col > 0; col--) {
-        outptr[bindex] = *inptr++;      /* can omit GETJSAMPLE() safely */
+        outptr[bindex] = *inptr++;
        outptr[gindex] = *inptr++;
        outptr[rindex] = *inptr++;
        outptr[aindex] = *inptr++;
@@ -346,7 +334,7 @@ get_32bit_row(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
      }
    } else {
      for (col = cinfo->image_width; col > 0; col--) {
-        outptr[bindex] = *inptr++;      /* can omit GETJSAMPLE() safely */
+        outptr[bindex] = *inptr++;
        outptr[gindex] = *inptr++;
        outptr[rindex] = *inptr++;
        inptr++;                        /* skip the 4th byte (Alpha channel) */
@@ -481,7 +469,9 @@ start_input_bmp(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
      TRACEMS2(cinfo, 1, JTRC_BMP_OS2_MAPPED, biWidth, biHeight);
      break;
    case 24:                    /* RGB image */
-      TRACEMS2(cinfo, 1, JTRC_BMP_OS2, biWidth, biHeight);
+    case 32:                    /* RGB image + Alpha channel */
      TRACEMS3(cinfo, 1, JTRC_BMP_OS2, biWidth, biHeight,
               source->bits_per_pixel);
      break;
    default:
      ERREXIT(cinfo, JERR_BMP_BADDEPTH);
@@ -508,10 +498,8 @@ start_input_bmp(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
      TRACEMS2(cinfo, 1, JTRC_BMP_MAPPED, biWidth, biHeight);
      break;
    case 24:                    /* RGB image */
      TRACEMS2(cinfo, 1, JTRC_BMP, biWidth, biHeight);
      break;
    case 32:                    /* RGB image + Alpha channel */
-      TRACEMS2(cinfo, 1, JTRC_BMP, biWidth, biHeight);
+      TRACEMS3(cinfo, 1, JTRC_BMP, biWidth, biHeight, source->bits_per_pixel);
      break;
    default:
      ERREXIT(cinfo, JERR_BMP_BADDEPTH);
--- a/rdcolmap.c
+++ b/rdcolmap.c
@@ -54,9 +54,8 @@ add_map_entry(j_decompress_ptr cinfo, int R, int G, int B)
  /* Check for duplicate color. */
  for (index = 0; index < ncolors; index++) {
-    if (GETJSAMPLE(colormap0[index]) == R &&
+    if (colormap0[index] == R && colormap1[index] == G &&
-        GETJSAMPLE(colormap1[index]) == G &&
+        colormap2[index] == B)
        GETJSAMPLE(colormap2[index]) == B)
      return;                   /* color is already in map */
  }
--- a/rdgif.c
+++ b/rdgif.c
@@ -1,29 +1,663 @@
 /*
 * rdgif.c
 *
 * This file was part of the Independent JPEG Group's software:
 * Copyright (C) 1991-1997, Thomas G. Lane.
- * This file is part of the Independent JPEG Group's software.
+ * Modified 2019 by Guido Vollbeding.
 * libjpeg-turbo Modifications:
 * Copyright (C) 2021, D. R. Commander.
 * For conditions of distribution and use, see the accompanying README.ijg
 * file.
 *
 * This file contains routines to read input images in GIF format.
 *
- *****************************************************************************
+ * These routines may need modification for non-Unix environments or
- * NOTE: to avoid entanglements with Unisys' patent on LZW compression,      *
+ * specialized applications.  As they stand, they assume input from
- * the ability to read GIF files has been removed from the IJG distribution. *
+ * an ordinary stdio stream.  They further assume that reading begins
- * Sorry about that.                                                         *
+ * at the start of the file; start_input may need work if the
- *****************************************************************************
+ * user interface has already read some data (e.g., to determine that
- *
+ * the file is indeed GIF format).
- * We are required to state that
+ */
- *    "The Graphics Interchange Format(c) is the Copyright property of
+
- *    CompuServe Incorporated. GIF(sm) is a Service Mark property of
+/*
- *    CompuServe Incorporated."
+ * This code is loosely based on giftoppm from the PBMPLUS distribution
 * of Feb. 1991.  That file contains the following copyright notice:
 * +-------------------------------------------------------------------+
 * | Copyright 1990, David Koblas.                                     |
 * |   Permission to use, copy, modify, and distribute this software   |
 * |   and its documentation for any purpose and without fee is hereby |
 * |   granted, provided that the above copyright notice appear in all |
 * |   copies and that both that copyright notice and this permission  |
 * |   notice appear in supporting documentation.  This software is    |
 * |   provided "as is" without express or implied warranty.           |
 * +-------------------------------------------------------------------+
 */
 #include "cdjpeg.h"             /* Common decls for cjpeg/djpeg applications */
 #ifdef GIF_SUPPORTED
 /* Macros to deal with unsigned chars as efficiently as compiler allows */
 typedef unsigned char U_CHAR;
 #define UCH(x)  ((int)(x))
 #define ReadOK(file, buffer, len) \
  (JFREAD(file, buffer, len) == ((size_t)(len)))
 #define MAXCOLORMAPSIZE  256    /* max # of colors in a GIF colormap */
 #define NUMCOLORS        3      /* # of colors */
 #define CM_RED           0      /* color component numbers */
 #define CM_GREEN         1
 #define CM_BLUE          2
 #define MAX_LZW_BITS     12     /* maximum LZW code size */
 #define LZW_TABLE_SIZE   (1 << MAX_LZW_BITS) /* # of possible LZW symbols */
 /* Macros for extracting header data --- note we assume chars may be signed */
 #define LM_to_uint(array, offset) \
  ((unsigned int)UCH(array[offset]) + \
   (((unsigned int)UCH(array[offset + 1])) << 8))
 #define BitSet(byte, bit)       ((byte) & (bit))
 #define INTERLACE       0x40    /* mask for bit signifying interlaced image */
 #define COLORMAPFLAG    0x80    /* mask for bit signifying colormap presence */
 /*
 * LZW decompression tables look like this:
 *   symbol_head[K] = prefix symbol of any LZW symbol K (0..LZW_TABLE_SIZE-1)
 *   symbol_tail[K] = suffix byte   of any LZW symbol K (0..LZW_TABLE_SIZE-1)
 * Note that entries 0..end_code of the above tables are not used,
 * since those symbols represent raw bytes or special codes.
 *
 * The stack represents the not-yet-used expansion of the last LZW symbol.
 * In the worst case, a symbol could expand to as many bytes as there are
 * LZW symbols, so we allocate LZW_TABLE_SIZE bytes for the stack.
 * (This is conservative since that number includes the raw-byte symbols.)
 */
 /* Private version of data source object */
 typedef struct {
  struct cjpeg_source_struct pub; /* public fields */
  j_compress_ptr cinfo;         /* back link saves passing separate parm */
  JSAMPARRAY colormap;          /* GIF colormap (converted to my format) */
  /* State for GetCode and LZWReadByte */
  U_CHAR code_buf[256 + 4];     /* current input data block */
  int last_byte;                /* # of bytes in code_buf */
  int last_bit;                 /* # of bits in code_buf */
  int cur_bit;                  /* next bit index to read */
  boolean first_time;           /* flags first call to GetCode */
  boolean out_of_blocks;        /* TRUE if hit terminator data block */
  int input_code_size;          /* codesize given in GIF file */
  int clear_code, end_code;     /* values for Clear and End codes */
  int code_size;                /* current actual code size */
  int limit_code;               /* 2^code_size */
  int max_code;                 /* first unused code value */
  /* Private state for LZWReadByte */
  int oldcode;                  /* previous LZW symbol */
  int firstcode;                /* first byte of oldcode's expansion */
  /* LZW symbol table and expansion stack */
  UINT16 *symbol_head;          /* => table of prefix symbols */
  UINT8  *symbol_tail;          /* => table of suffix bytes */
  UINT8  *symbol_stack;         /* => stack for symbol expansions */
  UINT8  *sp;                   /* stack pointer */
  /* State for interlaced image processing */
  boolean is_interlaced;        /* TRUE if have interlaced image */
  jvirt_sarray_ptr interlaced_image; /* full image in interlaced order */
  JDIMENSION cur_row_number;    /* need to know actual row number */
  JDIMENSION pass2_offset;      /* # of pixel rows in pass 1 */
  JDIMENSION pass3_offset;      /* # of pixel rows in passes 1&2 */
  JDIMENSION pass4_offset;      /* # of pixel rows in passes 1,2,3 */
 } gif_source_struct;
 typedef gif_source_struct *gif_source_ptr;
 /* Forward declarations */
 METHODDEF(JDIMENSION) get_pixel_rows(j_compress_ptr cinfo,
                                     cjpeg_source_ptr sinfo);
 METHODDEF(JDIMENSION) load_interlaced_image(j_compress_ptr cinfo,
                                            cjpeg_source_ptr sinfo);
 METHODDEF(JDIMENSION) get_interlaced_row(j_compress_ptr cinfo,
                                         cjpeg_source_ptr sinfo);
 LOCAL(int)
 ReadByte(gif_source_ptr sinfo)
 /* Read next byte from GIF file */
 {
  register FILE *infile = sinfo->pub.input_file;
  register int c;
  if ((c = getc(infile)) == EOF)
    ERREXIT(sinfo->cinfo, JERR_INPUT_EOF);
  return c;
 }
 LOCAL(int)
 GetDataBlock(gif_source_ptr sinfo, U_CHAR *buf)
 /* Read a GIF data block, which has a leading count byte */
 /* A zero-length block marks the end of a data block sequence */
 {
  int count;
  count = ReadByte(sinfo);
  if (count > 0) {
    if (!ReadOK(sinfo->pub.input_file, buf, count))
      ERREXIT(sinfo->cinfo, JERR_INPUT_EOF);
  }
  return count;
 }
 LOCAL(void)
 SkipDataBlocks(gif_source_ptr sinfo)
 /* Skip a series of data blocks, until a block terminator is found */
 {
  U_CHAR buf[256];
  while (GetDataBlock(sinfo, buf) > 0)
    /* skip */;
 }
 LOCAL(void)
 ReInitLZW(gif_source_ptr sinfo)
 /* (Re)initialize LZW state; shared code for startup and Clear processing */
 {
  sinfo->code_size = sinfo->input_code_size + 1;
  sinfo->limit_code = sinfo->clear_code << 1;   /* 2^code_size */
  sinfo->max_code = sinfo->clear_code + 2;      /* first unused code value */
  sinfo->sp = sinfo->symbol_stack;              /* init stack to empty */
 }
 LOCAL(void)
 InitLZWCode(gif_source_ptr sinfo)
 /* Initialize for a series of LZWReadByte (and hence GetCode) calls */
 {
  /* GetCode initialization */
  sinfo->last_byte = 2;         /* make safe to "recopy last two bytes" */
  sinfo->code_buf[0] = 0;
  sinfo->code_buf[1] = 0;
  sinfo->last_bit = 0;          /* nothing in the buffer */
  sinfo->cur_bit = 0;           /* force buffer load on first call */
  sinfo->first_time = TRUE;
  sinfo->out_of_blocks = FALSE;
  /* LZWReadByte initialization: */
  /* compute special code values (note that these do not change later) */
  sinfo->clear_code = 1 << sinfo->input_code_size;
  sinfo->end_code = sinfo->clear_code + 1;
  ReInitLZW(sinfo);
 }
 LOCAL(int)
 GetCode(gif_source_ptr sinfo)
 /* Fetch the next code_size bits from the GIF data */
 /* We assume code_size is less than 16 */
 {
  register int accum;
  int offs, count;
  while (sinfo->cur_bit + sinfo->code_size > sinfo->last_bit) {
    /* Time to reload the buffer */
    /* First time, share code with Clear case */
    if (sinfo->first_time) {
      sinfo->first_time = FALSE;
      return sinfo->clear_code;
    }
    if (sinfo->out_of_blocks) {
      WARNMS(sinfo->cinfo, JWRN_GIF_NOMOREDATA);
      return sinfo->end_code;   /* fake something useful */
    }
    /* preserve last two bytes of what we have -- assume code_size <= 16 */
    sinfo->code_buf[0] = sinfo->code_buf[sinfo->last_byte-2];
    sinfo->code_buf[1] = sinfo->code_buf[sinfo->last_byte-1];
    /* Load more bytes; set flag if we reach the terminator block */
    if ((count = GetDataBlock(sinfo, &sinfo->code_buf[2])) == 0) {
      sinfo->out_of_blocks = TRUE;
      WARNMS(sinfo->cinfo, JWRN_GIF_NOMOREDATA);
      return sinfo->end_code;   /* fake something useful */
    }
    /* Reset counters */
    sinfo->cur_bit = (sinfo->cur_bit - sinfo->last_bit) + 16;
    sinfo->last_byte = 2 + count;
    sinfo->last_bit = sinfo->last_byte * 8;
  }
  /* Form up next 24 bits in accum */
  offs = sinfo->cur_bit >> 3;   /* byte containing cur_bit */
  accum = UCH(sinfo->code_buf[offs + 2]);
  accum <<= 8;
  accum |= UCH(sinfo->code_buf[offs + 1]);
  accum <<= 8;
  accum |= UCH(sinfo->code_buf[offs]);
  /* Right-align cur_bit in accum, then mask off desired number of bits */
  accum >>= (sinfo->cur_bit & 7);
  sinfo->cur_bit += sinfo->code_size;
  return accum & ((1 << sinfo->code_size) - 1);
 }
 LOCAL(int)
 LZWReadByte(gif_source_ptr sinfo)
 /* Read an LZW-compressed byte */
 {
  register int code;            /* current working code */
  int incode;                   /* saves actual input code */
  /* If any codes are stacked from a previously read symbol, return them */
  if (sinfo->sp > sinfo->symbol_stack)
    return (int)(*(--sinfo->sp));
  /* Time to read a new symbol */
  code = GetCode(sinfo);
  if (code == sinfo->clear_code) {
    /* Reinit state, swallow any extra Clear codes, and */
    /* return next code, which is expected to be a raw byte. */
    ReInitLZW(sinfo);
    do {
      code = GetCode(sinfo);
    } while (code == sinfo->clear_code);
    if (code > sinfo->clear_code) { /* make sure it is a raw byte */
      WARNMS(sinfo->cinfo, JWRN_GIF_BADDATA);
      code = 0;                 /* use something valid */
    }
    /* make firstcode, oldcode valid! */
    sinfo->firstcode = sinfo->oldcode = code;
    return code;
  }
  if (code == sinfo->end_code) {
    /* Skip the rest of the image, unless GetCode already read terminator */
    if (!sinfo->out_of_blocks) {
      SkipDataBlocks(sinfo);
      sinfo->out_of_blocks = TRUE;
    }
    /* Complain that there's not enough data */
    WARNMS(sinfo->cinfo, JWRN_GIF_ENDCODE);
    /* Pad data with 0's */
    return 0;                   /* fake something usable */
  }
  /* Got normal raw byte or LZW symbol */
  incode = code;                /* save for a moment */
  if (code >= sinfo->max_code) { /* special case for not-yet-defined symbol */
    /* code == max_code is OK; anything bigger is bad data */
    if (code > sinfo->max_code) {
      WARNMS(sinfo->cinfo, JWRN_GIF_BADDATA);
      incode = 0;               /* prevent creation of loops in symbol table */
    }
    /* this symbol will be defined as oldcode/firstcode */
    *(sinfo->sp++) = (UINT8)sinfo->firstcode;
    code = sinfo->oldcode;
  }
  /* If it's a symbol, expand it into the stack */
  while (code >= sinfo->clear_code) {
    *(sinfo->sp++) = sinfo->symbol_tail[code]; /* tail is a byte value */
    code = sinfo->symbol_head[code]; /* head is another LZW symbol */
  }
  /* At this point code just represents a raw byte */
  sinfo->firstcode = code;      /* save for possible future use */
  /* If there's room in table... */
  if ((code = sinfo->max_code) < LZW_TABLE_SIZE) {
    /* Define a new symbol = prev sym + head of this sym's expansion */
    sinfo->symbol_head[code] = (UINT16)sinfo->oldcode;
    sinfo->symbol_tail[code] = (UINT8)sinfo->firstcode;
    sinfo->max_code++;
    /* Is it time to increase code_size? */
    if (sinfo->max_code >= sinfo->limit_code &&
        sinfo->code_size < MAX_LZW_BITS) {
      sinfo->code_size++;
      sinfo->limit_code <<= 1;  /* keep equal to 2^code_size */
    }
  }
  sinfo->oldcode = incode;      /* save last input symbol for future use */
  return sinfo->firstcode;      /* return first byte of symbol's expansion */
 }
 LOCAL(void)
 ReadColorMap(gif_source_ptr sinfo, int cmaplen, JSAMPARRAY cmap)
 /* Read a GIF colormap */
 {
  int i;
  for (i = 0; i < cmaplen; i++) {
 #if BITS_IN_JSAMPLE == 8
 #define UPSCALE(x)  (x)
 #else
 #define UPSCALE(x)  ((x) << (BITS_IN_JSAMPLE - 8))
 #endif
    cmap[CM_RED][i]   = (JSAMPLE)UPSCALE(ReadByte(sinfo));
    cmap[CM_GREEN][i] = (JSAMPLE)UPSCALE(ReadByte(sinfo));
    cmap[CM_BLUE][i]  = (JSAMPLE)UPSCALE(ReadByte(sinfo));
  }
 }
 LOCAL(void)
 DoExtension(gif_source_ptr sinfo)
 /* Process an extension block */
 /* Currently we ignore 'em all */
 {
  int extlabel;
  /* Read extension label byte */
  extlabel = ReadByte(sinfo);
  TRACEMS1(sinfo->cinfo, 1, JTRC_GIF_EXTENSION, extlabel);
  /* Skip the data block(s) associated with the extension */
  SkipDataBlocks(sinfo);
 }
 /*
 * Read the file header; return image size and component count.
 */
 METHODDEF(void)
 start_input_gif(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
 {
  gif_source_ptr source = (gif_source_ptr)sinfo;
  U_CHAR hdrbuf[10];            /* workspace for reading control blocks */
  unsigned int width, height;   /* image dimensions */
  int colormaplen, aspectRatio;
  int c;
  /* Read and verify GIF Header */
  if (!ReadOK(source->pub.input_file, hdrbuf, 6))
    ERREXIT(cinfo, JERR_GIF_NOT);
  if (hdrbuf[0] != 'G' || hdrbuf[1] != 'I' || hdrbuf[2] != 'F')
    ERREXIT(cinfo, JERR_GIF_NOT);
  /* Check for expected version numbers.
   * If unknown version, give warning and try to process anyway;
   * this is per recommendation in GIF89a standard.
   */
  if ((hdrbuf[3] != '8' || hdrbuf[4] != '7' || hdrbuf[5] != 'a') &&
      (hdrbuf[3] != '8' || hdrbuf[4] != '9' || hdrbuf[5] != 'a'))
    TRACEMS3(cinfo, 1, JTRC_GIF_BADVERSION, hdrbuf[3], hdrbuf[4], hdrbuf[5]);
  /* Read and decipher Logical Screen Descriptor */
  if (!ReadOK(source->pub.input_file, hdrbuf, 7))
    ERREXIT(cinfo, JERR_INPUT_EOF);
  width = LM_to_uint(hdrbuf, 0);
  height = LM_to_uint(hdrbuf, 2);
  if (width == 0 || height == 0)
    ERREXIT(cinfo, JERR_GIF_EMPTY);
  /* we ignore the color resolution, sort flag, and background color index */
  aspectRatio = UCH(hdrbuf[6]);
  if (aspectRatio != 0 && aspectRatio != 49)
    TRACEMS(cinfo, 1, JTRC_GIF_NONSQUARE);
  /* Allocate space to store the colormap */
  source->colormap = (*cinfo->mem->alloc_sarray)
    ((j_common_ptr)cinfo, JPOOL_IMAGE, (JDIMENSION)MAXCOLORMAPSIZE,
     (JDIMENSION)NUMCOLORS);
  colormaplen = 0;              /* indicate initialization */
  /* Read global colormap if header indicates it is present */
  if (BitSet(hdrbuf[4], COLORMAPFLAG)) {
    colormaplen = 2 << (hdrbuf[4] & 0x07);
    ReadColorMap(source, colormaplen, source->colormap);
  }
  /* Scan until we reach start of desired image.
   * We don't currently support skipping images, but could add it easily.
   */
  for (;;) {
    c = ReadByte(source);
    if (c == ';')               /* GIF terminator?? */
      ERREXIT(cinfo, JERR_GIF_IMAGENOTFOUND);
    if (c == '!') {             /* Extension */
      DoExtension(source);
      continue;
    }
    if (c != ',') {             /* Not an image separator? */
      WARNMS1(cinfo, JWRN_GIF_CHAR, c);
      continue;
    }
    /* Read and decipher Local Image Descriptor */
    if (!ReadOK(source->pub.input_file, hdrbuf, 9))
      ERREXIT(cinfo, JERR_INPUT_EOF);
    /* we ignore top/left position info, also sort flag */
    width = LM_to_uint(hdrbuf, 4);
    height = LM_to_uint(hdrbuf, 6);
    if (width == 0 || height == 0)
      ERREXIT(cinfo, JERR_GIF_EMPTY);
    source->is_interlaced = (BitSet(hdrbuf[8], INTERLACE) != 0);
    /* Read local colormap if header indicates it is present */
    /* Note: if we wanted to support skipping images, */
    /* we'd need to skip rather than read colormap for ignored images */
    if (BitSet(hdrbuf[8], COLORMAPFLAG)) {
      colormaplen = 2 << (hdrbuf[8] & 0x07);
      ReadColorMap(source, colormaplen, source->colormap);
    }
    source->input_code_size = ReadByte(source); /* get min-code-size byte */
    if (source->input_code_size < 2 || source->input_code_size > 8)
      ERREXIT1(cinfo, JERR_GIF_CODESIZE, source->input_code_size);
    /* Reached desired image, so break out of loop */
    /* If we wanted to skip this image, */
    /* we'd call SkipDataBlocks and then continue the loop */
    break;
  }
  /* Prepare to read selected image: first initialize LZW decompressor */
  source->symbol_head = (UINT16 *)
    (*cinfo->mem->alloc_large) ((j_common_ptr)cinfo, JPOOL_IMAGE,
                                LZW_TABLE_SIZE * sizeof(UINT16));
  source->symbol_tail = (UINT8 *)
    (*cinfo->mem->alloc_large) ((j_common_ptr)cinfo, JPOOL_IMAGE,
                                LZW_TABLE_SIZE * sizeof(UINT8));
  source->symbol_stack = (UINT8 *)
    (*cinfo->mem->alloc_large) ((j_common_ptr)cinfo, JPOOL_IMAGE,
                                LZW_TABLE_SIZE * sizeof(UINT8));
  InitLZWCode(source);
  /*
   * If image is interlaced, we read it into a full-size sample array,
   * decompressing as we go; then get_interlaced_row selects rows from the
   * sample array in the proper order.
   */
  if (source->is_interlaced) {
    /* We request the virtual array now, but can't access it until virtual
     * arrays have been allocated.  Hence, the actual work of reading the
     * image is postponed until the first call to get_pixel_rows.
     */
    source->interlaced_image = (*cinfo->mem->request_virt_sarray)
      ((j_common_ptr)cinfo, JPOOL_IMAGE, FALSE,
       (JDIMENSION)width, (JDIMENSION)height, (JDIMENSION)1);
    if (cinfo->progress != NULL) {
      cd_progress_ptr progress = (cd_progress_ptr)cinfo->progress;
      progress->total_extra_passes++; /* count file input as separate pass */
    }
    source->pub.get_pixel_rows = load_interlaced_image;
  } else {
    source->pub.get_pixel_rows = get_pixel_rows;
  }
  /* Create compressor input buffer. */
  source->pub.buffer = (*cinfo->mem->alloc_sarray)
    ((j_common_ptr)cinfo, JPOOL_IMAGE, (JDIMENSION)width * NUMCOLORS,
     (JDIMENSION)1);
  source->pub.buffer_height = 1;
  /* Pad colormap for safety. */
  for (c = colormaplen; c < source->clear_code; c++) {
    source->colormap[CM_RED][c]   =
    source->colormap[CM_GREEN][c] =
    source->colormap[CM_BLUE][c]  = CENTERJSAMPLE;
  }
  /* Return info about the image. */
  cinfo->in_color_space = JCS_RGB;
  cinfo->input_components = NUMCOLORS;
  cinfo->data_precision = BITS_IN_JSAMPLE; /* we always rescale data to this */
  cinfo->image_width = width;
  cinfo->image_height = height;
  TRACEMS3(cinfo, 1, JTRC_GIF, width, height, colormaplen);
 }
 /*
 * Read one row of pixels.
 * This version is used for noninterlaced GIF images:
 * we read directly from the GIF file.
 */
 METHODDEF(JDIMENSION)
 get_pixel_rows(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
 {
  gif_source_ptr source = (gif_source_ptr)sinfo;
  register int c;
  register JSAMPROW ptr;
  register JDIMENSION col;
  register JSAMPARRAY colormap = source->colormap;
  ptr = source->pub.buffer[0];
  for (col = cinfo->image_width; col > 0; col--) {
    c = LZWReadByte(source);
    *ptr++ = colormap[CM_RED][c];
    *ptr++ = colormap[CM_GREEN][c];
    *ptr++ = colormap[CM_BLUE][c];
  }
  return 1;
 }
 /*
 * Read one row of pixels.
 * This version is used for the first call on get_pixel_rows when
 * reading an interlaced GIF file: we read the whole image into memory.
 */
 METHODDEF(JDIMENSION)
 load_interlaced_image(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
 {
  gif_source_ptr source = (gif_source_ptr)sinfo;
  register JSAMPROW sptr;
  register JDIMENSION col;
  JDIMENSION row;
  cd_progress_ptr progress = (cd_progress_ptr)cinfo->progress;
  /* Read the interlaced image into the virtual array we've created. */
  for (row = 0; row < cinfo->image_height; row++) {
    if (progress != NULL) {
      progress->pub.pass_counter = (long)row;
      progress->pub.pass_limit = (long)cinfo->image_height;
      (*progress->pub.progress_monitor) ((j_common_ptr)cinfo);
    }
    sptr = *(*cinfo->mem->access_virt_sarray)
      ((j_common_ptr)cinfo, source->interlaced_image, row, (JDIMENSION)1,
       TRUE);
    for (col = cinfo->image_width; col > 0; col--) {
      *sptr++ = (JSAMPLE)LZWReadByte(source);
    }
  }
  if (progress != NULL)
    progress->completed_extra_passes++;
  /* Replace method pointer so subsequent calls don't come here. */
  source->pub.get_pixel_rows = get_interlaced_row;
  /* Initialize for get_interlaced_row, and perform first call on it. */
  source->cur_row_number = 0;
  source->pass2_offset = (cinfo->image_height + 7) / 8;
  source->pass3_offset = source->pass2_offset + (cinfo->image_height + 3) / 8;
  source->pass4_offset = source->pass3_offset + (cinfo->image_height + 1) / 4;
  return get_interlaced_row(cinfo, sinfo);
 }
 /*
 * Read one row of pixels.
 * This version is used for interlaced GIF images:
 * we read from the virtual array.
 */
 METHODDEF(JDIMENSION)
 get_interlaced_row(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
 {
  gif_source_ptr source = (gif_source_ptr)sinfo;
  register int c;
  register JSAMPROW sptr, ptr;
  register JDIMENSION col;
  register JSAMPARRAY colormap = source->colormap;
  JDIMENSION irow;
  /* Figure out which row of interlaced image is needed, and access it. */
  switch ((int)(source->cur_row_number & 7)) {
  case 0:                       /* first-pass row */
    irow = source->cur_row_number >> 3;
    break;
  case 4:                       /* second-pass row */
    irow = (source->cur_row_number >> 3) + source->pass2_offset;
    break;
  case 2:                       /* third-pass row */
  case 6:
    irow = (source->cur_row_number >> 2) + source->pass3_offset;
    break;
  default:                      /* fourth-pass row */
    irow = (source->cur_row_number >> 1) + source->pass4_offset;
  }
  sptr = *(*cinfo->mem->access_virt_sarray)
    ((j_common_ptr)cinfo, source->interlaced_image, irow, (JDIMENSION)1,
     FALSE);
  /* Scan the row, expand colormap, and output */
  ptr = source->pub.buffer[0];
  for (col = cinfo->image_width; col > 0; col--) {
    c = *sptr++;
    *ptr++ = colormap[CM_RED][c];
    *ptr++ = colormap[CM_GREEN][c];
    *ptr++ = colormap[CM_BLUE][c];
  }
  source->cur_row_number++;     /* for next time */
  return 1;
 }
 /*
 * Finish up at the end of the file.
 */
 METHODDEF(void)
 finish_input_gif(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
 {
  /* no work */
 }
 /*
 * The module selection routine for GIF format input.
 */
@@ -31,9 +665,18 @@
 GLOBAL(cjpeg_source_ptr)
 jinit_read_gif(j_compress_ptr cinfo)
 {
-  fprintf(stderr, "GIF input is unsupported for legal reasons.  Sorry.\n");
+  gif_source_ptr source;
-  exit(EXIT_FAILURE);
+
-  return NULL;                  /* keep compiler happy */
+  /* Create module interface object */
  source = (gif_source_ptr)
    (*cinfo->mem->alloc_small) ((j_common_ptr)cinfo, JPOOL_IMAGE,
                                sizeof(gif_source_struct));
  source->cinfo = cinfo;        /* make back link for subroutines */
  /* Fill in method ptrs, except get_pixel_rows which start_input sets */
  source->pub.start_input = start_input_gif;
  source->pub.finish_input = finish_input_gif;
  return (cjpeg_source_ptr)source;
 }
 #endif /* GIF_SUPPORTED */
--- a/rdppm.c
+++ b/rdppm.c
@@ -43,18 +43,8 @@
 /* Macros to deal with unsigned chars as efficiently as compiler allows */
 #ifdef HAVE_UNSIGNED_CHAR
 typedef unsigned char U_CHAR;
 #define UCH(x)  ((int)(x))
 #else /* !HAVE_UNSIGNED_CHAR */
 #ifdef __CHAR_UNSIGNED__
 typedef char U_CHAR;
 #define UCH(x)  ((int)(x))
 #else
 typedef char U_CHAR;
 #define UCH(x)  ((int)(x) & 0xFF)
 #endif
 #endif /* HAVE_UNSIGNED_CHAR */
 #define ReadOK(file, buffer, len) \
--- a/rdrle.c
+++ b/rdrle.c
@@ -1,389 +0,0 @@
 /*
 * rdrle.c
 *
 * This file was part of the Independent JPEG Group's software:
 * Copyright (C) 1991-1996, Thomas G. Lane.
 * It was modified by The libjpeg-turbo Project to include only code and
 * information relevant to libjpeg-turbo.
 * For conditions of distribution and use, see the accompanying README.ijg
 * file.
 *
 * This file contains routines to read input images in Utah RLE format.
 * The Utah Raster Toolkit library is required (version 3.1 or later).
 *
 * These routines may need modification for non-Unix environments or
 * specialized applications.  As they stand, they assume input from
 * an ordinary stdio stream.  They further assume that reading begins
 * at the start of the file; start_input may need work if the
 * user interface has already read some data (e.g., to determine that
 * the file is indeed RLE format).
 *
 * Based on code contributed by Mike Lijewski,
 * with updates from Robert Hutchinson.
 */
 #include "cdjpeg.h"             /* Common decls for cjpeg/djpeg applications */
 #ifdef RLE_SUPPORTED
 /* rle.h is provided by the Utah Raster Toolkit. */
 #include <rle.h>
 /*
 * We assume that JSAMPLE has the same representation as rle_pixel,
 * to wit, "unsigned char".  Hence we can't cope with 12- or 16-bit samples.
 */
 #if BITS_IN_JSAMPLE != 8
  Sorry, this code only copes with 8-bit JSAMPLEs. /* deliberate syntax err */
 #endif
 /*
 * We support the following types of RLE files:
 *
 *   GRAYSCALE   - 8 bits, no colormap
 *   MAPPEDGRAY  - 8 bits, 1 channel colomap
 *   PSEUDOCOLOR - 8 bits, 3 channel colormap
 *   TRUECOLOR   - 24 bits, 3 channel colormap
 *   DIRECTCOLOR - 24 bits, no colormap
 *
 * For now, we ignore any alpha channel in the image.
 */
 typedef enum
  { GRAYSCALE, MAPPEDGRAY, PSEUDOCOLOR, TRUECOLOR, DIRECTCOLOR } rle_kind;
 /*
 * Since RLE stores scanlines bottom-to-top, we have to invert the image
 * to conform to JPEG's top-to-bottom order.  To do this, we read the
 * incoming image into a virtual array on the first get_pixel_rows call,
 * then fetch the required row from the virtual array on subsequent calls.
 */
 typedef struct _rle_source_struct *rle_source_ptr;
 typedef struct _rle_source_struct {
  struct cjpeg_source_struct pub; /* public fields */
  rle_kind visual;              /* actual type of input file */
  jvirt_sarray_ptr image;       /* virtual array to hold the image */
  JDIMENSION row;               /* current row # in the virtual array */
  rle_hdr header;               /* Input file information */
  rle_pixel **rle_row;          /* holds a row returned by rle_getrow() */
 } rle_source_struct;
 /*
 * Read the file header; return image size and component count.
 */
 METHODDEF(void)
 start_input_rle(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
 {
  rle_source_ptr source = (rle_source_ptr)sinfo;
  JDIMENSION width, height;
 #ifdef PROGRESS_REPORT
  cd_progress_ptr progress = (cd_progress_ptr)cinfo->progress;
 #endif
  /* Use RLE library routine to get the header info */
  source->header = *rle_hdr_init(NULL);
  source->header.rle_file = source->pub.input_file;
  switch (rle_get_setup(&(source->header))) {
  case RLE_SUCCESS:
    /* A-OK */
    break;
  case RLE_NOT_RLE:
    ERREXIT(cinfo, JERR_RLE_NOT);
    break;
  case RLE_NO_SPACE:
    ERREXIT(cinfo, JERR_RLE_MEM);
    break;
  case RLE_EMPTY:
    ERREXIT(cinfo, JERR_RLE_EMPTY);
    break;
  case RLE_EOF:
    ERREXIT(cinfo, JERR_RLE_EOF);
    break;
  default:
    ERREXIT(cinfo, JERR_RLE_BADERROR);
    break;
  }
  /* Figure out what we have, set private vars and return values accordingly */
  width  = source->header.xmax - source->header.xmin + 1;
  height = source->header.ymax - source->header.ymin + 1;
  source->header.xmin = 0;              /* realign horizontally */
  source->header.xmax = width - 1;
  cinfo->image_width      = width;
  cinfo->image_height     = height;
  cinfo->data_precision   = 8;  /* we can only handle 8 bit data */
  if (source->header.ncolors == 1 && source->header.ncmap == 0) {
    source->visual     = GRAYSCALE;
    TRACEMS2(cinfo, 1, JTRC_RLE_GRAY, width, height);
  } else if (source->header.ncolors == 1 && source->header.ncmap == 1) {
    source->visual     = MAPPEDGRAY;
    TRACEMS3(cinfo, 1, JTRC_RLE_MAPGRAY, width, height,
             1 << source->header.cmaplen);
  } else if (source->header.ncolors == 1 && source->header.ncmap == 3) {
    source->visual     = PSEUDOCOLOR;
    TRACEMS3(cinfo, 1, JTRC_RLE_MAPPED, width, height,
             1 << source->header.cmaplen);
  } else if (source->header.ncolors == 3 && source->header.ncmap == 3) {
    source->visual     = TRUECOLOR;
    TRACEMS3(cinfo, 1, JTRC_RLE_FULLMAP, width, height,
             1 << source->header.cmaplen);
  } else if (source->header.ncolors == 3 && source->header.ncmap == 0) {
    source->visual     = DIRECTCOLOR;
    TRACEMS2(cinfo, 1, JTRC_RLE, width, height);
  } else
    ERREXIT(cinfo, JERR_RLE_UNSUPPORTED);
  if (source->visual == GRAYSCALE || source->visual == MAPPEDGRAY) {
    cinfo->in_color_space   = JCS_GRAYSCALE;
    cinfo->input_components = 1;
  } else {
    cinfo->in_color_space   = JCS_RGB;
    cinfo->input_components = 3;
  }
  /*
   * A place to hold each scanline while it's converted.
   * (GRAYSCALE scanlines don't need converting)
   */
  if (source->visual != GRAYSCALE) {
    source->rle_row = (rle_pixel **)(*cinfo->mem->alloc_sarray)
      ((j_common_ptr)cinfo, JPOOL_IMAGE,
       (JDIMENSION)width, (JDIMENSION)cinfo->input_components);
  }
  /* request a virtual array to hold the image */
  source->image = (*cinfo->mem->request_virt_sarray)
    ((j_common_ptr)cinfo, JPOOL_IMAGE, FALSE,
     (JDIMENSION)(width * source->header.ncolors),
     (JDIMENSION)height, (JDIMENSION)1);
 #ifdef PROGRESS_REPORT
  if (progress != NULL) {
    /* count file input as separate pass */
    progress->total_extra_passes++;
  }
 #endif
  source->pub.buffer_height = 1;
 }
 /*
 * Read one row of pixels.
 * Called only after load_image has read the image into the virtual array.
 * Used for GRAYSCALE, MAPPEDGRAY, TRUECOLOR, and DIRECTCOLOR images.
 */
 METHODDEF(JDIMENSION)
 get_rle_row(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
 {
  rle_source_ptr source = (rle_source_ptr)sinfo;
  source->row--;
  source->pub.buffer = (*cinfo->mem->access_virt_sarray)
    ((j_common_ptr)cinfo, source->image, source->row, (JDIMENSION)1, FALSE);
  return 1;
 }
 /*
 * Read one row of pixels.
 * Called only after load_image has read the image into the virtual array.
 * Used for PSEUDOCOLOR images.
 */
 METHODDEF(JDIMENSION)
 get_pseudocolor_row(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
 {
  rle_source_ptr source = (rle_source_ptr)sinfo;
  JSAMPROW src_row, dest_row;
  JDIMENSION col;
  rle_map *colormap;
  int val;
  colormap = source->header.cmap;
  dest_row = source->pub.buffer[0];
  source->row--;
  src_row = *(*cinfo->mem->access_virt_sarray)
    ((j_common_ptr)cinfo, source->image, source->row, (JDIMENSION)1, FALSE);
  for (col = cinfo->image_width; col > 0; col--) {
    val = GETJSAMPLE(*src_row++);
    *dest_row++ = (JSAMPLE)(colormap[val      ] >> 8);
    *dest_row++ = (JSAMPLE)(colormap[val + 256] >> 8);
    *dest_row++ = (JSAMPLE)(colormap[val + 512] >> 8);
  }
  return 1;
 }
 /*
 * Load the image into a virtual array.  We have to do this because RLE
 * files start at the lower left while the JPEG standard has them starting
 * in the upper left.  This is called the first time we want to get a row
 * of input.  What we do is load the RLE data into the array and then call
 * the appropriate routine to read one row from the array.  Before returning,
 * we set source->pub.get_pixel_rows so that subsequent calls go straight to
 * the appropriate row-reading routine.
 */
 METHODDEF(JDIMENSION)
 load_image(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
 {
  rle_source_ptr source = (rle_source_ptr)sinfo;
  JDIMENSION row, col;
  JSAMPROW scanline, red_ptr, green_ptr, blue_ptr;
  rle_pixel **rle_row;
  rle_map *colormap;
  char channel;
 #ifdef PROGRESS_REPORT
  cd_progress_ptr progress = (cd_progress_ptr)cinfo->progress;
 #endif
  colormap = source->header.cmap;
  rle_row = source->rle_row;
  /* Read the RLE data into our virtual array.
   * We assume here that rle_pixel is represented the same as JSAMPLE.
   */
  RLE_CLR_BIT(source->header, RLE_ALPHA); /* don't read the alpha channel */
 #ifdef PROGRESS_REPORT
  if (progress != NULL) {
    progress->pub.pass_limit = cinfo->image_height;
    progress->pub.pass_counter = 0;
    (*progress->pub.progress_monitor) ((j_common_ptr)cinfo);
  }
 #endif
  switch (source->visual) {
  case GRAYSCALE:
  case PSEUDOCOLOR:
    for (row = 0; row < cinfo->image_height; row++) {
      rle_row = (rle_pixel **)(*cinfo->mem->access_virt_sarray)
        ((j_common_ptr)cinfo, source->image, row, (JDIMENSION)1, TRUE);
      rle_getrow(&source->header, rle_row);
 #ifdef PROGRESS_REPORT
      if (progress != NULL) {
        progress->pub.pass_counter++;
        (*progress->pub.progress_monitor) ((j_common_ptr)cinfo);
      }
 #endif
    }
    break;
  case MAPPEDGRAY:
  case TRUECOLOR:
    for (row = 0; row < cinfo->image_height; row++) {
      scanline = *(*cinfo->mem->access_virt_sarray)
        ((j_common_ptr)cinfo, source->image, row, (JDIMENSION)1, TRUE);
      rle_row = source->rle_row;
      rle_getrow(&source->header, rle_row);
      for (col = 0; col < cinfo->image_width; col++) {
        for (channel = 0; channel < source->header.ncolors; channel++) {
          *scanline++ = (JSAMPLE)
            (colormap[GETJSAMPLE(rle_row[channel][col]) + 256 * channel] >> 8);
        }
      }
 #ifdef PROGRESS_REPORT
      if (progress != NULL) {
        progress->pub.pass_counter++;
        (*progress->pub.progress_monitor) ((j_common_ptr)cinfo);
      }
 #endif
    }
    break;
  case DIRECTCOLOR:
    for (row = 0; row < cinfo->image_height; row++) {
      scanline = *(*cinfo->mem->access_virt_sarray)
        ((j_common_ptr)cinfo, source->image, row, (JDIMENSION)1, TRUE);
      rle_getrow(&source->header, rle_row);
      red_ptr   = rle_row[0];
      green_ptr = rle_row[1];
      blue_ptr  = rle_row[2];
      for (col = cinfo->image_width; col > 0; col--) {
        *scanline++ = *red_ptr++;
        *scanline++ = *green_ptr++;
        *scanline++ = *blue_ptr++;
      }
 #ifdef PROGRESS_REPORT
      if (progress != NULL) {
        progress->pub.pass_counter++;
        (*progress->pub.progress_monitor) ((j_common_ptr)cinfo);
      }
 #endif
    }
  }
 #ifdef PROGRESS_REPORT
  if (progress != NULL)
    progress->completed_extra_passes++;
 #endif
  /* Set up to call proper row-extraction routine in future */
  if (source->visual == PSEUDOCOLOR) {
    source->pub.buffer = source->rle_row;
    source->pub.get_pixel_rows = get_pseudocolor_row;
  } else {
    source->pub.get_pixel_rows = get_rle_row;
  }
  source->row = cinfo->image_height;
  /* And fetch the topmost (bottommost) row */
  return (*source->pub.get_pixel_rows) (cinfo, sinfo);
 }
 /*
 * Finish up at the end of the file.
 */
 METHODDEF(void)
 finish_input_rle(j_compress_ptr cinfo, cjpeg_source_ptr sinfo)
 {
  /* no work */
 }
 /*
 * The module selection routine for RLE format input.
 */
 GLOBAL(cjpeg_source_ptr)
 jinit_read_rle(j_compress_ptr cinfo)
 {
  rle_source_ptr source;
  /* Create module interface object */
  source = (rle_source_ptr)
    (*cinfo->mem->alloc_small) ((j_common_ptr)cinfo, JPOOL_IMAGE,
                                sizeof(rle_source_struct));
  /* Fill in method ptrs */
  source->pub.start_input = start_input_rle;
  source->pub.finish_input = finish_input_rle;
  source->pub.get_pixel_rows = load_image;
  return (cjpeg_source_ptr)source;
 }
 #endif /* RLE_SUPPORTED */
--- a/rdtarga.c
+++ b/rdtarga.c
@@ -28,18 +28,8 @@
 /* Macros to deal with unsigned chars as efficiently as compiler allows */
 #ifdef HAVE_UNSIGNED_CHAR
 typedef unsigned char U_CHAR;
 #define UCH(x)  ((int)(x))
 #else /* !HAVE_UNSIGNED_CHAR */
 #ifdef __CHAR_UNSIGNED__
 typedef char U_CHAR;
 #define UCH(x)  ((int)(x))
 #else
 typedef char U_CHAR;
 #define UCH(x)  ((int)(x) & 0xFF)
 #endif
 #endif /* HAVE_UNSIGNED_CHAR */
 #define ReadOK(file, buffer, len) \
--- a/release/Config.cmake.in
+++ b/release/Config.cmake.in
@@ -0,0 +1,4 @@
@PACKAGE_INIT@
 include("${CMAKE_CURRENT_LIST_DIR}/@CMAKE_PROJECT_NAME@Targets.cmake")
 check_required_components("@CMAKE_PROJECT_NAME@")
--- a/release/ReadMe.txt
+++ b/release/ReadMe.txt
@@ -1,4 +1,4 @@
-libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86 and x86-64 systems.  On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal.  On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines.  In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs.
+libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86, x86-64, and Arm systems.  On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal.  On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines.  In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs.
 libjpeg-turbo implements both the traditional libjpeg API as well as the less powerful but more straightforward TurboJPEG API.  libjpeg-turbo also features colorspace extensions that allow it to compress from/decompress to 32-bit and big-endian pixel buffers (RGBX, XBGR, etc.), as well as a full-featured Java interface.
--- a/release/Welcome.rtf
+++ b/release/Welcome.rtf
@@ -1,18 +0,0 @@
 {\rtf1\ansi\ansicpg1252\cocoartf1404\cocoasubrtf460
 {\fonttbl\f0\fnil\fcharset0 Menlo-Regular;}
 {\colortbl;\red255\green255\blue255;\red0\green0\blue0;\red203\green233\blue242;}
 \deftab720
 \pard\tx529\tx1059\tx1589\tx2119\tx2649\tx3178\tx3708\tx4238\tx4768\tx5298\tx5827\tx6357\tx6887\tx7417\tx7947\tx8476\tx9006\tx9536\tx10066\tx10596\tx11125\tx11655\tx12185\tx12715\tx13245\tx13774\tx14304\tx14834\tx15364\tx15894\tx16423\tx16953\tx17483\tx18013\tx18543\tx19072\tx19602\tx20132\tx20662\tx21192\tx21722\tx22251\tx22781\tx23311\tx23841\tx24371\tx24900\tx25430\tx25960\tx26490\tx27020\tx27549\tx28079\tx28609\tx29139\tx29669\tx30198\tx30728\tx31258\tx31788\tx32318\tx32847\tx33377\tx33907\tx34437\tx34967\tx35496\tx36026\tx36556\tx37086\tx37616\tx38145\tx38675\tx39205\tx39735\tx40265\tx40794\tx41324\tx41854\tx42384\tx42914\tx43443\tx43973\tx44503\tx45033\tx45563\tx46093\tx46622\tx47152\tx47682\tx48212\tx48742\tx49271\tx49801\tx50331\tx50861\tx51391\tx51920\tx52450\tx52980\pardeftab720\li529\fi-530\partightenfactor0
 \f0\fs22 \cf2 \CocoaLigature0 TThi  /opt/mozjpeg/bin/uninstall\
 \pard\tx529\tx1059\tx1589\tx2119\tx2649\tx3178\tx3708\tx4238\tx4768\tx5298\tx5827\tx6357\tx6887\tx7417\tx7947\tx8476\tx9006\tx9536\tx10066\tx10596\tx11125\tx11655\tx12185\tx12715\tx13245\tx13774\tx14304\tx14834\tx15364\tx15894\tx16423\tx16953\tx17483\tx18013\tx18543\tx19072\tx19602\tx20132\tx20662\tx21192\tx21722\tx22251\tx22781\tx23311\tx23841\tx24371\tx24900\tx25430\tx25960\tx26490\tx27020\tx27549\tx28079\tx28609\tx29139\tx29669\tx30198\tx30728\tx31258\tx31788\tx32318\tx32847\tx33377\tx33907\tx34437\tx34967\tx35496\tx36026\tx36556\tx37086\tx37616\tx38145\tx38675\tx39205\tx39735\tx40265\tx40794\tx41324\tx41854\tx42384\tx42914\tx43443\tx43973\tx44503\tx45033\tx45563\tx46093\tx46622\tx47152\tx47682\tx48212\tx48742\tx49271\tx49801\tx50331\tx50861\tx51391\tx51920\tx52450\tx52980\pardeftab720\li662\fi-663\partightenfactor0
 \cf2  installer will install the mozjpeg SDK and run-time libraries onto your computer so that you can use mozjpeg to build new applications.  To remove the mozjpeg package, run\
 \pard\tx529\tx1059\tx1589\tx2119\tx2649\tx3178\tx3708\tx4238\tx4768\tx5298\tx5827\tx6357\tx6887\tx7417\tx7947\tx8476\tx9006\tx9536\tx10066\tx10596\tx11125\tx11655\tx12185\tx12715\tx13245\tx13774\tx14304\tx14834\tx15364\tx15894\tx16423\tx16953\tx17483\tx18013\tx18543\tx19072\tx19602\tx20132\tx20662\tx21192\tx21722\tx22251\tx22781\tx23311\tx23841\tx24371\tx24900\tx25430\tx25960\tx26490\tx27020\tx27549\tx28079\tx28609\tx29139\tx29669\tx30198\tx30728\tx31258\tx31788\tx32318\tx32847\tx33377\tx33907\tx34437\tx34967\tx35496\tx36026\tx36556\tx37086\tx37616\tx38145\tx38675\tx39205\tx39735\tx40265\tx40794\tx41324\tx41854\tx42384\tx42914\tx43443\tx43973\tx44503\tx45033\tx45563\tx46093\tx46622\tx47152\tx47682\tx48212\tx48742\tx49271\tx49801\tx50331\tx50861\tx51391\tx51920\tx52450\tx52980\pardeftab720\li529\fi-530\partightenfactor0
 \cf2 is installer will install the \cb3 mozjpeg\cb1  SDK and run-time libraries onto your computer so that you can use \cb3 mozjpeg\cb1  to build new applications.  To remove the \cb3 mozjpeg\cb1  package, run\
 \
 \pard\tx529\tx1059\tx1589\tx2119\tx2649\tx3178\tx3708\tx4238\tx4768\tx5298\tx5827\tx6357\tx6887\tx7417\tx7947\tx8476\tx9006\tx9536\tx10066\tx10596\tx11125\tx11655\tx12185\tx12715\tx13245\tx13774\tx14304\tx14834\tx15364\tx15894\tx16423\tx16953\tx17483\tx18013\tx18543\tx19072\tx19602\tx20132\tx20662\tx21192\tx21722\tx22251\tx22781\tx23311\tx23841\tx24371\tx24900\tx25430\tx25960\tx26490\tx27020\tx27549\tx28079\tx28609\tx29139\tx29669\tx30198\tx30728\tx31258\tx31788\tx32318\tx32847\tx33377\tx33907\tx34437\tx34967\tx35496\tx36026\tx36556\tx37086\tx37616\tx38145\tx38675\tx39205\tx39735\tx40265\tx40794\tx41324\tx41854\tx42384\tx42914\tx43443\tx43973\tx44503\tx45033\tx45563\tx46093\tx46622\tx47152\tx47682\tx48212\tx48742\tx49271\tx49801\tx50331\tx50861\tx51391\tx51920\tx52450\tx52980\pardeftab720\li794\fi-795\partightenfactor0
 \cf2   /opt/\cb3 mozjpeg\cb1 /bin/uninstall\
 \pard\tx529\tx1059\tx1589\tx2119\tx2649\tx3178\tx3708\tx4238\tx4768\tx5298\tx5827\tx6357\tx6887\tx7417\tx7947\tx8476\tx9006\tx9536\tx10066\tx10596\tx11125\tx11655\tx12185\tx12715\tx13245\tx13774\tx14304\tx14834\tx15364\tx15894\tx16423\tx16953\tx17483\tx18013\tx18543\tx19072\tx19602\tx20132\tx20662\tx21192\tx21722\tx22251\tx22781\tx23311\tx23841\tx24371\tx24900\tx25430\tx25960\tx26490\tx27020\tx27549\tx28079\tx28609\tx29139\tx29669\tx30198\tx30728\tx31258\tx31788\tx32318\tx32847\tx33377\tx33907\tx34437\tx34967\tx35496\tx36026\tx36556\tx37086\tx37616\tx38145\tx38675\tx39205\tx39735\tx40265\tx40794\tx41324\tx41854\tx42384\tx42914\tx43443\tx43973\tx44503\tx45033\tx45563\tx46093\tx46622\tx47152\tx47682\tx48212\tx48742\tx49271\tx49801\tx50331\tx50861\tx51391\tx51920\tx52450\tx52980\pardeftab720\li560\fi-561\partightenfactor0
 \cf2 \
 from the command line.\
 }
--- a/release/Welcome.rtf.in
+++ b/release/Welcome.rtf.in
@@ -0,0 +1,17 @@
 {\rtf1\ansi\ansicpg1252\cocoartf1038\cocoasubrtf360
 {\fonttbl\f0\fswiss\fcharset0 Helvetica;\f1\fmodern\fcharset0 CourierNewPSMT;}
 {\colortbl;\red255\green255\blue255;}
 \margl1440\margr1440\vieww9000\viewh8400\viewkind0
 \deftab720
 \pard\pardeftab720\ql\qnatural
 \f0\fs24 \cf0 This installer will install the libjpeg-turbo SDK and run-time libraries onto your computer so that you can use libjpeg-turbo to build new applications or accelerate existing ones.  To remove the libjpeg-turbo package, run\
 \
 \pard\pardeftab720\ql\qnatural
 \f1 \cf0   @CMAKE_INSTALL_FULL_BINDIR@/uninstall\
 \pard\pardeftab720\ql\qnatural
 \f0 \cf0 \
 from the command line.\
 }
--- a/release/installer.nsi.in
+++ b/release/installer.nsi.in
@@ -71,6 +71,11 @@ Section "@CMAKE_PROJECT_NAME@ SDK for @INST_PLATFORM@ (required)"
 	SetOutPath $INSTDIR\lib\pkgconfig
 	File "@CMAKE_CURRENT_BINARY_DIR@\pkgscripts\libjpeg.pc"
 	File "@CMAKE_CURRENT_BINARY_DIR@\pkgscripts\libturbojpeg.pc"
 	SetOutPath $INSTDIR\lib\cmake\@CMAKE_PROJECT_NAME@
 	File "@CMAKE_CURRENT_BINARY_DIR@\pkgscripts\@CMAKE_PROJECT_NAME@Config.cmake"
 	File "@CMAKE_CURRENT_BINARY_DIR@\pkgscripts\@CMAKE_PROJECT_NAME@ConfigVersion.cmake"
 	File "@CMAKE_CURRENT_BINARY_DIR@\win\@CMAKE_PROJECT_NAME@Targets.cmake"
 	File "@CMAKE_CURRENT_BINARY_DIR@\win\@CMAKE_PROJECT_NAME@Targets-release.cmake"
 !ifdef JAVA
 	SetOutPath $INSTDIR\classes
 	File "@CMAKE_CURRENT_BINARY_DIR@\java\turbojpeg.jar"
@@ -141,6 +146,10 @@ Section "Uninstall"
 !endif
 	Delete $INSTDIR\lib\pkgconfig\libjpeg.pc
 	Delete $INSTDIR\lib\pkgconfig\libturbojpeg.pc
 	Delete $INSTDIR\lib\cmake\@CMAKE_PROJECT_NAME@\@CMAKE_PROJECT_NAME@Config.cmake
 	Delete $INSTDIR\lib\cmake\@CMAKE_PROJECT_NAME@\@CMAKE_PROJECT_NAME@ConfigVersion.cmake
 	Delete $INSTDIR\lib\cmake\@CMAKE_PROJECT_NAME@\@CMAKE_PROJECT_NAME@Targets.cmake
 	Delete $INSTDIR\lib\cmake\@CMAKE_PROJECT_NAME@\@CMAKE_PROJECT_NAME@Targets-release.cmake
 !ifdef JAVA
 	Delete $INSTDIR\classes\turbojpeg.jar
 !endif
@@ -176,6 +185,8 @@ Section "Uninstall"
 	RMDir "$INSTDIR\include"
 	RMDir "$INSTDIR\lib\pkgconfig"
 	RMDir "$INSTDIR\lib\cmake\@CMAKE_PROJECT_NAME@"
 	RMDir "$INSTDIR\lib\cmake"
 	RMDir "$INSTDIR\lib"
 	RMDir "$INSTDIR\doc"
 !ifdef GCC
--- a/release/makecygwinpkg.in
+++ b/release/makecygwinpkg.in
@@ -1,66 +0,0 @@
 #!/bin/sh
 set -u
 set -e
 trap onexit INT
 trap onexit TERM
 trap onexit EXIT
 TMPDIR=
 onexit()
 {
 	if [ ! "$TMPDIR" = "" ]; then
 		rm -rf $TMPDIR
 	fi
 }
 safedirmove ()
 {
 	if [ "$1" = "$2" ]; then
 		return 0
 	fi
 	if [ "$1" = "" -o ! -d "$1" ]; then
 		echo safedirmove: source dir $1 is not valid
 		return 1
 	fi
 	if [ "$2" = "" -o -e "$2" ]; then
 		echo safedirmove: dest dir $2 is not valid
 		return 1
 	fi
 	if [ "$3" = "" -o -e "$3" ]; then
 		echo safedirmove: tmp dir $3 is not valid
 		return 1
 	fi
 	mkdir -p $3
 	mv $1/* $3/
 	rmdir $1
 	mkdir -p $2
 	mv $3/* $2/
 	rmdir $3
 	return 0
 }
 PKGNAME=@PKGNAME@
 VERSION=@VERSION@
 BUILD=@BUILD@
 PREFIX=@CMAKE_INSTALL_PREFIX@
 DOCDIR=@CMAKE_INSTALL_FULL_DOCDIR@
 LIBDIR=@CMAKE_INSTALL_FULL_LIBDIR@
 umask 022
 rm -f $PKGNAME-$VERSION-$BUILD.tar.bz2
 TMPDIR=`mktemp -d /tmp/ljtbuild.XXXXXX`
 __PWD=`pwd`
 make install DESTDIR=$TMPDIR/pkg
 if [ "$PREFIX" = "@CMAKE_INSTALL_DEFAULT_PREFIX@" -a "$DOCDIR" = "@CMAKE_INSTALL_DEFAULT_PREFIX@/doc" ]; then
 	safedirmove $TMPDIR/pkg$DOCDIR $TMPDIR/pkg/usr/share/doc/$PKGNAME-$VERSION $TMPDIR/__tmpdoc
 	ln -fs /usr/share/doc/$PKGNAME-$VERSION $TMPDIR/pkg$DOCDIR
 fi
 cd $TMPDIR/pkg
 tar cfj ../$PKGNAME-$VERSION-$BUILD.tar.bz2 *
 cd $__PWD
 mv $TMPDIR/*.tar.bz2 .
 exit 0
--- a/release/makedpkg.in
+++ b/release/makedpkg.in
@@ -67,7 +67,7 @@ makedeb()
 	mkdir $TMPDIR/DEBIAN
 	if [ $SUPPLEMENT = 1 ]; then
-		make install DESTDIR=$TMPDIR
+		DESTDIR=$TMPDIR @CMAKE_MAKE_PROGRAM@ install
 		rm -rf $TMPDIR$BINDIR
 		if [ "$DATAROOTDIR" != "$PREFIX" ]; then
 			rm -rf $TMPDIR$DATAROOTDIR
@@ -79,7 +79,7 @@ makedeb()
 		rm -rf $TMPDIR$INCLUDEDIR
 		rm -rf $TMPDIR$MANDIR
 	else
-		make install DESTDIR=$TMPDIR
+		DESTDIR=$TMPDIR @CMAKE_MAKE_PROGRAM@ install
 		if [ "$PREFIX" = "@CMAKE_INSTALL_DEFAULT_PREFIX@" -a "$DOCDIR" = "@CMAKE_INSTALL_DEFAULT_PREFIX@/doc" ]; then
 			safedirmove $TMPDIR/$DOCDIR $TMPDIR/usr/share/doc/$PKGNAME-$VERSION $TMPDIR/__tmpdoc
 			ln -fs /usr/share/doc/$DIRNAME-$VERSION $TMPDIR$DOCDIR
--- a/release/makemacpkg.in
+++ b/release/makemacpkg.in
@@ -43,23 +43,18 @@ safedirmove ()
 usage()
 {
-	echo "$0 [universal] [-lipo [path to lipo]]"
+	echo "$0 [-lipo [path to lipo]]"
 	exit 1
 }
 UNIVERSAL=0
 PKGNAME=@PKGNAME@
 VERSION=@VERSION@
 BUILD=@BUILD@
 SRCDIR=@CMAKE_CURRENT_SOURCE_DIR@
-BUILDDIR32=@OSX_32BIT_BUILD@
+BUILDDIRARMV8=@ARMV8_BUILD@
 BUILDDIRARMV7=@IOS_ARMV7_BUILD@
 BUILDDIRARMV7S=@IOS_ARMV7S_BUILD@
 BUILDDIRARMV8=@IOS_ARMV8_BUILD@
 WITH_JAVA=@WITH_JAVA@
-OSX_APP_CERT_NAME="@OSX_APP_CERT_NAME@"
+MACOS_APP_CERT_NAME="@MACOS_APP_CERT_NAME@"
-OSX_INST_CERT_NAME="@OSX_INST_CERT_NAME@"
+MACOS_INST_CERT_NAME="@MACOS_INST_CERT_NAME@"
 LIPO=lipo
 PREFIX=@CMAKE_INSTALL_PREFIX@
@@ -82,9 +77,6 @@ while [ $# -gt 0 ]; do
 			fi
 		fi
 		;;
 	universal)
 		UNIVERSAL=1
 		;;
 	esac
 	shift
 done
@@ -98,7 +90,7 @@ TMPDIR=`mktemp -d /tmp/$PKGNAME-build.XXXXXX`
 PKGROOT=$TMPDIR/pkg/Package_Root
 mkdir -p $PKGROOT
-make install DESTDIR=$PKGROOT
+DESTDIR=$PKGROOT @CMAKE_MAKE_PROGRAM@ install
 if [ "$PREFIX" = "@CMAKE_INSTALL_DEFAULT_PREFIX@" -a "$DOCDIR" = "@CMAKE_INSTALL_DEFAULT_PREFIX@/doc" ]; then
 	mkdir -p $PKGROOT/Library/Documentation
@@ -106,62 +98,7 @@ if [ "$PREFIX" = "@CMAKE_INSTALL_DEFAULT_PREFIX@" -a "$DOCDIR" = "@CMAKE_INSTALL
 	ln -fs /Library/Documentation/$PKGNAME $PKGROOT$DOCDIR
 fi
-if [ $UNIVERSAL = 1 -a "$BUILDDIR32" != "" ]; then
+install_subbuild()
 	if [ ! -d $BUILDDIR32 ]; then
 		echo ERROR: 32-bit build directory $BUILDDIR32 does not exist
 		exit 1
 	fi
 	if [ ! -f $BUILDDIR32/Makefile ]; then
 		echo ERROR: 32-bit build directory $BUILDDIR32 is not configured
 		exit 1
 	fi
 	mkdir -p $TMPDIR/dist.x86
 	pushd $BUILDDIR32
 	make install DESTDIR=$TMPDIR/dist.x86
 	popd
 	$LIPO -create \
 		-arch i386 $TMPDIR/dist.x86/$LIBDIR/$LIBJPEG_DSO_NAME \
 		-arch x86_64 $PKGROOT/$LIBDIR/$LIBJPEG_DSO_NAME \
 		-output $PKGROOT/$LIBDIR/$LIBJPEG_DSO_NAME
 	$LIPO -create \
 		-arch i386 $TMPDIR/dist.x86/$LIBDIR/libjpeg.a \
 		-arch x86_64 $PKGROOT/$LIBDIR/libjpeg.a \
 		-output $PKGROOT/$LIBDIR/libjpeg.a
 	$LIPO -create \
 		-arch i386 $TMPDIR/dist.x86/$LIBDIR/$TURBOJPEG_DSO_NAME \
 		-arch x86_64 $PKGROOT/$LIBDIR/$TURBOJPEG_DSO_NAME \
 		-output $PKGROOT/$LIBDIR/$TURBOJPEG_DSO_NAME
 	$LIPO -create \
 		-arch i386 $TMPDIR/dist.x86/$LIBDIR/libturbojpeg.a \
 		-arch x86_64 $PKGROOT/$LIBDIR/libturbojpeg.a \
 		-output $PKGROOT/$LIBDIR/libturbojpeg.a
 	$LIPO -create \
 		-arch i386 $TMPDIR/dist.x86/$BINDIR/cjpeg \
 		-arch x86_64 $PKGROOT/$BINDIR/cjpeg \
 		-output $PKGROOT/$BINDIR/cjpeg
 	$LIPO -create \
 		-arch i386 $TMPDIR/dist.x86/$BINDIR/djpeg \
 		-arch x86_64 $PKGROOT/$BINDIR/djpeg \
 		-output $PKGROOT/$BINDIR/djpeg
 	$LIPO -create \
 		-arch i386 $TMPDIR/dist.x86/$BINDIR/jpegtran \
 		-arch x86_64 $PKGROOT/$BINDIR/jpegtran \
 		-output $PKGROOT/$BINDIR/jpegtran
 	$LIPO -create \
 		-arch i386 $TMPDIR/dist.x86/$BINDIR/tjbench \
 		-arch x86_64 $PKGROOT/$BINDIR/tjbench \
 		-output $PKGROOT/$BINDIR/tjbench
 	$LIPO -create \
 		-arch i386 $TMPDIR/dist.x86/$BINDIR/rdjpgcom \
 		-arch x86_64 $PKGROOT/$BINDIR/rdjpgcom \
 		-output $PKGROOT/$BINDIR/rdjpgcom
 	$LIPO -create \
 		-arch i386 $TMPDIR/dist.x86/$BINDIR/wrjpgcom \
 		-arch x86_64 $PKGROOT/$BINDIR/wrjpgcom \
 		-output $PKGROOT/$BINDIR/wrjpgcom
 fi
 install_ios()
 {
 	BUILDDIR=$1
 	ARCHNAME=$2
@@ -172,13 +109,13 @@ install_ios()
 		echo ERROR: $ARCHNAME build directory $BUILDDIR does not exist
 		exit 1
 	fi
-	if [ ! -f $BUILDDIR/Makefile ]; then
+	if [ ! -f $BUILDDIR/Makefile -a ! -f $BUILDDIR/build.ninja ]; then
 		echo ERROR: $ARCHNAME build directory $BUILDDIR is not configured
 		exit 1
 	fi
 	mkdir -p $TMPDIR/dist.$DIRNAME
 	pushd $BUILDDIR
-	make install DESTDIR=$TMPDIR/dist.$DIRNAME
+	DESTDIR=$TMPDIR/dist.$DIRNAME @CMAKE_MAKE_PROGRAM@ install
 	popd
 	$LIPO -create \
 		$PKGROOT/$LIBDIR/$LIBJPEG_DSO_NAME \
@@ -222,28 +159,14 @@ install_ios()
 		-output $PKGROOT/$BINDIR/wrjpgcom
 }
-if [ $UNIVERSAL = 1 -a "$BUILDDIRARMV7" != "" ]; then
+if [ "$BUILDDIRARMV8" != "" ]; then
-	install_ios $BUILDDIRARMV7 Armv7 armv7 arm
+	install_subbuild $BUILDDIRARMV8 Armv8 armv8 arm64
 	fi
 if [ $UNIVERSAL = 1 -a "$BUILDDIRARMV7S" != "" ]; then
 	install_ios $BUILDDIRARMV7S Armv7s armv7s arm
 fi
 if [ $UNIVERSAL = 1 -a "$BUILDDIRARMV8" != "" ]; then
 	install_ios $BUILDDIRARMV8 Armv8 armv8 arm64
 	fi
 install_name_tool -id $LIBDIR/$LIBJPEG_DSO_NAME $PKGROOT/$LIBDIR/$LIBJPEG_DSO_NAME
 install_name_tool -id $LIBDIR/$TURBOJPEG_DSO_NAME $PKGROOT/$LIBDIR/$TURBOJPEG_DSO_NAME
 if [ $WITH_JAVA = 1 ]; then
 	ln -fs $TURBOJPEG_DSO_NAME $PKGROOT/$LIBDIR/libturbojpeg.jnilib
 fi
 if [ "$PREFIX" = "@CMAKE_INSTALL_DEFAULT_PREFIX@" -a "$LIBDIR" = "@CMAKE_INSTALL_DEFAULT_PREFIX@/lib" ]; then
 	if [ ! -h $PKGROOT/$PREFIX/lib32 ]; then
 		ln -fs lib $PKGROOT/$PREFIX/lib32
 	fi
 	if [ ! -h $PKGROOT/$PREFIX/lib64 ]; then
 		ln -fs lib $PKGROOT/$PREFIX/lib64
 	fi
@@ -255,28 +178,28 @@ install -m 755 pkgscripts/uninstall $PKGROOT/$BINDIR/
 find $PKGROOT -type f | while read file; do xattr -c $file; done
-cp $SRCDIR/release/License.rtf $SRCDIR/release/Welcome.rtf $SRCDIR/release/ReadMe.txt $TMPDIR/pkg/
+cp $SRCDIR/release/License.rtf pkgscripts/Welcome.rtf $SRCDIR/release/ReadMe.txt $TMPDIR/pkg/
 mkdir $TMPDIR/dmg
 pkgbuild --root $PKGROOT --version $VERSION.$BUILD --identifier @PKGID@ \
 	$TMPDIR/pkg/$PKGNAME.pkg
 SUFFIX=
-if [ "$OSX_INST_CERT_NAME" != "" ]; then
+if [ "$MACOS_INST_CERT_NAME" != "" ]; then
 	SUFFIX=-unsigned
 fi
 productbuild --distribution pkgscripts/Distribution.xml \
 	--package-path $TMPDIR/pkg/ --resources $TMPDIR/pkg/ \
 	$TMPDIR/dmg/$PKGNAME$SUFFIX.pkg
-if [ "$OSX_INST_CERT_NAME" != "" ]; then
+if [ "$MACOS_INST_CERT_NAME" != "" ]; then
-	productsign --sign "$OSX_INST_CERT_NAME" --timestamp \
+	productsign --sign "$MACOS_INST_CERT_NAME" --timestamp \
 		$TMPDIR/dmg/$PKGNAME$SUFFIX.pkg $TMPDIR/dmg/$PKGNAME.pkg
 	rm -r $TMPDIR/dmg/$PKGNAME$SUFFIX.pkg
 	pkgutil --check-signature $TMPDIR/dmg/$PKGNAME.pkg
 fi
 hdiutil create -fs HFS+ -volname $PKGNAME-$VERSION \
 	-srcfolder "$TMPDIR/dmg" $TMPDIR/$PKGNAME-$VERSION.dmg
-if [ "$OSX_APP_CERT_NAME" != "" ]; then
+if [ "$MACOS_APP_CERT_NAME" != "" ]; then
-	codesign -s "$OSX_APP_CERT_NAME" --timestamp $TMPDIR/$PKGNAME-$VERSION.dmg
+	codesign -s "$MACOS_APP_CERT_NAME" --timestamp $TMPDIR/$PKGNAME-$VERSION.dmg
 	codesign -vv $TMPDIR/$PKGNAME-$VERSION.dmg
 fi
 cp $TMPDIR/$PKGNAME-$VERSION.dmg .
--- a/release/maketarball.in
+++ b/release/maketarball.in
@@ -32,7 +32,7 @@ rm -f $PKGNAME-$VERSION-$OS-$ARCH.tar.bz2
 TMPDIR=`mktemp -d /tmp/$PKGNAME-build.XXXXXX`
 mkdir -p $TMPDIR/install
-make install DESTDIR=$TMPDIR/install
+DESTDIR=$TMPDIR/install @CMAKE_MAKE_PROGRAM@ install
 echo tartest >$TMPDIR/tartest
 GNUTAR=0
 BSDTAR=0
--- a/release/rpm.spec.in
+++ b/release/rpm.spec.in
@@ -53,7 +53,7 @@ Provides: %{name} = %{version}-%{release}, @CMAKE_PROJECT_NAME@ = %{version}-%{r
 %description
 libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
 baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
-MIPS systems, as well as progressive JPEG compression on x86 and x86-64
+MIPS systems, as well as progressive JPEG compression on x86, x86-64, and Arm
 systems.  On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg,
 all else being equal.  On other types of systems, libjpeg-turbo can still
 outperform libjpeg by a significant amount, by virtue of its highly-optimized
@@ -102,7 +102,7 @@ broader range of users and developers.
 %install
 rm -rf $RPM_BUILD_ROOT
-make install DESTDIR=$RPM_BUILD_ROOT
+DESTDIR=$RPM_BUILD_ROOT @CMAKE_MAKE_PROGRAM@ install
 /sbin/ldconfig -n $RPM_BUILD_ROOT%{_libdir}
 #-->%if 0
@@ -184,6 +184,9 @@ rm -rf $RPM_BUILD_ROOT
 %endif
 %dir %{_libdir}/pkgconfig
 %{_libdir}/pkgconfig/libjpeg.pc
 %dir %{_libdir}/cmake
 %dir %{_libdir}/cmake/@CMAKE_PROJECT_NAME@
 %{_libdir}/cmake/@CMAKE_PROJECT_NAME@
 %if "%{_with_turbojpeg}" == "1"
 	%if "%{_enable_shared}" == "1" || "%{_with_java}" == "1"
 		%{_libdir}/libturbojpeg.so.@TURBOJPEG_SO_VERSION@
--- a/release/uninstall.in
+++ b/release/uninstall.in
@@ -1,4 +1,5 @@
-# Copyright (C)2009-2011, 2013, 2016 D. R. Commander.  All Rights Reserved.
+# Copyright (C)2009-2011, 2013, 2016, 2020 D. R. Commander.
 #                                          All Rights Reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions are met:
@@ -70,6 +71,12 @@ fi
 if [ -d $LIBDIR/pkgconfig ]; then
 	rmdir $LIBDIR/pkgconfig 2>&1 || EXITSTATUS=-1
 fi
 if [ -d $LIBDIR/cmake/@CMAKE_PROJECT_NAME@ ]; then
 	rmdir $LIBDIR/cmake/@CMAKE_PROJECT_NAME@ || EXITSTATUS=-1
 fi
 if [ -d $LIBDIR/cmake ]; then
 	rmdir $LIBDIR/cmake || EXITSTATUS=-1
 fi
 if [ -d $LIBDIR ]; then
 	rmdir $LIBDIR 2>&1 || EXITSTATUS=-1
 fi
@@ -90,7 +97,7 @@ fi
 if [ -d $MANDIR ]; then
 	rmdir $MANDIR 2>&1 || EXITSTATUS=-1
 fi
-if [ -d $JAVADIR ]; then
+if [ -d "$JAVADIR" ]; then
 	rmdir $JAVADIR 2>&1 || EXITSTATUS=-1
 fi
 if [ -d $DATAROOTDIR -a "$DATAROOTDIR" != "$PREFIX" ]; then
--- a/sharedlib/CMakeLists.txt
+++ b/sharedlib/CMakeLists.txt
@@ -112,10 +112,13 @@ set_property(TARGET jpegtran PROPERTY COMPILE_FLAGS "${USE_SETMODE}")
 add_executable(jcstest ../jcstest.c)
 target_link_libraries(jcstest jpeg)
-install(TARGETS jpeg cjpeg djpeg jpegtran
+install(TARGETS jpeg EXPORT ${CMAKE_PROJECT_NAME}Targets
  INCLUDES DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
  ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}
  LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
  RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
 install(TARGETS cjpeg djpeg jpegtran
  RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
 if(NOT CMAKE_VERSION VERSION_LESS "3.1" AND MSVC AND
  CMAKE_C_LINKER_SUPPORTS_PDB)
  install(FILES "$<TARGET_PDB_FILE:jpeg>"
--- a/simd/CMakeLists.txt
+++ b/simd/CMakeLists.txt
@@ -30,6 +30,9 @@ if(CPU_TYPE STREQUAL "x86_64")
  if(CYGWIN)
    set(CMAKE_ASM_NASM_OBJECT_FORMAT win64)
  endif()
  if(CMAKE_C_COMPILER_ABI MATCHES "ELF X32")
    set(CMAKE_ASM_NASM_OBJECT_FORMAT elfx32)
  endif()
 elseif(CPU_TYPE STREQUAL "i386")
  if(BORLAND)
    set(CMAKE_ASM_NASM_OBJECT_FORMAT obj)
@@ -205,64 +208,76 @@ endif()
 ###############################################################################
-# Arm (GAS)
+# Arm (Intrinsics or GAS)
 ###############################################################################
 elseif(CPU_TYPE STREQUAL "arm64" OR CPU_TYPE STREQUAL "arm")
-enable_language(ASM)
+include(CheckSymbolExists)
 if(BITS EQUAL 32)
  set(CMAKE_REQUIRED_FLAGS -mfpu=neon)
 endif()
 check_symbol_exists(vld1_s16_x3 arm_neon.h HAVE_VLD1_S16_X3)
 check_symbol_exists(vld1_u16_x2 arm_neon.h HAVE_VLD1_U16_X2)
 check_symbol_exists(vld1q_u8_x4 arm_neon.h HAVE_VLD1Q_U8_X4)
 if(BITS EQUAL 32)
  unset(CMAKE_REQUIRED_FLAGS)
 endif()
 configure_file(arm/neon-compat.h.in arm/neon-compat.h @ONLY)
 include_directories(${CMAKE_CURRENT_BINARY_DIR}/arm)
-set(CMAKE_ASM_FLAGS "${CMAKE_C_FLAGS} ${CMAKE_ASM_FLAGS}")
+# GCC (as of this writing) and some older versions of Clang do not have a full
-
+# or optimal set of Neon intrinsics, so for performance reasons, when using
-string(TOUPPER ${CMAKE_BUILD_TYPE} CMAKE_BUILD_TYPE_UC)
+# those compilers, we default to using the older GAS implementation of the Neon
-set(EFFECTIVE_ASM_FLAGS "${CMAKE_ASM_FLAGS} ${CMAKE_ASM_FLAGS_${CMAKE_BUILD_TYPE_UC}}")
+# SIMD extensions for certain algorithms.  The presence or absence of the three
-message(STATUS "CMAKE_ASM_FLAGS = ${EFFECTIVE_ASM_FLAGS}")
+# intrinsics we tested above is a reasonable proxy for this.  We always default
-
+# to using the full Neon intrinsics implementation when building for macOS or
-# Test whether we need gas-preprocessor.pl
+# iOS, to avoid the need for gas-preprocessor.
-if(CPU_TYPE STREQUAL "arm")
+if((HAVE_VLD1_S16_X3 AND HAVE_VLD1_U16_X2 AND HAVE_VLD1Q_U8_X4) OR APPLE)
-  file(WRITE ${CMAKE_CURRENT_BINARY_DIR}/gastest.S "
+  set(DEFAULT_NEON_INTRINSICS 1)
    .text
    .fpu neon
    .arch armv7a
    .object_arch armv4
    .arm
    pld [r0]
    vmovn.u16 d0, q0")
 else()
-  file(WRITE ${CMAKE_CURRENT_BINARY_DIR}/gastest.S "
+  set(DEFAULT_NEON_INTRINSICS 0)
-    .text
+endif()
-    MYVAR .req x0
+option(NEON_INTRINSICS
-    movi v0.16b, #100
+  "Because GCC (as of this writing) and some older versions of Clang do not have a full or optimal set of Neon intrinsics, for performance reasons, the default when building libjpeg-turbo with those compilers is to continue using the older GAS implementation of the Neon SIMD extensions for certain algorithms.  Setting this option forces the full Neon intrinsics implementation to be used with all compilers.  Unsetting this option forces the hybrid GAS/intrinsics implementation to be used with all compilers."
-    mov MYVAR, #100
+  ${DEFAULT_NEON_INTRINSICS})
-    .unreq MYVAR")
+boolean_number(NEON_INTRINSICS PARENT_SCOPE)
 if(NEON_INTRINSICS)
  add_definitions(-DNEON_INTRINSICS)
  message(STATUS "Use full Neon SIMD intrinsics implementation (NEON_INTRINSICS = ${NEON_INTRINSICS})")
 else()
  message(STATUS "Use partial Neon SIMD intrinsics implementation (NEON_INTRINSICS = ${NEON_INTRINSICS})")
 endif()
-separate_arguments(CMAKE_ASM_FLAGS_SEP UNIX_COMMAND "${CMAKE_ASM_FLAGS}")
+set(SIMD_SOURCES arm/jcgray-neon.c arm/jcphuff-neon.c arm/jcsample-neon.c
  arm/jdmerge-neon.c arm/jdsample-neon.c arm/jfdctfst-neon.c
  arm/jidctred-neon.c arm/jquanti-neon.c)
 if(NEON_INTRINSICS)
  set(SIMD_SOURCES ${SIMD_SOURCES} arm/jccolor-neon.c arm/jidctint-neon.c)
 endif()
 if(NEON_INTRINSICS OR BITS EQUAL 64)
  set(SIMD_SOURCES ${SIMD_SOURCES} arm/jidctfst-neon.c)
 endif()
 if(NEON_INTRINSICS OR BITS EQUAL 32)
  set(SIMD_SOURCES ${SIMD_SOURCES} arm/aarch${BITS}/jchuff-neon.c
    arm/jdcolor-neon.c arm/jfdctint-neon.c)
 endif()
 if(BITS EQUAL 32)
  set_source_files_properties(${SIMD_SOURCES} COMPILE_FLAGS -mfpu=neon)
 endif()
 if(NOT NEON_INTRINSICS)
  enable_language(ASM)
-execute_process(COMMAND ${CMAKE_ASM_COMPILER} ${CMAKE_ASM_FLAGS_SEP}
+  set(CMAKE_ASM_FLAGS "${CMAKE_C_FLAGS} ${CMAKE_ASM_FLAGS}")
-    -x assembler-with-cpp -c ${CMAKE_CURRENT_BINARY_DIR}/gastest.S
+
-  RESULT_VARIABLE RESULT OUTPUT_VARIABLE OUTPUT ERROR_VARIABLE ERROR)
+  string(TOUPPER ${CMAKE_BUILD_TYPE} CMAKE_BUILD_TYPE_UC)
-if(NOT RESULT EQUAL 0)
+  set(EFFECTIVE_ASM_FLAGS "${CMAKE_ASM_FLAGS} ${CMAKE_ASM_FLAGS_${CMAKE_BUILD_TYPE_UC}}")
-  message(STATUS "GAS appears to be broken.  Trying gas-preprocessor.pl ...")
+  message(STATUS "CMAKE_ASM_FLAGS = ${EFFECTIVE_ASM_FLAGS}")
-  execute_process(COMMAND gas-preprocessor.pl ${CMAKE_ASM_COMPILER}
+
-      ${CMAKE_ASM_FLAGS_SEP} -x assembler-with-cpp -c
+  set(SIMD_SOURCES ${SIMD_SOURCES} arm/aarch${BITS}/jsimd_neon.S)
      ${CMAKE_CURRENT_BINARY_DIR}/gastest.S
    RESULT_VARIABLE RESULT OUTPUT_VARIABLE OUTPUT ERROR_VARIABLE ERROR)
  if(NOT RESULT EQUAL 0)
    simd_fail("SIMD extensions disabled: GAS is not working properly")
    return()
  else()
    message(STATUS "Using gas-preprocessor.pl")
    configure_file(gas-preprocessor.in gas-preprocessor @ONLY)
    set(CMAKE_ASM_COMPILER ${CMAKE_CURRENT_BINARY_DIR}/gas-preprocessor)
  endif()
 else()
  message(STATUS "GAS is working properly")
 endif()
-file(REMOVE ${CMAKE_CURRENT_BINARY_DIR}/gastest.S)
+add_library(simd OBJECT ${SIMD_SOURCES} arm/aarch${BITS}/jsimd.c)
 add_library(simd OBJECT ${CPU_TYPE}/jsimd_neon.S ${CPU_TYPE}/jsimd.c)
 if(CMAKE_POSITION_INDEPENDENT_CODE OR ENABLE_SHARED)
  set_target_properties(simd PROPERTIES POSITION_INDEPENDENT_CODE 1)
@@ -311,14 +326,35 @@ if(CMAKE_POSITION_INDEPENDENT_CODE OR ENABLE_SHARED)
 endif()
 ###############################################################################
-# Loongson (Intrinsics)
+# MIPS64 (Intrinsics)
 ###############################################################################
-elseif(CPU_TYPE STREQUAL "loongson")
+elseif(CPU_TYPE STREQUAL "loongson" OR CPU_TYPE MATCHES "mips64*")
-set(SIMD_SOURCES loongson/jccolor-mmi.c loongson/jcsample-mmi.c
+set(CMAKE_REQUIRED_FLAGS -Wa,-mloongson-mmi,-mloongson-ext)
-  loongson/jdcolor-mmi.c loongson/jdsample-mmi.c loongson/jfdctint-mmi.c
+
-  loongson/jidctint-mmi.c loongson/jquanti-mmi.c)
+check_c_source_compiles("
  int main(void) {
    int c = 0, a = 0, b = 0;
    asm (
      \"paddb %0, %1, %2\"
      : \"=f\" (c)
      : \"f\" (a), \"f\" (b)
    );
    return c;
  }" HAVE_MMI)
 unset(CMAKE_REQUIRED_FLAGS)
 if(NOT HAVE_MMI)
  simd_fail("SIMD extensions not available for this CPU")
  return()
 endif()
 set(SIMD_SOURCES mips64/jccolor-mmi.c mips64/jcgray-mmi.c mips64/jcsample-mmi.c
  mips64/jdcolor-mmi.c mips64/jdmerge-mmi.c mips64/jdsample-mmi.c
  mips64/jfdctfst-mmi.c mips64/jfdctint-mmi.c mips64/jidctfst-mmi.c
  mips64/jidctint-mmi.c mips64/jquanti-mmi.c)
 if(CMAKE_COMPILER_IS_GNUCC)
  foreach(file ${SIMD_SOURCES})
@@ -326,8 +362,12 @@ if(CMAKE_COMPILER_IS_GNUCC)
      " -fno-strict-aliasing")
  endforeach()
 endif()
 foreach(file ${SIMD_SOURCES})
  set_property(SOURCE ${file} APPEND_STRING PROPERTY COMPILE_FLAGS
    " -Wa,-mloongson-mmi,-mloongson-ext")
 endforeach()
-add_library(simd OBJECT ${SIMD_SOURCES} loongson/jsimd.c)
+add_library(simd OBJECT ${SIMD_SOURCES} mips64/jsimd.c)
 if(CMAKE_POSITION_INDEPENDENT_CODE OR ENABLE_SHARED)
  set_target_properties(simd PROPERTIES POSITION_INDEPENDENT_CODE 1)
--- a/simd/arm/aarch32/jccolext-neon.c
+++ b/simd/arm/aarch32/jccolext-neon.c
@@ -0,0 +1,148 @@
 /*
 * jccolext-neon.c - colorspace conversion (32-bit Arm Neon)
 *
 * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
 * Copyright (C) 2020, D. R. Commander.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 /* This file is included by jccolor-neon.c */
 /* RGB -> YCbCr conversion is defined by the following equations:
 *    Y  =  0.29900 * R + 0.58700 * G + 0.11400 * B
 *    Cb = -0.16874 * R - 0.33126 * G + 0.50000 * B  + 128
 *    Cr =  0.50000 * R - 0.41869 * G - 0.08131 * B  + 128
 *
 * Avoid floating point arithmetic by using shifted integer constants:
 *    0.29899597 = 19595 * 2^-16
 *    0.58700561 = 38470 * 2^-16
 *    0.11399841 =  7471 * 2^-16
 *    0.16874695 = 11059 * 2^-16
 *    0.33125305 = 21709 * 2^-16
 *    0.50000000 = 32768 * 2^-16
 *    0.41868592 = 27439 * 2^-16
 *    0.08131409 =  5329 * 2^-16
 * These constants are defined in jccolor-neon.c
 *
 * We add the fixed-point equivalent of 0.5 to Cb and Cr, which effectively
 * rounds up or down the result via integer truncation.
 */
 void jsimd_rgb_ycc_convert_neon(JDIMENSION image_width, JSAMPARRAY input_buf,
                                JSAMPIMAGE output_buf, JDIMENSION output_row,
                                int num_rows)
 {
  /* Pointer to RGB(X/A) input data */
  JSAMPROW inptr;
  /* Pointers to Y, Cb, and Cr output data */
  JSAMPROW outptr0, outptr1, outptr2;
  /* Allocate temporary buffer for final (image_width % 8) pixels in row. */
  ALIGN(16) uint8_t tmp_buf[8 * RGB_PIXELSIZE];
  /* Set up conversion constants. */
 #ifdef HAVE_VLD1_U16_X2
  const uint16x4x2_t consts = vld1_u16_x2(jsimd_rgb_ycc_neon_consts);
 #else
  /* GCC does not currently support the intrinsic vld1_<type>_x2(). */
  const uint16x4_t consts1 = vld1_u16(jsimd_rgb_ycc_neon_consts);
  const uint16x4_t consts2 = vld1_u16(jsimd_rgb_ycc_neon_consts + 4);
  const uint16x4x2_t consts = { { consts1, consts2 } };
 #endif
  const uint32x4_t scaled_128_5 = vdupq_n_u32((128 << 16) + 32767);
  while (--num_rows >= 0) {
    inptr = *input_buf++;
    outptr0 = output_buf[0][output_row];
    outptr1 = output_buf[1][output_row];
    outptr2 = output_buf[2][output_row];
    output_row++;
    int cols_remaining = image_width;
    for (; cols_remaining > 0; cols_remaining -= 8) {
      /* To prevent buffer overread by the vector load instructions, the last
       * (image_width % 8) columns of data are first memcopied to a temporary
       * buffer large enough to accommodate the vector load.
       */
      if (cols_remaining < 8) {
        memcpy(tmp_buf, inptr, cols_remaining * RGB_PIXELSIZE);
        inptr = tmp_buf;
      }
 #if RGB_PIXELSIZE == 4
      uint8x8x4_t input_pixels = vld4_u8(inptr);
 #else
      uint8x8x3_t input_pixels = vld3_u8(inptr);
 #endif
      uint16x8_t r = vmovl_u8(input_pixels.val[RGB_RED]);
      uint16x8_t g = vmovl_u8(input_pixels.val[RGB_GREEN]);
      uint16x8_t b = vmovl_u8(input_pixels.val[RGB_BLUE]);
      /* Compute Y = 0.29900 * R + 0.58700 * G + 0.11400 * B */
      uint32x4_t y_low = vmull_lane_u16(vget_low_u16(r), consts.val[0], 0);
      y_low = vmlal_lane_u16(y_low, vget_low_u16(g), consts.val[0], 1);
      y_low = vmlal_lane_u16(y_low, vget_low_u16(b), consts.val[0], 2);
      uint32x4_t y_high = vmull_lane_u16(vget_high_u16(r), consts.val[0], 0);
      y_high = vmlal_lane_u16(y_high, vget_high_u16(g), consts.val[0], 1);
      y_high = vmlal_lane_u16(y_high, vget_high_u16(b), consts.val[0], 2);
      /* Compute Cb = -0.16874 * R - 0.33126 * G + 0.50000 * B  + 128 */
      uint32x4_t cb_low = scaled_128_5;
      cb_low = vmlsl_lane_u16(cb_low, vget_low_u16(r), consts.val[0], 3);
      cb_low = vmlsl_lane_u16(cb_low, vget_low_u16(g), consts.val[1], 0);
      cb_low = vmlal_lane_u16(cb_low, vget_low_u16(b), consts.val[1], 1);
      uint32x4_t cb_high = scaled_128_5;
      cb_high = vmlsl_lane_u16(cb_high, vget_high_u16(r), consts.val[0], 3);
      cb_high = vmlsl_lane_u16(cb_high, vget_high_u16(g), consts.val[1], 0);
      cb_high = vmlal_lane_u16(cb_high, vget_high_u16(b), consts.val[1], 1);
      /* Compute Cr = 0.50000 * R - 0.41869 * G - 0.08131 * B  + 128 */
      uint32x4_t cr_low = scaled_128_5;
      cr_low = vmlal_lane_u16(cr_low, vget_low_u16(r), consts.val[1], 1);
      cr_low = vmlsl_lane_u16(cr_low, vget_low_u16(g), consts.val[1], 2);
      cr_low = vmlsl_lane_u16(cr_low, vget_low_u16(b), consts.val[1], 3);
      uint32x4_t cr_high = scaled_128_5;
      cr_high = vmlal_lane_u16(cr_high, vget_high_u16(r), consts.val[1], 1);
      cr_high = vmlsl_lane_u16(cr_high, vget_high_u16(g), consts.val[1], 2);
      cr_high = vmlsl_lane_u16(cr_high, vget_high_u16(b), consts.val[1], 3);
      /* Descale Y values (rounding right shift) and narrow to 16-bit. */
      uint16x8_t y_u16 = vcombine_u16(vrshrn_n_u32(y_low, 16),
                                      vrshrn_n_u32(y_high, 16));
      /* Descale Cb values (right shift) and narrow to 16-bit. */
      uint16x8_t cb_u16 = vcombine_u16(vshrn_n_u32(cb_low, 16),
                                       vshrn_n_u32(cb_high, 16));
      /* Descale Cr values (right shift) and narrow to 16-bit. */
      uint16x8_t cr_u16 = vcombine_u16(vshrn_n_u32(cr_low, 16),
                                       vshrn_n_u32(cr_high, 16));
      /* Narrow Y, Cb, and Cr values to 8-bit and store to memory.  Buffer
       * overwrite is permitted up to the next multiple of ALIGN_SIZE bytes.
       */
      vst1_u8(outptr0, vmovn_u16(y_u16));
      vst1_u8(outptr1, vmovn_u16(cb_u16));
      vst1_u8(outptr2, vmovn_u16(cr_u16));
      /* Increment pointers. */
      inptr += (8 * RGB_PIXELSIZE);
      outptr0 += 8;
      outptr1 += 8;
      outptr2 += 8;
    }
  }
 }
--- a/simd/arm/aarch32/jchuff-neon.c
+++ b/simd/arm/aarch32/jchuff-neon.c
@@ -0,0 +1,334 @@
 /*
 * jchuff-neon.c - Huffman entropy encoding (32-bit Arm Neon)
 *
 * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 *
 * NOTE: All referenced figures are from
 * Recommendation ITU-T T.81 (1992) | ISO/IEC 10918-1:1994.
 */
 #define JPEG_INTERNALS
 #include "../../../jinclude.h"
 #include "../../../jpeglib.h"
 #include "../../../jsimd.h"
 #include "../../../jdct.h"
 #include "../../../jsimddct.h"
 #include "../../jsimd.h"
 #include "../jchuff.h"
 #include "neon-compat.h"
 #include <limits.h>
 #include <arm_neon.h>
 JOCTET *jsimd_huff_encode_one_block_neon(void *state, JOCTET *buffer,
                                         JCOEFPTR block, int last_dc_val,
                                         c_derived_tbl *dctbl,
                                         c_derived_tbl *actbl)
 {
  uint8_t block_nbits[DCTSIZE2];
  uint16_t block_diff[DCTSIZE2];
  /* Load rows of coefficients from DCT block in zig-zag order. */
  /* Compute DC coefficient difference value. (F.1.1.5.1) */
  int16x8_t row0 = vdupq_n_s16(block[0] - last_dc_val);
  row0 = vld1q_lane_s16(block +  1, row0, 1);
  row0 = vld1q_lane_s16(block +  8, row0, 2);
  row0 = vld1q_lane_s16(block + 16, row0, 3);
  row0 = vld1q_lane_s16(block +  9, row0, 4);
  row0 = vld1q_lane_s16(block +  2, row0, 5);
  row0 = vld1q_lane_s16(block +  3, row0, 6);
  row0 = vld1q_lane_s16(block + 10, row0, 7);
  int16x8_t row1 = vld1q_dup_s16(block + 17);
  row1 = vld1q_lane_s16(block + 24, row1, 1);
  row1 = vld1q_lane_s16(block + 32, row1, 2);
  row1 = vld1q_lane_s16(block + 25, row1, 3);
  row1 = vld1q_lane_s16(block + 18, row1, 4);
  row1 = vld1q_lane_s16(block + 11, row1, 5);
  row1 = vld1q_lane_s16(block +  4, row1, 6);
  row1 = vld1q_lane_s16(block +  5, row1, 7);
  int16x8_t row2 = vld1q_dup_s16(block + 12);
  row2 = vld1q_lane_s16(block + 19, row2, 1);
  row2 = vld1q_lane_s16(block + 26, row2, 2);
  row2 = vld1q_lane_s16(block + 33, row2, 3);
  row2 = vld1q_lane_s16(block + 40, row2, 4);
  row2 = vld1q_lane_s16(block + 48, row2, 5);
  row2 = vld1q_lane_s16(block + 41, row2, 6);
  row2 = vld1q_lane_s16(block + 34, row2, 7);
  int16x8_t row3 = vld1q_dup_s16(block + 27);
  row3 = vld1q_lane_s16(block + 20, row3, 1);
  row3 = vld1q_lane_s16(block + 13, row3, 2);
  row3 = vld1q_lane_s16(block +  6, row3, 3);
  row3 = vld1q_lane_s16(block +  7, row3, 4);
  row3 = vld1q_lane_s16(block + 14, row3, 5);
  row3 = vld1q_lane_s16(block + 21, row3, 6);
  row3 = vld1q_lane_s16(block + 28, row3, 7);
  int16x8_t abs_row0 = vabsq_s16(row0);
  int16x8_t abs_row1 = vabsq_s16(row1);
  int16x8_t abs_row2 = vabsq_s16(row2);
  int16x8_t abs_row3 = vabsq_s16(row3);
  int16x8_t row0_lz = vclzq_s16(abs_row0);
  int16x8_t row1_lz = vclzq_s16(abs_row1);
  int16x8_t row2_lz = vclzq_s16(abs_row2);
  int16x8_t row3_lz = vclzq_s16(abs_row3);
  /* Compute number of bits required to represent each coefficient. */
  uint8x8_t row0_nbits = vsub_u8(vdup_n_u8(16),
                                 vmovn_u16(vreinterpretq_u16_s16(row0_lz)));
  uint8x8_t row1_nbits = vsub_u8(vdup_n_u8(16),
                                 vmovn_u16(vreinterpretq_u16_s16(row1_lz)));
  uint8x8_t row2_nbits = vsub_u8(vdup_n_u8(16),
                                 vmovn_u16(vreinterpretq_u16_s16(row2_lz)));
  uint8x8_t row3_nbits = vsub_u8(vdup_n_u8(16),
                                 vmovn_u16(vreinterpretq_u16_s16(row3_lz)));
  vst1_u8(block_nbits + 0 * DCTSIZE, row0_nbits);
  vst1_u8(block_nbits + 1 * DCTSIZE, row1_nbits);
  vst1_u8(block_nbits + 2 * DCTSIZE, row2_nbits);
  vst1_u8(block_nbits + 3 * DCTSIZE, row3_nbits);
  uint16x8_t row0_mask =
    vshlq_u16(vreinterpretq_u16_s16(vshrq_n_s16(row0, 15)),
              vnegq_s16(row0_lz));
  uint16x8_t row1_mask =
    vshlq_u16(vreinterpretq_u16_s16(vshrq_n_s16(row1, 15)),
              vnegq_s16(row1_lz));
  uint16x8_t row2_mask =
    vshlq_u16(vreinterpretq_u16_s16(vshrq_n_s16(row2, 15)),
              vnegq_s16(row2_lz));
  uint16x8_t row3_mask =
    vshlq_u16(vreinterpretq_u16_s16(vshrq_n_s16(row3, 15)),
              vnegq_s16(row3_lz));
  uint16x8_t row0_diff = veorq_u16(vreinterpretq_u16_s16(abs_row0), row0_mask);
  uint16x8_t row1_diff = veorq_u16(vreinterpretq_u16_s16(abs_row1), row1_mask);
  uint16x8_t row2_diff = veorq_u16(vreinterpretq_u16_s16(abs_row2), row2_mask);
  uint16x8_t row3_diff = veorq_u16(vreinterpretq_u16_s16(abs_row3), row3_mask);
  /* Store diff values for rows 0, 1, 2, and 3. */
  vst1q_u16(block_diff + 0 * DCTSIZE, row0_diff);
  vst1q_u16(block_diff + 1 * DCTSIZE, row1_diff);
  vst1q_u16(block_diff + 2 * DCTSIZE, row2_diff);
  vst1q_u16(block_diff + 3 * DCTSIZE, row3_diff);
  /* Load last four rows of coefficients from DCT block in zig-zag order. */
  int16x8_t row4 = vld1q_dup_s16(block + 35);
  row4 = vld1q_lane_s16(block + 42, row4, 1);
  row4 = vld1q_lane_s16(block + 49, row4, 2);
  row4 = vld1q_lane_s16(block + 56, row4, 3);
  row4 = vld1q_lane_s16(block + 57, row4, 4);
  row4 = vld1q_lane_s16(block + 50, row4, 5);
  row4 = vld1q_lane_s16(block + 43, row4, 6);
  row4 = vld1q_lane_s16(block + 36, row4, 7);
  int16x8_t row5 = vld1q_dup_s16(block + 29);
  row5 = vld1q_lane_s16(block + 22, row5, 1);
  row5 = vld1q_lane_s16(block + 15, row5, 2);
  row5 = vld1q_lane_s16(block + 23, row5, 3);
  row5 = vld1q_lane_s16(block + 30, row5, 4);
  row5 = vld1q_lane_s16(block + 37, row5, 5);
  row5 = vld1q_lane_s16(block + 44, row5, 6);
  row5 = vld1q_lane_s16(block + 51, row5, 7);
  int16x8_t row6 = vld1q_dup_s16(block + 58);
  row6 = vld1q_lane_s16(block + 59, row6, 1);
  row6 = vld1q_lane_s16(block + 52, row6, 2);
  row6 = vld1q_lane_s16(block + 45, row6, 3);
  row6 = vld1q_lane_s16(block + 38, row6, 4);
  row6 = vld1q_lane_s16(block + 31, row6, 5);
  row6 = vld1q_lane_s16(block + 39, row6, 6);
  row6 = vld1q_lane_s16(block + 46, row6, 7);
  int16x8_t row7 = vld1q_dup_s16(block + 53);
  row7 = vld1q_lane_s16(block + 60, row7, 1);
  row7 = vld1q_lane_s16(block + 61, row7, 2);
  row7 = vld1q_lane_s16(block + 54, row7, 3);
  row7 = vld1q_lane_s16(block + 47, row7, 4);
  row7 = vld1q_lane_s16(block + 55, row7, 5);
  row7 = vld1q_lane_s16(block + 62, row7, 6);
  row7 = vld1q_lane_s16(block + 63, row7, 7);
  int16x8_t abs_row4 = vabsq_s16(row4);
  int16x8_t abs_row5 = vabsq_s16(row5);
  int16x8_t abs_row6 = vabsq_s16(row6);
  int16x8_t abs_row7 = vabsq_s16(row7);
  int16x8_t row4_lz = vclzq_s16(abs_row4);
  int16x8_t row5_lz = vclzq_s16(abs_row5);
  int16x8_t row6_lz = vclzq_s16(abs_row6);
  int16x8_t row7_lz = vclzq_s16(abs_row7);
  /* Compute number of bits required to represent each coefficient. */
  uint8x8_t row4_nbits = vsub_u8(vdup_n_u8(16),
                                 vmovn_u16(vreinterpretq_u16_s16(row4_lz)));
  uint8x8_t row5_nbits = vsub_u8(vdup_n_u8(16),
                                 vmovn_u16(vreinterpretq_u16_s16(row5_lz)));
  uint8x8_t row6_nbits = vsub_u8(vdup_n_u8(16),
                                 vmovn_u16(vreinterpretq_u16_s16(row6_lz)));
  uint8x8_t row7_nbits = vsub_u8(vdup_n_u8(16),
                                 vmovn_u16(vreinterpretq_u16_s16(row7_lz)));
  vst1_u8(block_nbits + 4 * DCTSIZE, row4_nbits);
  vst1_u8(block_nbits + 5 * DCTSIZE, row5_nbits);
  vst1_u8(block_nbits + 6 * DCTSIZE, row6_nbits);
  vst1_u8(block_nbits + 7 * DCTSIZE, row7_nbits);
  uint16x8_t row4_mask =
    vshlq_u16(vreinterpretq_u16_s16(vshrq_n_s16(row4, 15)),
              vnegq_s16(row4_lz));
  uint16x8_t row5_mask =
    vshlq_u16(vreinterpretq_u16_s16(vshrq_n_s16(row5, 15)),
              vnegq_s16(row5_lz));
  uint16x8_t row6_mask =
    vshlq_u16(vreinterpretq_u16_s16(vshrq_n_s16(row6, 15)),
              vnegq_s16(row6_lz));
  uint16x8_t row7_mask =
    vshlq_u16(vreinterpretq_u16_s16(vshrq_n_s16(row7, 15)),
              vnegq_s16(row7_lz));
  uint16x8_t row4_diff = veorq_u16(vreinterpretq_u16_s16(abs_row4), row4_mask);
  uint16x8_t row5_diff = veorq_u16(vreinterpretq_u16_s16(abs_row5), row5_mask);
  uint16x8_t row6_diff = veorq_u16(vreinterpretq_u16_s16(abs_row6), row6_mask);
  uint16x8_t row7_diff = veorq_u16(vreinterpretq_u16_s16(abs_row7), row7_mask);
  /* Store diff values for rows 4, 5, 6, and 7. */
  vst1q_u16(block_diff + 4 * DCTSIZE, row4_diff);
  vst1q_u16(block_diff + 5 * DCTSIZE, row5_diff);
  vst1q_u16(block_diff + 6 * DCTSIZE, row6_diff);
  vst1q_u16(block_diff + 7 * DCTSIZE, row7_diff);
  /* Construct bitmap to accelerate encoding of AC coefficients.  A set bit
   * means that the corresponding coefficient != 0.
   */
  uint8x8_t row0_nbits_gt0 = vcgt_u8(row0_nbits, vdup_n_u8(0));
  uint8x8_t row1_nbits_gt0 = vcgt_u8(row1_nbits, vdup_n_u8(0));
  uint8x8_t row2_nbits_gt0 = vcgt_u8(row2_nbits, vdup_n_u8(0));
  uint8x8_t row3_nbits_gt0 = vcgt_u8(row3_nbits, vdup_n_u8(0));
  uint8x8_t row4_nbits_gt0 = vcgt_u8(row4_nbits, vdup_n_u8(0));
  uint8x8_t row5_nbits_gt0 = vcgt_u8(row5_nbits, vdup_n_u8(0));
  uint8x8_t row6_nbits_gt0 = vcgt_u8(row6_nbits, vdup_n_u8(0));
  uint8x8_t row7_nbits_gt0 = vcgt_u8(row7_nbits, vdup_n_u8(0));
  /* { 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 } */
  const uint8x8_t bitmap_mask =
    vreinterpret_u8_u64(vmov_n_u64(0x0102040810204080));
  row0_nbits_gt0 = vand_u8(row0_nbits_gt0, bitmap_mask);
  row1_nbits_gt0 = vand_u8(row1_nbits_gt0, bitmap_mask);
  row2_nbits_gt0 = vand_u8(row2_nbits_gt0, bitmap_mask);
  row3_nbits_gt0 = vand_u8(row3_nbits_gt0, bitmap_mask);
  row4_nbits_gt0 = vand_u8(row4_nbits_gt0, bitmap_mask);
  row5_nbits_gt0 = vand_u8(row5_nbits_gt0, bitmap_mask);
  row6_nbits_gt0 = vand_u8(row6_nbits_gt0, bitmap_mask);
  row7_nbits_gt0 = vand_u8(row7_nbits_gt0, bitmap_mask);
  uint8x8_t bitmap_rows_10 = vpadd_u8(row1_nbits_gt0, row0_nbits_gt0);
  uint8x8_t bitmap_rows_32 = vpadd_u8(row3_nbits_gt0, row2_nbits_gt0);
  uint8x8_t bitmap_rows_54 = vpadd_u8(row5_nbits_gt0, row4_nbits_gt0);
  uint8x8_t bitmap_rows_76 = vpadd_u8(row7_nbits_gt0, row6_nbits_gt0);
  uint8x8_t bitmap_rows_3210 = vpadd_u8(bitmap_rows_32, bitmap_rows_10);
  uint8x8_t bitmap_rows_7654 = vpadd_u8(bitmap_rows_76, bitmap_rows_54);
  uint8x8_t bitmap = vpadd_u8(bitmap_rows_7654, bitmap_rows_3210);
  /* Shift left to remove DC bit. */
  bitmap = vreinterpret_u8_u64(vshl_n_u64(vreinterpret_u64_u8(bitmap), 1));
  /* Move bitmap to 32-bit scalar registers. */
  uint32_t bitmap_1_32 = vget_lane_u32(vreinterpret_u32_u8(bitmap), 1);
  uint32_t bitmap_33_63 = vget_lane_u32(vreinterpret_u32_u8(bitmap), 0);
  /* Set up state and bit buffer for output bitstream. */
  working_state *state_ptr = (working_state *)state;
  int free_bits = state_ptr->cur.free_bits;
  size_t put_buffer = state_ptr->cur.put_buffer;
  /* Encode DC coefficient. */
  unsigned int nbits = block_nbits[0];
  /* Emit Huffman-coded symbol and additional diff bits. */
  unsigned int diff = block_diff[0];
  PUT_CODE(dctbl->ehufco[nbits], dctbl->ehufsi[nbits], diff)
  /* Encode AC coefficients. */
  unsigned int r = 0;  /* r = run length of zeros */
  unsigned int i = 1;  /* i = number of coefficients encoded */
  /* Code and size information for a run length of 16 zero coefficients */
  const unsigned int code_0xf0 = actbl->ehufco[0xf0];
  const unsigned int size_0xf0 = actbl->ehufsi[0xf0];
  while (bitmap_1_32 != 0) {
    r = BUILTIN_CLZ(bitmap_1_32);
    i += r;
    bitmap_1_32 <<= r;
    nbits = block_nbits[i];
    diff = block_diff[i];
    while (r > 15) {
      /* If run length > 15, emit special run-length-16 codes. */
      PUT_BITS(code_0xf0, size_0xf0)
      r -= 16;
    }
    /* Emit Huffman symbol for run length / number of bits. (F.1.2.2.1) */
    unsigned int rs = (r << 4) + nbits;
    PUT_CODE(actbl->ehufco[rs], actbl->ehufsi[rs], diff)
    i++;
    bitmap_1_32 <<= 1;
  }
  r = 33 - i;
  i = 33;
  while (bitmap_33_63 != 0) {
    unsigned int leading_zeros = BUILTIN_CLZ(bitmap_33_63);
    r += leading_zeros;
    i += leading_zeros;
    bitmap_33_63 <<= leading_zeros;
    nbits = block_nbits[i];
    diff = block_diff[i];
    while (r > 15) {
      /* If run length > 15, emit special run-length-16 codes. */
      PUT_BITS(code_0xf0, size_0xf0)
      r -= 16;
    }
    /* Emit Huffman symbol for run length / number of bits. (F.1.2.2.1) */
    unsigned int rs = (r << 4) + nbits;
    PUT_CODE(actbl->ehufco[rs], actbl->ehufsi[rs], diff)
    r = 0;
    i++;
    bitmap_33_63 <<= 1;
  }
  /* If the last coefficient(s) were zero, emit an end-of-block (EOB) code.
   * The value of RS for the EOB code is 0.
   */
  if (i != 64) {
    PUT_BITS(actbl->ehufco[0], actbl->ehufsi[0])
  }
  state_ptr->cur.put_buffer = put_buffer;
  state_ptr->cur.free_bits = free_bits;
  return buffer;
 }
--- a/simd/arm/aarch32/jsimd.c
+++ b/simd/arm/aarch32/jsimd.c
@@ -6,6 +6,7 @@
 * Copyright (C) 2009-2011, 2013-2014, 2016, 2018, D. R. Commander.
 * Copyright (C) 2015-2016, 2018, Matthieu Darbois.
 * Copyright (C) 2019, Google LLC.
 * Copyright (C) 2020, Arm Limited.
 *
 * Based on the x86 SIMD extension for IJG JPEG library,
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
@@ -17,12 +18,12 @@
 */
 #define JPEG_INTERNALS
-#include "../../jinclude.h"
+#include "../../../jinclude.h"
-#include "../../jpeglib.h"
+#include "../../../jpeglib.h"
 #include "../../../jsimd.h"
 #include "../../../jdct.h"
 #include "../../../jsimddct.h"
 #include "../../jsimd.h"
 #include "../../jdct.h"
 #include "../../jsimddct.h"
 #include "../jsimd.h"
 #include <stdio.h>
 #include <string.h>
@@ -164,6 +165,19 @@ jsimd_can_rgb_ycc(void)
 GLOBAL(int)
 jsimd_can_rgb_gray(void)
 {
  init_simd();
  /* The code is optimised for these values only */
  if (BITS_IN_JSAMPLE != 8)
    return 0;
  if (sizeof(JDIMENSION) != 4)
    return 0;
  if ((RGB_PIXELSIZE != 3) && (RGB_PIXELSIZE != 4))
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
@@ -246,6 +260,37 @@ jsimd_rgb_gray_convert(j_compress_ptr cinfo, JSAMPARRAY input_buf,
                       JSAMPIMAGE output_buf, JDIMENSION output_row,
                       int num_rows)
 {
  void (*neonfct) (JDIMENSION, JSAMPARRAY, JSAMPIMAGE, JDIMENSION, int);
  switch (cinfo->in_color_space) {
  case JCS_EXT_RGB:
    neonfct = jsimd_extrgb_gray_convert_neon;
    break;
  case JCS_EXT_RGBX:
  case JCS_EXT_RGBA:
    neonfct = jsimd_extrgbx_gray_convert_neon;
    break;
  case JCS_EXT_BGR:
    neonfct = jsimd_extbgr_gray_convert_neon;
    break;
  case JCS_EXT_BGRX:
  case JCS_EXT_BGRA:
    neonfct = jsimd_extbgrx_gray_convert_neon;
    break;
  case JCS_EXT_XBGR:
  case JCS_EXT_ABGR:
    neonfct = jsimd_extxbgr_gray_convert_neon;
    break;
  case JCS_EXT_XRGB:
  case JCS_EXT_ARGB:
    neonfct = jsimd_extxrgb_gray_convert_neon;
    break;
  default:
    neonfct = jsimd_extrgb_gray_convert_neon;
    break;
  }
  neonfct(cinfo->image_width, input_buf, output_buf, output_row, num_rows);
 }
 GLOBAL(void)
@@ -298,12 +343,38 @@ jsimd_ycc_rgb565_convert(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
 GLOBAL(int)
 jsimd_can_h2v2_downsample(void)
 {
  init_simd();
  /* The code is optimised for these values only */
  if (BITS_IN_JSAMPLE != 8)
    return 0;
  if (DCTSIZE != 8)
    return 0;
  if (sizeof(JDIMENSION) != 4)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
 GLOBAL(int)
 jsimd_can_h2v1_downsample(void)
 {
  init_simd();
  /* The code is optimised for these values only */
  if (BITS_IN_JSAMPLE != 8)
    return 0;
  if (DCTSIZE != 8)
    return 0;
  if (sizeof(JDIMENSION) != 4)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
@@ -311,23 +382,50 @@ GLOBAL(void)
 jsimd_h2v2_downsample(j_compress_ptr cinfo, jpeg_component_info *compptr,
                      JSAMPARRAY input_data, JSAMPARRAY output_data)
 {
  jsimd_h2v2_downsample_neon(cinfo->image_width, cinfo->max_v_samp_factor,
                             compptr->v_samp_factor, compptr->width_in_blocks,
                             input_data, output_data);
 }
 GLOBAL(void)
 jsimd_h2v1_downsample(j_compress_ptr cinfo, jpeg_component_info *compptr,
                      JSAMPARRAY input_data, JSAMPARRAY output_data)
 {
  jsimd_h2v1_downsample_neon(cinfo->image_width, cinfo->max_v_samp_factor,
                             compptr->v_samp_factor, compptr->width_in_blocks,
                             input_data, output_data);
 }
 GLOBAL(int)
 jsimd_can_h2v2_upsample(void)
 {
  init_simd();
  /* The code is optimised for these values only */
  if (BITS_IN_JSAMPLE != 8)
    return 0;
  if (sizeof(JDIMENSION) != 4)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
 GLOBAL(int)
 jsimd_can_h2v1_upsample(void)
 {
  init_simd();
  /* The code is optimised for these values only */
  if (BITS_IN_JSAMPLE != 8)
    return 0;
  if (sizeof(JDIMENSION) != 4)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
@@ -335,17 +433,32 @@ GLOBAL(void)
 jsimd_h2v2_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
                    JSAMPARRAY input_data, JSAMPARRAY *output_data_ptr)
 {
  jsimd_h2v2_upsample_neon(cinfo->max_v_samp_factor, cinfo->output_width,
                           input_data, output_data_ptr);
 }
 GLOBAL(void)
 jsimd_h2v1_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
                    JSAMPARRAY input_data, JSAMPARRAY *output_data_ptr)
 {
  jsimd_h2v1_upsample_neon(cinfo->max_v_samp_factor, cinfo->output_width,
                           input_data, output_data_ptr);
 }
 GLOBAL(int)
 jsimd_can_h2v2_fancy_upsample(void)
 {
  init_simd();
  /* The code is optimised for these values only */
  if (BITS_IN_JSAMPLE != 8)
    return 0;
  if (sizeof(JDIMENSION) != 4)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
@@ -366,10 +479,30 @@ jsimd_can_h2v1_fancy_upsample(void)
  return 0;
 }
 GLOBAL(int)
 jsimd_can_h1v2_fancy_upsample(void)
 {
  init_simd();
  /* The code is optimised for these values only */
  if (BITS_IN_JSAMPLE != 8)
    return 0;
  if (sizeof(JDIMENSION) != 4)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
 GLOBAL(void)
 jsimd_h2v2_fancy_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
                          JSAMPARRAY input_data, JSAMPARRAY *output_data_ptr)
 {
  jsimd_h2v2_fancy_upsample_neon(cinfo->max_v_samp_factor,
                                 compptr->downsampled_width, input_data,
                                 output_data_ptr);
 }
 GLOBAL(void)
@@ -381,15 +514,46 @@ jsimd_h2v1_fancy_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
                                 output_data_ptr);
 }
 GLOBAL(void)
 jsimd_h1v2_fancy_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
                          JSAMPARRAY input_data, JSAMPARRAY *output_data_ptr)
 {
  jsimd_h1v2_fancy_upsample_neon(cinfo->max_v_samp_factor,
                                 compptr->downsampled_width, input_data,
                                 output_data_ptr);
 }
 GLOBAL(int)
 jsimd_can_h2v2_merged_upsample(void)
 {
  init_simd();
  /* The code is optimised for these values only */
  if (BITS_IN_JSAMPLE != 8)
    return 0;
  if (sizeof(JDIMENSION) != 4)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
 GLOBAL(int)
 jsimd_can_h2v1_merged_upsample(void)
 {
  init_simd();
  /* The code is optimised for these values only */
  if (BITS_IN_JSAMPLE != 8)
    return 0;
  if (sizeof(JDIMENSION) != 4)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
@@ -397,12 +561,74 @@ GLOBAL(void)
 jsimd_h2v2_merged_upsample(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
                           JDIMENSION in_row_group_ctr, JSAMPARRAY output_buf)
 {
  void (*neonfct) (JDIMENSION, JSAMPIMAGE, JDIMENSION, JSAMPARRAY);
  switch (cinfo->out_color_space) {
    case JCS_EXT_RGB:
      neonfct = jsimd_h2v2_extrgb_merged_upsample_neon;
      break;
    case JCS_EXT_RGBX:
    case JCS_EXT_RGBA:
      neonfct = jsimd_h2v2_extrgbx_merged_upsample_neon;
      break;
    case JCS_EXT_BGR:
      neonfct = jsimd_h2v2_extbgr_merged_upsample_neon;
      break;
    case JCS_EXT_BGRX:
    case JCS_EXT_BGRA:
      neonfct = jsimd_h2v2_extbgrx_merged_upsample_neon;
      break;
    case JCS_EXT_XBGR:
    case JCS_EXT_ABGR:
      neonfct = jsimd_h2v2_extxbgr_merged_upsample_neon;
      break;
    case JCS_EXT_XRGB:
    case JCS_EXT_ARGB:
      neonfct = jsimd_h2v2_extxrgb_merged_upsample_neon;
      break;
    default:
      neonfct = jsimd_h2v2_extrgb_merged_upsample_neon;
      break;
  }
  neonfct(cinfo->output_width, input_buf, in_row_group_ctr, output_buf);
 }
 GLOBAL(void)
 jsimd_h2v1_merged_upsample(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
                           JDIMENSION in_row_group_ctr, JSAMPARRAY output_buf)
 {
  void (*neonfct) (JDIMENSION, JSAMPIMAGE, JDIMENSION, JSAMPARRAY);
  switch (cinfo->out_color_space) {
    case JCS_EXT_RGB:
      neonfct = jsimd_h2v1_extrgb_merged_upsample_neon;
      break;
    case JCS_EXT_RGBX:
    case JCS_EXT_RGBA:
      neonfct = jsimd_h2v1_extrgbx_merged_upsample_neon;
      break;
    case JCS_EXT_BGR:
      neonfct = jsimd_h2v1_extbgr_merged_upsample_neon;
      break;
    case JCS_EXT_BGRX:
    case JCS_EXT_BGRA:
      neonfct = jsimd_h2v1_extbgrx_merged_upsample_neon;
      break;
    case JCS_EXT_XBGR:
    case JCS_EXT_ABGR:
      neonfct = jsimd_h2v1_extxbgr_merged_upsample_neon;
      break;
    case JCS_EXT_XRGB:
    case JCS_EXT_ARGB:
      neonfct = jsimd_h2v1_extxrgb_merged_upsample_neon;
      break;
    default:
      neonfct = jsimd_h2v1_extrgb_merged_upsample_neon;
      break;
  }
  neonfct(cinfo->output_width, input_buf, in_row_group_ctr, output_buf);
 }
 GLOBAL(int)
@@ -448,6 +674,17 @@ jsimd_convsamp_float(JSAMPARRAY sample_data, JDIMENSION start_col,
 GLOBAL(int)
 jsimd_can_fdct_islow(void)
 {
  init_simd();
  /* The code is optimised for these values only */
  if (DCTSIZE != 8)
    return 0;
  if (sizeof(DCTELEM) != 2)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
@@ -477,6 +714,7 @@ jsimd_can_fdct_float(void)
 GLOBAL(void)
 jsimd_fdct_islow(DCTELEM *data)
 {
  jsimd_fdct_islow_neon(data);
 }
 GLOBAL(void)
@@ -696,6 +934,16 @@ jsimd_huff_encode_one_block(void *state, JOCTET *buffer, JCOEFPTR block,
 GLOBAL(int)
 jsimd_can_encode_mcu_AC_first_prepare(void)
 {
  init_simd();
  if (DCTSIZE != 8)
    return 0;
  if (sizeof(JCOEF) != 2)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
@@ -704,11 +952,23 @@ jsimd_encode_mcu_AC_first_prepare(const JCOEF *block,
                                  const int *jpeg_natural_order_start, int Sl,
                                  int Al, JCOEF *values, size_t *zerobits)
 {
  jsimd_encode_mcu_AC_first_prepare_neon(block, jpeg_natural_order_start,
                                         Sl, Al, values, zerobits);
 }
 GLOBAL(int)
 jsimd_can_encode_mcu_AC_refine_prepare(void)
 {
  init_simd();
  if (DCTSIZE != 8)
    return 0;
  if (sizeof(JCOEF) != 2)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
@@ -717,5 +977,7 @@ jsimd_encode_mcu_AC_refine_prepare(const JCOEF *block,
                                   const int *jpeg_natural_order_start, int Sl,
                                   int Al, JCOEF *absvalues, size_t *bits)
 {
-  return 0;
+  return jsimd_encode_mcu_AC_refine_prepare_neon(block,
                                                 jpeg_natural_order_start, Sl,
                                                 Al, absvalues, bits);
 }
--- a/simd/arm/aarch32/jsimd_neon.S
+++ b/simd/arm/aarch32/jsimd_neon.S
--- a/simd/arm/aarch64/jccolext-neon.c
+++ b/simd/arm/aarch64/jccolext-neon.c
@@ -0,0 +1,316 @@
 /*
 * jccolext-neon.c - colorspace conversion (64-bit Arm Neon)
 *
 * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 /* This file is included by jccolor-neon.c */
 /* RGB -> YCbCr conversion is defined by the following equations:
 *    Y  =  0.29900 * R + 0.58700 * G + 0.11400 * B
 *    Cb = -0.16874 * R - 0.33126 * G + 0.50000 * B  + 128
 *    Cr =  0.50000 * R - 0.41869 * G - 0.08131 * B  + 128
 *
 * Avoid floating point arithmetic by using shifted integer constants:
 *    0.29899597 = 19595 * 2^-16
 *    0.58700561 = 38470 * 2^-16
 *    0.11399841 =  7471 * 2^-16
 *    0.16874695 = 11059 * 2^-16
 *    0.33125305 = 21709 * 2^-16
 *    0.50000000 = 32768 * 2^-16
 *    0.41868592 = 27439 * 2^-16
 *    0.08131409 =  5329 * 2^-16
 * These constants are defined in jccolor-neon.c
 *
 * We add the fixed-point equivalent of 0.5 to Cb and Cr, which effectively
 * rounds up or down the result via integer truncation.
 */
 void jsimd_rgb_ycc_convert_neon(JDIMENSION image_width, JSAMPARRAY input_buf,
                                JSAMPIMAGE output_buf, JDIMENSION output_row,
                                int num_rows)
 {
  /* Pointer to RGB(X/A) input data */
  JSAMPROW inptr;
  /* Pointers to Y, Cb, and Cr output data */
  JSAMPROW outptr0, outptr1, outptr2;
  /* Allocate temporary buffer for final (image_width % 16) pixels in row. */
  ALIGN(16) uint8_t tmp_buf[16 * RGB_PIXELSIZE];
  /* Set up conversion constants. */
  const uint16x8_t consts = vld1q_u16(jsimd_rgb_ycc_neon_consts);
  const uint32x4_t scaled_128_5 = vdupq_n_u32((128 << 16) + 32767);
  while (--num_rows >= 0) {
    inptr = *input_buf++;
    outptr0 = output_buf[0][output_row];
    outptr1 = output_buf[1][output_row];
    outptr2 = output_buf[2][output_row];
    output_row++;
    int cols_remaining = image_width;
    for (; cols_remaining >= 16; cols_remaining -= 16) {
 #if RGB_PIXELSIZE == 4
      uint8x16x4_t input_pixels = vld4q_u8(inptr);
 #else
      uint8x16x3_t input_pixels = vld3q_u8(inptr);
 #endif
      uint16x8_t r_l = vmovl_u8(vget_low_u8(input_pixels.val[RGB_RED]));
      uint16x8_t g_l = vmovl_u8(vget_low_u8(input_pixels.val[RGB_GREEN]));
      uint16x8_t b_l = vmovl_u8(vget_low_u8(input_pixels.val[RGB_BLUE]));
      uint16x8_t r_h = vmovl_u8(vget_high_u8(input_pixels.val[RGB_RED]));
      uint16x8_t g_h = vmovl_u8(vget_high_u8(input_pixels.val[RGB_GREEN]));
      uint16x8_t b_h = vmovl_u8(vget_high_u8(input_pixels.val[RGB_BLUE]));
      /* Compute Y = 0.29900 * R + 0.58700 * G + 0.11400 * B */
      uint32x4_t y_ll = vmull_laneq_u16(vget_low_u16(r_l), consts, 0);
      y_ll = vmlal_laneq_u16(y_ll, vget_low_u16(g_l), consts, 1);
      y_ll = vmlal_laneq_u16(y_ll, vget_low_u16(b_l), consts, 2);
      uint32x4_t y_lh = vmull_laneq_u16(vget_high_u16(r_l), consts, 0);
      y_lh = vmlal_laneq_u16(y_lh, vget_high_u16(g_l), consts, 1);
      y_lh = vmlal_laneq_u16(y_lh, vget_high_u16(b_l), consts, 2);
      uint32x4_t y_hl = vmull_laneq_u16(vget_low_u16(r_h), consts, 0);
      y_hl = vmlal_laneq_u16(y_hl, vget_low_u16(g_h), consts, 1);
      y_hl = vmlal_laneq_u16(y_hl, vget_low_u16(b_h), consts, 2);
      uint32x4_t y_hh = vmull_laneq_u16(vget_high_u16(r_h), consts, 0);
      y_hh = vmlal_laneq_u16(y_hh, vget_high_u16(g_h), consts, 1);
      y_hh = vmlal_laneq_u16(y_hh, vget_high_u16(b_h), consts, 2);
      /* Compute Cb = -0.16874 * R - 0.33126 * G + 0.50000 * B  + 128 */
      uint32x4_t cb_ll = scaled_128_5;
      cb_ll = vmlsl_laneq_u16(cb_ll, vget_low_u16(r_l), consts, 3);
      cb_ll = vmlsl_laneq_u16(cb_ll, vget_low_u16(g_l), consts, 4);
      cb_ll = vmlal_laneq_u16(cb_ll, vget_low_u16(b_l), consts, 5);
      uint32x4_t cb_lh = scaled_128_5;
      cb_lh = vmlsl_laneq_u16(cb_lh, vget_high_u16(r_l), consts, 3);
      cb_lh = vmlsl_laneq_u16(cb_lh, vget_high_u16(g_l), consts, 4);
      cb_lh = vmlal_laneq_u16(cb_lh, vget_high_u16(b_l), consts, 5);
      uint32x4_t cb_hl = scaled_128_5;
      cb_hl = vmlsl_laneq_u16(cb_hl, vget_low_u16(r_h), consts, 3);
      cb_hl = vmlsl_laneq_u16(cb_hl, vget_low_u16(g_h), consts, 4);
      cb_hl = vmlal_laneq_u16(cb_hl, vget_low_u16(b_h), consts, 5);
      uint32x4_t cb_hh = scaled_128_5;
      cb_hh = vmlsl_laneq_u16(cb_hh, vget_high_u16(r_h), consts, 3);
      cb_hh = vmlsl_laneq_u16(cb_hh, vget_high_u16(g_h), consts, 4);
      cb_hh = vmlal_laneq_u16(cb_hh, vget_high_u16(b_h), consts, 5);
      /* Compute Cr = 0.50000 * R - 0.41869 * G - 0.08131 * B  + 128 */
      uint32x4_t cr_ll = scaled_128_5;
      cr_ll = vmlal_laneq_u16(cr_ll, vget_low_u16(r_l), consts, 5);
      cr_ll = vmlsl_laneq_u16(cr_ll, vget_low_u16(g_l), consts, 6);
      cr_ll = vmlsl_laneq_u16(cr_ll, vget_low_u16(b_l), consts, 7);
      uint32x4_t cr_lh = scaled_128_5;
      cr_lh = vmlal_laneq_u16(cr_lh, vget_high_u16(r_l), consts, 5);
      cr_lh = vmlsl_laneq_u16(cr_lh, vget_high_u16(g_l), consts, 6);
      cr_lh = vmlsl_laneq_u16(cr_lh, vget_high_u16(b_l), consts, 7);
      uint32x4_t cr_hl = scaled_128_5;
      cr_hl = vmlal_laneq_u16(cr_hl, vget_low_u16(r_h), consts, 5);
      cr_hl = vmlsl_laneq_u16(cr_hl, vget_low_u16(g_h), consts, 6);
      cr_hl = vmlsl_laneq_u16(cr_hl, vget_low_u16(b_h), consts, 7);
      uint32x4_t cr_hh = scaled_128_5;
      cr_hh = vmlal_laneq_u16(cr_hh, vget_high_u16(r_h), consts, 5);
      cr_hh = vmlsl_laneq_u16(cr_hh, vget_high_u16(g_h), consts, 6);
      cr_hh = vmlsl_laneq_u16(cr_hh, vget_high_u16(b_h), consts, 7);
      /* Descale Y values (rounding right shift) and narrow to 16-bit. */
      uint16x8_t y_l = vcombine_u16(vrshrn_n_u32(y_ll, 16),
                                    vrshrn_n_u32(y_lh, 16));
      uint16x8_t y_h = vcombine_u16(vrshrn_n_u32(y_hl, 16),
                                    vrshrn_n_u32(y_hh, 16));
      /* Descale Cb values (right shift) and narrow to 16-bit. */
      uint16x8_t cb_l = vcombine_u16(vshrn_n_u32(cb_ll, 16),
                                     vshrn_n_u32(cb_lh, 16));
      uint16x8_t cb_h = vcombine_u16(vshrn_n_u32(cb_hl, 16),
                                     vshrn_n_u32(cb_hh, 16));
      /* Descale Cr values (right shift) and narrow to 16-bit. */
      uint16x8_t cr_l = vcombine_u16(vshrn_n_u32(cr_ll, 16),
                                     vshrn_n_u32(cr_lh, 16));
      uint16x8_t cr_h = vcombine_u16(vshrn_n_u32(cr_hl, 16),
                                     vshrn_n_u32(cr_hh, 16));
      /* Narrow Y, Cb, and Cr values to 8-bit and store to memory.  Buffer
       * overwrite is permitted up to the next multiple of ALIGN_SIZE bytes.
       */
      vst1q_u8(outptr0, vcombine_u8(vmovn_u16(y_l), vmovn_u16(y_h)));
      vst1q_u8(outptr1, vcombine_u8(vmovn_u16(cb_l), vmovn_u16(cb_h)));
      vst1q_u8(outptr2, vcombine_u8(vmovn_u16(cr_l), vmovn_u16(cr_h)));
      /* Increment pointers. */
      inptr += (16 * RGB_PIXELSIZE);
      outptr0 += 16;
      outptr1 += 16;
      outptr2 += 16;
    }
    if (cols_remaining > 8) {
      /* To prevent buffer overread by the vector load instructions, the last
       * (image_width % 16) columns of data are first memcopied to a temporary
       * buffer large enough to accommodate the vector load.
       */
      memcpy(tmp_buf, inptr, cols_remaining * RGB_PIXELSIZE);
      inptr = tmp_buf;
 #if RGB_PIXELSIZE == 4
      uint8x16x4_t input_pixels = vld4q_u8(inptr);
 #else
      uint8x16x3_t input_pixels = vld3q_u8(inptr);
 #endif
      uint16x8_t r_l = vmovl_u8(vget_low_u8(input_pixels.val[RGB_RED]));
      uint16x8_t g_l = vmovl_u8(vget_low_u8(input_pixels.val[RGB_GREEN]));
      uint16x8_t b_l = vmovl_u8(vget_low_u8(input_pixels.val[RGB_BLUE]));
      uint16x8_t r_h = vmovl_u8(vget_high_u8(input_pixels.val[RGB_RED]));
      uint16x8_t g_h = vmovl_u8(vget_high_u8(input_pixels.val[RGB_GREEN]));
      uint16x8_t b_h = vmovl_u8(vget_high_u8(input_pixels.val[RGB_BLUE]));
      /* Compute Y = 0.29900 * R + 0.58700 * G + 0.11400 * B */
      uint32x4_t y_ll = vmull_laneq_u16(vget_low_u16(r_l), consts, 0);
      y_ll = vmlal_laneq_u16(y_ll, vget_low_u16(g_l), consts, 1);
      y_ll = vmlal_laneq_u16(y_ll, vget_low_u16(b_l), consts, 2);
      uint32x4_t y_lh = vmull_laneq_u16(vget_high_u16(r_l), consts, 0);
      y_lh = vmlal_laneq_u16(y_lh, vget_high_u16(g_l), consts, 1);
      y_lh = vmlal_laneq_u16(y_lh, vget_high_u16(b_l), consts, 2);
      uint32x4_t y_hl = vmull_laneq_u16(vget_low_u16(r_h), consts, 0);
      y_hl = vmlal_laneq_u16(y_hl, vget_low_u16(g_h), consts, 1);
      y_hl = vmlal_laneq_u16(y_hl, vget_low_u16(b_h), consts, 2);
      uint32x4_t y_hh = vmull_laneq_u16(vget_high_u16(r_h), consts, 0);
      y_hh = vmlal_laneq_u16(y_hh, vget_high_u16(g_h), consts, 1);
      y_hh = vmlal_laneq_u16(y_hh, vget_high_u16(b_h), consts, 2);
      /* Compute Cb = -0.16874 * R - 0.33126 * G + 0.50000 * B  + 128 */
      uint32x4_t cb_ll = scaled_128_5;
      cb_ll = vmlsl_laneq_u16(cb_ll, vget_low_u16(r_l), consts, 3);
      cb_ll = vmlsl_laneq_u16(cb_ll, vget_low_u16(g_l), consts, 4);
      cb_ll = vmlal_laneq_u16(cb_ll, vget_low_u16(b_l), consts, 5);
      uint32x4_t cb_lh = scaled_128_5;
      cb_lh = vmlsl_laneq_u16(cb_lh, vget_high_u16(r_l), consts, 3);
      cb_lh = vmlsl_laneq_u16(cb_lh, vget_high_u16(g_l), consts, 4);
      cb_lh = vmlal_laneq_u16(cb_lh, vget_high_u16(b_l), consts, 5);
      uint32x4_t cb_hl = scaled_128_5;
      cb_hl = vmlsl_laneq_u16(cb_hl, vget_low_u16(r_h), consts, 3);
      cb_hl = vmlsl_laneq_u16(cb_hl, vget_low_u16(g_h), consts, 4);
      cb_hl = vmlal_laneq_u16(cb_hl, vget_low_u16(b_h), consts, 5);
      uint32x4_t cb_hh = scaled_128_5;
      cb_hh = vmlsl_laneq_u16(cb_hh, vget_high_u16(r_h), consts, 3);
      cb_hh = vmlsl_laneq_u16(cb_hh, vget_high_u16(g_h), consts, 4);
      cb_hh = vmlal_laneq_u16(cb_hh, vget_high_u16(b_h), consts, 5);
      /* Compute Cr = 0.50000 * R - 0.41869 * G - 0.08131 * B  + 128 */
      uint32x4_t cr_ll = scaled_128_5;
      cr_ll = vmlal_laneq_u16(cr_ll, vget_low_u16(r_l), consts, 5);
      cr_ll = vmlsl_laneq_u16(cr_ll, vget_low_u16(g_l), consts, 6);
      cr_ll = vmlsl_laneq_u16(cr_ll, vget_low_u16(b_l), consts, 7);
      uint32x4_t cr_lh = scaled_128_5;
      cr_lh = vmlal_laneq_u16(cr_lh, vget_high_u16(r_l), consts, 5);
      cr_lh = vmlsl_laneq_u16(cr_lh, vget_high_u16(g_l), consts, 6);
      cr_lh = vmlsl_laneq_u16(cr_lh, vget_high_u16(b_l), consts, 7);
      uint32x4_t cr_hl = scaled_128_5;
      cr_hl = vmlal_laneq_u16(cr_hl, vget_low_u16(r_h), consts, 5);
      cr_hl = vmlsl_laneq_u16(cr_hl, vget_low_u16(g_h), consts, 6);
      cr_hl = vmlsl_laneq_u16(cr_hl, vget_low_u16(b_h), consts, 7);
      uint32x4_t cr_hh = scaled_128_5;
      cr_hh = vmlal_laneq_u16(cr_hh, vget_high_u16(r_h), consts, 5);
      cr_hh = vmlsl_laneq_u16(cr_hh, vget_high_u16(g_h), consts, 6);
      cr_hh = vmlsl_laneq_u16(cr_hh, vget_high_u16(b_h), consts, 7);
      /* Descale Y values (rounding right shift) and narrow to 16-bit. */
      uint16x8_t y_l = vcombine_u16(vrshrn_n_u32(y_ll, 16),
                                    vrshrn_n_u32(y_lh, 16));
      uint16x8_t y_h = vcombine_u16(vrshrn_n_u32(y_hl, 16),
                                    vrshrn_n_u32(y_hh, 16));
      /* Descale Cb values (right shift) and narrow to 16-bit. */
      uint16x8_t cb_l = vcombine_u16(vshrn_n_u32(cb_ll, 16),
                                     vshrn_n_u32(cb_lh, 16));
      uint16x8_t cb_h = vcombine_u16(vshrn_n_u32(cb_hl, 16),
                                     vshrn_n_u32(cb_hh, 16));
      /* Descale Cr values (right shift) and narrow to 16-bit. */
      uint16x8_t cr_l = vcombine_u16(vshrn_n_u32(cr_ll, 16),
                                     vshrn_n_u32(cr_lh, 16));
      uint16x8_t cr_h = vcombine_u16(vshrn_n_u32(cr_hl, 16),
                                     vshrn_n_u32(cr_hh, 16));
      /* Narrow Y, Cb, and Cr values to 8-bit and store to memory.  Buffer
       * overwrite is permitted up to the next multiple of ALIGN_SIZE bytes.
       */
      vst1q_u8(outptr0, vcombine_u8(vmovn_u16(y_l), vmovn_u16(y_h)));
      vst1q_u8(outptr1, vcombine_u8(vmovn_u16(cb_l), vmovn_u16(cb_h)));
      vst1q_u8(outptr2, vcombine_u8(vmovn_u16(cr_l), vmovn_u16(cr_h)));
    } else if (cols_remaining > 0) {
      /* To prevent buffer overread by the vector load instructions, the last
       * (image_width % 8) columns of data are first memcopied to a temporary
       * buffer large enough to accommodate the vector load.
       */
      memcpy(tmp_buf, inptr, cols_remaining * RGB_PIXELSIZE);
      inptr = tmp_buf;
 #if RGB_PIXELSIZE == 4
      uint8x8x4_t input_pixels = vld4_u8(inptr);
 #else
      uint8x8x3_t input_pixels = vld3_u8(inptr);
 #endif
      uint16x8_t r = vmovl_u8(input_pixels.val[RGB_RED]);
      uint16x8_t g = vmovl_u8(input_pixels.val[RGB_GREEN]);
      uint16x8_t b = vmovl_u8(input_pixels.val[RGB_BLUE]);
      /* Compute Y = 0.29900 * R + 0.58700 * G + 0.11400 * B */
      uint32x4_t y_l = vmull_laneq_u16(vget_low_u16(r), consts, 0);
      y_l = vmlal_laneq_u16(y_l, vget_low_u16(g), consts, 1);
      y_l = vmlal_laneq_u16(y_l, vget_low_u16(b), consts, 2);
      uint32x4_t y_h = vmull_laneq_u16(vget_high_u16(r), consts, 0);
      y_h = vmlal_laneq_u16(y_h, vget_high_u16(g), consts, 1);
      y_h = vmlal_laneq_u16(y_h, vget_high_u16(b), consts, 2);
      /* Compute Cb = -0.16874 * R - 0.33126 * G + 0.50000 * B  + 128 */
      uint32x4_t cb_l = scaled_128_5;
      cb_l = vmlsl_laneq_u16(cb_l, vget_low_u16(r), consts, 3);
      cb_l = vmlsl_laneq_u16(cb_l, vget_low_u16(g), consts, 4);
      cb_l = vmlal_laneq_u16(cb_l, vget_low_u16(b), consts, 5);
      uint32x4_t cb_h = scaled_128_5;
      cb_h = vmlsl_laneq_u16(cb_h, vget_high_u16(r), consts, 3);
      cb_h = vmlsl_laneq_u16(cb_h, vget_high_u16(g), consts, 4);
      cb_h = vmlal_laneq_u16(cb_h, vget_high_u16(b), consts, 5);
      /* Compute Cr = 0.50000 * R - 0.41869 * G - 0.08131 * B  + 128 */
      uint32x4_t cr_l = scaled_128_5;
      cr_l = vmlal_laneq_u16(cr_l, vget_low_u16(r), consts, 5);
      cr_l = vmlsl_laneq_u16(cr_l, vget_low_u16(g), consts, 6);
      cr_l = vmlsl_laneq_u16(cr_l, vget_low_u16(b), consts, 7);
      uint32x4_t cr_h = scaled_128_5;
      cr_h = vmlal_laneq_u16(cr_h, vget_high_u16(r), consts, 5);
      cr_h = vmlsl_laneq_u16(cr_h, vget_high_u16(g), consts, 6);
      cr_h = vmlsl_laneq_u16(cr_h, vget_high_u16(b), consts, 7);
      /* Descale Y values (rounding right shift) and narrow to 16-bit. */
      uint16x8_t y_u16 = vcombine_u16(vrshrn_n_u32(y_l, 16),
                                      vrshrn_n_u32(y_h, 16));
      /* Descale Cb values (right shift) and narrow to 16-bit. */
      uint16x8_t cb_u16 = vcombine_u16(vshrn_n_u32(cb_l, 16),
                                       vshrn_n_u32(cb_h, 16));
      /* Descale Cr values (right shift) and narrow to 16-bit. */
      uint16x8_t cr_u16 = vcombine_u16(vshrn_n_u32(cr_l, 16),
                                       vshrn_n_u32(cr_h, 16));
      /* Narrow Y, Cb, and Cr values to 8-bit and store to memory.  Buffer
       * overwrite is permitted up to the next multiple of ALIGN_SIZE bytes.
       */
      vst1_u8(outptr0, vmovn_u16(y_u16));
      vst1_u8(outptr1, vmovn_u16(cb_u16));
      vst1_u8(outptr2, vmovn_u16(cr_u16));
    }
  }
 }
--- a/simd/arm/aarch64/jchuff-neon.c
+++ b/simd/arm/aarch64/jchuff-neon.c
@@ -0,0 +1,403 @@
 /*
 * jchuff-neon.c - Huffman entropy encoding (64-bit Arm Neon)
 *
 * Copyright (C) 2020-2021, Arm Limited.  All Rights Reserved.
 * Copyright (C) 2020, D. R. Commander.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 *
 * NOTE: All referenced figures are from
 * Recommendation ITU-T T.81 (1992) | ISO/IEC 10918-1:1994.
 */
 #define JPEG_INTERNALS
 #include "../../../jinclude.h"
 #include "../../../jpeglib.h"
 #include "../../../jsimd.h"
 #include "../../../jdct.h"
 #include "../../../jsimddct.h"
 #include "../../jsimd.h"
 #include "../align.h"
 #include "../jchuff.h"
 #include "neon-compat.h"
 #include <limits.h>
 #include <arm_neon.h>
 ALIGN(16) static const uint8_t jsimd_huff_encode_one_block_consts[] = {
    0,   1,   2,   3,  16,  17,  32,  33,
   18,  19,   4,   5,   6,   7,  20,  21,
   34,  35,  48,  49, 255, 255,  50,  51,
   36,  37,  22,  23,   8,   9,  10,  11,
  255, 255,   6,   7,  20,  21,  34,  35,
   48,  49, 255, 255,  50,  51,  36,  37,
   54,  55,  40,  41,  26,  27,  12,  13,
   14,  15,  28,  29,  42,  43,  56,  57,
    6,   7,  20,  21,  34,  35,  48,  49,
   50,  51,  36,  37,  22,  23,   8,   9,
   26,  27,  12,  13, 255, 255,  14,  15,
   28,  29,  42,  43,  56,  57, 255, 255,
   52,  53,  54,  55,  40,  41,  26,  27,
   12,  13, 255, 255,  14,  15,  28,  29,
   26,  27,  40,  41,  42,  43,  28,  29,
   14,  15,  30,  31,  44,  45,  46,  47
 };
 JOCTET *jsimd_huff_encode_one_block_neon(void *state, JOCTET *buffer,
                                         JCOEFPTR block, int last_dc_val,
                                         c_derived_tbl *dctbl,
                                         c_derived_tbl *actbl)
 {
  uint16_t block_diff[DCTSIZE2];
  /* Load lookup table indices for rows of zig-zag ordering. */
 #ifdef HAVE_VLD1Q_U8_X4
  const uint8x16x4_t idx_rows_0123 =
    vld1q_u8_x4(jsimd_huff_encode_one_block_consts + 0 * DCTSIZE);
  const uint8x16x4_t idx_rows_4567 =
    vld1q_u8_x4(jsimd_huff_encode_one_block_consts + 8 * DCTSIZE);
 #else
  /* GCC does not currently support intrinsics vl1dq_<type>_x4(). */
  const uint8x16x4_t idx_rows_0123 = { {
    vld1q_u8(jsimd_huff_encode_one_block_consts + 0 * DCTSIZE),
    vld1q_u8(jsimd_huff_encode_one_block_consts + 2 * DCTSIZE),
    vld1q_u8(jsimd_huff_encode_one_block_consts + 4 * DCTSIZE),
    vld1q_u8(jsimd_huff_encode_one_block_consts + 6 * DCTSIZE)
  } };
  const uint8x16x4_t idx_rows_4567 = { {
    vld1q_u8(jsimd_huff_encode_one_block_consts + 8 * DCTSIZE),
    vld1q_u8(jsimd_huff_encode_one_block_consts + 10 * DCTSIZE),
    vld1q_u8(jsimd_huff_encode_one_block_consts + 12 * DCTSIZE),
    vld1q_u8(jsimd_huff_encode_one_block_consts + 14 * DCTSIZE)
  } };
 #endif
  /* Load 8x8 block of DCT coefficients. */
 #ifdef HAVE_VLD1Q_U8_X4
  const int8x16x4_t tbl_rows_0123 =
    vld1q_s8_x4((int8_t *)(block + 0 * DCTSIZE));
  const int8x16x4_t tbl_rows_4567 =
    vld1q_s8_x4((int8_t *)(block + 4 * DCTSIZE));
 #else
  const int8x16x4_t tbl_rows_0123 = { {
    vld1q_s8((int8_t *)(block + 0 * DCTSIZE)),
    vld1q_s8((int8_t *)(block + 1 * DCTSIZE)),
    vld1q_s8((int8_t *)(block + 2 * DCTSIZE)),
    vld1q_s8((int8_t *)(block + 3 * DCTSIZE))
  } };
  const int8x16x4_t tbl_rows_4567 = { {
    vld1q_s8((int8_t *)(block + 4 * DCTSIZE)),
    vld1q_s8((int8_t *)(block + 5 * DCTSIZE)),
    vld1q_s8((int8_t *)(block + 6 * DCTSIZE)),
    vld1q_s8((int8_t *)(block + 7 * DCTSIZE))
  } };
 #endif
  /* Initialise extra lookup tables. */
  const int8x16x4_t tbl_rows_2345 = { {
    tbl_rows_0123.val[2], tbl_rows_0123.val[3],
    tbl_rows_4567.val[0], tbl_rows_4567.val[1]
  } };
  const int8x16x3_t tbl_rows_567 =
    { { tbl_rows_4567.val[1], tbl_rows_4567.val[2], tbl_rows_4567.val[3] } };
  /* Shuffle coefficients into zig-zag order. */
  int16x8_t row0 =
    vreinterpretq_s16_s8(vqtbl4q_s8(tbl_rows_0123, idx_rows_0123.val[0]));
  int16x8_t row1 =
    vreinterpretq_s16_s8(vqtbl4q_s8(tbl_rows_0123, idx_rows_0123.val[1]));
  int16x8_t row2 =
    vreinterpretq_s16_s8(vqtbl4q_s8(tbl_rows_2345, idx_rows_0123.val[2]));
  int16x8_t row3 =
    vreinterpretq_s16_s8(vqtbl4q_s8(tbl_rows_0123, idx_rows_0123.val[3]));
  int16x8_t row4 =
    vreinterpretq_s16_s8(vqtbl4q_s8(tbl_rows_4567, idx_rows_4567.val[0]));
  int16x8_t row5 =
    vreinterpretq_s16_s8(vqtbl4q_s8(tbl_rows_2345, idx_rows_4567.val[1]));
  int16x8_t row6 =
    vreinterpretq_s16_s8(vqtbl4q_s8(tbl_rows_4567, idx_rows_4567.val[2]));
  int16x8_t row7 =
    vreinterpretq_s16_s8(vqtbl3q_s8(tbl_rows_567, idx_rows_4567.val[3]));
  /* Compute DC coefficient difference value (F.1.1.5.1). */
  row0 = vsetq_lane_s16(block[0] - last_dc_val, row0, 0);
  /* Initialize AC coefficient lanes not reachable by lookup tables. */
  row1 =
    vsetq_lane_s16(vgetq_lane_s16(vreinterpretq_s16_s8(tbl_rows_4567.val[0]),
                                  0), row1, 2);
  row2 =
    vsetq_lane_s16(vgetq_lane_s16(vreinterpretq_s16_s8(tbl_rows_0123.val[1]),
                                  4), row2, 0);
  row2 =
    vsetq_lane_s16(vgetq_lane_s16(vreinterpretq_s16_s8(tbl_rows_4567.val[2]),
                                  0), row2, 5);
  row5 =
    vsetq_lane_s16(vgetq_lane_s16(vreinterpretq_s16_s8(tbl_rows_0123.val[1]),
                                  7), row5, 2);
  row5 =
    vsetq_lane_s16(vgetq_lane_s16(vreinterpretq_s16_s8(tbl_rows_4567.val[2]),
                                  3), row5, 7);
  row6 =
    vsetq_lane_s16(vgetq_lane_s16(vreinterpretq_s16_s8(tbl_rows_0123.val[3]),
                                  7), row6, 5);
  /* DCT block is now in zig-zag order; start Huffman encoding process. */
  int16x8_t abs_row0 = vabsq_s16(row0);
  int16x8_t abs_row1 = vabsq_s16(row1);
  int16x8_t abs_row2 = vabsq_s16(row2);
  int16x8_t abs_row3 = vabsq_s16(row3);
  int16x8_t abs_row4 = vabsq_s16(row4);
  int16x8_t abs_row5 = vabsq_s16(row5);
  int16x8_t abs_row6 = vabsq_s16(row6);
  int16x8_t abs_row7 = vabsq_s16(row7);
  /* For negative coeffs: diff = abs(coeff) -1 = ~abs(coeff) */
  uint16x8_t row0_diff =
    vreinterpretq_u16_s16(veorq_s16(abs_row0, vshrq_n_s16(row0, 15)));
  uint16x8_t row1_diff =
    vreinterpretq_u16_s16(veorq_s16(abs_row1, vshrq_n_s16(row1, 15)));
  uint16x8_t row2_diff =
    vreinterpretq_u16_s16(veorq_s16(abs_row2, vshrq_n_s16(row2, 15)));
  uint16x8_t row3_diff =
    vreinterpretq_u16_s16(veorq_s16(abs_row3, vshrq_n_s16(row3, 15)));
  uint16x8_t row4_diff =
    vreinterpretq_u16_s16(veorq_s16(abs_row4, vshrq_n_s16(row4, 15)));
  uint16x8_t row5_diff =
    vreinterpretq_u16_s16(veorq_s16(abs_row5, vshrq_n_s16(row5, 15)));
  uint16x8_t row6_diff =
    vreinterpretq_u16_s16(veorq_s16(abs_row6, vshrq_n_s16(row6, 15)));
  uint16x8_t row7_diff =
    vreinterpretq_u16_s16(veorq_s16(abs_row7, vshrq_n_s16(row7, 15)));
  /* Construct bitmap to accelerate encoding of AC coefficients.  A set bit
   * means that the corresponding coefficient != 0.
   */
  uint8x8_t abs_row0_gt0 = vmovn_u16(vcgtq_u16(vreinterpretq_u16_s16(abs_row0),
                                               vdupq_n_u16(0)));
  uint8x8_t abs_row1_gt0 = vmovn_u16(vcgtq_u16(vreinterpretq_u16_s16(abs_row1),
                                               vdupq_n_u16(0)));
  uint8x8_t abs_row2_gt0 = vmovn_u16(vcgtq_u16(vreinterpretq_u16_s16(abs_row2),
                                               vdupq_n_u16(0)));
  uint8x8_t abs_row3_gt0 = vmovn_u16(vcgtq_u16(vreinterpretq_u16_s16(abs_row3),
                                               vdupq_n_u16(0)));
  uint8x8_t abs_row4_gt0 = vmovn_u16(vcgtq_u16(vreinterpretq_u16_s16(abs_row4),
                                               vdupq_n_u16(0)));
  uint8x8_t abs_row5_gt0 = vmovn_u16(vcgtq_u16(vreinterpretq_u16_s16(abs_row5),
                                               vdupq_n_u16(0)));
  uint8x8_t abs_row6_gt0 = vmovn_u16(vcgtq_u16(vreinterpretq_u16_s16(abs_row6),
                                               vdupq_n_u16(0)));
  uint8x8_t abs_row7_gt0 = vmovn_u16(vcgtq_u16(vreinterpretq_u16_s16(abs_row7),
                                               vdupq_n_u16(0)));
  /* { 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 } */
  const uint8x8_t bitmap_mask =
    vreinterpret_u8_u64(vmov_n_u64(0x0102040810204080));
  abs_row0_gt0 = vand_u8(abs_row0_gt0, bitmap_mask);
  abs_row1_gt0 = vand_u8(abs_row1_gt0, bitmap_mask);
  abs_row2_gt0 = vand_u8(abs_row2_gt0, bitmap_mask);
  abs_row3_gt0 = vand_u8(abs_row3_gt0, bitmap_mask);
  abs_row4_gt0 = vand_u8(abs_row4_gt0, bitmap_mask);
  abs_row5_gt0 = vand_u8(abs_row5_gt0, bitmap_mask);
  abs_row6_gt0 = vand_u8(abs_row6_gt0, bitmap_mask);
  abs_row7_gt0 = vand_u8(abs_row7_gt0, bitmap_mask);
  uint8x8_t bitmap_rows_10 = vpadd_u8(abs_row1_gt0, abs_row0_gt0);
  uint8x8_t bitmap_rows_32 = vpadd_u8(abs_row3_gt0, abs_row2_gt0);
  uint8x8_t bitmap_rows_54 = vpadd_u8(abs_row5_gt0, abs_row4_gt0);
  uint8x8_t bitmap_rows_76 = vpadd_u8(abs_row7_gt0, abs_row6_gt0);
  uint8x8_t bitmap_rows_3210 = vpadd_u8(bitmap_rows_32, bitmap_rows_10);
  uint8x8_t bitmap_rows_7654 = vpadd_u8(bitmap_rows_76, bitmap_rows_54);
  uint8x8_t bitmap_all = vpadd_u8(bitmap_rows_7654, bitmap_rows_3210);
  /* Shift left to remove DC bit. */
  bitmap_all =
    vreinterpret_u8_u64(vshl_n_u64(vreinterpret_u64_u8(bitmap_all), 1));
  /* Count bits set (number of non-zero coefficients) in bitmap. */
  unsigned int non_zero_coefficients = vaddv_u8(vcnt_u8(bitmap_all));
  /* Move bitmap to 64-bit scalar register. */
  uint64_t bitmap = vget_lane_u64(vreinterpret_u64_u8(bitmap_all), 0);
  /* Set up state and bit buffer for output bitstream. */
  working_state *state_ptr = (working_state *)state;
  int free_bits = state_ptr->cur.free_bits;
  size_t put_buffer = state_ptr->cur.put_buffer;
  /* Encode DC coefficient. */
  /* Find nbits required to specify sign and amplitude of coefficient. */
 #if defined(_MSC_VER) && !defined(__clang__)
  unsigned int lz = BUILTIN_CLZ(vgetq_lane_s16(abs_row0, 0));
 #else
  unsigned int lz;
  __asm__("clz %w0, %w1" : "=r"(lz) : "r"(vgetq_lane_s16(abs_row0, 0)));
 #endif
  unsigned int nbits = 32 - lz;
  /* Emit Huffman-coded symbol and additional diff bits. */
  unsigned int diff = (unsigned int)(vgetq_lane_u16(row0_diff, 0) << lz) >> lz;
  PUT_CODE(dctbl->ehufco[nbits], dctbl->ehufsi[nbits], diff)
  /* Encode AC coefficients. */
  unsigned int r = 0;  /* r = run length of zeros */
  unsigned int i = 1;  /* i = number of coefficients encoded */
  /* Code and size information for a run length of 16 zero coefficients */
  const unsigned int code_0xf0 = actbl->ehufco[0xf0];
  const unsigned int size_0xf0 = actbl->ehufsi[0xf0];
  /* The most efficient method of computing nbits and diff depends on the
   * number of non-zero coefficients.  If the bitmap is not too sparse (> 8
   * non-zero AC coefficients), it is beneficial to use Neon; else we compute
   * nbits and diff on demand using scalar code.
   */
  if (non_zero_coefficients > 8) {
    uint8_t block_nbits[DCTSIZE2];
    int16x8_t row0_lz = vclzq_s16(abs_row0);
    int16x8_t row1_lz = vclzq_s16(abs_row1);
    int16x8_t row2_lz = vclzq_s16(abs_row2);
    int16x8_t row3_lz = vclzq_s16(abs_row3);
    int16x8_t row4_lz = vclzq_s16(abs_row4);
    int16x8_t row5_lz = vclzq_s16(abs_row5);
    int16x8_t row6_lz = vclzq_s16(abs_row6);
    int16x8_t row7_lz = vclzq_s16(abs_row7);
    /* Compute nbits needed to specify magnitude of each coefficient. */
    uint8x8_t row0_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row0_lz)));
    uint8x8_t row1_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row1_lz)));
    uint8x8_t row2_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row2_lz)));
    uint8x8_t row3_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row3_lz)));
    uint8x8_t row4_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row4_lz)));
    uint8x8_t row5_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row5_lz)));
    uint8x8_t row6_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row6_lz)));
    uint8x8_t row7_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row7_lz)));
    /* Store nbits. */
    vst1_u8(block_nbits + 0 * DCTSIZE, row0_nbits);
    vst1_u8(block_nbits + 1 * DCTSIZE, row1_nbits);
    vst1_u8(block_nbits + 2 * DCTSIZE, row2_nbits);
    vst1_u8(block_nbits + 3 * DCTSIZE, row3_nbits);
    vst1_u8(block_nbits + 4 * DCTSIZE, row4_nbits);
    vst1_u8(block_nbits + 5 * DCTSIZE, row5_nbits);
    vst1_u8(block_nbits + 6 * DCTSIZE, row6_nbits);
    vst1_u8(block_nbits + 7 * DCTSIZE, row7_nbits);
    /* Mask bits not required to specify sign and amplitude of diff. */
    row0_diff = vshlq_u16(row0_diff, row0_lz);
    row1_diff = vshlq_u16(row1_diff, row1_lz);
    row2_diff = vshlq_u16(row2_diff, row2_lz);
    row3_diff = vshlq_u16(row3_diff, row3_lz);
    row4_diff = vshlq_u16(row4_diff, row4_lz);
    row5_diff = vshlq_u16(row5_diff, row5_lz);
    row6_diff = vshlq_u16(row6_diff, row6_lz);
    row7_diff = vshlq_u16(row7_diff, row7_lz);
    row0_diff = vshlq_u16(row0_diff, vnegq_s16(row0_lz));
    row1_diff = vshlq_u16(row1_diff, vnegq_s16(row1_lz));
    row2_diff = vshlq_u16(row2_diff, vnegq_s16(row2_lz));
    row3_diff = vshlq_u16(row3_diff, vnegq_s16(row3_lz));
    row4_diff = vshlq_u16(row4_diff, vnegq_s16(row4_lz));
    row5_diff = vshlq_u16(row5_diff, vnegq_s16(row5_lz));
    row6_diff = vshlq_u16(row6_diff, vnegq_s16(row6_lz));
    row7_diff = vshlq_u16(row7_diff, vnegq_s16(row7_lz));
    /* Store diff bits. */
    vst1q_u16(block_diff + 0 * DCTSIZE, row0_diff);
    vst1q_u16(block_diff + 1 * DCTSIZE, row1_diff);
    vst1q_u16(block_diff + 2 * DCTSIZE, row2_diff);
    vst1q_u16(block_diff + 3 * DCTSIZE, row3_diff);
    vst1q_u16(block_diff + 4 * DCTSIZE, row4_diff);
    vst1q_u16(block_diff + 5 * DCTSIZE, row5_diff);
    vst1q_u16(block_diff + 6 * DCTSIZE, row6_diff);
    vst1q_u16(block_diff + 7 * DCTSIZE, row7_diff);
    while (bitmap != 0) {
      r = BUILTIN_CLZLL(bitmap);
      i += r;
      bitmap <<= r;
      nbits = block_nbits[i];
      diff = block_diff[i];
      while (r > 15) {
        /* If run length > 15, emit special run-length-16 codes. */
        PUT_BITS(code_0xf0, size_0xf0)
        r -= 16;
      }
      /* Emit Huffman symbol for run length / number of bits. (F.1.2.2.1) */
      unsigned int rs = (r << 4) + nbits;
      PUT_CODE(actbl->ehufco[rs], actbl->ehufsi[rs], diff)
      i++;
      bitmap <<= 1;
    }
  } else if (bitmap != 0) {
    uint16_t block_abs[DCTSIZE2];
    /* Store absolute value of coefficients. */
    vst1q_u16(block_abs + 0 * DCTSIZE, vreinterpretq_u16_s16(abs_row0));
    vst1q_u16(block_abs + 1 * DCTSIZE, vreinterpretq_u16_s16(abs_row1));
    vst1q_u16(block_abs + 2 * DCTSIZE, vreinterpretq_u16_s16(abs_row2));
    vst1q_u16(block_abs + 3 * DCTSIZE, vreinterpretq_u16_s16(abs_row3));
    vst1q_u16(block_abs + 4 * DCTSIZE, vreinterpretq_u16_s16(abs_row4));
    vst1q_u16(block_abs + 5 * DCTSIZE, vreinterpretq_u16_s16(abs_row5));
    vst1q_u16(block_abs + 6 * DCTSIZE, vreinterpretq_u16_s16(abs_row6));
    vst1q_u16(block_abs + 7 * DCTSIZE, vreinterpretq_u16_s16(abs_row7));
    /* Store diff bits. */
    vst1q_u16(block_diff + 0 * DCTSIZE, row0_diff);
    vst1q_u16(block_diff + 1 * DCTSIZE, row1_diff);
    vst1q_u16(block_diff + 2 * DCTSIZE, row2_diff);
    vst1q_u16(block_diff + 3 * DCTSIZE, row3_diff);
    vst1q_u16(block_diff + 4 * DCTSIZE, row4_diff);
    vst1q_u16(block_diff + 5 * DCTSIZE, row5_diff);
    vst1q_u16(block_diff + 6 * DCTSIZE, row6_diff);
    vst1q_u16(block_diff + 7 * DCTSIZE, row7_diff);
    /* Same as above but must mask diff bits and compute nbits on demand. */
    while (bitmap != 0) {
      r = BUILTIN_CLZLL(bitmap);
      i += r;
      bitmap <<= r;
      lz = BUILTIN_CLZ(block_abs[i]);
      nbits = 32 - lz;
      diff = (unsigned int)(block_diff[i] << lz) >> lz;
      while (r > 15) {
        /* If run length > 15, emit special run-length-16 codes. */
        PUT_BITS(code_0xf0, size_0xf0)
        r -= 16;
      }
      /* Emit Huffman symbol for run length / number of bits. (F.1.2.2.1) */
      unsigned int rs = (r << 4) + nbits;
      PUT_CODE(actbl->ehufco[rs], actbl->ehufsi[rs], diff)
      i++;
      bitmap <<= 1;
    }
  }
  /* If the last coefficient(s) were zero, emit an end-of-block (EOB) code.
   * The value of RS for the EOB code is 0.
   */
  if (i != 64) {
    PUT_BITS(actbl->ehufco[0], actbl->ehufsi[0])
  }
  state_ptr->cur.put_buffer = put_buffer;
  state_ptr->cur.free_bits = free_bits;
  return buffer;
 }
--- a/simd/arm/aarch64/jsimd.c
+++ b/simd/arm/aarch64/jsimd.c
@@ -3,8 +3,9 @@
 *
 * Copyright 2009 Pierre Ossman <ossman@cendio.se> for Cendio AB
 * Copyright (C) 2011, Nokia Corporation and/or its subsidiary(-ies).
- * Copyright (C) 2009-2011, 2013-2014, 2016, 2018, D. R. Commander.
+ * Copyright (C) 2009-2011, 2013-2014, 2016, 2018, 2020, D. R. Commander.
 * Copyright (C) 2015-2016, 2018, Matthieu Darbois.
 * Copyright (C) 2020, Arm Limited.
 *
 * Based on the x86 SIMD extension for IJG JPEG library,
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
@@ -16,12 +17,13 @@
 */
 #define JPEG_INTERNALS
-#include "../../jinclude.h"
+#include "../../../jinclude.h"
-#include "../../jpeglib.h"
+#include "../../../jpeglib.h"
 #include "../../../jsimd.h"
 #include "../../../jdct.h"
 #include "../../../jsimddct.h"
 #include "../../jsimd.h"
-#include "../../jdct.h"
+#include "jconfigint.h"
 #include "../../jsimddct.h"
 #include "../jsimd.h"
 #include <stdio.h>
 #include <string.h>
@@ -189,6 +191,19 @@ jsimd_can_rgb_ycc(void)
 GLOBAL(int)
 jsimd_can_rgb_gray(void)
 {
  init_simd();
  /* The code is optimised for these values only */
  if (BITS_IN_JSAMPLE != 8)
    return 0;
  if (sizeof(JDIMENSION) != 4)
    return 0;
  if ((RGB_PIXELSIZE != 3) && (RGB_PIXELSIZE != 4))
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
@@ -237,20 +252,28 @@ jsimd_rgb_ycc_convert(j_compress_ptr cinfo, JSAMPARRAY input_buf,
  switch (cinfo->in_color_space) {
  case JCS_EXT_RGB:
 #ifndef NEON_INTRINSICS
    if (simd_features & JSIMD_FASTLD3)
 #endif
      neonfct = jsimd_extrgb_ycc_convert_neon;
 #ifndef NEON_INTRINSICS
    else
      neonfct = jsimd_extrgb_ycc_convert_neon_slowld3;
 #endif
    break;
  case JCS_EXT_RGBX:
  case JCS_EXT_RGBA:
    neonfct = jsimd_extrgbx_ycc_convert_neon;
    break;
  case JCS_EXT_BGR:
 #ifndef NEON_INTRINSICS
    if (simd_features & JSIMD_FASTLD3)
 #endif
      neonfct = jsimd_extbgr_ycc_convert_neon;
 #ifndef NEON_INTRINSICS
    else
      neonfct = jsimd_extbgr_ycc_convert_neon_slowld3;
 #endif
    break;
  case JCS_EXT_BGRX:
  case JCS_EXT_BGRA:
@@ -265,10 +288,14 @@ jsimd_rgb_ycc_convert(j_compress_ptr cinfo, JSAMPARRAY input_buf,
    neonfct = jsimd_extxrgb_ycc_convert_neon;
    break;
  default:
 #ifndef NEON_INTRINSICS
    if (simd_features & JSIMD_FASTLD3)
 #endif
      neonfct = jsimd_extrgb_ycc_convert_neon;
 #ifndef NEON_INTRINSICS
    else
      neonfct = jsimd_extrgb_ycc_convert_neon_slowld3;
 #endif
    break;
  }
@@ -280,6 +307,37 @@ jsimd_rgb_gray_convert(j_compress_ptr cinfo, JSAMPARRAY input_buf,
                       JSAMPIMAGE output_buf, JDIMENSION output_row,
                       int num_rows)
 {
  void (*neonfct) (JDIMENSION, JSAMPARRAY, JSAMPIMAGE, JDIMENSION, int);
  switch (cinfo->in_color_space) {
  case JCS_EXT_RGB:
    neonfct = jsimd_extrgb_gray_convert_neon;
    break;
  case JCS_EXT_RGBX:
  case JCS_EXT_RGBA:
    neonfct = jsimd_extrgbx_gray_convert_neon;
    break;
  case JCS_EXT_BGR:
    neonfct = jsimd_extbgr_gray_convert_neon;
    break;
  case JCS_EXT_BGRX:
  case JCS_EXT_BGRA:
    neonfct = jsimd_extbgrx_gray_convert_neon;
    break;
  case JCS_EXT_XBGR:
  case JCS_EXT_ABGR:
    neonfct = jsimd_extxbgr_gray_convert_neon;
    break;
  case JCS_EXT_XRGB:
  case JCS_EXT_ARGB:
    neonfct = jsimd_extxrgb_gray_convert_neon;
    break;
  default:
    neonfct = jsimd_extrgb_gray_convert_neon;
    break;
  }
  neonfct(cinfo->image_width, input_buf, output_buf, output_row, num_rows);
 }
 GLOBAL(void)
@@ -291,20 +349,28 @@ jsimd_ycc_rgb_convert(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
  switch (cinfo->out_color_space) {
  case JCS_EXT_RGB:
 #ifndef NEON_INTRINSICS
    if (simd_features & JSIMD_FASTST3)
 #endif
      neonfct = jsimd_ycc_extrgb_convert_neon;
 #ifndef NEON_INTRINSICS
    else
      neonfct = jsimd_ycc_extrgb_convert_neon_slowst3;
 #endif
    break;
  case JCS_EXT_RGBX:
  case JCS_EXT_RGBA:
    neonfct = jsimd_ycc_extrgbx_convert_neon;
    break;
  case JCS_EXT_BGR:
 #ifndef NEON_INTRINSICS
    if (simd_features & JSIMD_FASTST3)
 #endif
      neonfct = jsimd_ycc_extbgr_convert_neon;
 #ifndef NEON_INTRINSICS
    else
      neonfct = jsimd_ycc_extbgr_convert_neon_slowst3;
 #endif
    break;
  case JCS_EXT_BGRX:
  case JCS_EXT_BGRA:
@@ -319,10 +385,14 @@ jsimd_ycc_rgb_convert(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
    neonfct = jsimd_ycc_extxrgb_convert_neon;
    break;
  default:
 #ifndef NEON_INTRINSICS
    if (simd_features & JSIMD_FASTST3)
 #endif
      neonfct = jsimd_ycc_extrgb_convert_neon;
 #ifndef NEON_INTRINSICS
    else
      neonfct = jsimd_ycc_extrgb_convert_neon_slowst3;
 #endif
    break;
  }
@@ -397,12 +467,33 @@ jsimd_h2v1_downsample(j_compress_ptr cinfo, jpeg_component_info *compptr,
 GLOBAL(int)
 jsimd_can_h2v2_upsample(void)
 {
  init_simd();
  /* The code is optimised for these values only */
  if (BITS_IN_JSAMPLE != 8)
    return 0;
  if (sizeof(JDIMENSION) != 4)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
 GLOBAL(int)
 jsimd_can_h2v1_upsample(void)
 {
  init_simd();
  /* The code is optimised for these values only */
  if (BITS_IN_JSAMPLE != 8)
    return 0;
  if (sizeof(JDIMENSION) != 4)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
@@ -410,23 +501,66 @@ GLOBAL(void)
 jsimd_h2v2_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
                    JSAMPARRAY input_data, JSAMPARRAY *output_data_ptr)
 {
  jsimd_h2v2_upsample_neon(cinfo->max_v_samp_factor, cinfo->output_width,
                           input_data, output_data_ptr);
 }
 GLOBAL(void)
 jsimd_h2v1_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
                    JSAMPARRAY input_data, JSAMPARRAY *output_data_ptr)
 {
  jsimd_h2v1_upsample_neon(cinfo->max_v_samp_factor, cinfo->output_width,
                           input_data, output_data_ptr);
 }
 GLOBAL(int)
 jsimd_can_h2v2_fancy_upsample(void)
 {
  init_simd();
  /* The code is optimised for these values only */
  if (BITS_IN_JSAMPLE != 8)
    return 0;
  if (sizeof(JDIMENSION) != 4)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
 GLOBAL(int)
 jsimd_can_h2v1_fancy_upsample(void)
 {
  init_simd();
  /* The code is optimised for these values only */
  if (BITS_IN_JSAMPLE != 8)
    return 0;
  if (sizeof(JDIMENSION) != 4)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
 GLOBAL(int)
 jsimd_can_h1v2_fancy_upsample(void)
 {
  init_simd();
  /* The code is optimised for these values only */
  if (BITS_IN_JSAMPLE != 8)
    return 0;
  if (sizeof(JDIMENSION) != 4)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
@@ -434,23 +568,60 @@ GLOBAL(void)
 jsimd_h2v2_fancy_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
                          JSAMPARRAY input_data, JSAMPARRAY *output_data_ptr)
 {
  jsimd_h2v2_fancy_upsample_neon(cinfo->max_v_samp_factor,
                                 compptr->downsampled_width, input_data,
                                 output_data_ptr);
 }
 GLOBAL(void)
 jsimd_h2v1_fancy_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
                          JSAMPARRAY input_data, JSAMPARRAY *output_data_ptr)
 {
  jsimd_h2v1_fancy_upsample_neon(cinfo->max_v_samp_factor,
                                 compptr->downsampled_width, input_data,
                                 output_data_ptr);
 }
 GLOBAL(void)
 jsimd_h1v2_fancy_upsample(j_decompress_ptr cinfo, jpeg_component_info *compptr,
                          JSAMPARRAY input_data, JSAMPARRAY *output_data_ptr)
 {
  jsimd_h1v2_fancy_upsample_neon(cinfo->max_v_samp_factor,
                                 compptr->downsampled_width, input_data,
                                 output_data_ptr);
 }
 GLOBAL(int)
 jsimd_can_h2v2_merged_upsample(void)
 {
  init_simd();
  /* The code is optimised for these values only */
  if (BITS_IN_JSAMPLE != 8)
    return 0;
  if (sizeof(JDIMENSION) != 4)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
 GLOBAL(int)
 jsimd_can_h2v1_merged_upsample(void)
 {
  init_simd();
  /* The code is optimised for these values only */
  if (BITS_IN_JSAMPLE != 8)
    return 0;
  if (sizeof(JDIMENSION) != 4)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
@@ -458,12 +629,74 @@ GLOBAL(void)
 jsimd_h2v2_merged_upsample(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
                           JDIMENSION in_row_group_ctr, JSAMPARRAY output_buf)
 {
  void (*neonfct) (JDIMENSION, JSAMPIMAGE, JDIMENSION, JSAMPARRAY);
  switch (cinfo->out_color_space) {
    case JCS_EXT_RGB:
      neonfct = jsimd_h2v2_extrgb_merged_upsample_neon;
      break;
    case JCS_EXT_RGBX:
    case JCS_EXT_RGBA:
      neonfct = jsimd_h2v2_extrgbx_merged_upsample_neon;
      break;
    case JCS_EXT_BGR:
      neonfct = jsimd_h2v2_extbgr_merged_upsample_neon;
      break;
    case JCS_EXT_BGRX:
    case JCS_EXT_BGRA:
      neonfct = jsimd_h2v2_extbgrx_merged_upsample_neon;
      break;
    case JCS_EXT_XBGR:
    case JCS_EXT_ABGR:
      neonfct = jsimd_h2v2_extxbgr_merged_upsample_neon;
      break;
    case JCS_EXT_XRGB:
    case JCS_EXT_ARGB:
      neonfct = jsimd_h2v2_extxrgb_merged_upsample_neon;
      break;
    default:
      neonfct = jsimd_h2v2_extrgb_merged_upsample_neon;
      break;
  }
  neonfct(cinfo->output_width, input_buf, in_row_group_ctr, output_buf);
 }
 GLOBAL(void)
 jsimd_h2v1_merged_upsample(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
                           JDIMENSION in_row_group_ctr, JSAMPARRAY output_buf)
 {
  void (*neonfct) (JDIMENSION, JSAMPIMAGE, JDIMENSION, JSAMPARRAY);
  switch (cinfo->out_color_space) {
    case JCS_EXT_RGB:
      neonfct = jsimd_h2v1_extrgb_merged_upsample_neon;
      break;
    case JCS_EXT_RGBX:
    case JCS_EXT_RGBA:
      neonfct = jsimd_h2v1_extrgbx_merged_upsample_neon;
      break;
    case JCS_EXT_BGR:
      neonfct = jsimd_h2v1_extbgr_merged_upsample_neon;
      break;
    case JCS_EXT_BGRX:
    case JCS_EXT_BGRA:
      neonfct = jsimd_h2v1_extbgrx_merged_upsample_neon;
      break;
    case JCS_EXT_XBGR:
    case JCS_EXT_ABGR:
      neonfct = jsimd_h2v1_extxbgr_merged_upsample_neon;
      break;
    case JCS_EXT_XRGB:
    case JCS_EXT_ARGB:
      neonfct = jsimd_h2v1_extxrgb_merged_upsample_neon;
      break;
    default:
      neonfct = jsimd_h2v1_extrgb_merged_upsample_neon;
      break;
  }
  neonfct(cinfo->output_width, input_buf, in_row_group_ctr, output_buf);
 }
 GLOBAL(int)
@@ -762,17 +995,33 @@ jsimd_huff_encode_one_block(void *state, JOCTET *buffer, JCOEFPTR block,
                            int last_dc_val, c_derived_tbl *dctbl,
                            c_derived_tbl *actbl)
 {
 #ifndef NEON_INTRINSICS
  if (simd_features & JSIMD_FASTTBL)
 #endif
    return jsimd_huff_encode_one_block_neon(state, buffer, block, last_dc_val,
                                            dctbl, actbl);
 #ifndef NEON_INTRINSICS
  else
    return jsimd_huff_encode_one_block_neon_slowtbl(state, buffer, block,
                                                    last_dc_val, dctbl, actbl);
 #endif
 }
 GLOBAL(int)
 jsimd_can_encode_mcu_AC_first_prepare(void)
 {
  init_simd();
  if (DCTSIZE != 8)
    return 0;
  if (sizeof(JCOEF) != 2)
    return 0;
  if (SIZEOF_SIZE_T != 8)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
@@ -781,11 +1030,25 @@ jsimd_encode_mcu_AC_first_prepare(const JCOEF *block,
                                  const int *jpeg_natural_order_start, int Sl,
                                  int Al, JCOEF *values, size_t *zerobits)
 {
  jsimd_encode_mcu_AC_first_prepare_neon(block, jpeg_natural_order_start,
                                         Sl, Al, values, zerobits);
 }
 GLOBAL(int)
 jsimd_can_encode_mcu_AC_refine_prepare(void)
 {
  init_simd();
  if (DCTSIZE != 8)
    return 0;
  if (sizeof(JCOEF) != 2)
    return 0;
  if (SIZEOF_SIZE_T != 8)
    return 0;
  if (simd_support & JSIMD_NEON)
    return 1;
  return 0;
 }
@@ -794,5 +1057,7 @@ jsimd_encode_mcu_AC_refine_prepare(const JCOEF *block,
                                   const int *jpeg_natural_order_start, int Sl,
                                   int Al, JCOEF *absvalues, size_t *bits)
 {
-  return 0;
+  return jsimd_encode_mcu_AC_refine_prepare_neon(block,
                                                 jpeg_natural_order_start,
                                                 Sl, Al, absvalues, bits);
 }
--- a/simd/arm/aarch64/jsimd_neon.S
+++ b/simd/arm/aarch64/jsimd_neon.S
--- a/simd/arm/align.h
+++ b/simd/arm/align.h
@@ -0,0 +1,28 @@
 /*
 * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 /* How to obtain memory alignment for structures and variables */
 #if defined(_MSC_VER)
 #define ALIGN(alignment)  __declspec(align(alignment))
 #elif defined(__clang__) || defined(__GNUC__)
 #define ALIGN(alignment)  __attribute__((aligned(alignment)))
 #else
 #error "Unknown compiler"
 #endif
--- a/simd/arm/jccolor-neon.c
+++ b/simd/arm/jccolor-neon.c
@@ -0,0 +1,160 @@
 /*
 * jccolor-neon.c - colorspace conversion (Arm Neon)
 *
 * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
 * Copyright (C) 2020, D. R. Commander.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 #define JPEG_INTERNALS
 #include "../../jinclude.h"
 #include "../../jpeglib.h"
 #include "../../jsimd.h"
 #include "../../jdct.h"
 #include "../../jsimddct.h"
 #include "../jsimd.h"
 #include "align.h"
 #include "neon-compat.h"
 #include <arm_neon.h>
 /* RGB -> YCbCr conversion constants */
 #define F_0_298  19595
 #define F_0_587  38470
 #define F_0_113  7471
 #define F_0_168  11059
 #define F_0_331  21709
 #define F_0_500  32768
 #define F_0_418  27439
 #define F_0_081  5329
 ALIGN(16) static const uint16_t jsimd_rgb_ycc_neon_consts[] = {
  F_0_298, F_0_587, F_0_113, F_0_168,
  F_0_331, F_0_500, F_0_418, F_0_081
 };
 /* Include inline routines for colorspace extensions. */
 #if defined(__aarch64__) || defined(_M_ARM64)
 #include "aarch64/jccolext-neon.c"
 #else
 #include "aarch32/jccolext-neon.c"
 #endif
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #define RGB_RED  EXT_RGB_RED
 #define RGB_GREEN  EXT_RGB_GREEN
 #define RGB_BLUE  EXT_RGB_BLUE
 #define RGB_PIXELSIZE  EXT_RGB_PIXELSIZE
 #define jsimd_rgb_ycc_convert_neon  jsimd_extrgb_ycc_convert_neon
 #if defined(__aarch64__) || defined(_M_ARM64)
 #include "aarch64/jccolext-neon.c"
 #else
 #include "aarch32/jccolext-neon.c"
 #endif
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #undef jsimd_rgb_ycc_convert_neon
 #define RGB_RED  EXT_RGBX_RED
 #define RGB_GREEN  EXT_RGBX_GREEN
 #define RGB_BLUE  EXT_RGBX_BLUE
 #define RGB_PIXELSIZE  EXT_RGBX_PIXELSIZE
 #define jsimd_rgb_ycc_convert_neon  jsimd_extrgbx_ycc_convert_neon
 #if defined(__aarch64__) || defined(_M_ARM64)
 #include "aarch64/jccolext-neon.c"
 #else
 #include "aarch32/jccolext-neon.c"
 #endif
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #undef jsimd_rgb_ycc_convert_neon
 #define RGB_RED  EXT_BGR_RED
 #define RGB_GREEN  EXT_BGR_GREEN
 #define RGB_BLUE  EXT_BGR_BLUE
 #define RGB_PIXELSIZE  EXT_BGR_PIXELSIZE
 #define jsimd_rgb_ycc_convert_neon  jsimd_extbgr_ycc_convert_neon
 #if defined(__aarch64__) || defined(_M_ARM64)
 #include "aarch64/jccolext-neon.c"
 #else
 #include "aarch32/jccolext-neon.c"
 #endif
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #undef jsimd_rgb_ycc_convert_neon
 #define RGB_RED  EXT_BGRX_RED
 #define RGB_GREEN  EXT_BGRX_GREEN
 #define RGB_BLUE  EXT_BGRX_BLUE
 #define RGB_PIXELSIZE  EXT_BGRX_PIXELSIZE
 #define jsimd_rgb_ycc_convert_neon  jsimd_extbgrx_ycc_convert_neon
 #if defined(__aarch64__) || defined(_M_ARM64)
 #include "aarch64/jccolext-neon.c"
 #else
 #include "aarch32/jccolext-neon.c"
 #endif
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #undef jsimd_rgb_ycc_convert_neon
 #define RGB_RED  EXT_XBGR_RED
 #define RGB_GREEN  EXT_XBGR_GREEN
 #define RGB_BLUE  EXT_XBGR_BLUE
 #define RGB_PIXELSIZE  EXT_XBGR_PIXELSIZE
 #define jsimd_rgb_ycc_convert_neon  jsimd_extxbgr_ycc_convert_neon
 #if defined(__aarch64__) || defined(_M_ARM64)
 #include "aarch64/jccolext-neon.c"
 #else
 #include "aarch32/jccolext-neon.c"
 #endif
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #undef jsimd_rgb_ycc_convert_neon
 #define RGB_RED  EXT_XRGB_RED
 #define RGB_GREEN  EXT_XRGB_GREEN
 #define RGB_BLUE  EXT_XRGB_BLUE
 #define RGB_PIXELSIZE  EXT_XRGB_PIXELSIZE
 #define jsimd_rgb_ycc_convert_neon  jsimd_extxrgb_ycc_convert_neon
 #if defined(__aarch64__) || defined(_M_ARM64)
 #include "aarch64/jccolext-neon.c"
 #else
 #include "aarch32/jccolext-neon.c"
 #endif
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #undef jsimd_rgb_ycc_convert_neon
--- a/simd/arm/jcgray-neon.c
+++ b/simd/arm/jcgray-neon.c
@@ -0,0 +1,120 @@
 /*
 * jcgray-neon.c - grayscale colorspace conversion (Arm Neon)
 *
 * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 #define JPEG_INTERNALS
 #include "../../jinclude.h"
 #include "../../jpeglib.h"
 #include "../../jsimd.h"
 #include "../../jdct.h"
 #include "../../jsimddct.h"
 #include "../jsimd.h"
 #include "align.h"
 #include <arm_neon.h>
 /* RGB -> Grayscale conversion constants */
 #define F_0_298  19595
 #define F_0_587  38470
 #define F_0_113  7471
 /* Include inline routines for colorspace extensions. */
 #include "jcgryext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #define RGB_RED  EXT_RGB_RED
 #define RGB_GREEN  EXT_RGB_GREEN
 #define RGB_BLUE  EXT_RGB_BLUE
 #define RGB_PIXELSIZE  EXT_RGB_PIXELSIZE
 #define jsimd_rgb_gray_convert_neon  jsimd_extrgb_gray_convert_neon
 #include "jcgryext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #undef jsimd_rgb_gray_convert_neon
 #define RGB_RED  EXT_RGBX_RED
 #define RGB_GREEN  EXT_RGBX_GREEN
 #define RGB_BLUE  EXT_RGBX_BLUE
 #define RGB_PIXELSIZE  EXT_RGBX_PIXELSIZE
 #define jsimd_rgb_gray_convert_neon  jsimd_extrgbx_gray_convert_neon
 #include "jcgryext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #undef jsimd_rgb_gray_convert_neon
 #define RGB_RED  EXT_BGR_RED
 #define RGB_GREEN  EXT_BGR_GREEN
 #define RGB_BLUE  EXT_BGR_BLUE
 #define RGB_PIXELSIZE  EXT_BGR_PIXELSIZE
 #define jsimd_rgb_gray_convert_neon  jsimd_extbgr_gray_convert_neon
 #include "jcgryext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #undef jsimd_rgb_gray_convert_neon
 #define RGB_RED  EXT_BGRX_RED
 #define RGB_GREEN  EXT_BGRX_GREEN
 #define RGB_BLUE  EXT_BGRX_BLUE
 #define RGB_PIXELSIZE  EXT_BGRX_PIXELSIZE
 #define jsimd_rgb_gray_convert_neon  jsimd_extbgrx_gray_convert_neon
 #include "jcgryext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #undef jsimd_rgb_gray_convert_neon
 #define RGB_RED  EXT_XBGR_RED
 #define RGB_GREEN  EXT_XBGR_GREEN
 #define RGB_BLUE  EXT_XBGR_BLUE
 #define RGB_PIXELSIZE  EXT_XBGR_PIXELSIZE
 #define jsimd_rgb_gray_convert_neon  jsimd_extxbgr_gray_convert_neon
 #include "jcgryext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #undef jsimd_rgb_gray_convert_neon
 #define RGB_RED  EXT_XRGB_RED
 #define RGB_GREEN  EXT_XRGB_GREEN
 #define RGB_BLUE  EXT_XRGB_BLUE
 #define RGB_PIXELSIZE  EXT_XRGB_PIXELSIZE
 #define jsimd_rgb_gray_convert_neon  jsimd_extxrgb_gray_convert_neon
 #include "jcgryext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #undef jsimd_rgb_gray_convert_neon
--- a/simd/arm/jcgryext-neon.c
+++ b/simd/arm/jcgryext-neon.c
@@ -0,0 +1,106 @@
 /*
 * jcgryext-neon.c - grayscale colorspace conversion (Arm Neon)
 *
 * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 /* This file is included by jcgray-neon.c */
 /* RGB -> Grayscale conversion is defined by the following equation:
 *    Y  =  0.29900 * R + 0.58700 * G + 0.11400 * B
 *
 * Avoid floating point arithmetic by using shifted integer constants:
 *    0.29899597 = 19595 * 2^-16
 *    0.58700561 = 38470 * 2^-16
 *    0.11399841 =  7471 * 2^-16
 * These constants are defined in jcgray-neon.c
 *
 * This is the same computation as the RGB -> Y portion of RGB -> YCbCr.
 */
 void jsimd_rgb_gray_convert_neon(JDIMENSION image_width, JSAMPARRAY input_buf,
                                 JSAMPIMAGE output_buf, JDIMENSION output_row,
                                 int num_rows)
 {
  JSAMPROW inptr;
  JSAMPROW outptr;
  /* Allocate temporary buffer for final (image_width % 16) pixels in row. */
  ALIGN(16) uint8_t tmp_buf[16 * RGB_PIXELSIZE];
  while (--num_rows >= 0) {
    inptr = *input_buf++;
    outptr = output_buf[0][output_row];
    output_row++;
    int cols_remaining = image_width;
    for (; cols_remaining > 0; cols_remaining -= 16) {
      /* To prevent buffer overread by the vector load instructions, the last
       * (image_width % 16) columns of data are first memcopied to a temporary
       * buffer large enough to accommodate the vector load.
       */
      if (cols_remaining < 16) {
        memcpy(tmp_buf, inptr, cols_remaining * RGB_PIXELSIZE);
        inptr = tmp_buf;
      }
 #if RGB_PIXELSIZE == 4
      uint8x16x4_t input_pixels = vld4q_u8(inptr);
 #else
      uint8x16x3_t input_pixels = vld3q_u8(inptr);
 #endif
      uint16x8_t r_l = vmovl_u8(vget_low_u8(input_pixels.val[RGB_RED]));
      uint16x8_t r_h = vmovl_u8(vget_high_u8(input_pixels.val[RGB_RED]));
      uint16x8_t g_l = vmovl_u8(vget_low_u8(input_pixels.val[RGB_GREEN]));
      uint16x8_t g_h = vmovl_u8(vget_high_u8(input_pixels.val[RGB_GREEN]));
      uint16x8_t b_l = vmovl_u8(vget_low_u8(input_pixels.val[RGB_BLUE]));
      uint16x8_t b_h = vmovl_u8(vget_high_u8(input_pixels.val[RGB_BLUE]));
      /* Compute Y = 0.29900 * R + 0.58700 * G + 0.11400 * B */
      uint32x4_t y_ll = vmull_n_u16(vget_low_u16(r_l), F_0_298);
      uint32x4_t y_lh = vmull_n_u16(vget_high_u16(r_l), F_0_298);
      uint32x4_t y_hl = vmull_n_u16(vget_low_u16(r_h), F_0_298);
      uint32x4_t y_hh = vmull_n_u16(vget_high_u16(r_h), F_0_298);
      y_ll = vmlal_n_u16(y_ll, vget_low_u16(g_l), F_0_587);
      y_lh = vmlal_n_u16(y_lh, vget_high_u16(g_l), F_0_587);
      y_hl = vmlal_n_u16(y_hl, vget_low_u16(g_h), F_0_587);
      y_hh = vmlal_n_u16(y_hh, vget_high_u16(g_h), F_0_587);
      y_ll = vmlal_n_u16(y_ll, vget_low_u16(b_l), F_0_113);
      y_lh = vmlal_n_u16(y_lh, vget_high_u16(b_l), F_0_113);
      y_hl = vmlal_n_u16(y_hl, vget_low_u16(b_h), F_0_113);
      y_hh = vmlal_n_u16(y_hh, vget_high_u16(b_h), F_0_113);
      /* Descale Y values (rounding right shift) and narrow to 16-bit. */
      uint16x8_t y_l = vcombine_u16(vrshrn_n_u32(y_ll, 16),
                                    vrshrn_n_u32(y_lh, 16));
      uint16x8_t y_h = vcombine_u16(vrshrn_n_u32(y_hl, 16),
                                    vrshrn_n_u32(y_hh, 16));
      /* Narrow Y values to 8-bit and store to memory.  Buffer overwrite is
       * permitted up to the next multiple of ALIGN_SIZE bytes.
       */
      vst1q_u8(outptr, vcombine_u8(vmovn_u16(y_l), vmovn_u16(y_h)));
      /* Increment pointers. */
      inptr += (16 * RGB_PIXELSIZE);
      outptr += 16;
    }
  }
 }
--- a/simd/arm/jchuff.h
+++ b/simd/arm/jchuff.h
@@ -0,0 +1,149 @@
 /*
 * jchuff.h
 *
 * This file was part of the Independent JPEG Group's software:
 * Copyright (C) 1991-1997, Thomas G. Lane.
 * libjpeg-turbo Modifications:
 * Copyright (C) 2009, 2018, D. R. Commander.
 * Copyright (C) 2018, Matthias Räncker.
 * Copyright (C) 2020, Arm Limited.
 * For conditions of distribution and use, see the accompanying README.ijg
 * file.
 */
 /* Expanded entropy encoder object for Huffman encoding.
 *
 * The savable_state subrecord contains fields that change within an MCU,
 * but must not be updated permanently until we complete the MCU.
 */
 #if defined(__aarch64__) || defined(_M_ARM64)
 #define BIT_BUF_SIZE  64
 #else
 #define BIT_BUF_SIZE  32
 #endif
 typedef struct {
  size_t put_buffer;                    /* current bit accumulation buffer */
  int free_bits;                        /* # of bits available in it */
  int last_dc_val[MAX_COMPS_IN_SCAN];   /* last DC coef for each component */
 } savable_state;
 typedef struct {
  JOCTET *next_output_byte;     /* => next byte to write in buffer */
  size_t free_in_buffer;        /* # of byte spaces remaining in buffer */
  savable_state cur;            /* Current bit buffer & DC state */
  j_compress_ptr cinfo;         /* dump_buffer needs access to this */
  int simd;
 } working_state;
 /* Outputting bits to the file */
 /* Output byte b and, speculatively, an additional 0 byte. 0xFF must be encoded
 * as 0xFF 0x00, so the output buffer pointer is advanced by 2 if the byte is
 * 0xFF.  Otherwise, the output buffer pointer is advanced by 1, and the
 * speculative 0 byte will be overwritten by the next byte.
 */
 #define EMIT_BYTE(b) { \
  buffer[0] = (JOCTET)(b); \
  buffer[1] = 0; \
  buffer -= -2 + ((JOCTET)(b) < 0xFF); \
 }
 /* Output the entire bit buffer.  If there are no 0xFF bytes in it, then write
 * directly to the output buffer.  Otherwise, use the EMIT_BYTE() macro to
 * encode 0xFF as 0xFF 0x00.
 */
 #if defined(__aarch64__) || defined(_M_ARM64)
 #if defined(_MSC_VER) && !defined(__clang__)
 #define SPLAT() { \
  buffer[0] = (JOCTET)(put_buffer >> 56); \
  buffer[1] = (JOCTET)(put_buffer >> 48); \
  buffer[2] = (JOCTET)(put_buffer >> 40); \
  buffer[3] = (JOCTET)(put_buffer >> 32); \
  buffer[4] = (JOCTET)(put_buffer >> 24); \
  buffer[5] = (JOCTET)(put_buffer >> 16); \
  buffer[6] = (JOCTET)(put_buffer >>  8); \
  buffer[7] = (JOCTET)(put_buffer      ); \
 }
 #else
 #define SPLAT() { \
  __asm__("rev %x0, %x1" : "=r"(put_buffer) : "r"(put_buffer)); \
  *((uint64_t *)buffer) = put_buffer; \
 }
 #endif
 #define FLUSH() { \
  if (put_buffer & 0x8080808080808080 & ~(put_buffer + 0x0101010101010101)) { \
    EMIT_BYTE(put_buffer >> 56) \
    EMIT_BYTE(put_buffer >> 48) \
    EMIT_BYTE(put_buffer >> 40) \
    EMIT_BYTE(put_buffer >> 32) \
    EMIT_BYTE(put_buffer >> 24) \
    EMIT_BYTE(put_buffer >> 16) \
    EMIT_BYTE(put_buffer >>  8) \
    EMIT_BYTE(put_buffer      ) \
  } else { \
    SPLAT() \
    buffer += 8; \
  } \
 }
 #else
 #if defined(_MSC_VER) && !defined(__clang__)
 #define SPLAT() { \
  buffer[0] = (JOCTET)(put_buffer >> 24); \
  buffer[1] = (JOCTET)(put_buffer >> 16); \
  buffer[2] = (JOCTET)(put_buffer >>  8); \
  buffer[3] = (JOCTET)(put_buffer      ); \
 }
 #else
 #define SPLAT() { \
  __asm__("rev %0, %1" : "=r"(put_buffer) : "r"(put_buffer)); \
  *((uint32_t *)buffer) = put_buffer; \
 }
 #endif
 #define FLUSH() { \
  if (put_buffer & 0x80808080 & ~(put_buffer + 0x01010101)) { \
    EMIT_BYTE(put_buffer >> 24) \
    EMIT_BYTE(put_buffer >> 16) \
    EMIT_BYTE(put_buffer >>  8) \
    EMIT_BYTE(put_buffer      ) \
  } else { \
    SPLAT() \
    buffer += 4; \
  } \
 }
 #endif
 /* Fill the bit buffer to capacity with the leading bits from code, then output
 * the bit buffer and put the remaining bits from code into the bit buffer.
 */
 #define PUT_AND_FLUSH(code, size) { \
  put_buffer = (put_buffer << (size + free_bits)) | (code >> -free_bits); \
  FLUSH() \
  free_bits += BIT_BUF_SIZE; \
  put_buffer = code; \
 }
 /* Insert code into the bit buffer and output the bit buffer if needed.
 * NOTE: We can't flush with free_bits == 0, since the left shift in
 * PUT_AND_FLUSH() would have undefined behavior.
 */
 #define PUT_BITS(code, size) { \
  free_bits -= size; \
  if (free_bits < 0) \
    PUT_AND_FLUSH(code, size) \
  else \
    put_buffer = (put_buffer << size) | code; \
 }
 #define PUT_CODE(code, size, diff) { \
  diff |= code << nbits; \
  nbits += size; \
  PUT_BITS(diff, nbits) \
 }
--- a/simd/arm/jcphuff-neon.c
+++ b/simd/arm/jcphuff-neon.c
@@ -0,0 +1,591 @@
 /*
 * jcphuff-neon.c - prepare data for progressive Huffman encoding (Arm Neon)
 *
 * Copyright (C) 2020-2021, Arm Limited.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 #define JPEG_INTERNALS
 #include "../../jinclude.h"
 #include "../../jpeglib.h"
 #include "../../jsimd.h"
 #include "../../jdct.h"
 #include "../../jsimddct.h"
 #include "../jsimd.h"
 #include "neon-compat.h"
 #include <arm_neon.h>
 /* Data preparation for encode_mcu_AC_first().
 *
 * The equivalent scalar C function (encode_mcu_AC_first_prepare()) can be
 * found in jcphuff.c.
 */
 void jsimd_encode_mcu_AC_first_prepare_neon
  (const JCOEF *block, const int *jpeg_natural_order_start, int Sl, int Al,
   JCOEF *values, size_t *zerobits)
 {
  JCOEF *values_ptr = values;
  JCOEF *diff_values_ptr = values + DCTSIZE2;
  /* Rows of coefficients to zero (since they haven't been processed) */
  int i, rows_to_zero = 8;
  for (i = 0; i < Sl / 16; i++) {
    int16x8_t coefs1 = vld1q_dup_s16(block + jpeg_natural_order_start[0]);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[1], coefs1, 1);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[2], coefs1, 2);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[3], coefs1, 3);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[4], coefs1, 4);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[5], coefs1, 5);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[6], coefs1, 6);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[7], coefs1, 7);
    int16x8_t coefs2 = vld1q_dup_s16(block + jpeg_natural_order_start[8]);
    coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[9], coefs2, 1);
    coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[10], coefs2, 2);
    coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[11], coefs2, 3);
    coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[12], coefs2, 4);
    coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[13], coefs2, 5);
    coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[14], coefs2, 6);
    coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[15], coefs2, 7);
    /* Isolate sign of coefficients. */
    int16x8_t sign_coefs1 = vshrq_n_s16(coefs1, 15);
    int16x8_t sign_coefs2 = vshrq_n_s16(coefs2, 15);
    /* Compute absolute value of coefficients and apply point transform Al. */
    int16x8_t abs_coefs1 = vabsq_s16(coefs1);
    int16x8_t abs_coefs2 = vabsq_s16(coefs2);
    coefs1 = vshlq_s16(abs_coefs1, vdupq_n_s16(-Al));
    coefs2 = vshlq_s16(abs_coefs2, vdupq_n_s16(-Al));
    /* Compute diff values. */
    int16x8_t diff1 = veorq_s16(coefs1, sign_coefs1);
    int16x8_t diff2 = veorq_s16(coefs2, sign_coefs2);
    /* Store transformed coefficients and diff values. */
    vst1q_s16(values_ptr, coefs1);
    vst1q_s16(values_ptr + DCTSIZE, coefs2);
    vst1q_s16(diff_values_ptr, diff1);
    vst1q_s16(diff_values_ptr + DCTSIZE, diff2);
    values_ptr += 16;
    diff_values_ptr += 16;
    jpeg_natural_order_start += 16;
    rows_to_zero -= 2;
  }
  /* Same operation but for remaining partial vector */
  int remaining_coefs = Sl % 16;
  if (remaining_coefs > 8) {
    int16x8_t coefs1 = vld1q_dup_s16(block + jpeg_natural_order_start[0]);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[1], coefs1, 1);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[2], coefs1, 2);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[3], coefs1, 3);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[4], coefs1, 4);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[5], coefs1, 5);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[6], coefs1, 6);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[7], coefs1, 7);
    int16x8_t coefs2 = vdupq_n_s16(0);
    switch (remaining_coefs) {
    case 15:
      coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[14], coefs2, 6);
    case 14:
      coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[13], coefs2, 5);
    case 13:
      coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[12], coefs2, 4);
    case 12:
      coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[11], coefs2, 3);
    case 11:
      coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[10], coefs2, 2);
    case 10:
      coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[9], coefs2, 1);
    case 9:
      coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[8], coefs2, 0);
    default:
      break;
    }
    /* Isolate sign of coefficients. */
    int16x8_t sign_coefs1 = vshrq_n_s16(coefs1, 15);
    int16x8_t sign_coefs2 = vshrq_n_s16(coefs2, 15);
    /* Compute absolute value of coefficients and apply point transform Al. */
    int16x8_t abs_coefs1 = vabsq_s16(coefs1);
    int16x8_t abs_coefs2 = vabsq_s16(coefs2);
    coefs1 = vshlq_s16(abs_coefs1, vdupq_n_s16(-Al));
    coefs2 = vshlq_s16(abs_coefs2, vdupq_n_s16(-Al));
    /* Compute diff values. */
    int16x8_t diff1 = veorq_s16(coefs1, sign_coefs1);
    int16x8_t diff2 = veorq_s16(coefs2, sign_coefs2);
    /* Store transformed coefficients and diff values. */
    vst1q_s16(values_ptr, coefs1);
    vst1q_s16(values_ptr + DCTSIZE, coefs2);
    vst1q_s16(diff_values_ptr, diff1);
    vst1q_s16(diff_values_ptr + DCTSIZE, diff2);
    values_ptr += 16;
    diff_values_ptr += 16;
    rows_to_zero -= 2;
  } else if (remaining_coefs > 0) {
    int16x8_t coefs = vdupq_n_s16(0);
    switch (remaining_coefs) {
    case 8:
      coefs = vld1q_lane_s16(block + jpeg_natural_order_start[7], coefs, 7);
    case 7:
      coefs = vld1q_lane_s16(block + jpeg_natural_order_start[6], coefs, 6);
    case 6:
      coefs = vld1q_lane_s16(block + jpeg_natural_order_start[5], coefs, 5);
    case 5:
      coefs = vld1q_lane_s16(block + jpeg_natural_order_start[4], coefs, 4);
    case 4:
      coefs = vld1q_lane_s16(block + jpeg_natural_order_start[3], coefs, 3);
    case 3:
      coefs = vld1q_lane_s16(block + jpeg_natural_order_start[2], coefs, 2);
    case 2:
      coefs = vld1q_lane_s16(block + jpeg_natural_order_start[1], coefs, 1);
    case 1:
      coefs = vld1q_lane_s16(block + jpeg_natural_order_start[0], coefs, 0);
    default:
      break;
    }
    /* Isolate sign of coefficients. */
    int16x8_t sign_coefs = vshrq_n_s16(coefs, 15);
    /* Compute absolute value of coefficients and apply point transform Al. */
    int16x8_t abs_coefs = vabsq_s16(coefs);
    coefs = vshlq_s16(abs_coefs, vdupq_n_s16(-Al));
    /* Compute diff values. */
    int16x8_t diff = veorq_s16(coefs, sign_coefs);
    /* Store transformed coefficients and diff values. */
    vst1q_s16(values_ptr, coefs);
    vst1q_s16(diff_values_ptr, diff);
    values_ptr += 8;
    diff_values_ptr += 8;
    rows_to_zero--;
  }
  /* Zero remaining memory in the values and diff_values blocks. */
  for (i = 0; i < rows_to_zero; i++) {
    vst1q_s16(values_ptr, vdupq_n_s16(0));
    vst1q_s16(diff_values_ptr, vdupq_n_s16(0));
    values_ptr += 8;
    diff_values_ptr += 8;
  }
  /* Construct zerobits bitmap.  A set bit means that the corresponding
   * coefficient != 0.
   */
  int16x8_t row0 = vld1q_s16(values + 0 * DCTSIZE);
  int16x8_t row1 = vld1q_s16(values + 1 * DCTSIZE);
  int16x8_t row2 = vld1q_s16(values + 2 * DCTSIZE);
  int16x8_t row3 = vld1q_s16(values + 3 * DCTSIZE);
  int16x8_t row4 = vld1q_s16(values + 4 * DCTSIZE);
  int16x8_t row5 = vld1q_s16(values + 5 * DCTSIZE);
  int16x8_t row6 = vld1q_s16(values + 6 * DCTSIZE);
  int16x8_t row7 = vld1q_s16(values + 7 * DCTSIZE);
  uint8x8_t row0_eq0 = vmovn_u16(vceqq_s16(row0, vdupq_n_s16(0)));
  uint8x8_t row1_eq0 = vmovn_u16(vceqq_s16(row1, vdupq_n_s16(0)));
  uint8x8_t row2_eq0 = vmovn_u16(vceqq_s16(row2, vdupq_n_s16(0)));
  uint8x8_t row3_eq0 = vmovn_u16(vceqq_s16(row3, vdupq_n_s16(0)));
  uint8x8_t row4_eq0 = vmovn_u16(vceqq_s16(row4, vdupq_n_s16(0)));
  uint8x8_t row5_eq0 = vmovn_u16(vceqq_s16(row5, vdupq_n_s16(0)));
  uint8x8_t row6_eq0 = vmovn_u16(vceqq_s16(row6, vdupq_n_s16(0)));
  uint8x8_t row7_eq0 = vmovn_u16(vceqq_s16(row7, vdupq_n_s16(0)));
  /* { 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80 } */
  const uint8x8_t bitmap_mask =
    vreinterpret_u8_u64(vmov_n_u64(0x8040201008040201));
  row0_eq0 = vand_u8(row0_eq0, bitmap_mask);
  row1_eq0 = vand_u8(row1_eq0, bitmap_mask);
  row2_eq0 = vand_u8(row2_eq0, bitmap_mask);
  row3_eq0 = vand_u8(row3_eq0, bitmap_mask);
  row4_eq0 = vand_u8(row4_eq0, bitmap_mask);
  row5_eq0 = vand_u8(row5_eq0, bitmap_mask);
  row6_eq0 = vand_u8(row6_eq0, bitmap_mask);
  row7_eq0 = vand_u8(row7_eq0, bitmap_mask);
  uint8x8_t bitmap_rows_01 = vpadd_u8(row0_eq0, row1_eq0);
  uint8x8_t bitmap_rows_23 = vpadd_u8(row2_eq0, row3_eq0);
  uint8x8_t bitmap_rows_45 = vpadd_u8(row4_eq0, row5_eq0);
  uint8x8_t bitmap_rows_67 = vpadd_u8(row6_eq0, row7_eq0);
  uint8x8_t bitmap_rows_0123 = vpadd_u8(bitmap_rows_01, bitmap_rows_23);
  uint8x8_t bitmap_rows_4567 = vpadd_u8(bitmap_rows_45, bitmap_rows_67);
  uint8x8_t bitmap_all = vpadd_u8(bitmap_rows_0123, bitmap_rows_4567);
 #if defined(__aarch64__) || defined(_M_ARM64)
  /* Move bitmap to a 64-bit scalar register. */
  uint64_t bitmap = vget_lane_u64(vreinterpret_u64_u8(bitmap_all), 0);
  /* Store zerobits bitmap. */
  *zerobits = ~bitmap;
 #else
  /* Move bitmap to two 32-bit scalar registers. */
  uint32_t bitmap0 = vget_lane_u32(vreinterpret_u32_u8(bitmap_all), 0);
  uint32_t bitmap1 = vget_lane_u32(vreinterpret_u32_u8(bitmap_all), 1);
  /* Store zerobits bitmap. */
  zerobits[0] = ~bitmap0;
  zerobits[1] = ~bitmap1;
 #endif
 }
 /* Data preparation for encode_mcu_AC_refine().
 *
 * The equivalent scalar C function (encode_mcu_AC_refine_prepare()) can be
 * found in jcphuff.c.
 */
 int jsimd_encode_mcu_AC_refine_prepare_neon
  (const JCOEF *block, const int *jpeg_natural_order_start, int Sl, int Al,
   JCOEF *absvalues, size_t *bits)
 {
  /* Temporary storage buffers for data used to compute the signbits bitmap and
   * the end-of-block (EOB) position
   */
  uint8_t coef_sign_bits[64];
  uint8_t coef_eq1_bits[64];
  JCOEF *absvalues_ptr = absvalues;
  uint8_t *coef_sign_bits_ptr = coef_sign_bits;
  uint8_t *eq1_bits_ptr = coef_eq1_bits;
  /* Rows of coefficients to zero (since they haven't been processed) */
  int i, rows_to_zero = 8;
  for (i = 0; i < Sl / 16; i++) {
    int16x8_t coefs1 = vld1q_dup_s16(block + jpeg_natural_order_start[0]);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[1], coefs1, 1);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[2], coefs1, 2);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[3], coefs1, 3);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[4], coefs1, 4);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[5], coefs1, 5);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[6], coefs1, 6);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[7], coefs1, 7);
    int16x8_t coefs2 = vld1q_dup_s16(block + jpeg_natural_order_start[8]);
    coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[9], coefs2, 1);
    coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[10], coefs2, 2);
    coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[11], coefs2, 3);
    coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[12], coefs2, 4);
    coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[13], coefs2, 5);
    coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[14], coefs2, 6);
    coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[15], coefs2, 7);
    /* Compute and store data for signbits bitmap. */
    uint8x8_t sign_coefs1 =
      vmovn_u16(vreinterpretq_u16_s16(vshrq_n_s16(coefs1, 15)));
    uint8x8_t sign_coefs2 =
      vmovn_u16(vreinterpretq_u16_s16(vshrq_n_s16(coefs2, 15)));
    vst1_u8(coef_sign_bits_ptr, sign_coefs1);
    vst1_u8(coef_sign_bits_ptr + DCTSIZE, sign_coefs2);
    /* Compute absolute value of coefficients and apply point transform Al. */
    int16x8_t abs_coefs1 = vabsq_s16(coefs1);
    int16x8_t abs_coefs2 = vabsq_s16(coefs2);
    coefs1 = vshlq_s16(abs_coefs1, vdupq_n_s16(-Al));
    coefs2 = vshlq_s16(abs_coefs2, vdupq_n_s16(-Al));
    vst1q_s16(absvalues_ptr, coefs1);
    vst1q_s16(absvalues_ptr + DCTSIZE, coefs2);
    /* Test whether transformed coefficient values == 1 (used to find EOB
     * position.)
     */
    uint8x8_t coefs_eq11 = vmovn_u16(vceqq_s16(coefs1, vdupq_n_s16(1)));
    uint8x8_t coefs_eq12 = vmovn_u16(vceqq_s16(coefs2, vdupq_n_s16(1)));
    vst1_u8(eq1_bits_ptr, coefs_eq11);
    vst1_u8(eq1_bits_ptr + DCTSIZE, coefs_eq12);
    absvalues_ptr += 16;
    coef_sign_bits_ptr += 16;
    eq1_bits_ptr += 16;
    jpeg_natural_order_start += 16;
    rows_to_zero -= 2;
  }
  /* Same operation but for remaining partial vector */
  int remaining_coefs = Sl % 16;
  if (remaining_coefs > 8) {
    int16x8_t coefs1 = vld1q_dup_s16(block + jpeg_natural_order_start[0]);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[1], coefs1, 1);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[2], coefs1, 2);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[3], coefs1, 3);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[4], coefs1, 4);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[5], coefs1, 5);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[6], coefs1, 6);
    coefs1 = vld1q_lane_s16(block + jpeg_natural_order_start[7], coefs1, 7);
    int16x8_t coefs2 = vdupq_n_s16(0);
    switch (remaining_coefs) {
    case 15:
      coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[14], coefs2, 6);
    case 14:
      coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[13], coefs2, 5);
    case 13:
      coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[12], coefs2, 4);
    case 12:
      coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[11], coefs2, 3);
    case 11:
      coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[10], coefs2, 2);
    case 10:
      coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[9], coefs2, 1);
    case 9:
      coefs2 = vld1q_lane_s16(block + jpeg_natural_order_start[8], coefs2, 0);
    default:
      break;
    }
    /* Compute and store data for signbits bitmap. */
    uint8x8_t sign_coefs1 =
      vmovn_u16(vreinterpretq_u16_s16(vshrq_n_s16(coefs1, 15)));
    uint8x8_t sign_coefs2 =
      vmovn_u16(vreinterpretq_u16_s16(vshrq_n_s16(coefs2, 15)));
    vst1_u8(coef_sign_bits_ptr, sign_coefs1);
    vst1_u8(coef_sign_bits_ptr + DCTSIZE, sign_coefs2);
    /* Compute absolute value of coefficients and apply point transform Al. */
    int16x8_t abs_coefs1 = vabsq_s16(coefs1);
    int16x8_t abs_coefs2 = vabsq_s16(coefs2);
    coefs1 = vshlq_s16(abs_coefs1, vdupq_n_s16(-Al));
    coefs2 = vshlq_s16(abs_coefs2, vdupq_n_s16(-Al));
    vst1q_s16(absvalues_ptr, coefs1);
    vst1q_s16(absvalues_ptr + DCTSIZE, coefs2);
    /* Test whether transformed coefficient values == 1 (used to find EOB
     * position.)
     */
    uint8x8_t coefs_eq11 = vmovn_u16(vceqq_s16(coefs1, vdupq_n_s16(1)));
    uint8x8_t coefs_eq12 = vmovn_u16(vceqq_s16(coefs2, vdupq_n_s16(1)));
    vst1_u8(eq1_bits_ptr, coefs_eq11);
    vst1_u8(eq1_bits_ptr + DCTSIZE, coefs_eq12);
    absvalues_ptr += 16;
    coef_sign_bits_ptr += 16;
    eq1_bits_ptr += 16;
    jpeg_natural_order_start += 16;
    rows_to_zero -= 2;
  } else if (remaining_coefs > 0) {
    int16x8_t coefs = vdupq_n_s16(0);
    switch (remaining_coefs) {
    case 8:
      coefs = vld1q_lane_s16(block + jpeg_natural_order_start[7], coefs, 7);
    case 7:
      coefs = vld1q_lane_s16(block + jpeg_natural_order_start[6], coefs, 6);
    case 6:
      coefs = vld1q_lane_s16(block + jpeg_natural_order_start[5], coefs, 5);
    case 5:
      coefs = vld1q_lane_s16(block + jpeg_natural_order_start[4], coefs, 4);
    case 4:
      coefs = vld1q_lane_s16(block + jpeg_natural_order_start[3], coefs, 3);
    case 3:
      coefs = vld1q_lane_s16(block + jpeg_natural_order_start[2], coefs, 2);
    case 2:
      coefs = vld1q_lane_s16(block + jpeg_natural_order_start[1], coefs, 1);
    case 1:
      coefs = vld1q_lane_s16(block + jpeg_natural_order_start[0], coefs, 0);
    default:
      break;
    }
    /* Compute and store data for signbits bitmap. */
    uint8x8_t sign_coefs =
      vmovn_u16(vreinterpretq_u16_s16(vshrq_n_s16(coefs, 15)));
    vst1_u8(coef_sign_bits_ptr, sign_coefs);
    /* Compute absolute value of coefficients and apply point transform Al. */
    int16x8_t abs_coefs = vabsq_s16(coefs);
    coefs = vshlq_s16(abs_coefs, vdupq_n_s16(-Al));
    vst1q_s16(absvalues_ptr, coefs);
    /* Test whether transformed coefficient values == 1 (used to find EOB
     * position.)
     */
    uint8x8_t coefs_eq1 = vmovn_u16(vceqq_s16(coefs, vdupq_n_s16(1)));
    vst1_u8(eq1_bits_ptr, coefs_eq1);
    absvalues_ptr += 8;
    coef_sign_bits_ptr += 8;
    eq1_bits_ptr += 8;
    rows_to_zero--;
  }
  /* Zero remaining memory in blocks. */
  for (i = 0; i < rows_to_zero; i++) {
    vst1q_s16(absvalues_ptr, vdupq_n_s16(0));
    vst1_u8(coef_sign_bits_ptr, vdup_n_u8(0));
    vst1_u8(eq1_bits_ptr, vdup_n_u8(0));
    absvalues_ptr += 8;
    coef_sign_bits_ptr += 8;
    eq1_bits_ptr += 8;
  }
  /* Construct zerobits bitmap. */
  int16x8_t abs_row0 = vld1q_s16(absvalues + 0 * DCTSIZE);
  int16x8_t abs_row1 = vld1q_s16(absvalues + 1 * DCTSIZE);
  int16x8_t abs_row2 = vld1q_s16(absvalues + 2 * DCTSIZE);
  int16x8_t abs_row3 = vld1q_s16(absvalues + 3 * DCTSIZE);
  int16x8_t abs_row4 = vld1q_s16(absvalues + 4 * DCTSIZE);
  int16x8_t abs_row5 = vld1q_s16(absvalues + 5 * DCTSIZE);
  int16x8_t abs_row6 = vld1q_s16(absvalues + 6 * DCTSIZE);
  int16x8_t abs_row7 = vld1q_s16(absvalues + 7 * DCTSIZE);
  uint8x8_t abs_row0_eq0 = vmovn_u16(vceqq_s16(abs_row0, vdupq_n_s16(0)));
  uint8x8_t abs_row1_eq0 = vmovn_u16(vceqq_s16(abs_row1, vdupq_n_s16(0)));
  uint8x8_t abs_row2_eq0 = vmovn_u16(vceqq_s16(abs_row2, vdupq_n_s16(0)));
  uint8x8_t abs_row3_eq0 = vmovn_u16(vceqq_s16(abs_row3, vdupq_n_s16(0)));
  uint8x8_t abs_row4_eq0 = vmovn_u16(vceqq_s16(abs_row4, vdupq_n_s16(0)));
  uint8x8_t abs_row5_eq0 = vmovn_u16(vceqq_s16(abs_row5, vdupq_n_s16(0)));
  uint8x8_t abs_row6_eq0 = vmovn_u16(vceqq_s16(abs_row6, vdupq_n_s16(0)));
  uint8x8_t abs_row7_eq0 = vmovn_u16(vceqq_s16(abs_row7, vdupq_n_s16(0)));
  /* { 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80 } */
  const uint8x8_t bitmap_mask =
    vreinterpret_u8_u64(vmov_n_u64(0x8040201008040201));
  abs_row0_eq0 = vand_u8(abs_row0_eq0, bitmap_mask);
  abs_row1_eq0 = vand_u8(abs_row1_eq0, bitmap_mask);
  abs_row2_eq0 = vand_u8(abs_row2_eq0, bitmap_mask);
  abs_row3_eq0 = vand_u8(abs_row3_eq0, bitmap_mask);
  abs_row4_eq0 = vand_u8(abs_row4_eq0, bitmap_mask);
  abs_row5_eq0 = vand_u8(abs_row5_eq0, bitmap_mask);
  abs_row6_eq0 = vand_u8(abs_row6_eq0, bitmap_mask);
  abs_row7_eq0 = vand_u8(abs_row7_eq0, bitmap_mask);
  uint8x8_t bitmap_rows_01 = vpadd_u8(abs_row0_eq0, abs_row1_eq0);
  uint8x8_t bitmap_rows_23 = vpadd_u8(abs_row2_eq0, abs_row3_eq0);
  uint8x8_t bitmap_rows_45 = vpadd_u8(abs_row4_eq0, abs_row5_eq0);
  uint8x8_t bitmap_rows_67 = vpadd_u8(abs_row6_eq0, abs_row7_eq0);
  uint8x8_t bitmap_rows_0123 = vpadd_u8(bitmap_rows_01, bitmap_rows_23);
  uint8x8_t bitmap_rows_4567 = vpadd_u8(bitmap_rows_45, bitmap_rows_67);
  uint8x8_t bitmap_all = vpadd_u8(bitmap_rows_0123, bitmap_rows_4567);
 #if defined(__aarch64__) || defined(_M_ARM64)
  /* Move bitmap to a 64-bit scalar register. */
  uint64_t bitmap = vget_lane_u64(vreinterpret_u64_u8(bitmap_all), 0);
  /* Store zerobits bitmap. */
  bits[0] = ~bitmap;
 #else
  /* Move bitmap to two 32-bit scalar registers. */
  uint32_t bitmap0 = vget_lane_u32(vreinterpret_u32_u8(bitmap_all), 0);
  uint32_t bitmap1 = vget_lane_u32(vreinterpret_u32_u8(bitmap_all), 1);
  /* Store zerobits bitmap. */
  bits[0] = ~bitmap0;
  bits[1] = ~bitmap1;
 #endif
  /* Construct signbits bitmap. */
  uint8x8_t signbits_row0 = vld1_u8(coef_sign_bits + 0 * DCTSIZE);
  uint8x8_t signbits_row1 = vld1_u8(coef_sign_bits + 1 * DCTSIZE);
  uint8x8_t signbits_row2 = vld1_u8(coef_sign_bits + 2 * DCTSIZE);
  uint8x8_t signbits_row3 = vld1_u8(coef_sign_bits + 3 * DCTSIZE);
  uint8x8_t signbits_row4 = vld1_u8(coef_sign_bits + 4 * DCTSIZE);
  uint8x8_t signbits_row5 = vld1_u8(coef_sign_bits + 5 * DCTSIZE);
  uint8x8_t signbits_row6 = vld1_u8(coef_sign_bits + 6 * DCTSIZE);
  uint8x8_t signbits_row7 = vld1_u8(coef_sign_bits + 7 * DCTSIZE);
  signbits_row0 = vand_u8(signbits_row0, bitmap_mask);
  signbits_row1 = vand_u8(signbits_row1, bitmap_mask);
  signbits_row2 = vand_u8(signbits_row2, bitmap_mask);
  signbits_row3 = vand_u8(signbits_row3, bitmap_mask);
  signbits_row4 = vand_u8(signbits_row4, bitmap_mask);
  signbits_row5 = vand_u8(signbits_row5, bitmap_mask);
  signbits_row6 = vand_u8(signbits_row6, bitmap_mask);
  signbits_row7 = vand_u8(signbits_row7, bitmap_mask);
  bitmap_rows_01 = vpadd_u8(signbits_row0, signbits_row1);
  bitmap_rows_23 = vpadd_u8(signbits_row2, signbits_row3);
  bitmap_rows_45 = vpadd_u8(signbits_row4, signbits_row5);
  bitmap_rows_67 = vpadd_u8(signbits_row6, signbits_row7);
  bitmap_rows_0123 = vpadd_u8(bitmap_rows_01, bitmap_rows_23);
  bitmap_rows_4567 = vpadd_u8(bitmap_rows_45, bitmap_rows_67);
  bitmap_all = vpadd_u8(bitmap_rows_0123, bitmap_rows_4567);
 #if defined(__aarch64__) || defined(_M_ARM64)
  /* Move bitmap to a 64-bit scalar register. */
  bitmap = vget_lane_u64(vreinterpret_u64_u8(bitmap_all), 0);
  /* Store signbits bitmap. */
  bits[1] = ~bitmap;
 #else
  /* Move bitmap to two 32-bit scalar registers. */
  bitmap0 = vget_lane_u32(vreinterpret_u32_u8(bitmap_all), 0);
  bitmap1 = vget_lane_u32(vreinterpret_u32_u8(bitmap_all), 1);
  /* Store signbits bitmap. */
  bits[2] = ~bitmap0;
  bits[3] = ~bitmap1;
 #endif
  /* Construct bitmap to find EOB position (the index of the last coefficient
   * equal to 1.)
   */
  uint8x8_t row0_eq1 = vld1_u8(coef_eq1_bits + 0 * DCTSIZE);
  uint8x8_t row1_eq1 = vld1_u8(coef_eq1_bits + 1 * DCTSIZE);
  uint8x8_t row2_eq1 = vld1_u8(coef_eq1_bits + 2 * DCTSIZE);
  uint8x8_t row3_eq1 = vld1_u8(coef_eq1_bits + 3 * DCTSIZE);
  uint8x8_t row4_eq1 = vld1_u8(coef_eq1_bits + 4 * DCTSIZE);
  uint8x8_t row5_eq1 = vld1_u8(coef_eq1_bits + 5 * DCTSIZE);
  uint8x8_t row6_eq1 = vld1_u8(coef_eq1_bits + 6 * DCTSIZE);
  uint8x8_t row7_eq1 = vld1_u8(coef_eq1_bits + 7 * DCTSIZE);
  row0_eq1 = vand_u8(row0_eq1, bitmap_mask);
  row1_eq1 = vand_u8(row1_eq1, bitmap_mask);
  row2_eq1 = vand_u8(row2_eq1, bitmap_mask);
  row3_eq1 = vand_u8(row3_eq1, bitmap_mask);
  row4_eq1 = vand_u8(row4_eq1, bitmap_mask);
  row5_eq1 = vand_u8(row5_eq1, bitmap_mask);
  row6_eq1 = vand_u8(row6_eq1, bitmap_mask);
  row7_eq1 = vand_u8(row7_eq1, bitmap_mask);
  bitmap_rows_01 = vpadd_u8(row0_eq1, row1_eq1);
  bitmap_rows_23 = vpadd_u8(row2_eq1, row3_eq1);
  bitmap_rows_45 = vpadd_u8(row4_eq1, row5_eq1);
  bitmap_rows_67 = vpadd_u8(row6_eq1, row7_eq1);
  bitmap_rows_0123 = vpadd_u8(bitmap_rows_01, bitmap_rows_23);
  bitmap_rows_4567 = vpadd_u8(bitmap_rows_45, bitmap_rows_67);
  bitmap_all = vpadd_u8(bitmap_rows_0123, bitmap_rows_4567);
 #if defined(__aarch64__) || defined(_M_ARM64)
  /* Move bitmap to a 64-bit scalar register. */
  bitmap = vget_lane_u64(vreinterpret_u64_u8(bitmap_all), 0);
  /* Return EOB position. */
  if (bitmap == 0) {
    /* EOB position is defined to be 0 if all coefficients != 1. */
    return 0;
  } else {
    return 63 - BUILTIN_CLZLL(bitmap);
  }
 #else
  /* Move bitmap to two 32-bit scalar registers. */
  bitmap0 = vget_lane_u32(vreinterpret_u32_u8(bitmap_all), 0);
  bitmap1 = vget_lane_u32(vreinterpret_u32_u8(bitmap_all), 1);
  /* Return EOB position. */
  if (bitmap0 == 0 && bitmap1 == 0) {
    return 0;
  } else if (bitmap1 != 0) {
    return 63 - BUILTIN_CLZ(bitmap1);
  } else {
    return 31 - BUILTIN_CLZ(bitmap0);
  }
 #endif
 }
--- a/simd/arm/jcsample-neon.c
+++ b/simd/arm/jcsample-neon.c
@@ -0,0 +1,192 @@
 /*
 * jcsample-neon.c - downsampling (Arm Neon)
 *
 * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 #define JPEG_INTERNALS
 #include "../../jinclude.h"
 #include "../../jpeglib.h"
 #include "../../jsimd.h"
 #include "../../jdct.h"
 #include "../../jsimddct.h"
 #include "../jsimd.h"
 #include "align.h"
 #include <arm_neon.h>
 ALIGN(16) static const uint8_t jsimd_h2_downsample_consts[] = {
  0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,   /* Pad 0 */
  0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F,
  0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,   /* Pad 1 */
  0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0E,
  0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,   /* Pad 2 */
  0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0D, 0x0D,
  0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,   /* Pad 3 */
  0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0C, 0x0C, 0x0C,
  0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,   /* Pad 4 */
  0x08, 0x09, 0x0A, 0x0B, 0x0B, 0x0B, 0x0B, 0x0B,
  0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,   /* Pad 5 */
  0x08, 0x09, 0x0A, 0x0A, 0x0A, 0x0A, 0x0A, 0x0A,
  0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,   /* Pad 6 */
  0x08, 0x09, 0x09, 0x09, 0x09, 0x09, 0x09, 0x09,
  0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,   /* Pad 7 */
  0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08,
  0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,   /* Pad 8 */
  0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07,
  0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x06,   /* Pad 9 */
  0x06, 0x06, 0x06, 0x06, 0x06, 0x06, 0x06, 0x06,
  0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x05, 0x05,   /* Pad 10 */
  0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05,
  0x00, 0x01, 0x02, 0x03, 0x04, 0x04, 0x04, 0x04,   /* Pad 11 */
  0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04,
  0x00, 0x01, 0x02, 0x03, 0x03, 0x03, 0x03, 0x03,   /* Pad 12 */
  0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03,
  0x00, 0x01, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02,   /* Pad 13 */
  0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02,
  0x00, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01,   /* Pad 14 */
  0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01,
  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,   /* Pad 15 */
  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
 };
 /* Downsample pixel values of a single component.
 * This version handles the common case of 2:1 horizontal and 1:1 vertical,
 * without smoothing.
 */
 void jsimd_h2v1_downsample_neon(JDIMENSION image_width, int max_v_samp_factor,
                                JDIMENSION v_samp_factor,
                                JDIMENSION width_in_blocks,
                                JSAMPARRAY input_data, JSAMPARRAY output_data)
 {
  JSAMPROW inptr, outptr;
  /* Load expansion mask to pad remaining elements of last DCT block. */
  const int mask_offset = 16 * ((width_in_blocks * 2 * DCTSIZE) - image_width);
  const uint8x16_t expand_mask =
    vld1q_u8(&jsimd_h2_downsample_consts[mask_offset]);
  /* Load bias pattern (alternating every pixel.) */
  /* { 0, 1, 0, 1, 0, 1, 0, 1 } */
  const uint16x8_t bias = vreinterpretq_u16_u32(vdupq_n_u32(0x00010000));
  unsigned i, outrow;
  for (outrow = 0; outrow < v_samp_factor; outrow++) {
    outptr = output_data[outrow];
    inptr = input_data[outrow];
    /* Downsample all but the last DCT block of pixels. */
    for (i = 0; i < width_in_blocks - 1; i++) {
      uint8x16_t pixels = vld1q_u8(inptr + i * 2 * DCTSIZE);
      /* Add adjacent pixel values, widen to 16-bit, and add bias. */
      uint16x8_t samples_u16 = vpadalq_u8(bias, pixels);
      /* Divide total by 2 and narrow to 8-bit. */
      uint8x8_t samples_u8 = vshrn_n_u16(samples_u16, 1);
      /* Store samples to memory. */
      vst1_u8(outptr + i * DCTSIZE, samples_u8);
    }
    /* Load pixels in last DCT block into a table. */
    uint8x16_t pixels = vld1q_u8(inptr + (width_in_blocks - 1) * 2 * DCTSIZE);
 #if defined(__aarch64__) || defined(_M_ARM64)
    /* Pad the empty elements with the value of the last pixel. */
    pixels = vqtbl1q_u8(pixels, expand_mask);
 #else
    uint8x8x2_t table = { { vget_low_u8(pixels), vget_high_u8(pixels) } };
    pixels = vcombine_u8(vtbl2_u8(table, vget_low_u8(expand_mask)),
                         vtbl2_u8(table, vget_high_u8(expand_mask)));
 #endif
    /* Add adjacent pixel values, widen to 16-bit, and add bias. */
    uint16x8_t samples_u16 = vpadalq_u8(bias, pixels);
    /* Divide total by 2, narrow to 8-bit, and store. */
    uint8x8_t samples_u8 = vshrn_n_u16(samples_u16, 1);
    vst1_u8(outptr + (width_in_blocks - 1) * DCTSIZE, samples_u8);
  }
 }
 /* Downsample pixel values of a single component.
 * This version handles the standard case of 2:1 horizontal and 2:1 vertical,
 * without smoothing.
 */
 void jsimd_h2v2_downsample_neon(JDIMENSION image_width, int max_v_samp_factor,
                                JDIMENSION v_samp_factor,
                                JDIMENSION width_in_blocks,
                                JSAMPARRAY input_data, JSAMPARRAY output_data)
 {
  JSAMPROW inptr0, inptr1, outptr;
  /* Load expansion mask to pad remaining elements of last DCT block. */
  const int mask_offset = 16 * ((width_in_blocks * 2 * DCTSIZE) - image_width);
  const uint8x16_t expand_mask =
    vld1q_u8(&jsimd_h2_downsample_consts[mask_offset]);
  /* Load bias pattern (alternating every pixel.) */
  /* { 1, 2, 1, 2, 1, 2, 1, 2 } */
  const uint16x8_t bias = vreinterpretq_u16_u32(vdupq_n_u32(0x00020001));
  unsigned i, outrow;
  for (outrow = 0; outrow < v_samp_factor; outrow++) {
    outptr = output_data[outrow];
    inptr0 = input_data[outrow];
    inptr1 = input_data[outrow + 1];
    /* Downsample all but the last DCT block of pixels. */
    for (i = 0; i < width_in_blocks - 1; i++) {
      uint8x16_t pixels_r0 = vld1q_u8(inptr0 + i * 2 * DCTSIZE);
      uint8x16_t pixels_r1 = vld1q_u8(inptr1 + i * 2 * DCTSIZE);
      /* Add adjacent pixel values in row 0, widen to 16-bit, and add bias. */
      uint16x8_t samples_u16 = vpadalq_u8(bias, pixels_r0);
      /* Add adjacent pixel values in row 1, widen to 16-bit, and accumulate.
       */
      samples_u16 = vpadalq_u8(samples_u16, pixels_r1);
      /* Divide total by 4 and narrow to 8-bit. */
      uint8x8_t samples_u8 = vshrn_n_u16(samples_u16, 2);
      /* Store samples to memory and increment pointers. */
      vst1_u8(outptr + i * DCTSIZE, samples_u8);
    }
    /* Load pixels in last DCT block into a table. */
    uint8x16_t pixels_r0 =
      vld1q_u8(inptr0 + (width_in_blocks - 1) * 2 * DCTSIZE);
    uint8x16_t pixels_r1 =
      vld1q_u8(inptr1 + (width_in_blocks - 1) * 2 * DCTSIZE);
 #if defined(__aarch64__) || defined(_M_ARM64)
    /* Pad the empty elements with the value of the last pixel. */
    pixels_r0 = vqtbl1q_u8(pixels_r0, expand_mask);
    pixels_r1 = vqtbl1q_u8(pixels_r1, expand_mask);
 #else
    uint8x8x2_t table_r0 =
      { { vget_low_u8(pixels_r0), vget_high_u8(pixels_r0) } };
    uint8x8x2_t table_r1 =
      { { vget_low_u8(pixels_r1), vget_high_u8(pixels_r1) } };
    pixels_r0 = vcombine_u8(vtbl2_u8(table_r0, vget_low_u8(expand_mask)),
                            vtbl2_u8(table_r0, vget_high_u8(expand_mask)));
    pixels_r1 = vcombine_u8(vtbl2_u8(table_r1, vget_low_u8(expand_mask)),
                            vtbl2_u8(table_r1, vget_high_u8(expand_mask)));
 #endif
    /* Add adjacent pixel values in row 0, widen to 16-bit, and add bias. */
    uint16x8_t samples_u16 = vpadalq_u8(bias, pixels_r0);
    /* Add adjacent pixel values in row 1, widen to 16-bit, and accumulate. */
    samples_u16 = vpadalq_u8(samples_u16, pixels_r1);
    /* Divide total by 4, narrow to 8-bit, and store. */
    uint8x8_t samples_u8 = vshrn_n_u16(samples_u16, 2);
    vst1_u8(outptr + (width_in_blocks - 1) * DCTSIZE, samples_u8);
  }
 }
--- a/simd/arm/jdcolext-neon.c
+++ b/simd/arm/jdcolext-neon.c
@@ -0,0 +1,353 @@
 /*
 * jdcolext-neon.c - colorspace conversion (Arm Neon)
 *
 * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
 * Copyright (C) 2020, D. R. Commander.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 /* This file is included by jdcolor-neon.c. */
 /* YCbCr -> RGB conversion is defined by the following equations:
 *    R = Y                        + 1.40200 * (Cr - 128)
 *    G = Y - 0.34414 * (Cb - 128) - 0.71414 * (Cr - 128)
 *    B = Y + 1.77200 * (Cb - 128)
 *
 * Scaled integer constants are used to avoid floating-point arithmetic:
 *    0.3441467 = 11277 * 2^-15
 *    0.7141418 = 23401 * 2^-15
 *    1.4020386 = 22971 * 2^-14
 *    1.7720337 = 29033 * 2^-14
 * These constants are defined in jdcolor-neon.c.
 *
 * To ensure correct results, rounding is used when descaling.
 */
 /* Notes on safe memory access for YCbCr -> RGB conversion routines:
 *
 * Input memory buffers can be safely overread up to the next multiple of
 * ALIGN_SIZE bytes, since they are always allocated by alloc_sarray() in
 * jmemmgr.c.
 *
 * The output buffer cannot safely be written beyond output_width, since
 * output_buf points to a possibly unpadded row in the decompressed image
 * buffer allocated by the calling program.
 */
 void jsimd_ycc_rgb_convert_neon(JDIMENSION output_width, JSAMPIMAGE input_buf,
                                JDIMENSION input_row, JSAMPARRAY output_buf,
                                int num_rows)
 {
  JSAMPROW outptr;
  /* Pointers to Y, Cb, and Cr data */
  JSAMPROW inptr0, inptr1, inptr2;
  const int16x4_t consts = vld1_s16(jsimd_ycc_rgb_convert_neon_consts);
  const int16x8_t neg_128 = vdupq_n_s16(-128);
  while (--num_rows >= 0) {
    inptr0 = input_buf[0][input_row];
    inptr1 = input_buf[1][input_row];
    inptr2 = input_buf[2][input_row];
    input_row++;
    outptr = *output_buf++;
    int cols_remaining = output_width;
    for (; cols_remaining >= 16; cols_remaining -= 16) {
      uint8x16_t y  = vld1q_u8(inptr0);
      uint8x16_t cb = vld1q_u8(inptr1);
      uint8x16_t cr = vld1q_u8(inptr2);
      /* Subtract 128 from Cb and Cr. */
      int16x8_t cr_128_l =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(neg_128),
                                       vget_low_u8(cr)));
      int16x8_t cr_128_h =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(neg_128),
                                       vget_high_u8(cr)));
      int16x8_t cb_128_l =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(neg_128),
                                       vget_low_u8(cb)));
      int16x8_t cb_128_h =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(neg_128),
                                       vget_high_u8(cb)));
      /* Compute G-Y: - 0.34414 * (Cb - 128) - 0.71414 * (Cr - 128) */
      int32x4_t g_sub_y_ll = vmull_lane_s16(vget_low_s16(cb_128_l), consts, 0);
      int32x4_t g_sub_y_lh = vmull_lane_s16(vget_high_s16(cb_128_l),
                                            consts, 0);
      int32x4_t g_sub_y_hl = vmull_lane_s16(vget_low_s16(cb_128_h), consts, 0);
      int32x4_t g_sub_y_hh = vmull_lane_s16(vget_high_s16(cb_128_h),
                                            consts, 0);
      g_sub_y_ll = vmlsl_lane_s16(g_sub_y_ll, vget_low_s16(cr_128_l),
                                  consts, 1);
      g_sub_y_lh = vmlsl_lane_s16(g_sub_y_lh, vget_high_s16(cr_128_l),
                                  consts, 1);
      g_sub_y_hl = vmlsl_lane_s16(g_sub_y_hl, vget_low_s16(cr_128_h),
                                  consts, 1);
      g_sub_y_hh = vmlsl_lane_s16(g_sub_y_hh, vget_high_s16(cr_128_h),
                                  consts, 1);
      /* Descale G components: shift right 15, round, and narrow to 16-bit. */
      int16x8_t g_sub_y_l = vcombine_s16(vrshrn_n_s32(g_sub_y_ll, 15),
                                         vrshrn_n_s32(g_sub_y_lh, 15));
      int16x8_t g_sub_y_h = vcombine_s16(vrshrn_n_s32(g_sub_y_hl, 15),
                                         vrshrn_n_s32(g_sub_y_hh, 15));
      /* Compute R-Y: 1.40200 * (Cr - 128) */
      int16x8_t r_sub_y_l = vqrdmulhq_lane_s16(vshlq_n_s16(cr_128_l, 1),
                                               consts, 2);
      int16x8_t r_sub_y_h = vqrdmulhq_lane_s16(vshlq_n_s16(cr_128_h, 1),
                                               consts, 2);
      /* Compute B-Y: 1.77200 * (Cb - 128) */
      int16x8_t b_sub_y_l = vqrdmulhq_lane_s16(vshlq_n_s16(cb_128_l, 1),
                                               consts, 3);
      int16x8_t b_sub_y_h = vqrdmulhq_lane_s16(vshlq_n_s16(cb_128_h, 1),
                                               consts, 3);
      /* Add Y. */
      int16x8_t r_l =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(r_sub_y_l),
                                       vget_low_u8(y)));
      int16x8_t r_h =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(r_sub_y_h),
                                       vget_high_u8(y)));
      int16x8_t b_l =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(b_sub_y_l),
                                       vget_low_u8(y)));
      int16x8_t b_h =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(b_sub_y_h),
                                       vget_high_u8(y)));
      int16x8_t g_l =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(g_sub_y_l),
                                       vget_low_u8(y)));
      int16x8_t g_h =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(g_sub_y_h),
                                       vget_high_u8(y)));
 #if RGB_PIXELSIZE == 4
      uint8x16x4_t rgba;
      /* Convert each component to unsigned and narrow, clamping to [0-255]. */
      rgba.val[RGB_RED] = vcombine_u8(vqmovun_s16(r_l), vqmovun_s16(r_h));
      rgba.val[RGB_GREEN] = vcombine_u8(vqmovun_s16(g_l), vqmovun_s16(g_h));
      rgba.val[RGB_BLUE] = vcombine_u8(vqmovun_s16(b_l), vqmovun_s16(b_h));
      /* Set alpha channel to opaque (0xFF). */
      rgba.val[RGB_ALPHA] = vdupq_n_u8(0xFF);
      /* Store RGBA pixel data to memory. */
      vst4q_u8(outptr, rgba);
 #elif RGB_PIXELSIZE == 3
      uint8x16x3_t rgb;
      /* Convert each component to unsigned and narrow, clamping to [0-255]. */
      rgb.val[RGB_RED] = vcombine_u8(vqmovun_s16(r_l), vqmovun_s16(r_h));
      rgb.val[RGB_GREEN] = vcombine_u8(vqmovun_s16(g_l), vqmovun_s16(g_h));
      rgb.val[RGB_BLUE] = vcombine_u8(vqmovun_s16(b_l), vqmovun_s16(b_h));
      /* Store RGB pixel data to memory. */
      vst3q_u8(outptr, rgb);
 #else
      /* Pack R, G, and B values in ratio 5:6:5. */
      uint16x8_t rgb565_l = vqshluq_n_s16(r_l, 8);
      rgb565_l = vsriq_n_u16(rgb565_l, vqshluq_n_s16(g_l, 8), 5);
      rgb565_l = vsriq_n_u16(rgb565_l, vqshluq_n_s16(b_l, 8), 11);
      uint16x8_t rgb565_h = vqshluq_n_s16(r_h, 8);
      rgb565_h = vsriq_n_u16(rgb565_h, vqshluq_n_s16(g_h, 8), 5);
      rgb565_h = vsriq_n_u16(rgb565_h, vqshluq_n_s16(b_h, 8), 11);
      /* Store RGB pixel data to memory. */
      vst1q_u16((uint16_t *)outptr, rgb565_l);
      vst1q_u16(((uint16_t *)outptr) + 8, rgb565_h);
 #endif
      /* Increment pointers. */
      inptr0 += 16;
      inptr1 += 16;
      inptr2 += 16;
      outptr += (RGB_PIXELSIZE * 16);
    }
    if (cols_remaining >= 8) {
      uint8x8_t y  = vld1_u8(inptr0);
      uint8x8_t cb = vld1_u8(inptr1);
      uint8x8_t cr = vld1_u8(inptr2);
      /* Subtract 128 from Cb and Cr. */
      int16x8_t cr_128 =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(neg_128), cr));
      int16x8_t cb_128 =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(neg_128), cb));
      /* Compute G-Y: - 0.34414 * (Cb - 128) - 0.71414 * (Cr - 128) */
      int32x4_t g_sub_y_l = vmull_lane_s16(vget_low_s16(cb_128), consts, 0);
      int32x4_t g_sub_y_h = vmull_lane_s16(vget_high_s16(cb_128), consts, 0);
      g_sub_y_l = vmlsl_lane_s16(g_sub_y_l, vget_low_s16(cr_128), consts, 1);
      g_sub_y_h = vmlsl_lane_s16(g_sub_y_h, vget_high_s16(cr_128), consts, 1);
      /* Descale G components: shift right 15, round, and narrow to 16-bit. */
      int16x8_t g_sub_y = vcombine_s16(vrshrn_n_s32(g_sub_y_l, 15),
                                       vrshrn_n_s32(g_sub_y_h, 15));
      /* Compute R-Y: 1.40200 * (Cr - 128) */
      int16x8_t r_sub_y = vqrdmulhq_lane_s16(vshlq_n_s16(cr_128, 1),
                                             consts, 2);
      /* Compute B-Y: 1.77200 * (Cb - 128) */
      int16x8_t b_sub_y = vqrdmulhq_lane_s16(vshlq_n_s16(cb_128, 1),
                                             consts, 3);
      /* Add Y. */
      int16x8_t r =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(r_sub_y), y));
      int16x8_t b =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(b_sub_y), y));
      int16x8_t g =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(g_sub_y), y));
 #if RGB_PIXELSIZE == 4
      uint8x8x4_t rgba;
      /* Convert each component to unsigned and narrow, clamping to [0-255]. */
      rgba.val[RGB_RED] = vqmovun_s16(r);
      rgba.val[RGB_GREEN] = vqmovun_s16(g);
      rgba.val[RGB_BLUE] = vqmovun_s16(b);
      /* Set alpha channel to opaque (0xFF). */
      rgba.val[RGB_ALPHA] = vdup_n_u8(0xFF);
      /* Store RGBA pixel data to memory. */
      vst4_u8(outptr, rgba);
 #elif RGB_PIXELSIZE == 3
      uint8x8x3_t rgb;
      /* Convert each component to unsigned and narrow, clamping to [0-255]. */
      rgb.val[RGB_RED] = vqmovun_s16(r);
      rgb.val[RGB_GREEN] = vqmovun_s16(g);
      rgb.val[RGB_BLUE] = vqmovun_s16(b);
      /* Store RGB pixel data to memory. */
      vst3_u8(outptr, rgb);
 #else
      /* Pack R, G, and B values in ratio 5:6:5. */
      uint16x8_t rgb565 = vqshluq_n_s16(r, 8);
      rgb565 = vsriq_n_u16(rgb565, vqshluq_n_s16(g, 8), 5);
      rgb565 = vsriq_n_u16(rgb565, vqshluq_n_s16(b, 8), 11);
      /* Store RGB pixel data to memory. */
      vst1q_u16((uint16_t *)outptr, rgb565);
 #endif
      /* Increment pointers. */
      inptr0 += 8;
      inptr1 += 8;
      inptr2 += 8;
      outptr += (RGB_PIXELSIZE * 8);
      cols_remaining -= 8;
    }
    /* Handle the tail elements. */
    if (cols_remaining > 0) {
      uint8x8_t y  = vld1_u8(inptr0);
      uint8x8_t cb = vld1_u8(inptr1);
      uint8x8_t cr = vld1_u8(inptr2);
      /* Subtract 128 from Cb and Cr. */
      int16x8_t cr_128 =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(neg_128), cr));
      int16x8_t cb_128 =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(neg_128), cb));
      /* Compute G-Y: - 0.34414 * (Cb - 128) - 0.71414 * (Cr - 128) */
      int32x4_t g_sub_y_l = vmull_lane_s16(vget_low_s16(cb_128), consts, 0);
      int32x4_t g_sub_y_h = vmull_lane_s16(vget_high_s16(cb_128), consts, 0);
      g_sub_y_l = vmlsl_lane_s16(g_sub_y_l, vget_low_s16(cr_128), consts, 1);
      g_sub_y_h = vmlsl_lane_s16(g_sub_y_h, vget_high_s16(cr_128), consts, 1);
      /* Descale G components: shift right 15, round, and narrow to 16-bit. */
      int16x8_t g_sub_y = vcombine_s16(vrshrn_n_s32(g_sub_y_l, 15),
                                       vrshrn_n_s32(g_sub_y_h, 15));
      /* Compute R-Y: 1.40200 * (Cr - 128) */
      int16x8_t r_sub_y = vqrdmulhq_lane_s16(vshlq_n_s16(cr_128, 1),
                                             consts, 2);
      /* Compute B-Y: 1.77200 * (Cb - 128) */
      int16x8_t b_sub_y = vqrdmulhq_lane_s16(vshlq_n_s16(cb_128, 1),
                                             consts, 3);
      /* Add Y. */
      int16x8_t r =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(r_sub_y), y));
      int16x8_t b =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(b_sub_y), y));
      int16x8_t g =
        vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(g_sub_y), y));
 #if RGB_PIXELSIZE == 4
      uint8x8x4_t rgba;
      /* Convert each component to unsigned and narrow, clamping to [0-255]. */
      rgba.val[RGB_RED] = vqmovun_s16(r);
      rgba.val[RGB_GREEN] = vqmovun_s16(g);
      rgba.val[RGB_BLUE] = vqmovun_s16(b);
      /* Set alpha channel to opaque (0xFF). */
      rgba.val[RGB_ALPHA] = vdup_n_u8(0xFF);
      /* Store RGBA pixel data to memory. */
      switch (cols_remaining) {
      case 7:
        vst4_lane_u8(outptr + 6 * RGB_PIXELSIZE, rgba, 6);
      case 6:
        vst4_lane_u8(outptr + 5 * RGB_PIXELSIZE, rgba, 5);
      case 5:
        vst4_lane_u8(outptr + 4 * RGB_PIXELSIZE, rgba, 4);
      case 4:
        vst4_lane_u8(outptr + 3 * RGB_PIXELSIZE, rgba, 3);
      case 3:
        vst4_lane_u8(outptr + 2 * RGB_PIXELSIZE, rgba, 2);
      case 2:
        vst4_lane_u8(outptr + RGB_PIXELSIZE, rgba, 1);
      case 1:
        vst4_lane_u8(outptr, rgba, 0);
      default:
        break;
      }
 #elif RGB_PIXELSIZE == 3
      uint8x8x3_t rgb;
      /* Convert each component to unsigned and narrow, clamping to [0-255]. */
      rgb.val[RGB_RED] = vqmovun_s16(r);
      rgb.val[RGB_GREEN] = vqmovun_s16(g);
      rgb.val[RGB_BLUE] = vqmovun_s16(b);
      /* Store RGB pixel data to memory. */
      switch (cols_remaining) {
      case 7:
        vst3_lane_u8(outptr + 6 * RGB_PIXELSIZE, rgb, 6);
      case 6:
        vst3_lane_u8(outptr + 5 * RGB_PIXELSIZE, rgb, 5);
      case 5:
        vst3_lane_u8(outptr + 4 * RGB_PIXELSIZE, rgb, 4);
      case 4:
        vst3_lane_u8(outptr + 3 * RGB_PIXELSIZE, rgb, 3);
      case 3:
        vst3_lane_u8(outptr + 2 * RGB_PIXELSIZE, rgb, 2);
      case 2:
        vst3_lane_u8(outptr + RGB_PIXELSIZE, rgb, 1);
      case 1:
        vst3_lane_u8(outptr, rgb, 0);
      default:
        break;
      }
 #else
      /* Pack R, G, and B values in ratio 5:6:5. */
      uint16x8_t rgb565 = vqshluq_n_s16(r, 8);
      rgb565 = vsriq_n_u16(rgb565, vqshluq_n_s16(g, 8), 5);
      rgb565 = vsriq_n_u16(rgb565, vqshluq_n_s16(b, 8), 11);
      /* Store RGB565 pixel data to memory. */
      switch (cols_remaining) {
      case 7:
        vst1q_lane_u16((uint16_t *)(outptr + 6 * RGB_PIXELSIZE), rgb565, 6);
      case 6:
        vst1q_lane_u16((uint16_t *)(outptr + 5 * RGB_PIXELSIZE), rgb565, 5);
      case 5:
        vst1q_lane_u16((uint16_t *)(outptr + 4 * RGB_PIXELSIZE), rgb565, 4);
      case 4:
        vst1q_lane_u16((uint16_t *)(outptr + 3 * RGB_PIXELSIZE), rgb565, 3);
      case 3:
        vst1q_lane_u16((uint16_t *)(outptr + 2 * RGB_PIXELSIZE), rgb565, 2);
      case 2:
        vst1q_lane_u16((uint16_t *)(outptr + RGB_PIXELSIZE), rgb565, 1);
      case 1:
        vst1q_lane_u16((uint16_t *)outptr, rgb565, 0);
      default:
        break;
      }
 #endif
    }
  }
 }
--- a/simd/arm/jdcolor-neon.c
+++ b/simd/arm/jdcolor-neon.c
@@ -0,0 +1,141 @@
 /*
 * jdcolor-neon.c - colorspace conversion (Arm Neon)
 *
 * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 #define JPEG_INTERNALS
 #include "../../jinclude.h"
 #include "../../jpeglib.h"
 #include "../../jsimd.h"
 #include "../../jdct.h"
 #include "../../jsimddct.h"
 #include "../jsimd.h"
 #include "align.h"
 #include <arm_neon.h>
 /* YCbCr -> RGB conversion constants */
 #define F_0_344  11277  /* 0.3441467 = 11277 * 2^-15 */
 #define F_0_714  23401  /* 0.7141418 = 23401 * 2^-15 */
 #define F_1_402  22971  /* 1.4020386 = 22971 * 2^-14 */
 #define F_1_772  29033  /* 1.7720337 = 29033 * 2^-14 */
 ALIGN(16) static const int16_t jsimd_ycc_rgb_convert_neon_consts[] = {
  -F_0_344, F_0_714, F_1_402, F_1_772
 };
 /* Include inline routines for colorspace extensions. */
 #include "jdcolext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #define RGB_RED  EXT_RGB_RED
 #define RGB_GREEN  EXT_RGB_GREEN
 #define RGB_BLUE  EXT_RGB_BLUE
 #define RGB_PIXELSIZE  EXT_RGB_PIXELSIZE
 #define jsimd_ycc_rgb_convert_neon  jsimd_ycc_extrgb_convert_neon
 #include "jdcolext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #undef jsimd_ycc_rgb_convert_neon
 #define RGB_RED  EXT_RGBX_RED
 #define RGB_GREEN  EXT_RGBX_GREEN
 #define RGB_BLUE  EXT_RGBX_BLUE
 #define RGB_ALPHA  3
 #define RGB_PIXELSIZE  EXT_RGBX_PIXELSIZE
 #define jsimd_ycc_rgb_convert_neon  jsimd_ycc_extrgbx_convert_neon
 #include "jdcolext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_ALPHA
 #undef RGB_PIXELSIZE
 #undef jsimd_ycc_rgb_convert_neon
 #define RGB_RED  EXT_BGR_RED
 #define RGB_GREEN  EXT_BGR_GREEN
 #define RGB_BLUE  EXT_BGR_BLUE
 #define RGB_PIXELSIZE  EXT_BGR_PIXELSIZE
 #define jsimd_ycc_rgb_convert_neon  jsimd_ycc_extbgr_convert_neon
 #include "jdcolext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #undef jsimd_ycc_rgb_convert_neon
 #define RGB_RED  EXT_BGRX_RED
 #define RGB_GREEN  EXT_BGRX_GREEN
 #define RGB_BLUE  EXT_BGRX_BLUE
 #define RGB_ALPHA  3
 #define RGB_PIXELSIZE  EXT_BGRX_PIXELSIZE
 #define jsimd_ycc_rgb_convert_neon  jsimd_ycc_extbgrx_convert_neon
 #include "jdcolext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_ALPHA
 #undef RGB_PIXELSIZE
 #undef jsimd_ycc_rgb_convert_neon
 #define RGB_RED  EXT_XBGR_RED
 #define RGB_GREEN  EXT_XBGR_GREEN
 #define RGB_BLUE  EXT_XBGR_BLUE
 #define RGB_ALPHA  0
 #define RGB_PIXELSIZE  EXT_XBGR_PIXELSIZE
 #define jsimd_ycc_rgb_convert_neon  jsimd_ycc_extxbgr_convert_neon
 #include "jdcolext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_ALPHA
 #undef RGB_PIXELSIZE
 #undef jsimd_ycc_rgb_convert_neon
 #define RGB_RED  EXT_XRGB_RED
 #define RGB_GREEN  EXT_XRGB_GREEN
 #define RGB_BLUE  EXT_XRGB_BLUE
 #define RGB_ALPHA  0
 #define RGB_PIXELSIZE  EXT_XRGB_PIXELSIZE
 #define jsimd_ycc_rgb_convert_neon  jsimd_ycc_extxrgb_convert_neon
 #include "jdcolext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_ALPHA
 #undef RGB_PIXELSIZE
 #undef jsimd_ycc_rgb_convert_neon
 /* YCbCr -> RGB565 Conversion */
 #define RGB_PIXELSIZE  2
 #define jsimd_ycc_rgb_convert_neon  jsimd_ycc_rgb565_convert_neon
 #include "jdcolext-neon.c"
 #undef RGB_PIXELSIZE
 #undef jsimd_ycc_rgb_convert_neon
--- a/simd/arm/jdmerge-neon.c
+++ b/simd/arm/jdmerge-neon.c
@@ -0,0 +1,144 @@
 /*
 * jdmerge-neon.c - merged upsampling/color conversion (Arm Neon)
 *
 * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 #define JPEG_INTERNALS
 #include "../../jinclude.h"
 #include "../../jpeglib.h"
 #include "../../jsimd.h"
 #include "../../jdct.h"
 #include "../../jsimddct.h"
 #include "../jsimd.h"
 #include "align.h"
 #include <arm_neon.h>
 /* YCbCr -> RGB conversion constants */
 #define F_0_344  11277  /* 0.3441467 = 11277 * 2^-15 */
 #define F_0_714  23401  /* 0.7141418 = 23401 * 2^-15 */
 #define F_1_402  22971  /* 1.4020386 = 22971 * 2^-14 */
 #define F_1_772  29033  /* 1.7720337 = 29033 * 2^-14 */
 ALIGN(16) static const int16_t jsimd_ycc_rgb_convert_neon_consts[] = {
  -F_0_344, F_0_714, F_1_402, F_1_772
 };
 /* Include inline routines for colorspace extensions. */
 #include "jdmrgext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #define RGB_RED  EXT_RGB_RED
 #define RGB_GREEN  EXT_RGB_GREEN
 #define RGB_BLUE  EXT_RGB_BLUE
 #define RGB_PIXELSIZE  EXT_RGB_PIXELSIZE
 #define jsimd_h2v1_merged_upsample_neon  jsimd_h2v1_extrgb_merged_upsample_neon
 #define jsimd_h2v2_merged_upsample_neon  jsimd_h2v2_extrgb_merged_upsample_neon
 #include "jdmrgext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #undef jsimd_h2v1_merged_upsample_neon
 #undef jsimd_h2v2_merged_upsample_neon
 #define RGB_RED  EXT_RGBX_RED
 #define RGB_GREEN  EXT_RGBX_GREEN
 #define RGB_BLUE  EXT_RGBX_BLUE
 #define RGB_ALPHA  3
 #define RGB_PIXELSIZE  EXT_RGBX_PIXELSIZE
 #define jsimd_h2v1_merged_upsample_neon  jsimd_h2v1_extrgbx_merged_upsample_neon
 #define jsimd_h2v2_merged_upsample_neon  jsimd_h2v2_extrgbx_merged_upsample_neon
 #include "jdmrgext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_ALPHA
 #undef RGB_PIXELSIZE
 #undef jsimd_h2v1_merged_upsample_neon
 #undef jsimd_h2v2_merged_upsample_neon
 #define RGB_RED  EXT_BGR_RED
 #define RGB_GREEN  EXT_BGR_GREEN
 #define RGB_BLUE  EXT_BGR_BLUE
 #define RGB_PIXELSIZE  EXT_BGR_PIXELSIZE
 #define jsimd_h2v1_merged_upsample_neon  jsimd_h2v1_extbgr_merged_upsample_neon
 #define jsimd_h2v2_merged_upsample_neon  jsimd_h2v2_extbgr_merged_upsample_neon
 #include "jdmrgext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_PIXELSIZE
 #undef jsimd_h2v1_merged_upsample_neon
 #undef jsimd_h2v2_merged_upsample_neon
 #define RGB_RED  EXT_BGRX_RED
 #define RGB_GREEN  EXT_BGRX_GREEN
 #define RGB_BLUE  EXT_BGRX_BLUE
 #define RGB_ALPHA  3
 #define RGB_PIXELSIZE  EXT_BGRX_PIXELSIZE
 #define jsimd_h2v1_merged_upsample_neon  jsimd_h2v1_extbgrx_merged_upsample_neon
 #define jsimd_h2v2_merged_upsample_neon  jsimd_h2v2_extbgrx_merged_upsample_neon
 #include "jdmrgext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_ALPHA
 #undef RGB_PIXELSIZE
 #undef jsimd_h2v1_merged_upsample_neon
 #undef jsimd_h2v2_merged_upsample_neon
 #define RGB_RED  EXT_XBGR_RED
 #define RGB_GREEN  EXT_XBGR_GREEN
 #define RGB_BLUE  EXT_XBGR_BLUE
 #define RGB_ALPHA  0
 #define RGB_PIXELSIZE  EXT_XBGR_PIXELSIZE
 #define jsimd_h2v1_merged_upsample_neon  jsimd_h2v1_extxbgr_merged_upsample_neon
 #define jsimd_h2v2_merged_upsample_neon  jsimd_h2v2_extxbgr_merged_upsample_neon
 #include "jdmrgext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_ALPHA
 #undef RGB_PIXELSIZE
 #undef jsimd_h2v1_merged_upsample_neon
 #undef jsimd_h2v2_merged_upsample_neon
 #define RGB_RED  EXT_XRGB_RED
 #define RGB_GREEN  EXT_XRGB_GREEN
 #define RGB_BLUE  EXT_XRGB_BLUE
 #define RGB_ALPHA  0
 #define RGB_PIXELSIZE  EXT_XRGB_PIXELSIZE
 #define jsimd_h2v1_merged_upsample_neon  jsimd_h2v1_extxrgb_merged_upsample_neon
 #define jsimd_h2v2_merged_upsample_neon  jsimd_h2v2_extxrgb_merged_upsample_neon
 #include "jdmrgext-neon.c"
 #undef RGB_RED
 #undef RGB_GREEN
 #undef RGB_BLUE
 #undef RGB_ALPHA
 #undef RGB_PIXELSIZE
 #undef jsimd_h2v1_merged_upsample_neon
--- a/simd/arm/jdmrgext-neon.c
+++ b/simd/arm/jdmrgext-neon.c
@@ -0,0 +1,667 @@
 /*
 * jdmrgext-neon.c - merged upsampling/color conversion (Arm Neon)
 *
 * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
 * Copyright (C) 2020, D. R. Commander.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 /* This file is included by jdmerge-neon.c. */
 /* These routines combine simple (non-fancy, i.e. non-smooth) h2v1 or h2v2
 * chroma upsampling and YCbCr -> RGB color conversion into a single function.
 *
 * As with the standalone functions, YCbCr -> RGB conversion is defined by the
 * following equations:
 *    R = Y                        + 1.40200 * (Cr - 128)
 *    G = Y - 0.34414 * (Cb - 128) - 0.71414 * (Cr - 128)
 *    B = Y + 1.77200 * (Cb - 128)
 *
 * Scaled integer constants are used to avoid floating-point arithmetic:
 *    0.3441467 = 11277 * 2^-15
 *    0.7141418 = 23401 * 2^-15
 *    1.4020386 = 22971 * 2^-14
 *    1.7720337 = 29033 * 2^-14
 * These constants are defined in jdmerge-neon.c.
 *
 * To ensure correct results, rounding is used when descaling.
 */
 /* Notes on safe memory access for merged upsampling/YCbCr -> RGB conversion
 * routines:
 *
 * Input memory buffers can be safely overread up to the next multiple of
 * ALIGN_SIZE bytes, since they are always allocated by alloc_sarray() in
 * jmemmgr.c.
 *
 * The output buffer cannot safely be written beyond output_width, since
 * output_buf points to a possibly unpadded row in the decompressed image
 * buffer allocated by the calling program.
 */
 /* Upsample and color convert for the case of 2:1 horizontal and 1:1 vertical.
 */
 void jsimd_h2v1_merged_upsample_neon(JDIMENSION output_width,
                                     JSAMPIMAGE input_buf,
                                     JDIMENSION in_row_group_ctr,
                                     JSAMPARRAY output_buf)
 {
  JSAMPROW outptr;
  /* Pointers to Y, Cb, and Cr data */
  JSAMPROW inptr0, inptr1, inptr2;
  const int16x4_t consts = vld1_s16(jsimd_ycc_rgb_convert_neon_consts);
  const int16x8_t neg_128 = vdupq_n_s16(-128);
  inptr0 = input_buf[0][in_row_group_ctr];
  inptr1 = input_buf[1][in_row_group_ctr];
  inptr2 = input_buf[2][in_row_group_ctr];
  outptr = output_buf[0];
  int cols_remaining = output_width;
  for (; cols_remaining >= 16; cols_remaining -= 16) {
    /* De-interleave Y component values into two separate vectors, one
     * containing the component values with even-numbered indices and one
     * containing the component values with odd-numbered indices.
     */
    uint8x8x2_t y = vld2_u8(inptr0);
    uint8x8_t cb = vld1_u8(inptr1);
    uint8x8_t cr = vld1_u8(inptr2);
    /* Subtract 128 from Cb and Cr. */
    int16x8_t cr_128 =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(neg_128), cr));
    int16x8_t cb_128 =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(neg_128), cb));
    /* Compute G-Y: - 0.34414 * (Cb - 128) - 0.71414 * (Cr - 128) */
    int32x4_t g_sub_y_l = vmull_lane_s16(vget_low_s16(cb_128), consts, 0);
    int32x4_t g_sub_y_h = vmull_lane_s16(vget_high_s16(cb_128), consts, 0);
    g_sub_y_l = vmlsl_lane_s16(g_sub_y_l, vget_low_s16(cr_128), consts, 1);
    g_sub_y_h = vmlsl_lane_s16(g_sub_y_h, vget_high_s16(cr_128), consts, 1);
    /* Descale G components: shift right 15, round, and narrow to 16-bit. */
    int16x8_t g_sub_y = vcombine_s16(vrshrn_n_s32(g_sub_y_l, 15),
                                     vrshrn_n_s32(g_sub_y_h, 15));
    /* Compute R-Y: 1.40200 * (Cr - 128) */
    int16x8_t r_sub_y = vqrdmulhq_lane_s16(vshlq_n_s16(cr_128, 1), consts, 2);
    /* Compute B-Y: 1.77200 * (Cb - 128) */
    int16x8_t b_sub_y = vqrdmulhq_lane_s16(vshlq_n_s16(cb_128, 1), consts, 3);
    /* Add the chroma-derived values (G-Y, R-Y, and B-Y) to both the "even" and
     * "odd" Y component values.  This effectively upsamples the chroma
     * components horizontally.
     */
    int16x8_t g_even =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(g_sub_y),
                                     y.val[0]));
    int16x8_t r_even =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(r_sub_y),
                                     y.val[0]));
    int16x8_t b_even =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(b_sub_y),
                                     y.val[0]));
    int16x8_t g_odd =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(g_sub_y),
                                     y.val[1]));
    int16x8_t r_odd =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(r_sub_y),
                                     y.val[1]));
    int16x8_t b_odd =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(b_sub_y),
                                     y.val[1]));
    /* Convert each component to unsigned and narrow, clamping to [0-255].
     * Re-interleave the "even" and "odd" component values.
     */
    uint8x8x2_t r = vzip_u8(vqmovun_s16(r_even), vqmovun_s16(r_odd));
    uint8x8x2_t g = vzip_u8(vqmovun_s16(g_even), vqmovun_s16(g_odd));
    uint8x8x2_t b = vzip_u8(vqmovun_s16(b_even), vqmovun_s16(b_odd));
 #ifdef RGB_ALPHA
    uint8x16x4_t rgba;
    rgba.val[RGB_RED] = vcombine_u8(r.val[0], r.val[1]);
    rgba.val[RGB_GREEN] = vcombine_u8(g.val[0], g.val[1]);
    rgba.val[RGB_BLUE] = vcombine_u8(b.val[0], b.val[1]);
    /* Set alpha channel to opaque (0xFF). */
    rgba.val[RGB_ALPHA] = vdupq_n_u8(0xFF);
    /* Store RGBA pixel data to memory. */
    vst4q_u8(outptr, rgba);
 #else
    uint8x16x3_t rgb;
    rgb.val[RGB_RED] = vcombine_u8(r.val[0], r.val[1]);
    rgb.val[RGB_GREEN] = vcombine_u8(g.val[0], g.val[1]);
    rgb.val[RGB_BLUE] = vcombine_u8(b.val[0], b.val[1]);
    /* Store RGB pixel data to memory. */
    vst3q_u8(outptr, rgb);
 #endif
    /* Increment pointers. */
    inptr0 += 16;
    inptr1 += 8;
    inptr2 += 8;
    outptr += (RGB_PIXELSIZE * 16);
  }
  if (cols_remaining > 0) {
    /* De-interleave Y component values into two separate vectors, one
     * containing the component values with even-numbered indices and one
     * containing the component values with odd-numbered indices.
     */
    uint8x8x2_t y = vld2_u8(inptr0);
    uint8x8_t cb = vld1_u8(inptr1);
    uint8x8_t cr = vld1_u8(inptr2);
    /* Subtract 128 from Cb and Cr. */
    int16x8_t cr_128 =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(neg_128), cr));
    int16x8_t cb_128 =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(neg_128), cb));
    /* Compute G-Y: - 0.34414 * (Cb - 128) - 0.71414 * (Cr - 128) */
    int32x4_t g_sub_y_l = vmull_lane_s16(vget_low_s16(cb_128), consts, 0);
    int32x4_t g_sub_y_h = vmull_lane_s16(vget_high_s16(cb_128), consts, 0);
    g_sub_y_l = vmlsl_lane_s16(g_sub_y_l, vget_low_s16(cr_128), consts, 1);
    g_sub_y_h = vmlsl_lane_s16(g_sub_y_h, vget_high_s16(cr_128), consts, 1);
    /* Descale G components: shift right 15, round, and narrow to 16-bit. */
    int16x8_t g_sub_y = vcombine_s16(vrshrn_n_s32(g_sub_y_l, 15),
                                     vrshrn_n_s32(g_sub_y_h, 15));
    /* Compute R-Y: 1.40200 * (Cr - 128) */
    int16x8_t r_sub_y = vqrdmulhq_lane_s16(vshlq_n_s16(cr_128, 1), consts, 2);
    /* Compute B-Y: 1.77200 * (Cb - 128) */
    int16x8_t b_sub_y = vqrdmulhq_lane_s16(vshlq_n_s16(cb_128, 1), consts, 3);
    /* Add the chroma-derived values (G-Y, R-Y, and B-Y) to both the "even" and
     * "odd" Y component values.  This effectively upsamples the chroma
     * components horizontally.
     */
    int16x8_t g_even =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(g_sub_y),
                                     y.val[0]));
    int16x8_t r_even =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(r_sub_y),
                                     y.val[0]));
    int16x8_t b_even =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(b_sub_y),
                                     y.val[0]));
    int16x8_t g_odd =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(g_sub_y),
                                     y.val[1]));
    int16x8_t r_odd =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(r_sub_y),
                                     y.val[1]));
    int16x8_t b_odd =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(b_sub_y),
                                     y.val[1]));
    /* Convert each component to unsigned and narrow, clamping to [0-255].
     * Re-interleave the "even" and "odd" component values.
     */
    uint8x8x2_t r = vzip_u8(vqmovun_s16(r_even), vqmovun_s16(r_odd));
    uint8x8x2_t g = vzip_u8(vqmovun_s16(g_even), vqmovun_s16(g_odd));
    uint8x8x2_t b = vzip_u8(vqmovun_s16(b_even), vqmovun_s16(b_odd));
 #ifdef RGB_ALPHA
    uint8x8x4_t rgba_h;
    rgba_h.val[RGB_RED] = r.val[1];
    rgba_h.val[RGB_GREEN] = g.val[1];
    rgba_h.val[RGB_BLUE] = b.val[1];
    /* Set alpha channel to opaque (0xFF). */
    rgba_h.val[RGB_ALPHA] = vdup_n_u8(0xFF);
    uint8x8x4_t rgba_l;
    rgba_l.val[RGB_RED] = r.val[0];
    rgba_l.val[RGB_GREEN] = g.val[0];
    rgba_l.val[RGB_BLUE] = b.val[0];
    /* Set alpha channel to opaque (0xFF). */
    rgba_l.val[RGB_ALPHA] = vdup_n_u8(0xFF);
    /* Store RGBA pixel data to memory. */
    switch (cols_remaining) {
    case 15:
      vst4_lane_u8(outptr + 14 * RGB_PIXELSIZE, rgba_h, 6);
    case 14:
      vst4_lane_u8(outptr + 13 * RGB_PIXELSIZE, rgba_h, 5);
    case 13:
      vst4_lane_u8(outptr + 12 * RGB_PIXELSIZE, rgba_h, 4);
    case 12:
      vst4_lane_u8(outptr + 11 * RGB_PIXELSIZE, rgba_h, 3);
    case 11:
      vst4_lane_u8(outptr + 10 * RGB_PIXELSIZE, rgba_h, 2);
    case 10:
      vst4_lane_u8(outptr + 9 * RGB_PIXELSIZE, rgba_h, 1);
    case 9:
      vst4_lane_u8(outptr + 8 * RGB_PIXELSIZE, rgba_h, 0);
    case 8:
      vst4_u8(outptr, rgba_l);
      break;
    case 7:
      vst4_lane_u8(outptr + 6 * RGB_PIXELSIZE, rgba_l, 6);
    case 6:
      vst4_lane_u8(outptr + 5 * RGB_PIXELSIZE, rgba_l, 5);
    case 5:
      vst4_lane_u8(outptr + 4 * RGB_PIXELSIZE, rgba_l, 4);
    case 4:
      vst4_lane_u8(outptr + 3 * RGB_PIXELSIZE, rgba_l, 3);
    case 3:
      vst4_lane_u8(outptr + 2 * RGB_PIXELSIZE, rgba_l, 2);
    case 2:
      vst4_lane_u8(outptr + RGB_PIXELSIZE, rgba_l, 1);
    case 1:
      vst4_lane_u8(outptr, rgba_l, 0);
    default:
      break;
    }
 #else
    uint8x8x3_t rgb_h;
    rgb_h.val[RGB_RED] = r.val[1];
    rgb_h.val[RGB_GREEN] = g.val[1];
    rgb_h.val[RGB_BLUE] = b.val[1];
    uint8x8x3_t rgb_l;
    rgb_l.val[RGB_RED] = r.val[0];
    rgb_l.val[RGB_GREEN] = g.val[0];
    rgb_l.val[RGB_BLUE] = b.val[0];
    /* Store RGB pixel data to memory. */
    switch (cols_remaining) {
    case 15:
      vst3_lane_u8(outptr + 14 * RGB_PIXELSIZE, rgb_h, 6);
    case 14:
      vst3_lane_u8(outptr + 13 * RGB_PIXELSIZE, rgb_h, 5);
    case 13:
      vst3_lane_u8(outptr + 12 * RGB_PIXELSIZE, rgb_h, 4);
    case 12:
      vst3_lane_u8(outptr + 11 * RGB_PIXELSIZE, rgb_h, 3);
    case 11:
      vst3_lane_u8(outptr + 10 * RGB_PIXELSIZE, rgb_h, 2);
    case 10:
      vst3_lane_u8(outptr + 9 * RGB_PIXELSIZE, rgb_h, 1);
    case 9:
      vst3_lane_u8(outptr + 8 * RGB_PIXELSIZE, rgb_h, 0);
    case 8:
      vst3_u8(outptr, rgb_l);
      break;
    case 7:
      vst3_lane_u8(outptr + 6 * RGB_PIXELSIZE, rgb_l, 6);
    case 6:
      vst3_lane_u8(outptr + 5 * RGB_PIXELSIZE, rgb_l, 5);
    case 5:
      vst3_lane_u8(outptr + 4 * RGB_PIXELSIZE, rgb_l, 4);
    case 4:
      vst3_lane_u8(outptr + 3 * RGB_PIXELSIZE, rgb_l, 3);
    case 3:
      vst3_lane_u8(outptr + 2 * RGB_PIXELSIZE, rgb_l, 2);
    case 2:
      vst3_lane_u8(outptr + RGB_PIXELSIZE, rgb_l, 1);
    case 1:
      vst3_lane_u8(outptr, rgb_l, 0);
    default:
      break;
    }
 #endif
  }
 }
 /* Upsample and color convert for the case of 2:1 horizontal and 2:1 vertical.
 *
 * See comments above for details regarding color conversion and safe memory
 * access.
 */
 void jsimd_h2v2_merged_upsample_neon(JDIMENSION output_width,
                                     JSAMPIMAGE input_buf,
                                     JDIMENSION in_row_group_ctr,
                                     JSAMPARRAY output_buf)
 {
  JSAMPROW outptr0, outptr1;
  /* Pointers to Y (both rows), Cb, and Cr data */
  JSAMPROW inptr0_0, inptr0_1, inptr1, inptr2;
  const int16x4_t consts = vld1_s16(jsimd_ycc_rgb_convert_neon_consts);
  const int16x8_t neg_128 = vdupq_n_s16(-128);
  inptr0_0 = input_buf[0][in_row_group_ctr * 2];
  inptr0_1 = input_buf[0][in_row_group_ctr * 2 + 1];
  inptr1 = input_buf[1][in_row_group_ctr];
  inptr2 = input_buf[2][in_row_group_ctr];
  outptr0 = output_buf[0];
  outptr1 = output_buf[1];
  int cols_remaining = output_width;
  for (; cols_remaining >= 16; cols_remaining -= 16) {
    /* For each row, de-interleave Y component values into two separate
     * vectors, one containing the component values with even-numbered indices
     * and one containing the component values with odd-numbered indices.
     */
    uint8x8x2_t y0 = vld2_u8(inptr0_0);
    uint8x8x2_t y1 = vld2_u8(inptr0_1);
    uint8x8_t cb = vld1_u8(inptr1);
    uint8x8_t cr = vld1_u8(inptr2);
    /* Subtract 128 from Cb and Cr. */
    int16x8_t cr_128 =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(neg_128), cr));
    int16x8_t cb_128 =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(neg_128), cb));
    /* Compute G-Y: - 0.34414 * (Cb - 128) - 0.71414 * (Cr - 128) */
    int32x4_t g_sub_y_l = vmull_lane_s16(vget_low_s16(cb_128), consts, 0);
    int32x4_t g_sub_y_h = vmull_lane_s16(vget_high_s16(cb_128), consts, 0);
    g_sub_y_l = vmlsl_lane_s16(g_sub_y_l, vget_low_s16(cr_128), consts, 1);
    g_sub_y_h = vmlsl_lane_s16(g_sub_y_h, vget_high_s16(cr_128), consts, 1);
    /* Descale G components: shift right 15, round, and narrow to 16-bit. */
    int16x8_t g_sub_y = vcombine_s16(vrshrn_n_s32(g_sub_y_l, 15),
                                     vrshrn_n_s32(g_sub_y_h, 15));
    /* Compute R-Y: 1.40200 * (Cr - 128) */
    int16x8_t r_sub_y = vqrdmulhq_lane_s16(vshlq_n_s16(cr_128, 1), consts, 2);
    /* Compute B-Y: 1.77200 * (Cb - 128) */
    int16x8_t b_sub_y = vqrdmulhq_lane_s16(vshlq_n_s16(cb_128, 1), consts, 3);
    /* For each row, add the chroma-derived values (G-Y, R-Y, and B-Y) to both
     * the "even" and "odd" Y component values.  This effectively upsamples the
     * chroma components both horizontally and vertically.
     */
    int16x8_t g0_even =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(g_sub_y),
                                     y0.val[0]));
    int16x8_t r0_even =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(r_sub_y),
                                     y0.val[0]));
    int16x8_t b0_even =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(b_sub_y),
                                     y0.val[0]));
    int16x8_t g0_odd =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(g_sub_y),
                                     y0.val[1]));
    int16x8_t r0_odd =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(r_sub_y),
                                     y0.val[1]));
    int16x8_t b0_odd =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(b_sub_y),
                                     y0.val[1]));
    int16x8_t g1_even =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(g_sub_y),
                                     y1.val[0]));
    int16x8_t r1_even =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(r_sub_y),
                                     y1.val[0]));
    int16x8_t b1_even =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(b_sub_y),
                                     y1.val[0]));
    int16x8_t g1_odd =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(g_sub_y),
                                     y1.val[1]));
    int16x8_t r1_odd =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(r_sub_y),
                                     y1.val[1]));
    int16x8_t b1_odd =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(b_sub_y),
                                     y1.val[1]));
    /* Convert each component to unsigned and narrow, clamping to [0-255].
     * Re-interleave the "even" and "odd" component values.
     */
    uint8x8x2_t r0 = vzip_u8(vqmovun_s16(r0_even), vqmovun_s16(r0_odd));
    uint8x8x2_t r1 = vzip_u8(vqmovun_s16(r1_even), vqmovun_s16(r1_odd));
    uint8x8x2_t g0 = vzip_u8(vqmovun_s16(g0_even), vqmovun_s16(g0_odd));
    uint8x8x2_t g1 = vzip_u8(vqmovun_s16(g1_even), vqmovun_s16(g1_odd));
    uint8x8x2_t b0 = vzip_u8(vqmovun_s16(b0_even), vqmovun_s16(b0_odd));
    uint8x8x2_t b1 = vzip_u8(vqmovun_s16(b1_even), vqmovun_s16(b1_odd));
 #ifdef RGB_ALPHA
    uint8x16x4_t rgba0, rgba1;
    rgba0.val[RGB_RED] = vcombine_u8(r0.val[0], r0.val[1]);
    rgba1.val[RGB_RED] = vcombine_u8(r1.val[0], r1.val[1]);
    rgba0.val[RGB_GREEN] = vcombine_u8(g0.val[0], g0.val[1]);
    rgba1.val[RGB_GREEN] = vcombine_u8(g1.val[0], g1.val[1]);
    rgba0.val[RGB_BLUE] = vcombine_u8(b0.val[0], b0.val[1]);
    rgba1.val[RGB_BLUE] = vcombine_u8(b1.val[0], b1.val[1]);
    /* Set alpha channel to opaque (0xFF). */
    rgba0.val[RGB_ALPHA] = vdupq_n_u8(0xFF);
    rgba1.val[RGB_ALPHA] = vdupq_n_u8(0xFF);
    /* Store RGBA pixel data to memory. */
    vst4q_u8(outptr0, rgba0);
    vst4q_u8(outptr1, rgba1);
 #else
    uint8x16x3_t rgb0, rgb1;
    rgb0.val[RGB_RED] = vcombine_u8(r0.val[0], r0.val[1]);
    rgb1.val[RGB_RED] = vcombine_u8(r1.val[0], r1.val[1]);
    rgb0.val[RGB_GREEN] = vcombine_u8(g0.val[0], g0.val[1]);
    rgb1.val[RGB_GREEN] = vcombine_u8(g1.val[0], g1.val[1]);
    rgb0.val[RGB_BLUE] = vcombine_u8(b0.val[0], b0.val[1]);
    rgb1.val[RGB_BLUE] = vcombine_u8(b1.val[0], b1.val[1]);
    /* Store RGB pixel data to memory. */
    vst3q_u8(outptr0, rgb0);
    vst3q_u8(outptr1, rgb1);
 #endif
    /* Increment pointers. */
    inptr0_0 += 16;
    inptr0_1 += 16;
    inptr1 += 8;
    inptr2 += 8;
    outptr0 += (RGB_PIXELSIZE * 16);
    outptr1 += (RGB_PIXELSIZE * 16);
  }
  if (cols_remaining > 0) {
    /* For each row, de-interleave Y component values into two separate
     * vectors, one containing the component values with even-numbered indices
     * and one containing the component values with odd-numbered indices.
     */
    uint8x8x2_t y0 = vld2_u8(inptr0_0);
    uint8x8x2_t y1 = vld2_u8(inptr0_1);
    uint8x8_t cb = vld1_u8(inptr1);
    uint8x8_t cr = vld1_u8(inptr2);
    /* Subtract 128 from Cb and Cr. */
    int16x8_t cr_128 =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(neg_128), cr));
    int16x8_t cb_128 =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(neg_128), cb));
    /* Compute G-Y: - 0.34414 * (Cb - 128) - 0.71414 * (Cr - 128) */
    int32x4_t g_sub_y_l = vmull_lane_s16(vget_low_s16(cb_128), consts, 0);
    int32x4_t g_sub_y_h = vmull_lane_s16(vget_high_s16(cb_128), consts, 0);
    g_sub_y_l = vmlsl_lane_s16(g_sub_y_l, vget_low_s16(cr_128), consts, 1);
    g_sub_y_h = vmlsl_lane_s16(g_sub_y_h, vget_high_s16(cr_128), consts, 1);
    /* Descale G components: shift right 15, round, and narrow to 16-bit. */
    int16x8_t g_sub_y = vcombine_s16(vrshrn_n_s32(g_sub_y_l, 15),
                                     vrshrn_n_s32(g_sub_y_h, 15));
    /* Compute R-Y: 1.40200 * (Cr - 128) */
    int16x8_t r_sub_y = vqrdmulhq_lane_s16(vshlq_n_s16(cr_128, 1), consts, 2);
    /* Compute B-Y: 1.77200 * (Cb - 128) */
    int16x8_t b_sub_y = vqrdmulhq_lane_s16(vshlq_n_s16(cb_128, 1), consts, 3);
    /* For each row, add the chroma-derived values (G-Y, R-Y, and B-Y) to both
     * the "even" and "odd" Y component values.  This effectively upsamples the
     * chroma components both horizontally and vertically.
     */
    int16x8_t g0_even =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(g_sub_y),
                                     y0.val[0]));
    int16x8_t r0_even =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(r_sub_y),
                                     y0.val[0]));
    int16x8_t b0_even =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(b_sub_y),
                                     y0.val[0]));
    int16x8_t g0_odd =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(g_sub_y),
                                     y0.val[1]));
    int16x8_t r0_odd =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(r_sub_y),
                                     y0.val[1]));
    int16x8_t b0_odd =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(b_sub_y),
                                     y0.val[1]));
    int16x8_t g1_even =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(g_sub_y),
                                     y1.val[0]));
    int16x8_t r1_even =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(r_sub_y),
                                     y1.val[0]));
    int16x8_t b1_even =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(b_sub_y),
                                     y1.val[0]));
    int16x8_t g1_odd =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(g_sub_y),
                                     y1.val[1]));
    int16x8_t r1_odd =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(r_sub_y),
                                     y1.val[1]));
    int16x8_t b1_odd =
      vreinterpretq_s16_u16(vaddw_u8(vreinterpretq_u16_s16(b_sub_y),
                                     y1.val[1]));
    /* Convert each component to unsigned and narrow, clamping to [0-255].
     * Re-interleave the "even" and "odd" component values.
     */
    uint8x8x2_t r0 = vzip_u8(vqmovun_s16(r0_even), vqmovun_s16(r0_odd));
    uint8x8x2_t r1 = vzip_u8(vqmovun_s16(r1_even), vqmovun_s16(r1_odd));
    uint8x8x2_t g0 = vzip_u8(vqmovun_s16(g0_even), vqmovun_s16(g0_odd));
    uint8x8x2_t g1 = vzip_u8(vqmovun_s16(g1_even), vqmovun_s16(g1_odd));
    uint8x8x2_t b0 = vzip_u8(vqmovun_s16(b0_even), vqmovun_s16(b0_odd));
    uint8x8x2_t b1 = vzip_u8(vqmovun_s16(b1_even), vqmovun_s16(b1_odd));
 #ifdef RGB_ALPHA
    uint8x8x4_t rgba0_h, rgba1_h;
    rgba0_h.val[RGB_RED] = r0.val[1];
    rgba1_h.val[RGB_RED] = r1.val[1];
    rgba0_h.val[RGB_GREEN] = g0.val[1];
    rgba1_h.val[RGB_GREEN] = g1.val[1];
    rgba0_h.val[RGB_BLUE] = b0.val[1];
    rgba1_h.val[RGB_BLUE] = b1.val[1];
    /* Set alpha channel to opaque (0xFF). */
    rgba0_h.val[RGB_ALPHA] = vdup_n_u8(0xFF);
    rgba1_h.val[RGB_ALPHA] = vdup_n_u8(0xFF);
    uint8x8x4_t rgba0_l, rgba1_l;
    rgba0_l.val[RGB_RED] = r0.val[0];
    rgba1_l.val[RGB_RED] = r1.val[0];
    rgba0_l.val[RGB_GREEN] = g0.val[0];
    rgba1_l.val[RGB_GREEN] = g1.val[0];
    rgba0_l.val[RGB_BLUE] = b0.val[0];
    rgba1_l.val[RGB_BLUE] = b1.val[0];
    /* Set alpha channel to opaque (0xFF). */
    rgba0_l.val[RGB_ALPHA] = vdup_n_u8(0xFF);
    rgba1_l.val[RGB_ALPHA] = vdup_n_u8(0xFF);
    /* Store RGBA pixel data to memory. */
    switch (cols_remaining) {
    case 15:
      vst4_lane_u8(outptr0 + 14 * RGB_PIXELSIZE, rgba0_h, 6);
      vst4_lane_u8(outptr1 + 14 * RGB_PIXELSIZE, rgba1_h, 6);
    case 14:
      vst4_lane_u8(outptr0 + 13 * RGB_PIXELSIZE, rgba0_h, 5);
      vst4_lane_u8(outptr1 + 13 * RGB_PIXELSIZE, rgba1_h, 5);
    case 13:
      vst4_lane_u8(outptr0 + 12 * RGB_PIXELSIZE, rgba0_h, 4);
      vst4_lane_u8(outptr1 + 12 * RGB_PIXELSIZE, rgba1_h, 4);
    case 12:
      vst4_lane_u8(outptr0 + 11 * RGB_PIXELSIZE, rgba0_h, 3);
      vst4_lane_u8(outptr1 + 11 * RGB_PIXELSIZE, rgba1_h, 3);
    case 11:
      vst4_lane_u8(outptr0 + 10 * RGB_PIXELSIZE, rgba0_h, 2);
      vst4_lane_u8(outptr1 + 10 * RGB_PIXELSIZE, rgba1_h, 2);
    case 10:
      vst4_lane_u8(outptr0 + 9 * RGB_PIXELSIZE, rgba0_h, 1);
      vst4_lane_u8(outptr1 + 9 * RGB_PIXELSIZE, rgba1_h, 1);
    case 9:
      vst4_lane_u8(outptr0 + 8 * RGB_PIXELSIZE, rgba0_h, 0);
      vst4_lane_u8(outptr1 + 8 * RGB_PIXELSIZE, rgba1_h, 0);
    case 8:
      vst4_u8(outptr0, rgba0_l);
      vst4_u8(outptr1, rgba1_l);
      break;
    case 7:
      vst4_lane_u8(outptr0 + 6 * RGB_PIXELSIZE, rgba0_l, 6);
      vst4_lane_u8(outptr1 + 6 * RGB_PIXELSIZE, rgba1_l, 6);
    case 6:
      vst4_lane_u8(outptr0 + 5 * RGB_PIXELSIZE, rgba0_l, 5);
      vst4_lane_u8(outptr1 + 5 * RGB_PIXELSIZE, rgba1_l, 5);
    case 5:
      vst4_lane_u8(outptr0 + 4 * RGB_PIXELSIZE, rgba0_l, 4);
      vst4_lane_u8(outptr1 + 4 * RGB_PIXELSIZE, rgba1_l, 4);
    case 4:
      vst4_lane_u8(outptr0 + 3 * RGB_PIXELSIZE, rgba0_l, 3);
      vst4_lane_u8(outptr1 + 3 * RGB_PIXELSIZE, rgba1_l, 3);
    case 3:
      vst4_lane_u8(outptr0 + 2 * RGB_PIXELSIZE, rgba0_l, 2);
      vst4_lane_u8(outptr1 + 2 * RGB_PIXELSIZE, rgba1_l, 2);
    case 2:
      vst4_lane_u8(outptr0 + 1 * RGB_PIXELSIZE, rgba0_l, 1);
      vst4_lane_u8(outptr1 + 1 * RGB_PIXELSIZE, rgba1_l, 1);
    case 1:
      vst4_lane_u8(outptr0, rgba0_l, 0);
      vst4_lane_u8(outptr1, rgba1_l, 0);
    default:
      break;
    }
 #else
    uint8x8x3_t rgb0_h, rgb1_h;
    rgb0_h.val[RGB_RED] = r0.val[1];
    rgb1_h.val[RGB_RED] = r1.val[1];
    rgb0_h.val[RGB_GREEN] = g0.val[1];
    rgb1_h.val[RGB_GREEN] = g1.val[1];
    rgb0_h.val[RGB_BLUE] = b0.val[1];
    rgb1_h.val[RGB_BLUE] = b1.val[1];
    uint8x8x3_t rgb0_l, rgb1_l;
    rgb0_l.val[RGB_RED] = r0.val[0];
    rgb1_l.val[RGB_RED] = r1.val[0];
    rgb0_l.val[RGB_GREEN] = g0.val[0];
    rgb1_l.val[RGB_GREEN] = g1.val[0];
    rgb0_l.val[RGB_BLUE] = b0.val[0];
    rgb1_l.val[RGB_BLUE] = b1.val[0];
    /* Store RGB pixel data to memory. */
    switch (cols_remaining) {
    case 15:
      vst3_lane_u8(outptr0 + 14 * RGB_PIXELSIZE, rgb0_h, 6);
      vst3_lane_u8(outptr1 + 14 * RGB_PIXELSIZE, rgb1_h, 6);
    case 14:
      vst3_lane_u8(outptr0 + 13 * RGB_PIXELSIZE, rgb0_h, 5);
      vst3_lane_u8(outptr1 + 13 * RGB_PIXELSIZE, rgb1_h, 5);
    case 13:
      vst3_lane_u8(outptr0 + 12 * RGB_PIXELSIZE, rgb0_h, 4);
      vst3_lane_u8(outptr1 + 12 * RGB_PIXELSIZE, rgb1_h, 4);
    case 12:
      vst3_lane_u8(outptr0 + 11 * RGB_PIXELSIZE, rgb0_h, 3);
      vst3_lane_u8(outptr1 + 11 * RGB_PIXELSIZE, rgb1_h, 3);
    case 11:
      vst3_lane_u8(outptr0 + 10 * RGB_PIXELSIZE, rgb0_h, 2);
      vst3_lane_u8(outptr1 + 10 * RGB_PIXELSIZE, rgb1_h, 2);
    case 10:
      vst3_lane_u8(outptr0 + 9 * RGB_PIXELSIZE, rgb0_h, 1);
      vst3_lane_u8(outptr1 + 9 * RGB_PIXELSIZE, rgb1_h, 1);
    case 9:
      vst3_lane_u8(outptr0 + 8 * RGB_PIXELSIZE, rgb0_h, 0);
      vst3_lane_u8(outptr1 + 8 * RGB_PIXELSIZE, rgb1_h, 0);
    case 8:
      vst3_u8(outptr0, rgb0_l);
      vst3_u8(outptr1, rgb1_l);
      break;
    case 7:
      vst3_lane_u8(outptr0 + 6 * RGB_PIXELSIZE, rgb0_l, 6);
      vst3_lane_u8(outptr1 + 6 * RGB_PIXELSIZE, rgb1_l, 6);
    case 6:
      vst3_lane_u8(outptr0 + 5 * RGB_PIXELSIZE, rgb0_l, 5);
      vst3_lane_u8(outptr1 + 5 * RGB_PIXELSIZE, rgb1_l, 5);
    case 5:
      vst3_lane_u8(outptr0 + 4 * RGB_PIXELSIZE, rgb0_l, 4);
      vst3_lane_u8(outptr1 + 4 * RGB_PIXELSIZE, rgb1_l, 4);
    case 4:
      vst3_lane_u8(outptr0 + 3 * RGB_PIXELSIZE, rgb0_l, 3);
      vst3_lane_u8(outptr1 + 3 * RGB_PIXELSIZE, rgb1_l, 3);
    case 3:
      vst3_lane_u8(outptr0 + 2 * RGB_PIXELSIZE, rgb0_l, 2);
      vst3_lane_u8(outptr1 + 2 * RGB_PIXELSIZE, rgb1_l, 2);
    case 2:
      vst3_lane_u8(outptr0 + 1 * RGB_PIXELSIZE, rgb0_l, 1);
      vst3_lane_u8(outptr1 + 1 * RGB_PIXELSIZE, rgb1_l, 1);
    case 1:
      vst3_lane_u8(outptr0, rgb0_l, 0);
      vst3_lane_u8(outptr1, rgb1_l, 0);
    default:
      break;
    }
 #endif
  }
 }
--- a/simd/arm/jdsample-neon.c
+++ b/simd/arm/jdsample-neon.c
@@ -0,0 +1,569 @@
 /*
 * jdsample-neon.c - upsampling (Arm Neon)
 *
 * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
 * Copyright (C) 2020, D. R. Commander.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 #define JPEG_INTERNALS
 #include "../../jinclude.h"
 #include "../../jpeglib.h"
 #include "../../jsimd.h"
 #include "../../jdct.h"
 #include "../../jsimddct.h"
 #include "../jsimd.h"
 #include <arm_neon.h>
 /* The diagram below shows a row of samples produced by h2v1 downsampling.
 *
 *                s0        s1        s2
 *            +---------+---------+---------+
 *            |         |         |         |
 *            | p0   p1 | p2   p3 | p4   p5 |
 *            |         |         |         |
 *            +---------+---------+---------+
 *
 * Samples s0-s2 were created by averaging the original pixel component values
 * centered at positions p0-p5 above.  To approximate those original pixel
 * component values, we proportionally blend the adjacent samples in each row.
 *
 * An upsampled pixel component value is computed by blending the sample
 * containing the pixel center with the nearest neighboring sample, in the
 * ratio 3:1.  For example:
 *     p1(upsampled) = 3/4 * s0 + 1/4 * s1
 *     p2(upsampled) = 3/4 * s1 + 1/4 * s0
 * When computing the first and last pixel component values in the row, there
 * is no adjacent sample to blend, so:
 *     p0(upsampled) = s0
 *     p5(upsampled) = s2
 */
 void jsimd_h2v1_fancy_upsample_neon(int max_v_samp_factor,
                                    JDIMENSION downsampled_width,
                                    JSAMPARRAY input_data,
                                    JSAMPARRAY *output_data_ptr)
 {
  JSAMPARRAY output_data = *output_data_ptr;
  JSAMPROW inptr, outptr;
  int inrow;
  unsigned colctr;
  /* Set up constants. */
  const uint16x8_t one_u16 = vdupq_n_u16(1);
  const uint8x8_t three_u8 = vdup_n_u8(3);
  for (inrow = 0; inrow < max_v_samp_factor; inrow++) {
    inptr = input_data[inrow];
    outptr = output_data[inrow];
    /* First pixel component value in this row of the original image */
    *outptr = (JSAMPLE)GETJSAMPLE(*inptr);
    /*    3/4 * containing sample + 1/4 * nearest neighboring sample
     * For p1: containing sample = s0, nearest neighboring sample = s1
     * For p2: containing sample = s1, nearest neighboring sample = s0
     */
    uint8x16_t s0 = vld1q_u8(inptr);
    uint8x16_t s1 = vld1q_u8(inptr + 1);
    /* Multiplication makes vectors twice as wide.  '_l' and '_h' suffixes
     * denote low half and high half respectively.
     */
    uint16x8_t s1_add_3s0_l =
      vmlal_u8(vmovl_u8(vget_low_u8(s1)), vget_low_u8(s0), three_u8);
    uint16x8_t s1_add_3s0_h =
      vmlal_u8(vmovl_u8(vget_high_u8(s1)), vget_high_u8(s0), three_u8);
    uint16x8_t s0_add_3s1_l =
      vmlal_u8(vmovl_u8(vget_low_u8(s0)), vget_low_u8(s1), three_u8);
    uint16x8_t s0_add_3s1_h =
      vmlal_u8(vmovl_u8(vget_high_u8(s0)), vget_high_u8(s1), three_u8);
    /* Add ordered dithering bias to odd pixel values. */
    s0_add_3s1_l = vaddq_u16(s0_add_3s1_l, one_u16);
    s0_add_3s1_h = vaddq_u16(s0_add_3s1_h, one_u16);
    /* The offset is initially 1, because the first pixel component has already
     * been stored.  However, in subsequent iterations of the SIMD loop, this
     * offset is (2 * colctr - 1) to stay within the bounds of the sample
     * buffers without having to resort to a slow scalar tail case for the last
     * (downsampled_width % 16) samples.  See "Creation of 2-D sample arrays"
     * in jmemmgr.c for more details.
     */
    unsigned outptr_offset = 1;
    uint8x16x2_t output_pixels;
    /* We use software pipelining to maximise performance.  The code indented
     * an extra two spaces begins the next iteration of the loop.
     */
    for (colctr = 16; colctr < downsampled_width; colctr += 16) {
        s0 = vld1q_u8(inptr + colctr - 1);
        s1 = vld1q_u8(inptr + colctr);
      /* Right-shift by 2 (divide by 4), narrow to 8-bit, and combine. */
      output_pixels.val[0] = vcombine_u8(vrshrn_n_u16(s1_add_3s0_l, 2),
                                         vrshrn_n_u16(s1_add_3s0_h, 2));
      output_pixels.val[1] = vcombine_u8(vshrn_n_u16(s0_add_3s1_l, 2),
                                         vshrn_n_u16(s0_add_3s1_h, 2));
        /* Multiplication makes vectors twice as wide.  '_l' and '_h' suffixes
         * denote low half and high half respectively.
         */
        s1_add_3s0_l =
          vmlal_u8(vmovl_u8(vget_low_u8(s1)), vget_low_u8(s0), three_u8);
        s1_add_3s0_h =
          vmlal_u8(vmovl_u8(vget_high_u8(s1)), vget_high_u8(s0), three_u8);
        s0_add_3s1_l =
          vmlal_u8(vmovl_u8(vget_low_u8(s0)), vget_low_u8(s1), three_u8);
        s0_add_3s1_h =
          vmlal_u8(vmovl_u8(vget_high_u8(s0)), vget_high_u8(s1), three_u8);
        /* Add ordered dithering bias to odd pixel values. */
        s0_add_3s1_l = vaddq_u16(s0_add_3s1_l, one_u16);
        s0_add_3s1_h = vaddq_u16(s0_add_3s1_h, one_u16);
      /* Store pixel component values to memory. */
      vst2q_u8(outptr + outptr_offset, output_pixels);
      outptr_offset = 2 * colctr - 1;
    }
    /* Complete the last iteration of the loop. */
    /* Right-shift by 2 (divide by 4), narrow to 8-bit, and combine. */
    output_pixels.val[0] = vcombine_u8(vrshrn_n_u16(s1_add_3s0_l, 2),
                                       vrshrn_n_u16(s1_add_3s0_h, 2));
    output_pixels.val[1] = vcombine_u8(vshrn_n_u16(s0_add_3s1_l, 2),
                                       vshrn_n_u16(s0_add_3s1_h, 2));
    /* Store pixel component values to memory. */
    vst2q_u8(outptr + outptr_offset, output_pixels);
    /* Last pixel component value in this row of the original image */
    outptr[2 * downsampled_width - 1] =
      GETJSAMPLE(inptr[downsampled_width - 1]);
  }
 }
 /* The diagram below shows an array of samples produced by h2v2 downsampling.
 *
 *                s0        s1        s2
 *            +---------+---------+---------+
 *            | p0   p1 | p2   p3 | p4   p5 |
 *       sA   |         |         |         |
 *            | p6   p7 | p8   p9 | p10  p11|
 *            +---------+---------+---------+
 *            | p12  p13| p14  p15| p16  p17|
 *       sB   |         |         |         |
 *            | p18  p19| p20  p21| p22  p23|
 *            +---------+---------+---------+
 *            | p24  p25| p26  p27| p28  p29|
 *       sC   |         |         |         |
 *            | p30  p31| p32  p33| p34  p35|
 *            +---------+---------+---------+
 *
 * Samples s0A-s2C were created by averaging the original pixel component
 * values centered at positions p0-p35 above.  To approximate one of those
 * original pixel component values, we proportionally blend the sample
 * containing the pixel center with the nearest neighboring samples in each
 * row, column, and diagonal.
 *
 * An upsampled pixel component value is computed by first blending the sample
 * containing the pixel center with the nearest neighboring samples in the
 * same column, in the ratio 3:1, and then blending each column sum with the
 * nearest neighboring column sum, in the ratio 3:1.  For example:
 *     p14(upsampled) = 3/4 * (3/4 * s1B + 1/4 * s1A) +
 *                      1/4 * (3/4 * s0B + 1/4 * s0A)
 *                    = 9/16 * s1B + 3/16 * s1A + 3/16 * s0B + 1/16 * s0A
 * When computing the first and last pixel component values in the row, there
 * is no horizontally adjacent sample to blend, so:
 *     p12(upsampled) = 3/4 * s0B + 1/4 * s0A
 *     p23(upsampled) = 3/4 * s2B + 1/4 * s2C
 * When computing the first and last pixel component values in the column,
 * there is no vertically adjacent sample to blend, so:
 *     p2(upsampled) = 3/4 * s1A + 1/4 * s0A
 *     p33(upsampled) = 3/4 * s1C + 1/4 * s2C
 * When computing the corner pixel component values, there is no adjacent
 * sample to blend, so:
 *     p0(upsampled) = s0A
 *     p35(upsampled) = s2C
 */
 void jsimd_h2v2_fancy_upsample_neon(int max_v_samp_factor,
                                    JDIMENSION downsampled_width,
                                    JSAMPARRAY input_data,
                                    JSAMPARRAY *output_data_ptr)
 {
  JSAMPARRAY output_data = *output_data_ptr;
  JSAMPROW inptr0, inptr1, inptr2, outptr0, outptr1;
  int inrow, outrow;
  unsigned colctr;
  /* Set up constants. */
  const uint16x8_t seven_u16 = vdupq_n_u16(7);
  const uint8x8_t three_u8 = vdup_n_u8(3);
  const uint16x8_t three_u16 = vdupq_n_u16(3);
  inrow = outrow = 0;
  while (outrow < max_v_samp_factor) {
    inptr0 = input_data[inrow - 1];
    inptr1 = input_data[inrow];
    inptr2 = input_data[inrow + 1];
    /* Suffixes 0 and 1 denote the upper and lower rows of output pixels,
     * respectively.
     */
    outptr0 = output_data[outrow++];
    outptr1 = output_data[outrow++];
    /* First pixel component value in this row of the original image */
    int s0colsum0 = GETJSAMPLE(*inptr1) * 3 + GETJSAMPLE(*inptr0);
    *outptr0 = (JSAMPLE)((s0colsum0 * 4 + 8) >> 4);
    int s0colsum1 = GETJSAMPLE(*inptr1) * 3 + GETJSAMPLE(*inptr2);
    *outptr1 = (JSAMPLE)((s0colsum1 * 4 + 8) >> 4);
    /* Step 1: Blend samples vertically in columns s0 and s1.
     * Leave the divide by 4 until the end, when it can be done for both
     * dimensions at once, right-shifting by 4.
     */
    /* Load and compute s0colsum0 and s0colsum1. */
    uint8x16_t s0A = vld1q_u8(inptr0);
    uint8x16_t s0B = vld1q_u8(inptr1);
    uint8x16_t s0C = vld1q_u8(inptr2);
    /* Multiplication makes vectors twice as wide.  '_l' and '_h' suffixes
     * denote low half and high half respectively.
     */
    uint16x8_t s0colsum0_l = vmlal_u8(vmovl_u8(vget_low_u8(s0A)),
                                      vget_low_u8(s0B), three_u8);
    uint16x8_t s0colsum0_h = vmlal_u8(vmovl_u8(vget_high_u8(s0A)),
                                      vget_high_u8(s0B), three_u8);
    uint16x8_t s0colsum1_l = vmlal_u8(vmovl_u8(vget_low_u8(s0C)),
                                      vget_low_u8(s0B), three_u8);
    uint16x8_t s0colsum1_h = vmlal_u8(vmovl_u8(vget_high_u8(s0C)),
                                      vget_high_u8(s0B), three_u8);
    /* Load and compute s1colsum0 and s1colsum1. */
    uint8x16_t s1A = vld1q_u8(inptr0 + 1);
    uint8x16_t s1B = vld1q_u8(inptr1 + 1);
    uint8x16_t s1C = vld1q_u8(inptr2 + 1);
    uint16x8_t s1colsum0_l = vmlal_u8(vmovl_u8(vget_low_u8(s1A)),
                                      vget_low_u8(s1B), three_u8);
    uint16x8_t s1colsum0_h = vmlal_u8(vmovl_u8(vget_high_u8(s1A)),
                                      vget_high_u8(s1B), three_u8);
    uint16x8_t s1colsum1_l = vmlal_u8(vmovl_u8(vget_low_u8(s1C)),
                                      vget_low_u8(s1B), three_u8);
    uint16x8_t s1colsum1_h = vmlal_u8(vmovl_u8(vget_high_u8(s1C)),
                                      vget_high_u8(s1B), three_u8);
    /* Step 2: Blend the already-blended columns. */
    uint16x8_t output0_p1_l = vmlaq_u16(s1colsum0_l, s0colsum0_l, three_u16);
    uint16x8_t output0_p1_h = vmlaq_u16(s1colsum0_h, s0colsum0_h, three_u16);
    uint16x8_t output0_p2_l = vmlaq_u16(s0colsum0_l, s1colsum0_l, three_u16);
    uint16x8_t output0_p2_h = vmlaq_u16(s0colsum0_h, s1colsum0_h, three_u16);
    uint16x8_t output1_p1_l = vmlaq_u16(s1colsum1_l, s0colsum1_l, three_u16);
    uint16x8_t output1_p1_h = vmlaq_u16(s1colsum1_h, s0colsum1_h, three_u16);
    uint16x8_t output1_p2_l = vmlaq_u16(s0colsum1_l, s1colsum1_l, three_u16);
    uint16x8_t output1_p2_h = vmlaq_u16(s0colsum1_h, s1colsum1_h, three_u16);
    /* Add ordered dithering bias to odd pixel values. */
    output0_p1_l = vaddq_u16(output0_p1_l, seven_u16);
    output0_p1_h = vaddq_u16(output0_p1_h, seven_u16);
    output1_p1_l = vaddq_u16(output1_p1_l, seven_u16);
    output1_p1_h = vaddq_u16(output1_p1_h, seven_u16);
    /* Right-shift by 4 (divide by 16), narrow to 8-bit, and combine. */
    uint8x16x2_t output_pixels0 = { {
      vcombine_u8(vshrn_n_u16(output0_p1_l, 4), vshrn_n_u16(output0_p1_h, 4)),
      vcombine_u8(vrshrn_n_u16(output0_p2_l, 4), vrshrn_n_u16(output0_p2_h, 4))
    } };
    uint8x16x2_t output_pixels1 = { {
      vcombine_u8(vshrn_n_u16(output1_p1_l, 4), vshrn_n_u16(output1_p1_h, 4)),
      vcombine_u8(vrshrn_n_u16(output1_p2_l, 4), vrshrn_n_u16(output1_p2_h, 4))
    } };
    /* Store pixel component values to memory.
     * The minimum size of the output buffer for each row is 64 bytes => no
     * need to worry about buffer overflow here.  See "Creation of 2-D sample
     * arrays" in jmemmgr.c for more details.
     */
    vst2q_u8(outptr0 + 1, output_pixels0);
    vst2q_u8(outptr1 + 1, output_pixels1);
    /* The first pixel of the image shifted our loads and stores by one byte.
     * We have to re-align on a 32-byte boundary at some point before the end
     * of the row (we do it now on the 32/33 pixel boundary) to stay within the
     * bounds of the sample buffers without having to resort to a slow scalar
     * tail case for the last (downsampled_width % 16) samples.  See "Creation
     * of 2-D sample arrays" in jmemmgr.c for more details.
     */
    for (colctr = 16; colctr < downsampled_width; colctr += 16) {
      /* Step 1: Blend samples vertically in columns s0 and s1. */
      /* Load and compute s0colsum0 and s0colsum1. */
      s0A = vld1q_u8(inptr0 + colctr - 1);
      s0B = vld1q_u8(inptr1 + colctr - 1);
      s0C = vld1q_u8(inptr2 + colctr - 1);
      s0colsum0_l = vmlal_u8(vmovl_u8(vget_low_u8(s0A)), vget_low_u8(s0B),
                             three_u8);
      s0colsum0_h = vmlal_u8(vmovl_u8(vget_high_u8(s0A)), vget_high_u8(s0B),
                             three_u8);
      s0colsum1_l = vmlal_u8(vmovl_u8(vget_low_u8(s0C)), vget_low_u8(s0B),
                             three_u8);
      s0colsum1_h = vmlal_u8(vmovl_u8(vget_high_u8(s0C)), vget_high_u8(s0B),
                             three_u8);
      /* Load and compute s1colsum0 and s1colsum1. */
      s1A = vld1q_u8(inptr0 + colctr);
      s1B = vld1q_u8(inptr1 + colctr);
      s1C = vld1q_u8(inptr2 + colctr);
      s1colsum0_l = vmlal_u8(vmovl_u8(vget_low_u8(s1A)), vget_low_u8(s1B),
                             three_u8);
      s1colsum0_h = vmlal_u8(vmovl_u8(vget_high_u8(s1A)), vget_high_u8(s1B),
                             three_u8);
      s1colsum1_l = vmlal_u8(vmovl_u8(vget_low_u8(s1C)), vget_low_u8(s1B),
                             three_u8);
      s1colsum1_h = vmlal_u8(vmovl_u8(vget_high_u8(s1C)), vget_high_u8(s1B),
                             three_u8);
      /* Step 2: Blend the already-blended columns. */
      output0_p1_l = vmlaq_u16(s1colsum0_l, s0colsum0_l, three_u16);
      output0_p1_h = vmlaq_u16(s1colsum0_h, s0colsum0_h, three_u16);
      output0_p2_l = vmlaq_u16(s0colsum0_l, s1colsum0_l, three_u16);
      output0_p2_h = vmlaq_u16(s0colsum0_h, s1colsum0_h, three_u16);
      output1_p1_l = vmlaq_u16(s1colsum1_l, s0colsum1_l, three_u16);
      output1_p1_h = vmlaq_u16(s1colsum1_h, s0colsum1_h, three_u16);
      output1_p2_l = vmlaq_u16(s0colsum1_l, s1colsum1_l, three_u16);
      output1_p2_h = vmlaq_u16(s0colsum1_h, s1colsum1_h, three_u16);
      /* Add ordered dithering bias to odd pixel values. */
      output0_p1_l = vaddq_u16(output0_p1_l, seven_u16);
      output0_p1_h = vaddq_u16(output0_p1_h, seven_u16);
      output1_p1_l = vaddq_u16(output1_p1_l, seven_u16);
      output1_p1_h = vaddq_u16(output1_p1_h, seven_u16);
      /* Right-shift by 4 (divide by 16), narrow to 8-bit, and combine. */
      output_pixels0.val[0] = vcombine_u8(vshrn_n_u16(output0_p1_l, 4),
                                          vshrn_n_u16(output0_p1_h, 4));
      output_pixels0.val[1] = vcombine_u8(vrshrn_n_u16(output0_p2_l, 4),
                                          vrshrn_n_u16(output0_p2_h, 4));
      output_pixels1.val[0] = vcombine_u8(vshrn_n_u16(output1_p1_l, 4),
                                          vshrn_n_u16(output1_p1_h, 4));
      output_pixels1.val[1] = vcombine_u8(vrshrn_n_u16(output1_p2_l, 4),
                                          vrshrn_n_u16(output1_p2_h, 4));
      /* Store pixel component values to memory. */
      vst2q_u8(outptr0 + 2 * colctr - 1, output_pixels0);
      vst2q_u8(outptr1 + 2 * colctr - 1, output_pixels1);
    }
    /* Last pixel component value in this row of the original image */
    int s1colsum0 = GETJSAMPLE(inptr1[downsampled_width - 1]) * 3 +
                    GETJSAMPLE(inptr0[downsampled_width - 1]);
    outptr0[2 * downsampled_width - 1] = (JSAMPLE)((s1colsum0 * 4 + 7) >> 4);
    int s1colsum1 = GETJSAMPLE(inptr1[downsampled_width - 1]) * 3 +
                    GETJSAMPLE(inptr2[downsampled_width - 1]);
    outptr1[2 * downsampled_width - 1] = (JSAMPLE)((s1colsum1 * 4 + 7) >> 4);
    inrow++;
  }
 }
 /* The diagram below shows a column of samples produced by h1v2 downsampling
 * (or by losslessly rotating or transposing an h2v1-downsampled image.)
 *
 *            +---------+
 *            |   p0    |
 *     sA     |         |
 *            |   p1    |
 *            +---------+
 *            |   p2    |
 *     sB     |         |
 *            |   p3    |
 *            +---------+
 *            |   p4    |
 *     sC     |         |
 *            |   p5    |
 *            +---------+
 *
 * Samples sA-sC were created by averaging the original pixel component values
 * centered at positions p0-p5 above.  To approximate those original pixel
 * component values, we proportionally blend the adjacent samples in each
 * column.
 *
 * An upsampled pixel component value is computed by blending the sample
 * containing the pixel center with the nearest neighboring sample, in the
 * ratio 3:1.  For example:
 *     p1(upsampled) = 3/4 * sA + 1/4 * sB
 *     p2(upsampled) = 3/4 * sB + 1/4 * sA
 * When computing the first and last pixel component values in the column,
 * there is no adjacent sample to blend, so:
 *     p0(upsampled) = sA
 *     p5(upsampled) = sC
 */
 void jsimd_h1v2_fancy_upsample_neon(int max_v_samp_factor,
                                    JDIMENSION downsampled_width,
                                    JSAMPARRAY input_data,
                                    JSAMPARRAY *output_data_ptr)
 {
  JSAMPARRAY output_data = *output_data_ptr;
  JSAMPROW inptr0, inptr1, inptr2, outptr0, outptr1;
  int inrow, outrow;
  unsigned colctr;
  /* Set up constants. */
  const uint16x8_t one_u16 = vdupq_n_u16(1);
  const uint8x8_t three_u8 = vdup_n_u8(3);
  inrow = outrow = 0;
  while (outrow < max_v_samp_factor) {
    inptr0 = input_data[inrow - 1];
    inptr1 = input_data[inrow];
    inptr2 = input_data[inrow + 1];
    /* Suffixes 0 and 1 denote the upper and lower rows of output pixels,
     * respectively.
     */
    outptr0 = output_data[outrow++];
    outptr1 = output_data[outrow++];
    inrow++;
    /* The size of the input and output buffers is always a multiple of 32
     * bytes => no need to worry about buffer overflow when reading/writing
     * memory.  See "Creation of 2-D sample arrays" in jmemmgr.c for more
     * details.
     */
    for (colctr = 0; colctr < downsampled_width; colctr += 16) {
      /* Load samples. */
      uint8x16_t sA = vld1q_u8(inptr0 + colctr);
      uint8x16_t sB = vld1q_u8(inptr1 + colctr);
      uint8x16_t sC = vld1q_u8(inptr2 + colctr);
      /* Blend samples vertically. */
      uint16x8_t colsum0_l = vmlal_u8(vmovl_u8(vget_low_u8(sA)),
                                      vget_low_u8(sB), three_u8);
      uint16x8_t colsum0_h = vmlal_u8(vmovl_u8(vget_high_u8(sA)),
                                      vget_high_u8(sB), three_u8);
      uint16x8_t colsum1_l = vmlal_u8(vmovl_u8(vget_low_u8(sC)),
                                      vget_low_u8(sB), three_u8);
      uint16x8_t colsum1_h = vmlal_u8(vmovl_u8(vget_high_u8(sC)),
                                      vget_high_u8(sB), three_u8);
      /* Add ordered dithering bias to pixel values in even output rows. */
      colsum0_l = vaddq_u16(colsum0_l, one_u16);
      colsum0_h = vaddq_u16(colsum0_h, one_u16);
      /* Right-shift by 2 (divide by 4), narrow to 8-bit, and combine. */
      uint8x16_t output_pixels0 = vcombine_u8(vshrn_n_u16(colsum0_l, 2),
                                              vshrn_n_u16(colsum0_h, 2));
      uint8x16_t output_pixels1 = vcombine_u8(vrshrn_n_u16(colsum1_l, 2),
                                              vrshrn_n_u16(colsum1_h, 2));
      /* Store pixel component values to memory. */
      vst1q_u8(outptr0 + colctr, output_pixels0);
      vst1q_u8(outptr1 + colctr, output_pixels1);
    }
  }
 }
 /* The diagram below shows a row of samples produced by h2v1 downsampling.
 *
 *                s0        s1
 *            +---------+---------+
 *            |         |         |
 *            | p0   p1 | p2   p3 |
 *            |         |         |
 *            +---------+---------+
 *
 * Samples s0 and s1 were created by averaging the original pixel component
 * values centered at positions p0-p3 above.  To approximate those original
 * pixel component values, we duplicate the samples horizontally:
 *     p0(upsampled) = p1(upsampled) = s0
 *     p2(upsampled) = p3(upsampled) = s1
 */
 void jsimd_h2v1_upsample_neon(int max_v_samp_factor, JDIMENSION output_width,
                              JSAMPARRAY input_data,
                              JSAMPARRAY *output_data_ptr)
 {
  JSAMPARRAY output_data = *output_data_ptr;
  JSAMPROW inptr, outptr;
  int inrow;
  unsigned colctr;
  for (inrow = 0; inrow < max_v_samp_factor; inrow++) {
    inptr = input_data[inrow];
    outptr = output_data[inrow];
    for (colctr = 0; 2 * colctr < output_width; colctr += 16) {
      uint8x16_t samples = vld1q_u8(inptr + colctr);
      /* Duplicate the samples.  The store operation below interleaves them so
       * that adjacent pixel component values take on the same sample value,
       * per above.
       */
      uint8x16x2_t output_pixels = { { samples, samples } };
      /* Store pixel component values to memory.
       * Due to the way sample buffers are allocated, we don't need to worry
       * about tail cases when output_width is not a multiple of 32.  See
       * "Creation of 2-D sample arrays" in jmemmgr.c for details.
       */
      vst2q_u8(outptr + 2 * colctr, output_pixels);
    }
  }
 }
 /* The diagram below shows an array of samples produced by h2v2 downsampling.
 *
 *                s0        s1
 *            +---------+---------+
 *            | p0   p1 | p2   p3 |
 *       sA   |         |         |
 *            | p4   p5 | p6   p7 |
 *            +---------+---------+
 *            | p8   p9 | p10  p11|
 *       sB   |         |         |
 *            | p12  p13| p14  p15|
 *            +---------+---------+
 *
 * Samples s0A-s1B were created by averaging the original pixel component
 * values centered at positions p0-p15 above.  To approximate those original
 * pixel component values, we duplicate the samples both horizontally and
 * vertically:
 *     p0(upsampled) = p1(upsampled) = p4(upsampled) = p5(upsampled) = s0A
 *     p2(upsampled) = p3(upsampled) = p6(upsampled) = p7(upsampled) = s1A
 *     p8(upsampled) = p9(upsampled) = p12(upsampled) = p13(upsampled) = s0B
 *     p10(upsampled) = p11(upsampled) = p14(upsampled) = p15(upsampled) = s1B
 */
 void jsimd_h2v2_upsample_neon(int max_v_samp_factor, JDIMENSION output_width,
                              JSAMPARRAY input_data,
                              JSAMPARRAY *output_data_ptr)
 {
  JSAMPARRAY output_data = *output_data_ptr;
  JSAMPROW inptr, outptr0, outptr1;
  int inrow, outrow;
  unsigned colctr;
  for (inrow = 0, outrow = 0; outrow < max_v_samp_factor; inrow++) {
    inptr = input_data[inrow];
    outptr0 = output_data[outrow++];
    outptr1 = output_data[outrow++];
    for (colctr = 0; 2 * colctr < output_width; colctr += 16) {
      uint8x16_t samples = vld1q_u8(inptr + colctr);
      /* Duplicate the samples.  The store operation below interleaves them so
       * that adjacent pixel component values take on the same sample value,
       * per above.
       */
      uint8x16x2_t output_pixels = { { samples, samples } };
      /* Store pixel component values for both output rows to memory.
       * Due to the way sample buffers are allocated, we don't need to worry
       * about tail cases when output_width is not a multiple of 32.  See
       * "Creation of 2-D sample arrays" in jmemmgr.c for details.
       */
      vst2q_u8(outptr0 + 2 * colctr, output_pixels);
      vst2q_u8(outptr1 + 2 * colctr, output_pixels);
    }
  }
 }
--- a/simd/arm/jfdctfst-neon.c
+++ b/simd/arm/jfdctfst-neon.c
@@ -0,0 +1,214 @@
 /*
 * jfdctfst-neon.c - fast integer FDCT (Arm Neon)
 *
 * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 #define JPEG_INTERNALS
 #include "../../jinclude.h"
 #include "../../jpeglib.h"
 #include "../../jsimd.h"
 #include "../../jdct.h"
 #include "../../jsimddct.h"
 #include "../jsimd.h"
 #include "align.h"
 #include <arm_neon.h>
 /* jsimd_fdct_ifast_neon() performs a fast, not so accurate forward DCT
 * (Discrete Cosine Transform) on one block of samples.  It uses the same
 * calculations and produces exactly the same output as IJG's original
 * jpeg_fdct_ifast() function, which can be found in jfdctfst.c.
 *
 * Scaled integer constants are used to avoid floating-point arithmetic:
 *    0.382683433 = 12544 * 2^-15
 *    0.541196100 = 17795 * 2^-15
 *    0.707106781 = 23168 * 2^-15
 *    0.306562965 =  9984 * 2^-15
 *
 * See jfdctfst.c for further details of the DCT algorithm.  Where possible,
 * the variable names and comments here in jsimd_fdct_ifast_neon() match up
 * with those in jpeg_fdct_ifast().
 */
 #define F_0_382  12544
 #define F_0_541  17792
 #define F_0_707  23168
 #define F_0_306  9984
 ALIGN(16) static const int16_t jsimd_fdct_ifast_neon_consts[] = {
  F_0_382, F_0_541, F_0_707, F_0_306
 };
 void jsimd_fdct_ifast_neon(DCTELEM *data)
 {
  /* Load an 8x8 block of samples into Neon registers.  De-interleaving loads
   * are used, followed by vuzp to transpose the block such that we have a
   * column of samples per vector - allowing all rows to be processed at once.
   */
  int16x8x4_t data1 = vld4q_s16(data);
  int16x8x4_t data2 = vld4q_s16(data + 4 * DCTSIZE);
  int16x8x2_t cols_04 = vuzpq_s16(data1.val[0], data2.val[0]);
  int16x8x2_t cols_15 = vuzpq_s16(data1.val[1], data2.val[1]);
  int16x8x2_t cols_26 = vuzpq_s16(data1.val[2], data2.val[2]);
  int16x8x2_t cols_37 = vuzpq_s16(data1.val[3], data2.val[3]);
  int16x8_t col0 = cols_04.val[0];
  int16x8_t col1 = cols_15.val[0];
  int16x8_t col2 = cols_26.val[0];
  int16x8_t col3 = cols_37.val[0];
  int16x8_t col4 = cols_04.val[1];
  int16x8_t col5 = cols_15.val[1];
  int16x8_t col6 = cols_26.val[1];
  int16x8_t col7 = cols_37.val[1];
  /* Pass 1: process rows. */
  /* Load DCT conversion constants. */
  const int16x4_t consts = vld1_s16(jsimd_fdct_ifast_neon_consts);
  int16x8_t tmp0 = vaddq_s16(col0, col7);
  int16x8_t tmp7 = vsubq_s16(col0, col7);
  int16x8_t tmp1 = vaddq_s16(col1, col6);
  int16x8_t tmp6 = vsubq_s16(col1, col6);
  int16x8_t tmp2 = vaddq_s16(col2, col5);
  int16x8_t tmp5 = vsubq_s16(col2, col5);
  int16x8_t tmp3 = vaddq_s16(col3, col4);
  int16x8_t tmp4 = vsubq_s16(col3, col4);
  /* Even part */
  int16x8_t tmp10 = vaddq_s16(tmp0, tmp3);    /* phase 2 */
  int16x8_t tmp13 = vsubq_s16(tmp0, tmp3);
  int16x8_t tmp11 = vaddq_s16(tmp1, tmp2);
  int16x8_t tmp12 = vsubq_s16(tmp1, tmp2);
  col0 = vaddq_s16(tmp10, tmp11);             /* phase 3 */
  col4 = vsubq_s16(tmp10, tmp11);
  int16x8_t z1 = vqdmulhq_lane_s16(vaddq_s16(tmp12, tmp13), consts, 2);
  col2 = vaddq_s16(tmp13, z1);                /* phase 5 */
  col6 = vsubq_s16(tmp13, z1);
  /* Odd part */
  tmp10 = vaddq_s16(tmp4, tmp5);              /* phase 2 */
  tmp11 = vaddq_s16(tmp5, tmp6);
  tmp12 = vaddq_s16(tmp6, tmp7);
  int16x8_t z5 = vqdmulhq_lane_s16(vsubq_s16(tmp10, tmp12), consts, 0);
  int16x8_t z2 = vqdmulhq_lane_s16(tmp10, consts, 1);
  z2 = vaddq_s16(z2, z5);
  int16x8_t z4 = vqdmulhq_lane_s16(tmp12, consts, 3);
  z5 = vaddq_s16(tmp12, z5);
  z4 = vaddq_s16(z4, z5);
  int16x8_t z3 = vqdmulhq_lane_s16(tmp11, consts, 2);
  int16x8_t z11 = vaddq_s16(tmp7, z3);        /* phase 5 */
  int16x8_t z13 = vsubq_s16(tmp7, z3);
  col5 = vaddq_s16(z13, z2);                  /* phase 6 */
  col3 = vsubq_s16(z13, z2);
  col1 = vaddq_s16(z11, z4);
  col7 = vsubq_s16(z11, z4);
  /* Transpose to work on columns in pass 2. */
  int16x8x2_t cols_01 = vtrnq_s16(col0, col1);
  int16x8x2_t cols_23 = vtrnq_s16(col2, col3);
  int16x8x2_t cols_45 = vtrnq_s16(col4, col5);
  int16x8x2_t cols_67 = vtrnq_s16(col6, col7);
  int32x4x2_t cols_0145_l = vtrnq_s32(vreinterpretq_s32_s16(cols_01.val[0]),
                                      vreinterpretq_s32_s16(cols_45.val[0]));
  int32x4x2_t cols_0145_h = vtrnq_s32(vreinterpretq_s32_s16(cols_01.val[1]),
                                      vreinterpretq_s32_s16(cols_45.val[1]));
  int32x4x2_t cols_2367_l = vtrnq_s32(vreinterpretq_s32_s16(cols_23.val[0]),
                                      vreinterpretq_s32_s16(cols_67.val[0]));
  int32x4x2_t cols_2367_h = vtrnq_s32(vreinterpretq_s32_s16(cols_23.val[1]),
                                      vreinterpretq_s32_s16(cols_67.val[1]));
  int32x4x2_t rows_04 = vzipq_s32(cols_0145_l.val[0], cols_2367_l.val[0]);
  int32x4x2_t rows_15 = vzipq_s32(cols_0145_h.val[0], cols_2367_h.val[0]);
  int32x4x2_t rows_26 = vzipq_s32(cols_0145_l.val[1], cols_2367_l.val[1]);
  int32x4x2_t rows_37 = vzipq_s32(cols_0145_h.val[1], cols_2367_h.val[1]);
  int16x8_t row0 = vreinterpretq_s16_s32(rows_04.val[0]);
  int16x8_t row1 = vreinterpretq_s16_s32(rows_15.val[0]);
  int16x8_t row2 = vreinterpretq_s16_s32(rows_26.val[0]);
  int16x8_t row3 = vreinterpretq_s16_s32(rows_37.val[0]);
  int16x8_t row4 = vreinterpretq_s16_s32(rows_04.val[1]);
  int16x8_t row5 = vreinterpretq_s16_s32(rows_15.val[1]);
  int16x8_t row6 = vreinterpretq_s16_s32(rows_26.val[1]);
  int16x8_t row7 = vreinterpretq_s16_s32(rows_37.val[1]);
  /* Pass 2: process columns. */
  tmp0 = vaddq_s16(row0, row7);
  tmp7 = vsubq_s16(row0, row7);
  tmp1 = vaddq_s16(row1, row6);
  tmp6 = vsubq_s16(row1, row6);
  tmp2 = vaddq_s16(row2, row5);
  tmp5 = vsubq_s16(row2, row5);
  tmp3 = vaddq_s16(row3, row4);
  tmp4 = vsubq_s16(row3, row4);
  /* Even part */
  tmp10 = vaddq_s16(tmp0, tmp3);              /* phase 2 */
  tmp13 = vsubq_s16(tmp0, tmp3);
  tmp11 = vaddq_s16(tmp1, tmp2);
  tmp12 = vsubq_s16(tmp1, tmp2);
  row0 = vaddq_s16(tmp10, tmp11);             /* phase 3 */
  row4 = vsubq_s16(tmp10, tmp11);
  z1 = vqdmulhq_lane_s16(vaddq_s16(tmp12, tmp13), consts, 2);
  row2 = vaddq_s16(tmp13, z1);                /* phase 5 */
  row6 = vsubq_s16(tmp13, z1);
  /* Odd part */
  tmp10 = vaddq_s16(tmp4, tmp5);              /* phase 2 */
  tmp11 = vaddq_s16(tmp5, tmp6);
  tmp12 = vaddq_s16(tmp6, tmp7);
  z5 = vqdmulhq_lane_s16(vsubq_s16(tmp10, tmp12), consts, 0);
  z2 = vqdmulhq_lane_s16(tmp10, consts, 1);
  z2 = vaddq_s16(z2, z5);
  z4 = vqdmulhq_lane_s16(tmp12, consts, 3);
  z5 = vaddq_s16(tmp12, z5);
  z4 = vaddq_s16(z4, z5);
  z3 = vqdmulhq_lane_s16(tmp11, consts, 2);
  z11 = vaddq_s16(tmp7, z3);                  /* phase 5 */
  z13 = vsubq_s16(tmp7, z3);
  row5 = vaddq_s16(z13, z2);                  /* phase 6 */
  row3 = vsubq_s16(z13, z2);
  row1 = vaddq_s16(z11, z4);
  row7 = vsubq_s16(z11, z4);
  vst1q_s16(data + 0 * DCTSIZE, row0);
  vst1q_s16(data + 1 * DCTSIZE, row1);
  vst1q_s16(data + 2 * DCTSIZE, row2);
  vst1q_s16(data + 3 * DCTSIZE, row3);
  vst1q_s16(data + 4 * DCTSIZE, row4);
  vst1q_s16(data + 5 * DCTSIZE, row5);
  vst1q_s16(data + 6 * DCTSIZE, row6);
  vst1q_s16(data + 7 * DCTSIZE, row7);
 }
--- a/simd/arm/jfdctint-neon.c
+++ b/simd/arm/jfdctint-neon.c
@@ -0,0 +1,376 @@
 /*
 * jfdctint-neon.c - accurate integer FDCT (Arm Neon)
 *
 * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
 * Copyright (C) 2020, D. R. Commander.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 #define JPEG_INTERNALS
 #include "../../jinclude.h"
 #include "../../jpeglib.h"
 #include "../../jsimd.h"
 #include "../../jdct.h"
 #include "../../jsimddct.h"
 #include "../jsimd.h"
 #include "align.h"
 #include "neon-compat.h"
 #include <arm_neon.h>
 /* jsimd_fdct_islow_neon() performs a slower but more accurate forward DCT
 * (Discrete Cosine Transform) on one block of samples.  It uses the same
 * calculations and produces exactly the same output as IJG's original
 * jpeg_fdct_islow() function, which can be found in jfdctint.c.
 *
 * Scaled integer constants are used to avoid floating-point arithmetic:
 *    0.298631336 =  2446 * 2^-13
 *    0.390180644 =  3196 * 2^-13
 *    0.541196100 =  4433 * 2^-13
 *    0.765366865 =  6270 * 2^-13
 *    0.899976223 =  7373 * 2^-13
 *    1.175875602 =  9633 * 2^-13
 *    1.501321110 = 12299 * 2^-13
 *    1.847759065 = 15137 * 2^-13
 *    1.961570560 = 16069 * 2^-13
 *    2.053119869 = 16819 * 2^-13
 *    2.562915447 = 20995 * 2^-13
 *    3.072711026 = 25172 * 2^-13
 *
 * See jfdctint.c for further details of the DCT algorithm.  Where possible,
 * the variable names and comments here in jsimd_fdct_islow_neon() match up
 * with those in jpeg_fdct_islow().
 */
 #define CONST_BITS  13
 #define PASS1_BITS  2
 #define DESCALE_P1  (CONST_BITS - PASS1_BITS)
 #define DESCALE_P2  (CONST_BITS + PASS1_BITS)
 #define F_0_298  2446
 #define F_0_390  3196
 #define F_0_541  4433
 #define F_0_765  6270
 #define F_0_899  7373
 #define F_1_175  9633
 #define F_1_501  12299
 #define F_1_847  15137
 #define F_1_961  16069
 #define F_2_053  16819
 #define F_2_562  20995
 #define F_3_072  25172
 ALIGN(16) static const int16_t jsimd_fdct_islow_neon_consts[] = {
  F_0_298, -F_0_390,  F_0_541,  F_0_765,
 -F_0_899,  F_1_175,  F_1_501, -F_1_847,
 -F_1_961,  F_2_053, -F_2_562,  F_3_072
 };
 void jsimd_fdct_islow_neon(DCTELEM *data)
 {
  /* Load DCT constants. */
 #ifdef HAVE_VLD1_S16_X3
  const int16x4x3_t consts = vld1_s16_x3(jsimd_fdct_islow_neon_consts);
 #else
  /* GCC does not currently support the intrinsic vld1_<type>_x3(). */
  const int16x4_t consts1 = vld1_s16(jsimd_fdct_islow_neon_consts);
  const int16x4_t consts2 = vld1_s16(jsimd_fdct_islow_neon_consts + 4);
  const int16x4_t consts3 = vld1_s16(jsimd_fdct_islow_neon_consts + 8);
  const int16x4x3_t consts = { { consts1, consts2, consts3 } };
 #endif
  /* Load an 8x8 block of samples into Neon registers.  De-interleaving loads
   * are used, followed by vuzp to transpose the block such that we have a
   * column of samples per vector - allowing all rows to be processed at once.
   */
  int16x8x4_t s_rows_0123 = vld4q_s16(data);
  int16x8x4_t s_rows_4567 = vld4q_s16(data + 4 * DCTSIZE);
  int16x8x2_t cols_04 = vuzpq_s16(s_rows_0123.val[0], s_rows_4567.val[0]);
  int16x8x2_t cols_15 = vuzpq_s16(s_rows_0123.val[1], s_rows_4567.val[1]);
  int16x8x2_t cols_26 = vuzpq_s16(s_rows_0123.val[2], s_rows_4567.val[2]);
  int16x8x2_t cols_37 = vuzpq_s16(s_rows_0123.val[3], s_rows_4567.val[3]);
  int16x8_t col0 = cols_04.val[0];
  int16x8_t col1 = cols_15.val[0];
  int16x8_t col2 = cols_26.val[0];
  int16x8_t col3 = cols_37.val[0];
  int16x8_t col4 = cols_04.val[1];
  int16x8_t col5 = cols_15.val[1];
  int16x8_t col6 = cols_26.val[1];
  int16x8_t col7 = cols_37.val[1];
  /* Pass 1: process rows. */
  int16x8_t tmp0 = vaddq_s16(col0, col7);
  int16x8_t tmp7 = vsubq_s16(col0, col7);
  int16x8_t tmp1 = vaddq_s16(col1, col6);
  int16x8_t tmp6 = vsubq_s16(col1, col6);
  int16x8_t tmp2 = vaddq_s16(col2, col5);
  int16x8_t tmp5 = vsubq_s16(col2, col5);
  int16x8_t tmp3 = vaddq_s16(col3, col4);
  int16x8_t tmp4 = vsubq_s16(col3, col4);
  /* Even part */
  int16x8_t tmp10 = vaddq_s16(tmp0, tmp3);
  int16x8_t tmp13 = vsubq_s16(tmp0, tmp3);
  int16x8_t tmp11 = vaddq_s16(tmp1, tmp2);
  int16x8_t tmp12 = vsubq_s16(tmp1, tmp2);
  col0 = vshlq_n_s16(vaddq_s16(tmp10, tmp11), PASS1_BITS);
  col4 = vshlq_n_s16(vsubq_s16(tmp10, tmp11), PASS1_BITS);
  int16x8_t tmp12_add_tmp13 = vaddq_s16(tmp12, tmp13);
  int32x4_t z1_l =
    vmull_lane_s16(vget_low_s16(tmp12_add_tmp13), consts.val[0], 2);
  int32x4_t z1_h =
    vmull_lane_s16(vget_high_s16(tmp12_add_tmp13), consts.val[0], 2);
  int32x4_t col2_scaled_l =
    vmlal_lane_s16(z1_l, vget_low_s16(tmp13), consts.val[0], 3);
  int32x4_t col2_scaled_h =
    vmlal_lane_s16(z1_h, vget_high_s16(tmp13), consts.val[0], 3);
  col2 = vcombine_s16(vrshrn_n_s32(col2_scaled_l, DESCALE_P1),
                      vrshrn_n_s32(col2_scaled_h, DESCALE_P1));
  int32x4_t col6_scaled_l =
    vmlal_lane_s16(z1_l, vget_low_s16(tmp12), consts.val[1], 3);
  int32x4_t col6_scaled_h =
    vmlal_lane_s16(z1_h, vget_high_s16(tmp12), consts.val[1], 3);
  col6 = vcombine_s16(vrshrn_n_s32(col6_scaled_l, DESCALE_P1),
                      vrshrn_n_s32(col6_scaled_h, DESCALE_P1));
  /* Odd part */
  int16x8_t z1 = vaddq_s16(tmp4, tmp7);
  int16x8_t z2 = vaddq_s16(tmp5, tmp6);
  int16x8_t z3 = vaddq_s16(tmp4, tmp6);
  int16x8_t z4 = vaddq_s16(tmp5, tmp7);
  /* sqrt(2) * c3 */
  int32x4_t z5_l = vmull_lane_s16(vget_low_s16(z3), consts.val[1], 1);
  int32x4_t z5_h = vmull_lane_s16(vget_high_s16(z3), consts.val[1], 1);
  z5_l = vmlal_lane_s16(z5_l, vget_low_s16(z4), consts.val[1], 1);
  z5_h = vmlal_lane_s16(z5_h, vget_high_s16(z4), consts.val[1], 1);
  /* sqrt(2) * (-c1+c3+c5-c7) */
  int32x4_t tmp4_l = vmull_lane_s16(vget_low_s16(tmp4), consts.val[0], 0);
  int32x4_t tmp4_h = vmull_lane_s16(vget_high_s16(tmp4), consts.val[0], 0);
  /* sqrt(2) * ( c1+c3-c5+c7) */
  int32x4_t tmp5_l = vmull_lane_s16(vget_low_s16(tmp5), consts.val[2], 1);
  int32x4_t tmp5_h = vmull_lane_s16(vget_high_s16(tmp5), consts.val[2], 1);
  /* sqrt(2) * ( c1+c3+c5-c7) */
  int32x4_t tmp6_l = vmull_lane_s16(vget_low_s16(tmp6), consts.val[2], 3);
  int32x4_t tmp6_h = vmull_lane_s16(vget_high_s16(tmp6), consts.val[2], 3);
  /* sqrt(2) * ( c1+c3-c5-c7) */
  int32x4_t tmp7_l = vmull_lane_s16(vget_low_s16(tmp7), consts.val[1], 2);
  int32x4_t tmp7_h = vmull_lane_s16(vget_high_s16(tmp7), consts.val[1], 2);
  /* sqrt(2) * (c7-c3) */
  z1_l = vmull_lane_s16(vget_low_s16(z1), consts.val[1], 0);
  z1_h = vmull_lane_s16(vget_high_s16(z1), consts.val[1], 0);
  /* sqrt(2) * (-c1-c3) */
  int32x4_t z2_l = vmull_lane_s16(vget_low_s16(z2), consts.val[2], 2);
  int32x4_t z2_h = vmull_lane_s16(vget_high_s16(z2), consts.val[2], 2);
  /* sqrt(2) * (-c3-c5) */
  int32x4_t z3_l = vmull_lane_s16(vget_low_s16(z3), consts.val[2], 0);
  int32x4_t z3_h = vmull_lane_s16(vget_high_s16(z3), consts.val[2], 0);
  /* sqrt(2) * (c5-c3) */
  int32x4_t z4_l = vmull_lane_s16(vget_low_s16(z4), consts.val[0], 1);
  int32x4_t z4_h = vmull_lane_s16(vget_high_s16(z4), consts.val[0], 1);
  z3_l = vaddq_s32(z3_l, z5_l);
  z3_h = vaddq_s32(z3_h, z5_h);
  z4_l = vaddq_s32(z4_l, z5_l);
  z4_h = vaddq_s32(z4_h, z5_h);
  tmp4_l = vaddq_s32(tmp4_l, z1_l);
  tmp4_h = vaddq_s32(tmp4_h, z1_h);
  tmp4_l = vaddq_s32(tmp4_l, z3_l);
  tmp4_h = vaddq_s32(tmp4_h, z3_h);
  col7 = vcombine_s16(vrshrn_n_s32(tmp4_l, DESCALE_P1),
                      vrshrn_n_s32(tmp4_h, DESCALE_P1));
  tmp5_l = vaddq_s32(tmp5_l, z2_l);
  tmp5_h = vaddq_s32(tmp5_h, z2_h);
  tmp5_l = vaddq_s32(tmp5_l, z4_l);
  tmp5_h = vaddq_s32(tmp5_h, z4_h);
  col5 = vcombine_s16(vrshrn_n_s32(tmp5_l, DESCALE_P1),
                      vrshrn_n_s32(tmp5_h, DESCALE_P1));
  tmp6_l = vaddq_s32(tmp6_l, z2_l);
  tmp6_h = vaddq_s32(tmp6_h, z2_h);
  tmp6_l = vaddq_s32(tmp6_l, z3_l);
  tmp6_h = vaddq_s32(tmp6_h, z3_h);
  col3 = vcombine_s16(vrshrn_n_s32(tmp6_l, DESCALE_P1),
                      vrshrn_n_s32(tmp6_h, DESCALE_P1));
  tmp7_l = vaddq_s32(tmp7_l, z1_l);
  tmp7_h = vaddq_s32(tmp7_h, z1_h);
  tmp7_l = vaddq_s32(tmp7_l, z4_l);
  tmp7_h = vaddq_s32(tmp7_h, z4_h);
  col1 = vcombine_s16(vrshrn_n_s32(tmp7_l, DESCALE_P1),
                      vrshrn_n_s32(tmp7_h, DESCALE_P1));
  /* Transpose to work on columns in pass 2. */
  int16x8x2_t cols_01 = vtrnq_s16(col0, col1);
  int16x8x2_t cols_23 = vtrnq_s16(col2, col3);
  int16x8x2_t cols_45 = vtrnq_s16(col4, col5);
  int16x8x2_t cols_67 = vtrnq_s16(col6, col7);
  int32x4x2_t cols_0145_l = vtrnq_s32(vreinterpretq_s32_s16(cols_01.val[0]),
                                      vreinterpretq_s32_s16(cols_45.val[0]));
  int32x4x2_t cols_0145_h = vtrnq_s32(vreinterpretq_s32_s16(cols_01.val[1]),
                                      vreinterpretq_s32_s16(cols_45.val[1]));
  int32x4x2_t cols_2367_l = vtrnq_s32(vreinterpretq_s32_s16(cols_23.val[0]),
                                      vreinterpretq_s32_s16(cols_67.val[0]));
  int32x4x2_t cols_2367_h = vtrnq_s32(vreinterpretq_s32_s16(cols_23.val[1]),
                                      vreinterpretq_s32_s16(cols_67.val[1]));
  int32x4x2_t rows_04 = vzipq_s32(cols_0145_l.val[0], cols_2367_l.val[0]);
  int32x4x2_t rows_15 = vzipq_s32(cols_0145_h.val[0], cols_2367_h.val[0]);
  int32x4x2_t rows_26 = vzipq_s32(cols_0145_l.val[1], cols_2367_l.val[1]);
  int32x4x2_t rows_37 = vzipq_s32(cols_0145_h.val[1], cols_2367_h.val[1]);
  int16x8_t row0 = vreinterpretq_s16_s32(rows_04.val[0]);
  int16x8_t row1 = vreinterpretq_s16_s32(rows_15.val[0]);
  int16x8_t row2 = vreinterpretq_s16_s32(rows_26.val[0]);
  int16x8_t row3 = vreinterpretq_s16_s32(rows_37.val[0]);
  int16x8_t row4 = vreinterpretq_s16_s32(rows_04.val[1]);
  int16x8_t row5 = vreinterpretq_s16_s32(rows_15.val[1]);
  int16x8_t row6 = vreinterpretq_s16_s32(rows_26.val[1]);
  int16x8_t row7 = vreinterpretq_s16_s32(rows_37.val[1]);
  /* Pass 2: process columns. */
  tmp0 = vaddq_s16(row0, row7);
  tmp7 = vsubq_s16(row0, row7);
  tmp1 = vaddq_s16(row1, row6);
  tmp6 = vsubq_s16(row1, row6);
  tmp2 = vaddq_s16(row2, row5);
  tmp5 = vsubq_s16(row2, row5);
  tmp3 = vaddq_s16(row3, row4);
  tmp4 = vsubq_s16(row3, row4);
  /* Even part */
  tmp10 = vaddq_s16(tmp0, tmp3);
  tmp13 = vsubq_s16(tmp0, tmp3);
  tmp11 = vaddq_s16(tmp1, tmp2);
  tmp12 = vsubq_s16(tmp1, tmp2);
  row0 = vrshrq_n_s16(vaddq_s16(tmp10, tmp11), PASS1_BITS);
  row4 = vrshrq_n_s16(vsubq_s16(tmp10, tmp11), PASS1_BITS);
  tmp12_add_tmp13 = vaddq_s16(tmp12, tmp13);
  z1_l = vmull_lane_s16(vget_low_s16(tmp12_add_tmp13), consts.val[0], 2);
  z1_h = vmull_lane_s16(vget_high_s16(tmp12_add_tmp13), consts.val[0], 2);
  int32x4_t row2_scaled_l =
    vmlal_lane_s16(z1_l, vget_low_s16(tmp13), consts.val[0], 3);
  int32x4_t row2_scaled_h =
    vmlal_lane_s16(z1_h, vget_high_s16(tmp13), consts.val[0], 3);
  row2 = vcombine_s16(vrshrn_n_s32(row2_scaled_l, DESCALE_P2),
                      vrshrn_n_s32(row2_scaled_h, DESCALE_P2));
  int32x4_t row6_scaled_l =
    vmlal_lane_s16(z1_l, vget_low_s16(tmp12), consts.val[1], 3);
  int32x4_t row6_scaled_h =
    vmlal_lane_s16(z1_h, vget_high_s16(tmp12), consts.val[1], 3);
  row6 = vcombine_s16(vrshrn_n_s32(row6_scaled_l, DESCALE_P2),
                      vrshrn_n_s32(row6_scaled_h, DESCALE_P2));
  /* Odd part */
  z1 = vaddq_s16(tmp4, tmp7);
  z2 = vaddq_s16(tmp5, tmp6);
  z3 = vaddq_s16(tmp4, tmp6);
  z4 = vaddq_s16(tmp5, tmp7);
  /* sqrt(2) * c3 */
  z5_l = vmull_lane_s16(vget_low_s16(z3), consts.val[1], 1);
  z5_h = vmull_lane_s16(vget_high_s16(z3), consts.val[1], 1);
  z5_l = vmlal_lane_s16(z5_l, vget_low_s16(z4), consts.val[1], 1);
  z5_h = vmlal_lane_s16(z5_h, vget_high_s16(z4), consts.val[1], 1);
  /* sqrt(2) * (-c1+c3+c5-c7) */
  tmp4_l = vmull_lane_s16(vget_low_s16(tmp4), consts.val[0], 0);
  tmp4_h = vmull_lane_s16(vget_high_s16(tmp4), consts.val[0], 0);
  /* sqrt(2) * ( c1+c3-c5+c7) */
  tmp5_l = vmull_lane_s16(vget_low_s16(tmp5), consts.val[2], 1);
  tmp5_h = vmull_lane_s16(vget_high_s16(tmp5), consts.val[2], 1);
  /* sqrt(2) * ( c1+c3+c5-c7) */
  tmp6_l = vmull_lane_s16(vget_low_s16(tmp6), consts.val[2], 3);
  tmp6_h = vmull_lane_s16(vget_high_s16(tmp6), consts.val[2], 3);
  /* sqrt(2) * ( c1+c3-c5-c7) */
  tmp7_l = vmull_lane_s16(vget_low_s16(tmp7), consts.val[1], 2);
  tmp7_h = vmull_lane_s16(vget_high_s16(tmp7), consts.val[1], 2);
  /* sqrt(2) * (c7-c3) */
  z1_l = vmull_lane_s16(vget_low_s16(z1), consts.val[1], 0);
  z1_h = vmull_lane_s16(vget_high_s16(z1), consts.val[1], 0);
  /* sqrt(2) * (-c1-c3) */
  z2_l = vmull_lane_s16(vget_low_s16(z2), consts.val[2], 2);
  z2_h = vmull_lane_s16(vget_high_s16(z2), consts.val[2], 2);
  /* sqrt(2) * (-c3-c5) */
  z3_l = vmull_lane_s16(vget_low_s16(z3), consts.val[2], 0);
  z3_h = vmull_lane_s16(vget_high_s16(z3), consts.val[2], 0);
  /* sqrt(2) * (c5-c3) */
  z4_l = vmull_lane_s16(vget_low_s16(z4), consts.val[0], 1);
  z4_h = vmull_lane_s16(vget_high_s16(z4), consts.val[0], 1);
  z3_l = vaddq_s32(z3_l, z5_l);
  z3_h = vaddq_s32(z3_h, z5_h);
  z4_l = vaddq_s32(z4_l, z5_l);
  z4_h = vaddq_s32(z4_h, z5_h);
  tmp4_l = vaddq_s32(tmp4_l, z1_l);
  tmp4_h = vaddq_s32(tmp4_h, z1_h);
  tmp4_l = vaddq_s32(tmp4_l, z3_l);
  tmp4_h = vaddq_s32(tmp4_h, z3_h);
  row7 = vcombine_s16(vrshrn_n_s32(tmp4_l, DESCALE_P2),
                      vrshrn_n_s32(tmp4_h, DESCALE_P2));
  tmp5_l = vaddq_s32(tmp5_l, z2_l);
  tmp5_h = vaddq_s32(tmp5_h, z2_h);
  tmp5_l = vaddq_s32(tmp5_l, z4_l);
  tmp5_h = vaddq_s32(tmp5_h, z4_h);
  row5 = vcombine_s16(vrshrn_n_s32(tmp5_l, DESCALE_P2),
                      vrshrn_n_s32(tmp5_h, DESCALE_P2));
  tmp6_l = vaddq_s32(tmp6_l, z2_l);
  tmp6_h = vaddq_s32(tmp6_h, z2_h);
  tmp6_l = vaddq_s32(tmp6_l, z3_l);
  tmp6_h = vaddq_s32(tmp6_h, z3_h);
  row3 = vcombine_s16(vrshrn_n_s32(tmp6_l, DESCALE_P2),
                      vrshrn_n_s32(tmp6_h, DESCALE_P2));
  tmp7_l = vaddq_s32(tmp7_l, z1_l);
  tmp7_h = vaddq_s32(tmp7_h, z1_h);
  tmp7_l = vaddq_s32(tmp7_l, z4_l);
  tmp7_h = vaddq_s32(tmp7_h, z4_h);
  row1 = vcombine_s16(vrshrn_n_s32(tmp7_l, DESCALE_P2),
                      vrshrn_n_s32(tmp7_h, DESCALE_P2));
  vst1q_s16(data + 0 * DCTSIZE, row0);
  vst1q_s16(data + 1 * DCTSIZE, row1);
  vst1q_s16(data + 2 * DCTSIZE, row2);
  vst1q_s16(data + 3 * DCTSIZE, row3);
  vst1q_s16(data + 4 * DCTSIZE, row4);
  vst1q_s16(data + 5 * DCTSIZE, row5);
  vst1q_s16(data + 6 * DCTSIZE, row6);
  vst1q_s16(data + 7 * DCTSIZE, row7);
 }
--- a/simd/arm/jidctfst-neon.c
+++ b/simd/arm/jidctfst-neon.c
@@ -0,0 +1,472 @@
 /*
 * jidctfst-neon.c - fast integer IDCT (Arm Neon)
 *
 * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 #define JPEG_INTERNALS
 #include "../../jinclude.h"
 #include "../../jpeglib.h"
 #include "../../jsimd.h"
 #include "../../jdct.h"
 #include "../../jsimddct.h"
 #include "../jsimd.h"
 #include "align.h"
 #include <arm_neon.h>
 /* jsimd_idct_ifast_neon() performs dequantization and a fast, not so accurate
 * inverse DCT (Discrete Cosine Transform) on one block of coefficients.  It
 * uses the same calculations and produces exactly the same output as IJG's
 * original jpeg_idct_ifast() function, which can be found in jidctfst.c.
 *
 * Scaled integer constants are used to avoid floating-point arithmetic:
 *    0.082392200 =  2688 * 2^-15
 *    0.414213562 = 13568 * 2^-15
 *    0.847759065 = 27776 * 2^-15
 *    0.613125930 = 20096 * 2^-15
 *
 * See jidctfst.c for further details of the IDCT algorithm.  Where possible,
 * the variable names and comments here in jsimd_idct_ifast_neon() match up
 * with those in jpeg_idct_ifast().
 */
 #define PASS1_BITS  2
 #define F_0_082  2688
 #define F_0_414  13568
 #define F_0_847  27776
 #define F_0_613  20096
 ALIGN(16) static const int16_t jsimd_idct_ifast_neon_consts[] = {
  F_0_082, F_0_414, F_0_847, F_0_613
 };
 void jsimd_idct_ifast_neon(void *dct_table, JCOEFPTR coef_block,
                           JSAMPARRAY output_buf, JDIMENSION output_col)
 {
  IFAST_MULT_TYPE *quantptr = dct_table;
  /* Load DCT coefficients. */
  int16x8_t row0 = vld1q_s16(coef_block + 0 * DCTSIZE);
  int16x8_t row1 = vld1q_s16(coef_block + 1 * DCTSIZE);
  int16x8_t row2 = vld1q_s16(coef_block + 2 * DCTSIZE);
  int16x8_t row3 = vld1q_s16(coef_block + 3 * DCTSIZE);
  int16x8_t row4 = vld1q_s16(coef_block + 4 * DCTSIZE);
  int16x8_t row5 = vld1q_s16(coef_block + 5 * DCTSIZE);
  int16x8_t row6 = vld1q_s16(coef_block + 6 * DCTSIZE);
  int16x8_t row7 = vld1q_s16(coef_block + 7 * DCTSIZE);
  /* Load quantization table values for DC coefficients. */
  int16x8_t quant_row0 = vld1q_s16(quantptr + 0 * DCTSIZE);
  /* Dequantize DC coefficients. */
  row0 = vmulq_s16(row0, quant_row0);
  /* Construct bitmap to test if all AC coefficients are 0. */
  int16x8_t bitmap = vorrq_s16(row1, row2);
  bitmap = vorrq_s16(bitmap, row3);
  bitmap = vorrq_s16(bitmap, row4);
  bitmap = vorrq_s16(bitmap, row5);
  bitmap = vorrq_s16(bitmap, row6);
  bitmap = vorrq_s16(bitmap, row7);
  int64_t left_ac_bitmap = vgetq_lane_s64(vreinterpretq_s64_s16(bitmap), 0);
  int64_t right_ac_bitmap = vgetq_lane_s64(vreinterpretq_s64_s16(bitmap), 1);
  /* Load IDCT conversion constants. */
  const int16x4_t consts = vld1_s16(jsimd_idct_ifast_neon_consts);
  if (left_ac_bitmap == 0 && right_ac_bitmap == 0) {
    /* All AC coefficients are zero.
     * Compute DC values and duplicate into vectors.
     */
    int16x8_t dcval = row0;
    row1 = dcval;
    row2 = dcval;
    row3 = dcval;
    row4 = dcval;
    row5 = dcval;
    row6 = dcval;
    row7 = dcval;
  } else if (left_ac_bitmap == 0) {
    /* AC coefficients are zero for columns 0, 1, 2, and 3.
     * Use DC values for these columns.
     */
    int16x4_t dcval = vget_low_s16(row0);
    /* Commence regular fast IDCT computation for columns 4, 5, 6, and 7. */
    /* Load quantization table. */
    int16x4_t quant_row1 = vld1_s16(quantptr + 1 * DCTSIZE + 4);
    int16x4_t quant_row2 = vld1_s16(quantptr + 2 * DCTSIZE + 4);
    int16x4_t quant_row3 = vld1_s16(quantptr + 3 * DCTSIZE + 4);
    int16x4_t quant_row4 = vld1_s16(quantptr + 4 * DCTSIZE + 4);
    int16x4_t quant_row5 = vld1_s16(quantptr + 5 * DCTSIZE + 4);
    int16x4_t quant_row6 = vld1_s16(quantptr + 6 * DCTSIZE + 4);
    int16x4_t quant_row7 = vld1_s16(quantptr + 7 * DCTSIZE + 4);
    /* Even part: dequantize DCT coefficients. */
    int16x4_t tmp0 = vget_high_s16(row0);
    int16x4_t tmp1 = vmul_s16(vget_high_s16(row2), quant_row2);
    int16x4_t tmp2 = vmul_s16(vget_high_s16(row4), quant_row4);
    int16x4_t tmp3 = vmul_s16(vget_high_s16(row6), quant_row6);
    int16x4_t tmp10 = vadd_s16(tmp0, tmp2);   /* phase 3 */
    int16x4_t tmp11 = vsub_s16(tmp0, tmp2);
    int16x4_t tmp13 = vadd_s16(tmp1, tmp3);   /* phases 5-3 */
    int16x4_t tmp1_sub_tmp3 = vsub_s16(tmp1, tmp3);
    int16x4_t tmp12 = vqdmulh_lane_s16(tmp1_sub_tmp3, consts, 1);
    tmp12 = vadd_s16(tmp12, tmp1_sub_tmp3);
    tmp12 = vsub_s16(tmp12, tmp13);
    tmp0 = vadd_s16(tmp10, tmp13);            /* phase 2 */
    tmp3 = vsub_s16(tmp10, tmp13);
    tmp1 = vadd_s16(tmp11, tmp12);
    tmp2 = vsub_s16(tmp11, tmp12);
    /* Odd part: dequantize DCT coefficients. */
    int16x4_t tmp4 = vmul_s16(vget_high_s16(row1), quant_row1);
    int16x4_t tmp5 = vmul_s16(vget_high_s16(row3), quant_row3);
    int16x4_t tmp6 = vmul_s16(vget_high_s16(row5), quant_row5);
    int16x4_t tmp7 = vmul_s16(vget_high_s16(row7), quant_row7);
    int16x4_t z13 = vadd_s16(tmp6, tmp5);     /* phase 6 */
    int16x4_t neg_z10 = vsub_s16(tmp5, tmp6);
    int16x4_t z11 = vadd_s16(tmp4, tmp7);
    int16x4_t z12 = vsub_s16(tmp4, tmp7);
    tmp7 = vadd_s16(z11, z13);                /* phase 5 */
    int16x4_t z11_sub_z13 = vsub_s16(z11, z13);
    tmp11 = vqdmulh_lane_s16(z11_sub_z13, consts, 1);
    tmp11 = vadd_s16(tmp11, z11_sub_z13);
    int16x4_t z10_add_z12 = vsub_s16(z12, neg_z10);
    int16x4_t z5 = vqdmulh_lane_s16(z10_add_z12, consts, 2);
    z5 = vadd_s16(z5, z10_add_z12);
    tmp10 = vqdmulh_lane_s16(z12, consts, 0);
    tmp10 = vadd_s16(tmp10, z12);
    tmp10 = vsub_s16(tmp10, z5);
    tmp12 = vqdmulh_lane_s16(neg_z10, consts, 3);
    tmp12 = vadd_s16(tmp12, vadd_s16(neg_z10, neg_z10));
    tmp12 = vadd_s16(tmp12, z5);
    tmp6 = vsub_s16(tmp12, tmp7);             /* phase 2 */
    tmp5 = vsub_s16(tmp11, tmp6);
    tmp4 = vadd_s16(tmp10, tmp5);
    row0 = vcombine_s16(dcval, vadd_s16(tmp0, tmp7));
    row7 = vcombine_s16(dcval, vsub_s16(tmp0, tmp7));
    row1 = vcombine_s16(dcval, vadd_s16(tmp1, tmp6));
    row6 = vcombine_s16(dcval, vsub_s16(tmp1, tmp6));
    row2 = vcombine_s16(dcval, vadd_s16(tmp2, tmp5));
    row5 = vcombine_s16(dcval, vsub_s16(tmp2, tmp5));
    row4 = vcombine_s16(dcval, vadd_s16(tmp3, tmp4));
    row3 = vcombine_s16(dcval, vsub_s16(tmp3, tmp4));
  } else if (right_ac_bitmap == 0) {
    /* AC coefficients are zero for columns 4, 5, 6, and 7.
     * Use DC values for these columns.
     */
    int16x4_t dcval = vget_high_s16(row0);
    /* Commence regular fast IDCT computation for columns 0, 1, 2, and 3. */
    /* Load quantization table. */
    int16x4_t quant_row1 = vld1_s16(quantptr + 1 * DCTSIZE);
    int16x4_t quant_row2 = vld1_s16(quantptr + 2 * DCTSIZE);
    int16x4_t quant_row3 = vld1_s16(quantptr + 3 * DCTSIZE);
    int16x4_t quant_row4 = vld1_s16(quantptr + 4 * DCTSIZE);
    int16x4_t quant_row5 = vld1_s16(quantptr + 5 * DCTSIZE);
    int16x4_t quant_row6 = vld1_s16(quantptr + 6 * DCTSIZE);
    int16x4_t quant_row7 = vld1_s16(quantptr + 7 * DCTSIZE);
    /* Even part: dequantize DCT coefficients. */
    int16x4_t tmp0 = vget_low_s16(row0);
    int16x4_t tmp1 = vmul_s16(vget_low_s16(row2), quant_row2);
    int16x4_t tmp2 = vmul_s16(vget_low_s16(row4), quant_row4);
    int16x4_t tmp3 = vmul_s16(vget_low_s16(row6), quant_row6);
    int16x4_t tmp10 = vadd_s16(tmp0, tmp2);   /* phase 3 */
    int16x4_t tmp11 = vsub_s16(tmp0, tmp2);
    int16x4_t tmp13 = vadd_s16(tmp1, tmp3);   /* phases 5-3 */
    int16x4_t tmp1_sub_tmp3 = vsub_s16(tmp1, tmp3);
    int16x4_t tmp12 = vqdmulh_lane_s16(tmp1_sub_tmp3, consts, 1);
    tmp12 = vadd_s16(tmp12, tmp1_sub_tmp3);
    tmp12 = vsub_s16(tmp12, tmp13);
    tmp0 = vadd_s16(tmp10, tmp13);            /* phase 2 */
    tmp3 = vsub_s16(tmp10, tmp13);
    tmp1 = vadd_s16(tmp11, tmp12);
    tmp2 = vsub_s16(tmp11, tmp12);
    /* Odd part: dequantize DCT coefficients. */
    int16x4_t tmp4 = vmul_s16(vget_low_s16(row1), quant_row1);
    int16x4_t tmp5 = vmul_s16(vget_low_s16(row3), quant_row3);
    int16x4_t tmp6 = vmul_s16(vget_low_s16(row5), quant_row5);
    int16x4_t tmp7 = vmul_s16(vget_low_s16(row7), quant_row7);
    int16x4_t z13 = vadd_s16(tmp6, tmp5);     /* phase 6 */
    int16x4_t neg_z10 = vsub_s16(tmp5, tmp6);
    int16x4_t z11 = vadd_s16(tmp4, tmp7);
    int16x4_t z12 = vsub_s16(tmp4, tmp7);
    tmp7 = vadd_s16(z11, z13);                /* phase 5 */
    int16x4_t z11_sub_z13 = vsub_s16(z11, z13);
    tmp11 = vqdmulh_lane_s16(z11_sub_z13, consts, 1);
    tmp11 = vadd_s16(tmp11, z11_sub_z13);
    int16x4_t z10_add_z12 = vsub_s16(z12, neg_z10);
    int16x4_t z5 = vqdmulh_lane_s16(z10_add_z12, consts, 2);
    z5 = vadd_s16(z5, z10_add_z12);
    tmp10 = vqdmulh_lane_s16(z12, consts, 0);
    tmp10 = vadd_s16(tmp10, z12);
    tmp10 = vsub_s16(tmp10, z5);
    tmp12 = vqdmulh_lane_s16(neg_z10, consts, 3);
    tmp12 = vadd_s16(tmp12, vadd_s16(neg_z10, neg_z10));
    tmp12 = vadd_s16(tmp12, z5);
    tmp6 = vsub_s16(tmp12, tmp7);             /* phase 2 */
    tmp5 = vsub_s16(tmp11, tmp6);
    tmp4 = vadd_s16(tmp10, tmp5);
    row0 = vcombine_s16(vadd_s16(tmp0, tmp7), dcval);
    row7 = vcombine_s16(vsub_s16(tmp0, tmp7), dcval);
    row1 = vcombine_s16(vadd_s16(tmp1, tmp6), dcval);
    row6 = vcombine_s16(vsub_s16(tmp1, tmp6), dcval);
    row2 = vcombine_s16(vadd_s16(tmp2, tmp5), dcval);
    row5 = vcombine_s16(vsub_s16(tmp2, tmp5), dcval);
    row4 = vcombine_s16(vadd_s16(tmp3, tmp4), dcval);
    row3 = vcombine_s16(vsub_s16(tmp3, tmp4), dcval);
  } else {
    /* Some AC coefficients are non-zero; full IDCT calculation required. */
    /* Load quantization table. */
    int16x8_t quant_row1 = vld1q_s16(quantptr + 1 * DCTSIZE);
    int16x8_t quant_row2 = vld1q_s16(quantptr + 2 * DCTSIZE);
    int16x8_t quant_row3 = vld1q_s16(quantptr + 3 * DCTSIZE);
    int16x8_t quant_row4 = vld1q_s16(quantptr + 4 * DCTSIZE);
    int16x8_t quant_row5 = vld1q_s16(quantptr + 5 * DCTSIZE);
    int16x8_t quant_row6 = vld1q_s16(quantptr + 6 * DCTSIZE);
    int16x8_t quant_row7 = vld1q_s16(quantptr + 7 * DCTSIZE);
    /* Even part: dequantize DCT coefficients. */
    int16x8_t tmp0 = row0;
    int16x8_t tmp1 = vmulq_s16(row2, quant_row2);
    int16x8_t tmp2 = vmulq_s16(row4, quant_row4);
    int16x8_t tmp3 = vmulq_s16(row6, quant_row6);
    int16x8_t tmp10 = vaddq_s16(tmp0, tmp2);   /* phase 3 */
    int16x8_t tmp11 = vsubq_s16(tmp0, tmp2);
    int16x8_t tmp13 = vaddq_s16(tmp1, tmp3);   /* phases 5-3 */
    int16x8_t tmp1_sub_tmp3 = vsubq_s16(tmp1, tmp3);
    int16x8_t tmp12 = vqdmulhq_lane_s16(tmp1_sub_tmp3, consts, 1);
    tmp12 = vaddq_s16(tmp12, tmp1_sub_tmp3);
    tmp12 = vsubq_s16(tmp12, tmp13);
    tmp0 = vaddq_s16(tmp10, tmp13);            /* phase 2 */
    tmp3 = vsubq_s16(tmp10, tmp13);
    tmp1 = vaddq_s16(tmp11, tmp12);
    tmp2 = vsubq_s16(tmp11, tmp12);
    /* Odd part: dequantize DCT coefficients. */
    int16x8_t tmp4 = vmulq_s16(row1, quant_row1);
    int16x8_t tmp5 = vmulq_s16(row3, quant_row3);
    int16x8_t tmp6 = vmulq_s16(row5, quant_row5);
    int16x8_t tmp7 = vmulq_s16(row7, quant_row7);
    int16x8_t z13 = vaddq_s16(tmp6, tmp5);     /* phase 6 */
    int16x8_t neg_z10 = vsubq_s16(tmp5, tmp6);
    int16x8_t z11 = vaddq_s16(tmp4, tmp7);
    int16x8_t z12 = vsubq_s16(tmp4, tmp7);
    tmp7 = vaddq_s16(z11, z13);                /* phase 5 */
    int16x8_t z11_sub_z13 = vsubq_s16(z11, z13);
    tmp11 = vqdmulhq_lane_s16(z11_sub_z13, consts, 1);
    tmp11 = vaddq_s16(tmp11, z11_sub_z13);
    int16x8_t z10_add_z12 = vsubq_s16(z12, neg_z10);
    int16x8_t z5 = vqdmulhq_lane_s16(z10_add_z12, consts, 2);
    z5 = vaddq_s16(z5, z10_add_z12);
    tmp10 = vqdmulhq_lane_s16(z12, consts, 0);
    tmp10 = vaddq_s16(tmp10, z12);
    tmp10 = vsubq_s16(tmp10, z5);
    tmp12 = vqdmulhq_lane_s16(neg_z10, consts, 3);
    tmp12 = vaddq_s16(tmp12, vaddq_s16(neg_z10, neg_z10));
    tmp12 = vaddq_s16(tmp12, z5);
    tmp6 = vsubq_s16(tmp12, tmp7);             /* phase 2 */
    tmp5 = vsubq_s16(tmp11, tmp6);
    tmp4 = vaddq_s16(tmp10, tmp5);
    row0 = vaddq_s16(tmp0, tmp7);
    row7 = vsubq_s16(tmp0, tmp7);
    row1 = vaddq_s16(tmp1, tmp6);
    row6 = vsubq_s16(tmp1, tmp6);
    row2 = vaddq_s16(tmp2, tmp5);
    row5 = vsubq_s16(tmp2, tmp5);
    row4 = vaddq_s16(tmp3, tmp4);
    row3 = vsubq_s16(tmp3, tmp4);
  }
  /* Transpose rows to work on columns in pass 2. */
  int16x8x2_t rows_01 = vtrnq_s16(row0, row1);
  int16x8x2_t rows_23 = vtrnq_s16(row2, row3);
  int16x8x2_t rows_45 = vtrnq_s16(row4, row5);
  int16x8x2_t rows_67 = vtrnq_s16(row6, row7);
  int32x4x2_t rows_0145_l = vtrnq_s32(vreinterpretq_s32_s16(rows_01.val[0]),
                                      vreinterpretq_s32_s16(rows_45.val[0]));
  int32x4x2_t rows_0145_h = vtrnq_s32(vreinterpretq_s32_s16(rows_01.val[1]),
                                      vreinterpretq_s32_s16(rows_45.val[1]));
  int32x4x2_t rows_2367_l = vtrnq_s32(vreinterpretq_s32_s16(rows_23.val[0]),
                                      vreinterpretq_s32_s16(rows_67.val[0]));
  int32x4x2_t rows_2367_h = vtrnq_s32(vreinterpretq_s32_s16(rows_23.val[1]),
                                      vreinterpretq_s32_s16(rows_67.val[1]));
  int32x4x2_t cols_04 = vzipq_s32(rows_0145_l.val[0], rows_2367_l.val[0]);
  int32x4x2_t cols_15 = vzipq_s32(rows_0145_h.val[0], rows_2367_h.val[0]);
  int32x4x2_t cols_26 = vzipq_s32(rows_0145_l.val[1], rows_2367_l.val[1]);
  int32x4x2_t cols_37 = vzipq_s32(rows_0145_h.val[1], rows_2367_h.val[1]);
  int16x8_t col0 = vreinterpretq_s16_s32(cols_04.val[0]);
  int16x8_t col1 = vreinterpretq_s16_s32(cols_15.val[0]);
  int16x8_t col2 = vreinterpretq_s16_s32(cols_26.val[0]);
  int16x8_t col3 = vreinterpretq_s16_s32(cols_37.val[0]);
  int16x8_t col4 = vreinterpretq_s16_s32(cols_04.val[1]);
  int16x8_t col5 = vreinterpretq_s16_s32(cols_15.val[1]);
  int16x8_t col6 = vreinterpretq_s16_s32(cols_26.val[1]);
  int16x8_t col7 = vreinterpretq_s16_s32(cols_37.val[1]);
  /* 1-D IDCT, pass 2 */
  /* Even part */
  int16x8_t tmp10 = vaddq_s16(col0, col4);
  int16x8_t tmp11 = vsubq_s16(col0, col4);
  int16x8_t tmp13 = vaddq_s16(col2, col6);
  int16x8_t col2_sub_col6 = vsubq_s16(col2, col6);
  int16x8_t tmp12 = vqdmulhq_lane_s16(col2_sub_col6, consts, 1);
  tmp12 = vaddq_s16(tmp12, col2_sub_col6);
  tmp12 = vsubq_s16(tmp12, tmp13);
  int16x8_t tmp0 = vaddq_s16(tmp10, tmp13);
  int16x8_t tmp3 = vsubq_s16(tmp10, tmp13);
  int16x8_t tmp1 = vaddq_s16(tmp11, tmp12);
  int16x8_t tmp2 = vsubq_s16(tmp11, tmp12);
  /* Odd part */
  int16x8_t z13 = vaddq_s16(col5, col3);
  int16x8_t neg_z10 = vsubq_s16(col3, col5);
  int16x8_t z11 = vaddq_s16(col1, col7);
  int16x8_t z12 = vsubq_s16(col1, col7);
  int16x8_t tmp7 = vaddq_s16(z11, z13);      /* phase 5 */
  int16x8_t z11_sub_z13 = vsubq_s16(z11, z13);
  tmp11 = vqdmulhq_lane_s16(z11_sub_z13, consts, 1);
  tmp11 = vaddq_s16(tmp11, z11_sub_z13);
  int16x8_t z10_add_z12 = vsubq_s16(z12, neg_z10);
  int16x8_t z5 = vqdmulhq_lane_s16(z10_add_z12, consts, 2);
  z5 = vaddq_s16(z5, z10_add_z12);
  tmp10 = vqdmulhq_lane_s16(z12, consts, 0);
  tmp10 = vaddq_s16(tmp10, z12);
  tmp10 = vsubq_s16(tmp10, z5);
  tmp12 = vqdmulhq_lane_s16(neg_z10, consts, 3);
  tmp12 = vaddq_s16(tmp12, vaddq_s16(neg_z10, neg_z10));
  tmp12 = vaddq_s16(tmp12, z5);
  int16x8_t tmp6 = vsubq_s16(tmp12, tmp7);   /* phase 2 */
  int16x8_t tmp5 = vsubq_s16(tmp11, tmp6);
  int16x8_t tmp4 = vaddq_s16(tmp10, tmp5);
  col0 = vaddq_s16(tmp0, tmp7);
  col7 = vsubq_s16(tmp0, tmp7);
  col1 = vaddq_s16(tmp1, tmp6);
  col6 = vsubq_s16(tmp1, tmp6);
  col2 = vaddq_s16(tmp2, tmp5);
  col5 = vsubq_s16(tmp2, tmp5);
  col4 = vaddq_s16(tmp3, tmp4);
  col3 = vsubq_s16(tmp3, tmp4);
  /* Scale down by a factor of 8, narrowing to 8-bit. */
  int8x16_t cols_01_s8 = vcombine_s8(vqshrn_n_s16(col0, PASS1_BITS + 3),
                                     vqshrn_n_s16(col1, PASS1_BITS + 3));
  int8x16_t cols_45_s8 = vcombine_s8(vqshrn_n_s16(col4, PASS1_BITS + 3),
                                     vqshrn_n_s16(col5, PASS1_BITS + 3));
  int8x16_t cols_23_s8 = vcombine_s8(vqshrn_n_s16(col2, PASS1_BITS + 3),
                                     vqshrn_n_s16(col3, PASS1_BITS + 3));
  int8x16_t cols_67_s8 = vcombine_s8(vqshrn_n_s16(col6, PASS1_BITS + 3),
                                     vqshrn_n_s16(col7, PASS1_BITS + 3));
  /* Clamp to range [0-255]. */
  uint8x16_t cols_01 =
    vreinterpretq_u8_s8
      (vaddq_s8(cols_01_s8, vreinterpretq_s8_u8(vdupq_n_u8(CENTERJSAMPLE))));
  uint8x16_t cols_45 =
    vreinterpretq_u8_s8
      (vaddq_s8(cols_45_s8, vreinterpretq_s8_u8(vdupq_n_u8(CENTERJSAMPLE))));
  uint8x16_t cols_23 =
    vreinterpretq_u8_s8
      (vaddq_s8(cols_23_s8, vreinterpretq_s8_u8(vdupq_n_u8(CENTERJSAMPLE))));
  uint8x16_t cols_67 =
    vreinterpretq_u8_s8
      (vaddq_s8(cols_67_s8, vreinterpretq_s8_u8(vdupq_n_u8(CENTERJSAMPLE))));
  /* Transpose block to prepare for store. */
  uint32x4x2_t cols_0415 = vzipq_u32(vreinterpretq_u32_u8(cols_01),
                                     vreinterpretq_u32_u8(cols_45));
  uint32x4x2_t cols_2637 = vzipq_u32(vreinterpretq_u32_u8(cols_23),
                                     vreinterpretq_u32_u8(cols_67));
  uint8x16x2_t cols_0145 = vtrnq_u8(vreinterpretq_u8_u32(cols_0415.val[0]),
                                    vreinterpretq_u8_u32(cols_0415.val[1]));
  uint8x16x2_t cols_2367 = vtrnq_u8(vreinterpretq_u8_u32(cols_2637.val[0]),
                                    vreinterpretq_u8_u32(cols_2637.val[1]));
  uint16x8x2_t rows_0426 = vtrnq_u16(vreinterpretq_u16_u8(cols_0145.val[0]),
                                     vreinterpretq_u16_u8(cols_2367.val[0]));
  uint16x8x2_t rows_1537 = vtrnq_u16(vreinterpretq_u16_u8(cols_0145.val[1]),
                                     vreinterpretq_u16_u8(cols_2367.val[1]));
  uint8x16_t rows_04 = vreinterpretq_u8_u16(rows_0426.val[0]);
  uint8x16_t rows_15 = vreinterpretq_u8_u16(rows_1537.val[0]);
  uint8x16_t rows_26 = vreinterpretq_u8_u16(rows_0426.val[1]);
  uint8x16_t rows_37 = vreinterpretq_u8_u16(rows_1537.val[1]);
  JSAMPROW outptr0 = output_buf[0] + output_col;
  JSAMPROW outptr1 = output_buf[1] + output_col;
  JSAMPROW outptr2 = output_buf[2] + output_col;
  JSAMPROW outptr3 = output_buf[3] + output_col;
  JSAMPROW outptr4 = output_buf[4] + output_col;
  JSAMPROW outptr5 = output_buf[5] + output_col;
  JSAMPROW outptr6 = output_buf[6] + output_col;
  JSAMPROW outptr7 = output_buf[7] + output_col;
  /* Store DCT block to memory. */
  vst1q_lane_u64((uint64_t *)outptr0, vreinterpretq_u64_u8(rows_04), 0);
  vst1q_lane_u64((uint64_t *)outptr1, vreinterpretq_u64_u8(rows_15), 0);
  vst1q_lane_u64((uint64_t *)outptr2, vreinterpretq_u64_u8(rows_26), 0);
  vst1q_lane_u64((uint64_t *)outptr3, vreinterpretq_u64_u8(rows_37), 0);
  vst1q_lane_u64((uint64_t *)outptr4, vreinterpretq_u64_u8(rows_04), 1);
  vst1q_lane_u64((uint64_t *)outptr5, vreinterpretq_u64_u8(rows_15), 1);
  vst1q_lane_u64((uint64_t *)outptr6, vreinterpretq_u64_u8(rows_26), 1);
  vst1q_lane_u64((uint64_t *)outptr7, vreinterpretq_u64_u8(rows_37), 1);
 }
--- a/simd/arm/jidctint-neon.c
+++ b/simd/arm/jidctint-neon.c
@@ -0,0 +1,802 @@
 /*
 * jidctint-neon.c - accurate integer IDCT (Arm Neon)
 *
 * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
 * Copyright (C) 2020, D. R. Commander.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 #define JPEG_INTERNALS
 #include "jconfigint.h"
 #include "../../jinclude.h"
 #include "../../jpeglib.h"
 #include "../../jsimd.h"
 #include "../../jdct.h"
 #include "../../jsimddct.h"
 #include "../jsimd.h"
 #include "align.h"
 #include "neon-compat.h"
 #include <arm_neon.h>
 #define CONST_BITS  13
 #define PASS1_BITS  2
 #define DESCALE_P1  (CONST_BITS - PASS1_BITS)
 #define DESCALE_P2  (CONST_BITS + PASS1_BITS + 3)
 /* The computation of the inverse DCT requires the use of constants known at
 * compile time.  Scaled integer constants are used to avoid floating-point
 * arithmetic:
 *    0.298631336 =  2446 * 2^-13
 *    0.390180644 =  3196 * 2^-13
 *    0.541196100 =  4433 * 2^-13
 *    0.765366865 =  6270 * 2^-13
 *    0.899976223 =  7373 * 2^-13
 *    1.175875602 =  9633 * 2^-13
 *    1.501321110 = 12299 * 2^-13
 *    1.847759065 = 15137 * 2^-13
 *    1.961570560 = 16069 * 2^-13
 *    2.053119869 = 16819 * 2^-13
 *    2.562915447 = 20995 * 2^-13
 *    3.072711026 = 25172 * 2^-13
 */
 #define F_0_298  2446
 #define F_0_390  3196
 #define F_0_541  4433
 #define F_0_765  6270
 #define F_0_899  7373
 #define F_1_175  9633
 #define F_1_501  12299
 #define F_1_847  15137
 #define F_1_961  16069
 #define F_2_053  16819
 #define F_2_562  20995
 #define F_3_072  25172
 #define F_1_175_MINUS_1_961  (F_1_175 - F_1_961)
 #define F_1_175_MINUS_0_390  (F_1_175 - F_0_390)
 #define F_0_541_MINUS_1_847  (F_0_541 - F_1_847)
 #define F_3_072_MINUS_2_562  (F_3_072 - F_2_562)
 #define F_0_298_MINUS_0_899  (F_0_298 - F_0_899)
 #define F_1_501_MINUS_0_899  (F_1_501 - F_0_899)
 #define F_2_053_MINUS_2_562  (F_2_053 - F_2_562)
 #define F_0_541_PLUS_0_765   (F_0_541 + F_0_765)
 ALIGN(16) static const int16_t jsimd_idct_islow_neon_consts[] = {
  F_0_899,             F_0_541,
  F_2_562,             F_0_298_MINUS_0_899,
  F_1_501_MINUS_0_899, F_2_053_MINUS_2_562,
  F_0_541_PLUS_0_765,  F_1_175,
  F_1_175_MINUS_0_390, F_0_541_MINUS_1_847,
  F_3_072_MINUS_2_562, F_1_175_MINUS_1_961,
  0, 0, 0, 0
 };
 /* Forward declaration of regular and sparse IDCT helper functions */
 static INLINE void jsimd_idct_islow_pass1_regular(int16x4_t row0,
                                                  int16x4_t row1,
                                                  int16x4_t row2,
                                                  int16x4_t row3,
                                                  int16x4_t row4,
                                                  int16x4_t row5,
                                                  int16x4_t row6,
                                                  int16x4_t row7,
                                                  int16x4_t quant_row0,
                                                  int16x4_t quant_row1,
                                                  int16x4_t quant_row2,
                                                  int16x4_t quant_row3,
                                                  int16x4_t quant_row4,
                                                  int16x4_t quant_row5,
                                                  int16x4_t quant_row6,
                                                  int16x4_t quant_row7,
                                                  int16_t *workspace_1,
                                                  int16_t *workspace_2);
 static INLINE void jsimd_idct_islow_pass1_sparse(int16x4_t row0,
                                                 int16x4_t row1,
                                                 int16x4_t row2,
                                                 int16x4_t row3,
                                                 int16x4_t quant_row0,
                                                 int16x4_t quant_row1,
                                                 int16x4_t quant_row2,
                                                 int16x4_t quant_row3,
                                                 int16_t *workspace_1,
                                                 int16_t *workspace_2);
 static INLINE void jsimd_idct_islow_pass2_regular(int16_t *workspace,
                                                  JSAMPARRAY output_buf,
                                                  JDIMENSION output_col,
                                                  unsigned buf_offset);
 static INLINE void jsimd_idct_islow_pass2_sparse(int16_t *workspace,
                                                 JSAMPARRAY output_buf,
                                                 JDIMENSION output_col,
                                                 unsigned buf_offset);
 /* Perform dequantization and inverse DCT on one block of coefficients.  For
 * reference, the C implementation (jpeg_idct_slow()) can be found in
 * jidctint.c.
 *
 * Optimization techniques used for fast data access:
 *
 * In each pass, the inverse DCT is computed for the left and right 4x8 halves
 * of the DCT block.  This avoids spilling due to register pressure, and the
 * increased granularity allows for an optimized calculation depending on the
 * values of the DCT coefficients.  Between passes, intermediate data is stored
 * in 4x8 workspace buffers.
 *
 * Transposing the 8x8 DCT block after each pass can be achieved by transposing
 * each of the four 4x4 quadrants and swapping quadrants 1 and 2 (refer to the
 * diagram below.)  Swapping quadrants is cheap, since the second pass can just
 * swap the workspace buffer pointers.
 *
 *      +-------+-------+                   +-------+-------+
 *      |       |       |                   |       |       |
 *      |   0   |   1   |                   |   0   |   2   |
 *      |       |       |    transpose      |       |       |
 *      +-------+-------+     ------>       +-------+-------+
 *      |       |       |                   |       |       |
 *      |   2   |   3   |                   |   1   |   3   |
 *      |       |       |                   |       |       |
 *      +-------+-------+                   +-------+-------+
 *
 * Optimization techniques used to accelerate the inverse DCT calculation:
 *
 * In a DCT coefficient block, the coefficients are increasingly likely to be 0
 * as you move diagonally from top left to bottom right.  If whole rows of
 * coefficients are 0, then the inverse DCT calculation can be simplified.  On
 * the first pass of the inverse DCT, we test for three special cases before
 * defaulting to a full "regular" inverse DCT:
 *
 * 1) Coefficients in rows 4-7 are all zero.  In this case, we perform a
 *    "sparse" simplified inverse DCT on rows 0-3.
 * 2) AC coefficients (rows 1-7) are all zero.  In this case, the inverse DCT
 *    result is equal to the dequantized DC coefficients.
 * 3) AC and DC coefficients are all zero.  In this case, the inverse DCT
 *    result is all zero.  For the left 4x8 half, this is handled identically
 *    to Case 2 above.  For the right 4x8 half, we do no work and signal that
 *    the "sparse" algorithm is required for the second pass.
 *
 * In the second pass, only a single special case is tested: whether the AC and
 * DC coefficients were all zero in the right 4x8 block during the first pass
 * (refer to Case 3 above.)  If this is the case, then a "sparse" variant of
 * the second pass is performed for both the left and right halves of the DCT
 * block.  (The transposition after the first pass means that the right 4x8
 * block during the first pass becomes rows 4-7 during the second pass.)
 */
 void jsimd_idct_islow_neon(void *dct_table, JCOEFPTR coef_block,
                           JSAMPARRAY output_buf, JDIMENSION output_col)
 {
  ISLOW_MULT_TYPE *quantptr = dct_table;
  int16_t workspace_l[8 * DCTSIZE / 2];
  int16_t workspace_r[8 * DCTSIZE / 2];
  /* Compute IDCT first pass on left 4x8 coefficient block. */
  /* Load DCT coefficients in left 4x8 block. */
  int16x4_t row0 = vld1_s16(coef_block + 0 * DCTSIZE);
  int16x4_t row1 = vld1_s16(coef_block + 1 * DCTSIZE);
  int16x4_t row2 = vld1_s16(coef_block + 2 * DCTSIZE);
  int16x4_t row3 = vld1_s16(coef_block + 3 * DCTSIZE);
  int16x4_t row4 = vld1_s16(coef_block + 4 * DCTSIZE);
  int16x4_t row5 = vld1_s16(coef_block + 5 * DCTSIZE);
  int16x4_t row6 = vld1_s16(coef_block + 6 * DCTSIZE);
  int16x4_t row7 = vld1_s16(coef_block + 7 * DCTSIZE);
  /* Load quantization table for left 4x8 block. */
  int16x4_t quant_row0 = vld1_s16(quantptr + 0 * DCTSIZE);
  int16x4_t quant_row1 = vld1_s16(quantptr + 1 * DCTSIZE);
  int16x4_t quant_row2 = vld1_s16(quantptr + 2 * DCTSIZE);
  int16x4_t quant_row3 = vld1_s16(quantptr + 3 * DCTSIZE);
  int16x4_t quant_row4 = vld1_s16(quantptr + 4 * DCTSIZE);
  int16x4_t quant_row5 = vld1_s16(quantptr + 5 * DCTSIZE);
  int16x4_t quant_row6 = vld1_s16(quantptr + 6 * DCTSIZE);
  int16x4_t quant_row7 = vld1_s16(quantptr + 7 * DCTSIZE);
  /* Construct bitmap to test if DCT coefficients in left 4x8 block are 0. */
  int16x4_t bitmap = vorr_s16(row7, row6);
  bitmap = vorr_s16(bitmap, row5);
  bitmap = vorr_s16(bitmap, row4);
  int64_t bitmap_rows_4567 = vget_lane_s64(vreinterpret_s64_s16(bitmap), 0);
  if (bitmap_rows_4567 == 0) {
    bitmap = vorr_s16(bitmap, row3);
    bitmap = vorr_s16(bitmap, row2);
    bitmap = vorr_s16(bitmap, row1);
    int64_t left_ac_bitmap = vget_lane_s64(vreinterpret_s64_s16(bitmap), 0);
    if (left_ac_bitmap == 0) {
      int16x4_t dcval = vshl_n_s16(vmul_s16(row0, quant_row0), PASS1_BITS);
      int16x4x4_t quadrant = { { dcval, dcval, dcval, dcval } };
      /* Store 4x4 blocks to workspace, transposing in the process. */
      vst4_s16(workspace_l, quadrant);
      vst4_s16(workspace_r, quadrant);
    } else {
      jsimd_idct_islow_pass1_sparse(row0, row1, row2, row3, quant_row0,
                                    quant_row1, quant_row2, quant_row3,
                                    workspace_l, workspace_r);
    }
  } else {
    jsimd_idct_islow_pass1_regular(row0, row1, row2, row3, row4, row5,
                                   row6, row7, quant_row0, quant_row1,
                                   quant_row2, quant_row3, quant_row4,
                                   quant_row5, quant_row6, quant_row7,
                                   workspace_l, workspace_r);
  }
  /* Compute IDCT first pass on right 4x8 coefficient block. */
  /* Load DCT coefficients in right 4x8 block. */
  row0 = vld1_s16(coef_block + 0 * DCTSIZE + 4);
  row1 = vld1_s16(coef_block + 1 * DCTSIZE + 4);
  row2 = vld1_s16(coef_block + 2 * DCTSIZE + 4);
  row3 = vld1_s16(coef_block + 3 * DCTSIZE + 4);
  row4 = vld1_s16(coef_block + 4 * DCTSIZE + 4);
  row5 = vld1_s16(coef_block + 5 * DCTSIZE + 4);
  row6 = vld1_s16(coef_block + 6 * DCTSIZE + 4);
  row7 = vld1_s16(coef_block + 7 * DCTSIZE + 4);
  /* Load quantization table for right 4x8 block. */
  quant_row0 = vld1_s16(quantptr + 0 * DCTSIZE + 4);
  quant_row1 = vld1_s16(quantptr + 1 * DCTSIZE + 4);
  quant_row2 = vld1_s16(quantptr + 2 * DCTSIZE + 4);
  quant_row3 = vld1_s16(quantptr + 3 * DCTSIZE + 4);
  quant_row4 = vld1_s16(quantptr + 4 * DCTSIZE + 4);
  quant_row5 = vld1_s16(quantptr + 5 * DCTSIZE + 4);
  quant_row6 = vld1_s16(quantptr + 6 * DCTSIZE + 4);
  quant_row7 = vld1_s16(quantptr + 7 * DCTSIZE + 4);
  /* Construct bitmap to test if DCT coefficients in right 4x8 block are 0. */
  bitmap = vorr_s16(row7, row6);
  bitmap = vorr_s16(bitmap, row5);
  bitmap = vorr_s16(bitmap, row4);
  bitmap_rows_4567 = vget_lane_s64(vreinterpret_s64_s16(bitmap), 0);
  bitmap = vorr_s16(bitmap, row3);
  bitmap = vorr_s16(bitmap, row2);
  bitmap = vorr_s16(bitmap, row1);
  int64_t right_ac_bitmap = vget_lane_s64(vreinterpret_s64_s16(bitmap), 0);
  /* If this remains non-zero, a "regular" second pass will be performed. */
  int64_t right_ac_dc_bitmap = 1;
  if (right_ac_bitmap == 0) {
    bitmap = vorr_s16(bitmap, row0);
    right_ac_dc_bitmap = vget_lane_s64(vreinterpret_s64_s16(bitmap), 0);
    if (right_ac_dc_bitmap != 0) {
      int16x4_t dcval = vshl_n_s16(vmul_s16(row0, quant_row0), PASS1_BITS);
      int16x4x4_t quadrant = { { dcval, dcval, dcval, dcval } };
      /* Store 4x4 blocks to workspace, transposing in the process. */
      vst4_s16(workspace_l + 4 * DCTSIZE / 2, quadrant);
      vst4_s16(workspace_r + 4 * DCTSIZE / 2, quadrant);
    }
  } else {
    if (bitmap_rows_4567 == 0) {
      jsimd_idct_islow_pass1_sparse(row0, row1, row2, row3, quant_row0,
                                    quant_row1, quant_row2, quant_row3,
                                    workspace_l + 4 * DCTSIZE / 2,
                                    workspace_r + 4 * DCTSIZE / 2);
    } else {
      jsimd_idct_islow_pass1_regular(row0, row1, row2, row3, row4, row5,
                                     row6, row7, quant_row0, quant_row1,
                                     quant_row2, quant_row3, quant_row4,
                                     quant_row5, quant_row6, quant_row7,
                                     workspace_l + 4 * DCTSIZE / 2,
                                     workspace_r + 4 * DCTSIZE / 2);
    }
  }
  /* Second pass: compute IDCT on rows in workspace. */
  /* If all coefficients in right 4x8 block are 0, use "sparse" second pass. */
  if (right_ac_dc_bitmap == 0) {
    jsimd_idct_islow_pass2_sparse(workspace_l, output_buf, output_col, 0);
    jsimd_idct_islow_pass2_sparse(workspace_r, output_buf, output_col, 4);
  } else {
    jsimd_idct_islow_pass2_regular(workspace_l, output_buf, output_col, 0);
    jsimd_idct_islow_pass2_regular(workspace_r, output_buf, output_col, 4);
  }
 }
 /* Perform dequantization and the first pass of the accurate inverse DCT on a
 * 4x8 block of coefficients.  (To process the full 8x8 DCT block, this
 * function-- or some other optimized variant-- needs to be called for both the
 * left and right 4x8 blocks.)
 *
 * This "regular" version assumes that no optimization can be made to the IDCT
 * calculation, since no useful set of AC coefficients is all 0.
 *
 * The original C implementation of the accurate IDCT (jpeg_idct_slow()) can be
 * found in jidctint.c.  Algorithmic changes made here are documented inline.
 */
 static INLINE void jsimd_idct_islow_pass1_regular(int16x4_t row0,
                                                  int16x4_t row1,
                                                  int16x4_t row2,
                                                  int16x4_t row3,
                                                  int16x4_t row4,
                                                  int16x4_t row5,
                                                  int16x4_t row6,
                                                  int16x4_t row7,
                                                  int16x4_t quant_row0,
                                                  int16x4_t quant_row1,
                                                  int16x4_t quant_row2,
                                                  int16x4_t quant_row3,
                                                  int16x4_t quant_row4,
                                                  int16x4_t quant_row5,
                                                  int16x4_t quant_row6,
                                                  int16x4_t quant_row7,
                                                  int16_t *workspace_1,
                                                  int16_t *workspace_2)
 {
  /* Load constants for IDCT computation. */
 #ifdef HAVE_VLD1_S16_X3
  const int16x4x3_t consts = vld1_s16_x3(jsimd_idct_islow_neon_consts);
 #else
  const int16x4_t consts1 = vld1_s16(jsimd_idct_islow_neon_consts);
  const int16x4_t consts2 = vld1_s16(jsimd_idct_islow_neon_consts + 4);
  const int16x4_t consts3 = vld1_s16(jsimd_idct_islow_neon_consts + 8);
  const int16x4x3_t consts = { { consts1, consts2, consts3 } };
 #endif
  /* Even part */
  int16x4_t z2_s16 = vmul_s16(row2, quant_row2);
  int16x4_t z3_s16 = vmul_s16(row6, quant_row6);
  int32x4_t tmp2 = vmull_lane_s16(z2_s16, consts.val[0], 1);
  int32x4_t tmp3 = vmull_lane_s16(z2_s16, consts.val[1], 2);
  tmp2 = vmlal_lane_s16(tmp2, z3_s16, consts.val[2], 1);
  tmp3 = vmlal_lane_s16(tmp3, z3_s16, consts.val[0], 1);
  z2_s16 = vmul_s16(row0, quant_row0);
  z3_s16 = vmul_s16(row4, quant_row4);
  int32x4_t tmp0 = vshll_n_s16(vadd_s16(z2_s16, z3_s16), CONST_BITS);
  int32x4_t tmp1 = vshll_n_s16(vsub_s16(z2_s16, z3_s16), CONST_BITS);
  int32x4_t tmp10 = vaddq_s32(tmp0, tmp3);
  int32x4_t tmp13 = vsubq_s32(tmp0, tmp3);
  int32x4_t tmp11 = vaddq_s32(tmp1, tmp2);
  int32x4_t tmp12 = vsubq_s32(tmp1, tmp2);
  /* Odd part */
  int16x4_t tmp0_s16 = vmul_s16(row7, quant_row7);
  int16x4_t tmp1_s16 = vmul_s16(row5, quant_row5);
  int16x4_t tmp2_s16 = vmul_s16(row3, quant_row3);
  int16x4_t tmp3_s16 = vmul_s16(row1, quant_row1);
  z3_s16 = vadd_s16(tmp0_s16, tmp2_s16);
  int16x4_t z4_s16 = vadd_s16(tmp1_s16, tmp3_s16);
  /* Implementation as per jpeg_idct_islow() in jidctint.c:
   *   z5 = (z3 + z4) * 1.175875602;
   *   z3 = z3 * -1.961570560;  z4 = z4 * -0.390180644;
   *   z3 += z5;  z4 += z5;
   *
   * This implementation:
   *   z3 = z3 * (1.175875602 - 1.961570560) + z4 * 1.175875602;
   *   z4 = z3 * 1.175875602 + z4 * (1.175875602 - 0.390180644);
   */
  int32x4_t z3 = vmull_lane_s16(z3_s16, consts.val[2], 3);
  int32x4_t z4 = vmull_lane_s16(z3_s16, consts.val[1], 3);
  z3 = vmlal_lane_s16(z3, z4_s16, consts.val[1], 3);
  z4 = vmlal_lane_s16(z4, z4_s16, consts.val[2], 0);
  /* Implementation as per jpeg_idct_islow() in jidctint.c:
   *   z1 = tmp0 + tmp3;  z2 = tmp1 + tmp2;
   *   tmp0 = tmp0 * 0.298631336;  tmp1 = tmp1 * 2.053119869;
   *   tmp2 = tmp2 * 3.072711026;  tmp3 = tmp3 * 1.501321110;
   *   z1 = z1 * -0.899976223;  z2 = z2 * -2.562915447;
   *   tmp0 += z1 + z3;  tmp1 += z2 + z4;
   *   tmp2 += z2 + z3;  tmp3 += z1 + z4;
   *
   * This implementation:
   *   tmp0 = tmp0 * (0.298631336 - 0.899976223) + tmp3 * -0.899976223;
   *   tmp1 = tmp1 * (2.053119869 - 2.562915447) + tmp2 * -2.562915447;
   *   tmp2 = tmp1 * -2.562915447 + tmp2 * (3.072711026 - 2.562915447);
   *   tmp3 = tmp0 * -0.899976223 + tmp3 * (1.501321110 - 0.899976223);
   *   tmp0 += z3;  tmp1 += z4;
   *   tmp2 += z3;  tmp3 += z4;
   */
  tmp0 = vmull_lane_s16(tmp0_s16, consts.val[0], 3);
  tmp1 = vmull_lane_s16(tmp1_s16, consts.val[1], 1);
  tmp2 = vmull_lane_s16(tmp2_s16, consts.val[2], 2);
  tmp3 = vmull_lane_s16(tmp3_s16, consts.val[1], 0);
  tmp0 = vmlsl_lane_s16(tmp0, tmp3_s16, consts.val[0], 0);
  tmp1 = vmlsl_lane_s16(tmp1, tmp2_s16, consts.val[0], 2);
  tmp2 = vmlsl_lane_s16(tmp2, tmp1_s16, consts.val[0], 2);
  tmp3 = vmlsl_lane_s16(tmp3, tmp0_s16, consts.val[0], 0);
  tmp0 = vaddq_s32(tmp0, z3);
  tmp1 = vaddq_s32(tmp1, z4);
  tmp2 = vaddq_s32(tmp2, z3);
  tmp3 = vaddq_s32(tmp3, z4);
  /* Final output stage: descale and narrow to 16-bit. */
  int16x4x4_t rows_0123 = { {
    vrshrn_n_s32(vaddq_s32(tmp10, tmp3), DESCALE_P1),
    vrshrn_n_s32(vaddq_s32(tmp11, tmp2), DESCALE_P1),
    vrshrn_n_s32(vaddq_s32(tmp12, tmp1), DESCALE_P1),
    vrshrn_n_s32(vaddq_s32(tmp13, tmp0), DESCALE_P1)
  } };
  int16x4x4_t rows_4567 = { {
    vrshrn_n_s32(vsubq_s32(tmp13, tmp0), DESCALE_P1),
    vrshrn_n_s32(vsubq_s32(tmp12, tmp1), DESCALE_P1),
    vrshrn_n_s32(vsubq_s32(tmp11, tmp2), DESCALE_P1),
    vrshrn_n_s32(vsubq_s32(tmp10, tmp3), DESCALE_P1)
  } };
  /* Store 4x4 blocks to the intermediate workspace, ready for the second pass.
   * (VST4 transposes the blocks.  We need to operate on rows in the next
   * pass.)
   */
  vst4_s16(workspace_1, rows_0123);
  vst4_s16(workspace_2, rows_4567);
 }
 /* Perform dequantization and the first pass of the accurate inverse DCT on a
 * 4x8 block of coefficients.
 *
 * This "sparse" version assumes that the AC coefficients in rows 4-7 are all
 * 0.  This simplifies the IDCT calculation, accelerating overall performance.
 */
 static INLINE void jsimd_idct_islow_pass1_sparse(int16x4_t row0,
                                                 int16x4_t row1,
                                                 int16x4_t row2,
                                                 int16x4_t row3,
                                                 int16x4_t quant_row0,
                                                 int16x4_t quant_row1,
                                                 int16x4_t quant_row2,
                                                 int16x4_t quant_row3,
                                                 int16_t *workspace_1,
                                                 int16_t *workspace_2)
 {
  /* Load constants for IDCT computation. */
 #ifdef HAVE_VLD1_S16_X3
  const int16x4x3_t consts = vld1_s16_x3(jsimd_idct_islow_neon_consts);
 #else
  const int16x4_t consts1 = vld1_s16(jsimd_idct_islow_neon_consts);
  const int16x4_t consts2 = vld1_s16(jsimd_idct_islow_neon_consts + 4);
  const int16x4_t consts3 = vld1_s16(jsimd_idct_islow_neon_consts + 8);
  const int16x4x3_t consts = { { consts1, consts2, consts3 } };
 #endif
  /* Even part (z3 is all 0) */
  int16x4_t z2_s16 = vmul_s16(row2, quant_row2);
  int32x4_t tmp2 = vmull_lane_s16(z2_s16, consts.val[0], 1);
  int32x4_t tmp3 = vmull_lane_s16(z2_s16, consts.val[1], 2);
  z2_s16 = vmul_s16(row0, quant_row0);
  int32x4_t tmp0 = vshll_n_s16(z2_s16, CONST_BITS);
  int32x4_t tmp1 = vshll_n_s16(z2_s16, CONST_BITS);
  int32x4_t tmp10 = vaddq_s32(tmp0, tmp3);
  int32x4_t tmp13 = vsubq_s32(tmp0, tmp3);
  int32x4_t tmp11 = vaddq_s32(tmp1, tmp2);
  int32x4_t tmp12 = vsubq_s32(tmp1, tmp2);
  /* Odd part (tmp0 and tmp1 are both all 0) */
  int16x4_t tmp2_s16 = vmul_s16(row3, quant_row3);
  int16x4_t tmp3_s16 = vmul_s16(row1, quant_row1);
  int16x4_t z3_s16 = tmp2_s16;
  int16x4_t z4_s16 = tmp3_s16;
  int32x4_t z3 = vmull_lane_s16(z3_s16, consts.val[2], 3);
  int32x4_t z4 = vmull_lane_s16(z3_s16, consts.val[1], 3);
  z3 = vmlal_lane_s16(z3, z4_s16, consts.val[1], 3);
  z4 = vmlal_lane_s16(z4, z4_s16, consts.val[2], 0);
  tmp0 = vmlsl_lane_s16(z3, tmp3_s16, consts.val[0], 0);
  tmp1 = vmlsl_lane_s16(z4, tmp2_s16, consts.val[0], 2);
  tmp2 = vmlal_lane_s16(z3, tmp2_s16, consts.val[2], 2);
  tmp3 = vmlal_lane_s16(z4, tmp3_s16, consts.val[1], 0);
  /* Final output stage: descale and narrow to 16-bit. */
  int16x4x4_t rows_0123 = { {
    vrshrn_n_s32(vaddq_s32(tmp10, tmp3), DESCALE_P1),
    vrshrn_n_s32(vaddq_s32(tmp11, tmp2), DESCALE_P1),
    vrshrn_n_s32(vaddq_s32(tmp12, tmp1), DESCALE_P1),
    vrshrn_n_s32(vaddq_s32(tmp13, tmp0), DESCALE_P1)
  } };
  int16x4x4_t rows_4567 = { {
    vrshrn_n_s32(vsubq_s32(tmp13, tmp0), DESCALE_P1),
    vrshrn_n_s32(vsubq_s32(tmp12, tmp1), DESCALE_P1),
    vrshrn_n_s32(vsubq_s32(tmp11, tmp2), DESCALE_P1),
    vrshrn_n_s32(vsubq_s32(tmp10, tmp3), DESCALE_P1)
  } };
  /* Store 4x4 blocks to the intermediate workspace, ready for the second pass.
   * (VST4 transposes the blocks.  We need to operate on rows in the next
   * pass.)
   */
  vst4_s16(workspace_1, rows_0123);
  vst4_s16(workspace_2, rows_4567);
 }
 /* Perform the second pass of the accurate inverse DCT on a 4x8 block of
 * coefficients.  (To process the full 8x8 DCT block, this function-- or some
 * other optimized variant-- needs to be called for both the right and left 4x8
 * blocks.)
 *
 * This "regular" version assumes that no optimization can be made to the IDCT
 * calculation, since no useful set of coefficient values are all 0 after the
 * first pass.
 *
 * Again, the original C implementation of the accurate IDCT (jpeg_idct_slow())
 * can be found in jidctint.c.  Algorithmic changes made here are documented
 * inline.
 */
 static INLINE void jsimd_idct_islow_pass2_regular(int16_t *workspace,
                                                  JSAMPARRAY output_buf,
                                                  JDIMENSION output_col,
                                                  unsigned buf_offset)
 {
  /* Load constants for IDCT computation. */
 #ifdef HAVE_VLD1_S16_X3
  const int16x4x3_t consts = vld1_s16_x3(jsimd_idct_islow_neon_consts);
 #else
  const int16x4_t consts1 = vld1_s16(jsimd_idct_islow_neon_consts);
  const int16x4_t consts2 = vld1_s16(jsimd_idct_islow_neon_consts + 4);
  const int16x4_t consts3 = vld1_s16(jsimd_idct_islow_neon_consts + 8);
  const int16x4x3_t consts = { { consts1, consts2, consts3 } };
 #endif
  /* Even part */
  int16x4_t z2_s16 = vld1_s16(workspace + 2 * DCTSIZE / 2);
  int16x4_t z3_s16 = vld1_s16(workspace + 6 * DCTSIZE / 2);
  int32x4_t tmp2 = vmull_lane_s16(z2_s16, consts.val[0], 1);
  int32x4_t tmp3 = vmull_lane_s16(z2_s16, consts.val[1], 2);
  tmp2 = vmlal_lane_s16(tmp2, z3_s16, consts.val[2], 1);
  tmp3 = vmlal_lane_s16(tmp3, z3_s16, consts.val[0], 1);
  z2_s16 = vld1_s16(workspace + 0 * DCTSIZE / 2);
  z3_s16 = vld1_s16(workspace + 4 * DCTSIZE / 2);
  int32x4_t tmp0 = vshll_n_s16(vadd_s16(z2_s16, z3_s16), CONST_BITS);
  int32x4_t tmp1 = vshll_n_s16(vsub_s16(z2_s16, z3_s16), CONST_BITS);
  int32x4_t tmp10 = vaddq_s32(tmp0, tmp3);
  int32x4_t tmp13 = vsubq_s32(tmp0, tmp3);
  int32x4_t tmp11 = vaddq_s32(tmp1, tmp2);
  int32x4_t tmp12 = vsubq_s32(tmp1, tmp2);
  /* Odd part */
  int16x4_t tmp0_s16 = vld1_s16(workspace + 7 * DCTSIZE / 2);
  int16x4_t tmp1_s16 = vld1_s16(workspace + 5 * DCTSIZE / 2);
  int16x4_t tmp2_s16 = vld1_s16(workspace + 3 * DCTSIZE / 2);
  int16x4_t tmp3_s16 = vld1_s16(workspace + 1 * DCTSIZE / 2);
  z3_s16 = vadd_s16(tmp0_s16, tmp2_s16);
  int16x4_t z4_s16 = vadd_s16(tmp1_s16, tmp3_s16);
  /* Implementation as per jpeg_idct_islow() in jidctint.c:
   *   z5 = (z3 + z4) * 1.175875602;
   *   z3 = z3 * -1.961570560;  z4 = z4 * -0.390180644;
   *   z3 += z5;  z4 += z5;
   *
   * This implementation:
   *   z3 = z3 * (1.175875602 - 1.961570560) + z4 * 1.175875602;
   *   z4 = z3 * 1.175875602 + z4 * (1.175875602 - 0.390180644);
   */
  int32x4_t z3 = vmull_lane_s16(z3_s16, consts.val[2], 3);
  int32x4_t z4 = vmull_lane_s16(z3_s16, consts.val[1], 3);
  z3 = vmlal_lane_s16(z3, z4_s16, consts.val[1], 3);
  z4 = vmlal_lane_s16(z4, z4_s16, consts.val[2], 0);
  /* Implementation as per jpeg_idct_islow() in jidctint.c:
   *   z1 = tmp0 + tmp3;  z2 = tmp1 + tmp2;
   *   tmp0 = tmp0 * 0.298631336;  tmp1 = tmp1 * 2.053119869;
   *   tmp2 = tmp2 * 3.072711026;  tmp3 = tmp3 * 1.501321110;
   *   z1 = z1 * -0.899976223;  z2 = z2 * -2.562915447;
   *   tmp0 += z1 + z3;  tmp1 += z2 + z4;
   *   tmp2 += z2 + z3;  tmp3 += z1 + z4;
   *
   * This implementation:
   *   tmp0 = tmp0 * (0.298631336 - 0.899976223) + tmp3 * -0.899976223;
   *   tmp1 = tmp1 * (2.053119869 - 2.562915447) + tmp2 * -2.562915447;
   *   tmp2 = tmp1 * -2.562915447 + tmp2 * (3.072711026 - 2.562915447);
   *   tmp3 = tmp0 * -0.899976223 + tmp3 * (1.501321110 - 0.899976223);
   *   tmp0 += z3;  tmp1 += z4;
   *   tmp2 += z3;  tmp3 += z4;
   */
  tmp0 = vmull_lane_s16(tmp0_s16, consts.val[0], 3);
  tmp1 = vmull_lane_s16(tmp1_s16, consts.val[1], 1);
  tmp2 = vmull_lane_s16(tmp2_s16, consts.val[2], 2);
  tmp3 = vmull_lane_s16(tmp3_s16, consts.val[1], 0);
  tmp0 = vmlsl_lane_s16(tmp0, tmp3_s16, consts.val[0], 0);
  tmp1 = vmlsl_lane_s16(tmp1, tmp2_s16, consts.val[0], 2);
  tmp2 = vmlsl_lane_s16(tmp2, tmp1_s16, consts.val[0], 2);
  tmp3 = vmlsl_lane_s16(tmp3, tmp0_s16, consts.val[0], 0);
  tmp0 = vaddq_s32(tmp0, z3);
  tmp1 = vaddq_s32(tmp1, z4);
  tmp2 = vaddq_s32(tmp2, z3);
  tmp3 = vaddq_s32(tmp3, z4);
  /* Final output stage: descale and narrow to 16-bit. */
  int16x8_t cols_02_s16 = vcombine_s16(vaddhn_s32(tmp10, tmp3),
                                       vaddhn_s32(tmp12, tmp1));
  int16x8_t cols_13_s16 = vcombine_s16(vaddhn_s32(tmp11, tmp2),
                                       vaddhn_s32(tmp13, tmp0));
  int16x8_t cols_46_s16 = vcombine_s16(vsubhn_s32(tmp13, tmp0),
                                       vsubhn_s32(tmp11, tmp2));
  int16x8_t cols_57_s16 = vcombine_s16(vsubhn_s32(tmp12, tmp1),
                                       vsubhn_s32(tmp10, tmp3));
  /* Descale and narrow to 8-bit. */
  int8x8_t cols_02_s8 = vqrshrn_n_s16(cols_02_s16, DESCALE_P2 - 16);
  int8x8_t cols_13_s8 = vqrshrn_n_s16(cols_13_s16, DESCALE_P2 - 16);
  int8x8_t cols_46_s8 = vqrshrn_n_s16(cols_46_s16, DESCALE_P2 - 16);
  int8x8_t cols_57_s8 = vqrshrn_n_s16(cols_57_s16, DESCALE_P2 - 16);
  /* Clamp to range [0-255]. */
  uint8x8_t cols_02_u8 = vadd_u8(vreinterpret_u8_s8(cols_02_s8),
                                 vdup_n_u8(CENTERJSAMPLE));
  uint8x8_t cols_13_u8 = vadd_u8(vreinterpret_u8_s8(cols_13_s8),
                                 vdup_n_u8(CENTERJSAMPLE));
  uint8x8_t cols_46_u8 = vadd_u8(vreinterpret_u8_s8(cols_46_s8),
                                 vdup_n_u8(CENTERJSAMPLE));
  uint8x8_t cols_57_u8 = vadd_u8(vreinterpret_u8_s8(cols_57_s8),
                                 vdup_n_u8(CENTERJSAMPLE));
  /* Transpose 4x8 block and store to memory.  (Zipping adjacent columns
   * together allows us to store 16-bit elements.)
   */
  uint8x8x2_t cols_01_23 = vzip_u8(cols_02_u8, cols_13_u8);
  uint8x8x2_t cols_45_67 = vzip_u8(cols_46_u8, cols_57_u8);
  uint16x4x4_t cols_01_23_45_67 = { {
    vreinterpret_u16_u8(cols_01_23.val[0]),
    vreinterpret_u16_u8(cols_01_23.val[1]),
    vreinterpret_u16_u8(cols_45_67.val[0]),
    vreinterpret_u16_u8(cols_45_67.val[1])
  } };
  JSAMPROW outptr0 = output_buf[buf_offset + 0] + output_col;
  JSAMPROW outptr1 = output_buf[buf_offset + 1] + output_col;
  JSAMPROW outptr2 = output_buf[buf_offset + 2] + output_col;
  JSAMPROW outptr3 = output_buf[buf_offset + 3] + output_col;
  /* VST4 of 16-bit elements completes the transpose. */
  vst4_lane_u16((uint16_t *)outptr0, cols_01_23_45_67, 0);
  vst4_lane_u16((uint16_t *)outptr1, cols_01_23_45_67, 1);
  vst4_lane_u16((uint16_t *)outptr2, cols_01_23_45_67, 2);
  vst4_lane_u16((uint16_t *)outptr3, cols_01_23_45_67, 3);
 }
 /* Performs the second pass of the accurate inverse DCT on a 4x8 block
 * of coefficients.
 *
 * This "sparse" version assumes that the coefficient values (after the first
 * pass) in rows 4-7 are all 0.  This simplifies the IDCT calculation,
 * accelerating overall performance.
 */
 static INLINE void jsimd_idct_islow_pass2_sparse(int16_t *workspace,
                                                 JSAMPARRAY output_buf,
                                                 JDIMENSION output_col,
                                                 unsigned buf_offset)
 {
  /* Load constants for IDCT computation. */
 #ifdef HAVE_VLD1_S16_X3
  const int16x4x3_t consts = vld1_s16_x3(jsimd_idct_islow_neon_consts);
 #else
  const int16x4_t consts1 = vld1_s16(jsimd_idct_islow_neon_consts);
  const int16x4_t consts2 = vld1_s16(jsimd_idct_islow_neon_consts + 4);
  const int16x4_t consts3 = vld1_s16(jsimd_idct_islow_neon_consts + 8);
  const int16x4x3_t consts = { { consts1, consts2, consts3 } };
 #endif
  /* Even part (z3 is all 0) */
  int16x4_t z2_s16 = vld1_s16(workspace + 2 * DCTSIZE / 2);
  int32x4_t tmp2 = vmull_lane_s16(z2_s16, consts.val[0], 1);
  int32x4_t tmp3 = vmull_lane_s16(z2_s16, consts.val[1], 2);
  z2_s16 = vld1_s16(workspace + 0 * DCTSIZE / 2);
  int32x4_t tmp0 = vshll_n_s16(z2_s16, CONST_BITS);
  int32x4_t tmp1 = vshll_n_s16(z2_s16, CONST_BITS);
  int32x4_t tmp10 = vaddq_s32(tmp0, tmp3);
  int32x4_t tmp13 = vsubq_s32(tmp0, tmp3);
  int32x4_t tmp11 = vaddq_s32(tmp1, tmp2);
  int32x4_t tmp12 = vsubq_s32(tmp1, tmp2);
  /* Odd part (tmp0 and tmp1 are both all 0) */
  int16x4_t tmp2_s16 = vld1_s16(workspace + 3 * DCTSIZE / 2);
  int16x4_t tmp3_s16 = vld1_s16(workspace + 1 * DCTSIZE / 2);
  int16x4_t z3_s16 = tmp2_s16;
  int16x4_t z4_s16 = tmp3_s16;
  int32x4_t z3 = vmull_lane_s16(z3_s16, consts.val[2], 3);
  z3 = vmlal_lane_s16(z3, z4_s16, consts.val[1], 3);
  int32x4_t z4 = vmull_lane_s16(z3_s16, consts.val[1], 3);
  z4 = vmlal_lane_s16(z4, z4_s16, consts.val[2], 0);
  tmp0 = vmlsl_lane_s16(z3, tmp3_s16, consts.val[0], 0);
  tmp1 = vmlsl_lane_s16(z4, tmp2_s16, consts.val[0], 2);
  tmp2 = vmlal_lane_s16(z3, tmp2_s16, consts.val[2], 2);
  tmp3 = vmlal_lane_s16(z4, tmp3_s16, consts.val[1], 0);
  /* Final output stage: descale and narrow to 16-bit. */
  int16x8_t cols_02_s16 = vcombine_s16(vaddhn_s32(tmp10, tmp3),
                                       vaddhn_s32(tmp12, tmp1));
  int16x8_t cols_13_s16 = vcombine_s16(vaddhn_s32(tmp11, tmp2),
                                       vaddhn_s32(tmp13, tmp0));
  int16x8_t cols_46_s16 = vcombine_s16(vsubhn_s32(tmp13, tmp0),
                                       vsubhn_s32(tmp11, tmp2));
  int16x8_t cols_57_s16 = vcombine_s16(vsubhn_s32(tmp12, tmp1),
                                       vsubhn_s32(tmp10, tmp3));
  /* Descale and narrow to 8-bit. */
  int8x8_t cols_02_s8 = vqrshrn_n_s16(cols_02_s16, DESCALE_P2 - 16);
  int8x8_t cols_13_s8 = vqrshrn_n_s16(cols_13_s16, DESCALE_P2 - 16);
  int8x8_t cols_46_s8 = vqrshrn_n_s16(cols_46_s16, DESCALE_P2 - 16);
  int8x8_t cols_57_s8 = vqrshrn_n_s16(cols_57_s16, DESCALE_P2 - 16);
  /* Clamp to range [0-255]. */
  uint8x8_t cols_02_u8 = vadd_u8(vreinterpret_u8_s8(cols_02_s8),
                                 vdup_n_u8(CENTERJSAMPLE));
  uint8x8_t cols_13_u8 = vadd_u8(vreinterpret_u8_s8(cols_13_s8),
                                 vdup_n_u8(CENTERJSAMPLE));
  uint8x8_t cols_46_u8 = vadd_u8(vreinterpret_u8_s8(cols_46_s8),
                                 vdup_n_u8(CENTERJSAMPLE));
  uint8x8_t cols_57_u8 = vadd_u8(vreinterpret_u8_s8(cols_57_s8),
                                 vdup_n_u8(CENTERJSAMPLE));
  /* Transpose 4x8 block and store to memory.  (Zipping adjacent columns
   * together allows us to store 16-bit elements.)
   */
  uint8x8x2_t cols_01_23 = vzip_u8(cols_02_u8, cols_13_u8);
  uint8x8x2_t cols_45_67 = vzip_u8(cols_46_u8, cols_57_u8);
  uint16x4x4_t cols_01_23_45_67 = { {
    vreinterpret_u16_u8(cols_01_23.val[0]),
    vreinterpret_u16_u8(cols_01_23.val[1]),
    vreinterpret_u16_u8(cols_45_67.val[0]),
    vreinterpret_u16_u8(cols_45_67.val[1])
  } };
  JSAMPROW outptr0 = output_buf[buf_offset + 0] + output_col;
  JSAMPROW outptr1 = output_buf[buf_offset + 1] + output_col;
  JSAMPROW outptr2 = output_buf[buf_offset + 2] + output_col;
  JSAMPROW outptr3 = output_buf[buf_offset + 3] + output_col;
  /* VST4 of 16-bit elements completes the transpose. */
  vst4_lane_u16((uint16_t *)outptr0, cols_01_23_45_67, 0);
  vst4_lane_u16((uint16_t *)outptr1, cols_01_23_45_67, 1);
  vst4_lane_u16((uint16_t *)outptr2, cols_01_23_45_67, 2);
  vst4_lane_u16((uint16_t *)outptr3, cols_01_23_45_67, 3);
 }
--- a/simd/arm/jidctred-neon.c
+++ b/simd/arm/jidctred-neon.c
@@ -0,0 +1,486 @@
 /*
 * jidctred-neon.c - reduced-size IDCT (Arm Neon)
 *
 * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
 * Copyright (C) 2020, D. R. Commander.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 #define JPEG_INTERNALS
 #include "../../jinclude.h"
 #include "../../jpeglib.h"
 #include "../../jsimd.h"
 #include "../../jdct.h"
 #include "../../jsimddct.h"
 #include "../jsimd.h"
 #include "align.h"
 #include "neon-compat.h"
 #include <arm_neon.h>
 #define CONST_BITS  13
 #define PASS1_BITS  2
 #define F_0_211  1730
 #define F_0_509  4176
 #define F_0_601  4926
 #define F_0_720  5906
 #define F_0_765  6270
 #define F_0_850  6967
 #define F_0_899  7373
 #define F_1_061  8697
 #define F_1_272  10426
 #define F_1_451  11893
 #define F_1_847  15137
 #define F_2_172  17799
 #define F_2_562  20995
 #define F_3_624  29692
 /* jsimd_idct_2x2_neon() is an inverse DCT function that produces reduced-size
 * 2x2 output from an 8x8 DCT block.  It uses the same calculations and
 * produces exactly the same output as IJG's original jpeg_idct_2x2() function
 * from jpeg-6b, which can be found in jidctred.c.
 *
 * Scaled integer constants are used to avoid floating-point arithmetic:
 *    0.720959822 =  5906 * 2^-13
 *    0.850430095 =  6967 * 2^-13
 *    1.272758580 = 10426 * 2^-13
 *    3.624509785 = 29692 * 2^-13
 *
 * See jidctred.c for further details of the 2x2 IDCT algorithm.  Where
 * possible, the variable names and comments here in jsimd_idct_2x2_neon()
 * match up with those in jpeg_idct_2x2().
 */
 ALIGN(16) static const int16_t jsimd_idct_2x2_neon_consts[] = {
  -F_0_720, F_0_850, -F_1_272, F_3_624
 };
 void jsimd_idct_2x2_neon(void *dct_table, JCOEFPTR coef_block,
                         JSAMPARRAY output_buf, JDIMENSION output_col)
 {
  ISLOW_MULT_TYPE *quantptr = dct_table;
  /* Load DCT coefficients. */
  int16x8_t row0 = vld1q_s16(coef_block + 0 * DCTSIZE);
  int16x8_t row1 = vld1q_s16(coef_block + 1 * DCTSIZE);
  int16x8_t row3 = vld1q_s16(coef_block + 3 * DCTSIZE);
  int16x8_t row5 = vld1q_s16(coef_block + 5 * DCTSIZE);
  int16x8_t row7 = vld1q_s16(coef_block + 7 * DCTSIZE);
  /* Load quantization table values. */
  int16x8_t quant_row0 = vld1q_s16(quantptr + 0 * DCTSIZE);
  int16x8_t quant_row1 = vld1q_s16(quantptr + 1 * DCTSIZE);
  int16x8_t quant_row3 = vld1q_s16(quantptr + 3 * DCTSIZE);
  int16x8_t quant_row5 = vld1q_s16(quantptr + 5 * DCTSIZE);
  int16x8_t quant_row7 = vld1q_s16(quantptr + 7 * DCTSIZE);
  /* Dequantize DCT coefficients. */
  row0 = vmulq_s16(row0, quant_row0);
  row1 = vmulq_s16(row1, quant_row1);
  row3 = vmulq_s16(row3, quant_row3);
  row5 = vmulq_s16(row5, quant_row5);
  row7 = vmulq_s16(row7, quant_row7);
  /* Load IDCT conversion constants. */
  const int16x4_t consts = vld1_s16(jsimd_idct_2x2_neon_consts);
  /* Pass 1: process columns from input, put results in vectors row0 and
   * row1.
   */
  /* Even part */
  int32x4_t tmp10_l = vshll_n_s16(vget_low_s16(row0), CONST_BITS + 2);
  int32x4_t tmp10_h = vshll_n_s16(vget_high_s16(row0), CONST_BITS + 2);
  /* Odd part */
  int32x4_t tmp0_l = vmull_lane_s16(vget_low_s16(row1), consts, 3);
  tmp0_l = vmlal_lane_s16(tmp0_l, vget_low_s16(row3), consts, 2);
  tmp0_l = vmlal_lane_s16(tmp0_l, vget_low_s16(row5), consts, 1);
  tmp0_l = vmlal_lane_s16(tmp0_l, vget_low_s16(row7), consts, 0);
  int32x4_t tmp0_h = vmull_lane_s16(vget_high_s16(row1), consts, 3);
  tmp0_h = vmlal_lane_s16(tmp0_h, vget_high_s16(row3), consts, 2);
  tmp0_h = vmlal_lane_s16(tmp0_h, vget_high_s16(row5), consts, 1);
  tmp0_h = vmlal_lane_s16(tmp0_h, vget_high_s16(row7), consts, 0);
  /* Final output stage: descale and narrow to 16-bit. */
  row0 = vcombine_s16(vrshrn_n_s32(vaddq_s32(tmp10_l, tmp0_l), CONST_BITS),
                      vrshrn_n_s32(vaddq_s32(tmp10_h, tmp0_h), CONST_BITS));
  row1 = vcombine_s16(vrshrn_n_s32(vsubq_s32(tmp10_l, tmp0_l), CONST_BITS),
                      vrshrn_n_s32(vsubq_s32(tmp10_h, tmp0_h), CONST_BITS));
  /* Transpose two rows, ready for second pass. */
  int16x8x2_t cols_0246_1357 = vtrnq_s16(row0, row1);
  int16x8_t cols_0246 = cols_0246_1357.val[0];
  int16x8_t cols_1357 = cols_0246_1357.val[1];
  /* Duplicate columns such that each is accessible in its own vector. */
  int32x4x2_t cols_1155_3377 = vtrnq_s32(vreinterpretq_s32_s16(cols_1357),
                                         vreinterpretq_s32_s16(cols_1357));
  int16x8_t cols_1155 = vreinterpretq_s16_s32(cols_1155_3377.val[0]);
  int16x8_t cols_3377 = vreinterpretq_s16_s32(cols_1155_3377.val[1]);
  /* Pass 2: process two rows, store to output array. */
  /* Even part: we're only interested in col0; the top half of tmp10 is "don't
   * care."
   */
  int32x4_t tmp10 = vshll_n_s16(vget_low_s16(cols_0246), CONST_BITS + 2);
  /* Odd part: we're only interested in the bottom half of tmp0. */
  int32x4_t tmp0 = vmull_lane_s16(vget_low_s16(cols_1155), consts, 3);
  tmp0 = vmlal_lane_s16(tmp0, vget_low_s16(cols_3377), consts, 2);
  tmp0 = vmlal_lane_s16(tmp0, vget_high_s16(cols_1155), consts, 1);
  tmp0 = vmlal_lane_s16(tmp0, vget_high_s16(cols_3377), consts, 0);
  /* Final output stage: descale and clamp to range [0-255]. */
  int16x8_t output_s16 = vcombine_s16(vaddhn_s32(tmp10, tmp0),
                                      vsubhn_s32(tmp10, tmp0));
  output_s16 = vrsraq_n_s16(vdupq_n_s16(CENTERJSAMPLE), output_s16,
                            CONST_BITS + PASS1_BITS + 3 + 2 - 16);
  /* Narrow to 8-bit and convert to unsigned. */
  uint8x8_t output_u8 = vqmovun_s16(output_s16);
  /* Store 2x2 block to memory. */
  vst1_lane_u8(output_buf[0] + output_col, output_u8, 0);
  vst1_lane_u8(output_buf[1] + output_col, output_u8, 1);
  vst1_lane_u8(output_buf[0] + output_col + 1, output_u8, 4);
  vst1_lane_u8(output_buf[1] + output_col + 1, output_u8, 5);
 }
 /* jsimd_idct_4x4_neon() is an inverse DCT function that produces reduced-size
 * 4x4 output from an 8x8 DCT block.  It uses the same calculations and
 * produces exactly the same output as IJG's original jpeg_idct_4x4() function
 * from jpeg-6b, which can be found in jidctred.c.
 *
 * Scaled integer constants are used to avoid floating-point arithmetic:
 *    0.211164243 =  1730 * 2^-13
 *    0.509795579 =  4176 * 2^-13
 *    0.601344887 =  4926 * 2^-13
 *    0.765366865 =  6270 * 2^-13
 *    0.899976223 =  7373 * 2^-13
 *    1.061594337 =  8697 * 2^-13
 *    1.451774981 = 11893 * 2^-13
 *    1.847759065 = 15137 * 2^-13
 *    2.172734803 = 17799 * 2^-13
 *    2.562915447 = 20995 * 2^-13
 *
 * See jidctred.c for further details of the 4x4 IDCT algorithm.  Where
 * possible, the variable names and comments here in jsimd_idct_4x4_neon()
 * match up with those in jpeg_idct_4x4().
 */
 ALIGN(16) static const int16_t jsimd_idct_4x4_neon_consts[] = {
  F_1_847, -F_0_765, -F_0_211,  F_1_451,
 -F_2_172,  F_1_061, -F_0_509, -F_0_601,
  F_0_899,  F_2_562,        0,        0
 };
 void jsimd_idct_4x4_neon(void *dct_table, JCOEFPTR coef_block,
                         JSAMPARRAY output_buf, JDIMENSION output_col)
 {
  ISLOW_MULT_TYPE *quantptr = dct_table;
  /* Load DCT coefficients. */
  int16x8_t row0  = vld1q_s16(coef_block + 0 * DCTSIZE);
  int16x8_t row1  = vld1q_s16(coef_block + 1 * DCTSIZE);
  int16x8_t row2  = vld1q_s16(coef_block + 2 * DCTSIZE);
  int16x8_t row3  = vld1q_s16(coef_block + 3 * DCTSIZE);
  int16x8_t row5  = vld1q_s16(coef_block + 5 * DCTSIZE);
  int16x8_t row6  = vld1q_s16(coef_block + 6 * DCTSIZE);
  int16x8_t row7  = vld1q_s16(coef_block + 7 * DCTSIZE);
  /* Load quantization table values for DC coefficients. */
  int16x8_t quant_row0 = vld1q_s16(quantptr + 0 * DCTSIZE);
  /* Dequantize DC coefficients. */
  row0 = vmulq_s16(row0, quant_row0);
  /* Construct bitmap to test if all AC coefficients are 0. */
  int16x8_t bitmap = vorrq_s16(row1, row2);
  bitmap = vorrq_s16(bitmap, row3);
  bitmap = vorrq_s16(bitmap, row5);
  bitmap = vorrq_s16(bitmap, row6);
  bitmap = vorrq_s16(bitmap, row7);
  int64_t left_ac_bitmap = vgetq_lane_s64(vreinterpretq_s64_s16(bitmap), 0);
  int64_t right_ac_bitmap = vgetq_lane_s64(vreinterpretq_s64_s16(bitmap), 1);
  /* Load constants for IDCT computation. */
 #ifdef HAVE_VLD1_S16_X3
  const int16x4x3_t consts = vld1_s16_x3(jsimd_idct_4x4_neon_consts);
 #else
  /* GCC does not currently support the intrinsic vld1_<type>_x3(). */
  const int16x4_t consts1 = vld1_s16(jsimd_idct_4x4_neon_consts);
  const int16x4_t consts2 = vld1_s16(jsimd_idct_4x4_neon_consts + 4);
  const int16x4_t consts3 = vld1_s16(jsimd_idct_4x4_neon_consts + 8);
  const int16x4x3_t consts = { { consts1, consts2, consts3 } };
 #endif
  if (left_ac_bitmap == 0 && right_ac_bitmap == 0) {
    /* All AC coefficients are zero.
     * Compute DC values and duplicate into row vectors 0, 1, 2, and 3.
     */
    int16x8_t dcval = vshlq_n_s16(row0, PASS1_BITS);
    row0 = dcval;
    row1 = dcval;
    row2 = dcval;
    row3 = dcval;
  } else if (left_ac_bitmap == 0) {
    /* AC coefficients are zero for columns 0, 1, 2, and 3.
     * Compute DC values for these columns.
     */
    int16x4_t dcval = vshl_n_s16(vget_low_s16(row0), PASS1_BITS);
    /* Commence regular IDCT computation for columns 4, 5, 6, and 7. */
    /* Load quantization table. */
    int16x4_t quant_row1 = vld1_s16(quantptr + 1 * DCTSIZE + 4);
    int16x4_t quant_row2 = vld1_s16(quantptr + 2 * DCTSIZE + 4);
    int16x4_t quant_row3 = vld1_s16(quantptr + 3 * DCTSIZE + 4);
    int16x4_t quant_row5 = vld1_s16(quantptr + 5 * DCTSIZE + 4);
    int16x4_t quant_row6 = vld1_s16(quantptr + 6 * DCTSIZE + 4);
    int16x4_t quant_row7 = vld1_s16(quantptr + 7 * DCTSIZE + 4);
    /* Even part */
    int32x4_t tmp0 = vshll_n_s16(vget_high_s16(row0), CONST_BITS + 1);
    int16x4_t z2 = vmul_s16(vget_high_s16(row2), quant_row2);
    int16x4_t z3 = vmul_s16(vget_high_s16(row6), quant_row6);
    int32x4_t tmp2 = vmull_lane_s16(z2, consts.val[0], 0);
    tmp2 = vmlal_lane_s16(tmp2, z3, consts.val[0], 1);
    int32x4_t tmp10 = vaddq_s32(tmp0, tmp2);
    int32x4_t tmp12 = vsubq_s32(tmp0, tmp2);
    /* Odd part */
    int16x4_t z1 = vmul_s16(vget_high_s16(row7), quant_row7);
    z2 = vmul_s16(vget_high_s16(row5), quant_row5);
    z3 = vmul_s16(vget_high_s16(row3), quant_row3);
    int16x4_t z4 = vmul_s16(vget_high_s16(row1), quant_row1);
    tmp0 = vmull_lane_s16(z1, consts.val[0], 2);
    tmp0 = vmlal_lane_s16(tmp0, z2, consts.val[0], 3);
    tmp0 = vmlal_lane_s16(tmp0, z3, consts.val[1], 0);
    tmp0 = vmlal_lane_s16(tmp0, z4, consts.val[1], 1);
    tmp2 = vmull_lane_s16(z1, consts.val[1], 2);
    tmp2 = vmlal_lane_s16(tmp2, z2, consts.val[1], 3);
    tmp2 = vmlal_lane_s16(tmp2, z3, consts.val[2], 0);
    tmp2 = vmlal_lane_s16(tmp2, z4, consts.val[2], 1);
    /* Final output stage: descale and narrow to 16-bit. */
    row0 = vcombine_s16(dcval, vrshrn_n_s32(vaddq_s32(tmp10, tmp2),
                                            CONST_BITS - PASS1_BITS + 1));
    row3 = vcombine_s16(dcval, vrshrn_n_s32(vsubq_s32(tmp10, tmp2),
                                            CONST_BITS - PASS1_BITS + 1));
    row1 = vcombine_s16(dcval, vrshrn_n_s32(vaddq_s32(tmp12, tmp0),
                                            CONST_BITS - PASS1_BITS + 1));
    row2 = vcombine_s16(dcval, vrshrn_n_s32(vsubq_s32(tmp12, tmp0),
                                            CONST_BITS - PASS1_BITS + 1));
  } else if (right_ac_bitmap == 0) {
    /* AC coefficients are zero for columns 4, 5, 6, and 7.
     * Compute DC values for these columns.
     */
    int16x4_t dcval = vshl_n_s16(vget_high_s16(row0), PASS1_BITS);
    /* Commence regular IDCT computation for columns 0, 1, 2, and 3. */
    /* Load quantization table. */
    int16x4_t quant_row1 = vld1_s16(quantptr + 1 * DCTSIZE);
    int16x4_t quant_row2 = vld1_s16(quantptr + 2 * DCTSIZE);
    int16x4_t quant_row3 = vld1_s16(quantptr + 3 * DCTSIZE);
    int16x4_t quant_row5 = vld1_s16(quantptr + 5 * DCTSIZE);
    int16x4_t quant_row6 = vld1_s16(quantptr + 6 * DCTSIZE);
    int16x4_t quant_row7 = vld1_s16(quantptr + 7 * DCTSIZE);
    /* Even part */
    int32x4_t tmp0 = vshll_n_s16(vget_low_s16(row0), CONST_BITS + 1);
    int16x4_t z2 = vmul_s16(vget_low_s16(row2), quant_row2);
    int16x4_t z3 = vmul_s16(vget_low_s16(row6), quant_row6);
    int32x4_t tmp2 = vmull_lane_s16(z2, consts.val[0], 0);
    tmp2 = vmlal_lane_s16(tmp2, z3, consts.val[0], 1);
    int32x4_t tmp10 = vaddq_s32(tmp0, tmp2);
    int32x4_t tmp12 = vsubq_s32(tmp0, tmp2);
    /* Odd part */
    int16x4_t z1 = vmul_s16(vget_low_s16(row7), quant_row7);
    z2 = vmul_s16(vget_low_s16(row5), quant_row5);
    z3 = vmul_s16(vget_low_s16(row3), quant_row3);
    int16x4_t z4 = vmul_s16(vget_low_s16(row1), quant_row1);
    tmp0 = vmull_lane_s16(z1, consts.val[0], 2);
    tmp0 = vmlal_lane_s16(tmp0, z2, consts.val[0], 3);
    tmp0 = vmlal_lane_s16(tmp0, z3, consts.val[1], 0);
    tmp0 = vmlal_lane_s16(tmp0, z4, consts.val[1], 1);
    tmp2 = vmull_lane_s16(z1, consts.val[1], 2);
    tmp2 = vmlal_lane_s16(tmp2, z2, consts.val[1], 3);
    tmp2 = vmlal_lane_s16(tmp2, z3, consts.val[2], 0);
    tmp2 = vmlal_lane_s16(tmp2, z4, consts.val[2], 1);
    /* Final output stage: descale and narrow to 16-bit. */
    row0 = vcombine_s16(vrshrn_n_s32(vaddq_s32(tmp10, tmp2),
                                     CONST_BITS - PASS1_BITS + 1), dcval);
    row3 = vcombine_s16(vrshrn_n_s32(vsubq_s32(tmp10, tmp2),
                                     CONST_BITS - PASS1_BITS + 1), dcval);
    row1 = vcombine_s16(vrshrn_n_s32(vaddq_s32(tmp12, tmp0),
                                     CONST_BITS - PASS1_BITS + 1), dcval);
    row2 = vcombine_s16(vrshrn_n_s32(vsubq_s32(tmp12, tmp0),
                                     CONST_BITS - PASS1_BITS + 1), dcval);
  } else {
    /* All AC coefficients are non-zero; full IDCT calculation required. */
    int16x8_t quant_row1 = vld1q_s16(quantptr + 1 * DCTSIZE);
    int16x8_t quant_row2 = vld1q_s16(quantptr + 2 * DCTSIZE);
    int16x8_t quant_row3 = vld1q_s16(quantptr + 3 * DCTSIZE);
    int16x8_t quant_row5 = vld1q_s16(quantptr + 5 * DCTSIZE);
    int16x8_t quant_row6 = vld1q_s16(quantptr + 6 * DCTSIZE);
    int16x8_t quant_row7 = vld1q_s16(quantptr + 7 * DCTSIZE);
    /* Even part */
    int32x4_t tmp0_l = vshll_n_s16(vget_low_s16(row0), CONST_BITS + 1);
    int32x4_t tmp0_h = vshll_n_s16(vget_high_s16(row0), CONST_BITS + 1);
    int16x8_t z2 = vmulq_s16(row2, quant_row2);
    int16x8_t z3 = vmulq_s16(row6, quant_row6);
    int32x4_t tmp2_l = vmull_lane_s16(vget_low_s16(z2), consts.val[0], 0);
    int32x4_t tmp2_h = vmull_lane_s16(vget_high_s16(z2), consts.val[0], 0);
    tmp2_l = vmlal_lane_s16(tmp2_l, vget_low_s16(z3), consts.val[0], 1);
    tmp2_h = vmlal_lane_s16(tmp2_h, vget_high_s16(z3), consts.val[0], 1);
    int32x4_t tmp10_l = vaddq_s32(tmp0_l, tmp2_l);
    int32x4_t tmp10_h = vaddq_s32(tmp0_h, tmp2_h);
    int32x4_t tmp12_l = vsubq_s32(tmp0_l, tmp2_l);
    int32x4_t tmp12_h = vsubq_s32(tmp0_h, tmp2_h);
    /* Odd part */
    int16x8_t z1 = vmulq_s16(row7, quant_row7);
    z2 = vmulq_s16(row5, quant_row5);
    z3 = vmulq_s16(row3, quant_row3);
    int16x8_t z4 = vmulq_s16(row1, quant_row1);
    tmp0_l = vmull_lane_s16(vget_low_s16(z1), consts.val[0], 2);
    tmp0_l = vmlal_lane_s16(tmp0_l, vget_low_s16(z2), consts.val[0], 3);
    tmp0_l = vmlal_lane_s16(tmp0_l, vget_low_s16(z3), consts.val[1], 0);
    tmp0_l = vmlal_lane_s16(tmp0_l, vget_low_s16(z4), consts.val[1], 1);
    tmp0_h = vmull_lane_s16(vget_high_s16(z1), consts.val[0], 2);
    tmp0_h = vmlal_lane_s16(tmp0_h, vget_high_s16(z2), consts.val[0], 3);
    tmp0_h = vmlal_lane_s16(tmp0_h, vget_high_s16(z3), consts.val[1], 0);
    tmp0_h = vmlal_lane_s16(tmp0_h, vget_high_s16(z4), consts.val[1], 1);
    tmp2_l = vmull_lane_s16(vget_low_s16(z1), consts.val[1], 2);
    tmp2_l = vmlal_lane_s16(tmp2_l, vget_low_s16(z2), consts.val[1], 3);
    tmp2_l = vmlal_lane_s16(tmp2_l, vget_low_s16(z3), consts.val[2], 0);
    tmp2_l = vmlal_lane_s16(tmp2_l, vget_low_s16(z4), consts.val[2], 1);
    tmp2_h = vmull_lane_s16(vget_high_s16(z1), consts.val[1], 2);
    tmp2_h = vmlal_lane_s16(tmp2_h, vget_high_s16(z2), consts.val[1], 3);
    tmp2_h = vmlal_lane_s16(tmp2_h, vget_high_s16(z3), consts.val[2], 0);
    tmp2_h = vmlal_lane_s16(tmp2_h, vget_high_s16(z4), consts.val[2], 1);
    /* Final output stage: descale and narrow to 16-bit. */
    row0 = vcombine_s16(vrshrn_n_s32(vaddq_s32(tmp10_l, tmp2_l),
                                     CONST_BITS - PASS1_BITS + 1),
                        vrshrn_n_s32(vaddq_s32(tmp10_h, tmp2_h),
                                     CONST_BITS - PASS1_BITS + 1));
    row3 = vcombine_s16(vrshrn_n_s32(vsubq_s32(tmp10_l, tmp2_l),
                                     CONST_BITS - PASS1_BITS + 1),
                        vrshrn_n_s32(vsubq_s32(tmp10_h, tmp2_h),
                                     CONST_BITS - PASS1_BITS + 1));
    row1 = vcombine_s16(vrshrn_n_s32(vaddq_s32(tmp12_l, tmp0_l),
                                     CONST_BITS - PASS1_BITS + 1),
                        vrshrn_n_s32(vaddq_s32(tmp12_h, tmp0_h),
                                     CONST_BITS - PASS1_BITS + 1));
    row2 = vcombine_s16(vrshrn_n_s32(vsubq_s32(tmp12_l, tmp0_l),
                                     CONST_BITS - PASS1_BITS + 1),
                        vrshrn_n_s32(vsubq_s32(tmp12_h, tmp0_h),
                                     CONST_BITS - PASS1_BITS + 1));
  }
  /* Transpose 8x4 block to perform IDCT on rows in second pass. */
  int16x8x2_t row_01 = vtrnq_s16(row0, row1);
  int16x8x2_t row_23 = vtrnq_s16(row2, row3);
  int32x4x2_t cols_0426 = vtrnq_s32(vreinterpretq_s32_s16(row_01.val[0]),
                                    vreinterpretq_s32_s16(row_23.val[0]));
  int32x4x2_t cols_1537 = vtrnq_s32(vreinterpretq_s32_s16(row_01.val[1]),
                                    vreinterpretq_s32_s16(row_23.val[1]));
  int16x4_t col0 = vreinterpret_s16_s32(vget_low_s32(cols_0426.val[0]));
  int16x4_t col1 = vreinterpret_s16_s32(vget_low_s32(cols_1537.val[0]));
  int16x4_t col2 = vreinterpret_s16_s32(vget_low_s32(cols_0426.val[1]));
  int16x4_t col3 = vreinterpret_s16_s32(vget_low_s32(cols_1537.val[1]));
  int16x4_t col5 = vreinterpret_s16_s32(vget_high_s32(cols_1537.val[0]));
  int16x4_t col6 = vreinterpret_s16_s32(vget_high_s32(cols_0426.val[1]));
  int16x4_t col7 = vreinterpret_s16_s32(vget_high_s32(cols_1537.val[1]));
  /* Commence second pass of IDCT. */
  /* Even part */
  int32x4_t tmp0 = vshll_n_s16(col0, CONST_BITS + 1);
  int32x4_t tmp2 = vmull_lane_s16(col2, consts.val[0], 0);
  tmp2 = vmlal_lane_s16(tmp2, col6, consts.val[0], 1);
  int32x4_t tmp10 = vaddq_s32(tmp0, tmp2);
  int32x4_t tmp12 = vsubq_s32(tmp0, tmp2);
  /* Odd part */
  tmp0 = vmull_lane_s16(col7, consts.val[0], 2);
  tmp0 = vmlal_lane_s16(tmp0, col5, consts.val[0], 3);
  tmp0 = vmlal_lane_s16(tmp0, col3, consts.val[1], 0);
  tmp0 = vmlal_lane_s16(tmp0, col1, consts.val[1], 1);
  tmp2 = vmull_lane_s16(col7, consts.val[1], 2);
  tmp2 = vmlal_lane_s16(tmp2, col5, consts.val[1], 3);
  tmp2 = vmlal_lane_s16(tmp2, col3, consts.val[2], 0);
  tmp2 = vmlal_lane_s16(tmp2, col1, consts.val[2], 1);
  /* Final output stage: descale and clamp to range [0-255]. */
  int16x8_t output_cols_02 = vcombine_s16(vaddhn_s32(tmp10, tmp2),
                                          vsubhn_s32(tmp12, tmp0));
  int16x8_t output_cols_13 = vcombine_s16(vaddhn_s32(tmp12, tmp0),
                                          vsubhn_s32(tmp10, tmp2));
  output_cols_02 = vrsraq_n_s16(vdupq_n_s16(CENTERJSAMPLE), output_cols_02,
                                CONST_BITS + PASS1_BITS + 3 + 1 - 16);
  output_cols_13 = vrsraq_n_s16(vdupq_n_s16(CENTERJSAMPLE), output_cols_13,
                                CONST_BITS + PASS1_BITS + 3 + 1 - 16);
  /* Narrow to 8-bit and convert to unsigned while zipping 8-bit elements.
   * An interleaving store completes the transpose.
   */
  uint8x8x2_t output_0123 = vzip_u8(vqmovun_s16(output_cols_02),
                                    vqmovun_s16(output_cols_13));
  uint16x4x2_t output_01_23 = { {
    vreinterpret_u16_u8(output_0123.val[0]),
    vreinterpret_u16_u8(output_0123.val[1])
  } };
  /* Store 4x4 block to memory. */
  JSAMPROW outptr0 = output_buf[0] + output_col;
  JSAMPROW outptr1 = output_buf[1] + output_col;
  JSAMPROW outptr2 = output_buf[2] + output_col;
  JSAMPROW outptr3 = output_buf[3] + output_col;
  vst2_lane_u16((uint16_t *)outptr0, output_01_23, 0);
  vst2_lane_u16((uint16_t *)outptr1, output_01_23, 1);
  vst2_lane_u16((uint16_t *)outptr2, output_01_23, 2);
  vst2_lane_u16((uint16_t *)outptr3, output_01_23, 3);
 }
--- a/simd/arm/jquanti-neon.c
+++ b/simd/arm/jquanti-neon.c
@@ -0,0 +1,190 @@
 /*
 * jquanti-neon.c - sample data conversion and quantization (Arm Neon)
 *
 * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 #define JPEG_INTERNALS
 #include "../../jinclude.h"
 #include "../../jpeglib.h"
 #include "../../jsimd.h"
 #include "../../jdct.h"
 #include "../../jsimddct.h"
 #include "../jsimd.h"
 #include <arm_neon.h>
 /* After downsampling, the resulting sample values are in the range [0, 255],
 * but the Discrete Cosine Transform (DCT) operates on values centered around
 * 0.
 *
 * To prepare sample values for the DCT, load samples into a DCT workspace,
 * subtracting CENTERJSAMPLE (128).  The samples, now in the range [-128, 127],
 * are also widened from 8- to 16-bit.
 *
 * The equivalent scalar C function convsamp() can be found in jcdctmgr.c.
 */
 void jsimd_convsamp_neon(JSAMPARRAY sample_data, JDIMENSION start_col,
                         DCTELEM *workspace)
 {
  uint8x8_t samp_row0 = vld1_u8(sample_data[0] + start_col);
  uint8x8_t samp_row1 = vld1_u8(sample_data[1] + start_col);
  uint8x8_t samp_row2 = vld1_u8(sample_data[2] + start_col);
  uint8x8_t samp_row3 = vld1_u8(sample_data[3] + start_col);
  uint8x8_t samp_row4 = vld1_u8(sample_data[4] + start_col);
  uint8x8_t samp_row5 = vld1_u8(sample_data[5] + start_col);
  uint8x8_t samp_row6 = vld1_u8(sample_data[6] + start_col);
  uint8x8_t samp_row7 = vld1_u8(sample_data[7] + start_col);
  int16x8_t row0 =
    vreinterpretq_s16_u16(vsubl_u8(samp_row0, vdup_n_u8(CENTERJSAMPLE)));
  int16x8_t row1 =
    vreinterpretq_s16_u16(vsubl_u8(samp_row1, vdup_n_u8(CENTERJSAMPLE)));
  int16x8_t row2 =
    vreinterpretq_s16_u16(vsubl_u8(samp_row2, vdup_n_u8(CENTERJSAMPLE)));
  int16x8_t row3 =
    vreinterpretq_s16_u16(vsubl_u8(samp_row3, vdup_n_u8(CENTERJSAMPLE)));
  int16x8_t row4 =
    vreinterpretq_s16_u16(vsubl_u8(samp_row4, vdup_n_u8(CENTERJSAMPLE)));
  int16x8_t row5 =
    vreinterpretq_s16_u16(vsubl_u8(samp_row5, vdup_n_u8(CENTERJSAMPLE)));
  int16x8_t row6 =
    vreinterpretq_s16_u16(vsubl_u8(samp_row6, vdup_n_u8(CENTERJSAMPLE)));
  int16x8_t row7 =
    vreinterpretq_s16_u16(vsubl_u8(samp_row7, vdup_n_u8(CENTERJSAMPLE)));
  vst1q_s16(workspace + 0 * DCTSIZE, row0);
  vst1q_s16(workspace + 1 * DCTSIZE, row1);
  vst1q_s16(workspace + 2 * DCTSIZE, row2);
  vst1q_s16(workspace + 3 * DCTSIZE, row3);
  vst1q_s16(workspace + 4 * DCTSIZE, row4);
  vst1q_s16(workspace + 5 * DCTSIZE, row5);
  vst1q_s16(workspace + 6 * DCTSIZE, row6);
  vst1q_s16(workspace + 7 * DCTSIZE, row7);
 }
 /* After the DCT, the resulting array of coefficient values needs to be divided
 * by an array of quantization values.
 *
 * To avoid a slow division operation, the DCT coefficients are multiplied by
 * the (scaled) reciprocals of the quantization values and then right-shifted.
 *
 * The equivalent scalar C function quantize() can be found in jcdctmgr.c.
 */
 void jsimd_quantize_neon(JCOEFPTR coef_block, DCTELEM *divisors,
                         DCTELEM *workspace)
 {
  JCOEFPTR out_ptr = coef_block;
  UDCTELEM *recip_ptr = (UDCTELEM *)divisors;
  UDCTELEM *corr_ptr = (UDCTELEM *)divisors + DCTSIZE2;
  DCTELEM *shift_ptr = divisors + 3 * DCTSIZE2;
  int i;
  for (i = 0; i < DCTSIZE; i += DCTSIZE / 2) {
    /* Load DCT coefficients. */
    int16x8_t row0 = vld1q_s16(workspace + (i + 0) * DCTSIZE);
    int16x8_t row1 = vld1q_s16(workspace + (i + 1) * DCTSIZE);
    int16x8_t row2 = vld1q_s16(workspace + (i + 2) * DCTSIZE);
    int16x8_t row3 = vld1q_s16(workspace + (i + 3) * DCTSIZE);
    /* Load reciprocals of quantization values. */
    uint16x8_t recip0 = vld1q_u16(recip_ptr + (i + 0) * DCTSIZE);
    uint16x8_t recip1 = vld1q_u16(recip_ptr + (i + 1) * DCTSIZE);
    uint16x8_t recip2 = vld1q_u16(recip_ptr + (i + 2) * DCTSIZE);
    uint16x8_t recip3 = vld1q_u16(recip_ptr + (i + 3) * DCTSIZE);
    uint16x8_t corr0 = vld1q_u16(corr_ptr + (i + 0) * DCTSIZE);
    uint16x8_t corr1 = vld1q_u16(corr_ptr + (i + 1) * DCTSIZE);
    uint16x8_t corr2 = vld1q_u16(corr_ptr + (i + 2) * DCTSIZE);
    uint16x8_t corr3 = vld1q_u16(corr_ptr + (i + 3) * DCTSIZE);
    int16x8_t shift0 = vld1q_s16(shift_ptr + (i + 0) * DCTSIZE);
    int16x8_t shift1 = vld1q_s16(shift_ptr + (i + 1) * DCTSIZE);
    int16x8_t shift2 = vld1q_s16(shift_ptr + (i + 2) * DCTSIZE);
    int16x8_t shift3 = vld1q_s16(shift_ptr + (i + 3) * DCTSIZE);
    /* Extract sign from coefficients. */
    int16x8_t sign_row0 = vshrq_n_s16(row0, 15);
    int16x8_t sign_row1 = vshrq_n_s16(row1, 15);
    int16x8_t sign_row2 = vshrq_n_s16(row2, 15);
    int16x8_t sign_row3 = vshrq_n_s16(row3, 15);
    /* Get absolute value of DCT coefficients. */
    uint16x8_t abs_row0 = vreinterpretq_u16_s16(vabsq_s16(row0));
    uint16x8_t abs_row1 = vreinterpretq_u16_s16(vabsq_s16(row1));
    uint16x8_t abs_row2 = vreinterpretq_u16_s16(vabsq_s16(row2));
    uint16x8_t abs_row3 = vreinterpretq_u16_s16(vabsq_s16(row3));
    /* Add correction. */
    abs_row0 = vaddq_u16(abs_row0, corr0);
    abs_row1 = vaddq_u16(abs_row1, corr1);
    abs_row2 = vaddq_u16(abs_row2, corr2);
    abs_row3 = vaddq_u16(abs_row3, corr3);
    /* Multiply DCT coefficients by quantization reciprocals. */
    int32x4_t row0_l = vreinterpretq_s32_u32(vmull_u16(vget_low_u16(abs_row0),
                                                       vget_low_u16(recip0)));
    int32x4_t row0_h = vreinterpretq_s32_u32(vmull_u16(vget_high_u16(abs_row0),
                                                       vget_high_u16(recip0)));
    int32x4_t row1_l = vreinterpretq_s32_u32(vmull_u16(vget_low_u16(abs_row1),
                                                       vget_low_u16(recip1)));
    int32x4_t row1_h = vreinterpretq_s32_u32(vmull_u16(vget_high_u16(abs_row1),
                                                       vget_high_u16(recip1)));
    int32x4_t row2_l = vreinterpretq_s32_u32(vmull_u16(vget_low_u16(abs_row2),
                                                       vget_low_u16(recip2)));
    int32x4_t row2_h = vreinterpretq_s32_u32(vmull_u16(vget_high_u16(abs_row2),
                                                       vget_high_u16(recip2)));
    int32x4_t row3_l = vreinterpretq_s32_u32(vmull_u16(vget_low_u16(abs_row3),
                                                       vget_low_u16(recip3)));
    int32x4_t row3_h = vreinterpretq_s32_u32(vmull_u16(vget_high_u16(abs_row3),
                                                       vget_high_u16(recip3)));
    /* Narrow back to 16-bit. */
    row0 = vcombine_s16(vshrn_n_s32(row0_l, 16), vshrn_n_s32(row0_h, 16));
    row1 = vcombine_s16(vshrn_n_s32(row1_l, 16), vshrn_n_s32(row1_h, 16));
    row2 = vcombine_s16(vshrn_n_s32(row2_l, 16), vshrn_n_s32(row2_h, 16));
    row3 = vcombine_s16(vshrn_n_s32(row3_l, 16), vshrn_n_s32(row3_h, 16));
    /* Since VSHR only supports an immediate as its second argument, negate the
     * shift value and shift left.
     */
    row0 = vreinterpretq_s16_u16(vshlq_u16(vreinterpretq_u16_s16(row0),
                                           vnegq_s16(shift0)));
    row1 = vreinterpretq_s16_u16(vshlq_u16(vreinterpretq_u16_s16(row1),
                                           vnegq_s16(shift1)));
    row2 = vreinterpretq_s16_u16(vshlq_u16(vreinterpretq_u16_s16(row2),
                                           vnegq_s16(shift2)));
    row3 = vreinterpretq_s16_u16(vshlq_u16(vreinterpretq_u16_s16(row3),
                                           vnegq_s16(shift3)));
    /* Restore sign to original product. */
    row0 = veorq_s16(row0, sign_row0);
    row0 = vsubq_s16(row0, sign_row0);
    row1 = veorq_s16(row1, sign_row1);
    row1 = vsubq_s16(row1, sign_row1);
    row2 = veorq_s16(row2, sign_row2);
    row2 = vsubq_s16(row2, sign_row2);
    row3 = veorq_s16(row3, sign_row3);
    row3 = vsubq_s16(row3, sign_row3);
    /* Store quantized coefficients to memory. */
    vst1q_s16(out_ptr + (i + 0) * DCTSIZE, row0);
    vst1q_s16(out_ptr + (i + 1) * DCTSIZE, row1);
    vst1q_s16(out_ptr + (i + 2) * DCTSIZE, row2);
    vst1q_s16(out_ptr + (i + 3) * DCTSIZE, row3);
  }
 }
--- a/simd/arm/jsimd_neon.S
+++ b/simd/arm/jsimd_neon.S
--- a/simd/arm/neon-compat.h.in
+++ b/simd/arm/neon-compat.h.in
@@ -0,0 +1,35 @@
 /*
 * Copyright (C) 2020, D. R. Commander.  All Rights Reserved.
 * Copyright (C) 2020-2021, Arm Limited.  All Rights Reserved.
 *
 * This software is provided 'as-is', without any express or implied
 * warranty.  In no event will the authors be held liable for any damages
 * arising from the use of this software.
 *
 * Permission is granted to anyone to use this software for any purpose,
 * including commercial applications, and to alter it and redistribute it
 * freely, subject to the following restrictions:
 *
 * 1. The origin of this software must not be misrepresented; you must not
 *    claim that you wrote the original software. If you use this software
 *    in a product, an acknowledgment in the product documentation would be
 *    appreciated but is not required.
 * 2. Altered source versions must be plainly marked as such, and must not be
 *    misrepresented as being the original software.
 * 3. This notice may not be removed or altered from any source distribution.
 */
 #cmakedefine HAVE_VLD1_S16_X3
 #cmakedefine HAVE_VLD1_U16_X2
 #cmakedefine HAVE_VLD1Q_U8_X4
 /* Define compiler-independent count-leading-zeros macros */
 #if defined(_MSC_VER) && !defined(__clang__)
 #define BUILTIN_CLZ(x)  _CountLeadingZeros(x)
 #define BUILTIN_CLZLL(x)  _CountLeadingZeros64(x)
 #elif defined(__clang__) || defined(__GNUC__)
 #define BUILTIN_CLZ(x)  __builtin_clz(x)
 #define BUILTIN_CLZLL(x)  __builtin_clzll(x)
 #else
 #error "Unknown compiler"
 #endif
--- a/Show More
+++ b/Show More
`@@ -1,4 +1,4 @@`
	libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86 and x86-64 systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal. On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines. In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs.	libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86, x86-64, and Arm systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal. On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines. In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs.

	`libjpeg-turbo implements both the traditional libjpeg API as well as the less powerful but more straightforward TurboJPEG API. libjpeg-turbo also features colorspace extensions that allow it to compress from/decompress to 32-bit and big-endian pixel buffers (RGBX, XBGR, etc.), as well as a full-featured Java interface.`	`libjpeg-turbo implements both the traditional libjpeg API as well as the less powerful but more straightforward TurboJPEG API. libjpeg-turbo also features colorspace extensions that allow it to compress from/decompress to 32-bit and big-endian pixel buffers (RGBX, XBGR, etc.), as well as a full-featured Java interface.`