Merge branch 'master' into dev

2020-10-19 21:17:46 -05:00
parent b8200c6601 f7ca3c5a3d
commit 59352195b2
17 changed files with 97 additions and 82 deletions
--- a/BUILDING.md
+++ b/BUILDING.md
@@ -393,8 +393,8 @@ located (usually **/usr/bin**.)  Next, execute the following commands:
 Building libjpeg-turbo for iOS
 ------------------------------

-iOS platforms, such as the iPhone and iPad, use ARM processors, and all
-currently supported models include NEON instructions.  Thus, they can take
+iOS platforms, such as the iPhone and iPad, use Arm processors, and all
+currently supported models include Neon instructions.  Thus, they can take
 advantage of libjpeg-turbo's SIMD extensions to significantly accelerate JPEG
 compression/decompression.  This section describes how to build libjpeg-turbo
 for these platforms.
@@ -407,7 +407,7 @@ for these platforms.
  it should be installed in your `PATH`.


-### ARMv8 (64-bit)
+### Armv8 (64-bit)

 **gas-preprocessor.pl required if using Xcode < 6**

@@ -439,7 +439,7 @@ Building libjpeg-turbo for Android platforms requires v13b or later of the
 [Android NDK](https://developer.android.com/tools/sdk/ndk).


-### ARMv7 (32-bit)
+### Armv7 (32-bit)

 The following is a general recipe script that can be modified for your specific
 needs.
@@ -464,7 +464,7 @@ needs.
    make


-### ARMv8 (64-bit)
+### Armv8 (64-bit)

 The following is a general recipe script that can be modified for your specific
 needs.
@@ -643,12 +643,12 @@ Create Mac package/disk image.  This requires pkgbuild and productbuild, which
 are installed by default on OS X 10.7 and later.

 In order to create a Mac package/disk image that contains universal
-x86-64/ARM binaries, set the following CMake variable:
+x86-64/Arm binaries, set the following CMake variable:

-* `IOS_ARMV8_BUILD`: Directory containing an ARMv8 (64-bit) iOS build of
+* `IOS_ARMV8_BUILD`: Directory containing an Armv8 (64-bit) iOS build of
  libjpeg-turbo to include in the universal binaries

-You should first use CMake to configure an ARMv8 sub-build of libjpeg-turbo
+You should first use CMake to configure an Armv8 sub-build of libjpeg-turbo
 (see "Building libjpeg-turbo for iOS" above) in a build directory that matches
 the one specified in the aforementioned CMake variable.  Next, configure the
 primary (x86-64) build of libjpeg-turbo as an out-of-tree build, specifying the
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -55,10 +55,12 @@ if(CMAKE_SYSTEM_PROCESSOR_LC MATCHES "x86_64" OR
    set(CMAKE_SYSTEM_PROCESSOR ${CPU_TYPE})
  endif()
 elseif(CMAKE_SYSTEM_PROCESSOR_LC STREQUAL "aarch64" OR
-  CMAKE_SYSTEM_PROCESSOR_LC MATCHES "arm*64*")
+  CMAKE_SYSTEM_PROCESSOR_LC MATCHES "arm*")
+  if(BITS EQUAL 64)
    set(CPU_TYPE arm64)
-elseif(CMAKE_SYSTEM_PROCESSOR_LC MATCHES "arm*")
+  else()
    set(CPU_TYPE arm)
+  endif()
 elseif(CMAKE_SYSTEM_PROCESSOR_LC MATCHES "ppc*" OR
  CMAKE_SYSTEM_PROCESSOR_LC MATCHES "powerpc*")
  set(CPU_TYPE powerpc)
--- a/ChangeLog.md
+++ b/ChangeLog.md
@@ -33,7 +33,7 @@ approximately 2x when using the fast integer IDCT
    The overall decompression speedup for RGB images is now approximately
 2.3-3.7x (compared to 2-3.5x with libjpeg-turbo 2.0.x.)

-3. 32-bit (ARMv7 or ARMv7s) iOS builds of libjpeg-turbo are no longer
+3. 32-bit (Armv7 or Armv7s) iOS builds of libjpeg-turbo are no longer
 supported, and the libjpeg-turbo build system can no longer be used to package
 such builds.  32-bit iOS apps cannot run in iOS 11 and later, and the App Store
 no longer allows them.
@@ -61,10 +61,10 @@ higher-frequency scan.  libjpeg-turbo now applies block smoothing parameters to
 each iMCU row based on which scan generated the pixels in that row, rather than
 always using the block smoothing parameters for the most recent scan.

-7. Added SIMD acceleration for progressive Huffman encoding on ARM 64-bit
-(ARMv8) platforms.  This speeds up the compression of full-color progressive
+7. Added SIMD acceleration for progressive Huffman encoding on Arm 64-bit
+(Armv8) platforms.  This speeds up the compression of full-color progressive
 JPEGs by about 30-40% on average (relative to libjpeg-turbo 2.0.x) when using
-modern ARMv8 CPUs.
+modern Armv8 CPUs.

 8. Added configure-time and run-time auto-detection of Loongson MMI SIMD
 instructions, so that the Loongson MMI SIMD extensions can be included in any
@@ -124,8 +124,8 @@ with `jpeg_skip_scanlines()`, and the issues could not readily be fixed.
     - Fixed an issue whereby `jpeg_skip_scanlines()` always returned 0 when
 skipping past the end of an image.

-3. The ARM 64-bit (ARMv8) NEON SIMD extensions can now be built using MinGW
-toolchains targetting ARM64 (AArch64) Windows binaries.
+3. The Arm 64-bit (Armv8) Neon SIMD extensions can now be built using MinGW
+toolchains targetting Arm64 (AArch64) Windows binaries.

 4. Fixed unexpected visual artifacts that occurred when using
 `jpeg_crop_scanline()` and interblock smoothing while decompressing only the DC
@@ -198,7 +198,7 @@ other user-visible errant behavior, and given that the lossless transformer
 (unlike the decompressor) is not generally exposed to arbitrary data exploits,
 this issue did not likely pose a security risk.

-6. The ARM 64-bit (ARMv8) NEON SIMD assembly code now stores constants in a
+6. The Arm 64-bit (Armv8) Neon SIMD assembly code now stores constants in a
 separate read-only data section rather than in the text section, to support
 execute-only memory layouts.

@@ -484,7 +484,7 @@ algorithm that caused incorrect dithering in the output image.  This algorithm
 now produces bitwise-identical results to the unmerged algorithms.

 12. The SIMD function symbols for x86[-64]/ELF, MIPS/ELF, macOS/x86[-64] (if
-libjpeg-turbo is built with YASM), and iOS/ARM[64] builds are now private.
+libjpeg-turbo is built with YASM), and iOS/Arm[64] builds are now private.
 This prevents those symbols from being exposed in applications or shared
 libraries that link statically with libjpeg-turbo.

--- a/README.md
+++ b/README.md
@@ -2,8 +2,8 @@ Background
 ==========

 libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
-baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and
-MIPS systems, as well as progressive JPEG compression on x86, x86-64, and ARMv8
+baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
+MIPS systems, as well as progressive JPEG compression on x86, x86-64, and Armv8
 systems.  On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg,
 all else being equal.  On other types of systems, libjpeg-turbo can still
 outperform libjpeg by a significant amount, by virtue of its highly-optimized
--- a/cmakescripts/BuildPackages.cmake
+++ b/cmakescripts/BuildPackages.cmake
@@ -23,11 +23,18 @@ set(RPMARCH ${CMAKE_SYSTEM_PROCESSOR})
 if(CPU_TYPE STREQUAL "x86_64")
  set(DEBARCH amd64)
 elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "armv7*")
+  set(RPMARCH armv7hl)
  set(DEBARCH armhf)
 elseif(CPU_TYPE STREQUAL "arm64")
  set(DEBARCH ${CPU_TYPE})
 elseif(CPU_TYPE STREQUAL "arm")
+  if(CMAKE_C_COMPILER MATCHES "gnueabihf")
+    set(RPMARCH armv7hl)
+    set(DEBARCH armhf)
+  else()
+    set(RPMARCH armel)
    set(DEBARCH armel)
+  endif()
 elseif(CMAKE_SYSTEM_PROCESSOR_LC STREQUAL "ppc64le")
  set(DEBARCH ppc64el)
 elseif(CPU_TYPE STREQUAL "powerpc" AND BITS EQUAL 32)
@@ -128,7 +135,7 @@ endif() # WIN32
 if(APPLE)

 set(IOS_ARMV8_BUILD "" CACHE PATH
-  "Directory containing ARMv8 iOS build to include in universal binaries")
+  "Directory containing Armv8 iOS build to include in universal binaries")

 set(OSX_APP_CERT_NAME "" CACHE STRING
  "Name of the Developer ID Application certificate (in the macOS keychain) that should be used to sign the libjpeg-turbo DMG.  Leave this blank to generate an unsigned DMG.")
--- a/jchuff.c
+++ b/jchuff.c
@@ -35,10 +35,10 @@
 * memory footprint by 64k, which is important for some mobile applications
 * that create many isolated instances of libjpeg-turbo (web browsers, for
 * instance.)  This may improve performance on some mobile platforms as well.
- * This feature is enabled by default only on ARM processors, because some x86
+ * This feature is enabled by default only on Arm processors, because some x86
 * chips have a slow implementation of bsr, and the use of clz/bsr cannot be
 * shown to have a significant performance impact even on the x86 chips that
- * have a fast implementation of it.  When building for ARMv6, you can
+ * have a fast implementation of it.  When building for Armv6, you can
 * explicitly disable the use of clz/bsr by adding -mthumb to the compiler
 * flags (this defines __thumb__).
 */
@@ -73,7 +73,7 @@ typedef size_t bit_buf_type;
 #endif

 /* NOTE: The more optimal Huffman encoding algorithm has not yet been
- * implemented in the ARM NEON SIMD extensions, which is why we retain the old
+ * implemented in the Arm Neon SIMD extensions, which is why we retain the old
 * Huffman encoder behavior for that platform.
 */
 #if defined(WITH_SIMD) && !(defined(__arm__) || defined(__aarch64__))
@@ -98,7 +98,7 @@ typedef struct {
    simd_bit_buf_type simd;
  } put_buffer;                         /* current bit accumulation buffer */
  int free_bits;                        /* # of bits available in it */
-                                        /* (ARM SIMD: # of bits now in it) */
+                                        /* (Arm SIMD: # of bits now in it) */
  int last_dc_val[MAX_COMPS_IN_SCAN];   /* last DC coef for each component */
 } savable_state;

--- a/jcphuff.c
+++ b/jcphuff.c
@@ -43,10 +43,10 @@
 * memory footprint by 64k, which is important for some mobile applications
 * that create many isolated instances of libjpeg-turbo (web browsers, for
 * instance.)  This may improve performance on some mobile platforms as well.
- * This feature is enabled by default only on ARM processors, because some x86
+ * This feature is enabled by default only on Arm processors, because some x86
 * chips have a slow implementation of bsr, and the use of clz/bsr cannot be
 * shown to have a significant performance impact even on the x86 chips that
- * have a fast implementation of it.  When building for ARMv6, you can
+ * have a fast implementation of it.  When building for Armv6, you can
 * explicitly disable the use of clz/bsr by adding -mthumb to the compiler
 * flags (this defines __thumb__).
 */
--- a/release/ReadMe.txt
+++ b/release/ReadMe.txt
@@ -1,4 +1,4 @@
-libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86, x86-64, and ARMv8 systems.  On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal.  On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines.  In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs.
+libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86, x86-64, and Armv8 systems.  On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal.  On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines.  In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs.

 libjpeg-turbo implements both the traditional libjpeg API as well as the less powerful but more straightforward TurboJPEG API.  libjpeg-turbo also features colorspace extensions that allow it to compress from/decompress to 32-bit and big-endian pixel buffers (RGBX, XBGR, etc.), as well as a full-featured Java interface.

--- a/release/deb-control.in
+++ b/release/deb-control.in
@@ -9,9 +9,9 @@ Homepage: @PKGURL@
 Installed-Size: {__SIZE}
 Description: A SIMD-accelerated JPEG codec that provides both the libjpeg and TurboJPEG APIs
 libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
- baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and
+ baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
 MIPS systems, as well as progressive JPEG compression on x86, x86-64, and
- ARMv8 systems.  On such systems, libjpeg-turbo is generally 2-6x as fast as
+ Armv8 systems.  On such systems, libjpeg-turbo is generally 2-6x as fast as
 libjpeg, all else being equal.  On other types of systems, libjpeg-turbo can
 still outperform libjpeg by a significant amount, by virtue of its
 highly-optimized Huffman coding routines.  In many cases, the performance of
--- a/release/makedpkg.in
+++ b/release/makedpkg.in
@@ -54,7 +54,11 @@ makedeb()

 	if [ $SUPPLEMENT = 1 ]; then
 		PKGNAME=$PKGNAME\32
+		if [ "$DEBARCH" = "i386" ]; then
 			DEBARCH=amd64
+		else
+			DEBARCH=arm64
+		fi
 	fi

 	umask 022
@@ -110,6 +114,8 @@ if [ ! `uid` -eq 0 ]; then
 fi

 makedeb 0
-if [ "$DEBARCH" = "i386" ]; then makedeb 1; fi
+if [ "$DEBARCH" = "i386" -o "$DEBARCH" = "armel" -o "$DEBARCH" = "armhf" ]; then
+	makedeb 1
+fi

 exit
--- a/release/makemacpkg.in
+++ b/release/makemacpkg.in
@@ -160,7 +160,7 @@ install_ios()
 }

 if [ "$BUILDDIRARMV8" != "" ]; then
-	install_ios $BUILDDIRARMV8 ARMv8 armv8 arm64
+	install_ios $BUILDDIRARMV8 Armv8 armv8 arm64
 fi

 install_name_tool -id $LIBDIR/$LIBJPEG_DSO_NAME $PKGROOT/$LIBDIR/$LIBJPEG_DSO_NAME
--- a/release/rpm.spec.in
+++ b/release/rpm.spec.in
@@ -52,8 +52,8 @@ Provides: %{name} = %{version}-%{release}, @CMAKE_PROJECT_NAME@ = %{version}-%{r

 %description
 libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
-baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and
-MIPS systems, as well as progressive JPEG compression on x86, x86-64, and ARMv8
+baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
+MIPS systems, as well as progressive JPEG compression on x86, x86-64, and Armv8
 systems.  On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg,
 all else being equal.  On other types of systems, libjpeg-turbo can still
 outperform libjpeg by a significant amount, by virtue of its highly-optimized
--- a/simd/CMakeLists.txt
+++ b/simd/CMakeLists.txt
@@ -208,7 +208,7 @@ endif()


 ###############################################################################
-# ARM (GAS)
+# Arm (GAS)
 ###############################################################################

 elseif(CPU_TYPE STREQUAL "arm64" OR CPU_TYPE STREQUAL "arm")
--- a/simd/arm/jsimd.c
+++ b/simd/arm/jsimd.c
@@ -13,7 +13,7 @@
 *
 * This file contains the interface between the "normal" portions
 * of the library and the SIMD implementations when running on a
- * 32-bit ARM architecture.
+ * 32-bit Arm architecture.
 */

 #define JPEG_INTERNALS
@@ -118,7 +118,7 @@ init_simd(void)
 #if defined(__ARM_NEON__)
  simd_support |= JSIMD_NEON;
 #elif defined(__linux__) || defined(ANDROID) || defined(__ANDROID__)
-  /* We still have a chance to use NEON regardless of globally used
+  /* We still have a chance to use Neon regardless of globally used
   * -mcpu/-mfpu options passed to gcc by performing runtime detection via
   * /proc/cpuinfo parsing on linux/android */
  while (!parse_proc_cpuinfo(bufsize)) {
--- a/simd/arm/jsimd_neon.S
+++ b/simd/arm/jsimd_neon.S
@@ -1,5 +1,5 @@
 /*
- * ARMv7 NEON optimizations for libjpeg-turbo
+ * Armv7 Neon optimizations for libjpeg-turbo
 *
 * Copyright (C) 2009-2011, Nokia Corporation and/or its subsidiary(-ies).
 *                          All Rights Reserved.
@@ -229,7 +229,7 @@ asm_function jsimd_idct_islow_neon
    ROW7L           .req d30
    ROW7R           .req d31

-    /* Load and dequantize coefficients into NEON registers
+    /* Load and dequantize coefficients into Neon registers
     * with the following allocation:
     *       0 1 2 3 | 4 5 6 7
     *      ---------+--------
@@ -261,7 +261,7 @@ asm_function jsimd_idct_islow_neon
    vld1.16         {d0, d1, d2, d3}, [ip, :128]  /* load constants */
    add             ip, ip, #16
    vmul.s16        q15, q15, q3
-    vpush           {d8-d15}                      /* save NEON registers */
+    vpush           {d8-d15}                      /* save Neon registers */
    /* 1-D IDCT, pass 1, left 4x8 half */
    vadd.s16        d4, ROW7L, ROW3L
    vadd.s16        d5, ROW5L, ROW1L
@@ -507,7 +507,7 @@ asm_function jsimd_idct_islow_neon
    vqrshrn.s16     d17, q9, #2
    vqrshrn.s16     d18, q10, #2
    vqrshrn.s16     d19, q11, #2
-    vpop            {d8-d15}                      /* restore NEON registers */
+    vpop            {d8-d15}                      /* restore Neon registers */
    vqrshrn.s16     d20, q12, #2
      /* Transpose the final 8-bit samples and do signed->unsigned conversion */
      vtrn.16         q8, q9
@@ -688,7 +688,7 @@ asm_function jsimd_idct_islow_neon
 * function from jidctfst.c
 *
 * Normally 1-D AAN DCT needs 5 multiplications and 29 additions.
- * But in ARM NEON case some extra additions are required because VQDMULH
+ * But in Arm Neon case some extra additions are required because VQDMULH
 * instruction can't handle the constants larger than 1. So the expressions
 * like "x * 1.082392200" have to be converted to "x * 0.082392200 + x",
 * which introduces an extra addition. Overall, there are 6 extra additions
@@ -718,7 +718,7 @@ asm_function jsimd_idct_ifast_neon
    TMP3            .req r2
    TMP4            .req ip

-    /* Load and dequantize coefficients into NEON registers
+    /* Load and dequantize coefficients into Neon registers
     * with the following allocation:
     *       0 1 2 3 | 4 5 6 7
     *      ---------+--------
@@ -749,7 +749,7 @@ asm_function jsimd_idct_ifast_neon
    vmul.s16        q13, q13, q1
    vld1.16         {d0}, [ip, :64]  /* load constants */
    vmul.s16        q15, q15, q3
-    vpush           {d8-d13}         /* save NEON registers */
+    vpush           {d8-d13}         /* save Neon registers */
    /* 1-D IDCT, pass 1 */
    vsub.s16        q2, q10, q14
    vadd.s16        q14, q10, q14
@@ -842,7 +842,7 @@ asm_function jsimd_idct_ifast_neon
    vadd.s16        q14, q5, q3
    vsub.s16        q9, q5, q3
    vsub.s16        q13, q10, q2
-    vpop            {d8-d13}      /* restore NEON registers */
+    vpop            {d8-d13}      /* restore Neon registers */
    vadd.s16        q10, q10, q2
    vsub.s16        q11, q12, q1
    vadd.s16        q12, q12, q1
@@ -913,7 +913,7 @@ asm_function jsimd_idct_ifast_neon
 *
 * NOTE: jpeg-8 has an improved implementation of 4x4 inverse-DCT, which
 *       requires much less arithmetic operations and hence should be faster.
- *       The primary purpose of this particular NEON optimized function is
+ *       The primary purpose of this particular Neon optimized function is
 *       bit exact compatibility with jpeg-6b.
 *
 * TODO: a bit better instructions scheduling can be achieved by expanding
@@ -1016,7 +1016,7 @@ asm_function jsimd_idct_4x4_neon
    adr             TMP4, jsimd_idct_4x4_neon_consts
    vld1.16         {d0, d1, d2, d3}, [TMP4, :128]

-    /* Load all COEF_BLOCK into NEON registers with the following allocation:
+    /* Load all COEF_BLOCK into Neon registers with the following allocation:
     *       0 1 2 3 | 4 5 6 7
     *      ---------+--------
     *   0 | d4      | d5
@@ -1126,7 +1126,7 @@ asm_function jsimd_idct_4x4_neon
 *
 * NOTE: jpeg-8 has an improved implementation of 2x2 inverse-DCT, which
 *       requires much less arithmetic operations and hence should be faster.
- *       The primary purpose of this particular NEON optimized function is
+ *       The primary purpose of this particular Neon optimized function is
 *       bit exact compatibility with jpeg-6b.
 */

@@ -1173,7 +1173,7 @@ asm_function jsimd_idct_2x2_neon
    adr             TMP2, jsimd_idct_2x2_neon_consts
    vld1.16         {d0}, [TMP2, :64]

-    /* Load all COEF_BLOCK into NEON registers with the following allocation:
+    /* Load all COEF_BLOCK into Neon registers with the following allocation:
     *       0 1 2 3 | 4 5 6 7
     *      ---------+--------
     *   0 | d4      | d5
@@ -1499,7 +1499,7 @@ asm_function jsimd_ycc_\colorid\()_convert_neon
    adr             ip, jsimd_ycc_\colorid\()_neon_consts
    vld1.16         {d0, d1, d2, d3}, [ip, :128]

-    /* Save ARM registers and handle input arguments */
+    /* Save Arm registers and handle input arguments */
    push            {r4, r5, r6, r7, r8, r9, r10, lr}
    ldr             NUM_ROWS, [sp, #(4 * 8)]
    ldr             INPUT_BUF0, [INPUT_BUF]
@@ -1507,7 +1507,7 @@ asm_function jsimd_ycc_\colorid\()_convert_neon
    ldr             INPUT_BUF2, [INPUT_BUF, #8]
    .unreq          INPUT_BUF

-    /* Save NEON registers */
+    /* Save Neon registers */
    vpush           {d8-d15}

    /* Initially set d10, d11, d12, d13 to 0xFF */
@@ -1814,7 +1814,7 @@ asm_function jsimd_\colorid\()_ycc_convert_neon
    adr             ip, jsimd_\colorid\()_ycc_neon_consts
    vld1.16         {d0, d1, d2, d3}, [ip, :128]

-    /* Save ARM registers and handle input arguments */
+    /* Save Arm registers and handle input arguments */
    push            {r4, r5, r6, r7, r8, r9, r10, lr}
    ldr             NUM_ROWS, [sp, #(4 * 8)]
    ldr             OUTPUT_BUF0, [OUTPUT_BUF]
@@ -1822,7 +1822,7 @@ asm_function jsimd_\colorid\()_ycc_convert_neon
    ldr             OUTPUT_BUF2, [OUTPUT_BUF, #8]
    .unreq          OUTPUT_BUF

-    /* Save NEON registers */
+    /* Save Neon registers */
    vpush           {d8-d15}

    /* Outer loop over scanlines */
@@ -2017,7 +2017,7 @@ asm_function jsimd_fdct_ifast_neon
    adr             TMP, jsimd_fdct_ifast_neon_consts
    vld1.16         {d0}, [TMP, :64]

-    /* Load all DATA into NEON registers with the following allocation:
+    /* Load all DATA into Neon registers with the following allocation:
     *       0 1 2 3 | 4 5 6 7
     *      ---------+--------
     *   0 | d16     | d17    | q8
@@ -2112,8 +2112,8 @@ asm_function jsimd_fdct_ifast_neon
 *
 * Note: the code uses 2 stage pipelining in order to improve instructions
 *       scheduling and eliminate stalls (this provides ~15% better
- *       performance for this function on both ARM Cortex-A8 and
- *       ARM Cortex-A9 when compared to the non-pipelined variant).
+ *       performance for this function on both Arm Cortex-A8 and
+ *       Arm Cortex-A9 when compared to the non-pipelined variant).
 *       The instructions which belong to the second stage use different
 *       indentation for better readiability.
 */
--- a/simd/arm64/jsimd.c
+++ b/simd/arm64/jsimd.c
@@ -12,7 +12,7 @@
 *
 * This file contains the interface between the "normal" portions
 * of the library and the SIMD implementations when running on a
- * 64-bit ARM architecture.
+ * 64-bit Arm architecture.
 */

 #define JPEG_INTERNALS
@@ -115,8 +115,8 @@ parse_proc_cpuinfo(int bufsize)
 */

 /*
- * ARMv8 architectures support NEON extensions by default.
- * It is no longer optional as it was with ARMv7.
+ * Armv8 architectures support Neon extensions by default.
+ * It is no longer optional as it was with Armv7.
 */


--- a/simd/arm64/jsimd_neon.S
+++ b/simd/arm64/jsimd_neon.S
@@ -1,5 +1,5 @@
 /*
- * ARMv8 NEON optimizations for libjpeg-turbo
+ * Armv8 Neon optimizations for libjpeg-turbo
 *
 * Copyright (C) 2009-2011, Nokia Corporation and/or its subsidiary(-ies).
 *                          All Rights Reserved.
@@ -625,7 +625,7 @@ asm_function jsimd_idct_islow_neon
    shrn2           v5.8h, v15.4s, #16  /* wsptr[DCTSIZE*3] = (int)DESCALE(tmp13 + tmp0, CONST_BITS+PASS1_BITS+3) */
    shrn2           v6.8h, v17.4s, #16  /* wsptr[DCTSIZE*4] = (int)DESCALE(tmp13 - tmp0, CONST_BITS+PASS1_BITS+3) */
    movi            v0.16b, #(CENTERJSAMPLE)
-    /* Prepare pointers (dual-issue with NEON instructions) */
+    /* Prepare pointers (dual-issue with Neon instructions) */
      ldp             TMP1, TMP2, [OUTPUT_BUF], 16
    sqrshrn         v28.8b, v2.8h, #(CONST_BITS+PASS1_BITS+3-16)
      ldp             TMP3, TMP4, [OUTPUT_BUF], 16
@@ -1006,7 +1006,7 @@ asm_function jsimd_idct_islow_neon
 * function from jidctfst.c
 *
 * Normally 1-D AAN DCT needs 5 multiplications and 29 additions.
- * But in ARM NEON case some extra additions are required because VQDMULH
+ * But in Arm Neon case some extra additions are required because VQDMULH
 * instruction can't handle the constants larger than 1. So the expressions
 * like "x * 1.082392200" have to be converted to "x * 0.082392200 + x",
 * which introduces an extra addition. Overall, there are 6 extra additions
@@ -1038,7 +1038,7 @@ asm_function jsimd_idct_ifast_neon
       instruction ensures that those bits are set to zero. */
    uxtw x3, w3

-    /* Load and dequantize coefficients into NEON registers
+    /* Load and dequantize coefficients into Neon registers
     * with the following allocation:
     *       0 1 2 3 | 4 5 6 7
     *      ---------+--------
@@ -1051,7 +1051,7 @@ asm_function jsimd_idct_ifast_neon
     *   6 | d28     | d29     ( v22.8h )
     *   7 | d30     | d31     ( v23.8h )
     */
-    /* Save NEON registers used in fast IDCT */
+    /* Save Neon registers used in fast IDCT */
    get_symbol_loc  TMP5, Ljsimd_idct_ifast_neon_consts
    ld1             {v16.8h, v17.8h}, [COEF_BLOCK], 32
    ld1             {v0.8h, v1.8h}, [DCT_TABLE], 32
@@ -1156,7 +1156,7 @@ asm_function jsimd_idct_ifast_neon
    add             v20.8h, v20.8h, v1.8h
    /* Descale to 8-bit and range limit */
    movi            v0.16b, #0x80
-      /* Prepare pointers (dual-issue with NEON instructions) */
+      /* Prepare pointers (dual-issue with Neon instructions) */
      ldp             TMP1, TMP2, [OUTPUT_BUF], 16
    sqshrn          v28.8b, v16.8h, #5
      ldp             TMP3, TMP4, [OUTPUT_BUF], 16
@@ -1235,7 +1235,7 @@ asm_function jsimd_idct_ifast_neon
 *
 * NOTE: jpeg-8 has an improved implementation of 4x4 inverse-DCT, which
 *       requires much less arithmetic operations and hence should be faster.
- *       The primary purpose of this particular NEON optimized function is
+ *       The primary purpose of this particular Neon optimized function is
 *       bit exact compatibility with jpeg-6b.
 *
 * TODO: a bit better instructions scheduling can be achieved by expanding
@@ -1305,7 +1305,7 @@ asm_function jsimd_idct_4x4_neon
       instruction ensures that those bits are set to zero. */
    uxtw x3, w3

-    /* Save all used NEON registers */
+    /* Save all used Neon registers */
    sub             sp, sp, 64
    mov             x9, sp
    /* Load constants (v3.4h is just used for padding) */
@@ -1314,7 +1314,7 @@ asm_function jsimd_idct_4x4_neon
    st1             {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
    ld1             {v0.4h, v1.4h, v2.4h, v3.4h}, [TMP4]

-    /* Load all COEF_BLOCK into NEON registers with the following allocation:
+    /* Load all COEF_BLOCK into Neon registers with the following allocation:
     *       0 1 2 3 | 4 5 6 7
     *      ---------+--------
     *   0 | v4.4h   | v5.4h
@@ -1448,7 +1448,7 @@ asm_function jsimd_idct_4x4_neon
 *
 * NOTE: jpeg-8 has an improved implementation of 2x2 inverse-DCT, which
 *       requires much less arithmetic operations and hence should be faster.
- *       The primary purpose of this particular NEON optimized function is
+ *       The primary purpose of this particular Neon optimized function is
 *       bit exact compatibility with jpeg-6b.
 */

@@ -1497,7 +1497,7 @@ asm_function jsimd_idct_2x2_neon
    st1             {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
    ld1             {v14.4h}, [TMP2]

-    /* Load all COEF_BLOCK into NEON registers with the following allocation:
+    /* Load all COEF_BLOCK into Neon registers with the following allocation:
     *       0 1 2 3 | 4 5 6 7
     *      ---------+--------
     *   0 | v4.4h   | v5.4h
@@ -1871,7 +1871,7 @@ asm_function jsimd_ycc_\colorid\()_convert_neon_slowst3
    /* Load constants to d1, d2, d3 (v0.4h is just used for padding) */
    get_symbol_loc  x15, Ljsimd_ycc_rgb_neon_consts

-    /* Save NEON registers */
+    /* Save Neon registers */
    st1             {v8.8b, v9.8b, v10.8b, v11.8b}, [x9], 32
    st1             {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
    ld1             {v0.4h, v1.4h}, [x15], 16
@@ -2156,7 +2156,7 @@ generate_jsimd_ycc_rgb_convert_neon extbgr,  24, 2, .4h,  1, .4h,  0, .4h,  .8b,
 .endm

 /* TODO: expand macros and interleave instructions if some in-order
- *       ARM64 processor actually can dual-issue LOAD/STORE with ALU */
+ *       AArch64 processor actually can dual-issue LOAD/STORE with ALU */
 .macro do_rgb_to_yuv_stage2_store_load_stage1 fast_ld3
    do_rgb_to_yuv_stage2
    do_load         \bpp, 8, \fast_ld3
@@ -2196,7 +2196,7 @@ asm_function jsimd_\colorid\()_ycc_convert_neon_slowld3
    ldr             OUTPUT_BUF2, [OUTPUT_BUF, #16]
    .unreq          OUTPUT_BUF

-    /* Save NEON registers */
+    /* Save Neon registers */
    sub             sp, sp, #64
    mov             x9, sp
    st1             {v8.8b, v9.8b, v10.8b, v11.8b}, [x9], 32
@@ -2410,13 +2410,13 @@ asm_function jsimd_fdct_islow_neon
    get_symbol_loc  TMP, Ljsimd_fdct_islow_neon_consts
    ld1             {v0.8h, v1.8h}, [TMP]

-    /* Save NEON registers */
+    /* Save Neon registers */
    sub             sp, sp, #64
    mov             x10, sp
    st1             {v8.8b, v9.8b, v10.8b, v11.8b}, [x10], 32
    st1             {v12.8b, v13.8b, v14.8b, v15.8b}, [x10], 32

-    /* Load all DATA into NEON registers with the following allocation:
+    /* Load all DATA into Neon registers with the following allocation:
     *       0 1 2 3 | 4 5 6 7
     *      ---------+--------
     *   0 | d16     | d17    | v16.8h
@@ -2643,7 +2643,7 @@ asm_function jsimd_fdct_islow_neon
    st1             {v16.8h, v17.8h, v18.8h, v19.8h}, [DATA], 64
    st1             {v20.8h, v21.8h, v22.8h, v23.8h}, [DATA]

-    /* Restore NEON registers */
+    /* Restore Neon registers */
    ld1             {v8.8b, v9.8b, v10.8b, v11.8b}, [sp], 32
    ld1             {v12.8b, v13.8b, v14.8b, v15.8b}, [sp], 32

@@ -2695,7 +2695,7 @@ asm_function jsimd_fdct_ifast_neon
    get_symbol_loc  TMP, Ljsimd_fdct_ifast_neon_consts
    ld1             {v0.4h}, [TMP]

-    /* Load all DATA into NEON registers with the following allocation:
+    /* Load all DATA into Neon registers with the following allocation:
     *       0 1 2 3 | 4 5 6 7
     *      ---------+--------
     *   0 | d16     | d17    | v0.8h
@@ -3080,7 +3080,7 @@ asm_function jsimd_huff_encode_one_block_neon_slowtbl
 .endif
    sub             sp, sp, 272
    sub             BUFFER, BUFFER, #0x1    /* BUFFER=buffer-- */
-    /* Save ARM registers */
+    /* Save Arm registers */
    stp             x19, x20, [sp]
    get_symbol_loc  x15, Ljsimd_huff_encode_one_block_neon_consts
    ldr             PUT_BUFFER, [x0, #0x10]