Commit Graph

14 Commits

Author SHA1 Message Date
DRC
9055fb408d ARM/MIPS: Change the behavior of JSIMD_FORCE*
The JSIMD_FORCE* environment variables previously meant "force the use
of this instruction set if it is available but others are available as
well", but that did nothing on ARM platforms, since there is only ever
one instruction set available.  Since the ARM and MIPS CPU feature
detection code is less than bulletproof, and since there is only one
SIMD instruction set (currently) supported on those platforms, it makes
sense for the JSIMD_FORCE* environment variables on those platforms to
actually force the use of the SIMD instruction set, thus bypassing the
CPU feature detection code.

This addresses a concern raised in #88 whereby parsing /proc/cpuinfo
didn't work within a QEMU environment.  This at least provides a
workaround, allowing users to force-enable or force-disable SIMD
instructions for ARM and MIPS builds of libjpeg-turbo.
2016-07-07 13:38:48 -05:00
DRC
123f7258a8 Format copyright headers more consistently
The IJG convention is to format copyright notices as:

Copyright (C) YYYY, Owner.

We try to maintain this convention for any code that is part of the
libjpeg API library (with the exception of preserving the copyright
notices from Cendio's code verbatim, since those predate
libjpeg-turbo.)

Note that the phrase "All Rights Reserved" is no longer necessary, since
all Buenos Aires Convention signatories signed onto the Berne Convention
in 2000.  However, our convention is to retain this phrase for any files
that have a self-contained copyright header but to leave it off of any
files that refer to another file for conditions of distribution and use.
For instance, all of the non-SIMD files in the libjpeg API library refer
to README.ijg, and the copyright message in that file contains "All
Rights Reserved", so it is unnecessary to add it to the individual
files.

The TurboJPEG code retains my preferred formatting convention for
copyright notices, which is based on that of VirtualGL (where the
TurboJPEG API originated.)
2016-05-28 19:16:58 -05:00
DRC
bd49803f92 Use consistent/modern code formatting for pointers
The convention used by libjpeg:

    type * variable;

is not very common anymore, because it looks too much like
multiplication.  Some (particularly C++ programmers) prefer to tuck the
pointer symbol against the type:

    type* variable;

to emphasize that a pointer to a type is effectively a new type.
However, this can also be confusing, since defining multiple variables
on the same line would not work properly:

    type* variable1, variable2;  /* Only variable1 is actually a
                                    pointer. */

This commit reformats the entirety of the libjpeg-turbo code base so
that it uses the same code formatting convention for pointers that the
TurboJPEG API code uses:

    type *variable1, *variable2;

This seems to be the most common convention among C programmers, and
it is the convention used by other codec libraries, such as libpng and
libtiff.
2016-02-19 09:10:07 -06:00
DRC
8632f1b262 ARM64: Avoid tbl instruction on Cortex-A53/A57
Full-color compression speedups relative to previous commits:
Cortex-A53 (Nexus 5X), Android, 64-bit: 0.91-3.0% (avg. 1.8%)
Cortex-A57 (Nexus 5X), Android, 64-bit: -0.35-1.5% (avg. 0.65%)
2016-02-09 00:38:58 -06:00
DRC
46ecffa324 ARM64: Avoid LD3/ST3 at run time, not compile time
... and only if ThunderX is detected.  This can be easily expanded later
on to include other CPUs that are known to suffer from slow LD3/ST3, but
it doesn't make sense to disable LD3/ST3 for all non-Android Linux
platforms just because ThunderX is slow.
2016-02-07 22:05:56 -06:00
DRC
219470d6ac ARM64 NEON SIMD implementation of Huffman encoding
Full-color compression speedups relative to previous commits:
Cortex-A53 (Nexus 5X), Android, 64-bit: 1.1-13% (avg. 6.0%)
Cortex-A57 (Nexus 5X), Android, 64-bit: 0.0-22% (avg. 6.3%)

Refer to #47 and #50 for discussion

Closes #50

Note that this commit introduces a similar /proc/cpuinfo parser to that
of the ARM32 implementation.  It is used to specifically check whether
the code is running on Cavium ThunderX and, if so, disable the ARM64
SIMD Huffman routines (which slow performance by an average of 8% on
that CPU.)

Based on:
a8c282e5e5
2016-02-07 21:51:11 -06:00
DRC
ec6941f7bc Complete the ARM64 NEON SIMD implementation
This adds 64-bit NEON coverage for all of the algorithms that are
covered by the 32-bit NEON implementation, except for h2v1 (4:2:2) fancy
upsampling (used when decompressing 4:2:2 JPEG images.)  It also adds
64-bit NEON SIMD coverage for:

* slow integer forward DCT (compressor)
* h2v2 (4:2:0) downsampling (compressor)
* h2v1 (4:2:2) downsampling (compressor)

which are not covered in the 32-bit implementation.

Compression speedups relative to libjpeg-turbo 1.4.2:
Apple A7 (iPhone 5S), iOS, 64-bit: 113-150% (reported)
48-core ThunderX (RunAbove ARM Cloud), Linux, 64-bit: 2.1-33% (avg. 15%)

Refer to #44 and #49 for discussion

This commit also removes the unnecessary

    if (simd_support & JSIMD_ARM_NEON)

statements from the jsimd* algorithm functions.  Since the jsimd_can*()
functions check for the existence of NEON, the corresponding algorithm
functions will never be called if NEON isn't available.

Based on:
dcd9d84f10
b0d87b811f
70cd5c8a49
3e58d9a064
837b19542f
73dc43ccc8
a82b71a261
c1b1188c21
305c89284e
7f443f9995
4c2b53b77d

Unified version with fixes:
1004a3cd05
2016-01-15 11:21:48 -06:00
DRC
f3a8684cd1 SSE2 SIMD implementation of Huffman encoding
Full-color compression speedups relative to libjpeg-turbo 1.4.2:

2.8 GHz Intel Xeon W3530, Linux, 64-bit:  2.2-18% (avg. 9.5%)
2.8 GHz Intel Xeon W3530, Linux, 32-bit:  10-25% (avg. 17%)

2.3 GHz AMD A10-4600M APU, Linux, 64-bit:  4.9-17% (avg. 11%)
2.3 GHz AMD A10-4600M APU, Linux, 32-bit:  8.8-19% (avg. 15%)

3.0 GHz Intel Core i7, OS X, 64-bit:  3.5-16% (avg. 10%)
3.0 GHz Intel Core i7, OS X, 32-bit:  4.8-14% (avg. 11%)

2.6 GHz AMD Athlon 64 X2 5050e:
Performance-neutral (give or take a few percent)

Full-color compression speedups relative to IPP:

2.8 GHz Intel Xeon W3530, Linux, 64-bit:  4.8-34% (avg. 19%)
2.8 GHz Intel Xeon W3530, Linux, 32-bit:  -19%-7.0% (avg. -7.0%)

Refer to #42 for discussion.  Numerous other approaches were attempted,
but this one proved to be the most performant across all platforms.

This commit also fixes #3 (works around, really-- the clang-compiled version
of jchuff.c still performs 20% worse than its GCC-compiled counterpart, but
that code is now bypassed by the new SSE2 Huffman algorithm.)

Based on:
2cb4d41330
36c94e050d
2016-01-12 03:03:49 -06:00
DRC
bdc7650b9e ARM64 NEON SIMD support for YCC-to-RGB565 conversion
git-svn-id: svn+ssh://svn.code.sf.net/p/libjpeg-turbo/code/trunk@1386 632fc199-4ca6-4c93-a231-07263d6284db
2014-08-23 15:57:38 +00:00
DRC
d729f4da9c ARM NEON SIMD support for YCC-to-RGB565 conversion, and optimizations to the existing YCC-to-RGB conversion code:
-----

aee36252be.patch

From aee36252be20054afce371a92406fc66ba6627b5 Mon Sep 17 00:00:00 2001
From: Siarhei Siamashka <siarhei.siamashka@gmail.com>
Date: Wed, 13 Aug 2014 03:50:22 +0300
Subject: [PATCH] ARM: Faster NEON yuv->rgb conversion for Krait and Cortex-A15

The older code was developed and tested only on ARM Cortex-A8 and ARM Cortex-A9.
Tuning it for newer ARM processors can introduce some speed-up (up to 20%).

The performance of the inner loop (conversion of 8 pixels) improves from
~27 cycles down to ~22 cycles on Qualcomm Krait 300, and from ~20 cycles
down to ~18 cycles on ARM Cortex-A15.

The performance remains exactly the same on ARM Cortex-A7 (~58 cycles),
ARM Cortex-A8 (~25 cycles) and ARM Cortex-A9 (~30 cycles) processors.

Also use larger indentation in the source code for separating two independent
instruction streams.

-----

a5efdbf22c.patch

From a5efdbf22ce9c1acd4b14a353cec863c2c57557e Mon Sep 17 00:00:00 2001
From: Siarhei Siamashka <siarhei.siamashka@gmail.com>
Date: Wed, 13 Aug 2014 07:23:09 +0300
Subject: [PATCH] ARM: NEON optimized yuv->rgb565 conversion

The performance of the inner loop (conversion of 8 pixels):
* ARM Cortex-A7:  ~55 cycles
* ARM Cortex-A8:  ~28 cycles
* ARM Cortex-A9:  ~32 cycles
* ARM Cortex-A15: ~20 cycles
* Qualcomm Krait: ~24 cycles

Based on the Linaro rgb565 patch from
    https://sourceforge.net/p/libjpeg-turbo/patches/24/
but implements better instructions scheduling.


git-svn-id: svn+ssh://svn.code.sf.net/p/libjpeg-turbo/code/trunk@1385 632fc199-4ca6-4c93-a231-07263d6284db
2014-08-23 15:47:51 +00:00
DRC
3728aa01d8 Fix performance and other issues uncovered in testing with actual ARM64 hardware; formatting tweaks; remove NEON platform check (NEON is always available with ARMv8)
git-svn-id: svn+ssh://svn.code.sf.net/p/libjpeg-turbo/code/trunk@1333 632fc199-4ca6-4c93-a231-07263d6284db
2014-07-23 14:14:14 +00:00
DRC
1419852c42 Clean up code formatting in the SIMD interface functions
git-svn-id: svn+ssh://svn.code.sf.net/p/libjpeg-turbo/code/trunk@1305 632fc199-4ca6-4c93-a231-07263d6284db
2014-05-15 19:45:11 +00:00
DRC
1a45b81fa2 Remove trailing spaces (+ one additional tab in TJUnitTest.java that was missed in the previous commit)
git-svn-id: svn+ssh://svn.code.sf.net/p/libjpeg-turbo/code/trunk@1279 632fc199-4ca6-4c93-a231-07263d6284db
2014-05-09 18:06:58 +00:00
DRC
2d07ee519d Create a separate stub file for 64-bit ARM, since it currently implements only the decompression-related functions.
git-svn-id: svn+ssh://svn.code.sf.net/p/libjpeg-turbo/code/trunk@1109 632fc199-4ca6-4c93-a231-07263d6284db
2014-02-05 19:03:41 +00:00