mozjpeg

Author	SHA1	Message	Date
Jonathan Wright	2acfb93c94	Neon: Intrinsics impl. of h1v2 fancy upsamling There was no previous GAS implementation.	2020-11-10 19:09:09 -06:00
Jonathan Wright	975307775c	Neon: Intrinsics impl. of h2v1 & h2v2 fancy upsamp The previous AArch32 GAS implementation of h2v1 fancy upsampling has been removed, since the intrinsics implementation provides the same or better performance. There was no previous GAS implementation of h2v2 fancy upsampling, and there was no previous AArch64 GAS implementation of h2v1 fancy upsampling.	2020-11-10 19:09:09 -06:00
Jonathan Wright	5dbd39323c	Neon: Intrinsics implementation of YCbCr->RGB565 The previous AArch64 GAS implementation is retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using the new NEON_INTRINSICS CMake variable. The previous AArch32 GAS implementation has been removed, since the intrinsics implementation provides the same or better performance.	2020-11-10 19:09:09 -06:00
Jonathan Wright	0f35cd68f2	Neon: Intrinsics implementation of YCbCr->RGB The previous AArch64 GAS implementation is retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using the new NEON_INTRINSICS CMake variable. The previous AArch32 GAS implementation has been removed, since the intrinsics implementation provides the same or better performance.	2020-11-10 19:09:09 -06:00
Jonathan Wright	f3c3f01d23	Neon: Intrinsics impl. of Huffman encoding The previous AArch64 GAS implementation is retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using the new NEON_INTRINSICS CMake variable. The previous AArch32 GAS implementation has been removed, since the intrinsics implementation provides the same or better performance.	2020-11-10 19:09:09 -06:00
Jonathan Wright	d0004de5dd	Neon: Intrinsics impl. of accurate int forward DCT The previous AArch64 GAS implementation is retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using the new NEON_INTRINSICS CMake variable. There was no previous AArch32 GAS implementation.	2020-11-10 19:09:09 -06:00
Jonathan Wright	3d84668d42	Neon: Intrinsics impl. of fast integer forward DCT The previous AArch32 and AArch64 GAS implementations have been removed, since the intrinsics implementation provides the same or better performance.	2020-11-10 19:09:09 -06:00
Jonathan Wright	951d3677eb	Neon: Intrinsics impl. of int sample conv./quant. The previous AArch32 and AArch64 GAS implementations have been removed, since the intrinsics implementation provides the same or better performance.	2020-11-10 19:09:09 -06:00
Jonathan Wright	366168aa7d	Neon: Intrinsics impl. of h2v1 & h2v2 downsampling The previous AArch64 GAS implementation has been removed, since the intrinsics implementation provides the same or better performance. There was no previous AArch32 GAS implementation.	2020-11-10 19:09:09 -06:00
Jonathan Wright	f73b1dbc60	Neon: Intrinsics implementation of RGB->Grayscale There was no previous GAS implementation.	2020-11-10 19:09:09 -06:00
Jonathan Wright	4f2216b435	Neon: Intrinsics implementation of RGB->YCbCr The previous AArch32 and AArch64 GAS implementations are retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using a new NEON_INTRINSICS CMake variable.	2020-11-10 19:09:05 -06:00
DRC	7c1a1789d2	Merge branch 'master' into dev	2020-11-05 16:04:55 -06:00
DRC	6e632af9f6	Demote "fast" [I]DCT algorithms to legacy status - Refer to the "slow" [I]DCT algorithms as "accurate" instead, since they are not slow under libjpeg-turbo. - Adjust documentation claims to reflect the fact that the "slow" and "fast" algorithms produce about the same performance on AVX2-equipped CPUs (because of the dual-lane nature of AVX2, it was not possible to accelerate the "fast" algorithm beyond what was achievable with SSE2.) Also adjust the claims to reflect the fact that the "fast" algorithm tends to be ~5-15% faster than the "slow" algorithm on non-AVX2-equipped CPUs, regardless of the use of the libjpeg-turbo SIMD extensions. - Indicate the legacy status of the "fast" and float algorithms in the documentation and cjpeg/djpeg usage info. - Remove obsolete paragraph in the djpeg man page that suggested that the float algorithm could be faster than the "fast" algorithm on some CPUs.	2020-11-05 15:59:31 -06:00
DRC	cd342acf7f	Merge branch 'master' into dev	2020-10-27 16:45:23 -05:00
DRC	d27b935a88	Consistify formatting to simplify checkstyle The checkstyle script was hastily developed prior to libjpeg-turbo 2.0 beta1, so it has a lot of exceptions and is thus prone to false negatives. This commit eliminates some of those exceptions.	2020-10-27 15:45:09 -05:00
DRC	59352195b2	Merge branch 'master' into dev	2020-10-19 21:17:46 -05:00
DRC	1ed312eab6	"ARM"="Arm", "NEON"="Neon" Refer to: https://www.arm.com/company/policies/trademarks/arm-trademark-list/arm-trademark https://www.arm.com/company/policies/trademarks/arm-trademark-list/neon-trademark NOTE: These changes are only applied to change log entries for 2.0.x and later, since the change log is a historical record and Arm's new trademark policy did not go into effect until late 2017.	2020-10-15 17:47:31 -05:00
DRC	ae08115d4d	Merge branch 'master' into dev	2020-10-15 10:25:46 -05:00
DRC	b5a1472781	Build: Fix permissions	2020-10-15 10:22:51 -05:00
DRC	6ab61fa1d1	Merge branch 'master' into dev	2020-09-13 17:02:27 -05:00
DRC	6ee5d5f568	ARMv8 NEON: Support Windows builds w/AArch64 MinGW Based on: `c5ef665928` Closes #438	2020-07-28 18:24:41 -05:00
DRC	00d48d7e8c	Merge branch 'master' into dev	2020-02-17 18:14:10 -06:00
DRC	035262a18d	MIPS DSPr2: Work around various 'make test' errors Referring to #408, this commit #ifdefs DSPr2 SIMD functions that only work on little endian processors, and it completely excludes jsimd_h2v1_downsample_dspr2() and jsimd_h2v2_downsample_dspr2(). The latter two functions fail with the TJBench tiling regression tests, most likely because the implementation of the functions predates those tests.	2020-02-17 18:13:31 -06:00
DRC	ed7cab47d9	MIPS DSPr2: Fix compiler warning with -mdspr2 If -mdspr2 is passed to the compiler, __mips_dsp will be defined, and __mips_dsp_rev will be >= 2, so parse_proc_cpuinfo() will not be used.	2020-02-17 16:35:00 -06:00
DRC	42d679b9fc	MIPS SIMD: Always honor JSIMD_FORCE* env vars Previously, these environment variables were not honored unless a 74K CPU was detected, but this detection doesn't work properly with QEMU's user mode emulation. With all other CPU types, libjpeg-turbo honors JSIMD_FORCE* regardless of CPU detection.	2020-02-17 15:19:32 -06:00
DRC	b34c85ea4a	Merge branch 'master' into dev	2019-12-31 01:20:12 -06:00
DRC	166e34213e	simd/arm64/jsimd_neon.S: Fix checkstyle issue	2019-12-31 01:10:30 -06:00
DRC	c4675d62e8	Merge branch 'master' into dev	2019-12-31 00:58:42 -06:00
DRC	b542e4c8e9	ARMv8 SIMD: Support execute-only memory (XOM) Move constants out of the .text section in simd/arm64/jsimd_neon.S and into a .rodata section. This ensures that the ARMv8 NEON SIMD extensions are compatible with memory layouts that are marked execute-only (and thus unreadable.) Based on: `88f3ca7664` Closes #318	2019-12-20 14:24:10 -06:00
DRC	81b8c0eed5	Loongson MMI: Merge with MIPS64/add auto-detection Modern Loongson processors are MIPS64-compatible, and MMI instructions are now supported in the mainline of GCC. Thus, this commit adds compile-time and run-time auto-detection of MMI instructions and moves the MMI SIMD extensions for libjpeg-turbo from simd/loongson/ to simd/mips64/. That will allow MMI and MSA instructions to co-exist in the same build once #377 has been integrated. Based on: `82953ddd61` Closes #383	2019-12-17 14:35:49 -06:00
mayeut	e821464f79	ARM64 NEON SIMD impl. of prog. Huffman encoding This commit adds ARM64 NEON optimizations for the encode_mcu_AC_first() and encode_mcu_AC_refine() functions used in progressive Huffman encoding. Compression speedups for the typical set of five libjpeg-turbo test images (https://libjpeg-turbo.org/About/Performance): Cortex-A53: 23.8-39.2% (avg. 32.2%) Cortex-A72: 26.8-41.1% (avg. 33.5%) Apple A7: 29.7-45.9% (avg. 39.6%) Closes #229	2019-12-10 00:21:57 -06:00
DRC	9c6f79e919	Fix formatting issues detected by checkstyle	2019-11-14 12:16:38 -06:00
DRC	f60b6dd36f	Remove vestigial jpeg_nbits_table.inc Not needed since `087c29e07f`	2019-11-12 17:42:39 -06:00
DRC	713c451f58	Enable SSE2 progressive Huffman encoder for x32 Referring to #289, I'm not sure where I arrived at the conclusion that the SSE2 progressive Huffman encoder doesn't provide any speedup for x32. Upon re-testing, I discovered it to be about 50% faster than the C encoder. This commit also re-purposes one of the CI tests (specifically, the jpeg-7 API/ABI test) so that it tests x32 as well.	2019-11-08 16:03:38 -06:00
DRC	cbf0fcc8b7	i386 SSE2 Huffman: Fix pointer arithmetic issue Splitting the pointer arithmetic in GET_SYM() into a separate add and sub instruction was an attempt to work around an error ("invalid operand type") that occurred when assembling the file with NASM. However, this created a link error on macOS ("ld: illegal text-relocation to '_jconst_huff_encode_one_block' in simd/CMakeFiles/simd.dir/i386/jchuff-sse2.asm.o from '_jsimd_huff_encode_one_block_sse2' in simd/CMakeFiles/simd.dir/i386/jchuff-sse2.asm.o for architecture i386") and also changed the alignment of the code in ways that might have affected the previous benchmark results (which took a great deal of time to obtain.) Ultimately, the path of least resistance is just to require NASM 2.13 or later.	2019-11-05 15:56:28 -06:00
DRC	bbedb4b564	Merge branch 'master' into dev	2019-11-05 15:43:21 -06:00
DRC	cf54623b08	Mac: Support hiding SIMD fct symbols w/ NASM 2.14+ (NASM 2.14+ now supports the private_extern section directive, which was previously only available with YASM.)	2019-11-05 15:41:59 -06:00
DRC	087c29e07f	Optimize Huffman encoding This commit improves the C and SSE2 Huffman encoding implementations in the following ways: - Avoid using xmm8-xmm15 in the x86-64 SSE2 implementation. There is no actual need to use those registers, and avoiding them produces a cleaner WIN64 function entry/exit-- as well as shorter code, since REX prefixes can be avoided (this is helpful on certain CPUs, such as Intel Atom, for which instruction fetch and decoding can be a bottleneck.) - Optimize register usage so that fewer REX prefixes and register-register moves are needed. - Use the bit counter to store the number of free bits in the bit buffer rather than the number of bits in the bit buffer. This changes the method for inserting a code into the bit buffer to: (put_buffer \|= code << (free_bits -= code_size)); As a result: * Only one bit counter needs to stay in a register (we just keep it in cl.) * The bit buffer contents are already properly aligned to be written out (after a byte swap.) * Adjusting the free bits counter and checking if the bit buffer is full can be combined into a single operation. * We can wait to flush the bit buffer until the buffer is actually full and not just in danger of becoming full. Thus, eight bytes can be flushed at a time. - Speed is quite sensitive to the alignment of branch target labels, so insert some padding and remove branches from the flush code. (Flushing this way isn't actually faster when compared to using branches, but the branchless code doesn't need extra alignment and is thus smaller.) - Speculatively write out the bit buffer as a single 8-byte write, falling back to a byte-by-byte write only if there are any 0xFF bytes in the bit buffer that need to be encoded as 0xFF 0x00. - Use MMX registers for the 32-bit implementation (so the bit buffer can be 64 bits wide.) - Slightly reduce overall function code size. - Eliminate or combine a few SSE instructions. - Make some minor improvements to instruction scheduling. - Adjust flush_bits() in jchuff.c to handle cases in which the bit buffer has less than 7 free bits (apparently that couldn't happen before.) Based on: `947a09defa` `262ebb6b81` `6e9a091221` See change log for performance claims. Closes #292	2019-11-04 19:04:05 -06:00
DRC	d92ae5df0c	Merge branch 'master' into dev	2019-11-04 18:50:45 -06:00
DRC	6902cdb177	Build: Don't require ASM_NASM if !REQUIRE_SIMD The build system is supposed to fall back to a non-SIMD build if WITH_SIMD==1 but REQUIRE_SIMD==0. Based on: `972df912d0` Closes #384	2019-10-29 12:08:40 -05:00
DRC	95f4d6ef8b	Merge branch 'master' into dev	2019-10-24 02:13:23 -05:00
DRC	3a32d199df	x86 SIMD: Consistify capitalization of NASM types byte, word, dword, qword, oword, and yword are all assembler keywords, so it makes sense to use lowercase for these so as not to mistake them for macros or constants.	2019-10-17 20:02:20 -05:00
DRC	9a51a87af3	x86 SIMD: Remove obsolete [TAB8] comments With apologies to Richard Hendricks, our assembly code no longer uses tabs.	2019-10-17 14:11:35 -05:00
DRC	8ef53b102f	Merge branch 'master' into dev	2019-08-14 22:08:59 -05:00
DRC	a81a8c137b	SSE2 SIMD: Fix prog Huffman enc. error if Sl%16==0 (regression introduced by `5b177b3cab`) The SSE2 implementation of progressive Huffman encoding performed extraneous iterations when the scan length was a multiple of 16. Based on: `bb7f1ef983` Fixes #335 Closes #367	2019-08-14 22:01:30 -05:00
DRC	7fbfe29c65	Merge branch 'master' into dev	2019-07-18 15:18:27 -05:00
DRC	f37b7c1f96	Build: Fix build/install with Xcode IDE Closes #355	2019-07-02 11:28:26 -05:00
DRC	f36d531553	Merge branch 'master' into dev	2019-04-23 14:54:23 -05:00
Chris Blume	aa9db61677	x86 SIMD: Check for CPUID leaf 07H before using According to Intel's manual [1], "If a value entered for CPUID.EAX is higher than the maximum input value for basic or extended function for that processor then the data for the highest basic information leaf is returned." Right now, libjpeg-turbo doesn't first check that leaf 07H is supported before attempting to use it, so the ostensible AVX2 bit (Bit 05) of the CPUID result might actually be Bit 05 from a lower leaf. That bit might be set, even if the CPU doesn't support AVX2. This commit modifies the x86 and x86-64 SIMD feature detection code so that it first checks whether CPUID leaf 07H is supported before attempting to use it to check for AVX2 instruction support. DRC: This commit should fix https://bugzilla.mozilla.org/show_bug.cgi?id=1520760 However, I have not personally been able to reproduce that issue, despite using a Nehalem (pre-AVX2) CPU on which the maximum CPUID leaf has been limited via a BIOS setting. Closes #348 [1] "Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 2 (2A, 2B, 2C & 2D): Instruction Set Reference, A-Z", https://software.intel.com/sites/default/files/managed/a4/60/325383-sdm-vol-2abcd.pdf, page 3-192.	2019-04-16 17:07:28 -05:00
DRC	afbe48c290	MMI: Support 32-bit Loongson architectures	2019-02-27 13:36:48 -06:00

1 2 3 4 5 ...

381 Commits