Improve floating point DCT

From libjpeg-turbo r1288
Port the more accurate (and slightly faster) floating point IDCT
implementation from jpeg-8a and later.  New research revealed that the
SSE/SSE2 floating point IDCT implementation was actually more accurate
than the jpeg-6b implementation, not less, which is why its
mathematical results have always differed from those of the jpeg-6b
implementation.  This patch brings the accuracy of the C code in line
with that of the SSE/SSE2 code.
This commit is contained in:
Frank Bossen
2014-07-23 10:26:46 -04:00
parent ccb1d12f53
commit 3adc64a4cb
3 changed files with 56 additions and 46 deletions

View File

@@ -23,6 +23,14 @@ and is out of scope for a codec library.
[5] The TurboJPEG API can now be used to compress JPEG images from YUV planar [5] The TurboJPEG API can now be used to compress JPEG images from YUV planar
source images. source images.
[7] Improved the accuracy and performance of the non-SIMD implementation of the
floating point inverse DCT (using code borrowed from libjpeg v8a and later.)
The accuracy of this implementation now matches the accuracy of the SSE/SSE2
implementation. Note, however, that the floating point DCT/IDCT algorithms are
mainly a legacy feature. They generally do not produce significantly better
accuracy than the slow integer DCT/IDCT algorithms, and they are quite a bit
slower.
1.3.1 1.3.1
===== =====

View File

@@ -426,10 +426,19 @@ following reasons:
slightly more accurate than the implementation in libjpeg v6b, but not by slightly more accurate than the implementation in libjpeg v6b, but not by
any amount perceptible to human vision (generally in the range of 0.01 to any amount perceptible to human vision (generally in the range of 0.01 to
0.08 dB gain in PNSR.) 0.08 dB gain in PNSR.)
-- When not using the SIMD extensions, libjpeg-turbo uses the more accurate
(and slightly faster) floating point IDCT algorithm introduced in libjpeg
v8a as opposed to the algorithm used in libjpeg v6b. It should be noted,
however, that this algorithm basically brings the accuracy of the floating
point IDCT in line with the accuracy of the slow integer IDCT. The floating
point DCT/IDCT algorithms are mainly a legacy feature, and they do not
produce significantly more accuracy than the slow integer algorithms (to put
numbers on this, the typical difference in PNSR between the two algorithms
is less than 0.10 dB, whereas changing the quality level by 1 in the upper
range of the quality scale is typically more like a 1.0 dB difference.)
-- When not using the SIMD extensions, then the accuracy of the floating point -- When not using the SIMD extensions, then the accuracy of the floating point
DCT/IDCT can depend on the compiler and compiler settings. DCT/IDCT can depend on the compiler and compiler settings.
While libjpeg-turbo does emulate the libjpeg v8 API/ABI, under the hood, it is While libjpeg-turbo does emulate the libjpeg v8 API/ABI, under the hood, it is
still using the same algorithms as libjpeg v6b, so there are several specific still using the same algorithms as libjpeg v6b, so there are several specific
cases in which libjpeg-turbo cannot be expected to produce the same output as cases in which libjpeg-turbo cannot be expected to produce the same output as
@@ -445,10 +454,6 @@ libjpeg v8:
output of libjpeg v8 is less accurate than that of libjpeg v6b for this output of libjpeg v8 is less accurate than that of libjpeg v6b for this
reason. reason.
-- When using the floating point IDCT, for the reasons stated above and also
because the floating point IDCT algorithm was modified in libjpeg v8a to
improve accuracy.
-- When decompressing using a scaling factor > 1 and merged (AKA "non-fancy" or -- When decompressing using a scaling factor > 1 and merged (AKA "non-fancy" or
"non-smooth") chrominance upsampling, because libjpeg v8 does not support "non-smooth") chrominance upsampling, because libjpeg v8 does not support
merged upsampling with scaling factors > 1. merged upsampling with scaling factors > 1.

View File

@@ -1,9 +1,12 @@
/* /*
* jidctflt.c * jidctflt.c
* *
* This file was part of the Independent JPEG Group's software:
* Copyright (C) 1994-1998, Thomas G. Lane. * Copyright (C) 1994-1998, Thomas G. Lane.
* This file is part of the Independent JPEG Group's software. * Modified 2010 by Guido Vollbeding.
* For conditions of distribution and use, see the accompanying README file. * libjpeg-turbo Modifications:
* Copyright (C) 2014, D. R. Commander.
* For conditions of distribution and use, see the accompanying README file.
* *
* This file contains a floating-point implementation of the * This file contains a floating-point implementation of the
* inverse DCT (Discrete Cosine Transform). In the IJG code, this routine * inverse DCT (Discrete Cosine Transform). In the IJG code, this routine
@@ -76,10 +79,10 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
FLOAT_MULT_TYPE * quantptr; FLOAT_MULT_TYPE * quantptr;
FAST_FLOAT * wsptr; FAST_FLOAT * wsptr;
JSAMPROW outptr; JSAMPROW outptr;
JSAMPLE *range_limit = IDCT_range_limit(cinfo); JSAMPLE *range_limit = cinfo->sample_range_limit;
int ctr; int ctr;
FAST_FLOAT workspace[DCTSIZE2]; /* buffers data between passes */ FAST_FLOAT workspace[DCTSIZE2]; /* buffers data between passes */
SHIFT_TEMPS #define _0_125 ((FLOAT_MULT_TYPE)0.125)
/* Pass 1: process columns from input, store into work array. */ /* Pass 1: process columns from input, store into work array. */
@@ -101,7 +104,8 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
inptr[DCTSIZE*5] == 0 && inptr[DCTSIZE*6] == 0 && inptr[DCTSIZE*5] == 0 && inptr[DCTSIZE*6] == 0 &&
inptr[DCTSIZE*7] == 0) { inptr[DCTSIZE*7] == 0) {
/* AC terms all zero */ /* AC terms all zero */
FAST_FLOAT dcval = DEQUANTIZE(inptr[DCTSIZE*0], quantptr[DCTSIZE*0]); FAST_FLOAT dcval = DEQUANTIZE(inptr[DCTSIZE*0],
quantptr[DCTSIZE*0] * _0_125);
wsptr[DCTSIZE*0] = dcval; wsptr[DCTSIZE*0] = dcval;
wsptr[DCTSIZE*1] = dcval; wsptr[DCTSIZE*1] = dcval;
@@ -120,10 +124,10 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
/* Even part */ /* Even part */
tmp0 = DEQUANTIZE(inptr[DCTSIZE*0], quantptr[DCTSIZE*0]); tmp0 = DEQUANTIZE(inptr[DCTSIZE*0], quantptr[DCTSIZE*0] * _0_125);
tmp1 = DEQUANTIZE(inptr[DCTSIZE*2], quantptr[DCTSIZE*2]); tmp1 = DEQUANTIZE(inptr[DCTSIZE*2], quantptr[DCTSIZE*2] * _0_125);
tmp2 = DEQUANTIZE(inptr[DCTSIZE*4], quantptr[DCTSIZE*4]); tmp2 = DEQUANTIZE(inptr[DCTSIZE*4], quantptr[DCTSIZE*4] * _0_125);
tmp3 = DEQUANTIZE(inptr[DCTSIZE*6], quantptr[DCTSIZE*6]); tmp3 = DEQUANTIZE(inptr[DCTSIZE*6], quantptr[DCTSIZE*6] * _0_125);
tmp10 = tmp0 + tmp2; /* phase 3 */ tmp10 = tmp0 + tmp2; /* phase 3 */
tmp11 = tmp0 - tmp2; tmp11 = tmp0 - tmp2;
@@ -138,10 +142,10 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
/* Odd part */ /* Odd part */
tmp4 = DEQUANTIZE(inptr[DCTSIZE*1], quantptr[DCTSIZE*1]); tmp4 = DEQUANTIZE(inptr[DCTSIZE*1], quantptr[DCTSIZE*1] * _0_125);
tmp5 = DEQUANTIZE(inptr[DCTSIZE*3], quantptr[DCTSIZE*3]); tmp5 = DEQUANTIZE(inptr[DCTSIZE*3], quantptr[DCTSIZE*3] * _0_125);
tmp6 = DEQUANTIZE(inptr[DCTSIZE*5], quantptr[DCTSIZE*5]); tmp6 = DEQUANTIZE(inptr[DCTSIZE*5], quantptr[DCTSIZE*5] * _0_125);
tmp7 = DEQUANTIZE(inptr[DCTSIZE*7], quantptr[DCTSIZE*7]); tmp7 = DEQUANTIZE(inptr[DCTSIZE*7], quantptr[DCTSIZE*7] * _0_125);
z13 = tmp6 + tmp5; /* phase 6 */ z13 = tmp6 + tmp5; /* phase 6 */
z10 = tmp6 - tmp5; z10 = tmp6 - tmp5;
@@ -152,12 +156,12 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
tmp11 = (z11 - z13) * ((FAST_FLOAT) 1.414213562); /* 2*c4 */ tmp11 = (z11 - z13) * ((FAST_FLOAT) 1.414213562); /* 2*c4 */
z5 = (z10 + z12) * ((FAST_FLOAT) 1.847759065); /* 2*c2 */ z5 = (z10 + z12) * ((FAST_FLOAT) 1.847759065); /* 2*c2 */
tmp10 = ((FAST_FLOAT) 1.082392200) * z12 - z5; /* 2*(c2-c6) */ tmp10 = z5 - z12 * ((FAST_FLOAT) 1.082392200); /* 2*(c2-c6) */
tmp12 = ((FAST_FLOAT) -2.613125930) * z10 + z5; /* -2*(c2+c6) */ tmp12 = z5 - z10 * ((FAST_FLOAT) 2.613125930); /* 2*(c2+c6) */
tmp6 = tmp12 - tmp7; /* phase 2 */ tmp6 = tmp12 - tmp7; /* phase 2 */
tmp5 = tmp11 - tmp6; tmp5 = tmp11 - tmp6;
tmp4 = tmp10 + tmp5; tmp4 = tmp10 - tmp5;
wsptr[DCTSIZE*0] = tmp0 + tmp7; wsptr[DCTSIZE*0] = tmp0 + tmp7;
wsptr[DCTSIZE*7] = tmp0 - tmp7; wsptr[DCTSIZE*7] = tmp0 - tmp7;
@@ -165,8 +169,8 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
wsptr[DCTSIZE*6] = tmp1 - tmp6; wsptr[DCTSIZE*6] = tmp1 - tmp6;
wsptr[DCTSIZE*2] = tmp2 + tmp5; wsptr[DCTSIZE*2] = tmp2 + tmp5;
wsptr[DCTSIZE*5] = tmp2 - tmp5; wsptr[DCTSIZE*5] = tmp2 - tmp5;
wsptr[DCTSIZE*4] = tmp3 + tmp4; wsptr[DCTSIZE*3] = tmp3 + tmp4;
wsptr[DCTSIZE*3] = tmp3 - tmp4; wsptr[DCTSIZE*4] = tmp3 - tmp4;
inptr++; /* advance pointers to next column */ inptr++; /* advance pointers to next column */
quantptr++; quantptr++;
@@ -174,7 +178,6 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
} }
/* Pass 2: process rows from work array, store into output array. */ /* Pass 2: process rows from work array, store into output array. */
/* Note that we must descale the results by a factor of 8 == 2**3. */
wsptr = workspace; wsptr = workspace;
for (ctr = 0; ctr < DCTSIZE; ctr++) { for (ctr = 0; ctr < DCTSIZE; ctr++) {
@@ -187,8 +190,10 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
/* Even part */ /* Even part */
tmp10 = wsptr[0] + wsptr[4]; /* Apply signed->unsigned and prepare float->int conversion */
tmp11 = wsptr[0] - wsptr[4]; z5 = wsptr[0] + ((FAST_FLOAT) CENTERJSAMPLE + (FAST_FLOAT) 0.5);
tmp10 = z5 + wsptr[4];
tmp11 = z5 - wsptr[4];
tmp13 = wsptr[2] + wsptr[6]; tmp13 = wsptr[2] + wsptr[6];
tmp12 = (wsptr[2] - wsptr[6]) * ((FAST_FLOAT) 1.414213562) - tmp13; tmp12 = (wsptr[2] - wsptr[6]) * ((FAST_FLOAT) 1.414213562) - tmp13;
@@ -209,31 +214,23 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
tmp11 = (z11 - z13) * ((FAST_FLOAT) 1.414213562); tmp11 = (z11 - z13) * ((FAST_FLOAT) 1.414213562);
z5 = (z10 + z12) * ((FAST_FLOAT) 1.847759065); /* 2*c2 */ z5 = (z10 + z12) * ((FAST_FLOAT) 1.847759065); /* 2*c2 */
tmp10 = ((FAST_FLOAT) 1.082392200) * z12 - z5; /* 2*(c2-c6) */ tmp10 = z5 - z12 * ((FAST_FLOAT) 1.082392200); /* 2*(c2-c6) */
tmp12 = ((FAST_FLOAT) -2.613125930) * z10 + z5; /* -2*(c2+c6) */ tmp12 = z5 - z10 * ((FAST_FLOAT) 2.613125930); /* 2*(c2+c6) */
tmp6 = tmp12 - tmp7; tmp6 = tmp12 - tmp7;
tmp5 = tmp11 - tmp6; tmp5 = tmp11 - tmp6;
tmp4 = tmp10 + tmp5; tmp4 = tmp10 - tmp5;
/* Final output stage: scale down by a factor of 8 and range-limit */ /* Final output stage: float->int conversion and range-limit */
outptr[0] = range_limit[(int) DESCALE((INT32) (tmp0 + tmp7), 3) outptr[0] = range_limit[((int) (tmp0 + tmp7)) & RANGE_MASK];
& RANGE_MASK]; outptr[7] = range_limit[((int) (tmp0 - tmp7)) & RANGE_MASK];
outptr[7] = range_limit[(int) DESCALE((INT32) (tmp0 - tmp7), 3) outptr[1] = range_limit[((int) (tmp1 + tmp6)) & RANGE_MASK];
& RANGE_MASK]; outptr[6] = range_limit[((int) (tmp1 - tmp6)) & RANGE_MASK];
outptr[1] = range_limit[(int) DESCALE((INT32) (tmp1 + tmp6), 3) outptr[2] = range_limit[((int) (tmp2 + tmp5)) & RANGE_MASK];
& RANGE_MASK]; outptr[5] = range_limit[((int) (tmp2 - tmp5)) & RANGE_MASK];
outptr[6] = range_limit[(int) DESCALE((INT32) (tmp1 - tmp6), 3) outptr[3] = range_limit[((int) (tmp3 + tmp4)) & RANGE_MASK];
& RANGE_MASK]; outptr[4] = range_limit[((int) (tmp3 - tmp4)) & RANGE_MASK];
outptr[2] = range_limit[(int) DESCALE((INT32) (tmp2 + tmp5), 3)
& RANGE_MASK];
outptr[5] = range_limit[(int) DESCALE((INT32) (tmp2 - tmp5), 3)
& RANGE_MASK];
outptr[4] = range_limit[(int) DESCALE((INT32) (tmp3 + tmp4), 3)
& RANGE_MASK];
outptr[3] = range_limit[(int) DESCALE((INT32) (tmp3 - tmp4), 3)
& RANGE_MASK];
wsptr += DCTSIZE; /* advance pointer to next row */ wsptr += DCTSIZE; /* advance pointer to next row */
} }