Port the more accurate (and slightly faster) floating point IDCT implementation from jpeg-8a and later. New research revealed that the SSE/SSE2 floating point IDCT implementation was actually more accurate than the jpeg-6b implementation, not less, which is why its mathematical results have always differed from those of the jpeg-6b implementation. This patch brings the accuracy of the C code in line with that of the SSE/SSE2 code.

git-svn-id: svn+ssh://svn.code.sf.net/p/libjpeg-turbo/code/trunk@1288 632fc199-4ca6-4c93-a231-07263d6284db
This commit is contained in:
DRC
2014-05-11 10:09:07 +00:00
5 changed files with 59 additions and 51 deletions

View File

@@ -300,7 +300,7 @@ set(MD5_JPEG_3x2_FLOAT_PROG 343e3f8caf8af5986ebaf0bdc13b5c71)
set(MD5_PPM_3x2_FLOAT 1a75f36e5904d6fc3a85a43da9ad89bb)
else()
set(MD5_JPEG_3x2_FLOAT_PROG 9bca803d2042bd1eb03819e2bf92b3e5)
set(MD5_PPM_3x2_FLOAT ef6a420e369440edd6b918a0cf54ba3f)
set(MD5_PPM_3x2_FLOAT f6bfab038438ed8f5522fbd33595dcdc)
endif()
set(MD5_JPEG_420_ISLOW_ARI e986fb0a637a8d833d96e8a6d6d84ea1)
set(MD5_JPEG_444_ISLOW_PROGARI 0a8f1c8f66e113c3cf635df0a475a617)

View File

@@ -54,6 +54,14 @@ if compiler optimization was enabled when libjpeg-turbo was built. This caused
the regression tests to fail when doing a release build under Visual C++ 2010
and later.
[7] Improved the accuracy and performance of the non-SIMD implementation of the
floating point inverse DCT (using code borrowed from libjpeg v8a and later.)
The accuracy of this implementation now matches the accuracy of the SSE/SSE2
implementation. Note, however, that the floating point DCT/IDCT algorithms are
mainly a legacy feature. They generally do not produce significantly better
accuracy than the slow integer DCT/IDCT algorithms, and they are quite a bit
slower.
1.3.1
=====

View File

@@ -175,15 +175,13 @@ MD5_JPEG_GRAY_ISLOW = 72b51f894b8f4a10b3ee3066770aa38d
MD5_PPM_GRAY_ISLOW = 8d3596c56eace32f205deccc229aa5ed
MD5_PPM_GRAY_RGB_ISLOW = 116424ac07b79e5e801f00508eab48ec
MD5_JPEG_420S_IFAST_OPT = 388708217ac46273ca33086b22827ed8
# The SSE/SSE2 DCT/IDCT implementation has always produced a slight round-off
# error relative to the C code, so in this case, we just test for regression
# rather than verifying that the output matches libjpeg.
# See README-turbo.txt for more details on why this next bit is necessary.
if WITH_SSE_FLOAT_DCT
MD5_JPEG_3x2_FLOAT_PROG = 343e3f8caf8af5986ebaf0bdc13b5c71
MD5_PPM_3x2_FLOAT = 1a75f36e5904d6fc3a85a43da9ad89bb
else
MD5_JPEG_3x2_FLOAT_PROG = 9bca803d2042bd1eb03819e2bf92b3e5
MD5_PPM_3x2_FLOAT = ef6a420e369440edd6b918a0cf54ba3f
MD5_PPM_3x2_FLOAT = f6bfab038438ed8f5522fbd33595dcdc
endif
MD5_JPEG_420_ISLOW_ARI = e986fb0a637a8d833d96e8a6d6d84ea1
MD5_JPEG_444_ISLOW_PROGARI = 0a8f1c8f66e113c3cf635df0a475a617

View File

@@ -301,10 +301,19 @@ following reasons:
slightly more accurate than the implementation in libjpeg v6b, but not by
any amount perceptible to human vision (generally in the range of 0.01 to
0.08 dB gain in PNSR.)
-- When not using the SIMD extensions, libjpeg-turbo uses the more accurate
(and slightly faster) floating point IDCT algorithm introduced in libjpeg
v8a as opposed to the algorithm used in libjpeg v6b. It should be noted,
however, that this algorithm basically brings the accuracy of the floating
point IDCT in line with the accuracy of the slow integer IDCT. The floating
point DCT/IDCT algorithms are mainly a legacy feature, and they do not
produce significantly more accuracy than the slow integer algorithms (to put
numbers on this, the typical difference in PNSR between the two algorithms
is less than 0.10 dB, whereas changing the quality level by 1 in the upper
range of the quality scale is typically more like a 1.0 dB difference.)
-- When not using the SIMD extensions, then the accuracy of the floating point
DCT/IDCT can depend on the compiler and compiler settings.
While libjpeg-turbo does emulate the libjpeg v8 API/ABI, under the hood, it is
still using the same algorithms as libjpeg v6b, so there are several specific
cases in which libjpeg-turbo cannot be expected to produce the same output as
@@ -320,10 +329,6 @@ libjpeg v8:
output of libjpeg v8 is less accurate than that of libjpeg v6b for this
reason.
-- When using the floating point IDCT, for the reasons stated above and also
because the floating point IDCT algorithm was modified in libjpeg v8a to
improve accuracy.
-- When decompressing using a scaling factor > 1 and merged (AKA "non-fancy" or
"non-smooth") chrominance upsampling, because libjpeg v8 does not support
merged upsampling with scaling factors > 1.

View File

@@ -1,9 +1,12 @@
/*
* jidctflt.c
*
* This file was part of the Independent JPEG Group's software:
* Copyright (C) 1994-1998, Thomas G. Lane.
* This file is part of the Independent JPEG Group's software.
* For conditions of distribution and use, see the accompanying README file.
* Modified 2010 by Guido Vollbeding.
* libjpeg-turbo Modifications:
* Copyright (C) 2014, D. R. Commander.
* For conditions of distribution and use, see the accompanying README file.
*
* This file contains a floating-point implementation of the
* inverse DCT (Discrete Cosine Transform). In the IJG code, this routine
@@ -76,10 +79,10 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
FLOAT_MULT_TYPE * quantptr;
FAST_FLOAT * wsptr;
JSAMPROW outptr;
JSAMPLE *range_limit = IDCT_range_limit(cinfo);
JSAMPLE *range_limit = cinfo->sample_range_limit;
int ctr;
FAST_FLOAT workspace[DCTSIZE2]; /* buffers data between passes */
SHIFT_TEMPS
#define _0_125 ((FLOAT_MULT_TYPE)0.125)
/* Pass 1: process columns from input, store into work array. */
@@ -101,7 +104,8 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
inptr[DCTSIZE*5] == 0 && inptr[DCTSIZE*6] == 0 &&
inptr[DCTSIZE*7] == 0) {
/* AC terms all zero */
FAST_FLOAT dcval = DEQUANTIZE(inptr[DCTSIZE*0], quantptr[DCTSIZE*0]);
FAST_FLOAT dcval = DEQUANTIZE(inptr[DCTSIZE*0],
quantptr[DCTSIZE*0] * _0_125);
wsptr[DCTSIZE*0] = dcval;
wsptr[DCTSIZE*1] = dcval;
@@ -120,10 +124,10 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
/* Even part */
tmp0 = DEQUANTIZE(inptr[DCTSIZE*0], quantptr[DCTSIZE*0]);
tmp1 = DEQUANTIZE(inptr[DCTSIZE*2], quantptr[DCTSIZE*2]);
tmp2 = DEQUANTIZE(inptr[DCTSIZE*4], quantptr[DCTSIZE*4]);
tmp3 = DEQUANTIZE(inptr[DCTSIZE*6], quantptr[DCTSIZE*6]);
tmp0 = DEQUANTIZE(inptr[DCTSIZE*0], quantptr[DCTSIZE*0] * _0_125);
tmp1 = DEQUANTIZE(inptr[DCTSIZE*2], quantptr[DCTSIZE*2] * _0_125);
tmp2 = DEQUANTIZE(inptr[DCTSIZE*4], quantptr[DCTSIZE*4] * _0_125);
tmp3 = DEQUANTIZE(inptr[DCTSIZE*6], quantptr[DCTSIZE*6] * _0_125);
tmp10 = tmp0 + tmp2; /* phase 3 */
tmp11 = tmp0 - tmp2;
@@ -138,10 +142,10 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
/* Odd part */
tmp4 = DEQUANTIZE(inptr[DCTSIZE*1], quantptr[DCTSIZE*1]);
tmp5 = DEQUANTIZE(inptr[DCTSIZE*3], quantptr[DCTSIZE*3]);
tmp6 = DEQUANTIZE(inptr[DCTSIZE*5], quantptr[DCTSIZE*5]);
tmp7 = DEQUANTIZE(inptr[DCTSIZE*7], quantptr[DCTSIZE*7]);
tmp4 = DEQUANTIZE(inptr[DCTSIZE*1], quantptr[DCTSIZE*1] * _0_125);
tmp5 = DEQUANTIZE(inptr[DCTSIZE*3], quantptr[DCTSIZE*3] * _0_125);
tmp6 = DEQUANTIZE(inptr[DCTSIZE*5], quantptr[DCTSIZE*5] * _0_125);
tmp7 = DEQUANTIZE(inptr[DCTSIZE*7], quantptr[DCTSIZE*7] * _0_125);
z13 = tmp6 + tmp5; /* phase 6 */
z10 = tmp6 - tmp5;
@@ -152,12 +156,12 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
tmp11 = (z11 - z13) * ((FAST_FLOAT) 1.414213562); /* 2*c4 */
z5 = (z10 + z12) * ((FAST_FLOAT) 1.847759065); /* 2*c2 */
tmp10 = ((FAST_FLOAT) 1.082392200) * z12 - z5; /* 2*(c2-c6) */
tmp12 = ((FAST_FLOAT) -2.613125930) * z10 + z5; /* -2*(c2+c6) */
tmp10 = z5 - z12 * ((FAST_FLOAT) 1.082392200); /* 2*(c2-c6) */
tmp12 = z5 - z10 * ((FAST_FLOAT) 2.613125930); /* 2*(c2+c6) */
tmp6 = tmp12 - tmp7; /* phase 2 */
tmp5 = tmp11 - tmp6;
tmp4 = tmp10 + tmp5;
tmp4 = tmp10 - tmp5;
wsptr[DCTSIZE*0] = tmp0 + tmp7;
wsptr[DCTSIZE*7] = tmp0 - tmp7;
@@ -165,8 +169,8 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
wsptr[DCTSIZE*6] = tmp1 - tmp6;
wsptr[DCTSIZE*2] = tmp2 + tmp5;
wsptr[DCTSIZE*5] = tmp2 - tmp5;
wsptr[DCTSIZE*4] = tmp3 + tmp4;
wsptr[DCTSIZE*3] = tmp3 - tmp4;
wsptr[DCTSIZE*3] = tmp3 + tmp4;
wsptr[DCTSIZE*4] = tmp3 - tmp4;
inptr++; /* advance pointers to next column */
quantptr++;
@@ -174,7 +178,6 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
}
/* Pass 2: process rows from work array, store into output array. */
/* Note that we must descale the results by a factor of 8 == 2**3. */
wsptr = workspace;
for (ctr = 0; ctr < DCTSIZE; ctr++) {
@@ -187,8 +190,10 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
/* Even part */
tmp10 = wsptr[0] + wsptr[4];
tmp11 = wsptr[0] - wsptr[4];
/* Apply signed->unsigned and prepare float->int conversion */
z5 = wsptr[0] + ((FAST_FLOAT) CENTERJSAMPLE + (FAST_FLOAT) 0.5);
tmp10 = z5 + wsptr[4];
tmp11 = z5 - wsptr[4];
tmp13 = wsptr[2] + wsptr[6];
tmp12 = (wsptr[2] - wsptr[6]) * ((FAST_FLOAT) 1.414213562) - tmp13;
@@ -209,31 +214,23 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
tmp11 = (z11 - z13) * ((FAST_FLOAT) 1.414213562);
z5 = (z10 + z12) * ((FAST_FLOAT) 1.847759065); /* 2*c2 */
tmp10 = ((FAST_FLOAT) 1.082392200) * z12 - z5; /* 2*(c2-c6) */
tmp12 = ((FAST_FLOAT) -2.613125930) * z10 + z5; /* -2*(c2+c6) */
tmp10 = z5 - z12 * ((FAST_FLOAT) 1.082392200); /* 2*(c2-c6) */
tmp12 = z5 - z10 * ((FAST_FLOAT) 2.613125930); /* 2*(c2+c6) */
tmp6 = tmp12 - tmp7;
tmp5 = tmp11 - tmp6;
tmp4 = tmp10 + tmp5;
tmp4 = tmp10 - tmp5;
/* Final output stage: scale down by a factor of 8 and range-limit */
/* Final output stage: float->int conversion and range-limit */
outptr[0] = range_limit[(int) DESCALE((INT32) (tmp0 + tmp7), 3)
& RANGE_MASK];
outptr[7] = range_limit[(int) DESCALE((INT32) (tmp0 - tmp7), 3)
& RANGE_MASK];
outptr[1] = range_limit[(int) DESCALE((INT32) (tmp1 + tmp6), 3)
& RANGE_MASK];
outptr[6] = range_limit[(int) DESCALE((INT32) (tmp1 - tmp6), 3)
& RANGE_MASK];
outptr[2] = range_limit[(int) DESCALE((INT32) (tmp2 + tmp5), 3)
& RANGE_MASK];
outptr[5] = range_limit[(int) DESCALE((INT32) (tmp2 - tmp5), 3)
& RANGE_MASK];
outptr[4] = range_limit[(int) DESCALE((INT32) (tmp3 + tmp4), 3)
& RANGE_MASK];
outptr[3] = range_limit[(int) DESCALE((INT32) (tmp3 - tmp4), 3)
& RANGE_MASK];
outptr[0] = range_limit[((int) (tmp0 + tmp7)) & RANGE_MASK];
outptr[7] = range_limit[((int) (tmp0 - tmp7)) & RANGE_MASK];
outptr[1] = range_limit[((int) (tmp1 + tmp6)) & RANGE_MASK];
outptr[6] = range_limit[((int) (tmp1 - tmp6)) & RANGE_MASK];
outptr[2] = range_limit[((int) (tmp2 + tmp5)) & RANGE_MASK];
outptr[5] = range_limit[((int) (tmp2 - tmp5)) & RANGE_MASK];
outptr[3] = range_limit[((int) (tmp3 + tmp4)) & RANGE_MASK];
outptr[4] = range_limit[((int) (tmp3 - tmp4)) & RANGE_MASK];
wsptr += DCTSIZE; /* advance pointer to next row */
}