Port the more accurate (and slightly faster) floating point IDCT implementation from jpeg-8a and later. New research revealed that the SSE/SSE2 floating point IDCT implementation was actually more accurate than the jpeg-6b implementation, not less, which is why its mathematical results have always differed from those of the jpeg-6b implementation. This patch brings the accuracy of the C code in line with that of the SSE/SSE2 code.

git-svn-id: svn+ssh://svn.code.sf.net/p/libjpeg-turbo/code/trunk@1288 632fc199-4ca6-4c93-a231-07263d6284db
2014-05-11 10:09:07 +00:00
parent a758460175 5829cb2398
commit 715bb41ead
5 changed files with 59 additions and 51 deletions
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -300,7 +300,7 @@ set(MD5_JPEG_3x2_FLOAT_PROG 343e3f8caf8af5986ebaf0bdc13b5c71)
 set(MD5_PPM_3x2_FLOAT 1a75f36e5904d6fc3a85a43da9ad89bb)
 else()
 set(MD5_JPEG_3x2_FLOAT_PROG 9bca803d2042bd1eb03819e2bf92b3e5)
-set(MD5_PPM_3x2_FLOAT ef6a420e369440edd6b918a0cf54ba3f)
+set(MD5_PPM_3x2_FLOAT f6bfab038438ed8f5522fbd33595dcdc)
 endif()
 set(MD5_JPEG_420_ISLOW_ARI e986fb0a637a8d833d96e8a6d6d84ea1)
 set(MD5_JPEG_444_ISLOW_PROGARI 0a8f1c8f66e113c3cf635df0a475a617)
--- a/ChangeLog.txt
+++ b/ChangeLog.txt
@@ -54,6 +54,14 @@ if compiler optimization was enabled when libjpeg-turbo was built.  This caused
 the regression tests to fail when doing a release build under Visual C++ 2010
 and later.
 [7] Improved the accuracy and performance of the non-SIMD implementation of the
 floating point inverse DCT (using code borrowed from libjpeg v8a and later.)
 The accuracy of this implementation now matches the accuracy of the SSE/SSE2
 implementation.  Note, however, that the floating point DCT/IDCT algorithms are
 mainly a legacy feature.  They generally do not produce significantly better
 accuracy than the slow integer DCT/IDCT algorithms, and they are quite a bit
 slower.
 1.3.1
 =====
--- a/Makefile.am
+++ b/Makefile.am
@@ -175,15 +175,13 @@ MD5_JPEG_GRAY_ISLOW = 72b51f894b8f4a10b3ee3066770aa38d
 MD5_PPM_GRAY_ISLOW = 8d3596c56eace32f205deccc229aa5ed
 MD5_PPM_GRAY_RGB_ISLOW = 116424ac07b79e5e801f00508eab48ec
 MD5_JPEG_420S_IFAST_OPT = 388708217ac46273ca33086b22827ed8
-# The SSE/SSE2 DCT/IDCT implementation has always produced a slight round-off
+# See README-turbo.txt for more details on why this next bit is necessary.
 # error relative to the C code, so in this case, we just test for regression
 # rather than verifying that the output matches libjpeg.
 if WITH_SSE_FLOAT_DCT
 MD5_JPEG_3x2_FLOAT_PROG = 343e3f8caf8af5986ebaf0bdc13b5c71
 MD5_PPM_3x2_FLOAT = 1a75f36e5904d6fc3a85a43da9ad89bb
 else
 MD5_JPEG_3x2_FLOAT_PROG = 9bca803d2042bd1eb03819e2bf92b3e5
-MD5_PPM_3x2_FLOAT = ef6a420e369440edd6b918a0cf54ba3f
+MD5_PPM_3x2_FLOAT = f6bfab038438ed8f5522fbd33595dcdc
 endif
 MD5_JPEG_420_ISLOW_ARI = e986fb0a637a8d833d96e8a6d6d84ea1
 MD5_JPEG_444_ISLOW_PROGARI = 0a8f1c8f66e113c3cf635df0a475a617
--- a/README-turbo.txt
+++ b/README-turbo.txt
@@ -301,10 +301,19 @@ following reasons:
   slightly more accurate than the implementation in libjpeg v6b, but not by
   any amount perceptible to human vision (generally in the range of 0.01 to
   0.08 dB gain in PNSR.)
 -- When not using the SIMD extensions, libjpeg-turbo uses the more accurate
   (and slightly faster) floating point IDCT algorithm introduced in libjpeg
   v8a as opposed to the algorithm used in libjpeg v6b.  It should be noted,
   however, that this algorithm basically brings the accuracy of the floating
   point IDCT in line with the accuracy of the slow integer IDCT.  The floating
   point DCT/IDCT algorithms are mainly a legacy feature, and they do not
   produce significantly more accuracy than the slow integer algorithms (to put
   numbers on this, the typical difference in PNSR between the two algorithms
   is less than 0.10 dB, whereas changing the quality level by 1 in the upper
   range of the quality scale is typically more like a 1.0 dB difference.)
 -- When not using the SIMD extensions, then the accuracy of the floating point
   DCT/IDCT can depend on the compiler and compiler settings.
 While libjpeg-turbo does emulate the libjpeg v8 API/ABI, under the hood, it is
 still using the same algorithms as libjpeg v6b, so there are several specific
 cases in which libjpeg-turbo cannot be expected to produce the same output as
@@ -320,10 +329,6 @@ libjpeg v8:
   output of libjpeg v8 is less accurate than that of libjpeg v6b for this
   reason.
 -- When using the floating point IDCT, for the reasons stated above and also
   because the floating point IDCT algorithm was modified in libjpeg v8a to
   improve accuracy.
 -- When decompressing using a scaling factor > 1 and merged (AKA "non-fancy" or
   "non-smooth") chrominance upsampling, because libjpeg v8 does not support
   merged upsampling with scaling factors > 1.
--- a/jidctflt.c
+++ b/jidctflt.c
@@ -1,8 +1,11 @@
 /*
 * jidctflt.c
 *
 * This file was part of the Independent JPEG Group's software:
 * Copyright (C) 1994-1998, Thomas G. Lane.
- * This file is part of the Independent JPEG Group's software.
+ * Modified 2010 by Guido Vollbeding.
 * libjpeg-turbo Modifications:
 * Copyright (C) 2014, D. R. Commander.
  * For conditions of distribution and use, see the accompanying README file.
 *
 * This file contains a floating-point implementation of the
@@ -76,10 +79,10 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
  FLOAT_MULT_TYPE * quantptr;
  FAST_FLOAT * wsptr;
  JSAMPROW outptr;
-  JSAMPLE *range_limit = IDCT_range_limit(cinfo);
+  JSAMPLE *range_limit = cinfo->sample_range_limit;
  int ctr;
  FAST_FLOAT workspace[DCTSIZE2]; /* buffers data between passes */
-  SHIFT_TEMPS
+  #define _0_125 ((FLOAT_MULT_TYPE)0.125)
  /* Pass 1: process columns from input, store into work array. */
@@ -101,7 +104,8 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
        inptr[DCTSIZE*5] == 0 && inptr[DCTSIZE*6] == 0 &&
        inptr[DCTSIZE*7] == 0) {
      /* AC terms all zero */
-      FAST_FLOAT dcval = DEQUANTIZE(inptr[DCTSIZE*0], quantptr[DCTSIZE*0]);
+      FAST_FLOAT dcval = DEQUANTIZE(inptr[DCTSIZE*0],
                                    quantptr[DCTSIZE*0] * _0_125);
      wsptr[DCTSIZE*0] = dcval;
      wsptr[DCTSIZE*1] = dcval;
@@ -120,10 +124,10 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
    /* Even part */
-    tmp0 = DEQUANTIZE(inptr[DCTSIZE*0], quantptr[DCTSIZE*0]);
+    tmp0 = DEQUANTIZE(inptr[DCTSIZE*0], quantptr[DCTSIZE*0] * _0_125);
-    tmp1 = DEQUANTIZE(inptr[DCTSIZE*2], quantptr[DCTSIZE*2]);
+    tmp1 = DEQUANTIZE(inptr[DCTSIZE*2], quantptr[DCTSIZE*2] * _0_125);
-    tmp2 = DEQUANTIZE(inptr[DCTSIZE*4], quantptr[DCTSIZE*4]);
+    tmp2 = DEQUANTIZE(inptr[DCTSIZE*4], quantptr[DCTSIZE*4] * _0_125);
-    tmp3 = DEQUANTIZE(inptr[DCTSIZE*6], quantptr[DCTSIZE*6]);
+    tmp3 = DEQUANTIZE(inptr[DCTSIZE*6], quantptr[DCTSIZE*6] * _0_125);
    tmp10 = tmp0 + tmp2;        /* phase 3 */
    tmp11 = tmp0 - tmp2;
@@ -138,10 +142,10 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
    /* Odd part */
-    tmp4 = DEQUANTIZE(inptr[DCTSIZE*1], quantptr[DCTSIZE*1]);
+    tmp4 = DEQUANTIZE(inptr[DCTSIZE*1], quantptr[DCTSIZE*1] * _0_125);
-    tmp5 = DEQUANTIZE(inptr[DCTSIZE*3], quantptr[DCTSIZE*3]);
+    tmp5 = DEQUANTIZE(inptr[DCTSIZE*3], quantptr[DCTSIZE*3] * _0_125);
-    tmp6 = DEQUANTIZE(inptr[DCTSIZE*5], quantptr[DCTSIZE*5]);
+    tmp6 = DEQUANTIZE(inptr[DCTSIZE*5], quantptr[DCTSIZE*5] * _0_125);
-    tmp7 = DEQUANTIZE(inptr[DCTSIZE*7], quantptr[DCTSIZE*7]);
+    tmp7 = DEQUANTIZE(inptr[DCTSIZE*7], quantptr[DCTSIZE*7] * _0_125);
    z13 = tmp6 + tmp5;          /* phase 6 */
    z10 = tmp6 - tmp5;
@@ -152,12 +156,12 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
    tmp11 = (z11 - z13) * ((FAST_FLOAT) 1.414213562); /* 2*c4 */
    z5 = (z10 + z12) * ((FAST_FLOAT) 1.847759065); /* 2*c2 */
-    tmp10 = ((FAST_FLOAT) 1.082392200) * z12 - z5; /* 2*(c2-c6) */
+    tmp10 = z5 - z12 * ((FAST_FLOAT) 1.082392200); /* 2*(c2-c6) */
-    tmp12 = ((FAST_FLOAT) -2.613125930) * z10 + z5; /* -2*(c2+c6) */
+    tmp12 = z5 - z10 * ((FAST_FLOAT) 2.613125930); /* 2*(c2+c6) */
    tmp6 = tmp12 - tmp7;        /* phase 2 */
    tmp5 = tmp11 - tmp6;
-    tmp4 = tmp10 + tmp5;
+    tmp4 = tmp10 - tmp5;
    wsptr[DCTSIZE*0] = tmp0 + tmp7;
    wsptr[DCTSIZE*7] = tmp0 - tmp7;
@@ -165,8 +169,8 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
    wsptr[DCTSIZE*6] = tmp1 - tmp6;
    wsptr[DCTSIZE*2] = tmp2 + tmp5;
    wsptr[DCTSIZE*5] = tmp2 - tmp5;
-    wsptr[DCTSIZE*4] = tmp3 + tmp4;
+    wsptr[DCTSIZE*3] = tmp3 + tmp4;
-    wsptr[DCTSIZE*3] = tmp3 - tmp4;
+    wsptr[DCTSIZE*4] = tmp3 - tmp4;
    inptr++;                    /* advance pointers to next column */
    quantptr++;
@@ -174,7 +178,6 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
  }
  /* Pass 2: process rows from work array, store into output array. */
  /* Note that we must descale the results by a factor of 8 == 2**3. */
  wsptr = workspace;
  for (ctr = 0; ctr < DCTSIZE; ctr++) {
@@ -187,8 +190,10 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
    /* Even part */
-    tmp10 = wsptr[0] + wsptr[4];
+    /* Apply signed->unsigned and prepare float->int conversion */
-    tmp11 = wsptr[0] - wsptr[4];
+    z5 = wsptr[0] + ((FAST_FLOAT) CENTERJSAMPLE + (FAST_FLOAT) 0.5);
    tmp10 = z5 + wsptr[4];
    tmp11 = z5 - wsptr[4];
    tmp13 = wsptr[2] + wsptr[6];
    tmp12 = (wsptr[2] - wsptr[6]) * ((FAST_FLOAT) 1.414213562) - tmp13;
@@ -209,31 +214,23 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
    tmp11 = (z11 - z13) * ((FAST_FLOAT) 1.414213562);
    z5 = (z10 + z12) * ((FAST_FLOAT) 1.847759065); /* 2*c2 */
-    tmp10 = ((FAST_FLOAT) 1.082392200) * z12 - z5; /* 2*(c2-c6) */
+    tmp10 = z5 - z12 * ((FAST_FLOAT) 1.082392200); /* 2*(c2-c6) */
-    tmp12 = ((FAST_FLOAT) -2.613125930) * z10 + z5; /* -2*(c2+c6) */
+    tmp12 = z5 - z10 * ((FAST_FLOAT) 2.613125930); /* 2*(c2+c6) */
    tmp6 = tmp12 - tmp7;
    tmp5 = tmp11 - tmp6;
-    tmp4 = tmp10 + tmp5;
+    tmp4 = tmp10 - tmp5;
-    /* Final output stage: scale down by a factor of 8 and range-limit */
+    /* Final output stage: float->int conversion and range-limit */
-    outptr[0] = range_limit[(int) DESCALE((INT32) (tmp0 + tmp7), 3)
+    outptr[0] = range_limit[((int) (tmp0 + tmp7)) & RANGE_MASK];
-                            & RANGE_MASK];
+    outptr[7] = range_limit[((int) (tmp0 - tmp7)) & RANGE_MASK];
-    outptr[7] = range_limit[(int) DESCALE((INT32) (tmp0 - tmp7), 3)
+    outptr[1] = range_limit[((int) (tmp1 + tmp6)) & RANGE_MASK];
-                            & RANGE_MASK];
+    outptr[6] = range_limit[((int) (tmp1 - tmp6)) & RANGE_MASK];
-    outptr[1] = range_limit[(int) DESCALE((INT32) (tmp1 + tmp6), 3)
+    outptr[2] = range_limit[((int) (tmp2 + tmp5)) & RANGE_MASK];
-                            & RANGE_MASK];
+    outptr[5] = range_limit[((int) (tmp2 - tmp5)) & RANGE_MASK];
-    outptr[6] = range_limit[(int) DESCALE((INT32) (tmp1 - tmp6), 3)
+    outptr[3] = range_limit[((int) (tmp3 + tmp4)) & RANGE_MASK];
-                            & RANGE_MASK];
+    outptr[4] = range_limit[((int) (tmp3 - tmp4)) & RANGE_MASK];
    outptr[2] = range_limit[(int) DESCALE((INT32) (tmp2 + tmp5), 3)
                            & RANGE_MASK];
    outptr[5] = range_limit[(int) DESCALE((INT32) (tmp2 - tmp5), 3)
                            & RANGE_MASK];
    outptr[4] = range_limit[(int) DESCALE((INT32) (tmp3 + tmp4), 3)
                            & RANGE_MASK];
    outptr[3] = range_limit[(int) DESCALE((INT32) (tmp3 - tmp4), 3)
                            & RANGE_MASK];
    wsptr += DCTSIZE;           /* advance pointer to next row */
  }