Bump version to 2.1.

Merge pull request #79 from arjun024/devel
more logical flow of control if trellis_quant enabled
2014-07-29 09:12:48 -05:00 · 2014-07-25 09:02:45 -04:00 · 2014-07-25 17:33:35 +05:30 · 2014-07-24 16:39:26 -05:00 · 2014-07-24 17:09:27 -04:00 · 2014-07-24 10:50:59 -04:00
16 changed files with 4352 additions and 151 deletions
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -5,7 +5,7 @@
 cmake_minimum_required(VERSION 2.6)

 project(libmozjpeg C)
-set(VERSION 2.0pre)
+set(VERSION 2.1)

 if(MINGW OR CYGWIN)
  execute_process(COMMAND "date" "+%Y%m%d" OUTPUT_VARIABLE BUILD)
@@ -264,7 +264,7 @@ set_property(TARGET jpegtran-static PROPERTY COMPILE_FLAGS "-DUSE_SETMODE")

 add_executable(rdjpgcom rdjpgcom.c)

-add_executable(wrjpgcom rdjpgcom.c)
+add_executable(wrjpgcom wrjpgcom.c)


 #
--- a/ChangeLog.txt
+++ b/ChangeLog.txt
@@ -23,6 +23,14 @@ and is out of scope for a codec library.
 [5] The TurboJPEG API can now be used to compress JPEG images from YUV planar
 source images.

+[7] Improved the accuracy and performance of the non-SIMD implementation of the
+floating point inverse DCT (using code borrowed from libjpeg v8a and later.)
+The accuracy of this implementation now matches the accuracy of the SSE/SSE2
+implementation.  Note, however, that the floating point DCT/IDCT algorithms are
+mainly a legacy feature.  They generally do not produce significantly better
+accuracy than the slow integer DCT/IDCT algorithms, and they are quite a bit
+slower.
+

 1.3.1
 =====
--- a/README-turbo.txt
+++ b/README-turbo.txt
@@ -419,10 +419,25 @@ details.

 For the most part, libjpeg-turbo should produce identical output to libjpeg
 v6b.  The one exception to this is when using the floating point DCT/IDCT, in
-which case the outputs of libjpeg v6b and libjpeg-turbo are not guaranteed to
-be identical (the accuracy of the floating point DCT/IDCT is constant when
-using libjpeg-turbo's SIMD extensions, but otherwise, it can depend heavily on
-the compiler and compiler settings.)
+which case the outputs of libjpeg v6b and libjpeg-turbo can differ for the
+following reasons:
+
+-- The SSE/SSE2 floating point DCT implementation in libjpeg-turbo is ever so
+   slightly more accurate than the implementation in libjpeg v6b, but not by
+   any amount perceptible to human vision (generally in the range of 0.01 to
+   0.08 dB gain in PNSR.)
+-- When not using the SIMD extensions, libjpeg-turbo uses the more accurate
+   (and slightly faster) floating point IDCT algorithm introduced in libjpeg
+   v8a as opposed to the algorithm used in libjpeg v6b.  It should be noted,
+   however, that this algorithm basically brings the accuracy of the floating
+   point IDCT in line with the accuracy of the slow integer IDCT.  The floating
+   point DCT/IDCT algorithms are mainly a legacy feature, and they do not
+   produce significantly more accuracy than the slow integer algorithms (to put
+   numbers on this, the typical difference in PNSR between the two algorithms
+   is less than 0.10 dB, whereas changing the quality level by 1 in the upper
+   range of the quality scale is typically more like a 1.0 dB difference.)
+-- When not using the SIMD extensions, then the accuracy of the floating point
+   DCT/IDCT can depend on the compiler and compiler settings.

 While libjpeg-turbo does emulate the libjpeg v8 API/ABI, under the hood, it is
 still using the same algorithms as libjpeg v6b, so there are several specific
@@ -430,16 +445,14 @@ cases in which libjpeg-turbo cannot be expected to produce the same output as
 libjpeg v8:

 -- When decompressing using scaling factors of 1/2 and 1/4, because libjpeg v8
-   implements those scaling algorithms a bit differently than libjpeg v6b does,
-   and libjpeg-turbo's SIMD extensions are based on the libjpeg v6b behavior.
+   implements those scaling algorithms differently than libjpeg v6b does, and
+   libjpeg-turbo's SIMD extensions are based on the libjpeg v6b behavior.

 -- When using chrominance subsampling, because libjpeg v8 implements this
   with its DCT/IDCT scaling algorithms rather than with a separate
-   downsampling/upsampling algorithm.
-
-- When using the floating point IDCT, for the reasons stated above and also
-   because the floating point IDCT algorithm was modified in libjpeg v8a to
-   improve accuracy.
+   downsampling/upsampling algorithm.  In our testing, the subsampled/upsampled
+   output of libjpeg v8 is less accurate than that of libjpeg v6b for this
+   reason.

 -- When decompressing using a scaling factor > 1 and merged (AKA "non-fancy" or
   "non-smooth") chrominance upsampling, because libjpeg v8 does not support
--- a/README.md
+++ b/README.md
@@ -7,6 +7,8 @@ The idea is to reduce transfer times for JPEGs on the Web, thus reducing page lo

 'mozjpeg' is not intended to be a general JPEG library replacement. It makes tradeoffs that are intended to benefit Web use cases and focuses solely on improving encoding. It is best used as part of a Web encoding workflow. For a general JPEG library (e.g. your system libjpeg), especially if you care about decoding, we recommend libjpeg-turbo.

-For more information, see the project announcement:
+More information:

-https://blog.mozilla.org/research/2014/03/05/introducing-the-mozjpeg-project/
+* [Version 1.0 Announcement](https://blog.mozilla.org/research/2014/03/05/introducing-the-mozjpeg-project/)
+* [Version 2.0 Announcement](https://blog.mozilla.org/research/2014/07/15/mozilla-advances-jpeg-encoding-with-mozjpeg-2-0/)
+* [Mailing List](https://lists.mozilla.org/listinfo/dev-mozjpeg)</a>
--- a/cjpeg.1
+++ b/cjpeg.1
@@ -1,4 +1,4 @@
-.TH CJPEG 1 "18 January 2013"
+.TH CJPEG 1 "11 May 2014"
 .SH NAME
 cjpeg \- compress an image file to a JPEG file
 .SH SYNOPSIS
@@ -166,14 +166,25 @@ Use integer DCT method (default).
 .TP
 .B \-dct fast
 Use fast integer DCT (less accurate).
+In libjpeg-turbo, the fast method is generally about 5-15% faster than the int
+method when using the x86/x86-64 SIMD extensions (results may vary with other
+SIMD implementations, or when using libjpeg-turbo without SIMD extensions.)
+For quality levels of 90 and below, there should be little or no perceptible
+difference between the two algorithms.  For quality levels above 90, however,
+the difference between the fast and the int methods becomes more pronounced.
+With quality=97, for instance, the fast method incurs generally about a 1-3 dB
+loss (in PSNR) relative to the int method, but this can be larger for some
+images.  Do not use the fast method with quality levels above 97.  The
+algorithm often degenerates at quality=98 and above and can actually produce a
+more lossy image than if lower quality levels had been used.
 .TP
 .B \-dct float
 Use floating-point DCT method.
-The float method is very slightly more accurate than the int method, but is
-much slower unless your machine has very fast floating-point hardware.  Also
-note that results of the floating-point method may vary slightly across
-machines, while the integer methods should give the same results everywhere.
-The fast integer method is much less accurate than the other two.
+The float method is mostly a legacy feature.  It does not produce significantly
+more accurate results than the int method, and it is much slower.  The float
+method may also give different results on different machines due to varying
+roundoff behavior, whereas the integer methods should give the same results on
+all machines.
 .TP
 .BI \-restart " N"
 Emit a JPEG restart marker every N MCU rows, or every N MCU blocks if "B" is
--- a/cjpeg.c
+++ b/cjpeg.c
@@ -168,6 +168,7 @@ usage (void)
 #ifdef C_PROGRESSIVE_SUPPORTED
  fprintf(stderr, "  -progressive   Create progressive JPEG file (enabled by default)\n");
 #endif
+  fprintf(stderr, "  -baseline      Create baseline JPEG file (disable progressive coding)\n");
 #ifdef TARGA_SUPPORTED
  fprintf(stderr, "  -targa         Input file is Targa format (usually not needed)\n");
 #endif
@@ -206,7 +207,6 @@ usage (void)
 #endif
  fprintf(stderr, "  -verbose  or  -debug   Emit debug output\n");
  fprintf(stderr, "Switches for wizards:\n");
-  fprintf(stderr, "  -baseline      Force baseline quantization tables\n");
  fprintf(stderr, "  -qtables file  Use quantization tables given in file\n");
  fprintf(stderr, "  -qslots N[,...]    Set component quantization tables\n");
  fprintf(stderr, "  -sample HxV[,...]  Set component sampling factors\n");
@@ -279,11 +279,17 @@ parse_switches (j_compress_ptr cinfo, int argc, char **argv,
    } else if (keymatch(arg, "baseline", 1)) {
      /* Force baseline-compatible output (8-bit quantizer values). */
      force_baseline = TRUE;
+      /* Disable multiple scans */
+      simple_progressive = FALSE;
+      cinfo->num_scans = 0;
+      cinfo->scan_info = NULL;

    } else if (keymatch(arg, "dct", 2)) {
      /* Select DCT algorithm. */
-      if (++argn >= argc)	/* advance to next argument */
+      if (++argn >= argc) { /* advance to next argument */
+        fprintf(stderr, "%s: missing argument for dct\n", progname);
 	usage();
+      }
      if (keymatch(argv[argn], "int", 1)) {
 	cinfo->dct_method = JDCT_ISLOW;
      } else if (keymatch(argv[argn], "fast", 2)) {
@@ -291,6 +297,7 @@ parse_switches (j_compress_ptr cinfo, int argc, char **argv,
      } else if (keymatch(argv[argn], "float", 2)) {
 	cinfo->dct_method = JDCT_FLOAT;
      } else
+        fprintf(stderr, "%s: invalid argument for dct\n", progname);
 	usage();

    } else if (keymatch(arg, "debug", 1) || keymatch(arg, "verbose", 1)) {
@@ -314,7 +321,7 @@ parse_switches (j_compress_ptr cinfo, int argc, char **argv,
    } else if (keymatch(arg, "flat", 4)) {
      cinfo->use_flat_quant_tbl = TRUE;
      jpeg_set_quality(cinfo, 75, TRUE);
-      
+
    } else if (keymatch(arg, "grayscale", 2) || keymatch(arg, "greyscale",2)) {
      /* Force a monochrome JPEG file to be generated. */
      jpeg_set_colorspace(cinfo, JCS_GRAYSCALE);
@@ -327,12 +334,12 @@ parse_switches (j_compress_ptr cinfo, int argc, char **argv,
      if (++argn >= argc)	/* advance to next argument */
 	usage();
      cinfo->lambda_log_scale1 = atof(argv[argn]);
-      
+
    } else if (keymatch(arg, "lambda2", 7)) {
      if (++argn >= argc)	/* advance to next argument */
 	usage();
      cinfo->lambda_log_scale2 = atof(argv[argn]);
-      
+
    } else if (keymatch(arg, "maxmemory", 3)) {
      /* Maximum memory in Kb (or Mb with 'm'). */
      long lval;
@@ -348,7 +355,7 @@ parse_switches (j_compress_ptr cinfo, int argc, char **argv,

    } else if (keymatch(arg, "multidcscan", 3)) {
      cinfo->one_dc_scan = FALSE;
-      
+
    } else if (keymatch(arg, "optimize", 1) || keymatch(arg, "optimise", 1)) {
      /* Enable entropy parm optimization. */
 #ifdef ENTROPY_OPT_SUPPORTED
@@ -361,8 +368,10 @@ parse_switches (j_compress_ptr cinfo, int argc, char **argv,

    } else if (keymatch(arg, "outfile", 4)) {
      /* Set output file name. */
-      if (++argn >= argc)	/* advance to next argument */
+      if (++argn >= argc)	{ /* advance to next argument */
+        fprintf(stderr, "%s: missing argument for outfile\n", progname);
 	usage();
+      }
      outfilename = argv[argn];	/* save it away for later use */

    } else if (keymatch(arg, "progressive", 1)) {
@@ -388,8 +397,10 @@ parse_switches (j_compress_ptr cinfo, int argc, char **argv,

    } else if (keymatch(arg, "quality", 1)) {
      /* Quality ratings (quantization table scaling factors). */
-      if (++argn >= argc)	/* advance to next argument */
+      if (++argn >= argc)	{ /* advance to next argument */
+        fprintf(stderr, "%s: missing argument for quality\n", progname);
 	usage();
+      }
      qualityarg = argv[argn];

    } else if (keymatch(arg, "qslots", 2)) {
@@ -505,6 +516,7 @@ parse_switches (j_compress_ptr cinfo, int argc, char **argv,
      jpeg_set_quality(cinfo, 75, TRUE);
      
    } else {
+      fprintf(stderr, "%s: unknown option '%s'\n", progname, arg);
      usage();			/* bogus switch */
    }
  }
@@ -516,20 +528,26 @@ parse_switches (j_compress_ptr cinfo, int argc, char **argv,
    /* Set quantization tables for selected quality. */
    /* Some or all may be overridden if -qtables is present. */
    if (qualityarg != NULL)	/* process -quality if it was present */
-      if (! set_quality_ratings(cinfo, qualityarg, force_baseline))
+      if (! set_quality_ratings(cinfo, qualityarg, force_baseline)) {
+        fprintf(stderr, "%s: can't set quality ratings\n", progname);
 	usage();
+      }

    if (qtablefile != NULL)	/* process -qtables if it was present */
-      if (! read_quant_tables(cinfo, qtablefile, force_baseline))
+      if (! read_quant_tables(cinfo, qtablefile, force_baseline)) {
+        fprintf(stderr, "%s: can't read qtable file\n", progname);
 	usage();
+      }

    if (qslotsarg != NULL)	/* process -qslots if it was present */
      if (! set_quant_slots(cinfo, qslotsarg))
 	usage();

    if (samplearg != NULL)	/* process -sample if it was present */
-      if (! set_sample_factors(cinfo, samplearg))
+      if (! set_sample_factors(cinfo, samplearg)) {
+        fprintf(stderr, "%s: can't set sample factors\n", progname);
 	usage();
+      }

 #ifdef C_PROGRESSIVE_SUPPORTED
    if (simple_progressive)	/* process -progressive; -scans can override */
--- a/configure.ac
+++ b/configure.ac
@@ -2,7 +2,7 @@
 # Process this file with autoconf to produce a configure script.

 AC_PREREQ([2.56])
-AC_INIT([libmozjpeg], [2.0pre])
+AC_INIT([libmozjpeg], [2.1])
 BUILD=`date +%Y%m%d`

 AM_INIT_AUTOMAKE([-Wall foreign dist-bzip2])
@@ -20,7 +20,7 @@ AC_PROG_CC
 AM_PROG_CC_C_O
 AM_PROG_AS
 AC_PROG_INSTALL
-AM_PROG_AR
+m4_ifdef([AM_PROG_AR], [AM_PROG_AR])
 AC_PROG_LIBTOOL
 AC_PROG_LN_S

--- a/djpeg.1
+++ b/djpeg.1
@@ -1,4 +1,4 @@
-.TH DJPEG 1 "18 January 2013"
+.TH DJPEG 1 "11 May 2014"
 .SH NAME
 djpeg \- decompress a JPEG file to an image file
 .SH SYNOPSIS
@@ -115,14 +115,28 @@ Use integer DCT method (default).
 .TP
 .B \-dct fast
 Use fast integer DCT (less accurate).
+In libjpeg-turbo, the fast method is generally about 5-15% faster than the int
+method when using the x86/x86-64 SIMD extensions (results may vary with other
+SIMD implementations, or when using libjpeg-turbo without SIMD extensions.)  If
+the JPEG image was compressed using a quality level of 85 or below, then there
+should be little or no perceptible difference between the two algorithms.  When
+decompressing images that were compressed using quality levels above 85,
+however, the difference between the fast and int methods becomes more
+pronounced.  With images compressed using quality=97, for instance, the fast
+method incurs generally about a 4-6 dB loss (in PSNR) relative to the int
+method, but this can be larger for some images.  If you can avoid it, do not
+use the fast method when decompressing images that were compressed using
+quality levels above 97.  The algorithm often degenerates for such images and
+can actually produce a more lossy output image than if the JPEG image had been
+compressed using lower quality levels.
 .TP
 .B \-dct float
 Use floating-point DCT method.
-The float method is very slightly more accurate than the int method, but is
-much slower unless your machine has very fast floating-point hardware.  Also
-note that results of the floating-point method may vary slightly across
-machines, while the integer methods should give the same results everywhere.
-The fast integer method is much less accurate than the other two.
+The float method is mostly a legacy feature.  It does not produce significantly
+more accurate results than the int method, and it is much slower.  The float
+method may also give different results on different machines due to varying
+roundoff behavior, whereas the integer methods should give the same results on
+all machines.
 .TP
 .B \-dither fs
 Use Floyd-Steinberg dithering in color quantization.
--- a/jcapistd.c
+++ b/jcapistd.c
@@ -46,7 +46,7 @@ jpeg_start_compress (j_compress_ptr cinfo, boolean write_all_tables)
    jpeg_suppress_tables(cinfo, FALSE);	/* mark all tables to be written */

  /* setting up scan optimisation pattern failed, disable scan optimisation */
-  if (cinfo->num_scans_luma == 0)
+  if (cinfo->num_scans_luma == 0 || cinfo->scan_info == NULL || cinfo->num_scans == 0)
    cinfo->optimize_scans = FALSE;
  
  /* (Re)initialize error mgr and destination modules */
--- a/jcdctmgr.c
+++ b/jcdctmgr.c
@@ -543,6 +543,8 @@ forward_DCT_float (j_compress_ptr cinfo, jpeg_component_info * compptr,
  FAST_FLOAT * divisors = fdct->float_divisors[compptr->quant_tbl_no];
  FAST_FLOAT * workspace;
  JDIMENSION bi;
+  float v;
+  int x;


  /* Make sure the compiler doesn't look up these every pass */
@@ -572,10 +574,10 @@ forward_DCT_float (j_compress_ptr cinfo, jpeg_component_info * compptr,
      };

      for (i = 0; i < DCTSIZE2; i++) {
-        float v = workspace[i];
+        v = workspace[i];
        v /= aanscalefactor[i%8];
        v /= aanscalefactor[i/8];
-        int x = (v >= 0.0) ? (int)(v + 0.5) : (int)(v - 0.5);
+        x = (v >= 0.0) ? (int)(v + 0.5) : (int)(v - 0.5);
        dst[bi][i] = x;
      }
    }
@@ -588,9 +590,7 @@ forward_DCT_float (j_compress_ptr cinfo, jpeg_component_info * compptr,
 #endif /* DCT_FLOAT_SUPPORTED */

 #include "jchuff.h"
-
-static unsigned char jpeg_nbits_table[65536];
-static int jpeg_nbits_table_init = 0;
+#include "jpeg_nbits_table.h"

 static const float jpeg_lambda_weights_flat[64] = {
  1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f,
@@ -637,6 +637,10 @@ quantize_trellis(j_compress_ptr cinfo, c_derived_tbl *actbl, JBLOCKROW coef_bloc
  int has_eob;
  float cost_all_zeros;
  float best_cost_skip;
+  float cost;
+  int zero_run;
+  int run_bits;
+  int rate;

  Ss = cinfo->Ss;
  Se = cinfo->Se;
@@ -654,15 +658,6 @@ quantize_trellis(j_compress_ptr cinfo, c_derived_tbl *actbl, JBLOCKROW coef_bloc
    requires_eob[0] = 0;
  }
  
-  if(!jpeg_nbits_table_init) {
-    for(i = 0; i < 65536; i++) {
-      int nbits = 0, temp = i;
-      while (temp) {temp >>= 1;  nbits++;}
-      jpeg_nbits_table[i] = nbits;
-    }
-    jpeg_nbits_table_init = 1;
-  }
-
  norm = 0.0;
  for (i = 1; i < DCTSIZE2; i++) {
    norm += qtbl->quantval[i] * qtbl->quantval[i];
@@ -725,11 +720,11 @@ quantize_trellis(j_compress_ptr cinfo, c_derived_tbl *actbl, JBLOCKROW coef_bloc
        if (j != Ss-1 && coef_blocks[bi][zz] == 0)
          continue;
        
-        int zero_run = i - 1 - j;
+        zero_run = i - 1 - j;
        if ((zero_run >> 4) && actbl->ehufsi[0xf0] == 0)
          continue;
        
-        int run_bits = (zero_run >> 4) * actbl->ehufsi[0xf0];
+        run_bits = (zero_run >> 4) * actbl->ehufsi[0xf0];
        zero_run &= 15;

        for (k = 0; k < num_candidates; k++) {
@@ -737,8 +732,8 @@ quantize_trellis(j_compress_ptr cinfo, c_derived_tbl *actbl, JBLOCKROW coef_bloc
          if (coef_bits == 0)
            continue;
          
-          int rate = coef_bits + candidate_bits[k] + run_bits;
-          float cost = rate + candidate_dist[k];
+          rate = coef_bits + candidate_bits[k] + run_bits;
+          cost = rate + candidate_dist[k];
          cost += accumulated_zero_dist[i-1] - accumulated_zero_dist[j] + accumulated_cost[j];
          
          if (cost < accumulated_cost[i]) {
--- a/jchuff.c
+++ b/jchuff.c
@@ -22,8 +22,7 @@
 #include "jchuff.h"		/* Declarations shared with jcphuff.c */
 #include <limits.h>

-static unsigned char jpeg_nbits_table[65536];
-static int jpeg_nbits_table_init = 0;
+#include "jpeg_nbits_table.h"

 #ifndef min
 #define min(a,b) ((a)<(b)?(a):(b))
@@ -271,15 +270,6 @@ jpeg_make_c_derived_tbl (j_compress_ptr cinfo, boolean isDC, int tblno,
    dtbl->ehufco[i] = huffcode[p];
    dtbl->ehufsi[i] = huffsize[p];
  }
-
-  if(!jpeg_nbits_table_init) {
-    for(i = 0; i < 65536; i++) {
-      int nbits = 0, temp = i;
-      while (temp) {temp >>= 1;  nbits++;}
-      jpeg_nbits_table[i] = nbits;
-    }
-    jpeg_nbits_table_init = 1;
-  }
 }


--- a/jcmaster.c
+++ b/jcmaster.c
@@ -933,11 +933,10 @@ jinit_c_master_control (j_compress_ptr cinfo, boolean transcode_only)
  else
    master->total_passes = cinfo->num_scans;
  
+  master->pass_number_scan_opt_base = 0;
  if (cinfo->trellis_quant) {
-    if (cinfo->progressive_mode)
-      master->total_passes += ((cinfo->use_scans_in_trellis) ? 4 : 2) * cinfo->num_components * cinfo->trellis_num_loops;
-    else
-      master->total_passes += 1;
+    master->pass_number_scan_opt_base = ((cinfo->use_scans_in_trellis) ? 4 : 2) * cinfo->num_components * cinfo->trellis_num_loops;
+    master->total_passes += master->pass_number_scan_opt_base;
  }
  
  if (cinfo->optimize_scans) {
@@ -947,9 +946,4 @@ jinit_c_master_control (j_compress_ptr cinfo, boolean transcode_only)
    for (i = 0; i < cinfo->num_scans; i++)
      master->scan_buffer[i] = NULL;
  }
-  
-  if (cinfo->trellis_quant)
-    master->pass_number_scan_opt_base = ((cinfo->use_scans_in_trellis) ? 4 : 2) * cinfo->num_components * cinfo->trellis_num_loops;
-  else
-    master->pass_number_scan_opt_base = 0;
 }
--- a/jidctflt.c
+++ b/jidctflt.c
@@ -1,9 +1,12 @@
 /*
 * jidctflt.c
 *
+ * This file was part of the Independent JPEG Group's software:
 * Copyright (C) 1994-1998, Thomas G. Lane.
- * This file is part of the Independent JPEG Group's software.
- * For conditions of distribution and use, see the accompanying README file.
+ * Modified 2010 by Guido Vollbeding.
+ * libjpeg-turbo Modifications:
+ * Copyright (C) 2014, D. R. Commander.
+  * For conditions of distribution and use, see the accompanying README file.
 *
 * This file contains a floating-point implementation of the
 * inverse DCT (Discrete Cosine Transform).  In the IJG code, this routine
@@ -76,10 +79,10 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
  FLOAT_MULT_TYPE * quantptr;
  FAST_FLOAT * wsptr;
  JSAMPROW outptr;
-  JSAMPLE *range_limit = IDCT_range_limit(cinfo);
+  JSAMPLE *range_limit = cinfo->sample_range_limit;
  int ctr;
  FAST_FLOAT workspace[DCTSIZE2]; /* buffers data between passes */
-  SHIFT_TEMPS
+  #define _0_125 ((FLOAT_MULT_TYPE)0.125)

  /* Pass 1: process columns from input, store into work array. */

@@ -101,7 +104,8 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	inptr[DCTSIZE*5] == 0 && inptr[DCTSIZE*6] == 0 &&
 	inptr[DCTSIZE*7] == 0) {
      /* AC terms all zero */
-      FAST_FLOAT dcval = DEQUANTIZE(inptr[DCTSIZE*0], quantptr[DCTSIZE*0]);
+      FAST_FLOAT dcval = DEQUANTIZE(inptr[DCTSIZE*0],
+                                    quantptr[DCTSIZE*0] * _0_125);
      
      wsptr[DCTSIZE*0] = dcval;
      wsptr[DCTSIZE*1] = dcval;
@@ -120,10 +124,10 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
    
    /* Even part */

-    tmp0 = DEQUANTIZE(inptr[DCTSIZE*0], quantptr[DCTSIZE*0]);
-    tmp1 = DEQUANTIZE(inptr[DCTSIZE*2], quantptr[DCTSIZE*2]);
-    tmp2 = DEQUANTIZE(inptr[DCTSIZE*4], quantptr[DCTSIZE*4]);
-    tmp3 = DEQUANTIZE(inptr[DCTSIZE*6], quantptr[DCTSIZE*6]);
+    tmp0 = DEQUANTIZE(inptr[DCTSIZE*0], quantptr[DCTSIZE*0] * _0_125);
+    tmp1 = DEQUANTIZE(inptr[DCTSIZE*2], quantptr[DCTSIZE*2] * _0_125);
+    tmp2 = DEQUANTIZE(inptr[DCTSIZE*4], quantptr[DCTSIZE*4] * _0_125);
+    tmp3 = DEQUANTIZE(inptr[DCTSIZE*6], quantptr[DCTSIZE*6] * _0_125);

    tmp10 = tmp0 + tmp2;	/* phase 3 */
    tmp11 = tmp0 - tmp2;
@@ -138,10 +142,10 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
    
    /* Odd part */

-    tmp4 = DEQUANTIZE(inptr[DCTSIZE*1], quantptr[DCTSIZE*1]);
-    tmp5 = DEQUANTIZE(inptr[DCTSIZE*3], quantptr[DCTSIZE*3]);
-    tmp6 = DEQUANTIZE(inptr[DCTSIZE*5], quantptr[DCTSIZE*5]);
-    tmp7 = DEQUANTIZE(inptr[DCTSIZE*7], quantptr[DCTSIZE*7]);
+    tmp4 = DEQUANTIZE(inptr[DCTSIZE*1], quantptr[DCTSIZE*1] * _0_125);
+    tmp5 = DEQUANTIZE(inptr[DCTSIZE*3], quantptr[DCTSIZE*3] * _0_125);
+    tmp6 = DEQUANTIZE(inptr[DCTSIZE*5], quantptr[DCTSIZE*5] * _0_125);
+    tmp7 = DEQUANTIZE(inptr[DCTSIZE*7], quantptr[DCTSIZE*7] * _0_125);

    z13 = tmp6 + tmp5;		/* phase 6 */
    z10 = tmp6 - tmp5;
@@ -152,12 +156,12 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
    tmp11 = (z11 - z13) * ((FAST_FLOAT) 1.414213562); /* 2*c4 */

    z5 = (z10 + z12) * ((FAST_FLOAT) 1.847759065); /* 2*c2 */
-    tmp10 = ((FAST_FLOAT) 1.082392200) * z12 - z5; /* 2*(c2-c6) */
-    tmp12 = ((FAST_FLOAT) -2.613125930) * z10 + z5; /* -2*(c2+c6) */
+    tmp10 = z5 - z12 * ((FAST_FLOAT) 1.082392200); /* 2*(c2-c6) */
+    tmp12 = z5 - z10 * ((FAST_FLOAT) 2.613125930); /* 2*(c2+c6) */

    tmp6 = tmp12 - tmp7;	/* phase 2 */
    tmp5 = tmp11 - tmp6;
-    tmp4 = tmp10 + tmp5;
+    tmp4 = tmp10 - tmp5;

    wsptr[DCTSIZE*0] = tmp0 + tmp7;
    wsptr[DCTSIZE*7] = tmp0 - tmp7;
@@ -165,8 +169,8 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
    wsptr[DCTSIZE*6] = tmp1 - tmp6;
    wsptr[DCTSIZE*2] = tmp2 + tmp5;
    wsptr[DCTSIZE*5] = tmp2 - tmp5;
-    wsptr[DCTSIZE*4] = tmp3 + tmp4;
-    wsptr[DCTSIZE*3] = tmp3 - tmp4;
+    wsptr[DCTSIZE*3] = tmp3 + tmp4;
+    wsptr[DCTSIZE*4] = tmp3 - tmp4;

    inptr++;			/* advance pointers to next column */
    quantptr++;
@@ -174,7 +178,6 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
  }
  
  /* Pass 2: process rows from work array, store into output array. */
-  /* Note that we must descale the results by a factor of 8 == 2**3. */

  wsptr = workspace;
  for (ctr = 0; ctr < DCTSIZE; ctr++) {
@@ -187,8 +190,10 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
    
    /* Even part */

-    tmp10 = wsptr[0] + wsptr[4];
-    tmp11 = wsptr[0] - wsptr[4];
+    /* Apply signed->unsigned and prepare float->int conversion */
+    z5 = wsptr[0] + ((FAST_FLOAT) CENTERJSAMPLE + (FAST_FLOAT) 0.5);
+    tmp10 = z5 + wsptr[4];
+    tmp11 = z5 - wsptr[4];

    tmp13 = wsptr[2] + wsptr[6];
    tmp12 = (wsptr[2] - wsptr[6]) * ((FAST_FLOAT) 1.414213562) - tmp13;
@@ -209,31 +214,23 @@ jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
    tmp11 = (z11 - z13) * ((FAST_FLOAT) 1.414213562);

    z5 = (z10 + z12) * ((FAST_FLOAT) 1.847759065); /* 2*c2 */
-    tmp10 = ((FAST_FLOAT) 1.082392200) * z12 - z5; /* 2*(c2-c6) */
-    tmp12 = ((FAST_FLOAT) -2.613125930) * z10 + z5; /* -2*(c2+c6) */
+    tmp10 = z5 - z12 * ((FAST_FLOAT) 1.082392200); /* 2*(c2-c6) */
+    tmp12 = z5 - z10 * ((FAST_FLOAT) 2.613125930); /* 2*(c2+c6) */

    tmp6 = tmp12 - tmp7;
    tmp5 = tmp11 - tmp6;
-    tmp4 = tmp10 + tmp5;
+    tmp4 = tmp10 - tmp5;

-    /* Final output stage: scale down by a factor of 8 and range-limit */
+    /* Final output stage: float->int conversion and range-limit */

-    outptr[0] = range_limit[(int) DESCALE((INT32) (tmp0 + tmp7), 3)
-			    & RANGE_MASK];
-    outptr[7] = range_limit[(int) DESCALE((INT32) (tmp0 - tmp7), 3)
-			    & RANGE_MASK];
-    outptr[1] = range_limit[(int) DESCALE((INT32) (tmp1 + tmp6), 3)
-			    & RANGE_MASK];
-    outptr[6] = range_limit[(int) DESCALE((INT32) (tmp1 - tmp6), 3)
-			    & RANGE_MASK];
-    outptr[2] = range_limit[(int) DESCALE((INT32) (tmp2 + tmp5), 3)
-			    & RANGE_MASK];
-    outptr[5] = range_limit[(int) DESCALE((INT32) (tmp2 - tmp5), 3)
-			    & RANGE_MASK];
-    outptr[4] = range_limit[(int) DESCALE((INT32) (tmp3 + tmp4), 3)
-			    & RANGE_MASK];
-    outptr[3] = range_limit[(int) DESCALE((INT32) (tmp3 - tmp4), 3)
-			    & RANGE_MASK];
+    outptr[0] = range_limit[((int) (tmp0 + tmp7)) & RANGE_MASK];
+    outptr[7] = range_limit[((int) (tmp0 - tmp7)) & RANGE_MASK];
+    outptr[1] = range_limit[((int) (tmp1 + tmp6)) & RANGE_MASK];
+    outptr[6] = range_limit[((int) (tmp1 - tmp6)) & RANGE_MASK];
+    outptr[2] = range_limit[((int) (tmp2 + tmp5)) & RANGE_MASK];
+    outptr[5] = range_limit[((int) (tmp2 - tmp5)) & RANGE_MASK];
+    outptr[3] = range_limit[((int) (tmp3 + tmp4)) & RANGE_MASK];
+    outptr[4] = range_limit[((int) (tmp3 - tmp4)) & RANGE_MASK];
    
    wsptr += DCTSIZE;		/* advance pointer to next row */
  }
--- a/jpeg_nbits_table.h
+++ b/jpeg_nbits_table.h
--- a/libjpeg.txt
+++ b/libjpeg.txt
@@ -3,7 +3,7 @@ USING THE IJG JPEG LIBRARY
 This file was part of the Independent JPEG Group's software:
 Copyright (C) 1994-2011, Thomas G. Lane, Guido Vollbeding.
 Modifications:
-Copyright (C) 2010, D. R. Commander.
+Copyright (C) 2010, 2014, D. R. Commander.
 For conditions of distribution and use, see the accompanying README file.


@@ -886,14 +886,23 @@ J_DCT_METHOD dct_method
 		JDCT_FLOAT: floating-point method
 		JDCT_DEFAULT: default method (normally JDCT_ISLOW)
 		JDCT_FASTEST: fastest method (normally JDCT_IFAST)
-	The FLOAT method is very slightly more accurate than the ISLOW method,
-	but may give different results on different machines due to varying
-	roundoff behavior.  The integer methods should give the same results
-	on all machines.  On machines with sufficiently fast FP hardware, the
-	floating-point method may also be the fastest.  The IFAST method is
-	considerably less accurate than the other two; its use is not
-	recommended if high quality is a concern.  JDCT_DEFAULT and
-	JDCT_FASTEST are macros configurable by each installation.
+        In libjpeg-turbo, JDCT_IFAST is generally about 5-15% faster than
+        JDCT_ISLOW when using the x86/x86-64 SIMD extensions (results may vary
+        with other SIMD implementations, or when using libjpeg-turbo without
+        SIMD extensions.)  For quality levels of 90 and below, there should be
+        little or no perceptible difference between the two algorithms.  For
+        quality levels above 90, however, the difference between JDCT_IFAST and
+        JDCT_ISLOW becomes more pronounced.  With quality=97, for instance,
+        JDCT_IFAST incurs generally about a 1-3 dB loss (in PSNR) relative to
+        JDCT_ISLOW, but this can be larger for some images.  Do not use
+        JDCT_IFAST with quality levels above 97.  The algorithm often
+        degenerates at quality=98 and above and can actually produce a more
+        lossy image than if lower quality levels had been used.  JDCT_FLOAT is
+        mostly a legacy feature.  It does not produce significantly more
+        accurate results than the ISLOW method, and it is much slower.  The
+        FLOAT method may also give different results on different machines due
+        to varying roundoff behavior, whereas the integer methods should give
+        the same results on all machines.

 J_COLOR_SPACE jpeg_color_space
 int num_components
@@ -1170,8 +1179,32 @@ int actual_number_of_colors
 Additional decompression parameters that the application may set include:

 J_DCT_METHOD dct_method
-	Selects the algorithm used for the DCT step.  Choices are the same
-	as described above for compression.
+        Selects the algorithm used for the DCT step.  Choices are:
+                JDCT_ISLOW: slow but accurate integer algorithm
+                JDCT_IFAST: faster, less accurate integer method
+                JDCT_FLOAT: floating-point method
+                JDCT_DEFAULT: default method (normally JDCT_ISLOW)
+                JDCT_FASTEST: fastest method (normally JDCT_IFAST)
+        In libjpeg-turbo, JDCT_IFAST is generally about 5-15% faster than
+        JDCT_ISLOW when using the x86/x86-64 SIMD extensions (results may vary
+        with other SIMD implementations, or when using libjpeg-turbo without
+        SIMD extensions.)  If the JPEG image was compressed using a quality
+        level of 85 or below, then there should be little or no perceptible
+        difference between the two algorithms.  When decompressing images that
+        were compressed using quality levels above 85, however, the difference
+        between JDCT_IFAST and JDCT_ISLOW becomes more pronounced.  With images
+        compressed using quality=97, for instance, JDCT_IFAST incurs generally
+        about a 4-6 dB loss (in PSNR) relative to JDCT_ISLOW, but this can be
+        larger for some images.  If you can avoid it, do not use JDCT_IFAST
+        when decompressing images that were compressed using quality levels
+        above 97.  The algorithm often degenerates for such images and can
+        actually produce a more lossy output image than if the JPEG image had
+        been compressed using lower quality levels.  JDCT_FLOAT is mostly a
+        legacy feature.  It does not produce significantly more accurate
+        results than the ISLOW method, and it is much slower.  The FLOAT method
+        may also give different results on different machines due to varying
+        roundoff behavior, whereas the integer methods should give the same
+        results on all machines.

 boolean do_fancy_upsampling
 	If TRUE, do careful upsampling of chroma components.  If FALSE,
--- a/usage.txt
+++ b/usage.txt
@@ -172,13 +172,28 @@ Switches for advanced users:
 	-dct int	Use integer DCT method (default).
 	-dct fast	Use fast integer DCT (less accurate).
 	-dct float	Use floating-point DCT method.
-			The float method is very slightly more accurate than
-			the int method, but is much slower unless your machine
-			has very fast floating-point hardware.  Also note that
-			results of the floating-point method may vary slightly
-			across machines, while the integer methods should give
-			the same results everywhere.  The fast integer method
-			is much less accurate than the other two.
+                        In libjpeg-turbo, the fast method is generally about
+                        5-15% faster than the int method when using the
+                        x86/x86-64 SIMD extensions (results may vary with other
+                        SIMD implementations, or when using libjpeg-turbo
+                        without SIMD extensions.)  For quality levels of 90 and
+                        below, there should be little or no perceptible
+                        difference between the two algorithms.  For quality
+                        levels above 90, however, the difference between
+                        the fast and the int methods becomes more pronounced.
+                        With quality=97, for instance, the fast method incurs
+                        generally about a 1-3 dB loss (in PSNR) relative to
+                        the int method, but this can be larger for some images.
+                        Do not use the fast method with quality levels above
+                        97.  The algorithm often degenerates at quality=98 and
+                        above and can actually produce a more lossy image than
+                        if lower quality levels had been used.  The float
+                        method is mostly a legacy feature.  It does not produce
+                        significantly more accurate results than the int
+                        method, and it is much slower.  The float method may
+                        also give different results on different machines due
+                        to varying roundoff behavior, whereas the integer
+                        methods should give the same results on all machines.

 	-restart N	Emit a JPEG restart marker every N MCU rows, or every
 			N MCU blocks if "B" is attached to the number.
@@ -296,13 +311,32 @@ Switches for advanced users:
 	-dct int	Use integer DCT method (default).
 	-dct fast	Use fast integer DCT (less accurate).
 	-dct float	Use floating-point DCT method.
-			The float method is very slightly more accurate than
-			the int method, but is much slower unless your machine
-			has very fast floating-point hardware.  Also note that
-			results of the floating-point method may vary slightly
-			across machines, while the integer methods should give
-			the same results everywhere.  The fast integer method
-			is much less accurate than the other two.
+                        In libjpeg-turbo, the fast method is generally about
+                        5-15% faster than the int method when using the
+                        x86/x86-64 SIMD extensions (results may vary with other
+                        SIMD implementations, or when using libjpeg-turbo
+                        without SIMD extensions.)  If the JPEG image was
+                        compressed using a quality level of 85 or below, then
+                        there should be little or no perceptible difference
+                        between the two algorithms.  When decompressing images
+                        that were compressed using quality levels above 85,
+                        however, the difference between the fast and int
+                        methods becomes more pronounced.  With images
+                        compressed using quality=97, for instance, the fast
+                        method incurs generally about a 4-6 dB loss (in PSNR)
+                        relative to the int method, but this can be larger for
+                        some images.  If you can avoid it, do not use the fast
+                        method when decompressing images that were compressed
+                        using quality levels above 97.  The algorithm often
+                        degenerates for such images and can actually produce
+                        a more lossy output image than if the JPEG image had
+                        been compressed using lower quality levels.  The float
+                        method is mostly a legacy feature.  It does not produce
+                        significantly more accurate results than the int
+                        method, and it is much slower.  The float method may
+                        also give different results on different machines due
+                        to varying roundoff behavior, whereas the integer
+                        methods should give the same results on all machines.

 	-dither fs	Use Floyd-Steinberg dithering in color quantization.
 	-dither ordered	Use ordered dithering in color quantization.
@@ -381,12 +415,6 @@ When producing a color-quantized image, "-onepass -dither ordered" is fast but
 much lower quality than the default behavior.  "-dither none" may give
 acceptable results in two-pass mode, but is seldom tolerable in one-pass mode.

-If you are fortunate enough to have very fast floating point hardware,
-"-dct float" may be even faster than "-dct fast".  But on most machines
-"-dct float" is slower than "-dct int"; in this case it is not worth using,
-because its theoretical accuracy advantage is too small to be significant
-in practice.
-
 Two-pass color quantization requires a good deal of memory; on MS-DOS machines
 it may run out of memory even with -maxmemory 0.  In that case you can still
 decompress, with some loss of image quality, by specifying -onepass for
Author	SHA1	Message	Date
Josh Aas	594b7258cc	Bump version to 2.1.	2014-07-29 09:12:48 -05:00
fbossen	0533b31891	Merge pull request #79 from arjun024/devel more logical flow of control if trellis_quant enabled	2014-07-25 09:02:45 -04:00
Arjun Sreedharan	cca53c920d	more logical flow of control if trellis_quant enabled if trellis_quant is enabled, increment total number of passes by optimization beginning pass number. Signed-off-by: Arjun Sreedharan <arjun024@gmail.com>	2014-07-25 17:33:35 +05:30
Josh Aas	5901802871	Update README.md with link to 2.0 announcement and mailing list.	2014-07-24 16:39:26 -05:00
Frank Bossen	fbef31f76d	Add option to disable progressive coding in cjpeg Redefine baseline option in cjpeg to actually create a baseline JPEG file by disabling progressive coding	2014-07-24 17:09:27 -04:00
Frank Bossen	1aa50b71d9	Use precomputed table From jpeglib-turbo r1221: Integrate a slightly modified version of Mozilla's patch for precomputing the bit-counting LUT. This is useful if the table needs to be shared among multiple processes, although the primary reason for doing that is reduced footprint on mobile devices, which are probably already covered by the clz intrinsic code.	2014-07-24 10:50:59 -04:00
Frank Bossen	3adc64a4cb	Improve floating point DCT From libjpeg-turbo r1288 Port the more accurate (and slightly faster) floating point IDCT implementation from jpeg-8a and later. New research revealed that the SSE/SSE2 floating point IDCT implementation was actually more accurate than the jpeg-6b implementation, not less, which is why its mathematical results have always differed from those of the jpeg-6b implementation. This patch brings the accuracy of the C code in line with that of the SSE/SSE2 code.	2014-07-23 10:26:46 -04:00
Frank Bossen	ccb1d12f53	Update doc re: various DCT implementations From libjpeg-turbo r1287	2014-07-23 10:23:24 -04:00
Frank Bossen	c73a82c6aa	Fix build for wrjpgcom See r1325 in libjpeg-turbo	2014-07-23 09:22:55 -04:00
Frank Bossen	b9f25333f6	Disable scan optimization if no scan given Addresses segmentation fault issue in #69	2014-07-21 20:11:46 +02:00
Josh Aas	0f6b96c68e	Merge pull request #73 from pornel/master Added few error messages in cjpeg	2014-07-21 11:23:20 -05:00
Josh Aas	eeea9ed397	Merge pull request #72 from jlongman/master Bugfix: AM_PROG_AR is not recognized by older automake, so only use it w...	2014-07-21 11:21:30 -05:00
Josh Aas	b95528727f	Merge pull request #75 from pmed/master Fixed mozjpeg build with Visual C++ 2010	2014-07-21 11:18:33 -05:00
Pavel Medvedev	d9605b0560	Fixed mozjpeg build with Visual C++ 2010 Moved several variable declarations out of inner scopes to the function scope to compile C code with Visual C++ 2010	2014-07-21 15:32:54 +04:00
Kornel Lesiński	40e6e8b2a2	Added few error messages in cjpeg	2014-07-20 16:11:59 +01:00
Frank Bossen	514307e9e6	Fix trellis for nonprogressive mode (#69 ) Correct number of trellis passes when in nonprogressive mode	2014-07-18 18:10:56 +02:00
jlongman	7388a54647	Bugfix: AM_PROG_AR is not recognized by older automake, so only use it when defined	2014-07-17 14:16:24 -04:00
Josh Aas	e7a135b930	Bump version number for 2.0, make this version 2.0.1.	2014-07-15 13:55:55 -05:00