IJG R6b with x86SIMD V1.02

Independent JPEG Group's JPEG software release 6b with x86 SIMD extension for IJG JPEG library version 1.02
The Independent JPEG Group's JPEG software v6b
2015-07-29 16:36:25 -05:00 · 2015-07-27 13:43:00 -05:00
196 changed files with 57909 additions and 2352 deletions
--- a/134
+++ b/134
@@ -1,8 +1,8 @@
 The Independent JPEG Group's JPEG software
 ==========================================
-README for release 6a of 7-Feb-96
+README for release 6b of 27-Mar-1998
-=================================
+====================================
 This distribution contains the sixth public release of the Independent JPEG
 Group's free JPEG software.  You are welcome to redistribute this software and
@@ -13,9 +13,10 @@ larger programs) should contact IJG at jpeg-info@uunet.uu.net to be added to
 our electronic mailing list.  Mailing list members are notified of updates
 and have a chance to participate in technical discussions, etc.
-This software is the work of Tom Lane, Philip Gladstone, Luis Ortiz, Jim
+This software is the work of Tom Lane, Philip Gladstone, Jim Boucher,
-Boucher, Lee Crocker, Julian Minguillon, George Phillips, Davide Rossi,
+Lee Crocker, Julian Minguillon, Luis Ortiz, George Phillips, Davide Rossi,
-Ge' Weijers, and other members of the Independent JPEG Group.
+Guido Vollbeding, Ge' Weijers, and other members of the Independent JPEG
 Group.
 IJG is not affiliated with the official ISO JPEG standards committee.
@@ -126,7 +127,7 @@ with respect to this software, its quality, accuracy, merchantability, or
 fitness for a particular purpose.  This software is provided "AS IS", and you,
 its user, assume the entire risk as to its quality and accuracy.
-This software is copyright (C) 1991-1996, Thomas G. Lane.
+This software is copyright (C) 1991-1998, Thomas G. Lane.
 All Rights Reserved except as specified below.
 Permission is hereby granted to use, copy, modify, and distribute this
@@ -166,8 +167,11 @@ ansi2knr.c for full details.)  However, since ansi2knr.c is not needed as part
 of any program generated from the IJG code, this does not limit you more than
 the foregoing paragraphs do.
-The configuration script "configure" was produced with GNU Autoconf.  It
+The Unix configuration script "configure" was produced with GNU Autoconf.
-is copyright by the Free Software Foundation but is freely distributable.
+It is copyright by the Free Software Foundation but is freely distributable.
 The same holds for its supporting scripts (config.guess, config.sub,
 ltconfig, ltmain.sh).  Another support script, install-sh, is copyright
 by M.I.T. but is also freely distributable.
 It appears that the arithmetic coding option of the JPEG spec is covered by
 patents owned by IBM, AT&T, and Mitsubishi.  Hence arithmetic coding cannot
@@ -178,13 +182,12 @@ Huffman mode, it is unlikely that very many implementations will support it.)
 So far as we are aware, there are no patent restrictions on the remaining
 code.
-WARNING: Unisys has begun to enforce their patent on LZW compression against
+The IJG distribution formerly included code to read and write GIF files.
-GIF encoders and decoders.  You will need a license from Unisys to use the
+To avoid entanglement with the Unisys LZW patent, GIF reading support has
-included rdgif.c or wrgif.c files in a commercial or shareware application.
+been removed altogether, and the GIF writer has been simplified to produce
-At this time, Unisys is not enforcing their patent against freeware, so
+"uncompressed GIFs".  This technique does not use the LZW algorithm; the
-distribution of this package remains legal.  However, we intend to remove
+resulting GIF files are larger than usual, but are readable by all standard
-GIF support from the IJG package as soon as a suitable replacement format
+GIF decoders.
 becomes reasonably popular.
 We are required to state that
    "The Graphics Interchange Format(c) is the Copyright property of
@@ -203,21 +206,21 @@ The best short technical introduction to the JPEG compression algorithm is
 	Communications of the ACM, April 1991 (vol. 34 no. 4), pp. 30-44.
 (Adjacent articles in that issue discuss MPEG motion picture compression,
 applications of JPEG, and related topics.)  If you don't have the CACM issue
-handy, a PostScript file containing a revised version of Wallace's article
+handy, a PostScript file containing a revised version of Wallace's article is
-is available at ftp.uu.net, graphics/jpeg/wallace.ps.gz.  The file (actually
+available at ftp://ftp.uu.net/graphics/jpeg/wallace.ps.gz.  The file (actually
 a preprint for an article that appeared in IEEE Trans. Consumer Electronics)
 omits the sample images that appeared in CACM, but it includes corrections
-and some added material.  Note: the Wallace article is copyright ACM and
+and some added material.  Note: the Wallace article is copyright ACM and IEEE,
-IEEE, and it may not be used for commercial purposes.
+and it may not be used for commercial purposes.
 A somewhat less technical, more leisurely introduction to JPEG can be found in
-"The Data Compression Book" by Mark Nelson, published by M&T Books (Redwood
+"The Data Compression Book" by Mark Nelson and Jean-loup Gailly, published by
-City, CA), 1991, ISBN 1-55851-216-0.  This book provides good explanations and
+M&T Books (New York), 2nd ed. 1996, ISBN 1-55851-434-1.  This book provides
-example C code for a multitude of compression methods including JPEG.  It is
+good explanations and example C code for a multitude of compression methods
-an excellent source if you are comfortable reading C code but don't know much
+including JPEG.  It is an excellent source if you are comfortable reading C
-about data compression in general.  The book's JPEG sample code is far from
+code but don't know much about data compression in general.  The book's JPEG
-industrial-strength, but when you are ready to look at a full implementation,
+sample code is far from industrial-strength, but when you are ready to look
-you've got one here...
+at a full implementation, you've got one here...
 The best full description of JPEG is the textbook "JPEG Still Image Data
 Compression Standard" by William B. Pennebaker and Joan L. Mitchell, published
@@ -242,10 +245,9 @@ Part 1: Requirements and guidelines" and has document numbers ISO/IEC IS
 Continuous-tone Still Images, Part 2: Compliance testing" and has document
 numbers ISO/IEC IS 10918-2, ITU-T T.83.
-Extensions to the original JPEG standard are defined in JPEG Part 3, a new ISO
+Some extensions to the original JPEG standard are defined in JPEG Part 3,
-document.  Part 3 is undergoing ISO balloting and is expected to be approved
+a newer ISO standard numbered ISO/IEC IS 10918-3 and ITU-T T.84.  IJG
-by the end of 1995; it will have document numbers ISO/IEC IS 10918-3, ITU-T
+currently does not support any Part 3 extensions.
 T.84.  IJG currently does not support any Part 3 extensions.
 The JPEG standard does not specify all details of an interchangeable file
 format.  For the omitted details we follow the "JFIF" conventions, revision
@@ -255,24 +257,22 @@ format.  For the omitted details we follow the "JFIF" conventions, revision
 	1778 McCarthy Blvd.
 	Milpitas, CA 95035
 	phone (408) 944-6300,  fax (408) 944-6314
-A PostScript version of this document is available at ftp.uu.net, file
+A PostScript version of this document is available by FTP at
-graphics/jpeg/jfif.ps.gz.  It can also be obtained by e-mail from the C-Cube
+ftp://ftp.uu.net/graphics/jpeg/jfif.ps.gz.  There is also a plain text
-mail server, netlib@c3.pla.ca.us.  Send the message "send jfif_ps from jpeg"
+version at ftp://ftp.uu.net/graphics/jpeg/jfif.txt.gz, but it is missing
-to the server to obtain the JFIF document; send the message "help" if you have
+the figures.
 trouble.
-The TIFF 6.0 file format specification can be obtained by FTP from sgi.com
+The TIFF 6.0 file format specification can be obtained by FTP from
-(192.48.153.1), file graphics/tiff/TIFF6.ps.Z; or you can order a printed
+ftp://ftp.sgi.com/graphics/tiff/TIFF6.ps.gz.  The JPEG incorporation scheme
 copy from Aldus Corp. at (206) 628-6593.  The JPEG incorporation scheme
 found in the TIFF 6.0 spec of 3-June-92 has a number of serious problems.
 IJG does not recommend use of the TIFF 6.0 design (TIFF Compression tag 6).
 Instead, we recommend the JPEG design proposed by TIFF Technical Note #2
-(Compression tag 7).  Copies of this Note can be obtained from sgi.com or
+(Compression tag 7).  Copies of this Note can be obtained from ftp.sgi.com or
-from ftp.uu.net:/graphics/jpeg/.  It is expected that the next revision of
+from ftp://ftp.uu.net/graphics/jpeg/.  It is expected that the next revision
-the TIFF spec will replace the 6.0 JPEG design with the Note's design.
+of the TIFF spec will replace the 6.0 JPEG design with the Note's design.
 Although IJG's own code does not support TIFF/JPEG, the free libtiff library
 uses our library to implement TIFF/JPEG per the Note.  libtiff is available
-from sgi.com:/graphics/tiff/.
+from ftp://ftp.sgi.com/graphics/tiff/.
 ARCHIVE LOCATIONS
@@ -281,26 +281,27 @@ ARCHIVE LOCATIONS
 The "official" archive site for this software is ftp.uu.net (Internet
 address 192.48.96.9).  The most recent released version can always be found
 there in directory graphics/jpeg.  This particular version will be archived
-as graphics/jpeg/jpegsrc.v6a.tar.gz.  If you are on the Internet, you
+as ftp://ftp.uu.net/graphics/jpeg/jpegsrc.v6b.tar.gz.  If you don't have
-can retrieve files from ftp.uu.net by standard anonymous FTP.  If you don't
+direct Internet access, UUNET's archives are also available via UUCP; contact
 have FTP access, UUNET's archives are also available via UUCP; contact
 help@uunet.uu.net for information on retrieving files that way.
 Numerous Internet sites maintain copies of the UUNET files.  However, only
 ftp.uu.net is guaranteed to have the latest official version.
 You can also obtain this software in DOS-compatible "zip" archive format from
-the SimTel archives (ftp.coast.net:/SimTel/msdos/graphics/), or on CompuServe
+the SimTel archives (ftp://ftp.simtel.net/pub/simtelnet/msdos/graphics/), or
-in the Graphics Support forum (GO CIS:GRAPHSUP), library 12 "JPEG Tools".
+on CompuServe in the Graphics Support forum (GO CIS:GRAPHSUP), library 12
-Again, these versions may sometimes lag behind the ftp.uu.net release.
+"JPEG Tools".  Again, these versions may sometimes lag behind the ftp.uu.net
 release.
 The JPEG FAQ (Frequently Asked Questions) article is a useful source of
 general information about JPEG.  It is updated constantly and therefore is
 not included in this distribution.  The FAQ is posted every two weeks to
 Usenet newsgroups comp.graphics.misc, news.answers, and other groups.
-You can always obtain the latest version from the news.answers archive at
+It is available on the World Wide Web at http://www.faqs.org/faqs/jpeg-faq/
-rtfm.mit.edu.  By FTP, fetch /pub/usenet/news.answers/jpeg-faq/part1 and
+and other news.answers archive sites, including the official news.answers
-.../part2.  If you don't have FTP, send e-mail to mail-server@rtfm.mit.edu
+archive at rtfm.mit.edu: ftp://rtfm.mit.edu/pub/usenet/news.answers/jpeg-faq/.
 If you don't have Web or FTP access, send e-mail to mail-server@rtfm.mit.edu
 with body
 	send usenet/news.answers/jpeg-faq/part1
 	send usenet/news.answers/jpeg-faq/part2
@@ -315,21 +316,20 @@ some of the more popular free and shareware viewers, and tells where to
 obtain them on Internet.
 If you are on a Unix machine, we highly recommend Jef Poskanzer's free
-PBMPLUS image software, which provides many useful operations on PPM-format
+PBMPLUS software, which provides many useful operations on PPM-format image
-image files.  In particular, it can convert PPM images to and from a wide
+files.  In particular, it can convert PPM images to and from a wide range of
-range of other formats.  You can obtain this package by FTP from ftp.x.org
+other formats, thus making cjpeg/djpeg considerably more useful.  The latest
-(contrib/pbmplus*.tar.Z) or ftp.ee.lbl.gov (pbmplus*.tar.Z).  There is also
+version is distributed by the NetPBM group, and is available from numerous
-a newer update of this package called NETPBM, available from
+sites, notably ftp://wuarchive.wustl.edu/graphics/graphics/packages/NetPBM/.
-wuarchive.wustl.edu under directory /graphics/graphics/packages/NetPBM/.
+Unfortunately PBMPLUS/NETPBM is not nearly as portable as the IJG software is;
-Unfortunately PBMPLUS/NETPBM is not nearly as portable as the IJG software
+you are likely to have difficulty making it work on any non-Unix machine.
 is; you are likely to have difficulty making it work on any non-Unix machine.
 A different free JPEG implementation, written by the PVRG group at Stanford,
-is available from havefun.stanford.edu in directory pub/jpeg.  This program
+is available from ftp://havefun.stanford.edu/pub/jpeg/.  This program
 is designed for research and experimentation rather than production use;
 it is slower, harder to use, and less portable than the IJG code, but it
 is easier to read and modify.  Also, the PVRG code supports lossless JPEG,
-which we do not.
+which we do not.  (On the other hand, it doesn't do progressive JPEG.)
 FILE FORMAT WARS
@@ -370,14 +370,16 @@ use a proprietary file format!
 TO DO
 =====
 The major thrust for v7 will probably be improvement of visual quality.
 The current method for scaling the quantization tables is known not to be
 very good at low Q values.  We also intend to investigate block boundary
 smoothing, "poor man's variable quantization", and other means of improving
 quality-vs-file-size performance without sacrificing compatibility.
 In future versions, we are considering supporting some of the upcoming JPEG
 Part 3 extensions --- principally, variable quantization and the SPIFF file
 format.
-Tuning the software for better behavior at low quality/high compression
+As always, speeding things up is of great interest.
 settings is also of interest.  The current method for scaling the
 quantization tables is known not to be very good at low Q values.
 As always, speeding things up is high on our priority list.
 Please send bug reports, offers of help, etc. to jpeg-info@uunet.uu.net.
--- a/aclocal.m4
+++ b/aclocal.m4
--- a/altui/README.alt
+++ b/altui/README.alt
@@ -0,0 +1,71 @@
 Here is an alternate command-line user interface for the IJG JPEG software.
 It is designed for use under MS-DOS, and may also be useful on other non-Unix
 operating systems.  (For that matter, this code works fine on Unix, but the
 standard command-line syntax is better on Unix because it is pipe-friendly.)
 With this user interface, cjpeg and djpeg accept multiple input file names
 on the command line; output file names are generated by substituting
 appropriate extensions.  The user is prompted before any already-existing
 file will be overwritten.  See usage.alt for details.
 Expansion of wild-card file specifications is useful but is not directly
 provided by this code.  Most DOS C compilers have the ability to do wild-card
 expansion "behind the scenes", and we rely on that feature.  On other systems,
 the shell may do it for you, as is done on Unix.
 Also, a DOS-specific routine is provided to determine available memory;
 this makes the -maxmemory switch unnecessary except in unusual cases.
 If you know how to determine available memory on a different system,
 you can easily add the necessary code.  (And please send it along to
 jpeg-info@uunet.uu.net so we can include it in future releases!)
 INSTALLATION
 ============
 You need to have the main IJG JPEG distribution, release 6 or later.
 Replace the standard cjpeg.c and djpeg.c files with the ones provided here.
 Then build the software as described in the main distribution's install.doc
 file, with these exceptions:
 * Define PROGRESS_REPORT in jconfig.h if you want the percent-done display.
 * Define NO_OVERWRITE_CHECK if you *don't* want overwrite confirmation.
 * You may ignore the USE_SETMODE and TWO_FILE_COMMANDLINE symbols discussed
  in install.doc; these files do not use them.
 * As given, djpeg.c defaults to GIF output (not PPM output as in the standard
  djpeg.c).  If you want something different, modify DEFAULT_FMT.
 You may also need to do something special to enable filename wild-card
 expansion, assuming your compiler has that capability at all.
 Modify the standard usage.doc file as described in usage.alt.  (If you want
 to use the Unix-style manual pages cjpeg.1 and djpeg.1, better fix them too.)
 Here are some specific notes for popular MS-DOS compilers:
 Borland C:
  Add "-DMSDOS" to CFLAGS to enable use of the DOS memory determination code.
  Link with the standard library file WILDARGS.OBJ to get wild-card expansion.
 Microsoft C:
  Add "-DMSDOS" to CFLAGS to enable use of the DOS memory determination code.
  Link with the standard library file SETARGV.OBJ to get wild-card expansion.
  In the versions I've used, you must also add /NOE to the linker switches to
  avoid a duplicate-symbol error from including SETARGV.
 DJGPP (we recommend version 2.0 or later):
  Add "-DFREE_MEM_ESTIMATE=0" to CFLAGS.  Wild-card expansion is automatic.
 LEGAL ISSUES
 ============
 This software is copyright (C) 1991-1998, Thomas G. Lane.
 Terms of distribution and use are the same as for the free IJG JPEG software;
 see its README file for details.
 The authors make NO WARRANTY or representation, either express or implied,
 with respect to this software, its quality, accuracy, merchantability, or
 fitness for a particular purpose.  This software is provided "AS IS", and you,
 its user, assume the entire risk as to its quality and accuracy.
--- a/altui/cjpeg.c
+++ b/altui/cjpeg.c
@@ -0,0 +1,813 @@
 /*
 * alternate cjpeg.c
 *
 * Copyright (C) 1991-1998, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
 * ---------------------------------------------------------------------
 * x86 SIMD extension for IJG JPEG library
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
 * This file has been modified for SIMD extension.
 * Last Modified : January 6, 2006
 * ---------------------------------------------------------------------
 *
 * This file contains an alternate user interface for the JPEG compressor.
 * One or more input files are named on the command line, and output file
 * names are created by substituting ".jpg" for the input file's extension.
 */
 #include "cdjpeg.h"		/* Common decls for cjpeg/djpeg applications */
 #include "jversion.h"		/* for version message */
 #ifdef USE_CCOMMAND		/* command-line reader for Macintosh */
 #ifdef __MWERKS__
 #include <SIOUX.h>              /* Metrowerks needs this */
 #include <console.h>		/* ... and this */
 #endif
 #ifdef THINK_C
 #include <console.h>		/* Think declares it here */
 #endif
 #endif
 #ifndef PATH_MAX		/* ANSI maximum-pathname-length constant */
 #define PATH_MAX 256
 #endif
 /* Create the add-on message string table. */
 #define JMESSAGE(code,string)	string ,
 static const char * const cdjpeg_message_table[] = {
 #include "cderror.h"
  NULL
 };
 /*
 * SIMD Ext: compiler-specific hacks to enable filename wild-card expansion
 */
 #ifdef _MSC_VER		/* Microsoft Visual C++ */
 /* from setargv.c (setargv.obj) */
 /* Tested under Visual C++ V6.0, Toolkit 2003, and 2005 Express Edition */
 int __cdecl _setargv(void) { int __cdecl __setargv(void); return __setargv(); }
 #endif
 #ifdef __BORLANDC__	/* Borland C++ */
 /* from wildargs.c (wildargs.obj) */
 /* Tested under Borland C++ Compiler 5.5 (win32) */
 #include <wildargs.h>
 typedef void _RTLENTRY (* _RTLENTRY _argv_expand_fnc)(char *, _PFN_ADDARG);
 _argv_expand_fnc _argv_expand_ptr = _expand_wild;
 #endif
 /*
 * Automatic determination of available memory.
 */
 static long default_maxmem;	/* saves value determined at startup, or 0 */
 #ifndef FREE_MEM_ESTIMATE	/* may be defined from command line */
 #ifdef MSDOS			/* For MS-DOS (unless flat-memory model) */
 #include <dos.h>		/* for access to intdos() call */
 LOCAL(long)
 unused_dos_memory (void)
 /* Obtain total amount of unallocated DOS memory */
 {
  union REGS regs;
  long nparas;
  regs.h.ah = 0x48;		/* DOS function Allocate Memory Block */
  regs.x.bx = 0xFFFF;		/* Ask for more memory than DOS can have */
  (void) intdos(&regs, &regs);
  /* DOS will fail and return # of paragraphs actually available in BX. */
  nparas = (unsigned int) regs.x.bx;
  /* Times 16 to convert to bytes. */
  return nparas << 4;
 }
 /* The default memory setting is 95% of the available space. */
 #define FREE_MEM_ESTIMATE  ((unused_dos_memory() * 95L) / 100L)
 #endif /* MSDOS */
 #ifdef ATARI			/* For Atari ST/STE/TT, Pure C or Turbo C */
 #include <ext.h>
 /* The default memory setting is 90% of the available space. */
 #define FREE_MEM_ESTIMATE  (((long) coreleft() * 90L) / 100L)
 #endif /* ATARI */
 /* Add memory-estimation procedures for other operating systems here,
 * with appropriate #ifdef's around them.
 */
 #endif /* !FREE_MEM_ESTIMATE */
 /*
 * This routine determines what format the input file is,
 * and selects the appropriate input-reading module.
 *
 * To determine which family of input formats the file belongs to,
 * we may look only at the first byte of the file, since C does not
 * guarantee that more than one character can be pushed back with ungetc.
 * Looking at additional bytes would require one of these approaches:
 *     1) assume we can fseek() the input file (fails for piped input);
 *     2) assume we can push back more than one character (works in
 *        some C implementations, but unportable);
 *     3) provide our own buffering (breaks input readers that want to use
 *        stdio directly, such as the RLE library);
 * or  4) don't put back the data, and modify the input_init methods to assume
 *        they start reading after the start of file (also breaks RLE library).
 * #1 is attractive for MS-DOS but is untenable on Unix.
 *
 * The most portable solution for file types that can't be identified by their
 * first byte is to make the user tell us what they are.  This is also the
 * only approach for "raw" file types that contain only arbitrary values.
 * We presently apply this method for Targa files.  Most of the time Targa
 * files start with 0x00, so we recognize that case.  Potentially, however,
 * a Targa file could start with any byte value (byte 0 is the length of the
 * seldom-used ID field), so we provide a switch to force Targa input mode.
 */
 static boolean is_targa;	/* records user -targa switch */
 LOCAL(cjpeg_source_ptr)
 select_file_type (j_compress_ptr cinfo, FILE * infile)
 {
  int c;
  if (is_targa) {
 #ifdef TARGA_SUPPORTED
    return jinit_read_targa(cinfo);
 #else
    ERREXIT(cinfo, JERR_TGA_NOTCOMP);
 #endif
  }
  if ((c = getc(infile)) == EOF)
    ERREXIT(cinfo, JERR_INPUT_EMPTY);
  if (ungetc(c, infile) == EOF)
    ERREXIT(cinfo, JERR_UNGETC_FAILED);
  switch (c) {
 #ifdef BMP_SUPPORTED
  case 'B':
    return jinit_read_bmp(cinfo);
 #endif
 #ifdef GIF_SUPPORTED
  case 'G':
    return jinit_read_gif(cinfo);
 #endif
 #ifdef PPM_SUPPORTED
  case 'P':
    return jinit_read_ppm(cinfo);
 #endif
 #ifdef RLE_SUPPORTED
  case 'R':
    return jinit_read_rle(cinfo);
 #endif
 #ifdef TARGA_SUPPORTED
  case 0x00:
    return jinit_read_targa(cinfo);
 #endif
  default:
    ERREXIT(cinfo, JERR_UNKNOWN_FORMAT);
    break;
  }
  return NULL;			/* suppress compiler warnings */
 }
 /*
 * Argument-parsing code.
 * The switch parser is designed to be useful with DOS-style command line
 * syntax, ie, intermixed switches and file names, where only the switches
 * to the left of a given file name affect processing of that file.
 */
 static const char * progname;	/* program name for error messages */
 static char * outfilename;	/* for -outfile switch */
 LOCAL(void)
 usage (void)
 /* complain about bad command line */
 {
  fprintf(stderr, "usage: %s [switches] inputfile(s)\n", progname);
  fprintf(stderr, "List of input files may use wildcards (* and ?)\n");
  fprintf(stderr, "Output filename is same as input filename, but extension .jpg\n");
  fprintf(stderr, "Switches (names may be abbreviated):\n");
  fprintf(stderr, "  -quality N     Compression quality (0..100; 5-95 is useful range)\n");
  fprintf(stderr, "  -grayscale     Create monochrome JPEG file\n");
 #ifdef ENTROPY_OPT_SUPPORTED
  fprintf(stderr, "  -optimize      Optimize Huffman table (smaller file, but slow compression)\n");
 #endif
 #ifdef C_PROGRESSIVE_SUPPORTED
  fprintf(stderr, "  -progressive   Create progressive JPEG file\n");
 #endif
 #ifdef TARGA_SUPPORTED
  fprintf(stderr, "  -targa         Input file is Targa format (usually not needed)\n");
 #endif
  fprintf(stderr, "Switches for advanced users:\n");
 #ifdef DCT_ISLOW_SUPPORTED
  fprintf(stderr, "  -dct int       Use integer DCT method%s\n",
 	  (JDCT_DEFAULT == JDCT_ISLOW ? " (default)" : ""));
 #endif
 #ifdef DCT_IFAST_SUPPORTED
  fprintf(stderr, "  -dct fast      Use fast integer DCT (less accurate)%s\n",
 	  (JDCT_DEFAULT == JDCT_IFAST ? " (default)" : ""));
 #endif
 #ifdef DCT_FLOAT_SUPPORTED
  fprintf(stderr, "  -dct float     Use floating-point DCT method%s\n",
 	  (JDCT_DEFAULT == JDCT_FLOAT ? " (default)" : ""));
 #endif
  fprintf(stderr, "  -restart N     Set restart interval in rows, or in blocks with B\n");
 #ifdef INPUT_SMOOTHING_SUPPORTED
  fprintf(stderr, "  -smooth N      Smooth dithered input (N=1..100 is strength)\n");
 #endif
 #ifndef FREE_MEM_ESTIMATE
  fprintf(stderr, "  -maxmemory N   Maximum memory to use (in kbytes)\n");
 #endif
  fprintf(stderr, "  -outfile name  Specify name for output file\n");
  fprintf(stderr, "  -verbose  or  -debug   Emit debug output\n");
  fprintf(stderr, "Switches for wizards:\n");
 #ifdef C_ARITH_CODING_SUPPORTED
  fprintf(stderr, "  -arithmetic    Use arithmetic coding\n");
 #endif
  fprintf(stderr, "  -baseline      Force baseline quantization tables\n");
  fprintf(stderr, "  -qtables file  Use quantization tables given in file\n");
  fprintf(stderr, "  -qslots N[,...]    Set component quantization tables\n");
  fprintf(stderr, "  -sample HxV[,...]  Set component sampling factors\n");
 #ifdef C_MULTISCAN_FILES_SUPPORTED
  fprintf(stderr, "  -scans file    Create multi-scan JPEG per script file\n");
 #endif
  exit(EXIT_FAILURE);
 }
 #ifndef JSIMD_MODEINFO_NOT_SUPPORTED
 LOCAL(void)
 print_simd_info (FILE * file, char * labelstr, unsigned int simd)
 {
  fprintf(file, "%s%s%s%s%s%s\n", labelstr,
 	  simd & JSIMD_MMX   ? " MMX"    : "",
 	  simd & JSIMD_3DNOW ? " 3DNow!" : "",
 	  simd & JSIMD_SSE   ? " SSE"    : "",
 	  simd & JSIMD_SSE2  ? " SSE2"   : "",
 	  simd == JSIMD_NONE ? " NONE"   : "");
 }
 #endif /* !JSIMD_MODEINFO_NOT_SUPPORTED */
 LOCAL(int)
 parse_switches (j_compress_ptr cinfo, int argc, char **argv,
 		int last_file_arg_seen, boolean for_real)
 /* Parse optional switches.
 * Returns argv[] index of first file-name argument (== argc if none).
 * Any file names with indexes <= last_file_arg_seen are ignored;
 * they have presumably been processed in a previous iteration.
 * (Pass 0 for last_file_arg_seen on the first or only iteration.)
 * for_real is FALSE on the first (dummy) pass; we may skip any expensive
 * processing.
 */
 {
  int argn;
  char * arg;
  int quality;			/* -quality parameter */
  int q_scale_factor;		/* scaling percentage for -qtables */
  boolean force_baseline;
  boolean simple_progressive;
  char * qtablefile = NULL;	/* saves -qtables filename if any */
  char * qslotsarg = NULL;	/* saves -qslots parm if any */
  char * samplearg = NULL;	/* saves -sample parm if any */
  char * scansarg = NULL;	/* saves -scans parm if any */
  /* Set up default JPEG parameters. */
  /* Note that default -quality level need not, and does not,
   * match the default scaling for an explicit -qtables argument.
   */
  quality = 75;			/* default -quality value */
  q_scale_factor = 100;		/* default to no scaling for -qtables */
  force_baseline = FALSE;	/* by default, allow 16-bit quantizers */
  simple_progressive = FALSE;
  is_targa = FALSE;
  outfilename = NULL;
  cinfo->err->trace_level = 0;
  if (default_maxmem > 0)	/* override library's default value */
    cinfo->mem->max_memory_to_use = default_maxmem;
  /* Scan command line options, adjust parameters */
  for (argn = 1; argn < argc; argn++) {
    arg = argv[argn];
    if (*arg != '-') {
      /* Not a switch, must be a file name argument */
      if (argn <= last_file_arg_seen) {
 	outfilename = NULL;	/* -outfile applies to just one input file */
 	continue;		/* ignore this name if previously processed */
      }
      break;			/* else done parsing switches */
    }
    arg++;			/* advance past switch marker character */
    if (keymatch(arg, "arithmetic", 1)) {
      /* Use arithmetic coding. */
 #ifdef C_ARITH_CODING_SUPPORTED
      cinfo->arith_code = TRUE;
 #else
      fprintf(stderr, "%s: sorry, arithmetic coding not supported\n",
 	      progname);
      exit(EXIT_FAILURE);
 #endif
    } else if (keymatch(arg, "baseline", 1)) {
      /* Force baseline-compatible output (8-bit quantizer values). */
      force_baseline = TRUE;
 #ifndef JSIMD_MASKFUNC_NOT_SUPPORTED
    } else if (keymatch(arg, "nosimd" , 4)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_ALL);
    } else if (keymatch(arg, "nommx"  , 3)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_MMX);
    } else if (keymatch(arg, "no3dnow", 3)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_3DNOW);
    } else if (keymatch(arg, "nosse"  , 4)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_SSE);
    } else if (keymatch(arg, "nosse2" , 6)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_SSE2);
 #endif /* !JSIMD_MASKFUNC_NOT_SUPPORTED */
    } else if (keymatch(arg, "dct", 2)) {
      /* Select DCT algorithm. */
      if (++argn >= argc)	/* advance to next argument */
 	usage();
      if (keymatch(argv[argn], "int", 1)) {
 	cinfo->dct_method = JDCT_ISLOW;
      } else if (keymatch(argv[argn], "fast", 2)) {
 	cinfo->dct_method = JDCT_IFAST;
      } else if (keymatch(argv[argn], "float", 2)) {
 	cinfo->dct_method = JDCT_FLOAT;
      } else
 	usage();
    } else if (keymatch(arg, "debug", 1) || keymatch(arg, "verbose", 1)) {
      /* Enable debug printouts. */
      /* On first -d, print version identification */
      static boolean printed_version = FALSE;
      if (! printed_version) {
 	fprintf(stderr, "Independent JPEG Group's CJPEG, version %s\n%s\n",
 		JVERSION, JCOPYRIGHT);
 	fprintf(stderr,
 		"\nx86 SIMD extension for IJG JPEG library, version %s\n\n",
 		JPEG_SIMDEXT_VER_STR);
 #ifndef JSIMD_MODEINFO_NOT_SUPPORTED
 	print_simd_info(stderr, "SIMD instructions supported by the system :",
 			jpeg_simd_support(NULL));
 	fprintf(stderr, "\n      === SIMD Operation Modes ===\n");
 #ifdef DCT_ISLOW_SUPPORTED
 	print_simd_info(stderr, "Accurate integer DCT  (-dct int)   :",
 			jpeg_simd_forward_dct(cinfo, JDCT_ISLOW));
 #endif
 #ifdef DCT_IFAST_SUPPORTED
 	print_simd_info(stderr, "Fast integer DCT      (-dct fast)  :",
 			jpeg_simd_forward_dct(cinfo, JDCT_IFAST));
 #endif
 #ifdef DCT_FLOAT_SUPPORTED
 	print_simd_info(stderr, "Floating-point DCT    (-dct float) :",
 			jpeg_simd_forward_dct(cinfo, JDCT_FLOAT));
 #endif
 	print_simd_info(stderr, "Downsampling (-sample 2x2 or 2x1)  :",
 			jpeg_simd_downsampler(cinfo));
 	print_simd_info(stderr, "Colorspace conversion (RGB->YCbCr) :",
 			jpeg_simd_color_converter(cinfo));
 	fprintf(stderr, "\n");
 #endif /* !JSIMD_MODEINFO_NOT_SUPPORTED */
 	printed_version = TRUE;
      }
      cinfo->err->trace_level++;
    } else if (keymatch(arg, "grayscale", 2) || keymatch(arg, "greyscale",2)) {
      /* Force a monochrome JPEG file to be generated. */
      jpeg_set_colorspace(cinfo, JCS_GRAYSCALE);
    } else if (keymatch(arg, "maxmemory", 3)) {
      /* Maximum memory in Kb (or Mb with 'm'). */
      long lval;
      char ch = 'x';
      if (++argn >= argc)	/* advance to next argument */
 	usage();
      if (sscanf(argv[argn], "%ld%c", &lval, &ch) < 1)
 	usage();
      if (ch == 'm' || ch == 'M')
 	lval *= 1000L;
      cinfo->mem->max_memory_to_use = lval * 1000L;
    } else if (keymatch(arg, "optimize", 1) || keymatch(arg, "optimise", 1)) {
      /* Enable entropy parm optimization. */
 #ifdef ENTROPY_OPT_SUPPORTED
      cinfo->optimize_coding = TRUE;
 #else
      fprintf(stderr, "%s: sorry, entropy optimization was not compiled\n",
 	      progname);
      exit(EXIT_FAILURE);
 #endif
    } else if (keymatch(arg, "outfile", 4)) {
      /* Set output file name. */
      if (++argn >= argc)	/* advance to next argument */
 	usage();
      outfilename = argv[argn];	/* save it away for later use */
    } else if (keymatch(arg, "progressive", 1)) {
      /* Select simple progressive mode. */
 #ifdef C_PROGRESSIVE_SUPPORTED
      simple_progressive = TRUE;
      /* We must postpone execution until num_components is known. */
 #else
      fprintf(stderr, "%s: sorry, progressive output was not compiled\n",
 	      progname);
      exit(EXIT_FAILURE);
 #endif
    } else if (keymatch(arg, "quality", 1)) {
      /* Quality factor (quantization table scaling factor). */
      if (++argn >= argc)	/* advance to next argument */
 	usage();
      if (sscanf(argv[argn], "%d", &quality) != 1)
 	usage();
      /* Change scale factor in case -qtables is present. */
      q_scale_factor = jpeg_quality_scaling(quality);
    } else if (keymatch(arg, "qslots", 2)) {
      /* Quantization table slot numbers. */
      if (++argn >= argc)	/* advance to next argument */
 	usage();
      qslotsarg = argv[argn];
      /* Must delay setting qslots until after we have processed any
       * colorspace-determining switches, since jpeg_set_colorspace sets
       * default quant table numbers.
       */
    } else if (keymatch(arg, "qtables", 2)) {
      /* Quantization tables fetched from file. */
      if (++argn >= argc)	/* advance to next argument */
 	usage();
      qtablefile = argv[argn];
      /* We postpone actually reading the file in case -quality comes later. */
    } else if (keymatch(arg, "restart", 1)) {
      /* Restart interval in MCU rows (or in MCUs with 'b'). */
      long lval;
      char ch = 'x';
      if (++argn >= argc)	/* advance to next argument */
 	usage();
      if (sscanf(argv[argn], "%ld%c", &lval, &ch) < 1)
 	usage();
      if (lval < 0 || lval > 65535L)
 	usage();
      if (ch == 'b' || ch == 'B') {
 	cinfo->restart_interval = (unsigned int) lval;
 	cinfo->restart_in_rows = 0; /* else prior '-restart n' overrides me */
      } else {
 	cinfo->restart_in_rows = (int) lval;
 	/* restart_interval will be computed during startup */
      }
    } else if (keymatch(arg, "sample", 2)) {
      /* Set sampling factors. */
      if (++argn >= argc)	/* advance to next argument */
 	usage();
      samplearg = argv[argn];
      /* Must delay setting sample factors until after we have processed any
       * colorspace-determining switches, since jpeg_set_colorspace sets
       * default sampling factors.
       */
    } else if (keymatch(arg, "scans", 2)) {
      /* Set scan script. */
 #ifdef C_MULTISCAN_FILES_SUPPORTED
      if (++argn >= argc)	/* advance to next argument */
 	usage();
      scansarg = argv[argn];
      /* We must postpone reading the file in case -progressive appears. */
 #else
      fprintf(stderr, "%s: sorry, multi-scan output was not compiled\n",
 	      progname);
      exit(EXIT_FAILURE);
 #endif
    } else if (keymatch(arg, "smooth", 2)) {
      /* Set input smoothing factor. */
      int val;
      if (++argn >= argc)	/* advance to next argument */
 	usage();
      if (sscanf(argv[argn], "%d", &val) != 1)
 	usage();
      if (val < 0 || val > 100)
 	usage();
      cinfo->smoothing_factor = val;
    } else if (keymatch(arg, "targa", 1)) {
      /* Input file is Targa format. */
      is_targa = TRUE;
    } else {
      usage();			/* bogus switch */
    }
  }
  /* Post-switch-scanning cleanup */
  if (for_real) {
    /* Set quantization tables for selected quality. */
    /* Some or all may be overridden if -qtables is present. */
    jpeg_set_quality(cinfo, quality, force_baseline);
    if (qtablefile != NULL)	/* process -qtables if it was present */
      if (! read_quant_tables(cinfo, qtablefile,
 			      q_scale_factor, force_baseline))
 	usage();
    if (qslotsarg != NULL)	/* process -qslots if it was present */
      if (! set_quant_slots(cinfo, qslotsarg))
 	usage();
    if (samplearg != NULL)	/* process -sample if it was present */
      if (! set_sample_factors(cinfo, samplearg))
 	usage();
 #ifdef C_PROGRESSIVE_SUPPORTED
    if (simple_progressive)	/* process -progressive; -scans can override */
      jpeg_simple_progression(cinfo);
 #endif
 #ifdef C_MULTISCAN_FILES_SUPPORTED
    if (scansarg != NULL)	/* process -scans if it was present */
      if (! read_scan_script(cinfo, scansarg))
 	usage();
 #endif
  }
  return argn;			/* return index of next arg (file name) */
 }
 /*
 * Check for overwrite of an existing file; clear it with user
 */
 #ifndef NO_OVERWRITE_CHECK
 LOCAL(boolean)
 is_write_ok (char * outfname)
 {
  FILE * ofile;
  int ch;
  ofile = fopen(outfname, READ_BINARY);
  if (ofile == NULL)
    return TRUE;		/* not present */
  fclose(ofile);		/* oops, it is present */
  for (;;) {
    fprintf(stderr, "%s already exists, overwrite it? [y/n] ",
 	    outfname);
    fflush(stderr);
    ch = getc(stdin);
    if (ch != '\n')		/* flush rest of line */
      while (getc(stdin) != '\n')
 	/* nothing */;
    switch (ch) {
    case 'Y':
    case 'y':
      return TRUE;
    case 'N':
    case 'n':
      return FALSE;
    /* otherwise, ask again */
    }
  }
 }
 #endif
 /*
 * Process a single input file name, and return its index in argv[].
 * File names at or to left of old_file_index have been processed already.
 */
 LOCAL(int)
 process_one_file (int argc, char **argv, int old_file_index)
 {
  struct jpeg_compress_struct cinfo;
  struct jpeg_error_mgr jerr;
  char *infilename;
  char workfilename[PATH_MAX];
 #ifdef PROGRESS_REPORT
  struct cdjpeg_progress_mgr progress;
 #endif
  int file_index;
  cjpeg_source_ptr src_mgr;
  FILE * input_file = NULL;
  FILE * output_file = NULL;
  JDIMENSION num_scanlines;
  /* Initialize the JPEG compression object with default error handling. */
  cinfo.err = jpeg_std_error(&jerr);
  jpeg_create_compress(&cinfo);
  /* Add some application-specific error messages (from cderror.h) */
  jerr.addon_message_table = cdjpeg_message_table;
  jerr.first_addon_message = JMSG_FIRSTADDONCODE;
  jerr.last_addon_message = JMSG_LASTADDONCODE;
  /* Now safe to enable signal catcher. */
 #ifdef NEED_SIGNAL_CATCHER
  enable_signal_catcher((j_common_ptr) &cinfo);
 #endif
  /* Initialize JPEG parameters.
   * Much of this may be overridden later.
   * In particular, we don't yet know the input file's color space,
   * but we need to provide some value for jpeg_set_defaults() to work.
   */
  cinfo.in_color_space = JCS_RGB; /* arbitrary guess */
  jpeg_set_defaults(&cinfo);
  /* Scan command line to find next file name.
   * It is convenient to use just one switch-parsing routine, but the switch
   * values read here are ignored; we will rescan the switches after opening
   * the input file.
   */
  file_index = parse_switches(&cinfo, argc, argv, old_file_index, FALSE);
  if (file_index >= argc) {
    fprintf(stderr, "%s: missing input file name\n", progname);
    usage();
  }
  /* Open the input file. */
  infilename = argv[file_index];
  if ((input_file = fopen(infilename, READ_BINARY)) == NULL) {
    fprintf(stderr, "%s: can't open %s\n", progname, infilename);
    goto fail;
  }
 #ifdef PROGRESS_REPORT
  start_progress_monitor((j_common_ptr) &cinfo, &progress);
 #endif
  /* Figure out the input file format, and set up to read it. */
  src_mgr = select_file_type(&cinfo, input_file);
  src_mgr->input_file = input_file;
  /* Read the input file header to obtain file size & colorspace. */
  (*src_mgr->start_input) (&cinfo, src_mgr);
  /* Now that we know input colorspace, fix colorspace-dependent defaults */
  jpeg_default_colorspace(&cinfo);
  /* Adjust default compression parameters by re-parsing the options */
  file_index = parse_switches(&cinfo, argc, argv, old_file_index, TRUE);
  /* If user didn't supply -outfile switch, select output file name. */
  if (outfilename == NULL) {
    int i;
    outfilename = workfilename;
    /* Make outfilename be infilename with .jpg substituted for extension */
    strcpy(outfilename, infilename);
    for (i = strlen(outfilename)-1; i >= 0; i--) {
      switch (outfilename[i]) {
      case ':':
      case '/':
      case '\\':
 	i = 0;			/* stop scanning */
 	break;
      case '.':
 	outfilename[i] = '\0';	/* lop off existing extension */
 	i = 0;			/* stop scanning */
 	break;
      default:
 	break;			/* keep scanning */
      }
    }
    strcat(outfilename, ".jpg");
  }
  fprintf(stderr, "Compressing %s => %s\n", infilename, outfilename);
 #ifndef NO_OVERWRITE_CHECK
  if (! is_write_ok(outfilename))
    goto fail;
 #endif
  /* Open the output file. */
  if ((output_file = fopen(outfilename, WRITE_BINARY)) == NULL) {
    fprintf(stderr, "%s: can't create %s\n", progname, outfilename);
    goto fail;
  }
  /* Specify data destination for compression */
  jpeg_stdio_dest(&cinfo, output_file);
  /* Start compressor */
  jpeg_start_compress(&cinfo, TRUE);
  /* Process data */
  while (cinfo.next_scanline < cinfo.image_height) {
    num_scanlines = (*src_mgr->get_pixel_rows) (&cinfo, src_mgr);
    (void) jpeg_write_scanlines(&cinfo, src_mgr->buffer, num_scanlines);
  }
  /* Finish compression and release memory */
  (*src_mgr->finish_input) (&cinfo, src_mgr);
  jpeg_finish_compress(&cinfo);
  /* Clean up and exit */
 fail:
  jpeg_destroy_compress(&cinfo);
  if (input_file != NULL) fclose(input_file);
  if (output_file != NULL) fclose(output_file);
 #ifdef PROGRESS_REPORT
  end_progress_monitor((j_common_ptr) &cinfo);
 #endif
  /* Disable signal catcher. */
 #ifdef NEED_SIGNAL_CATCHER
  enable_signal_catcher((j_common_ptr) NULL);
 #endif
  return file_index;
 }
 /*
 * The main program.
 */
 int
 main (int argc, char **argv)
 {
  int file_index;
  /* On Mac, fetch a command line. */
 #ifdef USE_CCOMMAND
  argc = ccommand(&argv);
 #endif
 #ifdef MSDOS
  progname = "cjpeg";		/* DOS tends to be too verbose about argv[0] */
 #else
  progname = argv[0];
  if (progname == NULL || progname[0] == 0)
    progname = "cjpeg";		/* in case C library doesn't provide it */
 #endif
  /* The default maxmem must be computed only once at program startup,
   * since releasing memory with free() won't give it back to the OS.
   */
 #ifdef FREE_MEM_ESTIMATE
  default_maxmem = FREE_MEM_ESTIMATE;
 #else
  default_maxmem = 0;
 #endif
  /* Scan command line, parse switches and locate input file names */
  if (argc < 2)
    usage();			/* nothing on the command line?? */
  file_index = 0;
  while (file_index < argc-1)
    file_index = process_one_file(argc, argv, file_index);
  /* All done. */
  exit(EXIT_SUCCESS);
  return 0;			/* suppress no-return-value warnings */
 }
--- a/altui/djpeg.c
+++ b/altui/djpeg.c
@@ -0,0 +1,836 @@
 /*
 * alternate djpeg.c
 *
 * Copyright (C) 1991-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
 * ---------------------------------------------------------------------
 * x86 SIMD extension for IJG JPEG library
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
 * This file has been modified for SIMD extension.
 * Last Modified : January 6, 2006
 * ---------------------------------------------------------------------
 *
 * This file contains an alternate user interface for the JPEG decompressor.
 * One or more input files are named on the command line, and output file
 * names are created by substituting an appropriate extension.
 */
 #include "cdjpeg.h"		/* Common decls for cjpeg/djpeg applications */
 #include "jversion.h"		/* for version message */
 #include <ctype.h>		/* to declare isprint() */
 #ifdef USE_CCOMMAND		/* command-line reader for Macintosh */
 #ifdef __MWERKS__
 #include <SIOUX.h>              /* Metrowerks needs this */
 #include <console.h>		/* ... and this */
 #endif
 #ifdef THINK_C
 #include <console.h>		/* Think declares it here */
 #endif
 #endif
 #ifndef PATH_MAX		/* ANSI maximum-pathname-length constant */
 #define PATH_MAX 256
 #endif
 /* Create the add-on message string table. */
 #define JMESSAGE(code,string)	string ,
 static const char * const cdjpeg_message_table[] = {
 #include "cderror.h"
  NULL
 };
 /*
 * SIMD Ext: compiler-specific hacks to enable filename wild-card expansion
 */
 #ifdef _MSC_VER		/* Microsoft Visual C++ */
 /* from setargv.c (setargv.obj) */
 /* Tested under Visual C++ V6.0, Toolkit 2003, and 2005 Express Edition */
 int __cdecl _setargv(void) { int __cdecl __setargv(void); return __setargv(); }
 #endif
 #ifdef __BORLANDC__	/* Borland C++ */
 /* from wildargs.c (wildargs.obj) */
 /* Tested under Borland C++ Compiler 5.5 (win32) */
 #include <wildargs.h>
 typedef void _RTLENTRY (* _RTLENTRY _argv_expand_fnc)(char *, _PFN_ADDARG);
 _argv_expand_fnc _argv_expand_ptr = _expand_wild;
 #endif
 /*
 * Automatic determination of available memory.
 */
 static long default_maxmem;	/* saves value determined at startup, or 0 */
 #ifndef FREE_MEM_ESTIMATE	/* may be defined from command line */
 #ifdef MSDOS			/* For MS-DOS (unless flat-memory model) */
 #include <dos.h>		/* for access to intdos() call */
 LOCAL(long)
 unused_dos_memory (void)
 /* Obtain total amount of unallocated DOS memory */
 {
  union REGS regs;
  long nparas;
  regs.h.ah = 0x48;		/* DOS function Allocate Memory Block */
  regs.x.bx = 0xFFFF;		/* Ask for more memory than DOS can have */
  (void) intdos(&regs, &regs);
  /* DOS will fail and return # of paragraphs actually available in BX. */
  nparas = (unsigned int) regs.x.bx;
  /* Times 16 to convert to bytes. */
  return nparas << 4;
 }
 /* The default memory setting is 95% of the available space. */
 #define FREE_MEM_ESTIMATE  ((unused_dos_memory() * 95L) / 100L)
 #endif /* MSDOS */
 #ifdef ATARI			/* For Atari ST/STE/TT, Pure C or Turbo C */
 #include <ext.h>
 /* The default memory setting is 90% of the available space. */
 #define FREE_MEM_ESTIMATE  (((long) coreleft() * 90L) / 100L)
 #endif /* ATARI */
 /* Add memory-estimation procedures for other operating systems here,
 * with appropriate #ifdef's around them.
 */
 #endif /* !FREE_MEM_ESTIMATE */
 /*
 * This list defines the known output image formats
 * (not all of which need be supported by a given version).
 * You can change the default output format by defining DEFAULT_FMT;
 * indeed, you had better do so if you undefine PPM_SUPPORTED.
 */
 typedef enum {
 	FMT_BMP,		/* BMP format (Windows flavor) */
 	FMT_GIF,		/* GIF format */
 	FMT_OS2,		/* BMP format (OS/2 flavor) */
 	FMT_PPM,		/* PPM/PGM (PBMPLUS formats) */
 	FMT_RLE,		/* RLE format */
 	FMT_TARGA,		/* Targa format */
 	FMT_TIFF		/* TIFF format */
 } IMAGE_FORMATS;
 #ifndef DEFAULT_FMT		/* so can override from CFLAGS in Makefile */
 #define DEFAULT_FMT	FMT_GIF
 #endif
 static IMAGE_FORMATS requested_fmt;
 /*
 * Argument-parsing code.
 * The switch parser is designed to be useful with DOS-style command line
 * syntax, ie, intermixed switches and file names, where only the switches
 * to the left of a given file name affect processing of that file.
 */
 static const char * progname;	/* program name for error messages */
 static char * outfilename;	/* for -outfile switch */
 LOCAL(void)
 usage (void)
 /* complain about bad command line */
 {
  fprintf(stderr, "usage: %s [switches] inputfile(s)\n", progname);
  fprintf(stderr, "List of input files may use wildcards (* and ?)\n");
  fprintf(stderr, "Output filename is same as input filename except for extension\n");
  fprintf(stderr, "Switches (names may be abbreviated):\n");
  fprintf(stderr, "  -colors N      Reduce image to no more than N colors\n");
  fprintf(stderr, "  -fast          Fast, low-quality processing\n");
  fprintf(stderr, "  -grayscale     Force grayscale output\n");
 #ifdef IDCT_SCALING_SUPPORTED
  fprintf(stderr, "  -scale M/N     Scale output image by fraction M/N, eg, 1/8\n");
 #endif
 #ifdef BMP_SUPPORTED
  fprintf(stderr, "  -bmp           Select BMP output format (Windows style)%s\n",
 	  (DEFAULT_FMT == FMT_BMP ? " (default)" : ""));
 #endif
 #ifdef GIF_SUPPORTED
  fprintf(stderr, "  -gif           Select GIF output format%s\n",
 	  (DEFAULT_FMT == FMT_GIF ? " (default)" : ""));
 #endif
 #ifdef BMP_SUPPORTED
  fprintf(stderr, "  -os2           Select BMP output format (OS/2 style)%s\n",
 	  (DEFAULT_FMT == FMT_OS2 ? " (default)" : ""));
 #endif
 #ifdef PPM_SUPPORTED
  fprintf(stderr, "  -pnm           Select PBMPLUS (PPM/PGM) output format%s\n",
 	  (DEFAULT_FMT == FMT_PPM ? " (default)" : ""));
 #endif
 #ifdef RLE_SUPPORTED
  fprintf(stderr, "  -rle           Select Utah RLE output format%s\n",
 	  (DEFAULT_FMT == FMT_RLE ? " (default)" : ""));
 #endif
 #ifdef TARGA_SUPPORTED
  fprintf(stderr, "  -targa         Select Targa output format%s\n",
 	  (DEFAULT_FMT == FMT_TARGA ? " (default)" : ""));
 #endif
  fprintf(stderr, "Switches for advanced users:\n");
 #ifdef DCT_ISLOW_SUPPORTED
  fprintf(stderr, "  -dct int       Use integer DCT method%s\n",
 	  (JDCT_DEFAULT == JDCT_ISLOW ? " (default)" : ""));
 #endif
 #ifdef DCT_IFAST_SUPPORTED
  fprintf(stderr, "  -dct fast      Use fast integer DCT (less accurate)%s\n",
 	  (JDCT_DEFAULT == JDCT_IFAST ? " (default)" : ""));
 #endif
 #ifdef DCT_FLOAT_SUPPORTED
  fprintf(stderr, "  -dct float     Use floating-point DCT method%s\n",
 	  (JDCT_DEFAULT == JDCT_FLOAT ? " (default)" : ""));
 #endif
  fprintf(stderr, "  -dither fs     Use F-S dithering (default)\n");
  fprintf(stderr, "  -dither none   Don't use dithering in quantization\n");
  fprintf(stderr, "  -dither ordered  Use ordered dither (medium speed, quality)\n");
 #ifdef QUANT_2PASS_SUPPORTED
  fprintf(stderr, "  -map FILE      Map to colors used in named image file\n");
 #endif
  fprintf(stderr, "  -nosmooth      Don't use high-quality upsampling\n");
 #ifdef QUANT_1PASS_SUPPORTED
  fprintf(stderr, "  -onepass       Use 1-pass quantization (fast, low quality)\n");
 #endif
 #ifndef FREE_MEM_ESTIMATE
  fprintf(stderr, "  -maxmemory N   Maximum memory to use (in kbytes)\n");
 #endif
  fprintf(stderr, "  -outfile name  Specify name for output file\n");
  fprintf(stderr, "  -verbose  or  -debug   Emit debug output\n");
  exit(EXIT_FAILURE);
 }
 #ifndef JSIMD_MODEINFO_NOT_SUPPORTED
 LOCAL(void)
 print_simd_info (FILE * file, char * labelstr, unsigned int simd)
 {
  fprintf(file, "%s%s%s%s%s%s\n", labelstr,
 	  simd & JSIMD_MMX   ? " MMX"    : "",
 	  simd & JSIMD_3DNOW ? " 3DNow!" : "",
 	  simd & JSIMD_SSE   ? " SSE"    : "",
 	  simd & JSIMD_SSE2  ? " SSE2"   : "",
 	  simd == JSIMD_NONE ? " NONE"   : "");
 }
 #endif /* !JSIMD_MODEINFO_NOT_SUPPORTED */
 LOCAL(int)
 parse_switches (j_decompress_ptr cinfo, int argc, char **argv,
 		int last_file_arg_seen, boolean for_real)
 /* Parse optional switches.
 * Returns argv[] index of first file-name argument (== argc if none).
 * Any file names with indexes <= last_file_arg_seen are ignored;
 * they have presumably been processed in a previous iteration.
 * (Pass 0 for last_file_arg_seen on the first or only iteration.)
 * for_real is FALSE on the first (dummy) pass; we may skip any expensive
 * processing.
 */
 {
  int argn;
  char * arg;
  /* Set up default JPEG parameters. */
  requested_fmt = DEFAULT_FMT;	/* set default output file format */
  outfilename = NULL;
  cinfo->err->trace_level = 0;
  if (default_maxmem > 0)	/* override library's default value */
    cinfo->mem->max_memory_to_use = default_maxmem;
  /* Scan command line options, adjust parameters */
  for (argn = 1; argn < argc; argn++) {
    arg = argv[argn];
    if (*arg != '-') {
      /* Not a switch, must be a file name argument */
      if (argn <= last_file_arg_seen) {
 	outfilename = NULL;	/* -outfile applies to just one input file */
 	continue;		/* ignore this name if previously processed */
      }
      break;			/* else done parsing switches */
    }
    arg++;			/* advance past switch marker character */
    if (keymatch(arg, "bmp", 1)) {
      /* BMP output format. */
      requested_fmt = FMT_BMP;
    } else if (keymatch(arg, "colors", 1) || keymatch(arg, "colours", 1) ||
 	       keymatch(arg, "quantize", 1) || keymatch(arg, "quantise", 1)) {
      /* Do color quantization. */
      int val;
      if (++argn >= argc)	/* advance to next argument */
 	usage();
      if (sscanf(argv[argn], "%d", &val) != 1)
 	usage();
      cinfo->desired_number_of_colors = val;
      cinfo->quantize_colors = TRUE;
 #ifndef JSIMD_MASKFUNC_NOT_SUPPORTED
    } else if (keymatch(arg, "nosimd" , 4)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_ALL);
    } else if (keymatch(arg, "nommx"  , 3)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_MMX);
    } else if (keymatch(arg, "no3dnow", 3)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_3DNOW);
    } else if (keymatch(arg, "nosse"  , 4)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_SSE);
    } else if (keymatch(arg, "nosse2" , 6)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_SSE2);
 #endif /* !JSIMD_MASKFUNC_NOT_SUPPORTED */
    } else if (keymatch(arg, "dct", 2)) {
      /* Select IDCT algorithm. */
      if (++argn >= argc)	/* advance to next argument */
 	usage();
      if (keymatch(argv[argn], "int", 1)) {
 	cinfo->dct_method = JDCT_ISLOW;
      } else if (keymatch(argv[argn], "fast", 2)) {
 	cinfo->dct_method = JDCT_IFAST;
      } else if (keymatch(argv[argn], "float", 2)) {
 	cinfo->dct_method = JDCT_FLOAT;
      } else
 	usage();
    } else if (keymatch(arg, "dither", 2)) {
      /* Select dithering algorithm. */
      if (++argn >= argc)	/* advance to next argument */
 	usage();
      if (keymatch(argv[argn], "fs", 2)) {
 	cinfo->dither_mode = JDITHER_FS;
      } else if (keymatch(argv[argn], "none", 2)) {
 	cinfo->dither_mode = JDITHER_NONE;
      } else if (keymatch(argv[argn], "ordered", 2)) {
 	cinfo->dither_mode = JDITHER_ORDERED;
      } else
 	usage();
    } else if (keymatch(arg, "debug", 1) || keymatch(arg, "verbose", 1)) {
      /* Enable debug printouts. */
      /* On first -d, print version identification */
      static boolean printed_version = FALSE;
      if (! printed_version) {
 	fprintf(stderr, "Independent JPEG Group's DJPEG, version %s\n%s\n",
 		JVERSION, JCOPYRIGHT);
 	fprintf(stderr,
 		"\nx86 SIMD extension for IJG JPEG library, version %s\n\n",
 		JPEG_SIMDEXT_VER_STR);
 #ifndef JSIMD_MODEINFO_NOT_SUPPORTED
 	print_simd_info(stderr, "SIMD instructions supported by the system :",
 			jpeg_simd_support(NULL));
 	fprintf(stderr, "\n      === SIMD Operation Modes ===\n");
 #ifdef DCT_ISLOW_SUPPORTED
 	print_simd_info(stderr, "Accurate integer DCT  (-dct int)   :",
 			jpeg_simd_inverse_dct(cinfo, JDCT_ISLOW));
 #endif
 #ifdef DCT_IFAST_SUPPORTED
 	print_simd_info(stderr, "Fast integer DCT      (-dct fast)  :",
 			jpeg_simd_inverse_dct(cinfo, JDCT_IFAST));
 #endif
 #ifdef DCT_FLOAT_SUPPORTED
 	print_simd_info(stderr, "Floating-point DCT    (-dct float) :",
 			jpeg_simd_inverse_dct(cinfo, JDCT_FLOAT));
 #endif
 #ifdef IDCT_SCALING_SUPPORTED
 	print_simd_info(stderr, "Reduced-size DCT      (-scale M/N) :",
 			jpeg_simd_inverse_dct(cinfo, JDCT_FLOAT+1));
 #endif
 	print_simd_info(stderr, "High-quality upsampling (default)  :",
 			jpeg_simd_upsampler(cinfo, TRUE));
 	print_simd_info(stderr, "Low-quality upsampling (-nosmooth) :",
 			jpeg_simd_upsampler(cinfo, FALSE));
 	print_simd_info(stderr, "Colorspace conversion (YCbCr->RGB) :",
 			jpeg_simd_color_deconverter(cinfo));
 	fprintf(stderr, "\n");
 #endif /* !JSIMD_MODEINFO_NOT_SUPPORTED */
 	printed_version = TRUE;
      }
      cinfo->err->trace_level++;
    } else if (keymatch(arg, "fast", 1)) {
      /* Select recommended processing options for quick-and-dirty output. */
      cinfo->two_pass_quantize = FALSE;
      cinfo->dither_mode = JDITHER_ORDERED;
      if (! cinfo->quantize_colors) /* don't override an earlier -colors */
 	cinfo->desired_number_of_colors = 216;
      cinfo->dct_method = JDCT_FASTEST;
      cinfo->do_fancy_upsampling = FALSE;
    } else if (keymatch(arg, "gif", 1)) {
      /* GIF output format. */
      requested_fmt = FMT_GIF;
    } else if (keymatch(arg, "grayscale", 2) || keymatch(arg, "greyscale",2)) {
      /* Force monochrome output. */
      cinfo->out_color_space = JCS_GRAYSCALE;
    } else if (keymatch(arg, "map", 3)) {
      /* Quantize to a color map taken from an input file. */
      if (++argn >= argc)	/* advance to next argument */
 	usage();
      if (for_real) {		/* too expensive to do twice! */
 #ifdef QUANT_2PASS_SUPPORTED	/* otherwise can't quantize to supplied map */
 	FILE * mapfile;
 	if ((mapfile = fopen(argv[argn], READ_BINARY)) == NULL) {
 	  fprintf(stderr, "%s: can't open %s\n", progname, argv[argn]);
 	  exit(EXIT_FAILURE);
 	}
 	read_color_map(cinfo, mapfile);
 	fclose(mapfile);
 	cinfo->quantize_colors = TRUE;
 #else
 	ERREXIT(cinfo, JERR_NOT_COMPILED);
 #endif
      }
    } else if (keymatch(arg, "maxmemory", 3)) {
      /* Maximum memory in Kb (or Mb with 'm'). */
      long lval;
      char ch = 'x';
      if (++argn >= argc)	/* advance to next argument */
 	usage();
      if (sscanf(argv[argn], "%ld%c", &lval, &ch) < 1)
 	usage();
      if (ch == 'm' || ch == 'M')
 	lval *= 1000L;
      cinfo->mem->max_memory_to_use = lval * 1000L;
    } else if (keymatch(arg, "nosmooth", 3)) {
      /* Suppress fancy upsampling */
      cinfo->do_fancy_upsampling = FALSE;
    } else if (keymatch(arg, "onepass", 3)) {
      /* Use fast one-pass quantization. */
      cinfo->two_pass_quantize = FALSE;
    } else if (keymatch(arg, "os2", 3)) {
      /* BMP output format (OS/2 flavor). */
      requested_fmt = FMT_OS2;
    } else if (keymatch(arg, "outfile", 4)) {
      /* Set output file name. */
      if (++argn >= argc)	/* advance to next argument */
 	usage();
      outfilename = argv[argn];	/* save it away for later use */
    } else if (keymatch(arg, "pnm", 1) || keymatch(arg, "ppm", 1)) {
      /* PPM/PGM output format. */
      requested_fmt = FMT_PPM;
    } else if (keymatch(arg, "rle", 1)) {
      /* RLE output format. */
      requested_fmt = FMT_RLE;
    } else if (keymatch(arg, "scale", 1)) {
      /* Scale the output image by a fraction M/N. */
      if (++argn >= argc)	/* advance to next argument */
 	usage();
      if (sscanf(argv[argn], "%d/%d",
 		 &cinfo->scale_num, &cinfo->scale_denom) != 2)
 	usage();
    } else if (keymatch(arg, "targa", 1)) {
      /* Targa output format. */
      requested_fmt = FMT_TARGA;
    } else {
      usage();			/* bogus switch */
    }
  }
  return argn;			/* return index of next arg (file name) */
 }
 /*
 * Marker processor for COM and interesting APPn markers.
 * This replaces the library's built-in processor, which just skips the marker.
 * We want to print out the marker as text, to the extent possible.
 * Note this code relies on a non-suspending data source.
 */
 LOCAL(unsigned int)
 jpeg_getc (j_decompress_ptr cinfo)
 /* Read next byte */
 {
  struct jpeg_source_mgr * datasrc = cinfo->src;
  if (datasrc->bytes_in_buffer == 0) {
    if (! (*datasrc->fill_input_buffer) (cinfo))
      ERREXIT(cinfo, JERR_CANT_SUSPEND);
  }
  datasrc->bytes_in_buffer--;
  return GETJOCTET(*datasrc->next_input_byte++);
 }
 METHODDEF(boolean)
 print_text_marker (j_decompress_ptr cinfo)
 {
  boolean traceit = (cinfo->err->trace_level >= 1);
  INT32 length;
  unsigned int ch;
  unsigned int lastch = 0;
  length = jpeg_getc(cinfo) << 8;
  length += jpeg_getc(cinfo);
  length -= 2;			/* discount the length word itself */
  if (traceit) {
    if (cinfo->unread_marker == JPEG_COM)
      fprintf(stderr, "Comment, length %ld:\n", (long) length);
    else			/* assume it is an APPn otherwise */
      fprintf(stderr, "APP%d, length %ld:\n",
 	      cinfo->unread_marker - JPEG_APP0, (long) length);
  }
  while (--length >= 0) {
    ch = jpeg_getc(cinfo);
    if (traceit) {
      /* Emit the character in a readable form.
       * Nonprintables are converted to \nnn form,
       * while \ is converted to \\.
       * Newlines in CR, CR/LF, or LF form will be printed as one newline.
       */
      if (ch == '\r') {
 	fprintf(stderr, "\n");
      } else if (ch == '\n') {
 	if (lastch != '\r')
 	  fprintf(stderr, "\n");
      } else if (ch == '\\') {
 	fprintf(stderr, "\\\\");
      } else if (isprint(ch)) {
 	putc(ch, stderr);
      } else {
 	fprintf(stderr, "\\%03o", ch);
      }
      lastch = ch;
    }
  }
  if (traceit)
    fprintf(stderr, "\n");
  return TRUE;
 }
 /*
 * Check for overwrite of an existing file; clear it with user
 */
 #ifndef NO_OVERWRITE_CHECK
 LOCAL(boolean)
 is_write_ok (char * outfname)
 {
  FILE * ofile;
  int ch;
  ofile = fopen(outfname, READ_BINARY);
  if (ofile == NULL)
    return TRUE;		/* not present */
  fclose(ofile);		/* oops, it is present */
  for (;;) {
    fprintf(stderr, "%s already exists, overwrite it? [y/n] ",
 	    outfname);
    fflush(stderr);
    ch = getc(stdin);
    if (ch != '\n')		/* flush rest of line */
      while (getc(stdin) != '\n')
 	/* nothing */;
    switch (ch) {
    case 'Y':
    case 'y':
      return TRUE;
    case 'N':
    case 'n':
      return FALSE;
    /* otherwise, ask again */
    }
  }
 }
 #endif
 /*
 * Process a single input file name, and return its index in argv[].
 * File names at or to left of old_file_index have been processed already.
 */
 LOCAL(int)
 process_one_file (int argc, char **argv, int old_file_index)
 {
  struct jpeg_decompress_struct cinfo;
  struct jpeg_error_mgr jerr;
  char *infilename;
  char workfilename[PATH_MAX];
  const char *default_extension = NULL;
 #ifdef PROGRESS_REPORT
  struct cdjpeg_progress_mgr progress;
 #endif
  int file_index;
  djpeg_dest_ptr dest_mgr = NULL;
  FILE * input_file = NULL;
  FILE * output_file = NULL;
  JDIMENSION num_scanlines;
  /* Initialize the JPEG decompression object with default error handling. */
  cinfo.err = jpeg_std_error(&jerr);
  jpeg_create_decompress(&cinfo);
  /* Add some application-specific error messages (from cderror.h) */
  jerr.addon_message_table = cdjpeg_message_table;
  jerr.first_addon_message = JMSG_FIRSTADDONCODE;
  jerr.last_addon_message = JMSG_LASTADDONCODE;
  /* Insert custom marker processor for COM and APP12.
   * APP12 is used by some digital camera makers for textual info,
   * so we provide the ability to display it as text.
   * If you like, additional APPn marker types can be selected for display,
   * but don't try to override APP0 or APP14 this way (see libjpeg.doc).
   */
  jpeg_set_marker_processor(&cinfo, JPEG_COM, print_text_marker);
  jpeg_set_marker_processor(&cinfo, JPEG_APP0+12, print_text_marker);
  /* Now safe to enable signal catcher. */
 #ifdef NEED_SIGNAL_CATCHER
  enable_signal_catcher((j_common_ptr) &cinfo);
 #endif
  /* Scan command line to find next file name.
   * It is convenient to use just one switch-parsing routine, but the switch
   * values read here are ignored; we will rescan the switches after opening
   * the input file.
   * (Exception: tracing level set here controls verbosity for COM markers
   * found during jpeg_read_header...)
   */
  file_index = parse_switches(&cinfo, argc, argv, old_file_index, FALSE);
  if (file_index >= argc) {
    fprintf(stderr, "%s: missing input file name\n", progname);
    usage();
  }
  /* Open the input file. */
  infilename = argv[file_index];
  if ((input_file = fopen(infilename, READ_BINARY)) == NULL) {
    fprintf(stderr, "%s: can't open %s\n", progname, infilename);
    goto fail;
  }
 #ifdef PROGRESS_REPORT
  start_progress_monitor((j_common_ptr) &cinfo, &progress);
 #endif
  /* Specify data source for decompression */
  jpeg_stdio_src(&cinfo, input_file);
  /* Read file header, set default decompression parameters */
  (void) jpeg_read_header(&cinfo, TRUE);
  /* Adjust default decompression parameters by re-parsing the options */
  file_index = parse_switches(&cinfo, argc, argv, old_file_index, TRUE);
  /* Initialize the output module now to let it override any crucial
   * option settings (for instance, GIF wants to force color quantization).
   */
  switch (requested_fmt) {
 #ifdef BMP_SUPPORTED
  case FMT_BMP:
    dest_mgr = jinit_write_bmp(&cinfo, FALSE);
    default_extension = ".bmp";
    break;
  case FMT_OS2:
    dest_mgr = jinit_write_bmp(&cinfo, TRUE);
    default_extension = ".bmp";
    break;
 #endif
 #ifdef GIF_SUPPORTED
  case FMT_GIF:
    dest_mgr = jinit_write_gif(&cinfo);
    default_extension = ".gif";
    break;
 #endif
 #ifdef PPM_SUPPORTED
  case FMT_PPM:
    dest_mgr = jinit_write_ppm(&cinfo);
    default_extension = ".ppm";
    break;
 #endif
 #ifdef RLE_SUPPORTED
  case FMT_RLE:
    dest_mgr = jinit_write_rle(&cinfo);
    default_extension = ".rle";
    break;
 #endif
 #ifdef TARGA_SUPPORTED
  case FMT_TARGA:
    dest_mgr = jinit_write_targa(&cinfo);
    default_extension = ".tga";
    break;
 #endif
  default:
    ERREXIT(&cinfo, JERR_UNSUPPORTED_FORMAT);
    break;
  }
  /* If user didn't supply -outfile switch, select output file name. */
  if (outfilename == NULL) {
    int i;
    outfilename = workfilename;
    /* Make outfilename be infilename with appropriate extension */
    strcpy(outfilename, infilename);
    for (i = strlen(outfilename)-1; i >= 0; i--) {
      switch (outfilename[i]) {
      case ':':
      case '/':
      case '\\':
 	i = 0;			/* stop scanning */
 	break;
      case '.':
 	outfilename[i] = '\0';	/* lop off existing extension */
 	i = 0;			/* stop scanning */
 	break;
      default:
 	break;			/* keep scanning */
      }
    }
    strcat(outfilename, default_extension);
  }
  fprintf(stderr, "Decompressing %s => %s\n", infilename, outfilename);
 #ifndef NO_OVERWRITE_CHECK
  if (! is_write_ok(outfilename))
    goto fail;
 #endif
  /* Open the output file. */
  if ((output_file = fopen(outfilename, WRITE_BINARY)) == NULL) {
    fprintf(stderr, "%s: can't create %s\n", progname, outfilename);
    goto fail;
  }
  dest_mgr->output_file = output_file;
  /* Start decompressor */
  (void) jpeg_start_decompress(&cinfo);
  /* Write output file header */
  (*dest_mgr->start_output) (&cinfo, dest_mgr);
  /* Process data */
  while (cinfo.output_scanline < cinfo.output_height) {
    num_scanlines = jpeg_read_scanlines(&cinfo, dest_mgr->buffer,
 					dest_mgr->buffer_height);
    (*dest_mgr->put_pixel_rows) (&cinfo, dest_mgr, num_scanlines);
  }
 #ifdef PROGRESS_REPORT
  /* Hack: count final pass as done in case finish_output does an extra pass.
   * The library won't have updated completed_passes.
   */
  progress.pub.completed_passes = progress.pub.total_passes;
 #endif
  /* Finish decompression and release memory.
   * I must do it in this order because output module has allocated memory
   * of lifespan JPOOL_IMAGE; it needs to finish before releasing memory.
   */
  (*dest_mgr->finish_output) (&cinfo, dest_mgr);
  (void) jpeg_finish_decompress(&cinfo);
  /* Clean up and exit */
 fail:
  jpeg_destroy_decompress(&cinfo);
  if (input_file != NULL) fclose(input_file);
  if (output_file != NULL) fclose(output_file);
 #ifdef PROGRESS_REPORT
  end_progress_monitor((j_common_ptr) &cinfo);
 #endif
  /* Disable signal catcher. */
 #ifdef NEED_SIGNAL_CATCHER
  enable_signal_catcher((j_common_ptr) NULL);
 #endif
  return file_index;
 }
 /*
 * The main program.
 */
 int
 main (int argc, char **argv)
 {
  int file_index;
  /* On Mac, fetch a command line. */
 #ifdef USE_CCOMMAND
  argc = ccommand(&argv);
 #endif
 #ifdef MSDOS
  progname = "djpeg";		/* DOS tends to be too verbose about argv[0] */
 #else
  progname = argv[0];
  if (progname == NULL || progname[0] == 0)
    progname = "djpeg";		/* in case C library doesn't provide it */
 #endif
  /* The default maxmem must be computed only once at program startup,
   * since releasing memory with free() won't give it back to the OS.
   */
 #ifdef FREE_MEM_ESTIMATE
  default_maxmem = FREE_MEM_ESTIMATE;
 #else
  default_maxmem = 0;
 #endif
  /* Scan command line, parse switches and locate input file names */
  if (argc < 2)
    usage();			/* nothing on the command line?? */
  file_index = 0;
  while (file_index < argc-1)
    file_index = process_one_file(argc, argv, file_index);
  /* All done. */
  exit(EXIT_SUCCESS);
  return 0;			/* suppress no-return-value warnings */
 }
--- a/altui/usage.alt
+++ b/altui/usage.alt
@@ -0,0 +1,62 @@
 (Most of the standard usage.doc file also applies to this alternate version,
 but replace its "GENERAL USAGE" section with the text below.  Edit the text
 as necessary if you don't support wildcards or overwrite checking.  Be sure
 to fix the djpeg switch descriptions if you are not defaulting to PPM output.
 Also, if you've provided an accurate memory-estimation procedure, you can
 probably eliminate the HINTS related to the -maxmemory switch.)
 GENERAL USAGE
 We provide two programs, cjpeg to compress an image file into JPEG format,
 and djpeg to decompress a JPEG file back into a conventional image format.
 The basic command line is:
 	cjpeg [switches] list of image files
 or
 	djpeg [switches] list of jpeg files
 Each file named is compressed or decompressed.  The input file(s) are not
 modified; the output data is written to files which have the same names
 except for extension.  cjpeg always uses ".jpg" for the output file name's
 extension; djpeg uses one of ".bmp", ".gif", ".ppm", ".rle", or ".tga",
 depending on what output format is selected by the switches.
 For example, to convert xxx.bmp to xxx.jpg and yyy.ppm to yyy.jpg, say:
 	cjpeg xxx.bmp yyy.ppm
 On most systems you can use standard wildcards to specify the list of input
 files; for example, on DOS "djpeg *.jpg" decompresses all the JPEG files in
 the current directory.
 If an intended output file already exists, you'll be asked whether or not to
 overwrite it.  If you say no, the program skips that input file and goes on
 to the next one.
 You can intermix switches and file names; for example
 	djpeg -gif file1.jpg -targa file2.jpg
 decompresses file1.jpg into GIF format (file1.gif) and file2.jpg into Targa
 format (file2.tga).  Only switches to the left of a given file name affect
 processing of that file; when there are conflicting switches, the rightmost
 one takes precedence.
 You can override the program's choice of output file name by using the
 -outfile switch, as in
 	cjpeg -outfile output.jpg input.ppm
 -outfile only affects the first input file name to its right.
 The currently supported image file formats are: PPM (PBMPLUS color format),
 PGM (PBMPLUS gray-scale format), BMP, GIF, Targa, and RLE (Utah Raster
 Toolkit format).  (RLE is supported only if the URT library is available,
 which it isn't on most non-Unix systems.)  cjpeg recognizes the input image
 format automatically, with the exception of some Targa-format files.  You
 have to tell djpeg which format to generate.
 JPEG files are in the defacto standard JFIF file format.  There are other,
 less widely used JPEG-based file formats, but we don't support them.
 All switch names may be abbreviated; for example, -grayscale may be written
 -gray or -gr.  Most of the "basic" switches can be abbreviated to as little as
 one letter.  Upper and lower case are equivalent (-BMP is the same as -bmp).
 British spellings are also accepted (e.g., -greyscale), though for brevity
 these are not mentioned below.
--- a/cderror.h
+++ b/cderror.h
@@ -1,7 +1,7 @@
 /*
 * cderror.h
 *
- * Copyright (C) 1994, Thomas G. Lane.
+ * Copyright (C) 1994-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -72,7 +72,7 @@ JMESSAGE(JWRN_GIF_NOMOREDATA, "Ran out of GIF bits")
 #ifdef PPM_SUPPORTED
 JMESSAGE(JERR_PPM_COLORSPACE, "PPM output must be grayscale or RGB")
 JMESSAGE(JERR_PPM_NONNUMERIC, "Nonnumeric data in PPM file")
-JMESSAGE(JERR_PPM_NOT, "Not a PPM file")
+JMESSAGE(JERR_PPM_NOT, "Not a PPM/PGM file")
 JMESSAGE(JTRC_PGM, "%ux%u PGM image")
 JMESSAGE(JTRC_PGM_TEXT, "%ux%u text PGM image")
 JMESSAGE(JTRC_PPM, "%ux%u PPM image")
--- a/cdjpeg.c
+++ b/cdjpeg.c
@@ -1,7 +1,7 @@
 /*
 * cdjpeg.c
 *
- * Copyright (C) 1991-1996, Thomas G. Lane.
+ * Copyright (C) 1991-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -47,7 +47,9 @@ GLOBAL(void)
 enable_signal_catcher (j_common_ptr cinfo)
 {
  sig_cinfo = cinfo;
 #ifdef SIGINT			/* not all systems have SIGINT */
  signal(SIGINT, signal_catcher);
 #endif
 #ifdef SIGTERM			/* not all systems have SIGTERM */
  signal(SIGTERM, signal_catcher);
 #endif
--- a/cdjpeg.h
+++ b/cdjpeg.h
@@ -1,7 +1,7 @@
 /*
 * cdjpeg.h
 *
- * Copyright (C) 1994-1996, Thomas G. Lane.
+ * Copyright (C) 1994-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -156,9 +156,14 @@ EXTERN(FILE *) write_stdout JPP((void));
 #define READ_BINARY	"r"
 #define WRITE_BINARY	"w"
 #else
 #ifdef VMS			/* VMS is very nonstandard */
 #define READ_BINARY	"rb", "ctx=stm"
 #define WRITE_BINARY	"wb", "ctx=stm"
 #else				/* standard ANSI-compliant case */
 #define READ_BINARY	"rb"
 #define WRITE_BINARY	"wb"
 #endif
 #endif
 #ifndef EXIT_FAILURE		/* define exit() codes if not provided */
 #define EXIT_FAILURE  1
--- a/change.log
+++ b/change.log
@@ -1,6 +1,71 @@
 CHANGE LOG for Independent JPEG Group's JPEG software
 Version 6b  27-Mar-1998
 -----------------------
 jpegtran has new features for lossless image transformations (rotation
 and flipping) as well as "lossless" reduction to grayscale.
 jpegtran now copies comments by default; it has a -copy switch to enable
 copying all APPn blocks as well, or to suppress comments.  (Formerly it
 always suppressed comments and APPn blocks.)  jpegtran now also preserves
 JFIF version and resolution information.
 New decompressor library feature: COM and APPn markers found in the input
 file can be saved in memory for later use by the application.  (Before,
 you had to code this up yourself with a custom marker processor.)
 There is an unused field "void * client_data" now in compress and decompress
 parameter structs; this may be useful in some applications.
 JFIF version number information is now saved by the decoder and accepted by
 the encoder.  jpegtran uses this to copy the source file's version number,
 to ensure "jpegtran -copy all" won't create bogus files that contain JFXX
 extensions but claim to be version 1.01.  Applications that generate their
 own JFXX extension markers also (finally) have a supported way to cause the
 encoder to emit JFIF version number 1.02.
 djpeg's trace mode reports JFIF 1.02 thumbnail images as such, rather
 than as unknown APP0 markers.
 In -verbose mode, djpeg and rdjpgcom will try to print the contents of
 APP12 markers as text.  Some digital cameras store useful text information
 in APP12 markers.
 Handling of truncated data streams is more robust: blocks beyond the one in
 which the error occurs will be output as uniform gray, or left unchanged
 if decoding a progressive JPEG.  The appearance no longer depends on the
 Huffman tables being used.
 Huffman tables are checked for validity much more carefully than before.
 To avoid the Unisys LZW patent, djpeg's GIF output capability has been
 changed to produce "uncompressed GIFs", and cjpeg's GIF input capability
 has been removed altogether.  We're not happy about it either, but there
 seems to be no good alternative.
 The configure script now supports building libjpeg as a shared library
 on many flavors of Unix (all the ones that GNU libtool knows how to
 build shared libraries for).  Use "./configure --enable-shared" to
 try this out.
 New jconfig file and makefiles for Microsoft Visual C++ and Developer Studio.
 Also, a jconfig file and a build script for Metrowerks CodeWarrior
 on Apple Macintosh.  makefile.dj has been updated for DJGPP v2, and there
 are miscellaneous other minor improvements in the makefiles.
 jmemmac.c now knows how to create temporary files following Mac System 7
 conventions.
 djpeg's -map switch is now able to read raw-format PPM files reliably.
 cjpeg -progressive -restart no longer generates any unnecessary DRI markers.
 Multiple calls to jpeg_simple_progression for a single JPEG object
 no longer leak memory.
 Version 6a  7-Feb-96
 --------------------
--- a/cjpeg.1
+++ b/cjpeg.1
@@ -1,4 +1,4 @@
-.TH CJPEG 1 "15 June 1995"
+.TH CJPEG 1 "20 March 1998"
 .SH NAME
 cjpeg \- compress an image file to a JPEG file
 .SH SYNOPSIS
@@ -16,7 +16,7 @@ cjpeg \- compress an image file to a JPEG file
 compresses the named image file, or the standard input if no file is
 named, and produces a JPEG/JFIF file on the standard output.
 The currently supported input file formats are: PPM (PBMPLUS color
-format), PGM (PBMPLUS gray-scale format), BMP, GIF, Targa, and RLE (Utah Raster
+format), PGM (PBMPLUS gray-scale format), BMP, Targa, and RLE (Utah Raster
 Toolkit format).  (RLE is supported only if the URT library is available.)
 .SH OPTIONS
 All switch names may be abbreviated; for example,
@@ -27,9 +27,9 @@ or
 .BR \-gr .
 Most of the "basic" switches can be abbreviated to as little as one letter.
 Upper and lower case are equivalent (thus
-.B \-GIF
+.B \-BMP
 is the same as
-.BR \-gif ).
+.BR \-bmp ).
 British spellings are also accepted (e.g.,
 .BR \-greyscale ),
 though for brevity these are not mentioned below.
@@ -42,9 +42,9 @@ Scale quantization tables to adjust image quality.  Quality is 0 (worst) to
 .TP
 .B \-grayscale
 Create monochrome JPEG file from color input.  Be sure to use this switch when
-compressing a grayscale GIF file, because
+compressing a grayscale BMP file, because
 .B cjpeg
-isn't bright enough to notice whether a GIF file uses only shades of gray.
+isn't bright enough to notice whether a BMP file uses only shades of gray.
 By saying
 .BR \-grayscale ,
 you'll get a smaller JPEG file that takes less time to process.
@@ -180,16 +180,22 @@ for images that will be transmitted across unreliable networks such as Usenet.
 The
 .B \-smooth
 option filters the input to eliminate fine-scale noise.  This is often useful
-when converting GIF files to JPEG: a moderate smoothing factor of 10 to 50
+when converting dithered images to JPEG: a moderate smoothing factor of 10 to
-gets rid of dithering patterns in the input file, resulting in a smaller JPEG
+50 gets rid of dithering patterns in the input file, resulting in a smaller
-file and a better-looking image.  Too large a smoothing factor will visibly
+JPEG file and a better-looking image.  Too large a smoothing factor will
-blur the image, however.
+visibly blur the image, however.
 .PP
 Switches for wizards:
 .TP
 .B \-baseline
-Force a baseline JPEG file to be generated.  This clamps quantization values
+Force baseline-compatible quantization tables to be generated.  This clamps
-to 8 bits even at low quality settings.
+quantization values to 8 bits even at low quality settings.  (This switch is
 poorly named, since it does not ensure that the output is actually baseline
 JPEG.  For example, you can use
 .B \-baseline
 and
 .B \-progressive
 together.)
 .TP
 .BI \-qtables " file"
 Use the quantization tables given in the specified text file.
@@ -272,6 +278,10 @@ Independent JPEG Group
 .SH BUGS
 Arithmetic coding is not supported for legal reasons.
 .PP
 GIF input files are no longer supported, to avoid the Unisys LZW patent.
 Use a Unisys-licensed program if you need to read a GIF file.  (Conversion
 of GIF files to JPEG is usually a bad idea anyway.)
 .PP
 Not all variants of BMP and Targa file formats are supported.
 .PP
 The
--- a/cjpeg.c
+++ b/cjpeg.c
@@ -1,10 +1,17 @@
 /*
 * cjpeg.c
 *
- * Copyright (C) 1991-1996, Thomas G. Lane.
+ * Copyright (C) 1991-1998, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
 * ---------------------------------------------------------------------
 * x86 SIMD extension for IJG JPEG library
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
 * This file has been modified for SIMD extension.
 * Last Modified : August 23, 2005
 * ---------------------------------------------------------------------
 *
 * This file contains a command-line user interface for the JPEG compressor.
 * It should work on any system with Unix- or MS-DOS-style command lines.
 *
@@ -184,7 +191,7 @@ usage (void)
 #ifdef C_ARITH_CODING_SUPPORTED
  fprintf(stderr, "  -arithmetic    Use arithmetic coding\n");
 #endif
-  fprintf(stderr, "  -baseline      Force baseline output\n");
+  fprintf(stderr, "  -baseline      Force baseline quantization tables\n");
  fprintf(stderr, "  -qtables file  Use quantization tables given in file\n");
  fprintf(stderr, "  -qslots N[,...]    Set component quantization tables\n");
  fprintf(stderr, "  -sample HxV[,...]  Set component sampling factors\n");
@@ -195,6 +202,22 @@ usage (void)
 }
 #ifndef JSIMD_MODEINFO_NOT_SUPPORTED
 LOCAL(void)
 print_simd_info (FILE * file, char * labelstr, unsigned int simd)
 {
  fprintf(file, "%s%s%s%s%s%s\n", labelstr,
 	  simd & JSIMD_MMX   ? " MMX"    : "",
 	  simd & JSIMD_3DNOW ? " 3DNow!" : "",
 	  simd & JSIMD_SSE   ? " SSE"    : "",
 	  simd & JSIMD_SSE2  ? " SSE2"   : "",
 	  simd == JSIMD_NONE ? " NONE"   : "");
 }
 #endif /* !JSIMD_MODEINFO_NOT_SUPPORTED */
 LOCAL(int)
 parse_switches (j_compress_ptr cinfo, int argc, char **argv,
 		int last_file_arg_seen, boolean for_real)
@@ -255,9 +278,22 @@ parse_switches (j_compress_ptr cinfo, int argc, char **argv,
 #endif
    } else if (keymatch(arg, "baseline", 1)) {
-      /* Force baseline output (8-bit quantizer values). */
+      /* Force baseline-compatible output (8-bit quantizer values). */
      force_baseline = TRUE;
 #ifndef JSIMD_MASKFUNC_NOT_SUPPORTED
    } else if (keymatch(arg, "nosimd" , 4)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_ALL);
    } else if (keymatch(arg, "nommx"  , 3)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_MMX);
    } else if (keymatch(arg, "no3dnow", 3)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_3DNOW);
    } else if (keymatch(arg, "nosse"  , 4)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_SSE);
    } else if (keymatch(arg, "nosse2" , 6)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_SSE2);
 #endif /* !JSIMD_MASKFUNC_NOT_SUPPORTED */
    } else if (keymatch(arg, "dct", 2)) {
      /* Select DCT algorithm. */
      if (++argn >= argc)	/* advance to next argument */
@@ -279,6 +315,32 @@ parse_switches (j_compress_ptr cinfo, int argc, char **argv,
      if (! printed_version) {
 	fprintf(stderr, "Independent JPEG Group's CJPEG, version %s\n%s\n",
 		JVERSION, JCOPYRIGHT);
 	fprintf(stderr,
 		"\nx86 SIMD extension for IJG JPEG library, version %s\n\n",
 		JPEG_SIMDEXT_VER_STR);
 #ifndef JSIMD_MODEINFO_NOT_SUPPORTED
 	print_simd_info(stderr, "SIMD instructions supported by the system :",
 			jpeg_simd_support(NULL));
 	fprintf(stderr, "\n      === SIMD Operation Modes ===\n");
 #ifdef DCT_ISLOW_SUPPORTED
 	print_simd_info(stderr, "Accurate integer DCT  (-dct int)   :",
 			jpeg_simd_forward_dct(cinfo, JDCT_ISLOW));
 #endif
 #ifdef DCT_IFAST_SUPPORTED
 	print_simd_info(stderr, "Fast integer DCT      (-dct fast)  :",
 			jpeg_simd_forward_dct(cinfo, JDCT_IFAST));
 #endif
 #ifdef DCT_FLOAT_SUPPORTED
 	print_simd_info(stderr, "Floating-point DCT    (-dct float) :",
 			jpeg_simd_forward_dct(cinfo, JDCT_FLOAT));
 #endif
 	print_simd_info(stderr, "Downsampling (-sample 2x2 or 2x1)  :",
 			jpeg_simd_downsampler(cinfo));
 	print_simd_info(stderr, "Colorspace conversion (RGB->YCbCr) :",
 			jpeg_simd_color_converter(cinfo));
 	fprintf(stderr, "\n");
 #endif /* !JSIMD_MODEINFO_NOT_SUPPORTED */
 	printed_version = TRUE;
      }
      cinfo->err->trace_level++;
--- a/ckconfig.c
+++ b/ckconfig.c
@@ -4,6 +4,13 @@
 * Copyright (C) 1991-1994, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
 * ---------------------------------------------------------------------
 * x86 SIMD extension for IJG JPEG library
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
 * This file has been modified for SIMD extension.
 * Last Modified : March 28, 2005
 * ---------------------------------------------------------------------
 */
 /*
@@ -361,6 +368,10 @@ int main (argc, argv)
  fprintf(outfile, "#define INCOMPLETE_TYPES_BROKEN\n");
 #else
  fprintf(outfile, "#undef INCOMPLETE_TYPES_BROKEN\n");
 #endif
 #ifdef _WIN32
  fprintf(outfile, "\n/* Define "boolean" as unsigned char, not int, per Windows custom */\n");
  fprintf(outfile, "#define TYPEDEF_UCHAR_BOOLEAN\n");
 #endif
  fprintf(outfile, "\n#ifdef JPEG_INTERNALS\n\n");
  if (is_shifting_signed(-0x7F7E80B1L))
@@ -368,6 +379,14 @@ int main (argc, argv)
  else
    fprintf(outfile, "#define RIGHT_SHIFT_IS_UNSIGNED\n");
  fprintf(outfile, "\n#endif /* JPEG_INTERNALS */\n");
  fprintf(outfile, "\n#if defined(JPEG_INTERNALS) || defined(JPEG_INTERNAL_OPTIONS)\n");
  fprintf(outfile, "#undef JSIMD_MMX_NOT_SUPPORTED\n");
  fprintf(outfile, "#undef JSIMD_3DNOW_NOT_SUPPORTED\n");
  fprintf(outfile, "#undef JSIMD_SSE_NOT_SUPPORTED\n");
  fprintf(outfile, "#undef JSIMD_SSE2_NOT_SUPPORTED\n");
  fprintf(outfile, "#endif\n");
  fprintf(outfile, "\n#ifdef JPEG_CJPEG_DJPEG\n\n");
  fprintf(outfile, "#define BMP_SUPPORTED		/* BMP image file format */\n");
  fprintf(outfile, "#define GIF_SUPPORTED		/* GIF image file format */\n");
@@ -375,6 +394,9 @@ int main (argc, argv)
  fprintf(outfile, "#undef RLE_SUPPORTED		/* Utah RLE image file format */\n");
  fprintf(outfile, "#define TARGA_SUPPORTED		/* Targa image file format */\n\n");
  fprintf(outfile, "#undef TWO_FILE_COMMANDLINE	/* You may need this on non-Unix systems */\n");
 #ifdef _WIN32
  fprintf(outfile, "#define USE_SETMODE		/* Needed to make one-file style work */\n");
 #endif
  fprintf(outfile, "#undef NEED_SIGNAL_CATCHER	/* Define this if you use jmemname.c */\n");
  fprintf(outfile, "#undef DONT_USE_B_MODE\n");
  fprintf(outfile, "/* #define PROGRESS_REPORT */	/* optional */\n");
--- a/config.guess
+++ b/config.guess
--- a/config.sub
+++ b/config.sub
--- a/config.ver
+++ b/config.ver
@@ -0,0 +1,44 @@
 JPEG_VER_MAJOR=62
 JPEG_VER_MINOR=1
 JPEG_REVISION=0
 case $host_os in
  cygwin*)
    # The shared library built from this source code is *not* binary
    # compatible with the cygwin's official binary release (cygjpeg-62.dll).
    # This is because the official binary has been built with
    # the lossless jpeg patch which is available as ljpeg-6b.tar.gz .
    # Therefore we decided to give the shared library the version number
    # other than 62.
    #
    JPEG_VER_MAJOR=162
    JPEG_VER_MINOR=0
    ;;
  freebsd*)
    # This follows the official binary release in the ports collection.
    JPEG_VER_MAJOR=9
    ;;
 esac
 # convert absolute version numbers to libtool ages
 case $version_type in
  freebsd-aout|freebsd-elf|sunos)
    JPEG_LT_CURRENT=$JPEG_VER_MAJOR
    JPEG_LT_REVISION=$JPEG_VER_MINOR
    JPEG_LT_AGE=0
    ;;
  irix|nonstopux)
    JPEG_LT_CURRENT=`expr $JPEG_VER_MAJOR + $JPEG_VER_MINOR - 1`
    JPEG_LT_AGE=$JPEG_VER_MINOR
    JPEG_LT_REVISION=$JPEG_VER_MINOR
    ;;
  *)
    JPEG_LT_CURRENT=`expr $JPEG_VER_MAJOR + $JPEG_VER_MINOR`
    JPEG_LT_AGE=$JPEG_VER_MINOR
    JPEG_LT_REVISION=$JPEG_REVISION
    ;;
 esac
 JPEG_LIB_VERSION=$JPEG_LT_CURRENT:$JPEG_LT_REVISION:$JPEG_LT_AGE
--- a/5809
+++ b/5809
--- a/configure.in
+++ b/configure.in
@@ -0,0 +1,634 @@
 dnl Process this file with autoconf to produce a configure script.
 AC_INIT([jcmaster.c])
 AC_CONFIG_HEADER([jconfig.h:jconfig.cfg])
 dnl --------------------------------------------------------------------
 AC_PROG_CC
 AC_PROG_CPP
 dnl --------------------------------------------------------------------
 AC_MSG_CHECKING([for function prototypes])
 AC_CACHE_VAL([ijg_cv_have_prototypes],[AC_TRY_COMPILE([
 int testfunction (int arg1, int * arg2); /* check prototypes */
 struct methods_struct {		/* check method-pointer declarations */
  int (*error_exit) (char *msgtext);
  int (*trace_message) (char *msgtext);
  int (*another_method) (void);
 };
 int testfunction (int arg1, int * arg2) /* check definitions */
 { return arg2[arg1]; }
 int test2function (void)	/* check void arg list */
 { return 0; }
 ],[ ],[ijg_cv_have_prototypes=yes],[ijg_cv_have_prototypes=no])])
 AC_MSG_RESULT([$ijg_cv_have_prototypes])
 if test $ijg_cv_have_prototypes = yes; then
  AC_DEFINE([HAVE_PROTOTYPES],)
 else
  echo [Your compiler does not seem to know about function prototypes.]
  echo [Perhaps it needs a special switch to enable ANSI C mode.]
  echo [If so, we recommend running configure like this:]
  echo ["   ./configure  CC='cc -switch'"]
  echo [where -switch is the proper switch.]
 fi
 dnl --------------------------------------------------------------------
 AC_CHECK_HEADER([stddef.h],[AC_DEFINE([HAVE_STDDEF_H],)])
 AC_CHECK_HEADER([stdlib.h],[AC_DEFINE([HAVE_STDLIB_H],)])
 AC_CHECK_HEADER([string.h],[:],[AC_DEFINE([NEED_BSD_STRINGS],)])
 dnl --------------------------------------------------------------------
 AC_MSG_CHECKING([for size_t])
 AC_TRY_COMPILE([
 #ifdef HAVE_STDDEF_H
 #include <stddef.h>
 #endif
 #ifdef HAVE_STDLIB_H
 #include <stdlib.h>
 #endif
 #include <stdio.h>
 #ifdef NEED_BSD_STRINGS
 #include <strings.h>
 #else
 #include <string.h>
 #endif
 typedef size_t my_size_t;
 ],[ my_size_t foovar; ],
 [ijg_size_t_ok=yes],
 [ijg_size_t_ok="not ANSI, perhaps it is in sys/types.h"])
 AC_MSG_RESULT([$ijg_size_t_ok])
 if test "$ijg_size_t_ok" != yes; then
 AC_CHECK_HEADER([sys/types.h],[AC_DEFINE([NEED_SYS_TYPES_H],)
 AC_EGREP_HEADER([size_t],[sys/types.h],
 [ijg_size_t_ok="size_t is in sys/types.h"],[ijg_size_t_ok=no])],
 [ijg_size_t_ok=no])
 AC_MSG_RESULT([$ijg_size_t_ok])
 if test "$ijg_size_t_ok" = no; then
  echo [Type size_t is not defined in any of the usual places.]
  echo [Try putting '"typedef unsigned int size_t;"' in jconfig.h.]
 fi
 fi
 dnl --------------------------------------------------------------------
 AC_MSG_CHECKING([for type unsigned char])
 AC_TRY_COMPILE(,[ unsigned char un_char; ],[AC_MSG_RESULT(yes)
 AC_DEFINE([HAVE_UNSIGNED_CHAR],)],[AC_MSG_RESULT(no)])
 dnl --------------------------------------------------------------------
 AC_MSG_CHECKING([for type unsigned short])
 AC_TRY_COMPILE(,[ unsigned short un_short; ],[AC_MSG_RESULT(yes)
 AC_DEFINE([HAVE_UNSIGNED_SHORT],)],[AC_MSG_RESULT(no)])
 dnl --------------------------------------------------------------------
 AC_MSG_CHECKING([for type void])
 AC_TRY_COMPILE([
 /* Caution: a C++ compiler will insist on valid prototypes */
 typedef void * void_ptr;	/* check void * */
 #ifdef HAVE_PROTOTYPES		/* check ptr to function returning void */
 typedef void (*void_func) (int a, int b);
 #else
 typedef void (*void_func) ();
 #endif
 #ifdef HAVE_PROTOTYPES		/* check void function result */
 void test3function (void_ptr arg1, void_func arg2)
 #else
 void test3function (arg1, arg2)
     void_ptr arg1;
     void_func arg2;
 #endif
 {
  char * locptr = (char *) arg1; /* check casting to and from void * */
  arg1 = (void *) locptr;
  (*arg2) (1, 2);		/* check call of fcn returning void */
 }
 ],[ ],[AC_MSG_RESULT(yes)],[AC_MSG_RESULT(no)
 AC_DEFINE([void],[char])])
 dnl --------------------------------------------------------------------
 AC_MSG_CHECKING([for working const])
 AC_CACHE_VAL([ac_cv_c_const],[AC_TRY_COMPILE(,[
 /* Ultrix mips cc rejects this.  */
 typedef int charset[2]; const charset x;
 /* SunOS 4.1.1 cc rejects this.  */
 char const *const *ccp;
 char **p;
 /* NEC SVR4.0.2 mips cc rejects this.  */
 struct point {int x, y;};
 static struct point const zero = {0,0};
 /* AIX XL C 1.02.0.0 rejects this.
   It does not let you subtract one const X* pointer from another in an arm
   of an if-expression whose if-part is not a constant expression */
 const char *g = "string";
 ccp = &g + (g ? g-g : 0);
 /* HPUX 7.0 cc rejects these. */
 ++ccp;
 p = (char**) ccp;
 ccp = (char const *const *) p;
 { /* SCO 3.2v4 cc rejects this.  */
  char *t;
  char const *s = 0 ? (char *) 0 : (char const *) 0;
  *t++ = 0;
 }
 { /* Someone thinks the Sun supposedly-ANSI compiler will reject this.  */
  int x[] = {25, 17};
  const int *foo = &x[0];
  ++foo;
 }
 { /* Sun SC1.0 ANSI compiler rejects this -- but not the above. */
  typedef const int *iptr;
  iptr p = 0;
  ++p;
 }
 { /* AIX XL C 1.02.0.0 rejects this saying
     "k.c", line 2.27: 1506-025 (S) Operand must be a modifiable lvalue. */
  struct s { int j; const int *ap[3]; };
  struct s *b; b->j = 5;
 }
 { /* ULTRIX-32 V3.1 (Rev 9) vcc rejects this */
  const int foo = 10;
 }
 ],[ac_cv_c_const=yes],[ac_cv_c_const=no])])
 AC_MSG_RESULT([$ac_cv_c_const])
 if test $ac_cv_c_const = no; then
  AC_DEFINE([const],)
 fi
 dnl --------------------------------------------------------------------
 AC_MSG_CHECKING([for inline])
 ijg_cv_inline=""
 AC_TRY_COMPILE(,[} __inline__ int foo() { return 0; }
 int bar() { return foo();],[ijg_cv_inline="__inline__"],
 [AC_TRY_COMPILE(,[} __inline int foo() { return 0; }
 int bar() { return foo();],[ijg_cv_inline="__inline"],
 [AC_TRY_COMPILE(,[} inline int foo() { return 0; }
 int bar() { return foo();],[ijg_cv_inline="inline"],)])])
 AC_MSG_RESULT([$ijg_cv_inline])
 AC_DEFINE_UNQUOTED([INLINE],[$ijg_cv_inline])
 dnl --------------------------------------------------------------------
 AC_MSG_CHECKING([for broken incomplete types])
 AC_TRY_COMPILE([ typedef struct undefined_structure * undef_struct_ptr; ],
 ,[AC_MSG_RESULT(ok)],[AC_MSG_RESULT(broken)
 AC_DEFINE([INCOMPLETE_TYPES_BROKEN],)])
 dnl --------------------------------------------------------------------
 AC_MSG_CHECKING([for short external names])
 AC_TRY_LINK([
 int possibly_duplicate_function () { return 0; }
 int possibly_dupli_function () { return 1; }
 ],[ ],[AC_MSG_RESULT(ok)],[AC_MSG_RESULT(short)
 AC_DEFINE([NEED_SHORT_EXTERNAL_NAMES],)])
 dnl --------------------------------------------------------------------
 AC_MSG_CHECKING([to see if char is signed])
 AC_TRY_RUN([
 #ifdef HAVE_PROTOTYPES
 int is_char_signed (int arg)
 #else
 int is_char_signed (arg)
     int arg;
 #endif
 {
  if (arg == 189) {		/* expected result for unsigned char */
    return 0;			/* type char is unsigned */
  }
  else if (arg != -67) {	/* expected result for signed char */
    printf("Hmm, it seems 'char' is not eight bits wide on your machine.\n");
    printf("I fear the JPEG software will not work at all.\n\n");
  }
  return 1;			/* assume char is signed otherwise */
 }
 char signed_char_check = (char) (-67);
 main() {
  exit(is_char_signed((int) signed_char_check));
 }],[AC_MSG_RESULT(no)
 AC_DEFINE([CHAR_IS_UNSIGNED],)],[AC_MSG_RESULT(yes)],
 [echo Assuming that char is signed on target machine.
 echo If it is unsigned, this will be a little bit inefficient.
 ])
 dnl --------------------------------------------------------------------
 AC_MSG_CHECKING([to see if right shift is signed])
 AC_TRY_RUN([
 #ifdef HAVE_PROTOTYPES
 int is_shifting_signed (long arg)
 #else
 int is_shifting_signed (arg)
     long arg;
 #endif
 /* See whether right-shift on a long is signed or not. */
 {
  long res = arg >> 4;
  if (res == -0x7F7E80CL) {	/* expected result for signed shift */
    return 1;			/* right shift is signed */
  }
  /* see if unsigned-shift hack will fix it. */
  /* we can't just test exact value since it depends on width of long... */
  res |= (~0L) << (32-4);
  if (res == -0x7F7E80CL) {	/* expected result now? */
    return 0;			/* right shift is unsigned */
  }
  printf("Right shift isn't acting as I expect it to.\n");
  printf("I fear the JPEG software will not work at all.\n\n");
  return 0;			/* try it with unsigned anyway */
 }
 main() {
  exit(is_shifting_signed(-0x7F7E80B1L));
 }],[AC_MSG_RESULT(no)
 AC_DEFINE([RIGHT_SHIFT_IS_UNSIGNED],)],[AC_MSG_RESULT(yes)],
 [AC_MSG_RESULT([Assuming that right shift is signed on target machine.])])
 dnl --------------------------------------------------------------------
 AC_MSG_CHECKING([to see if fopen accepts b spec])
 AC_TRY_RUN([
 #include <stdio.h>
 main() {
  if (fopen("conftestdata", "wb") != NULL)
    exit(0);
  exit(1);
 }],[AC_MSG_RESULT(yes)],[AC_MSG_RESULT(no)
 AC_DEFINE([DONT_USE_B_MODE],)],[AC_MSG_RESULT([Assuming that it does.])])
 dnl --------------------------------------------------------------------
 AC_PROG_INSTALL
 AC_PROG_RANLIB
 dnl --------------------------------------------------------------------
 AC_CANONICAL_HOST
 AC_EXEEXT
 # Decide whether to use libtool,
 # and if so whether to build shared, static, or both flavors of library.
 AC_DISABLE_SHARED
 AC_DISABLE_STATIC
 if test "x$enable_shared" != xno  -o  "x$enable_static" != xno; then
  USELIBTOOL="yes"
 # LIBTOOL="./libtool"
  O="lo"
  A="la"
  LN='$(LIBTOOL) --mode=link $(CC)'
  INSTALL_LIB='$(LIBTOOL) --mode=install ${INSTALL}'
  INSTALL_PROGRAM="\$(LIBTOOL) --mode=install $INSTALL_PROGRAM"
  UNINSTALL='$(LIBTOOL) --mode=uninstall $(RM)'
 else
  USELIBTOOL="no"
  LIBTOOL=""
  O="o"
  A="a"
  LN='$(CC)'
  INSTALL_LIB="$INSTALL_DATA"
  UNINSTALL='$(RM)'
 fi
 AC_SUBST([LIBTOOL])
 AC_SUBST([O])
 AC_SUBST([A])
 AC_SUBST([LN])
 AC_SUBST([INSTALL_LIB])
 AC_SUBST([UNINSTALL])
 # Configure libtool if needed.
 if test $USELIBTOOL = yes; then
  AC_LIBTOOL_DLOPEN
  AC_LIBTOOL_WIN32_DLL
  AC_PROG_LIBTOOL
 fi
 # if libtool >= 1.5
 TAGCC=ifdef([AC_LIBTOOL_GCJ],[--tag=CC])
 AC_SUBST([TAGCC])
 dnl --------------------------------------------------------------------
 # Select memory manager depending on user input.
 # If no "-enable-maxmem", use jmemnobs
 MEMORYMGR='jmemnobs.$(O)'
 MAXMEM="no"
 AC_ARG_ENABLE([maxmem],
 [  --enable-maxmem[=N]     enable use of temp files, set max mem usage to N MB],
 [MAXMEM="$enableval"])
 # support --with-maxmem for backwards compatibility with IJG V5.
 AC_ARG_WITH([maxmem],,[MAXMEM="$withval"])
 if test "x$MAXMEM" = xyes; then
  MAXMEM=1
 fi
 if test "x$MAXMEM" != xno; then
  if test -n "`echo $MAXMEM | sed 's/[[0-9]]//g'`"; then
    AC_MSG_ERROR([non-numeric argument to --enable-maxmem])
  fi
  DEFAULTMAXMEM=`expr $MAXMEM \* 1048576`
 AC_DEFINE_UNQUOTED([DEFAULT_MAX_MEM],[${DEFAULTMAXMEM}])
 AC_MSG_CHECKING([for 'tmpfile()'])
 AC_TRY_LINK([#include <stdio.h>],[ FILE * tfile = tmpfile(); ],
 [AC_MSG_RESULT(yes)
 MEMORYMGR='jmemansi.$(O)'],
 [AC_MSG_RESULT(no)
 MEMORYMGR='jmemname.$(O)'
 AC_DEFINE([NEED_SIGNAL_CATCHER],)
 AC_MSG_CHECKING([for 'mktemp()'])
 AC_TRY_LINK(,[ char fname[80]; mktemp(fname); ],
 [AC_MSG_RESULT(yes)],[AC_MSG_RESULT(no)
 AC_DEFINE([NO_MKTEMP],)])])
 fi
 AC_SUBST([MEMORYMGR])
 dnl ====================================================================
 AC_MSG_CHECKING([to see if the host cpu type is i386 or compatible])
 case "$host_cpu" in
  i*86 | x86 | ia32)
    AC_MSG_RESULT(yes)
  ;;
  x86_64 | amd64 | aa64)
    AC_MSG_RESULT([no (x86_64)])
    AC_MSG_ERROR([Currently, this version of JPEG library cannot be compiled as 64-bit code. sorry.])
  ;;
  *)
    AC_MSG_RESULT([no ("$host_cpu")])
    AC_MSG_ERROR([This version of JPEG library is for i386 or compatible processors only.])
  ;;
 esac
 if test -z "$NAFLAGS" ; then
  AC_MSG_CHECKING([for object file format of host system])
  case "$host_os" in
    cygwin* | mingw* | pw32* | interix*)
      objfmt='Win32-COFF'
    ;;
    msdosdjgpp* | go32*)
      objfmt='COFF'
    ;;
    os2-emx*)			# not tested
      objfmt='MSOMF'		# obj
    ;;
    linux*coff* | linux*oldld*)
      objfmt='COFF'		# ???
    ;;
    linux*aout*)
      objfmt='a.out'
    ;;
    linux*)
      objfmt='ELF'
    ;;
    freebsd* | netbsd* | openbsd*)
      if echo __ELF__ | $CC -E - | grep __ELF__ > /dev/null; then
        objfmt='BSD-a.out'
      else
        objfmt='ELF'
      fi
    ;;
    solaris* | sunos* | sysv* | sco*)
      objfmt='ELF'
    ;;
    darwin* | rhapsody* | nextstep* | openstep* | macos*)
      objfmt='Mach-O'
    ;;
    *)
      objfmt='ELF ?'
    ;;
  esac
  AC_MSG_RESULT([$objfmt])
  if test "$objfmt" = 'ELF ?'; then
    objfmt='ELF'
    AC_MSG_WARN([unexpected host system. assumed that the format is $objfmt.])
  fi
 else
  objfmt=''
 fi
 AC_MSG_CHECKING([for object file format specifier (NAFLAGS) ])
 case "$objfmt" in
  MSOMF)      NAFLAGS='-fobj -DOBJ32';;
  Win32-COFF) NAFLAGS='-fwin32 -DWIN32';;
  COFF)       NAFLAGS='-fcoff -DCOFF';;
  a.out)      NAFLAGS='-faout -DAOUT';;
  BSD-a.out)  NAFLAGS='-faoutb -DAOUT';;
  ELF)        NAFLAGS='-felf -DELF';;
  RDF)        NAFLAGS='-frdf -DRDF';;
  Mach-O)     NAFLAGS='-fmacho -DMACHO';;
 esac
 AC_MSG_RESULT([$NAFLAGS])
 AC_SUBST([NAFLAGS])
 dnl --------------------------------------------------------------------
 AC_CHECK_PROGS(NASM, [nasm nasmw])
 test -z "$NASM" && AC_MSG_ERROR([no nasm (Netwide Assembler) found in \$PATH])
 if echo "$NASM" | grep yasm > /dev/null; then
  AC_MSG_WARN([DON'T USE YASM! CURRENT VERSION (R0.4.0) IS BUGGY!])
 fi
 AC_MSG_CHECKING([whether the assembler ($NASM $NAFLAGS) works])
 cat > conftest.asm <<EOF
 [%line __oline__ "configure"
        section .text
        bits    32
        global  _main,main
 _main:
 main:   xor     eax,eax
        ret
 ]EOF
 try_nasm='$NASM $NAFLAGS -o conftest.o conftest.asm'
 if AC_TRY_EVAL(try_nasm) && test -s conftest.o; then
  AC_MSG_RESULT(yes)
 else
  echo "configure: failed program was:" >&AC_FD_CC
  cat conftest.asm >&AC_FD_CC
  rm -rf conftest*
  AC_MSG_RESULT(no)
  AC_MSG_ERROR([installation or configuration problem: assembler cannot create object files.])
 fi
 AC_MSG_CHECKING([whether the linker accepts assembler output])
 try_nasm='${CC-cc} -o conftest${ac_exeext} $LDFLAGS conftest.o $LIBS 1>&AC_FD_CC'
 if AC_TRY_EVAL(try_nasm) && test -s conftest${ac_exeext}; then
  rm -rf conftest*
  AC_MSG_RESULT(yes)
 else
  rm -rf conftest*
  AC_MSG_RESULT(no)
  AC_MSG_ERROR([configuration problem: maybe object file format mismatch.])
 fi
 AC_MSG_CHECKING([whether the assembler supports line continuation character])
 cat > conftest.asm <<\EOF
 [%line __oline__ "configure"
 ; The line continuation character '\'
 ; was introduced in nasm 0.98.25.
        section .text
        bits    32
        global  _zero
 _zero:  xor     \
                eax,eax
        ret
 ]EOF
 try_nasm='$NASM $NAFLAGS -o conftest.o conftest.asm'
 if AC_TRY_EVAL(try_nasm) && test -s conftest.o; then
  rm -rf conftest*
  AC_MSG_RESULT(yes)
 else
  echo "configure: failed program was:" >&AC_FD_CC
  cat conftest.asm >&AC_FD_CC
  rm -rf conftest*
  AC_MSG_RESULT(no)
  AC_MSG_ERROR([you have to use a more recent version of the assembler.])
 fi
 dnl --------------------------------------------------------------------
 AC_MSG_CHECKING([SIMD instruction sets requested to use])
 simd_to_use=""
 AC_ARG_ENABLE(mmx,
 [  --disable-mmx           do not use MMX instruction set],
 [if test "x$enableval" = xno; then
  AC_DEFINE([JSIMD_MMX_NOT_SUPPORTED],)
 else
  simd_to_use="$simd_to_use MMX"
 fi], [simd_to_use="$simd_to_use MMX"])
 AC_ARG_ENABLE(3dnow,
 [  --disable-3dnow         do not use 3DNow! instruction set],
 [if test "x$enableval" = xno; then
  AC_DEFINE([JSIMD_3DNOW_NOT_SUPPORTED],)
 else
  simd_to_use="$simd_to_use 3DNow!"
 fi], [simd_to_use="$simd_to_use 3DNow!"])
 AC_ARG_ENABLE(sse,
 [  --disable-sse           do not use SSE instruction set],
 [if test "x$enableval" = xno; then
  AC_DEFINE([JSIMD_SSE_NOT_SUPPORTED],)
 else
  simd_to_use="$simd_to_use SSE"
 fi], [simd_to_use="$simd_to_use SSE"])
 AC_ARG_ENABLE(sse2,
 [  --disable-sse2          do not use SSE2 instruction set],
 [if test "x$enableval" = xno; then
  AC_DEFINE([JSIMD_SSE2_NOT_SUPPORTED],)
 else
  simd_to_use="$simd_to_use SSE2"
 fi], [simd_to_use="$simd_to_use SSE2"])
 test -z "$simd_to_use" && simd_to_use="NONE"
 AC_MSG_RESULT([$simd_to_use])
 for simd_name in $simd_to_use; do
 case "$simd_name" in
  MMX)    simd_instruction='psubw mm0,mm0';;
  3DNow!) simd_instruction='pfsub mm0,mm0';;
  SSE)    simd_instruction='subps xmm0,xmm0';;
  SSE2)   simd_instruction='subpd xmm0,xmm0';;
  *)      continue;;
 esac
 AC_MSG_CHECKING([whether the assembler supports $simd_name instructions])
 cat > conftest.asm <<EOF
 [%line __oline__ "configure"
        section .text
        bits    32
        global  _simd
 _simd:  $simd_instruction
        ret
 ]EOF
 try_nasm='$NASM $NAFLAGS -o conftest.o conftest.asm'
 if AC_TRY_EVAL(try_nasm) && test -s conftest.o; then
  rm -rf conftest*
  AC_MSG_RESULT(yes)
 else
  echo "configure: failed program was:" >&AC_FD_CC
  cat conftest.asm >&AC_FD_CC
  rm -rf conftest*
  AC_MSG_RESULT(no)
  AC_MSG_ERROR([you have to use a more recent version of the assembler.])
 fi
 done
 dnl --------------------------------------------------------------------
 # Select OS-dependent SIMD instruction support checker.
 # jsimdw32.$(O) (Win32) / jsimddjg.$(O) (DJGPP V.2) / jsimdgcc.$(O) (Unix/gcc)
 if test "x$SIMDCHECKER" = x ; then
  case "$host_os" in
    cygwin* | mingw* | pw32* | interix*)
      SIMDCHECKER='jsimdw32.$(O)'
    ;;
    msdosdjgpp* | go32*)
      SIMDCHECKER='jsimddjg.$(O)'
    ;;
    os2-emx*)			# not tested
      SIMDCHECKER='jsimdgcc.$(O)'
    ;;
    *)
      SIMDCHECKER='jsimdgcc.$(O)'
    ;;
  esac
 fi
 AC_SUBST([SIMDCHECKER])
 case "$host_os" in
  cygwin* | mingw* | pw32* | os2-emx* | msdosdjgpp* | go32*)
    AC_DEFINE([USE_SETMODE],)
  ;;
 # _host_name_*)
 #   AC_DEFINE([USE_FDOPEN],)
 # ;;
 esac
 # This is for UNIX-like environments on Windows platform.
 AC_ARG_ENABLE(uchar-boolean,
 [  --enable-uchar-boolean  define type \"boolean\" as unsigned char (for Windows)],
 [if test "x$enableval" != xno; then
  AC_DEFINE([TYPEDEF_UCHAR_BOOLEAN],)
 fi])
 dnl --------------------------------------------------------------------
 JPEG_LIB_VERSION="63:0:1"
 confv_dirs="$srcdir $srcdir/.. $srcdir/../.."
 config_ver=
 for ac_dir in $confv_dirs; do
  if test -r $ac_dir/config.ver; then
    config_ver=$ac_dir/config.ver
    break
  fi
 done
 if test -z "$config_ver"; then
  AC_MSG_WARN([cannot find config.ver in $confv_dirs])
  AC_MSG_WARN([default version number $JPEG_LIB_VERSION is used])
  AC_MSG_CHECKING([libjpeg version number for libtool])
  AC_MSG_RESULT([$JPEG_LIB_VERSION])
 else
  AC_MSG_CHECKING([libjpeg version number for libtool])
  . $config_ver
  AC_MSG_RESULT([$JPEG_LIB_VERSION])
  echo "configure: if you want to change the version number, modify $config_ver" 1>&2
 fi
 AC_SUBST([JPEG_LIB_VERSION])
 dnl --------------------------------------------------------------------
 # Prepare to massage makefile.cfg correctly.
 if test $ijg_cv_have_prototypes = yes; then
  A2K_DEPS=""
  COM_A2K="# "
 else
  A2K_DEPS="ansi2knr"
  COM_A2K=""
 fi
 AC_SUBST([A2K_DEPS])
 AC_SUBST([COM_A2K])
 # ansi2knr needs -DBSD if string.h is missing
 if test $ac_cv_header_string_h = no; then
  ANSI2KNRFLAGS="-DBSD"
 else
  ANSI2KNRFLAGS=""
 fi
 AC_SUBST([ANSI2KNRFLAGS])
 # Substitutions to enable or disable libtool-related stuff
 if test $USELIBTOOL = yes -a $ijg_cv_have_prototypes = yes; then
  COM_LT=""
 else
  COM_LT="# "
 fi
 AC_SUBST([COM_LT])
 if test "x$enable_shared" != xno; then
  FORCE_INSTALL_LIB="install-lib"
  UNINSTALL_LIB="uninstall-lib"
 else
  FORCE_INSTALL_LIB=""
  UNINSTALL_LIB=""
 fi
 AC_SUBST([FORCE_INSTALL_LIB])
 AC_SUBST([UNINSTALL_LIB])
 # Set up -I directives
 if test "x$srcdir" = x.; then
  INCLUDEFLAGS='-I$(srcdir)'
 else
  INCLUDEFLAGS='-I. -I$(srcdir)'
 fi
 AC_SUBST([INCLUDEFLAGS])
 dnl --------------------------------------------------------------------
 AC_OUTPUT([Makefile:makefile.cfg])
--- a/djpeg.1
+++ b/djpeg.1
@@ -1,4 +1,4 @@
-.TH DJPEG 1 "15 June 1995"
+.TH DJPEG 1 "22 August 1997"
 .SH NAME
 djpeg \- decompress a JPEG file to an image file
 .SH SYNOPSIS
@@ -26,9 +26,9 @@ or
 .BR \-gr .
 Most of the "basic" switches can be abbreviated to as little as one letter.
 Upper and lower case are equivalent (thus
-.B \-GIF
+.B \-BMP
 is the same as
-.BR \-gif ).
+.BR \-bmp ).
 British spellings are also accepted (e.g.,
 .BR \-greyscale ),
 though for brevity these are not mentioned below.
@@ -182,13 +182,13 @@ Same as
 .BR \-verbose .
 .SH EXAMPLES
 .LP
-This example decompresses the JPEG file foo.jpg, automatically quantizes to
+This example decompresses the JPEG file foo.jpg, quantizes it to
-256 colors, and saves the output in GIF format in foo.gif:
+256 colors, and saves the output in 8-bit BMP format in foo.bmp:
 .IP
-.B djpeg \-gif
+.B djpeg \-colors 256 \-bmp
 .I foo.jpg
 .B >
-.I foo.gif
+.I foo.bmp
 .SH HINTS
 To get a quick preview of an image, use the
 .B \-grayscale
@@ -245,4 +245,9 @@ Independent JPEG Group
 .SH BUGS
 Arithmetic coding is not supported for legal reasons.
 .PP
 To avoid the Unisys LZW patent,
 .B djpeg
 produces uncompressed GIF files.  These are larger than they should be, but
 are readable by standard GIF decoders.
 .PP
 Still not as fast as we'd like.
--- a/djpeg.c
+++ b/djpeg.c
@@ -1,10 +1,17 @@
 /*
 * djpeg.c
 *
- * Copyright (C) 1991-1996, Thomas G. Lane.
+ * Copyright (C) 1991-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
 * ---------------------------------------------------------------------
 * x86 SIMD extension for IJG JPEG library
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
 * This file has been modified for SIMD extension.
 * Last Modified : August 23, 2005
 * ---------------------------------------------------------------------
 *
 * This file contains a command-line user interface for the JPEG decompressor.
 * It should work on any system with Unix- or MS-DOS-style command lines.
 *
@@ -158,6 +165,22 @@ usage (void)
 }
 #ifndef JSIMD_MODEINFO_NOT_SUPPORTED
 LOCAL(void)
 print_simd_info (FILE * file, char * labelstr, unsigned int simd)
 {
  fprintf(file, "%s%s%s%s%s%s\n", labelstr,
 	  simd & JSIMD_MMX   ? " MMX"    : "",
 	  simd & JSIMD_3DNOW ? " 3DNow!" : "",
 	  simd & JSIMD_SSE   ? " SSE"    : "",
 	  simd & JSIMD_SSE2  ? " SSE2"   : "",
 	  simd == JSIMD_NONE ? " NONE"   : "");
 }
 #endif /* !JSIMD_MODEINFO_NOT_SUPPORTED */
 LOCAL(int)
 parse_switches (j_decompress_ptr cinfo, int argc, char **argv,
 		int last_file_arg_seen, boolean for_real)
@@ -208,6 +231,19 @@ parse_switches (j_decompress_ptr cinfo, int argc, char **argv,
      cinfo->desired_number_of_colors = val;
      cinfo->quantize_colors = TRUE;
 #ifndef JSIMD_MASKFUNC_NOT_SUPPORTED
    } else if (keymatch(arg, "nosimd" , 4)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_ALL);
    } else if (keymatch(arg, "nommx"  , 3)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_MMX);
    } else if (keymatch(arg, "no3dnow", 3)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_3DNOW);
    } else if (keymatch(arg, "nosse"  , 4)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_SSE);
    } else if (keymatch(arg, "nosse2" , 6)) {
      jpeg_simd_mask((j_common_ptr) cinfo, JSIMD_NONE, JSIMD_SSE2);
 #endif /* !JSIMD_MASKFUNC_NOT_SUPPORTED */
    } else if (keymatch(arg, "dct", 2)) {
      /* Select IDCT algorithm. */
      if (++argn >= argc)	/* advance to next argument */
@@ -242,6 +278,38 @@ parse_switches (j_decompress_ptr cinfo, int argc, char **argv,
      if (! printed_version) {
 	fprintf(stderr, "Independent JPEG Group's DJPEG, version %s\n%s\n",
 		JVERSION, JCOPYRIGHT);
 	fprintf(stderr,
 		"\nx86 SIMD extension for IJG JPEG library, version %s\n\n",
 		JPEG_SIMDEXT_VER_STR);
 #ifndef JSIMD_MODEINFO_NOT_SUPPORTED
 	print_simd_info(stderr, "SIMD instructions supported by the system :",
 			jpeg_simd_support(NULL));
 	fprintf(stderr, "\n      === SIMD Operation Modes ===\n");
 #ifdef DCT_ISLOW_SUPPORTED
 	print_simd_info(stderr, "Accurate integer DCT  (-dct int)   :",
 			jpeg_simd_inverse_dct(cinfo, JDCT_ISLOW));
 #endif
 #ifdef DCT_IFAST_SUPPORTED
 	print_simd_info(stderr, "Fast integer DCT      (-dct fast)  :",
 			jpeg_simd_inverse_dct(cinfo, JDCT_IFAST));
 #endif
 #ifdef DCT_FLOAT_SUPPORTED
 	print_simd_info(stderr, "Floating-point DCT    (-dct float) :",
 			jpeg_simd_inverse_dct(cinfo, JDCT_FLOAT));
 #endif
 #ifdef IDCT_SCALING_SUPPORTED
 	print_simd_info(stderr, "Reduced-size DCT      (-scale M/N) :",
 			jpeg_simd_inverse_dct(cinfo, JDCT_FLOAT+1));
 #endif
 	print_simd_info(stderr, "High-quality upsampling (default)  :",
 			jpeg_simd_upsampler(cinfo, TRUE));
 	print_simd_info(stderr, "Low-quality upsampling (-nosmooth) :",
 			jpeg_simd_upsampler(cinfo, FALSE));
 	print_simd_info(stderr, "Colorspace conversion (YCbCr->RGB) :",
 			jpeg_simd_color_deconverter(cinfo));
 	fprintf(stderr, "\n");
 #endif /* !JSIMD_MODEINFO_NOT_SUPPORTED */
 	printed_version = TRUE;
      }
      cinfo->err->trace_level++;
@@ -344,9 +412,9 @@ parse_switches (j_decompress_ptr cinfo, int argc, char **argv,
 /*
- * Marker processor for COM markers.
+ * Marker processor for COM and interesting APPn markers.
 * This replaces the library's built-in processor, which just skips the marker.
- * We want to print out the marker as text, if possible.
+ * We want to print out the marker as text, to the extent possible.
 * Note this code relies on a non-suspending data source.
 */
@@ -366,7 +434,7 @@ jpeg_getc (j_decompress_ptr cinfo)
 METHODDEF(boolean)
-COM_handler (j_decompress_ptr cinfo)
+print_text_marker (j_decompress_ptr cinfo)
 {
  boolean traceit = (cinfo->err->trace_level >= 1);
  INT32 length;
@@ -377,8 +445,13 @@ COM_handler (j_decompress_ptr cinfo)
  length += jpeg_getc(cinfo);
  length -= 2;			/* discount the length word itself */
-  if (traceit)
+  if (traceit) {
    if (cinfo->unread_marker == JPEG_COM)
      fprintf(stderr, "Comment, length %ld:\n", (long) length);
    else			/* assume it is an APPn otherwise */
      fprintf(stderr, "APP%d, length %ld:\n",
 	      cinfo->unread_marker - JPEG_APP0, (long) length);
  }
  while (--length >= 0) {
    ch = jpeg_getc(cinfo);
@@ -445,8 +518,15 @@ main (int argc, char **argv)
  jerr.addon_message_table = cdjpeg_message_table;
  jerr.first_addon_message = JMSG_FIRSTADDONCODE;
  jerr.last_addon_message = JMSG_LASTADDONCODE;
-  /* Insert custom COM marker processor. */
+
-  jpeg_set_marker_processor(&cinfo, JPEG_COM, COM_handler);
+  /* Insert custom marker processor for COM and APP12.
   * APP12 is used by some digital camera makers for textual info,
   * so we provide the ability to display it as text.
   * If you like, additional APPn marker types can be selected for display,
   * but don't try to override APP0 or APP14 this way (see libjpeg.doc).
   */
  jpeg_set_marker_processor(&cinfo, JPEG_COM, print_text_marker);
  jpeg_set_marker_processor(&cinfo, JPEG_APP0+12, print_text_marker);
  /* Now safe to enable signal catcher. */
 #ifdef NEED_SIGNAL_CATCHER
--- a/filelist.doc
+++ b/filelist.doc
@@ -1,6 +1,6 @@
 IJG JPEG LIBRARY:  FILE LIST
-Copyright (C) 1994-1996, Thomas G. Lane.
+Copyright (C) 1994-1998, Thomas G. Lane.
 This file is part of the Independent JPEG Group's software.
 For conditions of distribution and use, see the accompanying README file.
@@ -113,8 +113,8 @@ module:
 jmemnobs.c	"No backing store": assumes adequate virtual memory exists.
 jmemansi.c	Makes temporary files with ANSI-standard routine tmpfile().
 jmemname.c	Makes temporary files with program-generated file names.
-jmemdos.c	Custom implementation for MS-DOS: knows about extended and
+jmemdos.c	Custom implementation for MS-DOS (16-bit environment only):
-		expanded memory as well as temporary files.
+		can use extended and expanded memory as well as temp files.
 jmemmac.c	Custom implementation for Apple Macintosh.
 Exactly one of the system-dependent modules should be configured into an
@@ -134,8 +134,9 @@ CJPEG/DJPEG/JPEGTRAN
 Include files:
-cdjpeg.h	Declarations shared by cjpeg/djpeg modules.
+cdjpeg.h	Declarations shared by cjpeg/djpeg/jpegtran modules.
-cderror.h	Additional error and trace message codes for cjpeg/djpeg.
+cderror.h	Additional error and trace message codes for cjpeg et al.
 transupp.h	Declarations for jpegtran support routines in transupp.c.
 C source code files:
@@ -146,11 +147,12 @@ cdjpeg.c	Utility routines used by all three programs.
 rdcolmap.c	Code to read a colormap file for djpeg's "-map" switch.
 rdswitch.c	Code to process some of cjpeg's more complex switches.
 		Also used by jpegtran.
 transupp.c	Support code for jpegtran: lossless image manipulations.
 Image file reader modules for cjpeg:
 rdbmp.c		BMP file input.
-rdgif.c		GIF file input.
+rdgif.c		GIF file input (now just a stub).
 rdppm.c		PPM/PGM file input.
 rdrle.c		Utah RLE file input.
 rdtarga.c	Targa file input.
@@ -158,7 +160,7 @@ rdtarga.c	Targa file input.
 Image file writer modules for djpeg:
 wrbmp.c		BMP file output.
-wrgif.c		GIF file output.
+wrgif.c		GIF file output (a mere shadow of its former self).
 wrppm.c		PPM/PGM file output.
 wrrle.c		Utah RLE file output.
 wrtarga.c	Targa file output.
@@ -190,6 +192,11 @@ example.c	Sample code for calling JPEG library.
 Configuration/installation files and programs (see install.doc for more info):
 configure	Unix shell script to perform automatic configuration.
 ltconfig	Support scripts for configure (from GNU libtool).
 ltmain.sh
 config.guess
 config.sub
 install-sh	Install shell script for those Unix systems lacking one.
 ckconfig.c	Program to generate jconfig.h on non-Unix systems.
 jconfig.doc	Template for making jconfig.h by hand.
 makefile.*	Sample makefiles for particular systems.
--- a/323
+++ b/323
@@ -0,0 +1,323 @@
 #!/bin/sh
 # install - install a program, script, or datafile
 scriptversion=2005-05-14.22
 # This originates from X11R5 (mit/util/scripts/install.sh), which was
 # later released in X11R6 (xc/config/util/install.sh) with the
 # following copyright and license.
 #
 # Copyright (C) 1994 X Consortium
 #
 # Permission is hereby granted, free of charge, to any person obtaining a copy
 # of this software and associated documentation files (the "Software"), to
 # deal in the Software without restriction, including without limitation the
 # rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
 # sell copies of the Software, and to permit persons to whom the Software is
 # furnished to do so, subject to the following conditions:
 #
 # The above copyright notice and this permission notice shall be included in
 # all copies or substantial portions of the Software.
 #
 # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
 # X CONSORTIUM BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN
 # AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNEC-
 # TION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 #
 # Except as contained in this notice, the name of the X Consortium shall not
 # be used in advertising or otherwise to promote the sale, use or other deal-
 # ings in this Software without prior written authorization from the X Consor-
 # tium.
 #
 #
 # FSF changes to this file are in the public domain.
 #
 # Calling this script install-sh is preferred over install.sh, to prevent
 # `make' implicit rules from creating a file called install from it
 # when there is no Makefile.
 #
 # This script is compatible with the BSD install script, but was written
 # from scratch.  It can only install one file at a time, a restriction
 # shared with many OS's install programs.
 # set DOITPROG to echo to test this script
 # Don't use :- since 4.3BSD and earlier shells don't like it.
 doit="${DOITPROG-}"
 # put in absolute paths if you don't have them in your path; or use env. vars.
 mvprog="${MVPROG-mv}"
 cpprog="${CPPROG-cp}"
 chmodprog="${CHMODPROG-chmod}"
 chownprog="${CHOWNPROG-chown}"
 chgrpprog="${CHGRPPROG-chgrp}"
 stripprog="${STRIPPROG-strip}"
 rmprog="${RMPROG-rm}"
 mkdirprog="${MKDIRPROG-mkdir}"
 chmodcmd="$chmodprog 0755"
 chowncmd=
 chgrpcmd=
 stripcmd=
 rmcmd="$rmprog -f"
 mvcmd="$mvprog"
 src=
 dst=
 dir_arg=
 dstarg=
 no_target_directory=
 usage="Usage: $0 [OPTION]... [-T] SRCFILE DSTFILE
   or: $0 [OPTION]... SRCFILES... DIRECTORY
   or: $0 [OPTION]... -t DIRECTORY SRCFILES...
   or: $0 [OPTION]... -d DIRECTORIES...
 In the 1st form, copy SRCFILE to DSTFILE.
 In the 2nd and 3rd, copy all SRCFILES to DIRECTORY.
 In the 4th, create DIRECTORIES.
 Options:
 -c         (ignored)
 -d         create directories instead of installing files.
 -g GROUP   $chgrpprog installed files to GROUP.
 -m MODE    $chmodprog installed files to MODE.
 -o USER    $chownprog installed files to USER.
 -s         $stripprog installed files.
 -t DIRECTORY  install into DIRECTORY.
 -T         report an error if DSTFILE is a directory.
 --help     display this help and exit.
 --version  display version info and exit.
 Environment variables override the default commands:
  CHGRPPROG CHMODPROG CHOWNPROG CPPROG MKDIRPROG MVPROG RMPROG STRIPPROG
 "
 while test -n "$1"; do
  case $1 in
    -c) shift
        continue;;
    -d) dir_arg=true
        shift
        continue;;
    -g) chgrpcmd="$chgrpprog $2"
        shift
        shift
        continue;;
    --help) echo "$usage"; exit $?;;
    -m) chmodcmd="$chmodprog $2"
        shift
        shift
        continue;;
    -o) chowncmd="$chownprog $2"
        shift
        shift
        continue;;
    -s) stripcmd=$stripprog
        shift
        continue;;
    -t) dstarg=$2
 	shift
 	shift
 	continue;;
    -T) no_target_directory=true
 	shift
 	continue;;
    --version) echo "$0 $scriptversion"; exit $?;;
    *)  # When -d is used, all remaining arguments are directories to create.
 	# When -t is used, the destination is already specified.
 	test -n "$dir_arg$dstarg" && break
        # Otherwise, the last argument is the destination.  Remove it from $@.
 	for arg
 	do
          if test -n "$dstarg"; then
 	    # $@ is not empty: it contains at least $arg.
 	    set fnord "$@" "$dstarg"
 	    shift # fnord
 	  fi
 	  shift # arg
 	  dstarg=$arg
 	done
 	break;;
  esac
 done
 if test -z "$1"; then
  if test -z "$dir_arg"; then
    echo "$0: no input file specified." >&2
    exit 1
  fi
  # It's OK to call `install-sh -d' without argument.
  # This can happen when creating conditional directories.
  exit 0
 fi
 for src
 do
  # Protect names starting with `-'.
  case $src in
    -*) src=./$src ;;
  esac
  if test -n "$dir_arg"; then
    dst=$src
    src=
    if test -d "$dst"; then
      mkdircmd=:
      chmodcmd=
    else
      mkdircmd=$mkdirprog
    fi
  else
    # Waiting for this to be detected by the "$cpprog $src $dsttmp" command
    # might cause directories to be created, which would be especially bad
    # if $src (and thus $dsttmp) contains '*'.
    if test ! -f "$src" && test ! -d "$src"; then
      echo "$0: $src does not exist." >&2
      exit 1
    fi
    if test -z "$dstarg"; then
      echo "$0: no destination specified." >&2
      exit 1
    fi
    dst=$dstarg
    # Protect names starting with `-'.
    case $dst in
      -*) dst=./$dst ;;
    esac
    # If destination is a directory, append the input filename; won't work
    # if double slashes aren't ignored.
    if test -d "$dst"; then
      if test -n "$no_target_directory"; then
 	echo "$0: $dstarg: Is a directory" >&2
 	exit 1
      fi
      dst=$dst/`basename "$src"`
    fi
  fi
  # This sed command emulates the dirname command.
  dstdir=`echo "$dst" | sed -e 's,/*$,,;s,[^/]*$,,;s,/*$,,;s,^$,.,'`
  # Make sure that the destination directory exists.
  # Skip lots of stat calls in the usual case.
  if test ! -d "$dstdir"; then
    defaultIFS='
 	 '
    IFS="${IFS-$defaultIFS}"
    oIFS=$IFS
    # Some sh's can't handle IFS=/ for some reason.
    IFS='%'
    set x `echo "$dstdir" | sed -e 's@/@%@g' -e 's@^%@/@'`
    shift
    IFS=$oIFS
    pathcomp=
    while test $# -ne 0 ; do
      pathcomp=$pathcomp$1
      shift
      if test ! -d "$pathcomp"; then
        $mkdirprog "$pathcomp"
 	# mkdir can fail with a `File exist' error in case several
 	# install-sh are creating the directory concurrently.  This
 	# is OK.
 	test -d "$pathcomp" || exit
      fi
      pathcomp=$pathcomp/
    done
  fi
  if test -n "$dir_arg"; then
    $doit $mkdircmd "$dst" \
      && { test -z "$chowncmd" || $doit $chowncmd "$dst"; } \
      && { test -z "$chgrpcmd" || $doit $chgrpcmd "$dst"; } \
      && { test -z "$stripcmd" || $doit $stripcmd "$dst"; } \
      && { test -z "$chmodcmd" || $doit $chmodcmd "$dst"; }
  else
    dstfile=`basename "$dst"`
    # Make a couple of temp file names in the proper directory.
    dsttmp=$dstdir/_inst.$$_
    rmtmp=$dstdir/_rm.$$_
    # Trap to clean up those temp files at exit.
    trap 'ret=$?; rm -f "$dsttmp" "$rmtmp" && exit $ret' 0
    trap '(exit $?); exit' 1 2 13 15
    # Copy the file name to the temp name.
    $doit $cpprog "$src" "$dsttmp" &&
    # and set any options; do chmod last to preserve setuid bits.
    #
    # If any of these fail, we abort the whole thing.  If we want to
    # ignore errors from any of these, just make sure not to ignore
    # errors from the above "$doit $cpprog $src $dsttmp" command.
    #
    { test -z "$chowncmd" || $doit $chowncmd "$dsttmp"; } \
      && { test -z "$chgrpcmd" || $doit $chgrpcmd "$dsttmp"; } \
      && { test -z "$stripcmd" || $doit $stripcmd "$dsttmp"; } \
      && { test -z "$chmodcmd" || $doit $chmodcmd "$dsttmp"; } &&
    # Now rename the file to the real destination.
    { $doit $mvcmd -f "$dsttmp" "$dstdir/$dstfile" 2>/dev/null \
      || {
 	   # The rename failed, perhaps because mv can't rename something else
 	   # to itself, or perhaps because mv is so ancient that it does not
 	   # support -f.
 	   # Now remove or move aside any old file at destination location.
 	   # We try this two ways since rm can't unlink itself on some
 	   # systems and the destination file might be busy for other
 	   # reasons.  In this case, the final cleanup might fail but the new
 	   # file should still install successfully.
 	   {
 	     if test -f "$dstdir/$dstfile"; then
 	       $doit $rmcmd -f "$dstdir/$dstfile" 2>/dev/null \
 	       || $doit $mvcmd -f "$dstdir/$dstfile" "$rmtmp" 2>/dev/null \
 	       || {
 		 echo "$0: cannot unlink or rename $dstdir/$dstfile" >&2
 		 (exit 1); exit 1
 	       }
 	     else
 	       :
 	     fi
 	   } &&
 	   # Now rename the file to the real destination.
 	   $doit $mvcmd "$dsttmp" "$dstdir/$dstfile"
 	 }
    }
  fi || { (exit 1); exit 1; }
 done
 # The final little trick to "correctly" pass the exit status to the exit trap.
 {
  (exit 0); exit 0
 }
 # Local variables:
 # eval: (add-hook 'write-file-hooks 'time-stamp)
 # time-stamp-start: "scriptversion="
 # time-stamp-format: "%:y-%02m-%02d.%02H"
 # time-stamp-end: "$"
 # End:
--- a/install.doc
+++ b/install.doc
@@ -1,6 +1,6 @@
 INSTALLATION INSTRUCTIONS for the Independent JPEG Group's JPEG software
-Copyright (C) 1991-1996, Thomas G. Lane.
+Copyright (C) 1991-1998, Thomas G. Lane.
 This file is part of the Independent JPEG Group's software.
 For conditions of distribution and use, see the accompanying README file.
@@ -94,6 +94,19 @@ Configure was created with GNU Autoconf and it follows the usual conventions
 for GNU configure scripts.  It makes a few assumptions that you may want to
 override.  You can do this by providing optional switches to configure:
 * If you want to build libjpeg as a shared library, say
 	./configure --enable-shared
 To get both shared and static libraries, say
 	./configure --enable-shared --enable-static
 Note that these switches invoke GNU libtool to take care of system-dependent
 shared library building methods.  If things don't work this way, please try
 running configure without either switch; that should build a static library
 without using libtool.  If that works, your problem is probably with libtool
 not with the IJG code.  libtool is fairly new and doesn't support all flavors
 of Unix yet.  (You might be able to find a newer version of libtool than the
 one included with libjpeg; see ftp.gnu.org.  Report libtool problems to
 bug-libtool@gnu.org.)
 * Configure will use gcc (GNU C compiler) if it's available, otherwise cc.
 To force a particular compiler to be selected, use the CC option, for example
 	./configure CC='cc'
@@ -102,8 +115,10 @@ For example, on HP-UX you probably want to say
 	./configure CC='cc -Aa'
 to get HP's compiler to run in ANSI mode.
-* The default CFLAGS setting is "-O".  You can override this by saying,
+* The default CFLAGS setting is "-O" for non-gcc compilers, "-O2" for gcc.
-for example, ./configure CFLAGS='-O2'.
+You can override this by saying, for example,
 	./configure CFLAGS='-g'
 if you want to compile with debugging support.
 * Configure will set up the makefile so that "make install" will install files
 into /usr/local/bin, /usr/local/man, etc.  You can specify an installation
@@ -131,17 +146,20 @@ Makefile	jconfig file	System and/or compiler
 makefile.manx	jconfig.manx	Amiga, Manx Aztec C
 makefile.sas	jconfig.sas	Amiga, SAS C
 makeproj.mac	jconfig.mac	Apple Macintosh, Metrowerks CodeWarrior
 mak*jpeg.st	jconfig.st	Atari ST/STE/TT, Pure C or Turbo C
 makefile.bcc	jconfig.bcc	MS-DOS or OS/2, Borland C
 makefile.dj	jconfig.dj	MS-DOS, DJGPP (Delorie's port of GNU C)
-makefile.mc6	jconfig.mc6	MS-DOS, Microsoft C version 6.x and up
+makefile.mc6	jconfig.mc6	MS-DOS, Microsoft C (16-bit only)
 makefile.wat	jconfig.wat	MS-DOS, OS/2, or Windows NT, Watcom C
 makefile.vc	jconfig.vc	Windows NT/95, MS Visual C++
 make*.ds	jconfig.vc	Windows NT/95, MS Developer Studio
 makefile.mms	jconfig.vms	Digital VMS, with MMS software
 makefile.vms	jconfig.vms	Digital VMS, without MMS software
-Copy the proper jconfig file to jconfig.h and the makefile to Makefile
+Copy the proper jconfig file to jconfig.h and the makefile to Makefile (or
-(or whatever your system uses as the standard makefile name).  For the
+whatever your system uses as the standard makefile name).  For more info see
-Atari, we provide four project files; see the Atari hints below.
+the appropriate system-specific hints section near the end of this file.
 Configuring the software by hand
@@ -303,7 +321,7 @@ As a quick test of functionality we've included a small sample image in
 several forms:
 	testorig.jpg	Starting point for the djpeg tests.
 	testimg.ppm	The output of djpeg testorig.jpg
-	testimg.gif	The output of djpeg -gif testorig.jpg
+	testimg.bmp	The output of djpeg -bmp -colors 256 testorig.jpg
 	testimg.jpg	The output of cjpeg testimg.ppm
 	testprog.jpg	Progressive-mode equivalent of testorig.jpg.
 	testimgp.jpg	The output of cjpeg -progressive -optimize testimg.ppm
@@ -339,10 +357,10 @@ check fails, try recompiling with USE_SETMODE or USE_FDOPEN defined.
 If it still doesn't work, better use two-file style.
 If you chose a memory manager other than jmemnobs.c, you should test that
-temporary-file usage works.  Try "djpeg -gif -max 0 testorig.jpg" and make
+temporary-file usage works.  Try "djpeg -bmp -colors 256 -max 0 testorig.jpg"
-sure its output matches testimg.gif.  If you have any really large images
+and make sure its output matches testimg.bmp.  If you have any really large
-handy, try compressing them with -optimize and/or decompressing with -gif to
+images handy, try compressing them with -optimize and/or decompressing with
-make sure your DEFAULT_MAX_MEM setting is not too large.
+-colors 256 to make sure your DEFAULT_MAX_MEM setting is not too large.
 NOTE: this is far from an exhaustive test of the JPEG software; some modules,
 such as 1-pass color quantization, are not exercised at all.  It's just a
@@ -357,7 +375,7 @@ Once you're done with the above steps, you can install the software by
 copying the executable files (cjpeg, djpeg, jpegtran, rdjpgcom, and wrjpgcom)
 to wherever you normally install programs.  On Unix systems, you'll also want
 to put the man pages (cjpeg.1, djpeg.1, jpegtran.1, rdjpgcom.1, wrjpgcom.1)
-in the man-page directory.  The canned makefiles don't support this step
+in the man-page directory.  The pre-fab makefiles don't support this step
 since there's such a wide variety of installation procedures on different
 systems.
@@ -370,8 +388,13 @@ to see where configure thought the files should go.  You may need to edit
 the Makefile, particularly if your system's conventions for man page
 filenames don't match what configure expects.
-If you want to install the library file libjpeg.a and the include files j*.h
+If you want to install the IJG library itself, for use in compiling other
-(for use in compiling other programs besides the IJG ones), then say
+programs besides ours, then you need to put the four include files
 	jpeglib.h jerror.h jconfig.h jmorecfg.h
 into your include-file directory, and put the library file libjpeg.a
 (extension may vary depending on system) wherever library files go.
 If you generated a Makefile with "configure", it will do what it thinks
 is the right thing if you say
 	make install-lib
@@ -426,8 +449,8 @@ The PPM reader (rdppm.c) can read 12-bit data from either text-format or
 binary-format PPM and PGM files.  Binary-format PPM/PGM files which have a
 maxval greater than 255 are assumed to use 2 bytes per sample, LSB first
 (little-endian order).  As of early 1995, 2-byte binary format is not
-officially supported by the PBMPLUS library, but it is expected that the
+officially supported by the PBMPLUS library, but it is expected that a
-next release of PBMPLUS will support it.  Note that the PPM reader will
+future release of PBMPLUS will support it.  Note that the PPM reader will
 read files of any maxval regardless of the BITS_IN_JSAMPLE setting; incoming
 data is automatically rescaled to either maxval=255 or maxval=4095 as
 appropriate for the cjpeg bit depth.
@@ -568,19 +591,19 @@ Atari ST/STE/TT:
 Copy the project files makcjpeg.st, makdjpeg.st, maktjpeg.st, and makljpeg.st
 to cjpeg.prj, djpeg.prj, jpegtran.prj, and libjpeg.prj respectively.  The
 project files should work as-is with Pure C.  For Turbo C, change library
-filenames "PC..." to "TC..." in each project file.  Note that libjpeg.prj
+filenames "pc..." to "tc..." in each project file.  Note that libjpeg.prj
 selects jmemansi.c as the recommended memory manager.  You'll probably want to
 adjust the DEFAULT_MAX_MEM setting --- you want it to be a couple hundred K
 less than your normal free memory.  Put "#define DEFAULT_MAX_MEM nnnn" into
 jconfig.h to do this.
 To use the 68881/68882 coprocessor for the floating point DCT, add the
-compiler option "-8" to the project files and replace PCFLTLIB.LIB with
+compiler option "-8" to the project files and replace pcfltlib.lib with
-PC881LIB.LIB in cjpeg.prj and djpeg.prj.  Or if you don't have a
+pc881lib.lib in cjpeg.prj and djpeg.prj.  Or if you don't have a
 coprocessor, you may prefer to remove the float DCT code by undefining
 DCT_FLOAT_SUPPORTED in jmorecfg.h (since without a coprocessor, the float
 code will be too slow to be useful).  In that case, you can delete
-PCFLTLIB.LIB from the project files.
+pcfltlib.lib from the project files.
 Note that you must make libjpeg.lib before making cjpeg.ttp, djpeg.ttp,
 or jpegtran.ttp.  You'll have to perform the self-test by hand.
@@ -637,49 +660,62 @@ provide a Unix-style command line interface.  You can use this interface on
 the Mac by means of the ccommand() library routine provided by Metrowerks
 CodeWarrior or Think C.  This is only appropriate for testing the library,
 however; to make a user-friendly equivalent of cjpeg/djpeg you'd really want
-to develop a Mac-style user interface.  Such an interface exists for pre-v5
+to develop a Mac-style user interface.  There isn't a complete example
-IJG libraries (see the Think C entry, below) but at this writing it has not
+available at the moment, but there are some helpful starting points:
-been updated to work with the current release.
+1. Sam Bushell's free "To JPEG" applet provides drag-and-drop conversion to
 JPEG under System 7 and later.  This only illustrates how to use the
 compression half of the library, but it does a very nice job of that part.
 The CodeWarrior source code is available from http://www.pobox.com/~jsam.
 2. Jim Brunner prepared a Mac-style user interface for both compression and
 decompression.  Unfortunately, it hasn't been updated since IJG v4, and
 the library's API has changed considerably since then.  Still it may be of
 some help, particularly as a guide to compiling the IJG code under Think C.
 Jim's code is available from the Info-Mac archives, at sumex-aim.stanford.edu
 or mirrors thereof; see file /info-mac/dev/src/jpeg-convert-c.hqx.
-We recommend replacing "malloc" and "free" by "NewPtr" and "DisposePtr" in
+jmemmac.c is the recommended memory manager back end for Macintosh.  It uses
-whichever memory manager back end you use, because Mac C libraries often
+NewPtr/DisposePtr instead of malloc/free, and has a Mac-specific
-have inferior implementations of malloc/free.  jmemmac.c is recommended;
+implementation of jpeg_mem_available().  It also creates temporary files that
-it is a customized version of jmemansi.c with this change and a Mac-specific
+follow Mac conventions.  (That part of the code relies on System-7-or-later OS
-implementation of jpeg_mem_available().  You can also use jmemnobs.c if you
+functions.  See the comments in jmemmac.c if you need to run it on System 6.)
-don't care about handling images larger than available memory.
+NOTE that USE_MAC_MEMMGR must be defined in jconfig.h to use jmemmac.c.
-
+You can also use jmemnobs.c, if you don't care about handling images larger
-Macintosh, MPW:
+than available memory.  If you use any memory manager back end other than
-
+jmemmac.c, we recommend replacing "malloc" and "free" by "NewPtr" and
-We don't directly support MPW in the current release, but Larry Rosenstein
+"DisposePtr", because Mac C libraries often have peculiar implementations of
-ported an earlier version of the IJG code without very much trouble.  There's
+malloc/free.  (For instance, free() may not return the freed space to the
-useful notes and conversion scripts in his kit for porting PBMPLUS to MPW.
+Mac Memory Manager.  This is undesirable for the IJG code because jmemmgr.c
-You can obtain the kit by FTP to ftp.apple.com, files /pub/lsr/pbmplus-port*.
+already clumps space requests.)
 Macintosh, Metrowerks CodeWarrior:
 Metrowerks release DR2 has problems with the IJG code; don't use it.  Release
 DR3.5 or later should be OK.
 The Unix-command-line-style interface can be used by defining USE_CCOMMAND.
-You'll also need to define either TWO_FILE_COMMANDLINE (to avoid stdin/stdout)
+You'll also need to define TWO_FILE_COMMANDLINE to avoid stdin/stdout.
-or USE_FDOPEN (to make stdin/stdout work in binary mode).  See the Think C
+This means that when using the cjpeg/djpeg programs, you'll have to type the
-entry for more details.
+input and output file names in the "Arguments" text-edit box, rather than
 using the file radio buttons.  (Perhaps USE_FDOPEN or USE_SETMODE would
 eliminate the problem, but I haven't heard from anyone who's tried it.)
 On 680x0 Macs, Metrowerks defines type "double" as a 10-byte IEEE extended
 float.  jmemmgr.c won't like this: it wants sizeof(ALIGN_TYPE) to be a power
 of 2.  Add "#define ALIGN_TYPE long" to jconfig.h to eliminate the complaint.
 The supplied configuration file jconfig.mac can be used for your jconfig.h;
 it includes all the recommended symbol definitions.  If you have AppleScript
 installed, you can run the supplied script makeproj.mac to create CodeWarrior
 project files for the library and the testbed applications, then build the
 library and applications.  (Thanks to Dan Sears and Don Agro for this nifty
 hack, which saves us from trying to maintain CodeWarrior project files as part
 of the IJG distribution...)
 Macintosh, Think C:
-Jim Brunner has prepared a Mac-style user interface for the IJG library.
+The documentation in Jim Brunner's "JPEG Convert" source code (see above)
-Unfortunately, the released version of it only works with pre-v5 libraries;
+includes detailed build instructions for Think C; it's probably somewhat
-still, it may be a useful starting point.  You can obtain Jim's additional
+out of date for the current release, but may be helpful.
 source code from the Info-Mac archives, at sumex-aim.stanford.edu or mirrors
 thereof; see file /info-mac/dev/src/jpeg-convert-c.hqx.  Jim's documentation
 also includes more detailed build instructions for Think C.
 If you want to build the minimal command line version, proceed as follows.
 You'll have to prepare project files for the programs; we don't include any
@@ -695,6 +731,9 @@ On 680x0 Macs, Think C defines type "double" as a 12-byte IEEE extended float.
 jmemmgr.c won't like this: it wants sizeof(ALIGN_TYPE) to be a power of 2.
 Add "#define ALIGN_TYPE long" to jconfig.h to eliminate the complaint.
 jconfig.mac should work as a jconfig.h configuration file for Think C,
 but the makeproj.mac AppleScript script is specific to CodeWarrior.  Sorry.
 MIPS R3000:
@@ -705,7 +744,7 @@ Note that the R3000 chip is found in workstations from DEC and others.
 MS-DOS, generic comments for 16-bit compilers:
-The IJG code is designed to be compiled in 80x86 "small" or "medium" memory
+The IJG code is designed to work well in 80x86 "small" or "medium" memory
 models (i.e., data pointers are 16 bits unless explicitly declared "far";
 code pointers can be either size).  You may be able to use small model to
 compile cjpeg or djpeg by itself, but you will probably have to use medium
@@ -721,7 +760,7 @@ The DOS-specific memory manager, jmemdos.c, should be used if possible.
 It needs some assembly-code routines which are in jmemdosa.asm; make sure
 your makefile assembles that file and includes it in the library.  If you
 don't have a suitable assembler, you can get pre-assembled object files for
-jmemdosa by FTP from ftp.uu.net: graphics/jpeg/jdosaobj.zip.  (DOS-oriented
+jmemdosa by FTP from ftp.uu.net:/graphics/jpeg/jdosaobj.zip.  (DOS-oriented
 distributions of the IJG source code often include these object files.)
 When using jmemdos.c, jconfig.h must define USE_MSDOS_MEMMGR and must set
@@ -778,31 +817,22 @@ jconfig.bcc already includes #define USE_SETMODE to make this work.
 (fdopen does not work correctly.)
 MS-DOS, DJGPP:
 Use a recent version of DJGPP (1.11 or better).  If you prefer two-file
 command line style, change the supplied jconfig.dj to define
 TWO_FILE_COMMANDLINE.  makefile.dj is set up to generate only COFF files
 (cjpeg, djpeg, etc) when you say make.  After testing, say "make exe" to
 make executables with stub.exe, or "make standalone" if you want executables
 that include go32.  You will probably need to tweak the makefile's pointer to
 go32.exe to do "make standalone".
 MS-DOS, Microsoft C:
-makefile.mc6 works with Microsoft C, Visual C++, etc.  Note that this
+makefile.mc6 works with Microsoft C, DOS Visual C++, etc.  It should only
-makefile assumes that the working copy of itself is called "makefile".
+be used if you want to build a 16-bit (small or medium memory model) program.
 If you want to call it something else, say "makefile.mak", be sure to adjust
 the dependency line that reads "$(RFILE) : makefile".  Otherwise the make
 will fail because it doesn't know how to create "makefile".  Worse, some
 releases of Microsoft's make utilities give an incorrect error message in
 this situation.
 If you want one-file command line style, just undefine TWO_FILE_COMMANDLINE.
 jconfig.mc6 already includes #define USE_SETMODE to make this work.
 (fdopen does not work correctly.)
 Note that this makefile assumes that the working copy of itself is called
 "makefile".  If you want to call it something else, say "makefile.mak",
 be sure to adjust the dependency line that reads "$(RFILE) : makefile".
 Otherwise the make will fail because it doesn't know how to create "makefile".
 Worse, some releases of Microsoft's make utilities give an incorrect error
 message in this situation.
 Old versions of MS C fail with an "out of macro expansion space" error
 because they can't cope with the macro TRACEMS8 (defined in jerror.h).
 If this happens to you, the easiest solution is to change TRACEMS8 to
@@ -813,11 +843,12 @@ Original MS C 6.0 is very buggy; it compiles incorrect code unless you turn
 off optimization entirely (remove -O from CFLAGS).  6.00A is better, but it
 still generates bad code if you enable loop optimizations (-Ol or -Ox).
-MS C 8.0 reportedly fails to compile jquant1.c if optimization is turned off
+MS C 8.0 crashes when compiling jquant1.c with optimization switch /Oo ...
-(yes, off).
+which is on by default.  To work around this bug, compile that one file
 with /Oo-.
-Microsoft Windows (all versions):
+Microsoft Windows (all versions), generic comments:
 Some Windows system include files define typedef boolean as "unsigned char".
 The IJG code also defines typedef boolean, but we make it "int" by default.
@@ -825,45 +856,86 @@ This doesn't affect the IJG programs because we don't import those Windows
 include files.  But if you use the JPEG library in your own program, and some
 of your program's files import one definition of boolean while some import the
 other, you can get all sorts of mysterious problems.  A good preventive step
-is to change jmorecfg.h to define boolean as unsigned char.  We recommend
+is to make the IJG library use "unsigned char" for boolean.  To do that,
-making that part of jmorecfg.h read like this:
+add something like this to your jconfig.h file:
 	/* Define "boolean" as unsigned char, not int, per Windows custom */
 	#ifndef __RPCNDR_H__	/* don't conflict if rpcndr.h already read */
 	typedef unsigned char boolean;
 	#endif
-In v6a and later, using incompatible definitions of boolean will usually lead
+	#define HAVE_BOOLEAN	/* prevent jmorecfg.h from redefining it */
-to the failure message "JPEG parameter struct mismatch", rather than the
+(This is already in jconfig.vc, by the way.)
-difficult-to-diagnose bugs it caused with earlier versions.
+
 windef.h contains the declarations
 	#define far
 	#define FAR far
 Since jmorecfg.h tries to define FAR as empty, you may get a compiler
 warning if you include both jpeglib.h and windef.h (which windows.h
 includes).  To suppress the warning, you can put "#ifndef FAR"/"#endif"
 around the line "#define FAR" in jmorecfg.h.
 When using the library in a Windows application, you will almost certainly
 want to modify or replace the error handler module jerror.c, since our
 default error handler does a couple of inappropriate things:
  1. it tries to write error and warning messages on stderr;
  2. in event of a fatal error, it exits by calling exit().
 A simple stopgap solution for problem 1 is to replace the line
 	fprintf(stderr, "%s\n", buffer);
-(in output_message in jerror.c) with something like
+(in output_message in jerror.c) with
-	MessageBox(GetActiveWindow(),buffer,"JPEG Error",MB_OK);
+	MessageBox(GetActiveWindow(),buffer,"JPEG Error",MB_OK|MB_ICONERROR);
 It's highly recommended that you at least do that much, since otherwise
-error messages will disappear into nowhere.
+error messages will disappear into nowhere.  (Beginning with IJG v6b, this
 code is already present in jerror.c; just define USE_WINDOWS_MESSAGEBOX in
 jconfig.h to enable it.)
 The proper solution for problem 2 is to return control to your calling
 application after a library error.  This can be done with the setjmp/longjmp
-technique discussed in libjpeg.doc and illustrated in example.c.
+technique discussed in libjpeg.doc and illustrated in example.c.  (NOTE:
 some older Windows C compilers provide versions of setjmp/longjmp that
 don't actually work under Windows.  You may need to use the Windows system
 functions Catch and Throw instead.)
 The recommended memory manager under Windows is jmemnobs.c; in other words,
 let Windows do any virtual memory management needed.  You should NOT use
 jmemdos.c nor jmemdosa.asm under Windows.
 For Windows 3.1, we recommend compiling in medium or large memory model;
 for newer Windows versions, use a 32-bit flat memory model.  (See the MS-DOS
 sections above for more info about memory models.)  In the 16-bit memory
 models only, you'll need to put
 	#define MAX_ALLOC_CHUNK 65520L	/* Maximum request to malloc() */
 into jconfig.h to limit allocation chunks to 64Kb.  (Without that, you'd
 have to use huge memory model, which slows things down unnecessarily.)
 jmemnobs.c works without modification in large or flat memory models, but to
 use medium model, you need to modify its jpeg_get_large and jpeg_free_large
 routines to allocate far memory.  In any case, you might like to replace
 its calls to malloc and free with direct calls on Windows memory allocation
 functions.
 You may also want to modify jdatasrc.c and jdatadst.c to use Windows file
 operations rather than fread/fwrite.  This is only necessary if your C
 compiler doesn't provide a competent implementation of C stdio functions.
 You might want to tweak the RGB_xxx macros in jmorecfg.h so that the library
 will accept or deliver color pixels in BGR sample order, not RGB; BGR order
 is usually more convenient under Windows.  Note that this change will break
 the sample applications cjpeg/djpeg, but the library itself works fine.
 Many people want to convert the IJG library into a DLL.  This is reasonably
 straightforward, but watch out for the following:
  1. Don't try to compile as a DLL in small or medium memory model; use
 large model, or even better, 32-bit flat model.  Many places in the IJG code
 assume the address of a local variable is an ordinary (not FAR) pointer;
 that isn't true in a medium-model DLL.
  2. Microsoft C cannot pass file pointers between applications and DLLs.
 (See Microsoft Knowledge Base, PSS ID Number Q50336.)  So jdatasrc.c and
 jdatadst.c don't work if you open a file in your application and then pass
 the pointer to the DLL.  One workaround is to make jdatasrc.c/jdatadst.c
 part of your main application rather than part of the DLL.
  3. You'll probably need to modify the macros GLOBAL() and EXTERN() to
 attach suitable linkage keywords to the exported routine names.  Similarly,
 you'll want to modify METHODDEF() and JMETHOD() to ensure function pointers
@@ -871,10 +943,13 @@ are declared in a way that lets application routines be called back through
 the function pointers.  These macros are in jmorecfg.h.  Typical definitions
 for a 16-bit DLL are:
 	#define GLOBAL(type)		type _far _pascal _loadds _export
-	#define EXTERN(type)		extern type _far _pascal
+	#define EXTERN(type)		extern type _far _pascal _loadds
 	#define METHODDEF(type)		static type _far _pascal
 	#define JMETHOD(type,methodname,arglist)  \
 		type (_far _pascal *methodname) arglist
 For a 32-bit DLL you may want something like
 	#define GLOBAL(type)		__declspec(dllexport) type
 	#define EXTERN(type)		extern __declspec(dllexport) type
 Although not all the GLOBAL routines are actually intended to be called by
 the application, the performance cost of making them all DLL entry points is
 negligible.
@@ -888,6 +963,12 @@ but hasn't been very high priority --- any volunteers out there?
 Microsoft Windows, Borland C:
 The provided jconfig.bcc should work OK in a 32-bit Windows environment,
 but you'll need to tweak it in a 16-bit environment (you'd need to define
 NEED_FAR_POINTERS and MAX_ALLOC_CHUNK).  Beware that makefile.bcc will need
 alteration if you want to use it for Windows --- in particular, you should
 use jmemnobs.c not jmemdos.c under Windows.
 Borland C++ 4.5 fails with an internal compiler error when trying to compile
 jdmerge.c in 32-bit mode.  If enough people complain, perhaps Borland will fix
 it.  In the meantime, the simplest known workaround is to add a redundant
@@ -902,6 +983,57 @@ doesn't trigger the bug.
 Recent reports suggest that this bug does not occur with "bcc32a" (the
 Pentium-optimized version of the compiler).
 Another report from a user of Borland C 4.5 was that incorrect code (leading
 to a color shift in processed images) was produced if any of the following
 optimization switch combinations were used: 
 	-Ot -Og
 	-Ot -Op
 	-Ot -Om
 So try backing off on optimization if you see such a problem.  (Are there
 several different releases all numbered "4.5"??)
 Microsoft Windows, Microsoft Visual C++:
 jconfig.vc should work OK with any Microsoft compiler for a 32-bit memory
 model.  makefile.vc is intended for command-line use.  (If you are using
 the Developer Studio environment, you may prefer the DevStudio project
 files; see below.)
 Some users feel that it's easier to call the library from C++ code if you
 force VC++ to treat the library as C++ code, which you can do by renaming
 all the *.c files to *.cpp (and adjusting the makefile to match).  This
 avoids the need to put extern "C" { ... } around #include "jpeglib.h" in
 your C++ application.
 Microsoft Windows, Microsoft Developer Studio:
 We include makefiles that should work as project files in DevStudio 4.2 or
 later.  There is a library makefile that builds the IJG library as a static
 Win32 library, and an application makefile that builds the sample applications
 as Win32 console applications.  (Even if you only want the library, we
 recommend building the applications so that you can run the self-test.)
 To use:
 1. Copy jconfig.vc to jconfig.h, makelib.ds to jpeg.mak, and
   makeapps.ds to apps.mak.  (Note that the renaming is critical!)
 2. Click on the .mak files to construct project workspaces.
   (If you are using DevStudio more recent than 4.2, you'll probably
   get a message saying that the makefiles are being updated.)
 3. Build the library project, then the applications project.
 4. Move the application .exe files from `app`\Release to an
   appropriate location on your path.
 5. To perform the self-test, execute the command line
 	NMAKE /f makefile.vc  test
 OS/2, Borland C++:
 Watch out for optimization bugs in older Borland compilers; you may need
 to back off the optimization switch settings.  See the comments in
 makefile.bcc.
 SGI:
--- a/jcapimin.c
+++ b/jcapimin.c
@@ -1,7 +1,7 @@
 /*
 * jcapimin.c
 *
- * Copyright (C) 1994-1996, Thomas G. Lane.
+ * Copyright (C) 1994-1998, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -39,13 +39,18 @@ jpeg_CreateCompress (j_compress_ptr cinfo, int version, size_t structsize)
    ERREXIT2(cinfo, JERR_BAD_STRUCT_SIZE, 
 	     (int) SIZEOF(struct jpeg_compress_struct), (int) structsize);
-  /* For debugging purposes, zero the whole master structure.
+  /* For debugging purposes, we zero the whole master structure.
-   * But error manager pointer is already there, so save and restore it.
+   * But the application has already set the err pointer, and may have set
   * client_data, so we have to save and restore those fields.
   * Note: if application hasn't set client_data, tools like Purify may
   * complain here.
   */
  {
    struct jpeg_error_mgr * err = cinfo->err;
    void * client_data = cinfo->client_data; /* ignore Purify complaint here */
    MEMZERO(cinfo, SIZEOF(struct jpeg_compress_struct));
    cinfo->err = err;
    cinfo->client_data = client_data;
  }
  cinfo->is_decompressor = FALSE;
@@ -66,6 +71,8 @@ jpeg_CreateCompress (j_compress_ptr cinfo, int version, size_t structsize)
    cinfo->ac_huff_tbl_ptrs[i] = NULL;
  }
  cinfo->script_space = NULL;
  cinfo->input_gamma = 1.0;	/* in case application forgets */
  /* OK, I'm ready */
@@ -185,13 +192,40 @@ GLOBAL(void)
 jpeg_write_marker (j_compress_ptr cinfo, int marker,
 		   const JOCTET *dataptr, unsigned int datalen)
 {
  JMETHOD(void, write_marker_byte, (j_compress_ptr info, int val));
  if (cinfo->next_scanline != 0 ||
      (cinfo->global_state != CSTATE_SCANNING &&
       cinfo->global_state != CSTATE_RAW_OK &&
       cinfo->global_state != CSTATE_WRCOEFS))
    ERREXIT1(cinfo, JERR_BAD_STATE, cinfo->global_state);
-  (*cinfo->marker->write_any_marker) (cinfo, marker, dataptr, datalen);
+  (*cinfo->marker->write_marker_header) (cinfo, marker, datalen);
  write_marker_byte = cinfo->marker->write_marker_byte;	/* copy for speed */
  while (datalen--) {
    (*write_marker_byte) (cinfo, *dataptr);
    dataptr++;
  }
 }
 /* Same, but piecemeal. */
 GLOBAL(void)
 jpeg_write_m_header (j_compress_ptr cinfo, int marker, unsigned int datalen)
 {
  if (cinfo->next_scanline != 0 ||
      (cinfo->global_state != CSTATE_SCANNING &&
       cinfo->global_state != CSTATE_RAW_OK &&
       cinfo->global_state != CSTATE_WRCOEFS))
    ERREXIT1(cinfo, JERR_BAD_STATE, cinfo->global_state);
  (*cinfo->marker->write_marker_header) (cinfo, marker, datalen);
 }
 GLOBAL(void)
 jpeg_write_m_byte (j_compress_ptr cinfo, int val)
 {
  (*cinfo->marker->write_marker_byte) (cinfo, val);
 }
@@ -231,6 +265,16 @@ jpeg_write_tables (j_compress_ptr cinfo)
  (*cinfo->marker->write_tables_only) (cinfo);
  /* And clean up. */
  (*cinfo->dest->term_destination) (cinfo);
-  /* We can use jpeg_abort to release memory. */
+  /*
-  jpeg_abort((j_common_ptr) cinfo);
+   * In library releases up through v6a, we called jpeg_abort() here to free
   * any working memory allocated by the destination manager and marker
   * writer.  Some applications had a problem with that: they allocated space
   * of their own from the library memory manager, and didn't want it to go
   * away during write_tables.  So now we do nothing.  This will cause a
   * memory leak if an app calls write_tables repeatedly without doing a full
   * compression cycle or otherwise resetting the JPEG object.  However, that
   * seems less bad than unexpectedly freeing memory in the normal case.
   * An app that prefers the old behavior can call jpeg_abort for itself after
   * each call to jpeg_write_tables().
   */
 }
--- a/jccoefct.c
+++ b/jccoefct.c
@@ -1,7 +1,7 @@
 /*
 * jccoefct.c
 *
- * Copyright (C) 1994-1996, Thomas G. Lane.
+ * Copyright (C) 1994-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -135,8 +135,8 @@ start_pass_coef (j_compress_ptr cinfo, J_BUF_MODE pass_mode)
 * per call, ie, v_samp_factor block rows for each component in the image.
 * Returns TRUE if the iMCU row is completed, FALSE if suspended.
 *
- * NB: input_buf contains a plane for each component in image.
+ * NB: input_buf contains a plane for each component in image,
- * For single pass, this is the same as the components in the scan.
+ * which we index according to the component's SOF position.
 */
 METHODDEF(boolean)
@@ -175,7 +175,8 @@ compress_data (j_compress_ptr cinfo, JSAMPIMAGE input_buf)
 	  if (coef->iMCU_row_num < last_iMCU_row ||
 	      yoffset+yindex < compptr->last_row_height) {
 	    (*cinfo->fdct->forward_DCT) (cinfo, compptr,
-					 input_buf[ci], coef->MCU_buffer[blkn],
+					 input_buf[compptr->component_index],
 					 coef->MCU_buffer[blkn],
 					 ypos, xpos, (JDIMENSION) blockcnt);
 	    if (blockcnt < compptr->MCU_width) {
 	      /* Create some dummy blocks at the right edge of the image. */
--- a/jccolmmx.asm
+++ b/jccolmmx.asm
@@ -0,0 +1,513 @@
 ;
 ; jccolmmx.asm - colorspace conversion (MMX)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jcolsamp.inc"
 %if RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4
 %ifdef JCCOLOR_RGBYCC_MMX_SUPPORTED
 ; --------------------------------------------------------------------------
 %define SCALEBITS	16
 F_0_081	equ	 5329			; FIX(0.08131)
 F_0_114	equ	 7471			; FIX(0.11400)
 F_0_168	equ	11059			; FIX(0.16874)
 F_0_250	equ	16384			; FIX(0.25000)
 F_0_299	equ	19595			; FIX(0.29900)
 F_0_331	equ	21709			; FIX(0.33126)
 F_0_418	equ	27439			; FIX(0.41869)
 F_0_587	equ	38470			; FIX(0.58700)
 F_0_337	equ	(F_0_587 - F_0_250)	; FIX(0.58700) - FIX(0.25000)
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 	alignz	16
 	global	EXTN(jconst_rgb_ycc_convert_mmx)
 EXTN(jconst_rgb_ycc_convert_mmx):
 PW_F0299_F0337	times 2 dw  F_0_299, F_0_337
 PW_F0114_F0250	times 2 dw  F_0_114, F_0_250
 PW_MF016_MF033	times 2 dw -F_0_168,-F_0_331
 PW_MF008_MF041	times 2 dw -F_0_081,-F_0_418
 PD_ONEHALFM1_CJ	times 2 dd  (1 << (SCALEBITS-1)) - 1 + (CENTERJSAMPLE << SCALEBITS)
 PD_ONEHALF	times 2 dd  (1 << (SCALEBITS-1))
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Convert some rows of samples to the output colorspace.
 ;
 ; GLOBAL(void)
 ; jpeg_rgb_ycc_convert_mmx (j_compress_ptr cinfo,
 ;                           JSAMPARRAY input_buf, JSAMPIMAGE output_buf,
 ;                           JDIMENSION output_row, int num_rows);
 ;
 %define cinfo(b)	(b)+8		; j_compress_ptr cinfo
 %define input_buf(b)	(b)+12		; JSAMPARRAY input_buf
 %define output_buf(b)	(b)+16		; JSAMPIMAGE output_buf
 %define output_row(b)	(b)+20		; JDIMENSION output_row
 %define num_rows(b)	(b)+24		; int num_rows
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_MMWORD	; mmword wk[WK_NUM]
 %define WK_NUM		8
 %define gotptr		wk(0)-SIZEOF_POINTER	; void * gotptr
 	align	16
 	global	EXTN(jpeg_rgb_ycc_convert_mmx)
 EXTN(jpeg_rgb_ycc_convert_mmx):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_MMWORD)	; align to 64 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [wk(0)]
 	pushpic	eax		; make a room for GOT address
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx			; get GOT address
 	movpic	POINTER [gotptr], ebx	; save GOT address
 	mov	ecx, POINTER [cinfo(eax)]
 	mov	ecx, JDIMENSION [jcstruct_image_width(ecx)]	; num_cols
 	test	ecx,ecx
 	jz	near .return
 	push	ecx
 	mov	esi, JSAMPIMAGE [output_buf(eax)]
 	mov	ecx, JDIMENSION [output_row(eax)]
 	mov	edi, JSAMPARRAY [esi+0*SIZEOF_JSAMPARRAY]
 	mov	ebx, JSAMPARRAY [esi+1*SIZEOF_JSAMPARRAY]
 	mov	edx, JSAMPARRAY [esi+2*SIZEOF_JSAMPARRAY]
 	lea	edi, [edi+ecx*SIZEOF_JSAMPROW]
 	lea	ebx, [ebx+ecx*SIZEOF_JSAMPROW]
 	lea	edx, [edx+ecx*SIZEOF_JSAMPROW]
 	pop	ecx
 	mov	esi, JSAMPARRAY [input_buf(eax)]
 	mov	eax, INT [num_rows(eax)]
 	test	eax,eax
 	jle	near .return
 	alignx	16,7
 .rowloop:
 	pushpic	eax
 	push	edx
 	push	ebx
 	push	edi
 	push	esi
 	push	ecx			; col
 	mov	esi, JSAMPROW [esi]	; inptr
 	mov	edi, JSAMPROW [edi]	; outptr0
 	mov	ebx, JSAMPROW [ebx]	; outptr1
 	mov	edx, JSAMPROW [edx]	; outptr2
 	movpic	eax, POINTER [gotptr]	; load GOT address (eax)
 	cmp	ecx, byte SIZEOF_MMWORD
 	jae	short .columnloop
 	alignx	16,7
 %if RGB_PIXELSIZE == 3 ; ---------------
 .column_ld1:
 	push	eax
 	push	edx
 	lea	ecx,[ecx+ecx*2]		; imul ecx,RGB_PIXELSIZE
 	test	cl, SIZEOF_BYTE
 	jz	short .column_ld2
 	sub	ecx, byte SIZEOF_BYTE
 	xor	eax,eax
 	mov	al, BYTE [esi+ecx]
 .column_ld2:
 	test	cl, SIZEOF_WORD
 	jz	short .column_ld4
 	sub	ecx, byte SIZEOF_WORD
 	xor	edx,edx
 	mov	dx, WORD [esi+ecx]
 	shl	eax, WORD_BIT
 	or	eax,edx
 .column_ld4:
 	movd	mmA,eax
 	pop	edx
 	pop	eax
 	test	cl, SIZEOF_DWORD
 	jz	short .column_ld8
 	sub	ecx, byte SIZEOF_DWORD
 	movd	mmG, DWORD [esi+ecx]
 	psllq	mmA, DWORD_BIT
 	por	mmA,mmG
 .column_ld8:
 	test	cl, SIZEOF_MMWORD
 	jz	short .column_ld16
 	movq	mmG,mmA
 	movq	mmA, MMWORD [esi+0*SIZEOF_MMWORD]
 	mov	ecx, SIZEOF_MMWORD
 	jmp	short .rgb_ycc_cnv
 .column_ld16:
 	test	cl, 2*SIZEOF_MMWORD
 	mov	ecx, SIZEOF_MMWORD
 	jz	short .rgb_ycc_cnv
 	movq	mmF,mmA
 	movq	mmA, MMWORD [esi+0*SIZEOF_MMWORD]
 	movq	mmG, MMWORD [esi+1*SIZEOF_MMWORD]
 	jmp	short .rgb_ycc_cnv
 	alignx	16,7
 .columnloop:
 	movq	mmA, MMWORD [esi+0*SIZEOF_MMWORD]
 	movq	mmG, MMWORD [esi+1*SIZEOF_MMWORD]
 	movq	mmF, MMWORD [esi+2*SIZEOF_MMWORD]
 .rgb_ycc_cnv:
 	; mmA=(00 10 20 01 11 21 02 12)
 	; mmG=(22 03 13 23 04 14 24 05)
 	; mmF=(15 25 06 16 26 07 17 27)
 	movq      mmD,mmA
 	psllq     mmA,4*BYTE_BIT	; mmA=(-- -- -- -- 00 10 20 01)
 	psrlq     mmD,4*BYTE_BIT	; mmD=(11 21 02 12 -- -- -- --)
 	punpckhbw mmA,mmG		; mmA=(00 04 10 14 20 24 01 05)
 	psllq     mmG,4*BYTE_BIT	; mmG=(-- -- -- -- 22 03 13 23)
 	punpcklbw mmD,mmF		; mmD=(11 15 21 25 02 06 12 16)
 	punpckhbw mmG,mmF		; mmG=(22 26 03 07 13 17 23 27)
 	movq      mmE,mmA
 	psllq     mmA,4*BYTE_BIT	; mmA=(-- -- -- -- 00 04 10 14)
 	psrlq     mmE,4*BYTE_BIT	; mmE=(20 24 01 05 -- -- -- --)
 	punpckhbw mmA,mmD		; mmA=(00 02 04 06 10 12 14 16)
 	psllq     mmD,4*BYTE_BIT	; mmD=(-- -- -- -- 11 15 21 25)
 	punpcklbw mmE,mmG		; mmE=(20 22 24 26 01 03 05 07)
 	punpckhbw mmD,mmG		; mmD=(11 13 15 17 21 23 25 27)
 	pxor      mmH,mmH
 	movq      mmC,mmA
 	punpcklbw mmA,mmH		; mmA=(00 02 04 06)
 	punpckhbw mmC,mmH		; mmC=(10 12 14 16)
 	movq      mmB,mmE
 	punpcklbw mmE,mmH		; mmE=(20 22 24 26)
 	punpckhbw mmB,mmH		; mmB=(01 03 05 07)
 	movq      mmF,mmD
 	punpcklbw mmD,mmH		; mmD=(11 13 15 17)
 	punpckhbw mmF,mmH		; mmF=(21 23 25 27)
 %else ; RGB_PIXELSIZE == 4 ; -----------
 .column_ld1:
 	test	cl, SIZEOF_MMWORD/8
 	jz	short .column_ld2
 	sub	ecx, byte SIZEOF_MMWORD/8
 	movd	mmA, DWORD [esi+ecx*RGB_PIXELSIZE]
 .column_ld2:
 	test	cl, SIZEOF_MMWORD/4
 	jz	short .column_ld4
 	sub	ecx, byte SIZEOF_MMWORD/4
 	movq	mmF,mmA
 	movq	mmA, MMWORD [esi+ecx*RGB_PIXELSIZE]
 .column_ld4:
 	test	cl, SIZEOF_MMWORD/2
 	mov	ecx, SIZEOF_MMWORD
 	jz	short .rgb_ycc_cnv
 	movq	mmD,mmA
 	movq	mmC,mmF
 	movq	mmA, MMWORD [esi+0*SIZEOF_MMWORD]
 	movq	mmF, MMWORD [esi+1*SIZEOF_MMWORD]
 	jmp	short .rgb_ycc_cnv
 	alignx	16,7
 .columnloop:
 	movq	mmA, MMWORD [esi+0*SIZEOF_MMWORD]
 	movq	mmF, MMWORD [esi+1*SIZEOF_MMWORD]
 	movq	mmD, MMWORD [esi+2*SIZEOF_MMWORD]
 	movq	mmC, MMWORD [esi+3*SIZEOF_MMWORD]
 .rgb_ycc_cnv:
 	; mmA=(00 10 20 30 01 11 21 31)
 	; mmF=(02 12 22 32 03 13 23 33)
 	; mmD=(04 14 24 34 05 15 25 35)
 	; mmC=(06 16 26 36 07 17 27 37)
 	movq      mmB,mmA
 	punpcklbw mmA,mmF		; mmA=(00 02 10 12 20 22 30 32)
 	punpckhbw mmB,mmF		; mmB=(01 03 11 13 21 23 31 33)
 	movq      mmG,mmD
 	punpcklbw mmD,mmC		; mmD=(04 06 14 16 24 26 34 36)
 	punpckhbw mmG,mmC		; mmG=(05 07 15 17 25 27 35 37)
 	movq      mmE,mmA
 	punpcklwd mmA,mmD		; mmA=(00 02 04 06 10 12 14 16)
 	punpckhwd mmE,mmD		; mmE=(20 22 24 26 30 32 34 36)
 	movq      mmH,mmB
 	punpcklwd mmB,mmG		; mmB=(01 03 05 07 11 13 15 17)
 	punpckhwd mmH,mmG		; mmH=(21 23 25 27 31 33 35 37)
 	pxor      mmF,mmF
 	movq      mmC,mmA
 	punpcklbw mmA,mmF		; mmA=(00 02 04 06)
 	punpckhbw mmC,mmF		; mmC=(10 12 14 16)
 	movq      mmD,mmB
 	punpcklbw mmB,mmF		; mmB=(01 03 05 07)
 	punpckhbw mmD,mmF		; mmD=(11 13 15 17)
 	movq      mmG,mmE
 	punpcklbw mmE,mmF		; mmE=(20 22 24 26)
 	punpckhbw mmG,mmF		; mmG=(30 32 34 36)
 	punpcklbw mmF,mmH
 	punpckhbw mmH,mmH
 	psrlw     mmF,BYTE_BIT		; mmF=(21 23 25 27)
 	psrlw     mmH,BYTE_BIT		; mmH=(31 33 35 37)
 %endif ; RGB_PIXELSIZE ; ---------------
 	; mm0=(R0 R2 R4 R6)=RE, mm2=(G0 G2 G4 G6)=GE, mm4=(B0 B2 B4 B6)=BE
 	; mm1=(R1 R3 R5 R7)=RO, mm3=(G1 G3 G5 G7)=GO, mm5=(B1 B3 B5 B7)=BO
 	; (Original)
 	; Y  =  0.29900 * R + 0.58700 * G + 0.11400 * B
 	; Cb = -0.16874 * R - 0.33126 * G + 0.50000 * B + CENTERJSAMPLE
 	; Cr =  0.50000 * R - 0.41869 * G - 0.08131 * B + CENTERJSAMPLE
 	;
 	; (This implementation)
 	; Y  =  0.29900 * R + 0.33700 * G + 0.11400 * B + 0.25000 * G
 	; Cb = -0.16874 * R - 0.33126 * G + 0.50000 * B + CENTERJSAMPLE
 	; Cr =  0.50000 * R - 0.41869 * G - 0.08131 * B + CENTERJSAMPLE
 	movq      MMWORD [wk(0)], mm0	; wk(0)=RE
 	movq      MMWORD [wk(1)], mm1	; wk(1)=RO
 	movq      MMWORD [wk(2)], mm4	; wk(2)=BE
 	movq      MMWORD [wk(3)], mm5	; wk(3)=BO
 	movq      mm6,mm1
 	punpcklwd mm1,mm3
 	punpckhwd mm6,mm3
 	movq      mm7,mm1
 	movq      mm4,mm6
 	pmaddwd   mm1,[GOTOFF(eax,PW_F0299_F0337)] ; mm1=ROL*FIX(0.299)+GOL*FIX(0.337)
 	pmaddwd   mm6,[GOTOFF(eax,PW_F0299_F0337)] ; mm6=ROH*FIX(0.299)+GOH*FIX(0.337)
 	pmaddwd   mm7,[GOTOFF(eax,PW_MF016_MF033)] ; mm7=ROL*-FIX(0.168)+GOL*-FIX(0.331)
 	pmaddwd   mm4,[GOTOFF(eax,PW_MF016_MF033)] ; mm4=ROH*-FIX(0.168)+GOH*-FIX(0.331)
 	movq      MMWORD [wk(4)], mm1	; wk(4)=ROL*FIX(0.299)+GOL*FIX(0.337)
 	movq      MMWORD [wk(5)], mm6	; wk(5)=ROH*FIX(0.299)+GOH*FIX(0.337)
 	pxor      mm1,mm1
 	pxor      mm6,mm6
 	punpcklwd mm1,mm5		; mm1=BOL
 	punpckhwd mm6,mm5		; mm6=BOH
 	psrld     mm1,1			; mm1=BOL*FIX(0.500)
 	psrld     mm6,1			; mm6=BOH*FIX(0.500)
 	movq      mm5,[GOTOFF(eax,PD_ONEHALFM1_CJ)] ; mm5=[PD_ONEHALFM1_CJ]
 	paddd     mm7,mm1
 	paddd     mm4,mm6
 	paddd     mm7,mm5
 	paddd     mm4,mm5
 	psrld     mm7,SCALEBITS		; mm7=CbOL
 	psrld     mm4,SCALEBITS		; mm4=CbOH
 	packssdw  mm7,mm4		; mm7=CbO
 	movq      mm1, MMWORD [wk(2)]	; mm1=BE
 	movq      mm6,mm0
 	punpcklwd mm0,mm2
 	punpckhwd mm6,mm2
 	movq      mm5,mm0
 	movq      mm4,mm6
 	pmaddwd   mm0,[GOTOFF(eax,PW_F0299_F0337)] ; mm0=REL*FIX(0.299)+GEL*FIX(0.337)
 	pmaddwd   mm6,[GOTOFF(eax,PW_F0299_F0337)] ; mm6=REH*FIX(0.299)+GEH*FIX(0.337)
 	pmaddwd   mm5,[GOTOFF(eax,PW_MF016_MF033)] ; mm5=REL*-FIX(0.168)+GEL*-FIX(0.331)
 	pmaddwd   mm4,[GOTOFF(eax,PW_MF016_MF033)] ; mm4=REH*-FIX(0.168)+GEH*-FIX(0.331)
 	movq      MMWORD [wk(6)], mm0	; wk(6)=REL*FIX(0.299)+GEL*FIX(0.337)
 	movq      MMWORD [wk(7)], mm6	; wk(7)=REH*FIX(0.299)+GEH*FIX(0.337)
 	pxor      mm0,mm0
 	pxor      mm6,mm6
 	punpcklwd mm0,mm1		; mm0=BEL
 	punpckhwd mm6,mm1		; mm6=BEH
 	psrld     mm0,1			; mm0=BEL*FIX(0.500)
 	psrld     mm6,1			; mm6=BEH*FIX(0.500)
 	movq      mm1,[GOTOFF(eax,PD_ONEHALFM1_CJ)] ; mm1=[PD_ONEHALFM1_CJ]
 	paddd     mm5,mm0
 	paddd     mm4,mm6
 	paddd     mm5,mm1
 	paddd     mm4,mm1
 	psrld     mm5,SCALEBITS		; mm5=CbEL
 	psrld     mm4,SCALEBITS		; mm4=CbEH
 	packssdw  mm5,mm4		; mm5=CbE
 	psllw     mm7,BYTE_BIT
 	por       mm5,mm7		; mm5=Cb
 	movq      MMWORD [ebx], mm5	; Save Cb
 	movq      mm0, MMWORD [wk(3)]	; mm0=BO
 	movq      mm6, MMWORD [wk(2)]	; mm6=BE
 	movq      mm1, MMWORD [wk(1)]	; mm1=RO
 	movq      mm4,mm0
 	punpcklwd mm0,mm3
 	punpckhwd mm4,mm3
 	movq      mm7,mm0
 	movq      mm5,mm4
 	pmaddwd   mm0,[GOTOFF(eax,PW_F0114_F0250)] ; mm0=BOL*FIX(0.114)+GOL*FIX(0.250)
 	pmaddwd   mm4,[GOTOFF(eax,PW_F0114_F0250)] ; mm4=BOH*FIX(0.114)+GOH*FIX(0.250)
 	pmaddwd   mm7,[GOTOFF(eax,PW_MF008_MF041)] ; mm7=BOL*-FIX(0.081)+GOL*-FIX(0.418)
 	pmaddwd   mm5,[GOTOFF(eax,PW_MF008_MF041)] ; mm5=BOH*-FIX(0.081)+GOH*-FIX(0.418)
 	movq      mm3,[GOTOFF(eax,PD_ONEHALF)]	; mm3=[PD_ONEHALF]
 	paddd     mm0, MMWORD [wk(4)]
 	paddd     mm4, MMWORD [wk(5)]
 	paddd     mm0,mm3
 	paddd     mm4,mm3
 	psrld     mm0,SCALEBITS		; mm0=YOL
 	psrld     mm4,SCALEBITS		; mm4=YOH
 	packssdw  mm0,mm4		; mm0=YO
 	pxor      mm3,mm3
 	pxor      mm4,mm4
 	punpcklwd mm3,mm1		; mm3=ROL
 	punpckhwd mm4,mm1		; mm4=ROH
 	psrld     mm3,1			; mm3=ROL*FIX(0.500)
 	psrld     mm4,1			; mm4=ROH*FIX(0.500)
 	movq      mm1,[GOTOFF(eax,PD_ONEHALFM1_CJ)] ; mm1=[PD_ONEHALFM1_CJ]
 	paddd     mm7,mm3
 	paddd     mm5,mm4
 	paddd     mm7,mm1
 	paddd     mm5,mm1
 	psrld     mm7,SCALEBITS		; mm7=CrOL
 	psrld     mm5,SCALEBITS		; mm5=CrOH
 	packssdw  mm7,mm5		; mm7=CrO
 	movq      mm3, MMWORD [wk(0)]	; mm3=RE
 	movq      mm4,mm6
 	punpcklwd mm6,mm2
 	punpckhwd mm4,mm2
 	movq      mm1,mm6
 	movq      mm5,mm4
 	pmaddwd   mm6,[GOTOFF(eax,PW_F0114_F0250)] ; mm6=BEL*FIX(0.114)+GEL*FIX(0.250)
 	pmaddwd   mm4,[GOTOFF(eax,PW_F0114_F0250)] ; mm4=BEH*FIX(0.114)+GEH*FIX(0.250)
 	pmaddwd   mm1,[GOTOFF(eax,PW_MF008_MF041)] ; mm1=BEL*-FIX(0.081)+GEL*-FIX(0.418)
 	pmaddwd   mm5,[GOTOFF(eax,PW_MF008_MF041)] ; mm5=BEH*-FIX(0.081)+GEH*-FIX(0.418)
 	movq      mm2,[GOTOFF(eax,PD_ONEHALF)]	; mm2=[PD_ONEHALF]
 	paddd     mm6, MMWORD [wk(6)]
 	paddd     mm4, MMWORD [wk(7)]
 	paddd     mm6,mm2
 	paddd     mm4,mm2
 	psrld     mm6,SCALEBITS		; mm6=YEL
 	psrld     mm4,SCALEBITS		; mm4=YEH
 	packssdw  mm6,mm4		; mm6=YE
 	psllw     mm0,BYTE_BIT
 	por       mm6,mm0		; mm6=Y
 	movq      MMWORD [edi], mm6	; Save Y
 	pxor      mm2,mm2
 	pxor      mm4,mm4
 	punpcklwd mm2,mm3		; mm2=REL
 	punpckhwd mm4,mm3		; mm4=REH
 	psrld     mm2,1			; mm2=REL*FIX(0.500)
 	psrld     mm4,1			; mm4=REH*FIX(0.500)
 	movq      mm0,[GOTOFF(eax,PD_ONEHALFM1_CJ)] ; mm0=[PD_ONEHALFM1_CJ]
 	paddd     mm1,mm2
 	paddd     mm5,mm4
 	paddd     mm1,mm0
 	paddd     mm5,mm0
 	psrld     mm1,SCALEBITS		; mm1=CrEL
 	psrld     mm5,SCALEBITS		; mm5=CrEH
 	packssdw  mm1,mm5		; mm1=CrE
 	psllw     mm7,BYTE_BIT
 	por       mm1,mm7		; mm1=Cr
 	movq      MMWORD [edx], mm1	; Save Cr
 	sub	ecx, byte SIZEOF_MMWORD
 	add	esi, byte RGB_PIXELSIZE*SIZEOF_MMWORD	; inptr
 	add	edi, byte SIZEOF_MMWORD			; outptr0
 	add	ebx, byte SIZEOF_MMWORD			; outptr1
 	add	edx, byte SIZEOF_MMWORD			; outptr2
 	cmp	ecx, byte SIZEOF_MMWORD
 	jae	near .columnloop
 	test	ecx,ecx
 	jnz	near .column_ld1
 	pop	ecx			; col
 	pop	esi
 	pop	edi
 	pop	ebx
 	pop	edx
 	poppic	eax
 	add	esi, byte SIZEOF_JSAMPROW	; input_buf
 	add	edi, byte SIZEOF_JSAMPROW
 	add	ebx, byte SIZEOF_JSAMPROW
 	add	edx, byte SIZEOF_JSAMPROW
 	dec	eax				; num_rows
 	jg	near .rowloop
 	emms		; empty MMX state
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %endif ; JCCOLOR_RGBYCC_MMX_SUPPORTED
 %endif ; RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4
--- a/jccolor.c
+++ b/jccolor.c
@@ -5,12 +5,20 @@
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
 * ---------------------------------------------------------------------
 * x86 SIMD extension for IJG JPEG library
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
 * This file has been modified for SIMD extension.
 * Last Modified : January 5, 2006
 * ---------------------------------------------------------------------
 *
 * This file contains input colorspace conversion routines.
 */
 #define JPEG_INTERNALS
 #include "jinclude.h"
 #include "jpeglib.h"
 #include "jcolsamp.h"		/* Private declarations */
 /* Private subobject */
@@ -352,6 +360,7 @@ GLOBAL(void)
 jinit_color_converter (j_compress_ptr cinfo)
 {
  my_cconvert_ptr cconvert;
  unsigned int simd = jpeg_simd_support((j_common_ptr) cinfo);
  cconvert = (my_cconvert_ptr)
    (*cinfo->mem->alloc_small) ((j_common_ptr) cinfo, JPOOL_IMAGE,
@@ -420,8 +429,23 @@ jinit_color_converter (j_compress_ptr cinfo)
    if (cinfo->num_components != 3)
      ERREXIT(cinfo, JERR_BAD_J_COLORSPACE);
    if (cinfo->in_color_space == JCS_RGB) {
 #if RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4
 #ifdef JCCOLOR_RGBYCC_SSE2_SUPPORTED
      if (simd & JSIMD_SSE2 &&
          IS_CONST_ALIGNED_16(jconst_rgb_ycc_convert_sse2)) {
        cconvert->pub.color_convert = jpeg_rgb_ycc_convert_sse2;
      } else
 #endif
 #ifdef JCCOLOR_RGBYCC_MMX_SUPPORTED
      if (simd & JSIMD_MMX) {
        cconvert->pub.color_convert = jpeg_rgb_ycc_convert_mmx;
      } else
 #endif
 #endif /* RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4 */
      {
        cconvert->pub.start_pass = rgb_ycc_start;
        cconvert->pub.color_convert = rgb_ycc_convert;
      }
    } else if (cinfo->in_color_space == JCS_YCbCr)
      cconvert->pub.color_convert = null_convert;
    else
@@ -457,3 +481,28 @@ jinit_color_converter (j_compress_ptr cinfo)
    break;
  }
 }
 #ifndef JSIMD_MODEINFO_NOT_SUPPORTED
 GLOBAL(unsigned int)
 jpeg_simd_color_converter (j_compress_ptr cinfo)
 {
  unsigned int simd = jpeg_simd_support((j_common_ptr) cinfo);
 #if RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4
 #ifdef JCCOLOR_RGBYCC_SSE2_SUPPORTED
  if (simd & JSIMD_SSE2 &&
      IS_CONST_ALIGNED_16(jconst_rgb_ycc_convert_sse2))
    return JSIMD_SSE2;
 #endif
 #ifdef JCCOLOR_RGBYCC_MMX_SUPPORTED
  if (simd & JSIMD_MMX)
    return JSIMD_MMX;
 #endif
 #endif /* RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4 */
  return JSIMD_NONE;
 }
 #endif /* !JSIMD_MODEINFO_NOT_SUPPORTED */
--- a/jccolss2.asm
+++ b/jccolss2.asm
@@ -0,0 +1,541 @@
 ;
 ; jccolss2.asm - colorspace conversion (SSE2)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jcolsamp.inc"
 %if RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4
 %ifdef JCCOLOR_RGBYCC_SSE2_SUPPORTED
 ; --------------------------------------------------------------------------
 %define SCALEBITS	16
 F_0_081	equ	 5329			; FIX(0.08131)
 F_0_114	equ	 7471			; FIX(0.11400)
 F_0_168	equ	11059			; FIX(0.16874)
 F_0_250	equ	16384			; FIX(0.25000)
 F_0_299	equ	19595			; FIX(0.29900)
 F_0_331	equ	21709			; FIX(0.33126)
 F_0_418	equ	27439			; FIX(0.41869)
 F_0_587	equ	38470			; FIX(0.58700)
 F_0_337	equ	(F_0_587 - F_0_250)	; FIX(0.58700) - FIX(0.25000)
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 	alignz	16
 	global	EXTN(jconst_rgb_ycc_convert_sse2)
 EXTN(jconst_rgb_ycc_convert_sse2):
 PW_F0299_F0337	times 4 dw  F_0_299, F_0_337
 PW_F0114_F0250	times 4 dw  F_0_114, F_0_250
 PW_MF016_MF033	times 4 dw -F_0_168,-F_0_331
 PW_MF008_MF041	times 4 dw -F_0_081,-F_0_418
 PD_ONEHALFM1_CJ	times 4 dd  (1 << (SCALEBITS-1)) - 1 + (CENTERJSAMPLE << SCALEBITS)
 PD_ONEHALF	times 4 dd  (1 << (SCALEBITS-1))
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Convert some rows of samples to the output colorspace.
 ;
 ; GLOBAL(void)
 ; jpeg_rgb_ycc_convert_sse2 (j_compress_ptr cinfo,
 ;                            JSAMPARRAY input_buf, JSAMPIMAGE output_buf,
 ;                            JDIMENSION output_row, int num_rows);
 ;
 %define cinfo(b)	(b)+8		; j_compress_ptr cinfo
 %define input_buf(b)	(b)+12		; JSAMPARRAY input_buf
 %define output_buf(b)	(b)+16		; JSAMPIMAGE output_buf
 %define output_row(b)	(b)+20		; JDIMENSION output_row
 %define num_rows(b)	(b)+24		; int num_rows
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_XMMWORD	; xmmword wk[WK_NUM]
 %define WK_NUM		8
 %define gotptr		wk(0)-SIZEOF_POINTER	; void * gotptr
 	align	16
 	global	EXTN(jpeg_rgb_ycc_convert_sse2)
 EXTN(jpeg_rgb_ycc_convert_sse2):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_XMMWORD)	; align to 128 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [wk(0)]
 	pushpic	eax		; make a room for GOT address
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx			; get GOT address
 	movpic	POINTER [gotptr], ebx	; save GOT address
 	mov	ecx, POINTER [cinfo(eax)]
 	mov	ecx, JDIMENSION [jcstruct_image_width(ecx)]	; num_cols
 	test	ecx,ecx
 	jz	near .return
 	push	ecx
 	mov	esi, JSAMPIMAGE [output_buf(eax)]
 	mov	ecx, JDIMENSION [output_row(eax)]
 	mov	edi, JSAMPARRAY [esi+0*SIZEOF_JSAMPARRAY]
 	mov	ebx, JSAMPARRAY [esi+1*SIZEOF_JSAMPARRAY]
 	mov	edx, JSAMPARRAY [esi+2*SIZEOF_JSAMPARRAY]
 	lea	edi, [edi+ecx*SIZEOF_JSAMPROW]
 	lea	ebx, [ebx+ecx*SIZEOF_JSAMPROW]
 	lea	edx, [edx+ecx*SIZEOF_JSAMPROW]
 	pop	ecx
 	mov	esi, JSAMPARRAY [input_buf(eax)]
 	mov	eax, INT [num_rows(eax)]
 	test	eax,eax
 	jle	near .return
 	alignx	16,7
 .rowloop:
 	pushpic	eax
 	push	edx
 	push	ebx
 	push	edi
 	push	esi
 	push	ecx			; col
 	mov	esi, JSAMPROW [esi]	; inptr
 	mov	edi, JSAMPROW [edi]	; outptr0
 	mov	ebx, JSAMPROW [ebx]	; outptr1
 	mov	edx, JSAMPROW [edx]	; outptr2
 	movpic	eax, POINTER [gotptr]	; load GOT address (eax)
 	cmp	ecx, byte SIZEOF_XMMWORD
 	jae	near .columnloop
 	alignx	16,7
 %if RGB_PIXELSIZE == 3 ; ---------------
 .column_ld1:
 	push	eax
 	push	edx
 	lea	ecx,[ecx+ecx*2]		; imul ecx,RGB_PIXELSIZE
 	test	cl, SIZEOF_BYTE
 	jz	short .column_ld2
 	sub	ecx, byte SIZEOF_BYTE
 	movzx	eax, BYTE [esi+ecx]
 .column_ld2:
 	test	cl, SIZEOF_WORD
 	jz	short .column_ld4
 	sub	ecx, byte SIZEOF_WORD
 	movzx	edx, WORD [esi+ecx]
 	shl	eax, WORD_BIT
 	or	eax,edx
 .column_ld4:
 	movd	xmmA,eax
 	pop	edx
 	pop	eax
 	test	cl, SIZEOF_DWORD
 	jz	short .column_ld8
 	sub	ecx, byte SIZEOF_DWORD
 	movd	xmmF, _DWORD [esi+ecx]
 	pslldq	xmmA, SIZEOF_DWORD
 	por	xmmA,xmmF
 .column_ld8:
 	test	cl, SIZEOF_MMWORD
 	jz	short .column_ld16
 	sub	ecx, byte SIZEOF_MMWORD
 	movq	xmmB, _MMWORD [esi+ecx]
 	pslldq	xmmA, SIZEOF_MMWORD
 	por	xmmA,xmmB
 .column_ld16:
 	test	cl, SIZEOF_XMMWORD
 	jz	short .column_ld32
 	movdqa	xmmF,xmmA
 	movdqu	xmmA, XMMWORD [esi+0*SIZEOF_XMMWORD]
 	mov	ecx, SIZEOF_XMMWORD
 	jmp	short .rgb_ycc_cnv
 .column_ld32:
 	test	cl, 2*SIZEOF_XMMWORD
 	mov	ecx, SIZEOF_XMMWORD
 	jz	short .rgb_ycc_cnv
 	movdqa	xmmB,xmmA
 	movdqu	xmmA, XMMWORD [esi+0*SIZEOF_XMMWORD]
 	movdqu	xmmF, XMMWORD [esi+1*SIZEOF_XMMWORD]
 	jmp	short .rgb_ycc_cnv
 	alignx	16,7
 .columnloop:
 	movdqu	xmmA, XMMWORD [esi+0*SIZEOF_XMMWORD]
 	movdqu	xmmF, XMMWORD [esi+1*SIZEOF_XMMWORD]
 	movdqu	xmmB, XMMWORD [esi+2*SIZEOF_XMMWORD]
 .rgb_ycc_cnv:
 	; xmmA=(00 10 20 01 11 21 02 12 22 03 13 23 04 14 24 05)
 	; xmmF=(15 25 06 16 26 07 17 27 08 18 28 09 19 29 0A 1A)
 	; xmmB=(2A 0B 1B 2B 0C 1C 2C 0D 1D 2D 0E 1E 2E 0F 1F 2F)
 	movdqa    xmmG,xmmA
 	pslldq    xmmA,8	; xmmA=(-- -- -- -- -- -- -- -- 00 10 20 01 11 21 02 12)
 	psrldq    xmmG,8	; xmmG=(22 03 13 23 04 14 24 05 -- -- -- -- -- -- -- --)
 	punpckhbw xmmA,xmmF	; xmmA=(00 08 10 18 20 28 01 09 11 19 21 29 02 0A 12 1A)
 	pslldq    xmmF,8	; xmmF=(-- -- -- -- -- -- -- -- 15 25 06 16 26 07 17 27)
 	punpcklbw xmmG,xmmB	; xmmG=(22 2A 03 0B 13 1B 23 2B 04 0C 14 1C 24 2C 05 0D)
 	punpckhbw xmmF,xmmB	; xmmF=(15 1D 25 2D 06 0E 16 1E 26 2E 07 0F 17 1F 27 2F)
 	movdqa    xmmD,xmmA
 	pslldq    xmmA,8	; xmmA=(-- -- -- -- -- -- -- -- 00 08 10 18 20 28 01 09)
 	psrldq    xmmD,8	; xmmD=(11 19 21 29 02 0A 12 1A -- -- -- -- -- -- -- --)
 	punpckhbw xmmA,xmmG	; xmmA=(00 04 08 0C 10 14 18 1C 20 24 28 2C 01 05 09 0D)
 	pslldq    xmmG,8	; xmmG=(-- -- -- -- -- -- -- -- 22 2A 03 0B 13 1B 23 2B)
 	punpcklbw xmmD,xmmF	; xmmD=(11 15 19 1D 21 25 29 2D 02 06 0A 0E 12 16 1A 1E)
 	punpckhbw xmmG,xmmF	; xmmG=(22 26 2A 2E 03 07 0B 0F 13 17 1B 1F 23 27 2B 2F)
 	movdqa    xmmE,xmmA
 	pslldq    xmmA,8	; xmmA=(-- -- -- -- -- -- -- -- 00 04 08 0C 10 14 18 1C)
 	psrldq    xmmE,8	; xmmE=(20 24 28 2C 01 05 09 0D -- -- -- -- -- -- -- --)
 	punpckhbw xmmA,xmmD	; xmmA=(00 02 04 06 08 0A 0C 0E 10 12 14 16 18 1A 1C 1E)
 	pslldq    xmmD,8	; xmmD=(-- -- -- -- -- -- -- -- 11 15 19 1D 21 25 29 2D)
 	punpcklbw xmmE,xmmG	; xmmE=(20 22 24 26 28 2A 2C 2E 01 03 05 07 09 0B 0D 0F)
 	punpckhbw xmmD,xmmG	; xmmD=(11 13 15 17 19 1B 1D 1F 21 23 25 27 29 2B 2D 2F)
 	pxor      xmmH,xmmH
 	movdqa    xmmC,xmmA
 	punpcklbw xmmA,xmmH	; xmmA=(00 02 04 06 08 0A 0C 0E)
 	punpckhbw xmmC,xmmH	; xmmC=(10 12 14 16 18 1A 1C 1E)
 	movdqa    xmmB,xmmE
 	punpcklbw xmmE,xmmH	; xmmE=(20 22 24 26 28 2A 2C 2E)
 	punpckhbw xmmB,xmmH	; xmmB=(01 03 05 07 09 0B 0D 0F)
 	movdqa    xmmF,xmmD
 	punpcklbw xmmD,xmmH	; xmmD=(11 13 15 17 19 1B 1D 1F)
 	punpckhbw xmmF,xmmH	; xmmF=(21 23 25 27 29 2B 2D 2F)
 %else ; RGB_PIXELSIZE == 4 ; -----------
 .column_ld1:
 	test	cl, SIZEOF_XMMWORD/16
 	jz	short .column_ld2
 	sub	ecx, byte SIZEOF_XMMWORD/16
 	movd	xmmA, _DWORD [esi+ecx*RGB_PIXELSIZE]
 .column_ld2:
 	test	cl, SIZEOF_XMMWORD/8
 	jz	short .column_ld4
 	sub	ecx, byte SIZEOF_XMMWORD/8
 	movq	xmmE, _MMWORD [esi+ecx*RGB_PIXELSIZE]
 	pslldq	xmmA, SIZEOF_MMWORD
 	por	xmmA,xmmE
 .column_ld4:
 	test	cl, SIZEOF_XMMWORD/4
 	jz	short .column_ld8
 	sub	ecx, byte SIZEOF_XMMWORD/4
 	movdqa	xmmE,xmmA
 	movdqu	xmmA, XMMWORD [esi+ecx*RGB_PIXELSIZE]
 .column_ld8:
 	test	cl, SIZEOF_XMMWORD/2
 	mov	ecx, SIZEOF_XMMWORD
 	jz	short .rgb_ycc_cnv
 	movdqa	xmmF,xmmA
 	movdqa	xmmH,xmmE
 	movdqu	xmmA, XMMWORD [esi+0*SIZEOF_XMMWORD]
 	movdqu	xmmE, XMMWORD [esi+1*SIZEOF_XMMWORD]
 	jmp	short .rgb_ycc_cnv
 	alignx	16,7
 .columnloop:
 	movdqu	xmmA, XMMWORD [esi+0*SIZEOF_XMMWORD]
 	movdqu	xmmE, XMMWORD [esi+1*SIZEOF_XMMWORD]
 	movdqu	xmmF, XMMWORD [esi+2*SIZEOF_XMMWORD]
 	movdqu	xmmH, XMMWORD [esi+3*SIZEOF_XMMWORD]
 .rgb_ycc_cnv:
 	; xmmA=(00 10 20 30 01 11 21 31 02 12 22 32 03 13 23 33)
 	; xmmE=(04 14 24 34 05 15 25 35 06 16 26 36 07 17 27 37)
 	; xmmF=(08 18 28 38 09 19 29 39 0A 1A 2A 3A 0B 1B 2B 3B)
 	; xmmH=(0C 1C 2C 3C 0D 1D 2D 3D 0E 1E 2E 3E 0F 1F 2F 3F)
 	movdqa    xmmD,xmmA
 	punpcklbw xmmA,xmmE	; xmmA=(00 04 10 14 20 24 30 34 01 05 11 15 21 25 31 35)
 	punpckhbw xmmD,xmmE	; xmmD=(02 06 12 16 22 26 32 36 03 07 13 17 23 27 33 37)
 	movdqa    xmmC,xmmF
 	punpcklbw xmmF,xmmH	; xmmF=(08 0C 18 1C 28 2C 38 3C 09 0D 19 1D 29 2D 39 3D)
 	punpckhbw xmmC,xmmH	; xmmC=(0A 0E 1A 1E 2A 2E 3A 3E 0B 0F 1B 1F 2B 2F 3B 3F)
 	movdqa    xmmB,xmmA
 	punpcklwd xmmA,xmmF	; xmmA=(00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C)
 	punpckhwd xmmB,xmmF	; xmmB=(01 05 09 0D 11 15 19 1D 21 25 29 2D 31 35 39 3D)
 	movdqa    xmmG,xmmD
 	punpcklwd xmmD,xmmC	; xmmD=(02 06 0A 0E 12 16 1A 1E 22 26 2A 2E 32 36 3A 3E)
 	punpckhwd xmmG,xmmC	; xmmG=(03 07 0B 0F 13 17 1B 1F 23 27 2B 2F 33 37 3B 3F)
 	movdqa    xmmE,xmmA
 	punpcklbw xmmA,xmmD	; xmmA=(00 02 04 06 08 0A 0C 0E 10 12 14 16 18 1A 1C 1E)
 	punpckhbw xmmE,xmmD	; xmmE=(20 22 24 26 28 2A 2C 2E 30 32 34 36 38 3A 3C 3E)
 	movdqa    xmmH,xmmB
 	punpcklbw xmmB,xmmG	; xmmB=(01 03 05 07 09 0B 0D 0F 11 13 15 17 19 1B 1D 1F)
 	punpckhbw xmmH,xmmG	; xmmH=(21 23 25 27 29 2B 2D 2F 31 33 35 37 39 3B 3D 3F)
 	pxor      xmmF,xmmF
 	movdqa    xmmC,xmmA
 	punpcklbw xmmA,xmmF	; xmmA=(00 02 04 06 08 0A 0C 0E)
 	punpckhbw xmmC,xmmF	; xmmC=(10 12 14 16 18 1A 1C 1E)
 	movdqa    xmmD,xmmB
 	punpcklbw xmmB,xmmF	; xmmB=(01 03 05 07 09 0B 0D 0F)
 	punpckhbw xmmD,xmmF	; xmmD=(11 13 15 17 19 1B 1D 1F)
 	movdqa    xmmG,xmmE
 	punpcklbw xmmE,xmmF	; xmmE=(20 22 24 26 28 2A 2C 2E)
 	punpckhbw xmmG,xmmF	; xmmG=(30 32 34 36 38 3A 3C 3E)
 	punpcklbw xmmF,xmmH
 	punpckhbw xmmH,xmmH
 	psrlw     xmmF,BYTE_BIT	; xmmF=(21 23 25 27 29 2B 2D 2F)
 	psrlw     xmmH,BYTE_BIT	; xmmH=(31 33 35 37 39 3B 3D 3F)
 %endif ; RGB_PIXELSIZE ; ---------------
 	; xmm0=R(02468ACE)=RE, xmm2=G(02468ACE)=GE, xmm4=B(02468ACE)=BE
 	; xmm1=R(13579BDF)=RO, xmm3=G(13579BDF)=GO, xmm5=B(13579BDF)=BO
 	; (Original)
 	; Y  =  0.29900 * R + 0.58700 * G + 0.11400 * B
 	; Cb = -0.16874 * R - 0.33126 * G + 0.50000 * B + CENTERJSAMPLE
 	; Cr =  0.50000 * R - 0.41869 * G - 0.08131 * B + CENTERJSAMPLE
 	;
 	; (This implementation)
 	; Y  =  0.29900 * R + 0.33700 * G + 0.11400 * B + 0.25000 * G
 	; Cb = -0.16874 * R - 0.33126 * G + 0.50000 * B + CENTERJSAMPLE
 	; Cr =  0.50000 * R - 0.41869 * G - 0.08131 * B + CENTERJSAMPLE
 	movdqa    XMMWORD [wk(0)], xmm0	; wk(0)=RE
 	movdqa    XMMWORD [wk(1)], xmm1	; wk(1)=RO
 	movdqa    XMMWORD [wk(2)], xmm4	; wk(2)=BE
 	movdqa    XMMWORD [wk(3)], xmm5	; wk(3)=BO
 	movdqa    xmm6,xmm1
 	punpcklwd xmm1,xmm3
 	punpckhwd xmm6,xmm3
 	movdqa    xmm7,xmm1
 	movdqa    xmm4,xmm6
 	pmaddwd   xmm1,[GOTOFF(eax,PW_F0299_F0337)] ; xmm1=ROL*FIX(0.299)+GOL*FIX(0.337)
 	pmaddwd   xmm6,[GOTOFF(eax,PW_F0299_F0337)] ; xmm6=ROH*FIX(0.299)+GOH*FIX(0.337)
 	pmaddwd   xmm7,[GOTOFF(eax,PW_MF016_MF033)] ; xmm7=ROL*-FIX(0.168)+GOL*-FIX(0.331)
 	pmaddwd   xmm4,[GOTOFF(eax,PW_MF016_MF033)] ; xmm4=ROH*-FIX(0.168)+GOH*-FIX(0.331)
 	movdqa    XMMWORD [wk(4)], xmm1	; wk(4)=ROL*FIX(0.299)+GOL*FIX(0.337)
 	movdqa    XMMWORD [wk(5)], xmm6	; wk(5)=ROH*FIX(0.299)+GOH*FIX(0.337)
 	pxor      xmm1,xmm1
 	pxor      xmm6,xmm6
 	punpcklwd xmm1,xmm5		; xmm1=BOL
 	punpckhwd xmm6,xmm5		; xmm6=BOH
 	psrld     xmm1,1		; xmm1=BOL*FIX(0.500)
 	psrld     xmm6,1		; xmm6=BOH*FIX(0.500)
 	movdqa    xmm5,[GOTOFF(eax,PD_ONEHALFM1_CJ)] ; xmm5=[PD_ONEHALFM1_CJ]
 	paddd     xmm7,xmm1
 	paddd     xmm4,xmm6
 	paddd     xmm7,xmm5
 	paddd     xmm4,xmm5
 	psrld     xmm7,SCALEBITS	; xmm7=CbOL
 	psrld     xmm4,SCALEBITS	; xmm4=CbOH
 	packssdw  xmm7,xmm4		; xmm7=CbO
 	movdqa    xmm1, XMMWORD [wk(2)]	; xmm1=BE
 	movdqa    xmm6,xmm0
 	punpcklwd xmm0,xmm2
 	punpckhwd xmm6,xmm2
 	movdqa    xmm5,xmm0
 	movdqa    xmm4,xmm6
 	pmaddwd   xmm0,[GOTOFF(eax,PW_F0299_F0337)] ; xmm0=REL*FIX(0.299)+GEL*FIX(0.337)
 	pmaddwd   xmm6,[GOTOFF(eax,PW_F0299_F0337)] ; xmm6=REH*FIX(0.299)+GEH*FIX(0.337)
 	pmaddwd   xmm5,[GOTOFF(eax,PW_MF016_MF033)] ; xmm5=REL*-FIX(0.168)+GEL*-FIX(0.331)
 	pmaddwd   xmm4,[GOTOFF(eax,PW_MF016_MF033)] ; xmm4=REH*-FIX(0.168)+GEH*-FIX(0.331)
 	movdqa    XMMWORD [wk(6)], xmm0	; wk(6)=REL*FIX(0.299)+GEL*FIX(0.337)
 	movdqa    XMMWORD [wk(7)], xmm6	; wk(7)=REH*FIX(0.299)+GEH*FIX(0.337)
 	pxor      xmm0,xmm0
 	pxor      xmm6,xmm6
 	punpcklwd xmm0,xmm1		; xmm0=BEL
 	punpckhwd xmm6,xmm1		; xmm6=BEH
 	psrld     xmm0,1		; xmm0=BEL*FIX(0.500)
 	psrld     xmm6,1		; xmm6=BEH*FIX(0.500)
 	movdqa    xmm1,[GOTOFF(eax,PD_ONEHALFM1_CJ)] ; xmm1=[PD_ONEHALFM1_CJ]
 	paddd     xmm5,xmm0
 	paddd     xmm4,xmm6
 	paddd     xmm5,xmm1
 	paddd     xmm4,xmm1
 	psrld     xmm5,SCALEBITS	; xmm5=CbEL
 	psrld     xmm4,SCALEBITS	; xmm4=CbEH
 	packssdw  xmm5,xmm4		; xmm5=CbE
 	psllw     xmm7,BYTE_BIT
 	por       xmm5,xmm7		; xmm5=Cb
 	movdqa    XMMWORD [ebx], xmm5	; Save Cb
 	movdqa    xmm0, XMMWORD [wk(3)]	; xmm0=BO
 	movdqa    xmm6, XMMWORD [wk(2)]	; xmm6=BE
 	movdqa    xmm1, XMMWORD [wk(1)]	; xmm1=RO
 	movdqa    xmm4,xmm0
 	punpcklwd xmm0,xmm3
 	punpckhwd xmm4,xmm3
 	movdqa    xmm7,xmm0
 	movdqa    xmm5,xmm4
 	pmaddwd   xmm0,[GOTOFF(eax,PW_F0114_F0250)] ; xmm0=BOL*FIX(0.114)+GOL*FIX(0.250)
 	pmaddwd   xmm4,[GOTOFF(eax,PW_F0114_F0250)] ; xmm4=BOH*FIX(0.114)+GOH*FIX(0.250)
 	pmaddwd   xmm7,[GOTOFF(eax,PW_MF008_MF041)] ; xmm7=BOL*-FIX(0.081)+GOL*-FIX(0.418)
 	pmaddwd   xmm5,[GOTOFF(eax,PW_MF008_MF041)] ; xmm5=BOH*-FIX(0.081)+GOH*-FIX(0.418)
 	movdqa    xmm3,[GOTOFF(eax,PD_ONEHALF)]	; xmm3=[PD_ONEHALF]
 	paddd     xmm0, XMMWORD [wk(4)]
 	paddd     xmm4, XMMWORD [wk(5)]
 	paddd     xmm0,xmm3
 	paddd     xmm4,xmm3
 	psrld     xmm0,SCALEBITS	; xmm0=YOL
 	psrld     xmm4,SCALEBITS	; xmm4=YOH
 	packssdw  xmm0,xmm4		; xmm0=YO
 	pxor      xmm3,xmm3
 	pxor      xmm4,xmm4
 	punpcklwd xmm3,xmm1		; xmm3=ROL
 	punpckhwd xmm4,xmm1		; xmm4=ROH
 	psrld     xmm3,1		; xmm3=ROL*FIX(0.500)
 	psrld     xmm4,1		; xmm4=ROH*FIX(0.500)
 	movdqa    xmm1,[GOTOFF(eax,PD_ONEHALFM1_CJ)] ; xmm1=[PD_ONEHALFM1_CJ]
 	paddd     xmm7,xmm3
 	paddd     xmm5,xmm4
 	paddd     xmm7,xmm1
 	paddd     xmm5,xmm1
 	psrld     xmm7,SCALEBITS	; xmm7=CrOL
 	psrld     xmm5,SCALEBITS	; xmm5=CrOH
 	packssdw  xmm7,xmm5		; xmm7=CrO
 	movdqa    xmm3, XMMWORD [wk(0)]	; xmm3=RE
 	movdqa    xmm4,xmm6
 	punpcklwd xmm6,xmm2
 	punpckhwd xmm4,xmm2
 	movdqa    xmm1,xmm6
 	movdqa    xmm5,xmm4
 	pmaddwd   xmm6,[GOTOFF(eax,PW_F0114_F0250)] ; xmm6=BEL*FIX(0.114)+GEL*FIX(0.250)
 	pmaddwd   xmm4,[GOTOFF(eax,PW_F0114_F0250)] ; xmm4=BEH*FIX(0.114)+GEH*FIX(0.250)
 	pmaddwd   xmm1,[GOTOFF(eax,PW_MF008_MF041)] ; xmm1=BEL*-FIX(0.081)+GEL*-FIX(0.418)
 	pmaddwd   xmm5,[GOTOFF(eax,PW_MF008_MF041)] ; xmm5=BEH*-FIX(0.081)+GEH*-FIX(0.418)
 	movdqa    xmm2,[GOTOFF(eax,PD_ONEHALF)]	; xmm2=[PD_ONEHALF]
 	paddd     xmm6, XMMWORD [wk(6)]
 	paddd     xmm4, XMMWORD [wk(7)]
 	paddd     xmm6,xmm2
 	paddd     xmm4,xmm2
 	psrld     xmm6,SCALEBITS	; xmm6=YEL
 	psrld     xmm4,SCALEBITS	; xmm4=YEH
 	packssdw  xmm6,xmm4		; xmm6=YE
 	psllw     xmm0,BYTE_BIT
 	por       xmm6,xmm0		; xmm6=Y
 	movdqa    XMMWORD [edi], xmm6	; Save Y
 	pxor      xmm2,xmm2
 	pxor      xmm4,xmm4
 	punpcklwd xmm2,xmm3		; xmm2=REL
 	punpckhwd xmm4,xmm3		; xmm4=REH
 	psrld     xmm2,1		; xmm2=REL*FIX(0.500)
 	psrld     xmm4,1		; xmm4=REH*FIX(0.500)
 	movdqa    xmm0,[GOTOFF(eax,PD_ONEHALFM1_CJ)] ; xmm0=[PD_ONEHALFM1_CJ]
 	paddd     xmm1,xmm2
 	paddd     xmm5,xmm4
 	paddd     xmm1,xmm0
 	paddd     xmm5,xmm0
 	psrld     xmm1,SCALEBITS	; xmm1=CrEL
 	psrld     xmm5,SCALEBITS	; xmm5=CrEH
 	packssdw  xmm1,xmm5		; xmm1=CrE
 	psllw     xmm7,BYTE_BIT
 	por       xmm1,xmm7		; xmm1=Cr
 	movdqa    XMMWORD [edx], xmm1	; Save Cr
 	sub	ecx, byte SIZEOF_XMMWORD
 	add	esi, byte RGB_PIXELSIZE*SIZEOF_XMMWORD	; inptr
 	add	edi, byte SIZEOF_XMMWORD		; outptr0
 	add	ebx, byte SIZEOF_XMMWORD		; outptr1
 	add	edx, byte SIZEOF_XMMWORD		; outptr2
 	cmp	ecx, byte SIZEOF_XMMWORD
 	jae	near .columnloop
 	test	ecx,ecx
 	jnz	near .column_ld1
 	pop	ecx			; col
 	pop	esi
 	pop	edi
 	pop	ebx
 	pop	edx
 	poppic	eax
 	add	esi, byte SIZEOF_JSAMPROW	; input_buf
 	add	edi, byte SIZEOF_JSAMPROW
 	add	ebx, byte SIZEOF_JSAMPROW
 	add	edx, byte SIZEOF_JSAMPROW
 	dec	eax				; num_rows
 	jg	near .rowloop
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %endif ; JCCOLOR_RGBYCC_SSE2_SUPPORTED
 %endif ; RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4
--- a/jcdctmgr.c
+++ b/jcdctmgr.c
@@ -5,6 +5,13 @@
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
 * ---------------------------------------------------------------------
 * x86 SIMD extension for IJG JPEG library
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
 * This file has been modified for SIMD extension.
 * Last Modified : December 24, 2005
 * ---------------------------------------------------------------------
 *
 * This file contains the forward-DCT management logic.
 * This code selects a particular DCT implementation to be used,
 * and it performs related housekeeping chores including coefficient
@@ -24,6 +31,8 @@ typedef struct {
  /* Pointer to the DCT routine actually in use */
  forward_DCT_method_ptr do_dct;
  convsamp_int_method_ptr convsamp;
  quantize_int_method_ptr quantize;
  /* The actual post-DCT divisors --- not identical to the quant table
   * entries, because of scaling (especially for an unnormalized DCT).
@@ -34,12 +43,75 @@ typedef struct {
 #ifdef DCT_FLOAT_SUPPORTED
  /* Same as above for the floating-point case. */
  float_DCT_method_ptr do_float_dct;
  convsamp_float_method_ptr float_convsamp;
  quantize_float_method_ptr float_quantize;
  FAST_FLOAT * float_divisors[NUM_QUANT_TBLS];
 #endif
 } my_fdct_controller;
 typedef my_fdct_controller * my_fdct_ptr;
 /*
 * SIMD Ext: Most of SSE/SSE2 instructions require that the memory address
 * is aligned to a 16-byte boundary; if not, a general-protection exception
 * (#GP) is generated.
 */
 #define ALIGN_SIZE	16		/* sizeof SSE/SSE2 register */
 #define ALIGN_MEM(p,a)	((void *) (((size_t) (p) + (a) - 1) & -(a)))
 #ifdef JFDCT_INT_QUANTIZE_WITH_DIVISION
 #undef jpeg_quantize_int
 #undef jpeg_quantize_int_mmx
 #undef jpeg_quantize_int_sse2
 #define jpeg_quantize_int       jpeg_quantize_idiv
 #define jpeg_quantize_int_mmx   jpeg_quantize_idiv
 #define jpeg_quantize_int_sse2  jpeg_quantize_idiv
 #endif
 #ifndef JFDCT_INT_QUANTIZE_WITH_DIVISION
 /*
 * SIMD Ext: compute the reciprocal of the divisor
 *
 * This implementation is based on an algorithm described in
 *   "How to optimize for the Pentium family of microprocessors"
 *   (http://www.agner.org/assem/).
 */
 LOCAL(void)
 compute_reciprocal (DCTELEM divisor, DCTELEM * dtbl)
 {
  unsigned long d = ((unsigned long) divisor) & 0x0000FFFF;
  unsigned long fq, fr;
  int b, r, c;
  for (b = 0; (1UL << b) <= d; b++) ;
  r  = 16 + (--b);
  fq = (1UL << r) / d;
  fr = (1UL << r) % d;
  r -= 16;
  c  = 0;
  if (fr == 0) {
    fq >>= 1;
    r--;
  } else if (fr <= (d / 2)) {
    c++;
  } else {
    fq++;
  }
  dtbl[DCTSIZE2 * 0] = (DCTELEM) fq;		/* reciprocal */
  dtbl[DCTSIZE2 * 1] = (DCTELEM) (c + (d / 2));	/* correction + roundfactor */
  dtbl[DCTSIZE2 * 2] = (DCTELEM) (1 << (16 - (r + 1 + 1)));	/* scale */
  dtbl[DCTSIZE2 * 3] = (DCTELEM) (r + 1);			/* shift */
 }
 #endif /* JFDCT_INT_QUANTIZE_WITH_DIVISION */
 /*
 * Initialize for a processing pass.
@@ -75,6 +147,18 @@ start_pass_fdctmgr (j_compress_ptr cinfo)
      /* For LL&M IDCT method, divisors are equal to raw quantization
       * coefficients multiplied by 8 (to counteract scaling).
       */
 #ifndef JFDCT_INT_QUANTIZE_WITH_DIVISION
      if (fdct->divisors[qtblno] == NULL) {
 	fdct->divisors[qtblno] = (DCTELEM *)
 	  (*cinfo->mem->alloc_small) ((j_common_ptr) cinfo, JPOOL_IMAGE,
 				      (DCTSIZE2 * 4) * SIZEOF(DCTELEM));
      }
      dtbl = fdct->divisors[qtblno];
      for (i = 0; i < DCTSIZE2; i++) {
 	compute_reciprocal ((DCTELEM) (qtbl->quantval[i] << 3), &dtbl[i]);
      }
      break;
 #else  /* JFDCT_INT_QUANTIZE_WITH_DIVISION */
      if (fdct->divisors[qtblno] == NULL) {
 	fdct->divisors[qtblno] = (DCTELEM *)
 	  (*cinfo->mem->alloc_small) ((j_common_ptr) cinfo, JPOOL_IMAGE,
@@ -85,7 +169,8 @@ start_pass_fdctmgr (j_compress_ptr cinfo)
 	dtbl[i] = ((DCTELEM) qtbl->quantval[i]) << 3;
      }
      break;
-#endif
+#endif /* JFDCT_INT_QUANTIZE_WITH_DIVISION */
 #endif /* DCT_ISLOW_SUPPORTED */
 #ifdef DCT_IFAST_SUPPORTED
    case JDCT_IFAST:
      {
@@ -109,6 +194,21 @@ start_pass_fdctmgr (j_compress_ptr cinfo)
 	};
 	SHIFT_TEMPS
 #ifndef JFDCT_INT_QUANTIZE_WITH_DIVISION
 	if (fdct->divisors[qtblno] == NULL) {
 	  fdct->divisors[qtblno] = (DCTELEM *)
 	    (*cinfo->mem->alloc_small) ((j_common_ptr) cinfo, JPOOL_IMAGE,
 					(DCTSIZE2 * 4) * SIZEOF(DCTELEM));
 	}
 	dtbl = fdct->divisors[qtblno];
 	for (i = 0; i < DCTSIZE2; i++) {
 	  compute_reciprocal ((DCTELEM)
 			       DESCALE(MULTIPLY16V16((INT32) qtbl->quantval[i],
 						     (INT32) aanscales[i]),
 				       CONST_BITS-3),
 			      &dtbl[i]);
 	}
 #else  /* JFDCT_INT_QUANTIZE_WITH_DIVISION */
 	if (fdct->divisors[qtblno] == NULL) {
 	  fdct->divisors[qtblno] = (DCTELEM *)
 	    (*cinfo->mem->alloc_small) ((j_common_ptr) cinfo, JPOOL_IMAGE,
@@ -121,9 +221,10 @@ start_pass_fdctmgr (j_compress_ptr cinfo)
 				  (INT32) aanscales[i]),
 		    CONST_BITS-3);
 	}
 #endif /* JFDCT_INT_QUANTIZE_WITH_DIVISION */
      }
      break;
-#endif
+#endif /* DCT_IFAST_SUPPORTED */
 #ifdef DCT_FLOAT_SUPPORTED
    case JDCT_FLOAT:
      {
@@ -183,83 +284,23 @@ forward_DCT (j_compress_ptr cinfo, jpeg_component_info * compptr,
 	     JDIMENSION num_blocks)
 /* This version is used for integer DCT implementations. */
 {
  /* This routine is heavily used, so it's worth coding it tightly. */
  my_fdct_ptr fdct = (my_fdct_ptr) cinfo->fdct;
  forward_DCT_method_ptr do_dct = fdct->do_dct;
  DCTELEM * divisors = fdct->divisors[compptr->quant_tbl_no];
-  DCTELEM workspace[DCTSIZE2];	/* work area for FDCT subroutine */
+  DCTELEM workspace[DCTSIZE2 + ALIGN_SIZE/sizeof(DCTELEM)];
  DCTELEM * wkptr = (DCTELEM *) ALIGN_MEM(workspace, ALIGN_SIZE);
  JDIMENSION bi;
  sample_data += start_row;	/* fold in the vertical offset once */
  for (bi = 0; bi < num_blocks; bi++, start_col += DCTSIZE) {
    /* Load data into workspace, applying unsigned->signed conversion */
-    { register DCTELEM *workspaceptr;
+    (*fdct->convsamp) (sample_data, start_col, wkptr);
      register JSAMPROW elemptr;
      register int elemr;
      workspaceptr = workspace;
      for (elemr = 0; elemr < DCTSIZE; elemr++) {
 	elemptr = sample_data[elemr] + start_col;
 #if DCTSIZE == 8		/* unroll the inner loop */
 	*workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
 	*workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
 	*workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
 	*workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
 	*workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
 	*workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
 	*workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
 	*workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
 #else
 	{ register int elemc;
 	  for (elemc = DCTSIZE; elemc > 0; elemc--) {
 	    *workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
 	  }
 	}
 #endif
      }
    }
    /* Perform the DCT */
-    (*do_dct) (workspace);
+    (*fdct->do_dct) (wkptr);
    /* Quantize/descale the coefficients, and store into coef_blocks[] */
-    { register DCTELEM temp, qval;
+    (*fdct->quantize) (coef_blocks[bi], divisors, wkptr);
      register int i;
      register JCOEFPTR output_ptr = coef_blocks[bi];
      for (i = 0; i < DCTSIZE2; i++) {
 	qval = divisors[i];
 	temp = workspace[i];
 	/* Divide the coefficient value by qval, ensuring proper rounding.
 	 * Since C does not specify the direction of rounding for negative
 	 * quotients, we have to force the dividend positive for portability.
 	 *
 	 * In most files, at least half of the output values will be zero
 	 * (at default quantization settings, more like three-quarters...)
 	 * so we should ensure that this case is fast.  On many machines,
 	 * a comparison is enough cheaper than a divide to make a special test
 	 * a win.  Since both inputs will be nonnegative, we need only test
 	 * for a < b to discover whether a/b is 0.
 	 * If your machine's division is fast enough, define FAST_DIVIDE.
 	 */
 #ifdef FAST_DIVIDE
 #define DIVIDE_BY(a,b)	a /= b
 #else
 #define DIVIDE_BY(a,b)	if (a >= b) a /= b; else a = 0
 #endif
 	if (temp < 0) {
 	  temp = -temp;
 	  temp += qval>>1;	/* for rounding */
 	  DIVIDE_BY(temp, qval);
 	  temp = -temp;
 	} else {
 	  temp += qval>>1;	/* for rounding */
 	  DIVIDE_BY(temp, qval);
 	}
 	output_ptr[i] = (JCOEF) temp;
      }
    }
  }
 }
@@ -273,64 +314,23 @@ forward_DCT_float (j_compress_ptr cinfo, jpeg_component_info * compptr,
 		   JDIMENSION num_blocks)
 /* This version is used for floating-point DCT implementations. */
 {
  /* This routine is heavily used, so it's worth coding it tightly. */
  my_fdct_ptr fdct = (my_fdct_ptr) cinfo->fdct;
  float_DCT_method_ptr do_dct = fdct->do_float_dct;
  FAST_FLOAT * divisors = fdct->float_divisors[compptr->quant_tbl_no];
-  FAST_FLOAT workspace[DCTSIZE2]; /* work area for FDCT subroutine */
+  FAST_FLOAT workspace[DCTSIZE2 + ALIGN_SIZE/sizeof(FAST_FLOAT)];
  FAST_FLOAT * wkptr = (FAST_FLOAT *) ALIGN_MEM(workspace, ALIGN_SIZE);
  JDIMENSION bi;
  sample_data += start_row;	/* fold in the vertical offset once */
  for (bi = 0; bi < num_blocks; bi++, start_col += DCTSIZE) {
    /* Load data into workspace, applying unsigned->signed conversion */
-    { register FAST_FLOAT *workspaceptr;
+    (*fdct->float_convsamp) (sample_data, start_col, wkptr);
      register JSAMPROW elemptr;
      register int elemr;
      workspaceptr = workspace;
      for (elemr = 0; elemr < DCTSIZE; elemr++) {
 	elemptr = sample_data[elemr] + start_col;
 #if DCTSIZE == 8		/* unroll the inner loop */
 	*workspaceptr++ = (FAST_FLOAT)(GETJSAMPLE(*elemptr++) - CENTERJSAMPLE);
 	*workspaceptr++ = (FAST_FLOAT)(GETJSAMPLE(*elemptr++) - CENTERJSAMPLE);
 	*workspaceptr++ = (FAST_FLOAT)(GETJSAMPLE(*elemptr++) - CENTERJSAMPLE);
 	*workspaceptr++ = (FAST_FLOAT)(GETJSAMPLE(*elemptr++) - CENTERJSAMPLE);
 	*workspaceptr++ = (FAST_FLOAT)(GETJSAMPLE(*elemptr++) - CENTERJSAMPLE);
 	*workspaceptr++ = (FAST_FLOAT)(GETJSAMPLE(*elemptr++) - CENTERJSAMPLE);
 	*workspaceptr++ = (FAST_FLOAT)(GETJSAMPLE(*elemptr++) - CENTERJSAMPLE);
 	*workspaceptr++ = (FAST_FLOAT)(GETJSAMPLE(*elemptr++) - CENTERJSAMPLE);
 #else
 	{ register int elemc;
 	  for (elemc = DCTSIZE; elemc > 0; elemc--) {
 	    *workspaceptr++ = (FAST_FLOAT)
 	      (GETJSAMPLE(*elemptr++) - CENTERJSAMPLE);
 	  }
 	}
 #endif
      }
    }
    /* Perform the DCT */
-    (*do_dct) (workspace);
+    (*fdct->do_float_dct) (wkptr);
    /* Quantize/descale the coefficients, and store into coef_blocks[] */
-    { register FAST_FLOAT temp;
+    (*fdct->float_quantize) (coef_blocks[bi], divisors, wkptr);
      register int i;
      register JCOEFPTR output_ptr = coef_blocks[bi];
      for (i = 0; i < DCTSIZE2; i++) {
 	/* Apply the quantization and scaling factor */
 	temp = workspace[i] * divisors[i];
 	/* Round to nearest integer.
 	 * Since C does not specify the direction of rounding for negative
 	 * quotients, we have to force the dividend positive for portability.
 	 * The maximum coefficient size is +-16K (for 12-bit data), so this
 	 * code should work for either 16-bit or 32-bit ints.
 	 */
 	output_ptr[i] = (JCOEF) ((int) (temp + (FAST_FLOAT) 16384.5) - 16384);
      }
    }
  }
 }
@@ -346,6 +346,7 @@ jinit_forward_dct (j_compress_ptr cinfo)
 {
  my_fdct_ptr fdct;
  int i;
  unsigned int simd = jpeg_simd_support((j_common_ptr) cinfo);
  fdct = (my_fdct_ptr)
    (*cinfo->mem->alloc_small) ((j_common_ptr) cinfo, JPOOL_IMAGE,
@@ -357,21 +358,86 @@ jinit_forward_dct (j_compress_ptr cinfo)
 #ifdef DCT_ISLOW_SUPPORTED
  case JDCT_ISLOW:
    fdct->pub.forward_DCT = forward_DCT;
-    fdct->do_dct = jpeg_fdct_islow;
+#ifdef JFDCT_INT_SSE2_SUPPORTED
-    break;
+    if (simd & JSIMD_SSE2 &&
        IS_CONST_ALIGNED_16(jconst_fdct_islow_sse2)) {
      fdct->do_dct = jpeg_fdct_islow_sse2;
      fdct->convsamp = jpeg_convsamp_int_sse2;
      fdct->quantize = jpeg_quantize_int_sse2;
    } else
 #endif
 #ifdef JFDCT_INT_MMX_SUPPORTED
    if (simd & JSIMD_MMX) {
      fdct->do_dct = jpeg_fdct_islow_mmx;
      fdct->convsamp = jpeg_convsamp_int_mmx;
      fdct->quantize = jpeg_quantize_int_mmx;
    } else
 #endif
    {
      fdct->do_dct = jpeg_fdct_islow;
      fdct->convsamp = jpeg_convsamp_int;
      fdct->quantize = jpeg_quantize_int;
    }
    break;
 #endif /* DCT_ISLOW_SUPPORTED */
 #ifdef DCT_IFAST_SUPPORTED
  case JDCT_IFAST:
    fdct->pub.forward_DCT = forward_DCT;
-    fdct->do_dct = jpeg_fdct_ifast;
+#ifdef JFDCT_INT_SSE2_SUPPORTED
-    break;
+    if (simd & JSIMD_SSE2 &&
        IS_CONST_ALIGNED_16(jconst_fdct_ifast_sse2)) {
      fdct->do_dct = jpeg_fdct_ifast_sse2;
      fdct->convsamp = jpeg_convsamp_int_sse2;
      fdct->quantize = jpeg_quantize_int_sse2;
    } else
 #endif
 #ifdef JFDCT_INT_MMX_SUPPORTED
    if (simd & JSIMD_MMX) {
      fdct->do_dct = jpeg_fdct_ifast_mmx;
      fdct->convsamp = jpeg_convsamp_int_mmx;
      fdct->quantize = jpeg_quantize_int_mmx;
    } else
 #endif
    {
      fdct->do_dct = jpeg_fdct_ifast;
      fdct->convsamp = jpeg_convsamp_int;
      fdct->quantize = jpeg_quantize_int;
    }
    break;
 #endif /* DCT_IFAST_SUPPORTED */
 #ifdef DCT_FLOAT_SUPPORTED
  case JDCT_FLOAT:
    fdct->pub.forward_DCT = forward_DCT_float;
-    fdct->do_float_dct = jpeg_fdct_float;
+#ifdef JFDCT_FLT_SSE_SSE2_SUPPORTED
-    break;
+    if (simd & JSIMD_SSE && simd & JSIMD_SSE2 &&
        IS_CONST_ALIGNED_16(jconst_fdct_float_sse)) {
      fdct->do_float_dct = jpeg_fdct_float_sse;
      fdct->float_convsamp = jpeg_convsamp_flt_sse2;
      fdct->float_quantize = jpeg_quantize_flt_sse2;
    } else
 #endif
 #ifdef JFDCT_FLT_SSE_MMX_SUPPORTED
    if (simd & JSIMD_SSE &&
        IS_CONST_ALIGNED_16(jconst_fdct_float_sse)) {
      fdct->do_float_dct = jpeg_fdct_float_sse;
      fdct->float_convsamp = jpeg_convsamp_flt_sse;
      fdct->float_quantize = jpeg_quantize_flt_sse;
    } else
 #endif
 #ifdef JFDCT_FLT_3DNOW_MMX_SUPPORTED
    if (simd & JSIMD_3DNOW) {
      fdct->do_float_dct = jpeg_fdct_float_3dnow;
      fdct->float_convsamp = jpeg_convsamp_flt_3dnow;
      fdct->float_quantize = jpeg_quantize_flt_3dnow;
    } else
 #endif
    {
      fdct->do_float_dct = jpeg_fdct_float;
      fdct->float_convsamp = jpeg_convsamp_float;
      fdct->float_quantize = jpeg_quantize_float;
    }
    break;
 #endif /* DCT_FLOAT_SUPPORTED */
  default:
    ERREXIT(cinfo, JERR_NOT_COMPILED);
    break;
@@ -385,3 +451,65 @@ jinit_forward_dct (j_compress_ptr cinfo)
 #endif
  }
 }
 #ifndef JSIMD_MODEINFO_NOT_SUPPORTED
 GLOBAL(unsigned int)
 jpeg_simd_forward_dct (j_compress_ptr cinfo, int method)
 {
  unsigned int simd = jpeg_simd_support((j_common_ptr) cinfo);
  switch (method) {
 #ifdef DCT_ISLOW_SUPPORTED
  case JDCT_ISLOW:
 #ifdef JFDCT_INT_SSE2_SUPPORTED
    if (simd & JSIMD_SSE2 &&
        IS_CONST_ALIGNED_16(jconst_fdct_islow_sse2))
      return JSIMD_SSE2;
 #endif
 #ifdef JFDCT_INT_MMX_SUPPORTED
    if (simd & JSIMD_MMX)
      return JSIMD_MMX;
 #endif
    return JSIMD_NONE;
 #endif /* DCT_ISLOW_SUPPORTED */
 #ifdef DCT_IFAST_SUPPORTED
  case JDCT_IFAST:
 #ifdef JFDCT_INT_SSE2_SUPPORTED
    if (simd & JSIMD_SSE2 &&
        IS_CONST_ALIGNED_16(jconst_fdct_ifast_sse2))
      return JSIMD_SSE2;
 #endif
 #ifdef JFDCT_INT_MMX_SUPPORTED
    if (simd & JSIMD_MMX)
      return JSIMD_MMX;
 #endif
    return JSIMD_NONE;
 #endif /* DCT_IFAST_SUPPORTED */
 #ifdef DCT_FLOAT_SUPPORTED
  case JDCT_FLOAT:
 #ifdef JFDCT_FLT_SSE_SSE2_SUPPORTED
    if (simd & JSIMD_SSE && simd & JSIMD_SSE2 &&
        IS_CONST_ALIGNED_16(jconst_fdct_float_sse))
      return JSIMD_SSE;		/* (JSIMD_SSE | JSIMD_SSE2); */
 #endif
 #ifdef JFDCT_FLT_SSE_MMX_SUPPORTED
    if (simd & JSIMD_SSE &&
        IS_CONST_ALIGNED_16(jconst_fdct_float_sse))
      return JSIMD_SSE;		/* (JSIMD_SSE | JSIMD_MMX); */
 #endif
 #ifdef JFDCT_FLT_3DNOW_MMX_SUPPORTED
    if (simd & JSIMD_3DNOW)
      return JSIMD_3DNOW;	/* (JSIMD_3DNOW | JSIMD_MMX); */
 #endif
    return JSIMD_NONE;
 #endif /* DCT_FLOAT_SUPPORTED */
  default:
    ;
  }
  return JSIMD_NONE;	/* not compiled */
 }
 #endif /* !JSIMD_MODEINFO_NOT_SUPPORTED */
--- a/jchuff.c
+++ b/jchuff.c
@@ -1,7 +1,7 @@
 /*
 * jchuff.c
 *
- * Copyright (C) 1991-1996, Thomas G. Lane.
+ * Copyright (C) 1991-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -125,16 +125,14 @@ start_pass_huff (j_compress_ptr cinfo, boolean gather_statistics)
    compptr = cinfo->cur_comp_info[ci];
    dctbl = compptr->dc_tbl_no;
    actbl = compptr->ac_tbl_no;
    /* Make sure requested tables are present */
    /* (In gather mode, tables need not be allocated yet) */
    if (dctbl < 0 || dctbl >= NUM_HUFF_TBLS ||
 	(cinfo->dc_huff_tbl_ptrs[dctbl] == NULL && !gather_statistics))
      ERREXIT1(cinfo, JERR_NO_HUFF_TABLE, dctbl);
    if (actbl < 0 || actbl >= NUM_HUFF_TBLS ||
 	(cinfo->ac_huff_tbl_ptrs[actbl] == NULL && !gather_statistics))
      ERREXIT1(cinfo, JERR_NO_HUFF_TABLE, actbl);
    if (gather_statistics) {
 #ifdef ENTROPY_OPT_SUPPORTED
      /* Check for invalid table indexes */
      /* (make_c_derived_tbl does this in the other path) */
      if (dctbl < 0 || dctbl >= NUM_HUFF_TBLS)
 	ERREXIT1(cinfo, JERR_NO_HUFF_TABLE, dctbl);
      if (actbl < 0 || actbl >= NUM_HUFF_TBLS)
 	ERREXIT1(cinfo, JERR_NO_HUFF_TABLE, actbl);
      /* Allocate and zero the statistics tables */
      /* Note that jpeg_gen_optimal_table expects 257 entries in each table! */
      if (entropy->dc_count_ptrs[dctbl] == NULL)
@@ -151,9 +149,9 @@ start_pass_huff (j_compress_ptr cinfo, boolean gather_statistics)
    } else {
      /* Compute derived values for Huffman tables */
      /* We may do this more than once for a table, but it's not expensive */
-      jpeg_make_c_derived_tbl(cinfo, cinfo->dc_huff_tbl_ptrs[dctbl],
+      jpeg_make_c_derived_tbl(cinfo, TRUE, dctbl,
 			      & entropy->dc_derived_tbls[dctbl]);
-      jpeg_make_c_derived_tbl(cinfo, cinfo->ac_huff_tbl_ptrs[actbl],
+      jpeg_make_c_derived_tbl(cinfo, FALSE, actbl,
 			      & entropy->ac_derived_tbls[actbl]);
    }
    /* Initialize DC predictions to 0 */
@@ -172,19 +170,34 @@ start_pass_huff (j_compress_ptr cinfo, boolean gather_statistics)
 /*
 * Compute the derived values for a Huffman table.
 * This routine also performs some validation checks on the table.
 *
 * Note this is also used by jcphuff.c.
 */
 GLOBAL(void)
-jpeg_make_c_derived_tbl (j_compress_ptr cinfo, JHUFF_TBL * htbl,
+jpeg_make_c_derived_tbl (j_compress_ptr cinfo, boolean isDC, int tblno,
 			 c_derived_tbl ** pdtbl)
 {
  JHUFF_TBL *htbl;
  c_derived_tbl *dtbl;
-  int p, i, l, lastp, si;
+  int p, i, l, lastp, si, maxsymbol;
  char huffsize[257];
  unsigned int huffcode[257];
  unsigned int code;
  /* Note that huffsize[] and huffcode[] are filled in code-length order,
   * paralleling the order of the symbols themselves in htbl->huffval[].
   */
  /* Find the input Huffman table */
  if (tblno < 0 || tblno >= NUM_HUFF_TBLS)
    ERREXIT1(cinfo, JERR_NO_HUFF_TABLE, tblno);
  htbl =
    isDC ? cinfo->dc_huff_tbl_ptrs[tblno] : cinfo->ac_huff_tbl_ptrs[tblno];
  if (htbl == NULL)
    ERREXIT1(cinfo, JERR_NO_HUFF_TABLE, tblno);
  /* Allocate a workspace if we haven't already done so. */
  if (*pdtbl == NULL)
    *pdtbl = (c_derived_tbl *)
@@ -193,18 +206,20 @@ jpeg_make_c_derived_tbl (j_compress_ptr cinfo, JHUFF_TBL * htbl,
  dtbl = *pdtbl;
  /* Figure C.1: make table of Huffman code length for each symbol */
  /* Note that this is in code-length order. */
  p = 0;
  for (l = 1; l <= 16; l++) {
-    for (i = 1; i <= (int) htbl->bits[l]; i++)
+    i = (int) htbl->bits[l];
    if (i < 0 || p + i > 256)	/* protect against table overrun */
      ERREXIT(cinfo, JERR_BAD_HUFF_TABLE);
    while (i--)
      huffsize[p++] = (char) l;
  }
  huffsize[p] = 0;
  lastp = p;
  /* Figure C.2: generate the codes themselves */
-  /* Note that this is in code-length order. */
+  /* We also validate that the counts represent a legal Huffman code tree. */
  code = 0;
  si = huffsize[0];
@@ -214,6 +229,11 @@ jpeg_make_c_derived_tbl (j_compress_ptr cinfo, JHUFF_TBL * htbl,
      huffcode[p++] = code;
      code++;
    }
    /* code is now 1 more than the last code used for codelength si; but
     * it must still fit in si bits, since no code is allowed to be all ones.
     */
    if (((INT32) code) >= (((INT32) 1) << si))
      ERREXIT(cinfo, JERR_BAD_HUFF_TABLE);
    code <<= 1;
    si++;
  }
@@ -221,14 +241,25 @@ jpeg_make_c_derived_tbl (j_compress_ptr cinfo, JHUFF_TBL * htbl,
  /* Figure C.3: generate encoding tables */
  /* These are code and size indexed by symbol value */
-  /* Set any codeless symbols to have code length 0;
+  /* Set all codeless symbols to have code length 0;
-   * this allows emit_bits to detect any attempt to emit such symbols.
+   * this lets us detect duplicate VAL entries here, and later
   * allows emit_bits to detect any attempt to emit such symbols.
   */
  MEMZERO(dtbl->ehufsi, SIZEOF(dtbl->ehufsi));
  /* This is also a convenient place to check for out-of-range
   * and duplicated VAL entries.  We allow 0..255 for AC symbols
   * but only 0..15 for DC.  (We could constrain them further
   * based on data depth and mode, but this seems enough.)
   */
  maxsymbol = isDC ? 15 : 255;
  for (p = 0; p < lastp; p++) {
-    dtbl->ehufco[htbl->huffval[p]] = huffcode[p];
+    i = htbl->huffval[p];
-    dtbl->ehufsi[htbl->huffval[p]] = huffsize[p];
+    if (i < 0 || i > maxsymbol || dtbl->ehufsi[i])
      ERREXIT(cinfo, JERR_BAD_HUFF_TABLE);
    dtbl->ehufco[i] = huffcode[p];
    dtbl->ehufsi[i] = huffsize[p];
  }
 }
@@ -343,6 +374,11 @@ encode_one_block (working_state * state, JCOEFPTR block, int last_dc_val,
    nbits++;
    temp >>= 1;
  }
  /* Check for out-of-range coefficient values.
   * Since we're encoding a difference, the range limit is twice as much.
   */
  if (nbits > MAX_COEF_BITS+1)
    ERREXIT(state->cinfo, JERR_BAD_DCT_COEF);
  /* Emit the Huffman-coded symbol for the number of bits */
  if (! emit_bits(state, dctbl->ehufco[nbits], dctbl->ehufsi[nbits]))
@@ -380,6 +416,9 @@ encode_one_block (working_state * state, JCOEFPTR block, int last_dc_val,
      nbits = 1;		/* there must be at least one 1 bit */
      while ((temp >>= 1))
 	nbits++;
      /* Check for out-of-range coefficient values */
      if (nbits > MAX_COEF_BITS)
 	ERREXIT(state->cinfo, JERR_BAD_DCT_COEF);
      /* Emit Huffman symbol for run length / number of bits */
      i = (r << 4) + nbits;
@@ -516,19 +555,12 @@ finish_pass_huff (j_compress_ptr cinfo)
 /*
 * Huffman coding optimization.
 *
- * This actually is optimization, in the sense that we find the best possible
+ * We first scan the supplied data and count the number of uses of each symbol
- * Huffman table(s) for the given data.  We first scan the supplied data and
+ * that is to be Huffman-coded. (This process MUST agree with the code above.)
- * count the number of uses of each symbol that is to be Huffman-coded.
+ * Then we build a Huffman coding tree for the observed counts.
- * (This process must agree with the code above.)  Then we build an
+ * Symbols which are not needed at all for the particular image are not
- * optimal Huffman coding tree for the observed counts.
+ * assigned any code, which saves space in the DHT marker as well as in
- *
+ * the compressed data.
 * The JPEG standard requires Huffman codes to be no more than 16 bits long.
 * If some symbols have a very small but nonzero probability, the Huffman tree
 * must be adjusted to meet the code length restriction.  We currently use
 * the adjustment method suggested in the JPEG spec.  This method is *not*
 * optimal; it may not choose the best possible limited-length code.  But
 * since the symbols involved are infrequently used, it's not clear that
 * going to extra trouble is worthwhile.
 */
 #ifdef ENTROPY_OPT_SUPPORTED
@@ -537,7 +569,7 @@ finish_pass_huff (j_compress_ptr cinfo)
 /* Process a single block's worth of coefficients */
 LOCAL(void)
-htest_one_block (JCOEFPTR block, int last_dc_val,
+htest_one_block (j_compress_ptr cinfo, JCOEFPTR block, int last_dc_val,
 		 long dc_counts[], long ac_counts[])
 {
  register int temp;
@@ -556,6 +588,11 @@ htest_one_block (JCOEFPTR block, int last_dc_val,
    nbits++;
    temp >>= 1;
  }
  /* Check for out-of-range coefficient values.
   * Since we're encoding a difference, the range limit is twice as much.
   */
  if (nbits > MAX_COEF_BITS+1)
    ERREXIT(cinfo, JERR_BAD_DCT_COEF);
  /* Count the Huffman symbol for the number of bits */
  dc_counts[nbits]++;
@@ -582,6 +619,9 @@ htest_one_block (JCOEFPTR block, int last_dc_val,
      nbits = 1;		/* there must be at least one 1 bit */
      while ((temp >>= 1))
 	nbits++;
      /* Check for out-of-range coefficient values */
      if (nbits > MAX_COEF_BITS)
 	ERREXIT(cinfo, JERR_BAD_DCT_COEF);
      /* Count Huffman symbol for run length / number of bits */
      ac_counts[(r << 4) + nbits]++;
@@ -623,7 +663,7 @@ encode_mcu_gather (j_compress_ptr cinfo, JBLOCKROW *MCU_data)
  for (blkn = 0; blkn < cinfo->blocks_in_MCU; blkn++) {
    ci = cinfo->MCU_membership[blkn];
    compptr = cinfo->cur_comp_info[ci];
-    htest_one_block(MCU_data[blkn][0], entropy->saved.last_dc_val[ci],
+    htest_one_block(cinfo, MCU_data[blkn][0], entropy->saved.last_dc_val[ci],
 		    entropy->dc_count_ptrs[compptr->dc_tbl_no],
 		    entropy->ac_count_ptrs[compptr->ac_tbl_no]);
    entropy->saved.last_dc_val[ci] = MCU_data[blkn][0][0];
@@ -634,8 +674,31 @@ encode_mcu_gather (j_compress_ptr cinfo, JBLOCKROW *MCU_data)
 /*
- * Generate the optimal coding for the given counts, fill htbl.
+ * Generate the best Huffman code table for the given counts, fill htbl.
 * Note this is also used by jcphuff.c.
 *
 * The JPEG standard requires that no symbol be assigned a codeword of all
 * one bits (so that padding bits added at the end of a compressed segment
 * can't look like a valid code).  Because of the canonical ordering of
 * codewords, this just means that there must be an unused slot in the
 * longest codeword length category.  Section K.2 of the JPEG spec suggests
 * reserving such a slot by pretending that symbol 256 is a valid symbol
 * with count 1.  In theory that's not optimal; giving it count zero but
 * including it in the symbol set anyway should give a better Huffman code.
 * But the theoretically better code actually seems to come out worse in
 * practice, because it produces more all-ones bytes (which incur stuffed
 * zero bytes in the final file).  In any case the difference is tiny.
 *
 * The JPEG standard requires Huffman codes to be no more than 16 bits long.
 * If some symbols have a very small but nonzero probability, the Huffman tree
 * must be adjusted to meet the code length restriction.  We currently use
 * the adjustment method suggested in JPEG section K.2.  This method is *not*
 * optimal; it may not choose the best possible limited-length code.  But
 * typically only very-low-frequency symbols will be given less-than-optimal
 * lengths, so the code is almost optimal.  Experimental comparisons against
 * an optimal limited-length-code algorithm indicate that the difference is
 * microscopic --- usually less than a hundredth of a percent of total size.
 * So the extra complexity of an optimal algorithm doesn't seem worthwhile.
 */
 GLOBAL(void)
@@ -656,10 +719,10 @@ jpeg_gen_optimal_table (j_compress_ptr cinfo, JHUFF_TBL * htbl, long freq[])
  for (i = 0; i < 257; i++)
    others[i] = -1;		/* init links to empty */
-  freq[256] = 1;		/* make sure there is a nonzero count */
+  freq[256] = 1;		/* make sure 256 has a nonzero count */
  /* Including the pseudo-symbol 256 in the Huffman procedure guarantees
   * that no real symbol is given code-value of all ones, because 256
-   * will be placed in the largest codeword category.
+   * will be placed last in the largest codeword category.
   */
  /* Huffman's basic algorithm to assign optimal code lengths to symbols */
--- a/jchuff.h
+++ b/jchuff.h
@@ -1,7 +1,7 @@
 /*
 * jchuff.h
 *
- * Copyright (C) 1991-1996, Thomas G. Lane.
+ * Copyright (C) 1991-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -10,6 +10,18 @@
 * progressive encoder (jcphuff.c).  No other modules need to see these.
 */
 /* The legal range of a DCT coefficient is
 *  -1024 .. +1023  for 8-bit data;
 * -16384 .. +16383 for 12-bit data.
 * Hence the magnitude should always fit in 10 or 14 bits respectively.
 */
 #if BITS_IN_JSAMPLE == 8
 #define MAX_COEF_BITS 10
 #else
 #define MAX_COEF_BITS 14
 #endif
 /* Derived data constructed for each Huffman table */
 typedef struct {
@@ -27,7 +39,8 @@ typedef struct {
 /* Expand a Huffman table definition into the derived format */
 EXTERN(void) jpeg_make_c_derived_tbl
-	JPP((j_compress_ptr cinfo, JHUFF_TBL * htbl, c_derived_tbl ** pdtbl));
+	JPP((j_compress_ptr cinfo, boolean isDC, int tblno,
 	     c_derived_tbl ** pdtbl));
 /* Generate an optimal table definition given the specified counts */
 EXTERN(void) jpeg_gen_optimal_table
--- a/jcinit.c
+++ b/jcinit.c
@@ -1,7 +1,7 @@
 /*
 * jcinit.c
 *
- * Copyright (C) 1991-1996, Thomas G. Lane.
+ * Copyright (C) 1991-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -56,7 +56,7 @@ jinit_compress_master (j_compress_ptr cinfo)
  /* Need a full-image coefficient buffer in any multi-pass mode. */
  jinit_c_coef_controller(cinfo,
-			  (cinfo->num_scans > 1 || cinfo->optimize_coding));
+		(boolean) (cinfo->num_scans > 1 || cinfo->optimize_coding));
  jinit_c_main_controller(cinfo, FALSE /* never need full buffer here */);
  jinit_marker_writer(cinfo);
--- a/jcmarker.c
+++ b/jcmarker.c
@@ -1,7 +1,7 @@
 /*
 * jcmarker.c
 *
- * Copyright (C) 1991-1996, Thomas G. Lane.
+ * Copyright (C) 1991-1998, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -81,6 +81,17 @@ typedef enum {			/* JPEG marker codes */
 } JPEG_MARKER;
 /* Private state */
 typedef struct {
  struct jpeg_marker_writer pub; /* public fields */
  unsigned int last_restart_interval; /* last DRI value emitted; 0 after SOI */
 } my_marker_writer;
 typedef my_marker_writer * my_marker_ptr;
 /*
 * Basic output routines.
 *
@@ -158,8 +169,8 @@ emit_dqt (j_compress_ptr cinfo, int index)
      /* The table entries must be emitted in zigzag order. */
      unsigned int qval = qtbl->quantval[jpeg_natural_order[i]];
      if (prec)
-	emit_byte(cinfo, qval >> 8);
+	emit_byte(cinfo, (int) (qval >> 8));
-      emit_byte(cinfo, qval & 0xFF);
+      emit_byte(cinfo, (int) (qval & 0xFF));
    }
    qtbl->sent_table = TRUE;
@@ -342,7 +353,7 @@ emit_jfif_app0 (j_compress_ptr cinfo)
   * Length of APP0 block	(2 bytes)
   * Block ID			(4 bytes - ASCII "JFIF")
   * Zero byte			(1 byte to terminate the ID string)
-   * Version Major, Minor	(2 bytes - 0x01, 0x01)
+   * Version Major, Minor	(2 bytes - major first)
   * Units			(1 byte - 0x00 = none, 0x01 = inch, 0x02 = cm)
   * Xdpu			(2 bytes - dots per unit horizontal)
   * Ydpu			(2 bytes - dots per unit vertical)
@@ -359,11 +370,8 @@ emit_jfif_app0 (j_compress_ptr cinfo)
  emit_byte(cinfo, 0x49);
  emit_byte(cinfo, 0x46);
  emit_byte(cinfo, 0);
-  /* We currently emit version code 1.01 since we use no 1.02 features.
+  emit_byte(cinfo, cinfo->JFIF_major_version); /* Version fields */
-   * This may avoid complaints from some older decoders.
+  emit_byte(cinfo, cinfo->JFIF_minor_version);
   */
  emit_byte(cinfo, 1);		/* Major version */
  emit_byte(cinfo, 1);		/* Minor version */
  emit_byte(cinfo, cinfo->density_unit); /* Pixel size information */
  emit_2bytes(cinfo, (int) cinfo->X_density);
  emit_2bytes(cinfo, (int) cinfo->Y_density);
@@ -419,28 +427,30 @@ emit_adobe_app14 (j_compress_ptr cinfo)
 /*
- * This routine is exported for possible use by applications.
+ * These routines allow writing an arbitrary marker with parameters.
- * The intended use is to emit COM or APPn markers after calling
+ * The only intended use is to emit COM or APPn markers after calling
- * jpeg_start_compress() and before the first jpeg_write_scanlines() call
+ * write_file_header and before calling write_frame_header.
 * (hence, after write_file_header but before write_frame_header).
 * Other uses are not guaranteed to produce desirable results.
 * Counting the parameter bytes properly is the caller's responsibility.
 */
 METHODDEF(void)
-write_any_marker (j_compress_ptr cinfo, int marker,
+write_marker_header (j_compress_ptr cinfo, int marker, unsigned int datalen)
-		  const JOCTET *dataptr, unsigned int datalen)
+/* Emit an arbitrary marker header */
 /* Emit an arbitrary marker with parameters */
 {
-  if (datalen <= (unsigned int) 65533) { /* safety check */
+  if (datalen > (unsigned int) 65533)		/* safety check */
    ERREXIT(cinfo, JERR_BAD_LENGTH);
  emit_marker(cinfo, (JPEG_MARKER) marker);
  emit_2bytes(cinfo, (int) (datalen + 2));	/* total length */
 }
-    while (datalen--) {
+METHODDEF(void)
-      emit_byte(cinfo, *dataptr);
+write_marker_byte (j_compress_ptr cinfo, int val)
-      dataptr++;
+/* Emit one byte of marker parameters following write_marker_header */
-    }
+{
-  }
+  emit_byte(cinfo, val);
 }
@@ -458,8 +468,13 @@ write_any_marker (j_compress_ptr cinfo, int marker,
 METHODDEF(void)
 write_file_header (j_compress_ptr cinfo)
 {
  my_marker_ptr marker = (my_marker_ptr) cinfo->marker;
  emit_marker(cinfo, M_SOI);	/* first the SOI */
  /* SOI is defined to reset restart interval to 0 */
  marker->last_restart_interval = 0;
  if (cinfo->write_JFIF_header)	/* next an optional JFIF APP0 */
    emit_jfif_app0(cinfo);
  if (cinfo->write_Adobe_marker) /* next an optional Adobe APP14 */
@@ -535,6 +550,7 @@ write_frame_header (j_compress_ptr cinfo)
 METHODDEF(void)
 write_scan_header (j_compress_ptr cinfo)
 {
  my_marker_ptr marker = (my_marker_ptr) cinfo->marker;
  int i;
  jpeg_component_info *compptr;
@@ -567,11 +583,12 @@ write_scan_header (j_compress_ptr cinfo)
  }
  /* Emit DRI if required --- note that DRI value could change for each scan.
-   * If it doesn't, a tiny amount of space is wasted in multiple-scan files.
+   * We avoid wasting space with unnecessary DRIs, however.
   * We assume DRI will never be nonzero for one scan and zero for a later one.
   */
-  if (cinfo->restart_interval)
+  if (cinfo->restart_interval != marker->last_restart_interval) {
    emit_dri(cinfo);
    marker->last_restart_interval = cinfo->restart_interval;
  }
  emit_sos(cinfo);
 }
@@ -627,15 +644,21 @@ write_tables_only (j_compress_ptr cinfo)
 GLOBAL(void)
 jinit_marker_writer (j_compress_ptr cinfo)
 {
  my_marker_ptr marker;
  /* Create the subobject */
-  cinfo->marker = (struct jpeg_marker_writer *)
+  marker = (my_marker_ptr)
    (*cinfo->mem->alloc_small) ((j_common_ptr) cinfo, JPOOL_IMAGE,
-				SIZEOF(struct jpeg_marker_writer));
+				SIZEOF(my_marker_writer));
  cinfo->marker = (struct jpeg_marker_writer *) marker;
  /* Initialize method pointers */
-  cinfo->marker->write_any_marker = write_any_marker;
+  marker->pub.write_file_header = write_file_header;
-  cinfo->marker->write_file_header = write_file_header;
+  marker->pub.write_frame_header = write_frame_header;
-  cinfo->marker->write_frame_header = write_frame_header;
+  marker->pub.write_scan_header = write_scan_header;
-  cinfo->marker->write_scan_header = write_scan_header;
+  marker->pub.write_file_trailer = write_file_trailer;
-  cinfo->marker->write_file_trailer = write_file_trailer;
+  marker->pub.write_tables_only = write_tables_only;
-  cinfo->marker->write_tables_only = write_tables_only;
+  marker->pub.write_marker_header = write_marker_header;
  marker->pub.write_marker_byte = write_marker_byte;
  /* Initialize private state */
  marker->last_restart_interval = 0;
 }
--- a/jcmaster.c
+++ b/jcmaster.c
@@ -1,7 +1,7 @@
 /*
 * jcmaster.c
 *
- * Copyright (C) 1991-1996, Thomas G. Lane.
+ * Copyright (C) 1991-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -185,8 +185,20 @@ validate_script (j_compress_ptr cinfo)
    Al = scanptr->Al;
    if (cinfo->progressive_mode) {
 #ifdef C_PROGRESSIVE_SUPPORTED
      /* The JPEG spec simply gives the ranges 0..13 for Ah and Al, but that
       * seems wrong: the upper bound ought to depend on data precision.
       * Perhaps they really meant 0..N+1 for N-bit precision.
       * Here we allow 0..10 for 8-bit data; Al larger than 10 results in
       * out-of-range reconstructed DC values during the first DC scan,
       * which might cause problems for some decoders.
       */
 #if BITS_IN_JSAMPLE == 8
 #define MAX_AH_AL 10
 #else
 #define MAX_AH_AL 13
 #endif
      if (Ss < 0 || Ss >= DCTSIZE2 || Se < Ss || Se >= DCTSIZE2 ||
-	  Ah < 0 || Ah > 13 || Al < 0 || Al > 13)
+	  Ah < 0 || Ah > MAX_AH_AL || Al < 0 || Al > MAX_AH_AL)
 	ERREXIT1(cinfo, JERR_BAD_PROG_SCRIPT, scanno);
      if (Ss == 0) {
 	if (Se != 0)		/* DC and AC together not OK */
--- a/jcolsamp.h
+++ b/jcolsamp.h
@@ -0,0 +1,143 @@
 /*
 * jcolsamp.h - private declarations for color conversion & up/downsampling
 *
 * x86 SIMD extension for IJG JPEG library
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
 * For conditions of distribution and use, see copyright notice in jsimdext.inc
 *
 * Last Modified : February 4, 2006
 *
 * [TAB8]
 */
 /* configuration check: BITS_IN_JSAMPLE==8 (8-bit sample values) is the only
 * valid setting on this SIMD extension.
 */
 #if BITS_IN_JSAMPLE != 8
 #error "Sorry, this SIMD code only copes with 8-bit sample values."
 #endif
 /* Short forms of external names for systems with brain-damaged linkers. */
 #ifdef NEED_SHORT_EXTERNAL_NAMES
 #define jpeg_rgb_ycc_convert_mmx	jMRgbYccCnv	/* jccolmmx.asm */
 #define jpeg_rgb_ycc_convert_sse2	jSRgbYccCnv	/* jccolss2.asm */
 #define jpeg_h2v1_downsample_mmx	jM21Downsample	/* jcsammmx.asm */
 #define jpeg_h2v2_downsample_mmx	jM22Downsample	/* jcsammmx.asm */
 #define jpeg_h2v1_downsample_sse2	jS21Downsample	/* jcsamss2.asm */
 #define jpeg_h2v2_downsample_sse2	jS22Downsample	/* jcsamss2.asm */
 #define jpeg_ycc_rgb_convert_mmx	jMYccRgbCnv	/* jdcolmmx.asm */
 #define jpeg_ycc_rgb_convert_sse2	jSYccRgbCnv	/* jdcolss2.asm */
 #define jpeg_h2v1_merged_upsample_mmx	jM21MerUpsample	/* jdmermmx.asm */
 #define jpeg_h2v2_merged_upsample_mmx	jM22MerUpsample	/* jdmermmx.asm */
 #define jpeg_h2v1_merged_upsample_sse2	jS21MerUpsample	/* jdmerss2.asm */
 #define jpeg_h2v2_merged_upsample_sse2	jS22MerUpsample	/* jdmerss2.asm */
 #define jpeg_h2v1_fancy_upsample_mmx	jM21FanUpsample	/* jdsammmx.asm */
 #define jpeg_h2v2_fancy_upsample_mmx	jM22FanUpsample	/* jdsammmx.asm */
 #define jpeg_h1v2_fancy_upsample_mmx	jM12FanUpsample	/* jdsammmx.asm */
 #define jpeg_h2v1_upsample_mmx		jM21Upsample	/* jdsammmx.asm */
 #define jpeg_h2v2_upsample_mmx		jM22Upsample	/* jdsammmx.asm */
 #define jpeg_h2v1_fancy_upsample_sse2	jS21FanUpsample	/* jdsamss2.asm */
 #define jpeg_h2v2_fancy_upsample_sse2	jS22FanUpsample	/* jdsamss2.asm */
 #define jpeg_h1v2_fancy_upsample_sse2	jS12FanUpsample	/* jdsamss2.asm */
 #define jpeg_h2v1_upsample_sse2		jS21Upsample	/* jdsamss2.asm */
 #define jpeg_h2v2_upsample_sse2		jS22Upsample	/* jdsamss2.asm */
 #define jconst_rgb_ycc_convert_mmx	jMCRgbYccCnv	/* jccolmmx.asm */
 #define jconst_rgb_ycc_convert_sse2	jSCRgbYccCnv	/* jccolss2.asm */
 #define jconst_ycc_rgb_convert_mmx	jMCYccRgbCnv	/* jdcolmmx.asm */
 #define jconst_ycc_rgb_convert_sse2	jSCYccRgbCnv	/* jdcolss2.asm */
 #define jconst_merged_upsample_mmx	jMCMerUpsample	/* jdmermmx.asm */
 #define jconst_merged_upsample_sse2	jSCMerUpsample	/* jdmerss2.asm */
 #define jconst_fancy_upsample_mmx	jMCFanUpsample	/* jdsammmx.asm */
 #define jconst_fancy_upsample_sse2	jSCFanUpsample	/* jdsamss2.asm */
 #ifndef JSIMD_MODEINFO_NOT_SUPPORTED
 #define jpeg_simd_merged_upsampler	jSiMUpsampler	/* jdmerge.c    */
 #endif
 #endif /* NEED_SHORT_EXTERNAL_NAMES */
 /* Extern declarations for color conversion & up/downsampling routines. */
 EXTERN(void) jpeg_rgb_ycc_convert_mmx
    JPP((j_compress_ptr cinfo, JSAMPARRAY input_buf, JSAMPIMAGE output_buf,
 	 JDIMENSION output_row, int num_rows));
 EXTERN(void) jpeg_rgb_ycc_convert_sse2
    JPP((j_compress_ptr cinfo, JSAMPARRAY input_buf, JSAMPIMAGE output_buf,
 	 JDIMENSION output_row, int num_rows));
 EXTERN(void) jpeg_h2v1_downsample_mmx
    JPP((j_compress_ptr cinfo, jpeg_component_info * compptr,
 	 JSAMPARRAY input_data, JSAMPARRAY output_data));
 EXTERN(void) jpeg_h2v2_downsample_mmx
    JPP((j_compress_ptr cinfo, jpeg_component_info * compptr,
 	 JSAMPARRAY input_data, JSAMPARRAY output_data));
 EXTERN(void) jpeg_h2v1_downsample_sse2
    JPP((j_compress_ptr cinfo, jpeg_component_info * compptr,
 	 JSAMPARRAY input_data, JSAMPARRAY output_data));
 EXTERN(void) jpeg_h2v2_downsample_sse2
    JPP((j_compress_ptr cinfo, jpeg_component_info * compptr,
 	 JSAMPARRAY input_data, JSAMPARRAY output_data));
 EXTERN(void) jpeg_ycc_rgb_convert_mmx
    JPP((j_decompress_ptr cinfo, JSAMPIMAGE input_buf, JDIMENSION input_row,
 	 JSAMPARRAY output_buf, int num_rows));
 EXTERN(void) jpeg_ycc_rgb_convert_sse2
    JPP((j_decompress_ptr cinfo, JSAMPIMAGE input_buf, JDIMENSION input_row,
 	 JSAMPARRAY output_buf, int num_rows));
 EXTERN(void) jpeg_h2v1_merged_upsample_mmx
    JPP((j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
 	 JDIMENSION in_row_group_ctr, JSAMPARRAY output_buf));
 EXTERN(void) jpeg_h2v2_merged_upsample_mmx
    JPP((j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
 	 JDIMENSION in_row_group_ctr, JSAMPARRAY output_buf));
 EXTERN(void) jpeg_h2v1_merged_upsample_sse2
    JPP((j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
 	 JDIMENSION in_row_group_ctr, JSAMPARRAY output_buf));
 EXTERN(void) jpeg_h2v2_merged_upsample_sse2
    JPP((j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
 	 JDIMENSION in_row_group_ctr, JSAMPARRAY output_buf));
 EXTERN(void) jpeg_h2v1_fancy_upsample_mmx
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JSAMPARRAY input_data, JSAMPARRAY * output_data_ptr));
 EXTERN(void) jpeg_h2v2_fancy_upsample_mmx
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JSAMPARRAY input_data, JSAMPARRAY * output_data_ptr));
 EXTERN(void) jpeg_h1v2_fancy_upsample_mmx
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JSAMPARRAY input_data, JSAMPARRAY * output_data_ptr));
 EXTERN(void) jpeg_h2v1_upsample_mmx
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JSAMPARRAY input_data, JSAMPARRAY * output_data_ptr));
 EXTERN(void) jpeg_h2v2_upsample_mmx
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JSAMPARRAY input_data, JSAMPARRAY * output_data_ptr));
 EXTERN(void) jpeg_h2v1_fancy_upsample_sse2
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JSAMPARRAY input_data, JSAMPARRAY * output_data_ptr));
 EXTERN(void) jpeg_h2v2_fancy_upsample_sse2
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JSAMPARRAY input_data, JSAMPARRAY * output_data_ptr));
 EXTERN(void) jpeg_h1v2_fancy_upsample_sse2
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JSAMPARRAY input_data, JSAMPARRAY * output_data_ptr));
 EXTERN(void) jpeg_h2v1_upsample_sse2
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JSAMPARRAY input_data, JSAMPARRAY * output_data_ptr));
 EXTERN(void) jpeg_h2v2_upsample_sse2
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JSAMPARRAY input_data, JSAMPARRAY * output_data_ptr));
 extern const int jconst_rgb_ycc_convert_mmx[];
 extern const int jconst_rgb_ycc_convert_sse2[];
 extern const int jconst_ycc_rgb_convert_mmx[];
 extern const int jconst_ycc_rgb_convert_sse2[];
 extern const int jconst_merged_upsample_mmx[];
 extern const int jconst_merged_upsample_sse2[];
 extern const int jconst_fancy_upsample_mmx[];
 extern const int jconst_fancy_upsample_sse2[];
 #ifndef JSIMD_MODEINFO_NOT_SUPPORTED
 EXTERN(unsigned int) jpeg_simd_merged_upsampler JPP((j_decompress_ptr cinfo));
 #endif
--- a/jcolsamp.inc
+++ b/jcolsamp.inc
@@ -0,0 +1,156 @@
 ;
 ; jcolsamp.inc - private declarations for color conversion & up/downsampling
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; Last Modified : January 5, 2006
 ;
 ; [TAB8]
 ; --------------------------------------------------------------------------
 ;
 ; configuration check: BITS_IN_JSAMPLE==8 (8-bit sample values) is the only
 ; valid setting on this SIMD extension.
 ;
 %if BITS_IN_JSAMPLE != 8
 %error "Sorry, this SIMD code only copes with 8-bit sample values."
 %endif
 ; Short forms of external names for systems with brain-damaged linkers.
 ;
 %ifdef NEED_SHORT_EXTERNAL_NAMES
 %define jpeg_rgb_ycc_convert_mmx	jMRgbYccCnv	; jccolmmx.asm
 %define jpeg_rgb_ycc_convert_sse2	jSRgbYccCnv	; jccolss2.asm
 %define jpeg_h2v1_downsample_mmx	jM21Downsample	; jcsammmx.asm
 %define jpeg_h2v2_downsample_mmx	jM22Downsample	; jcsammmx.asm
 %define jpeg_h2v1_downsample_sse2	jS21Downsample	; jcsamss2.asm
 %define jpeg_h2v2_downsample_sse2	jS22Downsample	; jcsamss2.asm
 %define jpeg_ycc_rgb_convert_mmx	jMYccRgbCnv	; jdcolmmx.asm
 %define jpeg_ycc_rgb_convert_sse2	jSYccRgbCnv	; jdcolss2.asm
 %define jpeg_h2v1_merged_upsample_mmx	jM21MerUpsample	; jdmermmx.asm
 %define jpeg_h2v2_merged_upsample_mmx	jM22MerUpsample	; jdmermmx.asm
 %define jpeg_h2v1_merged_upsample_sse2	jS21MerUpsample	; jdmerss2.asm
 %define jpeg_h2v2_merged_upsample_sse2	jS22MerUpsample	; jdmerss2.asm
 %define jpeg_h2v1_fancy_upsample_mmx	jM21FanUpsample	; jdsammmx.asm
 %define jpeg_h2v2_fancy_upsample_mmx	jM22FanUpsample	; jdsammmx.asm
 %define jpeg_h1v2_fancy_upsample_mmx	jM12FanUpsample	; jdsammmx.asm
 %define jpeg_h2v1_upsample_mmx		jM21Upsample	; jdsammmx.asm
 %define jpeg_h2v2_upsample_mmx		jM22Upsample	; jdsammmx.asm
 %define jpeg_h2v1_fancy_upsample_sse2	jS21FanUpsample	; jdsamss2.asm
 %define jpeg_h2v2_fancy_upsample_sse2	jS22FanUpsample	; jdsamss2.asm
 %define jpeg_h1v2_fancy_upsample_sse2	jS12FanUpsample	; jdsamss2.asm
 %define jpeg_h2v1_upsample_sse2		jS21Upsample	; jdsamss2.asm
 %define jpeg_h2v2_upsample_sse2		jS22Upsample	; jdsamss2.asm
 %define jconst_rgb_ycc_convert_mmx	jMCRgbYccCnv	; jccolmmx.asm
 %define jconst_rgb_ycc_convert_sse2	jSCRgbYccCnv	; jccolss2.asm
 %define jconst_ycc_rgb_convert_mmx	jMCYccRgbCnv	; jdcolmmx.asm
 %define jconst_ycc_rgb_convert_sse2	jSCYccRgbCnv	; jdcolss2.asm
 %define jconst_merged_upsample_mmx	jMCMerUpsample	; jdmermmx.asm
 %define jconst_merged_upsample_sse2	jSCMerUpsample	; jdmerss2.asm
 %define jconst_fancy_upsample_mmx	jMCFanUpsample	; jdsammmx.asm
 %define jconst_fancy_upsample_sse2	jSCFanUpsample	; jdsamss2.asm
 %endif ; NEED_SHORT_EXTERNAL_NAMES
 ; --------------------------------------------------------------------------
 ; pseudo-resisters to make ordering of RGB configurable
 ;
 %if RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4
 %if RGB_RED < 0 || RGB_RED >= RGB_PIXELSIZE || RGB_GREEN < 0 || \
   RGB_GREEN >= RGB_PIXELSIZE || RGB_BLUE < 0 || RGB_BLUE >= RGB_PIXELSIZE || \
   RGB_RED == RGB_GREEN || RGB_GREEN == RGB_BLUE || RGB_RED == RGB_BLUE
 %error "Incorrect RGB pixel offset."
 %endif
 %if RGB_RED == 0
 %define  mmA  mm0
 %define  mmB  mm1
 %define xmmA xmm0
 %define xmmB xmm1
 %elif RGB_GREEN == 0
 %define  mmA  mm2
 %define  mmB  mm3
 %define xmmA xmm2
 %define xmmB xmm3
 %elif RGB_BLUE == 0
 %define  mmA  mm4
 %define  mmB  mm5
 %define xmmA xmm4
 %define xmmB xmm5
 %else
 %define  mmA  mm6
 %define  mmB  mm7
 %define xmmA xmm6
 %define xmmB xmm7
 %endif
 %if RGB_RED == 1
 %define  mmC  mm0
 %define  mmD  mm1
 %define xmmC xmm0
 %define xmmD xmm1
 %elif RGB_GREEN == 1
 %define  mmC  mm2
 %define  mmD  mm3
 %define xmmC xmm2
 %define xmmD xmm3
 %elif RGB_BLUE == 1
 %define  mmC  mm4
 %define  mmD  mm5
 %define xmmC xmm4
 %define xmmD xmm5
 %else
 %define  mmC  mm6
 %define  mmD  mm7
 %define xmmC xmm6
 %define xmmD xmm7
 %endif
 %if RGB_RED == 2
 %define  mmE  mm0
 %define  mmF  mm1
 %define xmmE xmm0
 %define xmmF xmm1
 %elif RGB_GREEN == 2
 %define  mmE  mm2
 %define  mmF  mm3
 %define xmmE xmm2
 %define xmmF xmm3
 %elif RGB_BLUE == 2
 %define  mmE  mm4
 %define  mmF  mm5
 %define xmmE xmm4
 %define xmmF xmm5
 %else
 %define  mmE  mm6
 %define  mmF  mm7
 %define xmmE xmm6
 %define xmmF xmm7
 %endif
 %if RGB_RED == 3
 %define  mmG  mm0
 %define  mmH  mm1
 %define xmmG xmm0
 %define xmmH xmm1
 %elif RGB_GREEN == 3
 %define  mmG  mm2
 %define  mmH  mm3
 %define xmmG xmm2
 %define xmmH xmm3
 %elif RGB_BLUE == 3
 %define  mmG  mm4
 %define  mmH  mm5
 %define xmmG xmm4
 %define xmmH xmm5
 %else
 %define  mmG  mm6
 %define  mmH  mm7
 %define xmmG xmm6
 %define xmmH xmm7
 %endif
 %endif ; RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4
 ; --------------------------------------------------------------------------
--- a/jcomapi.c
+++ b/jcomapi.c
@@ -1,10 +1,17 @@
 /*
 * jcomapi.c
 *
- * Copyright (C) 1994-1996, Thomas G. Lane.
+ * Copyright (C) 1994-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
 * ---------------------------------------------------------------------
 * x86 SIMD extension for IJG JPEG library
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
 * This file has been modified for SIMD extension.
 * Last Modified : March 11, 2005
 * ---------------------------------------------------------------------
 *
 * This file contains application interface routines that are used for both
 * compression and decompression.
 */
@@ -30,6 +37,10 @@ jpeg_abort (j_common_ptr cinfo)
 {
  int pool;
  /* Do nothing if called on a not-initialized or destroyed JPEG object. */
  if (cinfo->mem == NULL)
    return;
  /* Releasing pools in reverse order might help avoid fragmentation
   * with some (brain-damaged) malloc libraries.
   */
@@ -38,7 +49,15 @@ jpeg_abort (j_common_ptr cinfo)
  }
  /* Reset overall state for possible reuse of object */
-  cinfo->global_state = (cinfo->is_decompressor ? DSTATE_START : CSTATE_START);
+  if (cinfo->is_decompressor) {
    cinfo->global_state = DSTATE_START;
    /* Try to keep application from accessing now-deleted marker list.
     * A bit kludgy to do it here, but this is the most central place.
     */
    ((j_decompress_ptr) cinfo)->marker_list = NULL;
  } else {
    cinfo->global_state = CSTATE_START;
  }
 }
@@ -92,3 +111,54 @@ jpeg_alloc_huff_table (j_common_ptr cinfo)
  tbl->sent_table = FALSE;	/* make sure this is false in any new table */
  return tbl;
 }
 /*
 * SIMD Ext: Checking for support of SIMD instruction set.
 */
 GLOBAL(unsigned int)
 jpeg_simd_support (j_common_ptr cinfo)
 {
  enum { JSIMD_INVALID = ~0 };
  static volatile unsigned int simd_supported = JSIMD_INVALID;
  if (simd_supported == JSIMD_INVALID)
    simd_supported = jpeg_simd_os_support(jpeg_simd_cpu_support());
 #ifndef JSIMD_MASKFUNC_NOT_SUPPORTED
  if (cinfo != NULL)	/* Turn off the masked flags */
    return simd_supported & ~jpeg_simd_mask(cinfo, JSIMD_NONE, JSIMD_NONE);
 #endif
  return simd_supported;
 }
 #ifndef JSIMD_MASKFUNC_NOT_SUPPORTED
 /*
 * SIMD Ext: modify/retrieve SIMD instruction mask
 */
 GLOBAL(unsigned int)
 jpeg_simd_mask (j_common_ptr cinfo, unsigned int remove, unsigned int add)
 {
  unsigned long *gp;
  unsigned int oldmask;
  if (cinfo->is_decompressor)
    gp = (unsigned long *) &((j_decompress_ptr) cinfo)->output_gamma;
  else	/* compressor */
    gp = (unsigned long *) &((j_compress_ptr) cinfo)->input_gamma;
  if ((gp[1] == 0x3FF00000 || gp[1] == 0x00000000) &&	/* +1.0 or +0.0 */
      (gp[0] & ~JSIMD_ALL) == 0) {
    oldmask = gp[0];
    if (((remove | add) & ~JSIMD_ALL) == 0)
      gp[0] = (oldmask & ~remove) | add;
  } else {
    oldmask = 0;	/* error */
  }
  return oldmask;
 }
 #endif /* !JSIMD_MASKFUNC_NOT_SUPPORTED */
--- a/jconfig.bc5
+++ b/jconfig.bc5
@@ -0,0 +1,48 @@
 /* jconfig.bc5 --- jconfig.h for Borland C++ Compiler 5.5 (win32) */
 /* see jconfig.doc for explanations */
 #define HAVE_PROTOTYPES
 #define HAVE_UNSIGNED_CHAR
 #define HAVE_UNSIGNED_SHORT
 /* #define void char */
 /* #define const */
 #undef CHAR_IS_UNSIGNED
 #define HAVE_STDDEF_H
 #define HAVE_STDLIB_H
 #undef NEED_BSD_STRINGS
 #undef NEED_SYS_TYPES_H
 #undef NEED_FAR_POINTERS	/* we presume a 32-bit flat memory model */
 #undef NEED_SHORT_EXTERNAL_NAMES
 #undef INCOMPLETE_TYPES_BROKEN	/* this assumes you have -w-stu in CFLAGS */
 /* Define "boolean" as unsigned char, not int, per Windows custom */
 #define TYPEDEF_UCHAR_BOOLEAN
 #ifdef JPEG_INTERNALS
 #undef RIGHT_SHIFT_IS_UNSIGNED
 #endif /* JPEG_INTERNALS */
 #if defined(JPEG_INTERNALS) || defined(JPEG_INTERNAL_OPTIONS)
 #undef JSIMD_MMX_NOT_SUPPORTED
 #undef JSIMD_3DNOW_NOT_SUPPORTED
 #undef JSIMD_SSE_NOT_SUPPORTED
 #undef JSIMD_SSE2_NOT_SUPPORTED
 #endif
 #ifdef JPEG_CJPEG_DJPEG
 #define BMP_SUPPORTED		/* BMP image file format */
 #define GIF_SUPPORTED		/* GIF image file format */
 #define PPM_SUPPORTED		/* PBMPLUS PPM/PGM image file format */
 #undef RLE_SUPPORTED		/* Utah RLE image file format */
 #define TARGA_SUPPORTED		/* Targa image file format */
 #define TWO_FILE_COMMANDLINE
 #define USE_SETMODE		/* Borland has setmode() */
 #undef NEED_SIGNAL_CATCHER	/* Define this if you use jmemname.c */
 #undef DONT_USE_B_MODE
 #undef PROGRESS_REPORT		/* optional */
 #endif /* JPEG_CJPEG_DJPEG */
--- a/jconfig.cfg
+++ b/jconfig.cfg
@@ -16,6 +16,9 @@
 /* Define this if you get warnings about undefined structures. */
 #undef INCOMPLETE_TYPES_BROKEN
 /* Define "boolean" as unsigned char, not int, per Windows custom */
 #undef TYPEDEF_UCHAR_BOOLEAN
 #ifdef JPEG_INTERNALS
 #undef RIGHT_SHIFT_IS_UNSIGNED
@@ -26,6 +29,13 @@
 #endif /* JPEG_INTERNALS */
 #if defined(JPEG_INTERNALS) || defined(JPEG_INTERNAL_OPTIONS)
 #undef JSIMD_MMX_NOT_SUPPORTED
 #undef JSIMD_3DNOW_NOT_SUPPORTED
 #undef JSIMD_SSE_NOT_SUPPORTED
 #undef JSIMD_SSE2_NOT_SUPPORTED
 #endif
 #ifdef JPEG_CJPEG_DJPEG
 #define BMP_SUPPORTED		/* BMP image file format */
@@ -35,6 +45,8 @@
 #define TARGA_SUPPORTED		/* Targa image file format */
 #undef TWO_FILE_COMMANDLINE
 #undef USE_SETMODE
 #undef USE_FDOPEN
 #undef NEED_SIGNAL_CATCHER
 #undef DONT_USE_B_MODE
--- a/jconfig.dj
+++ b/jconfig.dj
@@ -21,6 +21,13 @@
 #endif /* JPEG_INTERNALS */
 #if defined(JPEG_INTERNALS) || defined(JPEG_INTERNAL_OPTIONS)
 #undef JSIMD_MMX_NOT_SUPPORTED
 #undef JSIMD_3DNOW_NOT_SUPPORTED
 #undef JSIMD_SSE_NOT_SUPPORTED
 #undef JSIMD_SSE2_NOT_SUPPORTED
 #endif
 #ifdef JPEG_CJPEG_DJPEG
 #define BMP_SUPPORTED		/* BMP image file format */
@@ -35,4 +42,6 @@
 #undef DONT_USE_B_MODE
 #undef PROGRESS_REPORT		/* optional */
 #define FREE_MEM_ESTIMATE	0	/* for alternate cjpeg/djpeg */
 #endif /* JPEG_CJPEG_DJPEG */
--- a/jconfig.linux
+++ b/jconfig.linux
@@ -0,0 +1,44 @@
 /* jconfig.linux --- jconfig.h for Linux ELF with gcc */
 /* see jconfig.doc for explanations */
 #define HAVE_PROTOTYPES
 #define HAVE_UNSIGNED_CHAR
 #define HAVE_UNSIGNED_SHORT
 /* #define void char */
 /* #define const */
 #undef CHAR_IS_UNSIGNED
 #define HAVE_STDDEF_H
 #define HAVE_STDLIB_H
 #undef NEED_BSD_STRINGS
 #undef NEED_SYS_TYPES_H
 #undef NEED_FAR_POINTERS
 #undef NEED_SHORT_EXTERNAL_NAMES
 #undef INCOMPLETE_TYPES_BROKEN
 #ifdef JPEG_INTERNALS
 #undef RIGHT_SHIFT_IS_UNSIGNED
 #endif /* JPEG_INTERNALS */
 #if defined(JPEG_INTERNALS) || defined(JPEG_INTERNAL_OPTIONS)
 #undef JSIMD_MMX_NOT_SUPPORTED
 #undef JSIMD_3DNOW_NOT_SUPPORTED
 #undef JSIMD_SSE_NOT_SUPPORTED
 #undef JSIMD_SSE2_NOT_SUPPORTED
 #endif
 #ifdef JPEG_CJPEG_DJPEG
 #define BMP_SUPPORTED		/* BMP image file format */
 #define GIF_SUPPORTED		/* GIF image file format */
 #define PPM_SUPPORTED		/* PBMPLUS PPM/PGM image file format */
 #undef RLE_SUPPORTED		/* Utah RLE image file format */
 #define TARGA_SUPPORTED		/* Targa image file format */
 #undef TWO_FILE_COMMANDLINE
 #undef NEED_SIGNAL_CATCHER	/* Define this if you use jmemname.c */
 #undef DONT_USE_B_MODE
 #undef PROGRESS_REPORT		/* optional */
 #endif /* JPEG_CJPEG_DJPEG */
--- a/jconfig.mgw
+++ b/jconfig.mgw
@@ -0,0 +1,48 @@
 /* jconfig.mgw --- jconfig.h for MinGW */
 /* see jconfig.doc for explanations */
 #define HAVE_PROTOTYPES
 #define HAVE_UNSIGNED_CHAR
 #define HAVE_UNSIGNED_SHORT
 /* #define void char */
 /* #define const */
 #undef CHAR_IS_UNSIGNED
 #define HAVE_STDDEF_H
 #define HAVE_STDLIB_H
 #undef NEED_BSD_STRINGS
 #undef NEED_SYS_TYPES_H
 #undef NEED_FAR_POINTERS
 #undef NEED_SHORT_EXTERNAL_NAMES
 #undef INCOMPLETE_TYPES_BROKEN
 /* Define "boolean" as unsigned char, not int, per Windows custom */
 #define TYPEDEF_UCHAR_BOOLEAN
 #ifdef JPEG_INTERNALS
 #undef RIGHT_SHIFT_IS_UNSIGNED
 #endif /* JPEG_INTERNALS */
 #if defined(JPEG_INTERNALS) || defined(JPEG_INTERNAL_OPTIONS)
 #undef JSIMD_MMX_NOT_SUPPORTED
 #undef JSIMD_3DNOW_NOT_SUPPORTED
 #undef JSIMD_SSE_NOT_SUPPORTED
 #undef JSIMD_SSE2_NOT_SUPPORTED
 #endif
 #ifdef JPEG_CJPEG_DJPEG
 #define BMP_SUPPORTED		/* BMP image file format */
 #define GIF_SUPPORTED		/* GIF image file format */
 #define PPM_SUPPORTED		/* PBMPLUS PPM/PGM image file format */
 #undef RLE_SUPPORTED		/* Utah RLE image file format */
 #define TARGA_SUPPORTED		/* Targa image file format */
 #define TWO_FILE_COMMANDLINE	/* optional */
 #define USE_SETMODE		/* MinGW has setmode() */
 #undef NEED_SIGNAL_CATCHER	/* Define this if you use jmemname.c */
 #undef DONT_USE_B_MODE
 #undef PROGRESS_REPORT		/* optional */
 #endif /* JPEG_CJPEG_DJPEG */
--- a/jconfig.vc
+++ b/jconfig.vc
@@ -0,0 +1,48 @@
 /* jconfig.vc --- jconfig.h for Microsoft Visual C++ on Windows 95 or NT. */
 /* see jconfig.doc for explanations */
 #define HAVE_PROTOTYPES
 #define HAVE_UNSIGNED_CHAR
 #define HAVE_UNSIGNED_SHORT
 /* #define void char */
 /* #define const */
 #undef CHAR_IS_UNSIGNED
 #define HAVE_STDDEF_H
 #define HAVE_STDLIB_H
 #undef NEED_BSD_STRINGS
 #undef NEED_SYS_TYPES_H
 #undef NEED_FAR_POINTERS	/* we presume a 32-bit flat memory model */
 #undef NEED_SHORT_EXTERNAL_NAMES
 #undef INCOMPLETE_TYPES_BROKEN
 /* Define "boolean" as unsigned char, not int, per Windows custom */
 #define TYPEDEF_UCHAR_BOOLEAN
 #ifdef JPEG_INTERNALS
 #undef RIGHT_SHIFT_IS_UNSIGNED
 #endif /* JPEG_INTERNALS */
 #if defined(JPEG_INTERNALS) || defined(JPEG_INTERNAL_OPTIONS)
 #undef JSIMD_MMX_NOT_SUPPORTED
 #undef JSIMD_3DNOW_NOT_SUPPORTED
 #undef JSIMD_SSE_NOT_SUPPORTED
 #undef JSIMD_SSE2_NOT_SUPPORTED
 #endif
 #ifdef JPEG_CJPEG_DJPEG
 #define BMP_SUPPORTED		/* BMP image file format */
 #define GIF_SUPPORTED		/* GIF image file format */
 #define PPM_SUPPORTED		/* PBMPLUS PPM/PGM image file format */
 #undef RLE_SUPPORTED		/* Utah RLE image file format */
 #define TARGA_SUPPORTED		/* Targa image file format */
 #define TWO_FILE_COMMANDLINE	/* optional */
 #define USE_SETMODE		/* Microsoft has setmode() */
 #undef NEED_SIGNAL_CATCHER
 #undef DONT_USE_B_MODE
 #undef PROGRESS_REPORT		/* optional */
 #endif /* JPEG_CJPEG_DJPEG */
--- a/jcparam.c
+++ b/jcparam.c
@@ -1,7 +1,7 @@
 /*
 * jcparam.c
 *
- * Copyright (C) 1991-1996, Thomas G. Lane.
+ * Copyright (C) 1991-1998, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -29,7 +29,7 @@ jpeg_add_quant_table (j_compress_ptr cinfo, int which_tbl,
 * are limited to 1..255 for JPEG baseline compatibility.
 */
 {
-  JQUANT_TBL ** qtblptr = & cinfo->quant_tbl_ptrs[which_tbl];
+  JQUANT_TBL ** qtblptr;
  int i;
  long temp;
@@ -37,6 +37,11 @@ jpeg_add_quant_table (j_compress_ptr cinfo, int which_tbl,
  if (cinfo->global_state != CSTATE_START)
    ERREXIT1(cinfo, JERR_BAD_STATE, cinfo->global_state);
  if (which_tbl < 0 || which_tbl >= NUM_QUANT_TBLS)
    ERREXIT1(cinfo, JERR_DQT_INDEX, which_tbl);
  qtblptr = & cinfo->quant_tbl_ptrs[which_tbl];
  if (*qtblptr == NULL)
    *qtblptr = jpeg_alloc_quant_table((j_common_ptr) cinfo);
@@ -148,11 +153,25 @@ add_huff_table (j_compress_ptr cinfo,
 		JHUFF_TBL **htblptr, const UINT8 *bits, const UINT8 *val)
 /* Define a Huffman table */
 {
  int nsymbols, len;
  if (*htblptr == NULL)
    *htblptr = jpeg_alloc_huff_table((j_common_ptr) cinfo);
  /* Copy the number-of-symbols-of-each-code-length counts */
  MEMCOPY((*htblptr)->bits, bits, SIZEOF((*htblptr)->bits));
-  MEMCOPY((*htblptr)->huffval, val, SIZEOF((*htblptr)->huffval));
+
  /* Validate the counts.  We do this here mainly so we can copy the right
   * number of symbols from the val[] array, without risking marching off
   * the end of memory.  jchuff.c will do a more thorough test later.
   */
  nsymbols = 0;
  for (len = 1; len <= 16; len++)
    nsymbols += bits[len];
  if (nsymbols < 1 || nsymbols > 256)
    ERREXIT(cinfo, JERR_BAD_HUFF_TABLE);
  MEMCOPY((*htblptr)->huffval, val, nsymbols * SIZEOF(UINT8));
  /* Initialize sent_table FALSE so table will be written to JPEG file. */
  (*htblptr)->sent_table = FALSE;
@@ -313,7 +332,15 @@ jpeg_set_defaults (j_compress_ptr cinfo)
  /* Fill in default JFIF marker parameters.  Note that whether the marker
   * will actually be written is determined by jpeg_set_colorspace.
   *
   * By default, the library emits JFIF version code 1.01.
   * An application that wants to emit JFIF 1.02 extension markers should set
   * JFIF_minor_version to 2.  We could probably get away with just defaulting
   * to 1.02, but there may still be some decoders in use that will complain
   * about that; saying 1.01 should minimize compatibility problems.
   */
  cinfo->JFIF_major_version = 1; /* Default JFIF version = 1.01 */
  cinfo->JFIF_minor_version = 1;
  cinfo->density_unit = 0;	/* Pixel size is unknown by default */
  cinfo->X_density = 1;		/* Pixel aspect ratio is square by default */
  cinfo->Y_density = 1;
@@ -529,11 +556,20 @@ jpeg_simple_progression (j_compress_ptr cinfo)
      nscans = 2 + 4 * ncomps;	/* 2 DC scans; 4 AC scans per component */
  }
-  /* Allocate space for script. */
+  /* Allocate space for script.
-  /* We use permanent pool just in case application re-uses script. */
+   * We need to put it in the permanent pool in case the application performs
-  scanptr = (jpeg_scan_info *)
+   * multiple compressions without changing the settings.  To avoid a memory
   * leak if jpeg_simple_progression is called repeatedly for the same JPEG
   * object, we try to re-use previously allocated space, and we allocate
   * enough space to handle YCbCr even if initially asked for grayscale.
   */
  if (cinfo->script_space == NULL || cinfo->script_space_size < nscans) {
    cinfo->script_space_size = MAX(nscans, 10);
    cinfo->script_space = (jpeg_scan_info *)
      (*cinfo->mem->alloc_small) ((j_common_ptr) cinfo, JPOOL_PERMANENT,
-				nscans * SIZEOF(jpeg_scan_info));
+			cinfo->script_space_size * SIZEOF(jpeg_scan_info));
  }
  scanptr = cinfo->script_space;
  cinfo->scan_info = scanptr;
  cinfo->num_scans = nscans;
--- a/jcphuff.c
+++ b/jcphuff.c
@@ -1,7 +1,7 @@
 /*
 * jcphuff.c
 *
- * Copyright (C) 1995-1996, Thomas G. Lane.
+ * Copyright (C) 1995-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -147,22 +147,19 @@ start_pass_phuff (j_compress_ptr cinfo, boolean gather_statistics)
    compptr = cinfo->cur_comp_info[ci];
    /* Initialize DC predictions to 0 */
    entropy->last_dc_val[ci] = 0;
-    /* Make sure requested tables are present */
+    /* Get table index */
    /* (In gather mode, tables need not be allocated yet) */
    if (is_DC_band) {
      if (cinfo->Ah != 0)	/* DC refinement needs no table */
 	continue;
      tbl = compptr->dc_tbl_no;
      if (tbl < 0 || tbl >= NUM_HUFF_TBLS ||
 	  (cinfo->dc_huff_tbl_ptrs[tbl] == NULL && !gather_statistics))
 	ERREXIT1(cinfo,JERR_NO_HUFF_TABLE, tbl);
    } else {
      entropy->ac_tbl_no = tbl = compptr->ac_tbl_no;
      if (tbl < 0 || tbl >= NUM_HUFF_TBLS ||
          (cinfo->ac_huff_tbl_ptrs[tbl] == NULL && !gather_statistics))
        ERREXIT1(cinfo,JERR_NO_HUFF_TABLE, tbl);
    }
    if (gather_statistics) {
      /* Check for invalid table index */
      /* (make_c_derived_tbl does this in the other path) */
      if (tbl < 0 || tbl >= NUM_HUFF_TBLS)
        ERREXIT1(cinfo, JERR_NO_HUFF_TABLE, tbl);
      /* Allocate and zero the statistics tables */
      /* Note that jpeg_gen_optimal_table expects 257 entries in each table! */
      if (entropy->count_ptrs[tbl] == NULL)
@@ -171,13 +168,9 @@ start_pass_phuff (j_compress_ptr cinfo, boolean gather_statistics)
 				      257 * SIZEOF(long));
      MEMZERO(entropy->count_ptrs[tbl], 257 * SIZEOF(long));
    } else {
-      /* Compute derived values for Huffman tables */
+      /* Compute derived values for Huffman table */
      /* We may do this more than once for a table, but it's not expensive */
-      if (is_DC_band)
+      jpeg_make_c_derived_tbl(cinfo, is_DC_band, tbl,
        jpeg_make_c_derived_tbl(cinfo, cinfo->dc_huff_tbl_ptrs[tbl],
 				& entropy->derived_tbls[tbl]);
      else
        jpeg_make_c_derived_tbl(cinfo, cinfo->ac_huff_tbl_ptrs[tbl],
 			      & entropy->derived_tbls[tbl]);
    }
  }
@@ -329,6 +322,9 @@ emit_eobrun (phuff_entropy_ptr entropy)
    nbits = 0;
    while ((temp >>= 1))
      nbits++;
    /* safety check: shouldn't happen given limited correction-bit buffer */
    if (nbits > 14)
      ERREXIT(entropy->cinfo, JERR_HUFF_MISSING_CODE);
    emit_symbol(entropy, entropy->ac_tbl_no, nbits << 4);
    if (nbits)
@@ -427,6 +423,11 @@ encode_mcu_DC_first (j_compress_ptr cinfo, JBLOCKROW *MCU_data)
      nbits++;
      temp >>= 1;
    }
    /* Check for out-of-range coefficient values.
     * Since we're encoding a difference, the range limit is twice as much.
     */
    if (nbits > MAX_COEF_BITS+1)
      ERREXIT(cinfo, JERR_BAD_DCT_COEF);
    /* Count/emit the Huffman-coded symbol for the number of bits */
    emit_symbol(entropy, compptr->dc_tbl_no, nbits);
@@ -523,6 +524,9 @@ encode_mcu_AC_first (j_compress_ptr cinfo, JBLOCKROW *MCU_data)
    nbits = 1;			/* there must be at least one 1 bit */
    while ((temp >>= 1))
      nbits++;
    /* Check for out-of-range coefficient values */
    if (nbits > MAX_COEF_BITS)
      ERREXIT(cinfo, JERR_BAD_DCT_COEF);
    /* Count/emit Huffman symbol for run length / number of bits */
    emit_symbol(entropy, entropy->ac_tbl_no, (r << 4) + nbits);
--- a/jcqnt3dn.asm
+++ b/jcqnt3dn.asm
@@ -0,0 +1,240 @@
 ;
 ; jcqnt3dn.asm - sample data conversion and quantization (3DNow! & MMX)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; Last Modified : January 23, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_FLOAT_SUPPORTED
 %ifdef JFDCT_FLT_3DNOW_MMX_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Load data into workspace, applying unsigned->signed conversion
 ;
 ; GLOBAL(void)
 ; jpeg_convsamp_flt_3dnow (JSAMPARRAY sample_data, JDIMENSION start_col,
 ;                          FAST_FLOAT * workspace);
 ;
 %define sample_data	ebp+8		; JSAMPARRAY sample_data
 %define start_col	ebp+12		; JDIMENSION start_col
 %define workspace	ebp+16		; FAST_FLOAT * workspace
 	align	16
 	global	EXTN(jpeg_convsamp_flt_3dnow)
 EXTN(jpeg_convsamp_flt_3dnow):
 	push	ebp
 	mov	ebp,esp
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	pcmpeqw  mm7,mm7
 	psllw    mm7,7
 	packsswb mm7,mm7		; mm7 = PB_CENTERJSAMPLE (0x808080..)
 	mov	esi, JSAMPARRAY [sample_data]	; (JSAMPROW *)
 	mov	eax, JDIMENSION [start_col]
 	mov	edi, POINTER [workspace]	; (DCTELEM *)
 	mov	ecx, DCTSIZE/2
 	alignx	16,7
 .convloop:
 	mov	ebx, JSAMPROW [esi+0*SIZEOF_JSAMPROW]	; (JSAMPLE *)
 	mov	edx, JSAMPROW [esi+1*SIZEOF_JSAMPROW]	; (JSAMPLE *)
 	movq	mm0, MMWORD [ebx+eax*SIZEOF_JSAMPLE]
 	movq	mm1, MMWORD [edx+eax*SIZEOF_JSAMPLE]
 	psubb	mm0,mm7				; mm0=(01234567)
 	psubb	mm1,mm7				; mm1=(89ABCDEF)
 	punpcklbw mm2,mm0			; mm2=(*0*1*2*3)
 	punpckhbw mm0,mm0			; mm0=(*4*5*6*7)
 	punpcklbw mm3,mm1			; mm3=(*8*9*A*B)
 	punpckhbw mm1,mm1			; mm1=(*C*D*E*F)
 	punpcklwd mm4,mm2			; mm4=(***0***1)
 	punpckhwd mm2,mm2			; mm2=(***2***3)
 	punpcklwd mm5,mm0			; mm5=(***4***5)
 	punpckhwd mm0,mm0			; mm0=(***6***7)
 	psrad	mm4,(DWORD_BIT-BYTE_BIT)	; mm4=(01)
 	psrad	mm2,(DWORD_BIT-BYTE_BIT)	; mm2=(23)
 	pi2fd	mm4,mm4
 	pi2fd	mm2,mm2
 	psrad	mm5,(DWORD_BIT-BYTE_BIT)	; mm5=(45)
 	psrad	mm0,(DWORD_BIT-BYTE_BIT)	; mm0=(67)
 	pi2fd	mm5,mm5
 	pi2fd	mm0,mm0
 	movq	MMWORD [MMBLOCK(0,0,edi,SIZEOF_FAST_FLOAT)], mm4
 	movq	MMWORD [MMBLOCK(0,1,edi,SIZEOF_FAST_FLOAT)], mm2
 	movq	MMWORD [MMBLOCK(0,2,edi,SIZEOF_FAST_FLOAT)], mm5
 	movq	MMWORD [MMBLOCK(0,3,edi,SIZEOF_FAST_FLOAT)], mm0
 	punpcklwd mm6,mm3			; mm6=(***8***9)
 	punpckhwd mm3,mm3			; mm3=(***A***B)
 	punpcklwd mm4,mm1			; mm4=(***C***D)
 	punpckhwd mm1,mm1			; mm1=(***E***F)
 	psrad	mm6,(DWORD_BIT-BYTE_BIT)	; mm6=(89)
 	psrad	mm3,(DWORD_BIT-BYTE_BIT)	; mm3=(AB)
 	pi2fd	mm6,mm6
 	pi2fd	mm3,mm3
 	psrad	mm4,(DWORD_BIT-BYTE_BIT)	; mm4=(CD)
 	psrad	mm1,(DWORD_BIT-BYTE_BIT)	; mm1=(EF)
 	pi2fd	mm4,mm4
 	pi2fd	mm1,mm1
 	movq	MMWORD [MMBLOCK(1,0,edi,SIZEOF_FAST_FLOAT)], mm6
 	movq	MMWORD [MMBLOCK(1,1,edi,SIZEOF_FAST_FLOAT)], mm3
 	movq	MMWORD [MMBLOCK(1,2,edi,SIZEOF_FAST_FLOAT)], mm4
 	movq	MMWORD [MMBLOCK(1,3,edi,SIZEOF_FAST_FLOAT)], mm1
 	add	esi, byte 2*SIZEOF_JSAMPROW
 	add	edi, byte 2*DCTSIZE*SIZEOF_FAST_FLOAT
 	dec	ecx
 	jnz	near .convloop
 	femms		; empty MMX/3DNow! state
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	pop	ebp
 	ret
 ; --------------------------------------------------------------------------
 ;
 ; Quantize/descale the coefficients, and store into coef_block
 ;
 ; GLOBAL(void)
 ; jpeg_quantize_flt_3dnow (JCOEFPTR coef_block, FAST_FLOAT * divisors,
 ;                          FAST_FLOAT * workspace);
 ;
 %define coef_block	ebp+8		; JCOEFPTR coef_block
 %define divisors	ebp+12		; FAST_FLOAT * divisors
 %define workspace	ebp+16		; FAST_FLOAT * workspace
 	align	16
 	global	EXTN(jpeg_quantize_flt_3dnow)
 EXTN(jpeg_quantize_flt_3dnow):
 	push	ebp
 	mov	ebp,esp
 ;	push	ebx		; unused
 ;	push	ecx		; unused
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	mov       eax, 0x4B400000	; (float)0x00C00000 (rndint_magic)
 	movd      mm7,eax
 	punpckldq mm7,mm7		; mm7={12582912.0F 12582912.0F}
 	mov	esi, POINTER [workspace]
 	mov	edx, POINTER [divisors]
 	mov	edi, JCOEFPTR [coef_block]
 	mov	eax, DCTSIZE2/16
 	alignx	16,7
 .quantloop:
 	movq	mm0, MMWORD [MMBLOCK(0,0,esi,SIZEOF_FAST_FLOAT)]
 	movq	mm1, MMWORD [MMBLOCK(0,1,esi,SIZEOF_FAST_FLOAT)]
 	pfmul	mm0, MMWORD [MMBLOCK(0,0,edx,SIZEOF_FAST_FLOAT)]
 	pfmul	mm1, MMWORD [MMBLOCK(0,1,edx,SIZEOF_FAST_FLOAT)]
 	movq	mm2, MMWORD [MMBLOCK(0,2,esi,SIZEOF_FAST_FLOAT)]
 	movq	mm3, MMWORD [MMBLOCK(0,3,esi,SIZEOF_FAST_FLOAT)]
 	pfmul	mm2, MMWORD [MMBLOCK(0,2,edx,SIZEOF_FAST_FLOAT)]
 	pfmul	mm3, MMWORD [MMBLOCK(0,3,edx,SIZEOF_FAST_FLOAT)]
 	pfadd	mm0,mm7			; mm0=(00 ** 01 **)
 	pfadd	mm1,mm7			; mm1=(02 ** 03 **)
 	pfadd	mm2,mm7			; mm0=(04 ** 05 **)
 	pfadd	mm3,mm7			; mm1=(06 ** 07 **)
 	movq      mm4,mm0
 	punpcklwd mm0,mm1		; mm0=(00 02 ** **)
 	punpckhwd mm4,mm1		; mm4=(01 03 ** **)
 	movq      mm5,mm2
 	punpcklwd mm2,mm3		; mm2=(04 06 ** **)
 	punpckhwd mm5,mm3		; mm5=(05 07 ** **)
 	punpcklwd mm0,mm4		; mm0=(00 01 02 03)
 	punpcklwd mm2,mm5		; mm2=(04 05 06 07)
 	movq	mm6, MMWORD [MMBLOCK(1,0,esi,SIZEOF_FAST_FLOAT)]
 	movq	mm1, MMWORD [MMBLOCK(1,1,esi,SIZEOF_FAST_FLOAT)]
 	pfmul	mm6, MMWORD [MMBLOCK(1,0,edx,SIZEOF_FAST_FLOAT)]
 	pfmul	mm1, MMWORD [MMBLOCK(1,1,edx,SIZEOF_FAST_FLOAT)]
 	movq	mm3, MMWORD [MMBLOCK(1,2,esi,SIZEOF_FAST_FLOAT)]
 	movq	mm4, MMWORD [MMBLOCK(1,3,esi,SIZEOF_FAST_FLOAT)]
 	pfmul	mm3, MMWORD [MMBLOCK(1,2,edx,SIZEOF_FAST_FLOAT)]
 	pfmul	mm4, MMWORD [MMBLOCK(1,3,edx,SIZEOF_FAST_FLOAT)]
 	pfadd	mm6,mm7			; mm0=(10 ** 11 **)
 	pfadd	mm1,mm7			; mm4=(12 ** 13 **)
 	pfadd	mm3,mm7			; mm0=(14 ** 15 **)
 	pfadd	mm4,mm7			; mm4=(16 ** 17 **)
 	movq      mm5,mm6
 	punpcklwd mm6,mm1		; mm6=(10 12 ** **)
 	punpckhwd mm5,mm1		; mm5=(11 13 ** **)
 	movq      mm1,mm3
 	punpcklwd mm3,mm4		; mm3=(14 16 ** **)
 	punpckhwd mm1,mm4		; mm1=(15 17 ** **)
 	punpcklwd mm6,mm5		; mm6=(10 11 12 13)
 	punpcklwd mm3,mm1		; mm3=(14 15 16 17)
 	movq	MMWORD [MMBLOCK(0,0,edi,SIZEOF_JCOEF)], mm0
 	movq	MMWORD [MMBLOCK(0,1,edi,SIZEOF_JCOEF)], mm2
 	movq	MMWORD [MMBLOCK(1,0,edi,SIZEOF_JCOEF)], mm6
 	movq	MMWORD [MMBLOCK(1,1,edi,SIZEOF_JCOEF)], mm3
 	add	esi, byte 16*SIZEOF_FAST_FLOAT
 	add	edx, byte 16*SIZEOF_FAST_FLOAT
 	add	edi, byte 16*SIZEOF_JCOEF
 	dec	eax
 	jnz	near .quantloop
 	femms		; empty MMX/3DNow! state
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; unused
 ;	pop	ebx		; unused
 	pop	ebp
 	ret
 %endif ; JFDCT_FLT_3DNOW_MMX_SUPPORTED
 %endif ; DCT_FLOAT_SUPPORTED
--- a/jcqntflt.asm
+++ b/jcqntflt.asm
@@ -0,0 +1,202 @@
 ;
 ; jcqntflt.asm - sample data conversion and quantization (non-SIMD, FP)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; Last Modified : March 21, 2004
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_FLOAT_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Load data into workspace, applying unsigned->signed conversion
 ;
 ; GLOBAL(void)
 ; jpeg_convsamp_float (JSAMPARRAY sample_data, JDIMENSION start_col,
 ;                      FAST_FLOAT * workspace);
 ;
 %define sample_data	ebp+8		; JSAMPARRAY sample_data
 %define start_col	ebp+12		; JDIMENSION start_col
 %define workspace	ebp+16		; FAST_FLOAT * workspace
 	align	16
 	global	EXTN(jpeg_convsamp_float)
 EXTN(jpeg_convsamp_float):
 	push	ebp
 	mov	ebp,esp
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	mov	esi, JSAMPARRAY [sample_data]	; (JSAMPROW *)
 	mov	edi, POINTER [workspace]	; (DCTELEM *)
 	mov	ecx, DCTSIZE
 	alignx	16,7
 .convloop:
 	mov	ebx, JSAMPROW [esi]		; (JSAMPLE *)
 	add	ebx, JDIMENSION [start_col]
 %assign i 0	; i=0
 %rep 4	; -- repeat 4 times ---
 	xor	eax,eax
 	xor	edx,edx
 	mov	al, JSAMPLE [ebx+(i+0)*SIZEOF_JSAMPLE]
 	mov	dl, JSAMPLE [ebx+(i+1)*SIZEOF_JSAMPLE]
 	add	eax, byte -CENTERJSAMPLE
 	add	edx, byte -CENTERJSAMPLE
 	push	eax
 	push	edx
 %assign i i+2	; i+=2
 %endrep	; -- repeat end ---
 	fild	INT32 [esp+0*SIZEOF_INT32]
 	fild	INT32 [esp+1*SIZEOF_INT32]
 	fild	INT32 [esp+2*SIZEOF_INT32]
 	fild	INT32 [esp+3*SIZEOF_INT32]
 	fild	INT32 [esp+4*SIZEOF_INT32]
 	fild	INT32 [esp+5*SIZEOF_INT32]
 	fild	INT32 [esp+6*SIZEOF_INT32]
 	fild	INT32 [esp+7*SIZEOF_INT32]
 	add	esp, byte DCTSIZE*SIZEOF_INT32
 	fstp	FAST_FLOAT [edi+0*SIZEOF_FAST_FLOAT]
 	fstp	FAST_FLOAT [edi+1*SIZEOF_FAST_FLOAT]
 	fstp	FAST_FLOAT [edi+2*SIZEOF_FAST_FLOAT]
 	fstp	FAST_FLOAT [edi+3*SIZEOF_FAST_FLOAT]
 	fstp	FAST_FLOAT [edi+4*SIZEOF_FAST_FLOAT]
 	fstp	FAST_FLOAT [edi+5*SIZEOF_FAST_FLOAT]
 	fstp	FAST_FLOAT [edi+6*SIZEOF_FAST_FLOAT]
 	fstp	FAST_FLOAT [edi+7*SIZEOF_FAST_FLOAT]
 	add	esi, byte SIZEOF_JSAMPROW
 	add	edi, byte DCTSIZE*SIZEOF_FAST_FLOAT
 	dec	ecx
 	jnz	near .convloop
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	pop	ebp
 	ret
 ; --------------------------------------------------------------------------
 ;
 ; Quantize/descale the coefficients, and store into coef_block
 ;
 ; GLOBAL(void)
 ; jpeg_quantize_float (JCOEFPTR coef_block, FAST_FLOAT * divisors,
 ;                      FAST_FLOAT * workspace);
 ;
 %define coef_block	ebp+8		; JCOEFPTR coef_block
 %define divisors	ebp+12		; FAST_FLOAT * divisors
 %define workspace	ebp+16		; FAST_FLOAT * workspace
 %define FLT_ROUNDS	1		; from <float.h>
 	align	16
 	global	EXTN(jpeg_quantize_float)
 EXTN(jpeg_quantize_float):
 	push	ebp
 	mov	ebp,esp
 	push	ebx
 ;	push	ecx		; unused
 ;	push	edx		; unused
 	push	esi
 	push	edi
 %if (FLT_ROUNDS != 1)
 	push	eax
 	fnstcw	word [esp]
 	mov	eax, [esp]
 	and	eax, (~0x0C00)		; round to nearest integer
 	push	eax
 	fldcw	word [esp]
 	pop	eax
 %endif
 	mov	esi, POINTER [workspace]
 	mov	ebx, POINTER [divisors]
 	mov	edi, JCOEFPTR [coef_block]
 	mov	eax, DCTSIZE2/8
 	alignx	16,7
 .quantloop:
 	fld	FAST_FLOAT [esi+0*SIZEOF_FAST_FLOAT]
 	fmul	FAST_FLOAT [ebx+0*SIZEOF_FAST_FLOAT]
 	fld	FAST_FLOAT [esi+1*SIZEOF_FAST_FLOAT]
 	fmul	FAST_FLOAT [ebx+1*SIZEOF_FAST_FLOAT]
 	fld	FAST_FLOAT [esi+2*SIZEOF_FAST_FLOAT]
 	fmul	FAST_FLOAT [ebx+2*SIZEOF_FAST_FLOAT]
 	fld	FAST_FLOAT [esi+3*SIZEOF_FAST_FLOAT]
 	fmul	FAST_FLOAT [ebx+3*SIZEOF_FAST_FLOAT]
 	fld	FAST_FLOAT [esi+4*SIZEOF_FAST_FLOAT]
 	fmul	FAST_FLOAT [ebx+4*SIZEOF_FAST_FLOAT]
 	fxch	st0,st1
 	fld	FAST_FLOAT [esi+5*SIZEOF_FAST_FLOAT]
 	fmul	FAST_FLOAT [ebx+5*SIZEOF_FAST_FLOAT]
 	fxch	st0,st3
 	fld	FAST_FLOAT [esi+6*SIZEOF_FAST_FLOAT]
 	fmul	FAST_FLOAT [ebx+6*SIZEOF_FAST_FLOAT]
 	fxch	st0,st5
 	fld	FAST_FLOAT [esi+7*SIZEOF_FAST_FLOAT]
 	fmul	FAST_FLOAT [ebx+7*SIZEOF_FAST_FLOAT]
 	fxch	st0,st7
 	fistp	JCOEF [edi+0*SIZEOF_JCOEF]
 	fistp	JCOEF [edi+1*SIZEOF_JCOEF]
 	fistp	JCOEF [edi+2*SIZEOF_JCOEF]
 	fistp	JCOEF [edi+3*SIZEOF_JCOEF]
 	fistp	JCOEF [edi+4*SIZEOF_JCOEF]
 	fistp	JCOEF [edi+5*SIZEOF_JCOEF]
 	fistp	JCOEF [edi+6*SIZEOF_JCOEF]
 	fistp	JCOEF [edi+7*SIZEOF_JCOEF]
 	add	esi, byte 8*SIZEOF_FAST_FLOAT
 	add	ebx, byte 8*SIZEOF_FAST_FLOAT
 	add	edi, byte 8*SIZEOF_JCOEF
 	dec	eax
 	jnz	short .quantloop
 %if (FLT_ROUNDS != 1)
 	fldcw	word [esp]
 	pop	eax		; pop old control word
 %endif
 	pop	edi
 	pop	esi
 ;	pop	edx		; unused
 ;	pop	ecx		; unused
 	pop	ebx
 	pop	ebp
 	ret
 %endif ; DCT_FLOAT_SUPPORTED
--- a/jcqntint.asm
+++ b/jcqntint.asm
@@ -0,0 +1,243 @@
 ;
 ; jcqntint.asm - sample data conversion and quantization (non-SIMD, integer)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; Last Modified : January 27, 2005
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Load data into workspace, applying unsigned->signed conversion
 ;
 ; GLOBAL(void)
 ; jpeg_convsamp_int (JSAMPARRAY sample_data, JDIMENSION start_col,
 ;                    DCTELEM * workspace);
 ;
 %define sample_data	ebp+8		; JSAMPARRAY sample_data
 %define start_col	ebp+12		; JDIMENSION start_col
 %define workspace	ebp+16		; DCTELEM * workspace
 	align	16
 	global	EXTN(jpeg_convsamp_int)
 EXTN(jpeg_convsamp_int):
 	push	ebp
 	mov	ebp,esp
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	mov	esi, JSAMPARRAY [sample_data]	; (JSAMPROW *)
 	mov	edi, POINTER [workspace]	; (DCTELEM *)
 	mov	ecx, DCTSIZE
 	alignx	16,7
 .convloop:
 	mov	ebx, JSAMPROW [esi]		; (JSAMPLE *)
 	add	ebx, JDIMENSION [start_col]
 %assign i 0	; i=0
 %rep 4	; -- repeat 4 times ---
 	xor	eax,eax
 	xor	edx,edx
 	mov	al, JSAMPLE [ebx+(i+0)*SIZEOF_JSAMPLE]
 	mov	dl, JSAMPLE [ebx+(i+1)*SIZEOF_JSAMPLE]
 	add	eax, byte -CENTERJSAMPLE
 	add	edx, byte -CENTERJSAMPLE
 	mov	DCTELEM [edi+(i+0)*SIZEOF_DCTELEM], ax
 	mov	DCTELEM [edi+(i+1)*SIZEOF_DCTELEM], dx
 %assign i i+2	; i+=2
 %endrep	; -- repeat end ---
 	add	esi, byte SIZEOF_JSAMPROW
 	add	edi, byte DCTSIZE*SIZEOF_DCTELEM
 	dec	ecx
 	jnz	short .convloop
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	pop	ebp
 	ret
 %ifndef JFDCT_INT_QUANTIZE_WITH_DIVISION
 ; --------------------------------------------------------------------------
 ;
 ; Quantize/descale the coefficients, and store into coef_block
 ;
 ; This implementation is based on an algorithm described in
 ;   "How to optimize for the Pentium family of microprocessors"
 ;   (http://www.agner.org/assem/).
 ;
 ; GLOBAL(void)
 ; jpeg_quantize_int (JCOEFPTR coef_block, DCTELEM * divisors,
 ;                    DCTELEM * workspace);
 ;
 %define RECIPROCAL(i,b)	((b)+((i)+DCTSIZE2*0)*SIZEOF_DCTELEM)
 %define CORRECTION(i,b)	((b)+((i)+DCTSIZE2*1)*SIZEOF_DCTELEM)
 %define SHIFT(i,b)	((b)+((i)+DCTSIZE2*3)*SIZEOF_DCTELEM)
 %define coef_block	ebp+8		; JCOEFPTR coef_block
 %define divisors	ebp+12		; DCTELEM * divisors
 %define workspace	ebp+16		; DCTELEM * workspace
 %define UNROLL	2
 	align	16
 	global	EXTN(jpeg_quantize_int)
 EXTN(jpeg_quantize_int):
 	push	ebp
 	mov	ebp,esp
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	mov	esi, POINTER [workspace]
 	mov	ebx, POINTER [divisors]
 	mov	edi, JCOEFPTR [coef_block]
 	mov	ecx, DCTSIZE2/UNROLL
 	alignx	16,7
 .quantloop:
 	push	ecx
 %assign i 0	; i=0;
 %rep UNROLL	; ---- repeat (UNROLL) times ----
 	mov	cx, DCTELEM [esi+(i)*SIZEOF_DCTELEM]
 	mov	ax,cx
 	sar	cx,(WORD_BIT-1)
 	xor	ax,cx		; if (ax < 0) ax = -ax;
 	sub	ax,cx
 	add	ax, DCTELEM [CORRECTION(i,ebx)]	; correction + roundfactor
 	shl	ax,1
 	mul	DCTELEM [RECIPROCAL(i,ebx)]	; reciprocal
 	mov	ax,cx
 	mov	cx, DCTELEM [SHIFT(i,ebx)]	; shift
 	shr	dx,cl
 	xor	dx,ax
 	sub	dx,ax
 	mov	JCOEF [edi+(i)*SIZEOF_JCOEF], dx
 %assign i i+1	; i++;
 %endrep		; ---- repeat end ----
 	pop	ecx
 	add	esi, byte UNROLL*SIZEOF_DCTELEM
 	add	ebx, byte UNROLL*SIZEOF_DCTELEM
 	add	edi, byte UNROLL*SIZEOF_JCOEF
 	dec	ecx
 	jnz	.quantloop
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	pop	ebp
 	ret
 %else ; JFDCT_INT_QUANTIZE_WITH_DIVISION
 ; --------------------------------------------------------------------------
 ;
 ; Quantize/descale the coefficients, and store into coef_block
 ;
 ; GLOBAL(void)
 ; jpeg_quantize_idiv (JCOEFPTR coef_block, DCTELEM * divisors,
 ;                     DCTELEM * workspace);
 ;
 %define coef_block	ebp+8		; JCOEFPTR coef_block
 %define divisors	ebp+12		; DCTELEM * divisors
 %define workspace	ebp+16		; DCTELEM * workspace
 	align	16
 	global	EXTN(jpeg_quantize_idiv)
 EXTN(jpeg_quantize_idiv):
 	push	ebp
 	mov	ebp,esp
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	mov	esi, POINTER [workspace]
 	mov	ebx, POINTER [divisors]
 	mov	edi, JCOEFPTR [coef_block]
 	mov	ecx, DCTSIZE2
 	alignx	16,7
 .quantloop:
 	push	ecx
 	movsx	ecx, DCTELEM [esi]	; temp
 	mov	eax,ecx
 	sar	ecx,(DWORD_BIT-1)
 	xor	edx,edx
 	mov	dx, DCTELEM [ebx]	; qval
 	xor	eax,ecx			; if (eax < 0) eax = -eax;
 	shr	edx,1
 	sub	eax,ecx
 	cmp	eax,edx			; if (temp + qval/2 >= qval)
 	jge	short .quant
 	; ---- if the quantized coefficient is zero
 	xor	eax,eax
 	jmp	short .output
 	alignx	16,7
 .quant:	; ---- do quantization
 	add	eax,edx
 	xor	edx,edx
 	div	DCTELEM [ebx]		; Q:ax,R:dx
 	xor	ax,cx
 	sub	ax,cx
 	alignx	16,7
 .output:
 	mov	JCOEF [edi], ax
 	pop	ecx
 	add	esi, byte SIZEOF_DCTELEM
 	add	ebx, byte SIZEOF_DCTELEM
 	add	edi, byte SIZEOF_JCOEF
 	dec	ecx
 	jnz	short .quantloop
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	pop	ebp
 	ret
 %endif ; !JFDCT_INT_QUANTIZE_WITH_DIVISION
--- a/jcqntmmx.asm
+++ b/jcqntmmx.asm
@@ -0,0 +1,254 @@
 ;
 ; jcqntmmx.asm - sample data conversion and quantization (MMX)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; Last Modified : January 27, 2005
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef JFDCT_INT_MMX_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Load data into workspace, applying unsigned->signed conversion
 ;
 ; GLOBAL(void)
 ; jpeg_convsamp_int_mmx (JSAMPARRAY sample_data, JDIMENSION start_col,
 ;                        DCTELEM * workspace);
 ;
 %define sample_data	ebp+8		; JSAMPARRAY sample_data
 %define start_col	ebp+12		; JDIMENSION start_col
 %define workspace	ebp+16		; DCTELEM * workspace
 	align	16
 	global	EXTN(jpeg_convsamp_int_mmx)
 EXTN(jpeg_convsamp_int_mmx):
 	push	ebp
 	mov	ebp,esp
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	pxor	mm6,mm6			; mm6=(all 0's)
 	pcmpeqw	mm7,mm7
 	psllw	mm7,7			; mm7={0xFF80 0xFF80 0xFF80 0xFF80}
 	mov	esi, JSAMPARRAY [sample_data]	; (JSAMPROW *)
 	mov	eax, JDIMENSION [start_col]
 	mov	edi, POINTER [workspace]	; (DCTELEM *)
 	mov	ecx, DCTSIZE/4
 	alignx	16,7
 .convloop:
 	mov	ebx, JSAMPROW [esi+0*SIZEOF_JSAMPROW]	; (JSAMPLE *)
 	mov	edx, JSAMPROW [esi+1*SIZEOF_JSAMPROW]	; (JSAMPLE *)
 	movq	mm0, MMWORD [ebx+eax*SIZEOF_JSAMPLE]	; mm0=(01234567)
 	movq	mm1, MMWORD [edx+eax*SIZEOF_JSAMPLE]	; mm1=(89ABCDEF)
 	mov	ebx, JSAMPROW [esi+2*SIZEOF_JSAMPROW]	; (JSAMPLE *)
 	mov	edx, JSAMPROW [esi+3*SIZEOF_JSAMPROW]	; (JSAMPLE *)
 	movq	mm2, MMWORD [ebx+eax*SIZEOF_JSAMPLE]	; mm2=(GHIJKLMN)
 	movq	mm3, MMWORD [edx+eax*SIZEOF_JSAMPLE]	; mm3=(OPQRSTUV)
 	movq      mm4,mm0
 	punpcklbw mm0,mm6		; mm0=(0123)
 	punpckhbw mm4,mm6		; mm4=(4567)
 	movq      mm5,mm1
 	punpcklbw mm1,mm6		; mm1=(89AB)
 	punpckhbw mm5,mm6		; mm5=(CDEF)
 	paddw	mm0,mm7
 	paddw	mm4,mm7
 	paddw	mm1,mm7
 	paddw	mm5,mm7
 	movq	MMWORD [MMBLOCK(0,0,edi,SIZEOF_DCTELEM)], mm0
 	movq	MMWORD [MMBLOCK(0,1,edi,SIZEOF_DCTELEM)], mm4
 	movq	MMWORD [MMBLOCK(1,0,edi,SIZEOF_DCTELEM)], mm1
 	movq	MMWORD [MMBLOCK(1,1,edi,SIZEOF_DCTELEM)], mm5
 	movq      mm0,mm2
 	punpcklbw mm2,mm6		; mm2=(GHIJ)
 	punpckhbw mm0,mm6		; mm0=(KLMN)
 	movq      mm4,mm3
 	punpcklbw mm3,mm6		; mm3=(OPQR)
 	punpckhbw mm4,mm6		; mm4=(STUV)
 	paddw	mm2,mm7
 	paddw	mm0,mm7
 	paddw	mm3,mm7
 	paddw	mm4,mm7
 	movq	MMWORD [MMBLOCK(2,0,edi,SIZEOF_DCTELEM)], mm2
 	movq	MMWORD [MMBLOCK(2,1,edi,SIZEOF_DCTELEM)], mm0
 	movq	MMWORD [MMBLOCK(3,0,edi,SIZEOF_DCTELEM)], mm3
 	movq	MMWORD [MMBLOCK(3,1,edi,SIZEOF_DCTELEM)], mm4
 	add	esi, byte 4*SIZEOF_JSAMPROW
 	add	edi, byte 4*DCTSIZE*SIZEOF_DCTELEM
 	dec	ecx
 	jnz	short .convloop
 	emms		; empty MMX state
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	pop	ebp
 	ret
 %ifndef JFDCT_INT_QUANTIZE_WITH_DIVISION
 ; --------------------------------------------------------------------------
 ;
 ; Quantize/descale the coefficients, and store into coef_block
 ;
 ; This implementation is based on an algorithm described in
 ;   "How to optimize for the Pentium family of microprocessors"
 ;   (http://www.agner.org/assem/).
 ;
 ; GLOBAL(void)
 ; jpeg_quantize_int_mmx (JCOEFPTR coef_block, DCTELEM * divisors,
 ;                        DCTELEM * workspace);
 ;
 %define RECIPROCAL(m,n,b) MMBLOCK(DCTSIZE*0+(m),(n),(b),SIZEOF_DCTELEM)
 %define CORRECTION(m,n,b) MMBLOCK(DCTSIZE*1+(m),(n),(b),SIZEOF_DCTELEM)
 %define SCALE(m,n,b)      MMBLOCK(DCTSIZE*2+(m),(n),(b),SIZEOF_DCTELEM)
 %define coef_block	ebp+8		; JCOEFPTR coef_block
 %define divisors	ebp+12		; DCTELEM * divisors
 %define workspace	ebp+16		; DCTELEM * workspace
 	align	16
 	global	EXTN(jpeg_quantize_int_mmx)
 EXTN(jpeg_quantize_int_mmx):
 	push	ebp
 	mov	ebp,esp
 ;	push	ebx		; unused
 ;	push	ecx		; unused
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	mov	esi, POINTER [workspace]
 	mov	edx, POINTER [divisors]
 	mov	edi, JCOEFPTR [coef_block]
 	mov	ah, 2
 	alignx	16,7
 .quantloop1:
 	mov	al, DCTSIZE2/8/2
 	alignx	16,7
 .quantloop2:
 	movq	mm2, MMWORD [MMBLOCK(0,0,esi,SIZEOF_DCTELEM)]
 	movq	mm3, MMWORD [MMBLOCK(0,1,esi,SIZEOF_DCTELEM)]
 	movq	mm0,mm2
 	movq	mm1,mm3
 	psraw	mm2,(WORD_BIT-1)
 	psraw	mm3,(WORD_BIT-1)
 	pxor	mm0,mm2
 	pxor	mm1,mm3
 	psubw	mm0,mm2		; if (mm0 < 0) mm0 = -mm0;
 	psubw	mm1,mm3		; if (mm1 < 0) mm1 = -mm1;
 	; unsigned long unsigned_multiply(unsigned short x, unsigned short y)
 	; {
 	;   enum { SHORT_BIT = 16 };
 	;   signed short sx = (signed short) x;
 	;   signed short sy = (signed short) y;
 	;   signed long sz;
 	; 
 	;   sz = (long) sx * (long) sy;     /* signed multiply */
 	; 
 	;   if (sx < 0) sz += (long) sy << SHORT_BIT;
 	;   if (sy < 0) sz += (long) sx << SHORT_BIT;
 	; 
 	;   return (unsigned long) sz;
 	; }
 	paddw	mm0, MMWORD [CORRECTION(0,0,edx)]   ; correction + roundfactor
 	paddw	mm1, MMWORD [CORRECTION(0,1,edx)]
 	psllw	mm0,1
 	psllw	mm1,1
 	movq	mm4,mm0
 	movq	mm5,mm1
 	pmulhw	mm0, MMWORD [RECIPROCAL(0,0,edx)]   ; reciprocal
 	pmulhw	mm1, MMWORD [RECIPROCAL(0,1,edx)]
 	movq	mm6, MMWORD [SCALE(0,0,edx)]	; scale
 	movq	mm7, MMWORD [SCALE(0,1,edx)]
 	paddw	mm0,mm4		; reciprocal is always negative (MSB=1)
 	paddw	mm1,mm5
 	psllw	mm0,1
 	psllw	mm1,1
 	movq	mm4,mm0
 	movq	mm5,mm1
 	pmulhw	mm0,mm6
 	pmulhw	mm1,mm7
 	psraw	mm6,(WORD_BIT-1)
 	psraw	mm7,(WORD_BIT-1)
 	pand	mm6,mm4
 	pand	mm7,mm5
 	paddw	mm0,mm6
 	paddw	mm1,mm7
 	psraw	mm4,(WORD_BIT-1)
 	psraw	mm5,(WORD_BIT-1)
 	pand	mm4, MMWORD [SCALE(0,0,edx)]	; scale
 	pand	mm5, MMWORD [SCALE(0,1,edx)]
 	paddw	mm0,mm4
 	paddw	mm1,mm5
 	pxor	mm0,mm2
 	pxor	mm1,mm3
 	psubw	mm0,mm2
 	psubw	mm1,mm3
 	movq	MMWORD [MMBLOCK(0,0,edi,SIZEOF_DCTELEM)], mm0
 	movq	MMWORD [MMBLOCK(0,1,edi,SIZEOF_DCTELEM)], mm1
 	add	esi, byte 8*SIZEOF_DCTELEM
 	add	edx, byte 8*SIZEOF_DCTELEM
 	add	edi, byte 8*SIZEOF_JCOEF
 	dec	al
 	jnz	near .quantloop2
 	dec	ah
 	jnz	near .quantloop1	; to avoid branch misprediction
 	emms		; empty MMX state
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; unused
 ;	pop	ebx		; unused
 	pop	ebp
 	ret
 %endif ; !JFDCT_INT_QUANTIZE_WITH_DIVISION
 %endif ; JFDCT_INT_MMX_SUPPORTED
--- a/jcqnts2f.asm
+++ b/jcqnts2f.asm
@@ -0,0 +1,178 @@
 ;
 ; jcqnts2f.asm - sample data conversion and quantization (SSE & SSE2)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; Last Modified : January 18, 2005
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_FLOAT_SUPPORTED
 %ifdef JFDCT_FLT_SSE_SSE2_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Load data into workspace, applying unsigned->signed conversion
 ;
 ; GLOBAL(void)
 ; jpeg_convsamp_flt_sse2 (JSAMPARRAY sample_data, JDIMENSION start_col,
 ;                         FAST_FLOAT * workspace);
 ;
 %define sample_data	ebp+8		; JSAMPARRAY sample_data
 %define start_col	ebp+12		; JDIMENSION start_col
 %define workspace	ebp+16		; FAST_FLOAT * workspace
 	align	16
 	global	EXTN(jpeg_convsamp_flt_sse2)
 EXTN(jpeg_convsamp_flt_sse2):
 	push	ebp
 	mov	ebp,esp
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	pcmpeqw  xmm7,xmm7
 	psllw    xmm7,7
 	packsswb xmm7,xmm7		; xmm7 = PB_CENTERJSAMPLE (0x808080..)
 	mov	esi, JSAMPARRAY [sample_data]	; (JSAMPROW *)
 	mov	eax, JDIMENSION [start_col]
 	mov	edi, POINTER [workspace]	; (DCTELEM *)
 	mov	ecx, DCTSIZE/2
 	alignx	16,7
 .convloop:
 	mov	ebx, JSAMPROW [esi+0*SIZEOF_JSAMPROW]	; (JSAMPLE *)
 	mov	edx, JSAMPROW [esi+1*SIZEOF_JSAMPROW]	; (JSAMPLE *)
 	movq	xmm0, _MMWORD [ebx+eax*SIZEOF_JSAMPLE]
 	movq	xmm1, _MMWORD [edx+eax*SIZEOF_JSAMPLE]
 	psubb	xmm0,xmm7			; xmm0=(01234567)
 	psubb	xmm1,xmm7			; xmm1=(89ABCDEF)
 	punpcklbw xmm0,xmm0			; xmm0=(*0*1*2*3*4*5*6*7)
 	punpcklbw xmm1,xmm1			; xmm1=(*8*9*A*B*C*D*E*F)
 	punpcklwd xmm2,xmm0			; xmm2=(***0***1***2***3)
 	punpckhwd xmm0,xmm0			; xmm0=(***4***5***6***7)
 	punpcklwd xmm3,xmm1			; xmm3=(***8***9***A***B)
 	punpckhwd xmm1,xmm1			; xmm1=(***C***D***E***F)
 	psrad     xmm2,(DWORD_BIT-BYTE_BIT)	; xmm2=(0123)
 	psrad     xmm0,(DWORD_BIT-BYTE_BIT)	; xmm0=(4567)
 	cvtdq2ps  xmm2,xmm2			; xmm2=(0123)
 	cvtdq2ps  xmm0,xmm0			; xmm0=(4567)
 	psrad     xmm3,(DWORD_BIT-BYTE_BIT)	; xmm3=(89AB)
 	psrad     xmm1,(DWORD_BIT-BYTE_BIT)	; xmm1=(CDEF)
 	cvtdq2ps  xmm3,xmm3			; xmm3=(89AB)
 	cvtdq2ps  xmm1,xmm1			; xmm1=(CDEF)
 	movaps	XMMWORD [XMMBLOCK(0,0,edi,SIZEOF_FAST_FLOAT)], xmm2
 	movaps	XMMWORD [XMMBLOCK(0,1,edi,SIZEOF_FAST_FLOAT)], xmm0
 	movaps	XMMWORD [XMMBLOCK(1,0,edi,SIZEOF_FAST_FLOAT)], xmm3
 	movaps	XMMWORD [XMMBLOCK(1,1,edi,SIZEOF_FAST_FLOAT)], xmm1
 	add	esi, byte 2*SIZEOF_JSAMPROW
 	add	edi, byte 2*DCTSIZE*SIZEOF_FAST_FLOAT
 	dec	ecx
 	jnz	short .convloop
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	pop	ebp
 	ret
 ; --------------------------------------------------------------------------
 ;
 ; Quantize/descale the coefficients, and store into coef_block
 ;
 ; GLOBAL(void)
 ; jpeg_quantize_flt_sse2 (JCOEFPTR coef_block, FAST_FLOAT * divisors,
 ;                         FAST_FLOAT * workspace);
 ;
 %define coef_block	ebp+8		; JCOEFPTR coef_block
 %define divisors	ebp+12		; FAST_FLOAT * divisors
 %define workspace	ebp+16		; FAST_FLOAT * workspace
 	align	16
 	global	EXTN(jpeg_quantize_flt_sse2)
 EXTN(jpeg_quantize_flt_sse2):
 	push	ebp
 	mov	ebp,esp
 ;	push	ebx		; unused
 ;	push	ecx		; unused
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	mov	esi, POINTER [workspace]
 	mov	edx, POINTER [divisors]
 	mov	edi, JCOEFPTR [coef_block]
 	mov	eax, DCTSIZE2/16
 	alignx	16,7
 .quantloop:
 	movaps	xmm0, XMMWORD [XMMBLOCK(0,0,esi,SIZEOF_FAST_FLOAT)]
 	movaps	xmm1, XMMWORD [XMMBLOCK(0,1,esi,SIZEOF_FAST_FLOAT)]
 	mulps	xmm0, XMMWORD [XMMBLOCK(0,0,edx,SIZEOF_FAST_FLOAT)]
 	mulps	xmm1, XMMWORD [XMMBLOCK(0,1,edx,SIZEOF_FAST_FLOAT)]
 	movaps	xmm2, XMMWORD [XMMBLOCK(1,0,esi,SIZEOF_FAST_FLOAT)]
 	movaps	xmm3, XMMWORD [XMMBLOCK(1,1,esi,SIZEOF_FAST_FLOAT)]
 	mulps	xmm2, XMMWORD [XMMBLOCK(1,0,edx,SIZEOF_FAST_FLOAT)]
 	mulps	xmm3, XMMWORD [XMMBLOCK(1,1,edx,SIZEOF_FAST_FLOAT)]
 	cvtps2dq xmm0,xmm0
 	cvtps2dq xmm1,xmm1
 	cvtps2dq xmm2,xmm2
 	cvtps2dq xmm3,xmm3
 	packssdw xmm0,xmm1
 	packssdw xmm2,xmm3
 	movdqa	XMMWORD [XMMBLOCK(0,0,edi,SIZEOF_JCOEF)], xmm0
 	movdqa	XMMWORD [XMMBLOCK(1,0,edi,SIZEOF_JCOEF)], xmm2
 	add	esi, byte 16*SIZEOF_FAST_FLOAT
 	add	edx, byte 16*SIZEOF_FAST_FLOAT
 	add	edi, byte 16*SIZEOF_JCOEF
 	dec	eax
 	jnz	short .quantloop
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; unused
 ;	pop	ebx		; unused
 	pop	ebp
 	ret
 %endif ; JFDCT_FLT_SSE_SSE2_SUPPORTED
 %endif ; DCT_FLOAT_SUPPORTED
--- a/jcqnts2i.asm
+++ b/jcqnts2i.asm
@@ -0,0 +1,216 @@
 ;
 ; jcqnts2i.asm - sample data conversion and quantization (SSE2)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; Last Modified : January 27, 2005
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef JFDCT_INT_SSE2_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Load data into workspace, applying unsigned->signed conversion
 ;
 ; GLOBAL(void)
 ; jpeg_convsamp_int_sse2 (JSAMPARRAY sample_data, JDIMENSION start_col,
 ;                         DCTELEM * workspace);
 ;
 %define sample_data	ebp+8		; JSAMPARRAY sample_data
 %define start_col	ebp+12		; JDIMENSION start_col
 %define workspace	ebp+16		; DCTELEM * workspace
 	align	16
 	global	EXTN(jpeg_convsamp_int_sse2)
 EXTN(jpeg_convsamp_int_sse2):
 	push	ebp
 	mov	ebp,esp
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	pxor	xmm6,xmm6		; xmm6=(all 0's)
 	pcmpeqw	xmm7,xmm7
 	psllw	xmm7,7			; xmm7={0xFF80 0xFF80 0xFF80 0xFF80 ..}
 	mov	esi, JSAMPARRAY [sample_data]	; (JSAMPROW *)
 	mov	eax, JDIMENSION [start_col]
 	mov	edi, POINTER [workspace]	; (DCTELEM *)
 	mov	ecx, DCTSIZE/4
 	alignx	16,7
 .convloop:
 	mov	ebx, JSAMPROW [esi+0*SIZEOF_JSAMPROW]	; (JSAMPLE *)
 	mov	edx, JSAMPROW [esi+1*SIZEOF_JSAMPROW]	; (JSAMPLE *)
 	movq	xmm0, _MMWORD [ebx+eax*SIZEOF_JSAMPLE]	; xmm0=(01234567)
 	movq	xmm1, _MMWORD [edx+eax*SIZEOF_JSAMPLE]	; xmm1=(89ABCDEF)
 	mov	ebx, JSAMPROW [esi+2*SIZEOF_JSAMPROW]	; (JSAMPLE *)
 	mov	edx, JSAMPROW [esi+3*SIZEOF_JSAMPROW]	; (JSAMPLE *)
 	movq	xmm2, _MMWORD [ebx+eax*SIZEOF_JSAMPLE]	; xmm2=(GHIJKLMN)
 	movq	xmm3, _MMWORD [edx+eax*SIZEOF_JSAMPLE]	; xmm3=(OPQRSTUV)
 	punpcklbw xmm0,xmm6		; xmm0=(01234567)
 	punpcklbw xmm1,xmm6		; xmm1=(89ABCDEF)
 	paddw     xmm0,xmm7
 	paddw     xmm1,xmm7
 	punpcklbw xmm2,xmm6		; xmm2=(GHIJKLMN)
 	punpcklbw xmm3,xmm6		; xmm3=(OPQRSTUV)
 	paddw     xmm2,xmm7
 	paddw     xmm3,xmm7
 	movdqa	XMMWORD [XMMBLOCK(0,0,edi,SIZEOF_DCTELEM)], xmm0
 	movdqa	XMMWORD [XMMBLOCK(1,0,edi,SIZEOF_DCTELEM)], xmm1
 	movdqa	XMMWORD [XMMBLOCK(2,0,edi,SIZEOF_DCTELEM)], xmm2
 	movdqa	XMMWORD [XMMBLOCK(3,0,edi,SIZEOF_DCTELEM)], xmm3
 	add	esi, byte 4*SIZEOF_JSAMPROW
 	add	edi, byte 4*DCTSIZE*SIZEOF_DCTELEM
 	dec	ecx
 	jnz	short .convloop
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	pop	ebp
 	ret
 %ifndef JFDCT_INT_QUANTIZE_WITH_DIVISION
 ; --------------------------------------------------------------------------
 ;
 ; Quantize/descale the coefficients, and store into coef_block
 ;
 ; This implementation is based on an algorithm described in
 ;   "How to optimize for the Pentium family of microprocessors"
 ;   (http://www.agner.org/assem/).
 ;
 ; GLOBAL(void)
 ; jpeg_quantize_int_sse2 (JCOEFPTR coef_block, DCTELEM * divisors,
 ;                         DCTELEM * workspace);
 ;
 %define RECIPROCAL(m,n,b) XMMBLOCK(DCTSIZE*0+(m),(n),(b),SIZEOF_DCTELEM)
 %define CORRECTION(m,n,b) XMMBLOCK(DCTSIZE*1+(m),(n),(b),SIZEOF_DCTELEM)
 %define SCALE(m,n,b)      XMMBLOCK(DCTSIZE*2+(m),(n),(b),SIZEOF_DCTELEM)
 %define coef_block	ebp+8		; JCOEFPTR coef_block
 %define divisors	ebp+12		; DCTELEM * divisors
 %define workspace	ebp+16		; DCTELEM * workspace
 	align	16
 	global	EXTN(jpeg_quantize_int_sse2)
 EXTN(jpeg_quantize_int_sse2):
 	push	ebp
 	mov	ebp,esp
 ;	push	ebx		; unused
 ;	push	ecx		; unused
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	mov	esi, POINTER [workspace]
 	mov	edx, POINTER [divisors]
 	mov	edi, JCOEFPTR [coef_block]
 	mov	eax, DCTSIZE2/32
 	alignx	16,7
 .quantloop:
 	movdqa	xmm4, XMMWORD [XMMBLOCK(0,0,esi,SIZEOF_DCTELEM)]
 	movdqa	xmm5, XMMWORD [XMMBLOCK(1,0,esi,SIZEOF_DCTELEM)]
 	movdqa	xmm6, XMMWORD [XMMBLOCK(2,0,esi,SIZEOF_DCTELEM)]
 	movdqa	xmm7, XMMWORD [XMMBLOCK(3,0,esi,SIZEOF_DCTELEM)]
 	movdqa	xmm0,xmm4
 	movdqa	xmm1,xmm5
 	movdqa	xmm2,xmm6
 	movdqa	xmm3,xmm7
 	psraw	xmm4,(WORD_BIT-1)
 	psraw	xmm5,(WORD_BIT-1)
 	psraw	xmm6,(WORD_BIT-1)
 	psraw	xmm7,(WORD_BIT-1)
 	pxor	xmm0,xmm4
 	pxor	xmm1,xmm5
 	pxor	xmm2,xmm6
 	pxor	xmm3,xmm7
 	psubw	xmm0,xmm4		; if (xmm0 < 0) xmm0 = -xmm0;
 	psubw	xmm1,xmm5		; if (xmm1 < 0) xmm1 = -xmm1;
 	psubw	xmm2,xmm6		; if (xmm2 < 0) xmm2 = -xmm2;
 	psubw	xmm3,xmm7		; if (xmm3 < 0) xmm3 = -xmm3;
 	paddw	xmm0, XMMWORD [CORRECTION(0,0,edx)]  ; correction + roundfactor
 	paddw	xmm1, XMMWORD [CORRECTION(1,0,edx)]
 	paddw	xmm2, XMMWORD [CORRECTION(2,0,edx)]
 	paddw	xmm3, XMMWORD [CORRECTION(3,0,edx)]
 	psllw	xmm0,1
 	psllw	xmm1,1
 	psllw	xmm2,1
 	psllw	xmm3,1
 	pmulhuw	xmm0, XMMWORD [RECIPROCAL(0,0,edx)]  ; reciprocal
 	pmulhuw	xmm1, XMMWORD [RECIPROCAL(1,0,edx)]
 	pmulhuw	xmm2, XMMWORD [RECIPROCAL(2,0,edx)]
 	pmulhuw	xmm3, XMMWORD [RECIPROCAL(3,0,edx)]
 	psllw	xmm0,1
 	psllw	xmm1,1
 	psllw	xmm2,1
 	psllw	xmm3,1
 	pmulhuw	xmm0, XMMWORD [SCALE(0,0,edx)]	; scale
 	pmulhuw	xmm1, XMMWORD [SCALE(1,0,edx)]
 	pmulhuw	xmm2, XMMWORD [SCALE(2,0,edx)]
 	pmulhuw	xmm3, XMMWORD [SCALE(3,0,edx)]
 	pxor	xmm0,xmm4
 	pxor	xmm1,xmm5
 	pxor	xmm2,xmm6
 	pxor	xmm3,xmm7
 	psubw	xmm0,xmm4
 	psubw	xmm1,xmm5
 	psubw	xmm2,xmm6
 	psubw	xmm3,xmm7
 	movdqa	XMMWORD [XMMBLOCK(0,0,edi,SIZEOF_DCTELEM)], xmm0
 	movdqa	XMMWORD [XMMBLOCK(1,0,edi,SIZEOF_DCTELEM)], xmm1
 	movdqa	XMMWORD [XMMBLOCK(2,0,edi,SIZEOF_DCTELEM)], xmm2
 	movdqa	XMMWORD [XMMBLOCK(3,0,edi,SIZEOF_DCTELEM)], xmm3
 	add	esi, byte 32*SIZEOF_DCTELEM
 	add	edx, byte 32*SIZEOF_DCTELEM
 	add	edi, byte 32*SIZEOF_JCOEF
 	dec	eax
 	jnz	near .quantloop
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; unused
 ;	pop	ebx		; unused
 	pop	ebp
 	ret
 %endif ; !JFDCT_INT_QUANTIZE_WITH_DIVISION
 %endif ; JFDCT_INT_SSE2_SUPPORTED
--- a/jcqntsse.asm
+++ b/jcqntsse.asm
@@ -0,0 +1,218 @@
 ;
 ; jcqntsse.asm - sample data conversion and quantization (SSE & MMX)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; Last Modified : January 12, 2005
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_FLOAT_SUPPORTED
 %ifdef JFDCT_FLT_SSE_MMX_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Load data into workspace, applying unsigned->signed conversion
 ;
 ; GLOBAL(void)
 ; jpeg_convsamp_flt_sse (JSAMPARRAY sample_data, JDIMENSION start_col,
 ;                        FAST_FLOAT * workspace);
 ;
 %define sample_data	ebp+8		; JSAMPARRAY sample_data
 %define start_col	ebp+12		; JDIMENSION start_col
 %define workspace	ebp+16		; FAST_FLOAT * workspace
 	align	16
 	global	EXTN(jpeg_convsamp_flt_sse)
 EXTN(jpeg_convsamp_flt_sse):
 	push	ebp
 	mov	ebp,esp
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	pcmpeqw  mm7,mm7
 	psllw    mm7,7
 	packsswb mm7,mm7		; mm7 = PB_CENTERJSAMPLE (0x808080..)
 	mov	esi, JSAMPARRAY [sample_data]	; (JSAMPROW *)
 	mov	eax, JDIMENSION [start_col]
 	mov	edi, POINTER [workspace]	; (DCTELEM *)
 	mov	ecx, DCTSIZE/2
 	alignx	16,7
 .convloop:
 	mov	ebx, JSAMPROW [esi+0*SIZEOF_JSAMPROW]	; (JSAMPLE *)
 	mov	edx, JSAMPROW [esi+1*SIZEOF_JSAMPROW]	; (JSAMPLE *)
 	movq	mm0, MMWORD [ebx+eax*SIZEOF_JSAMPLE]
 	movq	mm1, MMWORD [edx+eax*SIZEOF_JSAMPLE]
 	psubb	mm0,mm7				; mm0=(01234567)
 	psubb	mm1,mm7				; mm1=(89ABCDEF)
 	punpcklbw mm2,mm0			; mm2=(*0*1*2*3)
 	punpckhbw mm0,mm0			; mm0=(*4*5*6*7)
 	punpcklbw mm3,mm1			; mm3=(*8*9*A*B)
 	punpckhbw mm1,mm1			; mm1=(*C*D*E*F)
 	punpcklwd mm4,mm2			; mm4=(***0***1)
 	punpckhwd mm2,mm2			; mm2=(***2***3)
 	punpcklwd mm5,mm0			; mm5=(***4***5)
 	punpckhwd mm0,mm0			; mm0=(***6***7)
 	psrad     mm4,(DWORD_BIT-BYTE_BIT)	; mm4=(01)
 	psrad     mm2,(DWORD_BIT-BYTE_BIT)	; mm2=(23)
 	cvtpi2ps  xmm0,mm4			; xmm0=(01**)
 	cvtpi2ps  xmm1,mm2			; xmm1=(23**)
 	psrad     mm5,(DWORD_BIT-BYTE_BIT)	; mm5=(45)
 	psrad     mm0,(DWORD_BIT-BYTE_BIT)	; mm0=(67)
 	cvtpi2ps  xmm2,mm5			; xmm2=(45**)
 	cvtpi2ps  xmm3,mm0			; xmm3=(67**)
 	punpcklwd mm6,mm3			; mm6=(***8***9)
 	punpckhwd mm3,mm3			; mm3=(***A***B)
 	punpcklwd mm4,mm1			; mm4=(***C***D)
 	punpckhwd mm1,mm1			; mm1=(***E***F)
 	psrad     mm6,(DWORD_BIT-BYTE_BIT)	; mm6=(89)
 	psrad     mm3,(DWORD_BIT-BYTE_BIT)	; mm3=(AB)
 	cvtpi2ps  xmm4,mm6			; xmm4=(89**)
 	cvtpi2ps  xmm5,mm3			; xmm5=(AB**)
 	psrad     mm4,(DWORD_BIT-BYTE_BIT)	; mm4=(CD)
 	psrad     mm1,(DWORD_BIT-BYTE_BIT)	; mm1=(EF)
 	cvtpi2ps  xmm6,mm4			; xmm6=(CD**)
 	cvtpi2ps  xmm7,mm1			; xmm7=(EF**)
 	movlhps   xmm0,xmm1			; xmm0=(0123)
 	movlhps   xmm2,xmm3			; xmm2=(4567)
 	movlhps   xmm4,xmm5			; xmm4=(89AB)
 	movlhps   xmm6,xmm7			; xmm6=(CDEF)
 	movaps	XMMWORD [XMMBLOCK(0,0,edi,SIZEOF_FAST_FLOAT)], xmm0
 	movaps	XMMWORD [XMMBLOCK(0,1,edi,SIZEOF_FAST_FLOAT)], xmm2
 	movaps	XMMWORD [XMMBLOCK(1,0,edi,SIZEOF_FAST_FLOAT)], xmm4
 	movaps	XMMWORD [XMMBLOCK(1,1,edi,SIZEOF_FAST_FLOAT)], xmm6
 	add	esi, byte 2*SIZEOF_JSAMPROW
 	add	edi, byte 2*DCTSIZE*SIZEOF_FAST_FLOAT
 	dec	ecx
 	jnz	near .convloop
 	emms		; empty MMX state
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	pop	ebp
 	ret
 ; --------------------------------------------------------------------------
 ;
 ; Quantize/descale the coefficients, and store into coef_block
 ;
 ; GLOBAL(void)
 ; jpeg_quantize_flt_sse (JCOEFPTR coef_block, FAST_FLOAT * divisors,
 ;                        FAST_FLOAT * workspace);
 ;
 %define coef_block	ebp+8		; JCOEFPTR coef_block
 %define divisors	ebp+12		; FAST_FLOAT * divisors
 %define workspace	ebp+16		; FAST_FLOAT * workspace
 	align	16
 	global	EXTN(jpeg_quantize_flt_sse)
 EXTN(jpeg_quantize_flt_sse):
 	push	ebp
 	mov	ebp,esp
 ;	push	ebx		; unused
 ;	push	ecx		; unused
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	mov	esi, POINTER [workspace]
 	mov	edx, POINTER [divisors]
 	mov	edi, JCOEFPTR [coef_block]
 	mov	eax, DCTSIZE2/16
 	alignx	16,7
 .quantloop:
 	movaps	xmm0, XMMWORD [XMMBLOCK(0,0,esi,SIZEOF_FAST_FLOAT)]
 	movaps	xmm1, XMMWORD [XMMBLOCK(0,1,esi,SIZEOF_FAST_FLOAT)]
 	mulps	xmm0, XMMWORD [XMMBLOCK(0,0,edx,SIZEOF_FAST_FLOAT)]
 	mulps	xmm1, XMMWORD [XMMBLOCK(0,1,edx,SIZEOF_FAST_FLOAT)]
 	movaps	xmm2, XMMWORD [XMMBLOCK(1,0,esi,SIZEOF_FAST_FLOAT)]
 	movaps	xmm3, XMMWORD [XMMBLOCK(1,1,esi,SIZEOF_FAST_FLOAT)]
 	mulps	xmm2, XMMWORD [XMMBLOCK(1,0,edx,SIZEOF_FAST_FLOAT)]
 	mulps	xmm3, XMMWORD [XMMBLOCK(1,1,edx,SIZEOF_FAST_FLOAT)]
 	movhlps  xmm4,xmm0
 	movhlps  xmm5,xmm1
 	cvtps2pi mm0,xmm0
 	cvtps2pi mm1,xmm1
 	cvtps2pi mm4,xmm4
 	cvtps2pi mm5,xmm5
 	movhlps  xmm6,xmm2
 	movhlps  xmm7,xmm3
 	cvtps2pi mm2,xmm2
 	cvtps2pi mm3,xmm3
 	cvtps2pi mm6,xmm6
 	cvtps2pi mm7,xmm7
 	packssdw mm0,mm4
 	packssdw mm1,mm5
 	packssdw mm2,mm6
 	packssdw mm3,mm7
 	movq	MMWORD [MMBLOCK(0,0,edi,SIZEOF_JCOEF)], mm0
 	movq	MMWORD [MMBLOCK(0,1,edi,SIZEOF_JCOEF)], mm1
 	movq	MMWORD [MMBLOCK(1,0,edi,SIZEOF_JCOEF)], mm2
 	movq	MMWORD [MMBLOCK(1,1,edi,SIZEOF_JCOEF)], mm3
 	add	esi, byte 16*SIZEOF_FAST_FLOAT
 	add	edx, byte 16*SIZEOF_FAST_FLOAT
 	add	edi, byte 16*SIZEOF_JCOEF
 	dec	eax
 	jnz	short .quantloop
 	emms		; empty MMX state
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; unused
 ;	pop	ebx		; unused
 	pop	ebp
 	ret
 %endif ; JFDCT_FLT_SSE_MMX_SUPPORTED
 %endif ; DCT_FLOAT_SUPPORTED
--- a/jcsammmx.asm
+++ b/jcsammmx.asm
@@ -0,0 +1,328 @@
 ;
 ; jcsammmx.asm - downsampling (MMX)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; Last Modified : January 23, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jcolsamp.inc"
 %ifdef JCSAMPLE_MMX_SUPPORTED
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Downsample pixel values of a single component.
 ; This version handles the common case of 2:1 horizontal and 1:1 vertical,
 ; without smoothing.
 ;
 ; GLOBAL(void)
 ; jpeg_h2v1_downsample_mmx (j_compress_ptr cinfo,
 ;                           jpeg_component_info * compptr,
 ;                           JSAMPARRAY input_data, JSAMPARRAY output_data);
 ;
 %define cinfo(b)	(b)+8		; j_compress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define input_data(b)	(b)+16		; JSAMPARRAY input_data
 %define output_data(b)	(b)+20		; JSAMPARRAY output_data
 	align	16
 	global	EXTN(jpeg_h2v1_downsample_mmx)
 EXTN(jpeg_h2v1_downsample_mmx):
 	push	ebp
 	mov	ebp,esp
 ;	push	ebx		; unused
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	mov	ecx, POINTER [compptr(ebp)]
 	mov	ecx, JDIMENSION [jcompinfo_width_in_blocks(ecx)]
 	shl	ecx,3			; imul ecx,DCTSIZE (ecx = output_cols)
 	jz	near .return
 	mov	edx, POINTER [cinfo(ebp)]
 	mov	edx, JDIMENSION [jcstruct_image_width(edx)]
 	; -- expand_right_edge
 	push	ecx
 	shl	ecx,1				; output_cols * 2
 	sub	ecx,edx
 	jle	short .expand_end
 	mov	eax, POINTER [cinfo(ebp)]
 	mov	eax, INT [jcstruct_max_v_samp_factor(eax)]
 	test	eax,eax
 	jle	short .expand_end
 	cld
 	mov	esi, JSAMPARRAY [input_data(ebp)]	; input_data
 	alignx	16,7
 .expandloop:
 	push	eax
 	push	ecx
 	mov	edi, JSAMPROW [esi]
 	add	edi,edx
 	mov	al, JSAMPLE [edi-1]
 	rep stosb
 	pop	ecx
 	pop	eax
 	add	esi, byte SIZEOF_JSAMPROW
 	dec	eax
 	jg	short .expandloop
 .expand_end:
 	pop	ecx				; output_cols
 	; -- h2v1_downsample
 	mov	eax, POINTER [compptr(ebp)]
 	mov	eax, JDIMENSION [jcompinfo_v_samp_factor(eax)]	; rowctr
 	test	eax,eax
 	jle	short .return
 	mov       edx, 0x00010000	; bias pattern
 	movd      mm7,edx
 	pcmpeqw   mm6,mm6
 	punpckldq mm7,mm7		; mm7={0, 1, 0, 1}
 	psrlw     mm6,BYTE_BIT		; mm6={0xFF 0x00 0xFF 0x00 ..}
 	mov	esi, JSAMPARRAY [input_data(ebp)]	; input_data
 	mov	edi, JSAMPARRAY [output_data(ebp)]	; output_data
 	alignx	16,7
 .rowloop:
 	push	ecx
 	push	edi
 	push	esi
 	mov	esi, JSAMPROW [esi]		; inptr
 	mov	edi, JSAMPROW [edi]		; outptr
 	alignx	16,7
 .columnloop:
 	movq	mm0, MMWORD [esi+0*SIZEOF_MMWORD]
 	movq	mm1, MMWORD [esi+1*SIZEOF_MMWORD]
 	movq	mm2,mm0
 	movq	mm3,mm1
 	pand	mm0,mm6
 	psrlw	mm2,BYTE_BIT
 	pand	mm1,mm6
 	psrlw	mm3,BYTE_BIT
 	paddw	mm0,mm2
 	paddw	mm1,mm3
 	paddw	mm0,mm7
 	paddw	mm1,mm7
 	psrlw	mm0,1
 	psrlw	mm1,1
 	packuswb mm0,mm1
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mm0
 	add	esi, byte 2*SIZEOF_MMWORD	; inptr
 	add	edi, byte 1*SIZEOF_MMWORD	; outptr
 	sub	ecx, byte SIZEOF_MMWORD		; outcol
 	jnz	short .columnloop
 	pop	esi
 	pop	edi
 	pop	ecx
 	add	esi, byte SIZEOF_JSAMPROW	; input_data
 	add	edi, byte SIZEOF_JSAMPROW	; output_data
 	dec	eax				; rowctr
 	jg	short .rowloop
 	emms		; empty MMX state
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 ;	pop	ebx		; unused
 	pop	ebp
 	ret
 ; --------------------------------------------------------------------------
 ;
 ; Downsample pixel values of a single component.
 ; This version handles the standard case of 2:1 horizontal and 2:1 vertical,
 ; without smoothing.
 ;
 ; GLOBAL(void)
 ; jpeg_h2v2_downsample_mmx (j_compress_ptr cinfo,
 ;                           jpeg_component_info * compptr,
 ;                           JSAMPARRAY input_data, JSAMPARRAY output_data);
 ;
 %define cinfo(b)	(b)+8		; j_compress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define input_data(b)	(b)+16		; JSAMPARRAY input_data
 %define output_data(b)	(b)+20		; JSAMPARRAY output_data
 	align	16
 	global	EXTN(jpeg_h2v2_downsample_mmx)
 EXTN(jpeg_h2v2_downsample_mmx):
 	push	ebp
 	mov	ebp,esp
 ;	push	ebx		; unused
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	mov	ecx, POINTER [compptr(ebp)]
 	mov	ecx, JDIMENSION [jcompinfo_width_in_blocks(ecx)]
 	shl	ecx,3			; imul ecx,DCTSIZE (ecx = output_cols)
 	jz	near .return
 	mov	edx, POINTER [cinfo(ebp)]
 	mov	edx, JDIMENSION [jcstruct_image_width(edx)]
 	; -- expand_right_edge
 	push	ecx
 	shl	ecx,1				; output_cols * 2
 	sub	ecx,edx
 	jle	short .expand_end
 	mov	eax, POINTER [cinfo(ebp)]
 	mov	eax, INT [jcstruct_max_v_samp_factor(eax)]
 	test	eax,eax
 	jle	short .expand_end
 	cld
 	mov	esi, JSAMPARRAY [input_data(ebp)]	; input_data
 	alignx	16,7
 .expandloop:
 	push	eax
 	push	ecx
 	mov	edi, JSAMPROW [esi]
 	add	edi,edx
 	mov	al, JSAMPLE [edi-1]
 	rep stosb
 	pop	ecx
 	pop	eax
 	add	esi, byte SIZEOF_JSAMPROW
 	dec	eax
 	jg	short .expandloop
 .expand_end:
 	pop	ecx				; output_cols
 	; -- h2v2_downsample
 	mov	eax, POINTER [compptr(ebp)]
 	mov	eax, JDIMENSION [jcompinfo_v_samp_factor(eax)]	; rowctr
 	test	eax,eax
 	jle	near .return
 	mov       edx, 0x00020001	; bias pattern
 	movd      mm7,edx
 	pcmpeqw   mm6,mm6
 	punpckldq mm7,mm7		; mm7={1, 2, 1, 2}
 	psrlw     mm6,BYTE_BIT		; mm6={0xFF 0x00 0xFF 0x00 ..}
 	mov	esi, JSAMPARRAY [input_data(ebp)]	; input_data
 	mov	edi, JSAMPARRAY [output_data(ebp)]	; output_data
 	alignx	16,7
 .rowloop:
 	push	ecx
 	push	edi
 	push	esi
 	mov	edx, JSAMPROW [esi+0*SIZEOF_JSAMPROW]	; inptr0
 	mov	esi, JSAMPROW [esi+1*SIZEOF_JSAMPROW]	; inptr1
 	mov	edi, JSAMPROW [edi]			; outptr
 	alignx	16,7
 .columnloop:
 	movq	mm0, MMWORD [edx+0*SIZEOF_MMWORD]
 	movq	mm1, MMWORD [esi+0*SIZEOF_MMWORD]
 	movq	mm2, MMWORD [edx+1*SIZEOF_MMWORD]
 	movq	mm3, MMWORD [esi+1*SIZEOF_MMWORD]
 	movq	mm4,mm0
 	movq	mm5,mm1
 	pand	mm0,mm6
 	psrlw	mm4,BYTE_BIT
 	pand	mm1,mm6
 	psrlw	mm5,BYTE_BIT
 	paddw	mm0,mm4
 	paddw	mm1,mm5
 	movq	mm4,mm2
 	movq	mm5,mm3
 	pand	mm2,mm6
 	psrlw	mm4,BYTE_BIT
 	pand	mm3,mm6
 	psrlw	mm5,BYTE_BIT
 	paddw	mm2,mm4
 	paddw	mm3,mm5
 	paddw	mm0,mm1
 	paddw	mm2,mm3
 	paddw	mm0,mm7
 	paddw	mm2,mm7
 	psrlw	mm0,2
 	psrlw	mm2,2
 	packuswb mm0,mm2
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mm0
 	add	edx, byte 2*SIZEOF_MMWORD	; inptr0
 	add	esi, byte 2*SIZEOF_MMWORD	; inptr1
 	add	edi, byte 1*SIZEOF_MMWORD	; outptr
 	sub	ecx, byte SIZEOF_MMWORD		; outcol
 	jnz	near .columnloop
 	pop	esi
 	pop	edi
 	pop	ecx
 	add	esi, byte 2*SIZEOF_JSAMPROW	; input_data
 	add	edi, byte 1*SIZEOF_JSAMPROW	; output_data
 	dec	eax				; rowctr
 	jg	near .rowloop
 	emms		; empty MMX state
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 ;	pop	ebx		; unused
 	pop	ebp
 	ret
 %endif ; JCSAMPLE_MMX_SUPPORTED
--- a/jcsample.c
+++ b/jcsample.c
@@ -5,6 +5,13 @@
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
 * ---------------------------------------------------------------------
 * x86 SIMD extension for IJG JPEG library
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
 * This file has been modified for SIMD extension.
 * Last Modified : January 5, 2006
 * ---------------------------------------------------------------------
 *
 * This file contains downsampling routines.
 *
 * Downsampling input data is counted in "row groups".  A row group
@@ -48,6 +55,7 @@
 #define JPEG_INTERNALS
 #include "jinclude.h"
 #include "jpeglib.h"
 #include "jcolsamp.h"		/* Private declarations */
 /* Pointer to routine to downsample a single component */
@@ -467,6 +475,7 @@ jinit_downsampler (j_compress_ptr cinfo)
  int ci;
  jpeg_component_info * compptr;
  boolean smoothok = TRUE;
  unsigned int simd = jpeg_simd_support((j_common_ptr) cinfo);
  downsample = (my_downsample_ptr)
    (*cinfo->mem->alloc_small) ((j_common_ptr) cinfo, JPOOL_IMAGE,
@@ -494,6 +503,16 @@ jinit_downsampler (j_compress_ptr cinfo)
    } else if (compptr->h_samp_factor * 2 == cinfo->max_h_samp_factor &&
 	       compptr->v_samp_factor == cinfo->max_v_samp_factor) {
      smoothok = FALSE;
 #ifdef JCSAMPLE_SSE2_SUPPORTED
      if (simd & JSIMD_SSE2)
 	downsample->methods[ci] = jpeg_h2v1_downsample_sse2;
      else
 #endif
 #ifdef JCSAMPLE_MMX_SUPPORTED
      if (simd & JSIMD_MMX)
 	downsample->methods[ci] = jpeg_h2v1_downsample_mmx;
      else
 #endif
 	downsample->methods[ci] = h2v1_downsample;
    } else if (compptr->h_samp_factor * 2 == cinfo->max_h_samp_factor &&
 	       compptr->v_samp_factor * 2 == cinfo->max_v_samp_factor) {
@@ -502,6 +521,16 @@ jinit_downsampler (j_compress_ptr cinfo)
 	downsample->methods[ci] = h2v2_smooth_downsample;
 	downsample->pub.need_context_rows = TRUE;
      } else
 #endif
 #ifdef JCSAMPLE_SSE2_SUPPORTED
      if (simd & JSIMD_SSE2)
 	downsample->methods[ci] = jpeg_h2v2_downsample_sse2;
      else
 #endif
 #ifdef JCSAMPLE_MMX_SUPPORTED
      if (simd & JSIMD_MMX)
 	downsample->methods[ci] = jpeg_h2v2_downsample_mmx;
      else
 #endif
 	downsample->methods[ci] = h2v2_downsample;
    } else if ((cinfo->max_h_samp_factor % compptr->h_samp_factor) == 0 &&
@@ -517,3 +546,25 @@ jinit_downsampler (j_compress_ptr cinfo)
    TRACEMS(cinfo, 0, JTRC_SMOOTH_NOTIMPL);
 #endif
 }
 #ifndef JSIMD_MODEINFO_NOT_SUPPORTED
 GLOBAL(unsigned int)
 jpeg_simd_downsampler (j_compress_ptr cinfo)
 {
  unsigned int simd = jpeg_simd_support((j_common_ptr) cinfo);
 #ifdef JCSAMPLE_SSE2_SUPPORTED
  if (simd & JSIMD_SSE2)
    return JSIMD_SSE2;
 #endif
 #ifdef JCSAMPLE_MMX_SUPPORTED
  if (simd & JSIMD_MMX)
    return JSIMD_MMX;
 #endif
  return JSIMD_NONE;
 }
 #endif /* !JSIMD_MODEINFO_NOT_SUPPORTED */
--- a/jcsamss2.asm
+++ b/jcsamss2.asm
@@ -0,0 +1,355 @@
 ;
 ; jcsamss2.asm - downsampling (SSE2)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; Last Modified : January 23, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jcolsamp.inc"
 %ifdef JCSAMPLE_SSE2_SUPPORTED
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Downsample pixel values of a single component.
 ; This version handles the common case of 2:1 horizontal and 1:1 vertical,
 ; without smoothing.
 ;
 ; GLOBAL(void)
 ; jpeg_h2v1_downsample_sse2 (j_compress_ptr cinfo,
 ;                            jpeg_component_info * compptr,
 ;                            JSAMPARRAY input_data, JSAMPARRAY output_data);
 ;
 %define cinfo(b)	(b)+8		; j_compress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define input_data(b)	(b)+16		; JSAMPARRAY input_data
 %define output_data(b)	(b)+20		; JSAMPARRAY output_data
 	align	16
 	global	EXTN(jpeg_h2v1_downsample_sse2)
 EXTN(jpeg_h2v1_downsample_sse2):
 	push	ebp
 	mov	ebp,esp
 ;	push	ebx		; unused
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	mov	ecx, POINTER [compptr(ebp)]
 	mov	ecx, JDIMENSION [jcompinfo_width_in_blocks(ecx)]
 	shl	ecx,3			; imul ecx,DCTSIZE (ecx = output_cols)
 	jz	near .return
 	mov	edx, POINTER [cinfo(ebp)]
 	mov	edx, JDIMENSION [jcstruct_image_width(edx)]
 	; -- expand_right_edge
 	push	ecx
 	shl	ecx,1				; output_cols * 2
 	sub	ecx,edx
 	jle	short .expand_end
 	mov	eax, POINTER [cinfo(ebp)]
 	mov	eax, INT [jcstruct_max_v_samp_factor(eax)]
 	test	eax,eax
 	jle	short .expand_end
 	cld
 	mov	esi, JSAMPARRAY [input_data(ebp)]	; input_data
 	alignx	16,7
 .expandloop:
 	push	eax
 	push	ecx
 	mov	edi, JSAMPROW [esi]
 	add	edi,edx
 	mov	al, JSAMPLE [edi-1]
 	rep stosb
 	pop	ecx
 	pop	eax
 	add	esi, byte SIZEOF_JSAMPROW
 	dec	eax
 	jg	short .expandloop
 .expand_end:
 	pop	ecx				; output_cols
 	; -- h2v1_downsample
 	mov	eax, POINTER [compptr(ebp)]
 	mov	eax, JDIMENSION [jcompinfo_v_samp_factor(eax)]	; rowctr
 	test	eax,eax
 	jle	near .return
 	mov	edx, 0x00010000		; bias pattern
 	movd	xmm7,edx
 	pcmpeqw	xmm6,xmm6
 	pshufd	xmm7,xmm7,0x00		; xmm7={0, 1, 0, 1, 0, 1, 0, 1}
 	psrlw	xmm6,BYTE_BIT		; xmm6={0xFF 0x00 0xFF 0x00 ..}
 	mov	esi, JSAMPARRAY [input_data(ebp)]	; input_data
 	mov	edi, JSAMPARRAY [output_data(ebp)]	; output_data
 	alignx	16,7
 .rowloop:
 	push	ecx
 	push	edi
 	push	esi
 	mov	esi, JSAMPROW [esi]		; inptr
 	mov	edi, JSAMPROW [edi]		; outptr
 	cmp	ecx, byte SIZEOF_XMMWORD
 	jae	short .columnloop
 	alignx	16,7
 .columnloop_r8:
 	movdqa	xmm0, XMMWORD [esi+0*SIZEOF_XMMWORD]
 	pxor	xmm1,xmm1
 	mov	ecx, SIZEOF_XMMWORD
 	jmp	short .downsample
 	alignx	16,7
 .columnloop:
 	movdqa	xmm0, XMMWORD [esi+0*SIZEOF_XMMWORD]
 	movdqa	xmm1, XMMWORD [esi+1*SIZEOF_XMMWORD]
 .downsample:
 	movdqa	xmm2,xmm0
 	movdqa	xmm3,xmm1
 	pand	xmm0,xmm6
 	psrlw	xmm2,BYTE_BIT
 	pand	xmm1,xmm6
 	psrlw	xmm3,BYTE_BIT
 	paddw	xmm0,xmm2
 	paddw	xmm1,xmm3
 	paddw	xmm0,xmm7
 	paddw	xmm1,xmm7
 	psrlw	xmm0,1
 	psrlw	xmm1,1
 	packuswb xmm0,xmm1
 	movdqa	XMMWORD [edi+0*SIZEOF_XMMWORD], xmm0
 	sub	ecx, byte SIZEOF_XMMWORD	; outcol
 	add	esi, byte 2*SIZEOF_XMMWORD	; inptr
 	add	edi, byte 1*SIZEOF_XMMWORD	; outptr
 	cmp	ecx, byte SIZEOF_XMMWORD
 	jae	short .columnloop
 	test	ecx,ecx
 	jnz	short .columnloop_r8
 	pop	esi
 	pop	edi
 	pop	ecx
 	add	esi, byte SIZEOF_JSAMPROW	; input_data
 	add	edi, byte SIZEOF_JSAMPROW	; output_data
 	dec	eax				; rowctr
 	jg	near .rowloop
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 ;	pop	ebx		; unused
 	pop	ebp
 	ret
 ; --------------------------------------------------------------------------
 ;
 ; Downsample pixel values of a single component.
 ; This version handles the standard case of 2:1 horizontal and 2:1 vertical,
 ; without smoothing.
 ;
 ; GLOBAL(void)
 ; jpeg_h2v2_downsample_sse2 (j_compress_ptr cinfo,
 ;                            jpeg_component_info * compptr,
 ;                            JSAMPARRAY input_data, JSAMPARRAY output_data);
 ;
 %define cinfo(b)	(b)+8		; j_compress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define input_data(b)	(b)+16		; JSAMPARRAY input_data
 %define output_data(b)	(b)+20		; JSAMPARRAY output_data
 	align	16
 	global	EXTN(jpeg_h2v2_downsample_sse2)
 EXTN(jpeg_h2v2_downsample_sse2):
 	push	ebp
 	mov	ebp,esp
 ;	push	ebx		; unused
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	mov	ecx, POINTER [compptr(ebp)]
 	mov	ecx, JDIMENSION [jcompinfo_width_in_blocks(ecx)]
 	shl	ecx,3			; imul ecx,DCTSIZE (ecx = output_cols)
 	jz	near .return
 	mov	edx, POINTER [cinfo(ebp)]
 	mov	edx, JDIMENSION [jcstruct_image_width(edx)]
 	; -- expand_right_edge
 	push	ecx
 	shl	ecx,1				; output_cols * 2
 	sub	ecx,edx
 	jle	short .expand_end
 	mov	eax, POINTER [cinfo(ebp)]
 	mov	eax, INT [jcstruct_max_v_samp_factor(eax)]
 	test	eax,eax
 	jle	short .expand_end
 	cld
 	mov	esi, JSAMPARRAY [input_data(ebp)]	; input_data
 	alignx	16,7
 .expandloop:
 	push	eax
 	push	ecx
 	mov	edi, JSAMPROW [esi]
 	add	edi,edx
 	mov	al, JSAMPLE [edi-1]
 	rep stosb
 	pop	ecx
 	pop	eax
 	add	esi, byte SIZEOF_JSAMPROW
 	dec	eax
 	jg	short .expandloop
 .expand_end:
 	pop	ecx				; output_cols
 	; -- h2v2_downsample
 	mov	eax, POINTER [compptr(ebp)]
 	mov	eax, JDIMENSION [jcompinfo_v_samp_factor(eax)]	; rowctr
 	test	eax,eax
 	jle	near .return
 	mov	edx, 0x00020001		; bias pattern
 	movd	xmm7,edx
 	pcmpeqw	xmm6,xmm6
 	pshufd	xmm7,xmm7,0x00		; xmm7={1, 2, 1, 2, 1, 2, 1, 2}
 	psrlw	xmm6,BYTE_BIT		; xmm6={0xFF 0x00 0xFF 0x00 ..}
 	mov	esi, JSAMPARRAY [input_data(ebp)]	; input_data
 	mov	edi, JSAMPARRAY [output_data(ebp)]	; output_data
 	alignx	16,7
 .rowloop:
 	push	ecx
 	push	edi
 	push	esi
 	mov	edx, JSAMPROW [esi+0*SIZEOF_JSAMPROW]	; inptr0
 	mov	esi, JSAMPROW [esi+1*SIZEOF_JSAMPROW]	; inptr1
 	mov	edi, JSAMPROW [edi]			; outptr
 	cmp	ecx, byte SIZEOF_XMMWORD
 	jae	short .columnloop
 	alignx	16,7
 .columnloop_r8:
 	movdqa	xmm0, XMMWORD [edx+0*SIZEOF_XMMWORD]
 	movdqa	xmm1, XMMWORD [esi+0*SIZEOF_XMMWORD]
 	pxor	xmm2,xmm2
 	pxor	xmm3,xmm3
 	mov	ecx, SIZEOF_XMMWORD
 	jmp	short .downsample
 	alignx	16,7
 .columnloop:
 	movdqa	xmm0, XMMWORD [edx+0*SIZEOF_XMMWORD]
 	movdqa	xmm1, XMMWORD [esi+0*SIZEOF_XMMWORD]
 	movdqa	xmm2, XMMWORD [edx+1*SIZEOF_XMMWORD]
 	movdqa	xmm3, XMMWORD [esi+1*SIZEOF_XMMWORD]
 .downsample:
 	movdqa	xmm4,xmm0
 	movdqa	xmm5,xmm1
 	pand	xmm0,xmm6
 	psrlw	xmm4,BYTE_BIT
 	pand	xmm1,xmm6
 	psrlw	xmm5,BYTE_BIT
 	paddw	xmm0,xmm4
 	paddw	xmm1,xmm5
 	movdqa	xmm4,xmm2
 	movdqa	xmm5,xmm3
 	pand	xmm2,xmm6
 	psrlw	xmm4,BYTE_BIT
 	pand	xmm3,xmm6
 	psrlw	xmm5,BYTE_BIT
 	paddw	xmm2,xmm4
 	paddw	xmm3,xmm5
 	paddw	xmm0,xmm1
 	paddw	xmm2,xmm3
 	paddw	xmm0,xmm7
 	paddw	xmm2,xmm7
 	psrlw	xmm0,2
 	psrlw	xmm2,2
 	packuswb xmm0,xmm2
 	movdqa	XMMWORD [edi+0*SIZEOF_XMMWORD], xmm0
 	sub	ecx, byte SIZEOF_XMMWORD	; outcol
 	add	edx, byte 2*SIZEOF_XMMWORD	; inptr0
 	add	esi, byte 2*SIZEOF_XMMWORD	; inptr1
 	add	edi, byte 1*SIZEOF_XMMWORD	; outptr
 	cmp	ecx, byte SIZEOF_XMMWORD
 	jae	near .columnloop
 	test	ecx,ecx
 	jnz	near .columnloop_r8
 	pop	esi
 	pop	edi
 	pop	ecx
 	add	esi, byte 2*SIZEOF_JSAMPROW	; input_data
 	add	edi, byte 1*SIZEOF_JSAMPROW	; output_data
 	dec	eax				; rowctr
 	jg	near .rowloop
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 ;	pop	ebx		; unused
 	pop	ebp
 	ret
 %endif ; JCSAMPLE_SSE2_SUPPORTED
--- a/jctrans.c
+++ b/jctrans.c
@@ -1,7 +1,7 @@
 /*
 * jctrans.c
 *
- * Copyright (C) 1995-1996, Thomas G. Lane.
+ * Copyright (C) 1995-1998, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -129,6 +129,23 @@ jpeg_copy_critical_parameters (j_decompress_ptr srcinfo,
     * instead we rely on jpeg_set_colorspace to have made a suitable choice.
     */
  }
  /* Also copy JFIF version and resolution information, if available.
   * Strictly speaking this isn't "critical" info, but it's nearly
   * always appropriate to copy it if available.  In particular,
   * if the application chooses to copy JFIF 1.02 extension markers from
   * the source file, we need to copy the version to make sure we don't
   * emit a file that has 1.02 extensions but a claimed version of 1.01.
   * We will *not*, however, copy version info from mislabeled "2.01" files.
   */
  if (srcinfo->saw_JFIF_marker) {
    if (srcinfo->JFIF_major_version == 1) {
      dstinfo->JFIF_major_version = srcinfo->JFIF_major_version;
      dstinfo->JFIF_minor_version = srcinfo->JFIF_minor_version;
    }
    dstinfo->density_unit = srcinfo->density_unit;
    dstinfo->X_density = srcinfo->X_density;
    dstinfo->Y_density = srcinfo->Y_density;
  }
 }
@@ -170,7 +187,7 @@ transencode_master_selection (j_compress_ptr cinfo,
  /* We can now tell the memory manager to allocate virtual arrays. */
  (*cinfo->mem->realize_virt_arrays) ((j_common_ptr) cinfo);
-  /* Write the datastream header (SOI) immediately.
+  /* Write the datastream header (SOI, JFIF) immediately.
   * Frame and scan headers are postponed till later.
   * This lets application insert special markers after the SOI.
   */
--- a/jdapimin.c
+++ b/jdapimin.c
@@ -1,7 +1,7 @@
 /*
 * jdapimin.c
 *
- * Copyright (C) 1994-1996, Thomas G. Lane.
+ * Copyright (C) 1994-1998, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -39,13 +39,18 @@ jpeg_CreateDecompress (j_decompress_ptr cinfo, int version, size_t structsize)
    ERREXIT2(cinfo, JERR_BAD_STRUCT_SIZE, 
 	     (int) SIZEOF(struct jpeg_decompress_struct), (int) structsize);
-  /* For debugging purposes, zero the whole master structure.
+  /* For debugging purposes, we zero the whole master structure.
-   * But error manager pointer is already there, so save and restore it.
+   * But the application has already set the err pointer, and may have set
   * client_data, so we have to save and restore those fields.
   * Note: if application hasn't set client_data, tools like Purify may
   * complain here.
   */
  {
    struct jpeg_error_mgr * err = cinfo->err;
    void * client_data = cinfo->client_data; /* ignore Purify complaint here */
    MEMZERO(cinfo, SIZEOF(struct jpeg_decompress_struct));
    cinfo->err = err;
    cinfo->client_data = client_data;
  }
  cinfo->is_decompressor = TRUE;
@@ -67,6 +72,7 @@ jpeg_CreateDecompress (j_decompress_ptr cinfo, int version, size_t structsize)
  /* Initialize marker processor so application can override methods
   * for COM, APPn markers before calling jpeg_read_header.
   */
  cinfo->marker_list = NULL;
  jinit_marker_reader(cinfo);
  /* And initialize the overall input controller. */
@@ -100,23 +106,6 @@ jpeg_abort_decompress (j_decompress_ptr cinfo)
 }
 /*
 * Install a special processing method for COM or APPn markers.
 */
 GLOBAL(void)
 jpeg_set_marker_processor (j_decompress_ptr cinfo, int marker_code,
 			   jpeg_marker_parser_method routine)
 {
  if (marker_code == JPEG_COM)
    cinfo->marker->process_COM = routine;
  else if (marker_code >= JPEG_APP0 && marker_code <= JPEG_APP0+15)
    cinfo->marker->process_APPn[marker_code-JPEG_APP0] = routine;
  else
    ERREXIT1(cinfo, JERR_UNKNOWN_MARKER, marker_code);
 }
 /*
 * Set default decompression parameters.
 */
--- a/jdcoefct.c
+++ b/jdcoefct.c
@@ -1,10 +1,17 @@
 /*
 * jdcoefct.c
 *
- * Copyright (C) 1994-1996, Thomas G. Lane.
+ * Copyright (C) 1994-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
 * ---------------------------------------------------------------------
 * x86 SIMD extension for IJG JPEG library
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
 * This file has been modified to improve performance.
 * Last Modified : December 18, 2005
 * ---------------------------------------------------------------------
 *
 * This file contains the coefficient buffer controller for decompression.
 * This controller is the top level of the JPEG decompressor proper.
 * The coefficient buffer lies between entropy decoding and inverse-DCT steps.
@@ -133,14 +140,19 @@ start_output_pass (j_decompress_ptr cinfo)
 }
 #ifndef NEED_FAR_POINTERS
 #undef jzero_far
 #define jzero_far(target, bytestozero)  MEMZERO(target, bytestozero)
 #endif
 /*
 * Decompress and return some data in the single-pass case.
 * Always attempts to emit one fully interleaved MCU row ("iMCU" row).
 * Input and output must run in lockstep since we have only a one-MCU buffer.
 * Return value is JPEG_ROW_COMPLETED, JPEG_SCAN_COMPLETED, or JPEG_SUSPENDED.
 *
- * NB: output_buf contains a plane for each component in image.
+ * NB: output_buf contains a plane for each component in image,
- * For single pass, this is the same as the components in the scan.
+ * which we index according to the component's SOF position.
 */
 METHODDEF(int)
@@ -150,15 +162,61 @@ decompress_onepass (j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
  JDIMENSION MCU_col_num;	/* index of current MCU within row */
  JDIMENSION last_MCU_col = cinfo->MCUs_per_row - 1;
  JDIMENSION last_iMCU_row = cinfo->total_iMCU_rows - 1;
-  int blkn, ci, xindex, yindex, yoffset, useful_width;
+  int blkn, ci, ctr, xindex, yindex, yoffset;
  JSAMPARRAY output_ptr;
-  JDIMENSION start_col, output_col;
+  JDIMENSION output_col;
  jpeg_component_info *compptr;
  inverse_DCT_method_ptr inverse_DCT;
  JSAMPARRAY output_ptr_blk[D_MAX_BLOCKS_IN_MCU];
  JDIMENSION output_col_off[D_MAX_BLOCKS_IN_MCU];
  jpeg_component_info *compptr_blk[D_MAX_BLOCKS_IN_MCU];
  inverse_DCT_method_ptr inverse_DCT_blk_1[D_MAX_BLOCKS_IN_MCU];
  inverse_DCT_method_ptr inverse_DCT_blk_2[D_MAX_BLOCKS_IN_MCU];
  inverse_DCT_method_ptr *inverse_DCT_blk;
  /* Loop to process as much as one whole iMCU row */
  for (yoffset = coef->MCU_vert_offset; yoffset < coef->MCU_rows_per_iMCU_row;
       yoffset++) {
    /* Determine where data should go in output_buf and do the IDCT thing.
     * We skip dummy blocks at the right and bottom edges (but blkn gets
     * incremented past them!).  Note the inner loop relies on having
     * allocated the MCU_buffer[] blocks sequentially.
     */
    blkn = 0;			/* index of current DCT block within MCU */
    for (ci = 0; ci < cinfo->comps_in_scan; ci++) {
      compptr = cinfo->cur_comp_info[ci];
      /* Don't bother to IDCT an uninteresting component. */
      if (! compptr->component_needed) {
 	for (ctr = compptr->MCU_blocks; ctr > 0; ctr--) {
 	  inverse_DCT_blk_1[blkn] = inverse_DCT_blk_2[blkn] = NULL;
 	  blkn++;
 	}
 	continue;
      }
      inverse_DCT = cinfo->idct->inverse_DCT[compptr->component_index];
      output_ptr = output_buf[compptr->component_index] +
 	yoffset * compptr->DCT_scaled_size;
      for (yindex = 0; yindex < compptr->MCU_height; yindex++) {
 	if (cinfo->input_iMCU_row < last_iMCU_row ||
 	    yoffset+yindex < compptr->last_row_height) {
 	  for (xindex = 0; xindex < compptr->MCU_width; xindex++) {
 	    compptr_blk[blkn] = compptr;
 	    output_ptr_blk[blkn] = output_ptr;
 	    output_col_off[blkn] = xindex * compptr->DCT_scaled_size;
 	    inverse_DCT_blk_1[blkn] = inverse_DCT;
 	    inverse_DCT_blk_2[blkn] = (xindex < compptr->last_col_width) ?
 				      inverse_DCT : NULL;
 	    blkn++;
 	  }
 	} else {
 	  for (ctr = compptr->MCU_width; ctr > 0; ctr--) {
 	    inverse_DCT_blk_1[blkn] = inverse_DCT_blk_2[blkn] = NULL;
 	    blkn++;
 	  }
 	}
 	output_ptr += compptr->DCT_scaled_size;
      }
    }
    for (MCU_col_num = coef->MCU_ctr; MCU_col_num <= last_MCU_col;
 	 MCU_col_num++) {
      /* Try to fetch an MCU.  Entropy decoder expects buffer to be zeroed. */
@@ -170,38 +228,17 @@ decompress_onepass (j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
 	coef->MCU_ctr = MCU_col_num;
 	return JPEG_SUSPENDED;
      }
-      /* Determine where data should go in output_buf and do the IDCT thing.
+      inverse_DCT_blk = (MCU_col_num < last_MCU_col) ? inverse_DCT_blk_1
-       * We skip dummy blocks at the right and bottom edges (but blkn gets
+						     : inverse_DCT_blk_2;
-       * incremented past them!).  Note the inner loop relies on having
+      for (blkn = 0; blkn < cinfo->blocks_in_MCU; blkn++) {
-       * allocated the MCU_buffer[] blocks sequentially.
+	inverse_DCT = inverse_DCT_blk[blkn];
-       */
+	if (inverse_DCT == NULL)
      blkn = 0;			/* index of current DCT block within MCU */
      for (ci = 0; ci < cinfo->comps_in_scan; ci++) {
 	compptr = cinfo->cur_comp_info[ci];
 	/* Don't bother to IDCT an uninteresting component. */
 	if (! compptr->component_needed) {
 	  blkn += compptr->MCU_blocks;
 	  continue;
-	}
+	compptr = compptr_blk[blkn];
-	inverse_DCT = cinfo->idct->inverse_DCT[compptr->component_index];
+	output_col = MCU_col_num * compptr->MCU_sample_width +
-	useful_width = (MCU_col_num < last_MCU_col) ? compptr->MCU_width
+		     output_col_off[blkn];
-						    : compptr->last_col_width;
+	(*inverse_DCT) (cinfo, compptr, (JCOEFPTR) coef->MCU_buffer[blkn],
-	output_ptr = output_buf[ci] + yoffset * compptr->DCT_scaled_size;
+			output_ptr_blk[blkn], output_col);
 	start_col = MCU_col_num * compptr->MCU_sample_width;
 	for (yindex = 0; yindex < compptr->MCU_height; yindex++) {
 	  if (cinfo->input_iMCU_row < last_iMCU_row ||
 	      yoffset+yindex < compptr->last_row_height) {
 	    output_col = start_col;
 	    for (xindex = 0; xindex < useful_width; xindex++) {
 	      (*inverse_DCT) (cinfo, compptr,
 			      (JCOEFPTR) coef->MCU_buffer[blkn+xindex],
 			      output_ptr, output_col);
 	      output_col += compptr->DCT_scaled_size;
 	    }
 	  }
 	  blkn += compptr->MCU_width;
 	  output_ptr += compptr->DCT_scaled_size;
 	}
      }
    }
    /* Completed an MCU row, but perhaps not an iMCU row */
@@ -249,6 +286,8 @@ consume_data (j_decompress_ptr cinfo)
  JBLOCKARRAY buffer[MAX_COMPS_IN_SCAN];
  JBLOCKROW buffer_ptr;
  jpeg_component_info *compptr;
  int MCU_width[D_MAX_BLOCKS_IN_MCU];
  JBLOCKROW MCU_buffer_base[D_MAX_BLOCKS_IN_MCU];
  /* Align the virtual buffers for the components used in this scan. */
  for (ci = 0; ci < cinfo->comps_in_scan; ci++) {
@@ -266,20 +305,25 @@ consume_data (j_decompress_ptr cinfo)
  /* Loop to process one whole iMCU row */
  for (yoffset = coef->MCU_vert_offset; yoffset < coef->MCU_rows_per_iMCU_row;
       yoffset++) {
    for (MCU_col_num = coef->MCU_ctr; MCU_col_num < cinfo->MCUs_per_row;
 	 MCU_col_num++) {
    /* Construct list of pointers to DCT blocks belonging to this MCU */
    blkn = 0;			/* index of current DCT block within MCU */
    for (ci = 0; ci < cinfo->comps_in_scan; ci++) {
      compptr = cinfo->cur_comp_info[ci];
 	start_col = MCU_col_num * compptr->MCU_width;
      for (yindex = 0; yindex < compptr->MCU_height; yindex++) {
-	  buffer_ptr = buffer[ci][yindex+yoffset] + start_col;
+	buffer_ptr = buffer[ci][yindex+yoffset];
 	for (xindex = 0; xindex < compptr->MCU_width; xindex++) {
-	    coef->MCU_buffer[blkn++] = buffer_ptr++;
+	  MCU_width[blkn] = compptr->MCU_width;
 	  MCU_buffer_base[blkn] = buffer_ptr++;
 	  blkn++;
 	}
      }
    }
    for (MCU_col_num = coef->MCU_ctr; MCU_col_num < cinfo->MCUs_per_row;
 	 MCU_col_num++) {
      for (blkn = 0; blkn < cinfo->blocks_in_MCU; blkn++) {
 	start_col = MCU_col_num * MCU_width[blkn];
 	coef->MCU_buffer[blkn] = MCU_buffer_base[blkn] + start_col;
      }
      /* Try to fetch the MCU. */
      if (! (*cinfo->entropy->decode_mcu) (cinfo, coef->MCU_buffer)) {
 	/* Suspension forced; update state counters and exit */
@@ -452,6 +496,15 @@ smoothing_ok (j_decompress_ptr cinfo)
 }
 /*
 * SIMD Ext: Most of SSE/SSE2 instructions require that the memory address
 * is aligned to a 16-byte boundary; if not, a general-protection exception
 * (#GP) is generated.
 */
 #define ALIGN_SIZE	16		/* sizeof SSE/SSE2 register */
 #define ALIGN_MEM(p,a)	((void *) (((size_t) (p) + (a) - 1) & -(a)))
 /*
 * Variant of decompress_data for use when doing block smoothing.
 */
@@ -470,7 +523,8 @@ decompress_smooth_data (j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
  jpeg_component_info *compptr;
  inverse_DCT_method_ptr inverse_DCT;
  boolean first_row, last_row;
-  JBLOCK workspace;
+  JCOEF workspace[DCTSIZE2 + ALIGN_SIZE/sizeof(JCOEF)];
  JCOEF * workptr = (JCOEF *) ALIGN_MEM(workspace, ALIGN_SIZE);
  int *coef_bits;
  JQUANT_TBL *quanttbl;
  INT32 Q00,Q01,Q02,Q10,Q11,Q20, num;
@@ -559,7 +613,7 @@ decompress_smooth_data (j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
      last_block_column = compptr->width_in_blocks - 1;
      for (block_num = 0; block_num <= last_block_column; block_num++) {
 	/* Fetch current DCT block into workspace so we can modify it. */
-	jcopy_block_row(buffer_ptr, (JBLOCKROW) workspace, (JDIMENSION) 1);
+	jcopy_block_row(buffer_ptr, (JBLOCKROW) workptr, (JDIMENSION) 1);
 	/* Update DC values */
 	if (block_num < last_block_column) {
 	  DC3 = (int) prev_block_row[1][0];
@@ -571,7 +625,7 @@ decompress_smooth_data (j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
 	 * and is not known to be fully accurate.
 	 */
 	/* AC01 */
-	if ((Al=coef_bits[1]) != 0 && workspace[1] == 0) {
+	if ((Al=coef_bits[1]) != 0 && workptr[1] == 0) {
 	  num = 36 * Q00 * (DC4 - DC6);
 	  if (num >= 0) {
 	    pred = (int) (((Q01<<7) + num) / (Q01<<8));
@@ -583,10 +637,10 @@ decompress_smooth_data (j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
 	      pred = (1<<Al)-1;
 	    pred = -pred;
 	  }
-	  workspace[1] = (JCOEF) pred;
+	  workptr[1] = (JCOEF) pred;
 	}
 	/* AC10 */
-	if ((Al=coef_bits[2]) != 0 && workspace[8] == 0) {
+	if ((Al=coef_bits[2]) != 0 && workptr[8] == 0) {
 	  num = 36 * Q00 * (DC2 - DC8);
 	  if (num >= 0) {
 	    pred = (int) (((Q10<<7) + num) / (Q10<<8));
@@ -598,10 +652,10 @@ decompress_smooth_data (j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
 	      pred = (1<<Al)-1;
 	    pred = -pred;
 	  }
-	  workspace[8] = (JCOEF) pred;
+	  workptr[8] = (JCOEF) pred;
 	}
 	/* AC20 */
-	if ((Al=coef_bits[3]) != 0 && workspace[16] == 0) {
+	if ((Al=coef_bits[3]) != 0 && workptr[16] == 0) {
 	  num = 9 * Q00 * (DC2 + DC8 - 2*DC5);
 	  if (num >= 0) {
 	    pred = (int) (((Q20<<7) + num) / (Q20<<8));
@@ -613,10 +667,10 @@ decompress_smooth_data (j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
 	      pred = (1<<Al)-1;
 	    pred = -pred;
 	  }
-	  workspace[16] = (JCOEF) pred;
+	  workptr[16] = (JCOEF) pred;
 	}
 	/* AC11 */
-	if ((Al=coef_bits[4]) != 0 && workspace[9] == 0) {
+	if ((Al=coef_bits[4]) != 0 && workptr[9] == 0) {
 	  num = 5 * Q00 * (DC1 - DC3 - DC7 + DC9);
 	  if (num >= 0) {
 	    pred = (int) (((Q11<<7) + num) / (Q11<<8));
@@ -628,10 +682,10 @@ decompress_smooth_data (j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
 	      pred = (1<<Al)-1;
 	    pred = -pred;
 	  }
-	  workspace[9] = (JCOEF) pred;
+	  workptr[9] = (JCOEF) pred;
 	}
 	/* AC02 */
-	if ((Al=coef_bits[5]) != 0 && workspace[2] == 0) {
+	if ((Al=coef_bits[5]) != 0 && workptr[2] == 0) {
 	  num = 9 * Q00 * (DC4 + DC6 - 2*DC5);
 	  if (num >= 0) {
 	    pred = (int) (((Q02<<7) + num) / (Q02<<8));
@@ -643,10 +697,10 @@ decompress_smooth_data (j_decompress_ptr cinfo, JSAMPIMAGE output_buf)
 	      pred = (1<<Al)-1;
 	    pred = -pred;
 	  }
-	  workspace[2] = (JCOEF) pred;
+	  workptr[2] = (JCOEF) pred;
 	}
 	/* OK, do the IDCT */
-	(*inverse_DCT) (cinfo, compptr, (JCOEFPTR) workspace,
+	(*inverse_DCT) (cinfo, compptr, (JCOEFPTR) workptr,
 			output_ptr, output_col);
 	/* Advance for next column */
 	DC1 = DC2; DC2 = DC3;
--- a/jdcolmmx.asm
+++ b/jdcolmmx.asm
@@ -0,0 +1,438 @@
 ;
 ; jdcolmmx.asm - colorspace conversion (MMX)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jcolsamp.inc"
 %if RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4
 %ifdef JDCOLOR_YCCRGB_MMX_SUPPORTED
 ; --------------------------------------------------------------------------
 %define SCALEBITS	16
 F_0_344	equ	 22554			; FIX(0.34414)
 F_0_714	equ	 46802			; FIX(0.71414)
 F_1_402	equ	 91881			; FIX(1.40200)
 F_1_772	equ	116130			; FIX(1.77200)
 F_0_402	equ	(F_1_402 - 65536)	; FIX(1.40200) - FIX(1)
 F_0_285	equ	( 65536 - F_0_714)	; FIX(1) - FIX(0.71414)
 F_0_228	equ	(131072 - F_1_772)	; FIX(2) - FIX(1.77200)
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 	alignz	16
 	global	EXTN(jconst_ycc_rgb_convert_mmx)
 EXTN(jconst_ycc_rgb_convert_mmx):
 PW_F0402	times 4 dw  F_0_402
 PW_MF0228	times 4 dw -F_0_228
 PW_MF0344_F0285	times 2 dw -F_0_344, F_0_285
 PW_ONE		times 4 dw  1
 PD_ONEHALF	times 2 dd  1 << (SCALEBITS-1)
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Convert some rows of samples to the output colorspace.
 ;
 ; GLOBAL(void)
 ; jpeg_ycc_rgb_convert_mmx (j_decompress_ptr cinfo,
 ;                           JSAMPIMAGE input_buf, JDIMENSION input_row,
 ;                           JSAMPARRAY output_buf, int num_rows)
 ;
 %define cinfo(b)	(b)+8		; j_decompress_ptr cinfo
 %define input_buf(b)	(b)+12		; JSAMPIMAGE input_buf
 %define input_row(b)	(b)+16		; JDIMENSION input_row
 %define output_buf(b)	(b)+20		; JSAMPARRAY output_buf
 %define num_rows(b)	(b)+24		; int num_rows
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_MMWORD	; mmword wk[WK_NUM]
 %define WK_NUM		2
 %define gotptr		wk(0)-SIZEOF_POINTER	; void * gotptr
 	align	16
 	global	EXTN(jpeg_ycc_rgb_convert_mmx)
 EXTN(jpeg_ycc_rgb_convert_mmx):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_MMWORD)	; align to 64 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [wk(0)]
 	pushpic	eax		; make a room for GOT address
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx			; get GOT address
 	movpic	POINTER [gotptr], ebx	; save GOT address
 	mov	ecx, POINTER [cinfo(eax)]
 	mov	ecx, JDIMENSION [jdstruct_output_width(ecx)]	; num_cols
 	test	ecx,ecx
 	jz	near .return
 	push	ecx
 	mov	edi, JSAMPIMAGE [input_buf(eax)]
 	mov	ecx, JDIMENSION [input_row(eax)]
 	mov	esi, JSAMPARRAY [edi+0*SIZEOF_JSAMPARRAY]
 	mov	ebx, JSAMPARRAY [edi+1*SIZEOF_JSAMPARRAY]
 	mov	edx, JSAMPARRAY [edi+2*SIZEOF_JSAMPARRAY]
 	lea	esi, [esi+ecx*SIZEOF_JSAMPROW]
 	lea	ebx, [ebx+ecx*SIZEOF_JSAMPROW]
 	lea	edx, [edx+ecx*SIZEOF_JSAMPROW]
 	pop	ecx
 	mov	edi, JSAMPARRAY [output_buf(eax)]
 	mov	eax, INT [num_rows(eax)]
 	test	eax,eax
 	jle	near .return
 	alignx	16,7
 .rowloop:
 	push	eax
 	push	edi
 	push	edx
 	push	ebx
 	push	esi
 	push	ecx			; col
 	mov	esi, JSAMPROW [esi]	; inptr0
 	mov	ebx, JSAMPROW [ebx]	; inptr1
 	mov	edx, JSAMPROW [edx]	; inptr2
 	mov	edi, JSAMPROW [edi]	; outptr
 	movpic	eax, POINTER [gotptr]	; load GOT address (eax)
 	alignx	16,7
 .columnloop:
 	movq	mm5, MMWORD [ebx]	; mm5=Cb(01234567)
 	movq	mm1, MMWORD [edx]	; mm1=Cr(01234567)
 	pcmpeqw	mm4,mm4
 	pcmpeqw	mm7,mm7
 	psrlw	mm4,BYTE_BIT
 	psllw	mm7,7			; mm7={0xFF80 0xFF80 0xFF80 0xFF80}
 	movq	mm0,mm4			; mm0=mm4={0xFF 0x00 0xFF 0x00 ..}
 	pand	mm4,mm5			; mm4=Cb(0246)=CbE
 	psrlw	mm5,BYTE_BIT		; mm5=Cb(1357)=CbO
 	pand	mm0,mm1			; mm0=Cr(0246)=CrE
 	psrlw	mm1,BYTE_BIT		; mm1=Cr(1357)=CrO
 	paddw	mm4,mm7
 	paddw	mm5,mm7
 	paddw	mm0,mm7
 	paddw	mm1,mm7
 	; (Original)
 	; R = Y                + 1.40200 * Cr
 	; G = Y - 0.34414 * Cb - 0.71414 * Cr
 	; B = Y + 1.77200 * Cb
 	;
 	; (This implementation)
 	; R = Y                + 0.40200 * Cr + Cr
 	; G = Y - 0.34414 * Cb + 0.28586 * Cr - Cr
 	; B = Y - 0.22800 * Cb + Cb + Cb
 	movq	mm2,mm4			; mm2=CbE
 	movq	mm3,mm5			; mm3=CbO
 	paddw	mm4,mm4			; mm4=2*CbE
 	paddw	mm5,mm5			; mm5=2*CbO
 	movq	mm6,mm0			; mm6=CrE
 	movq	mm7,mm1			; mm7=CrO
 	paddw	mm0,mm0			; mm0=2*CrE
 	paddw	mm1,mm1			; mm1=2*CrO
 	pmulhw	mm4,[GOTOFF(eax,PW_MF0228)]	; mm4=(2*CbE * -FIX(0.22800))
 	pmulhw	mm5,[GOTOFF(eax,PW_MF0228)]	; mm5=(2*CbO * -FIX(0.22800))
 	pmulhw	mm0,[GOTOFF(eax,PW_F0402)]	; mm0=(2*CrE * FIX(0.40200))
 	pmulhw	mm1,[GOTOFF(eax,PW_F0402)]	; mm1=(2*CrO * FIX(0.40200))
 	paddw	mm4,[GOTOFF(eax,PW_ONE)]
 	paddw	mm5,[GOTOFF(eax,PW_ONE)]
 	psraw	mm4,1			; mm4=(CbE * -FIX(0.22800))
 	psraw	mm5,1			; mm5=(CbO * -FIX(0.22800))
 	paddw	mm0,[GOTOFF(eax,PW_ONE)]
 	paddw	mm1,[GOTOFF(eax,PW_ONE)]
 	psraw	mm0,1			; mm0=(CrE * FIX(0.40200))
 	psraw	mm1,1			; mm1=(CrO * FIX(0.40200))
 	paddw	mm4,mm2
 	paddw	mm5,mm3
 	paddw	mm4,mm2			; mm4=(CbE * FIX(1.77200))=(B-Y)E
 	paddw	mm5,mm3			; mm5=(CbO * FIX(1.77200))=(B-Y)O
 	paddw	mm0,mm6			; mm0=(CrE * FIX(1.40200))=(R-Y)E
 	paddw	mm1,mm7			; mm1=(CrO * FIX(1.40200))=(R-Y)O
 	movq	MMWORD [wk(0)], mm4	; wk(0)=(B-Y)E
 	movq	MMWORD [wk(1)], mm5	; wk(1)=(B-Y)O
 	movq      mm4,mm2
 	movq      mm5,mm3
 	punpcklwd mm2,mm6
 	punpckhwd mm4,mm6
 	pmaddwd   mm2,[GOTOFF(eax,PW_MF0344_F0285)]
 	pmaddwd   mm4,[GOTOFF(eax,PW_MF0344_F0285)]
 	punpcklwd mm3,mm7
 	punpckhwd mm5,mm7
 	pmaddwd   mm3,[GOTOFF(eax,PW_MF0344_F0285)]
 	pmaddwd   mm5,[GOTOFF(eax,PW_MF0344_F0285)]
 	paddd     mm2,[GOTOFF(eax,PD_ONEHALF)]
 	paddd     mm4,[GOTOFF(eax,PD_ONEHALF)]
 	psrad     mm2,SCALEBITS
 	psrad     mm4,SCALEBITS
 	paddd     mm3,[GOTOFF(eax,PD_ONEHALF)]
 	paddd     mm5,[GOTOFF(eax,PD_ONEHALF)]
 	psrad     mm3,SCALEBITS
 	psrad     mm5,SCALEBITS
 	packssdw  mm2,mm4	; mm2=CbE*-FIX(0.344)+CrE*FIX(0.285)
 	packssdw  mm3,mm5	; mm3=CbO*-FIX(0.344)+CrO*FIX(0.285)
 	psubw     mm2,mm6	; mm2=CbE*-FIX(0.344)+CrE*-FIX(0.714)=(G-Y)E
 	psubw     mm3,mm7	; mm3=CbO*-FIX(0.344)+CrO*-FIX(0.714)=(G-Y)O
 	movq      mm5, MMWORD [esi]	; mm5=Y(01234567)
 	pcmpeqw   mm4,mm4
 	psrlw     mm4,BYTE_BIT		; mm4={0xFF 0x00 0xFF 0x00 ..}
 	pand      mm4,mm5		; mm4=Y(0246)=YE
 	psrlw     mm5,BYTE_BIT		; mm5=Y(1357)=YO
 	paddw     mm0,mm4		; mm0=((R-Y)E+YE)=RE=(R0 R2 R4 R6)
 	paddw     mm1,mm5		; mm1=((R-Y)O+YO)=RO=(R1 R3 R5 R7)
 	packuswb  mm0,mm0		; mm0=(R0 R2 R4 R6 ** ** ** **)
 	packuswb  mm1,mm1		; mm1=(R1 R3 R5 R7 ** ** ** **)
 	paddw     mm2,mm4		; mm2=((G-Y)E+YE)=GE=(G0 G2 G4 G6)
 	paddw     mm3,mm5		; mm3=((G-Y)O+YO)=GO=(G1 G3 G5 G7)
 	packuswb  mm2,mm2		; mm2=(G0 G2 G4 G6 ** ** ** **)
 	packuswb  mm3,mm3		; mm3=(G1 G3 G5 G7 ** ** ** **)
 	paddw     mm4, MMWORD [wk(0)]	; mm4=(YE+(B-Y)E)=BE=(B0 B2 B4 B6)
 	paddw     mm5, MMWORD [wk(1)]	; mm5=(YO+(B-Y)O)=BO=(B1 B3 B5 B7)
 	packuswb  mm4,mm4		; mm4=(B0 B2 B4 B6 ** ** ** **)
 	packuswb  mm5,mm5		; mm5=(B1 B3 B5 B7 ** ** ** **)
 %if RGB_PIXELSIZE == 3 ; ---------------
 	; mmA=(00 02 04 06 ** ** ** **), mmB=(01 03 05 07 ** ** ** **)
 	; mmC=(10 12 14 16 ** ** ** **), mmD=(11 13 15 17 ** ** ** **)
 	; mmE=(20 22 24 26 ** ** ** **), mmF=(21 23 25 27 ** ** ** **)
 	; mmG=(** ** ** ** ** ** ** **), mmH=(** ** ** ** ** ** ** **)
 	punpcklbw mmA,mmC		; mmA=(00 10 02 12 04 14 06 16)
 	punpcklbw mmE,mmB		; mmE=(20 01 22 03 24 05 26 07)
 	punpcklbw mmD,mmF		; mmD=(11 21 13 23 15 25 17 27)
 	movq      mmG,mmA
 	movq      mmH,mmA
 	punpcklwd mmA,mmE		; mmA=(00 10 20 01 02 12 22 03)
 	punpckhwd mmG,mmE		; mmG=(04 14 24 05 06 16 26 07)
 	psrlq     mmH,2*BYTE_BIT	; mmH=(02 12 04 14 06 16 -- --)
 	psrlq     mmE,2*BYTE_BIT	; mmE=(22 03 24 05 26 07 -- --)
 	movq      mmC,mmD
 	movq      mmB,mmD
 	punpcklwd mmD,mmH		; mmD=(11 21 02 12 13 23 04 14)
 	punpckhwd mmC,mmH		; mmC=(15 25 06 16 17 27 -- --)
 	psrlq     mmB,2*BYTE_BIT	; mmB=(13 23 15 25 17 27 -- --)
 	movq      mmF,mmE
 	punpcklwd mmE,mmB		; mmE=(22 03 13 23 24 05 15 25)
 	punpckhwd mmF,mmB		; mmF=(26 07 17 27 -- -- -- --)
 	punpckldq mmA,mmD		; mmA=(00 10 20 01 11 21 02 12)
 	punpckldq mmE,mmG		; mmE=(22 03 13 23 04 14 24 05)
 	punpckldq mmC,mmF		; mmC=(15 25 06 16 26 07 17 27)
 	cmp	ecx, byte SIZEOF_MMWORD
 	jb	short .column_st16
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mmA
 	movq	MMWORD [edi+1*SIZEOF_MMWORD], mmE
 	movq	MMWORD [edi+2*SIZEOF_MMWORD], mmC
 	sub	ecx, byte SIZEOF_MMWORD
 	jz	short .nextrow
 	add	esi, byte SIZEOF_MMWORD			; inptr0
 	add	ebx, byte SIZEOF_MMWORD			; inptr1
 	add	edx, byte SIZEOF_MMWORD			; inptr2
 	add	edi, byte RGB_PIXELSIZE*SIZEOF_MMWORD	; outptr
 	jmp	near .columnloop
 	alignx	16,7
 .column_st16:
 	lea	ecx, [ecx+ecx*2]	; imul ecx, RGB_PIXELSIZE
 	cmp	ecx, byte 2*SIZEOF_MMWORD
 	jb	short .column_st8
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mmA
 	movq	MMWORD [edi+1*SIZEOF_MMWORD], mmE
 	movq	mmA,mmC
 	sub	ecx, byte 2*SIZEOF_MMWORD
 	add	edi, byte 2*SIZEOF_MMWORD
 	jmp	short .column_st4
 .column_st8:
 	cmp	ecx, byte SIZEOF_MMWORD
 	jb	short .column_st4
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mmA
 	movq	mmA,mmE
 	sub	ecx, byte SIZEOF_MMWORD
 	add	edi, byte SIZEOF_MMWORD
 .column_st4:
 	movd	eax,mmA
 	cmp	ecx, byte SIZEOF_DWORD
 	jb	short .column_st2
 	mov	DWORD [edi+0*SIZEOF_DWORD], eax
 	psrlq	mmA,DWORD_BIT
 	movd	eax,mmA
 	sub	ecx, byte SIZEOF_DWORD
 	add	edi, byte SIZEOF_DWORD
 .column_st2:
 	cmp	ecx, byte SIZEOF_WORD
 	jb	short .column_st1
 	mov	WORD [edi+0*SIZEOF_WORD], ax
 	shr	eax,WORD_BIT
 	sub	ecx, byte SIZEOF_WORD
 	add	edi, byte SIZEOF_WORD
 .column_st1:
 	cmp	ecx, byte SIZEOF_BYTE
 	jb	short .nextrow
 	mov	BYTE [edi+0*SIZEOF_BYTE], al
 %else ; RGB_PIXELSIZE == 4 ; -----------
 %ifdef RGBX_FILLER_0XFF
 	pcmpeqb   mm6,mm6		; mm6=(X0 X2 X4 X6 ** ** ** **)
 	pcmpeqb   mm7,mm7		; mm7=(X1 X3 X5 X7 ** ** ** **)
 %else
 	pxor      mm6,mm6		; mm6=(X0 X2 X4 X6 ** ** ** **)
 	pxor      mm7,mm7		; mm7=(X1 X3 X5 X7 ** ** ** **)
 %endif
 	; mmA=(00 02 04 06 ** ** ** **), mmB=(01 03 05 07 ** ** ** **)
 	; mmC=(10 12 14 16 ** ** ** **), mmD=(11 13 15 17 ** ** ** **)
 	; mmE=(20 22 24 26 ** ** ** **), mmF=(21 23 25 27 ** ** ** **)
 	; mmG=(30 32 34 36 ** ** ** **), mmH=(31 33 35 37 ** ** ** **)
 	punpcklbw mmA,mmC		; mmA=(00 10 02 12 04 14 06 16)
 	punpcklbw mmE,mmG		; mmE=(20 30 22 32 24 34 26 36)
 	punpcklbw mmB,mmD		; mmB=(01 11 03 13 05 15 07 17)
 	punpcklbw mmF,mmH		; mmF=(21 31 23 33 25 35 27 37)
 	movq      mmC,mmA
 	punpcklwd mmA,mmE		; mmA=(00 10 20 30 02 12 22 32)
 	punpckhwd mmC,mmE		; mmC=(04 14 24 34 06 16 26 36)
 	movq      mmG,mmB
 	punpcklwd mmB,mmF		; mmB=(01 11 21 31 03 13 23 33)
 	punpckhwd mmG,mmF		; mmG=(05 15 25 35 07 17 27 37)
 	movq      mmD,mmA
 	punpckldq mmA,mmB		; mmA=(00 10 20 30 01 11 21 31)
 	punpckhdq mmD,mmB		; mmD=(02 12 22 32 03 13 23 33)
 	movq      mmH,mmC
 	punpckldq mmC,mmG		; mmC=(04 14 24 34 05 15 25 35)
 	punpckhdq mmH,mmG		; mmH=(06 16 26 36 07 17 27 37)
 	cmp	ecx, byte SIZEOF_MMWORD
 	jb	short .column_st16
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mmA
 	movq	MMWORD [edi+1*SIZEOF_MMWORD], mmD
 	movq	MMWORD [edi+2*SIZEOF_MMWORD], mmC
 	movq	MMWORD [edi+3*SIZEOF_MMWORD], mmH
 	sub	ecx, byte SIZEOF_MMWORD
 	jz	short .nextrow
 	add	esi, byte SIZEOF_MMWORD			; inptr0
 	add	ebx, byte SIZEOF_MMWORD			; inptr1
 	add	edx, byte SIZEOF_MMWORD			; inptr2
 	add	edi, byte RGB_PIXELSIZE*SIZEOF_MMWORD	; outptr
 	jmp	near .columnloop
 	alignx	16,7
 .column_st16:
 	cmp	ecx, byte SIZEOF_MMWORD/2
 	jb	short .column_st8
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mmA
 	movq	MMWORD [edi+1*SIZEOF_MMWORD], mmD
 	movq	mmA,mmC
 	movq	mmD,mmH
 	sub	ecx, byte SIZEOF_MMWORD/2
 	add	edi, byte 2*SIZEOF_MMWORD
 .column_st8:
 	cmp	ecx, byte SIZEOF_MMWORD/4
 	jb	short .column_st4
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mmA
 	movq	mmA,mmD
 	sub	ecx, byte SIZEOF_MMWORD/4
 	add	edi, byte 1*SIZEOF_MMWORD
 .column_st4:
 	cmp	ecx, byte SIZEOF_MMWORD/8
 	jb	short .nextrow
 	movd	DWORD [edi+0*SIZEOF_DWORD], mmA
 %endif ; RGB_PIXELSIZE ; ---------------
 	alignx	16,7
 .nextrow:
 	pop	ecx
 	pop	esi
 	pop	ebx
 	pop	edx
 	pop	edi
 	pop	eax
 	add	esi, byte SIZEOF_JSAMPROW
 	add	ebx, byte SIZEOF_JSAMPROW
 	add	edx, byte SIZEOF_JSAMPROW
 	add	edi, byte SIZEOF_JSAMPROW	; output_buf
 	dec	eax				; num_rows
 	jg	near .rowloop
 	emms		; empty MMX state
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %endif ; JDCOLOR_YCCRGB_MMX_SUPPORTED
 %endif ; RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4
--- a/jdcolor.c
+++ b/jdcolor.c
@@ -1,16 +1,24 @@
 /*
 * jdcolor.c
 *
- * Copyright (C) 1991-1996, Thomas G. Lane.
+ * Copyright (C) 1991-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
 * ---------------------------------------------------------------------
 * x86 SIMD extension for IJG JPEG library
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
 * This file has been modified for SIMD extension.
 * Last Modified : January 5, 2006
 * ---------------------------------------------------------------------
 *
 * This file contains output colorspace conversion routines.
 */
 #define JPEG_INTERNALS
 #include "jinclude.h"
 #include "jpeglib.h"
 #include "jcolsamp.h"		/* Private declarations */
 /* Private subobject */
@@ -105,6 +113,17 @@ build_ycc_rgb_table (j_decompress_ptr cinfo)
 }
 #if RGB_PIXELSIZE == 4
 /* offset of filler byte */
 #define RGB_FILLER  (6 - (RGB_RED) - (RGB_GREEN) - (RGB_BLUE))
 /* byte pattern to fill with */
 #ifdef RGBX_FILLER_0XFF
 #define RGB_FILLER_BYTE 0xFF
 #else
 #define RGB_FILLER_BYTE 0x00
 #endif
 #endif /* RGB_PIXELSIZE == 4 */
 /*
 * Convert some rows of samples to the output colorspace.
 *
@@ -151,6 +170,9 @@ ycc_rgb_convert (j_decompress_ptr cinfo,
 			      ((int) RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr],
 						 SCALEBITS))];
      outptr[RGB_BLUE] =  range_limit[y + Cbbtab[cb]];
 #if RGB_PIXELSIZE == 4
      outptr[RGB_FILLER] = RGB_FILLER_BYTE;
 #endif
      outptr += RGB_PIXELSIZE;
    }
  }
@@ -207,6 +229,36 @@ grayscale_convert (j_decompress_ptr cinfo,
 }
 /*
 * Convert grayscale to RGB: just duplicate the graylevel three times.
 * This is provided to support applications that don't want to cope
 * with grayscale as a separate case.
 */
 METHODDEF(void)
 gray_rgb_convert (j_decompress_ptr cinfo,
 		  JSAMPIMAGE input_buf, JDIMENSION input_row,
 		  JSAMPARRAY output_buf, int num_rows)
 {
  register JSAMPROW inptr, outptr;
  register JDIMENSION col;
  JDIMENSION num_cols = cinfo->output_width;
  while (--num_rows >= 0) {
    inptr = input_buf[0][input_row++];
    outptr = *output_buf++;
    for (col = 0; col < num_cols; col++) {
      /* We can dispense with GETJSAMPLE() here */
      outptr[RGB_RED] = outptr[RGB_GREEN] = outptr[RGB_BLUE] = inptr[col];
 #if RGB_PIXELSIZE == 4
      outptr[RGB_FILLER] = RGB_FILLER_BYTE;
 #endif
      outptr += RGB_PIXELSIZE;
    }
  }
 }
 /*
 * Adobe-style YCCK->CMYK conversion.
 * We convert YCbCr to R=1-C, G=1-M, and B=1-Y using the same
@@ -278,6 +330,7 @@ jinit_color_deconverter (j_decompress_ptr cinfo)
 {
  my_cconvert_ptr cconvert;
  int ci;
  unsigned int simd = jpeg_simd_support((j_common_ptr) cinfo);
  cconvert = (my_cconvert_ptr)
    (*cinfo->mem->alloc_small) ((j_common_ptr) cinfo, JPOOL_IMAGE,
@@ -331,8 +384,25 @@ jinit_color_deconverter (j_decompress_ptr cinfo)
  case JCS_RGB:
    cinfo->out_color_components = RGB_PIXELSIZE;
    if (cinfo->jpeg_color_space == JCS_YCbCr) {
 #if RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4
 #ifdef JDCOLOR_YCCRGB_SSE2_SUPPORTED
      if (simd & JSIMD_SSE2 &&
          IS_CONST_ALIGNED_16(jconst_ycc_rgb_convert_sse2)) {
        cconvert->pub.color_convert = jpeg_ycc_rgb_convert_sse2;
      } else
 #endif
 #ifdef JDCOLOR_YCCRGB_MMX_SUPPORTED
      if (simd & JSIMD_MMX) {
        cconvert->pub.color_convert = jpeg_ycc_rgb_convert_mmx;
      } else
 #endif
 #endif /* RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4 */
      {
        cconvert->pub.color_convert = ycc_rgb_convert;
        build_ycc_rgb_table(cinfo);
      }
    } else if (cinfo->jpeg_color_space == JCS_GRAYSCALE) {
      cconvert->pub.color_convert = gray_rgb_convert;
    } else if (cinfo->jpeg_color_space == JCS_RGB && RGB_PIXELSIZE == 3) {
      cconvert->pub.color_convert = null_convert;
    } else
@@ -365,3 +435,28 @@ jinit_color_deconverter (j_decompress_ptr cinfo)
  else
    cinfo->output_components = cinfo->out_color_components;
 }
 #ifndef JSIMD_MODEINFO_NOT_SUPPORTED
 GLOBAL(unsigned int)
 jpeg_simd_color_deconverter (j_decompress_ptr cinfo)
 {
  unsigned int simd = jpeg_simd_support((j_common_ptr) cinfo);
 #if RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4
 #ifdef JDCOLOR_YCCRGB_SSE2_SUPPORTED
  if (simd & JSIMD_SSE2 &&
      IS_CONST_ALIGNED_16(jconst_ycc_rgb_convert_sse2))
    return JSIMD_SSE2;
 #endif
 #ifdef JDCOLOR_YCCRGB_MMX_SUPPORTED
  if (simd & JSIMD_MMX)
    return JSIMD_MMX;
 #endif
 #endif /* RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4 */
  return JSIMD_NONE;
 }
 #endif /* !JSIMD_MODEINFO_NOT_SUPPORTED */
--- a/jdcolss2.asm
+++ b/jdcolss2.asm
@@ -0,0 +1,536 @@
 ;
 ; jdcolss2.asm - colorspace conversion (SSE2)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jcolsamp.inc"
 %if RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4
 %ifdef JDCOLOR_YCCRGB_SSE2_SUPPORTED
 ; --------------------------------------------------------------------------
 %define SCALEBITS	16
 F_0_344	equ	 22554			; FIX(0.34414)
 F_0_714	equ	 46802			; FIX(0.71414)
 F_1_402	equ	 91881			; FIX(1.40200)
 F_1_772	equ	116130			; FIX(1.77200)
 F_0_402	equ	(F_1_402 - 65536)	; FIX(1.40200) - FIX(1)
 F_0_285	equ	( 65536 - F_0_714)	; FIX(1) - FIX(0.71414)
 F_0_228	equ	(131072 - F_1_772)	; FIX(2) - FIX(1.77200)
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 	alignz	16
 	global	EXTN(jconst_ycc_rgb_convert_sse2)
 EXTN(jconst_ycc_rgb_convert_sse2):
 PW_F0402	times 8 dw  F_0_402
 PW_MF0228	times 8 dw -F_0_228
 PW_MF0344_F0285	times 4 dw -F_0_344, F_0_285
 PW_ONE		times 8 dw  1
 PD_ONEHALF	times 4 dd  1 << (SCALEBITS-1)
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Convert some rows of samples to the output colorspace.
 ;
 ; GLOBAL(void)
 ; jpeg_ycc_rgb_convert_sse2 (j_decompress_ptr cinfo,
 ;                            JSAMPIMAGE input_buf, JDIMENSION input_row,
 ;                            JSAMPARRAY output_buf, int num_rows)
 ;
 %define cinfo(b)	(b)+8		; j_decompress_ptr cinfo
 %define input_buf(b)	(b)+12		; JSAMPIMAGE input_buf
 %define input_row(b)	(b)+16		; JDIMENSION input_row
 %define output_buf(b)	(b)+20		; JSAMPARRAY output_buf
 %define num_rows(b)	(b)+24		; int num_rows
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_XMMWORD	; xmmword wk[WK_NUM]
 %define WK_NUM		2
 %define gotptr		wk(0)-SIZEOF_POINTER	; void * gotptr
 	align	16
 	global	EXTN(jpeg_ycc_rgb_convert_sse2)
 EXTN(jpeg_ycc_rgb_convert_sse2):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_XMMWORD)	; align to 128 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [wk(0)]
 	pushpic	eax		; make a room for GOT address
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx			; get GOT address
 	movpic	POINTER [gotptr], ebx	; save GOT address
 	mov	ecx, POINTER [cinfo(eax)]
 	mov	ecx, JDIMENSION [jdstruct_output_width(ecx)]	; num_cols
 	test	ecx,ecx
 	jz	near .return
 	push	ecx
 	mov	edi, JSAMPIMAGE [input_buf(eax)]
 	mov	ecx, JDIMENSION [input_row(eax)]
 	mov	esi, JSAMPARRAY [edi+0*SIZEOF_JSAMPARRAY]
 	mov	ebx, JSAMPARRAY [edi+1*SIZEOF_JSAMPARRAY]
 	mov	edx, JSAMPARRAY [edi+2*SIZEOF_JSAMPARRAY]
 	lea	esi, [esi+ecx*SIZEOF_JSAMPROW]
 	lea	ebx, [ebx+ecx*SIZEOF_JSAMPROW]
 	lea	edx, [edx+ecx*SIZEOF_JSAMPROW]
 	pop	ecx
 	mov	edi, JSAMPARRAY [output_buf(eax)]
 	mov	eax, INT [num_rows(eax)]
 	test	eax,eax
 	jle	near .return
 	alignx	16,7
 .rowloop:
 	push	eax
 	push	edi
 	push	edx
 	push	ebx
 	push	esi
 	push	ecx			; col
 	mov	esi, JSAMPROW [esi]	; inptr0
 	mov	ebx, JSAMPROW [ebx]	; inptr1
 	mov	edx, JSAMPROW [edx]	; inptr2
 	mov	edi, JSAMPROW [edi]	; outptr
 	movpic	eax, POINTER [gotptr]	; load GOT address (eax)
 	alignx	16,7
 .columnloop:
 	movdqa	xmm5, XMMWORD [ebx]	; xmm5=Cb(0123456789ABCDEF)
 	movdqa	xmm1, XMMWORD [edx]	; xmm1=Cr(0123456789ABCDEF)
 	pcmpeqw	xmm4,xmm4
 	pcmpeqw	xmm7,xmm7
 	psrlw	xmm4,BYTE_BIT
 	psllw	xmm7,7			; xmm7={0xFF80 0xFF80 0xFF80 0xFF80 ..}
 	movdqa	xmm0,xmm4		; xmm0=xmm4={0xFF 0x00 0xFF 0x00 ..}
 	pand	xmm4,xmm5		; xmm4=Cb(02468ACE)=CbE
 	psrlw	xmm5,BYTE_BIT		; xmm5=Cb(13579BDF)=CbO
 	pand	xmm0,xmm1		; xmm0=Cr(02468ACE)=CrE
 	psrlw	xmm1,BYTE_BIT		; xmm1=Cr(13579BDF)=CrO
 	paddw	xmm4,xmm7
 	paddw	xmm5,xmm7
 	paddw	xmm0,xmm7
 	paddw	xmm1,xmm7
 	; (Original)
 	; R = Y                + 1.40200 * Cr
 	; G = Y - 0.34414 * Cb - 0.71414 * Cr
 	; B = Y + 1.77200 * Cb
 	;
 	; (This implementation)
 	; R = Y                + 0.40200 * Cr + Cr
 	; G = Y - 0.34414 * Cb + 0.28586 * Cr - Cr
 	; B = Y - 0.22800 * Cb + Cb + Cb
 	movdqa	xmm2,xmm4		; xmm2=CbE
 	movdqa	xmm3,xmm5		; xmm3=CbO
 	paddw	xmm4,xmm4		; xmm4=2*CbE
 	paddw	xmm5,xmm5		; xmm5=2*CbO
 	movdqa	xmm6,xmm0		; xmm6=CrE
 	movdqa	xmm7,xmm1		; xmm7=CrO
 	paddw	xmm0,xmm0		; xmm0=2*CrE
 	paddw	xmm1,xmm1		; xmm1=2*CrO
 	pmulhw	xmm4,[GOTOFF(eax,PW_MF0228)]	; xmm4=(2*CbE * -FIX(0.22800))
 	pmulhw	xmm5,[GOTOFF(eax,PW_MF0228)]	; xmm5=(2*CbO * -FIX(0.22800))
 	pmulhw	xmm0,[GOTOFF(eax,PW_F0402)]	; xmm0=(2*CrE * FIX(0.40200))
 	pmulhw	xmm1,[GOTOFF(eax,PW_F0402)]	; xmm1=(2*CrO * FIX(0.40200))
 	paddw	xmm4,[GOTOFF(eax,PW_ONE)]
 	paddw	xmm5,[GOTOFF(eax,PW_ONE)]
 	psraw	xmm4,1			; xmm4=(CbE * -FIX(0.22800))
 	psraw	xmm5,1			; xmm5=(CbO * -FIX(0.22800))
 	paddw	xmm0,[GOTOFF(eax,PW_ONE)]
 	paddw	xmm1,[GOTOFF(eax,PW_ONE)]
 	psraw	xmm0,1			; xmm0=(CrE * FIX(0.40200))
 	psraw	xmm1,1			; xmm1=(CrO * FIX(0.40200))
 	paddw	xmm4,xmm2
 	paddw	xmm5,xmm3
 	paddw	xmm4,xmm2		; xmm4=(CbE * FIX(1.77200))=(B-Y)E
 	paddw	xmm5,xmm3		; xmm5=(CbO * FIX(1.77200))=(B-Y)O
 	paddw	xmm0,xmm6		; xmm0=(CrE * FIX(1.40200))=(R-Y)E
 	paddw	xmm1,xmm7		; xmm1=(CrO * FIX(1.40200))=(R-Y)O
 	movdqa	XMMWORD [wk(0)], xmm4	; wk(0)=(B-Y)E
 	movdqa	XMMWORD [wk(1)], xmm5	; wk(1)=(B-Y)O
 	movdqa    xmm4,xmm2
 	movdqa    xmm5,xmm3
 	punpcklwd xmm2,xmm6
 	punpckhwd xmm4,xmm6
 	pmaddwd   xmm2,[GOTOFF(eax,PW_MF0344_F0285)]
 	pmaddwd   xmm4,[GOTOFF(eax,PW_MF0344_F0285)]
 	punpcklwd xmm3,xmm7
 	punpckhwd xmm5,xmm7
 	pmaddwd   xmm3,[GOTOFF(eax,PW_MF0344_F0285)]
 	pmaddwd   xmm5,[GOTOFF(eax,PW_MF0344_F0285)]
 	paddd     xmm2,[GOTOFF(eax,PD_ONEHALF)]
 	paddd     xmm4,[GOTOFF(eax,PD_ONEHALF)]
 	psrad     xmm2,SCALEBITS
 	psrad     xmm4,SCALEBITS
 	paddd     xmm3,[GOTOFF(eax,PD_ONEHALF)]
 	paddd     xmm5,[GOTOFF(eax,PD_ONEHALF)]
 	psrad     xmm3,SCALEBITS
 	psrad     xmm5,SCALEBITS
 	packssdw  xmm2,xmm4	; xmm2=CbE*-FIX(0.344)+CrE*FIX(0.285)
 	packssdw  xmm3,xmm5	; xmm3=CbO*-FIX(0.344)+CrO*FIX(0.285)
 	psubw     xmm2,xmm6	; xmm2=CbE*-FIX(0.344)+CrE*-FIX(0.714)=(G-Y)E
 	psubw     xmm3,xmm7	; xmm3=CbO*-FIX(0.344)+CrO*-FIX(0.714)=(G-Y)O
 	movdqa    xmm5, XMMWORD [esi]	; xmm5=Y(0123456789ABCDEF)
 	pcmpeqw   xmm4,xmm4
 	psrlw     xmm4,BYTE_BIT		; xmm4={0xFF 0x00 0xFF 0x00 ..}
 	pand      xmm4,xmm5		; xmm4=Y(02468ACE)=YE
 	psrlw     xmm5,BYTE_BIT		; xmm5=Y(13579BDF)=YO
 	paddw     xmm0,xmm4		; xmm0=((R-Y)E+YE)=RE=R(02468ACE)
 	paddw     xmm1,xmm5		; xmm1=((R-Y)O+YO)=RO=R(13579BDF)
 	packuswb  xmm0,xmm0		; xmm0=R(02468ACE********)
 	packuswb  xmm1,xmm1		; xmm1=R(13579BDF********)
 	paddw     xmm2,xmm4		; xmm2=((G-Y)E+YE)=GE=G(02468ACE)
 	paddw     xmm3,xmm5		; xmm3=((G-Y)O+YO)=GO=G(13579BDF)
 	packuswb  xmm2,xmm2		; xmm2=G(02468ACE********)
 	packuswb  xmm3,xmm3		; xmm3=G(13579BDF********)
 	paddw     xmm4, XMMWORD [wk(0)]	; xmm4=(YE+(B-Y)E)=BE=B(02468ACE)
 	paddw     xmm5, XMMWORD [wk(1)]	; xmm5=(YO+(B-Y)O)=BO=B(13579BDF)
 	packuswb  xmm4,xmm4		; xmm4=B(02468ACE********)
 	packuswb  xmm5,xmm5		; xmm5=B(13579BDF********)
 %if RGB_PIXELSIZE == 3 ; ---------------
 	; xmmA=(00 02 04 06 08 0A 0C 0E **), xmmB=(01 03 05 07 09 0B 0D 0F **)
 	; xmmC=(10 12 14 16 18 1A 1C 1E **), xmmD=(11 13 15 17 19 1B 1D 1F **)
 	; xmmE=(20 22 24 26 28 2A 2C 2E **), xmmF=(21 23 25 27 29 2B 2D 2F **)
 	; xmmG=(** ** ** ** ** ** ** ** **), xmmH=(** ** ** ** ** ** ** ** **)
 	punpcklbw xmmA,xmmC	; xmmA=(00 10 02 12 04 14 06 16 08 18 0A 1A 0C 1C 0E 1E)
 	punpcklbw xmmE,xmmB	; xmmE=(20 01 22 03 24 05 26 07 28 09 2A 0B 2C 0D 2E 0F)
 	punpcklbw xmmD,xmmF	; xmmD=(11 21 13 23 15 25 17 27 19 29 1B 2B 1D 2D 1F 2F)
 	movdqa    xmmG,xmmA
 	movdqa    xmmH,xmmA
 	punpcklwd xmmA,xmmE	; xmmA=(00 10 20 01 02 12 22 03 04 14 24 05 06 16 26 07)
 	punpckhwd xmmG,xmmE	; xmmG=(08 18 28 09 0A 1A 2A 0B 0C 1C 2C 0D 0E 1E 2E 0F)
 	psrldq    xmmH,2	; xmmH=(02 12 04 14 06 16 08 18 0A 1A 0C 1C 0E 1E -- --)
 	psrldq    xmmE,2	; xmmE=(22 03 24 05 26 07 28 09 2A 0B 2C 0D 2E 0F -- --)
 	movdqa    xmmC,xmmD
 	movdqa    xmmB,xmmD
 	punpcklwd xmmD,xmmH	; xmmD=(11 21 02 12 13 23 04 14 15 25 06 16 17 27 08 18)
 	punpckhwd xmmC,xmmH	; xmmC=(19 29 0A 1A 1B 2B 0C 1C 1D 2D 0E 1E 1F 2F -- --)
 	psrldq    xmmB,2	; xmmB=(13 23 15 25 17 27 19 29 1B 2B 1D 2D 1F 2F -- --)
 	movdqa    xmmF,xmmE
 	punpcklwd xmmE,xmmB	; xmmE=(22 03 13 23 24 05 15 25 26 07 17 27 28 09 19 29)
 	punpckhwd xmmF,xmmB	; xmmF=(2A 0B 1B 2B 2C 0D 1D 2D 2E 0F 1F 2F -- -- -- --)
 	pshufd    xmmH,xmmA,0x4E; xmmH=(04 14 24 05 06 16 26 07 00 10 20 01 02 12 22 03)
 	movdqa    xmmB,xmmE
 	punpckldq xmmA,xmmD	; xmmA=(00 10 20 01 11 21 02 12 02 12 22 03 13 23 04 14)
 	punpckldq xmmE,xmmH	; xmmE=(22 03 13 23 04 14 24 05 24 05 15 25 06 16 26 07)
 	punpckhdq xmmD,xmmB	; xmmD=(15 25 06 16 26 07 17 27 17 27 08 18 28 09 19 29)
 	pshufd    xmmH,xmmG,0x4E; xmmH=(0C 1C 2C 0D 0E 1E 2E 0F 08 18 28 09 0A 1A 2A 0B)
 	movdqa    xmmB,xmmF
 	punpckldq xmmG,xmmC	; xmmG=(08 18 28 09 19 29 0A 1A 0A 1A 2A 0B 1B 2B 0C 1C)
 	punpckldq xmmF,xmmH	; xmmF=(2A 0B 1B 2B 0C 1C 2C 0D 2C 0D 1D 2D 0E 1E 2E 0F)
 	punpckhdq xmmC,xmmB	; xmmC=(1D 2D 0E 1E 2E 0F 1F 2F 1F 2F -- -- -- -- -- --)
 	punpcklqdq xmmA,xmmE	; xmmA=(00 10 20 01 11 21 02 12 22 03 13 23 04 14 24 05)
 	punpcklqdq xmmD,xmmG	; xmmD=(15 25 06 16 26 07 17 27 08 18 28 09 19 29 0A 1A)
 	punpcklqdq xmmF,xmmC	; xmmF=(2A 0B 1B 2B 0C 1C 2C 0D 1D 2D 0E 1E 2E 0F 1F 2F)
 	cmp	ecx, byte SIZEOF_XMMWORD
 	jb	short .column_st32
 	test	edi, SIZEOF_XMMWORD-1
 	jnz	short .out1
 	; --(aligned)-------------------
 	movntdq	XMMWORD [edi+0*SIZEOF_XMMWORD], xmmA
 	movntdq	XMMWORD [edi+1*SIZEOF_XMMWORD], xmmD
 	movntdq	XMMWORD [edi+2*SIZEOF_XMMWORD], xmmF
 	add	edi, byte RGB_PIXELSIZE*SIZEOF_XMMWORD	; outptr
 	jmp	short .out0
 .out1:	; --(unaligned)-----------------
 	pcmpeqb    xmmH,xmmH			; xmmH=(all 1's)
 	maskmovdqu xmmA,xmmH			; movntdqu XMMWORD [edi], xmmA
 	add	edi, byte SIZEOF_XMMWORD	; outptr
 	maskmovdqu xmmD,xmmH			; movntdqu XMMWORD [edi], xmmD
 	add	edi, byte SIZEOF_XMMWORD	; outptr
 	maskmovdqu xmmF,xmmH			; movntdqu XMMWORD [edi], xmmF
 	add	edi, byte SIZEOF_XMMWORD	; outptr
 .out0:
 	sub	ecx, byte SIZEOF_XMMWORD
 	jz	near .nextrow
 	add	esi, byte SIZEOF_XMMWORD	; inptr0
 	add	ebx, byte SIZEOF_XMMWORD	; inptr1
 	add	edx, byte SIZEOF_XMMWORD	; inptr2
 	jmp	near .columnloop
 	alignx	16,7
 .column_st32:
 	pcmpeqb	xmmH,xmmH			; xmmH=(all 1's)
 	lea	ecx, [ecx+ecx*2]		; imul ecx, RGB_PIXELSIZE
 	cmp	ecx, byte 2*SIZEOF_XMMWORD
 	jb	short .column_st16
 	maskmovdqu xmmA,xmmH			; movntdqu XMMWORD [edi], xmmA
 	add	edi, byte SIZEOF_XMMWORD	; outptr
 	maskmovdqu xmmD,xmmH			; movntdqu XMMWORD [edi], xmmD
 	add	edi, byte SIZEOF_XMMWORD	; outptr
 	movdqa	xmmA,xmmF
 	sub	ecx, byte 2*SIZEOF_XMMWORD
 	jmp	short .column_st15
 .column_st16:
 	cmp	ecx, byte SIZEOF_XMMWORD
 	jb	short .column_st15
 	maskmovdqu xmmA,xmmH			; movntdqu XMMWORD [edi], xmmA
 	add	edi, byte SIZEOF_XMMWORD	; outptr
 	movdqa	xmmA,xmmD
 	sub	ecx, byte SIZEOF_XMMWORD
 .column_st15:
 	mov	eax,ecx
 	xor	ecx, byte 0x0F
 	shl	ecx, 2
 	movd	xmmB,ecx
 	psrlq	xmmH,4
 	pcmpeqb	xmmE,xmmE
 	psrlq	xmmH,xmmB
 	psrlq	xmmE,xmmB
 	punpcklbw xmmE,xmmH
 	; ----------------
 	mov	ecx,edi
 	and	ecx, byte SIZEOF_XMMWORD-1
 	jz	short .adj0
 	add	eax,ecx
 	cmp	eax, byte SIZEOF_XMMWORD
 	ja	short .adj0
 	and	edi, byte (-SIZEOF_XMMWORD)	; align to 16-byte boundary
 	shl	ecx, 3			; pslldq xmmA,ecx & pslldq xmmE,ecx
 	movdqa	xmmG,xmmA
 	movdqa	xmmC,xmmE
 	pslldq	xmmA, SIZEOF_XMMWORD/2
 	pslldq	xmmE, SIZEOF_XMMWORD/2
 	movd	xmmD,ecx
 	sub	ecx, byte (SIZEOF_XMMWORD/2)*BYTE_BIT
 	jb	short .adj1
 	movd	xmmF,ecx
 	psllq	xmmA,xmmF
 	psllq	xmmE,xmmF
 	jmp	short .adj0
 .adj1:	neg	ecx
 	movd	xmmF,ecx
 	psrlq	xmmA,xmmF
 	psrlq	xmmE,xmmF
 	psllq	xmmG,xmmD
 	psllq	xmmC,xmmD
 	por	xmmA,xmmG
 	por	xmmE,xmmC
 .adj0:	; ----------------
 	maskmovdqu xmmA,xmmE			; movntdqu XMMWORD [edi], xmmA
 %else ; RGB_PIXELSIZE == 4 ; -----------
 %ifdef RGBX_FILLER_0XFF
 	pcmpeqb   xmm6,xmm6		; xmm6=XE=X(02468ACE********)
 	pcmpeqb   xmm7,xmm7		; xmm7=XO=X(13579BDF********)
 %else
 	pxor      xmm6,xmm6		; xmm6=XE=X(02468ACE********)
 	pxor      xmm7,xmm7		; xmm7=XO=X(13579BDF********)
 %endif
 	; xmmA=(00 02 04 06 08 0A 0C 0E **), xmmB=(01 03 05 07 09 0B 0D 0F **)
 	; xmmC=(10 12 14 16 18 1A 1C 1E **), xmmD=(11 13 15 17 19 1B 1D 1F **)
 	; xmmE=(20 22 24 26 28 2A 2C 2E **), xmmF=(21 23 25 27 29 2B 2D 2F **)
 	; xmmG=(30 32 34 36 38 3A 3C 3E **), xmmH=(31 33 35 37 39 3B 3D 3F **)
 	punpcklbw xmmA,xmmC	; xmmA=(00 10 02 12 04 14 06 16 08 18 0A 1A 0C 1C 0E 1E)
 	punpcklbw xmmE,xmmG	; xmmE=(20 30 22 32 24 34 26 36 28 38 2A 3A 2C 3C 2E 3E)
 	punpcklbw xmmB,xmmD	; xmmB=(01 11 03 13 05 15 07 17 09 19 0B 1B 0D 1D 0F 1F)
 	punpcklbw xmmF,xmmH	; xmmF=(21 31 23 33 25 35 27 37 29 39 2B 3B 2D 3D 2F 3F)
 	movdqa    xmmC,xmmA
 	punpcklwd xmmA,xmmE	; xmmA=(00 10 20 30 02 12 22 32 04 14 24 34 06 16 26 36)
 	punpckhwd xmmC,xmmE	; xmmC=(08 18 28 38 0A 1A 2A 3A 0C 1C 2C 3C 0E 1E 2E 3E)
 	movdqa    xmmG,xmmB
 	punpcklwd xmmB,xmmF	; xmmB=(01 11 21 31 03 13 23 33 05 15 25 35 07 17 27 37)
 	punpckhwd xmmG,xmmF	; xmmG=(09 19 29 39 0B 1B 2B 3B 0D 1D 2D 3D 0F 1F 2F 3F)
 	movdqa    xmmD,xmmA
 	punpckldq xmmA,xmmB	; xmmA=(00 10 20 30 01 11 21 31 02 12 22 32 03 13 23 33)
 	punpckhdq xmmD,xmmB	; xmmD=(04 14 24 34 05 15 25 35 06 16 26 36 07 17 27 37)
 	movdqa    xmmH,xmmC
 	punpckldq xmmC,xmmG	; xmmC=(08 18 28 38 09 19 29 39 0A 1A 2A 3A 0B 1B 2B 3B)
 	punpckhdq xmmH,xmmG	; xmmH=(0C 1C 2C 3C 0D 1D 2D 3D 0E 1E 2E 3E 0F 1F 2F 3F)
 	cmp	ecx, byte SIZEOF_XMMWORD
 	jb	short .column_st32
 	test	edi, SIZEOF_XMMWORD-1
 	jnz	short .out1
 	; --(aligned)-------------------
 	movntdq	XMMWORD [edi+0*SIZEOF_XMMWORD], xmmA
 	movntdq	XMMWORD [edi+1*SIZEOF_XMMWORD], xmmD
 	movntdq	XMMWORD [edi+2*SIZEOF_XMMWORD], xmmC
 	movntdq	XMMWORD [edi+3*SIZEOF_XMMWORD], xmmH
 	add	edi, byte RGB_PIXELSIZE*SIZEOF_XMMWORD	; outptr
 	jmp	short .out0
 .out1:	; --(unaligned)-----------------
 	pcmpeqb    xmmE,xmmE			; xmmE=(all 1's)
 	maskmovdqu xmmA,xmmE			; movntdqu XMMWORD [edi], xmmA
 	add	edi, byte SIZEOF_XMMWORD	; outptr
 	maskmovdqu xmmD,xmmE			; movntdqu XMMWORD [edi], xmmD
 	add	edi, byte SIZEOF_XMMWORD	; outptr
 	maskmovdqu xmmC,xmmE			; movntdqu XMMWORD [edi], xmmC
 	add	edi, byte SIZEOF_XMMWORD	; outptr
 	maskmovdqu xmmH,xmmE			; movntdqu XMMWORD [edi], xmmH
 	add	edi, byte SIZEOF_XMMWORD	; outptr
 .out0:
 	sub	ecx, byte SIZEOF_XMMWORD
 	jz	near .nextrow
 	add	esi, byte SIZEOF_XMMWORD	; inptr0
 	add	ebx, byte SIZEOF_XMMWORD	; inptr1
 	add	edx, byte SIZEOF_XMMWORD	; inptr2
 	jmp	near .columnloop
 	alignx	16,7
 .column_st32:
 	pcmpeqb	xmmE,xmmE			; xmmE=(all 1's)
 	cmp	ecx, byte SIZEOF_XMMWORD/2
 	jb	short .column_st16
 	maskmovdqu xmmA,xmmE			; movntdqu XMMWORD [edi], xmmA
 	add	edi, byte SIZEOF_XMMWORD	; outptr
 	maskmovdqu xmmD,xmmE			; movntdqu XMMWORD [edi], xmmD
 	add	edi, byte SIZEOF_XMMWORD	; outptr
 	movdqa	xmmA,xmmC
 	movdqa	xmmD,xmmH
 	sub	ecx, byte SIZEOF_XMMWORD/2
 .column_st16:
 	cmp	ecx, byte SIZEOF_XMMWORD/4
 	jb	short .column_st15
 	maskmovdqu xmmA,xmmE			; movntdqu XMMWORD [edi], xmmA
 	add	edi, byte SIZEOF_XMMWORD	; outptr
 	movdqa	xmmA,xmmD
 	sub	ecx, byte SIZEOF_XMMWORD/4
 .column_st15:
 	cmp	ecx, byte SIZEOF_XMMWORD/16
 	jb	short .nextrow
 	mov	eax,ecx
 	xor	ecx, byte 0x03
 	inc	ecx
 	shl	ecx, 4
 	movd	xmmF,ecx
 	psrlq	xmmE,xmmF
 	punpcklbw xmmE,xmmE
 	; ----------------
 	mov	ecx,edi
 	and	ecx, byte SIZEOF_XMMWORD-1
 	jz	short .adj0
 	lea	eax, [ecx+eax*4]	; RGB_PIXELSIZE
 	cmp	eax, byte SIZEOF_XMMWORD
 	ja	short .adj0
 	and	edi, byte (-SIZEOF_XMMWORD)	; align to 16-byte boundary
 	shl	ecx, 3			; pslldq xmmA,ecx & pslldq xmmE,ecx
 	movdqa	xmmB,xmmA
 	movdqa	xmmG,xmmE
 	pslldq	xmmA, SIZEOF_XMMWORD/2
 	pslldq	xmmE, SIZEOF_XMMWORD/2
 	movd	xmmC,ecx
 	sub	ecx, byte (SIZEOF_XMMWORD/2)*BYTE_BIT
 	jb	short .adj1
 	movd	xmmH,ecx
 	psllq	xmmA,xmmH
 	psllq	xmmE,xmmH
 	jmp	short .adj0
 .adj1:	neg	ecx
 	movd	xmmH,ecx
 	psrlq	xmmA,xmmH
 	psrlq	xmmE,xmmH
 	psllq	xmmB,xmmC
 	psllq	xmmG,xmmC
 	por	xmmA,xmmB
 	por	xmmE,xmmG
 .adj0:	; ----------------
 	maskmovdqu xmmA,xmmE			; movntdqu XMMWORD [edi], xmmA
 %endif ; RGB_PIXELSIZE ; ---------------
 	alignx	16,7
 .nextrow:
 	pop	ecx
 	pop	esi
 	pop	ebx
 	pop	edx
 	pop	edi
 	pop	eax
 	add	esi, byte SIZEOF_JSAMPROW
 	add	ebx, byte SIZEOF_JSAMPROW
 	add	edx, byte SIZEOF_JSAMPROW
 	add	edi, byte SIZEOF_JSAMPROW	; output_buf
 	dec	eax				; num_rows
 	jg	near .rowloop
 	sfence		; flush the write buffer
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %endif ; JDCOLOR_YCCRGB_SSE2_SUPPORTED
 %endif ; RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4
--- a/jdct.h
+++ b/jdct.h
@@ -5,6 +5,13 @@
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
 * ---------------------------------------------------------------------
 * x86 SIMD extension for IJG JPEG library
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
 * This file has been modified for SIMD extension.
 * Last Modified : January 5, 2006
 * ---------------------------------------------------------------------
 *
 * This include file contains common declarations for the forward and
 * inverse DCT modules.  These declarations are private to the DCT managers
 * (jcdctmgr.c, jddctmgr.c) and the individual DCT algorithms.
@@ -13,6 +20,13 @@
 */
 /* SIMD Ext: configuration check */
 #if BITS_IN_JSAMPLE != 8
 #error "Sorry, this SIMD code only copes with 8-bit sample values."
 #endif
 /*
 * A forward DCT routine is given a pointer to a work area of type DCTELEM[];
 * the DCT is to be performed in-place in that buffer.  Type DCTELEM is int
@@ -26,14 +40,25 @@
 * Quantization of the output coefficients is done by jcdctmgr.c.
 */
-#if BITS_IN_JSAMPLE == 8
+/* SIMD Ext: To maximize parallelism, Type DCTELEM is changed to short
-typedef int DCTELEM;		/* 16 or 32 bits is fine */
+ * (originally, int).
-#else
+ */
-typedef INT32 DCTELEM;		/* must have 32 bits */
+typedef short DCTELEM;		/* SIMD Ext: must be short */
 #endif
 typedef JMETHOD(void, forward_DCT_method_ptr, (DCTELEM * data));
 typedef JMETHOD(void, float_DCT_method_ptr, (FAST_FLOAT * data));
 typedef JMETHOD(void, convsamp_int_method_ptr,
 		(JSAMPARRAY sample_data, JDIMENSION start_col,
 		 DCTELEM * workspace));
 typedef JMETHOD(void, convsamp_float_method_ptr,
 		(JSAMPARRAY sample_data, JDIMENSION start_col,
 		 FAST_FLOAT *workspace));
 typedef JMETHOD(void, quantize_int_method_ptr,
 		(JCOEFPTR coef_block, DCTELEM * divisors,
 		 DCTELEM * workspace));
 typedef JMETHOD(void, quantize_float_method_ptr,
 		(JCOEFPTR coef_block, FAST_FLOAT * divisors,
 		 FAST_FLOAT * workspace));
 /*
@@ -49,19 +74,22 @@ typedef JMETHOD(void, float_DCT_method_ptr, (FAST_FLOAT * data));
 /* typedef inverse_DCT_method_ptr is declared in jpegint.h */
 /* SIMD Ext: To maximize parallelism, Type MULTIPLIER is changed to short.
 * Macro definitions of MULTIPLIER and FAST_FLOAT in jmorecfg.h are ignored.
 */
 #undef MULTIPLIER
 #define MULTIPLIER  short	/* SIMD Ext: must be short */
 #undef FAST_FLOAT
 #define FAST_FLOAT  float	/* SIMD Ext: must be float */
 /*
 * Each IDCT routine has its own ideas about the best dct_table element type.
 */
-typedef MULTIPLIER ISLOW_MULT_TYPE; /* short or int, whichever is faster */
+typedef MULTIPLIER ISLOW_MULT_TYPE;	/* SIMD Ext: must be short */
-#if BITS_IN_JSAMPLE == 8
+typedef MULTIPLIER IFAST_MULT_TYPE;	/* SIMD Ext: must be short */
 typedef MULTIPLIER IFAST_MULT_TYPE; /* 16 bits is OK, use short if faster */
 #define IFAST_SCALE_BITS  2	/* fractional bits in scale factors */
-#else
+typedef FAST_FLOAT FLOAT_MULT_TYPE;	/* SIMD Ext: must be float */
 typedef INT32 IFAST_MULT_TYPE;	/* need 32 bits for scaled quantizers */
 #define IFAST_SCALE_BITS  13	/* fractional bits in scale factors */
 #endif
 typedef FAST_FLOAT FLOAT_MULT_TYPE; /* preferred floating type */
 /*
@@ -81,15 +109,64 @@ typedef FAST_FLOAT FLOAT_MULT_TYPE; /* preferred floating type */
 /* Short forms of external names for systems with brain-damaged linkers. */
 #ifdef NEED_SHORT_EXTERNAL_NAMES
-#define jpeg_fdct_islow		jFDislow
+#define jpeg_fdct_islow		jFDislow		/* jfdctint.asm */
-#define jpeg_fdct_ifast		jFDifast
+#define jpeg_fdct_ifast		jFDifast		/* jfdctfst.asm */
-#define jpeg_fdct_float		jFDfloat
+#define jpeg_fdct_float		jFDfloat		/* jfdctflt.asm */
-#define jpeg_idct_islow		jRDislow
+#define jpeg_fdct_islow_mmx	jFDMislow		/* jfmmxint.asm */
-#define jpeg_idct_ifast		jRDifast
+#define jpeg_fdct_ifast_mmx	jFDMifast		/* jfmmxfst.asm */
-#define jpeg_idct_float		jRDfloat
+#define jpeg_fdct_float_3dnow	jFD3float		/* jf3dnflt.asm */
-#define jpeg_idct_4x4		jRD4x4
+#define jpeg_fdct_islow_sse2	jFDSislow		/* jfss2int.asm */
-#define jpeg_idct_2x2		jRD2x2
+#define jpeg_fdct_ifast_sse2	jFDSifast		/* jfss2fst.asm */
-#define jpeg_idct_1x1		jRD1x1
+#define jpeg_fdct_float_sse	jFDSfloat		/* jfsseflt.asm */
 #define jpeg_convsamp_int	jCnvInt			/* jcqntint.asm */
 #define jpeg_quantize_int	jQntInt			/* jcqntint.asm */
 #define jpeg_quantize_idiv	jQntIDiv		/* jcqntint.asm */
 #define jpeg_convsamp_float	jCnvFloat		/* jcqntflt.asm */
 #define jpeg_quantize_float	jQntFloat		/* jcqntflt.asm */
 #define jpeg_convsamp_int_mmx	jCnvMmx			/* jcqntmmx.asm */
 #define jpeg_quantize_int_mmx	jQntMmx			/* jcqntmmx.asm */
 #define jpeg_convsamp_flt_3dnow	jCnv3dnow		/* jcqnt3dn.asm */
 #define jpeg_quantize_flt_3dnow	jQnt3dnow		/* jcqnt3dn.asm */
 #define jpeg_convsamp_int_sse2	jCnvISse2		/* jcqnts2i.asm */
 #define jpeg_quantize_int_sse2	jQntISse2		/* jcqnts2i.asm */
 #define jpeg_convsamp_flt_sse	jCnvSse			/* jcqntsse.asm */
 #define jpeg_quantize_flt_sse	jQntSse			/* jcqntsse.asm */
 #define jpeg_convsamp_flt_sse2	jCnvFSse2		/* jcqnts2f.asm */
 #define jpeg_quantize_flt_sse2	jQntFSse2		/* jcqnts2f.asm */
 #define jpeg_idct_islow		jRDislow		/* jidctint.asm */
 #define jpeg_idct_ifast		jRDifast		/* jidctfst.asm */
 #define jpeg_idct_float		jRDfloat		/* jidctflt.asm */
 #define jpeg_idct_4x4		jRD4x4			/* jidctred.asm */
 #define jpeg_idct_2x2		jRD2x2			/* jidctred.asm */
 #define jpeg_idct_1x1		jRD1x1			/* jidctred.asm */
 #define jpeg_idct_islow_mmx	jRDMislow		/* jimmxint.asm */
 #define jpeg_idct_ifast_mmx	jRDMifast		/* jimmxfst.asm */
 #define jpeg_idct_float_3dnow	jRD3float		/* ji3dnflt.asm */
 #define jpeg_idct_4x4_mmx	jRDM4x4			/* jimmxred.asm */
 #define jpeg_idct_2x2_mmx	jRDM2x2			/* jimmxred.asm */
 #define jpeg_idct_islow_sse2	jRDSislow		/* jiss2int.asm */
 #define jpeg_idct_ifast_sse2	jRDSifast		/* jiss2fst.asm */
 #define jpeg_idct_float_sse	jRDSfloat		/* jisseflt.asm */
 #define jpeg_idct_float_sse2	jRD2float		/* jiss2flt.asm */
 #define jpeg_idct_4x4_sse2	jRDS4x4			/* jiss2red.asm */
 #define jpeg_idct_2x2_sse2	jRDS2x2			/* jiss2red.asm */
 #define jconst_fdct_float	jFCfloat		/* jfdctflt.asm */
 #define jconst_fdct_islow_mmx	jFCMislow		/* jfmmxint.asm */
 #define jconst_fdct_ifast_mmx	jFCMifast		/* jfmmxfst.asm */
 #define jconst_fdct_float_3dnow	jFC3float		/* jf3dnflt.asm */
 #define jconst_fdct_islow_sse2	jFCSislow		/* jfss2int.asm */
 #define jconst_fdct_ifast_sse2	jFCSifast		/* jfss2fst.asm */
 #define jconst_fdct_float_sse	jFCSfloat		/* jfsseflt.asm */
 #define jconst_idct_float	jRCfloat		/* jidctflt.asm */
 #define jconst_idct_islow_mmx	jRCMislow		/* jimmxint.asm */
 #define jconst_idct_ifast_mmx	jRCMifast		/* jimmxfst.asm */
 #define jconst_idct_float_3dnow	jRC3float		/* ji3dnflt.asm */
 #define jconst_idct_red_mmx	jRCMred			/* jimmxred.asm */
 #define jconst_idct_islow_sse2	jRCSislow		/* jiss2int.asm */
 #define jconst_idct_ifast_sse2	jRCSifast		/* jiss2fst.asm */
 #define jconst_idct_float_sse	jRCSfloat		/* jisseflt.asm */
 #define jconst_idct_float_sse2	jRC2float		/* jiss2flt.asm */
 #define jconst_idct_red_sse2	jRCSred			/* jiss2red.asm */
 #endif /* NEED_SHORT_EXTERNAL_NAMES */
 /* Extern declarations for the forward and inverse DCT routines. */
@@ -98,6 +175,47 @@ EXTERN(void) jpeg_fdct_islow JPP((DCTELEM * data));
 EXTERN(void) jpeg_fdct_ifast JPP((DCTELEM * data));
 EXTERN(void) jpeg_fdct_float JPP((FAST_FLOAT * data));
 EXTERN(void) jpeg_fdct_islow_mmx JPP((DCTELEM * data));
 EXTERN(void) jpeg_fdct_ifast_mmx JPP((DCTELEM * data));
 EXTERN(void) jpeg_fdct_float_3dnow JPP((FAST_FLOAT * data));
 EXTERN(void) jpeg_fdct_islow_sse2 JPP((DCTELEM * data));
 EXTERN(void) jpeg_fdct_ifast_sse2 JPP((DCTELEM * data));
 EXTERN(void) jpeg_fdct_float_sse JPP((FAST_FLOAT * data));
 EXTERN(void) jpeg_convsamp_int
    JPP((JSAMPARRAY sample_data, JDIMENSION start_col, DCTELEM * workspace));
 EXTERN(void) jpeg_quantize_int
    JPP((JCOEFPTR coef_block, DCTELEM * divisors, DCTELEM * workspace));
 EXTERN(void) jpeg_quantize_idiv
    JPP((JCOEFPTR coef_block, DCTELEM * divisors, DCTELEM * workspace));
 EXTERN(void) jpeg_convsamp_float
    JPP((JSAMPARRAY sample_data, JDIMENSION start_col, FAST_FLOAT *workspace));
 EXTERN(void) jpeg_quantize_float
    JPP((JCOEFPTR coef_block, FAST_FLOAT * divisors, FAST_FLOAT * workspace));
 EXTERN(void) jpeg_convsamp_int_mmx
    JPP((JSAMPARRAY sample_data, JDIMENSION start_col, DCTELEM * workspace));
 EXTERN(void) jpeg_quantize_int_mmx
    JPP((JCOEFPTR coef_block, DCTELEM * divisors, DCTELEM * workspace));
 EXTERN(void) jpeg_convsamp_flt_3dnow
    JPP((JSAMPARRAY sample_data, JDIMENSION start_col, FAST_FLOAT *workspace));
 EXTERN(void) jpeg_quantize_flt_3dnow
    JPP((JCOEFPTR coef_block, FAST_FLOAT * divisors, FAST_FLOAT * workspace));
 EXTERN(void) jpeg_convsamp_int_sse2
    JPP((JSAMPARRAY sample_data, JDIMENSION start_col, DCTELEM * workspace));
 EXTERN(void) jpeg_quantize_int_sse2
    JPP((JCOEFPTR coef_block, DCTELEM * divisors, DCTELEM * workspace));
 EXTERN(void) jpeg_convsamp_flt_sse
    JPP((JSAMPARRAY sample_data, JDIMENSION start_col, FAST_FLOAT *workspace));
 EXTERN(void) jpeg_quantize_flt_sse
    JPP((JCOEFPTR coef_block, FAST_FLOAT * divisors, FAST_FLOAT * workspace));
 EXTERN(void) jpeg_convsamp_flt_sse2
    JPP((JSAMPARRAY sample_data, JDIMENSION start_col, FAST_FLOAT *workspace));
 EXTERN(void) jpeg_quantize_flt_sse2
    JPP((JCOEFPTR coef_block, FAST_FLOAT * divisors, FAST_FLOAT * workspace));
 EXTERN(void) jpeg_idct_islow
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JCOEFPTR coef_block, JSAMPARRAY output_buf, JDIMENSION output_col));
@@ -117,6 +235,60 @@ EXTERN(void) jpeg_idct_1x1
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JCOEFPTR coef_block, JSAMPARRAY output_buf, JDIMENSION output_col));
 EXTERN(void) jpeg_idct_islow_mmx
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JCOEFPTR coef_block, JSAMPARRAY output_buf, JDIMENSION output_col));
 EXTERN(void) jpeg_idct_ifast_mmx
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JCOEFPTR coef_block, JSAMPARRAY output_buf, JDIMENSION output_col));
 EXTERN(void) jpeg_idct_4x4_mmx
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JCOEFPTR coef_block, JSAMPARRAY output_buf, JDIMENSION output_col));
 EXTERN(void) jpeg_idct_2x2_mmx
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JCOEFPTR coef_block, JSAMPARRAY output_buf, JDIMENSION output_col));
 EXTERN(void) jpeg_idct_float_3dnow
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JCOEFPTR coef_block, JSAMPARRAY output_buf, JDIMENSION output_col));
 EXTERN(void) jpeg_idct_float_sse
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JCOEFPTR coef_block, JSAMPARRAY output_buf, JDIMENSION output_col));
 EXTERN(void) jpeg_idct_float_sse2
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JCOEFPTR coef_block, JSAMPARRAY output_buf, JDIMENSION output_col));
 EXTERN(void) jpeg_idct_islow_sse2
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JCOEFPTR coef_block, JSAMPARRAY output_buf, JDIMENSION output_col));
 EXTERN(void) jpeg_idct_ifast_sse2
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JCOEFPTR coef_block, JSAMPARRAY output_buf, JDIMENSION output_col));
 EXTERN(void) jpeg_idct_4x4_sse2
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JCOEFPTR coef_block, JSAMPARRAY output_buf, JDIMENSION output_col));
 EXTERN(void) jpeg_idct_2x2_sse2
    JPP((j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	 JCOEFPTR coef_block, JSAMPARRAY output_buf, JDIMENSION output_col));
 extern const int jconst_fdct_float[];
 extern const int jconst_fdct_islow_mmx[];
 extern const int jconst_fdct_ifast_mmx[];
 extern const int jconst_fdct_float_3dnow[];
 extern const int jconst_fdct_islow_sse2[];
 extern const int jconst_fdct_ifast_sse2[];
 extern const int jconst_fdct_float_sse[];
 extern const int jconst_idct_float[];
 extern const int jconst_idct_islow_mmx[];
 extern const int jconst_idct_ifast_mmx[];
 extern const int jconst_idct_float_3dnow[];
 extern const int jconst_idct_red_mmx[];
 extern const int jconst_idct_islow_sse2[];
 extern const int jconst_idct_ifast_sse2[];
 extern const int jconst_idct_float_sse[];
 extern const int jconst_idct_float_sse2[];
 extern const int jconst_idct_red_sse2[];
 /*
 * Macros for handling fixed-point arithmetic; these are used by many
--- a/jdct.inc
+++ b/jdct.inc
@@ -0,0 +1,125 @@
 ;
 ; jdct.inc - private declarations for forward & reverse DCT subsystems
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; Last Modified : January 5, 2006
 ;
 ; [TAB8]
 ; ---- jdct.h --------------------------------------------------------------
 ;
 ; configuration check: BITS_IN_JSAMPLE==8 (8-bit sample values) is the only
 ; valid setting on this SIMD extension.
 ;
 %if BITS_IN_JSAMPLE != 8
 %error "Sorry, this SIMD code only copes with 8-bit sample values."
 %endif
 ; A forward DCT routine is given a pointer to a work area of type DCTELEM[];
 ; the DCT is to be performed in-place in that buffer.
 ; To maximize parallelism, Type DCTELEM is changed to short (originally, int).
 ;
 %define DCTELEM			word		; short
 %define SIZEOF_DCTELEM		SIZEOF_WORD	; sizeof(DCTELEM)
 ; To maximize parallelism, Type MULTIPLIER is changed to short.
 ;
 %define MULTIPLIER		word		; short
 %define SIZEOF_MULTIPLIER	SIZEOF_WORD	; sizeof(MULTIPLIER)
 %define FAST_FLOAT		FP32		; float
 %define SIZEOF_FAST_FLOAT	SIZEOF_FP32	; sizeof(FAST_FLOAT)
 ; Each IDCT routine has its own ideas about the best dct_table element type.
 ;
 %define ISLOW_MULT_TYPE 	MULTIPLIER          ; must be short
 %define SIZEOF_ISLOW_MULT_TYPE	SIZEOF_MULTIPLIER   ; sizeof(ISLOW_MULT_TYPE)
 %define IFAST_MULT_TYPE 	MULTIPLIER          ; must be short
 %define SIZEOF_IFAST_MULT_TYPE	SIZEOF_MULTIPLIER   ; sizeof(IFAST_MULT_TYPE)
 %define IFAST_SCALE_BITS	2	; fractional bits in scale factors
 %define FLOAT_MULT_TYPE 	FAST_FLOAT          ; must be float
 %define SIZEOF_FLOAT_MULT_TYPE	SIZEOF_FAST_FLOAT   ; sizeof(FLOAT_MULT_TYPE)
 ; Each IDCT routine is responsible for range-limiting its results and
 ; converting them to unsigned form (0..MAXJSAMPLE).  The raw outputs could
 ; be quite far out of range if the input data is corrupt, so a bulletproof
 ; range-limiting step is required.  We use a mask-and-table-lookup method
 ; to do the combined operations quickly.
 ;
 %define RANGE_MASK  (MAXJSAMPLE * 4 + 3)  ; 2 bits wider than legal samples
 ; Short forms of external names for systems with brain-damaged linkers.
 ;
 %ifdef NEED_SHORT_EXTERNAL_NAMES
 %define jpeg_fdct_islow		jFDislow	; jfdctint.asm
 %define jpeg_fdct_ifast		jFDifast	; jfdctfst.asm
 %define jpeg_fdct_float		jFDfloat	; jfdctflt.asm
 %define jpeg_fdct_islow_mmx	jFDMislow	; jfmmxint.asm
 %define jpeg_fdct_ifast_mmx	jFDMifast	; jfmmxfst.asm
 %define jpeg_fdct_float_3dnow	jFD3float	; jf3dnflt.asm
 %define jpeg_fdct_islow_sse2	jFDSislow	; jfss2int.asm
 %define jpeg_fdct_ifast_sse2	jFDSifast	; jfss2fst.asm
 %define jpeg_fdct_float_sse	jFDSfloat	; jfsseflt.asm
 %define jpeg_convsamp_int	jCnvInt		; jcqntint.asm
 %define jpeg_quantize_int	jQntInt		; jcqntint.asm
 %define jpeg_quantize_idiv	jQntIDiv	; jcqntint.asm
 %define jpeg_convsamp_float	jCnvFloat	; jcqntflt.asm
 %define jpeg_quantize_float	jQntFloat	; jcqntflt.asm
 %define jpeg_convsamp_int_mmx	jCnvMmx		; jcqntmmx.asm
 %define jpeg_quantize_int_mmx	jQntMmx		; jcqntmmx.asm
 %define jpeg_convsamp_flt_3dnow	jCnv3dnow	; jcqnt3dn.asm
 %define jpeg_quantize_flt_3dnow	jQnt3dnow	; jcqnt3dn.asm
 %define jpeg_convsamp_int_sse2	jCnvISse2	; jcqnts2i.asm
 %define jpeg_quantize_int_sse2	jQntISse2	; jcqnts2i.asm
 %define jpeg_convsamp_flt_sse	jCnvSse		; jcqntsse.asm
 %define jpeg_quantize_flt_sse	jQntSse		; jcqntsse.asm
 %define jpeg_convsamp_flt_sse2	jCnvFSse2	; jcqnts2f.asm
 %define jpeg_quantize_flt_sse2	jQntFSse2	; jcqnts2f.asm
 %define jpeg_idct_islow		jRDislow	; jidctint.asm
 %define jpeg_idct_ifast		jRDifast	; jidctfst.asm
 %define jpeg_idct_float		jRDfloat	; jidctflt.asm
 %define jpeg_idct_4x4		jRD4x4		; jidctred.asm
 %define jpeg_idct_2x2		jRD2x2		; jidctred.asm
 %define jpeg_idct_1x1		jRD1x1		; jidctred.asm
 %define jpeg_idct_islow_mmx	jRDMislow	; jimmxint.asm
 %define jpeg_idct_ifast_mmx	jRDMifast	; jimmxfst.asm
 %define jpeg_idct_float_3dnow	jRD3float	; ji3dnflt.asm
 %define jpeg_idct_4x4_mmx	jRDM4x4		; jimmxred.asm
 %define jpeg_idct_2x2_mmx	jRDM2x2		; jimmxred.asm
 %define jpeg_idct_islow_sse2	jRDSislow	; jiss2int.asm
 %define jpeg_idct_ifast_sse2	jRDSifast	; jiss2fst.asm
 %define jpeg_idct_float_sse	jRDSfloat	; jisseflt.asm
 %define jpeg_idct_float_sse2	jRD2float	; jiss2flt.asm
 %define jpeg_idct_4x4_sse2	jRDS4x4		; jiss2red.asm
 %define jpeg_idct_2x2_sse2	jRDS2x2		; jiss2red.asm
 %define jconst_fdct_float	jFCfloat	; jfdctflt.asm
 %define jconst_fdct_islow_mmx	jFCMislow	; jfmmxint.asm
 %define jconst_fdct_ifast_mmx	jFCMifast	; jfmmxfst.asm
 %define jconst_fdct_float_3dnow	jFC3float	; jf3dnflt.asm
 %define jconst_fdct_islow_sse2	jFCSislow	; jfss2int.asm
 %define jconst_fdct_ifast_sse2	jFCSifast	; jfss2fst.asm
 %define jconst_fdct_float_sse	jFCSfloat	; jfsseflt.asm
 %define jconst_idct_float	jRCfloat	; jidctflt.asm
 %define jconst_idct_islow_mmx	jRCMislow	; jimmxint.asm
 %define jconst_idct_ifast_mmx	jRCMifast	; jimmxfst.asm
 %define jconst_idct_float_3dnow	jRC3float	; ji3dnflt.asm
 %define jconst_idct_red_mmx	jRCMred		; jimmxred.asm
 %define jconst_idct_islow_sse2	jRCSislow	; jiss2int.asm
 %define jconst_idct_ifast_sse2	jRCSifast	; jiss2fst.asm
 %define jconst_idct_float_sse	jRCSfloat	; jisseflt.asm
 %define jconst_idct_float_sse2	jRC2float	; jiss2flt.asm
 %define jconst_idct_red_sse2	jRCSred		; jiss2red.asm
 %endif ; NEED_SHORT_EXTERNAL_NAMES
 ; --------------------------------------------------------------------------
 %define ROW(n,b,s)		((b)+(n)*(s))
 %define COL(n,b,s)		((b)+(n)*(s)*DCTSIZE)
 %define DWBLOCK(m,n,b,s)	((b)+(m)*DCTSIZE*(s)+(n)*SIZEOF_DWORD)
 %define MMBLOCK(m,n,b,s)	((b)+(m)*DCTSIZE*(s)+(n)*SIZEOF_MMWORD)
 %define XMMBLOCK(m,n,b,s)	((b)+(m)*DCTSIZE*(s)+(n)*SIZEOF_XMMWORD)
 ; --------------------------------------------------------------------------
--- a/jddctmgr.c
+++ b/jddctmgr.c
@@ -5,6 +5,13 @@
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
 * ---------------------------------------------------------------------
 * x86 SIMD extension for IJG JPEG library
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
 * This file has been modified for SIMD extension.
 * Last Modified : December 24, 2005
 * ---------------------------------------------------------------------
 *
 * This file contains the inverse-DCT management logic.
 * This code selects a particular IDCT implementation to be used,
 * and it performs related housekeeping chores.  No code in this file
@@ -94,6 +101,7 @@ start_pass (j_decompress_ptr cinfo)
  int method = 0;
  inverse_DCT_method_ptr method_ptr = NULL;
  JQUANT_TBL * qtbl;
  unsigned int simd = jpeg_simd_support((j_common_ptr) cinfo);
  for (ci = 0, compptr = cinfo->comp_info; ci < cinfo->num_components;
       ci++, compptr++) {
@@ -105,34 +113,95 @@ start_pass (j_decompress_ptr cinfo)
      method = JDCT_ISLOW;	/* jidctred uses islow-style table */
      break;
    case 2:
 #ifdef JIDCT_INT_SSE2_SUPPORTED
      if (simd & JSIMD_SSE2 &&
          IS_CONST_ALIGNED_16(jconst_idct_red_sse2))
 	method_ptr = jpeg_idct_2x2_sse2;
      else
 #endif
 #ifdef JIDCT_INT_MMX_SUPPORTED
      if (simd & JSIMD_MMX)
 	method_ptr = jpeg_idct_2x2_mmx;
      else
 #endif
 	method_ptr = jpeg_idct_2x2;
      method = JDCT_ISLOW;	/* jidctred uses islow-style table */
      break;
    case 4:
 #ifdef JIDCT_INT_SSE2_SUPPORTED
      if (simd & JSIMD_SSE2 &&
          IS_CONST_ALIGNED_16(jconst_idct_red_sse2))
 	method_ptr = jpeg_idct_4x4_sse2;
      else
 #endif
 #ifdef JIDCT_INT_MMX_SUPPORTED
      if (simd & JSIMD_MMX)
 	method_ptr = jpeg_idct_4x4_mmx;
      else
 #endif
 	method_ptr = jpeg_idct_4x4;
      method = JDCT_ISLOW;	/* jidctred uses islow-style table */
      break;
-#endif
+#endif /* IDCT_SCALING_SUPPORTED */
    case DCTSIZE:
      switch (cinfo->dct_method) {
 #ifdef DCT_ISLOW_SUPPORTED
      case JDCT_ISLOW:
 #ifdef JIDCT_INT_SSE2_SUPPORTED
 	if (simd & JSIMD_SSE2 &&
 	    IS_CONST_ALIGNED_16(jconst_idct_islow_sse2))
 	  method_ptr = jpeg_idct_islow_sse2;
 	else
 #endif
 #ifdef JIDCT_INT_MMX_SUPPORTED
 	if (simd & JSIMD_MMX)
 	  method_ptr = jpeg_idct_islow_mmx;
 	else
 #endif
 	  method_ptr = jpeg_idct_islow;
 	method = JDCT_ISLOW;
 	break;
-#endif
+#endif /* DCT_ISLOW_SUPPORTED */
 #ifdef DCT_IFAST_SUPPORTED
      case JDCT_IFAST:
 #ifdef JIDCT_INT_SSE2_SUPPORTED
 	if (simd & JSIMD_SSE2 &&
 	    IS_CONST_ALIGNED_16(jconst_idct_ifast_sse2))
 	  method_ptr = jpeg_idct_ifast_sse2;
 	else
 #endif
 #ifdef JIDCT_INT_MMX_SUPPORTED
 	if (simd & JSIMD_MMX)
 	  method_ptr = jpeg_idct_ifast_mmx;
 	else
 #endif
 	  method_ptr = jpeg_idct_ifast;
 	method = JDCT_IFAST;
 	break;
-#endif
+#endif /* DCT_IFAST_SUPPORTED */
 #ifdef DCT_FLOAT_SUPPORTED
      case JDCT_FLOAT:
 #ifdef JIDCT_FLT_SSE_SSE2_SUPPORTED
 	if (simd & JSIMD_SSE && simd & JSIMD_SSE2 &&
 	    IS_CONST_ALIGNED_16(jconst_idct_float_sse2))
 	  method_ptr = jpeg_idct_float_sse2;
 	else
 #endif
 #ifdef JIDCT_FLT_SSE_MMX_SUPPORTED
 	if (simd & JSIMD_SSE &&
 	    IS_CONST_ALIGNED_16(jconst_idct_float_sse))
 	  method_ptr = jpeg_idct_float_sse;
 	else
 #endif
 #ifdef JIDCT_FLT_3DNOW_MMX_SUPPORTED
 	if (simd & JSIMD_3DNOW)
 	  method_ptr = jpeg_idct_float_3dnow;
 	else
 #endif
 	  method_ptr = jpeg_idct_float;
 	method = JDCT_FLOAT;
 	break;
-#endif
+#endif /* DCT_FLOAT_SUPPORTED */
      default:
 	ERREXIT(cinfo, JERR_NOT_COMPILED);
 	break;
@@ -267,3 +336,78 @@ jinit_inverse_dct (j_decompress_ptr cinfo)
    idct->cur_method[ci] = -1;
  }
 }
 #ifndef JSIMD_MODEINFO_NOT_SUPPORTED
 GLOBAL(unsigned int)
 jpeg_simd_inverse_dct (j_decompress_ptr cinfo, int method)
 {
  unsigned int simd = jpeg_simd_support((j_common_ptr) cinfo);
  switch (method) {
 #ifdef DCT_ISLOW_SUPPORTED
  case JDCT_ISLOW:
 #ifdef JIDCT_INT_SSE2_SUPPORTED
    if (simd & JSIMD_SSE2 &&
        IS_CONST_ALIGNED_16(jconst_idct_islow_sse2))
      return JSIMD_SSE2;
 #endif
 #ifdef JIDCT_INT_MMX_SUPPORTED
    if (simd & JSIMD_MMX)
      return JSIMD_MMX;
 #endif
    return JSIMD_NONE;
 #endif /* DCT_ISLOW_SUPPORTED */
 #ifdef DCT_IFAST_SUPPORTED
  case JDCT_IFAST:
 #ifdef JIDCT_INT_SSE2_SUPPORTED
    if (simd & JSIMD_SSE2 &&
        IS_CONST_ALIGNED_16(jconst_idct_ifast_sse2))
      return JSIMD_SSE2;
 #endif
 #ifdef JIDCT_INT_MMX_SUPPORTED
    if (simd & JSIMD_MMX)
      return JSIMD_MMX;
 #endif
    return JSIMD_NONE;
 #endif /* DCT_IFAST_SUPPORTED */
 #ifdef DCT_FLOAT_SUPPORTED
  case JDCT_FLOAT:
 #ifdef JIDCT_FLT_SSE_SSE2_SUPPORTED
    if (simd & JSIMD_SSE && simd & JSIMD_SSE2 &&
        IS_CONST_ALIGNED_16(jconst_idct_float_sse2))
      return JSIMD_SSE;		/* (JSIMD_SSE | JSIMD_SSE2); */
 #endif
 #ifdef JIDCT_FLT_SSE_MMX_SUPPORTED
    if (simd & JSIMD_SSE &&
        IS_CONST_ALIGNED_16(jconst_idct_float_sse))
      return JSIMD_SSE;		/* (JSIMD_SSE | JSIMD_MMX); */
 #endif
 #ifdef JIDCT_FLT_3DNOW_MMX_SUPPORTED
    if (simd & JSIMD_3DNOW)
      return JSIMD_3DNOW;	/* (JSIMD_3DNOW | JSIMD_MMX); */
 #endif
    return JSIMD_NONE;
 #endif /* DCT_FLOAT_SUPPORTED */
 #ifdef IDCT_SCALING_SUPPORTED
  case JDCT_FLOAT + 1:
 #ifdef JIDCT_INT_SSE2_SUPPORTED
    if (simd & JSIMD_SSE2 &&
        IS_CONST_ALIGNED_16(jconst_idct_red_sse2))
      return JSIMD_SSE2;
 #endif
 #ifdef JIDCT_INT_MMX_SUPPORTED
    if (simd & JSIMD_MMX)
      return JSIMD_MMX;
 #endif
    return JSIMD_NONE;
 #endif /* IDCT_SCALING_SUPPORTED */
  default:
    ;
  }
  return JSIMD_NONE;	/* not compiled */
 }
 #endif /* !JSIMD_MODEINFO_NOT_SUPPORTED */
--- a/jdhuff.c
+++ b/jdhuff.c
@@ -1,10 +1,17 @@
 /*
 * jdhuff.c
 *
- * Copyright (C) 1991-1996, Thomas G. Lane.
+ * Copyright (C) 1991-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
 * ---------------------------------------------------------------------
 * x86 SIMD extension for IJG JPEG library
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
 * This file has been modified to improve performance.
 * Last Modified : October 31, 2004
 * ---------------------------------------------------------------------
 *
 * This file contains Huffman entropy decoding routines.
 *
 * Much of the complexity here has to do with supporting input suspension.
@@ -64,6 +71,15 @@ typedef struct {
  /* Pointers to derived tables (these workspaces have image lifespan) */
  d_derived_tbl * dc_derived_tbls[NUM_HUFF_TBLS];
  d_derived_tbl * ac_derived_tbls[NUM_HUFF_TBLS];
  /* Precalculated info set up by start_pass for use in decode_mcu: */
  /* Pointers to derived tables to be used for each block within an MCU */
  d_derived_tbl * dc_cur_tbls[D_MAX_BLOCKS_IN_MCU];
  d_derived_tbl * ac_cur_tbls[D_MAX_BLOCKS_IN_MCU];
  /* Whether we care about the DC and AC coefficient values for each block */
  boolean dc_needed[D_MAX_BLOCKS_IN_MCU];
  boolean ac_needed[D_MAX_BLOCKS_IN_MCU];
 } huff_entropy_decoder;
 typedef huff_entropy_decoder * huff_entropy_ptr;
@@ -77,7 +93,7 @@ METHODDEF(void)
 start_pass_huff_decoder (j_decompress_ptr cinfo)
 {
  huff_entropy_ptr entropy = (huff_entropy_ptr) cinfo->entropy;
-  int ci, dctbl, actbl;
+  int ci, blkn, dctbl, actbl;
  jpeg_component_info * compptr;
  /* Check that the scan parameters Ss, Se, Ah/Al are OK for sequential JPEG.
@@ -92,27 +108,37 @@ start_pass_huff_decoder (j_decompress_ptr cinfo)
    compptr = cinfo->cur_comp_info[ci];
    dctbl = compptr->dc_tbl_no;
    actbl = compptr->ac_tbl_no;
    /* Make sure requested tables are present */
    if (dctbl < 0 || dctbl >= NUM_HUFF_TBLS ||
 	cinfo->dc_huff_tbl_ptrs[dctbl] == NULL)
      ERREXIT1(cinfo, JERR_NO_HUFF_TABLE, dctbl);
    if (actbl < 0 || actbl >= NUM_HUFF_TBLS ||
 	cinfo->ac_huff_tbl_ptrs[actbl] == NULL)
      ERREXIT1(cinfo, JERR_NO_HUFF_TABLE, actbl);
    /* Compute derived values for Huffman tables */
    /* We may do this more than once for a table, but it's not expensive */
-    jpeg_make_d_derived_tbl(cinfo, cinfo->dc_huff_tbl_ptrs[dctbl],
+    jpeg_make_d_derived_tbl(cinfo, TRUE, dctbl,
 			    & entropy->dc_derived_tbls[dctbl]);
-    jpeg_make_d_derived_tbl(cinfo, cinfo->ac_huff_tbl_ptrs[actbl],
+    jpeg_make_d_derived_tbl(cinfo, FALSE, actbl,
 			    & entropy->ac_derived_tbls[actbl]);
    /* Initialize DC predictions to 0 */
    entropy->saved.last_dc_val[ci] = 0;
  }
  /* Precalculate decoding info for each block in an MCU of this scan */
  for (blkn = 0; blkn < cinfo->blocks_in_MCU; blkn++) {
    ci = cinfo->MCU_membership[blkn];
    compptr = cinfo->cur_comp_info[ci];
    /* Precalculate which table to use for each block */
    entropy->dc_cur_tbls[blkn] = entropy->dc_derived_tbls[compptr->dc_tbl_no];
    entropy->ac_cur_tbls[blkn] = entropy->ac_derived_tbls[compptr->ac_tbl_no];
    /* Decide whether we really care about the coefficient values */
    if (compptr->component_needed) {
      entropy->dc_needed[blkn] = TRUE;
      /* we don't need the ACs if producing a 1/8th-size image */
      entropy->ac_needed[blkn] = (compptr->DCT_scaled_size > 1);
    } else {
      entropy->dc_needed[blkn] = entropy->ac_needed[blkn] = FALSE;
    }
  }
  /* Initialize bitread state variables */
  entropy->bitstate.bits_left = 0;
  entropy->bitstate.get_buffer = 0; /* unnecessary, but keeps Purify quiet */
-  entropy->bitstate.printed_eod = FALSE;
+  entropy->pub.insufficient_data = FALSE;
  /* Initialize restart counter */
  entropy->restarts_to_go = cinfo->restart_interval;
@@ -121,20 +147,35 @@ start_pass_huff_decoder (j_decompress_ptr cinfo)
 /*
 * Compute the derived values for a Huffman table.
 * This routine also performs some validation checks on the table.
 *
 * Note this is also used by jdphuff.c.
 */
 GLOBAL(void)
-jpeg_make_d_derived_tbl (j_decompress_ptr cinfo, JHUFF_TBL * htbl,
+jpeg_make_d_derived_tbl (j_decompress_ptr cinfo, boolean isDC, int tblno,
 			 d_derived_tbl ** pdtbl)
 {
  JHUFF_TBL *htbl;
  d_derived_tbl *dtbl;
-  int p, i, l, si;
+  int p, i, l, la, lx, si, numsymbols;
-  int lookbits, ctr;
+  int lookbits, look_end, sym, val, ctr;
  char huffsize[257];
  unsigned int huffcode[257];
  unsigned int code;
  /* Note that huffsize[] and huffcode[] are filled in code-length order,
   * paralleling the order of the symbols themselves in htbl->huffval[].
   */
  /* Find the input Huffman table */
  if (tblno < 0 || tblno >= NUM_HUFF_TBLS)
    ERREXIT1(cinfo, JERR_NO_HUFF_TABLE, tblno);
  htbl =
    isDC ? cinfo->dc_huff_tbl_ptrs[tblno] : cinfo->ac_huff_tbl_ptrs[tblno];
  if (htbl == NULL)
    ERREXIT1(cinfo, JERR_NO_HUFF_TABLE, tblno);
  /* Allocate a workspace if we haven't already done so. */
  if (*pdtbl == NULL)
    *pdtbl = (d_derived_tbl *)
@@ -144,17 +185,20 @@ jpeg_make_d_derived_tbl (j_decompress_ptr cinfo, JHUFF_TBL * htbl,
  dtbl->pub = htbl;		/* fill in back link */
  /* Figure C.1: make table of Huffman code length for each symbol */
  /* Note that this is in code-length order. */
  p = 0;
  for (l = 1; l <= 16; l++) {
-    for (i = 1; i <= (int) htbl->bits[l]; i++)
+    i = (int) htbl->bits[l];
    if (i < 0 || p + i > 256)	/* protect against table overrun */
      ERREXIT(cinfo, JERR_BAD_HUFF_TABLE);
    while (i--)
      huffsize[p++] = (char) l;
  }
  huffsize[p] = 0;
  numsymbols = p;
  /* Figure C.2: generate the codes themselves */
-  /* Note that this is in code-length order. */
+  /* We also validate that the counts represent a legal Huffman code tree. */
  code = 0;
  si = huffsize[0];
@@ -164,6 +208,11 @@ jpeg_make_d_derived_tbl (j_decompress_ptr cinfo, JHUFF_TBL * htbl,
      huffcode[p++] = code;
      code++;
    }
    /* code is now 1 more than the last code used for codelength si; but
     * it must still fit in si bits, since no code is allowed to be all ones.
     */
    if (((INT32) code) >= (((INT32) 1) << si))
      ERREXIT(cinfo, JERR_BAD_HUFF_TABLE);
    code <<= 1;
    si++;
  }
@@ -173,8 +222,10 @@ jpeg_make_d_derived_tbl (j_decompress_ptr cinfo, JHUFF_TBL * htbl,
  p = 0;
  for (l = 1; l <= 16; l++) {
    if (htbl->bits[l]) {
-      dtbl->valptr[l] = p; /* huffval[] index of 1st symbol of code length l */
+      /* valoffset[l] = huffval[] index of 1st symbol of code length l,
-      dtbl->mincode[l] = huffcode[p]; /* minimum code of length l */
+       * minus the minimum code of length l
       */
      dtbl->valoffset[l] = (INT32) p - (INT32) huffcode[p];
      p += htbl->bits[l];
      dtbl->maxcode[l] = huffcode[p-1]; /* maximum code of length l */
    } else {
@@ -190,21 +241,51 @@ jpeg_make_d_derived_tbl (j_decompress_ptr cinfo, JHUFF_TBL * htbl,
   * with that code.
   */
-  MEMZERO(dtbl->look_nbits, SIZEOF(dtbl->look_nbits));
+  MEMZERO(dtbl->lookx_nbits, SIZEOF(dtbl->lookx_nbits));
  p = 0;
-  for (l = 1; l <= HUFF_LOOKAHEAD; l++) {
+  for (l = 1; l <= HUFFX_LOOKAHEAD-1; l++) {
    for (i = 1; i <= (int) htbl->bits[l]; i++, p++) {
      /* l = current code's length, p = its index in huffcode[] & huffval[]. */
      /* Generate left-justified code followed by all possible bit sequences */
-      lookbits = huffcode[p] << (HUFF_LOOKAHEAD-l);
+      sym = htbl->huffval[p];		/* current symbol */
-      for (ctr = 1 << (HUFF_LOOKAHEAD-l); ctr > 0; ctr--) {
+      la = sym & 15;			/* length of additional bits field */
-	dtbl->look_nbits[lookbits] = l;
+      lx = HUFFX_LOOKAHEAD - l;
-	dtbl->look_sym[lookbits] = htbl->huffval[p];
+      lookbits = huffcode[p] << lx;
      look_end = lookbits + (1 << lx);
      lx -= la;
      while (lookbits < look_end) {
 	if (lx >= 0) {
 	  val = (lookbits >>  lx) & ((1 << la) - 1);
 	  ctr = 1 << lx;
 	} else {
 	  val = (lookbits << -lx) & ((1 << la) - 1);
 	  ctr = 1;
 	}
 	val = HUFF_EXTEND(val, la);
 	for (; ctr > 0; ctr--) {
 	  dtbl->lookx_nbits[lookbits] = l + la;
 	  dtbl->lookx_val[lookbits] = val;
 	  dtbl->lookx_sym[lookbits] = sym;
 	  lookbits++;
 	}
      }
    }
  }
  /* Validate symbols as being reasonable.
   * For AC tables, we make no check, but accept all byte values 0..255.
   * For DC tables, we require the symbols to be in range 0..15.
   * (Tighter bounds could be applied depending on the data depth and mode,
   * but this is sufficient to ensure safe decoding.)
   */
  if (isDC) {
    for (i = 0; i < numsymbols; i++) {
      int sym = htbl->huffval[i];
      if (sym < 0 || sym > 15)
 	ERREXIT(cinfo, JERR_BAD_HUFF_TABLE);
    }
  }
 }
@@ -213,23 +294,8 @@ jpeg_make_d_derived_tbl (j_decompress_ptr cinfo, JHUFF_TBL * htbl,
 * See jdhuff.h for info about usage.
 * Note: current values of get_buffer and bits_left are passed as parameters,
 * but are returned in the corresponding fields of the state struct.
 *
 * On most machines MIN_GET_BITS should be 25 to allow the full 32-bit width
 * of get_buffer to be used.  (On machines with wider words, an even larger
 * buffer could be used.)  However, on some machines 32-bit shifts are
 * quite slow and take time proportional to the number of places shifted.
 * (This is true with most PC compilers, for instance.)  In this case it may
 * be a win to set MIN_GET_BITS to the minimum value of 15.  This reduces the
 * average shift distance at the cost of more calls to jpeg_fill_bit_buffer.
 */
 #ifdef SLOW_SHIFT_32
 #define MIN_GET_BITS  15	/* minimum allowable value */
 #else
 #define MIN_GET_BITS  (BIT_BUF_SIZE-7)
 #endif
 GLOBAL(boolean)
 jpeg_fill_bit_buffer (bitread_working_state * state,
 		      register bit_buf_type get_buffer, register int bits_left,
@@ -239,33 +305,39 @@ jpeg_fill_bit_buffer (bitread_working_state * state,
  /* Copy heavily used state fields into locals (hopefully registers) */
  register const JOCTET * next_input_byte = state->next_input_byte;
  register size_t bytes_in_buffer = state->bytes_in_buffer;
-  register int c;
+  j_decompress_ptr cinfo = state->cinfo;
  /* Attempt to load at least MIN_GET_BITS bits into get_buffer. */
  /* (It is assumed that no request will be for more than that many bits.) */
  /* We fail to do so only if we hit a marker or are forced to suspend. */
  if (cinfo->unread_marker == 0) {	/* cannot advance past a marker */
    while (bits_left < MIN_GET_BITS) {
-    /* Attempt to read a byte */
+      register int c;
    if (state->unread_marker != 0)
      goto no_more_data;	/* can't advance past a marker */
      /* Attempt to read a byte */
      if (bytes_in_buffer == 0) {
-      if (! (*state->cinfo->src->fill_input_buffer) (state->cinfo))
+	if (! (*cinfo->src->fill_input_buffer) (cinfo))
 	  return FALSE;
-      next_input_byte = state->cinfo->src->next_input_byte;
+	next_input_byte = cinfo->src->next_input_byte;
-      bytes_in_buffer = state->cinfo->src->bytes_in_buffer;
+	bytes_in_buffer = cinfo->src->bytes_in_buffer;
      }
      bytes_in_buffer--;
      c = GETJOCTET(*next_input_byte++);
      /* If it's 0xFF, check and discard stuffed zero byte */
      if (c == 0xFF) {
 	/* Loop here to discard any padding FF's on terminating marker,
 	 * so that we can save a valid unread_marker value.  NOTE: we will
 	 * accept multiple FF's followed by a 0 as meaning a single FF data
 	 * byte.  This data pattern is not valid according to the standard.
 	 */
 	do {
 	  if (bytes_in_buffer == 0) {
-	  if (! (*state->cinfo->src->fill_input_buffer) (state->cinfo))
+	    if (! (*cinfo->src->fill_input_buffer) (cinfo))
 	      return FALSE;
-	  next_input_byte = state->cinfo->src->next_input_byte;
+	    next_input_byte = cinfo->src->next_input_byte;
-	  bytes_in_buffer = state->cinfo->src->bytes_in_buffer;
+	    bytes_in_buffer = cinfo->src->bytes_in_buffer;
 	  }
 	  bytes_in_buffer--;
 	  c = GETJOCTET(*next_input_byte++);
@@ -275,32 +347,44 @@ jpeg_fill_bit_buffer (bitread_working_state * state,
 	  /* Found FF/00, which represents an FF data byte */
 	  c = 0xFF;
 	} else {
-	/* Oops, it's actually a marker indicating end of compressed data. */
+	  /* Oops, it's actually a marker indicating end of compressed data.
-	/* Better put it back for use later */
+	   * Save the marker code for later use.
-	state->unread_marker = c;
+	   * Fine point: it might appear that we should save the marker into
-
+	   * bitread working state, not straight into permanent state.  But
-      no_more_data:
+	   * once we have hit a marker, we cannot need to suspend within the
-	/* There should be enough bits still left in the data segment; */
+	   * current MCU, because we will read no more bytes from the data
-	/* if so, just break out of the outer while loop. */
+	   * source.  So it is OK to update permanent state right away.
 	if (bits_left >= nbits)
 	  break;
 	/* Uh-oh.  Report corrupted data to user and stuff zeroes into
 	 * the data stream, so that we can produce some kind of image.
 	 * Note that this code will be repeated for each byte demanded
 	 * for the rest of the segment.  We use a nonvolatile flag to ensure
 	 * that only one warning message appears.
 	   */
-	if (! *(state->printed_eod_ptr)) {
+	  cinfo->unread_marker = c;
-	  WARNMS(state->cinfo, JWRN_HIT_MARKER);
+	  /* See if we need to insert some fake zero bits. */
-	  *(state->printed_eod_ptr) = TRUE;
+	  goto no_more_bytes;
 	}
 	c = 0;			/* insert a zero byte into bit buffer */
 	}
      }
      /* OK, load c into get_buffer */
      get_buffer = (get_buffer << 8) | c;
      bits_left += 8;
    } /* end while */
  } else {
  no_more_bytes:
    /* We get here if we've read the marker that terminates the compressed
     * data segment.  There should be enough bits in the buffer register
     * to satisfy the request; if so, no problem.
     */
    if (nbits > bits_left) {
      /* Uh-oh.  Report corrupted data to user and stuff zeroes into
       * the data stream, so that we can produce some kind of image.
       * We use a nonvolatile flag to ensure that only one warning message
       * appears per data segment.
       */
      if (! cinfo->entropy->insufficient_data) {
 	WARNMS(cinfo, JWRN_HIT_MARKER);
 	cinfo->entropy->insufficient_data = TRUE;
      }
      /* Fill the buffer with zero bits */
      get_buffer <<= MIN_GET_BITS - bits_left;
      bits_left = MIN_GET_BITS;
    }
  }
  /* Unload the local registers */
@@ -353,37 +437,10 @@ jpeg_huff_decode (bitread_working_state * state,
    return 0;			/* fake a zero as the safest result */
  }
-  return htbl->pub->huffval[ htbl->valptr[l] +
+  return htbl->pub->huffval[ (int) (code + htbl->valoffset[l]) ];
 			    ((int) (code - htbl->mincode[l])) ];
 }
 /*
 * Figure F.12: extend sign bit.
 * On some machines, a shift and add will be faster than a table lookup.
 */
 #ifdef AVOID_TABLES
 #define HUFF_EXTEND(x,s)  ((x) < (1<<((s)-1)) ? (x) + (((-1)<<(s)) + 1) : (x))
 #else
 #define HUFF_EXTEND(x,s)  ((x) < extend_test[s] ? (x) + extend_offset[s] : (x))
 static const int extend_test[16] =   /* entry n is 2**(n-1) */
  { 0, 0x0001, 0x0002, 0x0004, 0x0008, 0x0010, 0x0020, 0x0040, 0x0080,
    0x0100, 0x0200, 0x0400, 0x0800, 0x1000, 0x2000, 0x4000 };
 static const int extend_offset[16] = /* entry n is (-1 << n) + 1 */
  { 0, ((-1)<<1) + 1, ((-1)<<2) + 1, ((-1)<<3) + 1, ((-1)<<4) + 1,
    ((-1)<<5) + 1, ((-1)<<6) + 1, ((-1)<<7) + 1, ((-1)<<8) + 1,
    ((-1)<<9) + 1, ((-1)<<10) + 1, ((-1)<<11) + 1, ((-1)<<12) + 1,
    ((-1)<<13) + 1, ((-1)<<14) + 1, ((-1)<<15) + 1 };
 #endif /* AVOID_TABLES */
 /*
 * Check for a restart marker & resynchronize decoder.
 * Returns FALSE if must suspend.
@@ -411,8 +468,13 @@ process_restart (j_decompress_ptr cinfo)
  /* Reset restart counter */
  entropy->restarts_to_go = cinfo->restart_interval;
-  /* Next segment can get another out-of-data warning */
+  /* Reset out-of-data flag, unless read_restart_marker left us smack up
-  entropy->bitstate.printed_eod = FALSE;
+   * against a marker.  In that case we will end up treating the next data
   * segment as empty, and we can avoid producing bogus output pixels by
   * leaving the flag set.
   */
  if (cinfo->unread_marker == 0)
    entropy->pub.insufficient_data = FALSE;
  return TRUE;
 }
@@ -437,14 +499,9 @@ METHODDEF(boolean)
 decode_mcu (j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
 {
  huff_entropy_ptr entropy = (huff_entropy_ptr) cinfo->entropy;
-  register int s, k, r;
+  int blkn;
  int blkn, ci;
  JBLOCKROW block;
  BITREAD_STATE_VARS;
  savable_state state;
  d_derived_tbl * dctbl;
  d_derived_tbl * actbl;
  jpeg_component_info * compptr;
  /* Process restart marker if needed; may have to suspend */
  if (cinfo->restart_interval) {
@@ -453,6 +510,11 @@ decode_mcu (j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
 	return FALSE;
  }
  /* If we've run out of data, just leave the MCU set to zeroes.
   * This way, we return uniform gray for the remainder of the segment.
   */
  if (! entropy->pub.insufficient_data) {
    /* Load up working state */
    BITREAD_LOAD_STATE(cinfo,entropy->bitstate);
    ASSIGN_STATE(state, entropy->saved);
@@ -460,48 +522,140 @@ decode_mcu (j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
    /* Outer loop handles each block in the MCU */
    for (blkn = 0; blkn < cinfo->blocks_in_MCU; blkn++) {
-    block = MCU_data[blkn];
+      JBLOCKROW block = MCU_data[blkn];
-    ci = cinfo->MCU_membership[blkn];
+      d_derived_tbl * dctbl = entropy->dc_cur_tbls[blkn];
-    compptr = cinfo->cur_comp_info[ci];
+      d_derived_tbl * actbl = entropy->ac_cur_tbls[blkn];
-    dctbl = entropy->dc_derived_tbls[compptr->dc_tbl_no];
+      register int s, k, r;
    actbl = entropy->ac_derived_tbls[compptr->ac_tbl_no];
      /* Decode a single block's worth of coefficients */
      /* Section F.2.2.1: decode the DC coefficient difference */
-    HUFF_DECODE(s, br_state, dctbl, return FALSE, label1);
+      {		/* HUFFX_DECODE */
 	register int nb, look, t;
 	if (bits_left < HUFFX_LOOKAHEAD) {
 	  register const JOCTET * next_input_byte = br_state.next_input_byte;
 	  register size_t         bytes_in_buffer = br_state.bytes_in_buffer;
 	  if (cinfo->unread_marker == 0) {
 	    while (bits_left < MIN_GET_BITS) {
 	      register int c;
 	      if (bytes_in_buffer == 0 ||
 		  (c = GETJOCTET(*next_input_byte)) == 0xFF) {
 		goto label11; }
 	      bytes_in_buffer--; next_input_byte++;
 	      get_buffer = (get_buffer << 8) | c;
 	      bits_left += 8;
 	    }
 	    br_state.next_input_byte = next_input_byte;
 	    br_state.bytes_in_buffer = bytes_in_buffer;
 	  } else {
 	label11:
 	    br_state.next_input_byte = next_input_byte;
 	    br_state.bytes_in_buffer = bytes_in_buffer;
 	    if (! jpeg_fill_bit_buffer(&br_state,get_buffer,bits_left, 0)) {
 	      return FALSE; }
 	    get_buffer = br_state.get_buffer; bits_left = br_state.bits_left;
 	    if (bits_left < HUFFX_LOOKAHEAD) {
 	      nb = 1; goto label1;
 	    }
 	  }
 	}
 	look = PEEK_BITS(HUFFX_LOOKAHEAD);
 	if ((nb = dctbl->lookx_nbits[look]) != 0) {
 	  s = dctbl->lookx_val[look];
 	  if (nb <= HUFFX_LOOKAHEAD) {
 	    DROP_BITS(nb);
 	  } else {
 	    DROP_BITS(HUFFX_LOOKAHEAD);
 	    nb -= HUFFX_LOOKAHEAD;
 	    CHECK_BIT_BUFFER(br_state, nb, return FALSE);
 	    s += GET_BITS(nb);
 	  }
 	} else {
 	  nb = HUFFX_LOOKAHEAD;
      label1:
 	  if ((s=jpeg_huff_decode(&br_state,get_buffer,bits_left,dctbl,nb))
 	       < 0) { return FALSE; }
 	  get_buffer = br_state.get_buffer; bits_left = br_state.bits_left;
 	  if (s) {
 	    CHECK_BIT_BUFFER(br_state, s, return FALSE);
-      r = GET_BITS(s);
+	    t = GET_BITS(s);
-      s = HUFF_EXTEND(r, s);
+	    s = HUFF_EXTEND(t, s);
 	  }
-
+	}
-    /* Shortcut if component's values are not interesting */
+      }
-    if (! compptr->component_needed)
+      if (entropy->dc_needed[blkn]) {
      goto skip_ACs;
 	/* Convert DC difference to actual value, update last_dc_val */
 	int ci = cinfo->MCU_membership[blkn];
 	s += state.last_dc_val[ci];
 	state.last_dc_val[ci] = s;
 	/* Output the DC coefficient (assumes jpeg_natural_order[0] = 0) */
 	(*block)[0] = (JCOEF) s;
      }
-    /* Do we need to decode the AC coefficients for this component? */
+      if (entropy->ac_needed[blkn]) {
    if (compptr->DCT_scaled_size > 1) {
 	/* Section F.2.2.2: decode the AC coefficients */
 	/* Since zeroes are skipped, output area must be cleared beforehand */
 	for (k = 1; k < DCTSIZE2; k++) {
-	HUFF_DECODE(s, br_state, actbl, return FALSE, label2);
+	  {	/* HUFFX_DECODE */
-      
+	    register int nb, look, t;
-	r = s >> 4;
+	    if (bits_left < HUFFX_LOOKAHEAD) {
-	s &= 15;
+	      register const JOCTET * next_input_byte
-      
+					      = br_state.next_input_byte;
 	      register size_t bytes_in_buffer = br_state.bytes_in_buffer;
 	      if (cinfo->unread_marker == 0) {
 		while (bits_left < MIN_GET_BITS) {
 		  register int c;
 		  if (bytes_in_buffer == 0 ||
 		      (c = GETJOCTET(*next_input_byte)) == 0xFF) {
 		    goto label21; }
 		  bytes_in_buffer--; next_input_byte++;
 		  get_buffer = (get_buffer << 8) | c;
 		  bits_left += 8;
 		}
 		br_state.next_input_byte = next_input_byte;
 		br_state.bytes_in_buffer = bytes_in_buffer;
 	      } else {
 	    label21:
 		br_state.next_input_byte = next_input_byte;
 		br_state.bytes_in_buffer = bytes_in_buffer;
 		if (! jpeg_fill_bit_buffer(&br_state,get_buffer,bits_left,0)) {
 		  return FALSE; }
 		get_buffer = br_state.get_buffer;
 		bits_left  = br_state.bits_left;
 		if (bits_left < HUFFX_LOOKAHEAD) {
 		  nb = 1; goto label2;
 		}
 	      }
 	    }
 	    look = PEEK_BITS(HUFFX_LOOKAHEAD);
 	    if ((nb = actbl->lookx_nbits[look]) != 0) {
 	      s = actbl->lookx_val[look];
 	      r = actbl->lookx_sym[look] >> 4;
 	      if (nb <= HUFFX_LOOKAHEAD) {
 		DROP_BITS(nb);
 	      } else {
 		DROP_BITS(HUFFX_LOOKAHEAD);
 		nb -= HUFFX_LOOKAHEAD;
 		CHECK_BIT_BUFFER(br_state, nb, return FALSE);
 		s += GET_BITS(nb);
 	      }
 	    } else {
 	      nb = HUFFX_LOOKAHEAD;
 	  label2:
 	      if ((s=jpeg_huff_decode(&br_state,get_buffer,bits_left,actbl,nb))
 		   < 0) { return FALSE; }
 	      get_buffer = br_state.get_buffer; bits_left = br_state.bits_left;
 	      r = s >> 4; s &= 15;
 	      if (s) {
 		CHECK_BIT_BUFFER(br_state, s, return FALSE);
 		t = GET_BITS(s);
 		s = HUFF_EXTEND(t, s);
 	      }
 	    }
 	  }
 	  if (s) {
 	    k += r;
 	  CHECK_BIT_BUFFER(br_state, s, return FALSE);
 	  r = GET_BITS(s);
 	  s = HUFF_EXTEND(r, s);
 	    /* Output coefficient in natural (dezigzagged) order.
 	     * Note: the extra entries in jpeg_natural_order[] will save us
 	     * if k >= DCTSIZE2, which could happen if the data is corrupted.
@@ -515,20 +669,68 @@ decode_mcu (j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
 	}
      } else {
 skip_ACs:
 	/* Section F.2.2.2: decode the AC coefficients */
 	/* In this path we just discard the values */
 	for (k = 1; k < DCTSIZE2; k++) {
-	HUFF_DECODE(s, br_state, actbl, return FALSE, label3);
+	  {	/* HUFFX_DECODE */
-      
+	    register int nb, look;
-	r = s >> 4;
+	    if (bits_left < HUFFX_LOOKAHEAD) {
-	s &= 15;
+	      register const JOCTET * next_input_byte
-      
+					      = br_state.next_input_byte;
 	      register size_t bytes_in_buffer = br_state.bytes_in_buffer;
 	      if (cinfo->unread_marker == 0) {
 		while (bits_left < MIN_GET_BITS) {
 		  register int c;
 		  if (bytes_in_buffer == 0 ||
 		      (c = GETJOCTET(*next_input_byte)) == 0xFF) {
 		    goto label31; }
 		  bytes_in_buffer--; next_input_byte++;
 		  get_buffer = (get_buffer << 8) | c;
 		  bits_left += 8;
 		}
 		br_state.next_input_byte = next_input_byte;
 		br_state.bytes_in_buffer = bytes_in_buffer;
 	      } else {
 	    label31:
 		br_state.next_input_byte = next_input_byte;
 		br_state.bytes_in_buffer = bytes_in_buffer;
 		if (! jpeg_fill_bit_buffer(&br_state,get_buffer,bits_left,0)) {
 		  return FALSE; }
 		get_buffer = br_state.get_buffer;
 		bits_left  = br_state.bits_left;
 		if (bits_left < HUFFX_LOOKAHEAD) {
 		  nb = 1; goto label3;
 		}
 	      }
 	    }
 	    look = PEEK_BITS(HUFFX_LOOKAHEAD);
 	    if ((nb = actbl->lookx_nbits[look]) != 0) {
 	      s = actbl->lookx_sym[look];
 	      r = s >> 4; s &= 15;
 	      if (nb <= HUFFX_LOOKAHEAD) {
 		DROP_BITS(nb);
 	      } else {
 		DROP_BITS(HUFFX_LOOKAHEAD);
 		nb -= HUFFX_LOOKAHEAD;
 		CHECK_BIT_BUFFER(br_state, nb, return FALSE);
 		DROP_BITS(nb);
 	      }
 	    } else {
 	      nb = HUFFX_LOOKAHEAD;
 	  label3:
 	      if ((s=jpeg_huff_decode(&br_state,get_buffer,bits_left,actbl,nb))
 		   < 0) { return FALSE; }
 	      get_buffer = br_state.get_buffer; bits_left = br_state.bits_left;
 	      r = s >> 4; s &= 15;
 	      if (s) {
 	  k += r;
 		CHECK_BIT_BUFFER(br_state, s, return FALSE);
 		DROP_BITS(s);
 	      }
 	    }
 	  }
 	  if (s) {
 	    k += r;
 	  } else {
 	    if (r != 15)
 	      break;
@@ -542,6 +744,7 @@ skip_ACs:
    /* Completed MCU, so update state */
    BITREAD_SAVE_STATE(cinfo,entropy->bitstate);
    ASSIGN_STATE(entropy->saved, state);
  }
  /* Account for restart interval (no-op if not using restarts) */
  entropy->restarts_to_go--;
--- a/jdhuff.h
+++ b/jdhuff.h
@@ -1,10 +1,17 @@
 /*
 * jdhuff.h
 *
- * Copyright (C) 1991-1996, Thomas G. Lane.
+ * Copyright (C) 1991-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
 * ---------------------------------------------------------------------
 * x86 SIMD extension for IJG JPEG library
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
 * This file has been modified to improve performance.
 * Last Modified : October 31, 2004
 * ---------------------------------------------------------------------
 *
 * This file contains declarations for Huffman entropy decoding routines
 * that are shared between the sequential decoder (jdhuff.c) and the
 * progressive decoder (jdphuff.c).  No other modules need to see these.
@@ -21,30 +28,36 @@
 /* Derived data constructed for each Huffman table */
-#define HUFF_LOOKAHEAD	8	/* # of bits of lookahead */
+#define HUFFX_LOOKAHEAD	9	/* # of bits of lookahead */
 typedef struct {
  /* Basic tables: (element [0] of each array is unused) */
  INT32 mincode[17];		/* smallest code of length k */
  INT32 maxcode[18];		/* largest code of length k (-1 if none) */
  /* (maxcode[17] is a sentinel to ensure jpeg_huff_decode terminates) */
-  int valptr[17];		/* huffval[] index of 1st symbol of length k */
+  INT32 valoffset[17];		/* huffval[] offset for codes of length k */
  /* valoffset[k] = huffval[] index of 1st symbol of code length k, less
   * the smallest code of length k; so given a code of length k, the
   * corresponding symbol is huffval[code + valoffset[k]]
   */
  /* Link to public Huffman table (needed only in jpeg_huff_decode) */
  JHUFF_TBL *pub;
-  /* Lookahead tables: indexed by the next HUFF_LOOKAHEAD bits of
+  /* Lookahead tables: indexed by the next HUFFX_LOOKAHEAD bits of
   * the input data stream.  If the next Huffman code is no more
-   * than HUFF_LOOKAHEAD bits long, we can obtain its length and
+   * than HUFFX_LOOKAHEAD-1 bits long, we can obtain its length,
-   * the corresponding symbol directly from these tables.
+   * the corresponding symbol, and the encoded coefficient value
   * directly from these tables.
   */
-  int look_nbits[1<<HUFF_LOOKAHEAD]; /* # bits, or 0 if too long */
+  UINT8 lookx_nbits[1<<HUFFX_LOOKAHEAD];  /* # bits, or 0 if too long */
-  UINT8 look_sym[1<<HUFF_LOOKAHEAD]; /* symbol, or unused */
+  INT16 lookx_val[1<<HUFFX_LOOKAHEAD];  /* coefficient value, or unused */
  UINT8 lookx_sym[1<<HUFFX_LOOKAHEAD];  /* symbol, or unused */
 } d_derived_tbl;
 /* Expand a Huffman table definition into the derived format */
-EXTERN(void) jpeg_make_d_derived_tbl JPP((j_decompress_ptr cinfo,
+EXTERN(void) jpeg_make_d_derived_tbl
-				JHUFF_TBL * htbl, d_derived_tbl ** pdtbl));
+	JPP((j_decompress_ptr cinfo, boolean isDC, int tblno,
 	     d_derived_tbl ** pdtbl));
 /*
@@ -70,30 +83,43 @@ typedef INT32 bit_buf_type;	/* type of bit-extraction buffer */
 /* If long is > 32 bits on your machine, and shifting/masking longs is
 * reasonably fast, making bit_buf_type be long and setting BIT_BUF_SIZE
- * appropriately should be a win.  Unfortunately we can't do this with
+ * appropriately should be a win.  Unfortunately we can't define the size
- * something like  #define BIT_BUF_SIZE (sizeof(bit_buf_type)*8)
+ * with something like  #define BIT_BUF_SIZE (sizeof(bit_buf_type)*8)
 * because not all machines measure sizeof in 8-bit bytes.
 */
 #ifdef SLOW_SHIFT_32
 #define MIN_GET_BITS  15	/* minimum allowable value */
 #else
 #define MIN_GET_BITS  (BIT_BUF_SIZE-7)
 #endif
 /* On most machines MIN_GET_BITS should be 25 to allow the full 32-bit width
 * of get_buffer to be used.  (On machines with wider words, an even larger
 * buffer could be used.)  However, on some machines 32-bit shifts are
 * quite slow and take time proportional to the number of places shifted.
 * (This is true with most PC compilers, for instance.)  In this case it may
 * be a win to set MIN_GET_BITS to the minimum value of 15.  This reduces the
 * average shift distance at the cost of more calls to jpeg_fill_bit_buffer.
 */
 typedef struct {		/* Bitreading state saved across MCUs */
  bit_buf_type get_buffer;	/* current bit-extraction buffer */
  int bits_left;		/* # of unused bits in it */
  boolean printed_eod;		/* flag to suppress multiple warning msgs */
 } bitread_perm_state;
 typedef struct {		/* Bitreading working state within an MCU */
-  /* current data source state */
+  /* Current data source location */
  /* We need a copy, rather than munging the original, in case of suspension */
  const JOCTET * next_input_byte; /* => next byte to read from source */
  size_t bytes_in_buffer;	/* # of bytes remaining in source buffer */
-  int unread_marker;		/* nonzero if we have hit a marker */
+  /* Bit input buffer --- note these values are kept in register variables,
  /* bit input buffer --- note these values are kept in register variables,
   * not in this struct, inside the inner loops.
   */
  bit_buf_type get_buffer;	/* current bit-extraction buffer */
  int bits_left;		/* # of unused bits in it */
-  /* pointers needed by jpeg_fill_bit_buffer */
+  /* Pointer needed by jpeg_fill_bit_buffer. */
  j_decompress_ptr cinfo;	/* back link to decompress master record */
  boolean * printed_eod_ptr;	/* => flag in permanent state */
 } bitread_working_state;
 /* Macros to declare and load/save bitread local variables. */
@@ -106,15 +132,12 @@ typedef struct {		/* Bitreading working state within an MCU */
 	br_state.cinfo = cinfop; \
 	br_state.next_input_byte = cinfop->src->next_input_byte; \
 	br_state.bytes_in_buffer = cinfop->src->bytes_in_buffer; \
 	br_state.unread_marker = cinfop->unread_marker; \
 	get_buffer = permstate.get_buffer; \
-	bits_left = permstate.bits_left; \
+	bits_left = permstate.bits_left
 	br_state.printed_eod_ptr = & permstate.printed_eod
 #define BITREAD_SAVE_STATE(cinfop,permstate)  \
 	cinfop->src->next_input_byte = br_state.next_input_byte; \
 	cinfop->src->bytes_in_buffer = br_state.bytes_in_buffer; \
 	cinfop->unread_marker = br_state.unread_marker; \
 	permstate.get_buffer = get_buffer; \
 	permstate.bits_left = bits_left
@@ -156,47 +179,14 @@ EXTERN(boolean) jpeg_fill_bit_buffer
 	JPP((bitread_working_state * state, register bit_buf_type get_buffer,
 	     register int bits_left, int nbits));
 /*
 * Code for extracting next Huffman-coded symbol from input bit stream.
 * Again, this is time-critical and we make the main paths be macros.
 *
 * We use a lookahead table to process codes of up to HUFF_LOOKAHEAD bits
 * without looping.  Usually, more than 95% of the Huffman codes will be 8
 * or fewer bits long.  The few overlength codes are handled with a loop,
 * which need not be inline code.
 *
 * Notes about the HUFF_DECODE macro:
 * 1. Near the end of the data segment, we may fail to get enough bits
 *    for a lookahead.  In that case, we do it the hard way.
 * 2. If the lookahead table contains no entry, the next code must be
 *    more than HUFF_LOOKAHEAD bits long.
 * 3. jpeg_huff_decode returns -1 if forced to suspend.
 */
 #define HUFF_DECODE(result,state,htbl,failaction,slowlabel) \
 { register int nb, look; \
  if (bits_left < HUFF_LOOKAHEAD) { \
    if (! jpeg_fill_bit_buffer(&state,get_buffer,bits_left, 0)) {failaction;} \
    get_buffer = state.get_buffer; bits_left = state.bits_left; \
    if (bits_left < HUFF_LOOKAHEAD) { \
      nb = 1; goto slowlabel; \
    } \
  } \
  look = PEEK_BITS(HUFF_LOOKAHEAD); \
  if ((nb = htbl->look_nbits[look]) != 0) { \
    DROP_BITS(nb); \
    result = htbl->look_sym[look]; \
  } else { \
    nb = HUFF_LOOKAHEAD+1; \
 slowlabel: \
    if ((result=jpeg_huff_decode(&state,get_buffer,bits_left,htbl,nb)) < 0) \
 	{ failaction; } \
    get_buffer = state.get_buffer; bits_left = state.bits_left; \
  } \
 }
 /* Out-of-line case for Huffman code fetching */
 EXTERN(int) jpeg_huff_decode
 	JPP((bitread_working_state * state, register bit_buf_type get_buffer,
 	     register int bits_left, d_derived_tbl * htbl, int min_bits));
 /*
 * Figure F.12: extend sign bit.
 */
 #define HUFF_EXTEND(x,s)  ((x) < (1<<((s)-1)) ? (x) + (((-1)<<(s)) + 1) : (x))
--- a/jdinput.c
+++ b/jdinput.c
@@ -1,7 +1,7 @@
 /*
 * jdinput.c
 *
- * Copyright (C) 1991-1996, Thomas G. Lane.
+ * Copyright (C) 1991-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -301,7 +301,7 @@ consume_markers (j_decompress_ptr cinfo)
      initial_setup(cinfo);
      inputctl->inheaders = FALSE;
      /* Note: start_input_pass must be called by jdmaster.c
-       * before any more input can be consumed.  jdapi.c is
+       * before any more input can be consumed.  jdapimin.c is
       * responsible for enforcing this sequencing.
       */
    } else {			/* 2nd or later SOS marker */
--- a/jdmarker.c
+++ b/jdmarker.c
@@ -1,7 +1,7 @@
 /*
 * jdmarker.c
 *
- * Copyright (C) 1991-1996, Thomas G. Lane.
+ * Copyright (C) 1991-1998, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -85,6 +85,28 @@ typedef enum {			/* JPEG marker codes */
 } JPEG_MARKER;
 /* Private state */
 typedef struct {
  struct jpeg_marker_reader pub; /* public fields */
  /* Application-overridable marker processing methods */
  jpeg_marker_parser_method process_COM;
  jpeg_marker_parser_method process_APPn[16];
  /* Limit on marker data length to save for each marker type */
  unsigned int length_limit_COM;
  unsigned int length_limit_APPn[16];
  /* Status of COM/APPn marker saving */
  jpeg_saved_marker_ptr cur_marker;	/* NULL if not processing a marker */
  unsigned int bytes_read;		/* data bytes read so far in marker */
  /* Note: cur_marker is not linked into marker_list until it's all read. */
 } my_marker_reader;
 typedef my_marker_reader * my_marker_ptr;
 /*
 * Macros for fetching data from the data source module.
 *
@@ -104,7 +126,7 @@ typedef enum {			/* JPEG marker codes */
 	( datasrc->next_input_byte = next_input_byte,  \
 	  datasrc->bytes_in_buffer = bytes_in_buffer )
-/* Reload the local copies --- seldom used except in MAKE_BYTE_AVAIL */
+/* Reload the local copies --- used only in MAKE_BYTE_AVAIL */
 #define INPUT_RELOAD(cinfo)  \
 	( next_input_byte = datasrc->next_input_byte,  \
 	  bytes_in_buffer = datasrc->bytes_in_buffer )
@@ -118,14 +140,14 @@ typedef enum {			/* JPEG marker codes */
 	  if (! (*datasrc->fill_input_buffer) (cinfo))  \
 	    { action; }  \
 	  INPUT_RELOAD(cinfo);  \
-	}  \
+	}
 	bytes_in_buffer--
 /* Read a byte into variable V.
 * If must suspend, take the specified action (typically "return FALSE").
 */
 #define INPUT_BYTE(cinfo,V,action)  \
 	MAKESTMT( MAKE_BYTE_AVAIL(cinfo,action); \
 		  bytes_in_buffer--; \
 		  V = GETJOCTET(*next_input_byte++); )
 /* As above, but read two bytes interpreted as an unsigned 16-bit integer.
@@ -133,8 +155,10 @@ typedef enum {			/* JPEG marker codes */
 */
 #define INPUT_2BYTES(cinfo,V,action)  \
 	MAKESTMT( MAKE_BYTE_AVAIL(cinfo,action); \
 		  bytes_in_buffer--; \
 		  V = ((unsigned int) GETJOCTET(*next_input_byte++)) << 8; \
 		  MAKE_BYTE_AVAIL(cinfo,action); \
 		  bytes_in_buffer--; \
 		  V += GETJOCTET(*next_input_byte++); )
@@ -150,11 +174,18 @@ typedef enum {			/* JPEG marker codes */
 *   marker parameters; restart point has not been moved.  Same routine
 *   will be called again after application supplies more input data.
 *
- * This approach to suspension assumes that all of a marker's parameters can
+ * This approach to suspension assumes that all of a marker's parameters
- * fit into a single input bufferload.  This should hold for "normal"
+ * can fit into a single input bufferload.  This should hold for "normal"
- * markers.  Some COM/APPn markers might have large parameter segments,
+ * markers.  Some COM/APPn markers might have large parameter segments
- * but we use skip_input_data to get past those, and thereby put the problem
+ * that might not fit.  If we are simply dropping such a marker, we use
- * on the source manager's shoulders.
+ * skip_input_data to get past it, and thereby put the problem on the
 * source manager's shoulders.  If we are saving the marker's contents
 * into memory, we use a slightly different convention: when forced to
 * suspend, the marker processor updates the restart point to the end of
 * what it's consumed (ie, the end of the buffer) before returning FALSE.
 * On resumption, cinfo->unread_marker still contains the marker code,
 * but the data source will point to the next chunk of marker data.
 * The marker processor must retain internal state to deal with this.
 *
 * Note that we don't bother to avoid duplicate trace messages if a
 * suspension occurs within marker parameters.  Other side effects
@@ -188,7 +219,9 @@ get_soi (j_decompress_ptr cinfo)
  cinfo->CCIR601_sampling = FALSE; /* Assume non-CCIR sampling??? */
  cinfo->saw_JFIF_marker = FALSE;
-  cinfo->density_unit = 0;	/* set default JFIF APP0 values */
+  cinfo->JFIF_major_version = 1; /* set default JFIF APP0 values */
  cinfo->JFIF_minor_version = 1;
  cinfo->density_unit = 0;
  cinfo->X_density = 1;
  cinfo->Y_density = 1;
  cinfo->saw_Adobe_marker = FALSE;
@@ -280,11 +313,11 @@ get_sos (j_decompress_ptr cinfo)
  INPUT_BYTE(cinfo, n, return FALSE); /* Number of components */
  TRACEMS1(cinfo, 1, JTRC_SOS, n);
  if (length != (n * 2 + 6) || n < 1 || n > MAX_COMPS_IN_SCAN)
    ERREXIT(cinfo, JERR_BAD_LENGTH);
  TRACEMS1(cinfo, 1, JTRC_SOS, n);
  cinfo->comps_in_scan = n;
  /* Collect the component-spec parameters */
@@ -334,111 +367,7 @@ get_sos (j_decompress_ptr cinfo)
 }
-METHODDEF(boolean)
+#ifdef D_ARITH_CODING_SUPPORTED
 get_app0 (j_decompress_ptr cinfo)
 /* Process an APP0 marker */
 {
 #define JFIF_LEN 14
  INT32 length;
  UINT8 b[JFIF_LEN];
  int buffp;
  INPUT_VARS(cinfo);
  INPUT_2BYTES(cinfo, length, return FALSE);
  length -= 2;
  /* See if a JFIF APP0 marker is present */
  if (length >= JFIF_LEN) {
    for (buffp = 0; buffp < JFIF_LEN; buffp++)
      INPUT_BYTE(cinfo, b[buffp], return FALSE);
    length -= JFIF_LEN;
    if (b[0]==0x4A && b[1]==0x46 && b[2]==0x49 && b[3]==0x46 && b[4]==0) {
      /* Found JFIF APP0 marker: check version */
      /* Major version must be 1, anything else signals an incompatible change.
       * We used to treat this as an error, but now it's a nonfatal warning,
       * because some bozo at Hijaak couldn't read the spec.
       * Minor version should be 0..2, but process anyway if newer.
       */
      if (b[5] != 1)
 	WARNMS2(cinfo, JWRN_JFIF_MAJOR, b[5], b[6]);
      else if (b[6] > 2)
 	TRACEMS2(cinfo, 1, JTRC_JFIF_MINOR, b[5], b[6]);
      /* Save info */
      cinfo->saw_JFIF_marker = TRUE;
      cinfo->density_unit = b[7];
      cinfo->X_density = (b[8] << 8) + b[9];
      cinfo->Y_density = (b[10] << 8) + b[11];
      TRACEMS3(cinfo, 1, JTRC_JFIF,
 	       cinfo->X_density, cinfo->Y_density, cinfo->density_unit);
      if (b[12] | b[13])
 	TRACEMS2(cinfo, 1, JTRC_JFIF_THUMBNAIL, b[12], b[13]);
      if (length != ((INT32) b[12] * (INT32) b[13] * (INT32) 3))
 	TRACEMS1(cinfo, 1, JTRC_JFIF_BADTHUMBNAILSIZE, (int) length);
    } else {
      /* Start of APP0 does not match "JFIF" */
      TRACEMS1(cinfo, 1, JTRC_APP0, (int) length + JFIF_LEN);
    }
  } else {
    /* Too short to be JFIF marker */
    TRACEMS1(cinfo, 1, JTRC_APP0, (int) length);
  }
  INPUT_SYNC(cinfo);
  if (length > 0)		/* skip any remaining data -- could be lots */
    (*cinfo->src->skip_input_data) (cinfo, (long) length);
  return TRUE;
 }
 METHODDEF(boolean)
 get_app14 (j_decompress_ptr cinfo)
 /* Process an APP14 marker */
 {
 #define ADOBE_LEN 12
  INT32 length;
  UINT8 b[ADOBE_LEN];
  int buffp;
  unsigned int version, flags0, flags1, transform;
  INPUT_VARS(cinfo);
  INPUT_2BYTES(cinfo, length, return FALSE);
  length -= 2;
  /* See if an Adobe APP14 marker is present */
  if (length >= ADOBE_LEN) {
    for (buffp = 0; buffp < ADOBE_LEN; buffp++)
      INPUT_BYTE(cinfo, b[buffp], return FALSE);
    length -= ADOBE_LEN;
    if (b[0]==0x41 && b[1]==0x64 && b[2]==0x6F && b[3]==0x62 && b[4]==0x65) {
      /* Found Adobe APP14 marker */
      version = (b[5] << 8) + b[6];
      flags0 = (b[7] << 8) + b[8];
      flags1 = (b[9] << 8) + b[10];
      transform = b[11];
      TRACEMS4(cinfo, 1, JTRC_ADOBE, version, flags0, flags1, transform);
      cinfo->saw_Adobe_marker = TRUE;
      cinfo->Adobe_transform = (UINT8) transform;
    } else {
      /* Start of APP14 does not match "Adobe" */
      TRACEMS1(cinfo, 1, JTRC_APP14, (int) length + ADOBE_LEN);
    }
  } else {
    /* Too short to be Adobe marker */
    TRACEMS1(cinfo, 1, JTRC_APP14, (int) length);
  }
  INPUT_SYNC(cinfo);
  if (length > 0)		/* skip any remaining data -- could be lots */
    (*cinfo->src->skip_input_data) (cinfo, (long) length);
  return TRUE;
 }
 LOCAL(boolean)
 get_dac (j_decompress_ptr cinfo)
@@ -472,10 +401,19 @@ get_dac (j_decompress_ptr cinfo)
    }
  }
  if (length != 0)
    ERREXIT(cinfo, JERR_BAD_LENGTH);
  INPUT_SYNC(cinfo);
  return TRUE;
 }
 #else /* ! D_ARITH_CODING_SUPPORTED */
 #define get_dac(cinfo)  skip_variable(cinfo)
 #endif /* D_ARITH_CODING_SUPPORTED */
 LOCAL(boolean)
 get_dht (j_decompress_ptr cinfo)
@@ -491,7 +429,7 @@ get_dht (j_decompress_ptr cinfo)
  INPUT_2BYTES(cinfo, length, return FALSE);
  length -= 2;
-  while (length > 0) {
+  while (length > 16) {
    INPUT_BYTE(cinfo, index, return FALSE);
    TRACEMS1(cinfo, 1, JTRC_DHT, index);
@@ -512,8 +450,11 @@ get_dht (j_decompress_ptr cinfo)
 	     bits[9], bits[10], bits[11], bits[12],
 	     bits[13], bits[14], bits[15], bits[16]);
    /* Here we just do minimal validation of the counts to avoid walking
     * off the end of our table space.  jdhuff.c will check more carefully.
     */
    if (count > 256 || ((INT32) count) > length)
-      ERREXIT(cinfo, JERR_DHT_COUNTS);
+      ERREXIT(cinfo, JERR_BAD_HUFF_TABLE);
    for (i = 0; i < count; i++)
      INPUT_BYTE(cinfo, huffval[i], return FALSE);
@@ -537,6 +478,9 @@ get_dht (j_decompress_ptr cinfo)
    MEMCOPY((*htblptr)->huffval, huffval, SIZEOF((*htblptr)->huffval));
  }
  if (length != 0)
    ERREXIT(cinfo, JERR_BAD_LENGTH);
  INPUT_SYNC(cinfo);
  return TRUE;
 }
@@ -592,6 +536,9 @@ get_dqt (j_decompress_ptr cinfo)
    if (prec) length -= DCTSIZE2;
  }
  if (length != 0)
    ERREXIT(cinfo, JERR_BAD_LENGTH);
  INPUT_SYNC(cinfo);
  return TRUE;
 }
@@ -621,6 +568,279 @@ get_dri (j_decompress_ptr cinfo)
 }
 /*
 * Routines for processing APPn and COM markers.
 * These are either saved in memory or discarded, per application request.
 * APP0 and APP14 are specially checked to see if they are
 * JFIF and Adobe markers, respectively.
 */
 #define APP0_DATA_LEN	14	/* Length of interesting data in APP0 */
 #define APP14_DATA_LEN	12	/* Length of interesting data in APP14 */
 #define APPN_DATA_LEN	14	/* Must be the largest of the above!! */
 LOCAL(void)
 examine_app0 (j_decompress_ptr cinfo, JOCTET FAR * data,
 	      unsigned int datalen, INT32 remaining)
 /* Examine first few bytes from an APP0.
 * Take appropriate action if it is a JFIF marker.
 * datalen is # of bytes at data[], remaining is length of rest of marker data.
 */
 {
  INT32 totallen = (INT32) datalen + remaining;
  if (datalen >= APP0_DATA_LEN &&
      GETJOCTET(data[0]) == 0x4A &&
      GETJOCTET(data[1]) == 0x46 &&
      GETJOCTET(data[2]) == 0x49 &&
      GETJOCTET(data[3]) == 0x46 &&
      GETJOCTET(data[4]) == 0) {
    /* Found JFIF APP0 marker: save info */
    cinfo->saw_JFIF_marker = TRUE;
    cinfo->JFIF_major_version = GETJOCTET(data[5]);
    cinfo->JFIF_minor_version = GETJOCTET(data[6]);
    cinfo->density_unit = GETJOCTET(data[7]);
    cinfo->X_density = (GETJOCTET(data[8]) << 8) + GETJOCTET(data[9]);
    cinfo->Y_density = (GETJOCTET(data[10]) << 8) + GETJOCTET(data[11]);
    /* Check version.
     * Major version must be 1, anything else signals an incompatible change.
     * (We used to treat this as an error, but now it's a nonfatal warning,
     * because some bozo at Hijaak couldn't read the spec.)
     * Minor version should be 0..2, but process anyway if newer.
     */
    if (cinfo->JFIF_major_version != 1)
      WARNMS2(cinfo, JWRN_JFIF_MAJOR,
 	      cinfo->JFIF_major_version, cinfo->JFIF_minor_version);
    /* Generate trace messages */
    TRACEMS5(cinfo, 1, JTRC_JFIF,
 	     cinfo->JFIF_major_version, cinfo->JFIF_minor_version,
 	     cinfo->X_density, cinfo->Y_density, cinfo->density_unit);
    /* Validate thumbnail dimensions and issue appropriate messages */
    if (GETJOCTET(data[12]) | GETJOCTET(data[13]))
      TRACEMS2(cinfo, 1, JTRC_JFIF_THUMBNAIL,
 	       GETJOCTET(data[12]), GETJOCTET(data[13]));
    totallen -= APP0_DATA_LEN;
    if (totallen !=
 	((INT32)GETJOCTET(data[12]) * (INT32)GETJOCTET(data[13]) * (INT32) 3))
      TRACEMS1(cinfo, 1, JTRC_JFIF_BADTHUMBNAILSIZE, (int) totallen);
  } else if (datalen >= 6 &&
      GETJOCTET(data[0]) == 0x4A &&
      GETJOCTET(data[1]) == 0x46 &&
      GETJOCTET(data[2]) == 0x58 &&
      GETJOCTET(data[3]) == 0x58 &&
      GETJOCTET(data[4]) == 0) {
    /* Found JFIF "JFXX" extension APP0 marker */
    /* The library doesn't actually do anything with these,
     * but we try to produce a helpful trace message.
     */
    switch (GETJOCTET(data[5])) {
    case 0x10:
      TRACEMS1(cinfo, 1, JTRC_THUMB_JPEG, (int) totallen);
      break;
    case 0x11:
      TRACEMS1(cinfo, 1, JTRC_THUMB_PALETTE, (int) totallen);
      break;
    case 0x13:
      TRACEMS1(cinfo, 1, JTRC_THUMB_RGB, (int) totallen);
      break;
    default:
      TRACEMS2(cinfo, 1, JTRC_JFIF_EXTENSION,
 	       GETJOCTET(data[5]), (int) totallen);
      break;
    }
  } else {
    /* Start of APP0 does not match "JFIF" or "JFXX", or too short */
    TRACEMS1(cinfo, 1, JTRC_APP0, (int) totallen);
  }
 }
 LOCAL(void)
 examine_app14 (j_decompress_ptr cinfo, JOCTET FAR * data,
 	       unsigned int datalen, INT32 remaining)
 /* Examine first few bytes from an APP14.
 * Take appropriate action if it is an Adobe marker.
 * datalen is # of bytes at data[], remaining is length of rest of marker data.
 */
 {
  unsigned int version, flags0, flags1, transform;
  if (datalen >= APP14_DATA_LEN &&
      GETJOCTET(data[0]) == 0x41 &&
      GETJOCTET(data[1]) == 0x64 &&
      GETJOCTET(data[2]) == 0x6F &&
      GETJOCTET(data[3]) == 0x62 &&
      GETJOCTET(data[4]) == 0x65) {
    /* Found Adobe APP14 marker */
    version = (GETJOCTET(data[5]) << 8) + GETJOCTET(data[6]);
    flags0 = (GETJOCTET(data[7]) << 8) + GETJOCTET(data[8]);
    flags1 = (GETJOCTET(data[9]) << 8) + GETJOCTET(data[10]);
    transform = GETJOCTET(data[11]);
    TRACEMS4(cinfo, 1, JTRC_ADOBE, version, flags0, flags1, transform);
    cinfo->saw_Adobe_marker = TRUE;
    cinfo->Adobe_transform = (UINT8) transform;
  } else {
    /* Start of APP14 does not match "Adobe", or too short */
    TRACEMS1(cinfo, 1, JTRC_APP14, (int) (datalen + remaining));
  }
 }
 METHODDEF(boolean)
 get_interesting_appn (j_decompress_ptr cinfo)
 /* Process an APP0 or APP14 marker without saving it */
 {
  INT32 length;
  JOCTET b[APPN_DATA_LEN];
  unsigned int i, numtoread;
  INPUT_VARS(cinfo);
  INPUT_2BYTES(cinfo, length, return FALSE);
  length -= 2;
  /* get the interesting part of the marker data */
  if (length >= APPN_DATA_LEN)
    numtoread = APPN_DATA_LEN;
  else if (length > 0)
    numtoread = (unsigned int) length;
  else
    numtoread = 0;
  for (i = 0; i < numtoread; i++)
    INPUT_BYTE(cinfo, b[i], return FALSE);
  length -= numtoread;
  /* process it */
  switch (cinfo->unread_marker) {
  case M_APP0:
    examine_app0(cinfo, (JOCTET FAR *) b, numtoread, length);
    break;
  case M_APP14:
    examine_app14(cinfo, (JOCTET FAR *) b, numtoread, length);
    break;
  default:
    /* can't get here unless jpeg_save_markers chooses wrong processor */
    ERREXIT1(cinfo, JERR_UNKNOWN_MARKER, cinfo->unread_marker);
    break;
  }
  /* skip any remaining data -- could be lots */
  INPUT_SYNC(cinfo);
  if (length > 0)
    (*cinfo->src->skip_input_data) (cinfo, (long) length);
  return TRUE;
 }
 #ifdef SAVE_MARKERS_SUPPORTED
 METHODDEF(boolean)
 save_marker (j_decompress_ptr cinfo)
 /* Save an APPn or COM marker into the marker list */
 {
  my_marker_ptr marker = (my_marker_ptr) cinfo->marker;
  jpeg_saved_marker_ptr cur_marker = marker->cur_marker;
  unsigned int bytes_read, data_length;
  JOCTET FAR * data;
  INT32 length = 0;
  INPUT_VARS(cinfo);
  if (cur_marker == NULL) {
    /* begin reading a marker */
    INPUT_2BYTES(cinfo, length, return FALSE);
    length -= 2;
    if (length >= 0) {		/* watch out for bogus length word */
      /* figure out how much we want to save */
      unsigned int limit;
      if (cinfo->unread_marker == (int) M_COM)
 	limit = marker->length_limit_COM;
      else
 	limit = marker->length_limit_APPn[cinfo->unread_marker - (int) M_APP0];
      if ((unsigned int) length < limit)
 	limit = (unsigned int) length;
      /* allocate and initialize the marker item */
      cur_marker = (jpeg_saved_marker_ptr)
 	(*cinfo->mem->alloc_large) ((j_common_ptr) cinfo, JPOOL_IMAGE,
 				    SIZEOF(struct jpeg_marker_struct) + limit);
      cur_marker->next = NULL;
      cur_marker->marker = (UINT8) cinfo->unread_marker;
      cur_marker->original_length = (unsigned int) length;
      cur_marker->data_length = limit;
      /* data area is just beyond the jpeg_marker_struct */
      data = cur_marker->data = (JOCTET FAR *) (cur_marker + 1);
      marker->cur_marker = cur_marker;
      marker->bytes_read = 0;
      bytes_read = 0;
      data_length = limit;
    } else {
      /* deal with bogus length word */
      bytes_read = data_length = 0;
      data = NULL;
    }
  } else {
    /* resume reading a marker */
    bytes_read = marker->bytes_read;
    data_length = cur_marker->data_length;
    data = cur_marker->data + bytes_read;
  }
  while (bytes_read < data_length) {
    INPUT_SYNC(cinfo);		/* move the restart point to here */
    marker->bytes_read = bytes_read;
    /* If there's not at least one byte in buffer, suspend */
    MAKE_BYTE_AVAIL(cinfo, return FALSE);
    /* Copy bytes with reasonable rapidity */
    while (bytes_read < data_length && bytes_in_buffer > 0) {
      *data++ = *next_input_byte++;
      bytes_in_buffer--;
      bytes_read++;
    }
  }
  /* Done reading what we want to read */
  if (cur_marker != NULL) {	/* will be NULL if bogus length word */
    /* Add new marker to end of list */
    if (cinfo->marker_list == NULL) {
      cinfo->marker_list = cur_marker;
    } else {
      jpeg_saved_marker_ptr prev = cinfo->marker_list;
      while (prev->next != NULL)
 	prev = prev->next;
      prev->next = cur_marker;
    }
    /* Reset pointer & calc remaining data length */
    data = cur_marker->data;
    length = cur_marker->original_length - data_length;
  }
  /* Reset to initial state for next marker */
  marker->cur_marker = NULL;
  /* Process the marker if interesting; else just make a generic trace msg */
  switch (cinfo->unread_marker) {
  case M_APP0:
    examine_app0(cinfo, data, data_length, length);
    break;
  case M_APP14:
    examine_app14(cinfo, data, data_length, length);
    break;
  default:
    TRACEMS2(cinfo, 1, JTRC_MISC_MARKER, cinfo->unread_marker,
 	     (int) (data_length + length));
    break;
  }
  /* skip any remaining data -- could be lots */
  INPUT_SYNC(cinfo);		/* do before skip_input_data */
  if (length > 0)
    (*cinfo->src->skip_input_data) (cinfo, (long) length);
  return TRUE;
 }
 #endif /* SAVE_MARKERS_SUPPORTED */
 METHODDEF(boolean)
 skip_variable (j_decompress_ptr cinfo)
 /* Skip over an unknown or uninteresting variable-length marker */
@@ -629,11 +849,13 @@ skip_variable (j_decompress_ptr cinfo)
  INPUT_VARS(cinfo);
  INPUT_2BYTES(cinfo, length, return FALSE);
  length -= 2;
  TRACEMS2(cinfo, 1, JTRC_MISC_MARKER, cinfo->unread_marker, (int) length);
  INPUT_SYNC(cinfo);		/* do before skip_input_data */
-  (*cinfo->src->skip_input_data) (cinfo, (long) length - 2L);
+  if (length > 0)
    (*cinfo->src->skip_input_data) (cinfo, (long) length);
  return TRUE;
 }
@@ -833,12 +1055,13 @@ read_markers (j_decompress_ptr cinfo)
    case M_APP13:
    case M_APP14:
    case M_APP15:
-      if (! (*cinfo->marker->process_APPn[cinfo->unread_marker - (int) M_APP0]) (cinfo))
+      if (! (*((my_marker_ptr) cinfo->marker)->process_APPn[
 		cinfo->unread_marker - (int) M_APP0]) (cinfo))
 	return JPEG_SUSPENDED;
      break;
    case M_COM:
-      if (! (*cinfo->marker->process_COM) (cinfo))
+      if (! (*((my_marker_ptr) cinfo->marker)->process_COM) (cinfo))
 	return JPEG_SUSPENDED;
      break;
@@ -1018,12 +1241,15 @@ jpeg_resync_to_restart (j_decompress_ptr cinfo, int desired)
 METHODDEF(void)
 reset_marker_reader (j_decompress_ptr cinfo)
 {
  my_marker_ptr marker = (my_marker_ptr) cinfo->marker;
  cinfo->comp_info = NULL;		/* until allocated by get_sof */
  cinfo->input_scan_number = 0;		/* no SOS seen yet */
  cinfo->unread_marker = 0;		/* no pending marker */
-  cinfo->marker->saw_SOI = FALSE;	/* set internal state too */
+  marker->pub.saw_SOI = FALSE;		/* set internal state too */
-  cinfo->marker->saw_SOF = FALSE;
+  marker->pub.saw_SOF = FALSE;
-  cinfo->marker->discarded_bytes = 0;
+  marker->pub.discarded_bytes = 0;
  marker->cur_marker = NULL;
 }
@@ -1035,21 +1261,100 @@ reset_marker_reader (j_decompress_ptr cinfo)
 GLOBAL(void)
 jinit_marker_reader (j_decompress_ptr cinfo)
 {
  my_marker_ptr marker;
  int i;
  /* Create subobject in permanent pool */
-  cinfo->marker = (struct jpeg_marker_reader *)
+  marker = (my_marker_ptr)
    (*cinfo->mem->alloc_small) ((j_common_ptr) cinfo, JPOOL_PERMANENT,
-				SIZEOF(struct jpeg_marker_reader));
+				SIZEOF(my_marker_reader));
-  /* Initialize method pointers */
+  cinfo->marker = (struct jpeg_marker_reader *) marker;
-  cinfo->marker->reset_marker_reader = reset_marker_reader;
+  /* Initialize public method pointers */
-  cinfo->marker->read_markers = read_markers;
+  marker->pub.reset_marker_reader = reset_marker_reader;
-  cinfo->marker->read_restart_marker = read_restart_marker;
+  marker->pub.read_markers = read_markers;
-  cinfo->marker->process_COM = skip_variable;
+  marker->pub.read_restart_marker = read_restart_marker;
-  for (i = 0; i < 16; i++)
+  /* Initialize COM/APPn processing.
-    cinfo->marker->process_APPn[i] = skip_variable;
+   * By default, we examine and then discard APP0 and APP14,
-  cinfo->marker->process_APPn[0] = get_app0;
+   * but simply discard COM and all other APPn.
-  cinfo->marker->process_APPn[14] = get_app14;
+   */
  marker->process_COM = skip_variable;
  marker->length_limit_COM = 0;
  for (i = 0; i < 16; i++) {
    marker->process_APPn[i] = skip_variable;
    marker->length_limit_APPn[i] = 0;
  }
  marker->process_APPn[0] = get_interesting_appn;
  marker->process_APPn[14] = get_interesting_appn;
  /* Reset marker processing state */
  reset_marker_reader(cinfo);
 }
 /*
 * Control saving of COM and APPn markers into marker_list.
 */
 #ifdef SAVE_MARKERS_SUPPORTED
 GLOBAL(void)
 jpeg_save_markers (j_decompress_ptr cinfo, int marker_code,
 		   unsigned int length_limit)
 {
  my_marker_ptr marker = (my_marker_ptr) cinfo->marker;
  long maxlength;
  jpeg_marker_parser_method processor;
  /* Length limit mustn't be larger than what we can allocate
   * (should only be a concern in a 16-bit environment).
   */
  maxlength = cinfo->mem->max_alloc_chunk - SIZEOF(struct jpeg_marker_struct);
  if (((long) length_limit) > maxlength)
    length_limit = (unsigned int) maxlength;
  /* Choose processor routine to use.
   * APP0/APP14 have special requirements.
   */
  if (length_limit) {
    processor = save_marker;
    /* If saving APP0/APP14, save at least enough for our internal use. */
    if (marker_code == (int) M_APP0 && length_limit < APP0_DATA_LEN)
      length_limit = APP0_DATA_LEN;
    else if (marker_code == (int) M_APP14 && length_limit < APP14_DATA_LEN)
      length_limit = APP14_DATA_LEN;
  } else {
    processor = skip_variable;
    /* If discarding APP0/APP14, use our regular on-the-fly processor. */
    if (marker_code == (int) M_APP0 || marker_code == (int) M_APP14)
      processor = get_interesting_appn;
  }
  if (marker_code == (int) M_COM) {
    marker->process_COM = processor;
    marker->length_limit_COM = length_limit;
  } else if (marker_code >= (int) M_APP0 && marker_code <= (int) M_APP15) {
    marker->process_APPn[marker_code - (int) M_APP0] = processor;
    marker->length_limit_APPn[marker_code - (int) M_APP0] = length_limit;
  } else
    ERREXIT1(cinfo, JERR_UNKNOWN_MARKER, marker_code);
 }
 #endif /* SAVE_MARKERS_SUPPORTED */
 /*
 * Install a special processing method for COM or APPn markers.
 */
 GLOBAL(void)
 jpeg_set_marker_processor (j_decompress_ptr cinfo, int marker_code,
 			   jpeg_marker_parser_method routine)
 {
  my_marker_ptr marker = (my_marker_ptr) cinfo->marker;
  if (marker_code == (int) M_COM)
    marker->process_COM = routine;
  else if (marker_code >= (int) M_APP0 && marker_code <= (int) M_APP15)
    marker->process_APPn[marker_code - (int) M_APP0] = routine;
  else
    ERREXIT1(cinfo, JERR_UNKNOWN_MARKER, marker_code);
 }
--- a/jdmaster.c
+++ b/jdmaster.c
@@ -1,7 +1,7 @@
 /*
 * jdmaster.c
 *
- * Copyright (C) 1991-1996, Thomas G. Lane.
+ * Copyright (C) 1991-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -84,8 +84,10 @@ GLOBAL(void)
 jpeg_calc_output_dimensions (j_decompress_ptr cinfo)
 /* Do computations that are needed before master selection phase */
 {
 #ifdef IDCT_SCALING_SUPPORTED
  int ci;
  jpeg_component_info *compptr;
 #endif
  /* Prevent application from calling me at wrong times */
  if (cinfo->global_state != DSTATE_READY)
@@ -429,7 +431,7 @@ master_selection (j_decompress_ptr cinfo)
 * modules will be active during this pass and give them appropriate
 * start_pass calls.  We also set is_dummy_pass to indicate whether this
 * is a "real" output pass or a dummy pass for color quantization.
- * (In the latter case, jdapi.c will crank the pass to completion.)
+ * (In the latter case, jdapistd.c will crank the pass to completion.)
 */
 METHODDEF(void)
--- a/jdmerge.c
+++ b/jdmerge.c
@@ -5,6 +5,13 @@
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
 * ---------------------------------------------------------------------
 * x86 SIMD extension for IJG JPEG library
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
 * This file has been modified for SIMD extension.
 * Last Modified : January 5, 2006
 * ---------------------------------------------------------------------
 *
 * This file contains code for merged upsampling/color conversion.
 *
 * This file combines functions from jdsample.c and jdcolor.c;
@@ -35,6 +42,7 @@
 #define JPEG_INTERNALS
 #include "jinclude.h"
 #include "jpeglib.h"
 #include "jcolsamp.h"		/* Private declarations */
 #ifdef UPSAMPLE_MERGING_SUPPORTED
@@ -218,6 +226,17 @@ merged_1v_upsample (j_decompress_ptr cinfo,
 */
 #if RGB_PIXELSIZE == 4
 /* offset of filler byte */
 #define RGB_FILLER  (6 - (RGB_RED) - (RGB_GREEN) - (RGB_BLUE))
 /* byte pattern to fill with */
 #ifdef RGBX_FILLER_0XFF
 #define RGB_FILLER_BYTE 0xFF
 #else
 #define RGB_FILLER_BYTE 0x00
 #endif
 #endif /* RGB_PIXELSIZE == 4 */
 /*
 * Upsample and color convert for the case of 2:1 horizontal and 1:1 vertical.
 */
@@ -258,11 +277,17 @@ h2v1_merged_upsample (j_decompress_ptr cinfo,
    outptr[RGB_RED] =   range_limit[y + cred];
    outptr[RGB_GREEN] = range_limit[y + cgreen];
    outptr[RGB_BLUE] =  range_limit[y + cblue];
 #if RGB_PIXELSIZE == 4
    outptr[RGB_FILLER] = RGB_FILLER_BYTE;
 #endif
    outptr += RGB_PIXELSIZE;
    y  = GETJSAMPLE(*inptr0++);
    outptr[RGB_RED] =   range_limit[y + cred];
    outptr[RGB_GREEN] = range_limit[y + cgreen];
    outptr[RGB_BLUE] =  range_limit[y + cblue];
 #if RGB_PIXELSIZE == 4
    outptr[RGB_FILLER] = RGB_FILLER_BYTE;
 #endif
    outptr += RGB_PIXELSIZE;
  }
  /* If image width is odd, do the last output column separately */
@@ -276,6 +301,9 @@ h2v1_merged_upsample (j_decompress_ptr cinfo,
    outptr[RGB_RED] =   range_limit[y + cred];
    outptr[RGB_GREEN] = range_limit[y + cgreen];
    outptr[RGB_BLUE] =  range_limit[y + cblue];
 #if RGB_PIXELSIZE == 4
    outptr[RGB_FILLER] = RGB_FILLER_BYTE;
 #endif
  }
 }
@@ -322,21 +350,33 @@ h2v2_merged_upsample (j_decompress_ptr cinfo,
    outptr0[RGB_RED] =   range_limit[y + cred];
    outptr0[RGB_GREEN] = range_limit[y + cgreen];
    outptr0[RGB_BLUE] =  range_limit[y + cblue];
 #if RGB_PIXELSIZE == 4
    outptr0[RGB_FILLER] = RGB_FILLER_BYTE;
 #endif
    outptr0 += RGB_PIXELSIZE;
    y  = GETJSAMPLE(*inptr00++);
    outptr0[RGB_RED] =   range_limit[y + cred];
    outptr0[RGB_GREEN] = range_limit[y + cgreen];
    outptr0[RGB_BLUE] =  range_limit[y + cblue];
 #if RGB_PIXELSIZE == 4
    outptr0[RGB_FILLER] = RGB_FILLER_BYTE;
 #endif
    outptr0 += RGB_PIXELSIZE;
    y  = GETJSAMPLE(*inptr01++);
    outptr1[RGB_RED] =   range_limit[y + cred];
    outptr1[RGB_GREEN] = range_limit[y + cgreen];
    outptr1[RGB_BLUE] =  range_limit[y + cblue];
 #if RGB_PIXELSIZE == 4
    outptr1[RGB_FILLER] = RGB_FILLER_BYTE;
 #endif
    outptr1 += RGB_PIXELSIZE;
    y  = GETJSAMPLE(*inptr01++);
    outptr1[RGB_RED] =   range_limit[y + cred];
    outptr1[RGB_GREEN] = range_limit[y + cgreen];
    outptr1[RGB_BLUE] =  range_limit[y + cblue];
 #if RGB_PIXELSIZE == 4
    outptr1[RGB_FILLER] = RGB_FILLER_BYTE;
 #endif
    outptr1 += RGB_PIXELSIZE;
  }
  /* If image width is odd, do the last output column separately */
@@ -350,10 +390,16 @@ h2v2_merged_upsample (j_decompress_ptr cinfo,
    outptr0[RGB_RED] =   range_limit[y + cred];
    outptr0[RGB_GREEN] = range_limit[y + cgreen];
    outptr0[RGB_BLUE] =  range_limit[y + cblue];
 #if RGB_PIXELSIZE == 4
    outptr0[RGB_FILLER] = RGB_FILLER_BYTE;
 #endif
    y  = GETJSAMPLE(*inptr01);
    outptr1[RGB_RED] =   range_limit[y + cred];
    outptr1[RGB_GREEN] = range_limit[y + cgreen];
    outptr1[RGB_BLUE] =  range_limit[y + cblue];
 #if RGB_PIXELSIZE == 4
    outptr1[RGB_FILLER] = RGB_FILLER_BYTE;
 #endif
  }
 }
@@ -370,6 +416,7 @@ GLOBAL(void)
 jinit_merged_upsampler (j_decompress_ptr cinfo)
 {
  my_upsample_ptr upsample;
  unsigned int simd = jpeg_simd_support((j_common_ptr) cinfo);
  upsample = (my_upsample_ptr)
    (*cinfo->mem->alloc_small) ((j_common_ptr) cinfo, JPOOL_IMAGE,
@@ -382,19 +429,73 @@ jinit_merged_upsampler (j_decompress_ptr cinfo)
  if (cinfo->max_v_samp_factor == 2) {
    upsample->pub.upsample = merged_2v_upsample;
 #if RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4
 #ifdef JDMERGE_SSE2_SUPPORTED
    if (simd & JSIMD_SSE2 &&
        IS_CONST_ALIGNED_16(jconst_merged_upsample_sse2)) {
      upsample->upmethod = jpeg_h2v2_merged_upsample_sse2;
    } else
 #endif
 #ifdef JDMERGE_MMX_SUPPORTED
    if (simd & JSIMD_MMX) {
      upsample->upmethod = jpeg_h2v2_merged_upsample_mmx;
    } else
 #endif
 #endif /* RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4 */
    {
      upsample->upmethod = h2v2_merged_upsample;
      build_ycc_rgb_table(cinfo);
    }
    /* Allocate a spare row buffer */
    upsample->spare_row = (JSAMPROW)
      (*cinfo->mem->alloc_large) ((j_common_ptr) cinfo, JPOOL_IMAGE,
 		(size_t) (upsample->out_row_width * SIZEOF(JSAMPLE)));
  } else {
    upsample->pub.upsample = merged_1v_upsample;
 #if RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4
 #ifdef JDMERGE_SSE2_SUPPORTED
    if (simd & JSIMD_SSE2 &&
        IS_CONST_ALIGNED_16(jconst_merged_upsample_sse2)) {
      upsample->upmethod = jpeg_h2v1_merged_upsample_sse2;
    } else
 #endif
 #ifdef JDMERGE_MMX_SUPPORTED
    if (simd & JSIMD_MMX) {
      upsample->upmethod = jpeg_h2v1_merged_upsample_mmx;
    } else
 #endif
 #endif /* RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4 */
    {
      upsample->upmethod = h2v1_merged_upsample;
      build_ycc_rgb_table(cinfo);
    }
    /* No spare row needed */
    upsample->spare_row = NULL;
  }
  build_ycc_rgb_table(cinfo);
 }
 #ifndef JSIMD_MODEINFO_NOT_SUPPORTED
 GLOBAL(unsigned int)
 jpeg_simd_merged_upsampler (j_decompress_ptr cinfo)
 {
  unsigned int simd = jpeg_simd_support((j_common_ptr) cinfo);
 #if RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4
 #ifdef JDMERGE_SSE2_SUPPORTED
  if (simd & JSIMD_SSE2 &&
      IS_CONST_ALIGNED_16(jconst_merged_upsample_sse2))
    return JSIMD_SSE2;
 #endif
 #ifdef JDMERGE_MMX_SUPPORTED
  if (simd & JSIMD_MMX)
    return JSIMD_MMX;
 #endif
 #endif /* RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4 */
  return JSIMD_NONE;
 }
 #endif /* !JSIMD_MODEINFO_NOT_SUPPORTED */
 #endif /* UPSAMPLE_MERGING_SUPPORTED */
--- a/jdmermmx.asm
+++ b/jdmermmx.asm
@@ -0,0 +1,981 @@
 ;
 ; jdmermmx.asm - merged upsampling/color conversion (MMX)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jcolsamp.inc"
 %if RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4
 %ifdef UPSAMPLE_MERGING_SUPPORTED
 %ifdef JDMERGE_MMX_SUPPORTED
 ; --------------------------------------------------------------------------
 %define SCALEBITS	16
 F_0_344	equ	 22554			; FIX(0.34414)
 F_0_714	equ	 46802			; FIX(0.71414)
 F_1_402	equ	 91881			; FIX(1.40200)
 F_1_772	equ	116130			; FIX(1.77200)
 F_0_402	equ	(F_1_402 - 65536)	; FIX(1.40200) - FIX(1)
 F_0_285	equ	( 65536 - F_0_714)	; FIX(1) - FIX(0.71414)
 F_0_228	equ	(131072 - F_1_772)	; FIX(2) - FIX(1.77200)
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 	alignz	16
 	global	EXTN(jconst_merged_upsample_mmx)
 EXTN(jconst_merged_upsample_mmx):
 PW_F0402	times 4 dw  F_0_402
 PW_MF0228	times 4 dw -F_0_228
 PW_MF0344_F0285	times 2 dw -F_0_344, F_0_285
 PW_ONE		times 4 dw  1
 PD_ONEHALF	times 2 dd  1 << (SCALEBITS-1)
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Upsample and color convert for the case of 2:1 horizontal and 1:1 vertical.
 ;
 ; GLOBAL(void)
 ; jpeg_h2v1_merged_upsample_mmx (j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
 ;                                JDIMENSION in_row_group_ctr,
 ;                                JSAMPARRAY output_buf);
 ;
 %define cinfo(b)		(b)+8		; j_decompress_ptr cinfo
 %define input_buf(b)		(b)+12		; JSAMPIMAGE input_buf
 %define in_row_group_ctr(b)	(b)+16		; JDIMENSION in_row_group_ctr
 %define output_buf(b)		(b)+20		; JSAMPARRAY output_buf
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_MMWORD	; mmword wk[WK_NUM]
 %define WK_NUM		3
 %define gotptr		wk(0)-SIZEOF_POINTER	; void * gotptr
 	align	16
 	global	EXTN(jpeg_h2v1_merged_upsample_mmx)
 EXTN(jpeg_h2v1_merged_upsample_mmx):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_MMWORD)	; align to 64 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [wk(0)]
 	pushpic	eax		; make a room for GOT address
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx			; get GOT address
 	movpic	POINTER [gotptr], ebx	; save GOT address
 	mov	ecx, POINTER [cinfo(eax)]
 	mov	ecx, JDIMENSION [jdstruct_output_width(ecx)]	; col
 	test	ecx,ecx
 	jz	near .return
 	push	ecx
 	mov	edi, JSAMPIMAGE [input_buf(eax)]
 	mov	ecx, JDIMENSION [in_row_group_ctr(eax)]
 	mov	esi, JSAMPARRAY [edi+0*SIZEOF_JSAMPARRAY]
 	mov	ebx, JSAMPARRAY [edi+1*SIZEOF_JSAMPARRAY]
 	mov	edx, JSAMPARRAY [edi+2*SIZEOF_JSAMPARRAY]
 	mov	edi, JSAMPARRAY [output_buf(eax)]
 	mov	esi, JSAMPROW [esi+ecx*SIZEOF_JSAMPROW]		; inptr0
 	mov	ebx, JSAMPROW [ebx+ecx*SIZEOF_JSAMPROW]		; inptr1
 	mov	edx, JSAMPROW [edx+ecx*SIZEOF_JSAMPROW]		; inptr2
 	mov	edi, JSAMPROW [edi]				; outptr
 	pop	ecx			; col
 	alignx	16,7
 .columnloop:
 	movpic	eax, POINTER [gotptr]	; load GOT address (eax)
 	movq      mm6, MMWORD [ebx]	; mm6=Cb(01234567)
 	movq      mm7, MMWORD [edx]	; mm7=Cr(01234567)
 	pxor      mm1,mm1		; mm1=(all 0's)
 	pcmpeqw   mm3,mm3
 	psllw     mm3,7			; mm3={0xFF80 0xFF80 0xFF80 0xFF80}
 	movq      mm4,mm6
 	punpckhbw mm6,mm1		; mm6=Cb(4567)=CbH
 	punpcklbw mm4,mm1		; mm4=Cb(0123)=CbL
 	movq      mm0,mm7
 	punpckhbw mm7,mm1		; mm7=Cr(4567)=CrH
 	punpcklbw mm0,mm1		; mm0=Cr(0123)=CrL
 	paddw     mm6,mm3
 	paddw     mm4,mm3
 	paddw     mm7,mm3
 	paddw     mm0,mm3
 	; (Original)
 	; R = Y                + 1.40200 * Cr
 	; G = Y - 0.34414 * Cb - 0.71414 * Cr
 	; B = Y + 1.77200 * Cb
 	;
 	; (This implementation)
 	; R = Y                + 0.40200 * Cr + Cr
 	; G = Y - 0.34414 * Cb + 0.28586 * Cr - Cr
 	; B = Y - 0.22800 * Cb + Cb + Cb
 	movq	mm5,mm6			; mm5=CbH
 	movq	mm2,mm4			; mm2=CbL
 	paddw	mm6,mm6			; mm6=2*CbH
 	paddw	mm4,mm4			; mm4=2*CbL
 	movq	mm1,mm7			; mm1=CrH
 	movq	mm3,mm0			; mm3=CrL
 	paddw	mm7,mm7			; mm7=2*CrH
 	paddw	mm0,mm0			; mm0=2*CrL
 	pmulhw	mm6,[GOTOFF(eax,PW_MF0228)]	; mm6=(2*CbH * -FIX(0.22800))
 	pmulhw	mm4,[GOTOFF(eax,PW_MF0228)]	; mm4=(2*CbL * -FIX(0.22800))
 	pmulhw	mm7,[GOTOFF(eax,PW_F0402)]	; mm7=(2*CrH * FIX(0.40200))
 	pmulhw	mm0,[GOTOFF(eax,PW_F0402)]	; mm0=(2*CrL * FIX(0.40200))
 	paddw	mm6,[GOTOFF(eax,PW_ONE)]
 	paddw	mm4,[GOTOFF(eax,PW_ONE)]
 	psraw	mm6,1			; mm6=(CbH * -FIX(0.22800))
 	psraw	mm4,1			; mm4=(CbL * -FIX(0.22800))
 	paddw	mm7,[GOTOFF(eax,PW_ONE)]
 	paddw	mm0,[GOTOFF(eax,PW_ONE)]
 	psraw	mm7,1			; mm7=(CrH * FIX(0.40200))
 	psraw	mm0,1			; mm0=(CrL * FIX(0.40200))
 	paddw	mm6,mm5
 	paddw	mm4,mm2
 	paddw	mm6,mm5			; mm6=(CbH * FIX(1.77200))=(B-Y)H
 	paddw	mm4,mm2			; mm4=(CbL * FIX(1.77200))=(B-Y)L
 	paddw	mm7,mm1			; mm7=(CrH * FIX(1.40200))=(R-Y)H
 	paddw	mm0,mm3			; mm0=(CrL * FIX(1.40200))=(R-Y)L
 	movq	MMWORD [wk(0)], mm6	; wk(0)=(B-Y)H
 	movq	MMWORD [wk(1)], mm7	; wk(1)=(R-Y)H
 	movq      mm6,mm5
 	movq      mm7,mm2
 	punpcklwd mm5,mm1
 	punpckhwd mm6,mm1
 	pmaddwd   mm5,[GOTOFF(eax,PW_MF0344_F0285)]
 	pmaddwd   mm6,[GOTOFF(eax,PW_MF0344_F0285)]
 	punpcklwd mm2,mm3
 	punpckhwd mm7,mm3
 	pmaddwd   mm2,[GOTOFF(eax,PW_MF0344_F0285)]
 	pmaddwd   mm7,[GOTOFF(eax,PW_MF0344_F0285)]
 	paddd     mm5,[GOTOFF(eax,PD_ONEHALF)]
 	paddd     mm6,[GOTOFF(eax,PD_ONEHALF)]
 	psrad     mm5,SCALEBITS
 	psrad     mm6,SCALEBITS
 	paddd     mm2,[GOTOFF(eax,PD_ONEHALF)]
 	paddd     mm7,[GOTOFF(eax,PD_ONEHALF)]
 	psrad     mm2,SCALEBITS
 	psrad     mm7,SCALEBITS
 	packssdw  mm5,mm6	; mm5=CbH*-FIX(0.344)+CrH*FIX(0.285)
 	packssdw  mm2,mm7	; mm2=CbL*-FIX(0.344)+CrL*FIX(0.285)
 	psubw     mm5,mm1	; mm5=CbH*-FIX(0.344)+CrH*-FIX(0.714)=(G-Y)H
 	psubw     mm2,mm3	; mm2=CbL*-FIX(0.344)+CrL*-FIX(0.714)=(G-Y)L
 	movq	MMWORD [wk(2)], mm5	; wk(2)=(G-Y)H
 	mov	al,2			; Yctr
 	jmp	short .Yloop_1st
 	alignx	16,7
 .Yloop_2nd:
 	movq	mm0, MMWORD [wk(1)]	; mm0=(R-Y)H
 	movq	mm2, MMWORD [wk(2)]	; mm2=(G-Y)H
 	movq	mm4, MMWORD [wk(0)]	; mm4=(B-Y)H
 	alignx	16,7
 .Yloop_1st:
 	movq	mm7, MMWORD [esi]	; mm7=Y(01234567)
 	pcmpeqw	mm6,mm6
 	psrlw	mm6,BYTE_BIT		; mm6={0xFF 0x00 0xFF 0x00 ..}
 	pand	mm6,mm7			; mm6=Y(0246)=YE
 	psrlw	mm7,BYTE_BIT		; mm7=Y(1357)=YO
 	movq	mm1,mm0			; mm1=mm0=(R-Y)(L/H)
 	movq	mm3,mm2			; mm3=mm2=(G-Y)(L/H)
 	movq	mm5,mm4			; mm5=mm4=(B-Y)(L/H)
 	paddw     mm0,mm6		; mm0=((R-Y)+YE)=RE=(R0 R2 R4 R6)
 	paddw     mm1,mm7		; mm1=((R-Y)+YO)=RO=(R1 R3 R5 R7)
 	packuswb  mm0,mm0		; mm0=(R0 R2 R4 R6 ** ** ** **)
 	packuswb  mm1,mm1		; mm1=(R1 R3 R5 R7 ** ** ** **)
 	paddw     mm2,mm6		; mm2=((G-Y)+YE)=GE=(G0 G2 G4 G6)
 	paddw     mm3,mm7		; mm3=((G-Y)+YO)=GO=(G1 G3 G5 G7)
 	packuswb  mm2,mm2		; mm2=(G0 G2 G4 G6 ** ** ** **)
 	packuswb  mm3,mm3		; mm3=(G1 G3 G5 G7 ** ** ** **)
 	paddw     mm4,mm6		; mm4=((B-Y)+YE)=BE=(B0 B2 B4 B6)
 	paddw     mm5,mm7		; mm5=((B-Y)+YO)=BO=(B1 B3 B5 B7)
 	packuswb  mm4,mm4		; mm4=(B0 B2 B4 B6 ** ** ** **)
 	packuswb  mm5,mm5		; mm5=(B1 B3 B5 B7 ** ** ** **)
 %if RGB_PIXELSIZE == 3 ; ---------------
 	; mmA=(00 02 04 06 ** ** ** **), mmB=(01 03 05 07 ** ** ** **)
 	; mmC=(10 12 14 16 ** ** ** **), mmD=(11 13 15 17 ** ** ** **)
 	; mmE=(20 22 24 26 ** ** ** **), mmF=(21 23 25 27 ** ** ** **)
 	; mmG=(** ** ** ** ** ** ** **), mmH=(** ** ** ** ** ** ** **)
 	punpcklbw mmA,mmC		; mmA=(00 10 02 12 04 14 06 16)
 	punpcklbw mmE,mmB		; mmE=(20 01 22 03 24 05 26 07)
 	punpcklbw mmD,mmF		; mmD=(11 21 13 23 15 25 17 27)
 	movq      mmG,mmA
 	movq      mmH,mmA
 	punpcklwd mmA,mmE		; mmA=(00 10 20 01 02 12 22 03)
 	punpckhwd mmG,mmE		; mmG=(04 14 24 05 06 16 26 07)
 	psrlq     mmH,2*BYTE_BIT	; mmH=(02 12 04 14 06 16 -- --)
 	psrlq     mmE,2*BYTE_BIT	; mmE=(22 03 24 05 26 07 -- --)
 	movq      mmC,mmD
 	movq      mmB,mmD
 	punpcklwd mmD,mmH		; mmD=(11 21 02 12 13 23 04 14)
 	punpckhwd mmC,mmH		; mmC=(15 25 06 16 17 27 -- --)
 	psrlq     mmB,2*BYTE_BIT	; mmB=(13 23 15 25 17 27 -- --)
 	movq      mmF,mmE
 	punpcklwd mmE,mmB		; mmE=(22 03 13 23 24 05 15 25)
 	punpckhwd mmF,mmB		; mmF=(26 07 17 27 -- -- -- --)
 	punpckldq mmA,mmD		; mmA=(00 10 20 01 11 21 02 12)
 	punpckldq mmE,mmG		; mmE=(22 03 13 23 04 14 24 05)
 	punpckldq mmC,mmF		; mmC=(15 25 06 16 26 07 17 27)
 	cmp	ecx, byte SIZEOF_MMWORD
 	jb	short .column_st16
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mmA
 	movq	MMWORD [edi+1*SIZEOF_MMWORD], mmE
 	movq	MMWORD [edi+2*SIZEOF_MMWORD], mmC
 	sub	ecx, byte SIZEOF_MMWORD
 	jz	short .endcolumn
 	add	edi, byte RGB_PIXELSIZE*SIZEOF_MMWORD	; outptr
 	add	esi, byte SIZEOF_MMWORD			; inptr0
 	dec	al			; Yctr
 	jnz	near .Yloop_2nd
 	add	ebx, byte SIZEOF_MMWORD			; inptr1
 	add	edx, byte SIZEOF_MMWORD			; inptr2
 	jmp	near .columnloop
 	alignx	16,7
 .column_st16:
 	lea	ecx, [ecx+ecx*2]	; imul ecx, RGB_PIXELSIZE
 	cmp	ecx, byte 2*SIZEOF_MMWORD
 	jb	short .column_st8
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mmA
 	movq	MMWORD [edi+1*SIZEOF_MMWORD], mmE
 	movq	mmA,mmC
 	sub	ecx, byte 2*SIZEOF_MMWORD
 	add	edi, byte 2*SIZEOF_MMWORD
 	jmp	short .column_st4
 .column_st8:
 	cmp	ecx, byte SIZEOF_MMWORD
 	jb	short .column_st4
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mmA
 	movq	mmA,mmE
 	sub	ecx, byte SIZEOF_MMWORD
 	add	edi, byte SIZEOF_MMWORD
 .column_st4:
 	movd	eax,mmA
 	cmp	ecx, byte SIZEOF_DWORD
 	jb	short .column_st2
 	mov	DWORD [edi+0*SIZEOF_DWORD], eax
 	psrlq	mmA,DWORD_BIT
 	movd	eax,mmA
 	sub	ecx, byte SIZEOF_DWORD
 	add	edi, byte SIZEOF_DWORD
 .column_st2:
 	cmp	ecx, byte SIZEOF_WORD
 	jb	short .column_st1
 	mov	WORD [edi+0*SIZEOF_WORD], ax
 	shr	eax,WORD_BIT
 	sub	ecx, byte SIZEOF_WORD
 	add	edi, byte SIZEOF_WORD
 .column_st1:
 	cmp	ecx, byte SIZEOF_BYTE
 	jb	short .endcolumn
 	mov	BYTE [edi+0*SIZEOF_BYTE], al
 %else ; RGB_PIXELSIZE == 4 ; -----------
 %ifdef RGBX_FILLER_0XFF
 	pcmpeqb   mm6,mm6		; mm6=(X0 X2 X4 X6 ** ** ** **)
 	pcmpeqb   mm7,mm7		; mm7=(X1 X3 X5 X7 ** ** ** **)
 %else
 	pxor      mm6,mm6		; mm6=(X0 X2 X4 X6 ** ** ** **)
 	pxor      mm7,mm7		; mm7=(X1 X3 X5 X7 ** ** ** **)
 %endif
 	; mmA=(00 02 04 06 ** ** ** **), mmB=(01 03 05 07 ** ** ** **)
 	; mmC=(10 12 14 16 ** ** ** **), mmD=(11 13 15 17 ** ** ** **)
 	; mmE=(20 22 24 26 ** ** ** **), mmF=(21 23 25 27 ** ** ** **)
 	; mmG=(30 32 34 36 ** ** ** **), mmH=(31 33 35 37 ** ** ** **)
 	punpcklbw mmA,mmC		; mmA=(00 10 02 12 04 14 06 16)
 	punpcklbw mmE,mmG		; mmE=(20 30 22 32 24 34 26 36)
 	punpcklbw mmB,mmD		; mmB=(01 11 03 13 05 15 07 17)
 	punpcklbw mmF,mmH		; mmF=(21 31 23 33 25 35 27 37)
 	movq      mmC,mmA
 	punpcklwd mmA,mmE		; mmA=(00 10 20 30 02 12 22 32)
 	punpckhwd mmC,mmE		; mmC=(04 14 24 34 06 16 26 36)
 	movq      mmG,mmB
 	punpcklwd mmB,mmF		; mmB=(01 11 21 31 03 13 23 33)
 	punpckhwd mmG,mmF		; mmG=(05 15 25 35 07 17 27 37)
 	movq      mmD,mmA
 	punpckldq mmA,mmB		; mmA=(00 10 20 30 01 11 21 31)
 	punpckhdq mmD,mmB		; mmD=(02 12 22 32 03 13 23 33)
 	movq      mmH,mmC
 	punpckldq mmC,mmG		; mmC=(04 14 24 34 05 15 25 35)
 	punpckhdq mmH,mmG		; mmH=(06 16 26 36 07 17 27 37)
 	cmp	ecx, byte SIZEOF_MMWORD
 	jb	short .column_st16
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mmA
 	movq	MMWORD [edi+1*SIZEOF_MMWORD], mmD
 	movq	MMWORD [edi+2*SIZEOF_MMWORD], mmC
 	movq	MMWORD [edi+3*SIZEOF_MMWORD], mmH
 	sub	ecx, byte SIZEOF_MMWORD
 	jz	short .endcolumn
 	add	edi, byte RGB_PIXELSIZE*SIZEOF_MMWORD	; outptr
 	add	esi, byte SIZEOF_MMWORD			; inptr0
 	dec	al			; Yctr
 	jnz	near .Yloop_2nd
 	add	ebx, byte SIZEOF_MMWORD			; inptr1
 	add	edx, byte SIZEOF_MMWORD			; inptr2
 	jmp	near .columnloop
 	alignx	16,7
 .column_st16:
 	cmp	ecx, byte SIZEOF_MMWORD/2
 	jb	short .column_st8
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mmA
 	movq	MMWORD [edi+1*SIZEOF_MMWORD], mmD
 	movq	mmA,mmC
 	movq	mmD,mmH
 	sub	ecx, byte SIZEOF_MMWORD/2
 	add	edi, byte 2*SIZEOF_MMWORD
 .column_st8:
 	cmp	ecx, byte SIZEOF_MMWORD/4
 	jb	short .column_st4
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mmA
 	movq	mmA,mmD
 	sub	ecx, byte SIZEOF_MMWORD/4
 	add	edi, byte 1*SIZEOF_MMWORD
 .column_st4:
 	cmp	ecx, byte SIZEOF_MMWORD/8
 	jb	short .endcolumn
 	movd	DWORD [edi+0*SIZEOF_DWORD], mmA
 %endif ; RGB_PIXELSIZE ; ---------------
 .endcolumn:
 	emms		; empty MMX state
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %ifndef USE_DEDICATED_H2V2_MERGED_UPSAMPLE_MMX
 ; --------------------------------------------------------------------------
 ;
 ; Upsample and color convert for the case of 2:1 horizontal and 2:1 vertical.
 ;
 ; GLOBAL(void)
 ; jpeg_h2v2_merged_upsample_mmx (j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
 ;                                JDIMENSION in_row_group_ctr,
 ;                                JSAMPARRAY output_buf);
 ;
 %define cinfo(b)		(b)+8		; j_decompress_ptr cinfo
 %define input_buf(b)		(b)+12		; JSAMPIMAGE input_buf
 %define in_row_group_ctr(b)	(b)+16		; JDIMENSION in_row_group_ctr
 %define output_buf(b)		(b)+20		; JSAMPARRAY output_buf
 	align	16
 	global	EXTN(jpeg_h2v2_merged_upsample_mmx)
 EXTN(jpeg_h2v2_merged_upsample_mmx):
 	push	ebp
 	mov	ebp,esp
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	mov	eax, POINTER [cinfo(ebp)]
 	mov	edi, JSAMPIMAGE [input_buf(ebp)]
 	mov	ecx, JDIMENSION [in_row_group_ctr(ebp)]
 	mov	esi, JSAMPARRAY [edi+0*SIZEOF_JSAMPARRAY]
 	mov	ebx, JSAMPARRAY [edi+1*SIZEOF_JSAMPARRAY]
 	mov	edx, JSAMPARRAY [edi+2*SIZEOF_JSAMPARRAY]
 	mov	edi, JSAMPARRAY [output_buf(ebp)]
 	lea	esi, [esi+ecx*SIZEOF_JSAMPROW]
 	push	edx			; inptr2
 	push	ebx			; inptr1
 	push	esi			; inptr00
 	mov	ebx,esp
 	push	edi			; output_buf (outptr0)
 	push	ecx			; in_row_group_ctr
 	push	ebx			; input_buf
 	push	eax			; cinfo
 	call	near EXTN(jpeg_h2v1_merged_upsample_mmx)
 	add	esi, byte SIZEOF_JSAMPROW	; inptr01
 	add	edi, byte SIZEOF_JSAMPROW	; outptr1
 	mov	POINTER [ebx+0*SIZEOF_POINTER], esi
 	mov	POINTER [ebx-1*SIZEOF_POINTER], edi
 	call	near EXTN(jpeg_h2v1_merged_upsample_mmx)
 	add	esp, byte 7*SIZEOF_DWORD
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	pop	ebp
 	ret
 %else  ; USE_DEDICATED_H2V2_MERGED_UPSAMPLE_MMX
 ; --------------------------------------------------------------------------
 ;
 ; Upsample and color convert for the case of 2:1 horizontal and 2:1 vertical.
 ;
 ; GLOBAL(void)
 ; jpeg_h2v2_merged_upsample_mmx (j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
 ;                                JDIMENSION in_row_group_ctr,
 ;                                JSAMPARRAY output_buf);
 ;
 %define cinfo(b)		(b)+8		; j_decompress_ptr cinfo
 %define input_buf(b)		(b)+12		; JSAMPIMAGE input_buf
 %define in_row_group_ctr(b)	(b)+16		; JDIMENSION in_row_group_ctr
 %define output_buf(b)		(b)+20		; JSAMPARRAY output_buf
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_MMWORD	; mmword wk[WK_NUM]
 %define WK_NUM		10
 %define inptr1		wk(0)-SIZEOF_JSAMPROW	; JSAMPROW inptr1
 %define inptr2		inptr1-SIZEOF_JSAMPROW	; JSAMPROW inptr2
 %define gotptr		inptr2-SIZEOF_POINTER	; void * gotptr
 	align	16
 	global	EXTN(jpeg_h2v2_merged_upsample_mmx)
 EXTN(jpeg_h2v2_merged_upsample_mmx):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_MMWORD)	; align to 64 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [inptr2]
 	pushpic	eax		; make a room for GOT address
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx			; get GOT address
 	movpic	POINTER [gotptr], ebx	; save GOT address
 	mov	ecx, POINTER [cinfo(eax)]
 	mov	ecx, JDIMENSION [jdstruct_output_width(ecx)]	; col
 	test	ecx,ecx
 	jz	near .return
 	push	ecx
 	mov	edi, JSAMPIMAGE [input_buf(eax)]
 	mov	ecx, JDIMENSION [in_row_group_ctr(eax)]
 	mov	esi, JSAMPARRAY [edi+0*SIZEOF_JSAMPARRAY]
 	mov	ebx, JSAMPARRAY [edi+1*SIZEOF_JSAMPARRAY]
 	mov	edx, JSAMPARRAY [edi+2*SIZEOF_JSAMPARRAY]
 	mov	edi, JSAMPARRAY [output_buf(eax)]
 	mov	eax, JSAMPROW [esi+(ecx*2+0)*SIZEOF_JSAMPROW]	; inptr00
 	mov	esi, JSAMPROW [esi+(ecx*2+1)*SIZEOF_JSAMPROW]	; inptr01
 	mov	ebx, JSAMPROW [ebx+ecx*SIZEOF_JSAMPROW]		; inptr1
 	mov	edx, JSAMPROW [edx+ecx*SIZEOF_JSAMPROW]		; inptr2
 	pop	ecx		; col
 	push	eax		; inptr00
 	push	esi		; inptr01
 	mov	esi, JSAMPROW [edi+0*SIZEOF_JSAMPROW]		; outptr0
 	mov	edi, JSAMPROW [edi+1*SIZEOF_JSAMPROW]		; outptr1
 	alignx	16,7
 .columnloop:
 	movpic	eax, POINTER [gotptr]	; load GOT address (eax)
 	movq	mm6, MMWORD [ebx]	; mm6=Cb(01234567)
 	movq	mm7, MMWORD [edx]	; mm7=Cr(01234567)
 	mov	JSAMPROW [inptr1], ebx	; inptr1
 	mov	JSAMPROW [inptr2], edx	; inptr2
 	pop	edx			; edx=inptr01
 	pop	ebx			; ebx=inptr00
 	pxor      mm1,mm1		; mm1=(all 0's)
 	pcmpeqw   mm3,mm3
 	psllw     mm3,7			; mm3={0xFF80 0xFF80 0xFF80 0xFF80}
 	movq      mm4,mm6
 	punpckhbw mm6,mm1		; mm6=Cb(4567)=CbH
 	punpcklbw mm4,mm1		; mm4=Cb(0123)=CbL
 	movq      mm0,mm7
 	punpckhbw mm7,mm1		; mm7=Cr(4567)=CrH
 	punpcklbw mm0,mm1		; mm0=Cr(0123)=CrL
 	paddw     mm6,mm3
 	paddw     mm4,mm3
 	paddw     mm7,mm3
 	paddw     mm0,mm3
 	; (Original)
 	; R = Y                + 1.40200 * Cr
 	; G = Y - 0.34414 * Cb - 0.71414 * Cr
 	; B = Y + 1.77200 * Cb
 	;
 	; (This implementation)
 	; R = Y                + 0.40200 * Cr + Cr
 	; G = Y - 0.34414 * Cb + 0.28586 * Cr - Cr
 	; B = Y - 0.22800 * Cb + Cb + Cb
 	movq	mm5,mm6			; mm5=CbH
 	movq	mm2,mm4			; mm2=CbL
 	paddw	mm6,mm6			; mm6=2*CbH
 	paddw	mm4,mm4			; mm4=2*CbL
 	movq	mm1,mm7			; mm1=CrH
 	movq	mm3,mm0			; mm3=CrL
 	paddw	mm7,mm7			; mm7=2*CrH
 	paddw	mm0,mm0			; mm0=2*CrL
 	pmulhw	mm6,[GOTOFF(eax,PW_MF0228)]	; mm6=(2*CbH * -FIX(0.22800))
 	pmulhw	mm4,[GOTOFF(eax,PW_MF0228)]	; mm4=(2*CbL * -FIX(0.22800))
 	pmulhw	mm7,[GOTOFF(eax,PW_F0402)]	; mm7=(2*CrH * FIX(0.40200))
 	pmulhw	mm0,[GOTOFF(eax,PW_F0402)]	; mm0=(2*CrL * FIX(0.40200))
 	paddw	mm6,[GOTOFF(eax,PW_ONE)]
 	paddw	mm4,[GOTOFF(eax,PW_ONE)]
 	psraw	mm6,1			; mm6=(CbH * -FIX(0.22800))
 	psraw	mm4,1			; mm4=(CbL * -FIX(0.22800))
 	paddw	mm7,[GOTOFF(eax,PW_ONE)]
 	paddw	mm0,[GOTOFF(eax,PW_ONE)]
 	psraw	mm7,1			; mm7=(CrH * FIX(0.40200))
 	psraw	mm0,1			; mm0=(CrL * FIX(0.40200))
 	paddw	mm6,mm5
 	paddw	mm4,mm2
 	paddw	mm6,mm5			; mm6=(CbH * FIX(1.77200))=(B-Y)H
 	paddw	mm4,mm2			; mm4=(CbL * FIX(1.77200))=(B-Y)L
 	paddw	mm7,mm1			; mm7=(CrH * FIX(1.40200))=(R-Y)H
 	paddw	mm0,mm3			; mm0=(CrL * FIX(1.40200))=(R-Y)L
 	movq	MMWORD [wk(0)], mm6	; wk(0)=(B-Y)H
 	movq	MMWORD [wk(1)], mm7	; wk(1)=(R-Y)H
 	movq      mm6,mm5
 	movq      mm7,mm2
 	punpcklwd mm5,mm1
 	punpckhwd mm6,mm1
 	pmaddwd   mm5,[GOTOFF(eax,PW_MF0344_F0285)]
 	pmaddwd   mm6,[GOTOFF(eax,PW_MF0344_F0285)]
 	punpcklwd mm2,mm3
 	punpckhwd mm7,mm3
 	pmaddwd   mm2,[GOTOFF(eax,PW_MF0344_F0285)]
 	pmaddwd   mm7,[GOTOFF(eax,PW_MF0344_F0285)]
 	paddd     mm5,[GOTOFF(eax,PD_ONEHALF)]
 	paddd     mm6,[GOTOFF(eax,PD_ONEHALF)]
 	psrad     mm5,SCALEBITS
 	psrad     mm6,SCALEBITS
 	paddd     mm2,[GOTOFF(eax,PD_ONEHALF)]
 	paddd     mm7,[GOTOFF(eax,PD_ONEHALF)]
 	psrad     mm2,SCALEBITS
 	psrad     mm7,SCALEBITS
 	packssdw  mm5,mm6	; mm5=CbH*-FIX(0.344)+CrH*FIX(0.285)
 	packssdw  mm2,mm7	; mm2=CbL*-FIX(0.344)+CrL*FIX(0.285)
 	psubw     mm5,mm1	; mm5=CbH*-FIX(0.344)+CrH*-FIX(0.714)=(G-Y)H
 	psubw     mm2,mm3	; mm2=CbL*-FIX(0.344)+CrL*-FIX(0.714)=(G-Y)L
 	movq	MMWORD [wk(2)], mm5	; wk(2)=(G-Y)H
 	mov	ah,2			; YHctr
 	jmp	short .YHloop_1st
 	alignx	16,7
 .YHloop_2nd:
 	movq	mm0, MMWORD [wk(1)]	; mm0=(R-Y)H
 	movq	mm2, MMWORD [wk(2)]	; mm2=(G-Y)H
 	movq	mm4, MMWORD [wk(0)]	; mm4=(B-Y)H
 	alignx	16,7
 .YHloop_1st:
 	movq	MMWORD [wk(3)], mm0	; wk(3)=(R-Y)(L/H)
 	movq	MMWORD [wk(4)], mm2	; wk(4)=(G-Y)(L/H)
 	movq	MMWORD [wk(5)], mm4	; wk(5)=(B-Y)(L/H)
 	movq	mm7, MMWORD [ebx]	; mm7=Y(01234567)
 	mov	al,2			; YVctr
 	jmp	short .YVloop_1st
 	alignx	16,7
 .YVloop_2nd:
 	movq	mm0, MMWORD [wk(3)]	; mm0=(R-Y)(L/H)
 	movq	mm2, MMWORD [wk(4)]	; mm2=(G-Y)(L/H)
 	movq	mm4, MMWORD [wk(5)]	; mm4=(B-Y)(L/H)
 	movq	mm7, MMWORD [edx]	; mm7=Y(01234567)
 	alignx	16,7
 .YVloop_1st:
 	pcmpeqw	mm6,mm6
 	psrlw	mm6,BYTE_BIT		; mm6={0xFF 0x00 0xFF 0x00 ..}
 	pand	mm6,mm7			; mm6=Y(0246)=YE
 	psrlw	mm7,BYTE_BIT		; mm7=Y(1357)=YO
 	movq	mm1,mm0			; mm1=mm0=(R-Y)(L/H)
 	movq	mm3,mm2			; mm3=mm2=(G-Y)(L/H)
 	movq	mm5,mm4			; mm5=mm4=(B-Y)(L/H)
 	paddw     mm0,mm6		; mm0=((R-Y)+YE)=RE=(R0 R2 R4 R6)
 	paddw     mm1,mm7		; mm1=((R-Y)+YO)=RO=(R1 R3 R5 R7)
 	packuswb  mm0,mm0		; mm0=(R0 R2 R4 R6 ** ** ** **)
 	packuswb  mm1,mm1		; mm1=(R1 R3 R5 R7 ** ** ** **)
 	paddw     mm2,mm6		; mm2=((G-Y)+YE)=GE=(G0 G2 G4 G6)
 	paddw     mm3,mm7		; mm3=((G-Y)+YO)=GO=(G1 G3 G5 G7)
 	packuswb  mm2,mm2		; mm2=(G0 G2 G4 G6 ** ** ** **)
 	packuswb  mm3,mm3		; mm3=(G1 G3 G5 G7 ** ** ** **)
 	paddw     mm4,mm6		; mm4=((B-Y)+YE)=BE=(B0 B2 B4 B6)
 	paddw     mm5,mm7		; mm5=((B-Y)+YO)=BO=(B1 B3 B5 B7)
 	packuswb  mm4,mm4		; mm4=(B0 B2 B4 B6 ** ** ** **)
 	packuswb  mm5,mm5		; mm5=(B1 B3 B5 B7 ** ** ** **)
 %if RGB_PIXELSIZE == 3 ; ---------------
 	; mmA=(00 02 04 06 ** ** ** **), mmB=(01 03 05 07 ** ** ** **)
 	; mmC=(10 12 14 16 ** ** ** **), mmD=(11 13 15 17 ** ** ** **)
 	; mmE=(20 22 24 26 ** ** ** **), mmF=(21 23 25 27 ** ** ** **)
 	; mmG=(** ** ** ** ** ** ** **), mmH=(** ** ** ** ** ** ** **)
 	punpcklbw mmA,mmC		; mmA=(00 10 02 12 04 14 06 16)
 	punpcklbw mmE,mmB		; mmE=(20 01 22 03 24 05 26 07)
 	punpcklbw mmD,mmF		; mmD=(11 21 13 23 15 25 17 27)
 	movq      mmG,mmA
 	movq      mmH,mmA
 	punpcklwd mmA,mmE		; mmA=(00 10 20 01 02 12 22 03)
 	punpckhwd mmG,mmE		; mmG=(04 14 24 05 06 16 26 07)
 	psrlq     mmH,2*BYTE_BIT	; mmH=(02 12 04 14 06 16 -- --)
 	psrlq     mmE,2*BYTE_BIT	; mmE=(22 03 24 05 26 07 -- --)
 	movq      mmC,mmD
 	movq      mmB,mmD
 	punpcklwd mmD,mmH		; mmD=(11 21 02 12 13 23 04 14)
 	punpckhwd mmC,mmH		; mmC=(15 25 06 16 17 27 -- --)
 	psrlq     mmB,2*BYTE_BIT	; mmB=(13 23 15 25 17 27 -- --)
 	movq      mmF,mmE
 	punpcklwd mmE,mmB		; mmE=(22 03 13 23 24 05 15 25)
 	punpckhwd mmF,mmB		; mmF=(26 07 17 27 -- -- -- --)
 	punpckldq mmA,mmD		; mmA=(00 10 20 01 11 21 02 12)
 	punpckldq mmE,mmG		; mmE=(22 03 13 23 04 14 24 05)
 	punpckldq mmC,mmF		; mmC=(15 25 06 16 26 07 17 27)
 	dec	al			; YVctr
 	jz	short .YVloop_break
 	movq	MMWORD [wk(6)], mmA
 	movq	MMWORD [wk(7)], mmE
 	movq	MMWORD [wk(8)], mmC
 	jmp	near .YVloop_2nd
 	alignx	16,7
 .YVloop_break:
 	movq	mmH, MMWORD [wk(6)]
 	movq	mmB, MMWORD [wk(7)]
 	movq	mmD, MMWORD [wk(8)]
 	cmp	ecx, byte SIZEOF_MMWORD
 	jb	short .column_st16
 	movq	MMWORD [esi+0*SIZEOF_MMWORD], mmH
 	movq	MMWORD [esi+1*SIZEOF_MMWORD], mmB
 	movq	MMWORD [esi+2*SIZEOF_MMWORD], mmD
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mmA
 	movq	MMWORD [edi+1*SIZEOF_MMWORD], mmE
 	movq	MMWORD [edi+2*SIZEOF_MMWORD], mmC
 	sub	ecx, byte SIZEOF_MMWORD
 	jz	near .endcolumn
 	add	esi, byte RGB_PIXELSIZE*SIZEOF_MMWORD	; outptr0
 	add	edi, byte RGB_PIXELSIZE*SIZEOF_MMWORD	; outptr1
 	add	ebx, byte SIZEOF_MMWORD			; inptr00
 	add	edx, byte SIZEOF_MMWORD			; inptr01
 	dec	ah			; YHctr
 	jnz	near .YHloop_2nd
 	push	ebx			; inptr00
 	push	edx			; inptr01
 	mov	ebx, JSAMPROW [inptr1]	; ebx=inptr1
 	mov	edx, JSAMPROW [inptr2]	; edx=inptr2
 	add	ebx, byte SIZEOF_MMWORD	; inptr1
 	add	edx, byte SIZEOF_MMWORD	; inptr2
 	jmp	near .columnloop
 	alignx	16,7
 .column_st16:
 	lea	ecx, [ecx+ecx*2]	; imul ecx, RGB_PIXELSIZE
 	cmp	ecx, byte 2*SIZEOF_MMWORD
 	jb	short .column_st8
 	movq	MMWORD [esi+0*SIZEOF_MMWORD], mmH
 	movq	MMWORD [esi+1*SIZEOF_MMWORD], mmB
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mmA
 	movq	MMWORD [edi+1*SIZEOF_MMWORD], mmE
 	movq	mmH,mmD
 	movq	mmA,mmC
 	sub	ecx, byte 2*SIZEOF_MMWORD
 	add	esi, byte 2*SIZEOF_MMWORD
 	add	edi, byte 2*SIZEOF_MMWORD
 	jmp	short .column_st4
 .column_st8:
 	cmp	ecx, byte SIZEOF_MMWORD
 	jb	short .column_st4
 	movq	MMWORD [esi+0*SIZEOF_MMWORD], mmH
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mmA
 	movq	mmH,mmB
 	movq	mmA,mmE
 	sub	ecx, byte SIZEOF_MMWORD
 	add	esi, byte SIZEOF_MMWORD
 	add	edi, byte SIZEOF_MMWORD
 .column_st4:
 	movd	eax,mmH
 	movd	edx,mmA
 	cmp	ecx, byte SIZEOF_DWORD
 	jb	short .column_st2
 	mov	DWORD [esi+0*SIZEOF_DWORD], eax
 	mov	DWORD [edi+0*SIZEOF_DWORD], edx
 	psrlq	mmH,DWORD_BIT
 	psrlq	mmA,DWORD_BIT
 	movd	eax,mmH
 	movd	edx,mmA
 	sub	ecx, byte SIZEOF_DWORD
 	add	esi, byte SIZEOF_DWORD
 	add	edi, byte SIZEOF_DWORD
 .column_st2:
 	cmp	ecx, byte SIZEOF_WORD
 	jb	short .column_st1
 	mov	WORD [esi+0*SIZEOF_WORD], ax
 	mov	WORD [edi+0*SIZEOF_WORD], dx
 	shr	eax,WORD_BIT
 	shr	edx,WORD_BIT
 	sub	ecx, byte SIZEOF_WORD
 	add	esi, byte SIZEOF_WORD
 	add	edi, byte SIZEOF_WORD
 .column_st1:
 	cmp	ecx, byte SIZEOF_BYTE
 	jb	short .endcolumn
 	mov	BYTE [esi+0*SIZEOF_BYTE], al
 	mov	BYTE [edi+0*SIZEOF_BYTE], dl
 %else ; RGB_PIXELSIZE == 4 ; -----------
 %ifdef RGBX_FILLER_0XFF
 	pcmpeqb   mm6,mm6		; mm6=(X0 X2 X4 X6 ** ** ** **)
 	pcmpeqb   mm7,mm7		; mm7=(X1 X3 X5 X7 ** ** ** **)
 %else
 	pxor      mm6,mm6		; mm6=(X0 X2 X4 X6 ** ** ** **)
 	pxor      mm7,mm7		; mm7=(X1 X3 X5 X7 ** ** ** **)
 %endif
 	; mmA=(00 02 04 06 ** ** ** **), mmB=(01 03 05 07 ** ** ** **)
 	; mmC=(10 12 14 16 ** ** ** **), mmD=(11 13 15 17 ** ** ** **)
 	; mmE=(20 22 24 26 ** ** ** **), mmF=(21 23 25 27 ** ** ** **)
 	; mmG=(30 32 34 36 ** ** ** **), mmH=(31 33 35 37 ** ** ** **)
 	punpcklbw mmA,mmC		; mmA=(00 10 02 12 04 14 06 16)
 	punpcklbw mmE,mmG		; mmE=(20 30 22 32 24 34 26 36)
 	punpcklbw mmB,mmD		; mmB=(01 11 03 13 05 15 07 17)
 	punpcklbw mmF,mmH		; mmF=(21 31 23 33 25 35 27 37)
 	movq      mmC,mmA
 	punpcklwd mmA,mmE		; mmA=(00 10 20 30 02 12 22 32)
 	punpckhwd mmC,mmE		; mmC=(04 14 24 34 06 16 26 36)
 	movq      mmG,mmB
 	punpcklwd mmB,mmF		; mmB=(01 11 21 31 03 13 23 33)
 	punpckhwd mmG,mmF		; mmG=(05 15 25 35 07 17 27 37)
 	movq      mmD,mmA
 	punpckldq mmA,mmB		; mmA=(00 10 20 30 01 11 21 31)
 	punpckhdq mmD,mmB		; mmD=(02 12 22 32 03 13 23 33)
 	movq      mmH,mmC
 	punpckldq mmC,mmG		; mmC=(04 14 24 34 05 15 25 35)
 	punpckhdq mmH,mmG		; mmH=(06 16 26 36 07 17 27 37)
 	dec	al			; YVctr
 	jz	short .YVloop_break
 	movq	MMWORD [wk(6)], mmA
 	movq	MMWORD [wk(7)], mmD
 	movq	MMWORD [wk(8)], mmC
 	movq	MMWORD [wk(9)], mmH
 	jmp	near .YVloop_2nd
 	alignx	16,7
 .YVloop_break:
 	movq	mmE, MMWORD [wk(6)]
 	movq	mmF, MMWORD [wk(7)]
 	movq	mmB, MMWORD [wk(8)]
 	movq	mmG, MMWORD [wk(9)]
 	cmp	ecx, byte SIZEOF_MMWORD
 	jb	short .column_st16
 	movq	MMWORD [esi+0*SIZEOF_MMWORD], mmE
 	movq	MMWORD [esi+1*SIZEOF_MMWORD], mmF
 	movq	MMWORD [esi+2*SIZEOF_MMWORD], mmB
 	movq	MMWORD [esi+3*SIZEOF_MMWORD], mmG
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mmA
 	movq	MMWORD [edi+1*SIZEOF_MMWORD], mmD
 	movq	MMWORD [edi+2*SIZEOF_MMWORD], mmC
 	movq	MMWORD [edi+3*SIZEOF_MMWORD], mmH
 	sub	ecx, byte SIZEOF_MMWORD
 	jz	short .endcolumn
 	add	esi, byte RGB_PIXELSIZE*SIZEOF_MMWORD	; outptr0
 	add	edi, byte RGB_PIXELSIZE*SIZEOF_MMWORD	; outptr1
 	add	ebx, byte SIZEOF_MMWORD			; inptr00
 	add	edx, byte SIZEOF_MMWORD			; inptr01
 	dec	ah			; YHctr
 	jnz	near .YHloop_2nd
 	push	ebx			; inptr00
 	push	edx			; inptr01
 	mov	ebx, JSAMPROW [inptr1]	; ebx=inptr1
 	mov	edx, JSAMPROW [inptr2]	; edx=inptr2
 	add	ebx, byte SIZEOF_MMWORD	; inptr1
 	add	edx, byte SIZEOF_MMWORD	; inptr2
 	jmp	near .columnloop
 	alignx	16,7
 .column_st16:
 	cmp	ecx, byte SIZEOF_MMWORD/2
 	jb	short .column_st8
 	movq	MMWORD [esi+0*SIZEOF_MMWORD], mmE
 	movq	MMWORD [esi+1*SIZEOF_MMWORD], mmF
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mmA
 	movq	MMWORD [edi+1*SIZEOF_MMWORD], mmD
 	movq	mmE,mmB
 	movq	mmF,mmG
 	movq	mmA,mmC
 	movq	mmD,mmH
 	sub	ecx, byte SIZEOF_MMWORD/2
 	add	esi, byte 2*SIZEOF_MMWORD
 	add	edi, byte 2*SIZEOF_MMWORD
 .column_st8:
 	cmp	ecx, byte SIZEOF_MMWORD/4
 	jb	short .column_st4
 	movq	MMWORD [esi+0*SIZEOF_MMWORD], mmE
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mmA
 	movq	mmE,mmF
 	movq	mmA,mmD
 	sub	ecx, byte SIZEOF_MMWORD/4
 	add	esi, byte 1*SIZEOF_MMWORD
 	add	edi, byte 1*SIZEOF_MMWORD
 .column_st4:
 	cmp	ecx, byte SIZEOF_MMWORD/8
 	jb	short .endcolumn
 	movd	DWORD [esi+0*SIZEOF_DWORD], mmE
 	movd	DWORD [edi+0*SIZEOF_DWORD], mmA
 %endif ; RGB_PIXELSIZE ; ---------------
 .endcolumn:
 	emms		; empty MMX state
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %endif ; !USE_DEDICATED_H2V2_MERGED_UPSAMPLE_MMX
 %endif ; JDMERGE_MMX_SUPPORTED
 %endif ; UPSAMPLE_MERGING_SUPPORTED
 %endif ; RGB_PIXELSIZE == 3 || RGB_PIXELSIZE == 4
--- a/jdmerss2.asm
+++ b/jdmerss2.asm
--- a/jdphuff.c
+++ b/jdphuff.c
@@ -1,10 +1,17 @@
 /*
 * jdphuff.c
 *
- * Copyright (C) 1995-1996, Thomas G. Lane.
+ * Copyright (C) 1995-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
 * ---------------------------------------------------------------------
 * x86 SIMD extension for IJG JPEG library
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
 * This file has been modified to improve performance.
 * Last Modified : October 31, 2004
 * ---------------------------------------------------------------------
 *
 * This file contains Huffman entropy decoding routines for progressive JPEG.
 *
 * Much of the complexity here has to do with supporting input suspension.
@@ -69,6 +76,7 @@ typedef struct {
  d_derived_tbl * derived_tbls[NUM_HUFF_TBLS];
  d_derived_tbl * ac_derived_tbl; /* active table during an AC scan */
  d_derived_tbl * dc_derived_tbls[MAX_COMPS_IN_SCAN];
 } phuff_entropy_decoder;
 typedef phuff_entropy_decoder * phuff_entropy_ptr;
@@ -119,6 +127,12 @@ start_pass_phuff_decoder (j_decompress_ptr cinfo)
  }
  if (cinfo->Al > 13)		/* need not check for < 0 */
    bad = TRUE;
  /* Arguably the maximum Al value should be less than 13 for 8-bit precision,
   * but the spec doesn't say so, and we try to be liberal about what we
   * accept.  Note: large Al values could result in out-of-range DC
   * coefficients during early scans, leading to bizarre displays due to
   * overflows in the IDCT math.  But we won't crash.
   */
  if (bad)
    ERREXIT4(cinfo, JERR_BAD_PROGRESSION,
 	     cinfo->Ss, cinfo->Se, cinfo->Ah, cinfo->Al);
@@ -160,18 +174,13 @@ start_pass_phuff_decoder (j_decompress_ptr cinfo)
    if (is_DC_band) {
      if (cinfo->Ah == 0) {	/* DC refinement needs no table */
 	tbl = compptr->dc_tbl_no;
-	if (tbl < 0 || tbl >= NUM_HUFF_TBLS ||
+	jpeg_make_d_derived_tbl(cinfo, TRUE, tbl,
 	    cinfo->dc_huff_tbl_ptrs[tbl] == NULL)
 	  ERREXIT1(cinfo, JERR_NO_HUFF_TABLE, tbl);
 	jpeg_make_d_derived_tbl(cinfo, cinfo->dc_huff_tbl_ptrs[tbl],
 				& entropy->derived_tbls[tbl]);
 	entropy->dc_derived_tbls[ci] = entropy->derived_tbls[tbl];
      }
    } else {
      tbl = compptr->ac_tbl_no;
-      if (tbl < 0 || tbl >= NUM_HUFF_TBLS ||
+      jpeg_make_d_derived_tbl(cinfo, FALSE, tbl,
          cinfo->ac_huff_tbl_ptrs[tbl] == NULL)
        ERREXIT1(cinfo, JERR_NO_HUFF_TABLE, tbl);
      jpeg_make_d_derived_tbl(cinfo, cinfo->ac_huff_tbl_ptrs[tbl],
 			      & entropy->derived_tbls[tbl]);
      /* remember the single active table */
      entropy->ac_derived_tbl = entropy->derived_tbls[tbl];
@@ -183,7 +192,7 @@ start_pass_phuff_decoder (j_decompress_ptr cinfo)
  /* Initialize bitread state variables */
  entropy->bitstate.bits_left = 0;
  entropy->bitstate.get_buffer = 0; /* unnecessary, but keeps Purify quiet */
-  entropy->bitstate.printed_eod = FALSE;
+  entropy->pub.insufficient_data = FALSE;
  /* Initialize private state variables */
  entropy->saved.EOBRUN = 0;
@@ -193,32 +202,6 @@ start_pass_phuff_decoder (j_decompress_ptr cinfo)
 }
 /*
 * Figure F.12: extend sign bit.
 * On some machines, a shift and add will be faster than a table lookup.
 */
 #ifdef AVOID_TABLES
 #define HUFF_EXTEND(x,s)  ((x) < (1<<((s)-1)) ? (x) + (((-1)<<(s)) + 1) : (x))
 #else
 #define HUFF_EXTEND(x,s)  ((x) < extend_test[s] ? (x) + extend_offset[s] : (x))
 static const int extend_test[16] =   /* entry n is 2**(n-1) */
  { 0, 0x0001, 0x0002, 0x0004, 0x0008, 0x0010, 0x0020, 0x0040, 0x0080,
    0x0100, 0x0200, 0x0400, 0x0800, 0x1000, 0x2000, 0x4000 };
 static const int extend_offset[16] = /* entry n is (-1 << n) + 1 */
  { 0, ((-1)<<1) + 1, ((-1)<<2) + 1, ((-1)<<3) + 1, ((-1)<<4) + 1,
    ((-1)<<5) + 1, ((-1)<<6) + 1, ((-1)<<7) + 1, ((-1)<<8) + 1,
    ((-1)<<9) + 1, ((-1)<<10) + 1, ((-1)<<11) + 1, ((-1)<<12) + 1,
    ((-1)<<13) + 1, ((-1)<<14) + 1, ((-1)<<15) + 1 };
 #endif /* AVOID_TABLES */
 /*
 * Check for a restart marker & resynchronize decoder.
 * Returns FALSE if must suspend.
@@ -248,8 +231,13 @@ process_restart (j_decompress_ptr cinfo)
  /* Reset restart counter */
  entropy->restarts_to_go = cinfo->restart_interval;
-  /* Next segment can get another out-of-data warning */
+  /* Reset out-of-data flag, unless read_restart_marker left us smack up
-  entropy->bitstate.printed_eod = FALSE;
+   * against a marker.  In that case we will end up treating the next data
   * segment as empty, and we can avoid producing bogus output pixels by
   * leaving the flag set.
   */
  if (cinfo->unread_marker == 0)
    entropy->pub.insufficient_data = FALSE;
  return TRUE;
 }
@@ -282,13 +270,9 @@ decode_mcu_DC_first (j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
 {
  phuff_entropy_ptr entropy = (phuff_entropy_ptr) cinfo->entropy;
  int Al = cinfo->Al;
-  register int s, r;
+  int blkn;
  int blkn, ci;
  JBLOCKROW block;
  BITREAD_STATE_VARS;
  savable_state state;
  d_derived_tbl * tbl;
  jpeg_component_info * compptr;
  /* Process restart marker if needed; may have to suspend */
  if (cinfo->restart_interval) {
@@ -297,6 +281,11 @@ decode_mcu_DC_first (j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
 	return FALSE;
  }
  /* If we've run out of data, just leave the MCU set to zeroes.
   * This way, we return uniform gray for the remainder of the segment.
   */
  if (! entropy->pub.insufficient_data) {
    /* Load up working state */
    BITREAD_LOAD_STATE(cinfo,entropy->bitstate);
    ASSIGN_STATE(state, entropy->saved);
@@ -304,31 +293,78 @@ decode_mcu_DC_first (j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
    /* Outer loop handles each block in the MCU */
    for (blkn = 0; blkn < cinfo->blocks_in_MCU; blkn++) {
-    block = MCU_data[blkn];
+      JBLOCKROW block = MCU_data[blkn];
-    ci = cinfo->MCU_membership[blkn];
+      int ci = cinfo->MCU_membership[blkn];
-    compptr = cinfo->cur_comp_info[ci];
+      d_derived_tbl * tbl = entropy->dc_derived_tbls[ci];
-    tbl = entropy->derived_tbls[compptr->dc_tbl_no];
+      register int s;
      /* Decode a single block's worth of coefficients */
      /* Section F.2.2.1: decode the DC coefficient difference */
-    HUFF_DECODE(s, br_state, tbl, return FALSE, label1);
+      {		/* HUFFX_DECODE */
 	register int nb, look, t;
 	if (bits_left < HUFFX_LOOKAHEAD) {
 	  register const JOCTET * next_input_byte = br_state.next_input_byte;
 	  register size_t         bytes_in_buffer = br_state.bytes_in_buffer;
 	  if (cinfo->unread_marker == 0) {
 	    while (bits_left < MIN_GET_BITS) {
 	      register int c;
 	      if (bytes_in_buffer == 0 ||
 		  (c = GETJOCTET(*next_input_byte)) == 0xFF) {
 		goto label11; }
 	      bytes_in_buffer--; next_input_byte++;
 	      get_buffer = (get_buffer << 8) | c;
 	      bits_left += 8;
 	    }
 	    br_state.next_input_byte = next_input_byte;
 	    br_state.bytes_in_buffer = bytes_in_buffer;
 	  } else {
 	label11:
 	    br_state.next_input_byte = next_input_byte;
 	    br_state.bytes_in_buffer = bytes_in_buffer;
 	    if (! jpeg_fill_bit_buffer(&br_state,get_buffer,bits_left, 0)) {
 	      return FALSE; }
 	    get_buffer = br_state.get_buffer; bits_left = br_state.bits_left;
 	    if (bits_left < HUFFX_LOOKAHEAD) {
 	      nb = 1; goto label1;
 	    }
 	  }
 	}
 	look = PEEK_BITS(HUFFX_LOOKAHEAD);
 	if ((nb = tbl->lookx_nbits[look]) != 0) {
 	  s = tbl->lookx_val[look];
 	  if (nb <= HUFFX_LOOKAHEAD) {
 	    DROP_BITS(nb);
 	  } else {
 	    DROP_BITS(HUFFX_LOOKAHEAD);
 	    nb -= HUFFX_LOOKAHEAD;
 	    CHECK_BIT_BUFFER(br_state, nb, return FALSE);
 	    s += GET_BITS(nb);
 	  }
 	} else {
 	  nb = HUFFX_LOOKAHEAD;
      label1:
 	  if ((s=jpeg_huff_decode(&br_state,get_buffer,bits_left,tbl,nb))
 	       < 0) { return FALSE; }
 	  get_buffer = br_state.get_buffer; bits_left = br_state.bits_left;
 	  if (s) {
 	    CHECK_BIT_BUFFER(br_state, s, return FALSE);
-      r = GET_BITS(s);
+	    t = GET_BITS(s);
-      s = HUFF_EXTEND(r, s);
+	    s = HUFF_EXTEND(t, s);
 	  }
 	}
      }
      /* Convert DC difference to actual value, update last_dc_val */
      s += state.last_dc_val[ci];
      state.last_dc_val[ci] = s;
-    /* Scale and output the DC coefficient (assumes jpeg_natural_order[0]=0) */
+      /* Scale and output the coefficient (assumes jpeg_natural_order[0]=0) */
      (*block)[0] = (JCOEF) (s << Al);
    }
    /* Completed MCU, so update state */
    BITREAD_SAVE_STATE(cinfo,entropy->bitstate);
    ASSIGN_STATE(entropy->saved, state);
  }
  /* Account for restart interval (no-op if not using restarts) */
  entropy->restarts_to_go--;
@@ -348,11 +384,8 @@ decode_mcu_AC_first (j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
  phuff_entropy_ptr entropy = (phuff_entropy_ptr) cinfo->entropy;
  int Se = cinfo->Se;
  int Al = cinfo->Al;
  register int s, k, r;
  unsigned int EOBRUN;
  JBLOCKROW block;
  BITREAD_STATE_VARS;
  d_derived_tbl * tbl;
  /* Process restart marker if needed; may have to suspend */
  if (cinfo->restart_interval) {
@@ -361,29 +394,86 @@ decode_mcu_AC_first (j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
 	return FALSE;
  }
  /* If we've run out of data, just leave the MCU set to zeroes.
   * This way, we return uniform gray for the remainder of the segment.
   */
  if (! entropy->pub.insufficient_data) {
    /* Load up working state.
     * We can avoid loading/saving bitread state if in an EOB run.
     */
-  EOBRUN = entropy->saved.EOBRUN; /* only part of saved state we care about */
+    EOBRUN = entropy->saved.EOBRUN;	/* only part of saved state we need */
    /* There is always only one block per MCU */
-  if (EOBRUN > 0)		/* if it's a band of zeroes... */
+    if (EOBRUN > 0) {		/* if it's a band of zeroes... */
      EOBRUN--;			/* ...process it now (we do nothing) */
-  else {
+    } else {
      JBLOCKROW block = MCU_data[0];
      d_derived_tbl * tbl = entropy->ac_derived_tbl;
      register int s, k, r;
      /* Load up working state */
      BITREAD_LOAD_STATE(cinfo,entropy->bitstate);
    block = MCU_data[0];
    tbl = entropy->ac_derived_tbl;
      for (k = cinfo->Ss; k <= Se; k++) {
-      HUFF_DECODE(s, br_state, tbl, return FALSE, label2);
+	{	/* HUFFX_DECODE */
-      r = s >> 4;
+	  register int nb, look, t;
-      s &= 15;
+	  if (bits_left < HUFFX_LOOKAHEAD) {
 	    register const JOCTET * next_input_byte = br_state.next_input_byte;
 	    register size_t         bytes_in_buffer = br_state.bytes_in_buffer;
 	    if (cinfo->unread_marker == 0) {
 	      while (bits_left < MIN_GET_BITS) {
 		register int c;
 		if (bytes_in_buffer == 0 ||
 		    (c = GETJOCTET(*next_input_byte)) == 0xFF) {
 		  goto label21; }
 		bytes_in_buffer--; next_input_byte++;
 		get_buffer = (get_buffer << 8) | c;
 		bits_left += 8;
 	      }
 	      br_state.next_input_byte = next_input_byte;
 	      br_state.bytes_in_buffer = bytes_in_buffer;
 	    } else {
 	  label21:
 	      br_state.next_input_byte = next_input_byte;
 	      br_state.bytes_in_buffer = bytes_in_buffer;
 	      if (! jpeg_fill_bit_buffer(&br_state,get_buffer,bits_left, 0)) {
 		return FALSE; }
 	      get_buffer = br_state.get_buffer; bits_left = br_state.bits_left;
 	      if (bits_left < HUFFX_LOOKAHEAD) {
 		nb = 1; goto label2;
 	      }
 	    }
 	  }
 	  look = PEEK_BITS(HUFFX_LOOKAHEAD);
 	  if ((nb = tbl->lookx_nbits[look]) != 0) {
 	    s = tbl->lookx_val[look];
 	    r = tbl->lookx_sym[look] >> 4;
 	    if (nb <= HUFFX_LOOKAHEAD) {
 	      DROP_BITS(nb);
 	    } else {
 	      DROP_BITS(HUFFX_LOOKAHEAD);
 	      nb -= HUFFX_LOOKAHEAD;
 	      CHECK_BIT_BUFFER(br_state, nb, return FALSE);
 	      s += GET_BITS(nb);
 	    }
 	  } else {
 	    nb = HUFFX_LOOKAHEAD;
 	label2:
 	    if ((s=jpeg_huff_decode(&br_state,get_buffer,bits_left,tbl,nb))
 		 < 0) { return FALSE; }
 	    get_buffer = br_state.get_buffer; bits_left = br_state.bits_left;
 	    r = s >> 4; s &= 15;
 	    if (s) {
 	      CHECK_BIT_BUFFER(br_state, s, return FALSE);
 	      t = GET_BITS(s);
 	      s = HUFF_EXTEND(t, s);
 	    }
 	  }
 	}
 	if (s) {
 	  k += r;
        CHECK_BIT_BUFFER(br_state, s, return FALSE);
        r = GET_BITS(s);
        s = HUFF_EXTEND(r, s);
 	  /* Scale and output coefficient in natural (dezigzagged) order */
 	  (*block)[jpeg_natural_order[k]] = (JCOEF) (s << Al);
 	} else {
@@ -406,7 +496,8 @@ decode_mcu_AC_first (j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
    }
    /* Completed MCU, so update state */
-  entropy->saved.EOBRUN = EOBRUN; /* only part of saved state we care about */
+    entropy->saved.EOBRUN = EOBRUN;	/* only part of saved state we need */
  }
  /* Account for restart interval (no-op if not using restarts) */
  entropy->restarts_to_go--;
@@ -427,7 +518,6 @@ decode_mcu_DC_refine (j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
  phuff_entropy_ptr entropy = (phuff_entropy_ptr) cinfo->entropy;
  int p1 = 1 << cinfo->Al;	/* 1 in the bit position being coded */
  int blkn;
  JBLOCKROW block;
  BITREAD_STATE_VARS;
  /* Process restart marker if needed; may have to suspend */
@@ -437,13 +527,17 @@ decode_mcu_DC_refine (j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
 	return FALSE;
  }
  /* Not worth the cycles to check insufficient_data here,
   * since we will not change the data anyway if we read zeroes.
   */
  /* Load up working state */
  BITREAD_LOAD_STATE(cinfo,entropy->bitstate);
  /* Outer loop handles each block in the MCU */
  for (blkn = 0; blkn < cinfo->blocks_in_MCU; blkn++) {
-    block = MCU_data[blkn];
+    JBLOCKROW block = MCU_data[blkn];
    /* Encoded data is simply the next bit of the two's-complement DC value */
    CHECK_BIT_BUFFER(br_state, 1, return FALSE);
@@ -471,14 +565,14 @@ decode_mcu_AC_refine (j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
 {
  phuff_entropy_ptr entropy = (phuff_entropy_ptr) cinfo->entropy;
  int Se = cinfo->Se;
-  int p1 = 1 << cinfo->Al;	/* 1 in the bit position being coded */
+  int Al = cinfo->Al;
  int m1 = (-1) << cinfo->Al;	/* -1 in the bit position being coded */
  register int s, k, r;
  unsigned int EOBRUN;
  JBLOCKROW block;
  JCOEFPTR thiscoef;
  BITREAD_STATE_VARS;
  d_derived_tbl * tbl;
  int pm1[2];
  int num_newnz;
  int newnz_pos[DCTSIZE2];
@@ -489,19 +583,30 @@ decode_mcu_AC_refine (j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
 	return FALSE;
  }
  /* If we've run out of data, don't modify the MCU.
   */
  if (! entropy->pub.insufficient_data) {
    /* Load up working state */
    BITREAD_LOAD_STATE(cinfo,entropy->bitstate);
-  EOBRUN = entropy->saved.EOBRUN; /* only part of saved state we care about */
+    EOBRUN = entropy->saved.EOBRUN; /* only part of saved state we need */
    /* There is always only one block per MCU */
    block = MCU_data[0];
    tbl = entropy->ac_derived_tbl;
    /* The pm1[] array is indexed by a value from relational operator.
     * This method eliminates conditional branches depending on random data,
     * which result in lower performance on recent processors.
     */
    pm1[0] =   1  << cinfo->Al;	/* +1 in the bit position being coded */
    pm1[1] = (-1) << cinfo->Al;	/* -1 in the bit position being coded */
    /* If we are forced to suspend, we must undo the assignments to any newly
     * nonzero coefficients in the block, because otherwise we'd get confused
     * next time about which coefficients were already nonzero.
     * But we need not undo addition of bits to already-nonzero coefficients;
-   * instead, we can test the current bit position to see if we already did it.
+     * instead, we can test the current bit to see if we already did it.
     */
    num_newnz = 0;
@@ -510,18 +615,63 @@ decode_mcu_AC_refine (j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
    if (EOBRUN == 0) {
      for (; k <= Se; k++) {
-      HUFF_DECODE(s, br_state, tbl, goto undoit, label3);
+	{	/* HUFFX_DECODE */
-      r = s >> 4;
+	  register int nb, look, t;
-      s &= 15;
+	  if (bits_left < HUFFX_LOOKAHEAD) {
 	    register const JOCTET * next_input_byte = br_state.next_input_byte;
 	    register size_t         bytes_in_buffer = br_state.bytes_in_buffer;
 	    if (cinfo->unread_marker == 0) {
 	      while (bits_left < MIN_GET_BITS) {
 		register int c;
 		if (bytes_in_buffer == 0 ||
 		    (c = GETJOCTET(*next_input_byte)) == 0xFF) {
 		  goto label31; }
 		bytes_in_buffer--; next_input_byte++;
 		get_buffer = (get_buffer << 8) | c;
 		bits_left += 8;
 	      }
 	      br_state.next_input_byte = next_input_byte;
 	      br_state.bytes_in_buffer = bytes_in_buffer;
 	    } else {
 	  label31:
 	      br_state.next_input_byte = next_input_byte;
 	      br_state.bytes_in_buffer = bytes_in_buffer;
 	      if (! jpeg_fill_bit_buffer(&br_state,get_buffer,bits_left, 0)) {
 		goto undoit; }
 	      get_buffer = br_state.get_buffer; bits_left = br_state.bits_left;
 	      if (bits_left < HUFFX_LOOKAHEAD) {
 		nb = 1; goto label3;
 	      }
 	    }
 	  }
 	  look = PEEK_BITS(HUFFX_LOOKAHEAD);
 	  if ((nb = tbl->lookx_nbits[look]) != 0) {
 	    t = tbl->lookx_sym[look];
 	    s = tbl->lookx_val[look];
 	    r = t >> 4; t &= 15;
 	    if (t <= 1) {
 	      DROP_BITS(nb);
 	    } else {		  /* size of new coef should always be 1 */
 	      WARNMS(cinfo, JWRN_HUFF_BAD_CODE);
 	      DROP_BITS(nb - (t - 1));
 	      s = (s >= 0) ? 1 : -1;
 	    }
 	  } else {
 	    nb = HUFFX_LOOKAHEAD;
 	label3:
 	    if ((s=jpeg_huff_decode(&br_state,get_buffer,bits_left,tbl,nb))
 		 < 0) { goto undoit; }
 	    get_buffer = br_state.get_buffer; bits_left = br_state.bits_left;
 	    r = s >> 4; s &= 15;
 	    if (s) {
 	      if (s != 1)	    /* size of new coef should always be 1 */
 		WARNMS(cinfo, JWRN_HUFF_BAD_CODE);
 	      CHECK_BIT_BUFFER(br_state, 1, goto undoit);
-        if (GET_BITS(1))
+	      s = GET_BITS(1) ? 1 : -1;
-	  s = p1;		/* newly nonzero coef is positive */
+	    }
-	else
+	  }
-	  s = m1;		/* newly nonzero coef is negative */
+	}
-      } else {
+	if (s == 0) {
 	  if (r != 15) {
 	    EOBRUN = 1 << r;	/* EOBr, run length is 2^r + appended bits */
 	    if (r) {
@@ -542,12 +692,8 @@ decode_mcu_AC_refine (j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
 	  if (*thiscoef != 0) {
 	    CHECK_BIT_BUFFER(br_state, 1, goto undoit);
 	    if (GET_BITS(1)) {
-	    if ((*thiscoef & p1) == 0) { /* do nothing if already changed it */
+	      if ((*thiscoef & pm1[0]) == 0) /* do nothing if already set it */
-	      if (*thiscoef >= 0)
+		*thiscoef += pm1[(*thiscoef < 0)];
 		*thiscoef += p1;
 	      else
 		*thiscoef += m1;
 	    }
 	    }
 	  } else {
 	    if (--r < 0)
@@ -558,7 +704,7 @@ decode_mcu_AC_refine (j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
 	if (s) {
 	  int pos = jpeg_natural_order[k];
 	  /* Output newly nonzero coefficient */
-	(*block)[pos] = (JCOEF) s;
+	  (*block)[pos] = (JCOEF) (s << Al);
 	  /* Remember its position in case we have to suspend */
 	  newnz_pos[num_newnz++] = pos;
 	}
@@ -576,12 +722,8 @@ decode_mcu_AC_refine (j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
 	if (*thiscoef != 0) {
 	  CHECK_BIT_BUFFER(br_state, 1, goto undoit);
 	  if (GET_BITS(1)) {
-	  if ((*thiscoef & p1) == 0) { /* do nothing if already changed it */
+	    if ((*thiscoef & pm1[0]) == 0)  /* do nothing if already set it */
-	    if (*thiscoef >= 0)
+	      *thiscoef += pm1[(*thiscoef < 0)];
 	      *thiscoef += p1;
 	    else
 	      *thiscoef += m1;
 	  }
 	  }
 	}
      }
@@ -591,7 +733,8 @@ decode_mcu_AC_refine (j_decompress_ptr cinfo, JBLOCKROW *MCU_data)
    /* Completed MCU, so update state */
    BITREAD_SAVE_STATE(cinfo,entropy->bitstate);
-  entropy->saved.EOBRUN = EOBRUN; /* only part of saved state we care about */
+    entropy->saved.EOBRUN = EOBRUN; /* only part of saved state we need */
  }
  /* Account for restart interval (no-op if not using restarts) */
  entropy->restarts_to_go--;
--- a/jdsammmx.asm
+++ b/jdsammmx.asm
@@ -0,0 +1,893 @@
 ;
 ; jdsammmx.asm - upsampling (MMX)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jcolsamp.inc"
 %ifdef JDSAMPLE_FANCY_MMX_SUPPORTED
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 	alignz	16
 	global	EXTN(jconst_fancy_upsample_mmx)
 EXTN(jconst_fancy_upsample_mmx):
 PW_ONE		times 4 dw  1
 PW_TWO		times 4 dw  2
 PW_THREE	times 4 dw  3
 PW_SEVEN	times 4 dw  7
 PW_EIGHT	times 4 dw  8
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Fancy processing for the common case of 2:1 horizontal and 1:1 vertical.
 ;
 ; The upsampling algorithm is linear interpolation between pixel centers,
 ; also known as a "triangle filter".  This is a good compromise between
 ; speed and visual quality.  The centers of the output pixels are 1/4 and 3/4
 ; of the way between input pixel centers.
 ;
 ; GLOBAL(void)
 ; jpeg_h2v1_fancy_upsample_mmx (j_decompress_ptr cinfo,
 ;                               jpeg_component_info * compptr,
 ;                               JSAMPARRAY input_data,
 ;                               JSAMPARRAY * output_data_ptr);
 ;
 %define cinfo(b)		(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)		(b)+12		; jpeg_component_info * compptr
 %define input_data(b)		(b)+16		; JSAMPARRAY input_data
 %define output_data_ptr(b)	(b)+20		; JSAMPARRAY * output_data_ptr
 	align	16
 	global	EXTN(jpeg_h2v1_fancy_upsample_mmx)
 EXTN(jpeg_h2v1_fancy_upsample_mmx):
 	push	ebp
 	mov	ebp,esp
 	pushpic	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx		; get GOT address
 	mov	eax, POINTER [compptr(ebp)]
 	mov	eax, JDIMENSION [jcompinfo_downsampled_width(eax)]  ; colctr
 	test	eax,eax
 	jz	near .return
 	mov	ecx, POINTER [cinfo(ebp)]
 	mov	ecx, INT [jdstruct_max_v_samp_factor(ecx)]	; rowctr
 	test	ecx,ecx
 	jz	near .return
 	mov	esi, JSAMPARRAY [input_data(ebp)]	; input_data
 	mov	edi, POINTER [output_data_ptr(ebp)]
 	mov	edi, JSAMPARRAY [edi]			; output_data
 	alignx	16,7
 .rowloop:
 	push	eax			; colctr
 	push	edi
 	push	esi
 	mov	esi, JSAMPROW [esi]	; inptr
 	mov	edi, JSAMPROW [edi]	; outptr
 	test	eax, SIZEOF_MMWORD-1
 	jz	short .skip
 	mov	dl, JSAMPLE [esi+(eax-1)*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [esi+eax*SIZEOF_JSAMPLE], dl	; insert a dummy sample
 .skip:
 	pxor	mm0,mm0			; mm0=(all 0's)
 	pcmpeqb	mm7,mm7
 	psrlq	mm7,(SIZEOF_MMWORD-1)*BYTE_BIT
 	pand	mm7, MMWORD [esi+0*SIZEOF_MMWORD]
 	add	eax, byte SIZEOF_MMWORD-1
 	and	eax, byte -SIZEOF_MMWORD
 	cmp	eax, byte SIZEOF_MMWORD
 	ja	short .columnloop
 	alignx	16,7
 .columnloop_last:
 	pcmpeqb	mm6,mm6
 	psllq	mm6,(SIZEOF_MMWORD-1)*BYTE_BIT
 	pand	mm6, MMWORD [esi+0*SIZEOF_MMWORD]
 	jmp	short .upsample
 	alignx	16,7
 .columnloop:
 	movq	mm6, MMWORD [esi+1*SIZEOF_MMWORD]
 	psllq	mm6,(SIZEOF_MMWORD-1)*BYTE_BIT
 .upsample:
 	movq	mm1, MMWORD [esi+0*SIZEOF_MMWORD]
 	movq	mm2,mm1
 	movq	mm3,mm1			; mm1=( 0 1 2 3 4 5 6 7)
 	psllq	mm2,BYTE_BIT		; mm2=( - 0 1 2 3 4 5 6)
 	psrlq	mm3,BYTE_BIT		; mm3=( 1 2 3 4 5 6 7 -)
 	por	mm2,mm7			; mm2=(-1 0 1 2 3 4 5 6)
 	por	mm3,mm6			; mm3=( 1 2 3 4 5 6 7 8)
 	movq	mm7,mm1
 	psrlq	mm7,(SIZEOF_MMWORD-1)*BYTE_BIT	; mm7=( 7 - - - - - - -)
 	movq      mm4,mm1
 	punpcklbw mm1,mm0		; mm1=( 0 1 2 3)
 	punpckhbw mm4,mm0		; mm4=( 4 5 6 7)
 	movq      mm5,mm2
 	punpcklbw mm2,mm0		; mm2=(-1 0 1 2)
 	punpckhbw mm5,mm0		; mm5=( 3 4 5 6)
 	movq      mm6,mm3
 	punpcklbw mm3,mm0		; mm3=( 1 2 3 4)
 	punpckhbw mm6,mm0		; mm6=( 5 6 7 8)
 	pmullw	mm1,[GOTOFF(ebx,PW_THREE)]
 	pmullw	mm4,[GOTOFF(ebx,PW_THREE)]
 	paddw	mm2,[GOTOFF(ebx,PW_ONE)]
 	paddw	mm5,[GOTOFF(ebx,PW_ONE)]
 	paddw	mm3,[GOTOFF(ebx,PW_TWO)]
 	paddw	mm6,[GOTOFF(ebx,PW_TWO)]
 	paddw	mm2,mm1
 	paddw	mm5,mm4
 	psrlw	mm2,2			; mm2=OutLE=( 0  2  4  6)
 	psrlw	mm5,2			; mm5=OutHE=( 8 10 12 14)
 	paddw	mm3,mm1
 	paddw	mm6,mm4
 	psrlw	mm3,2			; mm3=OutLO=( 1  3  5  7)
 	psrlw	mm6,2			; mm6=OutHO=( 9 11 13 15)
 	psllw	mm3,BYTE_BIT
 	psllw	mm6,BYTE_BIT
 	por	mm2,mm3			; mm2=OutL=( 0  1  2  3  4  5  6  7)
 	por	mm5,mm6			; mm5=OutH=( 8  9 10 11 12 13 14 15)
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mm2
 	movq	MMWORD [edi+1*SIZEOF_MMWORD], mm5
 	sub	eax, byte SIZEOF_MMWORD
 	add	esi, byte 1*SIZEOF_MMWORD	; inptr
 	add	edi, byte 2*SIZEOF_MMWORD	; outptr
 	cmp	eax, byte SIZEOF_MMWORD
 	ja	near .columnloop
 	test	eax,eax
 	jnz	near .columnloop_last
 	pop	esi
 	pop	edi
 	pop	eax
 	add	esi, byte SIZEOF_JSAMPROW	; input_data
 	add	edi, byte SIZEOF_JSAMPROW	; output_data
 	dec	ecx				; rowctr
 	jg	near .rowloop
 	emms		; empty MMX state
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	poppic	ebx
 	pop	ebp
 	ret
 ; --------------------------------------------------------------------------
 ;
 ; Fancy processing for the common case of 2:1 horizontal and 2:1 vertical.
 ; Again a triangle filter; see comments for h2v1 case, above.
 ;
 ; GLOBAL(void)
 ; jpeg_h2v2_fancy_upsample_mmx (j_decompress_ptr cinfo,
 ;                               jpeg_component_info * compptr,
 ;                               JSAMPARRAY input_data,
 ;                               JSAMPARRAY * output_data_ptr);
 ;
 %define cinfo(b)		(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)		(b)+12		; jpeg_component_info * compptr
 %define input_data(b)		(b)+16		; JSAMPARRAY input_data
 %define output_data_ptr(b)	(b)+20		; JSAMPARRAY * output_data_ptr
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_MMWORD	; mmword wk[WK_NUM]
 %define WK_NUM		4
 %define gotptr		wk(0)-SIZEOF_POINTER	; void * gotptr
 	align	16
 	global	EXTN(jpeg_h2v2_fancy_upsample_mmx)
 EXTN(jpeg_h2v2_fancy_upsample_mmx):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_MMWORD)	; align to 64 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [wk(0)]
 	pushpic	eax		; make a room for GOT address
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx			; get GOT address
 	movpic	POINTER [gotptr], ebx	; save GOT address
 	mov	edx,eax				; edx = original ebp
 	mov	eax, POINTER [compptr(edx)]
 	mov	eax, JDIMENSION [jcompinfo_downsampled_width(eax)]  ; colctr
 	test	eax,eax
 	jz	near .return
 	mov	ecx, POINTER [cinfo(edx)]
 	mov	ecx, INT [jdstruct_max_v_samp_factor(ecx)]	; rowctr
 	test	ecx,ecx
 	jz	near .return
 	mov	esi, JSAMPARRAY [input_data(edx)]	; input_data
 	mov	edi, POINTER [output_data_ptr(edx)]
 	mov	edi, JSAMPARRAY [edi]			; output_data
 	alignx	16,7
 .rowloop:
 	push	eax					; colctr
 	push	ecx
 	push	edi
 	push	esi
 	mov	ecx, JSAMPROW [esi-1*SIZEOF_JSAMPROW]	; inptr1(above)
 	mov	ebx, JSAMPROW [esi+0*SIZEOF_JSAMPROW]	; inptr0
 	mov	esi, JSAMPROW [esi+1*SIZEOF_JSAMPROW]	; inptr1(below)
 	mov	edx, JSAMPROW [edi+0*SIZEOF_JSAMPROW]	; outptr0
 	mov	edi, JSAMPROW [edi+1*SIZEOF_JSAMPROW]	; outptr1
 	test	eax, SIZEOF_MMWORD-1
 	jz	short .skip
 	push	edx
 	mov	dl, JSAMPLE [ecx+(eax-1)*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [ecx+eax*SIZEOF_JSAMPLE], dl
 	mov	dl, JSAMPLE [ebx+(eax-1)*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [ebx+eax*SIZEOF_JSAMPLE], dl
 	mov	dl, JSAMPLE [esi+(eax-1)*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [esi+eax*SIZEOF_JSAMPLE], dl	; insert a dummy sample
 	pop	edx
 .skip:
 	; -- process the first column block
 	movq	mm0, MMWORD [ebx+0*SIZEOF_MMWORD]	; mm0=row[ 0][0]
 	movq	mm1, MMWORD [ecx+0*SIZEOF_MMWORD]	; mm1=row[-1][0]
 	movq	mm2, MMWORD [esi+0*SIZEOF_MMWORD]	; mm2=row[+1][0]
 	pushpic	ebx
 	movpic	ebx, POINTER [gotptr]	; load GOT address
 	pxor      mm3,mm3		; mm3=(all 0's)
 	movq      mm4,mm0
 	punpcklbw mm0,mm3		; mm0=row[ 0][0]( 0 1 2 3)
 	punpckhbw mm4,mm3		; mm4=row[ 0][0]( 4 5 6 7)
 	movq      mm5,mm1
 	punpcklbw mm1,mm3		; mm1=row[-1][0]( 0 1 2 3)
 	punpckhbw mm5,mm3		; mm5=row[-1][0]( 4 5 6 7)
 	movq      mm6,mm2
 	punpcklbw mm2,mm3		; mm2=row[+1][0]( 0 1 2 3)
 	punpckhbw mm6,mm3		; mm6=row[+1][0]( 4 5 6 7)
 	pmullw	mm0,[GOTOFF(ebx,PW_THREE)]
 	pmullw	mm4,[GOTOFF(ebx,PW_THREE)]
 	pcmpeqb	mm7,mm7
 	psrlq	mm7,(SIZEOF_MMWORD-2)*BYTE_BIT
 	paddw	mm1,mm0			; mm1=Int0L=( 0 1 2 3)
 	paddw	mm5,mm4			; mm5=Int0H=( 4 5 6 7)
 	paddw	mm2,mm0			; mm2=Int1L=( 0 1 2 3)
 	paddw	mm6,mm4			; mm6=Int1H=( 4 5 6 7)
 	movq	MMWORD [edx+0*SIZEOF_MMWORD], mm1	; temporarily save
 	movq	MMWORD [edx+1*SIZEOF_MMWORD], mm5	; the intermediate data
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mm2
 	movq	MMWORD [edi+1*SIZEOF_MMWORD], mm6
 	pand	mm1,mm7			; mm1=( 0 - - -)
 	pand	mm2,mm7			; mm2=( 0 - - -)
 	movq	MMWORD [wk(0)], mm1
 	movq	MMWORD [wk(1)], mm2
 	poppic	ebx
 	add	eax, byte SIZEOF_MMWORD-1
 	and	eax, byte -SIZEOF_MMWORD
 	cmp	eax, byte SIZEOF_MMWORD
 	ja	short .columnloop
 	alignx	16,7
 .columnloop_last:
 	; -- process the last column block
 	pushpic	ebx
 	movpic	ebx, POINTER [gotptr]	; load GOT address
 	pcmpeqb	mm1,mm1
 	psllq	mm1,(SIZEOF_MMWORD-2)*BYTE_BIT
 	movq	mm2,mm1
 	pand	mm1, MMWORD [edx+1*SIZEOF_MMWORD]	; mm1=( - - - 7)
 	pand	mm2, MMWORD [edi+1*SIZEOF_MMWORD]	; mm2=( - - - 7)
 	movq	MMWORD [wk(2)], mm1
 	movq	MMWORD [wk(3)], mm2
 	jmp	short .upsample
 	alignx	16,7
 .columnloop:
 	; -- process the next column block
 	movq	mm0, MMWORD [ebx+1*SIZEOF_MMWORD]	; mm0=row[ 0][1]
 	movq	mm1, MMWORD [ecx+1*SIZEOF_MMWORD]	; mm1=row[-1][1]
 	movq	mm2, MMWORD [esi+1*SIZEOF_MMWORD]	; mm2=row[+1][1]
 	pushpic	ebx
 	movpic	ebx, POINTER [gotptr]	; load GOT address
 	pxor      mm3,mm3		; mm3=(all 0's)
 	movq      mm4,mm0
 	punpcklbw mm0,mm3		; mm0=row[ 0][1]( 0 1 2 3)
 	punpckhbw mm4,mm3		; mm4=row[ 0][1]( 4 5 6 7)
 	movq      mm5,mm1
 	punpcklbw mm1,mm3		; mm1=row[-1][1]( 0 1 2 3)
 	punpckhbw mm5,mm3		; mm5=row[-1][1]( 4 5 6 7)
 	movq      mm6,mm2
 	punpcklbw mm2,mm3		; mm2=row[+1][1]( 0 1 2 3)
 	punpckhbw mm6,mm3		; mm6=row[+1][1]( 4 5 6 7)
 	pmullw	mm0,[GOTOFF(ebx,PW_THREE)]
 	pmullw	mm4,[GOTOFF(ebx,PW_THREE)]
 	paddw	mm1,mm0			; mm1=Int0L=( 0 1 2 3)
 	paddw	mm5,mm4			; mm5=Int0H=( 4 5 6 7)
 	paddw	mm2,mm0			; mm2=Int1L=( 0 1 2 3)
 	paddw	mm6,mm4			; mm6=Int1H=( 4 5 6 7)
 	movq	MMWORD [edx+2*SIZEOF_MMWORD], mm1	; temporarily save
 	movq	MMWORD [edx+3*SIZEOF_MMWORD], mm5	; the intermediate data
 	movq	MMWORD [edi+2*SIZEOF_MMWORD], mm2
 	movq	MMWORD [edi+3*SIZEOF_MMWORD], mm6
 	psllq	mm1,(SIZEOF_MMWORD-2)*BYTE_BIT	; mm1=( - - - 0)
 	psllq	mm2,(SIZEOF_MMWORD-2)*BYTE_BIT	; mm2=( - - - 0)
 	movq	MMWORD [wk(2)], mm1
 	movq	MMWORD [wk(3)], mm2
 .upsample:
 	; -- process the upper row
 	movq	mm7, MMWORD [edx+0*SIZEOF_MMWORD]	; mm7=Int0L=( 0 1 2 3)
 	movq	mm3, MMWORD [edx+1*SIZEOF_MMWORD]	; mm3=Int0H=( 4 5 6 7)
 	movq	mm0,mm7
 	movq	mm4,mm3
 	psrlq	mm0,2*BYTE_BIT			; mm0=( 1 2 3 -)
 	psllq	mm4,(SIZEOF_MMWORD-2)*BYTE_BIT	; mm4=( - - - 4)
 	movq	mm5,mm7
 	movq	mm6,mm3
 	psrlq	mm5,(SIZEOF_MMWORD-2)*BYTE_BIT	; mm5=( 3 - - -)
 	psllq	mm6,2*BYTE_BIT			; mm6=( - 4 5 6)
 	por	mm0,mm4				; mm0=( 1 2 3 4)
 	por	mm5,mm6				; mm5=( 3 4 5 6)
 	movq	mm1,mm7
 	movq	mm2,mm3
 	psllq	mm1,2*BYTE_BIT			; mm1=( - 0 1 2)
 	psrlq	mm2,2*BYTE_BIT			; mm2=( 5 6 7 -)
 	movq	mm4,mm3
 	psrlq	mm4,(SIZEOF_MMWORD-2)*BYTE_BIT	; mm4=( 7 - - -)
 	por	mm1, MMWORD [wk(0)]		; mm1=(-1 0 1 2)
 	por	mm2, MMWORD [wk(2)]		; mm2=( 5 6 7 8)
 	movq	MMWORD [wk(0)], mm4
 	pmullw	mm7,[GOTOFF(ebx,PW_THREE)]
 	pmullw	mm3,[GOTOFF(ebx,PW_THREE)]
 	paddw	mm1,[GOTOFF(ebx,PW_EIGHT)]
 	paddw	mm5,[GOTOFF(ebx,PW_EIGHT)]
 	paddw	mm0,[GOTOFF(ebx,PW_SEVEN)]
 	paddw	mm2,[GOTOFF(ebx,PW_SEVEN)]
 	paddw	mm1,mm7
 	paddw	mm5,mm3
 	psrlw	mm1,4			; mm1=Out0LE=( 0  2  4  6)
 	psrlw	mm5,4			; mm5=Out0HE=( 8 10 12 14)
 	paddw	mm0,mm7
 	paddw	mm2,mm3
 	psrlw	mm0,4			; mm0=Out0LO=( 1  3  5  7)
 	psrlw	mm2,4			; mm2=Out0HO=( 9 11 13 15)
 	psllw	mm0,BYTE_BIT
 	psllw	mm2,BYTE_BIT
 	por	mm1,mm0			; mm1=Out0L=( 0  1  2  3  4  5  6  7)
 	por	mm5,mm2			; mm5=Out0H=( 8  9 10 11 12 13 14 15)
 	movq	MMWORD [edx+0*SIZEOF_MMWORD], mm1
 	movq	MMWORD [edx+1*SIZEOF_MMWORD], mm5
 	; -- process the lower row
 	movq	mm6, MMWORD [edi+0*SIZEOF_MMWORD]	; mm6=Int1L=( 0 1 2 3)
 	movq	mm4, MMWORD [edi+1*SIZEOF_MMWORD]	; mm4=Int1H=( 4 5 6 7)
 	movq	mm7,mm6
 	movq	mm3,mm4
 	psrlq	mm7,2*BYTE_BIT			; mm7=( 1 2 3 -)
 	psllq	mm3,(SIZEOF_MMWORD-2)*BYTE_BIT	; mm3=( - - - 4)
 	movq	mm0,mm6
 	movq	mm2,mm4
 	psrlq	mm0,(SIZEOF_MMWORD-2)*BYTE_BIT	; mm0=( 3 - - -)
 	psllq	mm2,2*BYTE_BIT			; mm2=( - 4 5 6)
 	por	mm7,mm3				; mm7=( 1 2 3 4)
 	por	mm0,mm2				; mm0=( 3 4 5 6)
 	movq	mm1,mm6
 	movq	mm5,mm4
 	psllq	mm1,2*BYTE_BIT			; mm1=( - 0 1 2)
 	psrlq	mm5,2*BYTE_BIT			; mm5=( 5 6 7 -)
 	movq	mm3,mm4
 	psrlq	mm3,(SIZEOF_MMWORD-2)*BYTE_BIT	; mm3=( 7 - - -)
 	por	mm1, MMWORD [wk(1)]		; mm1=(-1 0 1 2)
 	por	mm5, MMWORD [wk(3)]		; mm5=( 5 6 7 8)
 	movq	MMWORD [wk(1)], mm3
 	pmullw	mm6,[GOTOFF(ebx,PW_THREE)]
 	pmullw	mm4,[GOTOFF(ebx,PW_THREE)]
 	paddw	mm1,[GOTOFF(ebx,PW_EIGHT)]
 	paddw	mm0,[GOTOFF(ebx,PW_EIGHT)]
 	paddw	mm7,[GOTOFF(ebx,PW_SEVEN)]
 	paddw	mm5,[GOTOFF(ebx,PW_SEVEN)]
 	paddw	mm1,mm6
 	paddw	mm0,mm4
 	psrlw	mm1,4			; mm1=Out1LE=( 0  2  4  6)
 	psrlw	mm0,4			; mm0=Out1HE=( 8 10 12 14)
 	paddw	mm7,mm6
 	paddw	mm5,mm4
 	psrlw	mm7,4			; mm7=Out1LO=( 1  3  5  7)
 	psrlw	mm5,4			; mm5=Out1HO=( 9 11 13 15)
 	psllw	mm7,BYTE_BIT
 	psllw	mm5,BYTE_BIT
 	por	mm1,mm7			; mm1=Out1L=( 0  1  2  3  4  5  6  7)
 	por	mm0,mm5			; mm0=Out1H=( 8  9 10 11 12 13 14 15)
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mm1
 	movq	MMWORD [edi+1*SIZEOF_MMWORD], mm0
 	poppic	ebx
 	sub	eax, byte SIZEOF_MMWORD
 	add	ecx, byte 1*SIZEOF_MMWORD	; inptr1(above)
 	add	ebx, byte 1*SIZEOF_MMWORD	; inptr0
 	add	esi, byte 1*SIZEOF_MMWORD	; inptr1(below)
 	add	edx, byte 2*SIZEOF_MMWORD	; outptr0
 	add	edi, byte 2*SIZEOF_MMWORD	; outptr1
 	cmp	eax, byte SIZEOF_MMWORD
 	ja	near .columnloop
 	test	eax,eax
 	jnz	near .columnloop_last
 	pop	esi
 	pop	edi
 	pop	ecx
 	pop	eax
 	add	esi, byte 1*SIZEOF_JSAMPROW	; input_data
 	add	edi, byte 2*SIZEOF_JSAMPROW	; output_data
 	sub	ecx, byte 2			; rowctr
 	jg	near .rowloop
 	emms		; empty MMX state
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %ifdef UPSAMPLE_H1V2_SUPPORTED
 ; --------------------------------------------------------------------------
 ;
 ; Fancy processing for the common case of 1:1 horizontal and 2:1 vertical.
 ; Again a triangle filter; see comments for h2v1 case, above.
 ;
 ; GLOBAL(void)
 ; jpeg_h1v2_fancy_upsample_mmx (j_decompress_ptr cinfo,
 ;                               jpeg_component_info * compptr,
 ;                               JSAMPARRAY input_data,
 ;                               JSAMPARRAY * output_data_ptr);
 ;
 %define cinfo(b)		(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)		(b)+12		; jpeg_component_info * compptr
 %define input_data(b)		(b)+16		; JSAMPARRAY input_data
 %define output_data_ptr(b)	(b)+20		; JSAMPARRAY * output_data_ptr
 %define gotptr		ebp-SIZEOF_POINTER	; void * gotptr
 	align	16
 	global	EXTN(jpeg_h1v2_fancy_upsample_mmx)
 EXTN(jpeg_h1v2_fancy_upsample_mmx):
 	push	ebp
 	mov	ebp,esp
 	pushpic	eax		; make a room for GOT address
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx			; get GOT address
 	movpic	POINTER [gotptr], ebx	; save GOT address
 	mov	eax, POINTER [compptr(ebp)]
 	mov	eax, JDIMENSION [jcompinfo_downsampled_width(eax)]  ; colctr
 	add	eax, byte SIZEOF_MMWORD-1
 	and	eax, byte -SIZEOF_MMWORD
 	jz	near .return
 	mov	ecx, POINTER [cinfo(ebp)]
 	mov	ecx, INT [jdstruct_max_v_samp_factor(ecx)]	; rowctr
 	test	ecx,ecx
 	jz	near .return
 	mov	esi, JSAMPARRAY [input_data(ebp)]	; input_data
 	mov	edi, POINTER [output_data_ptr(ebp)]
 	mov	edi, JSAMPARRAY [edi]			; output_data
 	alignx	16,7
 .rowloop:
 	push	eax					; colctr
 	push	ecx
 	push	edi
 	push	esi
 	mov	ecx, JSAMPROW [esi-1*SIZEOF_JSAMPROW]	; inptr1(above)
 	mov	ebx, JSAMPROW [esi+0*SIZEOF_JSAMPROW]	; inptr0
 	mov	esi, JSAMPROW [esi+1*SIZEOF_JSAMPROW]	; inptr1(below)
 	mov	edx, JSAMPROW [edi+0*SIZEOF_JSAMPROW]	; outptr0
 	mov	edi, JSAMPROW [edi+1*SIZEOF_JSAMPROW]	; outptr1
 	pxor	mm0,mm0			; mm0=(all 0's)
 	alignx	16,7
 .columnloop:
 	movq	mm1, MMWORD [ebx]	; mm1=row[ 0]( 0 1 2 3 4 5 6 7)
 	movq	mm2, MMWORD [ecx]	; mm2=row[-1]( 0 1 2 3 4 5 6 7)
 	movq	mm3, MMWORD [esi]	; mm3=row[+1]( 0 1 2 3 4 5 6 7)
 	pushpic	ebx
 	movpic	ebx, POINTER [gotptr]	; load GOT address
 	movq      mm4,mm1
 	punpcklbw mm1,mm0		; mm1=row[ 0]( 0 1 2 3)
 	punpckhbw mm4,mm0		; mm4=row[ 0]( 4 5 6 7)
 	movq      mm5,mm2
 	punpcklbw mm2,mm0		; mm2=row[-1]( 0 1 2 3)
 	punpckhbw mm5,mm0		; mm5=row[-1]( 4 5 6 7)
 	movq      mm6,mm3
 	punpcklbw mm3,mm0		; mm3=row[+1]( 0 1 2 3)
 	punpckhbw mm6,mm0		; mm6=row[+1]( 4 5 6 7)
 	pmullw	mm1,[GOTOFF(ebx,PW_THREE)]
 	pmullw	mm4,[GOTOFF(ebx,PW_THREE)]
 	paddw	mm2,[GOTOFF(ebx,PW_ONE)]
 	paddw	mm5,[GOTOFF(ebx,PW_ONE)]
 	paddw	mm3,[GOTOFF(ebx,PW_TWO)]
 	paddw	mm6,[GOTOFF(ebx,PW_TWO)]
 	paddw	mm2,mm1
 	paddw	mm5,mm4
 	psrlw	mm2,2			; mm2=Out0L=( 0 1 2 3)
 	psrlw	mm5,2			; mm5=Out0H=( 4 5 6 7)
 	paddw	mm3,mm1
 	paddw	mm6,mm4
 	psrlw	mm3,2			; mm3=Out1L=( 0 1 2 3)
 	psrlw	mm6,2			; mm6=Out1H=( 4 5 6 7)
 	packuswb  mm2,mm5		; mm2=Out0=( 0 1 2 3 4 5 6 7)
 	packuswb  mm3,mm6		; mm3=Out1=( 0 1 2 3 4 5 6 7)
 	movq	MMWORD [edx], mm2
 	movq	MMWORD [edi], mm3
 	poppic	ebx
 	add	ecx, byte 1*SIZEOF_MMWORD	; inptr1(above)
 	add	ebx, byte 1*SIZEOF_MMWORD	; inptr0
 	add	esi, byte 1*SIZEOF_MMWORD	; inptr1(below)
 	add	edx, byte 1*SIZEOF_MMWORD	; outptr0
 	add	edi, byte 1*SIZEOF_MMWORD	; outptr1
 	sub	eax, byte SIZEOF_MMWORD
 	jnz	near .columnloop
 	pop	esi
 	pop	edi
 	pop	ecx
 	pop	eax
 	add	esi, byte 1*SIZEOF_JSAMPROW	; input_data
 	add	edi, byte 2*SIZEOF_JSAMPROW	; output_data
 	sub	ecx, byte 2			; rowctr
 	jg	near .rowloop
 	emms		; empty MMX state
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	poppic	eax		; remove gotptr
 	pop	ebp
 	ret
 %endif ; UPSAMPLE_H1V2_SUPPORTED
 %endif ; JDSAMPLE_FANCY_MMX_SUPPORTED
 %ifdef JDSAMPLE_SIMPLE_MMX_SUPPORTED
 %ifndef JDSAMPLE_FANCY_MMX_SUPPORTED
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 %endif
 ;
 ; Fast processing for the common case of 2:1 horizontal and 1:1 vertical.
 ; It's still a box filter.
 ;
 ; GLOBAL(void)
 ; jpeg_h2v1_upsample_mmx (j_decompress_ptr cinfo,
 ;                         jpeg_component_info * compptr,
 ;                         JSAMPARRAY input_data,
 ;                         JSAMPARRAY * output_data_ptr);
 ;
 %define cinfo(b)		(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)		(b)+12		; jpeg_component_info * compptr
 %define input_data(b)		(b)+16		; JSAMPARRAY input_data
 %define output_data_ptr(b)	(b)+20		; JSAMPARRAY * output_data_ptr
 	align	16
 	global	EXTN(jpeg_h2v1_upsample_mmx)
 EXTN(jpeg_h2v1_upsample_mmx):
 	push	ebp
 	mov	ebp,esp
 ;	push	ebx		; unused
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	mov	edx, POINTER [cinfo(ebp)]
 	mov	edx, JDIMENSION [jdstruct_output_width(edx)]
 	add	edx, byte (2*SIZEOF_MMWORD)-1
 	and	edx, byte -(2*SIZEOF_MMWORD)
 	jz	short .return
 	mov	ecx, POINTER [cinfo(ebp)]
 	mov	ecx, INT [jdstruct_max_v_samp_factor(ecx)]	; rowctr
 	test	ecx,ecx
 	jz	short .return
 	mov	esi, JSAMPARRAY [input_data(ebp)]	; input_data
 	mov	edi, POINTER [output_data_ptr(ebp)]
 	mov	edi, JSAMPARRAY [edi]			; output_data
 	alignx	16,7
 .rowloop:
 	push	edi
 	push	esi
 	mov	esi, JSAMPROW [esi]		; inptr
 	mov	edi, JSAMPROW [edi]		; outptr
 	mov	eax,edx				; colctr
 	alignx	16,7
 .columnloop:
 	movq	mm0, MMWORD [esi+0*SIZEOF_MMWORD]
 	movq      mm1,mm0
 	punpcklbw mm0,mm0
 	punpckhbw mm1,mm1
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mm0
 	movq	MMWORD [edi+1*SIZEOF_MMWORD], mm1
 	sub	eax, byte 2*SIZEOF_MMWORD
 	jz	short .nextrow
 	movq	mm2, MMWORD [esi+1*SIZEOF_MMWORD]
 	movq      mm3,mm2
 	punpcklbw mm2,mm2
 	punpckhbw mm3,mm3
 	movq	MMWORD [edi+2*SIZEOF_MMWORD], mm2
 	movq	MMWORD [edi+3*SIZEOF_MMWORD], mm3
 	sub	eax, byte 2*SIZEOF_MMWORD
 	jz	short .nextrow
 	add	esi, byte 2*SIZEOF_MMWORD	; inptr
 	add	edi, byte 4*SIZEOF_MMWORD	; outptr
 	jmp	short .columnloop
 	alignx	16,7
 .nextrow:
 	pop	esi
 	pop	edi
 	add	esi, byte SIZEOF_JSAMPROW	; input_data
 	add	edi, byte SIZEOF_JSAMPROW	; output_data
 	dec	ecx				; rowctr
 	jg	short .rowloop
 	emms		; empty MMX state
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 ;	pop	ebx		; unused
 	pop	ebp
 	ret
 ; --------------------------------------------------------------------------
 ;
 ; Fast processing for the common case of 2:1 horizontal and 2:1 vertical.
 ; It's still a box filter.
 ;
 ; GLOBAL(void)
 ; jpeg_h2v2_upsample_mmx (j_decompress_ptr cinfo,
 ;                         jpeg_component_info * compptr,
 ;                         JSAMPARRAY input_data,
 ;                         JSAMPARRAY * output_data_ptr);
 ;
 %define cinfo(b)		(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)		(b)+12		; jpeg_component_info * compptr
 %define input_data(b)		(b)+16		; JSAMPARRAY input_data
 %define output_data_ptr(b)	(b)+20		; JSAMPARRAY * output_data_ptr
 	align	16
 	global	EXTN(jpeg_h2v2_upsample_mmx)
 EXTN(jpeg_h2v2_upsample_mmx):
 	push	ebp
 	mov	ebp,esp
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	mov	edx, POINTER [cinfo(ebp)]
 	mov	edx, JDIMENSION [jdstruct_output_width(edx)]
 	add	edx, byte (2*SIZEOF_MMWORD)-1
 	and	edx, byte -(2*SIZEOF_MMWORD)
 	jz	near .return
 	mov	ecx, POINTER [cinfo(ebp)]
 	mov	ecx, INT [jdstruct_max_v_samp_factor(ecx)]	; rowctr
 	test	ecx,ecx
 	jz	short .return
 	mov	esi, JSAMPARRAY [input_data(ebp)]	; input_data
 	mov	edi, POINTER [output_data_ptr(ebp)]
 	mov	edi, JSAMPARRAY [edi]			; output_data
 	alignx	16,7
 .rowloop:
 	push	edi
 	push	esi
 	mov	esi, JSAMPROW [esi]			; inptr
 	mov	ebx, JSAMPROW [edi+0*SIZEOF_JSAMPROW]	; outptr0
 	mov	edi, JSAMPROW [edi+1*SIZEOF_JSAMPROW]	; outptr1
 	mov	eax,edx					; colctr
 	alignx	16,7
 .columnloop:
 	movq	mm0, MMWORD [esi+0*SIZEOF_MMWORD]
 	movq      mm1,mm0
 	punpcklbw mm0,mm0
 	punpckhbw mm1,mm1
 	movq	MMWORD [ebx+0*SIZEOF_MMWORD], mm0
 	movq	MMWORD [ebx+1*SIZEOF_MMWORD], mm1
 	movq	MMWORD [edi+0*SIZEOF_MMWORD], mm0
 	movq	MMWORD [edi+1*SIZEOF_MMWORD], mm1
 	sub	eax, byte 2*SIZEOF_MMWORD
 	jz	short .nextrow
 	movq	mm2, MMWORD [esi+1*SIZEOF_MMWORD]
 	movq      mm3,mm2
 	punpcklbw mm2,mm2
 	punpckhbw mm3,mm3
 	movq	MMWORD [ebx+2*SIZEOF_MMWORD], mm2
 	movq	MMWORD [ebx+3*SIZEOF_MMWORD], mm3
 	movq	MMWORD [edi+2*SIZEOF_MMWORD], mm2
 	movq	MMWORD [edi+3*SIZEOF_MMWORD], mm3
 	sub	eax, byte 2*SIZEOF_MMWORD
 	jz	short .nextrow
 	add	esi, byte 2*SIZEOF_MMWORD	; inptr
 	add	ebx, byte 4*SIZEOF_MMWORD	; outptr0
 	add	edi, byte 4*SIZEOF_MMWORD	; outptr1
 	jmp	short .columnloop
 	alignx	16,7
 .nextrow:
 	pop	esi
 	pop	edi
 	add	esi, byte 1*SIZEOF_JSAMPROW	; input_data
 	add	edi, byte 2*SIZEOF_JSAMPROW	; output_data
 	sub	ecx, byte 2			; rowctr
 	jg	short .rowloop
 	emms		; empty MMX state
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	pop	ebp
 	ret
 %endif ; JDSAMPLE_SIMPLE_MMX_SUPPORTED
--- a/jdsample.c
+++ b/jdsample.c
@@ -5,6 +5,13 @@
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
 * ---------------------------------------------------------------------
 * x86 SIMD extension for IJG JPEG library
 * Copyright (C) 1999-2006, MIYASAKA Masaru.
 * This file has been modified for SIMD extension.
 * Last Modified : January 5, 2006
 * ---------------------------------------------------------------------
 *
 * This file contains upsampling routines.
 *
 * Upsampling input data is counted in "row groups".  A row group
@@ -21,6 +28,7 @@
 #define JPEG_INTERNALS
 #include "jinclude.h"
 #include "jpeglib.h"
 #include "jcolsamp.h"		/* Private declarations */
 /* Pointer to routine to upsample a single component */
@@ -285,6 +293,37 @@ h2v2_upsample (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 }
 #ifdef UPSAMPLE_H1V2_SUPPORTED
 /*
 * Fast processing for the common case of 1:1 horizontal and 2:1 vertical.
 * It's still a box filter.
 *
 * SIMD Ext: This routine is for files that are rotated or transposed
 *           by jpegtran.
 */
 METHODDEF(void)
 h1v2_upsample (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 	       JSAMPARRAY input_data, JSAMPARRAY * output_data_ptr)
 {
  JSAMPARRAY output_data = *output_data_ptr;
  int inrow, outrow;
  inrow = outrow = 0;
  while (outrow < cinfo->max_v_samp_factor) {
    jcopy_sample_rows(input_data, inrow, output_data, outrow,
 		      1, cinfo->output_width);
    jcopy_sample_rows(input_data, inrow, output_data, outrow+1,
 		      1, cinfo->output_width);
    inrow++;
    outrow += 2;
  }
 }
 #endif /* UPSAMPLE_H1V2_SUPPORTED */
 /*
 * Fancy processing for the common case of 2:1 horizontal and 1:1 vertical.
 *
@@ -391,6 +430,52 @@ h2v2_fancy_upsample (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 }
 #ifdef UPSAMPLE_H1V2_SUPPORTED
 /*
 * Fancy processing for the common case of 1:1 horizontal and 2:1 vertical.
 * Again a triangle filter; see comments for h2v1 case, above.
 *
 * It is OK for us to reference the adjacent input rows because we demanded
 * context from the main buffer controller (see initialization code).
 *
 * SIMD Ext: This routine is for files that are rotated or transposed
 *           by jpegtran.
 */
 METHODDEF(void)
 h1v2_fancy_upsample (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 		     JSAMPARRAY input_data, JSAMPARRAY * output_data_ptr)
 {
  JSAMPARRAY output_data = *output_data_ptr;
  register JSAMPROW inptr0, inptr1, outptr;
  register int colsum;
  register JDIMENSION colctr;
  int inrow, outrow, v;
  inrow = outrow = 0;
  while (outrow < cinfo->max_v_samp_factor) {
    for (v = 0; v < 2; v++) {
      /* inptr0 points to nearest input row, inptr1 points to next nearest */
      inptr0 = input_data[inrow];
      if (v == 0)		/* next nearest is row above */
 	inptr1 = input_data[inrow-1];
      else			/* next nearest is row below */
 	inptr1 = input_data[inrow+1];
      outptr = output_data[outrow++];
      for (colctr = compptr->downsampled_width; colctr > 0; colctr--) {
 	colsum = GETJSAMPLE(*inptr0++) * 3 + GETJSAMPLE(*inptr1++);
 	*outptr++ = (JSAMPLE) ((colsum + v + 1) >> 2);
      }
    }
    inrow++;
  }
 }
 #endif /* UPSAMPLE_H1V2_SUPPORTED */
 /*
 * Module initialization routine for upsampling.
 */
@@ -403,6 +488,7 @@ jinit_upsampler (j_decompress_ptr cinfo)
  jpeg_component_info * compptr;
  boolean need_buffer, do_fancy;
  int h_in_group, v_in_group, h_out_group, v_out_group;
  unsigned int simd = jpeg_simd_support((j_common_ptr) cinfo);
  upsample = (my_upsample_ptr)
    (*cinfo->mem->alloc_small) ((j_common_ptr) cinfo, JPOOL_IMAGE,
@@ -447,18 +533,83 @@ jinit_upsampler (j_decompress_ptr cinfo)
    } else if (h_in_group * 2 == h_out_group &&
 	       v_in_group == v_out_group) {
      /* Special cases for 2h1v upsampling */
-      if (do_fancy && compptr->downsampled_width > 2)
+      if (do_fancy && compptr->downsampled_width > 2) {
-	upsample->methods[ci] = h2v1_fancy_upsample;
+#ifdef JDSAMPLE_FANCY_SSE2_SUPPORTED
 	if (simd & JSIMD_SSE2 &&
 	    IS_CONST_ALIGNED_16(jconst_fancy_upsample_sse2))
 	  upsample->methods[ci] = jpeg_h2v1_fancy_upsample_sse2;
 	else
 #endif
 #ifdef JDSAMPLE_FANCY_MMX_SUPPORTED
 	if (simd & JSIMD_MMX)
 	  upsample->methods[ci] = jpeg_h2v1_fancy_upsample_mmx;
 	else
 #endif
 	  upsample->methods[ci] = h2v1_fancy_upsample;
      } else {
 #ifdef JDSAMPLE_SIMPLE_SSE2_SUPPORTED
 	if (simd & JSIMD_SSE2)
 	  upsample->methods[ci] = jpeg_h2v1_upsample_sse2;
 	else
 #endif
 #ifdef JDSAMPLE_SIMPLE_MMX_SUPPORTED
 	if (simd & JSIMD_MMX)
 	  upsample->methods[ci] = jpeg_h2v1_upsample_mmx;
 	else
 #endif
 	  upsample->methods[ci] = h2v1_upsample;
      }
    } else if (h_in_group * 2 == h_out_group &&
 	       v_in_group * 2 == v_out_group) {
      /* Special cases for 2h2v upsampling */
      if (do_fancy && compptr->downsampled_width > 2) {
 #ifdef JDSAMPLE_FANCY_SSE2_SUPPORTED
 	if (simd & JSIMD_SSE2 &&
 	    IS_CONST_ALIGNED_16(jconst_fancy_upsample_sse2))
 	  upsample->methods[ci] = jpeg_h2v2_fancy_upsample_sse2;
 	else
 #endif
 #ifdef JDSAMPLE_FANCY_MMX_SUPPORTED
 	if (simd & JSIMD_MMX)
 	  upsample->methods[ci] = jpeg_h2v2_fancy_upsample_mmx;
 	else
 #endif
 	  upsample->methods[ci] = h2v2_fancy_upsample;
 	upsample->pub.need_context_rows = TRUE;
-      } else
+      } else {
 #ifdef JDSAMPLE_SIMPLE_SSE2_SUPPORTED
 	if (simd & JSIMD_SSE2)
 	  upsample->methods[ci] = jpeg_h2v2_upsample_sse2;
 	else
 #endif
 #ifdef JDSAMPLE_SIMPLE_MMX_SUPPORTED
 	if (simd & JSIMD_MMX)
 	  upsample->methods[ci] = jpeg_h2v2_upsample_mmx;
 	else
 #endif
 	  upsample->methods[ci] = h2v2_upsample;
      }
 #ifdef UPSAMPLE_H1V2_SUPPORTED
    } else if (h_in_group == h_out_group &&
 	       v_in_group * 2 == v_out_group) {
      /* Special cases for 1h2v upsampling */
      if (do_fancy) {
 #ifdef JDSAMPLE_FANCY_SSE2_SUPPORTED
 	if (simd & JSIMD_SSE2 &&
 	    IS_CONST_ALIGNED_16(jconst_fancy_upsample_sse2))
 	  upsample->methods[ci] = jpeg_h1v2_fancy_upsample_sse2;
 	else
 #endif
 #ifdef JDSAMPLE_FANCY_MMX_SUPPORTED
 	if (simd & JSIMD_MMX)
 	  upsample->methods[ci] = jpeg_h1v2_fancy_upsample_mmx;
 	else
 #endif
 	  upsample->methods[ci] = h1v2_fancy_upsample;
 	upsample->pub.need_context_rows = TRUE;
      } else
 	upsample->methods[ci] = h1v2_upsample;
 #endif /* UPSAMPLE_H1V2_SUPPORTED */
    } else if ((h_out_group % h_in_group) == 0 &&
 	       (v_out_group % v_in_group) == 0) {
      /* Generic integral-factors upsampling method */
@@ -468,11 +619,52 @@ jinit_upsampler (j_decompress_ptr cinfo)
    } else
      ERREXIT(cinfo, JERR_FRACT_SAMPLE_NOTIMPL);
    if (need_buffer) {
      enum { SIZEOF_XMMWORD = 16 };	/* from jsimdext.inc */
      upsample->color_buf[ci] = (*cinfo->mem->alloc_sarray)
 	((j_common_ptr) cinfo, JPOOL_IMAGE,
-	 (JDIMENSION) jround_up((long) cinfo->output_width,
+	 (JDIMENSION) jround_up(jround_up((long) cinfo->output_width,
 					  (long) cinfo->max_h_samp_factor),
 				(long) (2 * SIZEOF_XMMWORD)),
 	 (JDIMENSION) cinfo->max_v_samp_factor);
    }
  }
 }
 #ifndef JSIMD_MODEINFO_NOT_SUPPORTED
 GLOBAL(unsigned int)
 jpeg_simd_upsampler (j_decompress_ptr cinfo, int do_fancy)
 {
  unsigned int simd = jpeg_simd_support((j_common_ptr) cinfo);
 #ifdef UPSAMPLE_MERGING_SUPPORTED
  if (!do_fancy)
    return jpeg_simd_merged_upsampler(cinfo);
 #endif
  if (do_fancy) {
 #ifdef JDSAMPLE_FANCY_SSE2_SUPPORTED
    if (simd & JSIMD_SSE2 &&
        IS_CONST_ALIGNED_16(jconst_fancy_upsample_sse2))
      return JSIMD_SSE2;
 #endif
 #ifdef JDSAMPLE_FANCY_MMX_SUPPORTED
    if (simd & JSIMD_MMX)
      return JSIMD_MMX;
 #endif
  } else {
 #ifdef JDSAMPLE_SIMPLE_SSE2_SUPPORTED
    if (simd & JSIMD_SSE2)
      return JSIMD_SSE2;
 #endif
 #ifdef JDSAMPLE_SIMPLE_MMX_SUPPORTED
    if (simd & JSIMD_MMX)
      return JSIMD_MMX;
 #endif
  }
  return JSIMD_NONE;
 }
 #endif /* !JSIMD_MODEINFO_NOT_SUPPORTED */
--- a/jdsamss2.asm
+++ b/jdsamss2.asm
@@ -0,0 +1,883 @@
 ;
 ; jdsamss2.asm - upsampling (SSE2)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jcolsamp.inc"
 %ifdef JDSAMPLE_FANCY_SSE2_SUPPORTED
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 	alignz	16
 	global	EXTN(jconst_fancy_upsample_sse2)
 EXTN(jconst_fancy_upsample_sse2):
 PW_ONE		times 8 dw  1
 PW_TWO		times 8 dw  2
 PW_THREE	times 8 dw  3
 PW_SEVEN	times 8 dw  7
 PW_EIGHT	times 8 dw  8
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Fancy processing for the common case of 2:1 horizontal and 1:1 vertical.
 ;
 ; The upsampling algorithm is linear interpolation between pixel centers,
 ; also known as a "triangle filter".  This is a good compromise between
 ; speed and visual quality.  The centers of the output pixels are 1/4 and 3/4
 ; of the way between input pixel centers.
 ;
 ; GLOBAL(void)
 ; jpeg_h2v1_fancy_upsample_sse2 (j_decompress_ptr cinfo,
 ;                                jpeg_component_info * compptr,
 ;                                JSAMPARRAY input_data,
 ;                                JSAMPARRAY * output_data_ptr);
 ;
 %define cinfo(b)		(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)		(b)+12		; jpeg_component_info * compptr
 %define input_data(b)		(b)+16		; JSAMPARRAY input_data
 %define output_data_ptr(b)	(b)+20		; JSAMPARRAY * output_data_ptr
 	align	16
 	global	EXTN(jpeg_h2v1_fancy_upsample_sse2)
 EXTN(jpeg_h2v1_fancy_upsample_sse2):
 	push	ebp
 	mov	ebp,esp
 	pushpic	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx		; get GOT address
 	mov	eax, POINTER [compptr(ebp)]
 	mov	eax, JDIMENSION [jcompinfo_downsampled_width(eax)]  ; colctr
 	test	eax,eax
 	jz	near .return
 	mov	ecx, POINTER [cinfo(ebp)]
 	mov	ecx, INT [jdstruct_max_v_samp_factor(ecx)]	; rowctr
 	test	ecx,ecx
 	jz	near .return
 	mov	esi, JSAMPARRAY [input_data(ebp)]	; input_data
 	mov	edi, POINTER [output_data_ptr(ebp)]
 	mov	edi, JSAMPARRAY [edi]			; output_data
 	alignx	16,7
 .rowloop:
 	push	eax			; colctr
 	push	edi
 	push	esi
 	mov	esi, JSAMPROW [esi]	; inptr
 	mov	edi, JSAMPROW [edi]	; outptr
 	test	eax, SIZEOF_XMMWORD-1
 	jz	short .skip
 	mov	dl, JSAMPLE [esi+(eax-1)*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [esi+eax*SIZEOF_JSAMPLE], dl	; insert a dummy sample
 .skip:
 	pxor	xmm0,xmm0		; xmm0=(all 0's)
 	pcmpeqb	xmm7,xmm7
 	psrldq	xmm7,(SIZEOF_XMMWORD-1)
 	pand	xmm7, XMMWORD [esi+0*SIZEOF_XMMWORD]
 	add	eax, byte SIZEOF_XMMWORD-1
 	and	eax, byte -SIZEOF_XMMWORD
 	cmp	eax, byte SIZEOF_XMMWORD
 	ja	short .columnloop
 	alignx	16,7
 .columnloop_last:
 	pcmpeqb	xmm6,xmm6
 	pslldq	xmm6,(SIZEOF_XMMWORD-1)
 	pand	xmm6, XMMWORD [esi+0*SIZEOF_XMMWORD]
 	jmp	short .upsample
 	alignx	16,7
 .columnloop:
 	movdqa	xmm6, XMMWORD [esi+1*SIZEOF_XMMWORD]
 	pslldq	xmm6,(SIZEOF_XMMWORD-1)
 .upsample:
 	movdqa	xmm1, XMMWORD [esi+0*SIZEOF_XMMWORD]
 	movdqa	xmm2,xmm1
 	movdqa	xmm3,xmm1		; xmm1=( 0  1  2 ... 13 14 15)
 	pslldq	xmm2,1			; xmm2=(--  0  1 ... 12 13 14)
 	psrldq	xmm3,1			; xmm3=( 1  2  3 ... 14 15 --)
 	por	xmm2,xmm7		; xmm2=(-1  0  1 ... 12 13 14)
 	por	xmm3,xmm6		; xmm3=( 1  2  3 ... 14 15 16)
 	movdqa	xmm7,xmm1
 	psrldq	xmm7,(SIZEOF_XMMWORD-1)	; xmm7=(15 -- -- ... -- -- --)
 	movdqa    xmm4,xmm1
 	punpcklbw xmm1,xmm0		; xmm1=( 0  1  2  3  4  5  6  7)
 	punpckhbw xmm4,xmm0		; xmm4=( 8  9 10 11 12 13 14 15)
 	movdqa    xmm5,xmm2
 	punpcklbw xmm2,xmm0		; xmm2=(-1  0  1  2  3  4  5  6)
 	punpckhbw xmm5,xmm0		; xmm5=( 7  8  9 10 11 12 13 14)
 	movdqa    xmm6,xmm3
 	punpcklbw xmm3,xmm0		; xmm3=( 1  2  3  4  5  6  7  8)
 	punpckhbw xmm6,xmm0		; xmm6=( 9 10 11 12 13 14 15 16)
 	pmullw	xmm1,[GOTOFF(ebx,PW_THREE)]
 	pmullw	xmm4,[GOTOFF(ebx,PW_THREE)]
 	paddw	xmm2,[GOTOFF(ebx,PW_ONE)]
 	paddw	xmm5,[GOTOFF(ebx,PW_ONE)]
 	paddw	xmm3,[GOTOFF(ebx,PW_TWO)]
 	paddw	xmm6,[GOTOFF(ebx,PW_TWO)]
 	paddw	xmm2,xmm1
 	paddw	xmm5,xmm4
 	psrlw	xmm2,2			; xmm2=OutLE=( 0  2  4  6  8 10 12 14)
 	psrlw	xmm5,2			; xmm5=OutHE=(16 18 20 22 24 26 28 30)
 	paddw	xmm3,xmm1
 	paddw	xmm6,xmm4
 	psrlw	xmm3,2			; xmm3=OutLO=( 1  3  5  7  9 11 13 15)
 	psrlw	xmm6,2			; xmm6=OutHO=(17 19 21 23 25 27 29 31)
 	psllw	xmm3,BYTE_BIT
 	psllw	xmm6,BYTE_BIT
 	por	xmm2,xmm3		; xmm2=OutL=( 0  1  2 ... 13 14 15)
 	por	xmm5,xmm6		; xmm5=OutH=(16 17 18 ... 29 30 31)
 	movdqa	XMMWORD [edi+0*SIZEOF_XMMWORD], xmm2
 	movdqa	XMMWORD [edi+1*SIZEOF_XMMWORD], xmm5
 	sub	eax, byte SIZEOF_XMMWORD
 	add	esi, byte 1*SIZEOF_XMMWORD	; inptr
 	add	edi, byte 2*SIZEOF_XMMWORD	; outptr
 	cmp	eax, byte SIZEOF_XMMWORD
 	ja	near .columnloop
 	test	eax,eax
 	jnz	near .columnloop_last
 	pop	esi
 	pop	edi
 	pop	eax
 	add	esi, byte SIZEOF_JSAMPROW	; input_data
 	add	edi, byte SIZEOF_JSAMPROW	; output_data
 	dec	ecx				; rowctr
 	jg	near .rowloop
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	poppic	ebx
 	pop	ebp
 	ret
 ; --------------------------------------------------------------------------
 ;
 ; Fancy processing for the common case of 2:1 horizontal and 2:1 vertical.
 ; Again a triangle filter; see comments for h2v1 case, above.
 ;
 ; GLOBAL(void)
 ; jpeg_h2v2_fancy_upsample_sse2 (j_decompress_ptr cinfo,
 ;                                jpeg_component_info * compptr,
 ;                                JSAMPARRAY input_data,
 ;                                JSAMPARRAY * output_data_ptr);
 ;
 %define cinfo(b)		(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)		(b)+12		; jpeg_component_info * compptr
 %define input_data(b)		(b)+16		; JSAMPARRAY input_data
 %define output_data_ptr(b)	(b)+20		; JSAMPARRAY * output_data_ptr
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_XMMWORD	; xmmword wk[WK_NUM]
 %define WK_NUM		4
 %define gotptr		wk(0)-SIZEOF_POINTER	; void * gotptr
 	align	16
 	global	EXTN(jpeg_h2v2_fancy_upsample_sse2)
 EXTN(jpeg_h2v2_fancy_upsample_sse2):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_XMMWORD)	; align to 128 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [wk(0)]
 	pushpic	eax		; make a room for GOT address
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx			; get GOT address
 	movpic	POINTER [gotptr], ebx	; save GOT address
 	mov	edx,eax				; edx = original ebp
 	mov	eax, POINTER [compptr(edx)]
 	mov	eax, JDIMENSION [jcompinfo_downsampled_width(eax)]  ; colctr
 	test	eax,eax
 	jz	near .return
 	mov	ecx, POINTER [cinfo(edx)]
 	mov	ecx, INT [jdstruct_max_v_samp_factor(ecx)]	; rowctr
 	test	ecx,ecx
 	jz	near .return
 	mov	esi, JSAMPARRAY [input_data(edx)]	; input_data
 	mov	edi, POINTER [output_data_ptr(edx)]
 	mov	edi, JSAMPARRAY [edi]			; output_data
 	alignx	16,7
 .rowloop:
 	push	eax					; colctr
 	push	ecx
 	push	edi
 	push	esi
 	mov	ecx, JSAMPROW [esi-1*SIZEOF_JSAMPROW]	; inptr1(above)
 	mov	ebx, JSAMPROW [esi+0*SIZEOF_JSAMPROW]	; inptr0
 	mov	esi, JSAMPROW [esi+1*SIZEOF_JSAMPROW]	; inptr1(below)
 	mov	edx, JSAMPROW [edi+0*SIZEOF_JSAMPROW]	; outptr0
 	mov	edi, JSAMPROW [edi+1*SIZEOF_JSAMPROW]	; outptr1
 	test	eax, SIZEOF_XMMWORD-1
 	jz	short .skip
 	push	edx
 	mov	dl, JSAMPLE [ecx+(eax-1)*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [ecx+eax*SIZEOF_JSAMPLE], dl
 	mov	dl, JSAMPLE [ebx+(eax-1)*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [ebx+eax*SIZEOF_JSAMPLE], dl
 	mov	dl, JSAMPLE [esi+(eax-1)*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [esi+eax*SIZEOF_JSAMPLE], dl	; insert a dummy sample
 	pop	edx
 .skip:
 	; -- process the first column block
 	movdqa	xmm0, XMMWORD [ebx+0*SIZEOF_XMMWORD]	; xmm0=row[ 0][0]
 	movdqa	xmm1, XMMWORD [ecx+0*SIZEOF_XMMWORD]	; xmm1=row[-1][0]
 	movdqa	xmm2, XMMWORD [esi+0*SIZEOF_XMMWORD]	; xmm2=row[+1][0]
 	pushpic	ebx
 	movpic	ebx, POINTER [gotptr]	; load GOT address
 	pxor      xmm3,xmm3		; xmm3=(all 0's)
 	movdqa    xmm4,xmm0
 	punpcklbw xmm0,xmm3		; xmm0=row[ 0]( 0  1  2  3  4  5  6  7)
 	punpckhbw xmm4,xmm3		; xmm4=row[ 0]( 8  9 10 11 12 13 14 15)
 	movdqa    xmm5,xmm1
 	punpcklbw xmm1,xmm3		; xmm1=row[-1]( 0  1  2  3  4  5  6  7)
 	punpckhbw xmm5,xmm3		; xmm5=row[-1]( 8  9 10 11 12 13 14 15)
 	movdqa    xmm6,xmm2
 	punpcklbw xmm2,xmm3		; xmm2=row[+1]( 0  1  2  3  4  5  6  7)
 	punpckhbw xmm6,xmm3		; xmm6=row[+1]( 8  9 10 11 12 13 14 15)
 	pmullw	xmm0,[GOTOFF(ebx,PW_THREE)]
 	pmullw	xmm4,[GOTOFF(ebx,PW_THREE)]
 	pcmpeqb	xmm7,xmm7
 	psrldq	xmm7,(SIZEOF_XMMWORD-2)
 	paddw	xmm1,xmm0		; xmm1=Int0L=( 0  1  2  3  4  5  6  7)
 	paddw	xmm5,xmm4		; xmm5=Int0H=( 8  9 10 11 12 13 14 15)
 	paddw	xmm2,xmm0		; xmm2=Int1L=( 0  1  2  3  4  5  6  7)
 	paddw	xmm6,xmm4		; xmm6=Int1H=( 8  9 10 11 12 13 14 15)
 	movdqa	XMMWORD [edx+0*SIZEOF_XMMWORD], xmm1	; temporarily save
 	movdqa	XMMWORD [edx+1*SIZEOF_XMMWORD], xmm5	; the intermediate data
 	movdqa	XMMWORD [edi+0*SIZEOF_XMMWORD], xmm2
 	movdqa	XMMWORD [edi+1*SIZEOF_XMMWORD], xmm6
 	pand	xmm1,xmm7		; xmm1=( 0 -- -- -- -- -- -- --)
 	pand	xmm2,xmm7		; xmm2=( 0 -- -- -- -- -- -- --)
 	movdqa	XMMWORD [wk(0)], xmm1
 	movdqa	XMMWORD [wk(1)], xmm2
 	poppic	ebx
 	add	eax, byte SIZEOF_XMMWORD-1
 	and	eax, byte -SIZEOF_XMMWORD
 	cmp	eax, byte SIZEOF_XMMWORD
 	ja	short .columnloop
 	alignx	16,7
 .columnloop_last:
 	; -- process the last column block
 	pushpic	ebx
 	movpic	ebx, POINTER [gotptr]	; load GOT address
 	pcmpeqb	xmm1,xmm1
 	pslldq	xmm1,(SIZEOF_XMMWORD-2)
 	movdqa	xmm2,xmm1
 	pand	xmm1, XMMWORD [edx+1*SIZEOF_XMMWORD]
 	pand	xmm2, XMMWORD [edi+1*SIZEOF_XMMWORD]
 	movdqa	XMMWORD [wk(2)], xmm1	; xmm1=(-- -- -- -- -- -- -- 15)
 	movdqa	XMMWORD [wk(3)], xmm2	; xmm2=(-- -- -- -- -- -- -- 15)
 	jmp	near .upsample
 	alignx	16,7
 .columnloop:
 	; -- process the next column block
 	movdqa	xmm0, XMMWORD [ebx+1*SIZEOF_XMMWORD]	; xmm0=row[ 0][1]
 	movdqa	xmm1, XMMWORD [ecx+1*SIZEOF_XMMWORD]	; xmm1=row[-1][1]
 	movdqa	xmm2, XMMWORD [esi+1*SIZEOF_XMMWORD]	; xmm2=row[+1][1]
 	pushpic	ebx
 	movpic	ebx, POINTER [gotptr]	; load GOT address
 	pxor      xmm3,xmm3		; xmm3=(all 0's)
 	movdqa    xmm4,xmm0
 	punpcklbw xmm0,xmm3		; xmm0=row[ 0]( 0  1  2  3  4  5  6  7)
 	punpckhbw xmm4,xmm3		; xmm4=row[ 0]( 8  9 10 11 12 13 14 15)
 	movdqa    xmm5,xmm1
 	punpcklbw xmm1,xmm3		; xmm1=row[-1]( 0  1  2  3  4  5  6  7)
 	punpckhbw xmm5,xmm3		; xmm5=row[-1]( 8  9 10 11 12 13 14 15)
 	movdqa    xmm6,xmm2
 	punpcklbw xmm2,xmm3		; xmm2=row[+1]( 0  1  2  3  4  5  6  7)
 	punpckhbw xmm6,xmm3		; xmm6=row[+1]( 8  9 10 11 12 13 14 15)
 	pmullw	xmm0,[GOTOFF(ebx,PW_THREE)]
 	pmullw	xmm4,[GOTOFF(ebx,PW_THREE)]
 	paddw	xmm1,xmm0		; xmm1=Int0L=( 0  1  2  3  4  5  6  7)
 	paddw	xmm5,xmm4		; xmm5=Int0H=( 8  9 10 11 12 13 14 15)
 	paddw	xmm2,xmm0		; xmm2=Int1L=( 0  1  2  3  4  5  6  7)
 	paddw	xmm6,xmm4		; xmm6=Int1H=( 8  9 10 11 12 13 14 15)
 	movdqa	XMMWORD [edx+2*SIZEOF_XMMWORD], xmm1	; temporarily save
 	movdqa	XMMWORD [edx+3*SIZEOF_XMMWORD], xmm5	; the intermediate data
 	movdqa	XMMWORD [edi+2*SIZEOF_XMMWORD], xmm2
 	movdqa	XMMWORD [edi+3*SIZEOF_XMMWORD], xmm6
 	pslldq	xmm1,(SIZEOF_XMMWORD-2)	; xmm1=(-- -- -- -- -- -- --  0)
 	pslldq	xmm2,(SIZEOF_XMMWORD-2)	; xmm2=(-- -- -- -- -- -- --  0)
 	movdqa	XMMWORD [wk(2)], xmm1
 	movdqa	XMMWORD [wk(3)], xmm2
 .upsample:
 	; -- process the upper row
 	movdqa	xmm7, XMMWORD [edx+0*SIZEOF_XMMWORD]
 	movdqa	xmm3, XMMWORD [edx+1*SIZEOF_XMMWORD]
 	movdqa	xmm0,xmm7		; xmm7=Int0L=( 0  1  2  3  4  5  6  7)
 	movdqa	xmm4,xmm3		; xmm3=Int0H=( 8  9 10 11 12 13 14 15)
 	psrldq	xmm0,2			; xmm0=( 1  2  3  4  5  6  7 --)
 	pslldq	xmm4,(SIZEOF_XMMWORD-2)	; xmm4=(-- -- -- -- -- -- --  8)
 	movdqa	xmm5,xmm7
 	movdqa	xmm6,xmm3
 	psrldq	xmm5,(SIZEOF_XMMWORD-2)	; xmm5=( 7 -- -- -- -- -- -- --)
 	pslldq	xmm6,2			; xmm6=(--  8  9 10 11 12 13 14)
 	por	xmm0,xmm4		; xmm0=( 1  2  3  4  5  6  7  8)
 	por	xmm5,xmm6		; xmm5=( 7  8  9 10 11 12 13 14)
 	movdqa	xmm1,xmm7
 	movdqa	xmm2,xmm3
 	pslldq	xmm1,2			; xmm1=(--  0  1  2  3  4  5  6)
 	psrldq	xmm2,2			; xmm2=( 9 10 11 12 13 14 15 --)
 	movdqa	xmm4,xmm3
 	psrldq	xmm4,(SIZEOF_XMMWORD-2)	; xmm4=(15 -- -- -- -- -- -- --)
 	por	xmm1, XMMWORD [wk(0)]	; xmm1=(-1  0  1  2  3  4  5  6)
 	por	xmm2, XMMWORD [wk(2)]	; xmm2=( 9 10 11 12 13 14 15 16)
 	movdqa	XMMWORD [wk(0)], xmm4
 	pmullw	xmm7,[GOTOFF(ebx,PW_THREE)]
 	pmullw	xmm3,[GOTOFF(ebx,PW_THREE)]
 	paddw	xmm1,[GOTOFF(ebx,PW_EIGHT)]
 	paddw	xmm5,[GOTOFF(ebx,PW_EIGHT)]
 	paddw	xmm0,[GOTOFF(ebx,PW_SEVEN)]
 	paddw	xmm2,[GOTOFF(ebx,PW_SEVEN)]
 	paddw	xmm1,xmm7
 	paddw	xmm5,xmm3
 	psrlw	xmm1,4			; xmm1=Out0LE=( 0  2  4  6  8 10 12 14)
 	psrlw	xmm5,4			; xmm5=Out0HE=(16 18 20 22 24 26 28 30)
 	paddw	xmm0,xmm7
 	paddw	xmm2,xmm3
 	psrlw	xmm0,4			; xmm0=Out0LO=( 1  3  5  7  9 11 13 15)
 	psrlw	xmm2,4			; xmm2=Out0HO=(17 19 21 23 25 27 29 31)
 	psllw	xmm0,BYTE_BIT
 	psllw	xmm2,BYTE_BIT
 	por	xmm1,xmm0		; xmm1=Out0L=( 0  1  2 ... 13 14 15)
 	por	xmm5,xmm2		; xmm5=Out0H=(16 17 18 ... 29 30 31)
 	movdqa	XMMWORD [edx+0*SIZEOF_XMMWORD], xmm1
 	movdqa	XMMWORD [edx+1*SIZEOF_XMMWORD], xmm5
 	; -- process the lower row
 	movdqa	xmm6, XMMWORD [edi+0*SIZEOF_XMMWORD]
 	movdqa	xmm4, XMMWORD [edi+1*SIZEOF_XMMWORD]
 	movdqa	xmm7,xmm6		; xmm6=Int1L=( 0  1  2  3  4  5  6  7)
 	movdqa	xmm3,xmm4		; xmm4=Int1H=( 8  9 10 11 12 13 14 15)
 	psrldq	xmm7,2			; xmm7=( 1  2  3  4  5  6  7 --)
 	pslldq	xmm3,(SIZEOF_XMMWORD-2)	; xmm3=(-- -- -- -- -- -- --  8)
 	movdqa	xmm0,xmm6
 	movdqa	xmm2,xmm4
 	psrldq	xmm0,(SIZEOF_XMMWORD-2)	; xmm0=( 7 -- -- -- -- -- -- --)
 	pslldq	xmm2,2			; xmm2=(--  8  9 10 11 12 13 14)
 	por	xmm7,xmm3		; xmm7=( 1  2  3  4  5  6  7  8)
 	por	xmm0,xmm2		; xmm0=( 7  8  9 10 11 12 13 14)
 	movdqa	xmm1,xmm6
 	movdqa	xmm5,xmm4
 	pslldq	xmm1,2			; xmm1=(--  0  1  2  3  4  5  6)
 	psrldq	xmm5,2			; xmm5=( 9 10 11 12 13 14 15 --)
 	movdqa	xmm3,xmm4
 	psrldq	xmm3,(SIZEOF_XMMWORD-2)	; xmm3=(15 -- -- -- -- -- -- --)
 	por	xmm1, XMMWORD [wk(1)]	; xmm1=(-1  0  1  2  3  4  5  6)
 	por	xmm5, XMMWORD [wk(3)]	; xmm5=( 9 10 11 12 13 14 15 16)
 	movdqa	XMMWORD [wk(1)], xmm3
 	pmullw	xmm6,[GOTOFF(ebx,PW_THREE)]
 	pmullw	xmm4,[GOTOFF(ebx,PW_THREE)]
 	paddw	xmm1,[GOTOFF(ebx,PW_EIGHT)]
 	paddw	xmm0,[GOTOFF(ebx,PW_EIGHT)]
 	paddw	xmm7,[GOTOFF(ebx,PW_SEVEN)]
 	paddw	xmm5,[GOTOFF(ebx,PW_SEVEN)]
 	paddw	xmm1,xmm6
 	paddw	xmm0,xmm4
 	psrlw	xmm1,4			; xmm1=Out1LE=( 0  2  4  6  8 10 12 14)
 	psrlw	xmm0,4			; xmm0=Out1HE=(16 18 20 22 24 26 28 30)
 	paddw	xmm7,xmm6
 	paddw	xmm5,xmm4
 	psrlw	xmm7,4			; xmm7=Out1LO=( 1  3  5  7  9 11 13 15)
 	psrlw	xmm5,4			; xmm5=Out1HO=(17 19 21 23 25 27 29 31)
 	psllw	xmm7,BYTE_BIT
 	psllw	xmm5,BYTE_BIT
 	por	xmm1,xmm7		; xmm1=Out1L=( 0  1  2 ... 13 14 15)
 	por	xmm0,xmm5		; xmm0=Out1H=(16 17 18 ... 29 30 31)
 	movdqa	XMMWORD [edi+0*SIZEOF_XMMWORD], xmm1
 	movdqa	XMMWORD [edi+1*SIZEOF_XMMWORD], xmm0
 	poppic	ebx
 	sub	eax, byte SIZEOF_XMMWORD
 	add	ecx, byte 1*SIZEOF_XMMWORD	; inptr1(above)
 	add	ebx, byte 1*SIZEOF_XMMWORD	; inptr0
 	add	esi, byte 1*SIZEOF_XMMWORD	; inptr1(below)
 	add	edx, byte 2*SIZEOF_XMMWORD	; outptr0
 	add	edi, byte 2*SIZEOF_XMMWORD	; outptr1
 	cmp	eax, byte SIZEOF_XMMWORD
 	ja	near .columnloop
 	test	eax,eax
 	jnz	near .columnloop_last
 	pop	esi
 	pop	edi
 	pop	ecx
 	pop	eax
 	add	esi, byte 1*SIZEOF_JSAMPROW	; input_data
 	add	edi, byte 2*SIZEOF_JSAMPROW	; output_data
 	sub	ecx, byte 2			; rowctr
 	jg	near .rowloop
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %ifdef UPSAMPLE_H1V2_SUPPORTED
 ; --------------------------------------------------------------------------
 ;
 ; Fancy processing for the common case of 1:1 horizontal and 2:1 vertical.
 ; Again a triangle filter; see comments for h2v1 case, above.
 ;
 ; GLOBAL(void)
 ; jpeg_h1v2_fancy_upsample_sse2 (j_decompress_ptr cinfo,
 ;                                jpeg_component_info * compptr,
 ;                                JSAMPARRAY input_data,
 ;                                JSAMPARRAY * output_data_ptr);
 ;
 %define cinfo(b)		(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)		(b)+12		; jpeg_component_info * compptr
 %define input_data(b)		(b)+16		; JSAMPARRAY input_data
 %define output_data_ptr(b)	(b)+20		; JSAMPARRAY * output_data_ptr
 %define gotptr		ebp-SIZEOF_POINTER	; void * gotptr
 	align	16
 	global	EXTN(jpeg_h1v2_fancy_upsample_sse2)
 EXTN(jpeg_h1v2_fancy_upsample_sse2):
 	push	ebp
 	mov	ebp,esp
 	pushpic	eax		; make a room for GOT address
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx			; get GOT address
 	movpic	POINTER [gotptr], ebx	; save GOT address
 	mov	eax, POINTER [compptr(ebp)]
 	mov	eax, JDIMENSION [jcompinfo_downsampled_width(eax)]  ; colctr
 	add	eax, byte SIZEOF_XMMWORD-1
 	and	eax, byte -SIZEOF_XMMWORD
 	jz	near .return
 	mov	ecx, POINTER [cinfo(ebp)]
 	mov	ecx, INT [jdstruct_max_v_samp_factor(ecx)]	; rowctr
 	test	ecx,ecx
 	jz	near .return
 	mov	esi, JSAMPARRAY [input_data(ebp)]	; input_data
 	mov	edi, POINTER [output_data_ptr(ebp)]
 	mov	edi, JSAMPARRAY [edi]			; output_data
 	alignx	16,7
 .rowloop:
 	push	eax					; colctr
 	push	ecx
 	push	edi
 	push	esi
 	mov	ecx, JSAMPROW [esi-1*SIZEOF_JSAMPROW]	; inptr1(above)
 	mov	ebx, JSAMPROW [esi+0*SIZEOF_JSAMPROW]	; inptr0
 	mov	esi, JSAMPROW [esi+1*SIZEOF_JSAMPROW]	; inptr1(below)
 	mov	edx, JSAMPROW [edi+0*SIZEOF_JSAMPROW]	; outptr0
 	mov	edi, JSAMPROW [edi+1*SIZEOF_JSAMPROW]	; outptr1
 	pxor	xmm0,xmm0		; xmm0=(all 0's)
 	alignx	16,7
 .columnloop:
 	movdqa	xmm1, XMMWORD [ebx]	; xmm1=row[ 0]( 0  1  2 ... 13 14 15)
 	movdqa	xmm2, XMMWORD [ecx]	; xmm2=row[-1]( 0  1  2 ... 13 14 15)
 	movdqa	xmm3, XMMWORD [esi]	; xmm3=row[+1]( 0  1  2 ... 13 14 15)
 	pushpic	ebx
 	movpic	ebx, POINTER [gotptr]	; load GOT address
 	movdqa    xmm4,xmm1
 	punpcklbw xmm1,xmm0		; xmm1=row[ 0]( 0  1  2  3  4  5  6  7)
 	punpckhbw xmm4,xmm0		; xmm4=row[ 0]( 8  9 10 11 12 13 14 15)
 	movdqa    xmm5,xmm2
 	punpcklbw xmm2,xmm0		; xmm2=row[-1]( 0  1  2  3  4  5  6  7)
 	punpckhbw xmm5,xmm0		; xmm5=row[-1]( 8  9 10 11 12 13 14 15)
 	movdqa    xmm6,xmm3
 	punpcklbw xmm3,xmm0		; xmm3=row[+1]( 0  1  2  3  4  5  6  7)
 	punpckhbw xmm6,xmm0		; xmm6=row[+1]( 8  9 10 11 12 13 14 15)
 	pmullw	xmm1,[GOTOFF(ebx,PW_THREE)]
 	pmullw	xmm4,[GOTOFF(ebx,PW_THREE)]
 	paddw	xmm2,[GOTOFF(ebx,PW_ONE)]
 	paddw	xmm5,[GOTOFF(ebx,PW_ONE)]
 	paddw	xmm3,[GOTOFF(ebx,PW_TWO)]
 	paddw	xmm6,[GOTOFF(ebx,PW_TWO)]
 	paddw	xmm2,xmm1
 	paddw	xmm5,xmm4
 	psrlw	xmm2,2			; xmm2=Out0L=( 0  1  2  3  4  5  6  7)
 	psrlw	xmm5,2			; xmm5=Out0H=( 8  9 10 11 12 13 14 15)
 	paddw	xmm3,xmm1
 	paddw	xmm6,xmm4
 	psrlw	xmm3,2			; xmm3=Out1L=( 0  1  2  3  4  5  6  7)
 	psrlw	xmm6,2			; xmm6=Out1H=( 8  9 10 11 12 13 14 15)
 	packuswb  xmm2,xmm5		; xmm2=Out0=( 0  1  2 ... 13 14 15)
 	packuswb  xmm3,xmm6		; xmm3=Out1=( 0  1  2 ... 13 14 15)
 	movdqa	XMMWORD [edx], xmm2
 	movdqa	XMMWORD [edi], xmm3
 	poppic	ebx
 	add	ecx, byte 1*SIZEOF_XMMWORD	; inptr1(above)
 	add	ebx, byte 1*SIZEOF_XMMWORD	; inptr0
 	add	esi, byte 1*SIZEOF_XMMWORD	; inptr1(below)
 	add	edx, byte 1*SIZEOF_XMMWORD	; outptr0
 	add	edi, byte 1*SIZEOF_XMMWORD	; outptr1
 	sub	eax, byte SIZEOF_XMMWORD
 	jnz	near .columnloop
 	pop	esi
 	pop	edi
 	pop	ecx
 	pop	eax
 	add	esi, byte 1*SIZEOF_JSAMPROW	; input_data
 	add	edi, byte 2*SIZEOF_JSAMPROW	; output_data
 	sub	ecx, byte 2			; rowctr
 	jg	near .rowloop
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	poppic	eax		; remove gotptr
 	pop	ebp
 	ret
 %endif ; UPSAMPLE_H1V2_SUPPORTED
 %endif ; JDSAMPLE_FANCY_SSE2_SUPPORTED
 %ifdef JDSAMPLE_SIMPLE_SSE2_SUPPORTED
 %ifndef JDSAMPLE_FANCY_SSE2_SUPPORTED
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 %endif
 ;
 ; Fast processing for the common case of 2:1 horizontal and 1:1 vertical.
 ; It's still a box filter.
 ;
 ; GLOBAL(void)
 ; jpeg_h2v1_upsample_sse2 (j_decompress_ptr cinfo,
 ;                          jpeg_component_info * compptr,
 ;                          JSAMPARRAY input_data,
 ;                          JSAMPARRAY * output_data_ptr);
 ;
 %define cinfo(b)		(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)		(b)+12		; jpeg_component_info * compptr
 %define input_data(b)		(b)+16		; JSAMPARRAY input_data
 %define output_data_ptr(b)	(b)+20		; JSAMPARRAY * output_data_ptr
 	align	16
 	global	EXTN(jpeg_h2v1_upsample_sse2)
 EXTN(jpeg_h2v1_upsample_sse2):
 	push	ebp
 	mov	ebp,esp
 ;	push	ebx		; unused
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	mov	edx, POINTER [cinfo(ebp)]
 	mov	edx, JDIMENSION [jdstruct_output_width(edx)]
 	add	edx, byte (2*SIZEOF_XMMWORD)-1
 	and	edx, byte -(2*SIZEOF_XMMWORD)
 	jz	short .return
 	mov	ecx, POINTER [cinfo(ebp)]
 	mov	ecx, INT [jdstruct_max_v_samp_factor(ecx)]	; rowctr
 	test	ecx,ecx
 	jz	short .return
 	mov	esi, JSAMPARRAY [input_data(ebp)]	; input_data
 	mov	edi, POINTER [output_data_ptr(ebp)]
 	mov	edi, JSAMPARRAY [edi]			; output_data
 	alignx	16,7
 .rowloop:
 	push	edi
 	push	esi
 	mov	esi, JSAMPROW [esi]		; inptr
 	mov	edi, JSAMPROW [edi]		; outptr
 	mov	eax,edx				; colctr
 	alignx	16,7
 .columnloop:
 	movdqa	xmm0, XMMWORD [esi+0*SIZEOF_XMMWORD]
 	movdqa    xmm1,xmm0
 	punpcklbw xmm0,xmm0
 	punpckhbw xmm1,xmm1
 	movdqa	XMMWORD [edi+0*SIZEOF_XMMWORD], xmm0
 	movdqa	XMMWORD [edi+1*SIZEOF_XMMWORD], xmm1
 	sub	eax, byte 2*SIZEOF_XMMWORD
 	jz	short .nextrow
 	movdqa	xmm2, XMMWORD [esi+1*SIZEOF_XMMWORD]
 	movdqa    xmm3,xmm2
 	punpcklbw xmm2,xmm2
 	punpckhbw xmm3,xmm3
 	movdqa	XMMWORD [edi+2*SIZEOF_XMMWORD], xmm2
 	movdqa	XMMWORD [edi+3*SIZEOF_XMMWORD], xmm3
 	sub	eax, byte 2*SIZEOF_XMMWORD
 	jz	short .nextrow
 	add	esi, byte 2*SIZEOF_XMMWORD	; inptr
 	add	edi, byte 4*SIZEOF_XMMWORD	; outptr
 	jmp	short .columnloop
 	alignx	16,7
 .nextrow:
 	pop	esi
 	pop	edi
 	add	esi, byte SIZEOF_JSAMPROW	; input_data
 	add	edi, byte SIZEOF_JSAMPROW	; output_data
 	dec	ecx				; rowctr
 	jg	short .rowloop
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 ;	pop	ebx		; unused
 	pop	ebp
 	ret
 ; --------------------------------------------------------------------------
 ;
 ; Fast processing for the common case of 2:1 horizontal and 2:1 vertical.
 ; It's still a box filter.
 ;
 ; GLOBAL(void)
 ; jpeg_h2v2_upsample_sse2 (j_decompress_ptr cinfo,
 ;                          jpeg_component_info * compptr,
 ;                          JSAMPARRAY input_data,
 ;                          JSAMPARRAY * output_data_ptr);
 ;
 %define cinfo(b)		(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)		(b)+12		; jpeg_component_info * compptr
 %define input_data(b)		(b)+16		; JSAMPARRAY input_data
 %define output_data_ptr(b)	(b)+20		; JSAMPARRAY * output_data_ptr
 	align	16
 	global	EXTN(jpeg_h2v2_upsample_sse2)
 EXTN(jpeg_h2v2_upsample_sse2):
 	push	ebp
 	mov	ebp,esp
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	mov	edx, POINTER [cinfo(ebp)]
 	mov	edx, JDIMENSION [jdstruct_output_width(edx)]
 	add	edx, byte (2*SIZEOF_XMMWORD)-1
 	and	edx, byte -(2*SIZEOF_XMMWORD)
 	jz	near .return
 	mov	ecx, POINTER [cinfo(ebp)]
 	mov	ecx, INT [jdstruct_max_v_samp_factor(ecx)]	; rowctr
 	test	ecx,ecx
 	jz	near .return
 	mov	esi, JSAMPARRAY [input_data(ebp)]	; input_data
 	mov	edi, POINTER [output_data_ptr(ebp)]
 	mov	edi, JSAMPARRAY [edi]			; output_data
 	alignx	16,7
 .rowloop:
 	push	edi
 	push	esi
 	mov	esi, JSAMPROW [esi]			; inptr
 	mov	ebx, JSAMPROW [edi+0*SIZEOF_JSAMPROW]	; outptr0
 	mov	edi, JSAMPROW [edi+1*SIZEOF_JSAMPROW]	; outptr1
 	mov	eax,edx					; colctr
 	alignx	16,7
 .columnloop:
 	movdqa	xmm0, XMMWORD [esi+0*SIZEOF_XMMWORD]
 	movdqa    xmm1,xmm0
 	punpcklbw xmm0,xmm0
 	punpckhbw xmm1,xmm1
 	movdqa	XMMWORD [ebx+0*SIZEOF_XMMWORD], xmm0
 	movdqa	XMMWORD [ebx+1*SIZEOF_XMMWORD], xmm1
 	movdqa	XMMWORD [edi+0*SIZEOF_XMMWORD], xmm0
 	movdqa	XMMWORD [edi+1*SIZEOF_XMMWORD], xmm1
 	sub	eax, byte 2*SIZEOF_XMMWORD
 	jz	short .nextrow
 	movdqa	xmm2, XMMWORD [esi+1*SIZEOF_XMMWORD]
 	movdqa    xmm3,xmm2
 	punpcklbw xmm2,xmm2
 	punpckhbw xmm3,xmm3
 	movdqa	XMMWORD [ebx+2*SIZEOF_XMMWORD], xmm2
 	movdqa	XMMWORD [ebx+3*SIZEOF_XMMWORD], xmm3
 	movdqa	XMMWORD [edi+2*SIZEOF_XMMWORD], xmm2
 	movdqa	XMMWORD [edi+3*SIZEOF_XMMWORD], xmm3
 	sub	eax, byte 2*SIZEOF_XMMWORD
 	jz	short .nextrow
 	add	esi, byte 2*SIZEOF_XMMWORD	; inptr
 	add	ebx, byte 4*SIZEOF_XMMWORD	; outptr0
 	add	edi, byte 4*SIZEOF_XMMWORD	; outptr1
 	jmp	short .columnloop
 	alignx	16,7
 .nextrow:
 	pop	esi
 	pop	edi
 	add	esi, byte 1*SIZEOF_JSAMPROW	; input_data
 	add	edi, byte 2*SIZEOF_JSAMPROW	; output_data
 	sub	ecx, byte 2			; rowctr
 	jg	short .rowloop
 .return:
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	pop	ebp
 	ret
 %endif ; JDSAMPLE_SIMPLE_SSE2_SUPPORTED
--- a/jdtrans.c
+++ b/jdtrans.c
@@ -1,7 +1,7 @@
 /*
 * jdtrans.c
 *
- * Copyright (C) 1995-1996, Thomas G. Lane.
+ * Copyright (C) 1995-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -30,6 +30,13 @@ LOCAL(void) transdecode_master_selection JPP((j_decompress_ptr cinfo));
 * To release the memory occupied by the virtual arrays, call
 * jpeg_finish_decompress() when done with the data.
 *
 * An alternative usage is to simply obtain access to the coefficient arrays
 * during a buffered-image-mode decompression operation.  This is allowed
 * after any jpeg_finish_output() call.  The arrays can be accessed until
 * jpeg_finish_decompress() is called.  (Note that any call to the library
 * may reposition the arrays, so don't rely on access_virt_barray() results
 * to stay valid across library calls.)
 *
 * Returns NULL if suspended.  This case need be checked only if
 * a suspending data source is used.
 */
@@ -41,8 +48,8 @@ jpeg_read_coefficients (j_decompress_ptr cinfo)
    /* First call: initialize active modules */
    transdecode_master_selection(cinfo);
    cinfo->global_state = DSTATE_RDCOEFS;
-  } else if (cinfo->global_state != DSTATE_RDCOEFS)
+  }
-    ERREXIT1(cinfo, JERR_BAD_STATE, cinfo->global_state);
+  if (cinfo->global_state == DSTATE_RDCOEFS) {
    /* Absorb whole file into the coef buffer */
    for (;;) {
      int retcode;
@@ -66,7 +73,18 @@ jpeg_read_coefficients (j_decompress_ptr cinfo)
    }
    /* Set state so that jpeg_finish_decompress does the right thing */
    cinfo->global_state = DSTATE_STOPPING;
  }
  /* At this point we should be in state DSTATE_STOPPING if being used
   * standalone, or in state DSTATE_BUFIMAGE if being invoked to get access
   * to the coefficients during a full buffered-image-mode decompression.
   */
  if ((cinfo->global_state == DSTATE_STOPPING ||
       cinfo->global_state == DSTATE_BUFIMAGE) && cinfo->buffered_image) {
    return cinfo->coef->coef_arrays;
  }
  /* Oops, improper usage */
  ERREXIT1(cinfo, JERR_BAD_STATE, cinfo->global_state);
  return NULL;			/* keep compiler happy */
 }
@@ -78,6 +96,9 @@ jpeg_read_coefficients (j_decompress_ptr cinfo)
 LOCAL(void)
 transdecode_master_selection (j_decompress_ptr cinfo)
 {
  /* This is effectively a buffered-image operation. */
  cinfo->buffered_image = TRUE;
  /* Entropy decoding: either Huffman or arithmetic coding. */
  if (cinfo->arith_code) {
    ERREXIT(cinfo, JERR_ARITH_NOTIMPL);
--- a/jerror.c
+++ b/jerror.c
@@ -1,7 +1,7 @@
 /*
 * jerror.c
 *
- * Copyright (C) 1991-1996, Thomas G. Lane.
+ * Copyright (C) 1991-1998, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -10,6 +10,11 @@
 * stderr is the right thing to do.  Many applications will want to replace
 * some or all of these routines.
 *
 * If you define USE_WINDOWS_MESSAGEBOX in jconfig.h or in the makefile,
 * you get a Windows-specific hack to display error messages in a dialog box.
 * It ain't much, but it beats dropping error messages into the bit bucket,
 * which is what happens to output to stderr under most Windows C compilers.
 *
 * These routines are used by both the compression and decompression code.
 */
@@ -19,6 +24,10 @@
 #include "jversion.h"
 #include "jerror.h"
 #ifdef USE_WINDOWS_MESSAGEBOX
 #include <windows.h>
 #endif
 #ifndef EXIT_FAILURE		/* define exit() codes if not provided */
 #define EXIT_FAILURE  1
 #endif
@@ -74,6 +83,15 @@ error_exit (j_common_ptr cinfo)
 * Actual output of an error or trace message.
 * Applications may override this method to send JPEG messages somewhere
 * other than stderr.
 *
 * On Windows, printing to stderr is generally completely useless,
 * so we provide optional code to produce an error-dialog popup.
 * Most Windows applications will still prefer to override this routine,
 * but if they don't, it'll do something at least marginally useful.
 *
 * NOTE: to use the library in an environment that doesn't support the
 * C stdio library, you may have to delete the call to fprintf() entirely,
 * not just not use this routine.
 */
 METHODDEF(void)
@@ -84,8 +102,14 @@ output_message (j_common_ptr cinfo)
  /* Create the message */
  (*cinfo->err->format_message) (cinfo, buffer);
 #ifdef USE_WINDOWS_MESSAGEBOX
  /* Display it in a message dialog box */
  MessageBox(GetActiveWindow(), buffer, "JPEG Library Error",
 	     MB_OK | MB_ICONERROR);
 #else
  /* Send it to stderr, adding a newline */
  fprintf(stderr, "%s\n", buffer);
 #endif
 }
--- a/jerror.h
+++ b/jerror.h
@@ -1,7 +1,7 @@
 /*
 * jerror.h
 *
- * Copyright (C) 1994-1995, Thomas G. Lane.
+ * Copyright (C) 1994-1997, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
@@ -45,7 +45,9 @@ JMESSAGE(JERR_BAD_ALIGN_TYPE, "ALIGN_TYPE is wrong, please fix")
 JMESSAGE(JERR_BAD_ALLOC_CHUNK, "MAX_ALLOC_CHUNK is wrong, please fix")
 JMESSAGE(JERR_BAD_BUFFER_MODE, "Bogus buffer control mode")
 JMESSAGE(JERR_BAD_COMPONENT_ID, "Invalid component ID %d in SOS")
 JMESSAGE(JERR_BAD_DCT_COEF, "DCT coefficient out of range")
 JMESSAGE(JERR_BAD_DCTSIZE, "IDCT output block size %d not supported")
 JMESSAGE(JERR_BAD_HUFF_TABLE, "Bogus Huffman table definition")
 JMESSAGE(JERR_BAD_IN_COLORSPACE, "Bogus input colorspace")
 JMESSAGE(JERR_BAD_J_COLORSPACE, "Bogus JPEG colorspace")
 JMESSAGE(JERR_BAD_LENGTH, "Bogus marker length")
@@ -71,7 +73,6 @@ JMESSAGE(JERR_COMPONENT_COUNT, "Too many color components: %d, max %d")
 JMESSAGE(JERR_CONVERSION_NOTIMPL, "Unsupported color conversion request")
 JMESSAGE(JERR_DAC_INDEX, "Bogus DAC index %d")
 JMESSAGE(JERR_DAC_VALUE, "Bogus DAC value 0x%x")
 JMESSAGE(JERR_DHT_COUNTS, "Bogus DHT counts")
 JMESSAGE(JERR_DHT_INDEX, "Bogus DHT index %d")
 JMESSAGE(JERR_DQT_INDEX, "Bogus DQT index %d")
 JMESSAGE(JERR_EMPTY_IMAGE, "Empty JPEG image (DNL not supported)")
@@ -134,12 +135,13 @@ JMESSAGE(JTRC_EMS_CLOSE, "Freed EMS handle %u")
 JMESSAGE(JTRC_EMS_OPEN, "Obtained EMS handle %u")
 JMESSAGE(JTRC_EOI, "End Of Image")
 JMESSAGE(JTRC_HUFFBITS, "        %3d %3d %3d %3d %3d %3d %3d %3d")
-JMESSAGE(JTRC_JFIF, "JFIF APP0 marker, density %dx%d  %d")
+JMESSAGE(JTRC_JFIF, "JFIF APP0 marker: version %d.%02d, density %dx%d  %d")
 JMESSAGE(JTRC_JFIF_BADTHUMBNAILSIZE,
 	 "Warning: thumbnail image size does not match data length %u")
-JMESSAGE(JTRC_JFIF_MINOR, "Unknown JFIF minor revision number %d.%02d")
+JMESSAGE(JTRC_JFIF_EXTENSION,
 	 "JFIF extension marker: type 0x%02x, length %u")
 JMESSAGE(JTRC_JFIF_THUMBNAIL, "    with %d x %d thumbnail image")
-JMESSAGE(JTRC_MISC_MARKER, "Skipping marker 0x%02x, length %u")
+JMESSAGE(JTRC_MISC_MARKER, "Miscellaneous marker 0x%02x, length %u")
 JMESSAGE(JTRC_PARMLESS_MARKER, "Unexpected marker 0x%02x")
 JMESSAGE(JTRC_QUANTVALS, "        %4u %4u %4u %4u %4u %4u %4u %4u")
 JMESSAGE(JTRC_QUANT_3_NCOLORS, "Quantizing to %d = %d*%d*%d colors")
@@ -157,6 +159,12 @@ JMESSAGE(JTRC_SOS_COMPONENT, "    Component %d: dc=%d ac=%d")
 JMESSAGE(JTRC_SOS_PARAMS, "  Ss=%d, Se=%d, Ah=%d, Al=%d")
 JMESSAGE(JTRC_TFILE_CLOSE, "Closed temporary file %s")
 JMESSAGE(JTRC_TFILE_OPEN, "Opened temporary file %s")
 JMESSAGE(JTRC_THUMB_JPEG,
 	 "JFIF extension marker: JPEG-compressed thumbnail image, length %u")
 JMESSAGE(JTRC_THUMB_PALETTE,
 	 "JFIF extension marker: palette thumbnail image, length %u")
 JMESSAGE(JTRC_THUMB_RGB,
 	 "JFIF extension marker: RGB thumbnail image, length %u")
 JMESSAGE(JTRC_UNKNOWN_IDS,
 	 "Unrecognized component IDs %d %d %d, assuming YCbCr")
 JMESSAGE(JTRC_XMS_CLOSE, "Freed XMS handle %u")
@@ -263,6 +271,12 @@ JMESSAGE(JWRN_TOO_MUCH_DATA, "Application transferred too many scanlines")
 	   _mp[0] = (p1); _mp[1] = (p2); _mp[2] = (p3); _mp[3] = (p4); \
 	   (cinfo)->err->msg_code = (code); \
 	   (*(cinfo)->err->emit_message) ((j_common_ptr) (cinfo), (lvl)); )
 #define TRACEMS5(cinfo,lvl,code,p1,p2,p3,p4,p5)  \
  MAKESTMT(int * _mp = (cinfo)->err->msg_parm.i; \
 	   _mp[0] = (p1); _mp[1] = (p2); _mp[2] = (p3); _mp[3] = (p4); \
 	   _mp[4] = (p5); \
 	   (cinfo)->err->msg_code = (code); \
 	   (*(cinfo)->err->emit_message) ((j_common_ptr) (cinfo), (lvl)); )
 #define TRACEMS8(cinfo,lvl,code,p1,p2,p3,p4,p5,p6,p7,p8)  \
  MAKESTMT(int * _mp = (cinfo)->err->msg_parm.i; \
 	   _mp[0] = (p1); _mp[1] = (p2); _mp[2] = (p3); _mp[3] = (p4); \
--- a/jf3dnflt.asm
+++ b/jf3dnflt.asm
@@ -0,0 +1,327 @@
 ;
 ; jf3dnflt.asm - floating-point FDCT (3DNow!)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains a floating-point implementation of the forward DCT
 ; (Discrete Cosine Transform). The following code is based directly on
 ; the IJG's original jfdctflt.c; see the jfdctflt.c for more details.
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_FLOAT_SUPPORTED
 %ifdef JFDCT_FLT_3DNOW_MMX_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 	alignz	16
 	global	EXTN(jconst_fdct_float_3dnow)
 EXTN(jconst_fdct_float_3dnow):
 PD_0_382	times 2 dd  0.382683432365089771728460
 PD_0_707	times 2 dd  0.707106781186547524400844
 PD_0_541	times 2 dd  0.541196100146196984399723
 PD_1_306	times 2 dd  1.306562964876376527856643
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform the forward DCT on one block of samples.
 ;
 ; GLOBAL(void)
 ; jpeg_fdct_float_3dnow (FAST_FLOAT * data)
 ;
 %define data(b)		(b)+8		; FAST_FLOAT * data
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_MMWORD	; mmword wk[WK_NUM]
 %define WK_NUM		2
 	align	16
 	global	EXTN(jpeg_fdct_float_3dnow)
 EXTN(jpeg_fdct_float_3dnow):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_MMWORD)	; align to 64 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [wk(0)]
 	pushpic	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 ;	push	esi		; unused
 ;	push	edi		; unused
 	get_GOT	ebx		; get GOT address
 	; ---- Pass 1: process rows.
 	mov	edx, POINTER [data(eax)]	; (FAST_FLOAT *)
 	mov	ecx, DCTSIZE/2
 	alignx	16,7
 .rowloop:
 	movq	mm0, MMWORD [MMBLOCK(0,0,edx,SIZEOF_FAST_FLOAT)]
 	movq	mm1, MMWORD [MMBLOCK(1,0,edx,SIZEOF_FAST_FLOAT)]
 	movq	mm2, MMWORD [MMBLOCK(0,3,edx,SIZEOF_FAST_FLOAT)]
 	movq	mm3, MMWORD [MMBLOCK(1,3,edx,SIZEOF_FAST_FLOAT)]
 	; mm0=(00 01), mm1=(10 11), mm2=(06 07), mm3=(16 17)
 	movq      mm4,mm0		; transpose coefficients
 	punpckldq mm0,mm1		; mm0=(00 10)=data0
 	punpckhdq mm4,mm1		; mm4=(01 11)=data1
 	movq      mm5,mm2		; transpose coefficients
 	punpckldq mm2,mm3		; mm2=(06 16)=data6
 	punpckhdq mm5,mm3		; mm5=(07 17)=data7
 	movq	mm6,mm4
 	movq	mm7,mm0
 	pfsub	mm4,mm2			; mm4=data1-data6=tmp6
 	pfsub	mm0,mm5			; mm0=data0-data7=tmp7
 	pfadd	mm6,mm2			; mm6=data1+data6=tmp1
 	pfadd	mm7,mm5			; mm7=data0+data7=tmp0
 	movq	mm1, MMWORD [MMBLOCK(0,1,edx,SIZEOF_FAST_FLOAT)]
 	movq	mm3, MMWORD [MMBLOCK(1,1,edx,SIZEOF_FAST_FLOAT)]
 	movq	mm2, MMWORD [MMBLOCK(0,2,edx,SIZEOF_FAST_FLOAT)]
 	movq	mm5, MMWORD [MMBLOCK(1,2,edx,SIZEOF_FAST_FLOAT)]
 	; mm1=(02 03), mm3=(12 13), mm2=(04 05), mm5=(14 15)
 	movq	MMWORD [wk(0)], mm4	; wk(0)=tmp6
 	movq	MMWORD [wk(1)], mm0	; wk(1)=tmp7
 	movq      mm4,mm1		; transpose coefficients
 	punpckldq mm1,mm3		; mm1=(02 12)=data2
 	punpckhdq mm4,mm3		; mm4=(03 13)=data3
 	movq      mm0,mm2		; transpose coefficients
 	punpckldq mm2,mm5		; mm2=(04 14)=data4
 	punpckhdq mm0,mm5		; mm0=(05 15)=data5
 	movq	mm3,mm4
 	movq	mm5,mm1
 	pfadd	mm4,mm2			; mm4=data3+data4=tmp3
 	pfadd	mm1,mm0			; mm1=data2+data5=tmp2
 	pfsub	mm3,mm2			; mm3=data3-data4=tmp4
 	pfsub	mm5,mm0			; mm5=data2-data5=tmp5
 	; -- Even part
 	movq	mm2,mm7
 	movq	mm0,mm6
 	pfsub	mm7,mm4			; mm7=tmp13
 	pfsub	mm6,mm1			; mm6=tmp12
 	pfadd	mm2,mm4			; mm2=tmp10
 	pfadd	mm0,mm1			; mm0=tmp11
 	pfadd	mm6,mm7
 	pfmul	mm6,[GOTOFF(ebx,PD_0_707)] ; mm6=z1
 	movq	mm4,mm2
 	movq	mm1,mm7
 	pfsub	mm2,mm0			; mm2=data4
 	pfsub	mm7,mm6			; mm7=data6
 	pfadd	mm4,mm0			; mm4=data0
 	pfadd	mm1,mm6			; mm1=data2
 	movq	MMWORD [MMBLOCK(0,2,edx,SIZEOF_FAST_FLOAT)], mm2
 	movq	MMWORD [MMBLOCK(0,3,edx,SIZEOF_FAST_FLOAT)], mm7
 	movq	MMWORD [MMBLOCK(0,0,edx,SIZEOF_FAST_FLOAT)], mm4
 	movq	MMWORD [MMBLOCK(0,1,edx,SIZEOF_FAST_FLOAT)], mm1
 	; -- Odd part
 	movq	mm0, MMWORD [wk(0)]	; mm0=tmp6
 	movq	mm6, MMWORD [wk(1)]	; mm6=tmp7
 	pfadd	mm3,mm5			; mm3=tmp10
 	pfadd	mm5,mm0			; mm5=tmp11
 	pfadd	mm0,mm6			; mm0=tmp12, mm6=tmp7
 	pfmul	mm5,[GOTOFF(ebx,PD_0_707)] ; mm5=z3
 	movq	mm2,mm3			; mm2=tmp10
 	pfsub	mm3,mm0
 	pfmul	mm3,[GOTOFF(ebx,PD_0_382)] ; mm3=z5
 	pfmul	mm2,[GOTOFF(ebx,PD_0_541)] ; mm2=MULTIPLY(tmp10,FIX_0_54119610)
 	pfmul	mm0,[GOTOFF(ebx,PD_1_306)] ; mm0=MULTIPLY(tmp12,FIX_1_30656296)
 	pfadd	mm2,mm3			; mm2=z2
 	pfadd	mm0,mm3			; mm0=z4
 	movq	mm7,mm6
 	pfsub	mm6,mm5			; mm6=z13
 	pfadd	mm7,mm5			; mm7=z11
 	movq	mm4,mm6
 	movq	mm1,mm7
 	pfsub	mm6,mm2			; mm6=data3
 	pfsub	mm7,mm0			; mm7=data7
 	pfadd	mm4,mm2			; mm4=data5
 	pfadd	mm1,mm0			; mm1=data1
 	movq	MMWORD [MMBLOCK(1,1,edx,SIZEOF_FAST_FLOAT)], mm6
 	movq	MMWORD [MMBLOCK(1,3,edx,SIZEOF_FAST_FLOAT)], mm7
 	movq	MMWORD [MMBLOCK(1,2,edx,SIZEOF_FAST_FLOAT)], mm4
 	movq	MMWORD [MMBLOCK(1,0,edx,SIZEOF_FAST_FLOAT)], mm1
 	add	edx, byte 2*DCTSIZE*SIZEOF_FAST_FLOAT
 	dec	ecx
 	jnz	near .rowloop
 	; ---- Pass 2: process columns.
 	mov	edx, POINTER [data(eax)]	; (FAST_FLOAT *)
 	mov	ecx, DCTSIZE/2
 	alignx	16,7
 .columnloop:
 	movq	mm0, MMWORD [MMBLOCK(0,0,edx,SIZEOF_FAST_FLOAT)]
 	movq	mm1, MMWORD [MMBLOCK(1,0,edx,SIZEOF_FAST_FLOAT)]
 	movq	mm2, MMWORD [MMBLOCK(6,0,edx,SIZEOF_FAST_FLOAT)]
 	movq	mm3, MMWORD [MMBLOCK(7,0,edx,SIZEOF_FAST_FLOAT)]
 	; mm0=(00 10), mm1=(01 11), mm2=(60 70), mm3=(61 71)
 	movq      mm4,mm0		; transpose coefficients
 	punpckldq mm0,mm1		; mm0=(00 01)=data0
 	punpckhdq mm4,mm1		; mm4=(10 11)=data1
 	movq      mm5,mm2		; transpose coefficients
 	punpckldq mm2,mm3		; mm2=(60 61)=data6
 	punpckhdq mm5,mm3		; mm5=(70 71)=data7
 	movq	mm6,mm4
 	movq	mm7,mm0
 	pfsub	mm4,mm2			; mm4=data1-data6=tmp6
 	pfsub	mm0,mm5			; mm0=data0-data7=tmp7
 	pfadd	mm6,mm2			; mm6=data1+data6=tmp1
 	pfadd	mm7,mm5			; mm7=data0+data7=tmp0
 	movq	mm1, MMWORD [MMBLOCK(2,0,edx,SIZEOF_FAST_FLOAT)]
 	movq	mm3, MMWORD [MMBLOCK(3,0,edx,SIZEOF_FAST_FLOAT)]
 	movq	mm2, MMWORD [MMBLOCK(4,0,edx,SIZEOF_FAST_FLOAT)]
 	movq	mm5, MMWORD [MMBLOCK(5,0,edx,SIZEOF_FAST_FLOAT)]
 	; mm1=(20 30), mm3=(21 31), mm2=(40 50), mm5=(41 51)
 	movq	MMWORD [wk(0)], mm4	; wk(0)=tmp6
 	movq	MMWORD [wk(1)], mm0	; wk(1)=tmp7
 	movq      mm4,mm1		; transpose coefficients
 	punpckldq mm1,mm3		; mm1=(20 21)=data2
 	punpckhdq mm4,mm3		; mm4=(30 31)=data3
 	movq      mm0,mm2		; transpose coefficients
 	punpckldq mm2,mm5		; mm2=(40 41)=data4
 	punpckhdq mm0,mm5		; mm0=(50 51)=data5
 	movq	mm3,mm4
 	movq	mm5,mm1
 	pfadd	mm4,mm2			; mm4=data3+data4=tmp3
 	pfadd	mm1,mm0			; mm1=data2+data5=tmp2
 	pfsub	mm3,mm2			; mm3=data3-data4=tmp4
 	pfsub	mm5,mm0			; mm5=data2-data5=tmp5
 	; -- Even part
 	movq	mm2,mm7
 	movq	mm0,mm6
 	pfsub	mm7,mm4			; mm7=tmp13
 	pfsub	mm6,mm1			; mm6=tmp12
 	pfadd	mm2,mm4			; mm2=tmp10
 	pfadd	mm0,mm1			; mm0=tmp11
 	pfadd	mm6,mm7
 	pfmul	mm6,[GOTOFF(ebx,PD_0_707)] ; mm6=z1
 	movq	mm4,mm2
 	movq	mm1,mm7
 	pfsub	mm2,mm0			; mm2=data4
 	pfsub	mm7,mm6			; mm7=data6
 	pfadd	mm4,mm0			; mm4=data0
 	pfadd	mm1,mm6			; mm1=data2
 	movq	MMWORD [MMBLOCK(4,0,edx,SIZEOF_FAST_FLOAT)], mm2
 	movq	MMWORD [MMBLOCK(6,0,edx,SIZEOF_FAST_FLOAT)], mm7
 	movq	MMWORD [MMBLOCK(0,0,edx,SIZEOF_FAST_FLOAT)], mm4
 	movq	MMWORD [MMBLOCK(2,0,edx,SIZEOF_FAST_FLOAT)], mm1
 	; -- Odd part
 	movq	mm0, MMWORD [wk(0)]	; mm0=tmp6
 	movq	mm6, MMWORD [wk(1)]	; mm6=tmp7
 	pfadd	mm3,mm5			; mm3=tmp10
 	pfadd	mm5,mm0			; mm5=tmp11
 	pfadd	mm0,mm6			; mm0=tmp12, mm6=tmp7
 	pfmul	mm5,[GOTOFF(ebx,PD_0_707)] ; mm5=z3
 	movq	mm2,mm3			; mm2=tmp10
 	pfsub	mm3,mm0
 	pfmul	mm3,[GOTOFF(ebx,PD_0_382)] ; mm3=z5
 	pfmul	mm2,[GOTOFF(ebx,PD_0_541)] ; mm2=MULTIPLY(tmp10,FIX_0_54119610)
 	pfmul	mm0,[GOTOFF(ebx,PD_1_306)] ; mm0=MULTIPLY(tmp12,FIX_1_30656296)
 	pfadd	mm2,mm3			; mm2=z2
 	pfadd	mm0,mm3			; mm0=z4
 	movq	mm7,mm6
 	pfsub	mm6,mm5			; mm6=z13
 	pfadd	mm7,mm5			; mm7=z11
 	movq	mm4,mm6
 	movq	mm1,mm7
 	pfsub	mm6,mm2			; mm6=data3
 	pfsub	mm7,mm0			; mm7=data7
 	pfadd	mm4,mm2			; mm4=data5
 	pfadd	mm1,mm0			; mm1=data1
 	movq	MMWORD [MMBLOCK(3,0,edx,SIZEOF_FAST_FLOAT)], mm6
 	movq	MMWORD [MMBLOCK(7,0,edx,SIZEOF_FAST_FLOAT)], mm7
 	movq	MMWORD [MMBLOCK(5,0,edx,SIZEOF_FAST_FLOAT)], mm4
 	movq	MMWORD [MMBLOCK(1,0,edx,SIZEOF_FAST_FLOAT)], mm1
 	add	edx, byte 2*SIZEOF_FAST_FLOAT
 	dec	ecx
 	jnz	near .columnloop
 	femms		; empty MMX/3DNow! state
 ;	pop	edi		; unused
 ;	pop	esi		; unused
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	poppic	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %endif ; JFDCT_FLT_3DNOW_MMX_SUPPORTED
 %endif ; DCT_FLOAT_SUPPORTED
--- a/jfdctflt.asm
+++ b/jfdctflt.asm
@@ -0,0 +1,288 @@
 ;
 ; jfdctflt.asm - floating-point FDCT (non-SIMD)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains a floating-point implementation of the forward DCT
 ; (Discrete Cosine Transform). The following code is based directly on
 ; the IJG's original jfdctflt.c; see the jfdctflt.c for more details.
 ;
 ; Last Modified : October 17, 2004
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_FLOAT_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 %define ROTATOR_TYPE	FP32	; float
 	alignz	16
 	global	EXTN(jconst_fdct_float)
 EXTN(jconst_fdct_float):
 F_0_382	dd	0.382683432365089771728460	; cos(PI*3/8)
 F_0_707	dd	0.707106781186547524400844	; cos(PI*1/4)
 F_0_541	dd	0.541196100146196984399723	; cos(PI*1/8)-cos(PI*3/8)
 F_1_306	dd	1.306562964876376527856643	; cos(PI*1/8)+cos(PI*3/8)
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform the forward DCT on one block of samples.
 ;
 ; GLOBAL(void)
 ; jpeg_fdct_float (FAST_FLOAT * data)
 ;
 %define data(b)	(b)+8		; FAST_FLOAT * data
 	align	16
 	global	EXTN(jpeg_fdct_float)
 EXTN(jpeg_fdct_float):
 	push	ebp
 	mov	ebp,esp
 	pushpic	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 ;	push	esi		; unused
 ;	push	edi		; unused
 	get_GOT	ebx		; get GOT address
 	; ---- Pass 1: process rows.
 	mov	edx, POINTER [data(ebp)]	; (FAST_FLOAT *)
 	mov	ecx, DCTSIZE
 	alignx	16,7
 .rowloop:
 	fld	FAST_FLOAT [ROW(1,edx,SIZEOF_FAST_FLOAT)]
 	fadd	FAST_FLOAT [ROW(6,edx,SIZEOF_FAST_FLOAT)]
 	fld	FAST_FLOAT [ROW(0,edx,SIZEOF_FAST_FLOAT)]
 	fadd	FAST_FLOAT [ROW(7,edx,SIZEOF_FAST_FLOAT)]
 	fld	FAST_FLOAT [ROW(3,edx,SIZEOF_FAST_FLOAT)]
 	fadd	FAST_FLOAT [ROW(4,edx,SIZEOF_FAST_FLOAT)]
 	fld	FAST_FLOAT [ROW(2,edx,SIZEOF_FAST_FLOAT)]
 	fadd	FAST_FLOAT [ROW(5,edx,SIZEOF_FAST_FLOAT)]
 	; -- Even part
 	fld	st2	; st2 = st2 + st1, st1 = st2 - st1
 	fsub	st0,st2
 	fxch	st0,st2
 	faddp	st3,st0
 	fld	st3	; st3 = st3 + st0, st0 = st3 - st0
 	fsub	st0,st1
 	fxch	st0,st1
 	faddp	st4,st0
 	fadd	st0,st1
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_0_707)]
 	fld	st2	; st3 = st2 + st3, st2 = st2 - st3
 	fsub	st0,st4
 	fxch	st0,st3
 	faddp	st4,st0
 	fld	st1	; st0 = st1 + st0, st1 = st1 - st0
 	fsub	st0,st1
 	fxch	st0,st2
 	faddp	st1,st0
 	fld	FAST_FLOAT [ROW(0,edx,SIZEOF_FAST_FLOAT)]
 	fsub	FAST_FLOAT [ROW(7,edx,SIZEOF_FAST_FLOAT)]
 	fxch	st0,st4
 	fld	FAST_FLOAT [ROW(3,edx,SIZEOF_FAST_FLOAT)]
 	fsub	FAST_FLOAT [ROW(4,edx,SIZEOF_FAST_FLOAT)]
 	fxch	st0,st4
 	fld	FAST_FLOAT [ROW(1,edx,SIZEOF_FAST_FLOAT)]
 	fsub	FAST_FLOAT [ROW(6,edx,SIZEOF_FAST_FLOAT)]
 	fxch	st0,st4
 	fld	FAST_FLOAT [ROW(2,edx,SIZEOF_FAST_FLOAT)]
 	fsub	FAST_FLOAT [ROW(5,edx,SIZEOF_FAST_FLOAT)]
 	fxch	st0,st4
 	fstp	FAST_FLOAT [ROW(2,edx,SIZEOF_FAST_FLOAT)]
 	fstp	FAST_FLOAT [ROW(6,edx,SIZEOF_FAST_FLOAT)]
 	fstp	FAST_FLOAT [ROW(4,edx,SIZEOF_FAST_FLOAT)]
 	fstp	FAST_FLOAT [ROW(0,edx,SIZEOF_FAST_FLOAT)]
 	; -- Odd part
 	fadd	st2,st0
 	fadd	st0,st1
 	fxch	st0,st3
 	fadd	st1,st0
 	fxch	st0,st3
 	fld	st2
 	fxch	st0,st1
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_0_707)]
 	fxch	st0,st1
 	fsub	st0,st2
 	fxch	st0,st3
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_0_541)]
 	fxch	st0,st3
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_0_382)]
 	fxch	st0,st2
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_1_306)]
 	fxch	st0,st2
 	fadd	st3,st0
 	faddp	st2,st0
 	fld	st3	; st3 = st3 + st0, st0 = st3 - st0
 	fsub	st0,st1
 	fxch	st0,st1
 	faddp	st4,st0
 	fld	st2	; st0 = st0 + st2, st2 = st0 - st2
 	fsubr	st0,st1
 	fxch	st0,st3
 	faddp	st1,st0
 	fld	st1	; st3 = st3 + st1, st1 = st3 - st1
 	fsubr	st0,st4
 	fxch	st0,st2
 	faddp	st4,st0
 	fstp	FAST_FLOAT [ROW(5,edx,SIZEOF_FAST_FLOAT)]
 	fstp	FAST_FLOAT [ROW(7,edx,SIZEOF_FAST_FLOAT)]
 	fstp	FAST_FLOAT [ROW(3,edx,SIZEOF_FAST_FLOAT)]
 	fstp	FAST_FLOAT [ROW(1,edx,SIZEOF_FAST_FLOAT)]
 	add	edx, byte DCTSIZE*SIZEOF_FAST_FLOAT
 	dec	ecx				; advance pointer to next row
 	jnz	near .rowloop
 	; ---- Pass 2: process columns.
 	mov	edx, POINTER [data(ebp)]	; (FAST_FLOAT *)
 	mov	ecx, DCTSIZE
 	alignx	16,7
 .columnloop:
 	fld	FAST_FLOAT [COL(1,edx,SIZEOF_FAST_FLOAT)]
 	fadd	FAST_FLOAT [COL(6,edx,SIZEOF_FAST_FLOAT)]
 	fld	FAST_FLOAT [COL(0,edx,SIZEOF_FAST_FLOAT)]
 	fadd	FAST_FLOAT [COL(7,edx,SIZEOF_FAST_FLOAT)]
 	fld	FAST_FLOAT [COL(3,edx,SIZEOF_FAST_FLOAT)]
 	fadd	FAST_FLOAT [COL(4,edx,SIZEOF_FAST_FLOAT)]
 	fld	FAST_FLOAT [COL(2,edx,SIZEOF_FAST_FLOAT)]
 	fadd	FAST_FLOAT [COL(5,edx,SIZEOF_FAST_FLOAT)]
 	; -- Even part
 	fld	st2	; st2 = st2 + st1, st1 = st2 - st1
 	fsub	st0,st2
 	fxch	st0,st2
 	faddp	st3,st0
 	fld	st3	; st3 = st3 + st0, st0 = st3 - st0
 	fsub	st0,st1
 	fxch	st0,st1
 	faddp	st4,st0
 	fadd	st0,st1
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_0_707)]
 	fld	st2	; st3 = st2 + st3, st2 = st2 - st3
 	fsub	st0,st4
 	fxch	st0,st3
 	faddp	st4,st0
 	fld	st1	; st0 = st1 + st0, st1 = st1 - st0
 	fsub	st0,st1
 	fxch	st0,st2
 	faddp	st1,st0
 	fld	FAST_FLOAT [COL(0,edx,SIZEOF_FAST_FLOAT)]
 	fsub	FAST_FLOAT [COL(7,edx,SIZEOF_FAST_FLOAT)]
 	fxch	st0,st4
 	fld	FAST_FLOAT [COL(3,edx,SIZEOF_FAST_FLOAT)]
 	fsub	FAST_FLOAT [COL(4,edx,SIZEOF_FAST_FLOAT)]
 	fxch	st0,st4
 	fld	FAST_FLOAT [COL(1,edx,SIZEOF_FAST_FLOAT)]
 	fsub	FAST_FLOAT [COL(6,edx,SIZEOF_FAST_FLOAT)]
 	fxch	st0,st4
 	fld	FAST_FLOAT [COL(2,edx,SIZEOF_FAST_FLOAT)]
 	fsub	FAST_FLOAT [COL(5,edx,SIZEOF_FAST_FLOAT)]
 	fxch	st0,st4
 	fstp	FAST_FLOAT [COL(2,edx,SIZEOF_FAST_FLOAT)]
 	fstp	FAST_FLOAT [COL(6,edx,SIZEOF_FAST_FLOAT)]
 	fstp	FAST_FLOAT [COL(4,edx,SIZEOF_FAST_FLOAT)]
 	fstp	FAST_FLOAT [COL(0,edx,SIZEOF_FAST_FLOAT)]
 	; -- Odd part
 	fadd	st2,st0
 	fadd	st0,st1
 	fxch	st0,st3
 	fadd	st1,st0
 	fxch	st0,st3
 	fld	st2
 	fxch	st0,st1
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_0_707)]
 	fxch	st0,st1
 	fsub	st0,st2
 	fxch	st0,st3
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_0_541)]
 	fxch	st0,st3
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_0_382)]
 	fxch	st0,st2
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_1_306)]
 	fxch	st0,st2
 	fadd	st3,st0
 	faddp	st2,st0
 	fld	st3	; st3 = st3 + st0, st0 = st3 - st0
 	fsub	st0,st1
 	fxch	st0,st1
 	faddp	st4,st0
 	fld	st2	; st0 = st0 + st2, st2 = st0 - st2
 	fsubr	st0,st1
 	fxch	st0,st3
 	faddp	st1,st0
 	fld	st1	; st3 = st3 + st1, st1 = st3 - st1
 	fsubr	st0,st4
 	fxch	st0,st2
 	faddp	st4,st0
 	fstp	FAST_FLOAT [COL(5,edx,SIZEOF_FAST_FLOAT)]
 	fstp	FAST_FLOAT [COL(7,edx,SIZEOF_FAST_FLOAT)]
 	fstp	FAST_FLOAT [COL(3,edx,SIZEOF_FAST_FLOAT)]
 	fstp	FAST_FLOAT [COL(1,edx,SIZEOF_FAST_FLOAT)]
 	add	edx, byte SIZEOF_FAST_FLOAT ; advance pointer to next column
 	dec	ecx
 	jnz	near .columnloop
 ;	pop	edi		; unused
 ;	pop	esi		; unused
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	poppic	ebx
 	pop	ebp
 	ret
 %endif ; DCT_FLOAT_SUPPORTED
--- a/jfdctfst.asm
+++ b/jfdctfst.asm
@@ -0,0 +1,303 @@
 ;
 ; jfdctfst.asm - fast integer FDCT (non-SIMD)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains a fast, not so accurate integer implementation of
 ; the forward DCT (Discrete Cosine Transform). The following code is based
 ; directly on the IJG's original jfdctfst.c; see the jfdctfst.c for
 ; more details.
 ;
 ; Last Modified : October 17, 2004
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_IFAST_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 ; We can gain a little more speed, with a further compromise in accuracy,
 ; by omitting the addition in a descaling shift.  This yields an
 ; incorrectly rounded result half the time...
 ;
 %macro	descale 2
 %ifdef USE_ACCURATE_ROUNDING
 %if (%2)<=7
 	add	%1, byte (1<<((%2)-1))	; add reg32,imm8
 %else
 	add	%1, (1<<((%2)-1))	; add reg32,imm32
 %endif
 %endif
 	sar	%1,%2
 %endmacro
 ; --------------------------------------------------------------------------
 %define CONST_BITS	8
 %if CONST_BITS == 8
 F_0_382	equ	 98		; FIX(0.382683433)
 F_0_541	equ	139		; FIX(0.541196100)
 F_0_707	equ	181		; FIX(0.707106781)
 F_1_306	equ	334		; FIX(1.306562965)
 %else
 ; NASM cannot do compile-time arithmetic on floating-point constants.
 %define DESCALE(x,n)  (((x)+(1<<((n)-1)))>>(n))
 F_0_382	equ	DESCALE( 410903207,30-CONST_BITS)	; FIX(0.382683433)
 F_0_541	equ	DESCALE( 581104887,30-CONST_BITS)	; FIX(0.541196100)
 F_0_707	equ	DESCALE( 759250124,30-CONST_BITS)	; FIX(0.707106781)
 F_1_306	equ	DESCALE(1402911301,30-CONST_BITS)	; FIX(1.306562965)
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform the forward DCT on one block of samples.
 ;
 ; GLOBAL(void)
 ; jpeg_fdct_ifast (DCTELEM * data)
 ;
 %define data(b)	(b)+8		; DCTELEM * data
 	align	16
 	global	EXTN(jpeg_fdct_ifast)
 EXTN(jpeg_fdct_ifast):
 	push	ebp
 	mov	ebp,esp
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	; ---- Pass 1: process rows.
 	mov	ecx, DCTSIZE
 	mov	edx, POINTER [data(ebp)]	; (DCTELEM *)
 	alignx	16,7
 .rowloop:
 	push	ecx		; ctr
 	push	edx		; dataptr
 	movsx	eax, DCTELEM [ROW(0,edx,SIZEOF_DCTELEM)]
 	movsx	edi, DCTELEM [ROW(7,edx,SIZEOF_DCTELEM)]
 	lea	esi,[eax+edi]	; esi=tmp0
 	sub	eax,edi		; eax=tmp7
 	push	eax
 	movsx	ebx, DCTELEM [ROW(1,edx,SIZEOF_DCTELEM)]
 	movsx	ecx, DCTELEM [ROW(6,edx,SIZEOF_DCTELEM)]
 	lea	edi,[ebx+ecx]	; edi=tmp1
 	sub	ebx,ecx		; ebx=tmp6
 	push	ebx
 	movsx	eax, DCTELEM [ROW(2,edx,SIZEOF_DCTELEM)]
 	movsx	ecx, DCTELEM [ROW(5,edx,SIZEOF_DCTELEM)]
 	lea	ebx,[eax+ecx]	; ebx=tmp2
 	sub	eax,ecx		; eax=tmp5
 	push	eax
 	movsx	ecx, DCTELEM [ROW(3,edx,SIZEOF_DCTELEM)]
 	movsx	eax, DCTELEM [ROW(4,edx,SIZEOF_DCTELEM)]
 	lea	edx,[ecx+eax]	; edx=tmp3
 	sub	ecx,eax		; ecx=tmp4
 	push	ecx
 	; -- Even part
 	lea	eax,[esi+edx]	; eax=tmp10
 	lea	ecx,[edi+ebx]	; ecx=tmp11
 	sub	esi,edx		; esi=tmp13
 	sub	edi,ebx		; edi=tmp12
 	mov	edx, POINTER [esp+16]	; dataptr
 	add	edi,esi
 	imul	edi,(F_0_707)	; edi=z1
 	descale	edi,CONST_BITS
 	lea	ebx,[eax+ecx]	; ebx=data0
 	sub	eax,ecx		; eax=data4
 	mov	DCTELEM [ROW(0,edx,SIZEOF_DCTELEM)], bx
 	mov	DCTELEM [ROW(4,edx,SIZEOF_DCTELEM)], ax
 	lea	ecx,[esi+edi]	; ecx=data2
 	sub	esi,edi		; esi=data6
 	mov	DCTELEM [ROW(2,edx,SIZEOF_DCTELEM)], cx
 	mov	DCTELEM [ROW(6,edx,SIZEOF_DCTELEM)], si
 	; -- Odd part
 	pop	eax	; eax=tmp4
 	pop	edx	; edx=tmp5
 	pop	ebx	; ebx=tmp6
 	pop	edi	; edi=tmp7
 	add	eax,edx		; eax=tmp10
 	add	edx,ebx		; edx=tmp11
 	add	ebx,edi		; ebx=tmp12, edi=tmp7
 	imul	edx,(F_0_707)	; edx=z3
 	descale	edx,CONST_BITS
 	lea	esi,[edi+edx]	; esi=z11
 	sub	edi,edx		; edi=z13
 	mov	ecx,eax		; ecx=tmp10
 	sub	eax,ebx
 	imul	eax,(F_0_382)	; eax=z5
 	imul	ecx,(F_0_541)	; ecx=MULTIPLY(tmp10,FIX_0_541196100)
 	imul	ebx,(F_1_306)	; ebx=MULTIPLY(tmp12,FIX_1_306562965)
 	descale	eax,CONST_BITS
 	descale	ecx,CONST_BITS
 	descale	ebx,CONST_BITS
 	add	ecx,eax		; ecx=z2
 	add	ebx,eax		; ebx=z4
 	pop	edx		; dataptr
 	lea	eax,[edi+ecx]	; eax=data5
 	sub	edi,ecx		; edi=data3
 	mov	DCTELEM [ROW(5,edx,SIZEOF_DCTELEM)], ax
 	mov	DCTELEM [ROW(3,edx,SIZEOF_DCTELEM)], di
 	lea	ecx,[esi+ebx]	; ecx=data1
 	sub	esi,ebx		; esi=data7
 	mov	DCTELEM [ROW(1,edx,SIZEOF_DCTELEM)], cx
 	mov	DCTELEM [ROW(7,edx,SIZEOF_DCTELEM)], si
 	pop	ecx		; ctr
 	add	edx, byte DCTSIZE*SIZEOF_DCTELEM
 	dec	ecx			; advance pointer to next row
 	jnz	near .rowloop
 	; ---- Pass 2: process columns.
 	mov	ecx, DCTSIZE
 	mov	edx, POINTER [data(ebp)]	; (DCTELEM *)
 	alignx	16,7
 .columnloop:
 	push	ecx		; ctr
 	push	edx		; dataptr
 	movsx	eax, DCTELEM [COL(0,edx,SIZEOF_DCTELEM)]
 	movsx	edi, DCTELEM [COL(7,edx,SIZEOF_DCTELEM)]
 	lea	esi,[eax+edi]	; esi=tmp0
 	sub	eax,edi		; eax=tmp7
 	push	eax
 	movsx	ebx, DCTELEM [COL(1,edx,SIZEOF_DCTELEM)]
 	movsx	ecx, DCTELEM [COL(6,edx,SIZEOF_DCTELEM)]
 	lea	edi,[ebx+ecx]	; edi=tmp1
 	sub	ebx,ecx		; ebx=tmp6
 	push	ebx
 	movsx	eax, DCTELEM [COL(2,edx,SIZEOF_DCTELEM)]
 	movsx	ecx, DCTELEM [COL(5,edx,SIZEOF_DCTELEM)]
 	lea	ebx,[eax+ecx]	; ebx=tmp2
 	sub	eax,ecx		; eax=tmp5
 	push	eax
 	movsx	ecx, DCTELEM [COL(3,edx,SIZEOF_DCTELEM)]
 	movsx	eax, DCTELEM [COL(4,edx,SIZEOF_DCTELEM)]
 	lea	edx,[ecx+eax]	; edx=tmp3
 	sub	ecx,eax		; ecx=tmp4
 	push	ecx
 	; -- Even part
 	lea	eax,[esi+edx]	; eax=tmp10
 	lea	ecx,[edi+ebx]	; ecx=tmp11
 	sub	esi,edx		; esi=tmp13
 	sub	edi,ebx		; edi=tmp12
 	mov	edx, POINTER [esp+16]	; dataptr
 	add	edi,esi
 	imul	edi,(F_0_707)	; edi=z1
 	descale	edi,CONST_BITS
 	lea	ebx,[eax+ecx]	; ebx=data0
 	sub	eax,ecx		; eax=data4
 	mov	DCTELEM [COL(0,edx,SIZEOF_DCTELEM)], bx
 	mov	DCTELEM [COL(4,edx,SIZEOF_DCTELEM)], ax
 	lea	ecx,[esi+edi]	; ecx=data2
 	sub	esi,edi		; esi=data6
 	mov	DCTELEM [COL(2,edx,SIZEOF_DCTELEM)], cx
 	mov	DCTELEM [COL(6,edx,SIZEOF_DCTELEM)], si
 	; -- Odd part
 	pop	eax	; eax=tmp4
 	pop	edx	; edx=tmp5
 	pop	ebx	; ebx=tmp6
 	pop	edi	; edi=tmp7
 	add	eax,edx		; eax=tmp10
 	add	edx,ebx		; edx=tmp11
 	add	ebx,edi		; ebx=tmp12, edi=tmp7
 	imul	edx,(F_0_707)	; edx=z3
 	descale	edx,CONST_BITS
 	lea	esi,[edi+edx]	; esi=z11
 	sub	edi,edx		; edi=z13
 	mov	ecx,eax		; ecx=tmp10
 	sub	eax,ebx
 	imul	eax,(F_0_382)	; eax=z5
 	imul	ecx,(F_0_541)	; ecx=MULTIPLY(tmp10,FIX_0_541196100)
 	imul	ebx,(F_1_306)	; ebx=MULTIPLY(tmp12,FIX_1_306562965)
 	descale	eax,CONST_BITS
 	descale	ecx,CONST_BITS
 	descale	ebx,CONST_BITS
 	add	ecx,eax		; ecx=z2
 	add	ebx,eax		; ebx=z4
 	pop	edx		; dataptr
 	lea	eax,[edi+ecx]	; eax=data5
 	sub	edi,ecx		; edi=data3
 	mov	DCTELEM [COL(5,edx,SIZEOF_DCTELEM)], ax
 	mov	DCTELEM [COL(3,edx,SIZEOF_DCTELEM)], di
 	lea	ecx,[esi+ebx]	; ecx=data1
 	sub	esi,ebx		; esi=data7
 	mov	DCTELEM [COL(1,edx,SIZEOF_DCTELEM)], cx
 	mov	DCTELEM [COL(7,edx,SIZEOF_DCTELEM)], si
 	pop	ecx		; ctr
 	add	edx, byte SIZEOF_DCTELEM    ; advance pointer to next column
 	dec	ecx
 	jnz	near .columnloop
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	pop	ebp
 	ret
 %endif ; DCT_IFAST_SUPPORTED
--- a/jfdctint.asm
+++ b/jfdctint.asm
@@ -0,0 +1,342 @@
 ;
 ; jfdctint.asm - accurate integer FDCT (non-SIMD)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains a slow-but-accurate integer implementation of the
 ; forward DCT (Discrete Cosine Transform). The following code is based
 ; directly on the IJG's original jfdctint.c; see the jfdctint.c for
 ; more details.
 ;
 ; Last Modified : October 17, 2004
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_ISLOW_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 ; Descale and correctly round a DWORD value that's scaled by N bits.
 ;
 %macro	descale 2
 %if (%2)<=7
 	add	%1, byte (1<<((%2)-1))	; add reg32,imm8
 %else
 	add	%1, (1<<((%2)-1))	; add reg32,imm32
 %endif
 	sar	%1,%2
 %endmacro
 ; --------------------------------------------------------------------------
 %define CONST_BITS	13
 %define PASS1_BITS	2
 %if CONST_BITS == 13
 F_0_298	equ	 2446		; FIX(0.298631336)
 F_0_390	equ	 3196		; FIX(0.390180644)
 F_0_541	equ	 4433		; FIX(0.541196100)
 F_0_765	equ	 6270		; FIX(0.765366865)
 F_0_899	equ	 7373		; FIX(0.899976223)
 F_1_175	equ	 9633		; FIX(1.175875602)
 F_1_501	equ	12299		; FIX(1.501321110)
 F_1_847	equ	15137		; FIX(1.847759065)
 F_1_961	equ	16069		; FIX(1.961570560)
 F_2_053	equ	16819		; FIX(2.053119869)
 F_2_562	equ	20995		; FIX(2.562915447)
 F_3_072	equ	25172		; FIX(3.072711026)
 %else
 ; NASM cannot do compile-time arithmetic on floating-point constants.
 %define DESCALE(x,n)  (((x)+(1<<((n)-1)))>>(n))
 F_0_298	equ	DESCALE( 320652955,30-CONST_BITS)	; FIX(0.298631336)
 F_0_390	equ	DESCALE( 418953276,30-CONST_BITS)	; FIX(0.390180644)
 F_0_541	equ	DESCALE( 581104887,30-CONST_BITS)	; FIX(0.541196100)
 F_0_765	equ	DESCALE( 821806413,30-CONST_BITS)	; FIX(0.765366865)
 F_0_899	equ	DESCALE( 966342111,30-CONST_BITS)	; FIX(0.899976223)
 F_1_175	equ	DESCALE(1262586813,30-CONST_BITS)	; FIX(1.175875602)
 F_1_501	equ	DESCALE(1612031267,30-CONST_BITS)	; FIX(1.501321110)
 F_1_847	equ	DESCALE(1984016188,30-CONST_BITS)	; FIX(1.847759065)
 F_1_961	equ	DESCALE(2106220350,30-CONST_BITS)	; FIX(1.961570560)
 F_2_053	equ	DESCALE(2204520673,30-CONST_BITS)	; FIX(2.053119869)
 F_2_562	equ	DESCALE(2751909506,30-CONST_BITS)	; FIX(2.562915447)
 F_3_072	equ	DESCALE(3299298341,30-CONST_BITS)	; FIX(3.072711026)
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform the forward DCT on one block of samples.
 ;
 ; GLOBAL(void)
 ; jpeg_fdct_islow (DCTELEM * data)
 ;
 %define data(b)	(b)+8		; DCTELEM * data
 	align	16
 	global	EXTN(jpeg_fdct_islow)
 EXTN(jpeg_fdct_islow):
 	push	ebp
 	mov	ebp,esp
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	; ---- Pass 1: process rows.
 	mov	edx, POINTER [data(ebp)]	; (DCTELEM *)
 	mov	ecx, DCTSIZE
 	alignx	16,7
 .rowloop:
 	movsx	eax, DCTELEM [ROW(0,edx,SIZEOF_DCTELEM)]
 	movsx	edi, DCTELEM [ROW(7,edx,SIZEOF_DCTELEM)]
 	lea	esi,[eax+edi]	; esi=tmp0
 	sub	eax,edi		; eax=tmp7
 	push	ecx		; ctr
 	push	eax
 	movsx	ebx, DCTELEM [ROW(1,edx,SIZEOF_DCTELEM)]
 	movsx	ecx, DCTELEM [ROW(6,edx,SIZEOF_DCTELEM)]
 	lea	edi,[ebx+ecx]	; edi=tmp1
 	sub	ebx,ecx		; ebx=tmp6
 	push	ebx
 	movsx	eax, DCTELEM [ROW(2,edx,SIZEOF_DCTELEM)]
 	movsx	ecx, DCTELEM [ROW(5,edx,SIZEOF_DCTELEM)]
 	lea	ebx,[eax+ecx]	; ebx=tmp2
 	sub	eax,ecx		; eax=tmp5
 	push	edx		; dataptr
 	push	eax
 	movsx	ecx, DCTELEM [ROW(3,edx,SIZEOF_DCTELEM)]
 	movsx	eax, DCTELEM [ROW(4,edx,SIZEOF_DCTELEM)]
 	lea	edx,[ecx+eax]	; edx=tmp3
 	sub	ecx,eax		; ecx=tmp4
 	push	ecx
 	; -- Even part
 	lea	eax,[esi+edx]	; eax=tmp10
 	lea	ecx,[edi+ebx]	; ecx=tmp11
 	sub	esi,edx		; esi=tmp13
 	sub	edi,ebx		; edi=tmp12
 	lea	ebx,[eax+ecx]	; ebx=data0
 	sub	eax,ecx		; eax=data4
 	mov	edx, POINTER [esp+8]	; dataptr
 	sal	ebx, PASS1_BITS
 	sal	eax, PASS1_BITS
 	mov	DCTELEM [ROW(0,edx,SIZEOF_DCTELEM)], bx
 	mov	DCTELEM [ROW(4,edx,SIZEOF_DCTELEM)], ax
 	lea	ecx,[edi+esi]
 	imul	ecx,(F_0_541)	; ecx=z1
 	imul	esi,(F_0_765)	; esi=MULTIPLY(tmp13,FIX_0_765366865)
 	imul	edi,(-F_1_847)	; edi=MULTIPLY(tmp12,-FIX_1_847759065)
 	add	esi,ecx		; esi=data2
 	add	edi,ecx		; edi=data6
 	descale	esi,(CONST_BITS-PASS1_BITS)
 	descale	edi,(CONST_BITS-PASS1_BITS)
 	mov	DCTELEM [ROW(2,edx,SIZEOF_DCTELEM)], si
 	mov	DCTELEM [ROW(6,edx,SIZEOF_DCTELEM)], di
 	; -- Odd part
 	mov	eax, INT32 [esp]	; eax=tmp4
 	mov	ebx, INT32 [esp+4]	; ebx=tmp5
 	mov	ecx, INT32 [esp+12]	; ecx=tmp6
 	mov	esi, INT32 [esp+16]	; esi=tmp7
 	lea	edx,[eax+ecx]	; edx=z3
 	lea	edi,[ebx+esi]	; edi=z4
 	add	eax,esi		; eax=z1
 	add	ebx,ecx		; ebx=z2
 	lea	esi,[edx+edi]
 	imul	esi,(F_1_175)	; esi=z5
 	imul	edx,(-F_1_961)	; edx=z3(=MULTIPLY(z3,-FIX_1_961570560))
 	imul	edi,(-F_0_390)	; edi=z4(=MULTIPLY(z4,-FIX_0_390180644))
 	imul	eax,(-F_0_899)	; eax=z1(=MULTIPLY(z1,-FIX_0_899976223))
 	imul	ebx,(-F_2_562)	; ebx=z2(=MULTIPLY(z2,-FIX_2_562915447))
 	add	edx,esi		; edx=z3(=z3+z5)
 	add	edi,esi		; edi=z4(=z4+z5)
 	lea	ecx,[eax+edx]	; ecx=z1+z3
 	lea	esi,[ebx+edi]	; esi=z2+z4
 	add	eax,edi		; eax=z1+z4
 	add	ebx,edx		; ebx=z2+z3
 	pop	edx		; edx=tmp4
 	pop	edi		; edi=tmp5
 	imul	edx,(F_0_298)	; edx=tmp4(=MULTIPLY(tmp4,FIX_0_298631336))
 	imul	edi,(F_2_053)	; edi=tmp5(=MULTIPLY(tmp5,FIX_2_053119869))
 	add	ecx,edx		; ecx=data7(=tmp4+z1+z3)
 	add	esi,edi		; esi=data5(=tmp5+z2+z4)
 	pop	edx		; dataptr
 	descale	ecx,(CONST_BITS-PASS1_BITS)
 	descale	esi,(CONST_BITS-PASS1_BITS)
 	mov	DCTELEM [ROW(7,edx,SIZEOF_DCTELEM)], cx
 	mov	DCTELEM [ROW(5,edx,SIZEOF_DCTELEM)], si
 	pop	edi		; edi=tmp6
 	pop	ecx		; ecx=tmp7
 	imul	edi,(F_3_072)	; edi=tmp6(=MULTIPLY(tmp6,FIX_3_072711026))
 	imul	ecx,(F_1_501)	; ecx=tmp7(=MULTIPLY(tmp7,FIX_1_501321110))
 	add	ebx,edi		; ebx=data3(=tmp6+z2+z3)
 	add	eax,ecx		; eax=data1(=tmp7+z1+z4)
 	pop	ecx		; ctr
 	descale	ebx,(CONST_BITS-PASS1_BITS)
 	descale	eax,(CONST_BITS-PASS1_BITS)
 	mov	DCTELEM [ROW(3,edx,SIZEOF_DCTELEM)], bx
 	mov	DCTELEM [ROW(1,edx,SIZEOF_DCTELEM)], ax
 	add	edx, byte DCTSIZE*SIZEOF_DCTELEM
 	dec	ecx			; advance pointer to next row
 	jnz	near .rowloop
 	; ---- Pass 2: process columns.
 	mov	edx, POINTER [data(ebp)]	; (DCTELEM *)
 	mov	ecx, DCTSIZE
 	alignx	16,7
 .columnloop:
 	movsx	eax, DCTELEM [COL(0,edx,SIZEOF_DCTELEM)]
 	movsx	edi, DCTELEM [COL(7,edx,SIZEOF_DCTELEM)]
 	lea	esi,[eax+edi]	; esi=tmp0
 	sub	eax,edi		; eax=tmp7
 	push	ecx		; ctr
 	push	eax
 	movsx	ebx, DCTELEM [COL(1,edx,SIZEOF_DCTELEM)]
 	movsx	ecx, DCTELEM [COL(6,edx,SIZEOF_DCTELEM)]
 	lea	edi,[ebx+ecx]	; edi=tmp1
 	sub	ebx,ecx		; ebx=tmp6
 	push	ebx
 	movsx	eax, DCTELEM [COL(2,edx,SIZEOF_DCTELEM)]
 	movsx	ecx, DCTELEM [COL(5,edx,SIZEOF_DCTELEM)]
 	lea	ebx,[eax+ecx]	; ebx=tmp2
 	sub	eax,ecx		; eax=tmp5
 	push	edx		; dataptr
 	push	eax
 	movsx	ecx, DCTELEM [COL(3,edx,SIZEOF_DCTELEM)]
 	movsx	eax, DCTELEM [COL(4,edx,SIZEOF_DCTELEM)]
 	lea	edx,[ecx+eax]	; edx=tmp3
 	sub	ecx,eax		; ecx=tmp4
 	push	ecx
 	; -- Even part
 	lea	eax,[esi+edx]	; eax=tmp10
 	lea	ecx,[edi+ebx]	; ecx=tmp11
 	sub	esi,edx		; esi=tmp13
 	sub	edi,ebx		; edi=tmp12
 	lea	ebx,[eax+ecx]	; ebx=data0
 	sub	eax,ecx		; eax=data4
 	mov	edx, POINTER [esp+8]	; dataptr
 	descale	ebx, PASS1_BITS
 	descale	eax, PASS1_BITS
 	mov	DCTELEM [COL(0,edx,SIZEOF_DCTELEM)], bx
 	mov	DCTELEM [COL(4,edx,SIZEOF_DCTELEM)], ax
 	lea	ecx,[edi+esi]
 	imul	ecx,(F_0_541)	; ecx=z1
 	imul	esi,(F_0_765)	; esi=MULTIPLY(tmp13,FIX_0_765366865)
 	imul	edi,(-F_1_847)	; edi=MULTIPLY(tmp12,-FIX_1_847759065)
 	add	esi,ecx		; esi=data2
 	add	edi,ecx		; edi=data6
 	descale	esi,(CONST_BITS+PASS1_BITS)
 	descale	edi,(CONST_BITS+PASS1_BITS)
 	mov	DCTELEM [COL(2,edx,SIZEOF_DCTELEM)], si
 	mov	DCTELEM [COL(6,edx,SIZEOF_DCTELEM)], di
 	; -- Odd part
 	mov	eax, INT32 [esp]	; eax=tmp4
 	mov	ebx, INT32 [esp+4]	; ebx=tmp5
 	mov	ecx, INT32 [esp+12]	; ecx=tmp6
 	mov	esi, INT32 [esp+16]	; esi=tmp7
 	lea	edx,[eax+ecx]	; edx=z3
 	lea	edi,[ebx+esi]	; edi=z4
 	add	eax,esi		; eax=z1
 	add	ebx,ecx		; ebx=z2
 	lea	esi,[edx+edi]
 	imul	esi,(F_1_175)	; esi=z5
 	imul	edx,(-F_1_961)	; edx=z3(=MULTIPLY(z3,-FIX_1_961570560))
 	imul	edi,(-F_0_390)	; edi=z4(=MULTIPLY(z4,-FIX_0_390180644))
 	imul	eax,(-F_0_899)	; eax=z1(=MULTIPLY(z1,-FIX_0_899976223))
 	imul	ebx,(-F_2_562)	; ebx=z2(=MULTIPLY(z2,-FIX_2_562915447))
 	add	edx,esi		; edx=z3(=z3+z5)
 	add	edi,esi		; edi=z4(=z4+z5)
 	lea	ecx,[eax+edx]	; ecx=z1+z3
 	lea	esi,[ebx+edi]	; esi=z2+z4
 	add	eax,edi		; eax=z1+z4
 	add	ebx,edx		; ebx=z2+z3
 	pop	edx		; edx=tmp4
 	pop	edi		; edi=tmp5
 	imul	edx,(F_0_298)	; edx=tmp4(=MULTIPLY(tmp4,FIX_0_298631336))
 	imul	edi,(F_2_053)	; edi=tmp5(=MULTIPLY(tmp5,FIX_2_053119869))
 	add	ecx,edx		; ecx=data7(=tmp4+z1+z3)
 	add	esi,edi		; esi=data5(=tmp5+z2+z4)
 	pop	edx		; dataptr
 	descale	ecx,(CONST_BITS+PASS1_BITS)
 	descale	esi,(CONST_BITS+PASS1_BITS)
 	mov	DCTELEM [COL(7,edx,SIZEOF_DCTELEM)], cx
 	mov	DCTELEM [COL(5,edx,SIZEOF_DCTELEM)], si
 	pop	edi		; edi=tmp6
 	pop	ecx		; ecx=tmp7
 	imul	edi,(F_3_072)	; edi=tmp6(=MULTIPLY(tmp6,FIX_3_072711026))
 	imul	ecx,(F_1_501)	; ecx=tmp7(=MULTIPLY(tmp7,FIX_1_501321110))
 	add	ebx,edi		; ebx=data3(=tmp6+z2+z3)
 	add	eax,ecx		; eax=data1(=tmp7+z1+z4)
 	pop	ecx		; ctr
 	descale	ebx,(CONST_BITS+PASS1_BITS)
 	descale	eax,(CONST_BITS+PASS1_BITS)
 	mov	DCTELEM [COL(3,edx,SIZEOF_DCTELEM)], bx
 	mov	DCTELEM [COL(1,edx,SIZEOF_DCTELEM)], ax
 	add	edx, byte SIZEOF_DCTELEM    ; advance pointer to next column
 	dec	ecx
 	jnz	near .columnloop
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	pop	ebp
 	ret
 %endif ; DCT_ISLOW_SUPPORTED
--- a/jfmmxfst.asm
+++ b/jfmmxfst.asm
@@ -0,0 +1,404 @@
 ;
 ; jfmmxfst.asm - fast integer FDCT (MMX)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains a fast, not so accurate integer implementation of
 ; the forward DCT (Discrete Cosine Transform). The following code is
 ; based directly on the IJG's original jfdctfst.c; see the jfdctfst.c
 ; for more details.
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_IFAST_SUPPORTED
 %ifdef JFDCT_INT_MMX_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 %define CONST_BITS	8	; 14 is also OK.
 %if CONST_BITS == 8
 F_0_382	equ	 98		; FIX(0.382683433)
 F_0_541	equ	139		; FIX(0.541196100)
 F_0_707	equ	181		; FIX(0.707106781)
 F_1_306	equ	334		; FIX(1.306562965)
 %else
 ; NASM cannot do compile-time arithmetic on floating-point constants.
 %define DESCALE(x,n)  (((x)+(1<<((n)-1)))>>(n))
 F_0_382	equ	DESCALE( 410903207,30-CONST_BITS)	; FIX(0.382683433)
 F_0_541	equ	DESCALE( 581104887,30-CONST_BITS)	; FIX(0.541196100)
 F_0_707	equ	DESCALE( 759250124,30-CONST_BITS)	; FIX(0.707106781)
 F_1_306	equ	DESCALE(1402911301,30-CONST_BITS)	; FIX(1.306562965)
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 ; PRE_MULTIPLY_SCALE_BITS <= 2 (to avoid overflow)
 ; CONST_BITS + CONST_SHIFT + PRE_MULTIPLY_SCALE_BITS == 16 (for pmulhw)
 %define PRE_MULTIPLY_SCALE_BITS   2
 %define CONST_SHIFT     (16 - PRE_MULTIPLY_SCALE_BITS - CONST_BITS)
 	alignz	16
 	global	EXTN(jconst_fdct_ifast_mmx)
 EXTN(jconst_fdct_ifast_mmx):
 PW_F0707	times 4 dw  F_0_707 << CONST_SHIFT
 PW_F0382	times 4 dw  F_0_382 << CONST_SHIFT
 PW_F0541	times 4 dw  F_0_541 << CONST_SHIFT
 PW_F1306	times 4 dw  F_1_306 << CONST_SHIFT
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform the forward DCT on one block of samples.
 ;
 ; GLOBAL(void)
 ; jpeg_fdct_ifast_mmx (DCTELEM * data)
 ;
 %define data(b)		(b)+8		; DCTELEM * data
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_MMWORD	; mmword wk[WK_NUM]
 %define WK_NUM		2
 	align	16
 	global	EXTN(jpeg_fdct_ifast_mmx)
 EXTN(jpeg_fdct_ifast_mmx):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_MMWORD)	; align to 64 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [wk(0)]
 	pushpic	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 ;	push	esi		; unused
 ;	push	edi		; unused
 	get_GOT	ebx		; get GOT address
 	; ---- Pass 1: process rows.
 	mov	edx, POINTER [data(eax)]	; (DCTELEM *)
 	mov	ecx, DCTSIZE/4
 	alignx	16,7
 .rowloop:
 	movq	mm0, MMWORD [MMBLOCK(2,0,edx,SIZEOF_DCTELEM)]
 	movq	mm1, MMWORD [MMBLOCK(3,0,edx,SIZEOF_DCTELEM)]
 	movq	mm2, MMWORD [MMBLOCK(2,1,edx,SIZEOF_DCTELEM)]
 	movq	mm3, MMWORD [MMBLOCK(3,1,edx,SIZEOF_DCTELEM)]
 	; mm0=(20 21 22 23), mm2=(24 25 26 27)
 	; mm1=(30 31 32 33), mm3=(34 35 36 37)
 	movq      mm4,mm0		; transpose coefficients(phase 1)
 	punpcklwd mm0,mm1		; mm0=(20 30 21 31)
 	punpckhwd mm4,mm1		; mm4=(22 32 23 33)
 	movq      mm5,mm2		; transpose coefficients(phase 1)
 	punpcklwd mm2,mm3		; mm2=(24 34 25 35)
 	punpckhwd mm5,mm3		; mm5=(26 36 27 37)
 	movq	mm6, MMWORD [MMBLOCK(0,0,edx,SIZEOF_DCTELEM)]
 	movq	mm7, MMWORD [MMBLOCK(1,0,edx,SIZEOF_DCTELEM)]
 	movq	mm1, MMWORD [MMBLOCK(0,1,edx,SIZEOF_DCTELEM)]
 	movq	mm3, MMWORD [MMBLOCK(1,1,edx,SIZEOF_DCTELEM)]
 	; mm6=(00 01 02 03), mm1=(04 05 06 07)
 	; mm7=(10 11 12 13), mm3=(14 15 16 17)
 	movq	MMWORD [wk(0)], mm4	; wk(0)=(22 32 23 33)
 	movq	MMWORD [wk(1)], mm2	; wk(1)=(24 34 25 35)
 	movq      mm4,mm6		; transpose coefficients(phase 1)
 	punpcklwd mm6,mm7		; mm6=(00 10 01 11)
 	punpckhwd mm4,mm7		; mm4=(02 12 03 13)
 	movq      mm2,mm1		; transpose coefficients(phase 1)
 	punpcklwd mm1,mm3		; mm1=(04 14 05 15)
 	punpckhwd mm2,mm3		; mm2=(06 16 07 17)
 	movq      mm7,mm6		; transpose coefficients(phase 2)
 	punpckldq mm6,mm0		; mm6=(00 10 20 30)=data0
 	punpckhdq mm7,mm0		; mm7=(01 11 21 31)=data1
 	movq      mm3,mm2		; transpose coefficients(phase 2)
 	punpckldq mm2,mm5		; mm2=(06 16 26 36)=data6
 	punpckhdq mm3,mm5		; mm3=(07 17 27 37)=data7
 	movq	mm0,mm7
 	movq	mm5,mm6
 	psubw	mm7,mm2			; mm7=data1-data6=tmp6
 	psubw	mm6,mm3			; mm6=data0-data7=tmp7
 	paddw	mm0,mm2			; mm0=data1+data6=tmp1
 	paddw	mm5,mm3			; mm5=data0+data7=tmp0
 	movq	mm2, MMWORD [wk(0)]	; mm2=(22 32 23 33)
 	movq	mm3, MMWORD [wk(1)]	; mm3=(24 34 25 35)
 	movq	MMWORD [wk(0)], mm7	; wk(0)=tmp6
 	movq	MMWORD [wk(1)], mm6	; wk(1)=tmp7
 	movq      mm7,mm4		; transpose coefficients(phase 2)
 	punpckldq mm4,mm2		; mm4=(02 12 22 32)=data2
 	punpckhdq mm7,mm2		; mm7=(03 13 23 33)=data3
 	movq      mm6,mm1		; transpose coefficients(phase 2)
 	punpckldq mm1,mm3		; mm1=(04 14 24 34)=data4
 	punpckhdq mm6,mm3		; mm6=(05 15 25 35)=data5
 	movq	mm2,mm7
 	movq	mm3,mm4
 	paddw	mm7,mm1			; mm7=data3+data4=tmp3
 	paddw	mm4,mm6			; mm4=data2+data5=tmp2
 	psubw	mm2,mm1			; mm2=data3-data4=tmp4
 	psubw	mm3,mm6			; mm3=data2-data5=tmp5
 	; -- Even part
 	movq	mm1,mm5
 	movq	mm6,mm0
 	psubw	mm5,mm7			; mm5=tmp13
 	psubw	mm0,mm4			; mm0=tmp12
 	paddw	mm1,mm7			; mm1=tmp10
 	paddw	mm6,mm4			; mm6=tmp11
 	paddw	mm0,mm5
 	psllw	mm0,PRE_MULTIPLY_SCALE_BITS
 	pmulhw	mm0,[GOTOFF(ebx,PW_F0707)] ; mm0=z1
 	movq	mm7,mm1
 	movq	mm4,mm5
 	psubw	mm1,mm6			; mm1=data4
 	psubw	mm5,mm0			; mm5=data6
 	paddw	mm7,mm6			; mm7=data0
 	paddw	mm4,mm0			; mm4=data2
 	movq	MMWORD [MMBLOCK(0,1,edx,SIZEOF_DCTELEM)], mm1
 	movq	MMWORD [MMBLOCK(2,1,edx,SIZEOF_DCTELEM)], mm5
 	movq	MMWORD [MMBLOCK(0,0,edx,SIZEOF_DCTELEM)], mm7
 	movq	MMWORD [MMBLOCK(2,0,edx,SIZEOF_DCTELEM)], mm4
 	; -- Odd part
 	movq	mm6, MMWORD [wk(0)]	; mm6=tmp6
 	movq	mm0, MMWORD [wk(1)]	; mm0=tmp7
 	paddw	mm2,mm3			; mm2=tmp10
 	paddw	mm3,mm6			; mm3=tmp11
 	paddw	mm6,mm0			; mm6=tmp12, mm0=tmp7
 	psllw	mm2,PRE_MULTIPLY_SCALE_BITS
 	psllw	mm6,PRE_MULTIPLY_SCALE_BITS
 	psllw	mm3,PRE_MULTIPLY_SCALE_BITS
 	pmulhw	mm3,[GOTOFF(ebx,PW_F0707)] ; mm3=z3
 	movq	mm1,mm2			; mm1=tmp10
 	psubw	mm2,mm6
 	pmulhw	mm2,[GOTOFF(ebx,PW_F0382)] ; mm2=z5
 	pmulhw	mm1,[GOTOFF(ebx,PW_F0541)] ; mm1=MULTIPLY(tmp10,FIX_0_54119610)
 	pmulhw	mm6,[GOTOFF(ebx,PW_F1306)] ; mm6=MULTIPLY(tmp12,FIX_1_30656296)
 	paddw	mm1,mm2			; mm1=z2
 	paddw	mm6,mm2			; mm6=z4
 	movq	mm5,mm0
 	psubw	mm0,mm3			; mm0=z13
 	paddw	mm5,mm3			; mm5=z11
 	movq	mm7,mm0
 	movq	mm4,mm5
 	psubw	mm0,mm1			; mm0=data3
 	psubw	mm5,mm6			; mm5=data7
 	paddw	mm7,mm1			; mm7=data5
 	paddw	mm4,mm6			; mm4=data1
 	movq	MMWORD [MMBLOCK(3,0,edx,SIZEOF_DCTELEM)], mm0
 	movq	MMWORD [MMBLOCK(3,1,edx,SIZEOF_DCTELEM)], mm5
 	movq	MMWORD [MMBLOCK(1,1,edx,SIZEOF_DCTELEM)], mm7
 	movq	MMWORD [MMBLOCK(1,0,edx,SIZEOF_DCTELEM)], mm4
 	add	edx, byte 4*DCTSIZE*SIZEOF_DCTELEM
 	dec	ecx
 	jnz	near .rowloop
 	; ---- Pass 2: process columns.
 	mov	edx, POINTER [data(eax)]	; (DCTELEM *)
 	mov	ecx, DCTSIZE/4
 	alignx	16,7
 .columnloop:
 	movq	mm0, MMWORD [MMBLOCK(2,0,edx,SIZEOF_DCTELEM)]
 	movq	mm1, MMWORD [MMBLOCK(3,0,edx,SIZEOF_DCTELEM)]
 	movq	mm2, MMWORD [MMBLOCK(6,0,edx,SIZEOF_DCTELEM)]
 	movq	mm3, MMWORD [MMBLOCK(7,0,edx,SIZEOF_DCTELEM)]
 	; mm0=(02 12 22 32), mm2=(42 52 62 72)
 	; mm1=(03 13 23 33), mm3=(43 53 63 73)
 	movq      mm4,mm0		; transpose coefficients(phase 1)
 	punpcklwd mm0,mm1		; mm0=(02 03 12 13)
 	punpckhwd mm4,mm1		; mm4=(22 23 32 33)
 	movq      mm5,mm2		; transpose coefficients(phase 1)
 	punpcklwd mm2,mm3		; mm2=(42 43 52 53)
 	punpckhwd mm5,mm3		; mm5=(62 63 72 73)
 	movq	mm6, MMWORD [MMBLOCK(0,0,edx,SIZEOF_DCTELEM)]
 	movq	mm7, MMWORD [MMBLOCK(1,0,edx,SIZEOF_DCTELEM)]
 	movq	mm1, MMWORD [MMBLOCK(4,0,edx,SIZEOF_DCTELEM)]
 	movq	mm3, MMWORD [MMBLOCK(5,0,edx,SIZEOF_DCTELEM)]
 	; mm6=(00 10 20 30), mm1=(40 50 60 70)
 	; mm7=(01 11 21 31), mm3=(41 51 61 71)
 	movq	MMWORD [wk(0)], mm4	; wk(0)=(22 23 32 33)
 	movq	MMWORD [wk(1)], mm2	; wk(1)=(42 43 52 53)
 	movq      mm4,mm6		; transpose coefficients(phase 1)
 	punpcklwd mm6,mm7		; mm6=(00 01 10 11)
 	punpckhwd mm4,mm7		; mm4=(20 21 30 31)
 	movq      mm2,mm1		; transpose coefficients(phase 1)
 	punpcklwd mm1,mm3		; mm1=(40 41 50 51)
 	punpckhwd mm2,mm3		; mm2=(60 61 70 71)
 	movq      mm7,mm6		; transpose coefficients(phase 2)
 	punpckldq mm6,mm0		; mm6=(00 01 02 03)=data0
 	punpckhdq mm7,mm0		; mm7=(10 11 12 13)=data1
 	movq      mm3,mm2		; transpose coefficients(phase 2)
 	punpckldq mm2,mm5		; mm2=(60 61 62 63)=data6
 	punpckhdq mm3,mm5		; mm3=(70 71 72 73)=data7
 	movq	mm0,mm7
 	movq	mm5,mm6
 	psubw	mm7,mm2			; mm7=data1-data6=tmp6
 	psubw	mm6,mm3			; mm6=data0-data7=tmp7
 	paddw	mm0,mm2			; mm0=data1+data6=tmp1
 	paddw	mm5,mm3			; mm5=data0+data7=tmp0
 	movq	mm2, MMWORD [wk(0)]	; mm2=(22 23 32 33)
 	movq	mm3, MMWORD [wk(1)]	; mm3=(42 43 52 53)
 	movq	MMWORD [wk(0)], mm7	; wk(0)=tmp6
 	movq	MMWORD [wk(1)], mm6	; wk(1)=tmp7
 	movq      mm7,mm4		; transpose coefficients(phase 2)
 	punpckldq mm4,mm2		; mm4=(20 21 22 23)=data2
 	punpckhdq mm7,mm2		; mm7=(30 31 32 33)=data3
 	movq      mm6,mm1		; transpose coefficients(phase 2)
 	punpckldq mm1,mm3		; mm1=(40 41 42 43)=data4
 	punpckhdq mm6,mm3		; mm6=(50 51 52 53)=data5
 	movq	mm2,mm7
 	movq	mm3,mm4
 	paddw	mm7,mm1			; mm7=data3+data4=tmp3
 	paddw	mm4,mm6			; mm4=data2+data5=tmp2
 	psubw	mm2,mm1			; mm2=data3-data4=tmp4
 	psubw	mm3,mm6			; mm3=data2-data5=tmp5
 	; -- Even part
 	movq	mm1,mm5
 	movq	mm6,mm0
 	psubw	mm5,mm7			; mm5=tmp13
 	psubw	mm0,mm4			; mm0=tmp12
 	paddw	mm1,mm7			; mm1=tmp10
 	paddw	mm6,mm4			; mm6=tmp11
 	paddw	mm0,mm5
 	psllw	mm0,PRE_MULTIPLY_SCALE_BITS
 	pmulhw	mm0,[GOTOFF(ebx,PW_F0707)] ; mm0=z1
 	movq	mm7,mm1
 	movq	mm4,mm5
 	psubw	mm1,mm6			; mm1=data4
 	psubw	mm5,mm0			; mm5=data6
 	paddw	mm7,mm6			; mm7=data0
 	paddw	mm4,mm0			; mm4=data2
 	movq	MMWORD [MMBLOCK(4,0,edx,SIZEOF_DCTELEM)], mm1
 	movq	MMWORD [MMBLOCK(6,0,edx,SIZEOF_DCTELEM)], mm5
 	movq	MMWORD [MMBLOCK(0,0,edx,SIZEOF_DCTELEM)], mm7
 	movq	MMWORD [MMBLOCK(2,0,edx,SIZEOF_DCTELEM)], mm4
 	; -- Odd part
 	movq	mm6, MMWORD [wk(0)]	; mm6=tmp6
 	movq	mm0, MMWORD [wk(1)]	; mm0=tmp7
 	paddw	mm2,mm3			; mm2=tmp10
 	paddw	mm3,mm6			; mm3=tmp11
 	paddw	mm6,mm0			; mm6=tmp12, mm0=tmp7
 	psllw	mm2,PRE_MULTIPLY_SCALE_BITS
 	psllw	mm6,PRE_MULTIPLY_SCALE_BITS
 	psllw	mm3,PRE_MULTIPLY_SCALE_BITS
 	pmulhw	mm3,[GOTOFF(ebx,PW_F0707)] ; mm3=z3
 	movq	mm1,mm2			; mm1=tmp10
 	psubw	mm2,mm6
 	pmulhw	mm2,[GOTOFF(ebx,PW_F0382)] ; mm2=z5
 	pmulhw	mm1,[GOTOFF(ebx,PW_F0541)] ; mm1=MULTIPLY(tmp10,FIX_0_54119610)
 	pmulhw	mm6,[GOTOFF(ebx,PW_F1306)] ; mm6=MULTIPLY(tmp12,FIX_1_30656296)
 	paddw	mm1,mm2			; mm1=z2
 	paddw	mm6,mm2			; mm6=z4
 	movq	mm5,mm0
 	psubw	mm0,mm3			; mm0=z13
 	paddw	mm5,mm3			; mm5=z11
 	movq	mm7,mm0
 	movq	mm4,mm5
 	psubw	mm0,mm1			; mm0=data3
 	psubw	mm5,mm6			; mm5=data7
 	paddw	mm7,mm1			; mm7=data5
 	paddw	mm4,mm6			; mm4=data1
 	movq	MMWORD [MMBLOCK(3,0,edx,SIZEOF_DCTELEM)], mm0
 	movq	MMWORD [MMBLOCK(7,0,edx,SIZEOF_DCTELEM)], mm5
 	movq	MMWORD [MMBLOCK(5,0,edx,SIZEOF_DCTELEM)], mm7
 	movq	MMWORD [MMBLOCK(1,0,edx,SIZEOF_DCTELEM)], mm4
 	add	edx, byte 4*SIZEOF_DCTELEM
 	dec	ecx
 	jnz	near .columnloop
 	emms		; empty MMX state
 ;	pop	edi		; unused
 ;	pop	esi		; unused
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	poppic	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %endif ; JFDCT_INT_MMX_SUPPORTED
 %endif ; DCT_IFAST_SUPPORTED
--- a/jfmmxint.asm
+++ b/jfmmxint.asm
@@ -0,0 +1,629 @@
 ;
 ; jfmmxint.asm - accurate integer FDCT (MMX)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains a slow-but-accurate integer implementation of the
 ; forward DCT (Discrete Cosine Transform). The following code is based
 ; directly on the IJG's original jfdctint.c; see the jfdctint.c for
 ; more details.
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_ISLOW_SUPPORTED
 %ifdef JFDCT_INT_MMX_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 %define CONST_BITS	13
 %define PASS1_BITS	2
 %define DESCALE_P1	(CONST_BITS-PASS1_BITS)
 %define DESCALE_P2	(CONST_BITS+PASS1_BITS)
 %if CONST_BITS == 13
 F_0_298	equ	 2446		; FIX(0.298631336)
 F_0_390	equ	 3196		; FIX(0.390180644)
 F_0_541	equ	 4433		; FIX(0.541196100)
 F_0_765	equ	 6270		; FIX(0.765366865)
 F_0_899	equ	 7373		; FIX(0.899976223)
 F_1_175	equ	 9633		; FIX(1.175875602)
 F_1_501	equ	12299		; FIX(1.501321110)
 F_1_847	equ	15137		; FIX(1.847759065)
 F_1_961	equ	16069		; FIX(1.961570560)
 F_2_053	equ	16819		; FIX(2.053119869)
 F_2_562	equ	20995		; FIX(2.562915447)
 F_3_072	equ	25172		; FIX(3.072711026)
 %else
 ; NASM cannot do compile-time arithmetic on floating-point constants.
 %define DESCALE(x,n)  (((x)+(1<<((n)-1)))>>(n))
 F_0_298	equ	DESCALE( 320652955,30-CONST_BITS)	; FIX(0.298631336)
 F_0_390	equ	DESCALE( 418953276,30-CONST_BITS)	; FIX(0.390180644)
 F_0_541	equ	DESCALE( 581104887,30-CONST_BITS)	; FIX(0.541196100)
 F_0_765	equ	DESCALE( 821806413,30-CONST_BITS)	; FIX(0.765366865)
 F_0_899	equ	DESCALE( 966342111,30-CONST_BITS)	; FIX(0.899976223)
 F_1_175	equ	DESCALE(1262586813,30-CONST_BITS)	; FIX(1.175875602)
 F_1_501	equ	DESCALE(1612031267,30-CONST_BITS)	; FIX(1.501321110)
 F_1_847	equ	DESCALE(1984016188,30-CONST_BITS)	; FIX(1.847759065)
 F_1_961	equ	DESCALE(2106220350,30-CONST_BITS)	; FIX(1.961570560)
 F_2_053	equ	DESCALE(2204520673,30-CONST_BITS)	; FIX(2.053119869)
 F_2_562	equ	DESCALE(2751909506,30-CONST_BITS)	; FIX(2.562915447)
 F_3_072	equ	DESCALE(3299298341,30-CONST_BITS)	; FIX(3.072711026)
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 	alignz	16
 	global	EXTN(jconst_fdct_islow_mmx)
 EXTN(jconst_fdct_islow_mmx):
 PW_F130_F054	times 2 dw  (F_0_541+F_0_765), F_0_541
 PW_F054_MF130	times 2 dw  F_0_541, (F_0_541-F_1_847)
 PW_MF078_F117	times 2 dw  (F_1_175-F_1_961), F_1_175
 PW_F117_F078	times 2 dw  F_1_175, (F_1_175-F_0_390)
 PW_MF060_MF089	times 2 dw  (F_0_298-F_0_899),-F_0_899
 PW_MF089_F060	times 2 dw -F_0_899, (F_1_501-F_0_899)
 PW_MF050_MF256	times 2 dw  (F_2_053-F_2_562),-F_2_562
 PW_MF256_F050	times 2 dw -F_2_562, (F_3_072-F_2_562)
 PD_DESCALE_P1	times 2 dd  1 << (DESCALE_P1-1)
 PD_DESCALE_P2	times 2 dd  1 << (DESCALE_P2-1)
 PW_DESCALE_P2X	times 4 dw  1 << (PASS1_BITS-1)
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform the forward DCT on one block of samples.
 ;
 ; GLOBAL(void)
 ; jpeg_fdct_islow_mmx (DCTELEM * data)
 ;
 %define data(b)		(b)+8		; DCTELEM * data
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_MMWORD	; mmword wk[WK_NUM]
 %define WK_NUM		2
 	align	16
 	global	EXTN(jpeg_fdct_islow_mmx)
 EXTN(jpeg_fdct_islow_mmx):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_MMWORD)	; align to 64 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [wk(0)]
 	pushpic	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 ;	push	esi		; unused
 ;	push	edi		; unused
 	get_GOT	ebx		; get GOT address
 	; ---- Pass 1: process rows.
 	mov	edx, POINTER [data(eax)]	; (DCTELEM *)
 	mov	ecx, DCTSIZE/4
 	alignx	16,7
 .rowloop:
 	movq	mm0, MMWORD [MMBLOCK(2,0,edx,SIZEOF_DCTELEM)]
 	movq	mm1, MMWORD [MMBLOCK(3,0,edx,SIZEOF_DCTELEM)]
 	movq	mm2, MMWORD [MMBLOCK(2,1,edx,SIZEOF_DCTELEM)]
 	movq	mm3, MMWORD [MMBLOCK(3,1,edx,SIZEOF_DCTELEM)]
 	; mm0=(20 21 22 23), mm2=(24 25 26 27)
 	; mm1=(30 31 32 33), mm3=(34 35 36 37)
 	movq      mm4,mm0		; transpose coefficients(phase 1)
 	punpcklwd mm0,mm1		; mm0=(20 30 21 31)
 	punpckhwd mm4,mm1		; mm4=(22 32 23 33)
 	movq      mm5,mm2		; transpose coefficients(phase 1)
 	punpcklwd mm2,mm3		; mm2=(24 34 25 35)
 	punpckhwd mm5,mm3		; mm5=(26 36 27 37)
 	movq	mm6, MMWORD [MMBLOCK(0,0,edx,SIZEOF_DCTELEM)]
 	movq	mm7, MMWORD [MMBLOCK(1,0,edx,SIZEOF_DCTELEM)]
 	movq	mm1, MMWORD [MMBLOCK(0,1,edx,SIZEOF_DCTELEM)]
 	movq	mm3, MMWORD [MMBLOCK(1,1,edx,SIZEOF_DCTELEM)]
 	; mm6=(00 01 02 03), mm1=(04 05 06 07)
 	; mm7=(10 11 12 13), mm3=(14 15 16 17)
 	movq	MMWORD [wk(0)], mm4	; wk(0)=(22 32 23 33)
 	movq	MMWORD [wk(1)], mm2	; wk(1)=(24 34 25 35)
 	movq      mm4,mm6		; transpose coefficients(phase 1)
 	punpcklwd mm6,mm7		; mm6=(00 10 01 11)
 	punpckhwd mm4,mm7		; mm4=(02 12 03 13)
 	movq      mm2,mm1		; transpose coefficients(phase 1)
 	punpcklwd mm1,mm3		; mm1=(04 14 05 15)
 	punpckhwd mm2,mm3		; mm2=(06 16 07 17)
 	movq      mm7,mm6		; transpose coefficients(phase 2)
 	punpckldq mm6,mm0		; mm6=(00 10 20 30)=data0
 	punpckhdq mm7,mm0		; mm7=(01 11 21 31)=data1
 	movq      mm3,mm2		; transpose coefficients(phase 2)
 	punpckldq mm2,mm5		; mm2=(06 16 26 36)=data6
 	punpckhdq mm3,mm5		; mm3=(07 17 27 37)=data7
 	movq	mm0,mm7
 	movq	mm5,mm6
 	psubw	mm7,mm2			; mm7=data1-data6=tmp6
 	psubw	mm6,mm3			; mm6=data0-data7=tmp7
 	paddw	mm0,mm2			; mm0=data1+data6=tmp1
 	paddw	mm5,mm3			; mm5=data0+data7=tmp0
 	movq	mm2, MMWORD [wk(0)]	; mm2=(22 32 23 33)
 	movq	mm3, MMWORD [wk(1)]	; mm3=(24 34 25 35)
 	movq	MMWORD [wk(0)], mm7	; wk(0)=tmp6
 	movq	MMWORD [wk(1)], mm6	; wk(1)=tmp7
 	movq      mm7,mm4		; transpose coefficients(phase 2)
 	punpckldq mm4,mm2		; mm4=(02 12 22 32)=data2
 	punpckhdq mm7,mm2		; mm7=(03 13 23 33)=data3
 	movq      mm6,mm1		; transpose coefficients(phase 2)
 	punpckldq mm1,mm3		; mm1=(04 14 24 34)=data4
 	punpckhdq mm6,mm3		; mm6=(05 15 25 35)=data5
 	movq	mm2,mm7
 	movq	mm3,mm4
 	paddw	mm7,mm1			; mm7=data3+data4=tmp3
 	paddw	mm4,mm6			; mm4=data2+data5=tmp2
 	psubw	mm2,mm1			; mm2=data3-data4=tmp4
 	psubw	mm3,mm6			; mm3=data2-data5=tmp5
 	; -- Even part
 	movq	mm1,mm5
 	movq	mm6,mm0
 	paddw	mm5,mm7			; mm5=tmp10
 	paddw	mm0,mm4			; mm0=tmp11
 	psubw	mm1,mm7			; mm1=tmp13
 	psubw	mm6,mm4			; mm6=tmp12
 	movq	mm7,mm5
 	paddw	mm5,mm0			; mm5=tmp10+tmp11
 	psubw	mm7,mm0			; mm7=tmp10-tmp11
 	psllw	mm5,PASS1_BITS		; mm5=data0
 	psllw	mm7,PASS1_BITS		; mm7=data4
 	movq	MMWORD [MMBLOCK(0,0,edx,SIZEOF_DCTELEM)], mm5
 	movq	MMWORD [MMBLOCK(0,1,edx,SIZEOF_DCTELEM)], mm7
 	; (Original)
 	; z1 = (tmp12 + tmp13) * 0.541196100;
 	; data2 = z1 + tmp13 * 0.765366865;
 	; data6 = z1 + tmp12 * -1.847759065;
 	;
 	; (This implementation)
 	; data2 = tmp13 * (0.541196100 + 0.765366865) + tmp12 * 0.541196100;
 	; data6 = tmp13 * 0.541196100 + tmp12 * (0.541196100 - 1.847759065);
 	movq      mm4,mm1		; mm1=tmp13
 	movq      mm0,mm1
 	punpcklwd mm4,mm6		; mm6=tmp12
 	punpckhwd mm0,mm6
 	movq      mm1,mm4
 	movq      mm6,mm0
 	pmaddwd   mm4,[GOTOFF(ebx,PW_F130_F054)]	; mm4=data2L
 	pmaddwd   mm0,[GOTOFF(ebx,PW_F130_F054)]	; mm0=data2H
 	pmaddwd   mm1,[GOTOFF(ebx,PW_F054_MF130)]	; mm1=data6L
 	pmaddwd   mm6,[GOTOFF(ebx,PW_F054_MF130)]	; mm6=data6H
 	paddd	mm4,[GOTOFF(ebx,PD_DESCALE_P1)]
 	paddd	mm0,[GOTOFF(ebx,PD_DESCALE_P1)]
 	psrad	mm4,DESCALE_P1
 	psrad	mm0,DESCALE_P1
 	paddd	mm1,[GOTOFF(ebx,PD_DESCALE_P1)]
 	paddd	mm6,[GOTOFF(ebx,PD_DESCALE_P1)]
 	psrad	mm1,DESCALE_P1
 	psrad	mm6,DESCALE_P1
 	packssdw  mm4,mm0		; mm4=data2
 	packssdw  mm1,mm6		; mm1=data6
 	movq	MMWORD [MMBLOCK(2,0,edx,SIZEOF_DCTELEM)], mm4
 	movq	MMWORD [MMBLOCK(2,1,edx,SIZEOF_DCTELEM)], mm1
 	; -- Odd part
 	movq	mm5, MMWORD [wk(0)]	; mm5=tmp6
 	movq	mm7, MMWORD [wk(1)]	; mm7=tmp7
 	movq	mm0,mm2			; mm2=tmp4
 	movq	mm6,mm3			; mm3=tmp5
 	paddw	mm0,mm5			; mm0=z3
 	paddw	mm6,mm7			; mm6=z4
 	; (Original)
 	; z5 = (z3 + z4) * 1.175875602;
 	; z3 = z3 * -1.961570560;  z4 = z4 * -0.390180644;
 	; z3 += z5;  z4 += z5;
 	;
 	; (This implementation)
 	; z3 = z3 * (1.175875602 - 1.961570560) + z4 * 1.175875602;
 	; z4 = z3 * 1.175875602 + z4 * (1.175875602 - 0.390180644);
 	movq      mm4,mm0
 	movq      mm1,mm0
 	punpcklwd mm4,mm6
 	punpckhwd mm1,mm6
 	movq      mm0,mm4
 	movq      mm6,mm1
 	pmaddwd   mm4,[GOTOFF(ebx,PW_MF078_F117)]	; mm4=z3L
 	pmaddwd   mm1,[GOTOFF(ebx,PW_MF078_F117)]	; mm1=z3H
 	pmaddwd   mm0,[GOTOFF(ebx,PW_F117_F078)]	; mm0=z4L
 	pmaddwd   mm6,[GOTOFF(ebx,PW_F117_F078)]	; mm6=z4H
 	movq	MMWORD [wk(0)], mm4	; wk(0)=z3L
 	movq	MMWORD [wk(1)], mm1	; wk(1)=z3H
 	; (Original)
 	; z1 = tmp4 + tmp7;  z2 = tmp5 + tmp6;
 	; tmp4 = tmp4 * 0.298631336;  tmp5 = tmp5 * 2.053119869;
 	; tmp6 = tmp6 * 3.072711026;  tmp7 = tmp7 * 1.501321110;
 	; z1 = z1 * -0.899976223;  z2 = z2 * -2.562915447;
 	; data7 = tmp4 + z1 + z3;  data5 = tmp5 + z2 + z4;
 	; data3 = tmp6 + z2 + z3;  data1 = tmp7 + z1 + z4;
 	;
 	; (This implementation)
 	; tmp4 = tmp4 * (0.298631336 - 0.899976223) + tmp7 * -0.899976223;
 	; tmp5 = tmp5 * (2.053119869 - 2.562915447) + tmp6 * -2.562915447;
 	; tmp6 = tmp5 * -2.562915447 + tmp6 * (3.072711026 - 2.562915447);
 	; tmp7 = tmp4 * -0.899976223 + tmp7 * (1.501321110 - 0.899976223);
 	; data7 = tmp4 + z3;  data5 = tmp5 + z4;
 	; data3 = tmp6 + z3;  data1 = tmp7 + z4;
 	movq      mm4,mm2
 	movq      mm1,mm2
 	punpcklwd mm4,mm7
 	punpckhwd mm1,mm7
 	movq      mm2,mm4
 	movq      mm7,mm1
 	pmaddwd   mm4,[GOTOFF(ebx,PW_MF060_MF089)]	; mm4=tmp4L
 	pmaddwd   mm1,[GOTOFF(ebx,PW_MF060_MF089)]	; mm1=tmp4H
 	pmaddwd   mm2,[GOTOFF(ebx,PW_MF089_F060)]	; mm2=tmp7L
 	pmaddwd   mm7,[GOTOFF(ebx,PW_MF089_F060)]	; mm7=tmp7H
 	paddd	mm4, MMWORD [wk(0)]	; mm4=data7L
 	paddd	mm1, MMWORD [wk(1)]	; mm1=data7H
 	paddd	mm2,mm0			; mm2=data1L
 	paddd	mm7,mm6			; mm7=data1H
 	paddd	mm4,[GOTOFF(ebx,PD_DESCALE_P1)]
 	paddd	mm1,[GOTOFF(ebx,PD_DESCALE_P1)]
 	psrad	mm4,DESCALE_P1
 	psrad	mm1,DESCALE_P1
 	paddd	mm2,[GOTOFF(ebx,PD_DESCALE_P1)]
 	paddd	mm7,[GOTOFF(ebx,PD_DESCALE_P1)]
 	psrad	mm2,DESCALE_P1
 	psrad	mm7,DESCALE_P1
 	packssdw  mm4,mm1		; mm4=data7
 	packssdw  mm2,mm7		; mm2=data1
 	movq	MMWORD [MMBLOCK(3,1,edx,SIZEOF_DCTELEM)], mm4
 	movq	MMWORD [MMBLOCK(1,0,edx,SIZEOF_DCTELEM)], mm2
 	movq      mm1,mm3
 	movq      mm7,mm3
 	punpcklwd mm1,mm5
 	punpckhwd mm7,mm5
 	movq      mm3,mm1
 	movq      mm5,mm7
 	pmaddwd   mm1,[GOTOFF(ebx,PW_MF050_MF256)]	; mm1=tmp5L
 	pmaddwd   mm7,[GOTOFF(ebx,PW_MF050_MF256)]	; mm7=tmp5H
 	pmaddwd   mm3,[GOTOFF(ebx,PW_MF256_F050)]	; mm3=tmp6L
 	pmaddwd   mm5,[GOTOFF(ebx,PW_MF256_F050)]	; mm5=tmp6H
 	paddd	mm1,mm0			; mm1=data5L
 	paddd	mm7,mm6			; mm7=data5H
 	paddd	mm3, MMWORD [wk(0)]	; mm3=data3L
 	paddd	mm5, MMWORD [wk(1)]	; mm5=data3H
 	paddd	mm1,[GOTOFF(ebx,PD_DESCALE_P1)]
 	paddd	mm7,[GOTOFF(ebx,PD_DESCALE_P1)]
 	psrad	mm1,DESCALE_P1
 	psrad	mm7,DESCALE_P1
 	paddd	mm3,[GOTOFF(ebx,PD_DESCALE_P1)]
 	paddd	mm5,[GOTOFF(ebx,PD_DESCALE_P1)]
 	psrad	mm3,DESCALE_P1
 	psrad	mm5,DESCALE_P1
 	packssdw  mm1,mm7		; mm1=data5
 	packssdw  mm3,mm5		; mm3=data3
 	movq	MMWORD [MMBLOCK(1,1,edx,SIZEOF_DCTELEM)], mm1
 	movq	MMWORD [MMBLOCK(3,0,edx,SIZEOF_DCTELEM)], mm3
 	add	edx, byte 4*DCTSIZE*SIZEOF_DCTELEM
 	dec	ecx
 	jnz	near .rowloop
 	; ---- Pass 2: process columns.
 	mov	edx, POINTER [data(eax)]	; (DCTELEM *)
 	mov	ecx, DCTSIZE/4
 	alignx	16,7
 .columnloop:
 	movq	mm0, MMWORD [MMBLOCK(2,0,edx,SIZEOF_DCTELEM)]
 	movq	mm1, MMWORD [MMBLOCK(3,0,edx,SIZEOF_DCTELEM)]
 	movq	mm2, MMWORD [MMBLOCK(6,0,edx,SIZEOF_DCTELEM)]
 	movq	mm3, MMWORD [MMBLOCK(7,0,edx,SIZEOF_DCTELEM)]
 	; mm0=(02 12 22 32), mm2=(42 52 62 72)
 	; mm1=(03 13 23 33), mm3=(43 53 63 73)
 	movq      mm4,mm0		; transpose coefficients(phase 1)
 	punpcklwd mm0,mm1		; mm0=(02 03 12 13)
 	punpckhwd mm4,mm1		; mm4=(22 23 32 33)
 	movq      mm5,mm2		; transpose coefficients(phase 1)
 	punpcklwd mm2,mm3		; mm2=(42 43 52 53)
 	punpckhwd mm5,mm3		; mm5=(62 63 72 73)
 	movq	mm6, MMWORD [MMBLOCK(0,0,edx,SIZEOF_DCTELEM)]
 	movq	mm7, MMWORD [MMBLOCK(1,0,edx,SIZEOF_DCTELEM)]
 	movq	mm1, MMWORD [MMBLOCK(4,0,edx,SIZEOF_DCTELEM)]
 	movq	mm3, MMWORD [MMBLOCK(5,0,edx,SIZEOF_DCTELEM)]
 	; mm6=(00 10 20 30), mm1=(40 50 60 70)
 	; mm7=(01 11 21 31), mm3=(41 51 61 71)
 	movq	MMWORD [wk(0)], mm4	; wk(0)=(22 23 32 33)
 	movq	MMWORD [wk(1)], mm2	; wk(1)=(42 43 52 53)
 	movq      mm4,mm6		; transpose coefficients(phase 1)
 	punpcklwd mm6,mm7		; mm6=(00 01 10 11)
 	punpckhwd mm4,mm7		; mm4=(20 21 30 31)
 	movq      mm2,mm1		; transpose coefficients(phase 1)
 	punpcklwd mm1,mm3		; mm1=(40 41 50 51)
 	punpckhwd mm2,mm3		; mm2=(60 61 70 71)
 	movq      mm7,mm6		; transpose coefficients(phase 2)
 	punpckldq mm6,mm0		; mm6=(00 01 02 03)=data0
 	punpckhdq mm7,mm0		; mm7=(10 11 12 13)=data1
 	movq      mm3,mm2		; transpose coefficients(phase 2)
 	punpckldq mm2,mm5		; mm2=(60 61 62 63)=data6
 	punpckhdq mm3,mm5		; mm3=(70 71 72 73)=data7
 	movq	mm0,mm7
 	movq	mm5,mm6
 	psubw	mm7,mm2			; mm7=data1-data6=tmp6
 	psubw	mm6,mm3			; mm6=data0-data7=tmp7
 	paddw	mm0,mm2			; mm0=data1+data6=tmp1
 	paddw	mm5,mm3			; mm5=data0+data7=tmp0
 	movq	mm2, MMWORD [wk(0)]	; mm2=(22 23 32 33)
 	movq	mm3, MMWORD [wk(1)]	; mm3=(42 43 52 53)
 	movq	MMWORD [wk(0)], mm7	; wk(0)=tmp6
 	movq	MMWORD [wk(1)], mm6	; wk(1)=tmp7
 	movq      mm7,mm4		; transpose coefficients(phase 2)
 	punpckldq mm4,mm2		; mm4=(20 21 22 23)=data2
 	punpckhdq mm7,mm2		; mm7=(30 31 32 33)=data3
 	movq      mm6,mm1		; transpose coefficients(phase 2)
 	punpckldq mm1,mm3		; mm1=(40 41 42 43)=data4
 	punpckhdq mm6,mm3		; mm6=(50 51 52 53)=data5
 	movq	mm2,mm7
 	movq	mm3,mm4
 	paddw	mm7,mm1			; mm7=data3+data4=tmp3
 	paddw	mm4,mm6			; mm4=data2+data5=tmp2
 	psubw	mm2,mm1			; mm2=data3-data4=tmp4
 	psubw	mm3,mm6			; mm3=data2-data5=tmp5
 	; -- Even part
 	movq	mm1,mm5
 	movq	mm6,mm0
 	paddw	mm5,mm7			; mm5=tmp10
 	paddw	mm0,mm4			; mm0=tmp11
 	psubw	mm1,mm7			; mm1=tmp13
 	psubw	mm6,mm4			; mm6=tmp12
 	movq	mm7,mm5
 	paddw	mm5,mm0			; mm5=tmp10+tmp11
 	psubw	mm7,mm0			; mm7=tmp10-tmp11
 	paddw	mm5,[GOTOFF(ebx,PW_DESCALE_P2X)]
 	paddw	mm7,[GOTOFF(ebx,PW_DESCALE_P2X)]
 	psraw	mm5,PASS1_BITS		; mm5=data0
 	psraw	mm7,PASS1_BITS		; mm7=data4
 	movq	MMWORD [MMBLOCK(0,0,edx,SIZEOF_DCTELEM)], mm5
 	movq	MMWORD [MMBLOCK(4,0,edx,SIZEOF_DCTELEM)], mm7
 	; (Original)
 	; z1 = (tmp12 + tmp13) * 0.541196100;
 	; data2 = z1 + tmp13 * 0.765366865;
 	; data6 = z1 + tmp12 * -1.847759065;
 	;
 	; (This implementation)
 	; data2 = tmp13 * (0.541196100 + 0.765366865) + tmp12 * 0.541196100;
 	; data6 = tmp13 * 0.541196100 + tmp12 * (0.541196100 - 1.847759065);
 	movq      mm4,mm1		; mm1=tmp13
 	movq      mm0,mm1
 	punpcklwd mm4,mm6		; mm6=tmp12
 	punpckhwd mm0,mm6
 	movq      mm1,mm4
 	movq      mm6,mm0
 	pmaddwd   mm4,[GOTOFF(ebx,PW_F130_F054)]	; mm4=data2L
 	pmaddwd   mm0,[GOTOFF(ebx,PW_F130_F054)]	; mm0=data2H
 	pmaddwd   mm1,[GOTOFF(ebx,PW_F054_MF130)]	; mm1=data6L
 	pmaddwd   mm6,[GOTOFF(ebx,PW_F054_MF130)]	; mm6=data6H
 	paddd	mm4,[GOTOFF(ebx,PD_DESCALE_P2)]
 	paddd	mm0,[GOTOFF(ebx,PD_DESCALE_P2)]
 	psrad	mm4,DESCALE_P2
 	psrad	mm0,DESCALE_P2
 	paddd	mm1,[GOTOFF(ebx,PD_DESCALE_P2)]
 	paddd	mm6,[GOTOFF(ebx,PD_DESCALE_P2)]
 	psrad	mm1,DESCALE_P2
 	psrad	mm6,DESCALE_P2
 	packssdw  mm4,mm0		; mm4=data2
 	packssdw  mm1,mm6		; mm1=data6
 	movq	MMWORD [MMBLOCK(2,0,edx,SIZEOF_DCTELEM)], mm4
 	movq	MMWORD [MMBLOCK(6,0,edx,SIZEOF_DCTELEM)], mm1
 	; -- Odd part
 	movq	mm5, MMWORD [wk(0)]	; mm5=tmp6
 	movq	mm7, MMWORD [wk(1)]	; mm7=tmp7
 	movq	mm0,mm2			; mm2=tmp4
 	movq	mm6,mm3			; mm3=tmp5
 	paddw	mm0,mm5			; mm0=z3
 	paddw	mm6,mm7			; mm6=z4
 	; (Original)
 	; z5 = (z3 + z4) * 1.175875602;
 	; z3 = z3 * -1.961570560;  z4 = z4 * -0.390180644;
 	; z3 += z5;  z4 += z5;
 	;
 	; (This implementation)
 	; z3 = z3 * (1.175875602 - 1.961570560) + z4 * 1.175875602;
 	; z4 = z3 * 1.175875602 + z4 * (1.175875602 - 0.390180644);
 	movq      mm4,mm0
 	movq      mm1,mm0
 	punpcklwd mm4,mm6
 	punpckhwd mm1,mm6
 	movq      mm0,mm4
 	movq      mm6,mm1
 	pmaddwd   mm4,[GOTOFF(ebx,PW_MF078_F117)]	; mm4=z3L
 	pmaddwd   mm1,[GOTOFF(ebx,PW_MF078_F117)]	; mm1=z3H
 	pmaddwd   mm0,[GOTOFF(ebx,PW_F117_F078)]	; mm0=z4L
 	pmaddwd   mm6,[GOTOFF(ebx,PW_F117_F078)]	; mm6=z4H
 	movq	MMWORD [wk(0)], mm4	; wk(0)=z3L
 	movq	MMWORD [wk(1)], mm1	; wk(1)=z3H
 	; (Original)
 	; z1 = tmp4 + tmp7;  z2 = tmp5 + tmp6;
 	; tmp4 = tmp4 * 0.298631336;  tmp5 = tmp5 * 2.053119869;
 	; tmp6 = tmp6 * 3.072711026;  tmp7 = tmp7 * 1.501321110;
 	; z1 = z1 * -0.899976223;  z2 = z2 * -2.562915447;
 	; data7 = tmp4 + z1 + z3;  data5 = tmp5 + z2 + z4;
 	; data3 = tmp6 + z2 + z3;  data1 = tmp7 + z1 + z4;
 	;
 	; (This implementation)
 	; tmp4 = tmp4 * (0.298631336 - 0.899976223) + tmp7 * -0.899976223;
 	; tmp5 = tmp5 * (2.053119869 - 2.562915447) + tmp6 * -2.562915447;
 	; tmp6 = tmp5 * -2.562915447 + tmp6 * (3.072711026 - 2.562915447);
 	; tmp7 = tmp4 * -0.899976223 + tmp7 * (1.501321110 - 0.899976223);
 	; data7 = tmp4 + z3;  data5 = tmp5 + z4;
 	; data3 = tmp6 + z3;  data1 = tmp7 + z4;
 	movq      mm4,mm2
 	movq      mm1,mm2
 	punpcklwd mm4,mm7
 	punpckhwd mm1,mm7
 	movq      mm2,mm4
 	movq      mm7,mm1
 	pmaddwd   mm4,[GOTOFF(ebx,PW_MF060_MF089)]	; mm4=tmp4L
 	pmaddwd   mm1,[GOTOFF(ebx,PW_MF060_MF089)]	; mm1=tmp4H
 	pmaddwd   mm2,[GOTOFF(ebx,PW_MF089_F060)]	; mm2=tmp7L
 	pmaddwd   mm7,[GOTOFF(ebx,PW_MF089_F060)]	; mm7=tmp7H
 	paddd	mm4, MMWORD [wk(0)]	; mm4=data7L
 	paddd	mm1, MMWORD [wk(1)]	; mm1=data7H
 	paddd	mm2,mm0			; mm2=data1L
 	paddd	mm7,mm6			; mm7=data1H
 	paddd	mm4,[GOTOFF(ebx,PD_DESCALE_P2)]
 	paddd	mm1,[GOTOFF(ebx,PD_DESCALE_P2)]
 	psrad	mm4,DESCALE_P2
 	psrad	mm1,DESCALE_P2
 	paddd	mm2,[GOTOFF(ebx,PD_DESCALE_P2)]
 	paddd	mm7,[GOTOFF(ebx,PD_DESCALE_P2)]
 	psrad	mm2,DESCALE_P2
 	psrad	mm7,DESCALE_P2
 	packssdw  mm4,mm1		; mm4=data7
 	packssdw  mm2,mm7		; mm2=data1
 	movq	MMWORD [MMBLOCK(7,0,edx,SIZEOF_DCTELEM)], mm4
 	movq	MMWORD [MMBLOCK(1,0,edx,SIZEOF_DCTELEM)], mm2
 	movq      mm1,mm3
 	movq      mm7,mm3
 	punpcklwd mm1,mm5
 	punpckhwd mm7,mm5
 	movq      mm3,mm1
 	movq      mm5,mm7
 	pmaddwd   mm1,[GOTOFF(ebx,PW_MF050_MF256)]	; mm1=tmp5L
 	pmaddwd   mm7,[GOTOFF(ebx,PW_MF050_MF256)]	; mm7=tmp5H
 	pmaddwd   mm3,[GOTOFF(ebx,PW_MF256_F050)]	; mm3=tmp6L
 	pmaddwd   mm5,[GOTOFF(ebx,PW_MF256_F050)]	; mm5=tmp6H
 	paddd	mm1,mm0			; mm1=data5L
 	paddd	mm7,mm6			; mm7=data5H
 	paddd	mm3, MMWORD [wk(0)]	; mm3=data3L
 	paddd	mm5, MMWORD [wk(1)]	; mm5=data3H
 	paddd	mm1,[GOTOFF(ebx,PD_DESCALE_P2)]
 	paddd	mm7,[GOTOFF(ebx,PD_DESCALE_P2)]
 	psrad	mm1,DESCALE_P2
 	psrad	mm7,DESCALE_P2
 	paddd	mm3,[GOTOFF(ebx,PD_DESCALE_P2)]
 	paddd	mm5,[GOTOFF(ebx,PD_DESCALE_P2)]
 	psrad	mm3,DESCALE_P2
 	psrad	mm5,DESCALE_P2
 	packssdw  mm1,mm7		; mm1=data5
 	packssdw  mm3,mm5		; mm3=data3
 	movq	MMWORD [MMBLOCK(5,0,edx,SIZEOF_DCTELEM)], mm1
 	movq	MMWORD [MMBLOCK(3,0,edx,SIZEOF_DCTELEM)], mm3
 	add	edx, byte 4*SIZEOF_DCTELEM
 	dec	ecx
 	jnz	near .columnloop
 	emms		; empty MMX state
 ;	pop	edi		; unused
 ;	pop	esi		; unused
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	poppic	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %endif ; JFDCT_INT_MMX_SUPPORTED
 %endif ; DCT_ISLOW_SUPPORTED
--- a/jfss2fst.asm
+++ b/jfss2fst.asm
@@ -0,0 +1,411 @@
 ;
 ; jfss2fst.asm - fast integer FDCT (SSE2)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains a fast, not so accurate integer implementation of
 ; the forward DCT (Discrete Cosine Transform). The following code is
 ; based directly on the IJG's original jfdctfst.c; see the jfdctfst.c
 ; for more details.
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_IFAST_SUPPORTED
 %ifdef JFDCT_INT_SSE2_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 %define CONST_BITS	8	; 14 is also OK.
 %if CONST_BITS == 8
 F_0_382	equ	 98		; FIX(0.382683433)
 F_0_541	equ	139		; FIX(0.541196100)
 F_0_707	equ	181		; FIX(0.707106781)
 F_1_306	equ	334		; FIX(1.306562965)
 %else
 ; NASM cannot do compile-time arithmetic on floating-point constants.
 %define DESCALE(x,n)  (((x)+(1<<((n)-1)))>>(n))
 F_0_382	equ	DESCALE( 410903207,30-CONST_BITS)	; FIX(0.382683433)
 F_0_541	equ	DESCALE( 581104887,30-CONST_BITS)	; FIX(0.541196100)
 F_0_707	equ	DESCALE( 759250124,30-CONST_BITS)	; FIX(0.707106781)
 F_1_306	equ	DESCALE(1402911301,30-CONST_BITS)	; FIX(1.306562965)
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 ; PRE_MULTIPLY_SCALE_BITS <= 2 (to avoid overflow)
 ; CONST_BITS + CONST_SHIFT + PRE_MULTIPLY_SCALE_BITS == 16 (for pmulhw)
 %define PRE_MULTIPLY_SCALE_BITS   2
 %define CONST_SHIFT     (16 - PRE_MULTIPLY_SCALE_BITS - CONST_BITS)
 	alignz	16
 	global	EXTN(jconst_fdct_ifast_sse2)
 EXTN(jconst_fdct_ifast_sse2):
 PW_F0707	times 8 dw  F_0_707 << CONST_SHIFT
 PW_F0382	times 8 dw  F_0_382 << CONST_SHIFT
 PW_F0541	times 8 dw  F_0_541 << CONST_SHIFT
 PW_F1306	times 8 dw  F_1_306 << CONST_SHIFT
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform the forward DCT on one block of samples.
 ;
 ; GLOBAL(void)
 ; jpeg_fdct_ifast_sse2 (DCTELEM * data)
 ;
 %define data(b)		(b)+8		; DCTELEM * data
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_XMMWORD	; xmmword wk[WK_NUM]
 %define WK_NUM		2
 	align	16
 	global	EXTN(jpeg_fdct_ifast_sse2)
 EXTN(jpeg_fdct_ifast_sse2):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_XMMWORD)	; align to 128 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [wk(0)]
 	pushpic	ebx
 ;	push	ecx		; unused
 ;	push	edx		; need not be preserved
 ;	push	esi		; unused
 ;	push	edi		; unused
 	get_GOT	ebx		; get GOT address
 	; ---- Pass 1: process rows.
 	mov	edx, POINTER [data(eax)]	; (DCTELEM *)
 	movdqa	xmm0, XMMWORD [XMMBLOCK(0,0,edx,SIZEOF_DCTELEM)]
 	movdqa	xmm1, XMMWORD [XMMBLOCK(1,0,edx,SIZEOF_DCTELEM)]
 	movdqa	xmm2, XMMWORD [XMMBLOCK(2,0,edx,SIZEOF_DCTELEM)]
 	movdqa	xmm3, XMMWORD [XMMBLOCK(3,0,edx,SIZEOF_DCTELEM)]
 	; xmm0=(00 01 02 03 04 05 06 07), xmm2=(20 21 22 23 24 25 26 27)
 	; xmm1=(10 11 12 13 14 15 16 17), xmm3=(30 31 32 33 34 35 36 37)
 	movdqa    xmm4,xmm0		; transpose coefficients(phase 1)
 	punpcklwd xmm0,xmm1		; xmm0=(00 10 01 11 02 12 03 13)
 	punpckhwd xmm4,xmm1		; xmm4=(04 14 05 15 06 16 07 17)
 	movdqa    xmm5,xmm2		; transpose coefficients(phase 1)
 	punpcklwd xmm2,xmm3		; xmm2=(20 30 21 31 22 32 23 33)
 	punpckhwd xmm5,xmm3		; xmm5=(24 34 25 35 26 36 27 37)
 	movdqa	xmm6, XMMWORD [XMMBLOCK(4,0,edx,SIZEOF_DCTELEM)]
 	movdqa	xmm7, XMMWORD [XMMBLOCK(5,0,edx,SIZEOF_DCTELEM)]
 	movdqa	xmm1, XMMWORD [XMMBLOCK(6,0,edx,SIZEOF_DCTELEM)]
 	movdqa	xmm3, XMMWORD [XMMBLOCK(7,0,edx,SIZEOF_DCTELEM)]
 	; xmm6=( 4 12 20 28 36 44 52 60), xmm1=( 6 14 22 30 38 46 54 62)
 	; xmm7=( 5 13 21 29 37 45 53 61), xmm3=( 7 15 23 31 39 47 55 63)
 	movdqa	XMMWORD [wk(0)], xmm2	; wk(0)=(20 30 21 31 22 32 23 33)
 	movdqa	XMMWORD [wk(1)], xmm5	; wk(1)=(24 34 25 35 26 36 27 37)
 	movdqa    xmm2,xmm6		; transpose coefficients(phase 1)
 	punpcklwd xmm6,xmm7		; xmm6=(40 50 41 51 42 52 43 53)
 	punpckhwd xmm2,xmm7		; xmm2=(44 54 45 55 46 56 47 57)
 	movdqa    xmm5,xmm1		; transpose coefficients(phase 1)
 	punpcklwd xmm1,xmm3		; xmm1=(60 70 61 71 62 72 63 73)
 	punpckhwd xmm5,xmm3		; xmm5=(64 74 65 75 66 76 67 77)
 	movdqa    xmm7,xmm6		; transpose coefficients(phase 2)
 	punpckldq xmm6,xmm1		; xmm6=(40 50 60 70 41 51 61 71)
 	punpckhdq xmm7,xmm1		; xmm7=(42 52 62 72 43 53 63 73)
 	movdqa    xmm3,xmm2		; transpose coefficients(phase 2)
 	punpckldq xmm2,xmm5		; xmm2=(44 54 64 74 45 55 65 75)
 	punpckhdq xmm3,xmm5		; xmm3=(46 56 66 76 47 57 67 77)
 	movdqa	xmm1, XMMWORD [wk(0)]	; xmm1=(20 30 21 31 22 32 23 33)
 	movdqa	xmm5, XMMWORD [wk(1)]	; xmm5=(24 34 25 35 26 36 27 37)
 	movdqa	XMMWORD [wk(0)], xmm7	; wk(0)=(42 52 62 72 43 53 63 73)
 	movdqa	XMMWORD [wk(1)], xmm2	; wk(1)=(44 54 64 74 45 55 65 75)
 	movdqa    xmm7,xmm0		; transpose coefficients(phase 2)
 	punpckldq xmm0,xmm1		; xmm0=(00 10 20 30 01 11 21 31)
 	punpckhdq xmm7,xmm1		; xmm7=(02 12 22 32 03 13 23 33)
 	movdqa    xmm2,xmm4		; transpose coefficients(phase 2)
 	punpckldq xmm4,xmm5		; xmm4=(04 14 24 34 05 15 25 35)
 	punpckhdq xmm2,xmm5		; xmm2=(06 16 26 36 07 17 27 37)
 	movdqa     xmm1,xmm0		; transpose coefficients(phase 3)
 	punpcklqdq xmm0,xmm6		; xmm0=(00 10 20 30 40 50 60 70)=data0
 	punpckhqdq xmm1,xmm6		; xmm1=(01 11 21 31 41 51 61 71)=data1
 	movdqa     xmm5,xmm2		; transpose coefficients(phase 3)
 	punpcklqdq xmm2,xmm3		; xmm2=(06 16 26 36 46 56 66 76)=data6
 	punpckhqdq xmm5,xmm3		; xmm5=(07 17 27 37 47 57 67 77)=data7
 	movdqa	xmm6,xmm1
 	movdqa	xmm3,xmm0
 	psubw	xmm1,xmm2		; xmm1=data1-data6=tmp6
 	psubw	xmm0,xmm5		; xmm0=data0-data7=tmp7
 	paddw	xmm6,xmm2		; xmm6=data1+data6=tmp1
 	paddw	xmm3,xmm5		; xmm3=data0+data7=tmp0
 	movdqa	xmm2, XMMWORD [wk(0)]	; xmm2=(42 52 62 72 43 53 63 73)
 	movdqa	xmm5, XMMWORD [wk(1)]	; xmm5=(44 54 64 74 45 55 65 75)
 	movdqa	XMMWORD [wk(0)], xmm1	; wk(0)=tmp6
 	movdqa	XMMWORD [wk(1)], xmm0	; wk(1)=tmp7
 	movdqa     xmm1,xmm7		; transpose coefficients(phase 3)
 	punpcklqdq xmm7,xmm2		; xmm7=(02 12 22 32 42 52 62 72)=data2
 	punpckhqdq xmm1,xmm2		; xmm1=(03 13 23 33 43 53 63 73)=data3
 	movdqa     xmm0,xmm4		; transpose coefficients(phase 3)
 	punpcklqdq xmm4,xmm5		; xmm4=(04 14 24 34 44 54 64 74)=data4
 	punpckhqdq xmm0,xmm5		; xmm0=(05 15 25 35 45 55 65 75)=data5
 	movdqa	xmm2,xmm1
 	movdqa	xmm5,xmm7
 	paddw	xmm1,xmm4		; xmm1=data3+data4=tmp3
 	paddw	xmm7,xmm0		; xmm7=data2+data5=tmp2
 	psubw	xmm2,xmm4		; xmm2=data3-data4=tmp4
 	psubw	xmm5,xmm0		; xmm5=data2-data5=tmp5
 	; -- Even part
 	movdqa	xmm4,xmm3
 	movdqa	xmm0,xmm6
 	psubw	xmm3,xmm1		; xmm3=tmp13
 	psubw	xmm6,xmm7		; xmm6=tmp12
 	paddw	xmm4,xmm1		; xmm4=tmp10
 	paddw	xmm0,xmm7		; xmm0=tmp11
 	paddw	xmm6,xmm3
 	psllw	xmm6,PRE_MULTIPLY_SCALE_BITS
 	pmulhw	xmm6,[GOTOFF(ebx,PW_F0707)] ; xmm6=z1
 	movdqa	xmm1,xmm4
 	movdqa	xmm7,xmm3
 	psubw	xmm4,xmm0		; xmm4=data4
 	psubw	xmm3,xmm6		; xmm3=data6
 	paddw	xmm1,xmm0		; xmm1=data0
 	paddw	xmm7,xmm6		; xmm7=data2
 	movdqa	xmm0, XMMWORD [wk(0)]	; xmm0=tmp6
 	movdqa	xmm6, XMMWORD [wk(1)]	; xmm6=tmp7
 	movdqa	XMMWORD [wk(0)], xmm4	; wk(0)=data4
 	movdqa	XMMWORD [wk(1)], xmm3	; wk(1)=data6
 	; -- Odd part
 	paddw	xmm2,xmm5		; xmm2=tmp10
 	paddw	xmm5,xmm0		; xmm5=tmp11
 	paddw	xmm0,xmm6		; xmm0=tmp12, xmm6=tmp7
 	psllw	xmm2,PRE_MULTIPLY_SCALE_BITS
 	psllw	xmm0,PRE_MULTIPLY_SCALE_BITS
 	psllw	xmm5,PRE_MULTIPLY_SCALE_BITS
 	pmulhw	xmm5,[GOTOFF(ebx,PW_F0707)] ; xmm5=z3
 	movdqa	xmm4,xmm2		; xmm4=tmp10
 	psubw	xmm2,xmm0
 	pmulhw	xmm2,[GOTOFF(ebx,PW_F0382)] ; xmm2=z5
 	pmulhw	xmm4,[GOTOFF(ebx,PW_F0541)] ; xmm4=MULTIPLY(tmp10,FIX_0_541196)
 	pmulhw	xmm0,[GOTOFF(ebx,PW_F1306)] ; xmm0=MULTIPLY(tmp12,FIX_1_306562)
 	paddw	xmm4,xmm2		; xmm4=z2
 	paddw	xmm0,xmm2		; xmm0=z4
 	movdqa	xmm3,xmm6
 	psubw	xmm6,xmm5		; xmm6=z13
 	paddw	xmm3,xmm5		; xmm3=z11
 	movdqa	xmm2,xmm6
 	movdqa	xmm5,xmm3
 	psubw	xmm6,xmm4		; xmm6=data3
 	psubw	xmm3,xmm0		; xmm3=data7
 	paddw	xmm2,xmm4		; xmm2=data5
 	paddw	xmm5,xmm0		; xmm5=data1
 	; ---- Pass 2: process columns.
 ;	mov	edx, POINTER [data(eax)]	; (DCTELEM *)
 	; xmm1=(00 10 20 30 40 50 60 70), xmm7=(02 12 22 32 42 52 62 72)
 	; xmm5=(01 11 21 31 41 51 61 71), xmm6=(03 13 23 33 43 53 63 73)
 	movdqa    xmm4,xmm1		; transpose coefficients(phase 1)
 	punpcklwd xmm1,xmm5		; xmm1=(00 01 10 11 20 21 30 31)
 	punpckhwd xmm4,xmm5		; xmm4=(40 41 50 51 60 61 70 71)
 	movdqa    xmm0,xmm7		; transpose coefficients(phase 1)
 	punpcklwd xmm7,xmm6		; xmm7=(02 03 12 13 22 23 32 33)
 	punpckhwd xmm0,xmm6		; xmm0=(42 43 52 53 62 63 72 73)
 	movdqa	xmm5, XMMWORD [wk(0)]	; xmm5=col4
 	movdqa	xmm6, XMMWORD [wk(1)]	; xmm6=col6
 	; xmm5=(04 14 24 34 44 54 64 74), xmm6=(06 16 26 36 46 56 66 76)
 	; xmm2=(05 15 25 35 45 55 65 75), xmm3=(07 17 27 37 47 57 67 77)
 	movdqa	XMMWORD [wk(0)], xmm7	; wk(0)=(02 03 12 13 22 23 32 33)
 	movdqa	XMMWORD [wk(1)], xmm0	; wk(1)=(42 43 52 53 62 63 72 73)
 	movdqa    xmm7,xmm5		; transpose coefficients(phase 1)
 	punpcklwd xmm5,xmm2		; xmm5=(04 05 14 15 24 25 34 35)
 	punpckhwd xmm7,xmm2		; xmm7=(44 45 54 55 64 65 74 75)
 	movdqa    xmm0,xmm6		; transpose coefficients(phase 1)
 	punpcklwd xmm6,xmm3		; xmm6=(06 07 16 17 26 27 36 37)
 	punpckhwd xmm0,xmm3		; xmm0=(46 47 56 57 66 67 76 77)
 	movdqa    xmm2,xmm5		; transpose coefficients(phase 2)
 	punpckldq xmm5,xmm6		; xmm5=(04 05 06 07 14 15 16 17)
 	punpckhdq xmm2,xmm6		; xmm2=(24 25 26 27 34 35 36 37)
 	movdqa    xmm3,xmm7		; transpose coefficients(phase 2)
 	punpckldq xmm7,xmm0		; xmm7=(44 45 46 47 54 55 56 57)
 	punpckhdq xmm3,xmm0		; xmm3=(64 65 66 67 74 75 76 77)
 	movdqa	xmm6, XMMWORD [wk(0)]	; xmm6=(02 03 12 13 22 23 32 33)
 	movdqa	xmm0, XMMWORD [wk(1)]	; xmm0=(42 43 52 53 62 63 72 73)
 	movdqa	XMMWORD [wk(0)], xmm2	; wk(0)=(24 25 26 27 34 35 36 37)
 	movdqa	XMMWORD [wk(1)], xmm7	; wk(1)=(44 45 46 47 54 55 56 57)
 	movdqa    xmm2,xmm1		; transpose coefficients(phase 2)
 	punpckldq xmm1,xmm6		; xmm1=(00 01 02 03 10 11 12 13)
 	punpckhdq xmm2,xmm6		; xmm2=(20 21 22 23 30 31 32 33)
 	movdqa    xmm7,xmm4		; transpose coefficients(phase 2)
 	punpckldq xmm4,xmm0		; xmm4=(40 41 42 43 50 51 52 53)
 	punpckhdq xmm7,xmm0		; xmm7=(60 61 62 63 70 71 72 73)
 	movdqa     xmm6,xmm1		; transpose coefficients(phase 3)
 	punpcklqdq xmm1,xmm5		; xmm1=(00 01 02 03 04 05 06 07)=data0
 	punpckhqdq xmm6,xmm5		; xmm6=(10 11 12 13 14 15 16 17)=data1
 	movdqa     xmm0,xmm7		; transpose coefficients(phase 3)
 	punpcklqdq xmm7,xmm3		; xmm7=(60 61 62 63 64 65 66 67)=data6
 	punpckhqdq xmm0,xmm3		; xmm0=(70 71 72 73 74 75 76 77)=data7
 	movdqa	xmm5,xmm6
 	movdqa	xmm3,xmm1
 	psubw	xmm6,xmm7		; xmm6=data1-data6=tmp6
 	psubw	xmm1,xmm0		; xmm1=data0-data7=tmp7
 	paddw	xmm5,xmm7		; xmm5=data1+data6=tmp1
 	paddw	xmm3,xmm0		; xmm3=data0+data7=tmp0
 	movdqa	xmm7, XMMWORD [wk(0)]	; xmm7=(24 25 26 27 34 35 36 37)
 	movdqa	xmm0, XMMWORD [wk(1)]	; xmm0=(44 45 46 47 54 55 56 57)
 	movdqa	XMMWORD [wk(0)], xmm6	; wk(0)=tmp6
 	movdqa	XMMWORD [wk(1)], xmm1	; wk(1)=tmp7
 	movdqa     xmm6,xmm2		; transpose coefficients(phase 3)
 	punpcklqdq xmm2,xmm7		; xmm2=(20 21 22 23 24 25 26 27)=data2
 	punpckhqdq xmm6,xmm7		; xmm6=(30 31 32 33 34 35 36 37)=data3
 	movdqa     xmm1,xmm4		; transpose coefficients(phase 3)
 	punpcklqdq xmm4,xmm0		; xmm4=(40 41 42 43 44 45 46 47)=data4
 	punpckhqdq xmm1,xmm0		; xmm1=(50 51 52 53 54 55 56 57)=data5
 	movdqa	xmm7,xmm6
 	movdqa	xmm0,xmm2
 	paddw	xmm6,xmm4		; xmm6=data3+data4=tmp3
 	paddw	xmm2,xmm1		; xmm2=data2+data5=tmp2
 	psubw	xmm7,xmm4		; xmm7=data3-data4=tmp4
 	psubw	xmm0,xmm1		; xmm0=data2-data5=tmp5
 	; -- Even part
 	movdqa	xmm4,xmm3
 	movdqa	xmm1,xmm5
 	psubw	xmm3,xmm6		; xmm3=tmp13
 	psubw	xmm5,xmm2		; xmm5=tmp12
 	paddw	xmm4,xmm6		; xmm4=tmp10
 	paddw	xmm1,xmm2		; xmm1=tmp11
 	paddw	xmm5,xmm3
 	psllw	xmm5,PRE_MULTIPLY_SCALE_BITS
 	pmulhw	xmm5,[GOTOFF(ebx,PW_F0707)] ; xmm5=z1
 	movdqa	xmm6,xmm4
 	movdqa	xmm2,xmm3
 	psubw	xmm4,xmm1		; xmm4=data4
 	psubw	xmm3,xmm5		; xmm3=data6
 	paddw	xmm6,xmm1		; xmm6=data0
 	paddw	xmm2,xmm5		; xmm2=data2
 	movdqa	XMMWORD [XMMBLOCK(4,0,edx,SIZEOF_DCTELEM)], xmm4
 	movdqa	XMMWORD [XMMBLOCK(6,0,edx,SIZEOF_DCTELEM)], xmm3
 	movdqa	XMMWORD [XMMBLOCK(0,0,edx,SIZEOF_DCTELEM)], xmm6
 	movdqa	XMMWORD [XMMBLOCK(2,0,edx,SIZEOF_DCTELEM)], xmm2
 	; -- Odd part
 	movdqa	xmm1, XMMWORD [wk(0)]	; xmm1=tmp6
 	movdqa	xmm5, XMMWORD [wk(1)]	; xmm5=tmp7
 	paddw	xmm7,xmm0		; xmm7=tmp10
 	paddw	xmm0,xmm1		; xmm0=tmp11
 	paddw	xmm1,xmm5		; xmm1=tmp12, xmm5=tmp7
 	psllw	xmm7,PRE_MULTIPLY_SCALE_BITS
 	psllw	xmm1,PRE_MULTIPLY_SCALE_BITS
 	psllw	xmm0,PRE_MULTIPLY_SCALE_BITS
 	pmulhw	xmm0,[GOTOFF(ebx,PW_F0707)] ; xmm0=z3
 	movdqa	xmm4,xmm7		; xmm4=tmp10
 	psubw	xmm7,xmm1
 	pmulhw	xmm7,[GOTOFF(ebx,PW_F0382)] ; xmm7=z5
 	pmulhw	xmm4,[GOTOFF(ebx,PW_F0541)] ; xmm4=MULTIPLY(tmp10,FIX_0_541196)
 	pmulhw	xmm1,[GOTOFF(ebx,PW_F1306)] ; xmm1=MULTIPLY(tmp12,FIX_1_306562)
 	paddw	xmm4,xmm7		; xmm4=z2
 	paddw	xmm1,xmm7		; xmm1=z4
 	movdqa	xmm3,xmm5
 	psubw	xmm5,xmm0		; xmm5=z13
 	paddw	xmm3,xmm0		; xmm3=z11
 	movdqa	xmm6,xmm5
 	movdqa	xmm2,xmm3
 	psubw	xmm5,xmm4		; xmm5=data3
 	psubw	xmm3,xmm1		; xmm3=data7
 	paddw	xmm6,xmm4		; xmm6=data5
 	paddw	xmm2,xmm1		; xmm2=data1
 	movdqa	XMMWORD [XMMBLOCK(3,0,edx,SIZEOF_DCTELEM)], xmm5
 	movdqa	XMMWORD [XMMBLOCK(7,0,edx,SIZEOF_DCTELEM)], xmm3
 	movdqa	XMMWORD [XMMBLOCK(5,0,edx,SIZEOF_DCTELEM)], xmm6
 	movdqa	XMMWORD [XMMBLOCK(1,0,edx,SIZEOF_DCTELEM)], xmm2
 ;	pop	edi		; unused
 ;	pop	esi		; unused
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; unused
 	poppic	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %endif ; JFDCT_INT_SSE2_SUPPORTED
 %endif ; DCT_IFAST_SUPPORTED
--- a/jfss2int.asm
+++ b/jfss2int.asm
@@ -0,0 +1,641 @@
 ;
 ; jfss2int.asm - accurate integer FDCT (SSE2)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains a slow-but-accurate integer implementation of the
 ; forward DCT (Discrete Cosine Transform). The following code is based
 ; directly on the IJG's original jfdctint.c; see the jfdctint.c for
 ; more details.
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_ISLOW_SUPPORTED
 %ifdef JFDCT_INT_SSE2_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 %define CONST_BITS	13
 %define PASS1_BITS	2
 %define DESCALE_P1	(CONST_BITS-PASS1_BITS)
 %define DESCALE_P2	(CONST_BITS+PASS1_BITS)
 %if CONST_BITS == 13
 F_0_298	equ	 2446		; FIX(0.298631336)
 F_0_390	equ	 3196		; FIX(0.390180644)
 F_0_541	equ	 4433		; FIX(0.541196100)
 F_0_765	equ	 6270		; FIX(0.765366865)
 F_0_899	equ	 7373		; FIX(0.899976223)
 F_1_175	equ	 9633		; FIX(1.175875602)
 F_1_501	equ	12299		; FIX(1.501321110)
 F_1_847	equ	15137		; FIX(1.847759065)
 F_1_961	equ	16069		; FIX(1.961570560)
 F_2_053	equ	16819		; FIX(2.053119869)
 F_2_562	equ	20995		; FIX(2.562915447)
 F_3_072	equ	25172		; FIX(3.072711026)
 %else
 ; NASM cannot do compile-time arithmetic on floating-point constants.
 %define DESCALE(x,n)  (((x)+(1<<((n)-1)))>>(n))
 F_0_298	equ	DESCALE( 320652955,30-CONST_BITS)	; FIX(0.298631336)
 F_0_390	equ	DESCALE( 418953276,30-CONST_BITS)	; FIX(0.390180644)
 F_0_541	equ	DESCALE( 581104887,30-CONST_BITS)	; FIX(0.541196100)
 F_0_765	equ	DESCALE( 821806413,30-CONST_BITS)	; FIX(0.765366865)
 F_0_899	equ	DESCALE( 966342111,30-CONST_BITS)	; FIX(0.899976223)
 F_1_175	equ	DESCALE(1262586813,30-CONST_BITS)	; FIX(1.175875602)
 F_1_501	equ	DESCALE(1612031267,30-CONST_BITS)	; FIX(1.501321110)
 F_1_847	equ	DESCALE(1984016188,30-CONST_BITS)	; FIX(1.847759065)
 F_1_961	equ	DESCALE(2106220350,30-CONST_BITS)	; FIX(1.961570560)
 F_2_053	equ	DESCALE(2204520673,30-CONST_BITS)	; FIX(2.053119869)
 F_2_562	equ	DESCALE(2751909506,30-CONST_BITS)	; FIX(2.562915447)
 F_3_072	equ	DESCALE(3299298341,30-CONST_BITS)	; FIX(3.072711026)
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 	alignz	16
 	global	EXTN(jconst_fdct_islow_sse2)
 EXTN(jconst_fdct_islow_sse2):
 PW_F130_F054	times 4 dw  (F_0_541+F_0_765), F_0_541
 PW_F054_MF130	times 4 dw  F_0_541, (F_0_541-F_1_847)
 PW_MF078_F117	times 4 dw  (F_1_175-F_1_961), F_1_175
 PW_F117_F078	times 4 dw  F_1_175, (F_1_175-F_0_390)
 PW_MF060_MF089	times 4 dw  (F_0_298-F_0_899),-F_0_899
 PW_MF089_F060	times 4 dw -F_0_899, (F_1_501-F_0_899)
 PW_MF050_MF256	times 4 dw  (F_2_053-F_2_562),-F_2_562
 PW_MF256_F050	times 4 dw -F_2_562, (F_3_072-F_2_562)
 PD_DESCALE_P1	times 4 dd  1 << (DESCALE_P1-1)
 PD_DESCALE_P2	times 4 dd  1 << (DESCALE_P2-1)
 PW_DESCALE_P2X	times 8 dw  1 << (PASS1_BITS-1)
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform the forward DCT on one block of samples.
 ;
 ; GLOBAL(void)
 ; jpeg_fdct_islow_sse2 (DCTELEM * data)
 ;
 %define data(b)		(b)+8		; DCTELEM * data
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_XMMWORD	; xmmword wk[WK_NUM]
 %define WK_NUM		6
 	align	16
 	global	EXTN(jpeg_fdct_islow_sse2)
 EXTN(jpeg_fdct_islow_sse2):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_XMMWORD)	; align to 128 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [wk(0)]
 	pushpic	ebx
 ;	push	ecx		; unused
 ;	push	edx		; need not be preserved
 ;	push	esi		; unused
 ;	push	edi		; unused
 	get_GOT	ebx		; get GOT address
 	; ---- Pass 1: process rows.
 	mov	edx, POINTER [data(eax)]	; (DCTELEM *)
 	movdqa	xmm0, XMMWORD [XMMBLOCK(0,0,edx,SIZEOF_DCTELEM)]
 	movdqa	xmm1, XMMWORD [XMMBLOCK(1,0,edx,SIZEOF_DCTELEM)]
 	movdqa	xmm2, XMMWORD [XMMBLOCK(2,0,edx,SIZEOF_DCTELEM)]
 	movdqa	xmm3, XMMWORD [XMMBLOCK(3,0,edx,SIZEOF_DCTELEM)]
 	; xmm0=(00 01 02 03 04 05 06 07), xmm2=(20 21 22 23 24 25 26 27)
 	; xmm1=(10 11 12 13 14 15 16 17), xmm3=(30 31 32 33 34 35 36 37)
 	movdqa    xmm4,xmm0		; transpose coefficients(phase 1)
 	punpcklwd xmm0,xmm1		; xmm0=(00 10 01 11 02 12 03 13)
 	punpckhwd xmm4,xmm1		; xmm4=(04 14 05 15 06 16 07 17)
 	movdqa    xmm5,xmm2		; transpose coefficients(phase 1)
 	punpcklwd xmm2,xmm3		; xmm2=(20 30 21 31 22 32 23 33)
 	punpckhwd xmm5,xmm3		; xmm5=(24 34 25 35 26 36 27 37)
 	movdqa	xmm6, XMMWORD [XMMBLOCK(4,0,edx,SIZEOF_DCTELEM)]
 	movdqa	xmm7, XMMWORD [XMMBLOCK(5,0,edx,SIZEOF_DCTELEM)]
 	movdqa	xmm1, XMMWORD [XMMBLOCK(6,0,edx,SIZEOF_DCTELEM)]
 	movdqa	xmm3, XMMWORD [XMMBLOCK(7,0,edx,SIZEOF_DCTELEM)]
 	; xmm6=( 4 12 20 28 36 44 52 60), xmm1=( 6 14 22 30 38 46 54 62)
 	; xmm7=( 5 13 21 29 37 45 53 61), xmm3=( 7 15 23 31 39 47 55 63)
 	movdqa	XMMWORD [wk(0)], xmm2	; wk(0)=(20 30 21 31 22 32 23 33)
 	movdqa	XMMWORD [wk(1)], xmm5	; wk(1)=(24 34 25 35 26 36 27 37)
 	movdqa    xmm2,xmm6		; transpose coefficients(phase 1)
 	punpcklwd xmm6,xmm7		; xmm6=(40 50 41 51 42 52 43 53)
 	punpckhwd xmm2,xmm7		; xmm2=(44 54 45 55 46 56 47 57)
 	movdqa    xmm5,xmm1		; transpose coefficients(phase 1)
 	punpcklwd xmm1,xmm3		; xmm1=(60 70 61 71 62 72 63 73)
 	punpckhwd xmm5,xmm3		; xmm5=(64 74 65 75 66 76 67 77)
 	movdqa    xmm7,xmm6		; transpose coefficients(phase 2)
 	punpckldq xmm6,xmm1		; xmm6=(40 50 60 70 41 51 61 71)
 	punpckhdq xmm7,xmm1		; xmm7=(42 52 62 72 43 53 63 73)
 	movdqa    xmm3,xmm2		; transpose coefficients(phase 2)
 	punpckldq xmm2,xmm5		; xmm2=(44 54 64 74 45 55 65 75)
 	punpckhdq xmm3,xmm5		; xmm3=(46 56 66 76 47 57 67 77)
 	movdqa	xmm1, XMMWORD [wk(0)]	; xmm1=(20 30 21 31 22 32 23 33)
 	movdqa	xmm5, XMMWORD [wk(1)]	; xmm5=(24 34 25 35 26 36 27 37)
 	movdqa	XMMWORD [wk(2)], xmm7	; wk(2)=(42 52 62 72 43 53 63 73)
 	movdqa	XMMWORD [wk(3)], xmm2	; wk(3)=(44 54 64 74 45 55 65 75)
 	movdqa    xmm7,xmm0		; transpose coefficients(phase 2)
 	punpckldq xmm0,xmm1		; xmm0=(00 10 20 30 01 11 21 31)
 	punpckhdq xmm7,xmm1		; xmm7=(02 12 22 32 03 13 23 33)
 	movdqa    xmm2,xmm4		; transpose coefficients(phase 2)
 	punpckldq xmm4,xmm5		; xmm4=(04 14 24 34 05 15 25 35)
 	punpckhdq xmm2,xmm5		; xmm2=(06 16 26 36 07 17 27 37)
 	movdqa     xmm1,xmm0		; transpose coefficients(phase 3)
 	punpcklqdq xmm0,xmm6		; xmm0=(00 10 20 30 40 50 60 70)=data0
 	punpckhqdq xmm1,xmm6		; xmm1=(01 11 21 31 41 51 61 71)=data1
 	movdqa     xmm5,xmm2		; transpose coefficients(phase 3)
 	punpcklqdq xmm2,xmm3		; xmm2=(06 16 26 36 46 56 66 76)=data6
 	punpckhqdq xmm5,xmm3		; xmm5=(07 17 27 37 47 57 67 77)=data7
 	movdqa	xmm6,xmm1
 	movdqa	xmm3,xmm0
 	psubw	xmm1,xmm2		; xmm1=data1-data6=tmp6
 	psubw	xmm0,xmm5		; xmm0=data0-data7=tmp7
 	paddw	xmm6,xmm2		; xmm6=data1+data6=tmp1
 	paddw	xmm3,xmm5		; xmm3=data0+data7=tmp0
 	movdqa	xmm2, XMMWORD [wk(2)]	; xmm2=(42 52 62 72 43 53 63 73)
 	movdqa	xmm5, XMMWORD [wk(3)]	; xmm5=(44 54 64 74 45 55 65 75)
 	movdqa	XMMWORD [wk(0)], xmm1	; wk(0)=tmp6
 	movdqa	XMMWORD [wk(1)], xmm0	; wk(1)=tmp7
 	movdqa     xmm1,xmm7		; transpose coefficients(phase 3)
 	punpcklqdq xmm7,xmm2		; xmm7=(02 12 22 32 42 52 62 72)=data2
 	punpckhqdq xmm1,xmm2		; xmm1=(03 13 23 33 43 53 63 73)=data3
 	movdqa     xmm0,xmm4		; transpose coefficients(phase 3)
 	punpcklqdq xmm4,xmm5		; xmm4=(04 14 24 34 44 54 64 74)=data4
 	punpckhqdq xmm0,xmm5		; xmm0=(05 15 25 35 45 55 65 75)=data5
 	movdqa	xmm2,xmm1
 	movdqa	xmm5,xmm7
 	paddw	xmm1,xmm4		; xmm1=data3+data4=tmp3
 	paddw	xmm7,xmm0		; xmm7=data2+data5=tmp2
 	psubw	xmm2,xmm4		; xmm2=data3-data4=tmp4
 	psubw	xmm5,xmm0		; xmm5=data2-data5=tmp5
 	; -- Even part
 	movdqa	xmm4,xmm3
 	movdqa	xmm0,xmm6
 	paddw	xmm3,xmm1		; xmm3=tmp10
 	paddw	xmm6,xmm7		; xmm6=tmp11
 	psubw	xmm4,xmm1		; xmm4=tmp13
 	psubw	xmm0,xmm7		; xmm0=tmp12
 	movdqa	xmm1,xmm3
 	paddw	xmm3,xmm6		; xmm3=tmp10+tmp11
 	psubw	xmm1,xmm6		; xmm1=tmp10-tmp11
 	psllw	xmm3,PASS1_BITS		; xmm3=data0
 	psllw	xmm1,PASS1_BITS		; xmm1=data4
 	movdqa	XMMWORD [wk(2)], xmm3	; wk(2)=data0
 	movdqa	XMMWORD [wk(3)], xmm1	; wk(3)=data4
 	; (Original)
 	; z1 = (tmp12 + tmp13) * 0.541196100;
 	; data2 = z1 + tmp13 * 0.765366865;
 	; data6 = z1 + tmp12 * -1.847759065;
 	;
 	; (This implementation)
 	; data2 = tmp13 * (0.541196100 + 0.765366865) + tmp12 * 0.541196100;
 	; data6 = tmp13 * 0.541196100 + tmp12 * (0.541196100 - 1.847759065);
 	movdqa    xmm7,xmm4		; xmm4=tmp13
 	movdqa    xmm6,xmm4
 	punpcklwd xmm7,xmm0		; xmm0=tmp12
 	punpckhwd xmm6,xmm0
 	movdqa    xmm4,xmm7
 	movdqa    xmm0,xmm6
 	pmaddwd   xmm7,[GOTOFF(ebx,PW_F130_F054)]	; xmm7=data2L
 	pmaddwd   xmm6,[GOTOFF(ebx,PW_F130_F054)]	; xmm6=data2H
 	pmaddwd   xmm4,[GOTOFF(ebx,PW_F054_MF130)]	; xmm4=data6L
 	pmaddwd   xmm0,[GOTOFF(ebx,PW_F054_MF130)]	; xmm0=data6H
 	paddd	xmm7,[GOTOFF(ebx,PD_DESCALE_P1)]
 	paddd	xmm6,[GOTOFF(ebx,PD_DESCALE_P1)]
 	psrad	xmm7,DESCALE_P1
 	psrad	xmm6,DESCALE_P1
 	paddd	xmm4,[GOTOFF(ebx,PD_DESCALE_P1)]
 	paddd	xmm0,[GOTOFF(ebx,PD_DESCALE_P1)]
 	psrad	xmm4,DESCALE_P1
 	psrad	xmm0,DESCALE_P1
 	packssdw  xmm7,xmm6		; xmm7=data2
 	packssdw  xmm4,xmm0		; xmm4=data6
 	movdqa	XMMWORD [wk(4)], xmm7	; wk(4)=data2
 	movdqa	XMMWORD [wk(5)], xmm4	; wk(5)=data6
 	; -- Odd part
 	movdqa	xmm3, XMMWORD [wk(0)]	; xmm3=tmp6
 	movdqa	xmm1, XMMWORD [wk(1)]	; xmm1=tmp7
 	movdqa	xmm6,xmm2		; xmm2=tmp4
 	movdqa	xmm0,xmm5		; xmm5=tmp5
 	paddw	xmm6,xmm3		; xmm6=z3
 	paddw	xmm0,xmm1		; xmm0=z4
 	; (Original)
 	; z5 = (z3 + z4) * 1.175875602;
 	; z3 = z3 * -1.961570560;  z4 = z4 * -0.390180644;
 	; z3 += z5;  z4 += z5;
 	;
 	; (This implementation)
 	; z3 = z3 * (1.175875602 - 1.961570560) + z4 * 1.175875602;
 	; z4 = z3 * 1.175875602 + z4 * (1.175875602 - 0.390180644);
 	movdqa    xmm7,xmm6
 	movdqa    xmm4,xmm6
 	punpcklwd xmm7,xmm0
 	punpckhwd xmm4,xmm0
 	movdqa    xmm6,xmm7
 	movdqa    xmm0,xmm4
 	pmaddwd   xmm7,[GOTOFF(ebx,PW_MF078_F117)]	; xmm7=z3L
 	pmaddwd   xmm4,[GOTOFF(ebx,PW_MF078_F117)]	; xmm4=z3H
 	pmaddwd   xmm6,[GOTOFF(ebx,PW_F117_F078)]	; xmm6=z4L
 	pmaddwd   xmm0,[GOTOFF(ebx,PW_F117_F078)]	; xmm0=z4H
 	movdqa	XMMWORD [wk(0)], xmm7	; wk(0)=z3L
 	movdqa	XMMWORD [wk(1)], xmm4	; wk(1)=z3H
 	; (Original)
 	; z1 = tmp4 + tmp7;  z2 = tmp5 + tmp6;
 	; tmp4 = tmp4 * 0.298631336;  tmp5 = tmp5 * 2.053119869;
 	; tmp6 = tmp6 * 3.072711026;  tmp7 = tmp7 * 1.501321110;
 	; z1 = z1 * -0.899976223;  z2 = z2 * -2.562915447;
 	; data7 = tmp4 + z1 + z3;  data5 = tmp5 + z2 + z4;
 	; data3 = tmp6 + z2 + z3;  data1 = tmp7 + z1 + z4;
 	;
 	; (This implementation)
 	; tmp4 = tmp4 * (0.298631336 - 0.899976223) + tmp7 * -0.899976223;
 	; tmp5 = tmp5 * (2.053119869 - 2.562915447) + tmp6 * -2.562915447;
 	; tmp6 = tmp5 * -2.562915447 + tmp6 * (3.072711026 - 2.562915447);
 	; tmp7 = tmp4 * -0.899976223 + tmp7 * (1.501321110 - 0.899976223);
 	; data7 = tmp4 + z3;  data5 = tmp5 + z4;
 	; data3 = tmp6 + z3;  data1 = tmp7 + z4;
 	movdqa    xmm7,xmm2
 	movdqa    xmm4,xmm2
 	punpcklwd xmm7,xmm1
 	punpckhwd xmm4,xmm1
 	movdqa    xmm2,xmm7
 	movdqa    xmm1,xmm4
 	pmaddwd   xmm7,[GOTOFF(ebx,PW_MF060_MF089)]	; xmm7=tmp4L
 	pmaddwd   xmm4,[GOTOFF(ebx,PW_MF060_MF089)]	; xmm4=tmp4H
 	pmaddwd   xmm2,[GOTOFF(ebx,PW_MF089_F060)]	; xmm2=tmp7L
 	pmaddwd   xmm1,[GOTOFF(ebx,PW_MF089_F060)]	; xmm1=tmp7H
 	paddd	xmm7, XMMWORD [wk(0)]	; xmm7=data7L
 	paddd	xmm4, XMMWORD [wk(1)]	; xmm4=data7H
 	paddd	xmm2,xmm6		; xmm2=data1L
 	paddd	xmm1,xmm0		; xmm1=data1H
 	paddd	xmm7,[GOTOFF(ebx,PD_DESCALE_P1)]
 	paddd	xmm4,[GOTOFF(ebx,PD_DESCALE_P1)]
 	psrad	xmm7,DESCALE_P1
 	psrad	xmm4,DESCALE_P1
 	paddd	xmm2,[GOTOFF(ebx,PD_DESCALE_P1)]
 	paddd	xmm1,[GOTOFF(ebx,PD_DESCALE_P1)]
 	psrad	xmm2,DESCALE_P1
 	psrad	xmm1,DESCALE_P1
 	packssdw  xmm7,xmm4		; xmm7=data7
 	packssdw  xmm2,xmm1		; xmm2=data1
 	movdqa    xmm4,xmm5
 	movdqa    xmm1,xmm5
 	punpcklwd xmm4,xmm3
 	punpckhwd xmm1,xmm3
 	movdqa    xmm5,xmm4
 	movdqa    xmm3,xmm1
 	pmaddwd   xmm4,[GOTOFF(ebx,PW_MF050_MF256)]	; xmm4=tmp5L
 	pmaddwd   xmm1,[GOTOFF(ebx,PW_MF050_MF256)]	; xmm1=tmp5H
 	pmaddwd   xmm5,[GOTOFF(ebx,PW_MF256_F050)]	; xmm5=tmp6L
 	pmaddwd   xmm3,[GOTOFF(ebx,PW_MF256_F050)]	; xmm3=tmp6H
 	paddd	xmm4,xmm6		; xmm4=data5L
 	paddd	xmm1,xmm0		; xmm1=data5H
 	paddd	xmm5, XMMWORD [wk(0)]	; xmm5=data3L
 	paddd	xmm3, XMMWORD [wk(1)]	; xmm3=data3H
 	paddd	xmm4,[GOTOFF(ebx,PD_DESCALE_P1)]
 	paddd	xmm1,[GOTOFF(ebx,PD_DESCALE_P1)]
 	psrad	xmm4,DESCALE_P1
 	psrad	xmm1,DESCALE_P1
 	paddd	xmm5,[GOTOFF(ebx,PD_DESCALE_P1)]
 	paddd	xmm3,[GOTOFF(ebx,PD_DESCALE_P1)]
 	psrad	xmm5,DESCALE_P1
 	psrad	xmm3,DESCALE_P1
 	packssdw  xmm4,xmm1		; xmm4=data5
 	packssdw  xmm5,xmm3		; xmm5=data3
 	; ---- Pass 2: process columns.
 ;	mov	edx, POINTER [data(eax)]	; (DCTELEM *)
 	movdqa	xmm6, XMMWORD [wk(2)]	; xmm6=col0
 	movdqa	xmm0, XMMWORD [wk(4)]	; xmm0=col2
 	; xmm6=(00 10 20 30 40 50 60 70), xmm0=(02 12 22 32 42 52 62 72)
 	; xmm2=(01 11 21 31 41 51 61 71), xmm5=(03 13 23 33 43 53 63 73)
 	movdqa    xmm1,xmm6		; transpose coefficients(phase 1)
 	punpcklwd xmm6,xmm2		; xmm6=(00 01 10 11 20 21 30 31)
 	punpckhwd xmm1,xmm2		; xmm1=(40 41 50 51 60 61 70 71)
 	movdqa    xmm3,xmm0		; transpose coefficients(phase 1)
 	punpcklwd xmm0,xmm5		; xmm0=(02 03 12 13 22 23 32 33)
 	punpckhwd xmm3,xmm5		; xmm3=(42 43 52 53 62 63 72 73)
 	movdqa	xmm2, XMMWORD [wk(3)]	; xmm2=col4
 	movdqa	xmm5, XMMWORD [wk(5)]	; xmm5=col6
 	; xmm2=(04 14 24 34 44 54 64 74), xmm5=(06 16 26 36 46 56 66 76)
 	; xmm4=(05 15 25 35 45 55 65 75), xmm7=(07 17 27 37 47 57 67 77)
 	movdqa	XMMWORD [wk(0)], xmm0	; wk(0)=(02 03 12 13 22 23 32 33)
 	movdqa	XMMWORD [wk(1)], xmm3	; wk(1)=(42 43 52 53 62 63 72 73)
 	movdqa    xmm0,xmm2		; transpose coefficients(phase 1)
 	punpcklwd xmm2,xmm4		; xmm2=(04 05 14 15 24 25 34 35)
 	punpckhwd xmm0,xmm4		; xmm0=(44 45 54 55 64 65 74 75)
 	movdqa    xmm3,xmm5		; transpose coefficients(phase 1)
 	punpcklwd xmm5,xmm7		; xmm5=(06 07 16 17 26 27 36 37)
 	punpckhwd xmm3,xmm7		; xmm3=(46 47 56 57 66 67 76 77)
 	movdqa    xmm4,xmm2		; transpose coefficients(phase 2)
 	punpckldq xmm2,xmm5		; xmm2=(04 05 06 07 14 15 16 17)
 	punpckhdq xmm4,xmm5		; xmm4=(24 25 26 27 34 35 36 37)
 	movdqa    xmm7,xmm0		; transpose coefficients(phase 2)
 	punpckldq xmm0,xmm3		; xmm0=(44 45 46 47 54 55 56 57)
 	punpckhdq xmm7,xmm3		; xmm7=(64 65 66 67 74 75 76 77)
 	movdqa	xmm5, XMMWORD [wk(0)]	; xmm5=(02 03 12 13 22 23 32 33)
 	movdqa	xmm3, XMMWORD [wk(1)]	; xmm3=(42 43 52 53 62 63 72 73)
 	movdqa	XMMWORD [wk(2)], xmm4	; wk(2)=(24 25 26 27 34 35 36 37)
 	movdqa	XMMWORD [wk(3)], xmm0	; wk(3)=(44 45 46 47 54 55 56 57)
 	movdqa    xmm4,xmm6		; transpose coefficients(phase 2)
 	punpckldq xmm6,xmm5		; xmm6=(00 01 02 03 10 11 12 13)
 	punpckhdq xmm4,xmm5		; xmm4=(20 21 22 23 30 31 32 33)
 	movdqa    xmm0,xmm1		; transpose coefficients(phase 2)
 	punpckldq xmm1,xmm3		; xmm1=(40 41 42 43 50 51 52 53)
 	punpckhdq xmm0,xmm3		; xmm0=(60 61 62 63 70 71 72 73)
 	movdqa     xmm5,xmm6		; transpose coefficients(phase 3)
 	punpcklqdq xmm6,xmm2		; xmm6=(00 01 02 03 04 05 06 07)=data0
 	punpckhqdq xmm5,xmm2		; xmm5=(10 11 12 13 14 15 16 17)=data1
 	movdqa     xmm3,xmm0		; transpose coefficients(phase 3)
 	punpcklqdq xmm0,xmm7		; xmm0=(60 61 62 63 64 65 66 67)=data6
 	punpckhqdq xmm3,xmm7		; xmm3=(70 71 72 73 74 75 76 77)=data7
 	movdqa	xmm2,xmm5
 	movdqa	xmm7,xmm6
 	psubw	xmm5,xmm0		; xmm5=data1-data6=tmp6
 	psubw	xmm6,xmm3		; xmm6=data0-data7=tmp7
 	paddw	xmm2,xmm0		; xmm2=data1+data6=tmp1
 	paddw	xmm7,xmm3		; xmm7=data0+data7=tmp0
 	movdqa	xmm0, XMMWORD [wk(2)]	; xmm0=(24 25 26 27 34 35 36 37)
 	movdqa	xmm3, XMMWORD [wk(3)]	; xmm3=(44 45 46 47 54 55 56 57)
 	movdqa	XMMWORD [wk(0)], xmm5	; wk(0)=tmp6
 	movdqa	XMMWORD [wk(1)], xmm6	; wk(1)=tmp7
 	movdqa     xmm5,xmm4		; transpose coefficients(phase 3)
 	punpcklqdq xmm4,xmm0		; xmm4=(20 21 22 23 24 25 26 27)=data2
 	punpckhqdq xmm5,xmm0		; xmm5=(30 31 32 33 34 35 36 37)=data3
 	movdqa     xmm6,xmm1		; transpose coefficients(phase 3)
 	punpcklqdq xmm1,xmm3		; xmm1=(40 41 42 43 44 45 46 47)=data4
 	punpckhqdq xmm6,xmm3		; xmm6=(50 51 52 53 54 55 56 57)=data5
 	movdqa	xmm0,xmm5
 	movdqa	xmm3,xmm4
 	paddw	xmm5,xmm1		; xmm5=data3+data4=tmp3
 	paddw	xmm4,xmm6		; xmm4=data2+data5=tmp2
 	psubw	xmm0,xmm1		; xmm0=data3-data4=tmp4
 	psubw	xmm3,xmm6		; xmm3=data2-data5=tmp5
 	; -- Even part
 	movdqa	xmm1,xmm7
 	movdqa	xmm6,xmm2
 	paddw	xmm7,xmm5		; xmm7=tmp10
 	paddw	xmm2,xmm4		; xmm2=tmp11
 	psubw	xmm1,xmm5		; xmm1=tmp13
 	psubw	xmm6,xmm4		; xmm6=tmp12
 	movdqa	xmm5,xmm7
 	paddw	xmm7,xmm2		; xmm7=tmp10+tmp11
 	psubw	xmm5,xmm2		; xmm5=tmp10-tmp11
 	paddw	xmm7,[GOTOFF(ebx,PW_DESCALE_P2X)]
 	paddw	xmm5,[GOTOFF(ebx,PW_DESCALE_P2X)]
 	psraw	xmm7,PASS1_BITS		; xmm7=data0
 	psraw	xmm5,PASS1_BITS		; xmm5=data4
 	movdqa	XMMWORD [XMMBLOCK(0,0,edx,SIZEOF_DCTELEM)], xmm7
 	movdqa	XMMWORD [XMMBLOCK(4,0,edx,SIZEOF_DCTELEM)], xmm5
 	; (Original)
 	; z1 = (tmp12 + tmp13) * 0.541196100;
 	; data2 = z1 + tmp13 * 0.765366865;
 	; data6 = z1 + tmp12 * -1.847759065;
 	;
 	; (This implementation)
 	; data2 = tmp13 * (0.541196100 + 0.765366865) + tmp12 * 0.541196100;
 	; data6 = tmp13 * 0.541196100 + tmp12 * (0.541196100 - 1.847759065);
 	movdqa    xmm4,xmm1		; xmm1=tmp13
 	movdqa    xmm2,xmm1
 	punpcklwd xmm4,xmm6		; xmm6=tmp12
 	punpckhwd xmm2,xmm6
 	movdqa    xmm1,xmm4
 	movdqa    xmm6,xmm2
 	pmaddwd   xmm4,[GOTOFF(ebx,PW_F130_F054)]	; xmm4=data2L
 	pmaddwd   xmm2,[GOTOFF(ebx,PW_F130_F054)]	; xmm2=data2H
 	pmaddwd   xmm1,[GOTOFF(ebx,PW_F054_MF130)]	; xmm1=data6L
 	pmaddwd   xmm6,[GOTOFF(ebx,PW_F054_MF130)]	; xmm6=data6H
 	paddd	xmm4,[GOTOFF(ebx,PD_DESCALE_P2)]
 	paddd	xmm2,[GOTOFF(ebx,PD_DESCALE_P2)]
 	psrad	xmm4,DESCALE_P2
 	psrad	xmm2,DESCALE_P2
 	paddd	xmm1,[GOTOFF(ebx,PD_DESCALE_P2)]
 	paddd	xmm6,[GOTOFF(ebx,PD_DESCALE_P2)]
 	psrad	xmm1,DESCALE_P2
 	psrad	xmm6,DESCALE_P2
 	packssdw  xmm4,xmm2		; xmm4=data2
 	packssdw  xmm1,xmm6		; xmm1=data6
 	movdqa	XMMWORD [XMMBLOCK(2,0,edx,SIZEOF_DCTELEM)], xmm4
 	movdqa	XMMWORD [XMMBLOCK(6,0,edx,SIZEOF_DCTELEM)], xmm1
 	; -- Odd part
 	movdqa	xmm7, XMMWORD [wk(0)]	; xmm7=tmp6
 	movdqa	xmm5, XMMWORD [wk(1)]	; xmm5=tmp7
 	movdqa	xmm2,xmm0		; xmm0=tmp4
 	movdqa	xmm6,xmm3		; xmm3=tmp5
 	paddw	xmm2,xmm7		; xmm2=z3
 	paddw	xmm6,xmm5		; xmm6=z4
 	; (Original)
 	; z5 = (z3 + z4) * 1.175875602;
 	; z3 = z3 * -1.961570560;  z4 = z4 * -0.390180644;
 	; z3 += z5;  z4 += z5;
 	;
 	; (This implementation)
 	; z3 = z3 * (1.175875602 - 1.961570560) + z4 * 1.175875602;
 	; z4 = z3 * 1.175875602 + z4 * (1.175875602 - 0.390180644);
 	movdqa    xmm4,xmm2
 	movdqa    xmm1,xmm2
 	punpcklwd xmm4,xmm6
 	punpckhwd xmm1,xmm6
 	movdqa    xmm2,xmm4
 	movdqa    xmm6,xmm1
 	pmaddwd   xmm4,[GOTOFF(ebx,PW_MF078_F117)]	; xmm4=z3L
 	pmaddwd   xmm1,[GOTOFF(ebx,PW_MF078_F117)]	; xmm1=z3H
 	pmaddwd   xmm2,[GOTOFF(ebx,PW_F117_F078)]	; xmm2=z4L
 	pmaddwd   xmm6,[GOTOFF(ebx,PW_F117_F078)]	; xmm6=z4H
 	movdqa	XMMWORD [wk(0)], xmm4	; wk(0)=z3L
 	movdqa	XMMWORD [wk(1)], xmm1	; wk(1)=z3H
 	; (Original)
 	; z1 = tmp4 + tmp7;  z2 = tmp5 + tmp6;
 	; tmp4 = tmp4 * 0.298631336;  tmp5 = tmp5 * 2.053119869;
 	; tmp6 = tmp6 * 3.072711026;  tmp7 = tmp7 * 1.501321110;
 	; z1 = z1 * -0.899976223;  z2 = z2 * -2.562915447;
 	; data7 = tmp4 + z1 + z3;  data5 = tmp5 + z2 + z4;
 	; data3 = tmp6 + z2 + z3;  data1 = tmp7 + z1 + z4;
 	;
 	; (This implementation)
 	; tmp4 = tmp4 * (0.298631336 - 0.899976223) + tmp7 * -0.899976223;
 	; tmp5 = tmp5 * (2.053119869 - 2.562915447) + tmp6 * -2.562915447;
 	; tmp6 = tmp5 * -2.562915447 + tmp6 * (3.072711026 - 2.562915447);
 	; tmp7 = tmp4 * -0.899976223 + tmp7 * (1.501321110 - 0.899976223);
 	; data7 = tmp4 + z3;  data5 = tmp5 + z4;
 	; data3 = tmp6 + z3;  data1 = tmp7 + z4;
 	movdqa    xmm4,xmm0
 	movdqa    xmm1,xmm0
 	punpcklwd xmm4,xmm5
 	punpckhwd xmm1,xmm5
 	movdqa    xmm0,xmm4
 	movdqa    xmm5,xmm1
 	pmaddwd   xmm4,[GOTOFF(ebx,PW_MF060_MF089)]	; xmm4=tmp4L
 	pmaddwd   xmm1,[GOTOFF(ebx,PW_MF060_MF089)]	; xmm1=tmp4H
 	pmaddwd   xmm0,[GOTOFF(ebx,PW_MF089_F060)]	; xmm0=tmp7L
 	pmaddwd   xmm5,[GOTOFF(ebx,PW_MF089_F060)]	; xmm5=tmp7H
 	paddd	xmm4, XMMWORD [wk(0)]	; xmm4=data7L
 	paddd	xmm1, XMMWORD [wk(1)]	; xmm1=data7H
 	paddd	xmm0,xmm2		; xmm0=data1L
 	paddd	xmm5,xmm6		; xmm5=data1H
 	paddd	xmm4,[GOTOFF(ebx,PD_DESCALE_P2)]
 	paddd	xmm1,[GOTOFF(ebx,PD_DESCALE_P2)]
 	psrad	xmm4,DESCALE_P2
 	psrad	xmm1,DESCALE_P2
 	paddd	xmm0,[GOTOFF(ebx,PD_DESCALE_P2)]
 	paddd	xmm5,[GOTOFF(ebx,PD_DESCALE_P2)]
 	psrad	xmm0,DESCALE_P2
 	psrad	xmm5,DESCALE_P2
 	packssdw  xmm4,xmm1		; xmm4=data7
 	packssdw  xmm0,xmm5		; xmm0=data1
 	movdqa	XMMWORD [XMMBLOCK(7,0,edx,SIZEOF_DCTELEM)], xmm4
 	movdqa	XMMWORD [XMMBLOCK(1,0,edx,SIZEOF_DCTELEM)], xmm0
 	movdqa    xmm1,xmm3
 	movdqa    xmm5,xmm3
 	punpcklwd xmm1,xmm7
 	punpckhwd xmm5,xmm7
 	movdqa    xmm3,xmm1
 	movdqa    xmm7,xmm5
 	pmaddwd   xmm1,[GOTOFF(ebx,PW_MF050_MF256)]	; xmm1=tmp5L
 	pmaddwd   xmm5,[GOTOFF(ebx,PW_MF050_MF256)]	; xmm5=tmp5H
 	pmaddwd   xmm3,[GOTOFF(ebx,PW_MF256_F050)]	; xmm3=tmp6L
 	pmaddwd   xmm7,[GOTOFF(ebx,PW_MF256_F050)]	; xmm7=tmp6H
 	paddd	xmm1,xmm2		; xmm1=data5L
 	paddd	xmm5,xmm6		; xmm5=data5H
 	paddd	xmm3, XMMWORD [wk(0)]	; xmm3=data3L
 	paddd	xmm7, XMMWORD [wk(1)]	; xmm7=data3H
 	paddd	xmm1,[GOTOFF(ebx,PD_DESCALE_P2)]
 	paddd	xmm5,[GOTOFF(ebx,PD_DESCALE_P2)]
 	psrad	xmm1,DESCALE_P2
 	psrad	xmm5,DESCALE_P2
 	paddd	xmm3,[GOTOFF(ebx,PD_DESCALE_P2)]
 	paddd	xmm7,[GOTOFF(ebx,PD_DESCALE_P2)]
 	psrad	xmm3,DESCALE_P2
 	psrad	xmm7,DESCALE_P2
 	packssdw  xmm1,xmm5		; xmm1=data5
 	packssdw  xmm3,xmm7		; xmm3=data3
 	movdqa	XMMWORD [XMMBLOCK(5,0,edx,SIZEOF_DCTELEM)], xmm1
 	movdqa	XMMWORD [XMMBLOCK(3,0,edx,SIZEOF_DCTELEM)], xmm3
 ;	pop	edi		; unused
 ;	pop	esi		; unused
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; unused
 	poppic	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %endif ; JFDCT_INT_SSE2_SUPPORTED
 %endif ; DCT_ISLOW_SUPPORTED
--- a/jfsseflt.asm
+++ b/jfsseflt.asm
@@ -0,0 +1,383 @@
 ;
 ; jfsseflt.asm - floating-point FDCT (SSE)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains a floating-point implementation of the forward DCT
 ; (Discrete Cosine Transform). The following code is based directly on
 ; the IJG's original jfdctflt.c; see the jfdctflt.c for more details.
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_FLOAT_SUPPORTED
 %ifdef JFDCT_FLT_SSE_MMX_SUPPORTED
 %define JFDCT_FLT_SSE_SUPPORTED
 %endif
 %ifdef JFDCT_FLT_SSE_SSE2_SUPPORTED
 %define JFDCT_FLT_SSE_SUPPORTED
 %endif
 %ifdef JFDCT_FLT_SSE_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 %macro	unpcklps2 2	; %1=(0 1 2 3) / %2=(4 5 6 7) => %1=(0 1 4 5)
 	shufps	%1,%2,0x44
 %endmacro
 %macro	unpckhps2 2	; %1=(0 1 2 3) / %2=(4 5 6 7) => %1=(2 3 6 7)
 	shufps	%1,%2,0xEE
 %endmacro
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 	alignz	16
 	global	EXTN(jconst_fdct_float_sse)
 EXTN(jconst_fdct_float_sse):
 PD_0_382	times 4 dd  0.382683432365089771728460
 PD_0_707	times 4 dd  0.707106781186547524400844
 PD_0_541	times 4 dd  0.541196100146196984399723
 PD_1_306	times 4 dd  1.306562964876376527856643
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform the forward DCT on one block of samples.
 ;
 ; GLOBAL(void)
 ; jpeg_fdct_float_sse (FAST_FLOAT * data)
 ;
 %define data(b)		(b)+8		; FAST_FLOAT * data
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_XMMWORD	; xmmword wk[WK_NUM]
 %define WK_NUM		2
 	align	16
 	global	EXTN(jpeg_fdct_float_sse)
 EXTN(jpeg_fdct_float_sse):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_XMMWORD)	; align to 128 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [wk(0)]
 	pushpic	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 ;	push	esi		; unused
 ;	push	edi		; unused
 	get_GOT	ebx		; get GOT address
 	; ---- Pass 1: process rows.
 	mov	edx, POINTER [data(eax)]	; (FAST_FLOAT *)
 	mov	ecx, DCTSIZE/4
 	alignx	16,7
 .rowloop:
 	movaps	xmm0, XMMWORD [XMMBLOCK(2,0,edx,SIZEOF_FAST_FLOAT)]
 	movaps	xmm1, XMMWORD [XMMBLOCK(3,0,edx,SIZEOF_FAST_FLOAT)]
 	movaps	xmm2, XMMWORD [XMMBLOCK(2,1,edx,SIZEOF_FAST_FLOAT)]
 	movaps	xmm3, XMMWORD [XMMBLOCK(3,1,edx,SIZEOF_FAST_FLOAT)]
 	; xmm0=(20 21 22 23), xmm2=(24 25 26 27)
 	; xmm1=(30 31 32 33), xmm3=(34 35 36 37)
 	movaps   xmm4,xmm0		; transpose coefficients(phase 1)
 	unpcklps xmm0,xmm1		; xmm0=(20 30 21 31)
 	unpckhps xmm4,xmm1		; xmm4=(22 32 23 33)
 	movaps   xmm5,xmm2		; transpose coefficients(phase 1)
 	unpcklps xmm2,xmm3		; xmm2=(24 34 25 35)
 	unpckhps xmm5,xmm3		; xmm5=(26 36 27 37)
 	movaps	xmm6, XMMWORD [XMMBLOCK(0,0,edx,SIZEOF_FAST_FLOAT)]
 	movaps	xmm7, XMMWORD [XMMBLOCK(1,0,edx,SIZEOF_FAST_FLOAT)]
 	movaps	xmm1, XMMWORD [XMMBLOCK(0,1,edx,SIZEOF_FAST_FLOAT)]
 	movaps	xmm3, XMMWORD [XMMBLOCK(1,1,edx,SIZEOF_FAST_FLOAT)]
 	; xmm6=(00 01 02 03), xmm1=(04 05 06 07)
 	; xmm7=(10 11 12 13), xmm3=(14 15 16 17)
 	movaps	XMMWORD [wk(0)], xmm4	; wk(0)=(22 32 23 33)
 	movaps	XMMWORD [wk(1)], xmm2	; wk(1)=(24 34 25 35)
 	movaps   xmm4,xmm6		; transpose coefficients(phase 1)
 	unpcklps xmm6,xmm7		; xmm6=(00 10 01 11)
 	unpckhps xmm4,xmm7		; xmm4=(02 12 03 13)
 	movaps   xmm2,xmm1		; transpose coefficients(phase 1)
 	unpcklps xmm1,xmm3		; xmm1=(04 14 05 15)
 	unpckhps xmm2,xmm3		; xmm2=(06 16 07 17)
 	movaps    xmm7,xmm6		; transpose coefficients(phase 2)
 	unpcklps2 xmm6,xmm0		; xmm6=(00 10 20 30)=data0
 	unpckhps2 xmm7,xmm0		; xmm7=(01 11 21 31)=data1
 	movaps    xmm3,xmm2		; transpose coefficients(phase 2)
 	unpcklps2 xmm2,xmm5		; xmm2=(06 16 26 36)=data6
 	unpckhps2 xmm3,xmm5		; xmm3=(07 17 27 37)=data7
 	movaps	xmm0,xmm7
 	movaps	xmm5,xmm6
 	subps	xmm7,xmm2		; xmm7=data1-data6=tmp6
 	subps	xmm6,xmm3		; xmm6=data0-data7=tmp7
 	addps	xmm0,xmm2		; xmm0=data1+data6=tmp1
 	addps	xmm5,xmm3		; xmm5=data0+data7=tmp0
 	movaps	xmm2, XMMWORD [wk(0)]	; xmm2=(22 32 23 33)
 	movaps	xmm3, XMMWORD [wk(1)]	; xmm3=(24 34 25 35)
 	movaps	XMMWORD [wk(0)], xmm7	; wk(0)=tmp6
 	movaps	XMMWORD [wk(1)], xmm6	; wk(1)=tmp7
 	movaps    xmm7,xmm4		; transpose coefficients(phase 2)
 	unpcklps2 xmm4,xmm2		; xmm4=(02 12 22 32)=data2
 	unpckhps2 xmm7,xmm2		; xmm7=(03 13 23 33)=data3
 	movaps    xmm6,xmm1		; transpose coefficients(phase 2)
 	unpcklps2 xmm1,xmm3		; xmm1=(04 14 24 34)=data4
 	unpckhps2 xmm6,xmm3		; xmm6=(05 15 25 35)=data5
 	movaps	xmm2,xmm7
 	movaps	xmm3,xmm4
 	addps	xmm7,xmm1		; xmm7=data3+data4=tmp3
 	addps	xmm4,xmm6		; xmm4=data2+data5=tmp2
 	subps	xmm2,xmm1		; xmm2=data3-data4=tmp4
 	subps	xmm3,xmm6		; xmm3=data2-data5=tmp5
 	; -- Even part
 	movaps	xmm1,xmm5
 	movaps	xmm6,xmm0
 	subps	xmm5,xmm7		; xmm5=tmp13
 	subps	xmm0,xmm4		; xmm0=tmp12
 	addps	xmm1,xmm7		; xmm1=tmp10
 	addps	xmm6,xmm4		; xmm6=tmp11
 	addps	xmm0,xmm5
 	mulps	xmm0,[GOTOFF(ebx,PD_0_707)] ; xmm0=z1
 	movaps	xmm7,xmm1
 	movaps	xmm4,xmm5
 	subps	xmm1,xmm6		; xmm1=data4
 	subps	xmm5,xmm0		; xmm5=data6
 	addps	xmm7,xmm6		; xmm7=data0
 	addps	xmm4,xmm0		; xmm4=data2
 	movaps	XMMWORD [XMMBLOCK(0,1,edx,SIZEOF_FAST_FLOAT)], xmm1
 	movaps	XMMWORD [XMMBLOCK(2,1,edx,SIZEOF_FAST_FLOAT)], xmm5
 	movaps	XMMWORD [XMMBLOCK(0,0,edx,SIZEOF_FAST_FLOAT)], xmm7
 	movaps	XMMWORD [XMMBLOCK(2,0,edx,SIZEOF_FAST_FLOAT)], xmm4
 	; -- Odd part
 	movaps	xmm6, XMMWORD [wk(0)]	; xmm6=tmp6
 	movaps	xmm0, XMMWORD [wk(1)]	; xmm0=tmp7
 	addps	xmm2,xmm3		; xmm2=tmp10
 	addps	xmm3,xmm6		; xmm3=tmp11
 	addps	xmm6,xmm0		; xmm6=tmp12, xmm0=tmp7
 	mulps	xmm3,[GOTOFF(ebx,PD_0_707)] ; xmm3=z3
 	movaps	xmm1,xmm2		; xmm1=tmp10
 	subps	xmm2,xmm6
 	mulps	xmm2,[GOTOFF(ebx,PD_0_382)] ; xmm2=z5
 	mulps	xmm1,[GOTOFF(ebx,PD_0_541)] ; xmm1=MULTIPLY(tmp10,FIX_0_541196)
 	mulps	xmm6,[GOTOFF(ebx,PD_1_306)] ; xmm6=MULTIPLY(tmp12,FIX_1_306562)
 	addps	xmm1,xmm2		; xmm1=z2
 	addps	xmm6,xmm2		; xmm6=z4
 	movaps	xmm5,xmm0
 	subps	xmm0,xmm3		; xmm0=z13
 	addps	xmm5,xmm3		; xmm5=z11
 	movaps	xmm7,xmm0
 	movaps	xmm4,xmm5
 	subps	xmm0,xmm1		; xmm0=data3
 	subps	xmm5,xmm6		; xmm5=data7
 	addps	xmm7,xmm1		; xmm7=data5
 	addps	xmm4,xmm6		; xmm4=data1
 	movaps	XMMWORD [XMMBLOCK(3,0,edx,SIZEOF_FAST_FLOAT)], xmm0
 	movaps	XMMWORD [XMMBLOCK(3,1,edx,SIZEOF_FAST_FLOAT)], xmm5
 	movaps	XMMWORD [XMMBLOCK(1,1,edx,SIZEOF_FAST_FLOAT)], xmm7
 	movaps	XMMWORD [XMMBLOCK(1,0,edx,SIZEOF_FAST_FLOAT)], xmm4
 	add	edx, 4*DCTSIZE*SIZEOF_FAST_FLOAT
 	dec	ecx
 	jnz	near .rowloop
 	; ---- Pass 2: process columns.
 	mov	edx, POINTER [data(eax)]	; (FAST_FLOAT *)
 	mov	ecx, DCTSIZE/4
 	alignx	16,7
 .columnloop:
 	movaps	xmm0, XMMWORD [XMMBLOCK(2,0,edx,SIZEOF_FAST_FLOAT)]
 	movaps	xmm1, XMMWORD [XMMBLOCK(3,0,edx,SIZEOF_FAST_FLOAT)]
 	movaps	xmm2, XMMWORD [XMMBLOCK(6,0,edx,SIZEOF_FAST_FLOAT)]
 	movaps	xmm3, XMMWORD [XMMBLOCK(7,0,edx,SIZEOF_FAST_FLOAT)]
 	; xmm0=(02 12 22 32), xmm2=(42 52 62 72)
 	; xmm1=(03 13 23 33), xmm3=(43 53 63 73)
 	movaps   xmm4,xmm0		; transpose coefficients(phase 1)
 	unpcklps xmm0,xmm1		; xmm0=(02 03 12 13)
 	unpckhps xmm4,xmm1		; xmm4=(22 23 32 33)
 	movaps   xmm5,xmm2		; transpose coefficients(phase 1)
 	unpcklps xmm2,xmm3		; xmm2=(42 43 52 53)
 	unpckhps xmm5,xmm3		; xmm5=(62 63 72 73)
 	movaps	xmm6, XMMWORD [XMMBLOCK(0,0,edx,SIZEOF_FAST_FLOAT)]
 	movaps	xmm7, XMMWORD [XMMBLOCK(1,0,edx,SIZEOF_FAST_FLOAT)]
 	movaps	xmm1, XMMWORD [XMMBLOCK(4,0,edx,SIZEOF_FAST_FLOAT)]
 	movaps	xmm3, XMMWORD [XMMBLOCK(5,0,edx,SIZEOF_FAST_FLOAT)]
 	; xmm6=(00 10 20 30), xmm1=(40 50 60 70)
 	; xmm7=(01 11 21 31), xmm3=(41 51 61 71)
 	movaps	XMMWORD [wk(0)], xmm4	; wk(0)=(22 23 32 33)
 	movaps	XMMWORD [wk(1)], xmm2	; wk(1)=(42 43 52 53)
 	movaps   xmm4,xmm6		; transpose coefficients(phase 1)
 	unpcklps xmm6,xmm7		; xmm6=(00 01 10 11)
 	unpckhps xmm4,xmm7		; xmm4=(20 21 30 31)
 	movaps   xmm2,xmm1		; transpose coefficients(phase 1)
 	unpcklps xmm1,xmm3		; xmm1=(40 41 50 51)
 	unpckhps xmm2,xmm3		; xmm2=(60 61 70 71)
 	movaps    xmm7,xmm6		; transpose coefficients(phase 2)
 	unpcklps2 xmm6,xmm0		; xmm6=(00 01 02 03)=data0
 	unpckhps2 xmm7,xmm0		; xmm7=(10 11 12 13)=data1
 	movaps    xmm3,xmm2		; transpose coefficients(phase 2)
 	unpcklps2 xmm2,xmm5		; xmm2=(60 61 62 63)=data6
 	unpckhps2 xmm3,xmm5		; xmm3=(70 71 72 73)=data7
 	movaps	xmm0,xmm7
 	movaps	xmm5,xmm6
 	subps	xmm7,xmm2		; xmm7=data1-data6=tmp6
 	subps	xmm6,xmm3		; xmm6=data0-data7=tmp7
 	addps	xmm0,xmm2		; xmm0=data1+data6=tmp1
 	addps	xmm5,xmm3		; xmm5=data0+data7=tmp0
 	movaps	xmm2, XMMWORD [wk(0)]	; xmm2=(22 23 32 33)
 	movaps	xmm3, XMMWORD [wk(1)]	; xmm3=(42 43 52 53)
 	movaps	XMMWORD [wk(0)], xmm7	; wk(0)=tmp6
 	movaps	XMMWORD [wk(1)], xmm6	; wk(1)=tmp7
 	movaps    xmm7,xmm4		; transpose coefficients(phase 2)
 	unpcklps2 xmm4,xmm2		; xmm4=(20 21 22 23)=data2
 	unpckhps2 xmm7,xmm2		; xmm7=(30 31 32 33)=data3
 	movaps    xmm6,xmm1		; transpose coefficients(phase 2)
 	unpcklps2 xmm1,xmm3		; xmm1=(40 41 42 43)=data4
 	unpckhps2 xmm6,xmm3		; xmm6=(50 51 52 53)=data5
 	movaps	xmm2,xmm7
 	movaps	xmm3,xmm4
 	addps	xmm7,xmm1		; xmm7=data3+data4=tmp3
 	addps	xmm4,xmm6		; xmm4=data2+data5=tmp2
 	subps	xmm2,xmm1		; xmm2=data3-data4=tmp4
 	subps	xmm3,xmm6		; xmm3=data2-data5=tmp5
 	; -- Even part
 	movaps	xmm1,xmm5
 	movaps	xmm6,xmm0
 	subps	xmm5,xmm7		; xmm5=tmp13
 	subps	xmm0,xmm4		; xmm0=tmp12
 	addps	xmm1,xmm7		; xmm1=tmp10
 	addps	xmm6,xmm4		; xmm6=tmp11
 	addps	xmm0,xmm5
 	mulps	xmm0,[GOTOFF(ebx,PD_0_707)] ; xmm0=z1
 	movaps	xmm7,xmm1
 	movaps	xmm4,xmm5
 	subps	xmm1,xmm6		; xmm1=data4
 	subps	xmm5,xmm0		; xmm5=data6
 	addps	xmm7,xmm6		; xmm7=data0
 	addps	xmm4,xmm0		; xmm4=data2
 	movaps	XMMWORD [XMMBLOCK(4,0,edx,SIZEOF_FAST_FLOAT)], xmm1
 	movaps	XMMWORD [XMMBLOCK(6,0,edx,SIZEOF_FAST_FLOAT)], xmm5
 	movaps	XMMWORD [XMMBLOCK(0,0,edx,SIZEOF_FAST_FLOAT)], xmm7
 	movaps	XMMWORD [XMMBLOCK(2,0,edx,SIZEOF_FAST_FLOAT)], xmm4
 	; -- Odd part
 	movaps	xmm6, XMMWORD [wk(0)]	; xmm6=tmp6
 	movaps	xmm0, XMMWORD [wk(1)]	; xmm0=tmp7
 	addps	xmm2,xmm3		; xmm2=tmp10
 	addps	xmm3,xmm6		; xmm3=tmp11
 	addps	xmm6,xmm0		; xmm6=tmp12, xmm0=tmp7
 	mulps	xmm3,[GOTOFF(ebx,PD_0_707)] ; xmm3=z3
 	movaps	xmm1,xmm2		; xmm1=tmp10
 	subps	xmm2,xmm6
 	mulps	xmm2,[GOTOFF(ebx,PD_0_382)] ; xmm2=z5
 	mulps	xmm1,[GOTOFF(ebx,PD_0_541)] ; xmm1=MULTIPLY(tmp10,FIX_0_541196)
 	mulps	xmm6,[GOTOFF(ebx,PD_1_306)] ; xmm6=MULTIPLY(tmp12,FIX_1_306562)
 	addps	xmm1,xmm2		; xmm1=z2
 	addps	xmm6,xmm2		; xmm6=z4
 	movaps	xmm5,xmm0
 	subps	xmm0,xmm3		; xmm0=z13
 	addps	xmm5,xmm3		; xmm5=z11
 	movaps	xmm7,xmm0
 	movaps	xmm4,xmm5
 	subps	xmm0,xmm1		; xmm0=data3
 	subps	xmm5,xmm6		; xmm5=data7
 	addps	xmm7,xmm1		; xmm7=data5
 	addps	xmm4,xmm6		; xmm4=data1
 	movaps	XMMWORD [XMMBLOCK(3,0,edx,SIZEOF_FAST_FLOAT)], xmm0
 	movaps	XMMWORD [XMMBLOCK(7,0,edx,SIZEOF_FAST_FLOAT)], xmm5
 	movaps	XMMWORD [XMMBLOCK(5,0,edx,SIZEOF_FAST_FLOAT)], xmm7
 	movaps	XMMWORD [XMMBLOCK(1,0,edx,SIZEOF_FAST_FLOAT)], xmm4
 	add	edx, byte 4*SIZEOF_FAST_FLOAT
 	dec	ecx
 	jnz	near .columnloop
 ;	pop	edi		; unused
 ;	pop	esi		; unused
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	poppic	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %endif ; JFDCT_FLT_SSE_SUPPORTED
 %endif ; DCT_FLOAT_SUPPORTED
--- a/ji3dnflt.asm
+++ b/ji3dnflt.asm
@@ -0,0 +1,462 @@
 ;
 ; ji3dnflt.asm - floating-point IDCT (3DNow! & MMX)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains a floating-point implementation of the inverse DCT
 ; (Discrete Cosine Transform). The following code is based directly on
 ; the IJG's original jidctflt.c; see the jidctflt.c for more details.
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_FLOAT_SUPPORTED
 %ifdef JIDCT_FLT_3DNOW_MMX_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 	alignz	16
 	global	EXTN(jconst_idct_float_3dnow)
 EXTN(jconst_idct_float_3dnow):
 PD_1_414	times 2 dd  1.414213562373095048801689
 PD_1_847	times 2 dd  1.847759065022573512256366
 PD_1_082	times 2 dd  1.082392200292393968799446
 PD_2_613	times 2 dd  2.613125929752753055713286
 PD_RNDINT_MAGIC	times 2 dd  100663296.0	; (float)(0x00C00000 << 3)
 PB_CENTERJSAMP	times 8 db  CENTERJSAMPLE
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform dequantization and inverse DCT on one block of coefficients.
 ;
 ; GLOBAL(void)
 ; jpeg_idct_float_3dnow (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 ;                        JCOEFPTR coef_block,
 ;                        JSAMPARRAY output_buf, JDIMENSION output_col)
 ;
 %define cinfo(b)	(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define coef_block(b)	(b)+16		; JCOEFPTR coef_block
 %define output_buf(b)	(b)+20		; JSAMPARRAY output_buf
 %define output_col(b)	(b)+24		; JDIMENSION output_col
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_MMWORD	; mmword wk[WK_NUM]
 %define WK_NUM		2
 %define workspace	wk(0)-DCTSIZE2*SIZEOF_FAST_FLOAT
 					; FAST_FLOAT workspace[DCTSIZE2]
 	align	16
 	global	EXTN(jpeg_idct_float_3dnow)
 EXTN(jpeg_idct_float_3dnow):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_MMWORD)	; align to 64 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [workspace]
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx		; get GOT address
 	; ---- Pass 1: process columns from input, store into work array.
 ;	mov	eax, [original_ebp]
 	mov	edx, POINTER [compptr(eax)]
 	mov	edx, POINTER [jcompinfo_dct_table(edx)]	; quantptr
 	mov	esi, JCOEFPTR [coef_block(eax)]		; inptr
 	lea	edi, [workspace]			; FAST_FLOAT * wsptr
 	mov	ecx, DCTSIZE/2				; ctr
 	alignx	16,7
 .columnloop:
 %ifndef NO_ZERO_COLUMN_TEST_FLOAT_3DNOW
 	mov	eax, DWORD [DWBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	or	eax, DWORD [DWBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	jnz	short .columnDCT
 	pushpic	ebx		; save GOT address
 	mov	ebx, DWORD [DWBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	mov	eax, DWORD [DWBLOCK(4,0,esi,SIZEOF_JCOEF)]
 	or	ebx, DWORD [DWBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	or	eax, DWORD [DWBLOCK(6,0,esi,SIZEOF_JCOEF)]
 	or	ebx, DWORD [DWBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	or	eax,ebx
 	poppic	ebx		; restore GOT address
 	jnz	short .columnDCT
 	; -- AC terms all zero
 	movd      mm0, DWORD [DWBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	punpcklwd mm0,mm0
 	psrad     mm0,(DWORD_BIT-WORD_BIT)
 	pi2fd     mm0,mm0
 	pfmul     mm0, MMWORD [MMBLOCK(0,0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	movq      mm1,mm0
 	punpckldq mm0,mm0
 	punpckhdq mm1,mm1
 	movq	MMWORD [MMBLOCK(0,0,edi,SIZEOF_FAST_FLOAT)], mm0
 	movq	MMWORD [MMBLOCK(0,1,edi,SIZEOF_FAST_FLOAT)], mm0
 	movq	MMWORD [MMBLOCK(0,2,edi,SIZEOF_FAST_FLOAT)], mm0
 	movq	MMWORD [MMBLOCK(0,3,edi,SIZEOF_FAST_FLOAT)], mm0
 	movq	MMWORD [MMBLOCK(1,0,edi,SIZEOF_FAST_FLOAT)], mm1
 	movq	MMWORD [MMBLOCK(1,1,edi,SIZEOF_FAST_FLOAT)], mm1
 	movq	MMWORD [MMBLOCK(1,2,edi,SIZEOF_FAST_FLOAT)], mm1
 	movq	MMWORD [MMBLOCK(1,3,edi,SIZEOF_FAST_FLOAT)], mm1
 	jmp	near .nextcolumn
 	alignx	16,7
 %endif
 .columnDCT:
 	; -- Even part
 	movd      mm0, DWORD [DWBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	movd      mm1, DWORD [DWBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	movd      mm2, DWORD [DWBLOCK(4,0,esi,SIZEOF_JCOEF)]
 	movd      mm3, DWORD [DWBLOCK(6,0,esi,SIZEOF_JCOEF)]
 	punpcklwd mm0,mm0
 	punpcklwd mm1,mm1
 	psrad     mm0,(DWORD_BIT-WORD_BIT)
 	psrad     mm1,(DWORD_BIT-WORD_BIT)
 	pi2fd     mm0,mm0
 	pi2fd     mm1,mm1
 	pfmul     mm0, MMWORD [MMBLOCK(0,0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	pfmul     mm1, MMWORD [MMBLOCK(2,0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	punpcklwd mm2,mm2
 	punpcklwd mm3,mm3
 	psrad     mm2,(DWORD_BIT-WORD_BIT)
 	psrad     mm3,(DWORD_BIT-WORD_BIT)
 	pi2fd     mm2,mm2
 	pi2fd     mm3,mm3
 	pfmul     mm2, MMWORD [MMBLOCK(4,0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	pfmul     mm3, MMWORD [MMBLOCK(6,0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	movq	mm4,mm0
 	movq	mm5,mm1
 	pfsub	mm0,mm2			; mm0=tmp11
 	pfsub	mm1,mm3
 	pfadd	mm4,mm2			; mm4=tmp10
 	pfadd	mm5,mm3			; mm5=tmp13
 	pfmul	mm1,[GOTOFF(ebx,PD_1_414)]
 	pfsub	mm1,mm5			; mm1=tmp12
 	movq	mm6,mm4
 	movq	mm7,mm0
 	pfsub	mm4,mm5			; mm4=tmp3
 	pfsub	mm0,mm1			; mm0=tmp2
 	pfadd	mm6,mm5			; mm6=tmp0
 	pfadd	mm7,mm1			; mm7=tmp1
 	movq	MMWORD [wk(1)], mm4	; tmp3
 	movq	MMWORD [wk(0)], mm0	; tmp2
 	; -- Odd part
 	movd      mm2, DWORD [DWBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movd      mm3, DWORD [DWBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	movd      mm5, DWORD [DWBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	movd      mm1, DWORD [DWBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	punpcklwd mm2,mm2
 	punpcklwd mm3,mm3
 	psrad     mm2,(DWORD_BIT-WORD_BIT)
 	psrad     mm3,(DWORD_BIT-WORD_BIT)
 	pi2fd     mm2,mm2
 	pi2fd     mm3,mm3
 	pfmul     mm2, MMWORD [MMBLOCK(1,0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	pfmul     mm3, MMWORD [MMBLOCK(3,0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	punpcklwd mm5,mm5
 	punpcklwd mm1,mm1
 	psrad     mm5,(DWORD_BIT-WORD_BIT)
 	psrad     mm1,(DWORD_BIT-WORD_BIT)
 	pi2fd     mm5,mm5
 	pi2fd     mm1,mm1
 	pfmul     mm5, MMWORD [MMBLOCK(5,0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	pfmul     mm1, MMWORD [MMBLOCK(7,0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	movq	mm4,mm2
 	movq	mm0,mm5
 	pfadd	mm2,mm1			; mm2=z11
 	pfadd	mm5,mm3			; mm5=z13
 	pfsub	mm4,mm1			; mm4=z12
 	pfsub	mm0,mm3			; mm0=z10
 	movq	mm1,mm2
 	pfsub	mm2,mm5
 	pfadd	mm1,mm5			; mm1=tmp7
 	pfmul	mm2,[GOTOFF(ebx,PD_1_414)]	; mm2=tmp11
 	movq	mm3,mm0
 	pfadd	mm0,mm4
 	pfmul	mm0,[GOTOFF(ebx,PD_1_847)]	; mm0=z5
 	pfmul	mm3,[GOTOFF(ebx,PD_2_613)]	; mm3=(z10 * 2.613125930)
 	pfmul	mm4,[GOTOFF(ebx,PD_1_082)]	; mm4=(z12 * 1.082392200)
 	pfsubr	mm3,mm0			; mm3=tmp12
 	pfsub	mm4,mm0			; mm4=tmp10
 	; -- Final output stage
 	pfsub	mm3,mm1			; mm3=tmp6
 	movq	mm5,mm6
 	movq	mm0,mm7
 	pfadd	mm6,mm1			; mm6=data0=(00 01)
 	pfadd	mm7,mm3			; mm7=data1=(10 11)
 	pfsub	mm5,mm1			; mm5=data7=(70 71)
 	pfsub	mm0,mm3			; mm0=data6=(60 61)
 	pfsub	mm2,mm3			; mm2=tmp5
 	movq      mm1,mm6		; transpose coefficients
 	punpckldq mm6,mm7		; mm6=(00 10)
 	punpckhdq mm1,mm7		; mm1=(01 11)
 	movq      mm3,mm0		; transpose coefficients
 	punpckldq mm0,mm5		; mm0=(60 70)
 	punpckhdq mm3,mm5		; mm3=(61 71)
 	movq	MMWORD [MMBLOCK(0,0,edi,SIZEOF_FAST_FLOAT)], mm6
 	movq	MMWORD [MMBLOCK(1,0,edi,SIZEOF_FAST_FLOAT)], mm1
 	movq	MMWORD [MMBLOCK(0,3,edi,SIZEOF_FAST_FLOAT)], mm0
 	movq	MMWORD [MMBLOCK(1,3,edi,SIZEOF_FAST_FLOAT)], mm3
 	movq	mm7, MMWORD [wk(0)]	; mm7=tmp2
 	movq	mm5, MMWORD [wk(1)]	; mm5=tmp3
 	pfadd	mm4,mm2			; mm4=tmp4
 	movq	mm6,mm7
 	movq	mm1,mm5
 	pfadd	mm7,mm2			; mm7=data2=(20 21)
 	pfadd	mm5,mm4			; mm5=data4=(40 41)
 	pfsub	mm6,mm2			; mm6=data5=(50 51)
 	pfsub	mm1,mm4			; mm1=data3=(30 31)
 	movq      mm0,mm7		; transpose coefficients
 	punpckldq mm7,mm1		; mm7=(20 30)
 	punpckhdq mm0,mm1		; mm0=(21 31)
 	movq      mm3,mm5		; transpose coefficients
 	punpckldq mm5,mm6		; mm5=(40 50)
 	punpckhdq mm3,mm6		; mm3=(41 51)
 	movq	MMWORD [MMBLOCK(0,1,edi,SIZEOF_FAST_FLOAT)], mm7
 	movq	MMWORD [MMBLOCK(1,1,edi,SIZEOF_FAST_FLOAT)], mm0
 	movq	MMWORD [MMBLOCK(0,2,edi,SIZEOF_FAST_FLOAT)], mm5
 	movq	MMWORD [MMBLOCK(1,2,edi,SIZEOF_FAST_FLOAT)], mm3
 .nextcolumn:
 	add	esi, byte 2*SIZEOF_JCOEF		; coef_block
 	add	edx, byte 2*SIZEOF_FLOAT_MULT_TYPE	; quantptr
 	add	edi, byte 2*DCTSIZE*SIZEOF_FAST_FLOAT	; wsptr
 	dec	ecx					; ctr
 	jnz	near .columnloop
 	; -- Prefetch the next coefficient block
 	prefetch [esi + (DCTSIZE2-8)*SIZEOF_JCOEF + 0*32]
 	prefetch [esi + (DCTSIZE2-8)*SIZEOF_JCOEF + 1*32]
 	prefetch [esi + (DCTSIZE2-8)*SIZEOF_JCOEF + 2*32]
 	prefetch [esi + (DCTSIZE2-8)*SIZEOF_JCOEF + 3*32]
 	; ---- Pass 2: process rows from work array, store into output array.
 	mov	eax, [original_ebp]
 	lea	esi, [workspace]			; FAST_FLOAT * wsptr
 	mov	edi, JSAMPARRAY [output_buf(eax)]	; (JSAMPROW *)
 	mov	eax, JDIMENSION [output_col(eax)]
 	mov	ecx, DCTSIZE/2				; ctr
 	alignx	16,7
 .rowloop:
 	; -- Even part
 	movq	mm0, MMWORD [MMBLOCK(0,0,esi,SIZEOF_FAST_FLOAT)]
 	movq	mm1, MMWORD [MMBLOCK(2,0,esi,SIZEOF_FAST_FLOAT)]
 	movq	mm2, MMWORD [MMBLOCK(4,0,esi,SIZEOF_FAST_FLOAT)]
 	movq	mm3, MMWORD [MMBLOCK(6,0,esi,SIZEOF_FAST_FLOAT)]
 	movq	mm4,mm0
 	movq	mm5,mm1
 	pfsub	mm0,mm2			; mm0=tmp11
 	pfsub	mm1,mm3
 	pfadd	mm4,mm2			; mm4=tmp10
 	pfadd	mm5,mm3			; mm5=tmp13
 	pfmul	mm1,[GOTOFF(ebx,PD_1_414)]
 	pfsub	mm1,mm5			; mm1=tmp12
 	movq	mm6,mm4
 	movq	mm7,mm0
 	pfsub	mm4,mm5			; mm4=tmp3
 	pfsub	mm0,mm1			; mm0=tmp2
 	pfadd	mm6,mm5			; mm6=tmp0
 	pfadd	mm7,mm1			; mm7=tmp1
 	movq	MMWORD [wk(1)], mm4	; tmp3
 	movq	MMWORD [wk(0)], mm0	; tmp2
 	; -- Odd part
 	movq	mm2, MMWORD [MMBLOCK(1,0,esi,SIZEOF_FAST_FLOAT)]
 	movq	mm3, MMWORD [MMBLOCK(3,0,esi,SIZEOF_FAST_FLOAT)]
 	movq	mm5, MMWORD [MMBLOCK(5,0,esi,SIZEOF_FAST_FLOAT)]
 	movq	mm1, MMWORD [MMBLOCK(7,0,esi,SIZEOF_FAST_FLOAT)]
 	movq	mm4,mm2
 	movq	mm0,mm5
 	pfadd	mm2,mm1			; mm2=z11
 	pfadd	mm5,mm3			; mm5=z13
 	pfsub	mm4,mm1			; mm4=z12
 	pfsub	mm0,mm3			; mm0=z10
 	movq	mm1,mm2
 	pfsub	mm2,mm5
 	pfadd	mm1,mm5			; mm1=tmp7
 	pfmul	mm2,[GOTOFF(ebx,PD_1_414)]	; mm2=tmp11
 	movq	mm3,mm0
 	pfadd	mm0,mm4
 	pfmul	mm0,[GOTOFF(ebx,PD_1_847)]	; mm0=z5
 	pfmul	mm3,[GOTOFF(ebx,PD_2_613)]	; mm3=(z10 * 2.613125930)
 	pfmul	mm4,[GOTOFF(ebx,PD_1_082)]	; mm4=(z12 * 1.082392200)
 	pfsubr	mm3,mm0			; mm3=tmp12
 	pfsub	mm4,mm0			; mm4=tmp10
 	; -- Final output stage
 	pfsub	mm3,mm1			; mm3=tmp6
 	movq	mm5,mm6
 	movq	mm0,mm7
 	pfadd	mm6,mm1			; mm6=data0=(00 10)
 	pfadd	mm7,mm3			; mm7=data1=(01 11)
 	pfsub	mm5,mm1			; mm5=data7=(07 17)
 	pfsub	mm0,mm3			; mm0=data6=(06 16)
 	pfsub	mm2,mm3			; mm2=tmp5
 	movq	mm1,[GOTOFF(ebx,PD_RNDINT_MAGIC)]	; mm1=[PD_RNDINT_MAGIC]
 	pcmpeqd	mm3,mm3
 	psrld	mm3,WORD_BIT		; mm3={0xFFFF 0x0000 0xFFFF 0x0000}
 	pfadd	mm6,mm1			; mm6=roundint(data0/8)=(00 ** 10 **)
 	pfadd	mm7,mm1			; mm7=roundint(data1/8)=(01 ** 11 **)
 	pfadd	mm0,mm1			; mm0=roundint(data6/8)=(06 ** 16 **)
 	pfadd	mm5,mm1			; mm5=roundint(data7/8)=(07 ** 17 **)
 	pand	mm6,mm3			; mm6=(00 -- 10 --)
 	pslld	mm7,WORD_BIT		; mm7=(-- 01 -- 11)
 	pand	mm0,mm3			; mm0=(06 -- 16 --)
 	pslld	mm5,WORD_BIT		; mm5=(-- 07 -- 17)
 	por	mm6,mm7			; mm6=(00 01 10 11)
 	por	mm0,mm5			; mm0=(06 07 16 17)
 	movq	mm1, MMWORD [wk(0)]	; mm1=tmp2
 	movq	mm3, MMWORD [wk(1)]	; mm3=tmp3
 	pfadd	mm4,mm2			; mm4=tmp4
 	movq	mm7,mm1
 	movq	mm5,mm3
 	pfadd	mm1,mm2			; mm1=data2=(02 12)
 	pfadd	mm3,mm4			; mm3=data4=(04 14)
 	pfsub	mm7,mm2			; mm7=data5=(05 15)
 	pfsub	mm5,mm4			; mm5=data3=(03 13)
 	movq	mm2,[GOTOFF(ebx,PD_RNDINT_MAGIC)]	; mm2=[PD_RNDINT_MAGIC]
 	pcmpeqd	mm4,mm4
 	psrld	mm4,WORD_BIT		; mm4={0xFFFF 0x0000 0xFFFF 0x0000}
 	pfadd	mm3,mm2			; mm3=roundint(data4/8)=(04 ** 14 **)
 	pfadd	mm7,mm2			; mm7=roundint(data5/8)=(05 ** 15 **)
 	pfadd	mm1,mm2			; mm1=roundint(data2/8)=(02 ** 12 **)
 	pfadd	mm5,mm2			; mm5=roundint(data3/8)=(03 ** 13 **)
 	pand	mm3,mm4			; mm3=(04 -- 14 --)
 	pslld	mm7,WORD_BIT		; mm7=(-- 05 -- 15)
 	pand	mm1,mm4			; mm1=(02 -- 12 --)
 	pslld	mm5,WORD_BIT		; mm5=(-- 03 -- 13)
 	por	mm3,mm7			; mm3=(04 05 14 15)
 	por	mm1,mm5			; mm1=(02 03 12 13)
 	movq      mm2,[GOTOFF(ebx,PB_CENTERJSAMP)]	; mm2=[PB_CENTERJSAMP]
 	packsswb  mm6,mm3		; mm6=(00 01 10 11 04 05 14 15)
 	packsswb  mm1,mm0		; mm1=(02 03 12 13 06 07 16 17)
 	paddb     mm6,mm2
 	paddb     mm1,mm2
 	movq      mm4,mm6		; transpose coefficients(phase 2)
 	punpcklwd mm6,mm1		; mm6=(00 01 02 03 10 11 12 13)
 	punpckhwd mm4,mm1		; mm4=(04 05 06 07 14 15 16 17)
 	movq      mm7,mm6		; transpose coefficients(phase 3)
 	punpckldq mm6,mm4		; mm6=(00 01 02 03 04 05 06 07)
 	punpckhdq mm7,mm4		; mm7=(10 11 12 13 14 15 16 17)
 	pushpic	ebx			; save GOT address
 	mov	edx, JSAMPROW [edi+0*SIZEOF_JSAMPROW]
 	mov	ebx, JSAMPROW [edi+1*SIZEOF_JSAMPROW]
 	movq	MMWORD [edx+eax*SIZEOF_JSAMPLE], mm6
 	movq	MMWORD [ebx+eax*SIZEOF_JSAMPLE], mm7
 	poppic	ebx			; restore GOT address
 	add	esi, byte 2*SIZEOF_FAST_FLOAT	; wsptr
 	add	edi, byte 2*SIZEOF_JSAMPROW
 	dec	ecx				; ctr
 	jnz	near .rowloop
 	femms		; empty MMX/3DNow! state
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %endif ; JIDCT_FLT_3DNOW_MMX_SUPPORTED
 %endif ; DCT_FLOAT_SUPPORTED
--- a/jidctflt.asm
+++ b/jidctflt.asm
@@ -0,0 +1,473 @@
 ;
 ; jidctflt.asm - floating-point IDCT (non-SIMD)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains a floating-point implementation of the inverse DCT
 ; (Discrete Cosine Transform). The following code is based directly on
 ; the IJG's original jidctflt.c; see the jidctflt.c for more details.
 ;
 ; Last Modified : October 17, 2004
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_FLOAT_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 %define ROTATOR_TYPE	FP32	; float
 	alignz	16
 	global	EXTN(jconst_idct_float)
 EXTN(jconst_idct_float):
 F_1_414	dd	1.414213562373095048801689	; 2*cos(PI*1/4)
 F_1_847	dd	1.847759065022573512256366	; 2*cos(PI*1/8)
 F_1_082	dd	1.082392200292393968799446	; 2*(cos(PI*1/8)-cos(PI*3/8))
 F_2_613	dd	2.613125929752753055713286	; 2*(cos(PI*1/8)+cos(PI*3/8))
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform dequantization and inverse DCT on one block of coefficients.
 ;
 ; GLOBAL(void)
 ; jpeg_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 ;                  JCOEFPTR coef_block,
 ;                  JSAMPARRAY output_buf, JDIMENSION output_col)
 ;
 %define cinfo(b)	(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define coef_block(b)	(b)+16		; JCOEFPTR coef_block
 %define output_buf(b)	(b)+20		; JSAMPARRAY output_buf
 %define output_col(b)	(b)+24		; JDIMENSION output_col
 %define tmp		ebp-SIZEOF_FP64	; double tmp
 %define workspace	tmp-DCTSIZE2*SIZEOF_FAST_FLOAT
 					; FAST_FLOAT workspace[DCTSIZE2]
 %define rndint_magic	workspace-SIZEOF_FP32
 					; float rndint_magic = 100663296.0F
 %define gotptr		rndint_magic-SIZEOF_POINTER	; void * gotptr
 	align	16
 	global	EXTN(jpeg_idct_float)
 EXTN(jpeg_idct_float):
 	push	ebp
 	mov	ebp,esp
 	lea	esp, [workspace]
 	push	FP32 0x4CC00000		; (float)(0x00C00000 << 3)
 	pushpic	eax			; make a room for GOT address
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx			; get GOT address
 	movpic	POINTER [gotptr], ebx	; save GOT address
 	; ---- Pass 1: process columns from input, store into work array.
 	mov	edx, POINTER [compptr(ebp)]
 	mov	edx, POINTER [jcompinfo_dct_table(edx)]	; quantptr
 	mov	esi, JCOEFPTR [coef_block(ebp)]		; inptr
 	lea	edi, [workspace]			; FAST_FLOAT * wsptr
 	mov	ecx, DCTSIZE				; ctr
 	alignx	16,7
 .columnloop:
 	mov	ax, JCOEF [COL(1,esi,SIZEOF_JCOEF)]
 	or	ax, JCOEF [COL(2,esi,SIZEOF_JCOEF)]
 	jnz	short .columnDCT
 	mov	bx, JCOEF [COL(3,esi,SIZEOF_JCOEF)]
 	mov	ax, JCOEF [COL(4,esi,SIZEOF_JCOEF)]
 	or	bx, JCOEF [COL(5,esi,SIZEOF_JCOEF)]
 	or	ax, JCOEF [COL(6,esi,SIZEOF_JCOEF)]
 	or	bx, JCOEF [COL(7,esi,SIZEOF_JCOEF)]
 	or	ax,bx
 	jnz	short .columnDCT
 	; -- AC terms all zero
 	fild	JCOEF [COL(0,esi,SIZEOF_JCOEF)]
 	fmul	FLOAT_MULT_TYPE [COL(0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	fst	FAST_FLOAT [COL(0,edi,SIZEOF_FAST_FLOAT)]
 	fst	FAST_FLOAT [COL(1,edi,SIZEOF_FAST_FLOAT)]
 	fst	FAST_FLOAT [COL(2,edi,SIZEOF_FAST_FLOAT)]
 	fst	FAST_FLOAT [COL(3,edi,SIZEOF_FAST_FLOAT)]
 	fst	FAST_FLOAT [COL(4,edi,SIZEOF_FAST_FLOAT)]
 	fst	FAST_FLOAT [COL(5,edi,SIZEOF_FAST_FLOAT)]
 	fst	FAST_FLOAT [COL(6,edi,SIZEOF_FAST_FLOAT)]
 	fstp	FAST_FLOAT [COL(7,edi,SIZEOF_FAST_FLOAT)]
 	jmp	near .nextcolumn
 	alignx	16,7
 .columnDCT:
 	movpic	ebx, POINTER [gotptr]	; load GOT address
 	; -- Even part
 	fild	JCOEF [COL(2,esi,SIZEOF_JCOEF)]
 	fild	JCOEF [COL(6,esi,SIZEOF_JCOEF)]
 	fild	JCOEF [COL(4,esi,SIZEOF_JCOEF)]
 	fild	JCOEF [COL(0,esi,SIZEOF_JCOEF)]
 	fxch	st0,st3
 	fmul	FLOAT_MULT_TYPE [COL(2,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	fxch	st0,st2
 	fmul	FLOAT_MULT_TYPE [COL(6,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	fxch	st0,st1
 	fmul	FLOAT_MULT_TYPE [COL(4,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	fxch	st0,st3
 	fmul	FLOAT_MULT_TYPE [COL(0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	fxch	st0,st1
 	fld	st2	; st2 = st2 + st0, st0 = st2 - st0
 	fsub	st0,st1
 	fxch	st0,st1
 	faddp	st3,st0
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_1_414)]
 	fld	st3	; st1 = st1 + st3, st3 = st1 - st3
 	fsubr	st0,st2
 	fxch	st0,st4
 	faddp	st2,st0
 	fsub	st0,st2
 	fld	st1	; st2 = st1 + st2, st1 = st1 - st2
 	fsub	st0,st3
 	fxch	st0,st2
 	faddp	st3,st0
 	fld	st3	; st0 = st3 + st0, st3 = st3 - st0
 	fsub	st0,st1
 	fxch	st0,st4
 	faddp	st1,st0
 	; -- Odd part
 	fild	JCOEF [COL(1,esi,SIZEOF_JCOEF)]
 	fild	JCOEF [COL(7,esi,SIZEOF_JCOEF)]
 	fild	JCOEF [COL(3,esi,SIZEOF_JCOEF)]
 	fild	JCOEF [COL(5,esi,SIZEOF_JCOEF)]
 	fxch	st0,st3
 	fmul	FLOAT_MULT_TYPE [COL(1,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	fxch	st0,st2
 	fmul	FLOAT_MULT_TYPE [COL(7,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	fxch	st0,st1
 	fmul	FLOAT_MULT_TYPE [COL(3,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	fxch	st0,st6
 	fxch	st3,st0
 	fmul	FLOAT_MULT_TYPE [COL(5,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	fxch	st0,st5
 	fstp	FP64 [tmp]
 	fld	st1	; st1 = st1 + st0, st0 = st1 - st0
 	fsub	st0,st1
 	fxch	st0,st1
 	faddp	st2,st0
 	fld	st5	; st4 = st4 + st5, st5 = st4 - st5
 	fsubr	st0,st5
 	fxch	st0,st6
 	faddp	st5,st0
 	fld	st1	; st1 = st1 + st4, st4 = st1 - st4
 	fsub	st0,st5
 	fxch	st0,st5
 	faddp	st2,st0
 	fld	st5
 	fadd	st0,st1
 	fxch	st0,st5
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_1_414)]
 	fxch	st0,st5
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_1_847)]
 	fxch	st0,st6
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_2_613)]
 	fxch	st0,st1
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_1_082)]
 	fxch	st0,st6
 	fsubr	st1,st0
 	fsubp	st6,st0
 	; -- Final output stage
 	fsub	st0,st1
 	fld	st2	; st1 = st2 + st1, st2 = st2 - st1
 	fsub	st0,st2
 	fxch	st0,st3
 	faddp	st2,st0
 	fsub	st4,st0
 	fld	st3	; st0 = st3 + st0, st3 = st3 - st0
 	fsub	st0,st1
 	fxch	st0,st4
 	faddp	st1,st0
 	fxch	st0,st2
 	fstp	FAST_FLOAT [COL(7,edi,SIZEOF_FAST_FLOAT)]
 	fstp	FAST_FLOAT [COL(0,edi,SIZEOF_FAST_FLOAT)]
 	fstp	FAST_FLOAT [COL(1,edi,SIZEOF_FAST_FLOAT)]
 	fstp	FAST_FLOAT [COL(6,edi,SIZEOF_FAST_FLOAT)]
 	fadd	st1,st0
 	fld	FP64 [tmp]
 	fld	st1	; st3 = st3 + st1, st1 = st3 - st1
 	fsubr	st0,st4
 	fxch	st0,st2
 	faddp	st4,st0
 	fld	st0	; st0 = st0 + st2, st2 = st0 - st2
 	fsub	st0,st3
 	fxch	st0,st3
 	faddp	st1,st0
 	fxch	st0,st3
 	fstp	FAST_FLOAT [COL(2,edi,SIZEOF_FAST_FLOAT)]
 	fstp	FAST_FLOAT [COL(5,edi,SIZEOF_FAST_FLOAT)]
 	fstp	FAST_FLOAT [COL(3,edi,SIZEOF_FAST_FLOAT)]
 	fstp	FAST_FLOAT [COL(4,edi,SIZEOF_FAST_FLOAT)]
 .nextcolumn:
 	add	esi, byte SIZEOF_JCOEF	; advance pointers to next column
 	add	edx, byte SIZEOF_FLOAT_MULT_TYPE
 	add	edi, byte SIZEOF_FAST_FLOAT
 	dec	ecx
 	jnz	near .columnloop
 	; ---- Pass 2: process rows from work array, store into output array.
 	mov	edx, POINTER [cinfo(ebp)]
 	mov	edx, POINTER [jdstruct_sample_range_limit(edx)]
 	sub	edx, byte -CENTERJSAMPLE*SIZEOF_JSAMPLE	; JSAMPLE * range_limit
 	lea	esi, [workspace]			; FAST_FLOAT * wsptr
 	mov	edi, JSAMPARRAY [output_buf(ebp)]	; (JSAMPROW *)
 	mov	ecx, DCTSIZE				; ctr
 	alignx	16,7
 .rowloop:
 	push	edi
 	mov	edi, JSAMPROW [edi]			; (JSAMPLE *)
 	add	edi, JDIMENSION [output_col(ebp)]	; edi=outptr
 %ifndef NO_ZERO_ROW_TEST_FLOAT
 	mov	eax, FAST_FLOAT [ROW(1,esi,SIZEOF_FAST_FLOAT)]
 	add	eax,eax			; shl eax,1 (shift out the sign bit)
 	jnz	short .rowDCT
 	mov	eax, FAST_FLOAT [ROW(2,esi,SIZEOF_FAST_FLOAT)]
 	mov	ebx, FAST_FLOAT [ROW(3,esi,SIZEOF_FAST_FLOAT)]
 	or	eax, FAST_FLOAT [ROW(4,esi,SIZEOF_FAST_FLOAT)]
 	or	ebx, FAST_FLOAT [ROW(5,esi,SIZEOF_FAST_FLOAT)]
 	or	eax, FAST_FLOAT [ROW(6,esi,SIZEOF_FAST_FLOAT)]
 	or	ebx, FAST_FLOAT [ROW(7,esi,SIZEOF_FAST_FLOAT)]
 	or	eax,ebx
 	add	eax,eax			; shl eax,1 (shift out the sign bit)
 	jnz	short .rowDCT
 	; -- AC terms all zero
 	push	eax
 	fld	FAST_FLOAT [ROW(0,esi,SIZEOF_FAST_FLOAT)]
 	fadd	FP32 [rndint_magic]
 	fstp	FP32 [esp]
 	pop	eax
 	and	eax,RANGE_MASK
 	mov	al, JSAMPLE [edx+eax*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [edi+0*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+1*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+2*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+3*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+4*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+5*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+6*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+7*SIZEOF_JSAMPLE], al
 	jmp	near .nextrow
 	alignx	16,7
 %endif
 .rowDCT:
 	movpic	ebx, POINTER [gotptr]	; load GOT address
 	; -- Even part
 	fld	FAST_FLOAT [ROW(4,esi,SIZEOF_FAST_FLOAT)]
 	fld	FAST_FLOAT [ROW(2,esi,SIZEOF_FAST_FLOAT)]
 	fld	FAST_FLOAT [ROW(0,esi,SIZEOF_FAST_FLOAT)]
 	fld	FAST_FLOAT [ROW(6,esi,SIZEOF_FAST_FLOAT)]
 	fld	st2	; st2 = st2 + st0, st0 = st2 - st0
 	fsub	st0,st1
 	fxch	st0,st1
 	faddp	st3,st0
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_1_414)]
 	fld	st3	; st1 = st1 + st3, st3 = st1 - st3
 	fsubr	st0,st2
 	fxch	st0,st4
 	faddp	st2,st0
 	fsub	st0,st2
 	fld	st1	; st2 = st1 + st2, st1 = st1 - st2
 	fsub	st0,st3
 	fxch	st0,st2
 	faddp	st3,st0
 	fld	st3	; st0 = st3 + st0, st3 = st3 - st0
 	fsub	st0,st1
 	fxch	st0,st4
 	faddp	st1,st0
 	; -- Odd part
 	fld	FAST_FLOAT [ROW(3,esi,SIZEOF_FAST_FLOAT)]
 	fxch	st0,st3
 	fld	FAST_FLOAT [ROW(1,esi,SIZEOF_FAST_FLOAT)]
 	fld	FAST_FLOAT [ROW(7,esi,SIZEOF_FAST_FLOAT)]
 	fld	FAST_FLOAT [ROW(5,esi,SIZEOF_FAST_FLOAT)]
 	fxch	st0,st5
 	fstp	FP64 [tmp]
 	fld	st1	; st1 = st1 + st0, st0 = st1 - st0
 	fsub	st0,st1
 	fxch	st0,st1
 	faddp	st2,st0
 	fld	st5	; st4 = st4 + st5, st5 = st4 - st5
 	fsubr	st0,st5
 	fxch	st0,st6
 	faddp	st5,st0
 	fld	st1	; st1 = st1 + st4, st4 = st1 - st4
 	fsub	st0,st5
 	fxch	st0,st5
 	faddp	st2,st0
 	fld	st5
 	fadd	st0,st1
 	fxch	st0,st5
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_1_414)]
 	fxch	st0,st5
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_1_847)]
 	fxch	st0,st6
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_2_613)]
 	fxch	st0,st1
 	fmul	ROTATOR_TYPE [GOTOFF(ebx,F_1_082)]
 	fxch	st0,st6
 	fsubr	st1,st0
 	fsubp	st6,st0
 	; -- Final output stage
 	sub	esp, byte DCTSIZE*SIZEOF_FP32
 	fsub	st0,st1
 	fld	st2	; st1 = st2 + st1, st2 = st2 - st1
 	fsub	st0,st2
 	fxch	st0,st3
 	faddp	st2,st0
 	fsub	st4,st0
 	fld	st3	; st0 = st3 + st0, st3 = st3 - st0
 	fsub	st0,st1
 	fxch	st0,st4
 	faddp	st1,st0
 	fld	FP32 [rndint_magic]
 	fadd	st4,st0
 	fadd	st1,st0
 	fadd	st2,st0
 	fadd	st3,st0
 	fxch	st0,st4
 	fstp	FP32 [esp+6*SIZEOF_FP32]
 	fstp	FP32 [esp+1*SIZEOF_FP32]
 	fstp	FP32 [esp+0*SIZEOF_FP32]
 	fstp	FP32 [esp+7*SIZEOF_FP32]
 	fxch	st0,st1
 	fadd	st2,st0
 	fld	FP64 [tmp]
 	fld	st1	; st4 = st4 + st1, st1 = st4 - st1
 	fsubr	st0,st5
 	fxch	st0,st2
 	faddp	st5,st0
 	fld	st0	; st0 = st0 + st3, st3 = st0 - st3
 	fsub	st0,st4
 	fxch	st0,st4
 	faddp	st1,st0
 	fxch	st0,st2
 	fadd	st1,st0
 	fadd	st2,st0
 	fadd	st3,st0
 	faddp	st4,st0
 	fstp	FP32 [esp+5*SIZEOF_FP32]
 	fstp	FP32 [esp+4*SIZEOF_FP32]
 	fstp	FP32 [esp+3*SIZEOF_FP32]
 	fstp	FP32 [esp+2*SIZEOF_FP32]
 %assign i 0	; i=0;
 %rep 4	; -- repeat 4 times ---
 	pop	eax
 	pop	ebx
 	and	eax,RANGE_MASK
 	and	ebx,RANGE_MASK
 	mov	al, JSAMPLE [edx+eax*SIZEOF_JSAMPLE]
 	mov	bl, JSAMPLE [edx+ebx*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [edi+(i+0)*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+(i+1)*SIZEOF_JSAMPLE], bl
 %assign i i+2	; i+=2;
 %endrep	; -- repeat end ---
 .nextrow:
 	pop	edi
 	add	esi, byte DCTSIZE*SIZEOF_FAST_FLOAT
 	add	edi, byte SIZEOF_JSAMPROW	; advance pointer to next row
 	dec	ecx
 	jnz	near .rowloop
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	mov	esp,ebp
 	pop	ebp
 	ret
 %endif ; DCT_FLOAT_SUPPORTED
--- a/jidctfst.asm
+++ b/jidctfst.asm
@@ -0,0 +1,464 @@
 ;
 ; jidctfst.asm - fast integer IDCT (non-SIMD)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains a fast, not so accurate integer implementation of
 ; the inverse DCT (Discrete Cosine Transform). The following code is
 ; based directly on the IJG's original jidctfst.c; see the jidctfst.c
 ; for more details.
 ;
 ; Last Modified : October 17, 2004
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_IFAST_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 ; We can gain a little more speed, with a further compromise in accuracy,
 ; by omitting the addition in a descaling shift.  This yields an
 ; incorrectly rounded result half the time...
 ;
 %macro	descale 2
 %ifdef USE_ACCURATE_ROUNDING
 %if (%2)<=7
 	add	%1, byte (1<<((%2)-1))	; add reg32,imm8
 %else
 	add	%1, (1<<((%2)-1))	; add reg32,imm32
 %endif
 %endif
 	sar	%1,%2
 %endmacro
 ; --------------------------------------------------------------------------
 %define CONST_BITS	8
 %define PASS1_BITS	2
 %if IFAST_SCALE_BITS != PASS1_BITS
 %error "'IFAST_SCALE_BITS' must be equal to 'PASS1_BITS'."
 %endif
 %if CONST_BITS == 8
 F_1_082	equ	277		; FIX(1.082392200)
 F_1_414	equ	362		; FIX(1.414213562)
 F_1_847	equ	473		; FIX(1.847759065)
 F_2_613	equ	669		; FIX(2.613125930)
 %else
 ; NASM cannot do compile-time arithmetic on floating-point constants.
 %define DESCALE(x,n)  (((x)+(1<<((n)-1)))>>(n))
 F_1_082	equ	DESCALE(1162209775,30-CONST_BITS)	; FIX(1.082392200)
 F_1_414	equ	DESCALE(1518500249,30-CONST_BITS)	; FIX(1.414213562)
 F_1_847	equ	DESCALE(1984016188,30-CONST_BITS)	; FIX(1.847759065)
 F_2_613	equ	DESCALE(2805822602,30-CONST_BITS)	; FIX(2.613125930)
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform dequantization and inverse DCT on one block of coefficients.
 ;
 ; GLOBAL(void)
 ; jpeg_idct_ifast (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 ;                  JCOEFPTR coef_block,
 ;                  JSAMPARRAY output_buf, JDIMENSION output_col)
 ;
 %define cinfo(b)	(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define coef_block(b)	(b)+16		; JCOEFPTR coef_block
 %define output_buf(b)	(b)+20		; JSAMPARRAY output_buf
 %define output_col(b)	(b)+24		; JDIMENSION output_col
 %define range_limit	ebp-SIZEOF_POINTER		; JSAMPLE * range_limit
 %define ptr		range_limit-SIZEOF_POINTER	; void * ptr
 %define workspace	ptr-DCTSIZE2*SIZEOF_INT
 					; int workspace[DCTSIZE2]
 	align	16
 	global	EXTN(jpeg_idct_ifast)
 EXTN(jpeg_idct_ifast):
 	push	ebp
 	mov	ebp,esp
 	lea	esp, [workspace]
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	; ---- Pass 1: process columns from input, store into work array.
 	mov	edx, POINTER [compptr(ebp)]
 	mov	edx, POINTER [jcompinfo_dct_table(edx)]	; quantptr
 	mov	esi, JCOEFPTR [coef_block(ebp)]		; inptr
 	lea	edi, [workspace]			; int * wsptr
 	mov	ecx, DCTSIZE				; ctr
 	alignx	16,7
 .columnloop:
 	mov	ax, JCOEF [COL(1,esi,SIZEOF_JCOEF)]
 	or	ax, JCOEF [COL(2,esi,SIZEOF_JCOEF)]
 	jnz	short .columnDCT
 	mov	bx, JCOEF [COL(3,esi,SIZEOF_JCOEF)]
 	mov	ax, JCOEF [COL(4,esi,SIZEOF_JCOEF)]
 	or	bx, JCOEF [COL(5,esi,SIZEOF_JCOEF)]
 	or	ax, JCOEF [COL(6,esi,SIZEOF_JCOEF)]
 	or	bx, JCOEF [COL(7,esi,SIZEOF_JCOEF)]
 	or	ax,bx
 	jnz	short .columnDCT
 	; -- AC terms all zero
 	mov	ax, JCOEF [COL(0,esi,SIZEOF_JCOEF)]
 	imul	ax, IFAST_MULT_TYPE [COL(0,edx,SIZEOF_IFAST_MULT_TYPE)]
 	cwde
 	mov	INT [COL(0,edi,SIZEOF_INT)], eax
 	mov	INT [COL(1,edi,SIZEOF_INT)], eax
 	mov	INT [COL(2,edi,SIZEOF_INT)], eax
 	mov	INT [COL(3,edi,SIZEOF_INT)], eax
 	mov	INT [COL(4,edi,SIZEOF_INT)], eax
 	mov	INT [COL(5,edi,SIZEOF_INT)], eax
 	mov	INT [COL(6,edi,SIZEOF_INT)], eax
 	mov	INT [COL(7,edi,SIZEOF_INT)], eax
 	jmp	near .nextcolumn
 	alignx	16,7
 .columnDCT:
 	push	ecx	; ctr
 	push	esi	; coef_block
 	push	edx	; quantptr
 	mov	POINTER [ptr], edi	; wsptr
 	; -- Even part
 	movsx	eax, JCOEF [COL(0,esi,SIZEOF_JCOEF)]
 	movsx	ecx, JCOEF [COL(4,esi,SIZEOF_JCOEF)]
 	imul	ax, IFAST_MULT_TYPE [COL(0,edx,SIZEOF_IFAST_MULT_TYPE)]
 	imul	cx, IFAST_MULT_TYPE [COL(4,edx,SIZEOF_IFAST_MULT_TYPE)]
 	movsx	ebx, JCOEF [COL(2,esi,SIZEOF_JCOEF)]
 	movsx	edi, JCOEF [COL(6,esi,SIZEOF_JCOEF)]
 	imul	bx, IFAST_MULT_TYPE [COL(2,edx,SIZEOF_IFAST_MULT_TYPE)]
 	imul	di, IFAST_MULT_TYPE [COL(6,edx,SIZEOF_IFAST_MULT_TYPE)]
 	lea	edx,[eax+ecx]		; edx=tmp10
 	sub	eax,ecx			; eax=tmp11
 	lea	ecx,[ebx+edi]		; ecx=tmp13
 	sub	ebx,edi
 	imul	ebx,(F_1_414)
 	descale	ebx,CONST_BITS
 	sub	ebx,ecx			; ebx=tmp12
 	lea	edi,[edx+ecx]		; edi=tmp0
 	sub	edx,ecx			; edx=tmp3
 	lea	ecx,[eax+ebx]		; ecx=tmp1
 	sub	eax,ebx			; eax=tmp2
 	push	edx		; tmp3
 	push	eax		; tmp2
 	push	ecx		; tmp1
 	push	edi		; tmp0
 	; -- Odd part
 	mov	edx, POINTER [esp+16]	; quantptr
 	movsx	eax, JCOEF [COL(1,esi,SIZEOF_JCOEF)]
 	movsx	ebx, JCOEF [COL(7,esi,SIZEOF_JCOEF)]
 	imul	ax, IFAST_MULT_TYPE [COL(1,edx,SIZEOF_IFAST_MULT_TYPE)]
 	imul	bx, IFAST_MULT_TYPE [COL(7,edx,SIZEOF_IFAST_MULT_TYPE)]
 	movsx	edi, JCOEF [COL(5,esi,SIZEOF_JCOEF)]
 	movsx	ecx, JCOEF [COL(3,esi,SIZEOF_JCOEF)]
 	imul	di, IFAST_MULT_TYPE [COL(5,edx,SIZEOF_IFAST_MULT_TYPE)]
 	imul	cx, IFAST_MULT_TYPE [COL(3,edx,SIZEOF_IFAST_MULT_TYPE)]
 	lea	esi,[eax+ebx]		; esi=z11
 	sub	eax,ebx			; eax=z12
 	lea	edx,[edi+ecx]		; edx=z13
 	sub	edi,ecx			; edi=z10
 	lea	ebx,[esi+edx]		; ebx=tmp7
 	sub	esi,edx
 	imul	esi,(F_1_414)		; esi=tmp11
 	descale	esi,CONST_BITS
 	lea	ecx,[edi+eax]
 	imul	ecx,(F_1_847)		; ecx=z5
 	imul	edi,(-F_2_613)		; edi=MULTIPLY(z10,-FIX_2_613125930)
 	imul	eax,(F_1_082)		; eax=MULTIPLY(z12,FIX_1_082392200)
 	descale	ecx,CONST_BITS
 	descale	edi,CONST_BITS
 	descale	eax,CONST_BITS
 	add	edi,ecx			; edi=tmp12
 	sub	eax,ecx			; eax=tmp10
 	; -- Final output stage
 	sub	edi,ebx		; edi=tmp6
 	pop	edx		; edx=tmp0
 	sub	esi,edi		; esi=tmp5
 	pop	ecx		; ecx=tmp1
 	add	eax,esi		; eax=tmp4
 	push	esi		; tmp5
 	push	eax		; tmp4
 	lea	eax,[edx+ebx]	; eax=data0(=tmp0+tmp7)
 	sub	edx,ebx		; edx=data7(=tmp0-tmp7)
 	lea	ebx,[ecx+edi]	; ebx=data1(=tmp1+tmp6)
 	sub	ecx,edi		; ecx=data6(=tmp1-tmp6)
 	mov	edi, POINTER [ptr]	; edi=wsptr
 	mov	INT [COL(0,edi,SIZEOF_INT)], eax
 	mov	INT [COL(7,edi,SIZEOF_INT)], edx
 	mov	INT [COL(1,edi,SIZEOF_INT)], ebx
 	mov	INT [COL(6,edi,SIZEOF_INT)], ecx
 	pop	esi		; esi=tmp4
 	pop	eax		; eax=tmp5
 	pop	edx		; edx=tmp2
 	pop	ecx		; ecx=tmp3
 	lea	ebx,[edx+eax]	; ebx=data2(=tmp2+tmp5)
 	sub	edx,eax		; edx=data5(=tmp2-tmp5)
 	lea	eax,[ecx+esi]	; eax=data4(=tmp3+tmp4)
 	sub	ecx,esi		; ecx=data3(=tmp3-tmp4)
 	mov	INT [COL(2,edi,SIZEOF_INT)], ebx
 	mov	INT [COL(5,edi,SIZEOF_INT)], edx
 	mov	INT [COL(4,edi,SIZEOF_INT)], eax
 	mov	INT [COL(3,edi,SIZEOF_INT)], ecx
 	pop	edx	; quantptr
 	pop	esi	; coef_block
 	pop	ecx	; ctr
 .nextcolumn:
 	add	esi, byte SIZEOF_JCOEF	; advance pointers to next column
 	add	edx, byte SIZEOF_IFAST_MULT_TYPE
 	add	edi, byte SIZEOF_INT
 	dec	ecx
 	jnz	near .columnloop
 	; ---- Pass 2: process rows from work array, store into output array.
 	mov	eax, POINTER [cinfo(ebp)]
 	mov	eax, POINTER [jdstruct_sample_range_limit(eax)]
 	sub	eax, byte -CENTERJSAMPLE*SIZEOF_JSAMPLE	; JSAMPLE * range_limit
 	mov	POINTER [range_limit], eax
 	lea	esi, [workspace]			; int * wsptr
 	mov	edi, JSAMPARRAY [output_buf(ebp)]	; (JSAMPROW *)
 	mov	ecx, DCTSIZE				; ctr
 	alignx	16,7
 .rowloop:
 	push	edi
 	mov	edi, JSAMPROW [edi]			; (JSAMPLE *)
 	add	edi, JDIMENSION [output_col(ebp)]	; edi=outptr
 %ifndef NO_ZERO_ROW_TEST
 	mov	eax, INT [ROW(1,esi,SIZEOF_INT)]
 	or	eax, INT [ROW(2,esi,SIZEOF_INT)]
 	jnz	short .rowDCT
 	mov	ebx, INT [ROW(3,esi,SIZEOF_INT)]
 	mov	eax, INT [ROW(4,esi,SIZEOF_INT)]
 	or	ebx, INT [ROW(5,esi,SIZEOF_INT)]
 	or	eax, INT [ROW(6,esi,SIZEOF_INT)]
 	or	ebx, INT [ROW(7,esi,SIZEOF_INT)]
 	or	eax,ebx
 	jnz	short .rowDCT
 	; -- AC terms all zero
 	mov	eax, INT [ROW(0,esi,SIZEOF_INT)]
 	mov	edx, POINTER [range_limit]	; (JSAMPLE *)
 	descale	eax,(PASS1_BITS+3)
 	and	eax,RANGE_MASK
 	mov	al, JSAMPLE [edx+eax*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [edi+0*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+1*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+2*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+3*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+4*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+5*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+6*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+7*SIZEOF_JSAMPLE], al
 	jmp	near .nextrow
 	alignx	16,7
 %endif
 .rowDCT:
 	push	esi	; wsptr
 	push	ecx	; ctr
 	mov	POINTER [ptr], edi	; outptr
 	; -- Even part
 	mov	eax, INT [ROW(0,esi,SIZEOF_INT)]
 	mov	ebx, INT [ROW(2,esi,SIZEOF_INT)]
 	mov	ecx, INT [ROW(4,esi,SIZEOF_INT)]
 	mov	edi, INT [ROW(6,esi,SIZEOF_INT)]
 	lea	edx,[eax+ecx]		; edx=tmp10
 	sub	eax,ecx			; eax=tmp11
 	lea	ecx,[ebx+edi]		; ecx=tmp13
 	sub	ebx,edi
 	imul	ebx,(F_1_414)
 	descale	ebx,CONST_BITS
 	sub	ebx,ecx			; ebx=tmp12
 	lea	edi,[edx+ecx]		; edi=tmp0
 	sub	edx,ecx			; edx=tmp3
 	lea	ecx,[eax+ebx]		; ecx=tmp1
 	sub	eax,ebx			; eax=tmp2
 	push	edx		; tmp3
 	push	eax		; tmp2
 	push	ecx		; tmp1
 	push	edi		; tmp0
 	; -- Odd part
 	mov	eax, INT [ROW(1,esi,SIZEOF_INT)]
 	mov	ecx, INT [ROW(3,esi,SIZEOF_INT)]
 	mov	edi, INT [ROW(5,esi,SIZEOF_INT)]
 	mov	ebx, INT [ROW(7,esi,SIZEOF_INT)]
 	lea	esi,[eax+ebx]		; esi=z11
 	sub	eax,ebx			; eax=z12
 	lea	edx,[edi+ecx]		; edx=z13
 	sub	edi,ecx			; edi=z10
 	lea	ebx,[esi+edx]		; ebx=tmp7
 	sub	esi,edx
 	imul	esi,(F_1_414)		; esi=tmp11
 	descale	esi,CONST_BITS
 	lea	ecx,[edi+eax]
 	imul	ecx,(F_1_847)		; ecx=z5
 	imul	edi,(-F_2_613)		; edi=MULTIPLY(z10,-FIX_2_613125930)
 	imul	eax,(F_1_082)		; eax=MULTIPLY(z12,FIX_1_082392200)
 	descale	ecx,CONST_BITS
 	descale	edi,CONST_BITS
 	descale	eax,CONST_BITS
 	add	edi,ecx			; edi=tmp12
 	sub	eax,ecx			; eax=tmp10
 	; -- Final output stage
 	sub	edi,ebx		; edi=tmp6
 	pop	edx		; edx=tmp0
 	sub	esi,edi		; esi=tmp5
 	pop	ecx		; ecx=tmp1
 	add	eax,esi		; eax=tmp4
 	push	esi		; tmp5
 	push	eax		; tmp4
 	lea	eax,[edx+ebx]	; eax=data0(=tmp0+tmp7)
 	sub	edx,ebx		; edx=data7(=tmp0-tmp7)
 	lea	ebx,[ecx+edi]	; ebx=data1(=tmp1+tmp6)
 	sub	ecx,edi		; ecx=data6(=tmp1-tmp6)
 	mov	esi, POINTER [range_limit]	; (JSAMPLE *)
 	descale	eax,(PASS1_BITS+3)
 	descale	edx,(PASS1_BITS+3)
 	descale	ebx,(PASS1_BITS+3)
 	descale	ecx,(PASS1_BITS+3)
 	mov	edi, POINTER [ptr]		; edi=outptr
 	and	eax,RANGE_MASK
 	and	edx,RANGE_MASK
 	and	ebx,RANGE_MASK
 	and	ecx,RANGE_MASK
 	mov	al, JSAMPLE [esi+eax*SIZEOF_JSAMPLE]
 	mov	dl, JSAMPLE [esi+edx*SIZEOF_JSAMPLE]
 	mov	bl, JSAMPLE [esi+ebx*SIZEOF_JSAMPLE]
 	mov	cl, JSAMPLE [esi+ecx*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [edi+0*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+7*SIZEOF_JSAMPLE], dl
 	mov	JSAMPLE [edi+1*SIZEOF_JSAMPLE], bl
 	mov	JSAMPLE [edi+6*SIZEOF_JSAMPLE], cl
 	pop	esi		; esi=tmp4
 	pop	eax		; eax=tmp5
 	pop	edx		; edx=tmp2
 	pop	ecx		; ecx=tmp3
 	lea	ebx,[edx+eax]	; ebx=data2(=tmp2+tmp5)
 	sub	edx,eax		; edx=data5(=tmp2-tmp5)
 	lea	eax,[ecx+esi]	; eax=data4(=tmp3+tmp4)
 	sub	ecx,esi		; ecx=data3(=tmp3-tmp4)
 	mov	esi, POINTER [range_limit]	; (JSAMPLE *)
 	descale	ebx,(PASS1_BITS+3)
 	descale	edx,(PASS1_BITS+3)
 	descale	eax,(PASS1_BITS+3)
 	descale	ecx,(PASS1_BITS+3)
 	and	ebx,RANGE_MASK
 	and	edx,RANGE_MASK
 	and	eax,RANGE_MASK
 	and	ecx,RANGE_MASK
 	mov	bl, JSAMPLE [esi+ebx*SIZEOF_JSAMPLE]
 	mov	dl, JSAMPLE [esi+edx*SIZEOF_JSAMPLE]
 	mov	al, JSAMPLE [esi+eax*SIZEOF_JSAMPLE]
 	mov	cl, JSAMPLE [esi+ecx*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [edi+2*SIZEOF_JSAMPLE], bl
 	mov	JSAMPLE [edi+5*SIZEOF_JSAMPLE], dl
 	mov	JSAMPLE [edi+4*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+3*SIZEOF_JSAMPLE], cl
 	pop	ecx	; ctr
 	pop	esi	; wsptr
 .nextrow:
 	pop	edi
 	add	esi, byte DCTSIZE*SIZEOF_INT	; advance pointer to next row
 	add	edi, byte SIZEOF_JSAMPROW
 	dec	ecx
 	jnz	near .rowloop
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	mov	esp,ebp
 	pop	ebp
 	ret
 %endif ; DCT_IFAST_SUPPORTED
--- a/jidctint.asm
+++ b/jidctint.asm
@@ -0,0 +1,524 @@
 ;
 ; jidctint.asm - accurate integer IDCT (non-SIMD)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains a slow-but-accurate integer implementation of the
 ; inverse DCT (Discrete Cosine Transform). The following code is based
 ; directly on the IJG's original jidctint.c; see the jidctint.c for
 ; more details.
 ;
 ; Last Modified : October 17, 2004
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_ISLOW_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 ; Descale and correctly round a DWORD value that's scaled by N bits.
 ;
 %macro	descale 2
 %if (%2)<=7
 	add	%1, byte (1<<((%2)-1))	; add reg32,imm8
 %else
 	add	%1, (1<<((%2)-1))	; add reg32,imm32
 %endif
 	sar	%1,%2
 %endmacro
 ; --------------------------------------------------------------------------
 %define CONST_BITS	13
 %define PASS1_BITS	2
 %if CONST_BITS == 13
 F_0_298	equ	 2446		; FIX(0.298631336)
 F_0_390	equ	 3196		; FIX(0.390180644)
 F_0_541	equ	 4433		; FIX(0.541196100)
 F_0_765	equ	 6270		; FIX(0.765366865)
 F_0_899	equ	 7373		; FIX(0.899976223)
 F_1_175	equ	 9633		; FIX(1.175875602)
 F_1_501	equ	12299		; FIX(1.501321110)
 F_1_847	equ	15137		; FIX(1.847759065)
 F_1_961	equ	16069		; FIX(1.961570560)
 F_2_053	equ	16819		; FIX(2.053119869)
 F_2_562	equ	20995		; FIX(2.562915447)
 F_3_072	equ	25172		; FIX(3.072711026)
 %else
 ; NASM cannot do compile-time arithmetic on floating-point constants.
 %define DESCALE(x,n)  (((x)+(1<<((n)-1)))>>(n))
 F_0_298	equ	DESCALE( 320652955,30-CONST_BITS)	; FIX(0.298631336)
 F_0_390	equ	DESCALE( 418953276,30-CONST_BITS)	; FIX(0.390180644)
 F_0_541	equ	DESCALE( 581104887,30-CONST_BITS)	; FIX(0.541196100)
 F_0_765	equ	DESCALE( 821806413,30-CONST_BITS)	; FIX(0.765366865)
 F_0_899	equ	DESCALE( 966342111,30-CONST_BITS)	; FIX(0.899976223)
 F_1_175	equ	DESCALE(1262586813,30-CONST_BITS)	; FIX(1.175875602)
 F_1_501	equ	DESCALE(1612031267,30-CONST_BITS)	; FIX(1.501321110)
 F_1_847	equ	DESCALE(1984016188,30-CONST_BITS)	; FIX(1.847759065)
 F_1_961	equ	DESCALE(2106220350,30-CONST_BITS)	; FIX(1.961570560)
 F_2_053	equ	DESCALE(2204520673,30-CONST_BITS)	; FIX(2.053119869)
 F_2_562	equ	DESCALE(2751909506,30-CONST_BITS)	; FIX(2.562915447)
 F_3_072	equ	DESCALE(3299298341,30-CONST_BITS)	; FIX(3.072711026)
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform dequantization and inverse DCT on one block of coefficients.
 ;
 ; GLOBAL(void)
 ; jpeg_idct_islow (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 ;                  JCOEFPTR coef_block,
 ;                  JSAMPARRAY output_buf, JDIMENSION output_col)
 ;
 %define cinfo(b)	(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define coef_block(b)	(b)+16		; JCOEFPTR coef_block
 %define output_buf(b)	(b)+20		; JSAMPARRAY output_buf
 %define output_col(b)	(b)+24		; JDIMENSION output_col
 %define range_limit	ebp-SIZEOF_POINTER		; JSAMPLE * range_limit
 %define ptr		range_limit-SIZEOF_POINTER	; void * ptr
 %define workspace	ptr-DCTSIZE2*SIZEOF_INT
 					; int workspace[DCTSIZE2]
 	align	16
 	global	EXTN(jpeg_idct_islow)
 EXTN(jpeg_idct_islow):
 	push	ebp
 	mov	ebp,esp
 	lea	esp, [workspace]
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	; ---- Pass 1: process columns from input, store into work array.
 	mov	edx, POINTER [compptr(ebp)]
 	mov	edx, POINTER [jcompinfo_dct_table(edx)]	; quantptr
 	mov	esi, JCOEFPTR [coef_block(ebp)]		; inptr
 	lea	edi, [workspace]			; int * wsptr
 	mov	ecx, DCTSIZE				; ctr
 	alignx	16,7
 .columnloop:
 	mov	ax, JCOEF [COL(1,esi,SIZEOF_JCOEF)]
 	or	ax, JCOEF [COL(2,esi,SIZEOF_JCOEF)]
 	jnz	short .columnDCT
 	mov	bx, JCOEF [COL(3,esi,SIZEOF_JCOEF)]
 	mov	ax, JCOEF [COL(4,esi,SIZEOF_JCOEF)]
 	or	bx, JCOEF [COL(5,esi,SIZEOF_JCOEF)]
 	or	ax, JCOEF [COL(6,esi,SIZEOF_JCOEF)]
 	or	bx, JCOEF [COL(7,esi,SIZEOF_JCOEF)]
 	or	ax,bx
 	jnz	short .columnDCT
 	; -- AC terms all zero
 	mov	ax, JCOEF [COL(0,esi,SIZEOF_JCOEF)]
 	imul	ax, ISLOW_MULT_TYPE [COL(0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	cwde
 	sal	eax,PASS1_BITS
 	mov	INT [COL(0,edi,SIZEOF_INT)], eax
 	mov	INT [COL(1,edi,SIZEOF_INT)], eax
 	mov	INT [COL(2,edi,SIZEOF_INT)], eax
 	mov	INT [COL(3,edi,SIZEOF_INT)], eax
 	mov	INT [COL(4,edi,SIZEOF_INT)], eax
 	mov	INT [COL(5,edi,SIZEOF_INT)], eax
 	mov	INT [COL(6,edi,SIZEOF_INT)], eax
 	mov	INT [COL(7,edi,SIZEOF_INT)], eax
 	jmp	near .nextcolumn
 	alignx	16,7
 .columnDCT:
 	push	ecx	; ctr
 	push	esi	; coef_block
 	push	edx	; quantptr
 	mov	POINTER [ptr], edi	; wsptr
 	; -- Even part
 	movsx	eax, JCOEF [COL(0,esi,SIZEOF_JCOEF)]
 	movsx	ecx, JCOEF [COL(4,esi,SIZEOF_JCOEF)]
 	imul	ax, ISLOW_MULT_TYPE [COL(0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	imul	cx, ISLOW_MULT_TYPE [COL(4,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	movsx	ebx, JCOEF [COL(2,esi,SIZEOF_JCOEF)]
 	movsx	edi, JCOEF [COL(6,esi,SIZEOF_JCOEF)]
 	imul	bx, ISLOW_MULT_TYPE [COL(2,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	imul	di, ISLOW_MULT_TYPE [COL(6,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	lea	edx,[eax+ecx]
 	sub	eax,ecx
 	sal	edx,CONST_BITS	; edx=tmp0
 	sal	eax,CONST_BITS	; eax=tmp1
 	lea	ecx,[ebx+edi]
 	imul	ecx,(F_0_541)	; ecx=z1
 	imul	ebx,(F_0_765)	; ebx=MULTIPLY(z2,FIX_0_765366865)
 	imul	edi,(-F_1_847)	; edi=MULTIPLY(z3,-FIX_1_847759065)
 	add	ebx,ecx		; ebx=tmp3
 	add	edi,ecx		; edi=tmp2
 	lea	ecx,[edx+ebx]	; ecx=tmp10
 	sub	edx,ebx		; edx=tmp13
 	lea	ebx,[eax+edi]	; ebx=tmp11
 	sub	eax,edi		; eax=tmp12
 	push	edx		; tmp13
 	push	eax		; tmp12
 	push	ebx		; tmp11
 	push	ecx		; tmp10
 	; -- Odd part
 	mov	edx, POINTER [esp+16]	; quantptr
 	movsx	eax, JCOEF [COL(1,esi,SIZEOF_JCOEF)]
 	movsx	edi, JCOEF [COL(3,esi,SIZEOF_JCOEF)]
 	imul	ax, ISLOW_MULT_TYPE [COL(1,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	imul	di, ISLOW_MULT_TYPE [COL(3,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	movsx	ecx, JCOEF [COL(5,esi,SIZEOF_JCOEF)]
 	movsx	ebx, JCOEF [COL(7,esi,SIZEOF_JCOEF)]
 	imul	cx, ISLOW_MULT_TYPE [COL(5,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	imul	bx, ISLOW_MULT_TYPE [COL(7,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	push	eax		; eax=tmp3
 	push	edi		; edi=tmp2
 	push	ecx		; ecx=tmp1
 	push	ebx		; ebx=tmp0
 	lea	esi,[ebx+edi]	; esi=z3
 	lea	edx,[ecx+eax]	; edx=z4
 	add	ebx,eax		; ebx=z1
 	add	ecx,edi		; ecx=z2
 	lea	eax,[esi+edx]
 	imul	eax,(F_1_175)	; eax=z5
 	imul	esi,(-F_1_961)	; esi=z3(=MULTIPLY(z3,-FIX_1_961570560))
 	imul	edx,(-F_0_390)	; edx=z4(=MULTIPLY(z4,-FIX_0_390180644))
 	imul	ebx,(-F_0_899)	; ebx=z1(=MULTIPLY(z1,-FIX_0_899976223))
 	imul	ecx,(-F_2_562)	; ecx=z2(=MULTIPLY(z2,-FIX_2_562915447))
 	add	esi,eax		; esi=z3(=z3+z5)
 	add	edx,eax		; edx=z4(=z4+z5)
 	lea	edi,[esi+ebx]	; edi=z1+z3
 	lea	eax,[edx+ecx]	; eax=z2+z4
 	add	esi,ecx		; esi=z2+z3
 	add	edx,ebx		; edx=z1+z4
 	pop	ecx		; ecx=tmp0
 	pop	ebx		; ebx=tmp1
 	imul	ecx,(F_0_298)	; ecx=tmp0(=MULTIPLY(tmp0,FIX_0_298631336))
 	imul	ebx,(F_2_053)	; ebx=tmp1(=MULTIPLY(tmp1,FIX_2_053119869))
 	add	edi,ecx		; edi=tmp0(=tmp0+z1+z3)
 	add	eax,ebx		; eax=tmp1(=tmp1+z2+z4)
 	pop	ecx		; ecx=tmp2
 	pop	ebx		; ebx=tmp3
 	imul	ecx,(F_3_072)	; ecx=tmp2(=MULTIPLY(tmp2,FIX_3_072711026))
 	imul	ebx,(F_1_501)	; ebx=tmp3(=MULTIPLY(tmp3,FIX_1_501321110))
 	add	esi,ecx		; esi=tmp2(=tmp2+z2+z3)
 	add	edx,ebx		; edx=tmp3(=tmp3+z1+z4)
 	; -- Final output stage
 	pop	ecx		; ecx=tmp10
 	pop	ebx		; ebx=tmp11
 	push	eax		; tmp1
 	push	edi		; tmp0
 	lea	eax,[ecx+edx]	; eax=data0(=tmp10+tmp3)
 	sub	ecx,edx		; ecx=data7(=tmp10-tmp3)
 	lea	edx,[ebx+esi]	; edx=data1(=tmp11+tmp2)
 	sub	ebx,esi		; ebx=data6(=tmp11-tmp2)
 	mov	edi, POINTER [ptr]	; edi=wsptr
 	descale	eax,(CONST_BITS-PASS1_BITS)
 	descale	ecx,(CONST_BITS-PASS1_BITS)
 	descale	edx,(CONST_BITS-PASS1_BITS)
 	descale	ebx,(CONST_BITS-PASS1_BITS)
 	mov	INT [COL(0,edi,SIZEOF_INT)], eax
 	mov	INT [COL(7,edi,SIZEOF_INT)], ecx
 	mov	INT [COL(1,edi,SIZEOF_INT)], edx
 	mov	INT [COL(6,edi,SIZEOF_INT)], ebx
 	pop	esi		; esi=tmp0
 	pop	eax		; eax=tmp1
 	pop	ecx		; ecx=tmp12
 	pop	edx		; edx=tmp13
 	lea	ebx,[ecx+eax]	; ebx=data2(=tmp12+tmp1)
 	sub	ecx,eax		; ecx=data5(=tmp12-tmp1)
 	lea	eax,[edx+esi]	; eax=data3(=tmp13+tmp0)
 	sub	edx,esi		; edx=data4(=tmp13-tmp0)
 	descale	ebx,(CONST_BITS-PASS1_BITS)
 	descale	ecx,(CONST_BITS-PASS1_BITS)
 	descale	eax,(CONST_BITS-PASS1_BITS)
 	descale	edx,(CONST_BITS-PASS1_BITS)
 	mov	INT [COL(2,edi,SIZEOF_INT)], ebx
 	mov	INT [COL(5,edi,SIZEOF_INT)], ecx
 	mov	INT [COL(3,edi,SIZEOF_INT)], eax
 	mov	INT [COL(4,edi,SIZEOF_INT)], edx
 	pop	edx	; quantptr
 	pop	esi	; coef_block
 	pop	ecx	; ctr
 .nextcolumn:
 	add	esi, byte SIZEOF_JCOEF	; advance pointers to next column
 	add	edx, byte SIZEOF_ISLOW_MULT_TYPE
 	add	edi, byte SIZEOF_INT
 	dec	ecx
 	jnz	near .columnloop
 	; ---- Pass 2: process rows from work array, store into output array.
 	mov	eax, POINTER [cinfo(ebp)]
 	mov	eax, POINTER [jdstruct_sample_range_limit(eax)]
 	sub	eax, byte -CENTERJSAMPLE*SIZEOF_JSAMPLE	; JSAMPLE * range_limit
 	mov	POINTER [range_limit], eax
 	lea	esi, [workspace]			; int * wsptr
 	mov	edi, JSAMPARRAY [output_buf(ebp)]	; (JSAMPROW *)
 	mov	ecx, DCTSIZE				; ctr
 	alignx	16,7
 .rowloop:
 	push	edi
 	mov	edi, JSAMPROW [edi]			; (JSAMPLE *)
 	add	edi, JDIMENSION [output_col(ebp)]	; edi=outptr
 %ifndef NO_ZERO_ROW_TEST
 	mov	eax, INT [ROW(1,esi,SIZEOF_INT)]
 	or	eax, INT [ROW(2,esi,SIZEOF_INT)]
 	jnz	short .rowDCT
 	mov	ebx, INT [ROW(3,esi,SIZEOF_INT)]
 	mov	eax, INT [ROW(4,esi,SIZEOF_INT)]
 	or	ebx, INT [ROW(5,esi,SIZEOF_INT)]
 	or	eax, INT [ROW(6,esi,SIZEOF_INT)]
 	or	ebx, INT [ROW(7,esi,SIZEOF_INT)]
 	or	eax,ebx
 	jnz	short .rowDCT
 	; -- AC terms all zero
 	mov	eax, INT [ROW(0,esi,SIZEOF_INT)]
 	mov	edx, POINTER [range_limit]	; (JSAMPLE *)
 	descale	eax,(PASS1_BITS+3)
 	and	eax,RANGE_MASK
 	mov	al, JSAMPLE [edx+eax*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [edi+0*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+1*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+2*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+3*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+4*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+5*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+6*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+7*SIZEOF_JSAMPLE], al
 	jmp	near .nextrow
 	alignx	16,7
 %endif
 .rowDCT:
 	push	esi	; wsptr
 	push	ecx	; ctr
 	mov	POINTER [ptr], edi	; outptr
 	; -- Even part
 	mov	eax, INT [ROW(0,esi,SIZEOF_INT)]
 	mov	ebx, INT [ROW(2,esi,SIZEOF_INT)]
 	mov	ecx, INT [ROW(4,esi,SIZEOF_INT)]
 	mov	edi, INT [ROW(6,esi,SIZEOF_INT)]
 	lea	edx,[eax+ecx]
 	sub	eax,ecx
 	sal	edx,CONST_BITS	; edx=tmp0
 	sal	eax,CONST_BITS	; eax=tmp1
 	lea	ecx,[ebx+edi]
 	imul	ecx,(F_0_541)	; ecx=z1
 	imul	ebx,(F_0_765)	; ebx=MULTIPLY(z2,FIX_0_765366865)
 	imul	edi,(-F_1_847)	; edi=MULTIPLY(z3,-FIX_1_847759065)
 	add	ebx,ecx		; ebx=tmp3
 	add	edi,ecx		; edi=tmp2
 	lea	ecx,[edx+ebx]	; ecx=tmp10
 	sub	edx,ebx		; edx=tmp13
 	lea	ebx,[eax+edi]	; ebx=tmp11
 	sub	eax,edi		; eax=tmp12
 	push	edx		; tmp13
 	push	eax		; tmp12
 	push	ebx		; tmp11
 	push	ecx		; tmp10
 	; -- Odd part
 	mov	eax, INT [ROW(1,esi,SIZEOF_INT)]
 	mov	edi, INT [ROW(3,esi,SIZEOF_INT)]
 	mov	ecx, INT [ROW(5,esi,SIZEOF_INT)]
 	mov	ebx, INT [ROW(7,esi,SIZEOF_INT)]
 	push	eax		; eax=tmp3
 	push	edi		; edi=tmp2
 	push	ecx		; ecx=tmp1
 	push	ebx		; ebx=tmp0
 	lea	esi,[ebx+edi]	; esi=z3
 	lea	edx,[ecx+eax]	; edx=z4
 	add	ebx,eax		; ebx=z1
 	add	ecx,edi		; ecx=z2
 	lea	eax,[esi+edx]
 	imul	eax,(F_1_175)	; eax=z5
 	imul	esi,(-F_1_961)	; esi=z3(=MULTIPLY(z3,-FIX_1_961570560))
 	imul	edx,(-F_0_390)	; edx=z4(=MULTIPLY(z4,-FIX_0_390180644))
 	imul	ebx,(-F_0_899)	; ebx=z1(=MULTIPLY(z1,-FIX_0_899976223))
 	imul	ecx,(-F_2_562)	; ecx=z2(=MULTIPLY(z2,-FIX_2_562915447))
 	add	esi,eax		; esi=z3(=z3+z5)
 	add	edx,eax		; edx=z4(=z4+z5)
 	lea	edi,[esi+ebx]	; edi=z1+z3
 	lea	eax,[edx+ecx]	; eax=z2+z4
 	add	esi,ecx		; esi=z2+z3
 	add	edx,ebx		; edx=z1+z4
 	pop	ecx		; ecx=tmp0
 	pop	ebx		; ebx=tmp1
 	imul	ecx,(F_0_298)	; ecx=tmp0(=MULTIPLY(tmp0,FIX_0_298631336))
 	imul	ebx,(F_2_053)	; ebx=tmp1(=MULTIPLY(tmp1,FIX_2_053119869))
 	add	edi,ecx		; edi=tmp0(=tmp0+z1+z3)
 	add	eax,ebx		; eax=tmp1(=tmp1+z2+z4)
 	pop	ecx		; ecx=tmp2
 	pop	ebx		; ebx=tmp3
 	imul	ecx,(F_3_072)	; ecx=tmp2(=MULTIPLY(tmp2,FIX_3_072711026))
 	imul	ebx,(F_1_501)	; ebx=tmp3(=MULTIPLY(tmp3,FIX_1_501321110))
 	add	esi,ecx		; esi=tmp2(=tmp2+z2+z3)
 	add	edx,ebx		; edx=tmp3(=tmp3+z1+z4)
 	; -- Final output stage
 	pop	ecx		; ecx=tmp10
 	pop	ebx		; ebx=tmp11
 	push	eax		; tmp1
 	push	edi		; tmp0
 	lea	eax,[ecx+edx]	; eax=data0(=tmp10+tmp3)
 	sub	ecx,edx		; ecx=data7(=tmp10-tmp3)
 	lea	edx,[ebx+esi]	; edx=data1(=tmp11+tmp2)
 	sub	ebx,esi		; ebx=data6(=tmp11-tmp2)
 	mov	esi, POINTER [range_limit]	; (JSAMPLE *)
 	descale	eax,(CONST_BITS+PASS1_BITS+3)
 	descale	ecx,(CONST_BITS+PASS1_BITS+3)
 	descale	edx,(CONST_BITS+PASS1_BITS+3)
 	descale	ebx,(CONST_BITS+PASS1_BITS+3)
 	mov	edi, POINTER [ptr]		; edi=outptr
 	and	eax,RANGE_MASK
 	and	ecx,RANGE_MASK
 	and	edx,RANGE_MASK
 	and	ebx,RANGE_MASK
 	mov	al, JSAMPLE [esi+eax*SIZEOF_JSAMPLE]
 	mov	cl, JSAMPLE [esi+ecx*SIZEOF_JSAMPLE]
 	mov	dl, JSAMPLE [esi+edx*SIZEOF_JSAMPLE]
 	mov	bl, JSAMPLE [esi+ebx*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [edi+0*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+7*SIZEOF_JSAMPLE], cl
 	mov	JSAMPLE [edi+1*SIZEOF_JSAMPLE], dl
 	mov	JSAMPLE [edi+6*SIZEOF_JSAMPLE], bl
 	pop	esi		; esi=tmp0
 	pop	eax		; eax=tmp1
 	pop	ecx		; ecx=tmp12
 	pop	edx		; edx=tmp13
 	lea	ebx,[ecx+eax]	; ebx=data2(=tmp12+tmp1)
 	sub	ecx,eax		; ecx=data5(=tmp12-tmp1)
 	lea	eax,[edx+esi]	; eax=data3(=tmp13+tmp0)
 	sub	edx,esi		; edx=data4(=tmp13-tmp0)
 	mov	esi, POINTER [range_limit]	; (JSAMPLE *)
 	descale	ebx,(CONST_BITS+PASS1_BITS+3)
 	descale	ecx,(CONST_BITS+PASS1_BITS+3)
 	descale	eax,(CONST_BITS+PASS1_BITS+3)
 	descale	edx,(CONST_BITS+PASS1_BITS+3)
 	and	ebx,RANGE_MASK
 	and	ecx,RANGE_MASK
 	and	eax,RANGE_MASK
 	and	edx,RANGE_MASK
 	mov	bl, JSAMPLE [esi+ebx*SIZEOF_JSAMPLE]
 	mov	cl, JSAMPLE [esi+ecx*SIZEOF_JSAMPLE]
 	mov	al, JSAMPLE [esi+eax*SIZEOF_JSAMPLE]
 	mov	dl, JSAMPLE [esi+edx*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [edi+2*SIZEOF_JSAMPLE], bl
 	mov	JSAMPLE [edi+5*SIZEOF_JSAMPLE], cl
 	mov	JSAMPLE [edi+3*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+4*SIZEOF_JSAMPLE], dl
 	pop	ecx	; ctr
 	pop	esi	; wsptr
 .nextrow:
 	pop	edi
 	add	esi, byte DCTSIZE*SIZEOF_INT	; advance pointer to next row
 	add	edi, byte SIZEOF_JSAMPROW
 	dec	ecx
 	jnz	near .rowloop
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	mov	esp,ebp
 	pop	ebp
 	ret
 %endif ; DCT_ISLOW_SUPPORTED
--- a/jidctred.asm
+++ b/jidctred.asm
@@ -0,0 +1,688 @@
 ;
 ; jidctred.asm - reduced-size IDCT (non-SIMD)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains inverse-DCT routines that produce reduced-size output:
 ; either 4x4, 2x2, or 1x1 pixels from an 8x8 DCT block.
 ; The following code is based directly on the IJG's original jidctred.c;
 ; see the jidctred.c for more details.
 ;
 ; Last Modified : October 17, 2004
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef IDCT_SCALING_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 ; Descale and correctly round a DWORD value that's scaled by N bits.
 ;
 %macro	descale 2
 %if (%2)<=7
 	add	%1, byte (1<<((%2)-1))	; add reg32,imm8
 %else
 	add	%1, (1<<((%2)-1))	; add reg32,imm32
 %endif
 	sar	%1,%2
 %endmacro
 ; --------------------------------------------------------------------------
 %define CONST_BITS	13
 %define PASS1_BITS	2
 %if CONST_BITS == 13
 F_0_211	equ	 1730		; FIX(0.211164243)
 F_0_509	equ	 4176		; FIX(0.509795579)
 F_0_601	equ	 4926		; FIX(0.601344887)
 F_0_720	equ	 5906		; FIX(0.720959822)
 F_0_765	equ	 6270		; FIX(0.765366865)
 F_0_850	equ	 6967		; FIX(0.850430095)
 F_0_899	equ	 7373		; FIX(0.899976223)
 F_1_061	equ	 8697		; FIX(1.061594337)
 F_1_272	equ	10426		; FIX(1.272758580)
 F_1_451	equ	11893		; FIX(1.451774981)
 F_1_847	equ	15137		; FIX(1.847759065)
 F_2_172	equ	17799		; FIX(2.172734803)
 F_2_562	equ	20995		; FIX(2.562915447)
 F_3_624	equ	29692		; FIX(3.624509785)
 %else
 ; NASM cannot do compile-time arithmetic on floating-point constants.
 %define DESCALE(x,n)  (((x)+(1<<((n)-1)))>>(n))
 F_0_211	equ	DESCALE( 226735879,30-CONST_BITS)	; FIX(0.211164243)
 F_0_509	equ	DESCALE( 547388834,30-CONST_BITS)	; FIX(0.509795579)
 F_0_601	equ	DESCALE( 645689155,30-CONST_BITS)	; FIX(0.601344887)
 F_0_720	equ	DESCALE( 774124714,30-CONST_BITS)	; FIX(0.720959822)
 F_0_765	equ	DESCALE( 821806413,30-CONST_BITS)	; FIX(0.765366865)
 F_0_850	equ	DESCALE( 913142361,30-CONST_BITS)	; FIX(0.850430095)
 F_0_899	equ	DESCALE( 966342111,30-CONST_BITS)	; FIX(0.899976223)
 F_1_061	equ	DESCALE(1139878239,30-CONST_BITS)	; FIX(1.061594337)
 F_1_272	equ	DESCALE(1366614119,30-CONST_BITS)	; FIX(1.272758580)
 F_1_451	equ	DESCALE(1558831516,30-CONST_BITS)	; FIX(1.451774981)
 F_1_847	equ	DESCALE(1984016188,30-CONST_BITS)	; FIX(1.847759065)
 F_2_172	equ	DESCALE(2332956230,30-CONST_BITS)	; FIX(2.172734803)
 F_2_562	equ	DESCALE(2751909506,30-CONST_BITS)	; FIX(2.562915447)
 F_3_624	equ	DESCALE(3891787747,30-CONST_BITS)	; FIX(3.624509785)
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform dequantization and inverse DCT on one block of coefficients,
 ; producing a reduced-size 4x4 output block.
 ;
 ; GLOBAL(void)
 ; jpeg_idct_4x4 (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 ;                JCOEFPTR coef_block,
 ;                JSAMPARRAY output_buf, JDIMENSION output_col)
 ;
 %define cinfo(b)	(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define coef_block(b)	(b)+16		; JCOEFPTR coef_block
 %define output_buf(b)	(b)+20		; JSAMPARRAY output_buf
 %define output_col(b)	(b)+24		; JDIMENSION output_col
 %define range_limit	ebp-SIZEOF_POINTER	; JSAMPLE * range_limit
 %define workspace	range_limit-(DCTSIZE*4)*SIZEOF_INT
 					; int workspace[DCTSIZE*4]
 	align	16
 	global	EXTN(jpeg_idct_4x4)
 EXTN(jpeg_idct_4x4):
 	push	ebp
 	mov	ebp,esp
 	lea	esp, [workspace]
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	; ---- Pass 1: process columns from input, store into work array.
 	mov	edx, POINTER [compptr(ebp)]
 	mov	edx, POINTER [jcompinfo_dct_table(edx)]	; quantptr
 	mov	esi, JCOEFPTR [coef_block(ebp)]		; inptr
 	lea	edi, [workspace]			; int * wsptr
 	mov	ecx, DCTSIZE				; ctr
 	alignx	16,7
 .columnloop:
 	; Don't bother to process column 4, because second pass won't use it
 	cmp	ecx, byte DCTSIZE-4
 	je	near .nextcolumn
 	mov	ax, JCOEF [COL(1,esi,SIZEOF_JCOEF)]
 	or	ax, JCOEF [COL(2,esi,SIZEOF_JCOEF)]
 	jnz	short .columnDCT
 	mov	ax, JCOEF [COL(3,esi,SIZEOF_JCOEF)]
 	mov	bx, JCOEF [COL(5,esi,SIZEOF_JCOEF)]
 	or	ax, JCOEF [COL(6,esi,SIZEOF_JCOEF)]
 	or	bx, JCOEF [COL(7,esi,SIZEOF_JCOEF)]
 	or	ax,bx
 	jnz	short .columnDCT
 	; -- AC terms all zero; we need not examine term 4 for 4x4 output
 	mov	ax, JCOEF [COL(0,esi,SIZEOF_JCOEF)]
 	imul	ax, ISLOW_MULT_TYPE [COL(0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	cwde
 	sal	eax, PASS1_BITS
 	mov	INT [COL(0,edi,SIZEOF_INT)], eax
 	mov	INT [COL(1,edi,SIZEOF_INT)], eax
 	mov	INT [COL(2,edi,SIZEOF_INT)], eax
 	mov	INT [COL(3,edi,SIZEOF_INT)], eax
 	jmp	near .nextcolumn
 	alignx	16,7
 .columnDCT:
 	push	ecx	; ctr
 	push	esi	; coef_block
 	push	edx	; quantptr
 	push	edi	; wsptr
 	; -- Even part
 	movsx	ebx, JCOEF [COL(2,esi,SIZEOF_JCOEF)]
 	movsx	ecx, JCOEF [COL(6,esi,SIZEOF_JCOEF)]
 	movsx	eax, JCOEF [COL(0,esi,SIZEOF_JCOEF)]
 	imul	bx, ISLOW_MULT_TYPE [COL(2,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	imul	cx, ISLOW_MULT_TYPE [COL(6,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	imul	ax, ISLOW_MULT_TYPE [COL(0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	imul	ebx,(F_1_847)		; ebx=MULTIPLY(z2,FIX_1_847759065)
 	imul	ecx,(-F_0_765)		; ecx=MULTIPLY(z3,-FIX_0_765366865)
 	sal	eax,(CONST_BITS+1)	; eax=tmp0
 	add	ecx,ebx			; ecx=tmp2
 	lea	edi,[eax+ecx]		; edi=tmp10
 	sub	eax,ecx			; eax=tmp12
 	push	eax		; tmp12
 	push	edi		; tmp10
 	; -- Odd part
 	movsx	edi, JCOEF [COL(7,esi,SIZEOF_JCOEF)]
 	movsx	ecx, JCOEF [COL(5,esi,SIZEOF_JCOEF)]
 	imul	di, ISLOW_MULT_TYPE [COL(7,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	imul	cx, ISLOW_MULT_TYPE [COL(5,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	movsx	ebx, JCOEF [COL(3,esi,SIZEOF_JCOEF)]
 	movsx	eax, JCOEF [COL(1,esi,SIZEOF_JCOEF)]
 	imul	bx, ISLOW_MULT_TYPE [COL(3,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	imul	ax, ISLOW_MULT_TYPE [COL(1,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	mov	esi,edi		; esi=edi=z1
 	mov	edx,ecx		; edx=ecx=z2
 	imul	edi,(-F_0_211)	; edi=MULTIPLY(z1,-FIX_0_211164243)
 	imul	ecx,(F_1_451)	; ecx=MULTIPLY(z2,FIX_1_451774981)
 	imul	esi,(-F_0_509)	; esi=MULTIPLY(z1,-FIX_0_509795579)
 	imul	edx,(-F_0_601)	; edx=MULTIPLY(z2,-FIX_0_601344887)
 	add	edi,ecx		; edi=(tmp0)
 	add	esi,edx		; esi=(tmp2)
 	mov	ecx,ebx		; ecx=ebx=z3
 	mov	edx,eax		; edx=eax=z4
 	imul	ebx,(-F_2_172)	; ebx=MULTIPLY(z3,-FIX_2_172734803)
 	imul	eax,(F_1_061)	; eax=MULTIPLY(z4,FIX_1_061594337)
 	imul	ecx,(F_0_899)	; ecx=MULTIPLY(z3,FIX_0_899976223)
 	imul	edx,(F_2_562)	; edx=MULTIPLY(z4,FIX_2_562915447)
 	add	edi,ebx
 	add	esi,ecx
 	add	edi,eax		; edi=tmp0
 	add	esi,edx		; esi=tmp2
 	; -- Final output stage
 	pop	ebx		; ebx=tmp10
 	pop	ecx		; ecx=tmp12
 	lea	eax,[ebx+esi]	; eax=data0(=tmp10+tmp2)
 	sub	ebx,esi		; ebx=data3(=tmp10-tmp2)
 	lea	edx,[ecx+edi]	; edx=data1(=tmp12+tmp0)
 	sub	ecx,edi		; ecx=data2(=tmp12-tmp0)
 	pop	edi	; wsptr
 	descale	eax,(CONST_BITS-PASS1_BITS+1)
 	descale	ebx,(CONST_BITS-PASS1_BITS+1)
 	descale	edx,(CONST_BITS-PASS1_BITS+1)
 	descale	ecx,(CONST_BITS-PASS1_BITS+1)
 	mov	INT [COL(0,edi,SIZEOF_INT)], eax
 	mov	INT [COL(3,edi,SIZEOF_INT)], ebx
 	mov	INT [COL(1,edi,SIZEOF_INT)], edx
 	mov	INT [COL(2,edi,SIZEOF_INT)], ecx
 	pop	edx	; quantptr
 	pop	esi	; coef_block
 	pop	ecx	; ctr
 .nextcolumn:
 	add	esi, byte SIZEOF_JCOEF	; advance pointers to next column
 	add	edx, byte SIZEOF_ISLOW_MULT_TYPE
 	add	edi, byte SIZEOF_INT
 	dec	ecx
 	jnz	near .columnloop
 	; ---- Pass 2: process 4 rows from work array, store into output array.
 	mov	eax, POINTER [cinfo(ebp)]
 	mov	eax, POINTER [jdstruct_sample_range_limit(eax)]
 	sub	eax, byte -CENTERJSAMPLE*SIZEOF_JSAMPLE	; JSAMPLE * range_limit
 	mov	POINTER [range_limit], eax
 	lea	esi, [workspace]			; int * wsptr
 	mov	edi, JSAMPARRAY [output_buf(ebp)]	; (JSAMPROW *)
 	mov	ecx, DCTSIZE/2				; ctr
 	alignx	16,7
 .rowloop:
 	push	edi
 	mov	edi, JSAMPROW [edi]			; (JSAMPLE *)
 	add	edi, JDIMENSION [output_col(ebp)]	; edi=outptr
 %ifndef NO_ZERO_ROW_TEST
 	mov	eax, INT [ROW(1,esi,SIZEOF_INT)]
 	or	eax, INT [ROW(2,esi,SIZEOF_INT)]
 	jnz	short .rowDCT
 	mov	eax, INT [ROW(3,esi,SIZEOF_INT)]
 	mov	ebx, INT [ROW(5,esi,SIZEOF_INT)]
 	or	eax, INT [ROW(6,esi,SIZEOF_INT)]
 	or	ebx, INT [ROW(7,esi,SIZEOF_INT)]
 	or	eax,ebx
 	jnz	short .rowDCT
 	; -- AC terms all zero
 	mov	eax, INT [ROW(0,esi,SIZEOF_INT)]
 	mov	edx, POINTER [range_limit]	; (JSAMPLE *)
 	descale	eax,(PASS1_BITS+3)
 	and	eax,RANGE_MASK
 	mov	al, JSAMPLE [edx+eax*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [edi+0*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+1*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+2*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+3*SIZEOF_JSAMPLE], al
 	jmp	near .nextrow
 	alignx	16,7
 %endif
 .rowDCT:
 	push	esi	; wsptr
 	push	ecx	; ctr
 	push	edi	; outptr
 	; -- Even part
 	mov	eax, INT [ROW(0,esi,SIZEOF_INT)]
 	mov	ebx, INT [ROW(2,esi,SIZEOF_INT)]
 	mov	ecx, INT [ROW(6,esi,SIZEOF_INT)]
 	imul	ebx,(F_1_847)		; ebx=MULTIPLY(z2,FIX_1_847759065)
 	imul	ecx,(-F_0_765)		; ecx=MULTIPLY(z3,-FIX_0_765366865)
 	sal	eax,(CONST_BITS+1)	; eax=tmp0
 	add	ecx,ebx			; ecx=tmp2
 	lea	edi,[eax+ecx]		; edi=tmp10
 	sub	eax,ecx			; eax=tmp12
 	push	eax		; tmp12
 	push	edi		; tmp10
 	; -- Odd part
 	mov	eax, INT [ROW(1,esi,SIZEOF_INT)]
 	mov	ebx, INT [ROW(3,esi,SIZEOF_INT)]
 	mov	ecx, INT [ROW(5,esi,SIZEOF_INT)]
 	mov	edi, INT [ROW(7,esi,SIZEOF_INT)]
 	mov	esi,edi		; esi=edi=z1
 	mov	edx,ecx		; edx=ecx=z2
 	imul	edi,(-F_0_211)	; edi=MULTIPLY(z1,-FIX_0_211164243)
 	imul	ecx,(F_1_451)	; ecx=MULTIPLY(z2,FIX_1_451774981)
 	imul	esi,(-F_0_509)	; esi=MULTIPLY(z1,-FIX_0_509795579)
 	imul	edx,(-F_0_601)	; edx=MULTIPLY(z2,-FIX_0_601344887)
 	add	edi,ecx		; edi=(tmp0)
 	add	esi,edx		; esi=(tmp2)
 	mov	ecx,ebx		; ecx=ebx=z3
 	mov	edx,eax		; edx=eax=z4
 	imul	ebx,(-F_2_172)	; ebx=MULTIPLY(z3,-FIX_2_172734803)
 	imul	eax,(F_1_061)	; eax=MULTIPLY(z4,FIX_1_061594337)
 	imul	ecx,(F_0_899)	; ecx=MULTIPLY(z3,FIX_0_899976223)
 	imul	edx,(F_2_562)	; edx=MULTIPLY(z4,FIX_2_562915447)
 	add	edi,ebx
 	add	esi,ecx
 	add	edi,eax		; edi=tmp0
 	add	esi,edx		; esi=tmp2
 	; -- Final output stage
 	pop	ebx		; ebx=tmp10
 	pop	ecx		; ecx=tmp12
 	lea	eax,[ebx+esi]	; eax=data0(=tmp10+tmp2)
 	sub	ebx,esi		; ebx=data3(=tmp10-tmp2)
 	lea	edx,[ecx+edi]	; edx=data1(=tmp12+tmp0)
 	sub	ecx,edi		; ecx=data2(=tmp12-tmp0)
 	mov	esi, POINTER [range_limit]	; (JSAMPLE *)
 	descale	eax,(CONST_BITS+PASS1_BITS+3+1)
 	descale	ebx,(CONST_BITS+PASS1_BITS+3+1)
 	descale	edx,(CONST_BITS+PASS1_BITS+3+1)
 	descale	ecx,(CONST_BITS+PASS1_BITS+3+1)
 	pop	edi	; outptr
 	and	eax,RANGE_MASK
 	and	ebx,RANGE_MASK
 	and	edx,RANGE_MASK
 	and	ecx,RANGE_MASK
 	mov	al, JSAMPLE [esi+eax*SIZEOF_JSAMPLE]
 	mov	bl, JSAMPLE [esi+ebx*SIZEOF_JSAMPLE]
 	mov	dl, JSAMPLE [esi+edx*SIZEOF_JSAMPLE]
 	mov	cl, JSAMPLE [esi+ecx*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [edi+0*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+3*SIZEOF_JSAMPLE], bl
 	mov	JSAMPLE [edi+1*SIZEOF_JSAMPLE], dl
 	mov	JSAMPLE [edi+2*SIZEOF_JSAMPLE], cl
 	pop	ecx	; ctr
 	pop	esi	; wsptr
 .nextrow:
 	pop	edi
 	add	esi, byte DCTSIZE*SIZEOF_INT	; advance pointer to next row
 	add	edi, byte SIZEOF_JSAMPROW
 	dec	ecx
 	jnz	near .rowloop
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	mov	esp,ebp
 	pop	ebp
 	ret
 ; --------------------------------------------------------------------------
 ;
 ; Perform dequantization and inverse DCT on one block of coefficients,
 ; producing a reduced-size 2x2 output block.
 ;
 ; GLOBAL(void)
 ; jpeg_idct_2x2 (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 ;                JCOEFPTR coef_block,
 ;                JSAMPARRAY output_buf, JDIMENSION output_col)
 ;
 %define cinfo(b)	(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define coef_block(b)	(b)+16		; JCOEFPTR coef_block
 %define output_buf(b)	(b)+20		; JSAMPARRAY output_buf
 %define output_col(b)	(b)+24		; JDIMENSION output_col
 %define range_limit	ebp-SIZEOF_POINTER	; JSAMPLE * range_limit
 %define workspace	range_limit-(DCTSIZE*2)*SIZEOF_INT
 					; int workspace[DCTSIZE*2]
 	align	16
 	global	EXTN(jpeg_idct_2x2)
 EXTN(jpeg_idct_2x2):
 	push	ebp
 	mov	ebp,esp
 	lea	esp, [workspace]
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	; ---- Pass 1: process columns from input, store into work array.
 	mov	edx, POINTER [compptr(ebp)]
 	mov	edx, POINTER [jcompinfo_dct_table(edx)]	; quantptr
 	mov	esi, JCOEFPTR [coef_block(ebp)]		; inptr
 	lea	edi, [workspace]			; int * wsptr
 	mov	ecx, DCTSIZE				; ctr
 	alignx	16,7
 .columnloop:
 	; Don't bother to process columns 2,4,6
 	test	ecx, 0x09
 	jz	near .nextcolumn
 	mov	ax, JCOEF [COL(1,esi,SIZEOF_JCOEF)]
 	or	ax, JCOEF [COL(3,esi,SIZEOF_JCOEF)]
 	jnz	short .columnDCT
 	mov	ax, JCOEF [COL(5,esi,SIZEOF_JCOEF)]
 	or	ax, JCOEF [COL(7,esi,SIZEOF_JCOEF)]
 	jnz	short .columnDCT
 	; -- AC terms all zero; we need not examine terms 2,4,6 for 2x2 output
 	mov	ax, JCOEF [COL(0,esi,SIZEOF_JCOEF)]
 	imul	ax, ISLOW_MULT_TYPE [COL(0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	cwde
 	sal	eax, PASS1_BITS
 	mov	INT [COL(0,edi,SIZEOF_INT)], eax
 	mov	INT [COL(1,edi,SIZEOF_INT)], eax
 	jmp	short .nextcolumn
 	alignx	16,7
 .columnDCT:
 	push	ecx	; ctr
 	push	edi	; wsptr
 	; -- Odd part
 	movsx	eax, JCOEF [COL(1,esi,SIZEOF_JCOEF)]
 	movsx	ebx, JCOEF [COL(3,esi,SIZEOF_JCOEF)]
 	imul	ax, ISLOW_MULT_TYPE [COL(1,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	imul	bx, ISLOW_MULT_TYPE [COL(3,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	movsx	ecx, JCOEF [COL(5,esi,SIZEOF_JCOEF)]
 	movsx	edi, JCOEF [COL(7,esi,SIZEOF_JCOEF)]
 	imul	cx, ISLOW_MULT_TYPE [COL(5,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	imul	di, ISLOW_MULT_TYPE [COL(7,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	imul	eax,(F_3_624)	; eax=MULTIPLY(data1,FIX_3_624509785)
 	imul	ebx,(-F_1_272)	; ebx=MULTIPLY(data3,-FIX_1_272758580)
 	imul	ecx,(F_0_850)	; ecx=MULTIPLY(data5,FIX_0_850430095)
 	imul	edi,(-F_0_720)	; edi=MULTIPLY(data7,-FIX_0_720959822)
 	add	eax,ebx
 	add	ecx,edi
 	add	ecx,eax		; ecx=tmp0
 	; -- Even part
 	mov	ax, JCOEF [COL(0,esi,SIZEOF_JCOEF)]
 	imul	ax, ISLOW_MULT_TYPE [COL(0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	cwde
 	sal	eax,(CONST_BITS+2)	; eax=tmp10
 	; -- Final output stage
 	pop	edi	; wsptr
 	lea	ebx,[eax+ecx]	; ebx=data0(=tmp10+tmp0)
 	sub	eax,ecx		; eax=data1(=tmp10-tmp0)
 	pop	ecx	; ctr
 	descale	ebx,(CONST_BITS-PASS1_BITS+2)
 	descale	eax,(CONST_BITS-PASS1_BITS+2)
 	mov	INT [COL(0,edi,SIZEOF_INT)], ebx
 	mov	INT [COL(1,edi,SIZEOF_INT)], eax
 .nextcolumn:
 	add	esi, byte SIZEOF_JCOEF	; advance pointers to next column
 	add	edx, byte SIZEOF_ISLOW_MULT_TYPE
 	add	edi, byte SIZEOF_INT
 	dec	ecx
 	jnz	near .columnloop
 	; ---- Pass 2: process 2 rows from work array, store into output array.
 	mov	eax, POINTER [cinfo(ebp)]
 	mov	eax, POINTER [jdstruct_sample_range_limit(eax)]
 	sub	eax, byte -CENTERJSAMPLE*SIZEOF_JSAMPLE	; JSAMPLE * range_limit
 	mov	POINTER [range_limit], eax
 	lea	esi, [workspace]			; int * wsptr
 	mov	edi, JSAMPARRAY [output_buf(ebp)]	; (JSAMPROW *)
 	mov	ecx, DCTSIZE/4				; ctr
 	alignx	16,7
 .rowloop:
 	push	edi
 	mov	edi, JSAMPROW [edi]			; (JSAMPLE *)
 	add	edi, JDIMENSION [output_col(ebp)]	; edi=outptr
 %ifndef NO_ZERO_ROW_TEST
 	mov	eax, INT [ROW(1,esi,SIZEOF_INT)]
 	or	eax, INT [ROW(3,esi,SIZEOF_INT)]
 	jnz	short .rowDCT
 	mov	eax, INT [ROW(5,esi,SIZEOF_INT)]
 	or	eax, INT [ROW(7,esi,SIZEOF_INT)]
 	jnz	short .rowDCT
 	; -- AC terms all zero
 	mov	eax, INT [ROW(0,esi,SIZEOF_INT)]
 	mov	edx, POINTER [range_limit]	; (JSAMPLE *)
 	descale	eax,(PASS1_BITS+3)
 	and	eax,RANGE_MASK
 	mov	al, JSAMPLE [edx+eax*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [edi+0*SIZEOF_JSAMPLE], al
 	mov	JSAMPLE [edi+1*SIZEOF_JSAMPLE], al
 	jmp	short .nextrow
 	alignx	16,7
 %endif
 .rowDCT:
 	push	ecx	; ctr
 	; -- Odd part
 	mov	eax, INT [ROW(1,esi,SIZEOF_INT)]
 	mov	ebx, INT [ROW(3,esi,SIZEOF_INT)]
 	mov	ecx, INT [ROW(5,esi,SIZEOF_INT)]
 	mov	edx, INT [ROW(7,esi,SIZEOF_INT)]
 	imul	eax,(F_3_624)	; eax=MULTIPLY(data1,FIX_3_624509785)
 	imul	ebx,(-F_1_272)	; ebx=MULTIPLY(data3,-FIX_1_272758580)
 	imul	ecx,(F_0_850)	; ecx=MULTIPLY(data5,FIX_0_850430095)
 	imul	edx,(-F_0_720)	; edx=MULTIPLY(data7,-FIX_0_720959822)
 	add	eax,ebx
 	add	ecx,edx
 	add	ecx,eax		; ecx=tmp0
 	; -- Even part
 	mov	eax, INT [ROW(0,esi,SIZEOF_INT)]
 	sal	eax,(CONST_BITS+2)	; eax=tmp10
 	; -- Final output stage
 	mov	edx, POINTER [range_limit]	; (JSAMPLE *)
 	lea	ebx,[eax+ecx]	; ebx=data0(=tmp10+tmp0)
 	sub	eax,ecx		; eax=data1(=tmp10-tmp0)
 	pop	ecx	; ctr
 	descale	ebx,(CONST_BITS+PASS1_BITS+3+2)
 	descale	eax,(CONST_BITS+PASS1_BITS+3+2)
 	and	ebx,RANGE_MASK
 	and	eax,RANGE_MASK
 	mov	bl, JSAMPLE [edx+ebx*SIZEOF_JSAMPLE]
 	mov	al, JSAMPLE [edx+eax*SIZEOF_JSAMPLE]
 	mov	JSAMPLE [edi+0*SIZEOF_JSAMPLE], bl
 	mov	JSAMPLE [edi+1*SIZEOF_JSAMPLE], al
 .nextrow:
 	pop	edi
 	add	esi, byte DCTSIZE*SIZEOF_INT	; advance pointer to next row
 	add	edi, byte SIZEOF_JSAMPROW
 	dec	ecx
 	jnz	near .rowloop
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	mov	esp,ebp
 	pop	ebp
 	ret
 ; --------------------------------------------------------------------------
 ;
 ; Perform dequantization and inverse DCT on one block of coefficients,
 ; producing a reduced-size 1x1 output block.
 ;
 ; GLOBAL(void)
 ; jpeg_idct_1x1 (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 ;                JCOEFPTR coef_block,
 ;                JSAMPARRAY output_buf, JDIMENSION output_col)
 ;
 %define cinfo(b)	(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define coef_block(b)	(b)+16		; JCOEFPTR coef_block
 %define output_buf(b)	(b)+20		; JSAMPARRAY output_buf
 %define output_col(b)	(b)+24		; JDIMENSION output_col
 %define ebp		esp-4		; use esp instead of ebp
 	align	16
 	global	EXTN(jpeg_idct_1x1)
 EXTN(jpeg_idct_1x1):
 ;	push	ebp
 ;	mov	ebp,esp
 ;	push	ebx		; unused
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 ;	push	esi		; unused
 ;	push	edi		; unused
 	; We hardly need an inverse DCT routine for this: just take the
 	; average pixel value, which is one-eighth of the DC coefficient.
 	mov	edx, POINTER [compptr(ebp)]
 	mov	ecx, JCOEFPTR [coef_block(ebp)]		; inptr
 	mov	edx, POINTER [jcompinfo_dct_table(edx)]	; quantptr
 	mov	ax, JCOEF [COL(0,ecx,SIZEOF_JCOEF)]
 	imul	ax, ISLOW_MULT_TYPE [COL(0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	mov	ecx, JSAMPARRAY [output_buf(ebp)]	; (JSAMPROW *)
 	mov	edx, JDIMENSION [output_col(ebp)]
 	mov	ecx, JSAMPROW [ecx]			; (JSAMPLE *)
 	add	ax, (1 << (3-1)) + (CENTERJSAMPLE << 3)
 	sar	ax,3		; descale
 	test	ah,ah		; unsigned saturation
 	jz	short .output
 	not	ax
 	sar	ax,15
 	alignx	16,3
 .output:
 	mov	JSAMPLE [ecx+edx*SIZEOF_JSAMPLE], al
 ;	pop	edi		; unused
 ;	pop	esi		; unused
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 ;	pop	ebx		; unused
 ;	pop	ebp
 	ret
 %endif ; IDCT_SCALING_SUPPORTED
--- a/jimmxfst.asm
+++ b/jimmxfst.asm
@@ -0,0 +1,510 @@
 ;
 ; jimmxfst.asm - fast integer IDCT (MMX)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains a fast, not so accurate integer implementation of
 ; the inverse DCT (Discrete Cosine Transform). The following code is
 ; based directly on the IJG's original jidctfst.c; see the jidctfst.c
 ; for more details.
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_IFAST_SUPPORTED
 %ifdef JIDCT_INT_MMX_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 %define CONST_BITS	8	; 14 is also OK.
 %define PASS1_BITS	2
 %if IFAST_SCALE_BITS != PASS1_BITS
 %error "'IFAST_SCALE_BITS' must be equal to 'PASS1_BITS'."
 %endif
 %if CONST_BITS == 8
 F_1_082	equ	277		; FIX(1.082392200)
 F_1_414	equ	362		; FIX(1.414213562)
 F_1_847	equ	473		; FIX(1.847759065)
 F_2_613	equ	669		; FIX(2.613125930)
 F_1_613	equ	(F_2_613 - 256)	; FIX(2.613125930) - FIX(1)
 %else
 ; NASM cannot do compile-time arithmetic on floating-point constants.
 %define	DESCALE(x,n)  (((x)+(1<<((n)-1)))>>(n))
 F_1_082	equ	DESCALE(1162209775,30-CONST_BITS)	; FIX(1.082392200)
 F_1_414	equ	DESCALE(1518500249,30-CONST_BITS)	; FIX(1.414213562)
 F_1_847	equ	DESCALE(1984016188,30-CONST_BITS)	; FIX(1.847759065)
 F_2_613	equ	DESCALE(2805822602,30-CONST_BITS)	; FIX(2.613125930)
 F_1_613	equ	(F_2_613 - (1 << CONST_BITS))	; FIX(2.613125930) - FIX(1)
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 ; PRE_MULTIPLY_SCALE_BITS <= 2 (to avoid overflow)
 ; CONST_BITS + CONST_SHIFT + PRE_MULTIPLY_SCALE_BITS == 16 (for pmulhw)
 %define PRE_MULTIPLY_SCALE_BITS   2
 %define CONST_SHIFT     (16 - PRE_MULTIPLY_SCALE_BITS - CONST_BITS)
 	alignz	16
 	global	EXTN(jconst_idct_ifast_mmx)
 EXTN(jconst_idct_ifast_mmx):
 PW_F1414	times 4 dw  F_1_414 << CONST_SHIFT
 PW_F1847	times 4 dw  F_1_847 << CONST_SHIFT
 PW_MF1613	times 4 dw -F_1_613 << CONST_SHIFT
 PW_F1082	times 4 dw  F_1_082 << CONST_SHIFT
 PB_CENTERJSAMP	times 8 db  CENTERJSAMPLE
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform dequantization and inverse DCT on one block of coefficients.
 ;
 ; GLOBAL(void)
 ; jpeg_idct_ifast_mmx (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 ;                      JCOEFPTR coef_block,
 ;                      JSAMPARRAY output_buf, JDIMENSION output_col)
 ;
 %define cinfo(b)	(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define coef_block(b)	(b)+16		; JCOEFPTR coef_block
 %define output_buf(b)	(b)+20		; JSAMPARRAY output_buf
 %define output_col(b)	(b)+24		; JDIMENSION output_col
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_MMWORD	; mmword wk[WK_NUM]
 %define WK_NUM		2
 %define workspace	wk(0)-DCTSIZE2*SIZEOF_JCOEF
 					; JCOEF workspace[DCTSIZE2]
 	align	16
 	global	EXTN(jpeg_idct_ifast_mmx)
 EXTN(jpeg_idct_ifast_mmx):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_MMWORD)	; align to 64 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [workspace]
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx		; get GOT address
 	; ---- Pass 1: process columns from input, store into work array.
 ;	mov	eax, [original_ebp]
 	mov	edx, POINTER [compptr(eax)]
 	mov	edx, POINTER [jcompinfo_dct_table(edx)]	; quantptr
 	mov	esi, JCOEFPTR [coef_block(eax)]		; inptr
 	lea	edi, [workspace]			; JCOEF * wsptr
 	mov	ecx, DCTSIZE/4				; ctr
 	alignx	16,7
 .columnloop:
 %ifndef NO_ZERO_COLUMN_TEST_IFAST_MMX
 	mov	eax, DWORD [DWBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	or	eax, DWORD [DWBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	jnz	short .columnDCT
 	movq	mm0, MMWORD [MMBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movq	mm1, MMWORD [MMBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	por	mm0, MMWORD [MMBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	por	mm1, MMWORD [MMBLOCK(4,0,esi,SIZEOF_JCOEF)]
 	por	mm0, MMWORD [MMBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	por	mm1, MMWORD [MMBLOCK(6,0,esi,SIZEOF_JCOEF)]
 	por	mm0, MMWORD [MMBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	por	mm1,mm0
 	packsswb mm1,mm1
 	movd	eax,mm1
 	test	eax,eax
 	jnz	short .columnDCT
 	; -- AC terms all zero
 	movq	mm0, MMWORD [MMBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	pmullw	mm0, MMWORD [MMBLOCK(0,0,edx,SIZEOF_IFAST_MULT_TYPE)]
 	movq      mm2,mm0		; mm0=in0=(00 01 02 03)
 	punpcklwd mm0,mm0		; mm0=(00 00 01 01)
 	punpckhwd mm2,mm2		; mm2=(02 02 03 03)
 	movq      mm1,mm0
 	punpckldq mm0,mm0		; mm0=(00 00 00 00)
 	punpckhdq mm1,mm1		; mm1=(01 01 01 01)
 	movq      mm3,mm2
 	punpckldq mm2,mm2		; mm2=(02 02 02 02)
 	punpckhdq mm3,mm3		; mm3=(03 03 03 03)
 	movq	MMWORD [MMBLOCK(0,0,edi,SIZEOF_JCOEF)], mm0
 	movq	MMWORD [MMBLOCK(0,1,edi,SIZEOF_JCOEF)], mm0
 	movq	MMWORD [MMBLOCK(1,0,edi,SIZEOF_JCOEF)], mm1
 	movq	MMWORD [MMBLOCK(1,1,edi,SIZEOF_JCOEF)], mm1
 	movq	MMWORD [MMBLOCK(2,0,edi,SIZEOF_JCOEF)], mm2
 	movq	MMWORD [MMBLOCK(2,1,edi,SIZEOF_JCOEF)], mm2
 	movq	MMWORD [MMBLOCK(3,0,edi,SIZEOF_JCOEF)], mm3
 	movq	MMWORD [MMBLOCK(3,1,edi,SIZEOF_JCOEF)], mm3
 	jmp	near .nextcolumn
 	alignx	16,7
 %endif
 .columnDCT:
 	; -- Even part
 	movq	mm0, MMWORD [MMBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	movq	mm1, MMWORD [MMBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	pmullw	mm0, MMWORD [MMBLOCK(0,0,edx,SIZEOF_IFAST_MULT_TYPE)]
 	pmullw	mm1, MMWORD [MMBLOCK(2,0,edx,SIZEOF_IFAST_MULT_TYPE)]
 	movq	mm2, MMWORD [MMBLOCK(4,0,esi,SIZEOF_JCOEF)]
 	movq	mm3, MMWORD [MMBLOCK(6,0,esi,SIZEOF_JCOEF)]
 	pmullw	mm2, MMWORD [MMBLOCK(4,0,edx,SIZEOF_IFAST_MULT_TYPE)]
 	pmullw	mm3, MMWORD [MMBLOCK(6,0,edx,SIZEOF_IFAST_MULT_TYPE)]
 	movq	mm4,mm0
 	movq	mm5,mm1
 	psubw	mm0,mm2			; mm0=tmp11
 	psubw	mm1,mm3
 	paddw	mm4,mm2			; mm4=tmp10
 	paddw	mm5,mm3			; mm5=tmp13
 	psllw	mm1,PRE_MULTIPLY_SCALE_BITS
 	pmulhw	mm1,[GOTOFF(ebx,PW_F1414)]
 	psubw	mm1,mm5			; mm1=tmp12
 	movq	mm6,mm4
 	movq	mm7,mm0
 	psubw	mm4,mm5			; mm4=tmp3
 	psubw	mm0,mm1			; mm0=tmp2
 	paddw	mm6,mm5			; mm6=tmp0
 	paddw	mm7,mm1			; mm7=tmp1
 	movq	MMWORD [wk(1)], mm4	; wk(1)=tmp3
 	movq	MMWORD [wk(0)], mm0	; wk(0)=tmp2
 	; -- Odd part
 	movq	mm2, MMWORD [MMBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movq	mm3, MMWORD [MMBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	pmullw	mm2, MMWORD [MMBLOCK(1,0,edx,SIZEOF_IFAST_MULT_TYPE)]
 	pmullw	mm3, MMWORD [MMBLOCK(3,0,edx,SIZEOF_IFAST_MULT_TYPE)]
 	movq	mm5, MMWORD [MMBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	movq	mm1, MMWORD [MMBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	pmullw	mm5, MMWORD [MMBLOCK(5,0,edx,SIZEOF_IFAST_MULT_TYPE)]
 	pmullw	mm1, MMWORD [MMBLOCK(7,0,edx,SIZEOF_IFAST_MULT_TYPE)]
 	movq	mm4,mm2
 	movq	mm0,mm5
 	psubw	mm2,mm1			; mm2=z12
 	psubw	mm5,mm3			; mm5=z10
 	paddw	mm4,mm1			; mm4=z11
 	paddw	mm0,mm3			; mm0=z13
 	movq	mm1,mm5			; mm1=z10(unscaled)
 	psllw	mm2,PRE_MULTIPLY_SCALE_BITS
 	psllw	mm5,PRE_MULTIPLY_SCALE_BITS
 	movq	mm3,mm4
 	psubw	mm4,mm0
 	paddw	mm3,mm0			; mm3=tmp7
 	psllw	mm4,PRE_MULTIPLY_SCALE_BITS
 	pmulhw	mm4,[GOTOFF(ebx,PW_F1414)]	; mm4=tmp11
 	; To avoid overflow...
 	;
 	; (Original)
 	; tmp12 = -2.613125930 * z10 + z5;
 	;
 	; (This implementation)
 	; tmp12 = (-1.613125930 - 1) * z10 + z5;
 	;       = -1.613125930 * z10 - z10 + z5;
 	movq	mm0,mm5
 	paddw	mm5,mm2
 	pmulhw	mm5,[GOTOFF(ebx,PW_F1847)]	; mm5=z5
 	pmulhw	mm0,[GOTOFF(ebx,PW_MF1613)]
 	pmulhw	mm2,[GOTOFF(ebx,PW_F1082)]
 	psubw	mm0,mm1
 	psubw	mm2,mm5			; mm2=tmp10
 	paddw	mm0,mm5			; mm0=tmp12
 	; -- Final output stage
 	psubw	mm0,mm3			; mm0=tmp6
 	movq	mm1,mm6
 	movq	mm5,mm7
 	paddw	mm6,mm3			; mm6=data0=(00 01 02 03)
 	paddw	mm7,mm0			; mm7=data1=(10 11 12 13)
 	psubw	mm1,mm3			; mm1=data7=(70 71 72 73)
 	psubw	mm5,mm0			; mm5=data6=(60 61 62 63)
 	psubw	mm4,mm0			; mm4=tmp5
 	movq      mm3,mm6		; transpose coefficients(phase 1)
 	punpcklwd mm6,mm7		; mm6=(00 10 01 11)
 	punpckhwd mm3,mm7		; mm3=(02 12 03 13)
 	movq      mm0,mm5		; transpose coefficients(phase 1)
 	punpcklwd mm5,mm1		; mm5=(60 70 61 71)
 	punpckhwd mm0,mm1		; mm0=(62 72 63 73)
 	movq	mm7, MMWORD [wk(0)]	; mm7=tmp2
 	movq	mm1, MMWORD [wk(1)]	; mm1=tmp3
 	movq	MMWORD [wk(0)], mm5	; wk(0)=(60 70 61 71)
 	movq	MMWORD [wk(1)], mm0	; wk(1)=(62 72 63 73)
 	paddw	mm2,mm4			; mm2=tmp4
 	movq	mm5,mm7
 	movq	mm0,mm1
 	paddw	mm7,mm4			; mm7=data2=(20 21 22 23)
 	paddw	mm1,mm2			; mm1=data4=(40 41 42 43)
 	psubw	mm5,mm4			; mm5=data5=(50 51 52 53)
 	psubw	mm0,mm2			; mm0=data3=(30 31 32 33)
 	movq      mm4,mm7		; transpose coefficients(phase 1)
 	punpcklwd mm7,mm0		; mm7=(20 30 21 31)
 	punpckhwd mm4,mm0		; mm4=(22 32 23 33)
 	movq      mm2,mm1		; transpose coefficients(phase 1)
 	punpcklwd mm1,mm5		; mm1=(40 50 41 51)
 	punpckhwd mm2,mm5		; mm2=(42 52 43 53)
 	movq      mm0,mm6		; transpose coefficients(phase 2)
 	punpckldq mm6,mm7		; mm6=(00 10 20 30)
 	punpckhdq mm0,mm7		; mm0=(01 11 21 31)
 	movq      mm5,mm3		; transpose coefficients(phase 2)
 	punpckldq mm3,mm4		; mm3=(02 12 22 32)
 	punpckhdq mm5,mm4		; mm5=(03 13 23 33)
 	movq	mm7, MMWORD [wk(0)]	; mm7=(60 70 61 71)
 	movq	mm4, MMWORD [wk(1)]	; mm4=(62 72 63 73)
 	movq	MMWORD [MMBLOCK(0,0,edi,SIZEOF_JCOEF)], mm6
 	movq	MMWORD [MMBLOCK(1,0,edi,SIZEOF_JCOEF)], mm0
 	movq	MMWORD [MMBLOCK(2,0,edi,SIZEOF_JCOEF)], mm3
 	movq	MMWORD [MMBLOCK(3,0,edi,SIZEOF_JCOEF)], mm5
 	movq      mm6,mm1		; transpose coefficients(phase 2)
 	punpckldq mm1,mm7		; mm1=(40 50 60 70)
 	punpckhdq mm6,mm7		; mm6=(41 51 61 71)
 	movq      mm0,mm2		; transpose coefficients(phase 2)
 	punpckldq mm2,mm4		; mm2=(42 52 62 72)
 	punpckhdq mm0,mm4		; mm0=(43 53 63 73)
 	movq	MMWORD [MMBLOCK(0,1,edi,SIZEOF_JCOEF)], mm1
 	movq	MMWORD [MMBLOCK(1,1,edi,SIZEOF_JCOEF)], mm6
 	movq	MMWORD [MMBLOCK(2,1,edi,SIZEOF_JCOEF)], mm2
 	movq	MMWORD [MMBLOCK(3,1,edi,SIZEOF_JCOEF)], mm0
 .nextcolumn:
 	add	esi, byte 4*SIZEOF_JCOEF		; coef_block
 	add	edx, byte 4*SIZEOF_IFAST_MULT_TYPE	; quantptr
 	add	edi, byte 4*DCTSIZE*SIZEOF_JCOEF	; wsptr
 	dec	ecx					; ctr
 	jnz	near .columnloop
 	; ---- Pass 2: process rows from work array, store into output array.
 	mov	eax, [original_ebp]
 	lea	esi, [workspace]			; JCOEF * wsptr
 	mov	edi, JSAMPARRAY [output_buf(eax)]	; (JSAMPROW *)
 	mov	eax, JDIMENSION [output_col(eax)]
 	mov	ecx, DCTSIZE/4				; ctr
 	alignx	16,7
 .rowloop:
 	; -- Even part
 	movq	mm0, MMWORD [MMBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	movq	mm1, MMWORD [MMBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	movq	mm2, MMWORD [MMBLOCK(4,0,esi,SIZEOF_JCOEF)]
 	movq	mm3, MMWORD [MMBLOCK(6,0,esi,SIZEOF_JCOEF)]
 	movq	mm4,mm0
 	movq	mm5,mm1
 	psubw	mm0,mm2			; mm0=tmp11
 	psubw	mm1,mm3
 	paddw	mm4,mm2			; mm4=tmp10
 	paddw	mm5,mm3			; mm5=tmp13
 	psllw	mm1,PRE_MULTIPLY_SCALE_BITS
 	pmulhw	mm1,[GOTOFF(ebx,PW_F1414)]
 	psubw	mm1,mm5			; mm1=tmp12
 	movq	mm6,mm4
 	movq	mm7,mm0
 	psubw	mm4,mm5			; mm4=tmp3
 	psubw	mm0,mm1			; mm0=tmp2
 	paddw	mm6,mm5			; mm6=tmp0
 	paddw	mm7,mm1			; mm7=tmp1
 	movq	MMWORD [wk(1)], mm4	; wk(1)=tmp3
 	movq	MMWORD [wk(0)], mm0	; wk(0)=tmp2
 	; -- Odd part
 	movq	mm2, MMWORD [MMBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movq	mm3, MMWORD [MMBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	movq	mm5, MMWORD [MMBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	movq	mm1, MMWORD [MMBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	movq	mm4,mm2
 	movq	mm0,mm5
 	psubw	mm2,mm1			; mm2=z12
 	psubw	mm5,mm3			; mm5=z10
 	paddw	mm4,mm1			; mm4=z11
 	paddw	mm0,mm3			; mm0=z13
 	movq	mm1,mm5			; mm1=z10(unscaled)
 	psllw	mm2,PRE_MULTIPLY_SCALE_BITS
 	psllw	mm5,PRE_MULTIPLY_SCALE_BITS
 	movq	mm3,mm4
 	psubw	mm4,mm0
 	paddw	mm3,mm0			; mm3=tmp7
 	psllw	mm4,PRE_MULTIPLY_SCALE_BITS
 	pmulhw	mm4,[GOTOFF(ebx,PW_F1414)]	; mm4=tmp11
 	; To avoid overflow...
 	;
 	; (Original)
 	; tmp12 = -2.613125930 * z10 + z5;
 	;
 	; (This implementation)
 	; tmp12 = (-1.613125930 - 1) * z10 + z5;
 	;       = -1.613125930 * z10 - z10 + z5;
 	movq	mm0,mm5
 	paddw	mm5,mm2
 	pmulhw	mm5,[GOTOFF(ebx,PW_F1847)]	; mm5=z5
 	pmulhw	mm0,[GOTOFF(ebx,PW_MF1613)]
 	pmulhw	mm2,[GOTOFF(ebx,PW_F1082)]
 	psubw	mm0,mm1
 	psubw	mm2,mm5			; mm2=tmp10
 	paddw	mm0,mm5			; mm0=tmp12
 	; -- Final output stage
 	psubw	mm0,mm3			; mm0=tmp6
 	movq	mm1,mm6
 	movq	mm5,mm7
 	paddw	mm6,mm3			; mm6=data0=(00 10 20 30)
 	paddw	mm7,mm0			; mm7=data1=(01 11 21 31)
 	psraw	mm6,(PASS1_BITS+3)	; descale
 	psraw	mm7,(PASS1_BITS+3)	; descale
 	psubw	mm1,mm3			; mm1=data7=(07 17 27 37)
 	psubw	mm5,mm0			; mm5=data6=(06 16 26 36)
 	psraw	mm1,(PASS1_BITS+3)	; descale
 	psraw	mm5,(PASS1_BITS+3)	; descale
 	psubw	mm4,mm0			; mm4=tmp5
 	packsswb  mm6,mm5		; mm6=(00 10 20 30 06 16 26 36)
 	packsswb  mm7,mm1		; mm7=(01 11 21 31 07 17 27 37)
 	movq	mm3, MMWORD [wk(0)]	; mm3=tmp2
 	movq	mm0, MMWORD [wk(1)]	; mm0=tmp3
 	paddw	mm2,mm4			; mm2=tmp4
 	movq	mm5,mm3
 	movq	mm1,mm0
 	paddw	mm3,mm4			; mm3=data2=(02 12 22 32)
 	paddw	mm0,mm2			; mm0=data4=(04 14 24 34)
 	psraw	mm3,(PASS1_BITS+3)	; descale
 	psraw	mm0,(PASS1_BITS+3)	; descale
 	psubw	mm5,mm4			; mm5=data5=(05 15 25 35)
 	psubw	mm1,mm2			; mm1=data3=(03 13 23 33)
 	psraw	mm5,(PASS1_BITS+3)	; descale
 	psraw	mm1,(PASS1_BITS+3)	; descale
 	movq      mm4,[GOTOFF(ebx,PB_CENTERJSAMP)]	; mm4=[PB_CENTERJSAMP]
 	packsswb  mm3,mm0		; mm3=(02 12 22 32 04 14 24 34)
 	packsswb  mm1,mm5		; mm1=(03 13 23 33 05 15 25 35)
 	paddb     mm6,mm4
 	paddb     mm7,mm4
 	paddb     mm3,mm4
 	paddb     mm1,mm4
 	movq      mm2,mm6		; transpose coefficients(phase 1)
 	punpcklbw mm6,mm7		; mm6=(00 01 10 11 20 21 30 31)
 	punpckhbw mm2,mm7		; mm2=(06 07 16 17 26 27 36 37)
 	movq      mm0,mm3		; transpose coefficients(phase 1)
 	punpcklbw mm3,mm1		; mm3=(02 03 12 13 22 23 32 33)
 	punpckhbw mm0,mm1		; mm0=(04 05 14 15 24 25 34 35)
 	movq      mm5,mm6		; transpose coefficients(phase 2)
 	punpcklwd mm6,mm3		; mm6=(00 01 02 03 10 11 12 13)
 	punpckhwd mm5,mm3		; mm5=(20 21 22 23 30 31 32 33)
 	movq      mm4,mm0		; transpose coefficients(phase 2)
 	punpcklwd mm0,mm2		; mm0=(04 05 06 07 14 15 16 17)
 	punpckhwd mm4,mm2		; mm4=(24 25 26 27 34 35 36 37)
 	movq      mm7,mm6		; transpose coefficients(phase 3)
 	punpckldq mm6,mm0		; mm6=(00 01 02 03 04 05 06 07)
 	punpckhdq mm7,mm0		; mm7=(10 11 12 13 14 15 16 17)
 	movq      mm1,mm5		; transpose coefficients(phase 3)
 	punpckldq mm5,mm4		; mm5=(20 21 22 23 24 25 26 27)
 	punpckhdq mm1,mm4		; mm1=(30 31 32 33 34 35 36 37)
 	pushpic	ebx			; save GOT address
 	mov	edx, JSAMPROW [edi+0*SIZEOF_JSAMPROW]
 	mov	ebx, JSAMPROW [edi+1*SIZEOF_JSAMPROW]
 	movq	MMWORD [edx+eax*SIZEOF_JSAMPLE], mm6
 	movq	MMWORD [ebx+eax*SIZEOF_JSAMPLE], mm7
 	mov	edx, JSAMPROW [edi+2*SIZEOF_JSAMPROW]
 	mov	ebx, JSAMPROW [edi+3*SIZEOF_JSAMPROW]
 	movq	MMWORD [edx+eax*SIZEOF_JSAMPLE], mm5
 	movq	MMWORD [ebx+eax*SIZEOF_JSAMPLE], mm1
 	poppic	ebx			; restore GOT address
 	add	esi, byte 4*SIZEOF_JCOEF	; wsptr
 	add	edi, byte 4*SIZEOF_JSAMPROW
 	dec	ecx				; ctr
 	jnz	near .rowloop
 	emms		; empty MMX state
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %endif ; JIDCT_INT_MMX_SUPPORTED
 %endif ; DCT_IFAST_SUPPORTED
--- a/jimmxint.asm
+++ b/jimmxint.asm
@@ -0,0 +1,862 @@
 ;
 ; jimmxint.asm - accurate integer IDCT (MMX)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains a slow-but-accurate integer implementation of the
 ; inverse DCT (Discrete Cosine Transform). The following code is based
 ; directly on the IJG's original jidctint.c; see the jidctint.c for
 ; more details.
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_ISLOW_SUPPORTED
 %ifdef JIDCT_INT_MMX_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 %define CONST_BITS	13
 %define PASS1_BITS	2
 %define DESCALE_P1	(CONST_BITS-PASS1_BITS)
 %define DESCALE_P2	(CONST_BITS+PASS1_BITS+3)
 %if CONST_BITS == 13
 F_0_298	equ	 2446		; FIX(0.298631336)
 F_0_390	equ	 3196		; FIX(0.390180644)
 F_0_541	equ	 4433		; FIX(0.541196100)
 F_0_765	equ	 6270		; FIX(0.765366865)
 F_0_899	equ	 7373		; FIX(0.899976223)
 F_1_175	equ	 9633		; FIX(1.175875602)
 F_1_501	equ	12299		; FIX(1.501321110)
 F_1_847	equ	15137		; FIX(1.847759065)
 F_1_961	equ	16069		; FIX(1.961570560)
 F_2_053	equ	16819		; FIX(2.053119869)
 F_2_562	equ	20995		; FIX(2.562915447)
 F_3_072	equ	25172		; FIX(3.072711026)
 %else
 ; NASM cannot do compile-time arithmetic on floating-point constants.
 %define DESCALE(x,n)  (((x)+(1<<((n)-1)))>>(n))
 F_0_298	equ	DESCALE( 320652955,30-CONST_BITS)	; FIX(0.298631336)
 F_0_390	equ	DESCALE( 418953276,30-CONST_BITS)	; FIX(0.390180644)
 F_0_541	equ	DESCALE( 581104887,30-CONST_BITS)	; FIX(0.541196100)
 F_0_765	equ	DESCALE( 821806413,30-CONST_BITS)	; FIX(0.765366865)
 F_0_899	equ	DESCALE( 966342111,30-CONST_BITS)	; FIX(0.899976223)
 F_1_175	equ	DESCALE(1262586813,30-CONST_BITS)	; FIX(1.175875602)
 F_1_501	equ	DESCALE(1612031267,30-CONST_BITS)	; FIX(1.501321110)
 F_1_847	equ	DESCALE(1984016188,30-CONST_BITS)	; FIX(1.847759065)
 F_1_961	equ	DESCALE(2106220350,30-CONST_BITS)	; FIX(1.961570560)
 F_2_053	equ	DESCALE(2204520673,30-CONST_BITS)	; FIX(2.053119869)
 F_2_562	equ	DESCALE(2751909506,30-CONST_BITS)	; FIX(2.562915447)
 F_3_072	equ	DESCALE(3299298341,30-CONST_BITS)	; FIX(3.072711026)
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 	alignz	16
 	global	EXTN(jconst_idct_islow_mmx)
 EXTN(jconst_idct_islow_mmx):
 PW_F130_F054	times 2 dw  (F_0_541+F_0_765), F_0_541
 PW_F054_MF130	times 2 dw  F_0_541, (F_0_541-F_1_847)
 PW_MF078_F117	times 2 dw  (F_1_175-F_1_961), F_1_175
 PW_F117_F078	times 2 dw  F_1_175, (F_1_175-F_0_390)
 PW_MF060_MF089	times 2 dw  (F_0_298-F_0_899),-F_0_899
 PW_MF089_F060	times 2 dw -F_0_899, (F_1_501-F_0_899)
 PW_MF050_MF256	times 2 dw  (F_2_053-F_2_562),-F_2_562
 PW_MF256_F050	times 2 dw -F_2_562, (F_3_072-F_2_562)
 PD_DESCALE_P1	times 2 dd  1 << (DESCALE_P1-1)
 PD_DESCALE_P2	times 2 dd  1 << (DESCALE_P2-1)
 PB_CENTERJSAMP	times 8 db  CENTERJSAMPLE
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform dequantization and inverse DCT on one block of coefficients.
 ;
 ; GLOBAL(void)
 ; jpeg_idct_islow_mmx (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 ;                      JCOEFPTR coef_block,
 ;                      JSAMPARRAY output_buf, JDIMENSION output_col)
 ;
 %define cinfo(b)	(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define coef_block(b)	(b)+16		; JCOEFPTR coef_block
 %define output_buf(b)	(b)+20		; JSAMPARRAY output_buf
 %define output_col(b)	(b)+24		; JDIMENSION output_col
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_MMWORD	; mmword wk[WK_NUM]
 %define WK_NUM		12
 %define workspace	wk(0)-DCTSIZE2*SIZEOF_JCOEF
 					; JCOEF workspace[DCTSIZE2]
 	align	16
 	global	EXTN(jpeg_idct_islow_mmx)
 EXTN(jpeg_idct_islow_mmx):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_MMWORD)	; align to 64 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [workspace]
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx		; get GOT address
 	; ---- Pass 1: process columns from input, store into work array.
 ;	mov	eax, [original_ebp]
 	mov	edx, POINTER [compptr(eax)]
 	mov	edx, POINTER [jcompinfo_dct_table(edx)]	; quantptr
 	mov	esi, JCOEFPTR [coef_block(eax)]		; inptr
 	lea	edi, [workspace]			; JCOEF * wsptr
 	mov	ecx, DCTSIZE/4				; ctr
 	alignx	16,7
 .columnloop:
 %ifndef NO_ZERO_COLUMN_TEST_ISLOW_MMX
 	mov	eax, DWORD [DWBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	or	eax, DWORD [DWBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	jnz	short .columnDCT
 	movq	mm0, MMWORD [MMBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movq	mm1, MMWORD [MMBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	por	mm0, MMWORD [MMBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	por	mm1, MMWORD [MMBLOCK(4,0,esi,SIZEOF_JCOEF)]
 	por	mm0, MMWORD [MMBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	por	mm1, MMWORD [MMBLOCK(6,0,esi,SIZEOF_JCOEF)]
 	por	mm0, MMWORD [MMBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	por	mm1,mm0
 	packsswb mm1,mm1
 	movd	eax,mm1
 	test	eax,eax
 	jnz	short .columnDCT
 	; -- AC terms all zero
 	movq	mm0, MMWORD [MMBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	pmullw	mm0, MMWORD [MMBLOCK(0,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	psllw	mm0,PASS1_BITS
 	movq      mm2,mm0		; mm0=in0=(00 01 02 03)
 	punpcklwd mm0,mm0		; mm0=(00 00 01 01)
 	punpckhwd mm2,mm2		; mm2=(02 02 03 03)
 	movq      mm1,mm0
 	punpckldq mm0,mm0		; mm0=(00 00 00 00)
 	punpckhdq mm1,mm1		; mm1=(01 01 01 01)
 	movq      mm3,mm2
 	punpckldq mm2,mm2		; mm2=(02 02 02 02)
 	punpckhdq mm3,mm3		; mm3=(03 03 03 03)
 	movq	MMWORD [MMBLOCK(0,0,edi,SIZEOF_JCOEF)], mm0
 	movq	MMWORD [MMBLOCK(0,1,edi,SIZEOF_JCOEF)], mm0
 	movq	MMWORD [MMBLOCK(1,0,edi,SIZEOF_JCOEF)], mm1
 	movq	MMWORD [MMBLOCK(1,1,edi,SIZEOF_JCOEF)], mm1
 	movq	MMWORD [MMBLOCK(2,0,edi,SIZEOF_JCOEF)], mm2
 	movq	MMWORD [MMBLOCK(2,1,edi,SIZEOF_JCOEF)], mm2
 	movq	MMWORD [MMBLOCK(3,0,edi,SIZEOF_JCOEF)], mm3
 	movq	MMWORD [MMBLOCK(3,1,edi,SIZEOF_JCOEF)], mm3
 	jmp	near .nextcolumn
 	alignx	16,7
 %endif
 .columnDCT:
 	; -- Even part
 	movq	mm0, MMWORD [MMBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	movq	mm1, MMWORD [MMBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	pmullw	mm0, MMWORD [MMBLOCK(0,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	mm1, MMWORD [MMBLOCK(2,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	movq	mm2, MMWORD [MMBLOCK(4,0,esi,SIZEOF_JCOEF)]
 	movq	mm3, MMWORD [MMBLOCK(6,0,esi,SIZEOF_JCOEF)]
 	pmullw	mm2, MMWORD [MMBLOCK(4,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	mm3, MMWORD [MMBLOCK(6,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	; (Original)
 	; z1 = (z2 + z3) * 0.541196100;
 	; tmp2 = z1 + z3 * -1.847759065;
 	; tmp3 = z1 + z2 * 0.765366865;
 	;
 	; (This implementation)
 	; tmp2 = z2 * 0.541196100 + z3 * (0.541196100 - 1.847759065);
 	; tmp3 = z2 * (0.541196100 + 0.765366865) + z3 * 0.541196100;
 	movq      mm4,mm1		; mm1=in2=z2
 	movq      mm5,mm1
 	punpcklwd mm4,mm3		; mm3=in6=z3
 	punpckhwd mm5,mm3
 	movq      mm1,mm4
 	movq      mm3,mm5
 	pmaddwd   mm4,[GOTOFF(ebx,PW_F130_F054)]	; mm4=tmp3L
 	pmaddwd   mm5,[GOTOFF(ebx,PW_F130_F054)]	; mm5=tmp3H
 	pmaddwd   mm1,[GOTOFF(ebx,PW_F054_MF130)]	; mm1=tmp2L
 	pmaddwd   mm3,[GOTOFF(ebx,PW_F054_MF130)]	; mm3=tmp2H
 	movq      mm6,mm0
 	paddw     mm0,mm2		; mm0=in0+in4
 	psubw     mm6,mm2		; mm6=in0-in4
 	pxor      mm7,mm7
 	pxor      mm2,mm2
 	punpcklwd mm7,mm0		; mm7=tmp0L
 	punpckhwd mm2,mm0		; mm2=tmp0H
 	psrad     mm7,(16-CONST_BITS)	; psrad mm7,16 & pslld mm7,CONST_BITS
 	psrad     mm2,(16-CONST_BITS)	; psrad mm2,16 & pslld mm2,CONST_BITS
 	movq	mm0,mm7
 	paddd	mm7,mm4			; mm7=tmp10L
 	psubd	mm0,mm4			; mm0=tmp13L
 	movq	mm4,mm2
 	paddd	mm2,mm5			; mm2=tmp10H
 	psubd	mm4,mm5			; mm4=tmp13H
 	movq	MMWORD [wk(0)], mm7	; wk(0)=tmp10L
 	movq	MMWORD [wk(1)], mm2	; wk(1)=tmp10H
 	movq	MMWORD [wk(2)], mm0	; wk(2)=tmp13L
 	movq	MMWORD [wk(3)], mm4	; wk(3)=tmp13H
 	pxor      mm5,mm5
 	pxor      mm7,mm7
 	punpcklwd mm5,mm6		; mm5=tmp1L
 	punpckhwd mm7,mm6		; mm7=tmp1H
 	psrad     mm5,(16-CONST_BITS)	; psrad mm5,16 & pslld mm5,CONST_BITS
 	psrad     mm7,(16-CONST_BITS)	; psrad mm7,16 & pslld mm7,CONST_BITS
 	movq	mm2,mm5
 	paddd	mm5,mm1			; mm5=tmp11L
 	psubd	mm2,mm1			; mm2=tmp12L
 	movq	mm0,mm7
 	paddd	mm7,mm3			; mm7=tmp11H
 	psubd	mm0,mm3			; mm0=tmp12H
 	movq	MMWORD [wk(4)], mm5	; wk(4)=tmp11L
 	movq	MMWORD [wk(5)], mm7	; wk(5)=tmp11H
 	movq	MMWORD [wk(6)], mm2	; wk(6)=tmp12L
 	movq	MMWORD [wk(7)], mm0	; wk(7)=tmp12H
 	; -- Odd part
 	movq	mm4, MMWORD [MMBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movq	mm6, MMWORD [MMBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	pmullw	mm4, MMWORD [MMBLOCK(1,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	mm6, MMWORD [MMBLOCK(3,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	movq	mm1, MMWORD [MMBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	movq	mm3, MMWORD [MMBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	pmullw	mm1, MMWORD [MMBLOCK(5,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	mm3, MMWORD [MMBLOCK(7,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	movq	mm5,mm6
 	movq	mm7,mm4
 	paddw	mm5,mm3			; mm5=z3
 	paddw	mm7,mm1			; mm7=z4
 	; (Original)
 	; z5 = (z3 + z4) * 1.175875602;
 	; z3 = z3 * -1.961570560;  z4 = z4 * -0.390180644;
 	; z3 += z5;  z4 += z5;
 	;
 	; (This implementation)
 	; z3 = z3 * (1.175875602 - 1.961570560) + z4 * 1.175875602;
 	; z4 = z3 * 1.175875602 + z4 * (1.175875602 - 0.390180644);
 	movq      mm2,mm5
 	movq      mm0,mm5
 	punpcklwd mm2,mm7
 	punpckhwd mm0,mm7
 	movq      mm5,mm2
 	movq      mm7,mm0
 	pmaddwd   mm2,[GOTOFF(ebx,PW_MF078_F117)]	; mm2=z3L
 	pmaddwd   mm0,[GOTOFF(ebx,PW_MF078_F117)]	; mm0=z3H
 	pmaddwd   mm5,[GOTOFF(ebx,PW_F117_F078)]	; mm5=z4L
 	pmaddwd   mm7,[GOTOFF(ebx,PW_F117_F078)]	; mm7=z4H
 	movq	MMWORD [wk(10)], mm2	; wk(10)=z3L
 	movq	MMWORD [wk(11)], mm0	; wk(11)=z3H
 	; (Original)
 	; z1 = tmp0 + tmp3;  z2 = tmp1 + tmp2;
 	; tmp0 = tmp0 * 0.298631336;  tmp1 = tmp1 * 2.053119869;
 	; tmp2 = tmp2 * 3.072711026;  tmp3 = tmp3 * 1.501321110;
 	; z1 = z1 * -0.899976223;  z2 = z2 * -2.562915447;
 	; tmp0 += z1 + z3;  tmp1 += z2 + z4;
 	; tmp2 += z2 + z3;  tmp3 += z1 + z4;
 	;
 	; (This implementation)
 	; tmp0 = tmp0 * (0.298631336 - 0.899976223) + tmp3 * -0.899976223;
 	; tmp1 = tmp1 * (2.053119869 - 2.562915447) + tmp2 * -2.562915447;
 	; tmp2 = tmp1 * -2.562915447 + tmp2 * (3.072711026 - 2.562915447);
 	; tmp3 = tmp0 * -0.899976223 + tmp3 * (1.501321110 - 0.899976223);
 	; tmp0 += z3;  tmp1 += z4;
 	; tmp2 += z3;  tmp3 += z4;
 	movq      mm2,mm3
 	movq      mm0,mm3
 	punpcklwd mm2,mm4
 	punpckhwd mm0,mm4
 	movq      mm3,mm2
 	movq      mm4,mm0
 	pmaddwd   mm2,[GOTOFF(ebx,PW_MF060_MF089)]	; mm2=tmp0L
 	pmaddwd   mm0,[GOTOFF(ebx,PW_MF060_MF089)]	; mm0=tmp0H
 	pmaddwd   mm3,[GOTOFF(ebx,PW_MF089_F060)]	; mm3=tmp3L
 	pmaddwd   mm4,[GOTOFF(ebx,PW_MF089_F060)]	; mm4=tmp3H
 	paddd	mm2, MMWORD [wk(10)]	; mm2=tmp0L
 	paddd	mm0, MMWORD [wk(11)]	; mm0=tmp0H
 	paddd	mm3,mm5			; mm3=tmp3L
 	paddd	mm4,mm7			; mm4=tmp3H
 	movq	MMWORD [wk(8)], mm2	; wk(8)=tmp0L
 	movq	MMWORD [wk(9)], mm0	; wk(9)=tmp0H
 	movq      mm2,mm1
 	movq      mm0,mm1
 	punpcklwd mm2,mm6
 	punpckhwd mm0,mm6
 	movq      mm1,mm2
 	movq      mm6,mm0
 	pmaddwd   mm2,[GOTOFF(ebx,PW_MF050_MF256)]	; mm2=tmp1L
 	pmaddwd   mm0,[GOTOFF(ebx,PW_MF050_MF256)]	; mm0=tmp1H
 	pmaddwd   mm1,[GOTOFF(ebx,PW_MF256_F050)]	; mm1=tmp2L
 	pmaddwd   mm6,[GOTOFF(ebx,PW_MF256_F050)]	; mm6=tmp2H
 	paddd	mm2,mm5			; mm2=tmp1L
 	paddd	mm0,mm7			; mm0=tmp1H
 	paddd	mm1, MMWORD [wk(10)]	; mm1=tmp2L
 	paddd	mm6, MMWORD [wk(11)]	; mm6=tmp2H
 	movq	MMWORD [wk(10)], mm2	; wk(10)=tmp1L
 	movq	MMWORD [wk(11)], mm0	; wk(11)=tmp1H
 	; -- Final output stage
 	movq	mm5, MMWORD [wk(0)]	; mm5=tmp10L
 	movq	mm7, MMWORD [wk(1)]	; mm7=tmp10H
 	movq	mm2,mm5
 	movq	mm0,mm7
 	paddd	mm5,mm3			; mm5=data0L
 	paddd	mm7,mm4			; mm7=data0H
 	psubd	mm2,mm3			; mm2=data7L
 	psubd	mm0,mm4			; mm0=data7H
 	movq	mm3,[GOTOFF(ebx,PD_DESCALE_P1)]	; mm3=[PD_DESCALE_P1]
 	paddd	mm5,mm3
 	paddd	mm7,mm3
 	psrad	mm5,DESCALE_P1
 	psrad	mm7,DESCALE_P1
 	paddd	mm2,mm3
 	paddd	mm0,mm3
 	psrad	mm2,DESCALE_P1
 	psrad	mm0,DESCALE_P1
 	packssdw  mm5,mm7		; mm5=data0=(00 01 02 03)
 	packssdw  mm2,mm0		; mm2=data7=(70 71 72 73)
 	movq	mm4, MMWORD [wk(4)]	; mm4=tmp11L
 	movq	mm3, MMWORD [wk(5)]	; mm3=tmp11H
 	movq	mm7,mm4
 	movq	mm0,mm3
 	paddd	mm4,mm1			; mm4=data1L
 	paddd	mm3,mm6			; mm3=data1H
 	psubd	mm7,mm1			; mm7=data6L
 	psubd	mm0,mm6			; mm0=data6H
 	movq	mm1,[GOTOFF(ebx,PD_DESCALE_P1)]	; mm1=[PD_DESCALE_P1]
 	paddd	mm4,mm1
 	paddd	mm3,mm1
 	psrad	mm4,DESCALE_P1
 	psrad	mm3,DESCALE_P1
 	paddd	mm7,mm1
 	paddd	mm0,mm1
 	psrad	mm7,DESCALE_P1
 	psrad	mm0,DESCALE_P1
 	packssdw  mm4,mm3		; mm4=data1=(10 11 12 13)
 	packssdw  mm7,mm0		; mm7=data6=(60 61 62 63)
 	movq      mm6,mm5		; transpose coefficients(phase 1)
 	punpcklwd mm5,mm4		; mm5=(00 10 01 11)
 	punpckhwd mm6,mm4		; mm6=(02 12 03 13)
 	movq      mm1,mm7		; transpose coefficients(phase 1)
 	punpcklwd mm7,mm2		; mm7=(60 70 61 71)
 	punpckhwd mm1,mm2		; mm1=(62 72 63 73)
 	movq	mm3, MMWORD [wk(6)]	; mm3=tmp12L
 	movq	mm0, MMWORD [wk(7)]	; mm0=tmp12H
 	movq	mm4, MMWORD [wk(10)]	; mm4=tmp1L
 	movq	mm2, MMWORD [wk(11)]	; mm2=tmp1H
 	movq	MMWORD [wk(0)], mm5	; wk(0)=(00 10 01 11)
 	movq	MMWORD [wk(1)], mm6	; wk(1)=(02 12 03 13)
 	movq	MMWORD [wk(4)], mm7	; wk(4)=(60 70 61 71)
 	movq	MMWORD [wk(5)], mm1	; wk(5)=(62 72 63 73)
 	movq	mm5,mm3
 	movq	mm6,mm0
 	paddd	mm3,mm4			; mm3=data2L
 	paddd	mm0,mm2			; mm0=data2H
 	psubd	mm5,mm4			; mm5=data5L
 	psubd	mm6,mm2			; mm6=data5H
 	movq	mm7,[GOTOFF(ebx,PD_DESCALE_P1)]	; mm7=[PD_DESCALE_P1]
 	paddd	mm3,mm7
 	paddd	mm0,mm7
 	psrad	mm3,DESCALE_P1
 	psrad	mm0,DESCALE_P1
 	paddd	mm5,mm7
 	paddd	mm6,mm7
 	psrad	mm5,DESCALE_P1
 	psrad	mm6,DESCALE_P1
 	packssdw  mm3,mm0		; mm3=data2=(20 21 22 23)
 	packssdw  mm5,mm6		; mm5=data5=(50 51 52 53)
 	movq	mm1, MMWORD [wk(2)]	; mm1=tmp13L
 	movq	mm4, MMWORD [wk(3)]	; mm4=tmp13H
 	movq	mm2, MMWORD [wk(8)]	; mm2=tmp0L
 	movq	mm7, MMWORD [wk(9)]	; mm7=tmp0H
 	movq	mm0,mm1
 	movq	mm6,mm4
 	paddd	mm1,mm2			; mm1=data3L
 	paddd	mm4,mm7			; mm4=data3H
 	psubd	mm0,mm2			; mm0=data4L
 	psubd	mm6,mm7			; mm6=data4H
 	movq	mm2,[GOTOFF(ebx,PD_DESCALE_P1)]	; mm2=[PD_DESCALE_P1]
 	paddd	mm1,mm2
 	paddd	mm4,mm2
 	psrad	mm1,DESCALE_P1
 	psrad	mm4,DESCALE_P1
 	paddd	mm0,mm2
 	paddd	mm6,mm2
 	psrad	mm0,DESCALE_P1
 	psrad	mm6,DESCALE_P1
 	packssdw  mm1,mm4		; mm1=data3=(30 31 32 33)
 	packssdw  mm0,mm6		; mm0=data4=(40 41 42 43)
 	movq	mm7, MMWORD [wk(0)]	; mm7=(00 10 01 11)
 	movq	mm2, MMWORD [wk(1)]	; mm2=(02 12 03 13)
 	movq      mm4,mm3		; transpose coefficients(phase 1)
 	punpcklwd mm3,mm1		; mm3=(20 30 21 31)
 	punpckhwd mm4,mm1		; mm4=(22 32 23 33)
 	movq      mm6,mm0		; transpose coefficients(phase 1)
 	punpcklwd mm0,mm5		; mm0=(40 50 41 51)
 	punpckhwd mm6,mm5		; mm6=(42 52 43 53)
 	movq      mm1,mm7		; transpose coefficients(phase 2)
 	punpckldq mm7,mm3		; mm7=(00 10 20 30)
 	punpckhdq mm1,mm3		; mm1=(01 11 21 31)
 	movq      mm5,mm2		; transpose coefficients(phase 2)
 	punpckldq mm2,mm4		; mm2=(02 12 22 32)
 	punpckhdq mm5,mm4		; mm5=(03 13 23 33)
 	movq	mm3, MMWORD [wk(4)]	; mm3=(60 70 61 71)
 	movq	mm4, MMWORD [wk(5)]	; mm4=(62 72 63 73)
 	movq	MMWORD [MMBLOCK(0,0,edi,SIZEOF_JCOEF)], mm7
 	movq	MMWORD [MMBLOCK(1,0,edi,SIZEOF_JCOEF)], mm1
 	movq	MMWORD [MMBLOCK(2,0,edi,SIZEOF_JCOEF)], mm2
 	movq	MMWORD [MMBLOCK(3,0,edi,SIZEOF_JCOEF)], mm5
 	movq      mm7,mm0		; transpose coefficients(phase 2)
 	punpckldq mm0,mm3		; mm0=(40 50 60 70)
 	punpckhdq mm7,mm3		; mm7=(41 51 61 71)
 	movq      mm1,mm6		; transpose coefficients(phase 2)
 	punpckldq mm6,mm4		; mm6=(42 52 62 72)
 	punpckhdq mm1,mm4		; mm1=(43 53 63 73)
 	movq	MMWORD [MMBLOCK(0,1,edi,SIZEOF_JCOEF)], mm0
 	movq	MMWORD [MMBLOCK(1,1,edi,SIZEOF_JCOEF)], mm7
 	movq	MMWORD [MMBLOCK(2,1,edi,SIZEOF_JCOEF)], mm6
 	movq	MMWORD [MMBLOCK(3,1,edi,SIZEOF_JCOEF)], mm1
 .nextcolumn:
 	add	esi, byte 4*SIZEOF_JCOEF		; coef_block
 	add	edx, byte 4*SIZEOF_ISLOW_MULT_TYPE	; quantptr
 	add	edi, byte 4*DCTSIZE*SIZEOF_JCOEF	; wsptr
 	dec	ecx					; ctr
 	jnz	near .columnloop
 	; ---- Pass 2: process rows from work array, store into output array.
 	mov	eax, [original_ebp]
 	lea	esi, [workspace]			; JCOEF * wsptr
 	mov	edi, JSAMPARRAY [output_buf(eax)]	; (JSAMPROW *)
 	mov	eax, JDIMENSION [output_col(eax)]
 	mov	ecx, DCTSIZE/4				; ctr
 	alignx	16,7
 .rowloop:
 	; -- Even part
 	movq	mm0, MMWORD [MMBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	movq	mm1, MMWORD [MMBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	movq	mm2, MMWORD [MMBLOCK(4,0,esi,SIZEOF_JCOEF)]
 	movq	mm3, MMWORD [MMBLOCK(6,0,esi,SIZEOF_JCOEF)]
 	; (Original)
 	; z1 = (z2 + z3) * 0.541196100;
 	; tmp2 = z1 + z3 * -1.847759065;
 	; tmp3 = z1 + z2 * 0.765366865;
 	;
 	; (This implementation)
 	; tmp2 = z2 * 0.541196100 + z3 * (0.541196100 - 1.847759065);
 	; tmp3 = z2 * (0.541196100 + 0.765366865) + z3 * 0.541196100;
 	movq      mm4,mm1		; mm1=in2=z2
 	movq      mm5,mm1
 	punpcklwd mm4,mm3		; mm3=in6=z3
 	punpckhwd mm5,mm3
 	movq      mm1,mm4
 	movq      mm3,mm5
 	pmaddwd   mm4,[GOTOFF(ebx,PW_F130_F054)]	; mm4=tmp3L
 	pmaddwd   mm5,[GOTOFF(ebx,PW_F130_F054)]	; mm5=tmp3H
 	pmaddwd   mm1,[GOTOFF(ebx,PW_F054_MF130)]	; mm1=tmp2L
 	pmaddwd   mm3,[GOTOFF(ebx,PW_F054_MF130)]	; mm3=tmp2H
 	movq      mm6,mm0
 	paddw     mm0,mm2		; mm0=in0+in4
 	psubw     mm6,mm2		; mm6=in0-in4
 	pxor      mm7,mm7
 	pxor      mm2,mm2
 	punpcklwd mm7,mm0		; mm7=tmp0L
 	punpckhwd mm2,mm0		; mm2=tmp0H
 	psrad     mm7,(16-CONST_BITS)	; psrad mm7,16 & pslld mm7,CONST_BITS
 	psrad     mm2,(16-CONST_BITS)	; psrad mm2,16 & pslld mm2,CONST_BITS
 	movq	mm0,mm7
 	paddd	mm7,mm4			; mm7=tmp10L
 	psubd	mm0,mm4			; mm0=tmp13L
 	movq	mm4,mm2
 	paddd	mm2,mm5			; mm2=tmp10H
 	psubd	mm4,mm5			; mm4=tmp13H
 	movq	MMWORD [wk(0)], mm7	; wk(0)=tmp10L
 	movq	MMWORD [wk(1)], mm2	; wk(1)=tmp10H
 	movq	MMWORD [wk(2)], mm0	; wk(2)=tmp13L
 	movq	MMWORD [wk(3)], mm4	; wk(3)=tmp13H
 	pxor      mm5,mm5
 	pxor      mm7,mm7
 	punpcklwd mm5,mm6		; mm5=tmp1L
 	punpckhwd mm7,mm6		; mm7=tmp1H
 	psrad     mm5,(16-CONST_BITS)	; psrad mm5,16 & pslld mm5,CONST_BITS
 	psrad     mm7,(16-CONST_BITS)	; psrad mm7,16 & pslld mm7,CONST_BITS
 	movq	mm2,mm5
 	paddd	mm5,mm1			; mm5=tmp11L
 	psubd	mm2,mm1			; mm2=tmp12L
 	movq	mm0,mm7
 	paddd	mm7,mm3			; mm7=tmp11H
 	psubd	mm0,mm3			; mm0=tmp12H
 	movq	MMWORD [wk(4)], mm5	; wk(4)=tmp11L
 	movq	MMWORD [wk(5)], mm7	; wk(5)=tmp11H
 	movq	MMWORD [wk(6)], mm2	; wk(6)=tmp12L
 	movq	MMWORD [wk(7)], mm0	; wk(7)=tmp12H
 	; -- Odd part
 	movq	mm4, MMWORD [MMBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movq	mm6, MMWORD [MMBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	movq	mm1, MMWORD [MMBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	movq	mm3, MMWORD [MMBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	movq	mm5,mm6
 	movq	mm7,mm4
 	paddw	mm5,mm3			; mm5=z3
 	paddw	mm7,mm1			; mm7=z4
 	; (Original)
 	; z5 = (z3 + z4) * 1.175875602;
 	; z3 = z3 * -1.961570560;  z4 = z4 * -0.390180644;
 	; z3 += z5;  z4 += z5;
 	;
 	; (This implementation)
 	; z3 = z3 * (1.175875602 - 1.961570560) + z4 * 1.175875602;
 	; z4 = z3 * 1.175875602 + z4 * (1.175875602 - 0.390180644);
 	movq      mm2,mm5
 	movq      mm0,mm5
 	punpcklwd mm2,mm7
 	punpckhwd mm0,mm7
 	movq      mm5,mm2
 	movq      mm7,mm0
 	pmaddwd   mm2,[GOTOFF(ebx,PW_MF078_F117)]	; mm2=z3L
 	pmaddwd   mm0,[GOTOFF(ebx,PW_MF078_F117)]	; mm0=z3H
 	pmaddwd   mm5,[GOTOFF(ebx,PW_F117_F078)]	; mm5=z4L
 	pmaddwd   mm7,[GOTOFF(ebx,PW_F117_F078)]	; mm7=z4H
 	movq	MMWORD [wk(10)], mm2	; wk(10)=z3L
 	movq	MMWORD [wk(11)], mm0	; wk(11)=z3H
 	; (Original)
 	; z1 = tmp0 + tmp3;  z2 = tmp1 + tmp2;
 	; tmp0 = tmp0 * 0.298631336;  tmp1 = tmp1 * 2.053119869;
 	; tmp2 = tmp2 * 3.072711026;  tmp3 = tmp3 * 1.501321110;
 	; z1 = z1 * -0.899976223;  z2 = z2 * -2.562915447;
 	; tmp0 += z1 + z3;  tmp1 += z2 + z4;
 	; tmp2 += z2 + z3;  tmp3 += z1 + z4;
 	;
 	; (This implementation)
 	; tmp0 = tmp0 * (0.298631336 - 0.899976223) + tmp3 * -0.899976223;
 	; tmp1 = tmp1 * (2.053119869 - 2.562915447) + tmp2 * -2.562915447;
 	; tmp2 = tmp1 * -2.562915447 + tmp2 * (3.072711026 - 2.562915447);
 	; tmp3 = tmp0 * -0.899976223 + tmp3 * (1.501321110 - 0.899976223);
 	; tmp0 += z3;  tmp1 += z4;
 	; tmp2 += z3;  tmp3 += z4;
 	movq      mm2,mm3
 	movq      mm0,mm3
 	punpcklwd mm2,mm4
 	punpckhwd mm0,mm4
 	movq      mm3,mm2
 	movq      mm4,mm0
 	pmaddwd   mm2,[GOTOFF(ebx,PW_MF060_MF089)]	; mm2=tmp0L
 	pmaddwd   mm0,[GOTOFF(ebx,PW_MF060_MF089)]	; mm0=tmp0H
 	pmaddwd   mm3,[GOTOFF(ebx,PW_MF089_F060)]	; mm3=tmp3L
 	pmaddwd   mm4,[GOTOFF(ebx,PW_MF089_F060)]	; mm4=tmp3H
 	paddd	mm2, MMWORD [wk(10)]	; mm2=tmp0L
 	paddd	mm0, MMWORD [wk(11)]	; mm0=tmp0H
 	paddd	mm3,mm5			; mm3=tmp3L
 	paddd	mm4,mm7			; mm4=tmp3H
 	movq	MMWORD [wk(8)], mm2	; wk(8)=tmp0L
 	movq	MMWORD [wk(9)], mm0	; wk(9)=tmp0H
 	movq      mm2,mm1
 	movq      mm0,mm1
 	punpcklwd mm2,mm6
 	punpckhwd mm0,mm6
 	movq      mm1,mm2
 	movq      mm6,mm0
 	pmaddwd   mm2,[GOTOFF(ebx,PW_MF050_MF256)]	; mm2=tmp1L
 	pmaddwd   mm0,[GOTOFF(ebx,PW_MF050_MF256)]	; mm0=tmp1H
 	pmaddwd   mm1,[GOTOFF(ebx,PW_MF256_F050)]	; mm1=tmp2L
 	pmaddwd   mm6,[GOTOFF(ebx,PW_MF256_F050)]	; mm6=tmp2H
 	paddd	mm2,mm5			; mm2=tmp1L
 	paddd	mm0,mm7			; mm0=tmp1H
 	paddd	mm1, MMWORD [wk(10)]	; mm1=tmp2L
 	paddd	mm6, MMWORD [wk(11)]	; mm6=tmp2H
 	movq	MMWORD [wk(10)], mm2	; wk(10)=tmp1L
 	movq	MMWORD [wk(11)], mm0	; wk(11)=tmp1H
 	; -- Final output stage
 	movq	mm5, MMWORD [wk(0)]	; mm5=tmp10L
 	movq	mm7, MMWORD [wk(1)]	; mm7=tmp10H
 	movq	mm2,mm5
 	movq	mm0,mm7
 	paddd	mm5,mm3			; mm5=data0L
 	paddd	mm7,mm4			; mm7=data0H
 	psubd	mm2,mm3			; mm2=data7L
 	psubd	mm0,mm4			; mm0=data7H
 	movq	mm3,[GOTOFF(ebx,PD_DESCALE_P2)]	; mm3=[PD_DESCALE_P2]
 	paddd	mm5,mm3
 	paddd	mm7,mm3
 	psrad	mm5,DESCALE_P2
 	psrad	mm7,DESCALE_P2
 	paddd	mm2,mm3
 	paddd	mm0,mm3
 	psrad	mm2,DESCALE_P2
 	psrad	mm0,DESCALE_P2
 	packssdw  mm5,mm7		; mm5=data0=(00 10 20 30)
 	packssdw  mm2,mm0		; mm2=data7=(07 17 27 37)
 	movq	mm4, MMWORD [wk(4)]	; mm4=tmp11L
 	movq	mm3, MMWORD [wk(5)]	; mm3=tmp11H
 	movq	mm7,mm4
 	movq	mm0,mm3
 	paddd	mm4,mm1			; mm4=data1L
 	paddd	mm3,mm6			; mm3=data1H
 	psubd	mm7,mm1			; mm7=data6L
 	psubd	mm0,mm6			; mm0=data6H
 	movq	mm1,[GOTOFF(ebx,PD_DESCALE_P2)]	; mm1=[PD_DESCALE_P2]
 	paddd	mm4,mm1
 	paddd	mm3,mm1
 	psrad	mm4,DESCALE_P2
 	psrad	mm3,DESCALE_P2
 	paddd	mm7,mm1
 	paddd	mm0,mm1
 	psrad	mm7,DESCALE_P2
 	psrad	mm0,DESCALE_P2
 	packssdw  mm4,mm3		; mm4=data1=(01 11 21 31)
 	packssdw  mm7,mm0		; mm7=data6=(06 16 26 36)
 	packsswb  mm5,mm7		; mm5=(00 10 20 30 06 16 26 36)
 	packsswb  mm4,mm2		; mm4=(01 11 21 31 07 17 27 37)
 	movq	mm6, MMWORD [wk(6)]	; mm6=tmp12L
 	movq	mm1, MMWORD [wk(7)]	; mm1=tmp12H
 	movq	mm3, MMWORD [wk(10)]	; mm3=tmp1L
 	movq	mm0, MMWORD [wk(11)]	; mm0=tmp1H
 	movq	MMWORD [wk(0)], mm5	; wk(0)=(00 10 20 30 06 16 26 36)
 	movq	MMWORD [wk(1)], mm4	; wk(1)=(01 11 21 31 07 17 27 37)
 	movq	mm7,mm6
 	movq	mm2,mm1
 	paddd	mm6,mm3			; mm6=data2L
 	paddd	mm1,mm0			; mm1=data2H
 	psubd	mm7,mm3			; mm7=data5L
 	psubd	mm2,mm0			; mm2=data5H
 	movq	mm5,[GOTOFF(ebx,PD_DESCALE_P2)]	; mm5=[PD_DESCALE_P2]
 	paddd	mm6,mm5
 	paddd	mm1,mm5
 	psrad	mm6,DESCALE_P2
 	psrad	mm1,DESCALE_P2
 	paddd	mm7,mm5
 	paddd	mm2,mm5
 	psrad	mm7,DESCALE_P2
 	psrad	mm2,DESCALE_P2
 	packssdw  mm6,mm1		; mm6=data2=(02 12 22 32)
 	packssdw  mm7,mm2		; mm7=data5=(05 15 25 35)
 	movq	mm4, MMWORD [wk(2)]	; mm4=tmp13L
 	movq	mm3, MMWORD [wk(3)]	; mm3=tmp13H
 	movq	mm0, MMWORD [wk(8)]	; mm0=tmp0L
 	movq	mm5, MMWORD [wk(9)]	; mm5=tmp0H
 	movq	mm1,mm4
 	movq	mm2,mm3
 	paddd	mm4,mm0			; mm4=data3L
 	paddd	mm3,mm5			; mm3=data3H
 	psubd	mm1,mm0			; mm1=data4L
 	psubd	mm2,mm5			; mm2=data4H
 	movq	mm0,[GOTOFF(ebx,PD_DESCALE_P2)]	; mm0=[PD_DESCALE_P2]
 	paddd	mm4,mm0
 	paddd	mm3,mm0
 	psrad	mm4,DESCALE_P2
 	psrad	mm3,DESCALE_P2
 	paddd	mm1,mm0
 	paddd	mm2,mm0
 	psrad	mm1,DESCALE_P2
 	psrad	mm2,DESCALE_P2
 	movq      mm5,[GOTOFF(ebx,PB_CENTERJSAMP)]	; mm5=[PB_CENTERJSAMP]
 	packssdw  mm4,mm3		; mm4=data3=(03 13 23 33)
 	packssdw  mm1,mm2		; mm1=data4=(04 14 24 34)
 	movq      mm0, MMWORD [wk(0)]	; mm0=(00 10 20 30 06 16 26 36)
 	movq      mm3, MMWORD [wk(1)]	; mm3=(01 11 21 31 07 17 27 37)
 	packsswb  mm6,mm1		; mm6=(02 12 22 32 04 14 24 34)
 	packsswb  mm4,mm7		; mm4=(03 13 23 33 05 15 25 35)
 	paddb     mm0,mm5
 	paddb     mm3,mm5
 	paddb     mm6,mm5
 	paddb     mm4,mm5
 	movq      mm2,mm0		; transpose coefficients(phase 1)
 	punpcklbw mm0,mm3		; mm0=(00 01 10 11 20 21 30 31)
 	punpckhbw mm2,mm3		; mm2=(06 07 16 17 26 27 36 37)
 	movq      mm1,mm6		; transpose coefficients(phase 1)
 	punpcklbw mm6,mm4		; mm6=(02 03 12 13 22 23 32 33)
 	punpckhbw mm1,mm4		; mm1=(04 05 14 15 24 25 34 35)
 	movq      mm7,mm0		; transpose coefficients(phase 2)
 	punpcklwd mm0,mm6		; mm0=(00 01 02 03 10 11 12 13)
 	punpckhwd mm7,mm6		; mm7=(20 21 22 23 30 31 32 33)
 	movq      mm5,mm1		; transpose coefficients(phase 2)
 	punpcklwd mm1,mm2		; mm1=(04 05 06 07 14 15 16 17)
 	punpckhwd mm5,mm2		; mm5=(24 25 26 27 34 35 36 37)
 	movq      mm3,mm0		; transpose coefficients(phase 3)
 	punpckldq mm0,mm1		; mm0=(00 01 02 03 04 05 06 07)
 	punpckhdq mm3,mm1		; mm3=(10 11 12 13 14 15 16 17)
 	movq      mm4,mm7		; transpose coefficients(phase 3)
 	punpckldq mm7,mm5		; mm7=(20 21 22 23 24 25 26 27)
 	punpckhdq mm4,mm5		; mm4=(30 31 32 33 34 35 36 37)
 	pushpic	ebx			; save GOT address
 	mov	edx, JSAMPROW [edi+0*SIZEOF_JSAMPROW]
 	mov	ebx, JSAMPROW [edi+1*SIZEOF_JSAMPROW]
 	movq	MMWORD [edx+eax*SIZEOF_JSAMPLE], mm0
 	movq	MMWORD [ebx+eax*SIZEOF_JSAMPLE], mm3
 	mov	edx, JSAMPROW [edi+2*SIZEOF_JSAMPROW]
 	mov	ebx, JSAMPROW [edi+3*SIZEOF_JSAMPROW]
 	movq	MMWORD [edx+eax*SIZEOF_JSAMPLE], mm7
 	movq	MMWORD [ebx+eax*SIZEOF_JSAMPLE], mm4
 	poppic	ebx			; restore GOT address
 	add	esi, byte 4*SIZEOF_JCOEF	; wsptr
 	add	edi, byte 4*SIZEOF_JSAMPROW
 	dec	ecx				; ctr
 	jnz	near .rowloop
 	emms		; empty MMX state
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %endif ; JIDCT_INT_MMX_SUPPORTED
 %endif ; DCT_ISLOW_SUPPORTED
--- a/jimmxred.asm
+++ b/jimmxred.asm
@@ -0,0 +1,719 @@
 ;
 ; jimmxred.asm - reduced-size IDCT (MMX)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains inverse-DCT routines that produce reduced-size
 ; output: either 4x4 or 2x2 pixels from an 8x8 DCT block.
 ; The following code is based directly on the IJG's original jidctred.c;
 ; see the jidctred.c for more details.
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef IDCT_SCALING_SUPPORTED
 %ifdef JIDCT_INT_MMX_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 %define CONST_BITS	13
 %define PASS1_BITS	2
 %define DESCALE_P1_4	(CONST_BITS-PASS1_BITS+1)
 %define DESCALE_P2_4	(CONST_BITS+PASS1_BITS+3+1)
 %define DESCALE_P1_2	(CONST_BITS-PASS1_BITS+2)
 %define DESCALE_P2_2	(CONST_BITS+PASS1_BITS+3+2)
 %if CONST_BITS == 13
 F_0_211	equ	 1730		; FIX(0.211164243)
 F_0_509	equ	 4176		; FIX(0.509795579)
 F_0_601	equ	 4926		; FIX(0.601344887)
 F_0_720	equ	 5906		; FIX(0.720959822)
 F_0_765	equ	 6270		; FIX(0.765366865)
 F_0_850	equ	 6967		; FIX(0.850430095)
 F_0_899	equ	 7373		; FIX(0.899976223)
 F_1_061	equ	 8697		; FIX(1.061594337)
 F_1_272	equ	10426		; FIX(1.272758580)
 F_1_451	equ	11893		; FIX(1.451774981)
 F_1_847	equ	15137		; FIX(1.847759065)
 F_2_172	equ	17799		; FIX(2.172734803)
 F_2_562	equ	20995		; FIX(2.562915447)
 F_3_624	equ	29692		; FIX(3.624509785)
 %else
 ; NASM cannot do compile-time arithmetic on floating-point constants.
 %define DESCALE(x,n)  (((x)+(1<<((n)-1)))>>(n))
 F_0_211	equ	DESCALE( 226735879,30-CONST_BITS)	; FIX(0.211164243)
 F_0_509	equ	DESCALE( 547388834,30-CONST_BITS)	; FIX(0.509795579)
 F_0_601	equ	DESCALE( 645689155,30-CONST_BITS)	; FIX(0.601344887)
 F_0_720	equ	DESCALE( 774124714,30-CONST_BITS)	; FIX(0.720959822)
 F_0_765	equ	DESCALE( 821806413,30-CONST_BITS)	; FIX(0.765366865)
 F_0_850	equ	DESCALE( 913142361,30-CONST_BITS)	; FIX(0.850430095)
 F_0_899	equ	DESCALE( 966342111,30-CONST_BITS)	; FIX(0.899976223)
 F_1_061	equ	DESCALE(1139878239,30-CONST_BITS)	; FIX(1.061594337)
 F_1_272	equ	DESCALE(1366614119,30-CONST_BITS)	; FIX(1.272758580)
 F_1_451	equ	DESCALE(1558831516,30-CONST_BITS)	; FIX(1.451774981)
 F_1_847	equ	DESCALE(1984016188,30-CONST_BITS)	; FIX(1.847759065)
 F_2_172	equ	DESCALE(2332956230,30-CONST_BITS)	; FIX(2.172734803)
 F_2_562	equ	DESCALE(2751909506,30-CONST_BITS)	; FIX(2.562915447)
 F_3_624	equ	DESCALE(3891787747,30-CONST_BITS)	; FIX(3.624509785)
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 	alignz	16
 	global	EXTN(jconst_idct_red_mmx)
 EXTN(jconst_idct_red_mmx):
 PW_F184_MF076	times 2 dw  F_1_847,-F_0_765
 PW_F256_F089	times 2 dw  F_2_562, F_0_899
 PW_F106_MF217	times 2 dw  F_1_061,-F_2_172
 PW_MF060_MF050	times 2 dw -F_0_601,-F_0_509
 PW_F145_MF021	times 2 dw  F_1_451,-F_0_211
 PW_F362_MF127	times 2 dw  F_3_624,-F_1_272
 PW_F085_MF072	times 2 dw  F_0_850,-F_0_720
 PD_DESCALE_P1_4	times 2 dd  1 << (DESCALE_P1_4-1)
 PD_DESCALE_P2_4	times 2 dd  1 << (DESCALE_P2_4-1)
 PD_DESCALE_P1_2	times 2 dd  1 << (DESCALE_P1_2-1)
 PD_DESCALE_P2_2	times 2 dd  1 << (DESCALE_P2_2-1)
 PB_CENTERJSAMP	times 8 db  CENTERJSAMPLE
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform dequantization and inverse DCT on one block of coefficients,
 ; producing a reduced-size 4x4 output block.
 ;
 ; GLOBAL(void)
 ; jpeg_idct_4x4_mmx (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 ;                    JCOEFPTR coef_block,
 ;                    JSAMPARRAY output_buf, JDIMENSION output_col)
 ;
 %define cinfo(b)	(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define coef_block(b)	(b)+16		; JCOEFPTR coef_block
 %define output_buf(b)	(b)+20		; JSAMPARRAY output_buf
 %define output_col(b)	(b)+24		; JDIMENSION output_col
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_MMWORD	; mmword wk[WK_NUM]
 %define WK_NUM		2
 %define workspace	wk(0)-DCTSIZE2*SIZEOF_JCOEF
 					; JCOEF workspace[DCTSIZE2]
 	align	16
 	global	EXTN(jpeg_idct_4x4_mmx)
 EXTN(jpeg_idct_4x4_mmx):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_MMWORD)	; align to 64 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [workspace]
 	pushpic	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx		; get GOT address
 	; ---- Pass 1: process columns from input, store into work array.
 ;	mov	eax, [original_ebp]
 	mov	edx, POINTER [compptr(eax)]
 	mov	edx, POINTER [jcompinfo_dct_table(edx)]	; quantptr
 	mov	esi, JCOEFPTR [coef_block(eax)]		; inptr
 	lea	edi, [workspace]			; JCOEF * wsptr
 	mov	ecx, DCTSIZE/4				; ctr
 	alignx	16,7
 .columnloop:
 %ifndef NO_ZERO_COLUMN_TEST_4X4_MMX
 	mov	eax, DWORD [DWBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	or	eax, DWORD [DWBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	jnz	short .columnDCT
 	movq	mm0, MMWORD [MMBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movq	mm1, MMWORD [MMBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	por	mm0, MMWORD [MMBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	por	mm1, MMWORD [MMBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	por	mm0, MMWORD [MMBLOCK(6,0,esi,SIZEOF_JCOEF)]
 	por	mm1, MMWORD [MMBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	por	mm0,mm1
 	packsswb mm0,mm0
 	movd	eax,mm0
 	test	eax,eax
 	jnz	short .columnDCT
 	; -- AC terms all zero
 	movq	mm0, MMWORD [MMBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	pmullw	mm0, MMWORD [MMBLOCK(0,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	psllw	mm0,PASS1_BITS
 	movq      mm2,mm0		; mm0=in0=(00 01 02 03)
 	punpcklwd mm0,mm0		; mm0=(00 00 01 01)
 	punpckhwd mm2,mm2		; mm2=(02 02 03 03)
 	movq      mm1,mm0
 	punpckldq mm0,mm0		; mm0=(00 00 00 00)
 	punpckhdq mm1,mm1		; mm1=(01 01 01 01)
 	movq      mm3,mm2
 	punpckldq mm2,mm2		; mm2=(02 02 02 02)
 	punpckhdq mm3,mm3		; mm3=(03 03 03 03)
 	movq	MMWORD [MMBLOCK(0,0,edi,SIZEOF_JCOEF)], mm0
 	movq	MMWORD [MMBLOCK(1,0,edi,SIZEOF_JCOEF)], mm1
 	movq	MMWORD [MMBLOCK(2,0,edi,SIZEOF_JCOEF)], mm2
 	movq	MMWORD [MMBLOCK(3,0,edi,SIZEOF_JCOEF)], mm3
 	jmp	near .nextcolumn
 	alignx	16,7
 %endif
 .columnDCT:
 	; -- Odd part
 	movq	mm0, MMWORD [MMBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movq	mm1, MMWORD [MMBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	pmullw	mm0, MMWORD [MMBLOCK(1,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	mm1, MMWORD [MMBLOCK(3,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	movq	mm2, MMWORD [MMBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	movq	mm3, MMWORD [MMBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	pmullw	mm2, MMWORD [MMBLOCK(5,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	mm3, MMWORD [MMBLOCK(7,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	movq      mm4,mm0
 	movq      mm5,mm0
 	punpcklwd mm4,mm1
 	punpckhwd mm5,mm1
 	movq      mm0,mm4
 	movq      mm1,mm5
 	pmaddwd   mm4,[GOTOFF(ebx,PW_F256_F089)]	; mm4=(tmp2L)
 	pmaddwd   mm5,[GOTOFF(ebx,PW_F256_F089)]	; mm5=(tmp2H)
 	pmaddwd   mm0,[GOTOFF(ebx,PW_F106_MF217)]	; mm0=(tmp0L)
 	pmaddwd   mm1,[GOTOFF(ebx,PW_F106_MF217)]	; mm1=(tmp0H)
 	movq      mm6,mm2
 	movq      mm7,mm2
 	punpcklwd mm6,mm3
 	punpckhwd mm7,mm3
 	movq      mm2,mm6
 	movq      mm3,mm7
 	pmaddwd   mm6,[GOTOFF(ebx,PW_MF060_MF050)]	; mm6=(tmp2L)
 	pmaddwd   mm7,[GOTOFF(ebx,PW_MF060_MF050)]	; mm7=(tmp2H)
 	pmaddwd   mm2,[GOTOFF(ebx,PW_F145_MF021)]	; mm2=(tmp0L)
 	pmaddwd   mm3,[GOTOFF(ebx,PW_F145_MF021)]	; mm3=(tmp0H)
 	paddd	mm6,mm4			; mm6=tmp2L
 	paddd	mm7,mm5			; mm7=tmp2H
 	paddd	mm2,mm0			; mm2=tmp0L
 	paddd	mm3,mm1			; mm3=tmp0H
 	movq	MMWORD [wk(0)], mm2	; wk(0)=tmp0L
 	movq	MMWORD [wk(1)], mm3	; wk(1)=tmp0H
 	; -- Even part
 	movq	mm4, MMWORD [MMBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	movq	mm5, MMWORD [MMBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	movq	mm0, MMWORD [MMBLOCK(6,0,esi,SIZEOF_JCOEF)]
 	pmullw	mm4, MMWORD [MMBLOCK(0,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	mm5, MMWORD [MMBLOCK(2,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	mm0, MMWORD [MMBLOCK(6,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pxor      mm1,mm1
 	pxor      mm2,mm2
 	punpcklwd mm1,mm4		; mm1=tmp0L
 	punpckhwd mm2,mm4		; mm2=tmp0H
 	psrad     mm1,(16-CONST_BITS-1)	; psrad mm1,16 & pslld mm1,CONST_BITS+1
 	psrad     mm2,(16-CONST_BITS-1)	; psrad mm2,16 & pslld mm2,CONST_BITS+1
 	movq      mm3,mm5		; mm5=in2=z2
 	punpcklwd mm5,mm0		; mm0=in6=z3
 	punpckhwd mm3,mm0
 	pmaddwd   mm5,[GOTOFF(ebx,PW_F184_MF076)]	; mm5=tmp2L
 	pmaddwd   mm3,[GOTOFF(ebx,PW_F184_MF076)]	; mm3=tmp2H
 	movq	mm4,mm1
 	movq	mm0,mm2
 	paddd	mm1,mm5			; mm1=tmp10L
 	paddd	mm2,mm3			; mm2=tmp10H
 	psubd	mm4,mm5			; mm4=tmp12L
 	psubd	mm0,mm3			; mm0=tmp12H
 	; -- Final output stage
 	movq	mm5,mm1
 	movq	mm3,mm2
 	paddd	mm1,mm6			; mm1=data0L
 	paddd	mm2,mm7			; mm2=data0H
 	psubd	mm5,mm6			; mm5=data3L
 	psubd	mm3,mm7			; mm3=data3H
 	movq	mm6,[GOTOFF(ebx,PD_DESCALE_P1_4)]	; mm6=[PD_DESCALE_P1_4]
 	paddd	mm1,mm6
 	paddd	mm2,mm6
 	psrad	mm1,DESCALE_P1_4
 	psrad	mm2,DESCALE_P1_4
 	paddd	mm5,mm6
 	paddd	mm3,mm6
 	psrad	mm5,DESCALE_P1_4
 	psrad	mm3,DESCALE_P1_4
 	packssdw  mm1,mm2		; mm1=data0=(00 01 02 03)
 	packssdw  mm5,mm3		; mm5=data3=(30 31 32 33)
 	movq	mm7, MMWORD [wk(0)]	; mm7=tmp0L
 	movq	mm6, MMWORD [wk(1)]	; mm6=tmp0H
 	movq	mm2,mm4
 	movq	mm3,mm0
 	paddd	mm4,mm7			; mm4=data1L
 	paddd	mm0,mm6			; mm0=data1H
 	psubd	mm2,mm7			; mm2=data2L
 	psubd	mm3,mm6			; mm3=data2H
 	movq	mm7,[GOTOFF(ebx,PD_DESCALE_P1_4)]	; mm7=[PD_DESCALE_P1_4]
 	paddd	mm4,mm7
 	paddd	mm0,mm7
 	psrad	mm4,DESCALE_P1_4
 	psrad	mm0,DESCALE_P1_4
 	paddd	mm2,mm7
 	paddd	mm3,mm7
 	psrad	mm2,DESCALE_P1_4
 	psrad	mm3,DESCALE_P1_4
 	packssdw  mm4,mm0		; mm4=data1=(10 11 12 13)
 	packssdw  mm2,mm3		; mm2=data2=(20 21 22 23)
 	movq      mm6,mm1		; transpose coefficients(phase 1)
 	punpcklwd mm1,mm4		; mm1=(00 10 01 11)
 	punpckhwd mm6,mm4		; mm6=(02 12 03 13)
 	movq      mm7,mm2		; transpose coefficients(phase 1)
 	punpcklwd mm2,mm5		; mm2=(20 30 21 31)
 	punpckhwd mm7,mm5		; mm7=(22 32 23 33)
 	movq      mm0,mm1		; transpose coefficients(phase 2)
 	punpckldq mm1,mm2		; mm1=(00 10 20 30)
 	punpckhdq mm0,mm2		; mm0=(01 11 21 31)
 	movq      mm3,mm6		; transpose coefficients(phase 2)
 	punpckldq mm6,mm7		; mm6=(02 12 22 32)
 	punpckhdq mm3,mm7		; mm3=(03 13 23 33)
 	movq	MMWORD [MMBLOCK(0,0,edi,SIZEOF_JCOEF)], mm1
 	movq	MMWORD [MMBLOCK(1,0,edi,SIZEOF_JCOEF)], mm0
 	movq	MMWORD [MMBLOCK(2,0,edi,SIZEOF_JCOEF)], mm6
 	movq	MMWORD [MMBLOCK(3,0,edi,SIZEOF_JCOEF)], mm3
 .nextcolumn:
 	add	esi, byte 4*SIZEOF_JCOEF		; coef_block
 	add	edx, byte 4*SIZEOF_ISLOW_MULT_TYPE	; quantptr
 	add	edi, byte 4*DCTSIZE*SIZEOF_JCOEF	; wsptr
 	dec	ecx					; ctr
 	jnz	near .columnloop
 	; ---- Pass 2: process rows from work array, store into output array.
 	mov	eax, [original_ebp]
 	lea	esi, [workspace]			; JCOEF * wsptr
 	mov	edi, JSAMPARRAY [output_buf(eax)]	; (JSAMPROW *)
 	mov	eax, JDIMENSION [output_col(eax)]
 	; -- Odd part
 	movq	mm0, MMWORD [MMBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movq	mm1, MMWORD [MMBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	movq	mm2, MMWORD [MMBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	movq	mm3, MMWORD [MMBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	movq      mm4,mm0
 	movq      mm5,mm0
 	punpcklwd mm4,mm1
 	punpckhwd mm5,mm1
 	movq      mm0,mm4
 	movq      mm1,mm5
 	pmaddwd   mm4,[GOTOFF(ebx,PW_F256_F089)]	; mm4=(tmp2L)
 	pmaddwd   mm5,[GOTOFF(ebx,PW_F256_F089)]	; mm5=(tmp2H)
 	pmaddwd   mm0,[GOTOFF(ebx,PW_F106_MF217)]	; mm0=(tmp0L)
 	pmaddwd   mm1,[GOTOFF(ebx,PW_F106_MF217)]	; mm1=(tmp0H)
 	movq      mm6,mm2
 	movq      mm7,mm2
 	punpcklwd mm6,mm3
 	punpckhwd mm7,mm3
 	movq      mm2,mm6
 	movq      mm3,mm7
 	pmaddwd   mm6,[GOTOFF(ebx,PW_MF060_MF050)]	; mm6=(tmp2L)
 	pmaddwd   mm7,[GOTOFF(ebx,PW_MF060_MF050)]	; mm7=(tmp2H)
 	pmaddwd   mm2,[GOTOFF(ebx,PW_F145_MF021)]	; mm2=(tmp0L)
 	pmaddwd   mm3,[GOTOFF(ebx,PW_F145_MF021)]	; mm3=(tmp0H)
 	paddd	mm6,mm4			; mm6=tmp2L
 	paddd	mm7,mm5			; mm7=tmp2H
 	paddd	mm2,mm0			; mm2=tmp0L
 	paddd	mm3,mm1			; mm3=tmp0H
 	movq	MMWORD [wk(0)], mm2	; wk(0)=tmp0L
 	movq	MMWORD [wk(1)], mm3	; wk(1)=tmp0H
 	; -- Even part
 	movq	mm4, MMWORD [MMBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	movq	mm5, MMWORD [MMBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	movq	mm0, MMWORD [MMBLOCK(6,0,esi,SIZEOF_JCOEF)]
 	pxor      mm1,mm1
 	pxor      mm2,mm2
 	punpcklwd mm1,mm4		; mm1=tmp0L
 	punpckhwd mm2,mm4		; mm2=tmp0H
 	psrad     mm1,(16-CONST_BITS-1)	; psrad mm1,16 & pslld mm1,CONST_BITS+1
 	psrad     mm2,(16-CONST_BITS-1)	; psrad mm2,16 & pslld mm2,CONST_BITS+1
 	movq      mm3,mm5		; mm5=in2=z2
 	punpcklwd mm5,mm0		; mm0=in6=z3
 	punpckhwd mm3,mm0
 	pmaddwd   mm5,[GOTOFF(ebx,PW_F184_MF076)]	; mm5=tmp2L
 	pmaddwd   mm3,[GOTOFF(ebx,PW_F184_MF076)]	; mm3=tmp2H
 	movq	mm4,mm1
 	movq	mm0,mm2
 	paddd	mm1,mm5			; mm1=tmp10L
 	paddd	mm2,mm3			; mm2=tmp10H
 	psubd	mm4,mm5			; mm4=tmp12L
 	psubd	mm0,mm3			; mm0=tmp12H
 	; -- Final output stage
 	movq	mm5,mm1
 	movq	mm3,mm2
 	paddd	mm1,mm6			; mm1=data0L
 	paddd	mm2,mm7			; mm2=data0H
 	psubd	mm5,mm6			; mm5=data3L
 	psubd	mm3,mm7			; mm3=data3H
 	movq	mm6,[GOTOFF(ebx,PD_DESCALE_P2_4)]	; mm6=[PD_DESCALE_P2_4]
 	paddd	mm1,mm6
 	paddd	mm2,mm6
 	psrad	mm1,DESCALE_P2_4
 	psrad	mm2,DESCALE_P2_4
 	paddd	mm5,mm6
 	paddd	mm3,mm6
 	psrad	mm5,DESCALE_P2_4
 	psrad	mm3,DESCALE_P2_4
 	packssdw  mm1,mm2		; mm1=data0=(00 10 20 30)
 	packssdw  mm5,mm3		; mm5=data3=(03 13 23 33)
 	movq	mm7, MMWORD [wk(0)]	; mm7=tmp0L
 	movq	mm6, MMWORD [wk(1)]	; mm6=tmp0H
 	movq	mm2,mm4
 	movq	mm3,mm0
 	paddd	mm4,mm7			; mm4=data1L
 	paddd	mm0,mm6			; mm0=data1H
 	psubd	mm2,mm7			; mm2=data2L
 	psubd	mm3,mm6			; mm3=data2H
 	movq	mm7,[GOTOFF(ebx,PD_DESCALE_P2_4)]	; mm7=[PD_DESCALE_P2_4]
 	paddd	mm4,mm7
 	paddd	mm0,mm7
 	psrad	mm4,DESCALE_P2_4
 	psrad	mm0,DESCALE_P2_4
 	paddd	mm2,mm7
 	paddd	mm3,mm7
 	psrad	mm2,DESCALE_P2_4
 	psrad	mm3,DESCALE_P2_4
 	packssdw  mm4,mm0		; mm4=data1=(01 11 21 31)
 	packssdw  mm2,mm3		; mm2=data2=(02 12 22 32)
 	movq      mm6,[GOTOFF(ebx,PB_CENTERJSAMP)]	; mm6=[PB_CENTERJSAMP]
 	packsswb  mm1,mm2		; mm1=(00 10 20 30 02 12 22 32)
 	packsswb  mm4,mm5		; mm4=(01 11 21 31 03 13 23 33)
 	paddb     mm1,mm6
 	paddb     mm4,mm6
 	movq      mm7,mm1		; transpose coefficients(phase 1)
 	punpcklbw mm1,mm4		; mm1=(00 01 10 11 20 21 30 31)
 	punpckhbw mm7,mm4		; mm7=(02 03 12 13 22 23 32 33)
 	movq      mm0,mm1		; transpose coefficients(phase 2)
 	punpcklwd mm1,mm7		; mm1=(00 01 02 03 10 11 12 13)
 	punpckhwd mm0,mm7		; mm0=(20 21 22 23 30 31 32 33)
 	mov	edx, JSAMPROW [edi+0*SIZEOF_JSAMPROW]
 	mov	esi, JSAMPROW [edi+2*SIZEOF_JSAMPROW]
 	movd	DWORD [edx+eax*SIZEOF_JSAMPLE], mm1
 	movd	DWORD [esi+eax*SIZEOF_JSAMPLE], mm0
 	psrlq	mm1,4*BYTE_BIT
 	psrlq	mm0,4*BYTE_BIT
 	mov	edx, JSAMPROW [edi+1*SIZEOF_JSAMPROW]
 	mov	esi, JSAMPROW [edi+3*SIZEOF_JSAMPROW]
 	movd	DWORD [edx+eax*SIZEOF_JSAMPLE], mm1
 	movd	DWORD [esi+eax*SIZEOF_JSAMPLE], mm0
 	emms		; empty MMX state
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	poppic	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 ; --------------------------------------------------------------------------
 ;
 ; Perform dequantization and inverse DCT on one block of coefficients,
 ; producing a reduced-size 2x2 output block.
 ;
 ; GLOBAL(void)
 ; jpeg_idct_2x2_mmx (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 ;                    JCOEFPTR coef_block,
 ;                    JSAMPARRAY output_buf, JDIMENSION output_col)
 ;
 %define cinfo(b)	(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define coef_block(b)	(b)+16		; JCOEFPTR coef_block
 %define output_buf(b)	(b)+20		; JSAMPARRAY output_buf
 %define output_col(b)	(b)+24		; JDIMENSION output_col
 	align	16
 	global	EXTN(jpeg_idct_2x2_mmx)
 EXTN(jpeg_idct_2x2_mmx):
 	push	ebp
 	mov	ebp,esp
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx		; get GOT address
 	; ---- Pass 1: process columns from input.
 	mov	edx, POINTER [compptr(ebp)]
 	mov	edx, POINTER [jcompinfo_dct_table(edx)]	; quantptr
 	mov	esi, JCOEFPTR [coef_block(ebp)]		; inptr
 	; | input:                  | result:        |
 	; | 00 01 ** 03 ** 05 ** 07 |                |
 	; | 10 11 ** 13 ** 15 ** 17 |                |
 	; | ** ** ** ** ** ** ** ** |                |
 	; | 30 31 ** 33 ** 35 ** 37 | A0 A1 A3 A5 A7 |
 	; | ** ** ** ** ** ** ** ** | B0 B1 B3 B5 B7 |
 	; | 50 51 ** 53 ** 55 ** 57 |                |
 	; | ** ** ** ** ** ** ** ** |                |
 	; | 70 71 ** 73 ** 75 ** 77 |                |
 	; -- Odd part
 	movq	mm0, MMWORD [MMBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movq	mm1, MMWORD [MMBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	pmullw	mm0, MMWORD [MMBLOCK(1,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	mm1, MMWORD [MMBLOCK(3,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	movq	mm2, MMWORD [MMBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	movq	mm3, MMWORD [MMBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	pmullw	mm2, MMWORD [MMBLOCK(5,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	mm3, MMWORD [MMBLOCK(7,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	; mm0=(10 11 ** 13), mm1=(30 31 ** 33)
 	; mm2=(50 51 ** 53), mm3=(70 71 ** 73)
 	pcmpeqd   mm7,mm7
 	pslld     mm7,WORD_BIT		; mm7={0x0000 0xFFFF 0x0000 0xFFFF}
 	movq      mm4,mm0		; mm4=(10 11 ** 13)
 	movq      mm5,mm2		; mm5=(50 51 ** 53)
 	punpcklwd mm4,mm1		; mm4=(10 30 11 31)
 	punpcklwd mm5,mm3		; mm5=(50 70 51 71)
 	pmaddwd   mm4,[GOTOFF(ebx,PW_F362_MF127)]
 	pmaddwd   mm5,[GOTOFF(ebx,PW_F085_MF072)]
 	psrld	mm0,WORD_BIT		; mm0=(11 -- 13 --)
 	pand	mm1,mm7			; mm1=(-- 31 -- 33)
 	psrld	mm2,WORD_BIT		; mm2=(51 -- 53 --)
 	pand	mm3,mm7			; mm3=(-- 71 -- 73)
 	por	mm0,mm1			; mm0=(11 31 13 33)
 	por	mm2,mm3			; mm2=(51 71 53 73)
 	pmaddwd	mm0,[GOTOFF(ebx,PW_F362_MF127)]
 	pmaddwd	mm2,[GOTOFF(ebx,PW_F085_MF072)]
 	paddd	mm4,mm5			; mm4=tmp0[col0 col1]
 	movq	mm6, MMWORD [MMBLOCK(1,1,esi,SIZEOF_JCOEF)]
 	movq	mm1, MMWORD [MMBLOCK(3,1,esi,SIZEOF_JCOEF)]
 	pmullw	mm6, MMWORD [MMBLOCK(1,1,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	mm1, MMWORD [MMBLOCK(3,1,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	movq	mm3, MMWORD [MMBLOCK(5,1,esi,SIZEOF_JCOEF)]
 	movq	mm5, MMWORD [MMBLOCK(7,1,esi,SIZEOF_JCOEF)]
 	pmullw	mm3, MMWORD [MMBLOCK(5,1,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	mm5, MMWORD [MMBLOCK(7,1,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	; mm6=(** 15 ** 17), mm1=(** 35 ** 37)
 	; mm3=(** 55 ** 57), mm5=(** 75 ** 77)
 	psrld	mm6,WORD_BIT		; mm6=(15 -- 17 --)
 	pand	mm1,mm7			; mm1=(-- 35 -- 37)
 	psrld	mm3,WORD_BIT		; mm3=(55 -- 57 --)
 	pand	mm5,mm7			; mm5=(-- 75 -- 77)
 	por	mm6,mm1			; mm6=(15 35 17 37)
 	por	mm3,mm5			; mm3=(55 75 57 77)
 	pmaddwd	mm6,[GOTOFF(ebx,PW_F362_MF127)]
 	pmaddwd	mm3,[GOTOFF(ebx,PW_F085_MF072)]
 	paddd	mm0,mm2			; mm0=tmp0[col1 col3]
 	paddd	mm6,mm3			; mm6=tmp0[col5 col7]
 	; -- Even part
 	movq	mm1, MMWORD [MMBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	movq	mm5, MMWORD [MMBLOCK(0,1,esi,SIZEOF_JCOEF)]
 	pmullw	mm1, MMWORD [MMBLOCK(0,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	mm5, MMWORD [MMBLOCK(0,1,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	; mm1=(00 01 ** 03), mm5=(** 05 ** 07)
 	movq	mm2,mm1				; mm2=(00 01 ** 03)
 	pslld	mm1,WORD_BIT			; mm1=(-- 00 -- **)
 	psrad	mm1,(WORD_BIT-CONST_BITS-2)	; mm1=tmp10[col0 ****]
 	pand	mm2,mm7				; mm2=(-- 01 -- 03)
 	pand	mm5,mm7				; mm5=(-- 05 -- 07)
 	psrad	mm2,(WORD_BIT-CONST_BITS-2)	; mm2=tmp10[col1 col3]
 	psrad	mm5,(WORD_BIT-CONST_BITS-2)	; mm5=tmp10[col5 col7]
 	; -- Final output stage
 	movq      mm3,mm1
 	paddd     mm1,mm4		; mm1=data0[col0 ****]=(A0 **)
 	psubd     mm3,mm4		; mm3=data1[col0 ****]=(B0 **)
 	punpckldq mm1,mm3		; mm1=(A0 B0)
 	movq	mm7,[GOTOFF(ebx,PD_DESCALE_P1_2)]	; mm7=[PD_DESCALE_P1_2]
 	movq	mm4,mm2
 	movq	mm3,mm5
 	paddd	mm2,mm0			; mm2=data0[col1 col3]=(A1 A3)
 	paddd	mm5,mm6			; mm5=data0[col5 col7]=(A5 A7)
 	psubd	mm4,mm0			; mm4=data1[col1 col3]=(B1 B3)
 	psubd	mm3,mm6			; mm3=data1[col5 col7]=(B5 B7)
 	paddd	mm1,mm7
 	psrad	mm1,DESCALE_P1_2
 	paddd	mm2,mm7
 	paddd	mm5,mm7
 	psrad	mm2,DESCALE_P1_2
 	psrad	mm5,DESCALE_P1_2
 	paddd	mm4,mm7
 	paddd	mm3,mm7
 	psrad	mm4,DESCALE_P1_2
 	psrad	mm3,DESCALE_P1_2
 	; ---- Pass 2: process rows, store into output array.
 	mov	edi, JSAMPARRAY [output_buf(ebp)]	; (JSAMPROW *)
 	mov	eax, JDIMENSION [output_col(ebp)]
 	; | input:| result:|
 	; | A0 B0 |        |
 	; | A1 B1 | C0 C1  |
 	; | A3 B3 | D0 D1  |
 	; | A5 B5 |        |
 	; | A7 B7 |        |
 	; -- Odd part
 	packssdw  mm2,mm4		; mm2=(A1 A3 B1 B3)
 	packssdw  mm5,mm3		; mm5=(A5 A7 B5 B7)
 	pmaddwd   mm2,[GOTOFF(ebx,PW_F362_MF127)]
 	pmaddwd   mm5,[GOTOFF(ebx,PW_F085_MF072)]
 	paddd     mm2,mm5		; mm2=tmp0[row0 row1]
 	; -- Even part
 	pslld     mm1,(CONST_BITS+2)	; mm1=tmp10[row0 row1]
 	; -- Final output stage
 	movq      mm0,[GOTOFF(ebx,PD_DESCALE_P2_2)]	; mm0=[PD_DESCALE_P2_2]
 	movq      mm6,mm1
 	paddd     mm1,mm2		; mm1=data0[row0 row1]=(C0 C1)
 	psubd     mm6,mm2		; mm6=data1[row0 row1]=(D0 D1)
 	paddd     mm1,mm0
 	paddd     mm6,mm0
 	psrad     mm1,DESCALE_P2_2
 	psrad     mm6,DESCALE_P2_2
 	movq      mm7,mm1		; transpose coefficients
 	punpckldq mm1,mm6		; mm1=(C0 D0)
 	punpckhdq mm7,mm6		; mm7=(C1 D1)
 	packssdw  mm1,mm7		; mm1=(C0 D0 C1 D1)
 	packsswb  mm1,mm1		; mm1=(C0 D0 C1 D1 C0 D0 C1 D1)
 	paddb     mm1,[GOTOFF(ebx,PB_CENTERJSAMP)]
 	movd	ecx,mm1
 	movd	ebx,mm1			; ebx=(C0 D0 C1 D1)
 	shr	ecx,2*BYTE_BIT		; ecx=(C1 D1 -- --)
 	mov	edx, JSAMPROW [edi+0*SIZEOF_JSAMPROW]
 	mov	esi, JSAMPROW [edi+1*SIZEOF_JSAMPROW]
 	mov	WORD [edx+eax*SIZEOF_JSAMPLE], bx
 	mov	WORD [esi+eax*SIZEOF_JSAMPLE], cx
 	emms		; empty MMX state
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	pop	ebp
 	ret
 %endif ; JIDCT_INT_MMX_SUPPORTED
 %endif ; IDCT_SCALING_SUPPORTED
--- a/jiss2flt.asm
+++ b/jiss2flt.asm
@@ -0,0 +1,508 @@
 ;
 ; jiss2flt.asm - floating-point IDCT (SSE & SSE2)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains a floating-point implementation of the inverse DCT
 ; (Discrete Cosine Transform). The following code is based directly on
 ; the IJG's original jidctflt.c; see the jidctflt.c for more details.
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_FLOAT_SUPPORTED
 %ifdef JIDCT_FLT_SSE_SSE2_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 %macro	unpcklps2 2	; %1=(0 1 2 3) / %2=(4 5 6 7) => %1=(0 1 4 5)
 	shufps	%1,%2,0x44
 %endmacro
 %macro	unpckhps2 2	; %1=(0 1 2 3) / %2=(4 5 6 7) => %1=(2 3 6 7)
 	shufps	%1,%2,0xEE
 %endmacro
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 	alignz	16
 	global	EXTN(jconst_idct_float_sse2)
 EXTN(jconst_idct_float_sse2):
 PD_1_414	times 4 dd  1.414213562373095048801689
 PD_1_847	times 4 dd  1.847759065022573512256366
 PD_1_082	times 4 dd  1.082392200292393968799446
 PD_M2_613	times 4 dd -2.613125929752753055713286
 PD_RNDINT_MAGIC	times 4 dd  100663296.0	; (float)(0x00C00000 << 3)
 PB_CENTERJSAMP	times 16 db CENTERJSAMPLE
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform dequantization and inverse DCT on one block of coefficients.
 ;
 ; GLOBAL(void)
 ; jpeg_idct_float_sse2 (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 ;                       JCOEFPTR coef_block,
 ;                       JSAMPARRAY output_buf, JDIMENSION output_col)
 ;
 %define cinfo(b)	(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define coef_block(b)	(b)+16		; JCOEFPTR coef_block
 %define output_buf(b)	(b)+20		; JSAMPARRAY output_buf
 %define output_col(b)	(b)+24		; JDIMENSION output_col
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_XMMWORD	; xmmword wk[WK_NUM]
 %define WK_NUM		2
 %define workspace	wk(0)-DCTSIZE2*SIZEOF_FAST_FLOAT
 					; FAST_FLOAT workspace[DCTSIZE2]
 	align	16
 	global	EXTN(jpeg_idct_float_sse2)
 EXTN(jpeg_idct_float_sse2):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_XMMWORD)	; align to 128 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [workspace]
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx		; get GOT address
 	; ---- Pass 1: process columns from input, store into work array.
 ;	mov	eax, [original_ebp]
 	mov	edx, POINTER [compptr(eax)]
 	mov	edx, POINTER [jcompinfo_dct_table(edx)]	; quantptr
 	mov	esi, JCOEFPTR [coef_block(eax)]		; inptr
 	lea	edi, [workspace]			; FAST_FLOAT * wsptr
 	mov	ecx, DCTSIZE/4				; ctr
 	alignx	16,7
 .columnloop:
 %ifndef NO_ZERO_COLUMN_TEST_FLOAT_SSE
 	mov	eax, DWORD [DWBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	or	eax, DWORD [DWBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	jnz	near .columnDCT
 	movq	xmm1, _MMWORD [MMBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movq	xmm2, _MMWORD [MMBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	movq	xmm3, _MMWORD [MMBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	movq	xmm4, _MMWORD [MMBLOCK(4,0,esi,SIZEOF_JCOEF)]
 	movq	xmm5, _MMWORD [MMBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	movq	xmm6, _MMWORD [MMBLOCK(6,0,esi,SIZEOF_JCOEF)]
 	movq	xmm7, _MMWORD [MMBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	por	xmm1,xmm2
 	por	xmm3,xmm4
 	por	xmm5,xmm6
 	por	xmm1,xmm3
 	por	xmm5,xmm7
 	por	xmm1,xmm5
 	packsswb xmm1,xmm1
 	movd	eax,xmm1
 	test	eax,eax
 	jnz	short .columnDCT
 	; -- AC terms all zero
 	movq      xmm0, _MMWORD [MMBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	punpcklwd xmm0,xmm0		; xmm0=(00 00 01 01 02 02 03 03)
 	psrad     xmm0,(DWORD_BIT-WORD_BIT)	; xmm0=in0=(00 01 02 03)
 	cvtdq2ps  xmm0,xmm0			; xmm0=in0=(00 01 02 03)
 	mulps	xmm0, XMMWORD [XMMBLOCK(0,0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	movaps	xmm1,xmm0
 	movaps	xmm2,xmm0
 	movaps	xmm3,xmm0
 	shufps	xmm0,xmm0,0x00			; xmm0=(00 00 00 00)
 	shufps	xmm1,xmm1,0x55			; xmm1=(01 01 01 01)
 	shufps	xmm2,xmm2,0xAA			; xmm2=(02 02 02 02)
 	shufps	xmm3,xmm3,0xFF			; xmm3=(03 03 03 03)
 	movaps	XMMWORD [XMMBLOCK(0,0,edi,SIZEOF_FAST_FLOAT)], xmm0
 	movaps	XMMWORD [XMMBLOCK(0,1,edi,SIZEOF_FAST_FLOAT)], xmm0
 	movaps	XMMWORD [XMMBLOCK(1,0,edi,SIZEOF_FAST_FLOAT)], xmm1
 	movaps	XMMWORD [XMMBLOCK(1,1,edi,SIZEOF_FAST_FLOAT)], xmm1
 	movaps	XMMWORD [XMMBLOCK(2,0,edi,SIZEOF_FAST_FLOAT)], xmm2
 	movaps	XMMWORD [XMMBLOCK(2,1,edi,SIZEOF_FAST_FLOAT)], xmm2
 	movaps	XMMWORD [XMMBLOCK(3,0,edi,SIZEOF_FAST_FLOAT)], xmm3
 	movaps	XMMWORD [XMMBLOCK(3,1,edi,SIZEOF_FAST_FLOAT)], xmm3
 	jmp	near .nextcolumn
 	alignx	16,7
 %endif
 .columnDCT:
 	; -- Even part
 	movq      xmm0, _MMWORD [MMBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	movq      xmm1, _MMWORD [MMBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	movq      xmm2, _MMWORD [MMBLOCK(4,0,esi,SIZEOF_JCOEF)]
 	movq      xmm3, _MMWORD [MMBLOCK(6,0,esi,SIZEOF_JCOEF)]
 	punpcklwd xmm0,xmm0		; xmm0=(00 00 01 01 02 02 03 03)
 	punpcklwd xmm1,xmm1		; xmm1=(20 20 21 21 22 22 23 23)
 	psrad     xmm0,(DWORD_BIT-WORD_BIT)	; xmm0=in0=(00 01 02 03)
 	psrad     xmm1,(DWORD_BIT-WORD_BIT)	; xmm1=in2=(20 21 22 23)
 	cvtdq2ps  xmm0,xmm0			; xmm0=in0=(00 01 02 03)
 	cvtdq2ps  xmm1,xmm1			; xmm1=in2=(20 21 22 23)
 	punpcklwd xmm2,xmm2		; xmm2=(40 40 41 41 42 42 43 43)
 	punpcklwd xmm3,xmm3		; xmm3=(60 60 61 61 62 62 63 63)
 	psrad     xmm2,(DWORD_BIT-WORD_BIT)	; xmm2=in4=(40 41 42 43)
 	psrad     xmm3,(DWORD_BIT-WORD_BIT)	; xmm3=in6=(60 61 62 63)
 	cvtdq2ps  xmm2,xmm2			; xmm2=in4=(40 41 42 43)
 	cvtdq2ps  xmm3,xmm3			; xmm3=in6=(60 61 62 63)
 	mulps     xmm0, XMMWORD [XMMBLOCK(0,0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	mulps     xmm1, XMMWORD [XMMBLOCK(2,0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	mulps     xmm2, XMMWORD [XMMBLOCK(4,0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	mulps     xmm3, XMMWORD [XMMBLOCK(6,0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	movaps	xmm4,xmm0
 	movaps	xmm5,xmm1
 	subps	xmm0,xmm2		; xmm0=tmp11
 	subps	xmm1,xmm3
 	addps	xmm4,xmm2		; xmm4=tmp10
 	addps	xmm5,xmm3		; xmm5=tmp13
 	mulps	xmm1,[GOTOFF(ebx,PD_1_414)]
 	subps	xmm1,xmm5		; xmm1=tmp12
 	movaps	xmm6,xmm4
 	movaps	xmm7,xmm0
 	subps	xmm4,xmm5		; xmm4=tmp3
 	subps	xmm0,xmm1		; xmm0=tmp2
 	addps	xmm6,xmm5		; xmm6=tmp0
 	addps	xmm7,xmm1		; xmm7=tmp1
 	movaps	XMMWORD [wk(1)], xmm4	; tmp3
 	movaps	XMMWORD [wk(0)], xmm0	; tmp2
 	; -- Odd part
 	movq      xmm2, _MMWORD [MMBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movq      xmm3, _MMWORD [MMBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	movq      xmm5, _MMWORD [MMBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	movq      xmm1, _MMWORD [MMBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	punpcklwd xmm2,xmm2		; xmm2=(10 10 11 11 12 12 13 13)
 	punpcklwd xmm3,xmm3		; xmm3=(30 30 31 31 32 32 33 33)
 	psrad     xmm2,(DWORD_BIT-WORD_BIT)	; xmm2=in1=(10 11 12 13)
 	psrad     xmm3,(DWORD_BIT-WORD_BIT)	; xmm3=in3=(30 31 32 33)
 	cvtdq2ps  xmm2,xmm2			; xmm2=in1=(10 11 12 13)
 	cvtdq2ps  xmm3,xmm3			; xmm3=in3=(30 31 32 33)
 	punpcklwd xmm5,xmm5		; xmm5=(50 50 51 51 52 52 53 53)
 	punpcklwd xmm1,xmm1		; xmm1=(70 70 71 71 72 72 73 73)
 	psrad     xmm5,(DWORD_BIT-WORD_BIT)	; xmm5=in5=(50 51 52 53)
 	psrad     xmm1,(DWORD_BIT-WORD_BIT)	; xmm1=in7=(70 71 72 73)
 	cvtdq2ps  xmm5,xmm5			; xmm5=in5=(50 51 52 53)
 	cvtdq2ps  xmm1,xmm1			; xmm1=in7=(70 71 72 73)
 	mulps     xmm2, XMMWORD [XMMBLOCK(1,0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	mulps     xmm3, XMMWORD [XMMBLOCK(3,0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	mulps     xmm5, XMMWORD [XMMBLOCK(5,0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	mulps     xmm1, XMMWORD [XMMBLOCK(7,0,edx,SIZEOF_FLOAT_MULT_TYPE)]
 	movaps	xmm4,xmm2
 	movaps	xmm0,xmm5
 	addps	xmm2,xmm1		; xmm2=z11
 	addps	xmm5,xmm3		; xmm5=z13
 	subps	xmm4,xmm1		; xmm4=z12
 	subps	xmm0,xmm3		; xmm0=z10
 	movaps	xmm1,xmm2
 	subps	xmm2,xmm5
 	addps	xmm1,xmm5		; xmm1=tmp7
 	mulps	xmm2,[GOTOFF(ebx,PD_1_414)]	; xmm2=tmp11
 	movaps	xmm3,xmm0
 	addps	xmm0,xmm4
 	mulps	xmm0,[GOTOFF(ebx,PD_1_847)]	; xmm0=z5
 	mulps	xmm3,[GOTOFF(ebx,PD_M2_613)]	; xmm3=(z10 * -2.613125930)
 	mulps	xmm4,[GOTOFF(ebx,PD_1_082)]	; xmm4=(z12 * 1.082392200)
 	addps	xmm3,xmm0		; xmm3=tmp12
 	subps	xmm4,xmm0		; xmm4=tmp10
 	; -- Final output stage
 	subps	xmm3,xmm1		; xmm3=tmp6
 	movaps	xmm5,xmm6
 	movaps	xmm0,xmm7
 	addps	xmm6,xmm1		; xmm6=data0=(00 01 02 03)
 	addps	xmm7,xmm3		; xmm7=data1=(10 11 12 13)
 	subps	xmm5,xmm1		; xmm5=data7=(70 71 72 73)
 	subps	xmm0,xmm3		; xmm0=data6=(60 61 62 63)
 	subps	xmm2,xmm3		; xmm2=tmp5
 	movaps    xmm1,xmm6		; transpose coefficients(phase 1)
 	unpcklps  xmm6,xmm7		; xmm6=(00 10 01 11)
 	unpckhps  xmm1,xmm7		; xmm1=(02 12 03 13)
 	movaps    xmm3,xmm0		; transpose coefficients(phase 1)
 	unpcklps  xmm0,xmm5		; xmm0=(60 70 61 71)
 	unpckhps  xmm3,xmm5		; xmm3=(62 72 63 73)
 	movaps	xmm7, XMMWORD [wk(0)]	; xmm7=tmp2
 	movaps	xmm5, XMMWORD [wk(1)]	; xmm5=tmp3
 	movaps	XMMWORD [wk(0)], xmm0	; wk(0)=(60 70 61 71)
 	movaps	XMMWORD [wk(1)], xmm3	; wk(1)=(62 72 63 73)
 	addps	xmm4,xmm2		; xmm4=tmp4
 	movaps	xmm0,xmm7
 	movaps	xmm3,xmm5
 	addps	xmm7,xmm2		; xmm7=data2=(20 21 22 23)
 	addps	xmm5,xmm4		; xmm5=data4=(40 41 42 43)
 	subps	xmm0,xmm2		; xmm0=data5=(50 51 52 53)
 	subps	xmm3,xmm4		; xmm3=data3=(30 31 32 33)
 	movaps    xmm2,xmm7		; transpose coefficients(phase 1)
 	unpcklps  xmm7,xmm3		; xmm7=(20 30 21 31)
 	unpckhps  xmm2,xmm3		; xmm2=(22 32 23 33)
 	movaps    xmm4,xmm5		; transpose coefficients(phase 1)
 	unpcklps  xmm5,xmm0		; xmm5=(40 50 41 51)
 	unpckhps  xmm4,xmm0		; xmm4=(42 52 43 53)
 	movaps    xmm3,xmm6		; transpose coefficients(phase 2)
 	unpcklps2 xmm6,xmm7		; xmm6=(00 10 20 30)
 	unpckhps2 xmm3,xmm7		; xmm3=(01 11 21 31)
 	movaps    xmm0,xmm1		; transpose coefficients(phase 2)
 	unpcklps2 xmm1,xmm2		; xmm1=(02 12 22 32)
 	unpckhps2 xmm0,xmm2		; xmm0=(03 13 23 33)
 	movaps	xmm7, XMMWORD [wk(0)]	; xmm7=(60 70 61 71)
 	movaps	xmm2, XMMWORD [wk(1)]	; xmm2=(62 72 63 73)
 	movaps	XMMWORD [XMMBLOCK(0,0,edi,SIZEOF_FAST_FLOAT)], xmm6
 	movaps	XMMWORD [XMMBLOCK(1,0,edi,SIZEOF_FAST_FLOAT)], xmm3
 	movaps	XMMWORD [XMMBLOCK(2,0,edi,SIZEOF_FAST_FLOAT)], xmm1
 	movaps	XMMWORD [XMMBLOCK(3,0,edi,SIZEOF_FAST_FLOAT)], xmm0
 	movaps    xmm6,xmm5		; transpose coefficients(phase 2)
 	unpcklps2 xmm5,xmm7		; xmm5=(40 50 60 70)
 	unpckhps2 xmm6,xmm7		; xmm6=(41 51 61 71)
 	movaps    xmm3,xmm4		; transpose coefficients(phase 2)
 	unpcklps2 xmm4,xmm2		; xmm4=(42 52 62 72)
 	unpckhps2 xmm3,xmm2		; xmm3=(43 53 63 73)
 	movaps	XMMWORD [XMMBLOCK(0,1,edi,SIZEOF_FAST_FLOAT)], xmm5
 	movaps	XMMWORD [XMMBLOCK(1,1,edi,SIZEOF_FAST_FLOAT)], xmm6
 	movaps	XMMWORD [XMMBLOCK(2,1,edi,SIZEOF_FAST_FLOAT)], xmm4
 	movaps	XMMWORD [XMMBLOCK(3,1,edi,SIZEOF_FAST_FLOAT)], xmm3
 .nextcolumn:
 	add	esi, byte 4*SIZEOF_JCOEF		; coef_block
 	add	edx, byte 4*SIZEOF_FLOAT_MULT_TYPE	; quantptr
 	add	edi,      4*DCTSIZE*SIZEOF_FAST_FLOAT	; wsptr
 	dec	ecx					; ctr
 	jnz	near .columnloop
 	; -- Prefetch the next coefficient block
 	prefetchnta [esi + (DCTSIZE2-8)*SIZEOF_JCOEF + 0*32]
 	prefetchnta [esi + (DCTSIZE2-8)*SIZEOF_JCOEF + 1*32]
 	prefetchnta [esi + (DCTSIZE2-8)*SIZEOF_JCOEF + 2*32]
 	prefetchnta [esi + (DCTSIZE2-8)*SIZEOF_JCOEF + 3*32]
 	; ---- Pass 2: process rows from work array, store into output array.
 	mov	eax, [original_ebp]
 	lea	esi, [workspace]			; FAST_FLOAT * wsptr
 	mov	edi, JSAMPARRAY [output_buf(eax)]	; (JSAMPROW *)
 	mov	eax, JDIMENSION [output_col(eax)]
 	mov	ecx, DCTSIZE/4				; ctr
 	alignx	16,7
 .rowloop:
 	; -- Even part
 	movaps	xmm0, XMMWORD [XMMBLOCK(0,0,esi,SIZEOF_FAST_FLOAT)]
 	movaps	xmm1, XMMWORD [XMMBLOCK(2,0,esi,SIZEOF_FAST_FLOAT)]
 	movaps	xmm2, XMMWORD [XMMBLOCK(4,0,esi,SIZEOF_FAST_FLOAT)]
 	movaps	xmm3, XMMWORD [XMMBLOCK(6,0,esi,SIZEOF_FAST_FLOAT)]
 	movaps	xmm4,xmm0
 	movaps	xmm5,xmm1
 	subps	xmm0,xmm2		; xmm0=tmp11
 	subps	xmm1,xmm3
 	addps	xmm4,xmm2		; xmm4=tmp10
 	addps	xmm5,xmm3		; xmm5=tmp13
 	mulps	xmm1,[GOTOFF(ebx,PD_1_414)]
 	subps	xmm1,xmm5		; xmm1=tmp12
 	movaps	xmm6,xmm4
 	movaps	xmm7,xmm0
 	subps	xmm4,xmm5		; xmm4=tmp3
 	subps	xmm0,xmm1		; xmm0=tmp2
 	addps	xmm6,xmm5		; xmm6=tmp0
 	addps	xmm7,xmm1		; xmm7=tmp1
 	movaps	XMMWORD [wk(1)], xmm4	; tmp3
 	movaps	XMMWORD [wk(0)], xmm0	; tmp2
 	; -- Odd part
 	movaps	xmm2, XMMWORD [XMMBLOCK(1,0,esi,SIZEOF_FAST_FLOAT)]
 	movaps	xmm3, XMMWORD [XMMBLOCK(3,0,esi,SIZEOF_FAST_FLOAT)]
 	movaps	xmm5, XMMWORD [XMMBLOCK(5,0,esi,SIZEOF_FAST_FLOAT)]
 	movaps	xmm1, XMMWORD [XMMBLOCK(7,0,esi,SIZEOF_FAST_FLOAT)]
 	movaps	xmm4,xmm2
 	movaps	xmm0,xmm5
 	addps	xmm2,xmm1		; xmm2=z11
 	addps	xmm5,xmm3		; xmm5=z13
 	subps	xmm4,xmm1		; xmm4=z12
 	subps	xmm0,xmm3		; xmm0=z10
 	movaps	xmm1,xmm2
 	subps	xmm2,xmm5
 	addps	xmm1,xmm5		; xmm1=tmp7
 	mulps	xmm2,[GOTOFF(ebx,PD_1_414)]	; xmm2=tmp11
 	movaps	xmm3,xmm0
 	addps	xmm0,xmm4
 	mulps	xmm0,[GOTOFF(ebx,PD_1_847)]	; xmm0=z5
 	mulps	xmm3,[GOTOFF(ebx,PD_M2_613)]	; xmm3=(z10 * -2.613125930)
 	mulps	xmm4,[GOTOFF(ebx,PD_1_082)]	; xmm4=(z12 * 1.082392200)
 	addps	xmm3,xmm0		; xmm3=tmp12
 	subps	xmm4,xmm0		; xmm4=tmp10
 	; -- Final output stage
 	subps	xmm3,xmm1		; xmm3=tmp6
 	movaps	xmm5,xmm6
 	movaps	xmm0,xmm7
 	addps	xmm6,xmm1		; xmm6=data0=(00 10 20 30)
 	addps	xmm7,xmm3		; xmm7=data1=(01 11 21 31)
 	subps	xmm5,xmm1		; xmm5=data7=(07 17 27 37)
 	subps	xmm0,xmm3		; xmm0=data6=(06 16 26 36)
 	subps	xmm2,xmm3		; xmm2=tmp5
 	movaps	xmm1,[GOTOFF(ebx,PD_RNDINT_MAGIC)]	; xmm1=[PD_RNDINT_MAGIC]
 	pcmpeqd	xmm3,xmm3
 	psrld	xmm3,WORD_BIT		; xmm3={0xFFFF 0x0000 0xFFFF 0x0000 ..}
 	addps	xmm6,xmm1	; xmm6=roundint(data0/8)=(00 ** 10 ** 20 ** 30 **)
 	addps	xmm7,xmm1	; xmm7=roundint(data1/8)=(01 ** 11 ** 21 ** 31 **)
 	addps	xmm0,xmm1	; xmm0=roundint(data6/8)=(06 ** 16 ** 26 ** 36 **)
 	addps	xmm5,xmm1	; xmm5=roundint(data7/8)=(07 ** 17 ** 27 ** 37 **)
 	pand	xmm6,xmm3		; xmm6=(00 -- 10 -- 20 -- 30 --)
 	pslld	xmm7,WORD_BIT		; xmm7=(-- 01 -- 11 -- 21 -- 31)
 	pand	xmm0,xmm3		; xmm0=(06 -- 16 -- 26 -- 36 --)
 	pslld	xmm5,WORD_BIT		; xmm5=(-- 07 -- 17 -- 27 -- 37)
 	por	xmm6,xmm7		; xmm6=(00 01 10 11 20 21 30 31)
 	por	xmm0,xmm5		; xmm0=(06 07 16 17 26 27 36 37)
 	movaps	xmm1, XMMWORD [wk(0)]	; xmm1=tmp2
 	movaps	xmm3, XMMWORD [wk(1)]	; xmm3=tmp3
 	addps	xmm4,xmm2		; xmm4=tmp4
 	movaps	xmm7,xmm1
 	movaps	xmm5,xmm3
 	addps	xmm1,xmm2		; xmm1=data2=(02 12 22 32)
 	addps	xmm3,xmm4		; xmm3=data4=(04 14 24 34)
 	subps	xmm7,xmm2		; xmm7=data5=(05 15 25 35)
 	subps	xmm5,xmm4		; xmm5=data3=(03 13 23 33)
 	movaps	xmm2,[GOTOFF(ebx,PD_RNDINT_MAGIC)]	; xmm2=[PD_RNDINT_MAGIC]
 	pcmpeqd	xmm4,xmm4
 	psrld	xmm4,WORD_BIT		; xmm4={0xFFFF 0x0000 0xFFFF 0x0000 ..}
 	addps	xmm3,xmm2	; xmm3=roundint(data4/8)=(04 ** 14 ** 24 ** 34 **)
 	addps	xmm7,xmm2	; xmm7=roundint(data5/8)=(05 ** 15 ** 25 ** 35 **)
 	addps	xmm1,xmm2	; xmm1=roundint(data2/8)=(02 ** 12 ** 22 ** 32 **)
 	addps	xmm5,xmm2	; xmm5=roundint(data3/8)=(03 ** 13 ** 23 ** 33 **)
 	pand	xmm3,xmm4		; xmm3=(04 -- 14 -- 24 -- 34 --)
 	pslld	xmm7,WORD_BIT		; xmm7=(-- 05 -- 15 -- 25 -- 35)
 	pand	xmm1,xmm4		; xmm1=(02 -- 12 -- 22 -- 32 --)
 	pslld	xmm5,WORD_BIT		; xmm5=(-- 03 -- 13 -- 23 -- 33)
 	por	xmm3,xmm7		; xmm3=(04 05 14 15 24 25 34 35)
 	por	xmm1,xmm5		; xmm1=(02 03 12 13 22 23 32 33)
 	movdqa    xmm2,[GOTOFF(ebx,PB_CENTERJSAMP)]	; xmm2=[PB_CENTERJSAMP]
 	packsswb  xmm6,xmm3	; xmm6=(00 01 10 11 20 21 30 31 04 05 14 15 24 25 34 35)
 	packsswb  xmm1,xmm0	; xmm1=(02 03 12 13 22 23 32 33 06 07 16 17 26 27 36 37)
 	paddb     xmm6,xmm2
 	paddb     xmm1,xmm2
 	movdqa    xmm4,xmm6	; transpose coefficients(phase 2)
 	punpcklwd xmm6,xmm1	; xmm6=(00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33)
 	punpckhwd xmm4,xmm1	; xmm4=(04 05 06 07 14 15 16 17 24 25 26 27 34 35 36 37)
 	movdqa    xmm7,xmm6	; transpose coefficients(phase 3)
 	punpckldq xmm6,xmm4	; xmm6=(00 01 02 03 04 05 06 07 10 11 12 13 14 15 16 17)
 	punpckhdq xmm7,xmm4	; xmm7=(20 21 22 23 24 25 26 27 30 31 32 33 34 35 36 37)
 	pshufd	xmm5,xmm6,0x4E	; xmm5=(10 11 12 13 14 15 16 17 00 01 02 03 04 05 06 07)
 	pshufd	xmm3,xmm7,0x4E	; xmm3=(30 31 32 33 34 35 36 37 20 21 22 23 24 25 26 27)
 	pushpic	ebx			; save GOT address
 	mov	edx, JSAMPROW [edi+0*SIZEOF_JSAMPROW]
 	mov	ebx, JSAMPROW [edi+2*SIZEOF_JSAMPROW]
 	movq	_MMWORD [edx+eax*SIZEOF_JSAMPLE], xmm6
 	movq	_MMWORD [ebx+eax*SIZEOF_JSAMPLE], xmm7
 	mov	edx, JSAMPROW [edi+1*SIZEOF_JSAMPROW]
 	mov	ebx, JSAMPROW [edi+3*SIZEOF_JSAMPROW]
 	movq	_MMWORD [edx+eax*SIZEOF_JSAMPLE], xmm5
 	movq	_MMWORD [ebx+eax*SIZEOF_JSAMPLE], xmm3
 	poppic	ebx			; restore GOT address
 	add	esi, byte 4*SIZEOF_FAST_FLOAT	; wsptr
 	add	edi, byte 4*SIZEOF_JSAMPROW
 	dec	ecx				; ctr
 	jnz	near .rowloop
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %endif ; JIDCT_FLT_SSE_SSE2_SUPPORTED
 %endif ; DCT_FLOAT_SUPPORTED
--- a/jiss2fst.asm
+++ b/jiss2fst.asm
@@ -0,0 +1,512 @@
 ;
 ; jiss2fst.asm - fast integer IDCT (SSE2)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains a fast, not so accurate integer implementation of
 ; the inverse DCT (Discrete Cosine Transform). The following code is
 ; based directly on the IJG's original jidctfst.c; see the jidctfst.c
 ; for more details.
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_IFAST_SUPPORTED
 %ifdef JIDCT_INT_SSE2_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 %define CONST_BITS	8	; 14 is also OK.
 %define PASS1_BITS	2
 %if IFAST_SCALE_BITS != PASS1_BITS
 %error "'IFAST_SCALE_BITS' must be equal to 'PASS1_BITS'."
 %endif
 %if CONST_BITS == 8
 F_1_082	equ	277		; FIX(1.082392200)
 F_1_414	equ	362		; FIX(1.414213562)
 F_1_847	equ	473		; FIX(1.847759065)
 F_2_613	equ	669		; FIX(2.613125930)
 F_1_613	equ	(F_2_613 - 256)	; FIX(2.613125930) - FIX(1)
 %else
 ; NASM cannot do compile-time arithmetic on floating-point constants.
 %define	DESCALE(x,n)  (((x)+(1<<((n)-1)))>>(n))
 F_1_082	equ	DESCALE(1162209775,30-CONST_BITS)	; FIX(1.082392200)
 F_1_414	equ	DESCALE(1518500249,30-CONST_BITS)	; FIX(1.414213562)
 F_1_847	equ	DESCALE(1984016188,30-CONST_BITS)	; FIX(1.847759065)
 F_2_613	equ	DESCALE(2805822602,30-CONST_BITS)	; FIX(2.613125930)
 F_1_613	equ	(F_2_613 - (1 << CONST_BITS))	; FIX(2.613125930) - FIX(1)
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 ; PRE_MULTIPLY_SCALE_BITS <= 2 (to avoid overflow)
 ; CONST_BITS + CONST_SHIFT + PRE_MULTIPLY_SCALE_BITS == 16 (for pmulhw)
 %define PRE_MULTIPLY_SCALE_BITS   2
 %define CONST_SHIFT     (16 - PRE_MULTIPLY_SCALE_BITS - CONST_BITS)
 	alignz	16
 	global	EXTN(jconst_idct_ifast_sse2)
 EXTN(jconst_idct_ifast_sse2):
 PW_F1414	times 8 dw  F_1_414 << CONST_SHIFT
 PW_F1847	times 8 dw  F_1_847 << CONST_SHIFT
 PW_MF1613	times 8 dw -F_1_613 << CONST_SHIFT
 PW_F1082	times 8 dw  F_1_082 << CONST_SHIFT
 PB_CENTERJSAMP	times 16 db CENTERJSAMPLE
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform dequantization and inverse DCT on one block of coefficients.
 ;
 ; GLOBAL(void)
 ; jpeg_idct_ifast_sse2 (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 ;                       JCOEFPTR coef_block,
 ;                       JSAMPARRAY output_buf, JDIMENSION output_col)
 ;
 %define cinfo(b)	(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define coef_block(b)	(b)+16		; JCOEFPTR coef_block
 %define output_buf(b)	(b)+20		; JSAMPARRAY output_buf
 %define output_col(b)	(b)+24		; JDIMENSION output_col
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_XMMWORD	; xmmword wk[WK_NUM]
 %define WK_NUM		2
 	align	16
 	global	EXTN(jpeg_idct_ifast_sse2)
 EXTN(jpeg_idct_ifast_sse2):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_XMMWORD)	; align to 128 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [wk(0)]
 	pushpic	ebx
 ;	push	ecx		; unused
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx		; get GOT address
 	; ---- Pass 1: process columns from input.
 ;	mov	eax, [original_ebp]
 	mov	edx, POINTER [compptr(eax)]
 	mov	edx, POINTER [jcompinfo_dct_table(edx)]	; quantptr
 	mov	esi, JCOEFPTR [coef_block(eax)]		; inptr
 %ifndef NO_ZERO_COLUMN_TEST_IFAST_SSE2
 	mov	eax, DWORD [DWBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	or	eax, DWORD [DWBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	jnz	near .columnDCT
 	movdqa	xmm0, XMMWORD [XMMBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movdqa	xmm1, XMMWORD [XMMBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	por	xmm0, XMMWORD [XMMBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	por	xmm1, XMMWORD [XMMBLOCK(4,0,esi,SIZEOF_JCOEF)]
 	por	xmm0, XMMWORD [XMMBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	por	xmm1, XMMWORD [XMMBLOCK(6,0,esi,SIZEOF_JCOEF)]
 	por	xmm0, XMMWORD [XMMBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	por	xmm1,xmm0
 	packsswb xmm1,xmm1
 	packsswb xmm1,xmm1
 	movd	eax,xmm1
 	test	eax,eax
 	jnz	short .columnDCT
 	; -- AC terms all zero
 	movdqa	xmm0, XMMWORD [XMMBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	pmullw	xmm0, XMMWORD [XMMBLOCK(0,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	movdqa    xmm7,xmm0		; xmm0=in0=(00 01 02 03 04 05 06 07)
 	punpcklwd xmm0,xmm0		; xmm0=(00 00 01 01 02 02 03 03)
 	punpckhwd xmm7,xmm7		; xmm7=(04 04 05 05 06 06 07 07)
 	pshufd	xmm6,xmm0,0x00		; xmm6=col0=(00 00 00 00 00 00 00 00)
 	pshufd	xmm2,xmm0,0x55		; xmm2=col1=(01 01 01 01 01 01 01 01)
 	pshufd	xmm5,xmm0,0xAA		; xmm5=col2=(02 02 02 02 02 02 02 02)
 	pshufd	xmm0,xmm0,0xFF		; xmm0=col3=(03 03 03 03 03 03 03 03)
 	pshufd	xmm1,xmm7,0x00		; xmm1=col4=(04 04 04 04 04 04 04 04)
 	pshufd	xmm4,xmm7,0x55		; xmm4=col5=(05 05 05 05 05 05 05 05)
 	pshufd	xmm3,xmm7,0xAA		; xmm3=col6=(06 06 06 06 06 06 06 06)
 	pshufd	xmm7,xmm7,0xFF		; xmm7=col7=(07 07 07 07 07 07 07 07)
 	movdqa	XMMWORD [wk(0)], xmm2	; wk(0)=col1
 	movdqa	XMMWORD [wk(1)], xmm0	; wk(1)=col3
 	jmp	near .column_end
 	alignx	16,7
 %endif
 .columnDCT:
 	; -- Even part
 	movdqa	xmm0, XMMWORD [XMMBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	movdqa	xmm1, XMMWORD [XMMBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	pmullw	xmm0, XMMWORD [XMMBLOCK(0,0,edx,SIZEOF_IFAST_MULT_TYPE)]
 	pmullw	xmm1, XMMWORD [XMMBLOCK(2,0,edx,SIZEOF_IFAST_MULT_TYPE)]
 	movdqa	xmm2, XMMWORD [XMMBLOCK(4,0,esi,SIZEOF_JCOEF)]
 	movdqa	xmm3, XMMWORD [XMMBLOCK(6,0,esi,SIZEOF_JCOEF)]
 	pmullw	xmm2, XMMWORD [XMMBLOCK(4,0,edx,SIZEOF_IFAST_MULT_TYPE)]
 	pmullw	xmm3, XMMWORD [XMMBLOCK(6,0,edx,SIZEOF_IFAST_MULT_TYPE)]
 	movdqa	xmm4,xmm0
 	movdqa	xmm5,xmm1
 	psubw	xmm0,xmm2		; xmm0=tmp11
 	psubw	xmm1,xmm3
 	paddw	xmm4,xmm2		; xmm4=tmp10
 	paddw	xmm5,xmm3		; xmm5=tmp13
 	psllw	xmm1,PRE_MULTIPLY_SCALE_BITS
 	pmulhw	xmm1,[GOTOFF(ebx,PW_F1414)]
 	psubw	xmm1,xmm5		; xmm1=tmp12
 	movdqa	xmm6,xmm4
 	movdqa	xmm7,xmm0
 	psubw	xmm4,xmm5		; xmm4=tmp3
 	psubw	xmm0,xmm1		; xmm0=tmp2
 	paddw	xmm6,xmm5		; xmm6=tmp0
 	paddw	xmm7,xmm1		; xmm7=tmp1
 	movdqa	XMMWORD [wk(1)], xmm4	; wk(1)=tmp3
 	movdqa	XMMWORD [wk(0)], xmm0	; wk(0)=tmp2
 	; -- Odd part
 	movdqa	xmm2, XMMWORD [XMMBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movdqa	xmm3, XMMWORD [XMMBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	pmullw	xmm2, XMMWORD [XMMBLOCK(1,0,edx,SIZEOF_IFAST_MULT_TYPE)]
 	pmullw	xmm3, XMMWORD [XMMBLOCK(3,0,edx,SIZEOF_IFAST_MULT_TYPE)]
 	movdqa	xmm5, XMMWORD [XMMBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	movdqa	xmm1, XMMWORD [XMMBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	pmullw	xmm5, XMMWORD [XMMBLOCK(5,0,edx,SIZEOF_IFAST_MULT_TYPE)]
 	pmullw	xmm1, XMMWORD [XMMBLOCK(7,0,edx,SIZEOF_IFAST_MULT_TYPE)]
 	movdqa	xmm4,xmm2
 	movdqa	xmm0,xmm5
 	psubw	xmm2,xmm1		; xmm2=z12
 	psubw	xmm5,xmm3		; xmm5=z10
 	paddw	xmm4,xmm1		; xmm4=z11
 	paddw	xmm0,xmm3		; xmm0=z13
 	movdqa	xmm1,xmm5		; xmm1=z10(unscaled)
 	psllw	xmm2,PRE_MULTIPLY_SCALE_BITS
 	psllw	xmm5,PRE_MULTIPLY_SCALE_BITS
 	movdqa	xmm3,xmm4
 	psubw	xmm4,xmm0
 	paddw	xmm3,xmm0		; xmm3=tmp7
 	psllw	xmm4,PRE_MULTIPLY_SCALE_BITS
 	pmulhw	xmm4,[GOTOFF(ebx,PW_F1414)]	; xmm4=tmp11
 	; To avoid overflow...
 	;
 	; (Original)
 	; tmp12 = -2.613125930 * z10 + z5;
 	;
 	; (This implementation)
 	; tmp12 = (-1.613125930 - 1) * z10 + z5;
 	;       = -1.613125930 * z10 - z10 + z5;
 	movdqa	xmm0,xmm5
 	paddw	xmm5,xmm2
 	pmulhw	xmm5,[GOTOFF(ebx,PW_F1847)]	; xmm5=z5
 	pmulhw	xmm0,[GOTOFF(ebx,PW_MF1613)]
 	pmulhw	xmm2,[GOTOFF(ebx,PW_F1082)]
 	psubw	xmm0,xmm1
 	psubw	xmm2,xmm5		; xmm2=tmp10
 	paddw	xmm0,xmm5		; xmm0=tmp12
 	; -- Final output stage
 	psubw	xmm0,xmm3		; xmm0=tmp6
 	movdqa	xmm1,xmm6
 	movdqa	xmm5,xmm7
 	paddw	xmm6,xmm3		; xmm6=data0=(00 01 02 03 04 05 06 07)
 	paddw	xmm7,xmm0		; xmm7=data1=(10 11 12 13 14 15 16 17)
 	psubw	xmm1,xmm3		; xmm1=data7=(70 71 72 73 74 75 76 77)
 	psubw	xmm5,xmm0		; xmm5=data6=(60 61 62 63 64 65 66 67)
 	psubw	xmm4,xmm0		; xmm4=tmp5
 	movdqa    xmm3,xmm6		; transpose coefficients(phase 1)
 	punpcklwd xmm6,xmm7		; xmm6=(00 10 01 11 02 12 03 13)
 	punpckhwd xmm3,xmm7		; xmm3=(04 14 05 15 06 16 07 17)
 	movdqa    xmm0,xmm5		; transpose coefficients(phase 1)
 	punpcklwd xmm5,xmm1		; xmm5=(60 70 61 71 62 72 63 73)
 	punpckhwd xmm0,xmm1		; xmm0=(64 74 65 75 66 76 67 77)
 	movdqa	xmm7, XMMWORD [wk(0)]	; xmm7=tmp2
 	movdqa	xmm1, XMMWORD [wk(1)]	; xmm1=tmp3
 	movdqa	XMMWORD [wk(0)], xmm5	; wk(0)=(60 70 61 71 62 72 63 73)
 	movdqa	XMMWORD [wk(1)], xmm0	; wk(1)=(64 74 65 75 66 76 67 77)
 	paddw	xmm2,xmm4		; xmm2=tmp4
 	movdqa	xmm5,xmm7
 	movdqa	xmm0,xmm1
 	paddw	xmm7,xmm4		; xmm7=data2=(20 21 22 23 24 25 26 27)
 	paddw	xmm1,xmm2		; xmm1=data4=(40 41 42 43 44 45 46 47)
 	psubw	xmm5,xmm4		; xmm5=data5=(50 51 52 53 54 55 56 57)
 	psubw	xmm0,xmm2		; xmm0=data3=(30 31 32 33 34 35 36 37)
 	movdqa    xmm4,xmm7		; transpose coefficients(phase 1)
 	punpcklwd xmm7,xmm0		; xmm7=(20 30 21 31 22 32 23 33)
 	punpckhwd xmm4,xmm0		; xmm4=(24 34 25 35 26 36 27 37)
 	movdqa    xmm2,xmm1		; transpose coefficients(phase 1)
 	punpcklwd xmm1,xmm5		; xmm1=(40 50 41 51 42 52 43 53)
 	punpckhwd xmm2,xmm5		; xmm2=(44 54 45 55 46 56 47 57)
 	movdqa    xmm0,xmm3		; transpose coefficients(phase 2)
 	punpckldq xmm3,xmm4		; xmm3=(04 14 24 34 05 15 25 35)
 	punpckhdq xmm0,xmm4		; xmm0=(06 16 26 36 07 17 27 37)
 	movdqa    xmm5,xmm6		; transpose coefficients(phase 2)
 	punpckldq xmm6,xmm7		; xmm6=(00 10 20 30 01 11 21 31)
 	punpckhdq xmm5,xmm7		; xmm5=(02 12 22 32 03 13 23 33)
 	movdqa	xmm4, XMMWORD [wk(0)]	; xmm4=(60 70 61 71 62 72 63 73)
 	movdqa	xmm7, XMMWORD [wk(1)]	; xmm7=(64 74 65 75 66 76 67 77)
 	movdqa	XMMWORD [wk(0)], xmm3	; wk(0)=(04 14 24 34 05 15 25 35)
 	movdqa	XMMWORD [wk(1)], xmm0	; wk(1)=(06 16 26 36 07 17 27 37)
 	movdqa    xmm3,xmm1		; transpose coefficients(phase 2)
 	punpckldq xmm1,xmm4		; xmm1=(40 50 60 70 41 51 61 71)
 	punpckhdq xmm3,xmm4		; xmm3=(42 52 62 72 43 53 63 73)
 	movdqa    xmm0,xmm2		; transpose coefficients(phase 2)
 	punpckldq xmm2,xmm7		; xmm2=(44 54 64 74 45 55 65 75)
 	punpckhdq xmm0,xmm7		; xmm0=(46 56 66 76 47 57 67 77)
 	movdqa     xmm4,xmm6		; transpose coefficients(phase 3)
 	punpcklqdq xmm6,xmm1		; xmm6=col0=(00 10 20 30 40 50 60 70)
 	punpckhqdq xmm4,xmm1		; xmm4=col1=(01 11 21 31 41 51 61 71)
 	movdqa     xmm7,xmm5		; transpose coefficients(phase 3)
 	punpcklqdq xmm5,xmm3		; xmm5=col2=(02 12 22 32 42 52 62 72)
 	punpckhqdq xmm7,xmm3		; xmm7=col3=(03 13 23 33 43 53 63 73)
 	movdqa	xmm1, XMMWORD [wk(0)]	; xmm1=(04 14 24 34 05 15 25 35)
 	movdqa	xmm3, XMMWORD [wk(1)]	; xmm3=(06 16 26 36 07 17 27 37)
 	movdqa	XMMWORD [wk(0)], xmm4	; wk(0)=col1
 	movdqa	XMMWORD [wk(1)], xmm7	; wk(1)=col3
 	movdqa     xmm4,xmm1		; transpose coefficients(phase 3)
 	punpcklqdq xmm1,xmm2		; xmm1=col4=(04 14 24 34 44 54 64 74)
 	punpckhqdq xmm4,xmm2		; xmm4=col5=(05 15 25 35 45 55 65 75)
 	movdqa     xmm7,xmm3		; transpose coefficients(phase 3)
 	punpcklqdq xmm3,xmm0		; xmm3=col6=(06 16 26 36 46 56 66 76)
 	punpckhqdq xmm7,xmm0		; xmm7=col7=(07 17 27 37 47 57 67 77)
 .column_end:
 	; -- Prefetch the next coefficient block
 	prefetchnta [esi + DCTSIZE2*SIZEOF_JCOEF + 0*32]
 	prefetchnta [esi + DCTSIZE2*SIZEOF_JCOEF + 1*32]
 	prefetchnta [esi + DCTSIZE2*SIZEOF_JCOEF + 2*32]
 	prefetchnta [esi + DCTSIZE2*SIZEOF_JCOEF + 3*32]
 	; ---- Pass 2: process rows from work array, store into output array.
 	mov	eax, [original_ebp]
 	mov	edi, JSAMPARRAY [output_buf(eax)]	; (JSAMPROW *)
 	mov	eax, JDIMENSION [output_col(eax)]
 	; -- Even part
 	; xmm6=col0, xmm5=col2, xmm1=col4, xmm3=col6
 	movdqa	xmm2,xmm6
 	movdqa	xmm0,xmm5
 	psubw	xmm6,xmm1		; xmm6=tmp11
 	psubw	xmm5,xmm3
 	paddw	xmm2,xmm1		; xmm2=tmp10
 	paddw	xmm0,xmm3		; xmm0=tmp13
 	psllw	xmm5,PRE_MULTIPLY_SCALE_BITS
 	pmulhw	xmm5,[GOTOFF(ebx,PW_F1414)]
 	psubw	xmm5,xmm0		; xmm5=tmp12
 	movdqa	xmm1,xmm2
 	movdqa	xmm3,xmm6
 	psubw	xmm2,xmm0		; xmm2=tmp3
 	psubw	xmm6,xmm5		; xmm6=tmp2
 	paddw	xmm1,xmm0		; xmm1=tmp0
 	paddw	xmm3,xmm5		; xmm3=tmp1
 	movdqa	xmm0, XMMWORD [wk(0)]	; xmm0=col1
 	movdqa	xmm5, XMMWORD [wk(1)]	; xmm5=col3
 	movdqa	XMMWORD [wk(0)], xmm2	; wk(0)=tmp3
 	movdqa	XMMWORD [wk(1)], xmm6	; wk(1)=tmp2
 	; -- Odd part
 	; xmm0=col1, xmm5=col3, xmm4=col5, xmm7=col7
 	movdqa	xmm2,xmm0
 	movdqa	xmm6,xmm4
 	psubw	xmm0,xmm7		; xmm0=z12
 	psubw	xmm4,xmm5		; xmm4=z10
 	paddw	xmm2,xmm7		; xmm2=z11
 	paddw	xmm6,xmm5		; xmm6=z13
 	movdqa	xmm7,xmm4		; xmm7=z10(unscaled)
 	psllw	xmm0,PRE_MULTIPLY_SCALE_BITS
 	psllw	xmm4,PRE_MULTIPLY_SCALE_BITS
 	movdqa	xmm5,xmm2
 	psubw	xmm2,xmm6
 	paddw	xmm5,xmm6		; xmm5=tmp7
 	psllw	xmm2,PRE_MULTIPLY_SCALE_BITS
 	pmulhw	xmm2,[GOTOFF(ebx,PW_F1414)]	; xmm2=tmp11
 	; To avoid overflow...
 	;
 	; (Original)
 	; tmp12 = -2.613125930 * z10 + z5;
 	;
 	; (This implementation)
 	; tmp12 = (-1.613125930 - 1) * z10 + z5;
 	;       = -1.613125930 * z10 - z10 + z5;
 	movdqa	xmm6,xmm4
 	paddw	xmm4,xmm0
 	pmulhw	xmm4,[GOTOFF(ebx,PW_F1847)]	; xmm4=z5
 	pmulhw	xmm6,[GOTOFF(ebx,PW_MF1613)]
 	pmulhw	xmm0,[GOTOFF(ebx,PW_F1082)]
 	psubw	xmm6,xmm7
 	psubw	xmm0,xmm4		; xmm0=tmp10
 	paddw	xmm6,xmm4		; xmm6=tmp12
 	; -- Final output stage
 	psubw	xmm6,xmm5		; xmm6=tmp6
 	movdqa	xmm7,xmm1
 	movdqa	xmm4,xmm3
 	paddw	xmm1,xmm5		; xmm1=data0=(00 10 20 30 40 50 60 70)
 	paddw	xmm3,xmm6		; xmm3=data1=(01 11 21 31 41 51 61 71)
 	psraw	xmm1,(PASS1_BITS+3)	; descale
 	psraw	xmm3,(PASS1_BITS+3)	; descale
 	psubw	xmm7,xmm5		; xmm7=data7=(07 17 27 37 47 57 67 77)
 	psubw	xmm4,xmm6		; xmm4=data6=(06 16 26 36 46 56 66 76)
 	psraw	xmm7,(PASS1_BITS+3)	; descale
 	psraw	xmm4,(PASS1_BITS+3)	; descale
 	psubw	xmm2,xmm6		; xmm2=tmp5
 	packsswb  xmm1,xmm4	; xmm1=(00 10 20 30 40 50 60 70 06 16 26 36 46 56 66 76)
 	packsswb  xmm3,xmm7	; xmm3=(01 11 21 31 41 51 61 71 07 17 27 37 47 57 67 77)
 	movdqa	xmm5, XMMWORD [wk(1)]	; xmm5=tmp2
 	movdqa	xmm6, XMMWORD [wk(0)]	; xmm6=tmp3
 	paddw	xmm0,xmm2		; xmm0=tmp4
 	movdqa	xmm4,xmm5
 	movdqa	xmm7,xmm6
 	paddw	xmm5,xmm2		; xmm5=data2=(02 12 22 32 42 52 62 72)
 	paddw	xmm6,xmm0		; xmm6=data4=(04 14 24 34 44 54 64 74)
 	psraw	xmm5,(PASS1_BITS+3)	; descale
 	psraw	xmm6,(PASS1_BITS+3)	; descale
 	psubw	xmm4,xmm2		; xmm4=data5=(05 15 25 35 45 55 65 75)
 	psubw	xmm7,xmm0		; xmm7=data3=(03 13 23 33 43 53 63 73)
 	psraw	xmm4,(PASS1_BITS+3)	; descale
 	psraw	xmm7,(PASS1_BITS+3)	; descale
 	movdqa    xmm2,[GOTOFF(ebx,PB_CENTERJSAMP)]	; xmm2=[PB_CENTERJSAMP]
 	packsswb  xmm5,xmm6	; xmm5=(02 12 22 32 42 52 62 72 04 14 24 34 44 54 64 74)
 	packsswb  xmm7,xmm4	; xmm7=(03 13 23 33 43 53 63 73 05 15 25 35 45 55 65 75)
 	paddb     xmm1,xmm2
 	paddb     xmm3,xmm2
 	paddb     xmm5,xmm2
 	paddb     xmm7,xmm2
 	movdqa    xmm0,xmm1	; transpose coefficients(phase 1)
 	punpcklbw xmm1,xmm3	; xmm1=(00 01 10 11 20 21 30 31 40 41 50 51 60 61 70 71)
 	punpckhbw xmm0,xmm3	; xmm0=(06 07 16 17 26 27 36 37 46 47 56 57 66 67 76 77)
 	movdqa    xmm6,xmm5	; transpose coefficients(phase 1)
 	punpcklbw xmm5,xmm7	; xmm5=(02 03 12 13 22 23 32 33 42 43 52 53 62 63 72 73)
 	punpckhbw xmm6,xmm7	; xmm6=(04 05 14 15 24 25 34 35 44 45 54 55 64 65 74 75)
 	movdqa    xmm4,xmm1	; transpose coefficients(phase 2)
 	punpcklwd xmm1,xmm5	; xmm1=(00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33)
 	punpckhwd xmm4,xmm5	; xmm4=(40 41 42 43 50 51 52 53 60 61 62 63 70 71 72 73)
 	movdqa    xmm2,xmm6	; transpose coefficients(phase 2)
 	punpcklwd xmm6,xmm0	; xmm6=(04 05 06 07 14 15 16 17 24 25 26 27 34 35 36 37)
 	punpckhwd xmm2,xmm0	; xmm2=(44 45 46 47 54 55 56 57 64 65 66 67 74 75 76 77)
 	movdqa    xmm3,xmm1	; transpose coefficients(phase 3)
 	punpckldq xmm1,xmm6	; xmm1=(00 01 02 03 04 05 06 07 10 11 12 13 14 15 16 17)
 	punpckhdq xmm3,xmm6	; xmm3=(20 21 22 23 24 25 26 27 30 31 32 33 34 35 36 37)
 	movdqa    xmm7,xmm4	; transpose coefficients(phase 3)
 	punpckldq xmm4,xmm2	; xmm4=(40 41 42 43 44 45 46 47 50 51 52 53 54 55 56 57)
 	punpckhdq xmm7,xmm2	; xmm7=(60 61 62 63 64 65 66 67 70 71 72 73 74 75 76 77)
 	pshufd	xmm5,xmm1,0x4E	; xmm5=(10 11 12 13 14 15 16 17 00 01 02 03 04 05 06 07)
 	pshufd	xmm0,xmm3,0x4E	; xmm0=(30 31 32 33 34 35 36 37 20 21 22 23 24 25 26 27)
 	pshufd	xmm6,xmm4,0x4E	; xmm6=(50 51 52 53 54 55 56 57 40 41 42 43 44 45 46 47)
 	pshufd	xmm2,xmm7,0x4E	; xmm2=(70 71 72 73 74 75 76 77 60 61 62 63 64 65 66 67)
 	mov	edx, JSAMPROW [edi+0*SIZEOF_JSAMPROW]
 	mov	esi, JSAMPROW [edi+2*SIZEOF_JSAMPROW]
 	movq	_MMWORD [edx+eax*SIZEOF_JSAMPLE], xmm1
 	movq	_MMWORD [esi+eax*SIZEOF_JSAMPLE], xmm3
 	mov	edx, JSAMPROW [edi+4*SIZEOF_JSAMPROW]
 	mov	esi, JSAMPROW [edi+6*SIZEOF_JSAMPROW]
 	movq	_MMWORD [edx+eax*SIZEOF_JSAMPLE], xmm4
 	movq	_MMWORD [esi+eax*SIZEOF_JSAMPLE], xmm7
 	mov	edx, JSAMPROW [edi+1*SIZEOF_JSAMPROW]
 	mov	esi, JSAMPROW [edi+3*SIZEOF_JSAMPROW]
 	movq	_MMWORD [edx+eax*SIZEOF_JSAMPLE], xmm5
 	movq	_MMWORD [esi+eax*SIZEOF_JSAMPLE], xmm0
 	mov	edx, JSAMPROW [edi+5*SIZEOF_JSAMPROW]
 	mov	esi, JSAMPROW [edi+7*SIZEOF_JSAMPROW]
 	movq	_MMWORD [edx+eax*SIZEOF_JSAMPLE], xmm6
 	movq	_MMWORD [esi+eax*SIZEOF_JSAMPLE], xmm2
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; unused
 	poppic	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %endif ; JIDCT_INT_SSE2_SUPPORTED
 %endif ; DCT_IFAST_SUPPORTED
--- a/jiss2int.asm
+++ b/jiss2int.asm
@@ -0,0 +1,869 @@
 ;
 ; jiss2int.asm - accurate integer IDCT (SSE2)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains a slow-but-accurate integer implementation of the
 ; inverse DCT (Discrete Cosine Transform). The following code is based
 ; directly on the IJG's original jidctint.c; see the jidctint.c for
 ; more details.
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef DCT_ISLOW_SUPPORTED
 %ifdef JIDCT_INT_SSE2_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 %define CONST_BITS	13
 %define PASS1_BITS	2
 %define DESCALE_P1	(CONST_BITS-PASS1_BITS)
 %define DESCALE_P2	(CONST_BITS+PASS1_BITS+3)
 %if CONST_BITS == 13
 F_0_298	equ	 2446		; FIX(0.298631336)
 F_0_390	equ	 3196		; FIX(0.390180644)
 F_0_541	equ	 4433		; FIX(0.541196100)
 F_0_765	equ	 6270		; FIX(0.765366865)
 F_0_899	equ	 7373		; FIX(0.899976223)
 F_1_175	equ	 9633		; FIX(1.175875602)
 F_1_501	equ	12299		; FIX(1.501321110)
 F_1_847	equ	15137		; FIX(1.847759065)
 F_1_961	equ	16069		; FIX(1.961570560)
 F_2_053	equ	16819		; FIX(2.053119869)
 F_2_562	equ	20995		; FIX(2.562915447)
 F_3_072	equ	25172		; FIX(3.072711026)
 %else
 ; NASM cannot do compile-time arithmetic on floating-point constants.
 %define DESCALE(x,n)  (((x)+(1<<((n)-1)))>>(n))
 F_0_298	equ	DESCALE( 320652955,30-CONST_BITS)	; FIX(0.298631336)
 F_0_390	equ	DESCALE( 418953276,30-CONST_BITS)	; FIX(0.390180644)
 F_0_541	equ	DESCALE( 581104887,30-CONST_BITS)	; FIX(0.541196100)
 F_0_765	equ	DESCALE( 821806413,30-CONST_BITS)	; FIX(0.765366865)
 F_0_899	equ	DESCALE( 966342111,30-CONST_BITS)	; FIX(0.899976223)
 F_1_175	equ	DESCALE(1262586813,30-CONST_BITS)	; FIX(1.175875602)
 F_1_501	equ	DESCALE(1612031267,30-CONST_BITS)	; FIX(1.501321110)
 F_1_847	equ	DESCALE(1984016188,30-CONST_BITS)	; FIX(1.847759065)
 F_1_961	equ	DESCALE(2106220350,30-CONST_BITS)	; FIX(1.961570560)
 F_2_053	equ	DESCALE(2204520673,30-CONST_BITS)	; FIX(2.053119869)
 F_2_562	equ	DESCALE(2751909506,30-CONST_BITS)	; FIX(2.562915447)
 F_3_072	equ	DESCALE(3299298341,30-CONST_BITS)	; FIX(3.072711026)
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 	alignz	16
 	global	EXTN(jconst_idct_islow_sse2)
 EXTN(jconst_idct_islow_sse2):
 PW_F130_F054	times 4 dw  (F_0_541+F_0_765), F_0_541
 PW_F054_MF130	times 4 dw  F_0_541, (F_0_541-F_1_847)
 PW_MF078_F117	times 4 dw  (F_1_175-F_1_961), F_1_175
 PW_F117_F078	times 4 dw  F_1_175, (F_1_175-F_0_390)
 PW_MF060_MF089	times 4 dw  (F_0_298-F_0_899),-F_0_899
 PW_MF089_F060	times 4 dw -F_0_899, (F_1_501-F_0_899)
 PW_MF050_MF256	times 4 dw  (F_2_053-F_2_562),-F_2_562
 PW_MF256_F050	times 4 dw -F_2_562, (F_3_072-F_2_562)
 PD_DESCALE_P1	times 4 dd  1 << (DESCALE_P1-1)
 PD_DESCALE_P2	times 4 dd  1 << (DESCALE_P2-1)
 PB_CENTERJSAMP	times 16 db CENTERJSAMPLE
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform dequantization and inverse DCT on one block of coefficients.
 ;
 ; GLOBAL(void)
 ; jpeg_idct_islow_sse2 (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 ;                       JCOEFPTR coef_block,
 ;                       JSAMPARRAY output_buf, JDIMENSION output_col)
 ;
 %define cinfo(b)	(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define coef_block(b)	(b)+16		; JCOEFPTR coef_block
 %define output_buf(b)	(b)+20		; JSAMPARRAY output_buf
 %define output_col(b)	(b)+24		; JDIMENSION output_col
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_XMMWORD	; xmmword wk[WK_NUM]
 %define WK_NUM		12
 	align	16
 	global	EXTN(jpeg_idct_islow_sse2)
 EXTN(jpeg_idct_islow_sse2):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_XMMWORD)	; align to 128 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [wk(0)]
 	pushpic	ebx
 ;	push	ecx		; unused
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx		; get GOT address
 	; ---- Pass 1: process columns from input.
 ;	mov	eax, [original_ebp]
 	mov	edx, POINTER [compptr(eax)]
 	mov	edx, POINTER [jcompinfo_dct_table(edx)]	; quantptr
 	mov	esi, JCOEFPTR [coef_block(eax)]		; inptr
 %ifndef NO_ZERO_COLUMN_TEST_ISLOW_SSE2
 	mov	eax, DWORD [DWBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	or	eax, DWORD [DWBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	jnz	near .columnDCT
 	movdqa	xmm0, XMMWORD [XMMBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movdqa	xmm1, XMMWORD [XMMBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	por	xmm0, XMMWORD [XMMBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	por	xmm1, XMMWORD [XMMBLOCK(4,0,esi,SIZEOF_JCOEF)]
 	por	xmm0, XMMWORD [XMMBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	por	xmm1, XMMWORD [XMMBLOCK(6,0,esi,SIZEOF_JCOEF)]
 	por	xmm0, XMMWORD [XMMBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	por	xmm1,xmm0
 	packsswb xmm1,xmm1
 	packsswb xmm1,xmm1
 	movd	eax,xmm1
 	test	eax,eax
 	jnz	short .columnDCT
 	; -- AC terms all zero
 	movdqa	xmm5, XMMWORD [XMMBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	pmullw	xmm5, XMMWORD [XMMBLOCK(0,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	psllw	xmm5,PASS1_BITS
 	movdqa    xmm4,xmm5		; xmm5=in0=(00 01 02 03 04 05 06 07)
 	punpcklwd xmm5,xmm5		; xmm5=(00 00 01 01 02 02 03 03)
 	punpckhwd xmm4,xmm4		; xmm4=(04 04 05 05 06 06 07 07)
 	pshufd	xmm7,xmm5,0x00		; xmm7=col0=(00 00 00 00 00 00 00 00)
 	pshufd	xmm6,xmm5,0x55		; xmm6=col1=(01 01 01 01 01 01 01 01)
 	pshufd	xmm1,xmm5,0xAA		; xmm1=col2=(02 02 02 02 02 02 02 02)
 	pshufd	xmm5,xmm5,0xFF		; xmm5=col3=(03 03 03 03 03 03 03 03)
 	pshufd	xmm0,xmm4,0x00		; xmm0=col4=(04 04 04 04 04 04 04 04)
 	pshufd	xmm3,xmm4,0x55		; xmm3=col5=(05 05 05 05 05 05 05 05)
 	pshufd	xmm2,xmm4,0xAA		; xmm2=col6=(06 06 06 06 06 06 06 06)
 	pshufd	xmm4,xmm4,0xFF		; xmm4=col7=(07 07 07 07 07 07 07 07)
 	movdqa	XMMWORD [wk(8)], xmm6	; wk(8)=col1
 	movdqa	XMMWORD [wk(9)], xmm5	; wk(9)=col3
 	movdqa	XMMWORD [wk(10)], xmm3	; wk(10)=col5
 	movdqa	XMMWORD [wk(11)], xmm4	; wk(11)=col7
 	jmp	near .column_end
 	alignx	16,7
 %endif
 .columnDCT:
 	; -- Even part
 	movdqa	xmm0, XMMWORD [XMMBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	movdqa	xmm1, XMMWORD [XMMBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	pmullw	xmm0, XMMWORD [XMMBLOCK(0,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	xmm1, XMMWORD [XMMBLOCK(2,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	movdqa	xmm2, XMMWORD [XMMBLOCK(4,0,esi,SIZEOF_JCOEF)]
 	movdqa	xmm3, XMMWORD [XMMBLOCK(6,0,esi,SIZEOF_JCOEF)]
 	pmullw	xmm2, XMMWORD [XMMBLOCK(4,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	xmm3, XMMWORD [XMMBLOCK(6,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	; (Original)
 	; z1 = (z2 + z3) * 0.541196100;
 	; tmp2 = z1 + z3 * -1.847759065;
 	; tmp3 = z1 + z2 * 0.765366865;
 	;
 	; (This implementation)
 	; tmp2 = z2 * 0.541196100 + z3 * (0.541196100 - 1.847759065);
 	; tmp3 = z2 * (0.541196100 + 0.765366865) + z3 * 0.541196100;
 	movdqa    xmm4,xmm1		; xmm1=in2=z2
 	movdqa    xmm5,xmm1
 	punpcklwd xmm4,xmm3		; xmm3=in6=z3
 	punpckhwd xmm5,xmm3
 	movdqa    xmm1,xmm4
 	movdqa    xmm3,xmm5
 	pmaddwd   xmm4,[GOTOFF(ebx,PW_F130_F054)]	; xmm4=tmp3L
 	pmaddwd   xmm5,[GOTOFF(ebx,PW_F130_F054)]	; xmm5=tmp3H
 	pmaddwd   xmm1,[GOTOFF(ebx,PW_F054_MF130)]	; xmm1=tmp2L
 	pmaddwd   xmm3,[GOTOFF(ebx,PW_F054_MF130)]	; xmm3=tmp2H
 	movdqa    xmm6,xmm0
 	paddw     xmm0,xmm2		; xmm0=in0+in4
 	psubw     xmm6,xmm2		; xmm6=in0-in4
 	pxor      xmm7,xmm7
 	pxor      xmm2,xmm2
 	punpcklwd xmm7,xmm0		; xmm7=tmp0L
 	punpckhwd xmm2,xmm0		; xmm2=tmp0H
 	psrad     xmm7,(16-CONST_BITS)	; psrad xmm7,16 & pslld xmm7,CONST_BITS
 	psrad     xmm2,(16-CONST_BITS)	; psrad xmm2,16 & pslld xmm2,CONST_BITS
 	movdqa	xmm0,xmm7
 	paddd	xmm7,xmm4		; xmm7=tmp10L
 	psubd	xmm0,xmm4		; xmm0=tmp13L
 	movdqa	xmm4,xmm2
 	paddd	xmm2,xmm5		; xmm2=tmp10H
 	psubd	xmm4,xmm5		; xmm4=tmp13H
 	movdqa	XMMWORD [wk(0)], xmm7	; wk(0)=tmp10L
 	movdqa	XMMWORD [wk(1)], xmm2	; wk(1)=tmp10H
 	movdqa	XMMWORD [wk(2)], xmm0	; wk(2)=tmp13L
 	movdqa	XMMWORD [wk(3)], xmm4	; wk(3)=tmp13H
 	pxor      xmm5,xmm5
 	pxor      xmm7,xmm7
 	punpcklwd xmm5,xmm6		; xmm5=tmp1L
 	punpckhwd xmm7,xmm6		; xmm7=tmp1H
 	psrad     xmm5,(16-CONST_BITS)	; psrad xmm5,16 & pslld xmm5,CONST_BITS
 	psrad     xmm7,(16-CONST_BITS)	; psrad xmm7,16 & pslld xmm7,CONST_BITS
 	movdqa	xmm2,xmm5
 	paddd	xmm5,xmm1		; xmm5=tmp11L
 	psubd	xmm2,xmm1		; xmm2=tmp12L
 	movdqa	xmm0,xmm7
 	paddd	xmm7,xmm3		; xmm7=tmp11H
 	psubd	xmm0,xmm3		; xmm0=tmp12H
 	movdqa	XMMWORD [wk(4)], xmm5	; wk(4)=tmp11L
 	movdqa	XMMWORD [wk(5)], xmm7	; wk(5)=tmp11H
 	movdqa	XMMWORD [wk(6)], xmm2	; wk(6)=tmp12L
 	movdqa	XMMWORD [wk(7)], xmm0	; wk(7)=tmp12H
 	; -- Odd part
 	movdqa	xmm4, XMMWORD [XMMBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movdqa	xmm6, XMMWORD [XMMBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	pmullw	xmm4, XMMWORD [XMMBLOCK(1,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	xmm6, XMMWORD [XMMBLOCK(3,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	movdqa	xmm1, XMMWORD [XMMBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	movdqa	xmm3, XMMWORD [XMMBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	pmullw	xmm1, XMMWORD [XMMBLOCK(5,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	xmm3, XMMWORD [XMMBLOCK(7,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	movdqa	xmm5,xmm6
 	movdqa	xmm7,xmm4
 	paddw	xmm5,xmm3		; xmm5=z3
 	paddw	xmm7,xmm1		; xmm7=z4
 	; (Original)
 	; z5 = (z3 + z4) * 1.175875602;
 	; z3 = z3 * -1.961570560;  z4 = z4 * -0.390180644;
 	; z3 += z5;  z4 += z5;
 	;
 	; (This implementation)
 	; z3 = z3 * (1.175875602 - 1.961570560) + z4 * 1.175875602;
 	; z4 = z3 * 1.175875602 + z4 * (1.175875602 - 0.390180644);
 	movdqa    xmm2,xmm5
 	movdqa    xmm0,xmm5
 	punpcklwd xmm2,xmm7
 	punpckhwd xmm0,xmm7
 	movdqa    xmm5,xmm2
 	movdqa    xmm7,xmm0
 	pmaddwd   xmm2,[GOTOFF(ebx,PW_MF078_F117)]	; xmm2=z3L
 	pmaddwd   xmm0,[GOTOFF(ebx,PW_MF078_F117)]	; xmm0=z3H
 	pmaddwd   xmm5,[GOTOFF(ebx,PW_F117_F078)]	; xmm5=z4L
 	pmaddwd   xmm7,[GOTOFF(ebx,PW_F117_F078)]	; xmm7=z4H
 	movdqa	XMMWORD [wk(10)], xmm2	; wk(10)=z3L
 	movdqa	XMMWORD [wk(11)], xmm0	; wk(11)=z3H
 	; (Original)
 	; z1 = tmp0 + tmp3;  z2 = tmp1 + tmp2;
 	; tmp0 = tmp0 * 0.298631336;  tmp1 = tmp1 * 2.053119869;
 	; tmp2 = tmp2 * 3.072711026;  tmp3 = tmp3 * 1.501321110;
 	; z1 = z1 * -0.899976223;  z2 = z2 * -2.562915447;
 	; tmp0 += z1 + z3;  tmp1 += z2 + z4;
 	; tmp2 += z2 + z3;  tmp3 += z1 + z4;
 	;
 	; (This implementation)
 	; tmp0 = tmp0 * (0.298631336 - 0.899976223) + tmp3 * -0.899976223;
 	; tmp1 = tmp1 * (2.053119869 - 2.562915447) + tmp2 * -2.562915447;
 	; tmp2 = tmp1 * -2.562915447 + tmp2 * (3.072711026 - 2.562915447);
 	; tmp3 = tmp0 * -0.899976223 + tmp3 * (1.501321110 - 0.899976223);
 	; tmp0 += z3;  tmp1 += z4;
 	; tmp2 += z3;  tmp3 += z4;
 	movdqa    xmm2,xmm3
 	movdqa    xmm0,xmm3
 	punpcklwd xmm2,xmm4
 	punpckhwd xmm0,xmm4
 	movdqa    xmm3,xmm2
 	movdqa    xmm4,xmm0
 	pmaddwd   xmm2,[GOTOFF(ebx,PW_MF060_MF089)]	; xmm2=tmp0L
 	pmaddwd   xmm0,[GOTOFF(ebx,PW_MF060_MF089)]	; xmm0=tmp0H
 	pmaddwd   xmm3,[GOTOFF(ebx,PW_MF089_F060)]	; xmm3=tmp3L
 	pmaddwd   xmm4,[GOTOFF(ebx,PW_MF089_F060)]	; xmm4=tmp3H
 	paddd	xmm2, XMMWORD [wk(10)]	; xmm2=tmp0L
 	paddd	xmm0, XMMWORD [wk(11)]	; xmm0=tmp0H
 	paddd	xmm3,xmm5		; xmm3=tmp3L
 	paddd	xmm4,xmm7		; xmm4=tmp3H
 	movdqa	XMMWORD [wk(8)], xmm2	; wk(8)=tmp0L
 	movdqa	XMMWORD [wk(9)], xmm0	; wk(9)=tmp0H
 	movdqa    xmm2,xmm1
 	movdqa    xmm0,xmm1
 	punpcklwd xmm2,xmm6
 	punpckhwd xmm0,xmm6
 	movdqa    xmm1,xmm2
 	movdqa    xmm6,xmm0
 	pmaddwd   xmm2,[GOTOFF(ebx,PW_MF050_MF256)]	; xmm2=tmp1L
 	pmaddwd   xmm0,[GOTOFF(ebx,PW_MF050_MF256)]	; xmm0=tmp1H
 	pmaddwd   xmm1,[GOTOFF(ebx,PW_MF256_F050)]	; xmm1=tmp2L
 	pmaddwd   xmm6,[GOTOFF(ebx,PW_MF256_F050)]	; xmm6=tmp2H
 	paddd	xmm2,xmm5		; xmm2=tmp1L
 	paddd	xmm0,xmm7		; xmm0=tmp1H
 	paddd	xmm1, XMMWORD [wk(10)]	; xmm1=tmp2L
 	paddd	xmm6, XMMWORD [wk(11)]	; xmm6=tmp2H
 	movdqa	XMMWORD [wk(10)], xmm2	; wk(10)=tmp1L
 	movdqa	XMMWORD [wk(11)], xmm0	; wk(11)=tmp1H
 	; -- Final output stage
 	movdqa	xmm5, XMMWORD [wk(0)]	; xmm5=tmp10L
 	movdqa	xmm7, XMMWORD [wk(1)]	; xmm7=tmp10H
 	movdqa	xmm2,xmm5
 	movdqa	xmm0,xmm7
 	paddd	xmm5,xmm3		; xmm5=data0L
 	paddd	xmm7,xmm4		; xmm7=data0H
 	psubd	xmm2,xmm3		; xmm2=data7L
 	psubd	xmm0,xmm4		; xmm0=data7H
 	movdqa	xmm3,[GOTOFF(ebx,PD_DESCALE_P1)]	; xmm3=[PD_DESCALE_P1]
 	paddd	xmm5,xmm3
 	paddd	xmm7,xmm3
 	psrad	xmm5,DESCALE_P1
 	psrad	xmm7,DESCALE_P1
 	paddd	xmm2,xmm3
 	paddd	xmm0,xmm3
 	psrad	xmm2,DESCALE_P1
 	psrad	xmm0,DESCALE_P1
 	packssdw  xmm5,xmm7		; xmm5=data0=(00 01 02 03 04 05 06 07)
 	packssdw  xmm2,xmm0		; xmm2=data7=(70 71 72 73 74 75 76 77)
 	movdqa	xmm4, XMMWORD [wk(4)]	; xmm4=tmp11L
 	movdqa	xmm3, XMMWORD [wk(5)]	; xmm3=tmp11H
 	movdqa	xmm7,xmm4
 	movdqa	xmm0,xmm3
 	paddd	xmm4,xmm1		; xmm4=data1L
 	paddd	xmm3,xmm6		; xmm3=data1H
 	psubd	xmm7,xmm1		; xmm7=data6L
 	psubd	xmm0,xmm6		; xmm0=data6H
 	movdqa	xmm1,[GOTOFF(ebx,PD_DESCALE_P1)]	; xmm1=[PD_DESCALE_P1]
 	paddd	xmm4,xmm1
 	paddd	xmm3,xmm1
 	psrad	xmm4,DESCALE_P1
 	psrad	xmm3,DESCALE_P1
 	paddd	xmm7,xmm1
 	paddd	xmm0,xmm1
 	psrad	xmm7,DESCALE_P1
 	psrad	xmm0,DESCALE_P1
 	packssdw  xmm4,xmm3		; xmm4=data1=(10 11 12 13 14 15 16 17)
 	packssdw  xmm7,xmm0		; xmm7=data6=(60 61 62 63 64 65 66 67)
 	movdqa    xmm6,xmm5		; transpose coefficients(phase 1)
 	punpcklwd xmm5,xmm4		; xmm5=(00 10 01 11 02 12 03 13)
 	punpckhwd xmm6,xmm4		; xmm6=(04 14 05 15 06 16 07 17)
 	movdqa    xmm1,xmm7		; transpose coefficients(phase 1)
 	punpcklwd xmm7,xmm2		; xmm7=(60 70 61 71 62 72 63 73)
 	punpckhwd xmm1,xmm2		; xmm1=(64 74 65 75 66 76 67 77)
 	movdqa	xmm3, XMMWORD [wk(6)]	; xmm3=tmp12L
 	movdqa	xmm0, XMMWORD [wk(7)]	; xmm0=tmp12H
 	movdqa	xmm4, XMMWORD [wk(10)]	; xmm4=tmp1L
 	movdqa	xmm2, XMMWORD [wk(11)]	; xmm2=tmp1H
 	movdqa	XMMWORD [wk(0)], xmm5	; wk(0)=(00 10 01 11 02 12 03 13)
 	movdqa	XMMWORD [wk(1)], xmm6	; wk(1)=(04 14 05 15 06 16 07 17)
 	movdqa	XMMWORD [wk(4)], xmm7	; wk(4)=(60 70 61 71 62 72 63 73)
 	movdqa	XMMWORD [wk(5)], xmm1	; wk(5)=(64 74 65 75 66 76 67 77)
 	movdqa	xmm5,xmm3
 	movdqa	xmm6,xmm0
 	paddd	xmm3,xmm4		; xmm3=data2L
 	paddd	xmm0,xmm2		; xmm0=data2H
 	psubd	xmm5,xmm4		; xmm5=data5L
 	psubd	xmm6,xmm2		; xmm6=data5H
 	movdqa	xmm7,[GOTOFF(ebx,PD_DESCALE_P1)]	; xmm7=[PD_DESCALE_P1]
 	paddd	xmm3,xmm7
 	paddd	xmm0,xmm7
 	psrad	xmm3,DESCALE_P1
 	psrad	xmm0,DESCALE_P1
 	paddd	xmm5,xmm7
 	paddd	xmm6,xmm7
 	psrad	xmm5,DESCALE_P1
 	psrad	xmm6,DESCALE_P1
 	packssdw  xmm3,xmm0		; xmm3=data2=(20 21 22 23 24 25 26 27)
 	packssdw  xmm5,xmm6		; xmm5=data5=(50 51 52 53 54 55 56 57)
 	movdqa	xmm1, XMMWORD [wk(2)]	; xmm1=tmp13L
 	movdqa	xmm4, XMMWORD [wk(3)]	; xmm4=tmp13H
 	movdqa	xmm2, XMMWORD [wk(8)]	; xmm2=tmp0L
 	movdqa	xmm7, XMMWORD [wk(9)]	; xmm7=tmp0H
 	movdqa	xmm0,xmm1
 	movdqa	xmm6,xmm4
 	paddd	xmm1,xmm2		; xmm1=data3L
 	paddd	xmm4,xmm7		; xmm4=data3H
 	psubd	xmm0,xmm2		; xmm0=data4L
 	psubd	xmm6,xmm7		; xmm6=data4H
 	movdqa	xmm2,[GOTOFF(ebx,PD_DESCALE_P1)]	; xmm2=[PD_DESCALE_P1]
 	paddd	xmm1,xmm2
 	paddd	xmm4,xmm2
 	psrad	xmm1,DESCALE_P1
 	psrad	xmm4,DESCALE_P1
 	paddd	xmm0,xmm2
 	paddd	xmm6,xmm2
 	psrad	xmm0,DESCALE_P1
 	psrad	xmm6,DESCALE_P1
 	packssdw  xmm1,xmm4		; xmm1=data3=(30 31 32 33 34 35 36 37)
 	packssdw  xmm0,xmm6		; xmm0=data4=(40 41 42 43 44 45 46 47)
 	movdqa	xmm7, XMMWORD [wk(0)]	; xmm7=(00 10 01 11 02 12 03 13)
 	movdqa	xmm2, XMMWORD [wk(1)]	; xmm2=(04 14 05 15 06 16 07 17)
 	movdqa    xmm4,xmm3		; transpose coefficients(phase 1)
 	punpcklwd xmm3,xmm1		; xmm3=(20 30 21 31 22 32 23 33)
 	punpckhwd xmm4,xmm1		; xmm4=(24 34 25 35 26 36 27 37)
 	movdqa    xmm6,xmm0		; transpose coefficients(phase 1)
 	punpcklwd xmm0,xmm5		; xmm0=(40 50 41 51 42 52 43 53)
 	punpckhwd xmm6,xmm5		; xmm6=(44 54 45 55 46 56 47 57)
 	movdqa    xmm1,xmm7		; transpose coefficients(phase 2)
 	punpckldq xmm7,xmm3		; xmm7=(00 10 20 30 01 11 21 31)
 	punpckhdq xmm1,xmm3		; xmm1=(02 12 22 32 03 13 23 33)
 	movdqa    xmm5,xmm2		; transpose coefficients(phase 2)
 	punpckldq xmm2,xmm4		; xmm2=(04 14 24 34 05 15 25 35)
 	punpckhdq xmm5,xmm4		; xmm5=(06 16 26 36 07 17 27 37)
 	movdqa	xmm3, XMMWORD [wk(4)]	; xmm3=(60 70 61 71 62 72 63 73)
 	movdqa	xmm4, XMMWORD [wk(5)]	; xmm4=(64 74 65 75 66 76 67 77)
 	movdqa	XMMWORD [wk(6)], xmm2	; wk(6)=(04 14 24 34 05 15 25 35)
 	movdqa	XMMWORD [wk(7)], xmm5	; wk(7)=(06 16 26 36 07 17 27 37)
 	movdqa    xmm2,xmm0		; transpose coefficients(phase 2)
 	punpckldq xmm0,xmm3		; xmm0=(40 50 60 70 41 51 61 71)
 	punpckhdq xmm2,xmm3		; xmm2=(42 52 62 72 43 53 63 73)
 	movdqa    xmm5,xmm6		; transpose coefficients(phase 2)
 	punpckldq xmm6,xmm4		; xmm6=(44 54 64 74 45 55 65 75)
 	punpckhdq xmm5,xmm4		; xmm5=(46 56 66 76 47 57 67 77)
 	movdqa     xmm3,xmm7		; transpose coefficients(phase 3)
 	punpcklqdq xmm7,xmm0		; xmm7=col0=(00 10 20 30 40 50 60 70)
 	punpckhqdq xmm3,xmm0		; xmm3=col1=(01 11 21 31 41 51 61 71)
 	movdqa     xmm4,xmm1		; transpose coefficients(phase 3)
 	punpcklqdq xmm1,xmm2		; xmm1=col2=(02 12 22 32 42 52 62 72)
 	punpckhqdq xmm4,xmm2		; xmm4=col3=(03 13 23 33 43 53 63 73)
 	movdqa	xmm0, XMMWORD [wk(6)]	; xmm0=(04 14 24 34 05 15 25 35)
 	movdqa	xmm2, XMMWORD [wk(7)]	; xmm2=(06 16 26 36 07 17 27 37)
 	movdqa	XMMWORD [wk(8)], xmm3	; wk(8)=col1
 	movdqa	XMMWORD [wk(9)], xmm4	; wk(9)=col3
 	movdqa     xmm3,xmm0		; transpose coefficients(phase 3)
 	punpcklqdq xmm0,xmm6		; xmm0=col4=(04 14 24 34 44 54 64 74)
 	punpckhqdq xmm3,xmm6		; xmm3=col5=(05 15 25 35 45 55 65 75)
 	movdqa     xmm4,xmm2		; transpose coefficients(phase 3)
 	punpcklqdq xmm2,xmm5		; xmm2=col6=(06 16 26 36 46 56 66 76)
 	punpckhqdq xmm4,xmm5		; xmm4=col7=(07 17 27 37 47 57 67 77)
 	movdqa	XMMWORD [wk(10)], xmm3	; wk(10)=col5
 	movdqa	XMMWORD [wk(11)], xmm4	; wk(11)=col7
 .column_end:
 	; -- Prefetch the next coefficient block
 	prefetchnta [esi + DCTSIZE2*SIZEOF_JCOEF + 0*32]
 	prefetchnta [esi + DCTSIZE2*SIZEOF_JCOEF + 1*32]
 	prefetchnta [esi + DCTSIZE2*SIZEOF_JCOEF + 2*32]
 	prefetchnta [esi + DCTSIZE2*SIZEOF_JCOEF + 3*32]
 	; ---- Pass 2: process rows from work array, store into output array.
 	mov	eax, [original_ebp]
 	mov	edi, JSAMPARRAY [output_buf(eax)]	; (JSAMPROW *)
 	mov	eax, JDIMENSION [output_col(eax)]
 	; -- Even part
 	; xmm7=col0, xmm1=col2, xmm0=col4, xmm2=col6
 	; (Original)
 	; z1 = (z2 + z3) * 0.541196100;
 	; tmp2 = z1 + z3 * -1.847759065;
 	; tmp3 = z1 + z2 * 0.765366865;
 	;
 	; (This implementation)
 	; tmp2 = z2 * 0.541196100 + z3 * (0.541196100 - 1.847759065);
 	; tmp3 = z2 * (0.541196100 + 0.765366865) + z3 * 0.541196100;
 	movdqa    xmm6,xmm1		; xmm1=in2=z2
 	movdqa    xmm5,xmm1
 	punpcklwd xmm6,xmm2		; xmm2=in6=z3
 	punpckhwd xmm5,xmm2
 	movdqa    xmm1,xmm6
 	movdqa    xmm2,xmm5
 	pmaddwd   xmm6,[GOTOFF(ebx,PW_F130_F054)]	; xmm6=tmp3L
 	pmaddwd   xmm5,[GOTOFF(ebx,PW_F130_F054)]	; xmm5=tmp3H
 	pmaddwd   xmm1,[GOTOFF(ebx,PW_F054_MF130)]	; xmm1=tmp2L
 	pmaddwd   xmm2,[GOTOFF(ebx,PW_F054_MF130)]	; xmm2=tmp2H
 	movdqa    xmm3,xmm7
 	paddw     xmm7,xmm0		; xmm7=in0+in4
 	psubw     xmm3,xmm0		; xmm3=in0-in4
 	pxor      xmm4,xmm4
 	pxor      xmm0,xmm0
 	punpcklwd xmm4,xmm7		; xmm4=tmp0L
 	punpckhwd xmm0,xmm7		; xmm0=tmp0H
 	psrad     xmm4,(16-CONST_BITS)	; psrad xmm4,16 & pslld xmm4,CONST_BITS
 	psrad     xmm0,(16-CONST_BITS)	; psrad xmm0,16 & pslld xmm0,CONST_BITS
 	movdqa	xmm7,xmm4
 	paddd	xmm4,xmm6		; xmm4=tmp10L
 	psubd	xmm7,xmm6		; xmm7=tmp13L
 	movdqa	xmm6,xmm0
 	paddd	xmm0,xmm5		; xmm0=tmp10H
 	psubd	xmm6,xmm5		; xmm6=tmp13H
 	movdqa	XMMWORD [wk(0)], xmm4	; wk(0)=tmp10L
 	movdqa	XMMWORD [wk(1)], xmm0	; wk(1)=tmp10H
 	movdqa	XMMWORD [wk(2)], xmm7	; wk(2)=tmp13L
 	movdqa	XMMWORD [wk(3)], xmm6	; wk(3)=tmp13H
 	pxor      xmm5,xmm5
 	pxor      xmm4,xmm4
 	punpcklwd xmm5,xmm3		; xmm5=tmp1L
 	punpckhwd xmm4,xmm3		; xmm4=tmp1H
 	psrad     xmm5,(16-CONST_BITS)	; psrad xmm5,16 & pslld xmm5,CONST_BITS
 	psrad     xmm4,(16-CONST_BITS)	; psrad xmm4,16 & pslld xmm4,CONST_BITS
 	movdqa	xmm0,xmm5
 	paddd	xmm5,xmm1		; xmm5=tmp11L
 	psubd	xmm0,xmm1		; xmm0=tmp12L
 	movdqa	xmm7,xmm4
 	paddd	xmm4,xmm2		; xmm4=tmp11H
 	psubd	xmm7,xmm2		; xmm7=tmp12H
 	movdqa	XMMWORD [wk(4)], xmm5	; wk(4)=tmp11L
 	movdqa	XMMWORD [wk(5)], xmm4	; wk(5)=tmp11H
 	movdqa	XMMWORD [wk(6)], xmm0	; wk(6)=tmp12L
 	movdqa	XMMWORD [wk(7)], xmm7	; wk(7)=tmp12H
 	; -- Odd part
 	movdqa	xmm6, XMMWORD [wk(9)]	; xmm6=col3
 	movdqa	xmm3, XMMWORD [wk(8)]	; xmm3=col1
 	movdqa	xmm1, XMMWORD [wk(11)]	; xmm1=col7
 	movdqa	xmm2, XMMWORD [wk(10)]	; xmm2=col5
 	movdqa	xmm5,xmm6
 	movdqa	xmm4,xmm3
 	paddw	xmm5,xmm1		; xmm5=z3
 	paddw	xmm4,xmm2		; xmm4=z4
 	; (Original)
 	; z5 = (z3 + z4) * 1.175875602;
 	; z3 = z3 * -1.961570560;  z4 = z4 * -0.390180644;
 	; z3 += z5;  z4 += z5;
 	;
 	; (This implementation)
 	; z3 = z3 * (1.175875602 - 1.961570560) + z4 * 1.175875602;
 	; z4 = z3 * 1.175875602 + z4 * (1.175875602 - 0.390180644);
 	movdqa    xmm0,xmm5
 	movdqa    xmm7,xmm5
 	punpcklwd xmm0,xmm4
 	punpckhwd xmm7,xmm4
 	movdqa    xmm5,xmm0
 	movdqa    xmm4,xmm7
 	pmaddwd   xmm0,[GOTOFF(ebx,PW_MF078_F117)]	; xmm0=z3L
 	pmaddwd   xmm7,[GOTOFF(ebx,PW_MF078_F117)]	; xmm7=z3H
 	pmaddwd   xmm5,[GOTOFF(ebx,PW_F117_F078)]	; xmm5=z4L
 	pmaddwd   xmm4,[GOTOFF(ebx,PW_F117_F078)]	; xmm4=z4H
 	movdqa	XMMWORD [wk(10)], xmm0	; wk(10)=z3L
 	movdqa	XMMWORD [wk(11)], xmm7	; wk(11)=z3H
 	; (Original)
 	; z1 = tmp0 + tmp3;  z2 = tmp1 + tmp2;
 	; tmp0 = tmp0 * 0.298631336;  tmp1 = tmp1 * 2.053119869;
 	; tmp2 = tmp2 * 3.072711026;  tmp3 = tmp3 * 1.501321110;
 	; z1 = z1 * -0.899976223;  z2 = z2 * -2.562915447;
 	; tmp0 += z1 + z3;  tmp1 += z2 + z4;
 	; tmp2 += z2 + z3;  tmp3 += z1 + z4;
 	;
 	; (This implementation)
 	; tmp0 = tmp0 * (0.298631336 - 0.899976223) + tmp3 * -0.899976223;
 	; tmp1 = tmp1 * (2.053119869 - 2.562915447) + tmp2 * -2.562915447;
 	; tmp2 = tmp1 * -2.562915447 + tmp2 * (3.072711026 - 2.562915447);
 	; tmp3 = tmp0 * -0.899976223 + tmp3 * (1.501321110 - 0.899976223);
 	; tmp0 += z3;  tmp1 += z4;
 	; tmp2 += z3;  tmp3 += z4;
 	movdqa    xmm0,xmm1
 	movdqa    xmm7,xmm1
 	punpcklwd xmm0,xmm3
 	punpckhwd xmm7,xmm3
 	movdqa    xmm1,xmm0
 	movdqa    xmm3,xmm7
 	pmaddwd   xmm0,[GOTOFF(ebx,PW_MF060_MF089)]	; xmm0=tmp0L
 	pmaddwd   xmm7,[GOTOFF(ebx,PW_MF060_MF089)]	; xmm7=tmp0H
 	pmaddwd   xmm1,[GOTOFF(ebx,PW_MF089_F060)]	; xmm1=tmp3L
 	pmaddwd   xmm3,[GOTOFF(ebx,PW_MF089_F060)]	; xmm3=tmp3H
 	paddd	xmm0, XMMWORD [wk(10)]	; xmm0=tmp0L
 	paddd	xmm7, XMMWORD [wk(11)]	; xmm7=tmp0H
 	paddd	xmm1,xmm5		; xmm1=tmp3L
 	paddd	xmm3,xmm4		; xmm3=tmp3H
 	movdqa	XMMWORD [wk(8)], xmm0	; wk(8)=tmp0L
 	movdqa	XMMWORD [wk(9)], xmm7	; wk(9)=tmp0H
 	movdqa    xmm0,xmm2
 	movdqa    xmm7,xmm2
 	punpcklwd xmm0,xmm6
 	punpckhwd xmm7,xmm6
 	movdqa    xmm2,xmm0
 	movdqa    xmm6,xmm7
 	pmaddwd   xmm0,[GOTOFF(ebx,PW_MF050_MF256)]	; xmm0=tmp1L
 	pmaddwd   xmm7,[GOTOFF(ebx,PW_MF050_MF256)]	; xmm7=tmp1H
 	pmaddwd   xmm2,[GOTOFF(ebx,PW_MF256_F050)]	; xmm2=tmp2L
 	pmaddwd   xmm6,[GOTOFF(ebx,PW_MF256_F050)]	; xmm6=tmp2H
 	paddd	xmm0,xmm5		; xmm0=tmp1L
 	paddd	xmm7,xmm4		; xmm7=tmp1H
 	paddd	xmm2, XMMWORD [wk(10)]	; xmm2=tmp2L
 	paddd	xmm6, XMMWORD [wk(11)]	; xmm6=tmp2H
 	movdqa	XMMWORD [wk(10)], xmm0	; wk(10)=tmp1L
 	movdqa	XMMWORD [wk(11)], xmm7	; wk(11)=tmp1H
 	; -- Final output stage
 	movdqa	xmm5, XMMWORD [wk(0)]	; xmm5=tmp10L
 	movdqa	xmm4, XMMWORD [wk(1)]	; xmm4=tmp10H
 	movdqa	xmm0,xmm5
 	movdqa	xmm7,xmm4
 	paddd	xmm5,xmm1		; xmm5=data0L
 	paddd	xmm4,xmm3		; xmm4=data0H
 	psubd	xmm0,xmm1		; xmm0=data7L
 	psubd	xmm7,xmm3		; xmm7=data7H
 	movdqa	xmm1,[GOTOFF(ebx,PD_DESCALE_P2)]	; xmm1=[PD_DESCALE_P2]
 	paddd	xmm5,xmm1
 	paddd	xmm4,xmm1
 	psrad	xmm5,DESCALE_P2
 	psrad	xmm4,DESCALE_P2
 	paddd	xmm0,xmm1
 	paddd	xmm7,xmm1
 	psrad	xmm0,DESCALE_P2
 	psrad	xmm7,DESCALE_P2
 	packssdw  xmm5,xmm4		; xmm5=data0=(00 10 20 30 40 50 60 70)
 	packssdw  xmm0,xmm7		; xmm0=data7=(07 17 27 37 47 57 67 77)
 	movdqa	xmm3, XMMWORD [wk(4)]	; xmm3=tmp11L
 	movdqa	xmm1, XMMWORD [wk(5)]	; xmm1=tmp11H
 	movdqa	xmm4,xmm3
 	movdqa	xmm7,xmm1
 	paddd	xmm3,xmm2		; xmm3=data1L
 	paddd	xmm1,xmm6		; xmm1=data1H
 	psubd	xmm4,xmm2		; xmm4=data6L
 	psubd	xmm7,xmm6		; xmm7=data6H
 	movdqa	xmm2,[GOTOFF(ebx,PD_DESCALE_P2)]	; xmm2=[PD_DESCALE_P2]
 	paddd	xmm3,xmm2
 	paddd	xmm1,xmm2
 	psrad	xmm3,DESCALE_P2
 	psrad	xmm1,DESCALE_P2
 	paddd	xmm4,xmm2
 	paddd	xmm7,xmm2
 	psrad	xmm4,DESCALE_P2
 	psrad	xmm7,DESCALE_P2
 	packssdw  xmm3,xmm1		; xmm3=data1=(01 11 21 31 41 51 61 71)
 	packssdw  xmm4,xmm7		; xmm4=data6=(06 16 26 36 46 56 66 76)
 	packsswb  xmm5,xmm4		; xmm5=(00 10 20 30 40 50 60 70 06 16 26 36 46 56 66 76)
 	packsswb  xmm3,xmm0		; xmm3=(01 11 21 31 41 51 61 71 07 17 27 37 47 57 67 77)
 	movdqa	xmm6, XMMWORD [wk(6)]	; xmm6=tmp12L
 	movdqa	xmm2, XMMWORD [wk(7)]	; xmm2=tmp12H
 	movdqa	xmm1, XMMWORD [wk(10)]	; xmm1=tmp1L
 	movdqa	xmm7, XMMWORD [wk(11)]	; xmm7=tmp1H
 	movdqa	XMMWORD [wk(0)], xmm5	; wk(0)=(00 10 20 30 40 50 60 70 06 16 26 36 46 56 66 76)
 	movdqa	XMMWORD [wk(1)], xmm3	; wk(1)=(01 11 21 31 41 51 61 71 07 17 27 37 47 57 67 77)
 	movdqa	xmm4,xmm6
 	movdqa	xmm0,xmm2
 	paddd	xmm6,xmm1		; xmm6=data2L
 	paddd	xmm2,xmm7		; xmm2=data2H
 	psubd	xmm4,xmm1		; xmm4=data5L
 	psubd	xmm0,xmm7		; xmm0=data5H
 	movdqa	xmm5,[GOTOFF(ebx,PD_DESCALE_P2)]	; xmm5=[PD_DESCALE_P2]
 	paddd	xmm6,xmm5
 	paddd	xmm2,xmm5
 	psrad	xmm6,DESCALE_P2
 	psrad	xmm2,DESCALE_P2
 	paddd	xmm4,xmm5
 	paddd	xmm0,xmm5
 	psrad	xmm4,DESCALE_P2
 	psrad	xmm0,DESCALE_P2
 	packssdw  xmm6,xmm2		; xmm6=data2=(02 12 22 32 42 52 62 72)
 	packssdw  xmm4,xmm0		; xmm4=data5=(05 15 25 35 45 55 65 75)
 	movdqa	xmm3, XMMWORD [wk(2)]	; xmm3=tmp13L
 	movdqa	xmm1, XMMWORD [wk(3)]	; xmm1=tmp13H
 	movdqa	xmm7, XMMWORD [wk(8)]	; xmm7=tmp0L
 	movdqa	xmm5, XMMWORD [wk(9)]	; xmm5=tmp0H
 	movdqa	xmm2,xmm3
 	movdqa	xmm0,xmm1
 	paddd	xmm3,xmm7		; xmm3=data3L
 	paddd	xmm1,xmm5		; xmm1=data3H
 	psubd	xmm2,xmm7		; xmm2=data4L
 	psubd	xmm0,xmm5		; xmm0=data4H
 	movdqa	xmm7,[GOTOFF(ebx,PD_DESCALE_P2)]	; xmm7=[PD_DESCALE_P2]
 	paddd	xmm3,xmm7
 	paddd	xmm1,xmm7
 	psrad	xmm3,DESCALE_P2
 	psrad	xmm1,DESCALE_P2
 	paddd	xmm2,xmm7
 	paddd	xmm0,xmm7
 	psrad	xmm2,DESCALE_P2
 	psrad	xmm0,DESCALE_P2
 	movdqa    xmm5,[GOTOFF(ebx,PB_CENTERJSAMP)]	; xmm5=[PB_CENTERJSAMP]
 	packssdw  xmm3,xmm1		; xmm3=data3=(03 13 23 33 43 53 63 73)
 	packssdw  xmm2,xmm0		; xmm2=data4=(04 14 24 34 44 54 64 74)
 	movdqa    xmm7, XMMWORD [wk(0)]	; xmm7=(00 10 20 30 40 50 60 70 06 16 26 36 46 56 66 76)
 	movdqa    xmm1, XMMWORD [wk(1)]	; xmm1=(01 11 21 31 41 51 61 71 07 17 27 37 47 57 67 77)
 	packsswb  xmm6,xmm2		; xmm6=(02 12 22 32 42 52 62 72 04 14 24 34 44 54 64 74)
 	packsswb  xmm3,xmm4		; xmm3=(03 13 23 33 43 53 63 73 05 15 25 35 45 55 65 75)
 	paddb     xmm7,xmm5
 	paddb     xmm1,xmm5
 	paddb     xmm6,xmm5
 	paddb     xmm3,xmm5
 	movdqa    xmm0,xmm7	; transpose coefficients(phase 1)
 	punpcklbw xmm7,xmm1	; xmm7=(00 01 10 11 20 21 30 31 40 41 50 51 60 61 70 71)
 	punpckhbw xmm0,xmm1	; xmm0=(06 07 16 17 26 27 36 37 46 47 56 57 66 67 76 77)
 	movdqa    xmm2,xmm6	; transpose coefficients(phase 1)
 	punpcklbw xmm6,xmm3	; xmm6=(02 03 12 13 22 23 32 33 42 43 52 53 62 63 72 73)
 	punpckhbw xmm2,xmm3	; xmm2=(04 05 14 15 24 25 34 35 44 45 54 55 64 65 74 75)
 	movdqa    xmm4,xmm7	; transpose coefficients(phase 2)
 	punpcklwd xmm7,xmm6	; xmm7=(00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33)
 	punpckhwd xmm4,xmm6	; xmm4=(40 41 42 43 50 51 52 53 60 61 62 63 70 71 72 73)
 	movdqa    xmm5,xmm2	; transpose coefficients(phase 2)
 	punpcklwd xmm2,xmm0	; xmm2=(04 05 06 07 14 15 16 17 24 25 26 27 34 35 36 37)
 	punpckhwd xmm5,xmm0	; xmm5=(44 45 46 47 54 55 56 57 64 65 66 67 74 75 76 77)
 	movdqa    xmm1,xmm7	; transpose coefficients(phase 3)
 	punpckldq xmm7,xmm2	; xmm7=(00 01 02 03 04 05 06 07 10 11 12 13 14 15 16 17)
 	punpckhdq xmm1,xmm2	; xmm1=(20 21 22 23 24 25 26 27 30 31 32 33 34 35 36 37)
 	movdqa    xmm3,xmm4	; transpose coefficients(phase 3)
 	punpckldq xmm4,xmm5	; xmm4=(40 41 42 43 44 45 46 47 50 51 52 53 54 55 56 57)
 	punpckhdq xmm3,xmm5	; xmm3=(60 61 62 63 64 65 66 67 70 71 72 73 74 75 76 77)
 	pshufd	xmm6,xmm7,0x4E	; xmm6=(10 11 12 13 14 15 16 17 00 01 02 03 04 05 06 07)
 	pshufd	xmm0,xmm1,0x4E	; xmm0=(30 31 32 33 34 35 36 37 20 21 22 23 24 25 26 27)
 	pshufd	xmm2,xmm4,0x4E	; xmm2=(50 51 52 53 54 55 56 57 40 41 42 43 44 45 46 47)
 	pshufd	xmm5,xmm3,0x4E	; xmm5=(70 71 72 73 74 75 76 77 60 61 62 63 64 65 66 67)
 	mov	edx, JSAMPROW [edi+0*SIZEOF_JSAMPROW]
 	mov	esi, JSAMPROW [edi+2*SIZEOF_JSAMPROW]
 	movq	_MMWORD [edx+eax*SIZEOF_JSAMPLE], xmm7
 	movq	_MMWORD [esi+eax*SIZEOF_JSAMPLE], xmm1
 	mov	edx, JSAMPROW [edi+4*SIZEOF_JSAMPROW]
 	mov	esi, JSAMPROW [edi+6*SIZEOF_JSAMPROW]
 	movq	_MMWORD [edx+eax*SIZEOF_JSAMPLE], xmm4
 	movq	_MMWORD [esi+eax*SIZEOF_JSAMPLE], xmm3
 	mov	edx, JSAMPROW [edi+1*SIZEOF_JSAMPROW]
 	mov	esi, JSAMPROW [edi+3*SIZEOF_JSAMPROW]
 	movq	_MMWORD [edx+eax*SIZEOF_JSAMPLE], xmm6
 	movq	_MMWORD [esi+eax*SIZEOF_JSAMPLE], xmm0
 	mov	edx, JSAMPROW [edi+5*SIZEOF_JSAMPROW]
 	mov	esi, JSAMPROW [edi+7*SIZEOF_JSAMPROW]
 	movq	_MMWORD [edx+eax*SIZEOF_JSAMPLE], xmm2
 	movq	_MMWORD [esi+eax*SIZEOF_JSAMPLE], xmm5
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; unused
 	poppic	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 %endif ; JIDCT_INT_SSE2_SUPPORTED
 %endif ; DCT_ISLOW_SUPPORTED
--- a/jiss2red.asm
+++ b/jiss2red.asm
@@ -0,0 +1,607 @@
 ;
 ; jiss2red.asm - reduced-size IDCT (SSE2)
 ;
 ; x86 SIMD extension for IJG JPEG library
 ; Copyright (C) 1999-2006, MIYASAKA Masaru.
 ; For conditions of distribution and use, see copyright notice in jsimdext.inc
 ;
 ; This file should be assembled with NASM (Netwide Assembler),
 ; can *not* be assembled with Microsoft's MASM or any compatible
 ; assembler (including Borland's Turbo Assembler).
 ; NASM is available from http://nasm.sourceforge.net/ or
 ; http://sourceforge.net/project/showfiles.php?group_id=6208
 ;
 ; This file contains inverse-DCT routines that produce reduced-size
 ; output: either 4x4 or 2x2 pixels from an 8x8 DCT block.
 ; The following code is based directly on the IJG's original jidctred.c;
 ; see the jidctred.c for more details.
 ;
 ; Last Modified : February 4, 2006
 ;
 ; [TAB8]
 %include "jsimdext.inc"
 %include "jdct.inc"
 %ifdef IDCT_SCALING_SUPPORTED
 %ifdef JIDCT_INT_SSE2_SUPPORTED
 ; This module is specialized to the case DCTSIZE = 8.
 ;
 %if DCTSIZE != 8
 %error "Sorry, this code only copes with 8x8 DCTs."
 %endif
 ; --------------------------------------------------------------------------
 %define CONST_BITS	13
 %define PASS1_BITS	2
 %define DESCALE_P1_4	(CONST_BITS-PASS1_BITS+1)
 %define DESCALE_P2_4	(CONST_BITS+PASS1_BITS+3+1)
 %define DESCALE_P1_2	(CONST_BITS-PASS1_BITS+2)
 %define DESCALE_P2_2	(CONST_BITS+PASS1_BITS+3+2)
 %if CONST_BITS == 13
 F_0_211	equ	 1730		; FIX(0.211164243)
 F_0_509	equ	 4176		; FIX(0.509795579)
 F_0_601	equ	 4926		; FIX(0.601344887)
 F_0_720	equ	 5906		; FIX(0.720959822)
 F_0_765	equ	 6270		; FIX(0.765366865)
 F_0_850	equ	 6967		; FIX(0.850430095)
 F_0_899	equ	 7373		; FIX(0.899976223)
 F_1_061	equ	 8697		; FIX(1.061594337)
 F_1_272	equ	10426		; FIX(1.272758580)
 F_1_451	equ	11893		; FIX(1.451774981)
 F_1_847	equ	15137		; FIX(1.847759065)
 F_2_172	equ	17799		; FIX(2.172734803)
 F_2_562	equ	20995		; FIX(2.562915447)
 F_3_624	equ	29692		; FIX(3.624509785)
 %else
 ; NASM cannot do compile-time arithmetic on floating-point constants.
 %define DESCALE(x,n)  (((x)+(1<<((n)-1)))>>(n))
 F_0_211	equ	DESCALE( 226735879,30-CONST_BITS)	; FIX(0.211164243)
 F_0_509	equ	DESCALE( 547388834,30-CONST_BITS)	; FIX(0.509795579)
 F_0_601	equ	DESCALE( 645689155,30-CONST_BITS)	; FIX(0.601344887)
 F_0_720	equ	DESCALE( 774124714,30-CONST_BITS)	; FIX(0.720959822)
 F_0_765	equ	DESCALE( 821806413,30-CONST_BITS)	; FIX(0.765366865)
 F_0_850	equ	DESCALE( 913142361,30-CONST_BITS)	; FIX(0.850430095)
 F_0_899	equ	DESCALE( 966342111,30-CONST_BITS)	; FIX(0.899976223)
 F_1_061	equ	DESCALE(1139878239,30-CONST_BITS)	; FIX(1.061594337)
 F_1_272	equ	DESCALE(1366614119,30-CONST_BITS)	; FIX(1.272758580)
 F_1_451	equ	DESCALE(1558831516,30-CONST_BITS)	; FIX(1.451774981)
 F_1_847	equ	DESCALE(1984016188,30-CONST_BITS)	; FIX(1.847759065)
 F_2_172	equ	DESCALE(2332956230,30-CONST_BITS)	; FIX(2.172734803)
 F_2_562	equ	DESCALE(2751909506,30-CONST_BITS)	; FIX(2.562915447)
 F_3_624	equ	DESCALE(3891787747,30-CONST_BITS)	; FIX(3.624509785)
 %endif
 ; --------------------------------------------------------------------------
 	SECTION	SEG_CONST
 	alignz	16
 	global	EXTN(jconst_idct_red_sse2)
 EXTN(jconst_idct_red_sse2):
 PW_F184_MF076	times 4 dw  F_1_847,-F_0_765
 PW_F256_F089	times 4 dw  F_2_562, F_0_899
 PW_F106_MF217	times 4 dw  F_1_061,-F_2_172
 PW_MF060_MF050	times 4 dw -F_0_601,-F_0_509
 PW_F145_MF021	times 4 dw  F_1_451,-F_0_211
 PW_F362_MF127	times 4 dw  F_3_624,-F_1_272
 PW_F085_MF072	times 4 dw  F_0_850,-F_0_720
 PD_DESCALE_P1_4	times 4 dd  1 << (DESCALE_P1_4-1)
 PD_DESCALE_P2_4	times 4 dd  1 << (DESCALE_P2_4-1)
 PD_DESCALE_P1_2	times 4 dd  1 << (DESCALE_P1_2-1)
 PD_DESCALE_P2_2	times 4 dd  1 << (DESCALE_P2_2-1)
 PB_CENTERJSAMP	times 16 db CENTERJSAMPLE
 	alignz	16
 ; --------------------------------------------------------------------------
 	SECTION	SEG_TEXT
 	BITS	32
 ;
 ; Perform dequantization and inverse DCT on one block of coefficients,
 ; producing a reduced-size 4x4 output block.
 ;
 ; GLOBAL(void)
 ; jpeg_idct_4x4_sse2 (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 ;                     JCOEFPTR coef_block,
 ;                     JSAMPARRAY output_buf, JDIMENSION output_col)
 ;
 %define cinfo(b)	(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define coef_block(b)	(b)+16		; JCOEFPTR coef_block
 %define output_buf(b)	(b)+20		; JSAMPARRAY output_buf
 %define output_col(b)	(b)+24		; JDIMENSION output_col
 %define original_ebp	ebp+0
 %define wk(i)		ebp-(WK_NUM-(i))*SIZEOF_XMMWORD	; xmmword wk[WK_NUM]
 %define WK_NUM		2
 	align	16
 	global	EXTN(jpeg_idct_4x4_sse2)
 EXTN(jpeg_idct_4x4_sse2):
 	push	ebp
 	mov	eax,esp				; eax = original ebp
 	sub	esp, byte 4
 	and	esp, byte (-SIZEOF_XMMWORD)	; align to 128 bits
 	mov	[esp],eax
 	mov	ebp,esp				; ebp = aligned ebp
 	lea	esp, [wk(0)]
 	pushpic	ebx
 ;	push	ecx		; unused
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx		; get GOT address
 	; ---- Pass 1: process columns from input.
 ;	mov	eax, [original_ebp]
 	mov	edx, POINTER [compptr(eax)]
 	mov	edx, POINTER [jcompinfo_dct_table(edx)]	; quantptr
 	mov	esi, JCOEFPTR [coef_block(eax)]		; inptr
 %ifndef NO_ZERO_COLUMN_TEST_4X4_SSE2
 	mov	eax, DWORD [DWBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	or	eax, DWORD [DWBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	jnz	short .columnDCT
 	movdqa	xmm0, XMMWORD [XMMBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movdqa	xmm1, XMMWORD [XMMBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	por	xmm0, XMMWORD [XMMBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	por	xmm1, XMMWORD [XMMBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	por	xmm0, XMMWORD [XMMBLOCK(6,0,esi,SIZEOF_JCOEF)]
 	por	xmm1, XMMWORD [XMMBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	por	xmm0,xmm1
 	packsswb xmm0,xmm0
 	packsswb xmm0,xmm0
 	movd	eax,xmm0
 	test	eax,eax
 	jnz	short .columnDCT
 	; -- AC terms all zero
 	movdqa	xmm0, XMMWORD [XMMBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	pmullw	xmm0, XMMWORD [XMMBLOCK(0,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	psllw	xmm0,PASS1_BITS
 	movdqa    xmm3,xmm0	; xmm0=in0=(00 01 02 03 04 05 06 07)
 	punpcklwd xmm0,xmm0	; xmm0=(00 00 01 01 02 02 03 03)
 	punpckhwd xmm3,xmm3	; xmm3=(04 04 05 05 06 06 07 07)
 	pshufd	xmm1,xmm0,0x50	; xmm1=[col0 col1]=(00 00 00 00 01 01 01 01)
 	pshufd	xmm0,xmm0,0xFA	; xmm0=[col2 col3]=(02 02 02 02 03 03 03 03)
 	pshufd	xmm6,xmm3,0x50	; xmm6=[col4 col5]=(04 04 04 04 05 05 05 05)
 	pshufd	xmm3,xmm3,0xFA	; xmm3=[col6 col7]=(06 06 06 06 07 07 07 07)
 	jmp	near .column_end
 	alignx	16,7
 %endif
 .columnDCT:
 	; -- Odd part
 	movdqa	xmm0, XMMWORD [XMMBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movdqa	xmm1, XMMWORD [XMMBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	pmullw	xmm0, XMMWORD [XMMBLOCK(1,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	xmm1, XMMWORD [XMMBLOCK(3,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	movdqa	xmm2, XMMWORD [XMMBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	movdqa	xmm3, XMMWORD [XMMBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	pmullw	xmm2, XMMWORD [XMMBLOCK(5,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	xmm3, XMMWORD [XMMBLOCK(7,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	movdqa    xmm4,xmm0
 	movdqa    xmm5,xmm0
 	punpcklwd xmm4,xmm1
 	punpckhwd xmm5,xmm1
 	movdqa    xmm0,xmm4
 	movdqa    xmm1,xmm5
 	pmaddwd   xmm4,[GOTOFF(ebx,PW_F256_F089)]	; xmm4=(tmp2L)
 	pmaddwd   xmm5,[GOTOFF(ebx,PW_F256_F089)]	; xmm5=(tmp2H)
 	pmaddwd   xmm0,[GOTOFF(ebx,PW_F106_MF217)]	; xmm0=(tmp0L)
 	pmaddwd   xmm1,[GOTOFF(ebx,PW_F106_MF217)]	; xmm1=(tmp0H)
 	movdqa    xmm6,xmm2
 	movdqa    xmm7,xmm2
 	punpcklwd xmm6,xmm3
 	punpckhwd xmm7,xmm3
 	movdqa    xmm2,xmm6
 	movdqa    xmm3,xmm7
 	pmaddwd   xmm6,[GOTOFF(ebx,PW_MF060_MF050)]	; xmm6=(tmp2L)
 	pmaddwd   xmm7,[GOTOFF(ebx,PW_MF060_MF050)]	; xmm7=(tmp2H)
 	pmaddwd   xmm2,[GOTOFF(ebx,PW_F145_MF021)]	; xmm2=(tmp0L)
 	pmaddwd   xmm3,[GOTOFF(ebx,PW_F145_MF021)]	; xmm3=(tmp0H)
 	paddd	xmm6,xmm4		; xmm6=tmp2L
 	paddd	xmm7,xmm5		; xmm7=tmp2H
 	paddd	xmm2,xmm0		; xmm2=tmp0L
 	paddd	xmm3,xmm1		; xmm3=tmp0H
 	movdqa	XMMWORD [wk(0)], xmm2	; wk(0)=tmp0L
 	movdqa	XMMWORD [wk(1)], xmm3	; wk(1)=tmp0H
 	; -- Even part
 	movdqa	xmm4, XMMWORD [XMMBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	movdqa	xmm5, XMMWORD [XMMBLOCK(2,0,esi,SIZEOF_JCOEF)]
 	movdqa	xmm0, XMMWORD [XMMBLOCK(6,0,esi,SIZEOF_JCOEF)]
 	pmullw	xmm4, XMMWORD [XMMBLOCK(0,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	xmm5, XMMWORD [XMMBLOCK(2,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	xmm0, XMMWORD [XMMBLOCK(6,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pxor      xmm1,xmm1
 	pxor      xmm2,xmm2
 	punpcklwd xmm1,xmm4		; xmm1=tmp0L
 	punpckhwd xmm2,xmm4		; xmm2=tmp0H
 	psrad     xmm1,(16-CONST_BITS-1) ; psrad xmm1,16 & pslld xmm1,CONST_BITS+1
 	psrad     xmm2,(16-CONST_BITS-1) ; psrad xmm2,16 & pslld xmm2,CONST_BITS+1
 	movdqa    xmm3,xmm5		; xmm5=in2=z2
 	punpcklwd xmm5,xmm0		; xmm0=in6=z3
 	punpckhwd xmm3,xmm0
 	pmaddwd   xmm5,[GOTOFF(ebx,PW_F184_MF076)]	; xmm5=tmp2L
 	pmaddwd   xmm3,[GOTOFF(ebx,PW_F184_MF076)]	; xmm3=tmp2H
 	movdqa	xmm4,xmm1
 	movdqa	xmm0,xmm2
 	paddd	xmm1,xmm5		; xmm1=tmp10L
 	paddd	xmm2,xmm3		; xmm2=tmp10H
 	psubd	xmm4,xmm5		; xmm4=tmp12L
 	psubd	xmm0,xmm3		; xmm0=tmp12H
 	; -- Final output stage
 	movdqa	xmm5,xmm1
 	movdqa	xmm3,xmm2
 	paddd	xmm1,xmm6		; xmm1=data0L
 	paddd	xmm2,xmm7		; xmm2=data0H
 	psubd	xmm5,xmm6		; xmm5=data3L
 	psubd	xmm3,xmm7		; xmm3=data3H
 	movdqa	xmm6,[GOTOFF(ebx,PD_DESCALE_P1_4)]	; xmm6=[PD_DESCALE_P1_4]
 	paddd	xmm1,xmm6
 	paddd	xmm2,xmm6
 	psrad	xmm1,DESCALE_P1_4
 	psrad	xmm2,DESCALE_P1_4
 	paddd	xmm5,xmm6
 	paddd	xmm3,xmm6
 	psrad	xmm5,DESCALE_P1_4
 	psrad	xmm3,DESCALE_P1_4
 	packssdw  xmm1,xmm2		; xmm1=data0=(00 01 02 03 04 05 06 07)
 	packssdw  xmm5,xmm3		; xmm5=data3=(30 31 32 33 34 35 36 37)
 	movdqa	xmm7, XMMWORD [wk(0)]	; xmm7=tmp0L
 	movdqa	xmm6, XMMWORD [wk(1)]	; xmm6=tmp0H
 	movdqa	xmm2,xmm4
 	movdqa	xmm3,xmm0
 	paddd	xmm4,xmm7		; xmm4=data1L
 	paddd	xmm0,xmm6		; xmm0=data1H
 	psubd	xmm2,xmm7		; xmm2=data2L
 	psubd	xmm3,xmm6		; xmm3=data2H
 	movdqa	xmm7,[GOTOFF(ebx,PD_DESCALE_P1_4)]	; xmm7=[PD_DESCALE_P1_4]
 	paddd	xmm4,xmm7
 	paddd	xmm0,xmm7
 	psrad	xmm4,DESCALE_P1_4
 	psrad	xmm0,DESCALE_P1_4
 	paddd	xmm2,xmm7
 	paddd	xmm3,xmm7
 	psrad	xmm2,DESCALE_P1_4
 	psrad	xmm3,DESCALE_P1_4
 	packssdw  xmm4,xmm0		; xmm4=data1=(10 11 12 13 14 15 16 17)
 	packssdw  xmm2,xmm3		; xmm2=data2=(20 21 22 23 24 25 26 27)
 	movdqa    xmm6,xmm1	; transpose coefficients(phase 1)
 	punpcklwd xmm1,xmm4	; xmm1=(00 10 01 11 02 12 03 13)
 	punpckhwd xmm6,xmm4	; xmm6=(04 14 05 15 06 16 07 17)
 	movdqa    xmm7,xmm2	; transpose coefficients(phase 1)
 	punpcklwd xmm2,xmm5	; xmm2=(20 30 21 31 22 32 23 33)
 	punpckhwd xmm7,xmm5	; xmm7=(24 34 25 35 26 36 27 37)
 	movdqa    xmm0,xmm1	; transpose coefficients(phase 2)
 	punpckldq xmm1,xmm2	; xmm1=[col0 col1]=(00 10 20 30 01 11 21 31)
 	punpckhdq xmm0,xmm2	; xmm0=[col2 col3]=(02 12 22 32 03 13 23 33)
 	movdqa    xmm3,xmm6	; transpose coefficients(phase 2)
 	punpckldq xmm6,xmm7	; xmm6=[col4 col5]=(04 14 24 34 05 15 25 35)
 	punpckhdq xmm3,xmm7	; xmm3=[col6 col7]=(06 16 26 36 07 17 27 37)
 .column_end:
 	; -- Prefetch the next coefficient block
 	prefetchnta [esi + DCTSIZE2*SIZEOF_JCOEF + 0*32]
 	prefetchnta [esi + DCTSIZE2*SIZEOF_JCOEF + 1*32]
 	prefetchnta [esi + DCTSIZE2*SIZEOF_JCOEF + 2*32]
 	prefetchnta [esi + DCTSIZE2*SIZEOF_JCOEF + 3*32]
 	; ---- Pass 2: process rows, store into output array.
 	mov	eax, [original_ebp]
 	mov	edi, JSAMPARRAY [output_buf(eax)]	; (JSAMPROW *)
 	mov	eax, JDIMENSION [output_col(eax)]
 	; -- Even part
 	pxor      xmm4,xmm4
 	punpcklwd xmm4,xmm1		; xmm4=tmp0
 	psrad     xmm4,(16-CONST_BITS-1) ; psrad xmm4,16 & pslld xmm4,CONST_BITS+1
 	; -- Odd part
 	punpckhwd xmm1,xmm0
 	punpckhwd xmm6,xmm3
 	movdqa    xmm5,xmm1
 	movdqa    xmm2,xmm6
 	pmaddwd   xmm1,[GOTOFF(ebx,PW_F256_F089)]	; xmm1=(tmp2)
 	pmaddwd   xmm6,[GOTOFF(ebx,PW_MF060_MF050)]	; xmm6=(tmp2)
 	pmaddwd   xmm5,[GOTOFF(ebx,PW_F106_MF217)]	; xmm5=(tmp0)
 	pmaddwd   xmm2,[GOTOFF(ebx,PW_F145_MF021)]	; xmm2=(tmp0)
 	paddd     xmm6,xmm1		; xmm6=tmp2
 	paddd     xmm2,xmm5		; xmm2=tmp0
 	; -- Even part
 	punpcklwd xmm0,xmm3
 	pmaddwd   xmm0,[GOTOFF(ebx,PW_F184_MF076)]	; xmm0=tmp2
 	movdqa    xmm7,xmm4
 	paddd     xmm4,xmm0		; xmm4=tmp10
 	psubd     xmm7,xmm0		; xmm7=tmp12
 	; -- Final output stage
 	movdqa	xmm1,[GOTOFF(ebx,PD_DESCALE_P2_4)]	; xmm1=[PD_DESCALE_P2_4]
 	movdqa	xmm5,xmm4
 	movdqa	xmm3,xmm7
 	paddd	xmm4,xmm6		; xmm4=data0=(00 10 20 30)
 	paddd	xmm7,xmm2		; xmm7=data1=(01 11 21 31)
 	psubd	xmm5,xmm6		; xmm5=data3=(03 13 23 33)
 	psubd	xmm3,xmm2		; xmm3=data2=(02 12 22 32)
 	paddd	xmm4,xmm1
 	paddd	xmm7,xmm1
 	psrad	xmm4,DESCALE_P2_4
 	psrad	xmm7,DESCALE_P2_4
 	paddd	xmm5,xmm1
 	paddd	xmm3,xmm1
 	psrad	xmm5,DESCALE_P2_4
 	psrad	xmm3,DESCALE_P2_4
 	packssdw  xmm4,xmm3		; xmm4=(00 10 20 30 02 12 22 32)
 	packssdw  xmm7,xmm5		; xmm7=(01 11 21 31 03 13 23 33)
 	movdqa    xmm0,xmm4		; transpose coefficients(phase 1)
 	punpcklwd xmm4,xmm7		; xmm4=(00 01 10 11 20 21 30 31)
 	punpckhwd xmm0,xmm7		; xmm0=(02 03 12 13 22 23 32 33)
 	movdqa    xmm6,xmm4		; transpose coefficients(phase 2)
 	punpckldq xmm4,xmm0		; xmm4=(00 01 02 03 10 11 12 13)
 	punpckhdq xmm6,xmm0		; xmm6=(20 21 22 23 30 31 32 33)
 	packsswb  xmm4,xmm6		; xmm4=(00 01 02 03 10 11 12 13 20 ..)
 	paddb     xmm4,[GOTOFF(ebx,PB_CENTERJSAMP)]
 	pshufd    xmm2,xmm4,0x39	; xmm2=(10 11 12 13 20 21 22 23 30 ..)
 	pshufd    xmm1,xmm4,0x4E	; xmm1=(20 21 22 23 30 31 32 33 00 ..)
 	pshufd    xmm3,xmm4,0x93	; xmm3=(30 31 32 33 00 01 02 03 10 ..)
 	mov	edx, JSAMPROW [edi+0*SIZEOF_JSAMPROW]
 	mov	esi, JSAMPROW [edi+1*SIZEOF_JSAMPROW]
 	movd	_DWORD [edx+eax*SIZEOF_JSAMPLE], xmm4
 	movd	_DWORD [esi+eax*SIZEOF_JSAMPLE], xmm2
 	mov	edx, JSAMPROW [edi+2*SIZEOF_JSAMPROW]
 	mov	esi, JSAMPROW [edi+3*SIZEOF_JSAMPROW]
 	movd	_DWORD [edx+eax*SIZEOF_JSAMPLE], xmm1
 	movd	_DWORD [esi+eax*SIZEOF_JSAMPLE], xmm3
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; unused
 	poppic	ebx
 	mov	esp,ebp		; esp <- aligned ebp
 	pop	esp		; esp <- original ebp
 	pop	ebp
 	ret
 ; --------------------------------------------------------------------------
 ;
 ; Perform dequantization and inverse DCT on one block of coefficients,
 ; producing a reduced-size 2x2 output block.
 ;
 ; GLOBAL(void)
 ; jpeg_idct_2x2_sse2 (j_decompress_ptr cinfo, jpeg_component_info * compptr,
 ;                     JCOEFPTR coef_block,
 ;                     JSAMPARRAY output_buf, JDIMENSION output_col)
 ;
 %define cinfo(b)	(b)+8		; j_decompress_ptr cinfo
 %define compptr(b)	(b)+12		; jpeg_component_info * compptr
 %define coef_block(b)	(b)+16		; JCOEFPTR coef_block
 %define output_buf(b)	(b)+20		; JSAMPARRAY output_buf
 %define output_col(b)	(b)+24		; JDIMENSION output_col
 	align	16
 	global	EXTN(jpeg_idct_2x2_sse2)
 EXTN(jpeg_idct_2x2_sse2):
 	push	ebp
 	mov	ebp,esp
 	push	ebx
 ;	push	ecx		; need not be preserved
 ;	push	edx		; need not be preserved
 	push	esi
 	push	edi
 	get_GOT	ebx		; get GOT address
 	; ---- Pass 1: process columns from input.
 	mov	edx, POINTER [compptr(ebp)]
 	mov	edx, POINTER [jcompinfo_dct_table(edx)]	; quantptr
 	mov	esi, JCOEFPTR [coef_block(ebp)]		; inptr
 	; | input:                  | result:        |
 	; | 00 01 ** 03 ** 05 ** 07 |                |
 	; | 10 11 ** 13 ** 15 ** 17 |                |
 	; | ** ** ** ** ** ** ** ** |                |
 	; | 30 31 ** 33 ** 35 ** 37 | A0 A1 A3 A5 A7 |
 	; | ** ** ** ** ** ** ** ** | B0 B1 B3 B5 B7 |
 	; | 50 51 ** 53 ** 55 ** 57 |                |
 	; | ** ** ** ** ** ** ** ** |                |
 	; | 70 71 ** 73 ** 75 ** 77 |                |
 	; -- Odd part
 	movdqa	xmm0, XMMWORD [XMMBLOCK(1,0,esi,SIZEOF_JCOEF)]
 	movdqa	xmm1, XMMWORD [XMMBLOCK(3,0,esi,SIZEOF_JCOEF)]
 	pmullw	xmm0, XMMWORD [XMMBLOCK(1,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	xmm1, XMMWORD [XMMBLOCK(3,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	movdqa	xmm2, XMMWORD [XMMBLOCK(5,0,esi,SIZEOF_JCOEF)]
 	movdqa	xmm3, XMMWORD [XMMBLOCK(7,0,esi,SIZEOF_JCOEF)]
 	pmullw	xmm2, XMMWORD [XMMBLOCK(5,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	pmullw	xmm3, XMMWORD [XMMBLOCK(7,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	; xmm0=(10 11 ** 13 ** 15 ** 17), xmm1=(30 31 ** 33 ** 35 ** 37)
 	; xmm2=(50 51 ** 53 ** 55 ** 57), xmm3=(70 71 ** 73 ** 75 ** 77)
 	pcmpeqd   xmm7,xmm7
 	pslld     xmm7,WORD_BIT		; xmm7={0x0000 0xFFFF 0x0000 0xFFFF ..}
 	movdqa    xmm4,xmm0		; xmm4=(10 11 ** 13 ** 15 ** 17)
 	movdqa    xmm5,xmm2		; xmm5=(50 51 ** 53 ** 55 ** 57)
 	punpcklwd xmm4,xmm1		; xmm4=(10 30 11 31 ** ** 13 33)
 	punpcklwd xmm5,xmm3		; xmm5=(50 70 51 71 ** ** 53 73)
 	pmaddwd   xmm4,[GOTOFF(ebx,PW_F362_MF127)]
 	pmaddwd   xmm5,[GOTOFF(ebx,PW_F085_MF072)]
 	psrld	xmm0,WORD_BIT		; xmm0=(11 -- 13 -- 15 -- 17 --)
 	pand	xmm1,xmm7		; xmm1=(-- 31 -- 33 -- 35 -- 37)
 	psrld	xmm2,WORD_BIT		; xmm2=(51 -- 53 -- 55 -- 57 --)
 	pand	xmm3,xmm7		; xmm3=(-- 71 -- 73 -- 75 -- 77)
 	por	xmm0,xmm1		; xmm0=(11 31 13 33 15 35 17 37)
 	por	xmm2,xmm3		; xmm2=(51 71 53 73 55 75 57 77)
 	pmaddwd	xmm0,[GOTOFF(ebx,PW_F362_MF127)]
 	pmaddwd	xmm2,[GOTOFF(ebx,PW_F085_MF072)]
 	paddd	xmm4,xmm5		; xmm4=tmp0[col0 col1 **** col3]
 	paddd	xmm0,xmm2		; xmm0=tmp0[col1 col3 col5 col7]
 	; -- Even part
 	movdqa	xmm6, XMMWORD [XMMBLOCK(0,0,esi,SIZEOF_JCOEF)]
 	pmullw	xmm6, XMMWORD [XMMBLOCK(0,0,edx,SIZEOF_ISLOW_MULT_TYPE)]
 	; xmm6=(00 01 ** 03 ** 05 ** 07)
 	movdqa	xmm1,xmm6		; xmm1=(00 01 ** 03 ** 05 ** 07)
 	pslld	xmm6,WORD_BIT		; xmm6=(-- 00 -- ** -- ** -- **)
 	pand	xmm1,xmm7		; xmm1=(-- 01 -- 03 -- 05 -- 07)
 	psrad	xmm6,(WORD_BIT-CONST_BITS-2) ; xmm6=tmp10[col0 **** **** ****]
 	psrad	xmm1,(WORD_BIT-CONST_BITS-2) ; xmm1=tmp10[col1 col3 col5 col7]
 	; -- Final output stage
 	movdqa	xmm3,xmm6
 	movdqa	xmm5,xmm1
 	paddd	xmm6,xmm4	; xmm6=data0[col0 **** **** ****]=(A0 ** ** **)
 	paddd	xmm1,xmm0	; xmm1=data0[col1 col3 col5 col7]=(A1 A3 A5 A7)
 	psubd	xmm3,xmm4	; xmm3=data1[col0 **** **** ****]=(B0 ** ** **)
 	psubd	xmm5,xmm0	; xmm5=data1[col1 col3 col5 col7]=(B1 B3 B5 B7)
 	movdqa	xmm2,[GOTOFF(ebx,PD_DESCALE_P1_2)]	; xmm2=[PD_DESCALE_P1_2]
 	punpckldq  xmm6,xmm3		; xmm6=(A0 B0 ** **)
 	movdqa     xmm7,xmm1
 	punpcklqdq xmm1,xmm5		; xmm1=(A1 A3 B1 B3)
 	punpckhqdq xmm7,xmm5		; xmm7=(A5 A7 B5 B7)
 	paddd	xmm6,xmm2
 	psrad	xmm6,DESCALE_P1_2
 	paddd	xmm1,xmm2
 	paddd	xmm7,xmm2
 	psrad	xmm1,DESCALE_P1_2
 	psrad	xmm7,DESCALE_P1_2
 	; -- Prefetch the next coefficient block
 	prefetchnta [esi + DCTSIZE2*SIZEOF_JCOEF + 0*32]
 	prefetchnta [esi + DCTSIZE2*SIZEOF_JCOEF + 1*32]
 	prefetchnta [esi + DCTSIZE2*SIZEOF_JCOEF + 2*32]
 	prefetchnta [esi + DCTSIZE2*SIZEOF_JCOEF + 3*32]
 	; ---- Pass 2: process rows, store into output array.
 	mov	edi, JSAMPARRAY [output_buf(ebp)]	; (JSAMPROW *)
 	mov	eax, JDIMENSION [output_col(ebp)]
 	; | input:| result:|
 	; | A0 B0 |        |
 	; | A1 B1 | C0 C1  |
 	; | A3 B3 | D0 D1  |
 	; | A5 B5 |        |
 	; | A7 B7 |        |
 	; -- Odd part
 	packssdw  xmm1,xmm1		; xmm1=(A1 A3 B1 B3 A1 A3 B1 B3)
 	packssdw  xmm7,xmm7		; xmm7=(A5 A7 B5 B7 A5 A7 B5 B7)
 	pmaddwd   xmm1,[GOTOFF(ebx,PW_F362_MF127)]
 	pmaddwd   xmm7,[GOTOFF(ebx,PW_F085_MF072)]
 	paddd     xmm1,xmm7		; xmm1=tmp0[row0 row1 row0 row1]
 	; -- Even part
 	pslld     xmm6,(CONST_BITS+2)	; xmm6=tmp10[row0 row1 **** ****]
 	; -- Final output stage
 	movdqa    xmm4,xmm6
 	paddd     xmm6,xmm1	; xmm6=data0[row0 row1 **** ****]=(C0 C1 ** **)
 	psubd     xmm4,xmm1	; xmm4=data1[row0 row1 **** ****]=(D0 D1 ** **)
 	punpckldq xmm6,xmm4	; xmm6=(C0 D0 C1 D1)
 	paddd     xmm6,[GOTOFF(ebx,PD_DESCALE_P2_2)]
 	psrad     xmm6,DESCALE_P2_2
 	packssdw  xmm6,xmm6		; xmm6=(C0 D0 C1 D1 C0 D0 C1 D1)
 	packsswb  xmm6,xmm6		; xmm6=(C0 D0 C1 D1 C0 D0 C1 D1 ..)
 	paddb     xmm6,[GOTOFF(ebx,PB_CENTERJSAMP)]
 	pextrw	ebx,xmm6,0x00		; ebx=(C0 D0 -- --)
 	pextrw	ecx,xmm6,0x01		; ecx=(C1 D1 -- --)
 	mov	edx, JSAMPROW [edi+0*SIZEOF_JSAMPROW]
 	mov	esi, JSAMPROW [edi+1*SIZEOF_JSAMPROW]
 	mov	WORD [edx+eax*SIZEOF_JSAMPLE], bx
 	mov	WORD [esi+eax*SIZEOF_JSAMPLE], cx
 	pop	edi
 	pop	esi
 ;	pop	edx		; need not be preserved
 ;	pop	ecx		; need not be preserved
 	pop	ebx
 	pop	ebp
 	ret
 %endif ; JIDCT_INT_SSE2_SUPPORTED
 %endif ; IDCT_SCALING_SUPPORTED
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
MIYASAKA Masaru	a2e6a9dd47	IJG R6b with x86SIMD V1.02 Independent JPEG Group's JPEG software release 6b with x86 SIMD extension for IJG JPEG library version 1.02	2015-07-29 16:36:25 -05:00
Thomas G. Lane	5ead57a34a	The Independent JPEG Group's JPEG software v6b	2015-07-27 13:43:00 -05:00