Use Ghostscript, but tell it to not reprocess images?

30

21

I have a PDF that has already compressed and somewhat artifact-y images, and I'm using Ghostscript to prepend a title page to that PDF.

However, I cannot find any way to tell GS to just use the existing images as-is without reprocessing them, and now I'm feeling as if it's something to do with how GS works, i.e. you can't recompile/link a PDF without reprocessing its images.. Is that true?

I can raise the DPI setting in GS, but it'll go from 5MB to 60MB while still looking worse.

Is there any better alternative to GS that'll do what I need (preferably that will compile on OS X)?

Mahmoud Al-Qudsi

Posted 2011-11-22T10:34:58.463

Reputation: 3 274

Can you edit your question and quote the exact commandline you are using to prepend your title page to the original PDF? Then I could tell you what exactly to change or add to the commandline in order to get a better output for images... – Kurt Pfeifle – 2011-11-25T12:46:33.723

I don't want to just have it look better, I want to merge without reprocessing. This will a) result in better quality (lossless transforms), and b) not waste hours of CPU time processing my 1000+ page document. – Mahmoud Al-Qudsi – 2012-01-02T04:44:54.607

1Hey, you didn't answer my question and you didn't quote the exact GS commandline you are using. Which means: you'll not be getting the help regarding GS you're looking for... – Kurt Pfeifle – 2012-01-02T09:07:55.060

Answers

44

If you just want to concatenate two PDF files without any reprocessing of its content, pdftk is for you. (On Mac OS X this should be available via MacPorts or Fink, for Linux, there are native packages for all major distributions; for Windows, look here.) Try this:

 pdftk title.pdf content.pdf cat output book.pdf

This will prepend the title.pdf to the content.pdf and write the result into book.pdf.

pdftk is a "dumb", but very fast way to concatenate two (or more) PDF files. "Dumb" in so far, as pdftk does not in any way interpret the PDF data stream, it just makes sure that the internal object numbers are re-reshuffled as needed and appear in the PDF xref structure (which basically is a sort of PDF ToC for objects).

Ghostscript:

If you want to use Ghostscript, the basic command to concatenate the same two files would be:

 gs \
  -o book.pdf \
  -sDEVICE=pdfwrite \
   title.pdf \
   content.pdf

However, as you experienced, this simple command line may mess up your image quality. The reason is that Ghostscript is not 'dump' when it processes PDFs: it completely interpretes them when reading in, and creates a completely new file when writing out the result. For creating the result, it will automatically be using default settings for a lot of details in the overall processing. These defaults will apply for all cases where its invocations had not instructed Ghostscript otherwise.

So Ghostscript's method to create the new book.pdf is much more "intelligent" (but also much slower) than pdftk's method. (This is also the reason why Ghostscript in many cases is able to --within limits-- "repair" b0rken PDF files, or to embed fonts into the output PDFs which are not embedded in input PDFs, or to remove duplicate images, replacing them by mere references, etc. -- and overall created smaller, better optimized files from bloated input PDFs...)

The solution is to not let Ghostscript use its defaults: by adding more custom parameters to the command line.

What does it mean "Ghostscript 'interprets' its PDF input"?

All of the file and its contents (objects, streams, fonts, images,...) are read in, checked and held in its own internal representation, before spitting out the resulting PDF with its PDF objects again. However, when 'spitting out', Ghostscript will apply all of its internal default settings for the hundreds of parameters [*] which there are available.

Unfortunately, this causes your "reprocessing" of images according to these default settings -- which can only be avoided or overridden by adding your own (desired) commandline parameters.

Your image problems could be caused by Ghostscript's need (due to licensing issues) to re-encode JPEG2000 images to JPEG encoding. If you want to avoid this, add the following to your commandline:

-dAutoFilterColorImages=false \
-dAutoFilterGrayImages=false \
-dColorImageFilter=/FlateEncode \
-dGrayImageFilter=/FlateEncode \

Other image-related commandline options to consider for including are:

-dColorConversionStrategy=/LeaveColorUnchanged \
-dDownsampleMonoImages=false \
-dDownsampleGrayImages=false \
-dDownsampleColorImages=false \

So the complete Ghostscript commandline that could make you happy should read:

 gs \
  -o book.pdf \
  -sDEVICE=pdfwrite \
  -dColorConversionStrategy=/LeaveColorUnchanged \
  -dDownsampleMonoImages=false \
  -dDownsampleGrayImages=false \
  -dDownsampleColorImages=false \
  -dAutoFilterColorImages=false \
  -dAutoFilterGrayImages=false \
  -dColorImageFilter=/FlateEncode \
  -dGrayImageFilter=/FlateEncode \
   title.pdf \
   content.pdf

You could also tell Ghostscript NOT to compress images at all in the output PDF, by using this commandline:

 gs \
  -o book.pdf \
  -sDEVICE=pdfwrite \
  -dColorConversionStrategy=/LeaveColorUnchanged \
  -dEncodeColorImages=false \
  -dEncodeGrayImages=false \
  -dEncodeMonoImages=false \
   title.pdf \
   content.pdf

.


[*]:
If you are interested to learn about a complete list of default settings which Ghostscript's pdfwrite device is using, run the following command. It returns you the full list:

 gs \
   -sDEVICE=pdfwrite \
   -o /dev/null \
   -c "currentpagedevice { exch ==only ( ) print == } forall"

For explanations about what exactly all these parameters do mean, you'll have to read up in the Adobe documentation about "Distiller Parameters". Ghostscript tries very hard to mimic all these...

Kurt Pfeifle

Posted 2011-11-22T10:34:58.463

Reputation: 10 024

3(FYI) In my case, the flags dEncodeColorImages, dEncodeGrayImages, dEncodeMonoImages cause the output file to become a lot more massive. By removing them, the file size changed from 22MB to 3.1MB and the image quality seems exactly as with using these flags. All the unique flags with I use are: dColorConversionStrategy=/LeaveColorUnchanged, dDownsampleMonoImages=false, dDownsampleGrayImages=false, dDownsampleColorImages=false, dAutoFilterColorImages=false, dAutoFilterGrayImages=false, dColorImageFilter=/FlateEncode, dGrayImageFilter=/FlateEncode – Dor – 2016-05-19T16:04:10.157

@Kurt Pfeifle What options are allowed for -dColorImageFilter? I can only find FlateEncode and DCTEncode. DCT seems to do JPEG (why did they encrypt that?). I think FLATE is an outdated option for images by now since Bell Labs patent on LZW is no longer an issue? However after spending quite some time searching I cannot find how to use PNG (or anything else)... My original images are PNG and I want them to stay unchanged. I tried the -c option, but it gives me -c can only be used in a built with POSTSCRIPT included.... – Louis Somers – 2019-09-17T22:05:30.323

-1

It turns out macOS can also do this natively, but not via any programmatic or scriptable interface.

By opening in two different Preview.app windows the two PDF files, you can drag-and-drop a page from the sidebar thumnails to the other document, and OS X will recreate the PDF with out reprocessing the actual documents/images. Works like a dream, and can possibly be Apple-scripted, though I am not certain.

I was not able to use ghostscript and have it join PDFs without recreating/reprocessing the actual documents and images, but the pdftk suggestion in another answer also worked.

Mahmoud Al-Qudsi

Posted 2011-11-22T10:34:58.463

Reputation: 3 274

the statement in your last sentence is not correct. It IS possible for Ghostscript to process images and apply lossless compression or no compression at all to them. The lossless encoding scheme is called Flate. – Kurt Pfeifle – 2012-01-02T10:41:48.380