reducing the size of PDF file of scanned images

1

0

I downloaded this PDF file from a website which is 350 KB in size with 20 pages. All pages are scanned images. I extracted the images using Adobe Acrobat Pro which are 1.32 MB in size collectively (view > tools > document processing > export all images). I converted them into a single PDF file (1.28 MB). How can I combine those images into a low-sized PDF file?

Do I need to reduce the size of scanned images with a software? So how can I do this to get the quality of that 350-KB PDF file?

In fact, I regularly scan some of my documents and convert them into PDF and I want to keep them as small as possible.

What I tried:

  • in Adobe Acrobat Pro: file > save as other > reduced size PDF
  • in Adobe Acrobat Pro: file > print > "print in gray-scale" check-box checked

update: Links removed due to copyright infringement!

living being

Posted 2014-12-28T08:29:27.187

Reputation: 812

Store on a compressed directory? Assumming windows OS. You can also winzip/pkzip each for compression. – mdpc – 2014-12-28T09:11:18.843

stored on a regular directory, not compressed. Yes, using Windows. Zipping and extracting each time? That's not practical. – living being – 2014-12-28T09:24:15.850

I mean set the directory to be compressed so that as you put things in it will be compressed automagically. For longer term storage and light use, I think that compression is quite practical on an individual file basis. – mdpc – 2014-12-28T09:25:46.623

The original pages look like a fax (not over 200 dpi black and white; could have been scanned that way), with a watermark on every page. That's why the PDF was so small, and how to re-create one of comparable size. – fixer1234 – 2014-12-28T09:37:48.250

1@mdpc: I want to reduce the size of the file itself. – living being – 2014-12-28T09:41:34.257

@fixer1234: How can I reduce the size of my scanned images to this level? + So Adobe Reader converts the images to a higher quality? – living being – 2014-12-28T09:45:06.080

Use a scan setting equivalent to a fax (200 dpi B&W), and save in an image format that supports B&W (also called monochrome or bi-level), with compression, like TIFF, GIF, PNG, or PCX. If they are already scanned, use an image processing program like Irfanview to convert them. – fixer1234 – 2014-12-28T09:59:17.227

Converting the pdf to postscript and back to pdf usually does the trick. If you have access to a linux machine, you could do it like this: pdf2ps input.pdf output.ps; ps2pdf output.ps output.pdf – Reuben L. – 2014-12-28T10:07:43.907

@fixer1234: I did some tweaking with Irfanview and did not get a good result. – living being – 2014-12-28T10:16:14.230

@Reuben L: I'm using Windows – living being – 2014-12-28T10:16:50.487

Take a close look at the text on the sample document. It isn't very high quality. That's what 18K per page looks like. Also, reprocessing an old color or greyscale scan will probably be degraded from a clean original scan at the desired settings. – fixer1234 – 2014-12-28T10:22:17.063

The best result I got from Irfanview is 70 KB per page with an awful quality. – living being – 2014-12-28T10:26:03.567

Can you post a sample original scan and a sample processed scan? Looking at the actual files is the only way to determine what the problem might be. – fixer1234 – 2014-12-28T20:15:47.553

as I mentioned, I downloaded that file from a website. I have no access to scanned images. furthermore my main purpose is to find a general solution to apply on my own scanned images for not to get a large PDF file out of them. – living being – 2014-12-28T20:21:26.697

"The best result I got from Irfanview is 70 KB per page with an awful quality." The only way for someone to understand the results you got is to see the before and after images involved. You haven't clearly stated your objective. What do you want to start with (existing images or hard copies)? What do you want to end up with (a small aggregate PDF)? How good/bad can the result look (similar to the sample you posted)? You can't get there if jpg is part of the process (doesn't do B&W, plus heavy compression creates artifacts). Starting from color/greyscale image will yield poor results. – fixer1234 – 2014-12-29T00:42:04.893

Answers

1

What you did is useful as an exercise. Otherwise exporting images from a PDF like this and creating a new PDF out of those makes no sense.

The original document space usage is:

Description        Bytes      Percentage
Images             351,829    97.60 %
Content Streams    2,742      0.76 %
Document Overhead  5,916      1.64 %
Total              360,478    100 %

Your document's space usage is:

Description        Bytes      Percentage
Images             1,329,944  98.87 %
Bookmarks          21         0.00 %
Content Streams    1,675      0.12 %
Structure info     60         0.00 %
Document Overhead  13,389     1.00 %
Total              1,345,089  100 % 

The original document isn't created with Acro, but iText which explains the missing structure info.

Under Document Processing you have a separate tool "Optimize scanned PDF". I followed your workflow and run the optimizer on my newly created PDF, and the resulting file size is 328KB. However the quality is clearly worse than the original document.

This is to be expected, as I did everything with default settings. This means the image export was already done as jpg which anyway is larger than a PDF. I tested this just by extracting each page to a single PDF - for example the jpg image exported from page 1 is 22KB whereas exported as a PDF it's just 9KB. Optimizing the images further in the new document worsens the image quality even more. This is just unavoidable with bitmap image formats such as jpg.

The size usage above shows that Acrobat clearly exported the images with highest possible quality. This makes sense, as when you do this you want to get them out with minimal image data loss.

One option could be OCRing the file, which converts the images to text, and textual files are much lighter than image bloats. Acro Pro contains OCR tool, but I can't test this as I don't have Arabic available.

EDIT: Extended language pack only applies to Adobe Reader. After some research it seems that Acrobat does not support Arabic OCR. See this Adobe forum discussion.

Scanning into PDF and then optimizing is always a tradeoff between size and quality. You just need to test with different settings (both original scan and the optimization) to you find a satisfactory compromise.

Instructions for PDF optimization are in Acrobat Help. Help is available online for both Acrobat X and Acrobat XI

Peregrino69

Posted 2014-12-28T08:29:27.187

Reputation: 210

they converted the images into PDF without OCR. So it is "possible" to do it with such a low size. What kind of solution we can use to achieve this? That's my question. Definitely there's a tradeoff between size and quality, but a document with this size and this quality is extant. – living being – 2014-12-29T04:09:05.563

Most likely yes. I don't really get the question - if you have a PDF which contains pictures with text, you can OCR it with Acrobat, but you will have to test yourself which settings give you an acceptable quality. There is no one-size-fits-all -solution. The only things I can think of that would decrease the file size is OCRing and optimizing. I added a couple of things to the original answer. – Peregrino69 – 2014-12-29T10:42:56.023

Of course there's not one standard size as you pointed out. I just meant that they achieved this size, but the size of mine is 300% more. That's a lot! + Using OCR is a pain in the neck. It cannot recognize all of the text correctly, especially for non-English languages. – living being – 2014-12-29T10:52:43.220

The original scanned images are simply smaller than what Acrobat exports. If you want to have smaller size exported images, you can choose lower quality, but that's just what it says - lower quality images. I updated the answer again with info about using Acrobat with non-English languages and non-western characters. If you're using Acrobat professionally I'd recommend getting training on it. Lynda (http://www.lynda.com) is the current provider of official Adobe training materials.

– Peregrino69 – 2014-12-29T11:01:40.550

I used a professional software designed specifically for OCR called Readiris Pro and I did my best attempt to get a good result which has never been achieved (for this language) – living being – 2014-12-29T11:05:38.350

"Readiris 14 Mac OCR doesn’t support the following languages: arabic, farsi, kazakh, mongolian (cyrillic)" (http://www.irislink.com/c2-2808-189/Readiris-14---OCR-Software---Scan--convert---manage-documents.aspx). So that only works if you use Windows. Readiris seems to create OCR:d PDFs directly, but that's not part of this question. This concerns Acrobat. If you can't find a satisfactory solution using Acrobat's tools, you can find a number of PDF compression tools with google, free and paid, so you can test with those.

– Peregrino69 – 2014-12-29T11:18:54.230

I used Readiris on Windows and it worked for Farsi. + I did Google search for this a lot and tried them and never get a good result – living being – 2014-12-29T11:21:25.160

Again, this question is about Acrobat, not Readiris. You didn't use it in this workflow, so that isn't a part of this. Acrobat has the tools it has. If those do not satisfy your needs, the only further advice I can give is contacting Adobe support. – Peregrino69 – 2014-12-29T11:26:39.313