Extracting background images from a PDF file?

8

3

I have a PDF file containing maps of the building I work in, here:

http://www.libsys.und.edu/dev/FloorPlans_All.pdf

The original source files have been lost, and I've been asked to extract the map images, preferably without the text and icons that have been overlaid on top of them. This has proven annoyingly difficult.

So far, I have tried the following GUI programs:

  • Adobe Reader: lets me select text, but not the background images
  • FoxIt PDF Viewer: lets me select text, but not the background images
  • XPDF on Ubuntu 10.10: lets mes select text, but not the background images

And also the following command-line programs:

  • pdfimages: extracts the icons indicating bathrooms just fine, but not the background images
  • pdftohtml: same as pdfimages, plus it makes a poorly marked up HTML document
  • pdfextract: same as pdfimages
  • convert: successfully saved images, but with the text burned into them

I've even tried opening the PDF manually in a text editor and extracting the stream objects by pasting them into a new file and saving it with a .jpg, .png, or .bmp extension (each in turn). Considering how little I know about the internal structure of PDF files, it's no surprise that this didn't work.

So ... is there any way I can retrieve the map images from this thing without also getting the text and icons?

Will Martin

Posted 2011-05-27T16:24:24.190

Reputation: 821

The way I usually solve this kind of task: (1) Use qpdf to convert the binary parts to ASCII as far as possible. (2) Use a text editor to make all text invisible that I don't want to see on screen or in printouts (can be achieved easily and without damage to the XRef table by toggling the invisible flag). (3) Re-distill the result with Ghostscript to boil down its size as much as possible. -- Unfortunately, your file is no longer downloadable to demonstrate the procedure... – Kurt Pfeifle – 2011-05-28T06:06:02.413

Answers

7

You can download the XPDF library from http://www.foolabs.com/xpdf/download.html for Linux and Windows. Then run pdfimages -j input.pdf output and you should get output-000.jpg, output-001.jpg, etc. Also, check out http://linuxcommand.org/man_pages/pdfimages1.html for more usage options.

mybluevan

Posted 2011-05-27T16:24:24.190

Reputation: 116

1Correction, looks like the image is a vector graphic directly embedded in the PDF. Try opening it in something like Inkscape or Adobe Illustrator that handles vector graphics. – mybluevan – 2011-05-27T18:05:10.063

Ah HA! The maps are vector graphics -- no wonder I've been having such trouble! Inkscape seems to have opened it just fine, and I can edit it to my heart's content. Thanks! – Will Martin – 2011-05-27T18:27:23.377

2

Ok, after messing around with this for 5 minutes, my analysis is that PDF is even weirder than I originally thought, and that's saying something.

Not sure what your budget is, but with Acrobat Pro Extended 9, you can use:

A. Tools, Advanced Editing, Touchup Text Tool

-Select All
-Right click, Properties
-Text tab
-Select a standard font (e.g. Arial), close
-Hit Delete

B. Tools, Advanced editing, Touchup Object Tool

-Select the object (you can get most, but not all, of them (e.g. student computers icons can't be selected), then delete

Here's what Page 1 looked like after a quick cleanup: http://dl.dropbox.com/u/7434256/p1test.pdf

Craig H

Posted 2011-05-27T16:24:24.190

Reputation: 1 172

Weird is understatement. I don't know the history of this file, but Acrobat Pro 8 gave us trouble. Inkscape did the trick, though, thank goodness. Now to convert it all to some proper SVGs that we can generate raster graphics from ... – Will Martin – 2011-05-27T18:29:16.043

1The job you've done on the original PDF (which, unfortunately, is no longer available to me) is not the best. Your file is still ~3 MByte. It contains lots + lots of un-used objects. It even contains an instance of the /AA operator (for Automatic Action) makeing it a potentially dangerous PDF file. Ghostscript was able to boil it down to 60 kByte without loosing any of its visible content. (The metadata contained in the file spreads over 17 different objects. The metadata als suggests there are 17 different revisions/modifications of that file since its creation on 2011-01-18.) – Kurt Pfeifle – 2011-05-28T06:00:37.080

@pipitas I'm glad I checked this again, it turns out Apache was down on that server. The original PDF is available again. It's still annoying though. I've since discovered that the maps were generated from AutoCAD DXF files, which make for seriously ugly vector graphics. There are hundreds of individual paths in each map, each one a single line with two end points. This probably made it easier for an architect to alter individual sections of wall or whatever, but it's a pain in the butt for anything else. – Will Martin – 2011-05-28T23:43:59.487

@Will Martin: Ouch!, this is a rather big PDF file containing a lot of internal, hidden file updates (and therefor: garbage from a user point of view). -- 16 pages on 16 MBytes is rather "heavy" for just simple looking vector graphics. At least 16 different layers ("Optional Content" in PDF parlance), one for each page. I'd rather not wade through this mess with a text editor only... – Kurt Pfeifle – 2011-05-29T12:07:06.267

2@pipitas: Thanks - fair points. Although I wouldn't describe what I did as a "job" - I was just demonstrating (after a couple minutes of playing around) that it was possible with Acrobat. Money back guarantee and all that. ;) – Craig H – 2011-05-30T20:46:00.903

1

Take the PDF which was made by Craig H and optimize it a bit by running it through Ghostscript. On Windows the commandline is:

gswin32c.exe ^
   -o p1test-gs-optimized.pdf ^
   -sDEVICE=pdfwrite ^
   -dPDFSETTIINGS=/prepress ^
    p1test.pdf

On Linux/Unix/Mac OS X do:

gs \
   -o p1test-gs-optimized.pdf \
   -sDEVICE=pdfwrite \
   -dPDFSETTIINGS=/prepress \
    p1test.pdf

This will bring down the size of the file from 3.000 kByte to about 60 kByte without loosing content. Then importing it to Inkscape (or InDesign, Illustrator,...) should be much faster....

Kurt Pfeifle

Posted 2011-05-27T16:24:24.190

Reputation: 10 024

1

...you could try Photoshop. It reads PDF's, and it's 'possible' it originated in PS and possibly still has the layers... but it's a very long-shot.

aart12

Posted 2011-05-27T16:24:24.190

Reputation: 11

0

In a Linux environment I have used pdfmod to extract all the images in one go. See https://wiki.gnome.org/Apps/PdfMod or, for Ubuntu users, https://apps.ubuntu.com/cat/applications/pdfmod/

To download and install it in Ubuntu, it is sufficient to type sudo apt-get install pdfmod.

  • Start the pdfmod GUI (type in pdfmod in the dashboard or command-line terminal)
  • Open the PDF document
  • Select all the pages (or any that you want to extract the images from)
  • The Edit menu item will present the option of extracting as many images as they can be extracted within the selected range (export n images, with n the appropriate number). You can also access this command by hovering with your mouse on the selection and activating the local menu (right-click for the right-handed).
  • Once you go ahead with this, a new window will open up where you select the location to save the images into.

Hope this helps.

XavierStuvw

Posted 2011-05-27T16:24:24.190

Reputation: 309

Please read How do I recommend software for some tips as to how you should go about recommending software. Provide more than just a link, for example well as some additional information about the software itself, and how it can be used to solve the problem in the question. You could even include some example command lines.

– DavidPostill – 2016-04-10T18:23:21.003

@DavidPostill. Thanks for pointing this out. Done, I believe. – XavierStuvw – 2016-04-11T09:20:08.623

Much better ... ;) – DavidPostill – 2016-04-11T09:45:13.137

Now I know what I can demand from answers to my posts :-) – XavierStuvw – 2016-04-11T10:19:23.367

-1

Open the document on your screen, zoom in on the picture to make it as large as possible but all of it is still visible. Press alt+prnt scrn (or the equivalent on your operating system) and it should take a screen shot of the program. Now open up paint or your favorite image editor (photoshop, gimp, etc) paste in the picture and crop out anything you don't want.

Will Gunn

Posted 2011-05-27T16:24:24.190

Reputation: 370

This also includes the icons that are over the background image in the screenshotted images, plus it uses the screen's resolution. There must be a better way. – Zachiel – 2017-01-03T11:00:53.357