8
3
I have a PDF file containing maps of the building I work in, here:
http://www.libsys.und.edu/dev/FloorPlans_All.pdf
The original source files have been lost, and I've been asked to extract the map images, preferably without the text and icons that have been overlaid on top of them. This has proven annoyingly difficult.
So far, I have tried the following GUI programs:
- Adobe Reader: lets me select text, but not the background images
- FoxIt PDF Viewer: lets me select text, but not the background images
- XPDF on Ubuntu 10.10: lets mes select text, but not the background images
And also the following command-line programs:
- pdfimages: extracts the icons indicating bathrooms just fine, but not the background images
- pdftohtml: same as pdfimages, plus it makes a poorly marked up HTML document
- pdfextract: same as pdfimages
- convert: successfully saved images, but with the text burned into them
I've even tried opening the PDF manually in a text editor and extracting the stream objects by pasting them into a new file and saving it with a .jpg, .png, or .bmp extension (each in turn). Considering how little I know about the internal structure of PDF files, it's no surprise that this didn't work.
So ... is there any way I can retrieve the map images from this thing without also getting the text and icons?
The way I usually solve this kind of task: (1) Use
qpdf
to convert the binary parts to ASCII as far as possible. (2) Use a text editor to make all text invisible that I don't want to see on screen or in printouts (can be achieved easily and without damage to the XRef table by toggling the invisible flag). (3) Re-distill the result with Ghostscript to boil down its size as much as possible. -- Unfortunately, your file is no longer downloadable to demonstrate the procedure... – Kurt Pfeifle – 2011-05-28T06:06:02.413