search PDFs with non-standard character encodings

19

6

Some PDF files produce garbage ("mojibake") when you copy text (even though they render OK). This makes it impossible to search them (whatever you search for will not match the garbage).

Does anyone have an easy workaround?

Examples:

  1. TEAC TV manual EU2816STF (yields above problems in Adobe Reader on both Windows and a Mac, but works fine in Preview on a Mac)
  2. Leadtek Winfast PVR2 manual (FTP link; also has problems in Preview on a Mac)
  3. Swann TV tuner card manual (FTP link; also has problems in Preview on a Mac)
  4. Phonedisc license agreement (from the now-defunct DTMS)
  5. Macquarie IFP quarterly fund review
  6. BAN-TACS Small Business Booklet (archived version)
  7. Easterfest 2004 flyer (also from the archive)

I am using Adobe Reader (latest version) for Windows - perhaps an alternative viewer might help? I'm looking for a free solution for Windows. Open-source would be even better.

Edit: The docs for the Multivalent Extract Text tool have a good summary of why things can go wrong, including: (quoted document last modified Jan 2006)

  • Text may not have a Unicode mapping. PDF Type 3 fonts often do not, and TeX DVI has characters that do not have Unicode equivalents.
  • The Unicode encoding may be buggy. Open Office maps some characters into the same Unicode, resulting in apparant letter dropping and doubling.

I guess the ultimate solution in these cases would be to OCR each glyph in a font to figure out what character it really is. Note that this would be easier than OCRing a noisy scanned document because the exact shape of the glyph is available (at infinite resolution since it's a "vector" image).

Hugh Allen

Posted 2010-03-13T03:05:41.377

Reputation: 8 620

Using clipbrd.exe (see http://www.mydigitallife.info/2008/11/06/how-to-view-windows-clipboard-contents-easily-in-windows-xp-and-vista/) you can see what's on the clipboard. What does that give you?

– Arjan – 2010-03-16T09:16:35.403

@Arjan van Bentem: it gives me exactly the same garbage that I get when pasting into Notepad. – Hugh Allen – 2010-03-16T11:29:47.657

Any details on the format? I'm on a Mac, but I assume Windows would tell you if something is an image or text, and then for text maybe also reveals something about the encoding? – Arjan – 2010-03-16T23:28:08.243

For the TV Manual example: same issue in Adobe Reader 8.1.2 on a Mac, but no problems using the Mac's Preview to copy or search text. Its document properties shows "Encoding: Custom" for the fonts (see http://img.skitch.com/20100318-827uckkb5i326eta291f3qig3u.png). Other PDF documents show things like "Encoding: Ansi" or "Roman" and have no issues in Adobe Reader on a Mac (like http://www.adobe.com/education/pdf/type_primer.pdf yields http://img.skitch.com/20100318-tbyjrny9bsg684eqhr7b3au7fb.png).

– Arjan – 2010-03-18T22:43:43.783

Do you have any other examples? I don't know if this implies anything, but given your and mine examples: file type_primer.pdf yields type_primer.pdf: PDF document, version 1.5, and file product_manual_281.pdf gives me product_manual_281.pdf: PDF document, version 1.3. – Arjan – 2010-03-18T22:48:44.987

Hmm, both your new Leadtek and Swann example give problems in Preview on a Mac as well. (And, in case it matters: both show "Encoding: Identity-H", and PDF document, version 1.3, and both have been created using CorelDRAW.) All your examples show "Clipboard contents: rich text (RTF)" on a Mac, so that's not a lot of info either. – Arjan – 2010-03-19T07:32:50.903

1

Also, http://pdftextonline.com/ cannot fetch the text from the TV Manual nor the Phonedisc document (did not try the others). But sending to Gmail and then viewing as HTML does work for the TV Manual (just like Preview has no issues with that document)...

– Arjan – 2010-03-23T11:41:32.243

Answers

3

Foxit Reader, perhaps?

For what it's worth, I just checked the PDF you linked to with Safari 4.0.4 on Mac OS X 10.6.2 and while there is some Engrish, the PDF it renders flawlessly without any onscreen "garbage". Perhaps you're having Unicode issues (more common on Windows than Mac OS)?

Alex

Posted 2010-03-13T03:05:41.377

Reputation: 2 094

The garbage is not on the screen - it is in the clipboard when I copy some text. What happens for you when you try? – Hugh Allen – 2010-03-15T11:14:42.100

@Hugh: Features It is a remote controlled colour television. 100 programmes from VHF, UHF bands or cable channels can be preset. It can tune cable channels. Controlling the TV is very easy by its menu driven system. It has three Euroconnector socket for external de- vices (such as computer, video, video games, audio set, etc.) – Alex – 2010-03-16T00:02:56.270

@Hugh: The bullets aren't copying properly, but the rest is. What section/page/paragraph specifically are you having an issue with, and I'll give that a try? – Alex – 2010-03-16T00:03:28.500

All of it. I'm using Adobe Reader for Windows. I just updated to the latest version which didn't help. +1 thanks for the info. I guess Adobe Reader has a bug not shared by the OSX equivalent. – Hugh Allen – 2010-03-16T06:34:43.007

4I tried Foxit Reader and it has the same issue. Its installer is also really intrusive, wanting to install a toolbar, modify your homepage etc :( – Hugh Allen – 2010-03-16T07:15:41.413

@Hugh: That's a shame -- it didn't used to be that way. My apologies; I'll stop recommending it, then. – Alex – 2010-03-16T19:48:40.087

@Hugh: I'm not using Adobe Reader for Mac OS; such a thing does exist, AFAIK, but the OS's built-in PDF handling is excellent, so most people don't use it. (I'm stunned that not even Windows 7 includes PDF support out of the box.) – Alex – 2010-03-16T19:49:33.190

3

Simplest way to get around this is to open the file in a recent version of Google Chrome with built-in PDF reading plugin. Then you can use Chrome's search feature to find text, and copy-paste works correctly.

acatalept

Posted 2010-03-13T03:05:41.377

Reputation: 596

2

For the TV Manual example: same issue in Adobe Reader 8.1.2 on a Mac, but no problems using Mac's Preview to copy or search text. Also, sending it to a Gmail account and then choosing "View" and then "Plain HTML" reveals the text. But Adobe Reader doesn't like it.

Its document properties shows "Encoding: Custom" for the fonts. Another document shows things like "Encoding: Ansi" or "Roman", and has no issues in neither Preview nor Adobe Reader on a Mac:

enter image description here

enter image description here

However, both the Leadtek and the Swann examples give problems in Preview on a Mac as well, and in Gmail, and both show "Encoding: Identity-H". The Phonedisc test fails too, with "Encoding: Custom".

Confusing, and not consistent, but on some Adobe forum I found the following explanation for yet another example that shows "Encoding: Custom" (emphasis mine):

After looking inside the PDF it turns out that no usable encoding information is present (neither in the PDF nor in the embedded font data) to derive the meaning of the characters/glyphs that are displayed on the pages in the document.

The fonts actualy are all embedded, but in a way that all encoding information has been removed. This is a typical example of a PDF that is syntactically fully compliant with the PDF spec but where important information about the meaning of the text in it has been thrown away during the process of making the PDF. As far as I can tell it would be very difficult to recover the encoding info.

This does not explain why Mac's Preview (and apparently Infix as well) can handle some of the examples when Adobe Reader fails, even with "Encoding: Custom". Maybe Preview has no issues when the exact font happens to be present on the computer itself? Or maybe it's just guessing an encoding, which happens to work for some but not all of the documents?

Whatever causes this: if passing through Google Docs or Gmail doesn't work, then maybe the easiest (but far from easy) workaround is indeed to save as TIFF and then do OCR. Services like Evernote might do it on the fly (it does OCR on images; I doubt it will do OCR on a PDF).

Arjan

Posted 2010-03-13T03:05:41.377

Reputation: 29 084

-1

The download of file 1 failed for me, file 2 I could open with xpdf, a fast and open-source pdf-viewer. I guess it can't handle forms, but for pure text and grafic I prefer it for its fast startup time.

user unknown

Posted 2010-03-13T03:05:41.377

Reputation: 1 623

1The question was not about "opening" the PDFs, or about "opening with fast startup time". Instead, it was about being unable to copy'n'paste text snippets from the rendered pages. So your answer probably is a good one, but does not fit to this question. – Kurt Pfeifle – 2011-07-28T23:19:42.443

-2

Unfortunately it cannot be helped. PDF documents do not actually contain any letters, but they contain shapes of letters. In other words instead of reading a letter and drawing it on the screen Adobe Reader as any other PDF reading application would simply draw the vector graphics encoded in the file.

However, some PDF readers come with software that allows to analyze the shape and recover the text by using text recognition. It works same as if you scanned a paper of printed text and used software like ABBYY FineReader to convert it back to text, but due to infinitely high quality of vector drawings results are typically much better than for scanned documents.

Some documents can be protected from being converted to text by fooling the Adobe Reader. For example letters can be drawn in several overlapping shapes in such way that visually they would still look the same, while text recognition software would fail to recognize text. Your document is an example of such protection.

One way would be to print the document into an image and let text recognition software recognize it. Higher resolution for the image will improve the quality. This method however is not really handy.

Sergiy Belozorov

Posted 2010-03-13T03:05:41.377

Reputation: 1 704

2PDF documents do not actually contain any letters -- that's not true for most non-scanned documents; see http://en.wikipedia.org/wiki/Portable_Document_Format#Text – Arjan – 2010-03-17T17:27:54.733

Thank you. Interesting information. I have always though that there is no information about text in PDF. Nevertheless it seems like the document provided by Alexander doesn't have text embedded. Or it's also possible that font that is used in there has weird encoding of characters, i.e. they do not correspond to typical ASCII encoding. – Sergiy Belozorov – 2010-03-18T09:10:25.623

2How could I have copied the text from the PDF if it were just shapes? You're partly right -- it's not rasterized in the PDF (unless it's from a scanned source), but text data IS included. However, the fonts are (usually) also embedded, permitting the included text to be vector-rendered. – Alex – 2010-03-24T05:00:20.407