A scanned document converted into a PDF initially does not contain any text. It's composed of pages each covered by a full-page pixel image. This image may or may not contain areas that look the same as shapes of characters, identified by human brains as letters and "text".
Programmatically, it is not text, only pixels.
In order to insert into a PDF derived from scanned images something that is real text, one can only employ an OCR process. This will add an extra layer of content to the PDF pages. That extra layer would contain all identified (or mis-identified) characters behind the pixel shapes as real glyphs from a real font. However, these real-text characters do have a special PDF markup, tagging them to not be rendered visually by a viewer (or when printing). Their existens would show up only when searching (or highlighting) text (or when trying to copy'n'paste areas from the image while the Acrobat Text Touchup Tool is active).
So, is your real question this: "The OCR results for my scanned PDF documents are sub-optimal. Not all characters are correctly identified. I want to edit the hidden text in order to make OCR result better. How do I do that with a free tool?" ?
Edit:
I'm not normally using Acrobat. But just now I had the opportunity to look at a 9.1.3 Professional version on a collegue's PC.
First thing I checked: Is it really true, that Acroabat doesn't allow to edit OCR'd text?
Answer: No, not true. I could use Acrobat's builtin OCR engine to capture the text of a random scanned document which I google-searched and downloaded from the web. After that, this text was perfectly editable with the TouchUp Text Tool available via the Advanced Editing menu entry.
Procedure:
- Start Acrobat Professional; load your scanned PDF document.
- In the Document menu, click OCR Text Recognition and select Recognize Text Using OCR.
- Decide which pages you want to OCR in the Recognize Text window.
- Start the process and wait till it's completed.
- Now use the Tools menu, *Advanced Editing", and start the TouchUp Text Tool.
- From here you'll work it out yourself...
1What do you mean exactly by the "text stream"? On a scanned document, the text is an image as well, you can't edit it easily. – Gnoupi – 2010-06-24T15:48:22.833
1A PDF file has the potential to store two levels of representation, the actual image and a text part, which is what I (perhaps mistakenly) called the "text stream". When a word processing document is converted to a PDF, this part is created at the same time as the image, and is usually quite accurate. When a scanned document is turned into a PDF, the text part is created by OCR processing of the image. There are also PDF files that have no text part at all.
This part is what you are accessing when you copy and paste text from a PDF document. – Emil – 2010-06-24T20:27:24.120
2You should add this info to the question ;-) – Ivo Flipse – 2010-06-25T10:56:25.233
1I believe in keeping the question brief and to the point, and leaving any additional or clarifying information in the comments. I've edited the question to make it as clear as possible without getting to wordy. – Emil – 2010-06-26T17:32:38.137