Is there a freeware program for editing the text stream of PDFs?

5

3

PDFs are a great archive format for scanned images, but Acrobat does not allow you to edit the text layer of the document (the part that can copy and paste from) leaving you with just the raw OCR. Are there any freeware alternatives that let you edit the text layer?

Emil

Posted 2010-06-24T15:42:37.040

Reputation: 495

1What do you mean exactly by the "text stream"? On a scanned document, the text is an image as well, you can't edit it easily. – Gnoupi – 2010-06-24T15:48:22.833

1A PDF file has the potential to store two levels of representation, the actual image and a text part, which is what I (perhaps mistakenly) called the "text stream". When a word processing document is converted to a PDF, this part is created at the same time as the image, and is usually quite accurate. When a scanned document is turned into a PDF, the text part is created by OCR processing of the image. There are also PDF files that have no text part at all.

This part is what you are accessing when you copy and paste text from a PDF document. – Emil – 2010-06-24T20:27:24.120

2You should add this info to the question ;-) – Ivo Flipse – 2010-06-25T10:56:25.233

1I believe in keeping the question brief and to the point, and leaving any additional or clarifying information in the comments. I've edited the question to make it as clear as possible without getting to wordy. – Emil – 2010-06-26T17:32:38.137

Answers

1

Free PDF editors are very scarce.

The only free one I know is OpenOffice with Sun PDF Import Extension.

From the techsupportalert article A PDF File Allows Editing in 100% Layout Accuracy:

Sun PDF Import ExtensionOpenOffice with Sun PDF Import Extension produces a hybrid PDF / ODF file. The file created will have a normal .pdf file extension. By itself, it is a PDF file and can be viewed by any PDF viewer such as Adobe Reader, PDF-XChange Viewer or Foxit Reader.

On top of this, it contains a source ODF file, which can be opened with OpenOffice directly from the PDF file for editing without loosing any layouts, bookmarks, hyperlinks or formats.

To create a hybrid PDF file, run OpenOffice with Sun PDF Import Extension installed, select "File", choose "Export as PDF", a PDF Options window like the screen shot will open, then tick "Create hybrid file" and click "Export".

This hybrid PDF file saves you in keeping two separate file formats as it has combined two into one. It is ready for sharing and viewing with a PDF reader, yet it can be opened for perfect editing just the way a normal OpenOffice file can be. Probably it will be a good idea to name the hybrid file ending with "-odf.pdf" to differentiate from the normal PDF file.

Sun PDF Import Extension is compatible with OpenOffice.org (3.0 or later) or StarOffice 9.

harrymc

Posted 2010-06-24T15:42:37.040

Reputation: 306 093

Great, thanks! This looks very promising, even if rather cumbersome. – Emil – 2010-06-24T19:04:33.607

1

A scanned document converted into a PDF initially does not contain any text. It's composed of pages each covered by a full-page pixel image. This image may or may not contain areas that look the same as shapes of characters, identified by human brains as letters and "text".

Programmatically, it is not text, only pixels.

In order to insert into a PDF derived from scanned images something that is real text, one can only employ an OCR process. This will add an extra layer of content to the PDF pages. That extra layer would contain all identified (or mis-identified) characters behind the pixel shapes as real glyphs from a real font. However, these real-text characters do have a special PDF markup, tagging them to not be rendered visually by a viewer (or when printing). Their existens would show up only when searching (or highlighting) text (or when trying to copy'n'paste areas from the image while the Acrobat Text Touchup Tool is active).

So, is your real question this: "The OCR results for my scanned PDF documents are sub-optimal. Not all characters are correctly identified. I want to edit the hidden text in order to make OCR result better. How do I do that with a free tool?" ?


Edit: I'm not normally using Acrobat. But just now I had the opportunity to look at a 9.1.3 Professional version on a collegue's PC.

First thing I checked: Is it really true, that Acroabat doesn't allow to edit OCR'd text?

Answer: No, not true. I could use Acrobat's builtin OCR engine to capture the text of a random scanned document which I google-searched and downloaded from the web. After that, this text was perfectly editable with the TouchUp Text Tool available via the Advanced Editing menu entry.

Procedure:

  1. Start Acrobat Professional; load your scanned PDF document.
  2. In the Document menu, click OCR Text Recognition and select Recognize Text Using OCR.
  3. Decide which pages you want to OCR in the Recognize Text window.
  4. Start the process and wait till it's completed.
  5. Now use the Tools menu, *Advanced Editing", and start the TouchUp Text Tool.
  6. From here you'll work it out yourself...

Kurt Pfeifle

Posted 2010-06-24T15:42:37.040

Reputation: 10 024

Yes, that is more or less what I want to do. The result of the OCR process that Acrobat carries out is saved as a separate layer (after which it is just text, albeit hidden), and I would like to edit that layer. At this point, it makes little sense to keep referring to it as an OCR result, especially when comparing it to e.g, a PDF created from a word document, where the text layer is not an OCR result at all. – Emil – 2010-06-24T22:06:06.250

This is an interesting question. I never needed to think about it, and I don't know enough. As soon as I have some time on hand I'll do some research (like study the relevant parts of the PDF spec) in order to find out more. Could well be that these hidden OCR'd text strings are made to be not editable at all. But maybe there is a workaround... – Kurt Pfeifle – 2010-06-25T12:21:32.487

(After all, there are tons of OCR'd PDF documents out there. And OCR working with 99% accuracy is already regarded as "good". (But looking from the p.o.v. of a high school teacher, any text that has 10 spelling errors for each 1000 characters is earning one of the worst school grades you can imagine....) – Kurt Pfeifle – 2010-06-25T12:22:41.530

There are programs that do this, but no freeware solutions.

Regarding the quality of the OCR, that's not really relevant. In some situations, a single error, let's say a very embarrasing one or one that leads to serious misunderstandings, would be enough to make a solution to this necessary. – Emil – 2010-06-26T17:30:44.787

This "answer" just confusingly restates the question. :-( You ask "Is your real question this" — yes, that's the real question, because that's what "text layer" means. And this is not an answer to the question. It's ok to post such answers before a clarification, but now please delete it, because it's a waste of time for those reading it. – ShreevatsaR – 2010-08-28T19:23:57.237

@ShreevatsaR: I had hoped it clarified the question. For the original question seems to have assumed that there is a text layer in any scanned document. Which there is not. That layer only gets added through an OCR process. A complaint about "Acrobat not allowing to edit..." may be rooted in not understanding this context. So my question "Is your real question..." was justified... There is a difference between text in a PDF derived from a Word document, and OCR'd text, which Emil seems to be in denial of.... --- And what do you find confusing with my explanations? – Kurt Pfeifle – 2010-08-28T21:28:07.527

@ShreevatsaR: Also, you should keep in mind that my answer was (1) to Emil's question as was originally stated, and (2) to respond to some commments that seemed to be unaware of the existence of text layers added by OCR. The original question was edited later and now reads a bit different from what I responded to.... – Kurt Pfeifle – 2010-08-28T21:54:09.860

I don't see how the question assumes that. It's clearly starting with some PDF file that does have a text layer — maybe the questioner got it from somewhere else, and didn't scan it themselves, or whatever — somehow, the PDF file happens to have a text layer. Now how do you edit it? That's the question. Anyway, again, the only part of your edited answer that's relevant is Step 5 (use the TouchUp Text tool), because your steps 1 to 4 are about OCR itself (creating the text layer), which is irrelevant. I'm still looking for a freeware program that lets you edit the text layer of a PDF… – ShreevatsaR – 2010-08-29T01:32:30.033

BTW, Emil is not "in denial of" anything as you claim — see his comment on the question: "the text part is created by OCR processing of the image. There are also PDF files that have no text part at all. This part is what you are accessing when you copy and paste text from a PDF document". So really, your clarification is not necessary. :-) – ShreevatsaR – 2010-08-29T01:40:32.733

@ShreevatsaR: I'm not a native speaker of English. So the expression "in denial of" was possibly not exactly matching the thought I intended to express. What I had in mind when writing this was Emil's statement: "At this point, it makes little sense to keep referring to it as an OCR result". Since the OCR result behavior seems to be at the center of his described problem, it makes all sense to always keep this difference in mind when discussing PDF text editing (as compared to the situation with "normal" PDFs). – Kurt Pfeifle – 2010-08-29T09:49:41.370

@ShreevatsaR: Are you aware of all different variants for text representation (esp. regarding "layers" a.k.a. "optional content groups", "hidden" attributes and "transparency" attributes which all may come into play here) defined in the 1000+ pages of the official PDF file format spec? If not you're excused to deem my described steps 1-4 for creating a text layer (do we know it's a "layer"?!) as "irrelevant". But they are not. What I describe works for me (and you, if you follow all my steps). It may not work with PDFs Emil has. Because his may have been created by a different procedure... – Kurt Pfeifle – 2010-08-29T10:02:16.683

@ShreevatsaR: You are looking for a freeware program that lets you edit PDF text layer(s)? You may as well ask for a freeware program that lets you edit PDF texts at all. Not all text is in separate layers, you know? -- Anyway, you ask for it: you can try PDFEdit (http://pdfedit.petricek.net/en/index.html). But be warned: it's not as easy as using Word to edit a .doc. You need to know the internals of the PDF format and be familiar with the spec for being able to use it in ways that makes sense. It's more like if you want to edit the DOM tree of a Web-2.0 application with a text editor.

– Kurt Pfeifle – 2010-08-29T10:13:06.077

@ShreevatsaR: you say the question is "clearly starting with some PDF file that does have a text layer" and "the PDF file happens to have a text layer". That's not a fact. It's just an assumption of Emil (and you). PDFs (not even scanned ones) don't necessarily use different "layers" for text storing. Text may share the same "layer" with other PDF objects, but it may still be "hidden" or "invisible"... -- – Kurt Pfeifle – 2010-08-29T10:43:55.923

@ShreevatsaR: ... And, BTW, the first version of Emil's question used the expression *"text stream"*. Which is technically more correct because: there certainly exist at least one or (more frequently) multiple text streams (in PDFs which contain text at all). "Stream" is a well defined technical keyword from the PDF spec. So even for cases where text is in a separate "layer", that layer will store its text objects as "streams". – Kurt Pfeifle – 2010-08-29T10:51:21.677

The point is—the situation Emil and I are in is—this: there is a PDF file, which happens to have page images, and also a text stream/layer. (I don't know why you say "not a fact": surely it is a fact about the particular file(s) Emil (and I) wants to edit, and no claim is being made about all PDF files.) So how do you edit this text layer/stream? Your solution doesn't work for me (and likely not for Emil either) because I don't want to do OCR again; I just want to edit the existing text stream that happens to be already present in my file. (It's not relevant how it got there!) – ShreevatsaR – 2010-08-30T20:41:00.610

Gheee... you're unteachable. I'll give up (after this last attempt). "Not fact, but assumption" referred to deeming the text being in a separate layer (when it could be in the same one as the scanimage). You didn't even make an attempt to verify if the PDF has different layers and if one is set to not being displayed in default viewer settings (and tell us which it is). "Because I don't want to do OCR again"... well, you don't even want to come closer to a solution. If you tried, based on results you'd know more about where to tackle next (even if it's not to be your final solution). – Kurt Pfeifle – 2010-08-30T21:16:23.350

That is, Adobe Acrobat Pro isn't free or cheap… but I did find a copy on a friend's machine (on Mac OS X) and tried the TouchUp Text tool: it seems slightly unfriendly, as it doesn't even show the text being edited (it still shows the page image, rather than the text stream). Maybe I haven't figured out the interface. Anyway, I'll try PDFEdit, thanks. [BTW, I completely agree with you that it's good to keep in mind that we're talking about a separate, probably OCR-generated, text stream, and not the visible page image part — I just disagree that it's worth reiterating what's already known. :)] – ShreevatsaR – 2010-08-30T21:18:19.530

Ah, you said "extra layer" in your explanation, so I assumed "text layer" was the correct term rather than "text stream", sorry. :-) Ok, let's call it text stream: the "fact" is, the PDF consists of images, and also you can select from it and copy-paste text. This much is fact. From this I conclude that the text is stored separately from the image, and I want to edit it. You're right, it's probably not a "layer" but "text stream"… whatever it is, the question is only how to edit it. – ShreevatsaR – 2010-08-30T21:22:23.910

0

It appears what you mean by "text stream" is the text data from the PDF. Not sure. If that is the case, I use the standard clipboard and any text only editor, I use KEDIT because of it's column editing capabilities, to capture the data and edit it. The problem is that you lose any formatting with this and sometimes with tables the order of the data get messed up. But, for simple captures, it works.

Dave

Posted 2010-06-24T15:42:37.040

Reputation: 526

Yes, that is what I mean. I think I saw the term here somewhere and thought it apt.

I'm afraid I wasn't clear enough. What I want is not to edit the text from the PDF, but in the PDF, i.e., to produce a PDF with a good, edited text part, so that a person copying and pasting from the document would get good and accurate text, while still displaying the document in its original form. – Emil – 2010-06-24T19:02:16.243

PDF comes in more than one flavor. The Adobe version is "encrypted" and as such can't be edited without Adobe software. PDF is an open format that is defined in text which is editable by any text editor. Maybe someone can give you a reference to the PDF standard. – Dave – 2010-06-25T00:30:30.100

1

It seems rather far-reaching to say that Adobe pdf's can't be edited without Adobe software - certainly most third party PDF editing software manages to do that, or do you mean that they do it by reading the file and converting it into another format?

I wasn't looking to write the program myself, so the standard will be of little use. Wikipedia (http://en.wikipedia.org/wiki/Pdf) has plenty of information and links for understanding the standard.

– Emil – 2010-06-27T14:24:03.740