How to copy text out of a PDF without losing formatting?

40

16

When I copy text out of a PDF file and into a text editor, it ends up mangled in a variety of ways. Formatting like bold and italics are lost; soft line breaks within a paragraph of text are converted to hard line breaks; dashes to break a word over two lines are preserved even when they shouldn't be; and single and double quotes are replaced with ? signs.

Ideally, I'd like to be able to copy text from a PDF and have formatting converted to HTML codes, "smart quotes" converted to " and ', and line breaks done properly. Is there any way to do this?

Colen

Posted 2010-10-11T21:13:58.040

Reputation: 872

May be related: https://superuser.com/a/455278/13787

– Steven R. Loomis – 2018-07-03T21:40:49.553

2Word 2013 can open PDFs. Not perfect. But doable – pratnala – 2012-12-01T14:56:48.990

Answers

54

Firstly, you have to understand what a PDF is. PDFs are designed to mimic a printed page, and they are designed only as an output format, not an input format. a PDF is basically a map containing the exact location of characters (individual letters or punctuation, etc.) or images. In most cases, a PDF does not even store information about where one word ends and another begins, much less things like soft breaks vs. hard breaks for paragraph endings.

(A few recent PDFs do store some information about this stuff, but that's a new technology, and you'd be lucky to find PDFs like that. Even if you did, your PDF viewer might not know about it.)

Anyway, it's up to your software to implement some kind of "artificial intelligence" to extract merely from the locations of individual characters what is a word, what is a paragraph, and so on. Different software is going to do this better than others, and it's also going to depend on how the PDF was made. In any case, you should never expect perfect results. Having the output PDF is not the same as having the source document. Far better to try to obtain that if you can.

The standard solution to your kind of problem is to use Adobe Acrobat Professional (the expensive one, not the free reader) to convert the PDF to HTML. Even that is not going to get perfect results.

There is free software that can be used to extract text from PDFs with some of formatting intact, but again, don't expect perfect results. See, e.g., calibre (which can convert to RTF format), pdftohtml/pdfreflow or the AbiWord word processor (with all import/export plugins enabled). There's also a PDF import plugin for OpenOffice.

But please don't expect perfection with any of these results. You're going against the grain here. PDF just is not meant as an editable input format.

frabjous

Posted 2010-10-11T21:13:58.040

Reputation: 9 044

2a feedback 5 years later: no big improvement: I had to convert it to HTML (using acrobat x) then insert each row it in a MSword table. (Saving for word or excel or txt just messed up everything, copy past from chrome did not work at all either). Still waiting for a (very) smart software. – JinSnow – 2015-11-06T07:03:08.007

right click on the table choosing "copy with formatting" work too, with the limits mentioned above – JinSnow – 2015-11-06T07:10:41.127

1Because this is the accepted answer, I suggest that you also mention the (newer) option that pratnala wrote in his comment - open the pdf directly from Word 2013. On some pdfs I tried it gave better results than all the above software. – BornToCode – 2017-05-17T00:51:54.640

8

Another option is to download and start using the free pdf viewer, Foxit (its good). Then you can 'Save As' and choose .txt to convert it to a text file. That will preserve all the formatting. Dunno whether you can do the same in Adobe because I stopped using it a while ago when I converted to Foxit.

chris

Posted 2010-10-11T21:13:58.040

Reputation: 81

I use Foxit, and just tried it, I wouldn't say it preserved formatting. And all I wanted was decent line endings and each paragraph as a paragraph. – pgr – 2015-12-31T14:48:55.470

Using txt you will loose all formating: fonts, bold, italics, colors, and of course more advanced options – skan – 2017-02-22T16:21:10.110

Foxit Reader worked great for me – Michael Tranchida – 2018-05-02T10:42:33.597

"Save as... Text" worked for me with several free pdf viewers. – Jeff – 2013-12-18T19:23:30.840

5

Open your PDF file with a browser(Google chrome and firefox are tested)then copy your text there.

harsini

Posted 2010-10-11T21:13:58.040

Reputation: 61

Sadly this didn't work for me in Firefox. – Reb – 2016-09-06T11:50:46.707

close. FF kept font sizes at least. Chrome failed miserably, not even line-feeds. – nd34567s32e – 2018-02-20T13:51:53.000

As of Oct 2019 opening a PDF in Chrome and copy/pasting to a text editor at least preserves end-of-line (but, sadly, not any leading white space on the lines). – DocOc – 2019-10-03T12:50:26.217

5

There is a very good online tool called Sej-da. Its deals with Advanced PDF Manipulation. There is no software to download. As it is a new online tool it is currently still in Beta. It allows you to extract text from a PDF, as well as providing a myriad of other PDF functionalities

http://www.sejda.com/

A brief video review of sejda functions was done 14th November 2012 by Revision 3 it can be found here:

http://revision3.com/tzdaily/sejda-online-pdf

Simon

Posted 2010-10-11T21:13:58.040

Reputation: 4 193

1

One could still download the command line tool: http://www.sejda.org/download/ (I don't think it allows extracting text with formatting?)

– Arjan – 2012-12-01T14:41:44.117

I have already recommended Sejda above Arjan – Simon – 2012-12-01T14:56:17.000

1Huh? I just meant: you're saying it's an online tool, but one can also download the same thing. Also, looking into it further: I don't think it will preserve the formatting, like was asked for? – Arjan – 2012-12-01T15:16:50.807

I am well aware preserving of format was requested, but unless you try you will never know. – Simon – 2012-12-01T15:41:21.607

As its a free tool with a wealth of features, and its not even out of beta - there is nothing to lose, but try. With time its feature set will be probably be extended, but for now cant really complain. – Simon – 2012-12-01T15:47:24.323

4

You can use Adobe Acrobat Pro for this.

For tables: With Acrobat 9/10 there was a select tables feature. With Acrobat X you can just click Save As > Spreadsheet > Excel. It even concatenates pages into one long spreadsheets. Awesome feature.

For text: A similar feature exists for exporting to MS Word. Save As > Word > Word Doc.

Sources:

user156787

Posted 2010-10-11T21:13:58.040

Reputation: 41

0

I found this very useful ( Remove Line Breaks ):

Here is a useful trick to quickly resolve this without having to remove all the line breaks manually. Basically, all it does is automatically replace all the unwanted line breaks with a single space, making all the text run together into a single paragraph:

1- copy the text you want from the PDF.

2- paste into a new Word document.

3- click “edit” then “replace”

4- make sure you’re in the “find what” field

5- click “more” then “special”

6- select “paragraph mark” (top of the list)

7- click into the “replace with” field

8- press the space bar once

9- click “replace all”

10- click “ok” then close the “find & replace” box.

sky-light

Posted 2010-10-11T21:13:58.040

Reputation: 133

0

Foxit will toggle between displaying the original file as normal PDF or as text by pressing Ctrl + 6 (With a little fiddling with the zoom level of the text mode there's not much jump in position back and forth between reading and copying)

Stoatly

Posted 2010-10-11T21:13:58.040

Reputation: 1

-1

You could copy from adobe reader into MS Excel and format (table) the way you want and then copy and paste from Excel. This solution works great. You don't need to buy expensive adobe professional copy.

Murali Sastry

Posted 2010-10-11T21:13:58.040

Reputation: 11

The question discusses text. Do you think this would be a good general solution for text, including converting formatting to HTML codes? – fixer1234 – 2015-12-11T05:24:57.993

-1

I was trying to save the the text and format of a pdf that was organized in a table. In Acrobat Professional, I realized there is a 'Save As' option that allows saving as an excel document. This worked well for my needs. I also noticed there is a Save As Word document option as well. I didn't try it though.

Douglas Thompson

Posted 2010-10-11T21:13:58.040

Reputation: 11

2This duplicates user156787's answer. – fixer1234 – 2016-01-23T01:52:08.263