Convert pdf to text ignoring structure

4

1

I am looking for a tool that can batch convert pdf to text

I don't want the tool to try maintain any sort of structure just print line by line with spaces between words.

All tools I have encountered so far pdftotext, pdf2text etc... all try seperate out structures and end up making a mess. The original document was poorly structured and after scanning a lot of structures are mixed up, so I want to get the most consistent from all my pdfs and the best way at present seems to extract each word line by line.

My purpose is to extract the text which contains key value pairs and compare it to data in a database.

rogermushroom

Posted 2011-05-06T16:07:53.990

Reputation: 143

2Why not use pdftotext (etc), then strip things like extra whitespace and various characters you don't care about. Perhaps it would help to state what purpose you need the text for, in case someone has a good alternative to what you seek. – Brian Vandenberg – 2011-05-06T16:29:38.800

1@Brian Vandenburg Cheers for the comment, by default pdftotext attempts to format the archive I have and makes a mess however I have just tried with the layout flag and then replacing space/spaces with one char I get something closer to what I am looking for, I will update the question with the purpose – rogermushroom – 2011-05-06T16:33:27.517

Answers

2

If you want to batch convert PDF files to text then take a look at my company's product, Debenu PDF Tools Pro.

It has three different options for converting PDF files to text which should give you the output you're looking for. The first option shown in the screenshot below will just extract the text line by line as it finds it in the PDF without formatting it. The second option tries to preserve the original layout.

It's a tool designed for batch processing. There's a fully functional 14 day trial and then it reverts to Lite mode which isn't feature limited but has a limit on the number of files that can be processed per day.

enter image description here

Rowan

Posted 2011-05-06T16:07:53.990

Reputation: 942