4
1
I am looking for a tool that can batch convert pdf to text
I don't want the tool to try maintain any sort of structure just print line by line with spaces between words.
All tools I have encountered so far pdftotext, pdf2text etc... all try seperate out structures and end up making a mess. The original document was poorly structured and after scanning a lot of structures are mixed up, so I want to get the most consistent from all my pdfs and the best way at present seems to extract each word line by line.
My purpose is to extract the text which contains key value pairs and compare it to data in a database.
2Why not use pdftotext (etc), then strip things like extra whitespace and various characters you don't care about. Perhaps it would help to state what purpose you need the text for, in case someone has a good alternative to what you seek. – Brian Vandenberg – 2011-05-06T16:29:38.800
1@Brian Vandenburg Cheers for the comment, by default pdftotext attempts to format the archive I have and makes a mess however I have just tried with the layout flag and then replacing space/spaces with one char I get something closer to what I am looking for, I will update the question with the purpose – rogermushroom – 2011-05-06T16:33:27.517