megalomania
April 25th, 2004, 04:52 PM
I want to create pdf files with text-under-image OCR’d results. The images to be OCR’d are digital photographs and are quite huge to give good OCR accuracy; we are talking 300-400 K each. Once the images have been made into PDF pages there is no way reduce their size to the same extent I can with other image editing software. I can cut the images down to a tenth or less of their original size, but then I can’t OCR those.
As far as I know there is no way to remove the text layer from a PDF document and import it under a new PDF document (in this case made from low resolution images).
DJVU files on the other hand can export their text layer as an XML document. I could still get good OCR accuracy from the larger images, export the text, and then create a new DJVU file made of much smaller sized images, and import the XML text.
The trouble is I can’t seem to find any software that enables you to create or edit DJVU files. There is something called Document Express (desktop, pro, and enterprise editions) that might do the trick, but I can’t find any, eh, let us say free copies. I think only the enterprise edition lets you export and import text as XML, although I could be wrong on this.
Are there any freeware appz out there that let you create and edit DJVU files in this way? A DJVU to PDF converter would also be handy.
Naturally if anyone knows how to extract OCR’d text layers from one PDF document and import it into another I would like to know.
As far as I know there is no way to remove the text layer from a PDF document and import it under a new PDF document (in this case made from low resolution images).
DJVU files on the other hand can export their text layer as an XML document. I could still get good OCR accuracy from the larger images, export the text, and then create a new DJVU file made of much smaller sized images, and import the XML text.
The trouble is I can’t seem to find any software that enables you to create or edit DJVU files. There is something called Document Express (desktop, pro, and enterprise editions) that might do the trick, but I can’t find any, eh, let us say free copies. I think only the enterprise edition lets you export and import text as XML, although I could be wrong on this.
Are there any freeware appz out there that let you create and edit DJVU files in this way? A DJVU to PDF converter would also be handy.
Naturally if anyone knows how to extract OCR’d text layers from one PDF document and import it into another I would like to know.