Batch convert pdf's t searchable pdf's

3

2

I'm looking for a way to convert thousands of pdf's to searchable pdf's. I've used a program called "PDF Create Assistant" that came with Nuance's ecopy software. However, you can't select a folder, you have to go into each sub folder, select the files to convert, and then go to the next folder.

What is another way to convert a large number of pdf's to searchable pdf's?

Haven't had any suggestions. Surely there must be a way to batch convert pdf's(?).

boilers222

Posted 2012-09-06T20:38:03.323

Reputation: 131

I couldn't find a users guide on the site. Can you use ecopy from the command line, or do you have to use a GUI? If you can call it from the command line, then you could easily create a script that traverses through all of your documents and calls the application to convert each one. – None – 2012-09-06T20:48:57.287

You have to use the GUI. At least, I haven't found a way to call it from the command line. If anyone knows how to use it from the command line, let me know. Thanks! – None – 2012-09-06T20:50:24.617

Is there an ecopy users manual or other technical info online somewhere? Command line functionality, if it exists, would probably be described there. If you can give me a pointer, I'll take a look. – None – 2012-09-06T20:53:07.923

Not that I've found. – None – 2012-09-07T15:24:41.663

Answers

3

Use the CPYCONVERTER.EXE file in the BIN folder as a command line (Wildcards supported in eCopy Ver.9-Paperworks) This is for 8.5 eCopy Desktop.

Command Line Cpy Converter Version 8.5 (Build 0.116)
 Copyright c 1992 - 2004. All rights reserved.

 Converts CPY to CPY, CPY to TIF or TIF to CPY

Usage:
 cpyconverter.exe [-?] -S=<source path> -D=<dest path> [-P] [-E] [-Q] [-B] [-O]
[-T3/T4/TC/C/U]

Note:
 Wildcards are not supported.  Full paths must be used for source and destinatio
n

Switches:
--------------------
-?                      : This menu
-Q                      : Turn off logging.
-P                      : Converter pauses after conversion.
-E                      : Converter pauses if there is an error.
-B                      : Converter burns-in Blackout/Whiteout markups (if appli
cable).
-O                      : Converter OCRs document and creates searchable text (i
f applicable).
-S="<SOURCE PATH>"      : The path of the file to convert.
-D="<DESTINATION PATH>" : The path of the newly converted file.
-P=<PASSWORD>           : Password for encrypting and decrypting documents.
--------------------
 * If the source document is encrypted CPY converter will attempt to decrypt it
to the destination document with the supplied password.
 * If the source document is not encrypted CPY converter will attempt to encrypt
 the destination document using the supplied password.
 * Please note you cannot encrypt/decrypt tif documents.

-<Conversion Type>      : The type of conversion to be done(T3, T4, TC, C, U)
--------------------
* T4 - Convert CPY to TIF Group4
* T3 - Convert CPY to TIF Group3
* C  - Convert TIF(Any group) to CPY
* U  - Convert CPY to CPY

Ex.1 cpyconverter.exe -S="C:\My Dir\test.tif" -D="C:\My Dir\test.cpy" -C
Convert Tiff to cpy

Ex.2 cpyconverter.exe -S="C:\My Dir\test.cpy" -D="C:\My Dir\test.tif" -T3
Convert Cpy to Tif Group 3

Ex.3 cpyconverter.exe -S="C:\My Dir\test.cpy" -D="C:\My Dir\test.tif" -T4
Convert Cpy to Tif Group 4

EJW

Posted 2012-09-06T20:38:03.323

Reputation:

0

On linux

First you need to OCR the PDFs that are not already OCR I wrote a pretty simple way to search all pdfs that cannot be greped and OCR them.

I noticed if a pdf file doesn't have any font it is usually not searchable. So knowing this we can use pdffonts.

First 2 lines of the pdffonts are the table header, so when a file is searchable has more than two line output, knowing this we can create:

gedit check_pdf_searchable.sh

then paste this

#!/bin/bash 
#set -vx
if ((`pdffonts "$1" | wc -l` < 3 )); then
echo $1
pypdfocr "$1"
fi

then make it executable

chmod +x check_pdf_searchable.sh

then list all non-searchable pdfs in the directory:

ls -1 ./*.pdf | xargs -L1 -I {} ./check_pdf_searchable.sh {}

or in the directory and its subdirectories:

tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} ./check_pdf_searchable.sh {}

Eduard Florinescu

Posted 2012-09-06T20:38:03.323

Reputation: 2 116

0

The easiest way is to use an online ocr api. The ocr.space api includes support for creating searchable PDFs. The service has a free tier of 25,000 conversions per month.

You can then automate the this with Powershell, batch or any other scripting language. For example trigger the conversion from batch with cURL:

curl -H "apikey:helloworld" --form "file=@yourpdf.PDF" --form "language=eng" -form "isOverlayRequired=true" https://api.ocr.space/Parse/Image

Nic Endo

Posted 2012-09-06T20:38:03.323

Reputation: 101