Command-line OCR in Windows 7

5

3

What are some command-line OCR utilities that will work in Windows 7 64-bit?

Phenom

Posted 2010-06-06T22:17:11.697

Reputation: 6 119

For Linux/OSX, check: OCR on Linux systems

– kenorb – 2014-09-19T12:01:59.727

Answers

4

I think Tesseract is the best (free) command-line based OCR software. Unfortunately there doesn't appear to be a Windows 7 64-bit binary available so you'd have to compile it yourself; here are the instructions for doing so (taken from a comment on the Tesseract FAQ page):

  1. Download tesseract 2.04. Unpack it. In this example I've unpacked to C:\projects\tesseract-2.04. Windows 7 still doesn't understand .tar.gz out of the box. My recommendation is to get a copy of 7-Zip.

  2. Download your required language files. I need German and English. I unpack these to the tessdata subdirectory of C:\projects\tesseract-2.04\tessdata.

  3. Install libtiff. On my (64 bit) system the suggested install directory is C:\Program Files (x86)\GnuWin32?. Underneath this directory are a bunch of subdirectories containing files we'll need to compile tesseract with tiff support, namely include, bin and lib.

  4. Add C:\Program Files (x86)\GnuWin32?\bin to your PATH environment variable so that the output tesseract.exe can find the libtiff dll. Restart.

  5. Open the vc solution (tesseract.sln)

  6. Change the solution configuration to "Release" mode. Note that if you later change back to Debug mode, you'll need to set up all the following again...

  7. In the solution explorer right click the solution node (Solution 'tesseract') and click "Properties". Change to "Configuration Properties" and select "Release" configuration from the dropdown at the top of the window. Navigate to: Tools -> Options -> Projects and Solutions -> VC++ Directories Here we'll be adding the full paths for the subdirectories lib and include from the libtiff install so that VC can find the required header (.h) and static library (.lib) files. In this example they are: $(ProgramFiles?)\GnuWin32?\include $(ProgramFiles?)\GnuWin32?\lib as I'm using an environment variable. I could however just have written them as C:\Program Files (x86)\GnuWin32?\include. Change the "Show Directories For" dropdown to "Include files". Add the following: $(ProgramFiles?)\GnuWin32?\include Now change the "Show Directories For" dropdown to "Library files". Add the following: $(ProgramFiles?)\GnuWin32?\lib

  8. Now open the project properties window for the tesseract project (not the solution). In the solution explorer right click the tesseract project and click properties. Navigate the horrendous list of options to Configuration Properties -> C/C++ -> Preprocessor and add HAVE_LIBTIFF to the list of Preprocessor Definitions. This causes a bunch of #includes to be enabled in the code.

  9. You also want to add an "Additional dependency". go to the "Additional dependencies" section for the project properties and add libtiff.lib.

  10. Build the solution. Watch the error list. If you get a bunch of LNK2109 errors, that means the linker can't find something tesseract references. You're missing a reference to one of the paths from libtiff. If you get an error mentioning mt.exe, you've possibly encountered a bug in the sdk. Just try building again. see http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=106634 for more info.

If/when the solution builds successfully, you'll have a tesseract.exe file in the same directory as the tesseract solution file. drag you multipage compressed tiff here and try running tesseract. for example, if your tiff is called in.tif and you want to output text to out.txt, and the documents' language is German then your command line would look like:

tesseract.exe in.tif out -l deu The output file will have .txt appended to it by tesseract. If you're just translating English text then you can leave off the -l option, as tesseract assumes "eng" if you don't specify anything. If your tif file has the file extension .tiff, then tesseract will crap itself thusly:

C:\projects\tesseract-2.04>tesseract.exe in.tiff out -l deu Tesseract Open Source OCR Engine name_to_image_type:Error:Unrecognized image type:in.tiff IMAGE::read_header:Error:Can't read this image type:in.tiff tesseract.exe:Error:Read of file failed:in.tiff

Hopefully (fingers crossed, heh) you've now got an OCR'd out.txt file sitting in C:\projects\tesseract-2.04.

jwaddell

Posted 2010-06-06T22:17:11.697

Reputation: 2 626

1

JOCR is the only one I know which can work on windows and is command-line based. View their webpage here

Daisetsu

Posted 2010-06-06T22:17:11.697

Reputation: 5 195

0

There is an installer for windows 7 for tesseract. I have just installed it and got it to do OCR on a small image. The result was terrible but I hope with some tuning I can improve the results.

andrew

Posted 2010-06-06T22:17:11.697

Reputation: 847