Log in

View Full Version : New OCR software, open source


jellywerker
September 5th, 2006, 10:38 AM
Hp/Google/UNLV has recently released their tesseract ocr software, originally developed from 1985-1995 by HP, it is supposedly one of the most accurate ocr programs available. Google has updated some of the old code and put it on sourceforge. I hope this makes scanning easier for those who have yet to crack another ocr application.

Article: http://developers.slashdot.org/article.pl?sid=06/09/04/2215210&from=rss

Link:http://sourceforge.net/projects/tesseract-ocr

megalomania
September 8th, 2006, 11:07 PM
How well can we expect an 11 year old OCR app that has been abandoned by its patron to perform compared to modern commercial OCR software? I read about it on bookpeople, and quite frankly it leaves a lot to be desired.

This was forwarded to me, I'm not on the list.
> Subject: [BP] HP's open-source Tesseract OCR, any experience?
> Tom Breuel pointed out to me a new project up at sourceforge, called
> "tesseract-ocr", with "lvincent" listed as admin -- presumably Luc
> Vincent (a document image processing expert now at Google). There are
> no files there, but they do seem to be at the University of Nevada -
> Las Vegas ISRI site, at
> http://www.isri.unlv.edu/downloads/ocr-prerelease-20051201.tar.bz2,
>
> I was wondering if any adventurous explorer had tried it out yet, and
> if so, what the results were like?

It only is configured to build under MSVC++6 for Windows.
It only accepts uncompressed bitonal tiffs.
It's command-line only. No GUI.
It performed abysmally on the provided testimage.tif
But it did build. :)

Also in that directory you mentioned, there is a utility called ocrspell,
which is crufty code that I can't get to configure properly on a modern linux
system ... to give you an idea, it is hardcoded for ispell 3.1.08 and it's
dependent files, and most systems are using aspell 0.50.x or 0.60.x. (ispell
3.1.20 or higher) The (other) problem here is that the dictionaries are very
different from what the program expects.

Granted, this was a fairly quick look, but I don't see this as being useful
very soon without a lot of gnashing of teeth.

jellywerker
September 9th, 2006, 12:19 AM
I did not know this, thanks for sharing before someone wasted time trying to use it and having to correct mistakes.

megalomania
September 10th, 2006, 06:37 PM
Apparently this software has been deemed unworthy compared to commercial OCR apps. The keyword here being commercial. As far as open source goes, OCR apps are in short supply. For those corporations, non-profits, and individuals that play by the rules, open source and freeware is the only affordable solution. For the rest of us, thanks to the wonder that is warez, even the most expensive enterprise edition with unlimited site licenses of a commercial OCR app is still free.

If Google decides to spend some of their discretionary money on developing Tesseract, say a billion dollars (change under the CEO’s couch cushion for Google), then maybe this could go somewhere. I imagine Google has more financial capital than both Abbyy and Nuance (formerly ScanSoft) combined.

Developing software to clean up lower resolution scans and geometrically distorted images would be a better use of such money. OCR is essentially considered a solved problem except for handwriting.