Practical OCR solution for converting a large book to a digital format?

12

9

I was over by my grandparent's place this past weekend. My grandmother pulled out this giant (~1400 page) book of her family history going back to 1630 or so. Giant nerd that I am, I thought it would be slick to have all the information stored in a database and available from the web. I can handle all the web programming and regular expressions and what not, but what I don't know is the best way to get the text from book to computer.

I know some kind of OCR will be necessary, from the little research I've done, it seems like my options are:

  1. take a picture of every page with a camera then process the pictures with OCR software
  2. use a scanner to scan each page, then process with OCR software
  3. use some kind of hand held device, like this.

Does anyone have any ideas about the best way to tackle this problem? I don't want to destroy the book, because as far as I know, it can't be replaced. This is probably the only time I'm ever going to scan a large book, so I don't think I want to spend more than $250 on any kind of device. I don't mind some manual effort here (I realize this will most likely take months), but I'd like to find the most efficient method possible.

Note about the book: It's only about 20 years old, so it's in pretty good shape. It's monochrome and the pages haven't begun to yellow. Since it is so large though, I worry about possible shadows when the text gets down close to the binding.

user11219

Posted 2009-09-15T13:08:23.220

Reputation:

1On a side note, if the book is only 20 years old and the information goes back to the 1600s, where is the original source material? That might be nice to capture as well! – Craig – 2009-09-15T17:34:28.740

Yeah, that would be cool too. I'm going to see if I can track down the original author. – None – 2009-09-15T22:32:19.153

Answers

8

I came across this on Lifehacker quite some time back, and it has been one of my top DIY projects ever since.

enter image description here

Replace the iPhone with any camera or imaging, and you get a stack of nice high-res jpegs ready for you to OCR with any software, even (urks!) MS Office... ;)

Cheap. Effective. DIY. You can't beat an idea like this.

EDIT: Comments raised up some points about shadows, page curlings, etc. Quite easily resolved for anyone who have literally photo-copied library texts.

Add a multiple light sources to illuminate the book, and eliminate the shadows.

slant the book at 90 degrees to the pages don't curl towards the bindings in the middle. It also preserves the binding.

I'll see if I can give an example and set one up myself.

EDIT 2 : uploaded sample of how you should hold the book, and also notice the light source from the left.

enter image description here

caliban

Posted 2009-09-15T13:08:23.220

Reputation: 18 979

I was unable to find the original files and the author did not answer my emails. I used this scan stand from Thingiverse instead. I did some tests and had some issues using tesseract, maybe because of irregular lighting without flash and bright reflections with flash. I got around 5-10% of errors with this scan stand and near perfect results with a proper scanner. Because I have a lot of books to scan, I decided to buy a proper scanner.

– miguelmorin – 2018-07-27T21:47:23.797

That is so cool! Wish I could do that :) – alex – 2009-09-15T13:16:10.213

However, you need a real camera to do that, and a good quality, or you will end with picture you can't exploit, especially from a very old book. So it's far from cheap. – Gnoupi – 2009-09-15T13:18:55.310

Very interesting. I wonder how this would work with a book, considering the shadows there would probably be between pages. – None – 2009-09-15T13:19:20.553

If the pages are bent or have shadows you will have problems getting the OCR software to recognize the letters. – alex – 2009-09-15T13:20:53.453

add multiple light sources to illuminate the book, and eliminate the shadows. slant the book at 90 degrees so the pages don't curl towards the bindings in the middle. It's simple common sense, we do that all the time back in college taking photos of library texts. – caliban – 2009-09-15T13:34:08.380

@Gnoupi - you don't necessarily need a 56 megapixel Phase One Leaf system to do OCR-ing. In fact, a cheap 5 megapixels will do just fine. Set to the ISO 50 or 100 for little noise, put it on delayed capture mode, fire and let it capture. 5 megapixels is beaucoup for OCRing work. – caliban – 2009-09-15T16:11:13.183

I'm going to give this or some kind of slight variation a try of 20 or so pages and see how practical it's going to be. Thanks for the tips! – None – 2009-09-19T13:18:10.000

3

From what I know, ABBYY makes the best OCR software, but it's not free. You should try using a trial version of ABBYY FineReader, maybe it will help you.

alex

Posted 2009-09-15T13:08:23.220

Reputation: 16 172

1

You will need to capture the image somehow. Various services exist to do this for you. You will also need someone who is familiar with the content of the text to proofread as OCR is not perfect yet. Especially with anything handwritten.

Others are discussing your question here: http://ask.metafilter.com/92506/scan-my-books

Some companies will do this for you: http://www.scandexsystems.com/BookScanning2.html http://www.kirtas.com/index.php?option=com_content&view=article&id=13&Itemid=48 http://www.ristech.ca/product.html

Some Free Software: http://download.cnet.com/Image-To-PDF-OCR-Converter-PDF-E-Book-Maker/3000-6675_4-10392924.html

NickSentowski

Posted 2009-09-15T13:08:23.220

Reputation: 189

1

For a large and important to you and your family project like this, a DIY Book Scanner may be the way to go, some designs even sport page turners - http://www.diybookscanner.org/ This one doesn't natively support OCR, but does shoot 600 pages an hour and you can run it through OCR after the fact http://hackaday.com/2011/07/18/diy-book-scanner-processes-600-pageshour/

Xaq Fixx

Posted 2009-09-15T13:08:23.220

Reputation: 11

0

You may want to see if a university near you has a whole book scanner and then beg/bribe a student to put your book through it.

Chris Nava

Posted 2009-09-15T13:08:23.220

Reputation: 7 009

0

I would recommend a flatbed scanner rigged for book scanning or a whole book scanner as mentioned by Chris.

If you can, get your images compiled into a TIFF format as that is industry standard when it comes to document management systems.

For doing OCR, I would recommend tesseract OCR as it is the framework Google expounded upon for their books project.

Greg Buehler

Posted 2009-09-15T13:08:23.220

Reputation: 1 150

0

At work we use a Plustek Optibook 3600 book scanner which is about $250.
It's basically a standard flat bed scanner but with the glass plate going right to the edge of the scanner so that the book page can be placed flat on the plate. This eliminates the spine shadow and avoids damaging books.

enter image description here

pelms

Posted 2009-09-15T13:08:23.220

Reputation: 8 283

Have you ever tried using that with a really thick book? It's like 3 inches thick. – None – 2009-09-15T22:30:52.293

If you can open it 90° with the page reasonably flat it should be fine. Try on a table edge. – pelms – 2009-09-16T08:33:49.260

0

while it sounds tempting to automate the process, you may want to invest rather more time and work since this particular book is a personal matter. OCR will do the bulk but you'll have to proofread page by page and compare with the original. keep in mind, the author's mistakes are part of the deal, do not correct them (create footnotes if you feel so inclined). take your time, don't put yourself under pressure, book scanning is donkey work but thoroughness pays and you'll end up with a fine digital copy of your family's chronic. good luck with your endeavour :)

Molly7244

Posted 2009-09-15T13:08:23.220

Reputation:

actually, that's a really good point. I hadn't considered making the original content of the book available digitally, but as long as I have it, I may as well make a .pdf version. – None – 2009-09-15T22:24:29.420

why PDF? think HTML. and you might as well keep the original scans although you'll end up with a massive amount of data. – None – 2009-09-15T22:44:25.853

My idea was to have all the birth / lineage info in a database, so I could make a web frontend that would make navigating / searching / updating easier. I plan on working any typos out of that version. Also, I have some cousins that aren't in there and it would be nice to add them.

I was thinking pdf because it would be nice to have something that would look like the original book with the original page numbers and such intact. That version I'd leave alone and keep all the typos from the book. – None – 2009-09-15T23:43:40.070