scan A4 doc > pdf > ocr > translate to english?

2

4

I've tried using a combination of

  • my home scanner to create a '300 dpi', 'document', 'pdf' (options on Canon all-in-one)
  • ZoHoViewer to create either an RTF or TXT file
  • google docs to translate

I'm not sure how good or bad a product ZoHoViewer is, but the following:

Als Arbeitsmarkbehörde haben wir den gesetzlichen Auftrag, die Vermittelbarkeit von

turns into:

AlsArbeitsmarktbeh6rde habenwirdengesetzlichenAuftrag,dieVermittelbarkeit vonSt...

consequently, goog docs makes a pig's breakfast of trying to translate it.

Does anyone have any better suggestions (preferably free online services)

adolf garlic

Posted 2010-01-18T19:38:21.647

Reputation: 1 618

Since there isn't an "exact" duplicate, I'm leaving this one open. However you should go through the questions I linked, since they will probably offer possible solutions – Ivo Flipse – 2010-01-18T20:10:40.200

In case anyone's interested the translation should be "When labor market authority, we have a statutory mandate, the employability of" - or something along those lines – ChrisF – 2010-01-18T20:34:35.370

correction: "As the labour market authority" ... sounds better :) – None – 2010-01-18T21:32:08.297

@Molly - It was just a copy 'n' paste into Google Translate! – ChrisF – 2010-01-19T12:28:20.157

Answers

0

Not 100% perfect but the best out of all the things I have tried:

http://www.paperfile.net/ combined with a language pack (free to download instructions in app) copy and paste whole of the text to a google doc, then use the tools > translate in google docs

adolf garlic

Posted 2010-01-18T19:38:21.647

Reputation: 1 618

5

There have been several other questions on SuperUser on OCR, which might be worth checking out for possible solutions.

Most notably this answer by Molly looks promising:

I really like TopOCR, certainly a great addition to your scan tools:

  • Incredible OCR accuracy, upto 99.8% with a 3 MP camera
  • No page limits, and no extra downloads or components needed
  • Handles images with mixed text and graphics (Manual or Auto Zoning)
  • Tolerates skew and uneven lighting
  • Multiple text output formats, including searchable PDF and HTML
  • Able to read 11 different languages
  • Powerful, easy to use Image Processing with Image Dewarping
  • Supports Smartphones: See some Smartphone samples
  • Includes built-in, full featured Text and Image WYSIWYG Editors
  • Post-processing spell checker for all 11 languages
  • Built-in Text-To-Speech software. How about OCR to MP3?
  • Includes a built-in multi-lingual text translater
  • Supports a Command Line Interface and a GUI
  • Make a high performance document Search and Indexing system
  • Browser Helper Mode supports creating free audio eBooks
  • With TopOCR's Web Engine it's easy to add new features

alt text

it's very accurate and works excellent with low quality images such as photographs of pages/documents

TopOCR is freeware (can be made portable with Universal Extractor)

Further reading:

Which OCR software has the most options?

Practical OCR solution for converting a large book to a digital format?

How to extract text with OCR from a PDF on Linux?

Ivo Flipse

Posted 2010-01-18T19:38:21.647

Reputation: 24 054

Ok I will give it a try. – adolf garlic – 2010-01-22T09:25:27.027

I tried topOCR with another doc and it is useless. Tildas and stuff all over the place [this was from a doc scanned at 600dpi]. Also I am undergoing the pain of switching from windows to mac and topocr is windows only. – adolf garlic – 2010-01-29T16:32:08.127

this "downvoting business" here is getting ridiculous, in fact TopOCR is ideal for this this job as it 'understands' German (and a lot of other languages) and includes a translator. +1 and flagged for moderator attention. – None – 2010-01-18T21:20:43.737

2 Up, 2 down :/ – Sathyajith Bhat – 2010-01-18T23:44:14.700

Looks nice, but I don't really want to have to pull my camera out when I want to scan a doc. I can see how that might be nice if you do not have a scanner or are out and about with your phone though. – adolf garlic – 2010-01-20T13:20:45.467

It means you can also do it on scanned document ;-) – Ivo Flipse – 2010-01-20T13:24:54.663

4

Given that the OCR has converted:

Als Arbeitsmarkbehörde ...

to:

AlsArbeitsmarktbeh6rde ...

A couple of things spring to mind.

  1. Try scanning at a higher dpi. It looks like it can't recognise the space between the words, a higher dpi might improve that.

  2. Can you set the language of your OCR program? I see that it's converted the "ö" to a "6". While this might be a problem caused by the resolution it might also be that as "ö" isn't an everyday part of English, the program is choosing the "next best" fit - in this case "6".

ChrisF

Posted 2010-01-18T19:38:21.647

Reputation: 39 650

Naah. zamzar is useless at recognising spaces even at 600 dpi, will try another tool for conversion. – adolf garlic – 2010-01-21T09:26:37.600

Good points Chris! – Ivo Flipse – 2010-01-18T21:06:20.867

Have rescanned at 600dpi, am just waiting (forever) for zamzar to send me the converted doc....I guess when it's free you cannot expect too much, but nearly a day!? Too long – adolf garlic – 2010-01-20T13:17:12.683

Now I just checked and it the conversion expired. Starting again. Harumph. – adolf garlic – 2010-01-20T13:18:22.290