Optimal font for Tesseract? (specifically the .NET wrapper)

1

I am using Tesseract as a means to convert printed text documents captured by my cell phone camera into text. The results are not great. The quality of the image is very good, far clearer than a fax, but it seems to have a very difficult time identifying characters.

I've also tried mimicking one of these documents in a text editor, taking a screenshot of the window, and running that through Tesseract and the results are only marginally better.

This leads me to believe there's probably an optimal font for Tesseract. I Googled a bit and came across OCR-A, but it apparently requires a license. I then stumbled upon am free OCR-A alternative on SourceFourge, but it doesn't appear to fare much better than Arial or Courier New.

Is there a font that works best with Tesseract or do I need to do something else to increase the accuracy of the character recognition?

user613051

Posted 2016-07-03T16:12:33.800

Reputation: 11

You do have the correct dictionary loaded, right? – Daniel B – 2016-07-03T16:20:46.577

@DanielB Good point. I am actually using this as a means to convert relatively small data files to base64 and then printing them on paper for backup. It's sort of the same idea behind Paperback. Any idea how to create my own custom dictionary? I could try creating a dictionary of every possible base64 string and see if that helps with the accuracy. – user613051 – 2016-07-03T17:58:07.640

Why not print also qr codes next to the text?? – Máté Juhász – 2016-07-03T18:32:49.880

@MátéJuhász I've considered generating QR codes because of the amount of data they can hold, but haven't gotten around to looking for QR code reader apps that don't require every permission known to humankind – user613051 – 2016-07-03T18:57:11.713

Answers

0

Your best choice is to train it for whatever font you are using.

I don't want to pretend this is an easy process, it isn't but it should work better. Also most OCR programs favor 300dpi or 600dpi, so upscaling maybe necessary.

The Tesseract Github Wiki has some good resources on Training Tesseract.

cybernard

Posted 2016-07-03T16:12:33.800

Reputation: 11 200