How do I train Tesseract for a new ttf font?

1

1

I am trying to train Tesseract for some funny looking fonts, like Palace for example. I have tried a simple way - produced traindata with http://trainyourtesseract.com/ and then have made a call like

api->Init(".\\tessdata", "eng+Palace",OEM_TESSERACT_ONLY). api->SetPageSegMode(PSM_SINGLE_LINE); api->SetImage(image); // Get OCR result outText = api->GetUTF8Text();

The result for a line like

M P S T a o e h i l n p r s t u w y

palacescript.tiff

is below, no glyph is correctly recognized:

.MDXXXo,XkX.n.mX.XnoX

Does trainyourtesseract make bad traineddata or do I make wrong calls, and how does one handle such cases?

Actualle, I have tried the same with less funny fonts, but also the recognition almost does not improve.

I am attaching the tiff file and my trained data for Palace.

Thank you everyone in advance for help, Yuliana

Yuliana Zigangirova

Posted 2019-11-01T13:42:47.733

Reputation: 11

did you solve it? – V.Wu – 2020-01-07T06:05:11.753

Answers

0

The trainyourtesseract site only responsible to generate a .traineddata, It's doesn't responsible for accuracy. so you still need more training on it after you got the .traineddata file.

I did find out what the accuracy of trainyourtesseract is. but it absolutely is not 100 percent. Looking through the result, the accuracy still needs a lot of improvement.

Referring to the Tesseract Training Tutorial.

Fine tune. Starting with an existing trained language, train on your specific additional data. This may work for problems that are close to the existing training data, but different in some subtle way, like a particularly unusual font. May work with even a small amount of training data.

Cut off the top layer (or some arbitrary number of layers) from the network and retrain a new top layer using the new data. If fine tuning doesn't work, this is most likely the next best option. Cutting off the top layer could still work for training a completely new language or script, if you start with the most similar looking script.

Retrain from scratch. This is a daunting task, unless you have a very representative and sufficiently large training set for your problem. If not, you are likely to end up with an over-fitted network that does really well on the training data, but not on the actual data.

You can retrain your Palace.traineddata from scratch. the disadvantage is that you need to supply a lot of training data. or you can fine-tune your palacescript.tiff into eng.traineddata, but it still needs a lot of training data.

if you can't supply those enormous data. don't worry!

you can follow How to prepare training files for Tesseract OCR and improve characters recognition?, which build on the Legacy engine. The .box file made by makebox can't apply for LSTM engine.

V.Wu

Posted 2019-11-01T13:42:47.733

Reputation: 101