How do I train Tesseract for a new ttf font?

The trainyourtesseract site only responsible to generate a .traineddata, It's doesn't responsible for accuracy. so you still need more training on it after you got the .traineddata file.

I did find out what the accuracy of trainyourtesseract is. but it absolutely is not 100 percent. Looking through the result, the accuracy still needs a lot of improvement.

Referring to the Tesseract Training Tutorial.

Fine tune. Starting with an existing trained language, train on your specific additional data. This may work for problems that are close to the existing training data, but different in some subtle way, like a particularly unusual font. May work with even a small amount of training data.

Cut off the top layer (or some arbitrary number of layers) from the network and retrain a new top layer using the new data. If fine tuning doesn't work, this is most likely the next best option. Cutting off the top layer could still work for training a completely new language or script, if you start with the most similar looking script.

Retrain from scratch. This is a daunting task, unless you have a very representative and sufficiently large training set for your problem. If not, you are likely to end up with an over-fitted network that does really well on the training data, but not on the actual data.

You can retrain your Palace.traineddata from scratch. the disadvantage is that you need to supply a lot of training data. or you can fine-tune your palacescript.tiff into eng.traineddata, but it still needs a lot of training data.

if you can't supply those enormous data. don't worry!

you can follow How to prepare training files for Tesseract OCR and improve characters recognition?, which build on the Legacy engine. The .box file made by makebox can't apply for LSTM engine.

V.Wu

Posted 2019-11-01T13:42:47.733

Reputation: 101

How do I train Tesseract for a new ttf font?

Answers