Tesseract hocr and txt at the same time, or converting from Tesseracts hocr to txt

3

1

I've been playing around with Linux OCR software, and I really like Tesseract, especially in conjunction with gsan2pdf. Tesseract v3 or greater supports outputting in the hocr format, and gscan2pdf is able to make use of that in order to create searchable pdf's of scanned documents.

Sometimes, however, I would also like to get the plain text version as well. pdftotext on the searchable pdf generated by gscan2pdf as described above is not so great for that as even with the -raw option, the layout of the output doesn't copy the original physical layout well. I can set up a user-defined command in gscan2pdf that will call tesseract on the original scanned image without the hocr option so that only plain text is generated, however ocr happens to be quite time-consuming to do it twice for each page. Isn't there a working way to convert from hocr to plain text (with the same layout as the one generated by tesseract when invoked without the hocr option) or a way to make tesseract output both plain text and hocr at the same time?

https://github.com/jbrinley/HocrConverter looks promising, but it doesn't work for me.

PSkocik

Posted 2013-05-16T20:57:07.090

Reputation: 1 182

Answers

0

<?php 
/**
 * Cli process that gets as 1st argument the output of tesseract ... hocr and dumps 
 * its text nodes
 * Usage: script.php in.tif.html out.txt
 */
$inFile = $argv[1];
$outFile = $argv[2];
$stream = file_get_contents($inFile);
$dom = DOMDocument::loadHTML($stream);
$out = array();
foreach ($dom->getElementsByTagName('p') as $tag) {
    $out[] = $tag->nodeValue;
}

file_put_contents($outFile, implode("\n", $out));

juanmf

Posted 2013-05-16T20:57:07.090

Reputation: 101