hOCR

hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language (XML) in the form of Hypertext Markup Language (HTML) or XHTML.[1]

Software

The following OCR software can output the recognition result as hOCR file:

Example

The following example is an extract of an hOCR file:

...
<p class='ocr_par' lang='deu' title="bbox930">
  <span class='ocr_line' title="bbox 348 797 1482 838; baseline -0.009 -6">
    <span class='ocrx_word' title='bbox 348 805 402 832; x_wconf 93'>Die</span> 
    <span class='ocrx_word' title='bbox 421 804 697 832; x_wconf 90'>Darlehenssumme</span> 
    <span class='ocrx_word' title='bbox 717 803 755 831; x_wconf 96'>ist</span> 
    <span class='ocrx_word' title='bbox 773 803 802 831; x_wconf 96'>in</span> 
    <span class='ocrx_word' title='bbox 821 803 917 830; x_wconf 96'>ihrem</span> 
    <span class='ocrx_word' title='bbox 935 799 1180 838; x_wconf 95'>ursprünglichen</span> 
    <span class='ocrx_word' title='bbox 1199 797 1343 832; x_wconf 95'>Umfange</span> 
    <span class='ocrx_word' title='bbox 1362 805 1399 823; x_wconf 95'>zu</span> 
    <span class='ocrx_word' title='bbox 1417 x_wconf 96'>ver-</span> 
  </span>
  ...

The recognized text is stored in normal text nodes of the HTML file. The distribution into separate lines and words is here given by the surrounding span tags. Moreover, the usual HTML entities are used, for example the p tag for a paragraph. Additional information is given in the properties such as:

  • different layout elements such as "ocr_par", "ocr_line", "ocrx_word"
  • geometric information for each element with a bounding box "bbox"
  • language information "lang"
  • some confidence values "x_wconf"
gollark: (You cannot contest this because you have not designed Macron)
gollark: All the variables are x.
gollark: It can't have Hungarian notation.
gollark: In Macron all variables are just x with different formatting.
gollark: Yet you couldn't even design it in 17 minutes.

See also

  • ALTO (XML) -- another OCR data representation format

References

  1. Breuel, T. (2007-09-01). "The hOCR Microformat for OCR Workflow and Results". Ninth International Conference on Document Analysis and Recognition (ICDAR 2007). 2: 1063–1067. doi:10.1109/ICDAR.2007.4377078. ISBN 978-0-7695-2822-9.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.