In the 2014 challenge, Michael Stern suggests using OCR to parse an image of the number 2014 to 2014. I'd like to take this challenge in a different direction. Using built-in OCR from the language/standard library of your choice, design the smallest image (in bytes) which is parsed into the ASCII string "2014".

Stern's original image is 7357 bytes, but with a bit of effort it can be losslessly compressed to 980 bytes. No doubt the black-and-white version (181 bytes) works as well with the same code.

Rules: Each answer should give the image, its size in bytes, and the code needed to process it. No custom OCR allowed, for obvious reasons...! Any reasonable languages and image formats are allowed.

Edit: In response to comments, I'll broaden this to include any preexisting library, or even http://www.free-ocr.com/ for those languages where no OCR is available.

Charles

Posted 2014-01-05T06:16:31.547

Reputation: 2 435

9How many languages or standard libraries have built-in OCR? Or do you intend "standard library" here to mean "any library which hasn't been created specifically for this challenge"? – Peter Taylor – 2014-01-05T08:25:30.420

3Does any development platform other than Mathematica have OCR built in? – Michael Stern – 2014-01-05T15:44:55.747

You should standardize, say something like "use http://www.free-ocr.com/" or some other easily accessible ocr.

– Justin – 2014-01-05T16:10:21.327

Answers

Shell (ImageMagick, Tesseract), 18 bytes

file=golf_2014
echo -n UDQKMTMgNQruqCqo6riKiO6I | base64 -d > $file.pbm
convert -border 2x2 -bordercolor white -resize 300% -sharpen 0 -monochrome $file.pbm $file.png
tesseract $file.png $file digits
cat $file.txt
rm $file.pbm $file.png $file.txt

The image is 18 bytes and can be reproduced like this:

echo -n UDQKMTMgNQruqCqo6riKiO6I | base64 -d > 2014.pbm

It looks like this (this is a PNG copy, not the original):

2014

After processing with ImageMagick, it looks like this:

2014 big

Using ImageMagick version 6.6.9-7, Tesseract version 3.02. The PBM image was created in Gimp and edited with a hex editor.

This version requires jp2a.

file=golf_2014
echo -n UDQKMTMgNQruqCqo6riKiO6I | base64 -d > $file.pbm
convert -border 2x2 -bordercolor white -resize 300% -sharpen 0 -monochrome $file.pbm $file.png
tesseract $file.png $file digits
cat $file.txt
convert -background black -fill white -border 2x2 -bordercolor black -pointsize 100 label:$(cat $file.txt) $file.jpg
jp2a --chars=" $(cat $file.txt) " $file.jpg
rm $file.pbm $file.png $file.txt $file.jpg

It outputs something like this:

    2014444444102         01144444102              214441                 214441     
   1             1      24           1            04    4                0     4     
  1    410201     0    0    410004    1       2014      4              21      4     
 24   42     0    4    4    0     1    0    24          4             04       4     
  22222      1    1   0    42     0    4    2   4100    4            1   41    4     
            1    42   0    4      2     2   2412   0    4          24   420    4     
          04    42    0    1      2     2          0    4         0   40  0    4     
       204    42      0    1      2     2          0    4       24   42   0    4     
     21     12        0    4      0    42          0    4      2     411114     1112 
    04   412          24    0     1    0           0    4      0                   0 
  24     1111111110    1    42  21    4            0    4      200011111001    40002 
  4               4     04    44     42            0    4                 0    4     
 0                4      214       10              0    4                 0    4     
  22222222222222222         222222                  22222                  22222

user13957

Posted 2014-01-05T06:16:31.547

Reputation: 118

Very, very impressive. 3 bytes for the header, 5 bytes for the dimensions of the image, 10 bytes for the bitmap. The format is described here: http://netpbm.sourceforge.net/doc/pbm.html

– Charles – 2014-01-08T02:59:17.883

Java + Tesseract, 53 bytes

Since I don't have Mathematica, I decided to ~~bend the rules a bit and~~ use Tesseract to do the OCR. I wrote a program that renders "2014" into an image, using various fonts, sizes and styles, and finds the smallest image that gets recognized as "2014". Results depend on the available fonts.

Here is the winner on my computer - 53 bytes, using the "URW Gothic L" font: 2014

Code:

import java.awt.Color;
import java.awt.Font;
import java.awt.FontMetrics;
import java.awt.Graphics2D;
import java.awt.GraphicsEnvironment;
import java.awt.image.BufferedImage;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;

import javax.imageio.ImageIO;

public class Ocr {
    public static boolean blankLine(final BufferedImage img, final int x1, final int y1, final int x2, final int y2) {
        final int d = x2 - x1 + y2 - y1 + 1;
        final int dx = (x2 - x1 + 1) / d;
        final int dy = (y2 - y1 + 1) / d;
        for (int i = 0, x = x1, y = y1; i < d; ++i, x += dx, y += dy) {
            if (img.getRGB(x, y) != -1) {
                return false;
            }
        }
        return true;
    }

    public static BufferedImage trim(final BufferedImage img) {
        int x1 = 0;
        int y1 = 0;
        int x2 = img.getWidth() - 1;
        int y2 = img.getHeight() - 1;
        while (x1 < x2 && blankLine(img, x1, y1, x1, y2)) x1++;
        while (x1 < x2 && blankLine(img, x2, y1, x2, y2)) x2--;
        while (y1 < y2 && blankLine(img, x1, y1, x2, y1)) y1++;
        while (y1 < y2 && blankLine(img, x1, y2, x2, y2)) y2--;
        return img.getSubimage(x1, y1, x2 - x1 + 1, y2 - y1 + 1);
    }

    public static int render(final Font font, final int w, final String name) throws IOException {
        BufferedImage img = new BufferedImage(w, w, BufferedImage.TYPE_BYTE_BINARY);
        Graphics2D g = img.createGraphics();
        float size = font.getSize2D();
        Font f = font;
        while (true) {
            final FontMetrics fm = g.getFontMetrics(f);
            if (fm.stringWidth("2014") <= w) {
                break;
            }
            size -= 0.5f;
            f = f.deriveFont(size);
        }
        g = img.createGraphics();
        g.setFont(f);
        g.fillRect(0, 0, w, w);
        g.setColor(Color.BLACK);
        g.drawString("2014", 0, w - 1);
        g.dispose();
        img = trim(img);
        final File file = new File(name);
        ImageIO.write(img, "gif", file);
        return (int) file.length();
    }

    public static boolean ocr() throws Exception {
        Runtime.getRuntime().exec("/usr/bin/tesseract 2014.gif out -psm 8").waitFor();
        String t = "";
        final BufferedReader br = new BufferedReader(new FileReader("out.txt"));
        while (true) {
            final String s = br.readLine();
            if (s == null) break;
            t += s;
        }
        br.close();
        return t.trim().equals("2014");
    }

    public static void main(final String... args) throws Exception {
        int min = 10000;
        for (String s : GraphicsEnvironment.getLocalGraphicsEnvironment().getAvailableFontFamilyNames()) {
            for (int t = 0; t < 4; ++t) {
                final Font font = new Font(s, t, 50);
                for (int w = 10; w < 25; ++w) {
                    final int size = render(font, w, "2014.gif");
                    if (size < min && ocr()) {
                        render(font, w, "2014win.gif");
                        min = size;
                        System.out.println(s + ", " + size);
                    }
                }
            }
        }
    }
}

aditsu quit because SE is EVIL

Posted 2014-01-05T06:16:31.547

Reputation: 22 326

I changed the rules to allow this and similar entries. Impressive file size. – Charles – 2014-01-05T18:09:05.697

Mathematica 753 100

f[n_,format_]:=
Module[{filename},
Print["raster: ",n," by ", n];
filename="2014At"<>ToString[n]<>"."<>format;
Print["filename:  ",filename];
Print["format: ",format];
Print["raster image: ",rasterImg=Rasterize[Style[2014,"OCR A Std"],
RasterSize->n,ImageSize->1n,ImageResolution->6n]];
Export[filename,rasterImg];
Print["Actual imported image: ",img=Switch[format,"PDF"|"HDF",Import[filename][[1]],
_,Import[filename]]];
Print["Identified text: ",TextRecognize[ImageResize[img,Scaled[3]]]];
Print["filesize (bytes): ",FileByteCount[filename]]]

My best case so far:

f[24, "PBM"]

efficiency

DavidC

Posted 2014-01-05T06:16:31.547

Reputation: 24 524

Mathematica, 78 bytes

The trick to winning this in Mathematica will probably be use of the ImageResize[] function as below.

First, I created the text "2014" and saved it to a GIF file, for fair comparison with David Carraher's solution. The text looks like 2014 . This is not optimized in any way; it's just Geneva in a small font size; other fonts and smaller sizes may be possible. Straight TextRecognize[] would fail, but TextRecognize[ImageResize[]]] has no problem

filename = "~/Desktop/2014.gif";
Print["Actual imported image: ", img = Import[filename]]
Print["Identified text: ", 
 TextRecognize[ImageResize[img, Scaled[2]]]]
Print["filesize (bytes): ", FileByteCount[filename]]

results

Fussing with the typeface, font size, degree of scaling, etc., will probably result in even smaller files that work.

Michael Stern

Posted 2014-01-05T06:16:31.547

Reputation: 3 029

Very impressive file size. – DavidC – 2014-01-05T16:28:54.927

You can crop the image off the white borders to make it smaller and shorter the spaces between digits, maybe redraw to make them more compact. – swish – 2014-01-05T16:44:35.557

@swish indeed, trimming the white border takes it to 78 byes. – Michael Stern – 2014-01-05T20:37:28.787

Produce the number 2014 from an image

Answers

Shell (ImageMagick, Tesseract), 18 bytes

Java + Tesseract, 53 bytes

Mathematica 753 100

Mathematica, 78 bytes