Seeing individual glyphs in a PDF /FontFile2 object

0

0

How to extract the mapping from Character ID's (CID) to glyph instructions in an embedded CID font of a PDF?

Some more details and motivation:

I have a large collection of PDFs, some of which have faulty CMAP's which are causing problems in extracting text from the files.

In order to correct this, I'd like to understand the /FontFile2 stream object (an embedded, CID type font) contained in the PDFs. It is probably enough just to be able to parse the stream into a mapping from CIDs to glyph instructions, without understanding how to interpret the instructions.

(The CIDs keep shifting around from one file to the next in the collection, even though there are only about half a dozen fonts or so. So I'm hoping that, even without understanding how to interpret the glyph instructions, I will be able to identify them uniquely and fix the CMAPs by comparing faulty and correct CMAPs, perhaps even just applying a simple majority rule to determine the mapping "glyph instructions" -> Unicode, and using that to recompute the CMAPs of individual files.

Just Me

Posted 2019-03-28T22:51:57.307

Reputation: 1

No answers