How to convert PDF to ASCII Postscript so the contained text can be searched/replaced?

2

1

According to Chapter 3.2 of the PostScript Language Reference, "there are three encodings for the PostScript language:ASCII, binary token, and binary object sequence".

We've been generating PDF files from HTML/CSS with PrinceXML for quite some time. Recently, a new requirement arose in cooperation with another company that needs the contents of our PDF files as Postscript. When converting the PDF to PS via the command-line by using pdf2ps, pdftops, a2ping or others, the resulting PS files seem to have one of the binary encodings as there's no way to search for text.

We're delivering the PS file few days prior to printing and don't know the printing date beforehand, but as a requirement, the printing date needs to be printed. Therefore, we need to insert a date-placeholder (##.##.####), which they will automatically replace when printing.

If we insert that placeholder in our HTML/CSS representation, it can't be searched in the contents of the postscript file and therefore not replaced with the current date prior to printing.

Does anyone know a way to convert the PDF to ASCII PostScript so the contained text can be searched and replaced?

Codepunkt

Posted 2011-06-16T15:28:27.057

Reputation: 121

Why are you doing HTML -> PDF -> PS? Why not go straight from HTML to PS for this client? – Flimzy – 2011-06-19T06:48:06.777

because we didn't find any way to do so that produces ascii postscript so we can use placeholders the way we need to and that supports the same or almost the same html/css features as princeXML so both pdf and ps look the same. – Codepunkt – 2011-06-19T19:21:57.493

One way to make your PDF and PS look the same would be to do HTML -> PS -> PDF... although that doesn't address the text replacement requirement. IME, when text-replacement is required in PS, it's usually been done by writing raw PS. It's also possible that TeX could output into ASCII PS. But I'm sure you have absolutely no interest in rewriting your documents in a way that you could do this, though. :) I wish I could offer a better suggestion. – Flimzy – 2011-06-19T20:36:32.340

we considered writing raw postscript and then doing html > ps > pdf. could be kind of an emergency solution in case we find no other way to deal with it. any idea on books/tutorials/best practices for writing raw ps? :) – Codepunkt – 2011-06-19T22:43:43.750

Can you provide a link to a sample PDF that you need to convert to ASCII Postscript? Can you also provide a link to a non-ASCII PostScript that is supposed to contain your placeholder '##.##.####'? This way I might be able to work out a path for you to follow... – Kurt Pfeifle – 2011-06-20T16:09:04.620

Answers

0

I had no luck with pd2fps.

With pdftops version 0.12.4, bundled with poppler I can find text in the PS code, but only one word at a time (each word is surrounded by parenthesis).

For example download

wget ctan.org/tex-archive/macros/latex/contrib/lipsum/lipsum.pdf
pdftops lipsum.pdf
sed 's;2011/;2012/;' lipsum.ps > lipsum2.ps

This will change the year (present at the beginning of the file) from 2011 to 2012, but pay attention because you can't always simply change text, depending on the structure of the PS code there may be not enough space for replace text. Try the previous example with 2013 instead of 2012 and you'll see.

I don't understand postscript, but I suspect that some conversion MAY lead to a partly binary and partly text file, if so try to use sed that will leave the non-textual bytes as is.

David Costa

Posted 2011-06-16T15:28:27.057

Reputation: 701

0

Another solution consist in modifying the original pdf so that the date is in a form and then use flpsed to fill it! Check out it here: http://freshmeat.net/projects/flpsed

David Costa

Posted 2011-06-16T15:28:27.057

Reputation: 701