How Can I determine the page count from a postscript file (generated by Opera)?

2

I don't know the postscript language.

I have a duplex printing emulation system written in bash. It prints the odd pages first and then the even pages. It needs to know if there's an odd page count so it can eject the last odd page that doesn't have a corresponding even side. It also uses page counts for reporting purposes.

I didn't know how to do this correctly, so I wrote code that looks at the end and, if necessary, the beginning of the postscript file searching for "%%Pages:" which is followed by a page count. This works on almost everything except files printed by the Opera browser.

Can anyone suggest another way to get this information?

Postscript files tend to be rather large with a lot of non-human-readable content, so I haven't yet spent a lot of time pouring over the ones that come out of Opera.

TIA

The current code is at:

http://sourceforge.net/projects/duplexpr/

function ps_page_ct

Joe

Posted 2011-12-09T12:56:20.193

Reputation: 533

Answers

5

The following Ghostscript command will reliably count the pages in your PostScript file -- but it can be rather slow, because it requires the file to be completely interpreted (run), as @afrazier already stated in a comment:

gs \
 -o /dev/null \
 -sDEVICE=bbox \
  input.ps 2>&1 \
| grep HiResBoundingBox \
| wc -l

Kurt Pfeifle

Posted 2011-12-09T12:56:20.193

Reputation: 10 024

Finally! I'll check it out and come back here. Thank you. – Joe – 2012-06-29T17:49:34.400

This appears to work, but "rather slow" is an understatement. On my I3 notebook, a three page document runs in just over two minutes. At best, I can add a switch (option) to my system to use this method as a last resort. In the mean time, I altered my code to count "HiResBoundingBox" instead of "showpage". – Joe – 2012-06-29T18:49:30.973

@Joe: Should your code rely on simply grep-ping for HiResBoundingBox: it will not work. This word need not be used in input files at all -- its appearance in the output stream is caused by Ghostscript interpreting all of the input and distilling this info snippet for you. – Kurt Pfeifle – 2012-06-29T22:23:56.523

@Joe: The reason why I said the command is 'rather slow' is this: Ghostscript needs to completely interpret and render the PostScript file (without displaying it) in order to reliably extract the page number info. It's just as much work as to completely display the file on screen. The reason for this is that PostScript is a Turing-complete programming language, and to see what's happening at one spot, the interpreter needs to run sequentially all the code which is located before that spot. – Kurt Pfeifle – 2012-06-29T22:28:50.623

@Joe: Two minutes for a 3 page document is extreme. It means that this document is a rather complex beast, and that it would take Ghostscript just as long to simply make that document display on screen... – Kurt Pfeifle – 2012-06-29T22:37:50.100

Thanks. I'll change the code back to using "showpage" (once it fails to find a %%Pages). I know it's not guaranteed to work, but it often does. Eventually, I'll add a switch to optionally do it the right way for files that don't work. – Joe – 2012-07-01T04:27:57.633

Well, now that I got that to work, they came out with PostScript 1.4, 1.5 and 1.6. I came up with a heuristic that works sometimes with 1.4 and 1.5. I didn't see anything in the 1.6 file I examined that looked like it would work. Will your brute force method above still work with these newer versions? – Joe – 2013-02-19T10:22:53.230

@Joe: there is no such thing as "PostScript 1.4, 1.5 and 1.6". PostScript specifies levels 1, 2 and 3. Are you sure you did indeed mean PostScript?? The versions you named are there for PDF, not for PostScript. – Kurt Pfeifle – 2013-02-24T17:13:01.203

Oops. I'm working on both at the same time and I get them confused. Those are PDF versions and should be in a separate question. Thanks for noticing. – Joe – 2013-02-25T22:51:25.700

5

Unfortunately, there is no simple way of finding pages in a raw Postscript file. That is why %%Pages convention has been created (Adobe Document Structuring Conventions).

The command for issuing a page is showpage. In simple cases, you just have to count them.

But this command can be embedded in the body of a function and then you need a Postscript parser.

mouviciel

Posted 2011-12-09T12:56:20.193

Reputation: 2 858

+1 for what is, ultimately, the correct answer. Postscript is a Turing-complete language. If you want an accurate answer, interpreting the file is the "One True Way". I'd start seriously looking at leveraging Ghostscript if you really need this.

– afrazier – 2011-12-09T13:48:55.043

@afrazier: Did you ever compare this "one true answer" with mine? How does mine then rate on your trueness scale? – Kurt Pfeifle – 2013-02-26T01:54:11.513

@KurtPfeifle: I didn't see your answer, but after reviewing it, it's definitely worth a +1. After some more Googling, you've also posted this on SO. There's also this comment from the GhostScript maintainer. I'm not sure they'll be any faster than what you used above though.

– afrazier – 2013-02-26T13:54:34.810

2

I found this little snippet somewhere, it will process the document very fast and print out the page count. This can help if exiftool do not print this meta-data information because the document was not generated correctly:

gs -dNODISPLAY -dBATCH -dNOPAUSE -o /dev/null source|grep -P '^Page'|wc -l

j.berrisch

Posted 2011-12-09T12:56:20.193

Reputation: 21

Tried it on two test files. One gave 0 for a 3 page document and the other generated a ghostscript error message. – Joe – 2013-04-25T00:00:19.333

Did you try without the |grep ....? What's the output? I use it as fallback when exif data does not contain 'Page Count'. Is there a link to the documents you are using? Which version of ghostscript are you using? On which OS? – j.berrisch – 2013-04-25T09:11:15.457

Here's one document: https://dl.dropboxusercontent.com/u/54584985/Opera01.ps . It does 'bigbird@ramdass:~/pgm/duplex_proj/devel/test_data$ gs -dNODISPLAY -dBATCH -dNOPAUSE -o /dev/null Opera01.ps|grep -P '^Page'|wc -l 0 ' Here's without the grep 'bigbird@ramdass:~/pgm/duplex_proj/devel/test_data$ gs -dNODISPLAY -dBATCH -dNOPAUSE -o /dev/null Opera01.ps GPL Ghostscript 9.05 (2012-02-08) Copyright (C) 2010 Artifex Software, Inc. All rights reserved. This software comes with NO WARRANTY: see the file PUBLIC for details. bigbird@ramdass:~/pgm/duplex_proj/devel/test_data$ ' Sorry about format.

– Joe – 2013-04-28T03:50:55.810

Sorry but you're right, I only tested for my bad formatted PDF's for which it works quiet well. Perhaps Opera should fix something because the exiftool result shows up a 'Pages' property but with a wired value 'atend'. – j.berrisch – 2013-05-03T13:11:39.213