How does less display PDFs?

51

9

I have tried several programs: pdftotext, pdf2txt.py, ... All of them can extract text from PDFs, but the one which is doing the better job is good ol' less: the text from the PDF has a proper layout. How is less doing this? Is it using any library, or is the PDF processing built-in?

I am asking because I would like to use this funcionality programmatically, wihout necessarily having to run less as an external program (I am doing python).

My system is:

» less --version
less 458 (GNU regular expressions)
Copyright (C) 1984-2012 Mark Nudelman

less comes with NO WARRANTY, to the extent permitted by law.
For information about the terms of redistribution,
see the file named README in the less distribution.
Homepage: http://www.greenwoodsoftware.com/less

» uname -a
Linux polyphemus 3.13.0-53-generic #89-Ubuntu SMP Wed May 20 10:34:39 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

dangonfast

Posted 2015-06-09T19:53:48.467

Reputation: 1 878

Answers

62

Your distribution is probably using the popular lesspipe.sh script. Check the LESSOPEN environment variable.

This script exists in many variations. I’m looking at the Gentoo version. In it, you’ll find the following line:

*.ps|*.pdf) ps2ascii "$1" || pstotext "$1" || pdftotext "$1" ;;

That means it will try those commands in the order displayed. $1 is the file name.

Another version uses the following command:

pdftohtml -stdout "$t" | parsehtml -

Daniel B

Posted 2015-06-09T19:53:48.467

Reputation: 40 502

15Thanks, it turns out it is using pdftotext -layout $1 - – dangonfast – 2015-06-09T20:19:59.103

@jeckyll2hide Did you find the explanation for the better results with less? – vvy – 2015-06-17T05:46:19.900

@vvy Probably the -layout switch. ;) – Daniel B – 2015-06-17T07:01:37.503