This is a hard task not not easy to solve. If you really want an exact result, copy paragraph by paragraph for your PDF viewer into a text file and check it with the wc -w
tool. The reason why not to use pdftotext
in that case is: mathematical formulas may get also into the output and regarded as "words". (Alternatively you could edit the output you get from pdftotext
). Another reason why this may fail are the headings: "4.3.2 Foo Bar" is counted as three words.
A way around is only to count words starting with a char out of [A-Za-z]. So what I usally do is a two step approach:
get the list of uniq words and check if there are too much false positives inside:
pdftotext foo.pdf - | tr " " "\n" | sort | uniq | grep "^[A-Za-z]" > words
I don't use a dictionary here, as some spelling errors would not count as words.
Get this word list and grep it within the output of pdftotext:
pdftotext foo.pdf - | tr " " "\n" | grep -Ff words | wc -l
I know this could be done within a one liner, but then I could not easily see the filter result from the first step. The -F
may help you as stated by the comment of moi below (thanks).
It is worth noting that
– amenthes – 2016-08-01T23:41:31.960pdftotext
is part of Xpdf which is also available for the windows platform. The Xpdf download page is located here: http://www.foolabs.com/xpdf/download.html .wc
can also be found, but alternatively one can use pretty much any word processor like word or LibreOffice Writer. They count words as well. (For LibreOffice Writer go to File -> Properties -> Statistics)10It's
pdftotext
: don't forget the e. And you can use a single command:pdftotext myfile.pdf - | wc -w
. – frabjous – 2010-12-13T04:15:43.0201@frabjous Thanks, updated the answer with the suggestions! – icyrock.com – 2010-12-14T01:48:24.247