Count the number of words in a PDF file

68

28

How can I get the word count of a PDF file? I think that most pdf files for which I want to get total word count have text layer embedded, so I need no OCR.

The task was arisen from searching for some scientific papers of known size, e.g. 15000 words. Most moders papers are published in pdf format

osgx

Posted 2010-12-13T02:07:11.660

Reputation: 5 419

Answers

94

Quick Answer:

pdftotext myfile.pdf - | wc -w

Long Answer:

If on Unix, you can use pdftotext:

and then do the word count in the generated file. If on Unix, you can use:

wc -w converted-pdf.txt

to get the word count.

Also, see the comment by frabjous - basically, you can do it in one step by piping to stdout instead to a temporary file:

pdftotext myfile.pdf - | wc -w

icyrock.com

Posted 2010-12-13T02:07:11.660

Reputation: 4 623

It is worth noting that pdftotext is part of Xpdf which is also available for the windows platform. The Xpdf download page is located here: http://www.foolabs.com/xpdf/download.html . wc can also be found, but alternatively one can use pretty much any word processor like word or LibreOffice Writer. They count words as well. (For LibreOffice Writer go to File -> Properties -> Statistics)

– amenthes – 2016-08-01T23:41:31.960

10It's pdftotext : don't forget the e. And you can use a single command: pdftotext myfile.pdf - | wc -w. – frabjous – 2010-12-13T04:15:43.020

1@frabjous Thanks, updated the answer with the suggestions! – icyrock.com – 2010-12-14T01:48:24.247

13

This is a hard task not not easy to solve. If you really want an exact result, copy paragraph by paragraph for your PDF viewer into a text file and check it with the wc -w tool. The reason why not to use pdftotext in that case is: mathematical formulas may get also into the output and regarded as "words". (Alternatively you could edit the output you get from pdftotext). Another reason why this may fail are the headings: "4.3.2 Foo Bar" is counted as three words.

A way around is only to count words starting with a char out of [A-Za-z]. So what I usally do is a two step approach:

  1. get the list of uniq words and check if there are too much false positives inside:

    pdftotext foo.pdf - | tr " " "\n" | sort | uniq | grep "^[A-Za-z]" > words

    I don't use a dictionary here, as some spelling errors would not count as words.

  2. Get this word list and grep it within the output of pdftotext:

    pdftotext foo.pdf - | tr " " "\n" | grep -Ff words | wc -l

I know this could be done within a one liner, but then I could not easily see the filter result from the first step. The -F may help you as stated by the comment of moi below (thanks).

math

Posted 2010-12-13T02:07:11.660

Reputation: 2 376

1I had to use grep -Ff words, because grep complains about "Unmatched [ or [^". From the man page:

-F, --fixed-strings
              Interpret  PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.  (-F is specified by
              POSIX.)
 – moi  – 2016-08-23T09:09:24.963

10

I just tried out a free program, Translator's Abacus. You can drag and drop various file types (including PDF), and it pops up a browser with a printable report of the word count for each document. It worked fine for me. (It is specifically created for word counts and is only 435 KB... that is, not a "big application"). Translator's Abacus doesn't work on PDF 1.5 or later.

Alternatively: you can just Ctrl+A to select all text in Acrobat Reader and then copy-paste it into a program like Microsoft Word (which has a word count on the status bar at the bottom of the screen).

Adam

Posted 2010-12-13T02:07:11.660

Reputation: 251

In (many?) PDFs, Crl+A only selects the words on the current page, not the entire document. The Translator's Abacus works perfectly though, great! – Junuxx – 2012-10-08T12:44:45.383

3Correction, Translator's Abacus doesn't work on PDF 1.5 or later. – Junuxx – 2012-10-08T12:55:57.100

+1 Ctrl+A in Adobe Reader together with WinMerge work great in Windows! – superjos – 2013-03-04T10:43:43.177

2

A straightforward way to do this if you using Acrobat Pro is to export the PDF to a Microsoft Word document and then do the word count in Word. Alternatively, you can export it to a plain text file and use a word count utility in the text editor of your choice/. I just did a word count on a pdf article using the Word method and it took all of 30 seconds to complete.

Hope this helps.

Bruce Crawford

Posted 2010-12-13T02:07:11.660

Reputation: 121

I converted to text and did wc -w filename.txt. It worked. Thanks. – vijayst – 2017-09-09T16:24:38.113

1

You can install OCRFeeder. In it choose File->Import PDF->Automatically detect and recognize all pages->Export to ODT and libreoffice writer document will be ready for word count or any other RTF function you will want to use.

user55926

Posted 2010-12-13T02:07:11.660

Reputation: 11

0

You can use Adobe Acrobat's console JavaScript with the following code, which I took from Dave Merchant's answer on forums.adobe.com:

var cnt=0;
for (var p = 0; p < this.numPages; p++) cnt += getPageNumWords(p);
console.println("There are " + cnt + " words in this file.");

Tested with Adobe Acrobat Pro DC 2018.011.20040 on Windows 7 SP1 x64 Ultimate.


To enable the JavaScript Console:

enter image description here

To launch the JavaScript Console Window:

CTRL + J

enter image description here

FYI, if you have the LaTeX source corresponding to the PDF: Correct word-count of a LaTeX document.

Franck Dernoncourt

Posted 2010-12-13T02:07:11.660

Reputation: 13 518

0

In Windows, starting from Microsoft Office 2013, you can open a PDF file in MS word. Here is an example of a PDF file that I've opened in MS word 2016:

enter image description here

Once, it is open, you can see the number of words at the bottom left of MS word status bar.

Navaro

Posted 2010-12-13T02:07:11.660

Reputation: 111

0

I find the word counter included in abracadabra tools convenient. The installation is a bit quirky though.

Christoph

Posted 2010-12-13T02:07:11.660

Reputation: 796

-1

De facto standard, which translators use since around 2000 is AnyCount Word Count Tool It does word counts in PDF and 37 other formats.

Vladimir

Posted 2010-12-13T02:07:11.660

Reputation: 1

Vladimir, is there any third-party references (mentions in books, papers, journals, market reviews) that anycount is widely used in the word counting and translation markets? Like https://books.google.com/books?id=llKVpiO2q0EC&pg=PA19#v=onepage&q=any+count&f=false

– osgx – 2017-11-06T16:52:54.657

-3

Ctrl+Shift+F enter advanced search type the word and it will count how many times it is in the doc. It is not rocket science.

Johnny Boy

Posted 2010-12-13T02:07:11.660

Reputation: 15

You may not have answered the OPs answer but your post certainly helped me. Thanks. :D – mahela007 – 2015-07-01T17:08:04.540

9I think you've misunderstood the question... 'word count' normally refers to the total number of words in a document, rather than the number of a specific word... and also, I think it would be better if you were to specify which program you are talking about - not all PDF readers have the same functions or use the same keyboard shortcuts. – evilsoup – 2013-03-28T19:07:11.840