Check if PDF files are corrupted using command line on Linux

I have many PDF files on one folder.

Is it possible check if one or more files are corrupted (zero pages, or unfinished downloads) using the command line, without needing to open them one by one?

Kokizzu

Posted 2013-04-10T13:36:24.813

Reputation: 1 205

Answers

You can try doing it with pdfinfo (here on Fedora in the poppler-utils package). pdfinfo gets information about the PDF file from its dictionary, so if it finds it the file should be ok

for f in *.pdf; do
  if ! pdfinfo "$f" &> /dev/null; then
    echo "$f" is broken
  fi
done

vonbrand

Posted 2013-04-10T13:36:24.813

Reputation: 2 083

6I would suggest to replace pdfinfo with pdftotext. This way all text on every page will be checked. And the > gt character should be &> so that all the error messages don't show up. – schoetbi – 2014-10-17T19:29:46.120

All my PDFs are flagged as broken. Hundreds of gigabytes of them. Including ones I just created. Whether using pdfinfo or pdftotext... – PatrickT – 2016-03-11T07:20:50.273

find . -iname '*.pdf' | while read -r f
  do
    if pdftotext "$f" - &> /dev/null; then 
        echo "$f" was ok;   
    else
        mv "$f" "$f.broken";
        echo "$f" is broken;   
    fi; 
done

schoetbi

Posted 2013-04-10T13:36:24.813

Reputation: 231

To clarify: This script renames the pdf files that are diagnosed as 'broken' by appending .broken to the .pdf extension. – PatrickT – 2016-03-11T07:07:57.583

My tool of choice for checking PDFs is qpdf. qpdf has a --check argument that does well to find problems in PDFs.

Check a single PDF with `qpdf`:

qpdf --check test_file.pdf

Check all PDFs in a directory with `qpdf`:

find ./directory_to_scan/ -type f -iname '*.pdf' \( -exec sh -c 'qpdf --check "{}" > /dev/null && echo "{}": OK' \; -o -exec echo "{}": FAILED \; \)

Command Explanation:

find ./directory_to_scan/ -type f -iname '*.pdf' Find all files with '.pdf' extension
-exec sh -c 'qpdf --check "{}" > /dev/null && echo "{}": OK' \; Execute qpdf for each file found and pipe all output to /dev/null. Also print filename followed by ': OK' if return status of qpdf is 0 (i.e. no errors)
-o -exec echo "{}": FAILED \; \) This gets executed if errors are found: Print filename followed by ": FAILED"

Where to get `qpdf`:

qpdf has both Linux and Windows binaries available at: https://github.com/qpdf/qpdf/releases. You could also use your package manager of choice to get it. For example on Ubuntu you can install qpdf using apt with the command:

apt install qpdf

moo

Posted 2013-04-10T13:36:24.813

Reputation: 905

However, qpdf --check does not detect multiply defined metadata, which are incorrect as they are handle differently by different tools. I've reported a bug. Other tools such as pdfinfo and pdftk do not either, but they do not claim to check the PDF structure.

– vinc17 – 2019-07-30T13:50:10.760

I got myself an answer:

for x in *.pdf; do echo "$x"; pdfinfo "$x" | grep Pages; done

PDFs with errors will show errors.

Kokizzu

Posted 2013-04-10T13:36:24.813

Reputation: 1 205

It is a bad idea (and never really needed) to iterate over the output of ls: http://mywiki.wooledge.org/ParsingLs

– slhck – 2013-04-10T14:06:48.467

2@slhck: This should be handled with find (1). :-) – Reinstate Monica - M. Schröder – 2013-04-11T13:15:47.137

All of the methods using pdfinfo or pdftotext have not worked for me. In fact they kept giving me false positives and sometimes created files I didn't need.

What did work was JHOVE.

Installation:

Install the jar from the above link and update your PATH environment variable with this command:

echo "export PATH=\$PATH:/REPLACE_WITH/YOUR/PATH_TO/jhove/" >> ~/.bash_profile

Refresh each terminal with source ~/.bash_profile and you're good to start using it system wide.

Basic Usage:

jhove -m pdf-hul someFile.pdf

You'll get a lot of info about the pdf - more than most people probably need.

Bash One-Liner:
Simply returns valid or invalid:

if [[ $(jhove -m pdf-hul someFile.pdf | grep -a "Status:") == *"Well-Formed and valid"* ]]; then echo "valid"; else echo "invalid"; fi;

Note that this was run on Mac OS X but I assume it works the same with any Unix based Bash environment.

kraftydevil

Posted 2013-04-10T13:36:24.813

Reputation: 129