16
11
I have many PDF files on one folder.
Is it possible check if one or more files are corrupted (zero pages, or unfinished downloads) using the command line, without needing to open them one by one?
16
11
I have many PDF files on one folder.
Is it possible check if one or more files are corrupted (zero pages, or unfinished downloads) using the command line, without needing to open them one by one?
20
You can try doing it with pdfinfo
(here on Fedora in the poppler-utils
package). pdfinfo
gets information about the PDF file from its dictionary, so if it finds it the file should be ok
for f in *.pdf; do
if ! pdfinfo "$f" &> /dev/null; then
echo "$f" is broken
fi
done
13
find . -iname '*.pdf' | while read -r f
do
if pdftotext "$f" - &> /dev/null; then
echo "$f" was ok;
else
mv "$f" "$f.broken";
echo "$f" is broken;
fi;
done
To clarify: This script renames the pdf files that are diagnosed as 'broken' by appending .broken to the .pdf extension. – PatrickT – 2016-03-11T07:07:57.583
6
My tool of choice for checking PDFs is qpdf
. qpdf
has a --check
argument that does well to find problems in PDFs.
qpdf
:qpdf --check test_file.pdf
qpdf
:find ./directory_to_scan/ -type f -iname '*.pdf' \( -exec sh -c 'qpdf --check "{}" > /dev/null && echo "{}": OK' \; -o -exec echo "{}": FAILED \; \)
Command Explanation:
find ./directory_to_scan/ -type f -iname '*.pdf'
Find all files with '.pdf' extension
-exec sh -c 'qpdf --check "{}" > /dev/null && echo "{}": OK' \;
Execute qpdf
for each file found and pipe all output to /dev/null
. Also print filename followed by ': OK' if return status of qpdf
is 0 (i.e. no errors)
-o -exec echo "{}": FAILED \; \)
This gets executed if errors are found: Print filename followed by ": FAILED"
qpdf
:qpdf
has both Linux and Windows binaries available at: https://github.com/qpdf/qpdf/releases. You could also use your package manager of choice to get it. For example on Ubuntu you can install qpdf using apt with the command:
apt install qpdf
However, qpdf --check
does not detect multiply defined metadata, which are incorrect as they are handle differently by different tools. I've reported a bug. Other tools such as pdfinfo
and pdftk
do not either, but they do not claim to check the PDF structure.
4
I got myself an answer:
for x in *.pdf; do echo "$x"; pdfinfo "$x" | grep Pages; done
PDFs with errors will show errors.
4
It is a bad idea (and never really needed) to iterate over the output of ls
: http://mywiki.wooledge.org/ParsingLs
2@slhck: This should be handled with find (1)
. :-) – Reinstate Monica - M. Schröder – 2013-04-11T13:15:47.137
2
All of the methods using pdfinfo
or pdftotext
have not worked for me. In fact they kept giving me false positives and sometimes created files I didn't need.
What did work was JHOVE.
Installation:
Install the jar from the above link and update your PATH environment variable with this command:
echo "export PATH=\$PATH:/REPLACE_WITH/YOUR/PATH_TO/jhove/" >> ~/.bash_profile
Refresh each terminal with
source ~/.bash_profile
and you're good to start using it system wide.
Basic Usage:
jhove -m pdf-hul someFile.pdf
You'll get a lot of info about the pdf - more than most people probably need.
Bash One-Liner:
Simply returns valid
or invalid
:
if [[ $(jhove -m pdf-hul someFile.pdf | grep -a "Status:") == *"Well-Formed and valid"* ]]; then echo "valid"; else echo "invalid"; fi;
Note that this was run on Mac OS X but I assume it works the same with any Unix based Bash environment.
6I would suggest to replace pdfinfo with pdftotext. This way all text on every page will be checked. And the > gt character should be &> so that all the error messages don't show up. – schoetbi – 2014-10-17T19:29:46.120
All my PDFs are flagged as broken. Hundreds of gigabytes of them. Including ones I just created. Whether using
pdfinfo
orpdftotext
... – PatrickT – 2016-03-11T07:20:50.273