I'm currently looking at ways to prevent malicious PDF files at the network boundary. This will include virus scanning - but there are known limitations to that. I see a common approach is to flatten the PDF file using something like:
gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=flattened.pdf raw.pdf
While this certainly seems to remove the usual suspects from the output of pdfid, that alone does not mean that the associated threats have been eliminated.
Hence:
Will this approach eliminate most Flash and Javascript exploits?
What threats are likely to persist?
Notes:
As this is intended for bulk scanning, suggestions such as this are not really practical at scale.
Links to authoritative sources would be much appreciated.
Update
The method above removes Flash and Javascript from the PDF. Steffen (see below) highlighted that malware embedded in image files would likely survive. To mitigate this, I am downsampling the images. I've not been able to get a clear answer to whether gs preserves or removes EXIF data, but the downsampling will likely alter the offset of any malware embedded there nullifying its exploitability, and the downsampling should also remove any malware embedded in image data. Hence:
DPI=63
gs -dBATCH -dNOPAUSE -dQUIET -sDEVICE=pdfwrite \
-dDownsampleColorImages \
-dColorImageDownsampleType=/Bicubic -dColorImageResolution=${DPI} \
-dDownsampleGrayImages \
-dGrayImageDownsampleType=/Bicubic -dGrayImageResolution=${DPI} \
-dDownsampleMonoImages \
-dMonoImageDownsampleType=/Bicubic -dMonoImageResolution=${DPI} \
-sOUTPUTFILE=${TMPPDF} ${SRCFILE}