Is there a way to scan a pdf to ensure it doesn't contain anything that could be a virus?

Question

The answers to Can a PDF file contain a virus? show that clearly it can!

Sometimes we can be quite sure a certain pdf should not need to do anything sophisticated - for example a book in pdf form - so we wouldn't expect them to contain embedded executables, or similarly more complex items, like javascripts, and if they did, they could be avoided or treated with extra precaution.

Question

Is there a simple way on macOS and Windows to ensure that any URLs ending in .pdf are scanned for anything more complicated than text and images (the things we'd expect to find in, say, a book), and only opened/downloaded/viewed if it passes the check?

Note: I know many harmless pdfs contain some complex behaviours, but I'd prefer to turn the check off for those specific cases (i.e. if they're from a trusted source), rather than allowing potentially malicious behaviour.

Does this answer your question? [What is the safest way to deal with loads of incoming PDF files, some of which could potentially be malicious?](https://security.stackexchange.com/questions/151300/what-is-the-safest-way-to-deal-with-loads-of-incoming-pdf-files-some-of-which-c) — ThoriumBR, Mar 14 '22 at 00:05
Is there a specific reason to believe, that standard virus scanners would not suffice? Your question does not address this IMHO obvious solution. — Marcel, Mar 14 '22 at 07:02

score 2 · Answer 1 · answered Mar 14 '22 at 07:40

Ensure? No. A simple reason: Images, layout information, fonts, and all sorts of other "simple" data can nonetheless be malicious, and can lead to arbitrary code execution if the parser for them has an exploitable bug (a.k.a. a vulnerability). This is not academic; lots of exploits, including some quite famous ones, were carried out through image or font parsers.

Similarly, any scanner that you could use to theoretically validate the contents of a PDF could, itself, be vulnerable. After all, it too is parsing the file, and there's nothing that says security tools can't contain vulnerabilities themselves. In fact, adding a security tool always increases the attack surface - the amount of space where a vulnerability could exist - and there is no way to guarantee that the tool, even if not itself vulnerable, will reliably detect malicious data without passing it on to other code.

You could, in theory, have a PDF reader that doesn't handle any but the most common and trusted formats; it wouldn't be able to open everything (not even every book), but it could open most of them (probably all from most publishers, etc.). It wouldn't be totally safe - even common and trusted code can have vulnerabilities that lurk undetected for over a decade. I don't know of any PDF reader that has this feature (and specific product recommendations are out of scope for this site anyhow), but you might be able to find one if you look.

Another option would be a PDF validator. As mentioned above, this does add attack surface (the validator itself), but in theory a validator could apply strict validation without attempting to render the font/image/layout/whatever, which reduces the risk somewhat, and would probably throw out anything that isn't safe (not guaranteed, but probably) without being at risk itself (unless the validator was software somebody specifically targeted, or was rather shoddily written).

One way to mitigate all these risks is to handle the PDFs in a sandbox, a low-privilege process with minimal and strictly-controlled access to the rest of the system. Sandboxing is quite common, including for PDFs - Adobe Reader was one of the first really popular desktop programs that I know of to include a sandbox (other than browsers; Adobe adapted the one Chrome was already using) - and is used for approximately all apps on mobile devices and most apps from the desktop Windows Store and MacOS App Store. Mind you, sandboxes aren't a perfect solution - they don't restrict everything, and even stuff that they do try to restrict might be possible if the sandbox is itself buggy (as pretty much all complex software is) in the right way. Still, it adds defense in depth.

score 2 · Answer 2 · answered Mar 14 '22 at 08:33

2

There is a simple tool PDFiD from Didier Stevens:

https://blog.didierstevens.com/programs/pdf-tools/ (for PDFiD scroll down... and after that take a look at the other tools too btw)

I find it handy for a quick manual scan for the most common attack vectors in pdf. Scanning is very quick and it could warn you, that the document contains elements that can be exploited.

Note: I am not any kind of a security expert, just a common user.

answered Mar 14 '22 at 08:33

M_Ryan

33
5

1

We try not to answer with "use this tool". Instead, offer techniques, approaches, etc. – schroeder Mar 14 '22 at 08:57
My bad, I'm sorry for that. – M_Ryan Mar 14 '22 at 10:39

score 1 · Answer 3 · answered Mar 14 '22 at 08:59

There is no sure way, which is why the concept of Content Deconstruction/Reconstruction (CDR) is becoming popular. The process scans the document for content, then creates a new file with just the content.

It's not a "scan" but a "carbon copy" of the content to bypass anything that might be lurking.

Is there a way to scan a pdf to ensure it doesn't contain anything that could be a virus?

Question

3 Answers3