Removing malware from a malicious PDF file

Question

I have analyzed a large PDF using Didier Steven's pdfid tool. I get the following result:

Now that I know there is malicious content in the PDF, what can I do to remove it? I would still like to view the contents of the PDF. Unfortunately, I am not confident in my ability to remove all the malicious content with technical tools like pdf parsers. I am looking for another method where my lack of experience will not be a big liability.

Here are some of my ideas:

Upload the document to Google Docs. I assume that Google has one of the most secure PDF viewers out there.
Open the PDF in a virtual environment and somehow save the PDF page-by-page. At the very worst, I could write a script to print screen each page and make a new PDF full of images.
Use Firefox's pdf.js or some other browser-based viewer

Another question I have: How reliable is the pdfid tool? For example, if I get a result with no Javascript, OpenAction, etc. can I be sure that it is safe?

Depending on how advanced the virus is, you can always open on a VM and print to file... — KnightOfNi, Sep 04 '15 at 19:44
`I know there is malicious content in the PDF` what do you know and how do you know it? — Neil Smithline, Sep 04 '15 at 20:17
You've asked 2 different questions, "how to remove it" and "how to view an infected PDF" and those questions are not related at all. Is there one question that you would prefer to have answered? — schroeder, Sep 05 '15 at 15:58

score 3 · Answer 1 · edited Jun 16 '20 at 09:49

Since there's absolutely no way to tell what's inside that file and what it's capable of, I'd recommend cutting your losses and nuking it from the get-go. With most PDFs (and, infact, any file really) you can recover relatively quickly as the PDF is bound to reside somewhere else (either an earlier draft or an un-modified version).

That being said, you seem very keen to open this PDF so I'm assuming it has some sort of uniqueness to it. You'll need to forgive me for putting my day-job hat on but usually this means a user hasn't saved or backed up their documents when they should have. If that's not the case, and this is truly the Dead Sea Scrolls of the PDF world, let's see how we can open/clean these files without damaging them or, indeed, yourself.

Using a browser to recover the PDF

You've brought up opening them in a browser several times:

"Upload the document to Google Docs. I assume that Google has one of the most secure PDF viewers out there."

and

"Use Firefox's pdf.js or some other browser-based viewer"

As I said at the beginning, we've no idea what we're dealing with at the moment. Opening it a browser (at this stage) wouldn't be recommended. Also uploading it to Google is a little bit unethical since you know full well that there's something suspicious going on with the PDF.

Even if you were somehow able to know exactly what that PDF file's malicious content is capable of, there's no way to ensure it hasn't been tampered with to masquerade.

Using a VM/isolated machine to recover the PDF

You touched on it in the middle of your question, using a VM or isolated machine is obviously a far safer way of opening that PDF.

Now you might be thinking that's a lot of time to spend setting one of those up but if you keep a snapshot of clean, portable, small systems taken immediately after the installation has completed, we can almost safely open the PDF, print the contents to file (as @KnightOfNi has already alluded to) without the PDF editing capabilities. Why not go one step further and actually print them on a closed loop with a USB printer and rescan them? Might seem like a massive hassle but we're dealing with a time-bomb of unknown magnitude. It could destroy your system, steal your data, crypto-locker your stuff, or simply be a lemon - the point is we have no idea.

"Can I be sure that it is safe?"

From your question:

Another question I have: How reliable is the pdfid tool? For example, if I get a result with no Javascript, OpenAction, etc. can I be sure that it is safe?

Again, not really - but there are ways you can test to see how accurate the tool is.

If it's Open Source, have a dig through the code and make sure it's not just spitting out any old garbage and barely touching the files.
In that VM/isolated machine lab, why not throw it some good/bad/ugly files that are known to be infected with a certain strain of malware? You can roll your own or you can reach out to the researcher community who usually have a few examples to download online.

To be honest, it all depends on how much you want to recover that PDF. For most people- it's way more hassle than it's worth... but then again, that depends on the contents.

What I'd do:

Isolated virtual machine on an isolated physical machine (sounds like overkill but that's how I tested my first experience with CryptoLocker - glad I did it that way).
That means no file sharing (i.e. share desktop -> desktop on VM)
Go for an odd OS. The odder the better. Most of these PDF viruses are targeting specific users of specific versions of an operating system. I doubt they've thought about Kali, TAILS, or similar
Deactivate plugins/delete any known areas or apps of concern (flash, etc.)
Open a test PDF and set all the security settings to max before opening the target document
Open the target document, physically print it, scan it back in.

You forgot to cut internet access to the VM or even better, cut internet for VM + Host — Freedo, Sep 05 '15 at 23:13
@Freedo "Isolated virtual machine on an isolated physical machine". Isolated as in no network connection ;). (Preferably in a padded cell) — ScottMcGready, Sep 05 '15 at 23:14
@Freedo actually wifi and Bluetooth radios off to be totally sure. — ScottMcGready, Sep 05 '15 at 23:18
physically printing is a big waste of time, energy and dead trees. a lot better is to use (for example) imagemagick to convert pages to plain pictures (TGA format—it's dead simple, therefore readers are unlikely to be exploitable), and get them off the VM, then convert to any other format (e.g. PNG) to save space. — Display Name, Nov 01 '16 at 10:13
If you are cutting network access, make sure you also have a Faraday cage :-) https://github.com/fulldecent/system-bus-radio — William Entriken, Apr 29 '22 at 13:34

score 2 · Answer 2 · answered Sep 04 '15 at 21:30

The pdfid output you've posted does not indicate malicious content, per se.

I'd be suspicious, too... and... look a little deeper before saying it's bad news.

Look Inside? Personally, I'd first open it with a text editor and take a quick look at that script... see what it would try to do... but then again, I'm a researcher so I'm curious at heart.

Scan it? You can simply upload the PDF to virustotal.com and it will get simultaneously checked by a whole bunch of malware scanners. If it comes up as malicious, then you'll know for sure that you should go the more cautious VM route. If it comes up as clean, then hey, maybe it really is. Or maybe the attacker is really good at hiding their intent amongst those ObjStms. The mere presence of javascript doesn't mean it's malicious.
https://www.virustotal.com/

Disable JS? It's probably javascript you're worried about, right? Most PDF readers have a way to go into properties/settings/options/preferences and check a box for safe reading or uncheck a box for javascript execution.
https://www.techsupportalert.com/content/how-disable-javascript-popular-free-pdf-readers.htm

Or Roll Up Your Sleeves and Fire Up Your VM Bottom line, let's say your paranoia is spot on, and the file really is bad news -- I would agree with you that opening in a disposable VM session is going to be a safer way to access a potentially-malicious file. And if you do choose to go that route, remember to fire up a sniffer like wireshark before you open the PDF; that way you can see whatever it tries to do, if anything.

Removing malware from a malicious PDF file

2 Answers2

Using a browser to recover the PDF

Using a VM/isolated machine to recover the PDF

"Can I be sure that it is safe?"

Linked