How should I serve untrusted / unsanitized documents (PDF, DOC, XLS) to end users over the web?

Question

My website will be hosting documents for end users and I would like to make this as secure as possible.

Question

Assuming that I have a need to share documents on a website that may contain hostile code, what should I do to protect myself and my end users?

Are you referring to conversion on a server, conversion on your PC, or conversion through a service? (like Google Docs) — 700 Software, Feb 22 '12 at 00:30
See this question: [Which files are more insecure than others?](http://security.stackexchange.com/q/323/33) Though it's not quite what you're asking, I think the answer is (mostly) the same. — AviD, Feb 22 '12 at 00:57
@GeorgeBailey - I'm ultimately looking for a conversion exe that can run locally or on a server, and has some programmatic interface I can leverage. Lacking that, I'll suffice with command line switches. (If the TOS permits, I may wrap the Google Docs service into an EXE) — makerofthings7, Feb 22 '12 at 04:39
@AviD I was getting ahead of myself, and did this Jeopardy style. My underlying question is "how to I secure document transmission over the web", my assumed answer was to "convert it into a "Safe" format". I've edited the question; do you think I altered it too much and had best delete it and start over? — makerofthings7, Feb 22 '12 at 05:00
related with focus on HTML: http://stackoverflow.com/questions/3922872/how-can-xss-be-avoided-in-html-downloads — Ciro Santilli OurBigBook.com, Nov 07 '14 at 15:21

score 5 · Accepted Answer · answered Feb 22 '12 at 02:25

It sounds like you are concerned about viewing documents from an untrusted source (e.g., concerned about the risk that a malicious document might exploit a vulnerability in your PDF viewer, or in Word, or the like). That's a reasonable concern.

If that's the concern, one way to mitigate it is to use Google Docs to view the document. Google Docs provides the ability to view documents in your web browser. Google converts the document into HTML on Google's servers, and then sends the HTML to your browser. Google then takes responsibility for making sure their servers can handle malicious documents safely. This is a reasonable approach to protecting yourself against malicious documents.

The primary downside is that it may not be a good fit for viewing confidential documents, because you'd have to upload them to Google before you could view them in this way.

An alternative defense is to keep your document viewers fully patched. I recommend turning on automatic updates for Adobe Reader and Microsoft Office. This is an alternative way to mitigate the risk.

I've never explored Google Docs, but like the idea of neutralizing content ... even if that final format is HTML. I would think HTML in a browser would be more risky than other formats. Is there have a converter that converts all text into the markup StackExchange uses? — makerofthings7, Feb 22 '12 at 04:18
@DW I realized I didn't formulate my question correctly. I've changed it substantially. See my comment to Avid above. Originally, I was attempting to [re-purpose your answer for securing images](http://security.stackexchange.com/a/8625/396) in my first question to general documents since I thought they can be exploited with similar mechanisms (depending on the software and "trust zone" involved). — makerofthings7, Feb 22 '12 at 05:03
@makerofthings7: the first two points from that answer can certainly also be used for general documents; always, always do #1. The one about converting is just a bit harder for Word and PDF files. — bobince, Feb 22 '12 at 08:54

score 5 · Answer 2 · edited Mar 17 '17 at 13:14

As a site owner, there is only so much you can do about it. The primary responsibility for protecting users from malicious documents lies with the software vendors who make the document viewers (e.g., Adobe, Microsoft, and potentially browser vendors). They are the ones with the opportunity to best help their users; your leverage is more limited.

That said, here are some steps you could/should take, to protect your users as best as possible:

Only allow whitelisted types of documents. Create a whitelist of MIME types, representing document formats that you consider relatively safe and important to support. (This whitelist might include, for example, PDF, Word, Powerpoint, Excel spreadsheets, etc., but not, e.g., native executables, Flash videos, zip or jar archives, HTML or Javascript, etc.) When the user uploads a document, check that it on the whitelist. When you serve this document to others, follow practices I've outlined elsewhere to defend against content-type sniffing attacks. The most important one is to set a correct Content-Type: header on the HTTP responses where you serve the image, using the previously validated MIME type. Also, include a X-Content-Type-Options: nosniff header, to prevent some versions of IE from trying to do content-type sniffing. In addition to defending against content-sniffing attacks, this will also prevent attackers from uploading other files that browser treat specially (e.g., crossdomain.xml, a HTML5 manifest file, and more).
Validate filenames. Perform sanity checks on filenames when they are uploaded. For instance, you might check that they match a regexp like [a-zA-Z-_!(),: ]*(\.[a-zA-Z]+)?. In particular, you want to rule out slashes (to protect against path traversal). Also, check that the extension (if present) seems to match the validated MIME type and is on a whitelist of known-safe extensions. You might want to check that the filename does not contain two extensions (e.g., foo.txt.exe), to prevent social engineering attacks.

Alternatively, you could generate a random filename for each document. This would be the most secure, but it might diminish usability. A third option is to replace the original filename only if the original filename doesn't pass your validity checks.
Optional: Scan file uploads for viruses or malware. When a file is uploaded, you might want to use a virus scanner to check for known viruses or malware. You could use a local virus scanner, but it might be even easier to upload it to VirusTotal. I would do this at upload time, and possibly also subsequently at periodic intervals (e.g., once every 1000 subsequent downloads, or once every week), as sometimes more recent virus definition files will catch more viruses in older documents. It would be fine to do this asynchronously or in batch mode, if that provides better performance; if the virus-check comes back a failure, you can always remove the file at that time.
Optional: Use a separate domain to host file uploads. You could host the documents on a separate domain that is used only to host user-uploaded documents. This will limit the impact of some browser-level attacks, such as content-type sniffing. However, this does not defend against code-injection attacks that exploit a vulnerability (e.g., a buffer overrun, a double-free) in a document viewer and let the attacker execute native code.
Put barriers to denial of service and spam. Put a limit on the maximum file size, to avoid getting overwhelmed with submissions. If you allow unauthenticated users to upload files, require the user to solve a CAPTCHA as part of the upload.
Optional: Detect out-of-date browsers and plugins. Scan the user's browser. If they are running an old browser (like IE6) or old plugins, consider providing a recommendation that they update, to protect themselves as they browse on the net. Here are some resources to help with that:
- If you want to recommend IE6 users to update, IE6 Update makes it easy to add a little bit of code to your website that will detect users who are using IE6, and prompt them to update their browser.
- If you want more flexibility about which browser versions will trigger a little notification encouraging the user to upgrade, take a look at browser-update.com.
- If you must check for out-of-date or known-vulnerable plugins, the Mozilla plugin check may be useful. It supports Firefox, Safari, Chrome, and Opera, and partially supports IE. They also have a programmatic API to their plugin version database. Based on that, you could probably code something up to automatically check the user's plugins and if they have an out-of-date plugin for a document type you support (e.g., an old version of a Adobe PDF viewer plugin), you could trigger a notification to encourage the user to upgrade their plugin.

Related: Various links to [detect out-of-date browsers](http://security.stackexchange.com/a/2029) — makerofthings7, Feb 23 '12 at 03:54
Following on to your bullet point; [Perfidies is a Mozilla demo that compares a live browser session against a Plugin vulnerability database.](http://security.stackexchange.com/q/12057/396) — makerofthings7, Feb 23 '12 at 06:30

score 5 · Answer 3 · answered Feb 22 '12 at 20:54

When I've had to tackle this before I've converted the file from it's source format into another and sometimes then to a third format.

If Open/LibreOffice supports your document format you could just call it with --convert-to pdf. Then you can take that PDF and manipulate it further, eg using ImageMagick:

convert source.pdf source-page-%d.jpg
convert -page A4 --compress jpeg source-page-*.jpg source.pdf

Chaining the 3 together means you end up with a PDF who's content is rendered pages, which should be pretty safe.

The only problem with this is that the open/libreoffice software could get infected from reading the files. — costa, Oct 31 '19 at 07:21

score 1 · Answer 4 · edited Mar 17 '17 at 13:14

Regarding DW's answer:

Optional: Use a separate domain to host file uploads.
You could host the documents on a separate domain that is used only to host user-uploaded documents. This will limit the impact of some browser-level attacks, such as content-type sniffing. However, this does not defend against code-injection attacks that exploit a vulnerability (e.g., a buffer overrun, a double-free) in a document viewer and let the attacker execute native code.

The most secure implementation of this in my mind is to create unique DNS domain for every file, so an issue with one download will be less likely to affect others in the repository.

This could be implemented in two parts: First, start with a wild card entry in DNS that points to the web server that holds the files, (e.g *.userfiles.com). The idea is that when someone requests /Accounting/budget.xls they will go to the following precomputed url.. in this case the domain prefix is an MD5 hash of the URL:

https://fd4929a6ea48f85130f5a43f89cb7dd6.userfiles.com/Accounting/budget.xls

Secondly the server side code can then verify that "fd4929a6ea48f85130f5a43f89cb7dd6" is the MD5 of the requested file. A server side component can then check this value at runtime and allow or deny access.

Examples:

https://96a077f305f5720221b338863ea8e3d7.userfiles.com/tech/SecurityPolicy.docx - OK
https://fd4929a6ea48f85130f5a43f89cb7dd6.userfiles.com/Accounting/budget.xls - OK
https://fd4929a6ea48f85130f5a43f89cb7dd6.userfiles.com/Accounting/Budget.xls  - Fails Checksum
https://fd4929a6ea48f85130f5a43f89cb7dd6.userfiles.com/Accounting/forecast.xls - Fails checksum

Note to self: Use SHA1 instead of MD5, since MD5 is broken. – makerofthings7 Feb 23 '12 at 15:36 — makerofthings7, Feb 23 '12 at 15:36

score 0 · Answer 5 · answered Feb 22 '12 at 16:22

Assuming that I have a need to share documents on a website that may contain hostile code, what should I do to protect myself and my end users?

I will go ahead and take a try at answering this question. The honest truth is that most exploits in a word document, pdf document, take your pick document contained both valid and malicious content.

One could in theory have a perfectly valid word document which contains a malicious third-party content ( some external content is allowed ) which causes either an unknown serious of events to occur which allowws malicous code to be run or say in the case of a PDF file within Adobe Reader malicious Javascript to be ran.

So there are several things you can do.

Convert all documents to a single format( my suggestion is PDF ) and use the built in security capabiltiies of Chrome, IE9, and Firefox and display the document within the browser itself. This means you don't actually transfer the file to the user unless they decide they want to save the file.

As I said you will be unable to catch every single thing.

How should I serve untrusted / unsanitized documents (PDF, DOC, XLS) to end users over the web?

5 Answers5

Linked