1

For context; I have a web application that allows users to upload a PDF file from which the web app extracts certain information by parsing it. The app then sends this information to another server for further processing.

The web app is based on Python (Django & FastAPI) and runs on a Linux-based operating system inside a Docker container (which has root privileges).

The PDF file is not stored, it is received at an endpoint as a regular HTTP request with the file contained in the form data (multipart/form-data); this file is then converted to HTML and parsed (the file is never stored on the server, only handled in-memory). The resulting data are sent to another server for storage in an SQL database.

My questions are as follows:

  1. Is parsing the file in an interpreted language such as Python considered to be 'executing' it?
  2. Does handling this file in this manner pose any risk to the server if the file contains malware?
ThoriumBR
  • 50,648
  • 13
  • 127
  • 142
  • Do you have a specific reason to use PDF for that instead of just letting the user enter the data they want to submit? –  Oct 15 '21 at 13:24
  • The extraction process is quite tedious and the automation is for user convenience. – portabletable Oct 15 '21 at 13:34
  • Is there a reason these files are generated as PDF? PDF is generally a format meant for humans to read, not for machine-to-machine communication. –  Oct 15 '21 at 14:37
  • Good question: the PDF files are automatically generated from another source, which is out of our (or the user's) control. – portabletable Oct 15 '21 at 14:40
  • That seems like a really shitty situation. Can you possibly share an example PDF (without any personal or confidential information, of course)? –  Oct 15 '21 at 14:47
  • Unfortunately that's not possible in my case. – portabletable Oct 18 '21 at 07:22
  • That's unfortunate, but understandable. As the answers state, it's risky and the workflow should be changed if possible. –  Oct 18 '21 at 12:33
  • Thank you for helping me clarify my situation. – portabletable Oct 19 '21 at 09:24
  • You didn't ask for any suggestions to remedy the situation, but I see suggestions being made none the less (e.g. avoiding PDF all together). Another approach would be to process the PDF file client side in the browser, and only send your sanitized data (not the PDF file) to the server. – Ryan Oct 23 '21 at 19:27

2 Answers2

0

You don't "execute" a PDF (although technically it can contain executable code such as JavaScript), but there are still risks from the approach you're taking. The main things that spring to mind are:

  • Command injection if you passing the filename to some kind of parsing tool via a system() call or similar.
  • SQL injection if you're not handling the database queries safely.
  • A parsing vulnerability in whatever PDF library you're using, which could potentially lead to RCE.

The first two can be protected against through general secure development practices. For the third one, the specifics will depend on exactly how you're handling your PDFs, but the main things would be to make sure your PDF parsing library is fully patched (and kept up to date), and to disable any if the more dangerous functionality such as JavaScript that the library might support.

If you can have the user just enter data directly into a form that would be much safer - but that may not be possible in your use case.

Gh0stFish
  • 4,664
  • 14
  • 15
0

This workflow is risky. Even if a PDF file is not executable per se, it can contain embedded malicious code and that code can be executed if there are any application or library parsing it.

Treat all received PDF files as untrusted and hostile. Use a container with the bare minimum to read the PDF (and please don't run it as root) to convert to another format. Don't have any networking, but only a shared storage volume. Your application receives the PDF, writes on the shared volume, the conversion container converts the PDF to another format, saves on the shared volume, and a third container (or even your application) reads the resulting file.

This way you are reducing the attack surface, and reducing the blast radius. If any attacker manages to execute code on the conversion container, it will end up on a container with no usable shell, no editor, and no networking. It's not impossible to misuse such environment, but it makes it way more difficult and makes it more probably to trip your security products.

ThoriumBR
  • 50,648
  • 13
  • 127
  • 142