How should I more securely store files obtained by mirroring a website?

Question

Questions: what methods, tools, techniques, and procedures are recommended for securely storing files obtained by mirroring a website? What I mean is that I am looking for commonly known and accepted practices that align with the use-case outlined below, even if there might be varying points-of-view on which practices constitute a best practice.

Background: I have endeavored to provide as many specific details as I can provide on my use-case, though I admit that I struggle to provide more specific detail because this is an area in which my knowledge is limited. Thus, I am asking this question to better direct any future efforts, research, etc. As outlined in this question, I sometimes mirror websites to gain information of value to due diligence investigations. I learned from my previous question that things are often more complicated than expected. This question reflects my interest in thinking about the security implications of actually storing what I get when I use wget, curl, phantomJS, etc. So, I welcome your thoughts on best practices, actions you suggest I take, or general precautions. For example, perhaps it is a best practice to only store such files on an air-gapped machine, to convert HTML to a different format, or to simply avoid any Javascript files because they pose an inherent threat. As a reference, I use Ubuntu Linux.

Goals:

Retain text-based files for keyword and other types of searching
Retain image files for visual and metadata evaluation
To the best of my ability given the other two goals, store and interact with files in a secure manner

Filetypes encountered: these are the main filetypes I've encountered to date...

Documents: PDF, RTF, DOC, DOCX, ODF, TXT
Presentations: PPT, PPTX
Spreadsheets: XLS, XLSX, CSV
Images: PNG, TIF, JPG, GIF
Web: HTML, CSS, JS, PHP

This depends on what you do with the files. If a file is just _sitting_ there, it really can't do anything bad. You need to either execute a malicious file, or parse a file with a vulnerable program for it to do anything bad. — forest, Dec 19 '17 at 03:03
@forest Hello again! I had heard about that distinction. I think I'm in the latter category (execution & parsing) because I want everything text-based to be searchable. In terms of non-text files, I'll want to examine certain files' metadata with exiftool and generally examine them if they are images. I hope that clarifies. — Tigelle, Dec 19 '17 at 03:17
Are you worried about exploitation of your browser/client or of the server which is hosting the content? — forest, Dec 19 '17 at 03:31
@forest The server where I'm storing the material. I've also considered secure cloud storage, about which I know little. Not sure how Wuala, SpiderOak, Tresorit, etc. might mitigate any concerns. — Tigelle, Dec 19 '17 at 03:37
"Secure cloud storage" would do nothing, same with converting HTML to another format. If your server is what you want to secure, then the gist of the answer would be that you need to use a sandbox/MAC for the at-risk processes, or at the very least, process the data from an unprivileged user, ideally one with network access disabled. Unfortunately, this question is rather broad. The possibilities for exploitation are endless, and without more knowledge of the specific actions you plan to take, I can't give a very specific answer. — forest, Dec 19 '17 at 03:47
@forest That makes sense. Thank you. I acknowledge the question is broad. I'm casting a fairly wide net because I'm a non-expert and I assume there are some good concepts/best practices to keep in mind. For example, my thought was that a user might offer, like you have, that converting HTML to PDF doesn't do much, that you could cut down risk quite a bit by not retaining X filetype, etc. — Tigelle, Dec 19 '17 at 03:55
Perhaps you should say what filetypes you will be using, and what you will be doing with them. Checking if a file is an image doesn't give me much information (for example, you could be doing anything from using `file(1)` to opening it in an image viewer). If you're just checking file extensions and generating checksum for each file to sort them and delete duplicates or something, then there is virtually no security risk at all. On the other hand, parsing untrusted PDFs is probably the least safe thing you can do. I don't know where between these two extremes your required actions lie. — forest, Dec 19 '17 at 03:59
@forest Thank you. I listed some filetypes in the question, but I'll modify the question to list filetypes I've encountered thus far. — Tigelle, Dec 19 '17 at 12:29
Are you going to access _all_ of those using a program to read it? Like are you going to be reading that powerpoint presentation in MS Office/LibreOffice? — forest, Dec 19 '17 at 12:40
@forest I'll add a blurb on that in the original question. But, generally, I simply want to have text-based files available for keyword searching and image files for visual evaluation for interesting content and for metadata evaluation with exiftool. — Tigelle, Dec 19 '17 at 14:36

forest · Accepted Answer · 2022-04-28T20:14:32.813

The file types you listed and the goals you presented result in a huge attack surface. Assuming a sophisticated adversary, this gives a vastly increased possibility for exploitation. PDFs and PPT/PPTX are particularly problematic. If you cannot limit yourself to far fewer file types, you will need to isolate your activities, either using privilege separation or remote virtualization.

Privilege separation

This is a solution in the case that you need to perform this activity on your local computer. While it would be difficult to individually sandbox every single application you will be using, you can create a new user on your computer with few privileges:

Disable access to su and sudo from the new user, and do not use them as that user.
Do not su to your lesser user from root to avoid TTY pushback attacks.
Use iptables to disable network access for that user.
Set resource limits to reduce the amount of damage an exploited application can do.
Use Wayland instead of Xorg if possible, or Xorg with systemd-logind to run it as non-root.
Enable and use the Secure Attention Key when finished and switching to a new session.
Scan for and remove all unnecessary setuid or setgid files, as well as setcap files.
Use an auditing framework like auditd to monitor potentially malicious activities.
Apply general system hardening such as sysctl tweaks or hardening patches.

Depending on the level of sophistication of the adversary, this may not be enough, and even for an adversary of moderate capability, this is rather incomplete, but it is a starting point. As you have to assume (most likely correctly) that the applications you are using to access these files are vulnerable to arbitrary code execution, the question turns into How can I safely run untrusted code?, which is of course extremely broad.

Virtual Private Servers

An easier solution would be to use a VPS. You can run applications remotely on the VPS rather than on your own computer and interact with it that way. Even if the VPS is completely compromised, the attack surface area is reduced to that of your SSH client and your terminal, which is fairly small. As you will not be able to directly view images over SSH (at least not safely), you may want to convert them to a very simple (and difficult to exploit) image format before transferring it to your local computer and viewing it. An example of such a format would be the PPM pixel map format. This also works for viewing PDF files, as they can be readily converted into image files using various utilities.

The host of a VPS is able to access and modify anything on the VPS. If this is an issue for you (for example, if the files are extremely sensitive, or integrity is of utmost importance), you may not want to use a VPS. This is unlikely to be an issue, and as the website you have downloaded from is already (presumably) public, there should be no confidentiality issue. You can increase the confidentiality and integrity of the data by instead using a dedicated server, although that would be more expensive.

You should keep a local backup of these files, in the case that the VPS is shut down, so that you can restore it to another VPS later. The local files should be stored in an "inert" form that will not be susceptible to exploitation of any indexers or thumbnail generators you may have on your system. This can be done for example by putting the files in an archive such as tar.

As you will need to use command line utilities rather than graphical ones, you will need to find ways to access these files over SSH. Some examples which you can do remotely on a VPS:

You can convert PNG, TIFF, JPEG, and (non-animated) GIF to PPM to safely view it locally.
XLS/XLSX can be converted to CSV, which is fine on command line. There is a good vim plugin.
RTF, DOC, DOCX, and ODF can be converted to images, which can be securely viewed locally.
I am not aware of any way to view PPT/PPTX in command line, though you can analyze them.
HTML, CSS, and JS can be viewed in a text editor or run on a remote text-based browser.
PHP can be viewed in a text editor or executed using a command-line PHP interpreter.

score 1 · Answer 2 · answered Dec 19 '17 at 08:43

Since you asked about storage, my answer would focus on storing potentially infected files, and not on working with potentially infected files (which is a very different issue).

All the file types you mentioned, including some images, have been used or still are being used for distributing malware. So each of them presents a potential threat if unvetted.

However you cannot get infected by simply downloading a document file, so storing even an infected file is usually safe until and unless you open it. Generally you only get infected when you open/view the document file, and sometime you'd only get exploited when you use vulnerable software. But with some formats (like DOC) you might get exploited even with up-to-date software if you're not careful.

Thus your main concern in storing those documents securely boils down to preventing accidental clicks on them which would open them. The easiest way to prevent this is to scramble them. Even basic XOR would do the job here. This would prevent accidental openings of such files and thus would be secure enough.

OP doesn't want to just store the files and avoid accidental clicks. He says he needs to access the files, so "scrambling" them seems like it would not work. — forest, Dec 19 '17 at 08:51
Of course you unscramble them back when you need to access or index them. — George Y., Dec 19 '17 at 21:08

How should I more securely store files obtained by mirroring a website?

2 Answers2

Privilege separation

Virtual Private Servers