What would it take to scan a PDF for questionable content in addition to malware?

Question

My company is trying to implement a new feature on our site to let customers load documents accompanying an order. We know that we will need to perform some kind of malware scan by calling out to a service. However, my boss has also asked for the scan to be performed for questionable content which could wind up on our servers if not caught. The files in question would be PDF format. My gut tells me this is a significant increase in complexity, as the system would need to decipher the content of each PDF file. Is there anything out there that does this today, and what would the system do in order to achieve this?

Does `questionable content` refer to code, inappropriate text and pictures, or other? — Neil Smithline, Jun 28 '16 at 15:06
Do you mean something like isitporn.com + human intervention if > a certain percentage? — noɥʇʎԀʎzɐɹƆ, Jun 28 '16 at 15:43
@NeilSmithline- the latter certainly. Although it may not be limited to pornographic material. The vague descriptor was "anything that exposes us to liability." — Alice, Jun 28 '16 at 16:25
@JamesLu- does the Clearsite API handle PDFs or only image files? — Alice, Jun 28 '16 at 16:33
It's taking the easy way out, but consider whether the risks are even worth the theoretical rewards in implementing a feature like this. You're exposing the company to liability by virtue of the fact that you're serving content you haven't vetted. Automated tools, OCR, etc are great but who's to say someone doesn't upload a document libeling (or promoting!) your competitors? — Ivan, Jun 28 '16 at 19:18

score 1 · Answer 1 · answered Jun 28 '16 at 15:00

There are a few frameworks that come to mind for this. Mastiff, Viper, and IRMA. All are meant for reverse engineering, and analyzing malware so any will need some customization to run. For example:

Out of the box install:

client  --> uploads file to system
system  --> system sends it to one of the above
program --> analyzes file creates report

You would need to do something like the following:

client  --> uploads file to system
system  --> system sends it to one of the above
program --> analyzes file if malicious delete/quarantine/etc
program --> analyzes file is not malicious send file to your org

This can be done with some work otherwise this becomes a "vendor" specific question: "What product can..." which is off-topic here.

score 1 · Answer 2 · edited Mar 17 '17 at 10:46

1

The problem is defining what "questionable content" is. The three possibilities which spring to mind are

malware
copyrighted material being distributed without licence,
content deemed to be offensive as slander / blasphemy / thoughtcrime / pornography....

The first one can be massively mitigated by flattening the PDF files and virus scanning for good measure.

Addressing the second is also relatively easy - just build a database of all the copyrighted content in the world and see if anything in the PDF matches it. You still need a manual proces for "fair use".

Third one is a bit tricky.

edited Mar 17 '17 at 10:46

Community

1

answered Jun 28 '16 at 15:40

symcbean

18,278
39
73

The third one is indeed tricky. For text-only PDF files, you could try to convert the PDF to text and then check for offensive words, but as soon as you have images, you need to use OCR+tools to detect pornographic images (tricky, but could be done)+tools to detect slander (very tricky)+tools to detect blasphemous images (again, very tricky). – A. Darwin Jun 28 '16 at 18:05

What would it take to scan a PDF for questionable content in addition to malware?

2 Answers2