ML approach to malicious attachments using email contents

Question

Based on my understanding of the field, it seems like there is a lot of attention paid to spam filtering based on the contents and metadata of an email, and there is considerable attention paid to detecting malicious attachments on their own, but I was wondering whether any work has been done on taking a ML approach of analyzing attachments and emails together to determine malice. It seems like you could do a lot better detecting emails with bad attachments if the contents of the attachment don't match up with the body of the email.

For instance, an email whose body claims to have an earnings report wouldn't be suspicious, and an executable file wouldn't be any more suspicious than any executable file, but an earnings report email with an executable attachment should set off red flags.

I was wondering whether products already incorporated this insight or if there were open source projects that integrated this capability.

For context, I work at a large company that already has a spam filter, and we perform automated static and dynamic analysis on attachments, my question is whether I could tie the output of the tree together with ML to provide better coverage for threats that may slip through. We already do manual content to catch these kinds of things so I was wondering if applying ML could catch novel threats. I know that EXEs aren't the main threat.

Given the high variance in mail content I doubt that it is possible to get both a high true positive rate and a low false positive rate with this. Most mails would likely not have a clear enough classification at all since not enough similar mails were seen. And there are likely better and simpler approaches, like why should one accept an executable attachment from an unknown sender or from a sender who never send such attachment in the first place? — Steffen Ullrich, Apr 29 '19 at 17:07
Also no one sends executables to begin with.Most people send docx files or powerpoints with vba. — yeah_well, Apr 29 '19 at 19:12
I am sure companies are working on it and its still a huge problem to correctly determine if a email is maliscious,spam or not.It just very complex if you start looking at it — yeah_well, Apr 29 '19 at 19:14
That's fair. I'm coming from a SOC that can't block content programmatically, and we don't have influence with the people who do. I'm not familiar with the options available to the team that manages filtering either. Part of what we're interested in is tracking campaigns, and I was wondering whether this approach might yield better coverage without costing too many resources. We have a large repository of mail internally if there aren't any large public data-sets. — solumnant, Apr 29 '19 at 19:14
@VipulNair That's mostly the case but there are a number of cases where people will send something like "Earnings_Report.pdf.exe". I just want to provide better coverage for the organization and I thought there should be some way to combine our existing email analysis with our automated static and dynamic analysis in a programmatic way. — solumnant, Apr 29 '19 at 19:29
There's a bit of an issue with your question. Questions of the type "what product/service does X?" is not a great fit on a Q&A site because the lists of potential answers could go on forever (even if I think that there are none of this type). Did you have another question around this situation? — schroeder, Apr 29 '19 at 19:59
I guess the best way to put my question was whether there was any theoretical basis for what I was thinking of (whitepaper/research), and whether any open source projects or products even exited around the concept. It seemed as though the concept should already exist but I didn't know the name for it or how to look for it. I really just wanted a starting point for my own research, or perhaps other ways to use the existing (computationally expensive) information we already produce. — solumnant, Apr 29 '19 at 20:54

score 1 · Accepted Answer · answered Apr 29 '19 at 19:12

1

I think you are using the wrong tool for the wrong problem. Why not inspect the attachment for malicious code? What does pairing the content of both things gain you?

Also, you appear to have an underlying assumption about the disconnect between the two contents. Why do you think that the attachment content would not match the email content?

What if attachments are zipped, compressed, encrypted, or compiled and you cannot read the content? What if the attachment has no content?

So, if we do the things we currently can do correctly: validate senders, Bayesian analysis of email content, and inspection for malicious attachments, what gaps exist for your approach to provide fruit?

I do not believe projects exist to do this (and I have not heard of any) because it is simply not a fruitful area for work when there are other more fruitful avenues.

answered Apr 29 '19 at 19:12

schroeder

123,438
55
284
319

Thanks. I'm new at my job and trying to help out, but I don't really understand the email stack too well. I'm bringing it up here because my manager mentioned it to me as a possible project when we were brainstorming, but neither of us are responsible for that stack. – solumnant Apr 29 '19 at 19:18
Projects can be useful as learning, even if they are not efficient. If you are hoping for an efficient, high-value solution though, I really don't think this is a great idea. – schroeder Apr 29 '19 at 19:20
Use ML to create and maintain a social network and determine anomalies in who is talking to whom, when, and patterns of behaviour, and you have a winner. – schroeder Apr 29 '19 at 19:21
We already have an automated sandbox for files that trigger enough rules, and the sandbox can handle most packed and compiled content. I was thinking that the presence of these features could be worthwhile signals to any ML. I gave an executable as an example because it was an easy illustration, but I know there are many more file types out there. I was just thinking that it should be possible to combine the artifacts from automated static and dynamic analysis with the analysis of the contents of the email to provide better coverage than ML on any one of them or a signature based system. – solumnant Apr 29 '19 at 19:24

score 0 · Answer 2 · answered May 07 '19 at 13:21

E-mail SPAM filtering was one of the first use of machine learning (ML), before it was called ML. Indeed, one of the first algorithm used to filter e-mails is the Naive Bayes spam filtering, which is an algorithm that use supervised learning to produce a classification (spam or not spam).

Spamassassin, one of the most well-known SPAM filter, was using a genetic algorithm until its version 3, then a neural network trained by a Stochastic Gradient Descent (source).

ML approach to malicious attachments using email contents

2 Answers2