Extracting features from PE files. Machine learning and malware

Question

Are the features which can be extracted from a PE file (some information from headers, section names, strings, import, export sections etc) enough to use them to train particular machine learning algorithms to detect whether the suspicious file is malicious or not?

I have met with two different opinions:

The first option states that these features are enough to create a basic detection system. Additionally, to increase in efficiency and accuracy, the behavioral attributes (e.g. API calls) may be included.

But the second opinion states that these attributes are useless in many cases because a lot of these features are redundant and redundant features can hurt the quality of a model.

I'm also wondering if it is possible to detect if a malware sample is similar to another sample and imply that one is a variant of the other? Is this kind of information useful in malware detection?

Could you further elaborate which features the second opinion would rely on to detect the malware from the pe header? — Sebastian Walla, Aug 29 '18 at 19:19
Here is a brief presentation, which speaks about this second apprach. https://www.youtube.com/watch?v=_msntOyAGvI — bielu000, Aug 30 '18 at 08:54

score 1 · Answer 1 · answered Aug 29 '18 at 19:17

If you encounter a packed malware the features extractable from the PE file and its imported library functions probably won't get you anywhere. That is because there will only be little imported function calls and the header information can be modified to misguide your learned model of a malware.

Note that the fact that an executable is packed is not sufficient to determine it as malware as also normal programs can do that.

score 0 · Answer 2 · answered Sep 23 '18 at 14:28

Many of the attributes on the headers can be changed without affecting the executable at all and even those that affect it, can be changed by changing the parts of the executable which they affect (sizeofimage as an example), so while getting that data can be useful on detecting some stuff, any decent coder will make their pe immune to this.

Another thing are signatures, they can detect basic naked malware, but anything encrypted will bypass them easily.

Your best bet is to detect entropy of sections and the characteristics. High entropy means that the executable is likely to be packed/encrypted. The presence of a section with read, write and execute characteristics almost always means the executable is packed.

But guess what, there are ways to decrease entropy as much as you want, and also to pack a pe without a read/write/execute section.

The more aggressive your heuristics are, more likely you are to detect malware, but you will also run into a lot of false positives (not all packed executables are malware).

score -1 · Answer 3 · answered Aug 29 '18 at 19:17

my experience with machine learning systems that detects "things" in your case is malware, is that these systems needs to have a lot of samples for learn what is malware from what is not. They generate a lot of false positives from my experience and at the end is better to follow an hybrid approach, that is scan some parts of the executable, apply some rules, then apply other type of heuristic and then machine learning, this depends on each case.

For example if somebody sends you an email with an executable, what is the probability that the executable is malware? Will you take the risk your system? do a regular scan or a sha256sum and verify with third party? or apply a machine learning strategy

Extracting features from PE files. Machine learning and malware

3 Answers3