Are the features which can be extracted from a PE file (some information from headers, section names, strings, import, export sections etc) enough to use them to train particular machine learning algorithms to detect whether the suspicious file is malicious or not?
I have met with two different opinions:
The first option states that these features are enough to create a basic detection system. Additionally, to increase in efficiency and accuracy, the behavioral attributes (e.g. API calls) may be included.
But the second opinion states that these attributes are useless in many cases because a lot of these features are redundant and redundant features can hurt the quality of a model.
I'm also wondering if it is possible to detect if a malware sample is similar to another sample and imply that one is a variant of the other? Is this kind of information useful in malware detection?