2

I intend to train an RNN on snapshots of the VM metrics to classify malware. I will, therefore, run hundreds of different pieces of malware inside that VM. It has been isolated from my host (as best as I could/thought).

What would be the best (most secure for the host and most reliable for the information collected) way to gather regular (every second or so) snapshots of the system metrics (such as packets sent and received and process list) from an isolated Windows 10 VM where malware will be running?

The way to collect data should make it difficult for malware to interfere with it and not increase the risk of malware escaping the VM. Until now, I was relying on VBoxManage (which is great) but processes and packets sent/received are not listed.

I am open to other virtualization engines than VirtualBox, if it helps.

  • 1
    "_The way to collect data should make it difficult for malware to interfere with_" While that should be a goal, you don't want to do anything "special" on your test system that couldn't be done on a "normal" system... assuming the idea is that the trained neural network will form the basis of an anti-malware program intended to run on normal systems. – TripeHound Jul 31 '19 at 11:09
  • @TripeHound "_assuming the idea is that the trained neural network will form the basis of an anti-malware program intended to run on normal systems_" It's not the goal of my project, which is only to train an RNN to detect malware (with more focus on the neural network part than on the data collection) but indeed, it would be better if proposed methods could be applied on any system so they would be more useful to other people in general. – Cobalt Scales Jul 31 '19 at 11:41

1 Answers1

1

I opted for Cuckoo to automatize running the software on the guest machine and wrote a custom auxiliary analysis script to collect those metrics every second. The way to collect data is thus as secure as Cuckoo is so I guess better solutions could be found but at least there is a possibility to stream them as they are collected so that, in case of loss of communication between the host and the guest, metrics prior the loss are not lost.