Python Antivirus comparing hashes

Question

I'm writing an antivirus in python mostly to learn and for research purposes, I do understand it would be more efficient to do this in something like C and eventually I will port this over. So far I have coded the first part of the AV that will check virus share and download the latest hashes to a file.

From here I'm not sure how to have the hashes compared to a database so I can see the malware family it's part of. Is there a resource online or some API I can use? I would try VirusTotal, but since it's a free account I have it can only do 4 requests per minute.

Lastly, does the AV need to hash all legitimate files on the system when scanning then compare that to my list of malicious signatures? I plan to build upon this and eventually use ML but for now I want to keep it as simple as possible and learn while I go along.

It really sounds like you first need to understand how AV works. — schroeder, Nov 21 '19 at 20:49

score 1 · Answer 1 · answered Nov 20 '19 at 02:37

I would suggest rethinking your approach.

Hashes are not virus signatures; they only serve as a fingerprint to identify files. With most conventional hash algorithms (e.g. SHA256), modifying the input slightly results in an entirely different hash output. Thus there is no correlation between malicious features in your virus files and their hashes. Attempting to train a machine learning model on file hashes would be useless; it would simply overfit the input dataset.

If you're building your AV purely using a hash-lookup approach, your current method works fine. It may be good to find a virus source that provides both the file hash and malware family, thereby circumventing 4 per minute restriction of VirusTotal. If you intend to analyse malicious features in files however, you will need access to the virus files, not just the hashes.

score 0 · Answer 2 · edited Nov 21 '19 at 21:53

Is there a resource online or some API I can use?

I would suggest that if you want a good result, you can subscribe to VirusTotal. But there's a lot of alternatives to VirusTotal:

Lastly does the AV need to hash all legitimate files on the system when scanning then compare that to my list of malicious signatures?

No need to hash all the files, you need a list of malicious signatures. But this method can be easily bypassed by hackers.

Additional Recommendation

Most Antivirus and EDR now use hooking, its ability to control and monitor the entire landscape of the system. For more information about it here's the link: https://en.wikipedia.org/wiki/Hooking

Python Antivirus comparing hashes

2 Answers2