The common anti-virus (to my knowledge) uses a kind of brute force type method where they get the hash of the file and compare it to thousands of known virus' hash. Is it just they have servers with super fast SSD and they upload the hashes to that and search really fast or am I completely wrong with my understanding of how they work?
-
67Why do you think looking up a hash in a list of thousands is slow? With the correct data structure, a modern computer can do such searches millions of times per second. A list of thousands of hashes easily fits in memory, no fast SSD needed. – marcelm Nov 12 '17 at 12:07
-
23Hash lookup is on average *O(1)* (only in very rare occasions it works in *O(n)*). Nevertheless the point of a hash is to perform *literal comparisons*. So that would mean that a hacker could add a pointless two bytes to the virus, and make easily 65'536 versions of the file by assigning all possible values to the two bytes. – Willem Van Onsem Nov 12 '17 at 12:12
-
55AV software is still one of the biggest performance drags in PCs :) So fast is relative here... – rackandboneman Nov 13 '17 at 00:20
-
2You may also be interested in [this](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) as an example of how you can search for many substrings in a file all at once. – biziclop Nov 13 '17 at 11:41
-
3AhoCorasick is a multi-regex that can pattern match on many thousands of patterns simultaneously; a state machine that emits multiple events. With a first-pass of a regex, files get selected for deeper inspection. I know that's how we did firewalling and intrusion detection at wire speed; at Check Point, which also had an AV division. – Rob Nov 14 '17 at 02:42
-
2If you want to test hashes for set membership in really huge sets, use a bloom filter. Though I would imagine that AhoCorasick is more relevant to the main AV detect. A bloom filter (real world example) would be a set like, every hostname associated with divorce lawyers. It's an opaque bit structure that hides what is in it, and can give false positives; particularly as it get really full. – Rob Nov 14 '17 at 02:45
-
4The lookup of Hashes is not the problem, calculation of Hashes much more so. But pure signature based AV is not a good solution anymore anyway. (And the more advanced methods are even more demanding) – eckes Nov 14 '17 at 09:38
-
11They're not. They hide their performance penalties in other processes. – Joshua Nov 14 '17 at 18:56
-
2You're assuming it's fast--in my experience that's not the case. It's just most of the time they look at the file and quickly figure out that it's a file they have already checked. That's where the hashes come from. I also suspect they have a table of hashes for common software. It's nowhere near as fast when it hits something that hasn't already been cleared. – Loren Pechtel Nov 14 '17 at 21:30
3 Answers
Disclosure: I work for an anti-virus vendor.
Because most anti-virus engines were born as protecting endpoints, and even now for many of them endpoint protection is major part of business, the modern anti-virus engines are optimized for scanning endpoints. This optimization includes many things, such as:
Not scanning the files which couldn't contain infections which can infect your computer;
Remembering which files were scanned, and not scanning them again unless the files were modified;
Optimizing scans for the file types when possible - for example when scanning executables, only certain parts of it need to be scanned. This minimizes disk reads, and improves performance.
A common misconception is that AV engines use hashes. They generally do not, for three reasons:
- First is that getting around hash detection is very easy and doesn't require modifying the actual malicious code at all;
- Second is that using hashes does not allow you to implement any kind of proactive protection - you will only be able to detect malware which you have seen;
- Calculating a hash requires the whole file to be read, while for some files (such as executables) this is not necessary. And reading the whole file on non-SSD hard drives is expensive operation - most AV engines should scan a large clean executable file faster than calculating hash on it
-
12a random number put in the middle of a message in an executable when that file is stored makes each file have a different hash. – Skaperen Nov 12 '17 at 08:41
-
55I've heard many times about this "hash detection". Thanks for confirming this is non sense. – Zanon Nov 12 '17 at 13:43
-
11I don't think this answer fully addresses the main question of _how_ the performance is achieved. In particular, "most AV engines should scan a large clean executable file faster than calculating hash on it" seems to merely perpetuate the OP's question. How exactly is it possible to scan a large file faster than computing a hash/checksum from it? Both operations would appear to be `O(n)`. – aroth Nov 13 '17 at 13:48
-
17@aroth Because "for example when scanning executables, only certain parts of it need to be scanned. This minimizes disk reads, and improves performance." – David Mulder Nov 13 '17 at 14:30
-
4The question remains how you detect that a file on disk has changed since the last scan. It is trivial to reset file modification date, so that is not an indicator; and a sophisticated infection could conceivably leave the file size unchanged. That leaves only a hash as a criteria. – Peter - Reinstate Monica Nov 14 '17 at 08:07
-
7@PeterA.Schneider Aren't write operations monitorable, for lack of a better word? – timuzhti Nov 14 '17 at 08:24
-
1Another reason hashing the file would not be suitable is that it would be trivially easy to subvert, as most file types, including executables, will tolerate adding random bytes at the end of the file with no change in how the file works. Or at the start of the file, for ZIP/JAR/etc. – thomasrutter Nov 14 '17 at 23:12
-
2@PeterA.Schneider it is true that a file could be modified without changing size and setting modification time back. Virus scanners ought to re-scan files at regular intervals (even if not every time). Not the whole executable would need to be scanned. – thomasrutter Nov 14 '17 at 23:15
-
2@PeterA.Schneider there are many possible ways. For example, since most anti-viruses already include the file system driver intercepting file operations, such driver may simply reset the flag in the internal database of already scanned files. – George Y. Nov 15 '17 at 02:50
-
2You say that reading the file from HDD would be expensive operation. But how do you scan an executable without reading it? – Tomáš Zato - Reinstate Monica Nov 15 '17 at 11:55
-
@GeorgeY. So if a user deactivates their AV application for a short period, does the AV determine that all files need to be rescanned (or checked by hash???)... or does it actually leave some minimal form of monitoring active... or does it end up undermining the long-term security leaving most AV to believe files should still be the same when they no longer may be? – JeopardyTempest Nov 15 '17 at 14:19
-
@aroth Two operations might both be O(n), but one might still be 100 times faster than the other. – Michael Kay Nov 16 '17 at 21:12
-
why is it necessary to have a malware scanner like Malware Bytes? why don't AV vendors just include that in their virus scanning? or do they? And why is a separate anti-ransomware necessary? – ycomp Nov 17 '17 at 03:30
-
@ycomp it is not necessary. All major AV vendors detect adware, spyware and other malware - not just viruses. And separate anti-ransomware is not necessary either. – George Y. Nov 17 '17 at 04:01
-
@GeorgeY. but what happens when the AV vendor has a separate anti-ransomware tool (I use bitdefender av)? do they cripple the free av in regards to ransomware detection then? – ycomp Nov 17 '17 at 04:03
-
@TomášZato you can scan most clean executables without reading the **whole** file from disk. Obviously you need to read **some** data. – George Y. Nov 17 '17 at 04:04
-
@ycomp There's difference between detecting ransomware binaries (this is what AV engine is doing), and preventing ransomware from encrypting files (this may be an additional component). – George Y. Nov 17 '17 at 04:06
-
so does removing ransomware fall into the AV sphere? or the anti-ransomware software sphere generally? not preventing encrypting, just removing it once detected before it can even try to encrypt – ycomp Nov 17 '17 at 04:08
-
@aroth This might be one of those cases where big-O notation hides important constant factors... – trognanders Nov 17 '17 at 04:33
-
1Since a couple of people have asked about scanning executables vs checking a hash on them: To compute the hash on a file, you have to read *the entire file*; however, proper executable scanning does not need to read the whole file, just key parts. If you have a 100+MB executable, that can mean the difference between reading 2MB of it or reading 100+MB -- multiply that by however many large executables happen to live on a system, and you should be able to see where the performance improvement comes from. – Doktor J Nov 22 '17 at 14:56
The common anti-virus (to my knowledge) use a kinda brute force type method where they get the hash of the file and compare it to thousands of known virus' hash. Is it just they have servers with super fast SSD and they upload the hashes to that and search really fast or am I completely wrong with my understanding of how they work.
I think you're considering those online "antiviruses" which only do hashing, and are quite poor in comparison to the (much more efficient and advanced) heuristic- and rule- based antiviruses SmokeDispenser mentioned.
Hashing is a good choice for the one use case where you have one virus file sent identical to a lot of people, and the AV engine you use is updated so often that the hash will be received and stored before you encounter the virus. In that one case, for example, you can immediately flag an email attachment as virus/phishing/ransomware/trojan etc., and you can do so with minimal computational expense. If you have to read the whole file anyway and you have a good connection (both things because e.g. you're a mail server), then additional I/O expense is negligible.
Since this is a minority of all use cases, (good) antiviruses don't (just) use hashes.
However, checking for a hash is really very, very fast. You don't need SSDs or any sophisticated hardware at all.
Imagine (even if it does not work exactly that way) that you have a hexadecimal hash such as 'baadc0debaadbaad' and want to verify whether it is present or not in an archive holding one hundred billion of them.
When you created the archive, you created two hundred fifty-six directories on a hard disk and named them from '00' to 'ff' (0 to 255 in decimal). In every one of them, you created 256 directories with names again from '00' to 'ff'. And in each of those you put 256 directories again, and in each third-level directory of the form AB/CD/EF you placed all the hashes that began with 'abcdef', divided in 256 files from "00.txt" to "ff.txt".
So the hash 'baadc0debaadbaad' ends up in a file "C:/ba/ad/c0/de.txt" which reads:
...
baadc0de81f872ac Ebola virus
baadc0debaadbaad Dark Avenger virus
baadc0debf31fe11 known clean file
...
How many hashes are in that file? If hashes are evenly distributed, just divide one hundred billion by the number of files. There are 256 folders with 256 subfolders with 256 subfolders with 256 files, so the number of files is 256*256*256*256 = 4294967296.
One hundred billion divided by that number is around twenty-four. That's how many hashes (on average) are expected to be in that one file.
So if you were given a hash, and asked to find it between those hundred billion hashes by hand, you would just need to click on the appropriate first-level folder, click on the second, click on the third, open the appropriate file and read twenty-four hashes. You'd be done probably in less than thirty seconds, having "apparently" examined three billion hashes per second. By hand. By yourself. It seems impossible, it's just good data organisation.
A computer can do the same thing even more efficiently.
Rule based scan
A more complicated case is how to scan for a known and analysed virus inside a file of which you know (initially) nothing.
In this case what comes to help is the fact that the operative system must execute the virus, so the virus can't be just anywhere - it must be where the operative system will actively look for it. Typically, a virus will infect an executable file by hiding somewhere (e.g. in a data area) and replacing the beginning of its executable area with a call to itself. This means that in most cases you need only scan the executable area and ensure it's clean (of course, viruses pop out all the time with clever ways of hiding in unlikely places, so reality is a bit more complicated than what I just said).
The antivirus looks in the same places, and in this case performs a rule-based analysis on the file. This is similar to the hash search: you check the first byte you encounter in the sensitive area, and based on its value you look at another byte somewhere else. The algorithm doing this kind of operation is sometimes called an automaton. For example, to see whether a text string is either "Hello, World!" or "Attack at dawn", you would check the first letter; if it was a H, you would ignore all the checks for "Attack at dawn".
With programs, this is made way more complicated by polymorphism and self-encryption - the fact that two different codes (say, 2+2 and 2*2) might yield the same result, so you need more checks ("If the first word is Hello, Howdy, or Good...") in order to establish that you've found a variation of the dread virus Salutations, Cosmos!. For self-encrypting code you would probably either need to find an algorithm to break the code (if it is feasible in very little time) or concentrate on recognizing the self-decryption routine, without which the virus would not be able to make itself executable by the system.
However, with a careful selection of the choice tree, you will be able to quickly converge either to a single possibility (the virus name) or to the confidence that no known virus is present.
Heuristics
This is the most complicated (and slow) kind of analysis, but also the most powerful because it allows you to strongly suspect the presence of a hitherto unknown virus. You can think of heuristics as lots of small rule-based recognizers; being small grants a speed enhancement that offsets their being many. Every recognizer will identify a piece of code performing an action, such as decrypting code, attempting to gain privileges on the system, open and write to files, hook system calls, rewrite code in memory, infiltrate the operative system and so on. Many legitimate programs do one or more of all these things, or appear to do so (for example decryption and decompression will look very similar to one another. Also some programs are encrypted to protect intellectual property, not to hide malicious code). For each operation you establish a "score", and executable code above a certain threshold may be regarded as suspicious.
The exact details of the heuristic engines are of course much more complicated than this; they need to balance speed, reliability (when they say it's suspicious, it has to be suspicious) and sensitivity (when it is suspicious, they need to be able to say it is suspicious).
Actually, they're complicated enough that they're usually closely guarded secrets of antivirus companies, and often touted as 'Intelligent' all the way to 'Artificial Intelligences'.
The need to continuously update the rules (it makes little sense to still look for a virus gone extinct ten years ago) and, less urgently but no less importantly, tune the heuristics, is the reason why an essential parameter to an antivirus' quality is the frequency of its updates (even if, to a point, an outstanding heuristic engine might allow less frequent updates, and an impossibly perfect heuristic engine could need none at all).
Optimizations
If you are an antivirus, and are able to load into the operating system kernel to such an extent as to intercept any disk-write operation, and know for sure that there is no way of booting the operating system without activating your code (i.e. you can't be blue-pilled), then it is true that no file on disk can be modified without your knowledge. At that point, you can keep track of which files have been scanned and found clean, and only scan files that got modified since the last scan, which is a really small fraction of the total (and you might even ignore disk writes performed by non-infected processes; as long as you allow for the fact that there are ways in which a virus written in e.g. Javascript could be able to ask another process such as WSCRIPT to write to the disk on its behalf).
- 22,521
- 4
- 51
- 60
-
In your explanation about 256 directories and 256 subdirectories, etc., etc....what type of data structure is this? For instance, it's not a binary tree, but it is a tree, right? Is it just a directory tree? Excellent answer. I'm surprised it didn't spark more discussion. Thanks for taking the time to provide such quality content. – harperville Apr 12 '19 at 02:05
-
@harperville it is a N-tree, yes. Here, of course, N=256. A binary tree has N=2. You could do this using directories (if there wasn't a directory length limit), using the hash in binary. *Then* the directory *also* is implemented as a tree - I think usually a B-tree, which is a N-tree with variable N. If you only populated the trees where there is some hash, you'd end up with a B-tree. And if the directory *was* indeed implemented as a B-tree itself and there weren't other inefficiencies involved (**there are!**), then you could keep all the files in a single directory for faster access. – LSerni Apr 12 '19 at 09:16
While anti virus software might upload samples for analysis (see below), this is not how detection works.
Also, signature based systems are getting less common (as they can be easily evaded). They also do not work how you expect them to work.
Signatures are built over of a specific region or parts of a binary. If the location is not known, the signature window is being let “slide” across the binary. So this is more complex and generates way more samples than you might think as an intermediate product.
Nonetheless, searching for those signatures is not slow but also not exactly fast; there often is a notable delay when antivirus is used.
Nonetheless, doing so does not require enormous amounts of disk space or RAM (though some) - there is a lot more going on when a 4K video is decoded for display.
Besides signatures, there are also heuristics and behavior analysis that are happening, so the binary in question gets sandboxed and it’s behavior is analyzed. If actions taken are suspicious, such binaries might be blocked - those are the times the samples may get sent to the AV vendor for further analysis.
To conclude:
It’s not that fast. But the operations are not that complicated that they would need servers and couldn’t be done by a workstation on its own.
Such a server infrastructure would have to be huge, by the way. If the work was to hard for a workstation, that would mean a strong server could maybe serve a few customers, leaving a vendor with maybe 300 million copies to pay for and operate several million servers.
- 14,302
- 8
- 43
- 58
-
The only point I'd quibble over is the endpoint:server resource ratio of 2 or 3:1... Whether you're talking mobiles, laptops or even business desktops, the ratio should be considerably higher. The stock compute server we use at work is 96 cores and 1TB of RAM with 12 SSDs and 24 spinning disks. That would be equivalent to dozens to hundreds of endpoints. – Basic Nov 12 '17 at 18:38
-
1@Basic That totally depends on the type of server used, that is right. I will edit that paragraph to give it more slack. – Tobi Nary Nov 12 '17 at 18:43