Why is Google so much faster than a hard-drive search?

252

71

When I search a file on my HD in Windows 7 or Windows XP it takes some minutes to finish the process. If I fill in a search term in Google, the answer is on my screen in milliseconds

How is it possible for Google to search the Internet, which is many times larger than my hard drive, faster than my OS can search my computer? Is it only a matter of computing power and the right algorithm?

Arne

Posted 2013-04-03T18:44:59.993

Reputation: 1 667

99

Have you tried indexing all the files on your drive and searching only the index? Try Everything and see.

– Karan – 2013-04-03T18:53:44.503

11Google desktop "used" to do that for windows also... – rogerdpack – 2013-04-03T20:39:16.420

14Google searches through indices stored in RAM, not through files on a hard drive. – Ari – 2013-04-04T01:12:07.987

13The index is important, but Google also uses a map-reduce algorithm to conduct a massively parallel set of operations. No matter how many cores you have in your computer, I guarantee Google has more. – Adam Wuerl – 2013-04-04T03:05:20.333

41There's nothing precluding a desktop search implementation from using indexing. However, remember that Google has enough cash for a) lots of very fast CPUs/servers to parallelise a query; b) lots of very fast RAM to avoid having to access a disk ever; c) lots of hard drives much faster than the one you use; d) lots of very smart engineers to optimize the algorithms involved. (E.g. caching the results for (a lot of) frequently used queries and much much more.) It's not "only" a question of either of these, it's all of these acting in in concert. – millimoose – 2013-04-04T03:14:00.783

No need to turn on Windows Index search use third party tool Everything search http://www.voidtools.com/ this will let you search faster.

– mr_eclair – 2013-04-04T03:41:06.490

@Karan: While we're on the topic I'll also mention my own program which is similar to Everything. :)

– user541686 – 2013-04-04T05:33:21.093

1If you spent a billion dollars or so making your search as fast as possible ... – David Schwartz – 2013-04-04T06:42:56.280

Arne @Karan has answered you right through the comment, I hope he can make it an answer with more details. That is how Google works for serach for more read this Wiki article.

– avirk – 2013-04-04T08:14:04.027

@DavidSchwartz Google has always been very fast. Indeed, their lack in early life is a large contributor, as it forced them to figure out a way of parallelising queries across many machines of varying quality. – Phoshi – 2013-04-04T11:08:12.987

@Antoine Simon's, the one that focuses on indexing above everything else, and leaves the rest to links. It's not wrong as much as painting a woefully incomplete picture of what makes a Google search responsive. – millimoose – 2013-04-04T12:45:46.503

Click the below link you can understand how google search engine works fast. How Search Works

– UdayKiran Pulipati – 2013-04-04T07:00:04.590

3@JarrodRoberson: Since Vista, Windows has its own search indexer, which is turned on by default for the personal data folder and works exactly as Spotlight without eating up resources as you claim. By the way, the question is much broader and deeper than your OSX vs. Windows chitchat. – Pincopallino – 2013-04-04T14:08:40.433

3The point is with web content a lot of people are doing the same searches making it worthwhile to predo the work. There is only one of you searching your local computer so putting a lot of resources into indexes is not worth the work. – JamesRyan – 2013-04-04T14:36:49.223

On my linux box, "locate filename" is insanely fast, because it is using an index too. Somehow it seems that neither Apple nor Microsoft's GUI search tools on their desktop OS's can touch my linux box's command line locate speeds. – Warren P – 2013-04-05T02:46:18.170

The difference between an indexed and unindexed search is the difference between O(1) and O(n) operation. – Lie Ryan – 2013-04-06T07:57:07.573

@WarrenP there is a small difference between locate and OSX and Windows search utilities. locate doesn't index the file content, but just the file name. – Pincopallino – 2013-04-07T16:59:25.253

1Am I the only one who wishes that inspection of file content was a secondary step in search? I seldom want to search for file content, and would like to opt into it. That my Mac laptop searches first for file contents, and I have to tell it otherwise seems a dumb default. But them, I'm a geek not a typical Mac user. – Warren P – 2013-04-07T20:30:17.637

@LieRyan: Indexing isn't O(1)... – user541686 – 2013-04-08T06:58:52.103

@Mehrdad: creating an index isn't O(1), but searching an index is O(1), because it's essentially just a huge hash table. Search engine algorithms though usually aren't O(1) because most search queries are composed of multiple words and the merging/reduce results is not O(1), also sorting the results are not O(1) either; fuzzy/phonetic/misspelt search can be made O(1) though at the cost of larger index size. – Lie Ryan – 2013-04-10T09:11:46.347

@LieRyan: What kind of a hashtable are you thinking of that lets you match arbitrary substrings? – user541686 – 2013-04-10T09:13:51.717

@Mehrdad: no internet search engines let you match arbitrary substrings. – Lie Ryan – 2013-04-10T09:14:23.627

@LieRyan: I thought we're talking about hard drive searches... your reply was right after the Linux comment. – user541686 – 2013-04-10T09:23:33.177

To achieve Google-like search speeds on your desktop, install "Search Everything". It indexes your hard drive and makes you find everything (hence the name) in an instant. – Matthias – 2013-04-12T10:24:03.703

@Karan - is that everything tool safe ? looks like an obscure site and obsure tool to me. – david blaine – 2013-05-07T18:11:34.940

1@davidblaine: It's not open source (yet, although the author has hinted at it) so I can't guarantee anything. That said, it's probably one of the best known indexing utilities for Windows and I personally have been using it on my systems for years. If you could just see the number of deleted answers below where people did nothing except suggest using it like I did... If you want my personal recommendation, as long as you know its limitations (NTFS only, no content indexing etc., see FAQ for more), I think it is great at what it does and I simply love how fast it is! – Karan – 2013-05-07T21:05:00.843

Because a significant portion of the state of Oklahoma is covered with Google data servers. – Daniel R Hicks – 2013-05-18T02:40:45.890

Answers

212

Google is not searching the internet: it is searching an index. Google has huge server farms which are constantly scanning and indexing the internet. This process takes a lot of time, just like the search of your unindexed hard drive. In Windows 7, there is an option to index your hard drives. This process takes some time at first but once it is up and running the results of a search will be instantaneous.

If you want to know more about how the Google search works you can read Google's article "How Search Works" or read the article "How Stuff Works: How Google Works".

Simon

Posted 2013-04-03T18:44:59.993

Reputation: 3 831

46

Last paragraph: this link is much more authoritative and overall better.

– ulidtko – 2013-04-03T21:03:59.003

@ulidtko thanks, I added the link as well to the answer. – Simon – 2013-04-03T21:07:55.607

Also google uses stacked pcs, indexed data is duplicated in these comodity pc server farms, huge numbers of these pcs serves the search requests spontaneously. – vinodpthmn – 2013-04-04T05:03:27.287

ockquote>

In Windows 7 there is as well an option to index your hard drives that process takes some time at first but once it is up and running the results of a search will as well be instantaneous. Where can I find this option?

– Piccolo – 2013-04-04T05:59:30.603

4

Pardon my curiosity, but don't file systems already index the files on the disk? Isn't what you see in your file explorer a mere index of links to the actual physical sectors on the disk? Why do we, then, need to do even more indexing?

– Adi – 2013-04-04T07:30:40.013

@HobbitHole It is a service called Windows Search which should be running by default. When you search in an unindexed location Windows will ask if you want to add id to the index. There are some advanced options that can be changed to tweak the search.

– Simon – 2013-04-04T08:36:06.880

9@Adnan the file systems index is designed to find the position where a file is stored on a physical media. It is like the index of a book that tells you on which page a chapter starts. A search index is designed to find content. A good search index not only indexes a files name but as well the content of known file types like pdf, doc, html, ... Advanced indexes use as well synonyms so if you search for "car" it might as well find results with the word "automobile". – Simon – 2013-04-04T08:47:46.873

@Simon The situation in the OP's question (and mine as well) is searching for files on the disk, not searching in the files on disk. When I try to find a file named car.jpg it takes more than 15 seconds to find it, isn't the file explorer just searching the FS's index? If I have a list of the books (and their locations, but that's irrelevant) in the library, isn't that basically an index for the books' names? – Adi – 2013-04-04T09:19:09.507

3@Adnan, file system isn't really an "index", just a tree of file names. Searching such tree isn't fast, because it's structure isn't optimized for searching. OTOH google (and databases) uses specific sorted index structures which makes lookup for particular entry lightning fast. Even then, not all searches can benefit from such index and will be slow(er). – PiRX – 2013-04-04T10:36:26.333

@PiRX Thank you very much. I guess all I needed was the first 2 sentences. My assumption was that the FS tree is good for searching. Now I fully get it. – Adi – 2013-04-04T11:28:55.213

8@Adnan In a sense, the FS Tree is optimised against searching. It's designed to allow addressing of known locations. From your root node, all you get is a list of directories and files under root. Every directory just knows about the files in it, and the directories below it. Accessing a known filepath is very fast under this, and it offers a lot of flexibility, but there does not exist a global listing of files to search through. You must always descend through the directory tree, and that makes for a lot of distinct lookups. – Phoshi – 2013-04-04T15:28:58.207

@Phoshi That's even chocolate sprinkles on top of the ice-creamy explanation PiRX provided. The picture is getting clearer and clear for me. Thank you guys for being awesome. – Adi – 2013-04-04T15:35:06.267

71

Google is like searching the yellow pages for an address (indexed). Windows search is akin to driving around checking numbers on buildings (non-indexed).

Another analogy would be looking through a well organized library and card catalog, or just sorting through an unorganized pile of books every time.

Fundamentally it's all the organizational work done prior to the search that makes it fast.

FYI: When searching indexed locations, windows search can be just as responsive.

Ryan

Posted 2013-04-03T18:44:59.993

Reputation: 744

5Or: Scanning a textbook vs looking into a (detailed) table of contents – bobobobo – 2013-04-04T23:12:03.757

36

Google's business is search (and serving up Ads) and it's very focused on that. There are number of things that Google does to ensure data is returned to you very fast:

  • First it uses MapReduce and PageRank to generate a comprehensive index of the World Wide Web. It updates this regularly so the results are fresh.
  • That index is distributed and replicated across Google's many servers
  • Your query is split across multiple servers to build the returned results. This allows the process to be highly parallelized.
  • Common queries and results are cached, reducing the need to perform the search at all.

See this link for more information about How Search Works

Comparatively a hard drive search without an index has to read through every file on the drive and this can take a lot of time.

Additionally you can think of both a filesystem and an index as a tree. In the filesystem the root of the tree is the top-level folder and it can have branches (folders) or leaves (files) in that one folder. Each branch can have sub-branches for more folders and leaves for more files. To search this structure you have to 'walk' all of the branches (and sub-branches) to find the leaf you are looking for. An index flips this hierarchy around. The base becomes the alphabet and all of the sub-branches further refinements on this. The leaves are the location of the item you are looking for. Searching this structure allows you to prune (exclude) large sections of tree (eg. the first letter of your search term allows you to trim 25 other branches right away).

Brad Patton

Posted 2013-04-03T18:44:59.993

Reputation: 9 939

30

About 4 years ago I also asked myself the same question. But as I googled around doing my research I eventually read that besides the fact that they hire the best of the best to come up with some of the most sophisticated search algorithms and all of that.

One of the key design they used is similar to the idea of map reduce I think. You have a lot of cheap computers on farms. Let these computers have only about 80 gig of hard disk space and push hard to have about 16 gig RAM or even better 32 gig RAM on these computers(as much as possible). Remember that they are connected through some sophisticated system they designed. But the key idea here is that when a query is submitted, it is passed into their system where it will try and search the fresh data in RAM. Keep in mind they have a lot of these cheap computers. And since the data is in RAM, it is found a lot faster than it would be on a hard disk. But don't forget that they have a sophisticated(indexing and all those algorithms) system too that help greatly.

And this data doesn't have to be fresh, because we all know that Google stores everything. So as to what should be in RAM, the same principle with splay trees can be used, keep what ever people are searching the most in RAM and flush the least searched stuff to hard disk.

This little idea coupled with their indexing and all the other things others have mentioned in their answers, might be one of the reasons why it is faster than a hard-drive search.

  • The power to predict based on other searches.
  • The data is most likely in RAM which we all know is faster.
  • Use multiple systems to divide and conquer
  • Searching is their main priority.

Of course I could be wrong, but this made sense to me. And I was happy with what I learned.

Touch

Posted 2013-04-03T18:44:59.993

Reputation: 409

7You nailed it on some of the things that the other, more popular posters missed. Google doesn't search everything as often. Definitely not on the whole internet, and not even everything in its own caches. Moreover, when you search on Google.com, the actual search is not happening in real-time, just a quick copying and displaying of search results that have already been produced and organized in the past months by Google. It's extremely complicated to describe the producing/organizing process, but it can vaguely be called "indexing" as someone said. – Joseph Myers – 2013-04-03T23:28:31.537

It's extremely complicated to describe the producing/organizing process.... Yep, that's what I refer to as the sophisticated part of it. Thumbs up, you summarized it well. – Touch – 2013-04-03T23:34:02.317

1

@JosephMyers google indexes constantly. Do a search on a question asked on SuperUser earlier in the day (eg. https://www.google.com/search?q=google+faster+than+a+hard+drive) and it shows up in the results.

– Brad Patton – 2013-04-04T00:24:16.507

@Touch I agree about searches in RAM. This was the fourth point in my post about caching – Brad Patton – 2013-04-04T00:25:07.777

@Brad Patton True. I had to mention it because it was the basis of what I learned. And the part about indexing constantly, well the indexing part is kind of the organizing part. Therefore the statement holds that you search what has been organized and not what is being indexed at the moment. As for why the result is showing, stackoverflow has more credibility than many websites, therefore it's good to idea to index it more frequently. That's why it shows up. If it wasn't for that, you would have to wait a day or two before what you search shows up. I think that's what Mr JosephMyers is saying. – Touch – 2013-04-04T00:38:43.137

20

Google uses an extremely sophisticated indexing system, parallel operations, and a number of load balancing techniques not available to a standard standalone computer. there is really very little similarity between a web search and a hard disk file search, and google optimizes heavily for their specific use cases.

Frank Thomas

Posted 2013-04-03T18:44:59.993

Reputation: 29 039

4

In 2004, some Google employees published a paper: MapReduce and from that time on they improved that hundreds of times.

Also, they use Google File System(GFS) which is a distributed file system like Hadoop Distribud File System(HDFS) and extremely optimized for their purposes. Also as far as I know, GFS works maybe thousand time faster than HDFS.

smttsp

Posted 2013-04-03T18:44:59.993

Reputation: 141

2

I thought I would add to this as I too had this question a while ago and found these great videos which describes what Google do on the surface. Interesting to watch.

Google on Youtube 1
Google on Youtube 2

He goes a little bit deeper but not deep enough that you get lost in technicalities.

Cheers.

Mogget

Posted 2013-04-03T18:44:59.993

Reputation: 1 186

2

Just adding something to the wonderful answers here. Google use caching of popular search phrases. The results of these searches reside in a memory. So if you search something that is searched a lot, the results will show up almost immediately.

Mellowcandle

Posted 2013-04-03T18:44:59.993

Reputation: 189

1

To answer the question on a simplistic level: imagine you have a textbook with a keyword index at the back.

Searching a hard disk (naively, at least) is like going through the book, page by page, scanning each line for an occurrence of your keyword.

Using an Internet search engine is like looking up the keyword in the index, and then turning directly to the page number it gives.

In reality of course, it's a lot more complex than this. For example, you would usually search your hard disk for different kinds of information than the Internet. But the basic thing to take away is that the search engine is using an index. It has already gone through the "book", word by word, and it has compiled a list of those words along with where to find them, and it has organised the list in such a way that it can look up things in it very quickly.

For example, think about the organisation of an index in a book. Firstly, it is usually sorted alphabetically, and secondly it may have letter headings. When you look up a word in the index you can see straight away the list of words beginning with the letter you want. And because the list is sorted, it is easy to find the word you want within the list, or to tell quickly if it is missing.

So to summarize, it's like your hard disk just has a book, while the search engine has the index. Though as some others have pointed out, it's possible to use software to index your hard disk, and then you can use the index instead of the whole thing.

mwfearnley

Posted 2013-04-03T18:44:59.993

Reputation: 5 885

-1

I guess one of the reasons Google emerged Auto Complete and used AJAX was speed problem. Now when you are typing, words are sent in background so Google can do part of job while you are not finished yet. Also indices are based on multiple word combinations (which you can find as suggestions at the bottom of page). Currently network speed is higher than hard-drives and probably much of those indices resides in RAM of the servers in their farm.

Xaqron

Posted 2013-04-03T18:44:59.993

Reputation: 148