2

I want to experiment with creating a web crawler. I'll start with indexing a few medium sized website like Stack Overflow or Smashing Magazine. If it works, I'd like to start crawling the entire web. I'll respect robot.txts. I save all html, pdf, word, excel, powerpoint, keynote, etc... documents (not exes, dmgs etc, just documents) in a MySQL DB. Next to that, I'll have a second table containing all restults and descriptions, and a table with words and on what page to find those words (aka an index).

How much HDD space do you think I need to save all the pages? Is it as low as 1 TB or is it about 10 TB, 20? Maybe 30? 1000?

Thanks

  • As an aside, if you're interested in crawlers and the innards of search engines, there is an excellent chapter or two in Programming Collective Intelligence from O'Reilly. It explains a crawler and indexer implementation in Python which will really help you understand how these things work. http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325 – Justin Scott Jun 10 '10 at 18:02

3 Answers3

3

The internet achive does index the web like you mentioned, but only preserves Websites, not documents as far as i know. They do keep older versions of sites indexed, so their need for space might be alot larger. In their FAQ they speak about 2 petabytes of required space for that task (http://www.archive.org/about/faqs.php#9) and about hundreds of linux servers, each holding about 1TB of data. That should be some figures, which should give you a first impression.

softcr
  • 101
  • 2
  • ZOMFG! I think I'll start out without the caching then hehe (-:-) –  Jun 05 '10 at 13:07
  • 2TB hard drives are [on sale for $115](http://www.goharddrive.com/ProductDetails.asp?ProductCode=G01-0156&Click=46406) - They might give you a discount if you buy in bulk ;-) – Matt Simmons Jun 05 '10 at 13:51
  • I helped put in some storage that will be used for the .ie portion of the Internet Archive a couple of months back we had something like 90TB for that and .ie is a pretty small subset. – Helvick Jun 05 '10 at 13:57
0

In 2008 google was indexing 1,000,000,000,000 pages, if a webpage is, on average, 1Ko, this is 1000To
An average of 1ko par page is just a very low estimation.. there is PDF doing huge size...

Good luck

radius
  • 9,545
  • 23
  • 45
  • From http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html google DB is 100000 To – radius Jun 10 '10 at 12:23
-1

I suspect that an index alone is going to run you one KiloByte per page on average, by the time you include description, etc. There's a lot of pages out there...

Rob Moir
  • 31,664
  • 6
  • 58
  • 86