How much HDD space would I need to cache the web while respecting robot.txts?

Question

I want to experiment with creating a web crawler. I'll start with indexing a few medium sized website like Stack Overflow or Smashing Magazine. If it works, I'd like to start crawling the entire web. I'll respect robot.txts. I save all html, pdf, word, excel, powerpoint, keynote, etc... documents (not exes, dmgs etc, just documents) in a MySQL DB. Next to that, I'll have a second table containing all restults and descriptions, and a table with words and on what page to find those words (aka an index).

How much HDD space do you think I need to save all the pages? Is it as low as 1 TB or is it about 10 TB, 20? Maybe 30? 1000?

Thanks

As an aside, if you're interested in crawlers and the innards of search engines, there is an excellent chapter or two in Programming Collective Intelligence from O'Reilly. It explains a crawler and indexer implementation in Python which will really help you understand how these things work. http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325 — Justin Scott, Jun 10 '10 at 18:02

score 3 · Answer 1 · answered Jun 05 '10 at 13:05

3

The internet achive does index the web like you mentioned, but only preserves Websites, not documents as far as i know. They do keep older versions of sites indexed, so their need for space might be alot larger. In their FAQ they speak about 2 petabytes of required space for that task (http://www.archive.org/about/faqs.php#9) and about hundreds of linux servers, each holding about 1TB of data. That should be some figures, which should give you a first impression.

answered Jun 05 '10 at 13:05

softcr

101
2

ZOMFG! I think I'll start out without the caching then hehe (-:-) – Jun 05 '10 at 13:07
2TB hard drives are [on sale for $115](http://www.goharddrive.com/ProductDetails.asp?ProductCode=G01-0156&Click=46406) - They might give you a discount if you buy in bulk ;-) – Matt Simmons Jun 05 '10 at 13:51
I helped put in some storage that will be used for the .ie portion of the Internet Archive a couple of months back we had something like 90TB for that and .ie is a pretty small subset. – Helvick Jun 05 '10 at 13:57

radius · Answer 2 · 2010-06-05T14:03:17.043

0

In 2008 google was indexing 1,000,000,000,000 pages, if a webpage is, on average, 1Ko, this is 1000To
An average of 1ko par page is just a very low estimation.. there is PDF doing huge size...

Good luck

edited Jun 05 '10 at 14:03

answered Jun 05 '10 at 13:16

radius

9,545
23
45

From http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html google DB is 100000 To – radius Jun 10 '10 at 12:23

score -1 · Answer 3 · answered Jun 05 '10 at 13:30

-1

I suspect that an index alone is going to run you one KiloByte per page on average, by the time you include description, etc. There's a lot of pages out there...

answered Jun 05 '10 at 13:30

Rob Moir

31,664
6
58
86

How much HDD space would I need to cache the web while respecting robot.txts?

3 Answers3