8

I have a list of web pages that I need to scrape, parse and then store the resulting data in a database. The total is around 5,000,000.

My current assumption of the best way to approach this is to deploy ~100 EC2 instances, provide each instance with 50,000 pages to scrape and then leave that to run, then once the process is completed merge the databases together. The assumption is that would take around one day to run (600ms to load, parse and save each page).

Does anyone have experience with doing such a large volume of page scraping within limited time? I've done large numbers before (1.5m) but that was from a single machine and took just over a week to complete.

The bottleneck in my situation is the downloading of the pages, the parsing is something that takes no more than 2ms, so something that can streamline the process of downloading the pages is what I'm looking for.

sam
  • 211
  • 2
  • 6
  • When you say a list of web pages, is this just a plain web page or whole sites like a forum or something? Also even if you go flat out, are there rules in place for the sites you want to scrape (or is this just getting the theory in place first?) – tombull89 Oct 31 '11 at 11:57
  • I have multiple instances where this answer is relevant to me, for the sake of the question I gave an arbitrary figure that could easily be visualised and the type of web page varies, but for the question it can be assumed it is a forum being scraped if you like. Whether or not the site allows scraping is a none issue (for the question anyway) – sam Oct 31 '11 at 12:02
  • To clarify the point about the type of web pages: each web page is independent of any other, they can be scraped in any order and have no reliance on another being scraped. It could be done forwards, backwards, randomised, that doesn't matter. – sam Oct 31 '11 at 12:06
  • I see. I don't know how EC2 would handle the downloads, but some more expereinced SF users may have some ideas. Also, off-topic, but is this *the* citricsquid from the MinecraftForums? It's a fairly...unique...name. – tombull89 Oct 31 '11 at 12:11
  • mmhmm that is I. – sam Oct 31 '11 at 12:15
  • Welcome to ServerFault! Sorry I can't give you a definitive answer. – tombull89 Oct 31 '11 at 12:18

2 Answers2

7

Working on the assumption that download time (and therefore bandwidth usage) is your limiting factor, I would make the following suggestions:

Firstly, choose m1.large instances. Of the three 'levels' of I/O performance (which includes bandwidth), the m1.large and m1.xlarge instances both offer 'high' I/O performance. Since your task is not CPU bound, the least expensive of these will be the preferable choice.

Secondly, your instance will be able to download far faster than any site can serve pages - do not download a single page at a time on a given instance, run the task concurrently - you should be able to do at least 20 pages simultaneously (although, I would guess you can probably do 50-100 without difficulty). (Take the example of downloading from a forum from your comment - that is a dynamic page that is going to take the server time to generate - and there are other users using that sites bandwidth, etc.). Continue to increase the concurrency until you reach the limits of the instance bandwidth. (Of course, don't make multiple simultaneous requests to the same site).

If you really are trying to maximize performance, you may consider launching instances in geographically appropriate zones to minimize latency (but that would require geolocating all your URLs, which may not be practical).

One thing to note is that instance bandwidth is variable, at times you will get higher performance, and at other times you will get lower performance. On the smaller instances, the variation in performance is more significant because the physical links are shared by more servers and any of those can decrease your available bandwidth. Between m1.large instances, within the EC2 network (same availability zone), you should get near theoretical gigabit throughput.

In general, with AWS, it is almost always more efficient to go with a larger instance as opposed to multiple smaller instances (unless you are specifically looking at something such as failover, etc. where you need multiple instances).

I don't know what your setup entails, but when I have previously attempted this (between 1 and 2 million links, updated periodically), my approach was to maintain a database of the links adding new links as they were found, and forking processes to scrape and parse the pages. A URL would be retrieved (at random) and marked as in progress on the database, the script would download the page and if successful, mark the url as downloaded in the database and send the content to another script that parsed the page, new links were added to the database as they were found. The advantage of the database here was centralization - multiple scripts could query the database simultaneously and (as long as transactions were atomic) one could be assured that each page would only be downloaded once.

A couple of additional points of mention - there are limits (I believe 20) on the number of on-demand instances you can have running at one time - if you plan to exceed those limits, you will need to request AWS to increase your account's limits. It would be much more economical for you to run spot instances, and to scale up your numbers when the spot price is low (maybe one on-demand instance to keep everything organized, and the remaining, spot instances).

If time is of higher priority than cost to you, the cluster compute instances offer 10Gbps bandwidth - and should yield the greatest download bandwidth.

Recap: try few large instances (instead of many small instances) and run multiple concurrent downloads on each instance - add more instances if you find yourself bandwidth limited, move to larger instances if you find yourself CPU/memory bound.

cyberx86
  • 20,620
  • 1
  • 60
  • 80
4

We tried to do something similar, and here is my 5 cents:

  1. Get 2-3 cheap unmetered servers, e.g. don't pay for the bandwidth.

  2. Use python with asyncore. Asyncore is the old way to do things, but we found it works faster than any other method. Downside is that DNS lookup is blocking, i.e. not "parallel". Using asyncore we managed to scrape 1M URL's for 40 min, using single XEON 4 cores, 8 GB RAM. Load average on the server was less 4 (that is excellent for 4 cores).

  3. If you do not like asyncore, try gevent. It even do DNS non blocking. Using gevent, 1M was downloaded for about 50 min on same hardware. Load average on the server was huge.

Note we did test lots of Python libraries, such grequests, curl, liburl/liburl2, but we did not test Twisted.

  1. We did test PHP + curl + several processes, it did the job for about an hour, but load average on the server was huge.
Nick
  • 786
  • 2
  • 12
  • 37