Questions tagged [web-crawler]

A web-crawler (also known as a web-spider) traverses the webpages of the internet by following the links of urls contained within each webpage. There is usually an initial seed of URLs from which the crawler is given to initialize its crawl.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

More on Wikipedia

95 questions
3
votes
3 answers

Firewall - Preventing Content Theft & Rogue Crawlers

Our websites are being crawled by content thieves on a regular basis. We obviously want to let through the nice bots and legitimate user activity, but block questionable activity. We have tried IP blocking at our firewall, but this becomes to…
2
votes
0 answers

Barracuda.com and crawling / pinger services causing unusual load on web servers

I recently received a large number of hits on my home page from 64.235.153.8. It revolves to barracuda.com I know Barracuda as an enterprise class spam detection/prevention solution. Do they also offer some kind of bot/crawling or pinger services…
Luke G
  • 151
  • 6
2
votes
0 answers

Block Bad Bots in Nginx for Multiple Sites

I need to block a bunch of robots from crawling a few hundred sites hosted on a Nginx web server running on a Ubuntu 16.04 machine. I've found a fairly simple example here (important part of the code is below) but it seems that this functionality…
2
votes
1 answer

Q: Strange web traffic - Is this an attack?

I recently noticed some strange traffic in my nginx access logs. I'm not sure if these indicate an attack, a mistake, or something else. I've started sending these to HTTP 444, so these logs will indicate that. 1) I noticed an increase in traffic,…
user153775
  • 23
  • 2
2
votes
1 answer

Are there any regularly updated Bot/Spider/Crawler Databases?

I am looking for a database that is regularly updated of different bots, spiders, and crawlers. I want to be able to identify them in the log files from IIS.
2
votes
2 answers

What's the purpose of spammy HTTP referers?

In the logs of my website, there's a lot of visits with a HTTP referer set to spam-like websites (usually Russian sites, I've noticed). I assume what they're doing is just using a web crawler to visit any site they find with the HTTP referer as…
user280917
2
votes
2 answers

Strange "GET /api/levels/ " and "GET /play/" requests in logs

I've setup new Amazon EC2 instance. In a day or two started to get strange "GET" requests from the "google bot-like" IP's (eg 66.249.76.84, 66.249.74.152) about one in 10 seconds (some examples): 66.249.74.152 - - [10/Apr/2013:06:05:02 +0000] "GET…
domage
  • 23
  • 2
2
votes
1 answer

Ethical/legal considerations when redirecting

A webcrawler has bought our site down twice. It ignores our robots.txt and we have had no reply from their customer services or support using both e-mail and twitter. I have had to create url redirect based on their user agent string, I have…
NimChimpsky
  • 460
  • 2
  • 5
  • 17
2
votes
2 answers

Apache crashing with memory/cpu overload when google crawler visits site

I have a site with low traffic, less than 500 hits a day. It has a 6G of memory and it is way underutilized, on average 5% is in use. But as soon as googlebot establishes a connection to my webserver/apache, the memory and cpu usage spikes in…
Daniel t.
  • 9,061
  • 1
  • 32
  • 36
2
votes
3 answers

How much HDD space would I need to cache the web while respecting robot.txts?

I want to experiment with creating a web crawler. I'll start with indexing a few medium sized website like Stack Overflow or Smashing Magazine. If it works, I'd like to start crawling the entire web. I'll respect robot.txts. I save all html, pdf,…
user42235
1
vote
0 answers

HTTrack stores extensionless pages with a .html appended

I'd like to mirror an old site of mine to local files. I've used httrack for this in the past, but I'm having a problem this time that I really thought I figured out before, but can't seem to now. My site has a lot of extensionless pages, which…
boomhauer
  • 151
  • 1
  • 1
  • 6
1
vote
1 answer

How many requests can a router handle?

I made a script to scan a file which contains a portion of ipv4 addresses (about 50 million), it attempts to connect to the website using OpenSSL and extract a small piece of it and writes it into a file. To save some of the details, it uses…
user153882
  • 113
  • 5
1
vote
1 answer

Bash script - wait for all xargs processes to be finished

I have written a small bash script for crawling an XML sitemap of URLs. It retrieves 5 URLs in parallel using xargs. Now I want an E-Mail to be sent when all URLs have been crawled, so it has to wait until all sub-processes of xargs have finished…
Alex
  • 302
  • 1
  • 3
  • 12
1
vote
1 answer

Suspected malicious activity by one of my site's users; any way to know for sure?

In the course of about 2 hours, a logged in user on my website accessed roughly 1,600 pages in a way that looks suspiciously similar to a bot. I am concerned because users must purchase access to the site in order to get full access to our protected…
Nick S.
  • 131
  • 1
1
vote
0 answers

What are the symptoms of an overloaded webserver

I'm maintaining some web crawlers. I want to improve our load/throttling system to be more intelligent. Of cause I look at response codes, and throttle up or down based on that. I would though like the system to be better at dynamically adjusting…
Niels Kristian
  • 358
  • 2
  • 13