Questions tagged [web-crawler]

A web-crawler (also known as a web-spider) traverses the webpages of the internet by following the links of urls contained within each webpage. There is usually an initial seed of URLs from which the crawler is given to initialize its crawl.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

More on Wikipedia

95 questions
30
votes
4 answers

Does a company have implied right to crawl my website?

I have found out that McAfee SiteAdvisor has reported my website as "may be having security issues". I care little about whatever McAfee thinks of my website (I can secure it myself and if not, McAfee definitely is not the company I'd be asking for…
kralyk
  • 487
  • 5
  • 11
12
votes
3 answers

How do sites detect bots behind proxies or company networks

How do large sites (e.g. Wikipedia) deal with bots that are behind other IP masker? For instance, in my university, everybody searches Wikipedia, giving it a significant load. But, as far as I know, Wikipedia can only know the IP of the university…
user4052054
  • 222
  • 2
  • 6
10
votes
5 answers

Finding all IP ranges belonging to a specific ISP

I'm having an issue with a certain individual who keeps scraping my site in an aggressive manner; wasting bandwidth and CPU resources. I've already implemented a system which tails my web server access logs, adds each new IP to a database, keeps…
user45795
9
votes
5 answers

How are these 'bad bots' finding my closed webserver?

I've installed Apache a while ago, and a quick look at my access.log shows that all sorts of unknown IPs are connecting, mostly with a status code 403, 404, 400, 408. I have no idea how they're finding my IP, because i only use it for personal use,…
bryc
  • 193
  • 1
  • 5
8
votes
3 answers

How do I use robots.txt to disallow crawling for only my subdomains?

If I want my main website to on search engines, but none of the subdomains to be, should I just put the "disallow all" robots.txt in the directories of the subdomains? If I do, will my main domain still be crawlable?
tkbx
  • 201
  • 1
  • 2
  • 6
7
votes
4 answers

How do I rate limit google's crawl of my IP block?

I have several sites in a /24 network that all get crawled by google on a pretty regular basis. Normally this is fine. However, when google starts crawling all the sites at the same time, the small set of servers that back this IP block can take a…
Zak
  • 1,032
  • 2
  • 15
  • 25
7
votes
8 answers

How can I tell how often Google crawls my site?

I've started a relatively new website, and I submitted it to google and everything. I use google's webmaster tools as well. I'm wondering how to figure out the frequency of google's spider accessing my website. I always hear people talking in forums…
RobHardgood
5
votes
2 answers

Strange request in access.log, how to block?

I am using nginx on my own server, and I noticed a few days ago some strange request in my access.log : 77.50.217.37 - - [19/Aug/2011:17:50:50 +0200] "GET http://images.google.com/ HTTP/1.1" 200 151 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT…
jchampem
  • 53
  • 1
  • 3
4
votes
3 answers

How often does Google's web spiders crawl the web?

Just a few hours after having made some changes in the HTML of my site, I found that Google had updated its search result against my website. The Internet is so huge, how did the Google crawler do that? Doesn't it use too much bandwidth?
Xiè Jìléi
  • 782
  • 7
  • 13
  • 27
4
votes
3 answers

Does google's web crawler download binary files?

My Google-fu is failing me right now. I'm trying to figure out whether Google's web crawler downloads non-image binary files when it spiders sites. I know it downloads (and indexes) images and PDFs, but what about .zip, .dmg, etc? My client offers…
jessica
  • 143
  • 4
4
votes
1 answer

Site crawler/spider that tosses results into mysql

It's been suggested that we use mysql for our site's search as it'd be running on the same server that hosts our web server (nginx) and our db (mysql). Since not all of our pages are created from the database, it's been suggested that we have a…
Ian
  • 251
  • 2
  • 10
4
votes
4 answers

How i can run nutch on linux kernel?

I want to run nutch on the linux kernel,I have loged in as a root user, I have setted all the environment variable and nutch file setting. I have created a url.txt file which content the url to crawl, When i am trying to run nutch using following…
3
votes
1 answer

Baidu Spider causing 3Gb of traffic a day - but I do business in China

I'm in a difficult situation, the Baidu spider is hitting my site causing about 3Gb a day worth of bandwidth. At the same time I do business in China so don't want to just block it. Has anyone else been in a similar situation (any spider)? Did you…
d.lanza38
  • 327
  • 1
  • 5
  • 13
3
votes
1 answer

Why is googlebot requesting robots.txt from my SSH server?

I run ossec on my server and periodically I receive a warning like this: Received From: myserver->/var/log/auth.log Rule: 5701 fired (level 8) -> "Possible attack on the ssh server (or version gathering)." Portion of the log(s): Nov 19 14:26:33…
Brian
  • 766
  • 1
  • 6
  • 14
3
votes
2 answers

Is it good idea to ban amazonaws.com

Site are crawled by anonymous bot hosted on amazon ec2. This robot doesn't respect robots.txt and creates high load on web server so I added check if reverse IP for request ends with "amazonaws.com" then server returns 403 page immediately. This…
valodzka
  • 177
  • 3
  • 10
1
2 3 4 5 6 7