Highest Voted 'scraping' Questions - Server Fault Stack Exchange

8

votes

7 answers

How to avoid being scraped?

We have a searchable Database(DB) , we limit the results to 15 per page and only 100 results yet still get people trying to scrape the site. We are banning sites that hit it fast enough. I was wondering if there is anything else that we can do.…

database scraping

asked May 12 '09 at 00:10

Randin

183
3

8

votes

2 answers

Most efficient (time, cost) way to scrape 5 million web pages?

I have a list of web pages that I need to scrape, parse and then store the resulting data in a database. The total is around 5,000,000. My current assumption of the best way to approach this is to deploy ~100 EC2 instances, provide each instance…

amazon-ec2 scraping

asked Oct 31 '11 at 10:31

sam

211
2
6

4

votes

4 answers

How easy/expensive is it to adopt Google Mini/Google Appliance for intranet search?

Out of curiosity, is anyone here using Google Mini or Google Search Appliance to provide intranet search? Was it easy to set up? What kind of prices do they charge (ball park figure, I'm sure it depends on the customer)?

intranet search-server google-search-appliance scraping

asked May 17 '09 at 01:19

username

4,725
18
54
78

4

votes

0 answers

How to Enable JavaScript for wget in linux for grabbing a website?

I use wget like this for save site : wget --page-requisites --no-parent --mirror http://example.com/index.html -P /home/ in Some cases It does NOT work, Error is : This site requires Javascript to work, please enable Javascript in your browser or…

linux wget javascript scraping

asked Feb 19 '18 at 03:17

AriaData

53
1
1
4

3

votes

3 answers

IP address of spiders and “official” web bots

Is there an official API to iplists.com from where I can get the list of spiders? My intention is to whitelist these IPs for site scraping.

web-hosting scraping

asked Mar 14 '12 at 03:00

Quintin Par

4,293
10
46
72

3

votes

1 answer

What to do about spoofed user agents? Scrapers pretending to be spiders

I've been following a few spiders in our logs and I did a traceroute on their ip to find out they are in fact EC2 instances. The user agents are listed as Google bot and msnbot but they are not Google or MS ip's. Is there anything I can do, is…

scraping

asked Mar 21 '11 at 21:15

Ryan Detzel

687
3
7
20

2

votes

1 answer

Protect nginx from hammering

I would like to protect my nginx+pessenger+rails3 HTTP server from hammering/scraping. If you try to scrape the Google it shows you a captcha in case you make too many requests from the same IP. What module should I use? Thanks.

security scraping

asked Apr 04 '12 at 23:57

xpepermint

267
3
9

2

votes

2 answers

Amazon EC2 + S3 + Python + Scraping - The cheapest way of doing this?

I tapped in to Amazons AWS offerings and please explain this in high level - if I am thinking right. So I have few Python scraping scripts on my local machine. I want to use AWS for super fast internet connectivity and cheaper price - win / win! I…

linux amazon-ec2 python amazon-web-services scraping

asked Sep 28 '11 at 22:11

ThinkCode

184
1
10

2

votes

1 answer

How can I use fail2ban to block scrapers?

I have a media site and problems of users coming along and scraping all of the content.I placed a invisible URL on the page to catch spiders that immediately blocks the ip, but some people have figured out the URL scheme and are creating their own…

apache-2.2 block fail2ban rate-limiting scraping

asked Jun 13 '11 at 02:11

coneybeare

611
1
7
14

1

vote

0 answers

HTTrack stores extensionless pages with a .html appended

I'd like to mirror an old site of mine to local files. I've used httrack for this in the past, but I'm having a problem this time that I really thought I figured out before, but can't seem to now. My site has a lot of extensionless pages, which…

web archive web-crawler scraping

asked Dec 18 '17 at 14:38

boomhauer

151
1
1
6

1

vote

1 answer

Suspected malicious activity by one of my site's users; any way to know for sure?

In the course of about 2 hours, a logged in user on my website accessed roughly 1,600 pages in a way that looks suspiciously similar to a bot. I am concerned because users must purchase access to the site in order to get full access to our protected…

apache-2.2 web-crawler scraping

asked Apr 07 '16 at 14:46

Nick S.

131
1

1

vote

2 answers

Protection against scrapping with nginx

This morning we had a crawler going nuts on our server hitting our site almost 100 times per second. We'd like to add a protection for this. I guess I'' have to use HttpLimitReqModule but I don't want to block allow google/bing/... How should I do…

nginx ddos web-crawler flooding scraping

asked Sep 22 '13 at 18:08

bl0b

141
1
6

1

vote

1 answer

iis 6 anti data harvesting/scraping

We have a page on our extranet website that exposes information we would like to prevent from being data harvested. We have done the due diligence of encrypting the URL parameters to make it hard for the end-user to generate links for data…

windows-server-2003 iis-6 scraping

asked Oct 25 '11 at 15:56

Random Developer

123
5

1

vote

1 answer

Can a scraping bot have JavaScript enabled?

I've got a few thousand request that seem to be coming from a client with JavaScript enabled and I'm wondering if that client could be a bot.

javascript scraping

asked Sep 01 '11 at 14:09

Emanuil Rusev

801
1
9
16

1

vote

0 answers

Using Tunnelbroker to load balance Node.JS web scraper

I was wondering if I could assign my Linux VPS that is hosting my Node.JS app (puppeteer) for web scraping with a block of IPv6 addresses from Tunnelbroker to achieve IP rotation and load-balance my scraping requests and minimize the chances of the…

linux networking ipv6 node.js scraping

asked Jun 18 '22 at 01:02

Hackel

11
2

Questions tagged [scraping]