Questions tagged [scraping]

26 questions
8
votes
7 answers

How to avoid being scraped?

We have a searchable Database(DB) , we limit the results to 15 per page and only 100 results yet still get people trying to scrape the site. We are banning sites that hit it fast enough. I was wondering if there is anything else that we can do.…
Randin
  • 183
  • 3
8
votes
2 answers

Most efficient (time, cost) way to scrape 5 million web pages?

I have a list of web pages that I need to scrape, parse and then store the resulting data in a database. The total is around 5,000,000. My current assumption of the best way to approach this is to deploy ~100 EC2 instances, provide each instance…
sam
  • 211
  • 2
  • 6
4
votes
4 answers

How easy/expensive is it to adopt Google Mini/Google Appliance for intranet search?

Out of curiosity, is anyone here using Google Mini or Google Search Appliance to provide intranet search? Was it easy to set up? What kind of prices do they charge (ball park figure, I'm sure it depends on the customer)?
username
  • 4,725
  • 18
  • 54
  • 78
4
votes
0 answers

How to Enable JavaScript for wget in linux for grabbing a website?

I use wget like this for save site : wget --page-requisites --no-parent --mirror http://example.com/index.html -P /home/ in Some cases It does NOT work, Error is : This site requires Javascript to work, please enable Javascript in your browser or…
AriaData
  • 53
  • 1
  • 1
  • 4
3
votes
3 answers

IP address of spiders and “official” web bots

Is there an official API to iplists.com from where I can get the list of spiders? My intention is to whitelist these IPs for site scraping.
Quintin Par
  • 4,293
  • 10
  • 46
  • 72
3
votes
1 answer

What to do about spoofed user agents? Scrapers pretending to be spiders

I've been following a few spiders in our logs and I did a traceroute on their ip to find out they are in fact EC2 instances. The user agents are listed as Google bot and msnbot but they are not Google or MS ip's. Is there anything I can do, is…
Ryan Detzel
  • 687
  • 3
  • 7
  • 20
2
votes
1 answer

Protect nginx from hammering

I would like to protect my nginx+pessenger+rails3 HTTP server from hammering/scraping. If you try to scrape the Google it shows you a captcha in case you make too many requests from the same IP. What module should I use? Thanks.
xpepermint
  • 267
  • 3
  • 9
2
votes
2 answers

Amazon EC2 + S3 + Python + Scraping - The cheapest way of doing this?

I tapped in to Amazons AWS offerings and please explain this in high level - if I am thinking right. So I have few Python scraping scripts on my local machine. I want to use AWS for super fast internet connectivity and cheaper price - win / win! I…
ThinkCode
  • 184
  • 1
  • 10
2
votes
1 answer

How can I use fail2ban to block scrapers?

I have a media site and problems of users coming along and scraping all of the content.I placed a invisible URL on the page to catch spiders that immediately blocks the ip, but some people have figured out the URL scheme and are creating their own…
coneybeare
  • 611
  • 1
  • 7
  • 14
1
vote
0 answers

HTTrack stores extensionless pages with a .html appended

I'd like to mirror an old site of mine to local files. I've used httrack for this in the past, but I'm having a problem this time that I really thought I figured out before, but can't seem to now. My site has a lot of extensionless pages, which…
boomhauer
  • 151
  • 1
  • 1
  • 6
1
vote
1 answer

Suspected malicious activity by one of my site's users; any way to know for sure?

In the course of about 2 hours, a logged in user on my website accessed roughly 1,600 pages in a way that looks suspiciously similar to a bot. I am concerned because users must purchase access to the site in order to get full access to our protected…
Nick S.
  • 131
  • 1
1
vote
2 answers

Protection against scrapping with nginx

This morning we had a crawler going nuts on our server hitting our site almost 100 times per second. We'd like to add a protection for this. I guess I'' have to use HttpLimitReqModule but I don't want to block allow google/bing/... How should I do…
bl0b
  • 141
  • 1
  • 6
1
vote
1 answer

iis 6 anti data harvesting/scraping

We have a page on our extranet website that exposes information we would like to prevent from being data harvested. We have done the due diligence of encrypting the URL parameters to make it hard for the end-user to generate links for data…
1
vote
1 answer

Can a scraping bot have JavaScript enabled?

I've got a few thousand request that seem to be coming from a client with JavaScript enabled and I'm wondering if that client could be a bot.
Emanuil Rusev
  • 801
  • 1
  • 9
  • 16
1
vote
0 answers

Using Tunnelbroker to load balance Node.JS web scraper

I was wondering if I could assign my Linux VPS that is hosting my Node.JS app (puppeteer) for web scraping with a block of IPv6 addresses from Tunnelbroker to achieve IP rotation and load-balance my scraping requests and minimize the chances of the…
Hackel
  • 11
  • 2
1
2