Questions tagged [web-crawler]

For questions about how to use or defend against web-crawlers or web-spiders.

17 questions
85
votes
10 answers

How and why is my site being abused?

I own a popular website that allows people to enter a phone number and get information back about that phone number, such as the name of the phone carrier. It's a free service, but it costs us money for each query so we show ads on the site to help…
Marc
  • 699
  • 1
  • 4
  • 4
3
votes
1 answer

How to crawl a web site if content is only visible to registered accounts?

I am reading abpit the attack amd defense strategies of web spiders. Assume I have sensitive information on my website, which should be protected from 3rd party web spiders. Use case #1: Me: I set the sensitive data only visible to registered user…
TJCLK
  • 818
  • 8
  • 23
1
vote
1 answer

How do I prove that robots.txt was not provded

I want to scrape our university's learning platform website, to let myself know via notifications when a new entry added to any lesson. But, I'm scared that they'll put robots.txt afterwards and sue me or something, I don't know. I just don't have…
Kenan
  • 13
  • 2
1
vote
1 answer

Why is my web site being scanned for license.txt, and should I be worried?

Lately I am seeing multiple daily 404s for variations of "license.txt", e.g., "wordpress/license.txt", "blog/license.txt", "old/license.txt", "new/license.txt". Here's a little snippet of slightly redacted logfile to illustrate: 5.189.164.217 - -…
C8H10N4O2
  • 113
  • 4
1
vote
1 answer

Questions about SOCKS5 security

I'm planning to start a distributed crawler in order to avoid common limitations imposed by servers/CDN like rate limit, region filter, and others. My idea is to have a central server and multiple agents that will run on different networks. These…
fenugurod
  • 13
  • 2
1
vote
1 answer

Why fingerprint a browser if a fingerprint can be replayed?

I'm facing an issue with rampant scraping and abuse on a website that costs me a good chunk of money to maintain. So, I have been looking to implement a few solutions, and apparently these solutions fingerprint the client in some form. However, the…
user22260
1
vote
1 answer

Does a searchable public database exist of (hostname; ip) mappings?

This question is not about the trivial usage of the forward/reverse DNS. Getting the IP of a hostname is trivial (DNS), and using reverse DNS, also we can get (typically) a single hostname of an IP. However, particularly for massive http…
peterh
  • 2,938
  • 6
  • 25
  • 31
1
vote
0 answers

I run a web crawler on my localhost computer, can my ISP detects that?

I'm using an Internet plan of 100GB bandwidth monthly from my ISP, and I made a simple web crawler for fun and run it on my personal computer 24/7. The crawler is consuming all of the bandwidth, and I configured it to skip downloading media files…
AccountantM
  • 296
  • 1
  • 6
0
votes
0 answers

Threats that JavaScript poses to a web crawler

I'm writing a simple crawler with node.js, which searches for web pages and conditionally executes any JavaScript present. The problem is that in doing so, I execute code form untrusted sources in my node.js environment. Can running untrusted code…
0
votes
1 answer

How does msnbot keep finding my unpublished admin url?

I am a website developer (mainly using MVC.NET). Recently, we have been contacted by a hacker. He claimed that he knows our admin URL. The problem is we do not publish or put the admin URL anywhere on our webpage. The only place where the URL is…
Sam
  • 109
  • 1
0
votes
0 answers

Risks of web crawlers on public buckets

So I have some data that isn't overly sensitive, but I'm still on the fence on whether or not we should invest the additional time into managing it as a private resource, vs just publicly available. The data (images & pdfs) are to be hosted on aws'…
Francky_V
  • 103
  • 3
0
votes
1 answer

Why can't you give special security cookies to a specific crawler so that they could securely crawl the website?

In the current day and age we have the problem of malicious/spam crawlers and similar concerns. My suggestion would be implementing cookie support for crawling and by that I mean giving specific cookies with crawler ID (at best refreshed using…
Munchkin
  • 212
  • 2
  • 10
0
votes
0 answers

How to Spoof JA3 Signature?

I am using python requests library to make HTTP calls. However website bot detection is using JA3 fingerprint verification and blocking me. Is there any way I can spoof the JA3 signature.
Ditti
  • 1
  • 1
0
votes
1 answer

How do attackers hit a website with thousands of similar but distinct IP addresses?

I have a website that is being hit with invalid URL requests by thousands of distinct IP addresses, never the same one used twice. Most of them are in a few ranges of IP addresses and often just go up sequentially. Could this be a zombie botnet of…
Pat James
  • 141
  • 1
  • 6
0
votes
1 answer

Are AWS Signed URL's crawled by google?

I have used Amazon pre signed url to share content. https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-presigned-urls.html Is google able to crawl this url? I'm sharing this url with just one client. What about other services? there…
1
2