1

I'm planning to start a distributed crawler in order to avoid common limitations imposed by servers/CDN like rate limit, region filter, and others.

My idea is to have a central server and multiple agents that will run on different networks. These agents will be SOCKS5 servers. The central server will round-robin the requests in the pool of agents (SOCKS5 servers) to access the origin (website).

  • Is the origin able to detect the server IP?
  • I don't have control over the agent (SOCKS5 server), how safe is this connection? The owner of the SOCKS5 server is able to see what I'm doing or even change the request like a MiTM attack?
  • Something like this already exists?
schroeder
  • 123,438
  • 55
  • 284
  • 319
fenugurod
  • 13
  • 2
  • "My idea is to have a central server and multiple agents that will run on different networks. . This agents will be SOCKS5 servers" - why? It would be just as much as effort to build a distributed system where data is processed on the perferal nodes, but the solution would scale much, MUCH more easily. – symcbean Jan 01 '20 at 21:24
  • Yes, it would be easier, but I can't trust the agents, that's the point of moving the data back to the server. The only thing that I need is a different endpoint to the internet on every request. – fenugurod Jan 01 '20 at 23:07

1 Answers1

2

The origin is able to detect the server IP?

Unless the socks server explicitly provides information about it (like adding an X-Forwarded-For header) the website cannot detect the originating IP address (what you call "server").

I don't have control over the agent (SOCKS5 server), how safe is this connection? The owner of the SOCKS5 server is able to see what I'm doing or even change the request like a MiTM attack?

SOCKS5 does not provide any security by itself. If this is a plain HTTP request the operator of the SOCKS5 server can see everything and even manipulate the request (like adding an X-Forwarded-For header). If this is instead HTTPS and you do not specifically import a MITM CA as trusted nor ignore certificate errors, then the SOCKS operator cannot modify the traffic and can at most the the domain and IP you access and some traffic pattern.

Something like this already exists?

Off-topic. But services like this exist already. Though note that it does not make it more legal if you use such a service (or use your own) to violate the terms of service for websites.

Steffen Ullrich
  • 184,332
  • 29
  • 363
  • 424
  • Thanks for your answer, was clarifying. About the terms of service, this is kinda opaque on my opinion. We can think in this as something immoral instead of illegal. The terms of service will be respected on each of the agents, it's similar how Google operates. I worked in a very big eCommerce websites and we always had problems with Google Bots. They respect the traffic on each IP they have, but they have thousands of IP and we always had to deal with that traffic. – fenugurod Jan 01 '20 at 23:20