I'm planning to start a distributed crawler in order to avoid common limitations imposed by servers/CDN like rate limit, region filter, and others.
My idea is to have a central server and multiple agents that will run on different networks. These agents will be SOCKS5 servers. The central server will round-robin the requests in the pool of agents (SOCKS5 servers) to access the origin (website).
- Is the origin able to detect the server IP?
- I don't have control over the agent (SOCKS5 server), how safe is this connection? The owner of the SOCKS5 server is able to see what I'm doing or even change the request like a MiTM attack?
- Something like this already exists?