Context
I'm working on an application where one of the features requires scraping metadata from resources available over HTTP/HTTPS. These resources are submitted by end-users in the form of a URL, and we then invoke an HTTP request to the resource and parse the response body to pull relevant metadata (title, meta tags, etc) from the HTML response. We enforce HTTP/HTTPS as the protocol of all submitted URLs.
Question/Concern
Is there an attack vector where an HTTP request to a user-submitted resource could allow remote code execution, access to the file system, or access to network interfaces that are not exposed externally?
My concern is that there may be a network interface (loopback, localhost, etc) that when requested from the box could result in the application retrieving and displaying sensitive data back to the user.
What I've already tried/considered
- Server-side validation will only allow submission of HTTP/HTTPS resources
- Redirects are followed up to a limit, so it is not sufficient to only sanitize the initial user-submitted resource
- Any resources submitted without a protocol is forced to HTTPS
- Appropriate measures are taken to ensure that user-submitted resources cannot cause SQL injection and that values scrapped from HTTP resources cannot cause SQL injection
- Measures are taken to prevent/mitigate DOS attacks facilitated through the following of infinite redirect loops
- We may need to block/prevent requests to localhost, 127.0.0.1, or other resources which are not at a fully qualified domain may need to be blocked
- Is there a risk that the user could cause a request to be invoked against the Docker network or other similar network interfaces on the box?
- One risk I see here is that a malicious user sets up a redirect host that redirects back to localhost or an on-box network. We wouldn't allow an initial connection to localhost, etc, but if we don't prevent redirect requests to those interfaces as well it could be a problem.
- Some kind of sandbox/container could help mitigate attack vectors, however, the setup would be a bit prohibitive at this time