What risks exist when executing HTTP requests to user-submitted resources?

Question

Context

I'm working on an application where one of the features requires scraping metadata from resources available over HTTP/HTTPS. These resources are submitted by end-users in the form of a URL, and we then invoke an HTTP request to the resource and parse the response body to pull relevant metadata (title, meta tags, etc) from the HTML response. We enforce HTTP/HTTPS as the protocol of all submitted URLs.

Question/Concern

Is there an attack vector where an HTTP request to a user-submitted resource could allow remote code execution, access to the file system, or access to network interfaces that are not exposed externally?

My concern is that there may be a network interface (loopback, localhost, etc) that when requested from the box could result in the application retrieving and displaying sensitive data back to the user.

What I've already tried/considered

Server-side validation will only allow submission of HTTP/HTTPS resources
Redirects are followed up to a limit, so it is not sufficient to only sanitize the initial user-submitted resource
Any resources submitted without a protocol is forced to HTTPS
Appropriate measures are taken to ensure that user-submitted resources cannot cause SQL injection and that values scrapped from HTTP resources cannot cause SQL injection
Measures are taken to prevent/mitigate DOS attacks facilitated through the following of infinite redirect loops
We may need to block/prevent requests to localhost, 127.0.0.1, or other resources which are not at a fully qualified domain may need to be blocked
Is there a risk that the user could cause a request to be invoked against the Docker network or other similar network interfaces on the box?
- One risk I see here is that a malicious user sets up a redirect host that redirects back to localhost or an on-box network. We wouldn't allow an initial connection to localhost, etc, but if we don't prevent redirect requests to those interfaces as well it could be a problem.
Some kind of sandbox/container could help mitigate attack vectors, however, the setup would be a bit prohibitive at this time

@multithr3at3d thanks for the link! This at least givers me the name of the attack (SSRF). I'll review this more closely and see if it answers all my questions. — Vigs, May 02 '20 at 21:38
On every cloud VM lies a meta-data service, that responds to internal HTTP calls, and provides back credentials, ensure you don't expose these. I like your idea of an external host redirecting internally -- that's pretty cool. Needless to say, it'll be tough securing, and also depending on the nature of the workload, you might end up with legal troubles, as you're effectively proxy-ing traffic for someone. — keithRozario, May 03 '20 at 12:39

Pedro · Accepted Answer · 2020-05-04T13:06:06.993

Conceptually yes, this opens up a (quite obvious) way for external sources to somehow manipulate what will end up on your application. Whether that translates into a vulnerability is entirely dependent on what you do with the information.

It is critical that you exhaustively filter anything that is brought in, consider applying a white list of characters, or multiple ones depending on which field you are filtering. Not interpreting anything that comes back is helpful -- parsing responses as plaintext only.

Definitely control remote URLs (and protocols, make sure it's http:// or https://) and redirects and reject anything like localhost or RFC1918 addresses to protect your own network. Also limit the rate at which you access remote resources. Ideally plan to batch these requests and have them done by a separate process or on a different server.

Presuming that you'll only be showing some of the information pulled in, again make sure you filter and escape everything very carefully even if it comes back from your own database (since it's not impossible that it hadn't been properly filtered on the way into the database). Assume the worst.

Notwithstanding the functionality you need on your application, this type of activity incurs quite a bit of risk.

What risks exist when executing HTTP requests to user-submitted resources?

1 Answers1