14

From many posts, I know that almost everything in HTTPS or SSL connections is encrypted. Still, I am wondering, if it is possible to get the URLs out of such a connection if the computer that opens the connections is on a home network and access to the wifi router including a Unix based OS of the router is given?

I am not talking about the content of any messages, but only the domains that are visited in a browser an possibly the rest of the URLs like domain.com/thiscategory/site123.

alecxe
  • 1,515
  • 5
  • 19
  • 34
jdoe
  • 151
  • 1
  • 2
  • 5
  • 2
    Fire up Wireshark, try it, and see! Alternatively, see the answer here: https://stackoverflow.com/questions/499591/are-https-urls-encrypted – Bob Brown Dec 27 '17 at 00:34
  • 1
    No, it's not possible. – Luke Park Dec 27 '17 at 00:51
  • 1
    Dupe https://security.stackexchange.com/questions/117536/is-https-url-in-plain-text-at-first-connection https://security.stackexchange.com/questions/7705/does-ssl-tls-https-hide-the-urls-being-accessed https://security.stackexchange.com/questions/34794/if-ssl-encrypts-urls-then-how-are-https-messages-routed and several more linked from those. – dave_thompson_085 Dec 27 '17 at 03:47
  • 1
    TLS encrypts everything above "it" which means that the application protocol on top of TLS is encrypted and thus you can not see the URL. You can however of course see the TCP headers and TLS headers. – mroman Dec 28 '17 at 10:47

4 Answers4

20

TL;DR An attacker cannot see anything past the domain.

Structure of a HTTP request

HTTP works by sending two things to a website: the method, and the headers. The most common methods are GET, POST, and HEAD, which retrieves a page, transfers data, or requests only response headers, respectively. TLS encrypts the entirety of HTTP traffic, including the headers and method. In HTTP, the path in the URL is sent along with the header body. Take this example, with wget loading the page foo.example.com/some/page.html. This text, as ASCII, is sent to the server:

GET /some/page.html HTTP/1.1
User-Agent: Wget/1.19.1 (linux-gnu)
Accept: */*
Accept-Encoding: identity
Host: foo.example.com

The server will then respond with an HTTP status code, some headers of its own, and optionally some data (such as HTML). An example, giving a 301 redirect and some plain text as a response, may be:

HTTP/1.1 301 Moved Permanently
Date: Wed, 27 Dec 2017 04:42:54 GMT
Server: Apache
Location: https://bar.example.com/new/location.html
Content-Length: 56
Content-Type: text/plain

Thank you Mario, but our princess is in another castle!

Which would tell the client that the correct location is elsewhere.

These are the headers sent directly to the site over TCP. TLS works on a different layer, making all of this encrypted. This includes the page you are accessing with the GET method. Note that, although the Host header is also in the header body and thus encrypted, the host can still be obtained through rDNS lookup on the IP address, or by checking SNI, which transmits the domain in plaintext.

Structure of a URL

https://foo.example.com/some/page.html#some-fragment
| proto |    domain    |     path     |  fragment  |
  • proto - There are only two protocols in common use, HTTP and HTTPS.
  • domain - The domain is example.com and *.example.com, detectable with rDNS or SNI.
  • path - The path is completely encrypted and can only be read by the target server.
  • fragment - The fragment is visible only to the web browser and is not transmitted.

What an attacker can see

So what can an attacker see if you make a request over HTTPS? Let's take the previous hypothetical request from the perspective of a passive eavesdropper on the network. If I wanted to know what you are accessing, I have only limited options:

  • I see you making a web request encrypted with TLS going to 203.0.113.98.
  • I see that the destination port is 443, which I know is used for HTTPS.
  • I do an rDNS lookup and see that IP is used for example.com and example.org.
  • I look at the SNI record and see you are connecting to foo.example.com.

This is all I could do. I would not be able to see the path you are requesting, or even what method you are using, short of heuristic analysis based on the sizes of the data being sent and received, called traffic analysis attacks.

An important note about referers on older browsers

Even though HTTPS encrypts the path you are accessing, if you click a hyperlink within that site which goes to an unencrypted page, the full path may be leaked in the referer header. This is not the case anymore for many newer browsers, but older or non-compliant browsers may still have this behavior, as will websites which set the HTML5 referer meta tag to always send the information. An example sent unencrypted by a client go from https://example.com/private/details.html to http://example.org/public/page.html in such a case would be:

GET /public/page.html
Referer: https://example.com/private/details.html
User-Agent: Wget/1.19.1 (linux-gnu)
Accept: */*
Accept-Encoding: identity
Host: example.org

As such, navigating from an HTTPS page to an HTTP page may leak the full URL (excluding the fragment) of the previous page, so keep that in mind.

forest
  • 64,616
  • 20
  • 206
  • 257
  • 1
    Thanks to all of you. I am really surprised, as I thought there must have been a way. Always believed that at least the first AP would need an address in order to establish the connection request by the computer. – jdoe Dec 28 '17 at 08:21
  • @jdoe The only address it needs is that of the domain itself. The rest of it can be hosted on the same system, knowing how to connect to `example.com/foo` and `example.com/bar` requires only knowing how to connect to `example.com` itself. – forest Dec 28 '17 at 09:11
  • Not 100% sure here, but isn't browsers supposed to not send a referer policy when you are transitioning from HTTPS to HTTP? I think this can be changed by explicitly setting a different referer policy (as I think Google do), but hopefully a site where the URL is sensitive wouldn't do that. – Anders Dec 28 '17 at 10:04
  • @Anders Some browsers may do that, but it is not a part of any specification and so should not be relied upon in important cases. Ideally, a site where the URLs are sensitive would set the proper referer policy to disable it on sensitive pages completely, but many (if not most) do not. – forest Dec 28 '17 at 10:06
  • According to [this spec](https://w3c.github.io/webappsec-referrer-policy/#referrer-policy-empty-string) no referer policy should default to `no-referrer-when-downgrade` which would not leak refer unencrypted, and I think the major browsers implement that. (You are right it would still leak to the site you are going to.) Maybe this is mostly nitpicking, and I might be wrong on this, though. – Anders Dec 28 '17 at 10:14
  • Oh it looks like you're right. My information was outdated. RFC 2616 made this requirement a SHOULD NOT, but RFC 7231 turned it into a MUST NOT. This used to be the case (https://serverfault.com/questions/520244/referer-is-passed-from-https-to-http-in-some-cases-how), and still is when `` is set. Unfortunately, it seems that is often set, e.g. on blogs and by ad/analytics companies. – forest Dec 28 '17 at 10:26
3

The naive answer is no: the URL is encrypted in the TLS stream. But that answer ignores a great deal of relevant information.

Suppose it's Wikipedia. How long is an HTTP GET request for https://en.wikipedia.org/wiki/Cryptography versus https://en.wikipedia.org/wiki/Information_security, assuming all the header fields are the same? If you can measure the length of a request, which will likely be submitted in a single TLS record, then you can probably tell these apart.

That doesn't help you to distinguish a request for the article on cryptography from the article on choreography, of course. It also doesn't help if the TLS client cleverly adds some padding, ignored by the server, to the TLS record to round it to a multiple of some block size. But English Wikipedia has a much longer article on cryptography than on choreography. So even if the endpoints pad their TLS records to the maximum 16384 bytes, you can probably distinguish the article on cryptography from the article on choreography.

There's a complication from your perspective as the attacker: the client may use the same TLS stream for many requests and many responses. But they will likely all be timed in a burst as the victim loads a single page with embedded CSS, images, JavaScript, etc., and then go silent as the victim reads the page. The timing and number of these requests provides another variable on which you can discriminate what page they were looking for.

All these variables can be fed into a probabilistic model of pages—here's one example, lifted from the anonymity bibliography. Defeating that one example doesn't mean that the distribution of data an attacker on the network learns for one page is indistinguishable from another page, just that that particular distinguisher isn't as effective.

So, are you, as the eavesdropper, guaranteed to be able to read the URL off the wire? No: it is encrypted in the TLS stream (unless the NULL cipher is chosen!), so at best you can infer it from other observable variables with probabilistic dependencies on it.

On the other hand, is the victim guaranteed that their URL is concealed from an eavesdropper? No: there are many variables dependent on the URL that an attacker may be able to infer juicy information about, like which sexually transmitted disease you're reading about at the Mayo Clinic.

(Note that anything in the fragment of a URL—the part after the # mark in https://en.wikipedia.org/wiki/Cryptography#Terminology—is not transmitted in the HTTP GET request at all, unless there is some script on the page that triggers different network traffic dependent on the URL fragment.)

Squeamish Ossifrage
  • 2,636
  • 8
  • 17
0

The URL as you say is inside HTTP headers which are, like the HTTP body, inside the TLS stream, which means they are encrypted. You can derive the server name by sniffing for DNS requests before the HTTPS request, but you may not get results, if the name is already in the local cache for example.

Patrick Mevzek
  • 1,748
  • 2
  • 10
  • 23
-1

The URL is also encrypted while you use TLS communication method. There is no way to find out the content or resource URL by sniffing the secure HTTPS traffic. But still security best practices recommend to not to send any sensitive information through HTTP query strings. The reason is it can be cached in your browser or logged in your servers.

Camfy
  • 1
  • This is incorrect. The domain (and subdomains) can be sniffed, and that is part of the URL. – forest Dec 27 '17 at 05:02