4

In light of the recent senate decision in the US to allow ISPs to sell users' browsing history, I've been reading of recommendations on how users can retain their privacy. One of the common recommendations is to restrict your browsing to HTTPS sites, so that at least on-site content remains private even if domain activity is no longer private.

Though thinking about this, ISPs could surely make some basic deductions quite easily - e.g. high download rate on youtube.com suggests watching video, high upload rate on said site suggests uploading video, etc.

Taking this approach further, I wonder if ISPs could guess specific page content by using HTTPS request size? E.g. that "tech security / possible bias against Republicans" article from Ars Technica I linked is a 16.77 / 16.78 kB transferred response size for just the base HTML page. Whereas another article that fits the "weather nerd" category is usually coming in at 13.34 kB.

Now of course this is going to depend on the dynamic vs static/caching nature of the page, and particularly if there is substantial user-specific tailoring to each HTML load.

Though does this concept stand? Am I correct in thinking that encrypted request size via HTTPS is going to be almost exactly proportional to underlying request size?

Thinking about possible solutions, sites could obfuscate articles by filling in with junk HTML (e.g. a big comment section) to minimise uniqueness. Likewise images could be compressed to identical sizes, making a simple mapping approach for any ISP / packet-sniffer ineffective and requiring more substantial pattern recognition. Though of course this is somewhat moot, as what domains have been visited and when is already more than enough to build an online profile.

andrewb
  • 204
  • 1
  • 6
  • I would argue this is not a duplicate in that it's specific to the recent senate decision about ISP's which means it encompasses a larger scope than the other question. – Trey Blalock Mar 26 '17 at 04:21
  • buy a cheap VPS and run a VPN to it; won't stop feds, but will stop ISPs; i doubt DO/linode will sell you out. – dandavis Mar 27 '17 at 08:43
  • Didn't catch that question, seems like marking it as a duplicate is reasonably fair. Though the answer here is great and adds a lot more detail to the strategy that can be employed. – andrewb Mar 27 '17 at 22:01

1 Answers1

3

Effectively you are asking about fingerprinting web browser behavior when visiting specific pages on a website. Yes, this is definitely something which can be done but the accuracy will vary from site to site depending on how much the web pages themselves vary and typically people doing this type of analysis are also gathering a little more data than just HTTPS request size (although in some cases HTTPS request information is all you would need).

Keep in mind the ISP's are actually getting a LOT more data including DNS requests, timing information for page loads, additional browser assets that may be loaded (javascript, 3rd party CSS, web fonts, etc..) and requests for files hosted on a Content Distribution Network CDN or a secondary image server, or remote ad server requests and some things like Google Analytics at the same time.

Add all of this additional data and fingerprinting what the browser is doing at any given time becomes much easier even with HTTPS enabled on the main site being visited.

Think of the following three requests to a website.

1.) The main page containing 15 images across 2 hosts and 2 CDN's as well as a banner ad. This page also loads a Javascript library from a remote site as well as some web fonts from yet another website.

2.) A secondary page containing 3 images that are hosted on 1 host and one large image hosted on a CDN, some HTML content from 1 host and 2 new javascript library connections.

3.) A third page containing very specific information of interest hosting a much longer HTML page, a few very large photos all from only 1 host (not accessing a CDN) and maybe this website took a few milliseconds longer to load since the web server no longer had it cached in ram.

These are overly simplistic examples but what you can start to see is that each web page in these examples has a unique fingerprint when loaded by a browser. Many websites have very homogenous pages which won't lend them self to such easy fingerprinting but this is primarily for example purposes.

In any case, since these unique fingerprints can be created and even proactively analyzed by large search engines or bots (or simply collected in mass by the ISP's). It does make it very easy in some cases to predict what a user is doing and in many cases exactly which page on a website is being viewed.

As far as request sizes being the same you also have three additional variables to keep in mind:

1.) Compression algorithms negotiated between browser and server for data being sent.

2.) Packet size variation due to additional factors like type of network.

3.) Dynamic pages with things like news feeds will change in size.

Even with a couple of variables like these, it would still be possible to reasonably approximate which page a person is loading. It won't be an exact one-to-one size to page comparison but for certain browsers under certain conditions the data size will frequently be the same and the range of behaviors will likely be very closely grouped together.

Finally, a bigger concern is the aggregation of many different types of data and compiling that information about customers so that customer profiling can happen with only a few clicks and comparisons with other data sets. ISP's will get an amazing amount of data about end users from every device in their houses that communicates out.

Trey Blalock
  • 14,099
  • 6
  • 43
  • 49