In light of the recent senate decision in the US to allow ISPs to sell users' browsing history, I've been reading of recommendations on how users can retain their privacy. One of the common recommendations is to restrict your browsing to HTTPS sites, so that at least on-site content remains private even if domain activity is no longer private.
Though thinking about this, ISPs could surely make some basic deductions quite easily - e.g. high download rate on youtube.com suggests watching video, high upload rate on said site suggests uploading video, etc.
Taking this approach further, I wonder if ISPs could guess specific page content by using HTTPS request size? E.g. that "tech security / possible bias against Republicans" article from Ars Technica I linked is a 16.77 / 16.78 kB transferred response size for just the base HTML page. Whereas another article that fits the "weather nerd" category is usually coming in at 13.34 kB.
Now of course this is going to depend on the dynamic vs static/caching nature of the page, and particularly if there is substantial user-specific tailoring to each HTML load.
Though does this concept stand? Am I correct in thinking that encrypted request size via HTTPS is going to be almost exactly proportional to underlying request size?
Thinking about possible solutions, sites could obfuscate articles by filling in with junk HTML (e.g. a big comment section) to minimise uniqueness. Likewise images could be compressed to identical sizes, making a simple mapping approach for any ISP / packet-sniffer ineffective and requiring more substantial pattern recognition. Though of course this is somewhat moot, as what domains have been visited and when is already more than enough to build an online profile.