17

I'm working on a small side project which involves a browser add-on and a backend service. The service is pretty simple. Give it a URL and it will check if the URL exists in a database. If the URL is found some additional information is also returned.

The browser add-on forwards any URLs that the user opens to the service and checks the response. Now, sharing every URL you're browsing is of course a big no-no. So instead, I was thinking about using SHA1 (or similar hashing function) to create a hash of the URL, and only send that to the backend service to check for membership in the DB.

My question is whether this scheme is better for the users privacy. My thinking is that now I'm not sharing any URLs, and the only way I know the user has opened a given URL is if it's already present in the database.

Arminius
  • 43,922
  • 13
  • 140
  • 136
Jibran
  • 273
  • 2
  • 5
  • 1
    I might not understand properly, but is "look up information about a host in a database" not exactly what e.g. DNS queries do? People do plaintext DNS queries all day long. So unless your users are identifiable in your service, how is this an issue? Someone from some IP address looked up "findcheapsex.com", so what. Note that e.g. Google also tracks clicks _and_ tries very hard to indeed identify users (which seems perfectly acceptable for hundreds of millions of people). – Damon Nov 08 '16 at 12:27
  • 4
    you could look at [homomorphic encryption](https://en.wikipedia.org/wiki/Homomorphic_encryption) for making computations on encrypted data. [One of the top researchers](https://www.google.ch/search?q=raluca+ada+popa+app) in the field proposed a few apps for this – Ciprian Tomoiagă Nov 08 '16 at 12:56
  • 1
    Look at Private Information Retrieval: https://en.wikipedia.org/wiki/Private_information_retrieval . Homomorphic encryption might be overkill. – Reinstate Monica Nov 08 '16 at 17:08
  • 2
    @Damon - there's a big difference between logging a DNS query for "webmd.com" and logging the full URL for "webmd.com/syphilis/treatment" -- DNS queries do leak information, but they leak less than full URL's. – Johnny Nov 09 '16 at 06:23
  • 1
    Just gonna assume that you already know that you should salt these. It won't help a ton with the problems other people have mentioned but it makes some of the exploits a little more difficult. – ford prefect Nov 09 '16 at 14:17

6 Answers6

27

It's better but not perfect.

While it is (currently) impossible to get the URL for a given hash, of course every URL has the same hash.

So it is not possible to see all the URLs a user browses, but it is quite likely to get most of them.

While it isn’t possible to see user A visits HASH1 and conclude that HASH1 means fancyDomainBelongingToUserA-NoOneElseVisits.com, it is for example possible to just calculate the hash for CheatOnMyWife.fancytld and then see which users visit that site.

I wouldn’t consider that to be protecting the users privacy.

Also just matching users who visit a lot of similar domains can be pretty revealing.

Giacomo1968
  • 1,185
  • 5
  • 16
Josef
  • 5,903
  • 25
  • 33
  • 1
    Are services like Web of Trust, or the Google Toolbar (when it was still around), or any similar services that rate/rank URLs facing the same issues? How do they work around these privacy concerns? – Jibran Nov 08 '16 at 08:16
  • Also, given the nature of the project, since I'm essentially providing some information to the user based on the URL they are visiting, maybe this step is inevitable and something that users just accept for the value they get out of it? Or is there another way? – Jibran Nov 08 '16 at 08:20
  • 14
    @Jibran they don't: [Web of Trust Sells Your Browsing History, Uninstall It Now](https://lifehacker.com/web-of-trust-sells-your-browsing-history-uninstall-it-1788667989), [Web of Trust (WOT) Add-on taken down by Chrome & Firefox](http://news.thewindowsclub.com/web-of-trust-wot-add-on-taken-down-86981/), ... I don't know what you are doing, but if you need to collect all users URLs probably whatever it is is wrong. You could look how Google safe browsing worked and works. They used a bloom filter and now store the **prefix** of hashes of bad domains. – Josef Nov 08 '16 at 08:23
  • So by hashing my URL I'm already a step ahead? :) Given that there is no way to avoid sending the URL or a hash AFAIK, the users have to have trust in the service to use this? Or is there a way around this? – Jibran Nov 08 '16 at 08:27
  • @Jibran I don't know what exactly you want to achieve. Look at the methods google safe browsing uses for example and see if that fits you. You are ahead, but not far I would say. Think how you would feel about the headlines **Jibran's Addon destroys your privacy, Uninstall It Now** or **Jibran's Addon taken down by Chrome & Firefox**... – Josef Nov 08 '16 at 08:31
  • Thanks. The Google Safe Browsing using the Update API https://developers.google.com/safe-browsing/v4/update-api seems like an interesting approach. I probably can't use that since my DB is being updated every few minutes, but it's still interesting to see Googles approach. But what this tells me is that you always need to do a comparison, whether you do it on the local machine or the remote server. Thanks a lot for all the info and help. – Jibran Nov 08 '16 at 08:38
  • @Jibran but maybe you can compare just a hash prefix and not the full hash? https://developers.google.com/safe-browsing/v4/urls-hashing#hash-prefix-computations Depends on your exact application of course. – Josef Nov 08 '16 at 08:46
  • That sounds better. But I assume that only improves the privacy if the hash prefix has a greater chance of collision with other URL prefixes. Because otherwise it's just as identifiable as the full hash, if I'm not wrong. – Jibran Nov 08 '16 at 09:28
  • 2
    Also I think Google uses prefixes to save space, and not improve privacy, because once you have the complete bad URL list, you're no longer concerned with URLs leaving your local system. – Jibran Nov 08 '16 at 09:30
  • As [Arminius](https://security.stackexchange.com/a/142122/37864) they also us partial hashes for the webservice where this clearly is a privacy feature. (They also allow you to send the full url to a webservice and I doubt they care about the few bytes difference in traffic) – Josef Nov 08 '16 at 11:11
  • @Josef How would transmitting only the hash prefix improve privacy? Haven't you just effectively made it so your hash function is now `truncate_to_x_length(hash())` instead of `hash()`? That still suffers from all the same problems detailed in this answer, does it not? – Ajedi32 Nov 08 '16 at 14:39
  • @Ajedi32: if you chop off enough of the hash then it ceases to be a good hash, since it stops being collision-resistant. False matches are bad for accuracy, but good for "plausible deniability" what the input was. Combine those with the smaller size, and in fact that's the same trade-off provided by a Bloom filter, so combining the Bloom filter with truncated hashes might be quite a natural choice. – Steve Jessop Nov 08 '16 at 16:35
  • 2
    Alternatively (and I don't think, from the descriptions here, that it's what Google does), you could use a partial hash to shard your badURL database into manageable chunks. So if the client computes a partial hash and gets back a handful of matching bad URLs, it can compare those against the full URL to determine badness in reasonable time. The server doesn't know for sure which (if any) of all the URLs matching that partial hash, was the one I actually visited. – Steve Jessop Nov 08 '16 at 16:40
  • 2
    I think the only privacy-safe option is to send the entire database of URLs to the user, and have the check done client-side – BlueRaja - Danny Pflughoeft Nov 08 '16 at 20:12
9

I think it's good that you want to protect a user's privacy, but what you're building seems to be opposed to protecting privacy, so I don't think it's possible to do with a simple setup (e.g. client sending url, in whatever form, directly to your backend service).

As others have noted, hashing using sha1 is a good first step, but it only achieves privacy against humans risking a quick glance into the database. It doesn't give you much privacy against algorithms designed to analyze the database contents.

You're leaking more than the visited url, too: The user also tells you at what time he was online and looked at the given url if you're doing real-time checking.

A few others have suggested solutions to mitigate the privacy issues. While they're all better than not doing anything, they don't solve the problem. For example, Google's solution of only sending 32 bits of the hash looks nice, but that still only maps all existing urls to a hash table with 4 billion slots. Some of these slots may contain a large number of entries, but since not all urls are equally likely to be visited (for example, facebook urls are much more likely to be visited than some primary school's homepage) and the urls of a single domain will most likely be hashed fairly evenly over the 4 billion available slots, it will still be quite easy to guess, given a set of full urls which hash to the same 32 bit prefix, which url was actually visited (especially for google, who has pagerank data on a huge number of urls out there...)

Such an attack involves someone building a rainbow table of URLs he's interested in. You could make it more difficult by

  1. Using a password hash function instead of sha1, which takes a long time to calculate the hash - but this will mean that your browser plugin seems unresponsive.
  2. Salting your hashes. Obviously you can't give every user his own salt, or all hashes for the same url provided by different users will be unique, most likely making your application pointless. But the larger your userbase grows, the less users need the same salt values. You still don't protect user privacy, but you make it harder to compute rainbow tables to find out exactly which urls were visited, and if someone does it for the salt of a specific user, only the privacy of all other users sharing his salt is compromised.

However, this still doesn't help anything at all in cases where an attacker isn't interested in the whole set of hashed urls, but only wants to answer very specific questions (e.g. which users visited urls belonging to the domains in a given "blacklist"?) Since such queries will only involve a short list (maybe a few dozen up to a few hundred thousand urls, depending on the size of the blacklist), it's trivial to hash each of them in a short amount of time, no matter what countermeasures you use to slow it down.

It's worse than that, because many websites only have a few common entry points, the most likely one being just the domain followed by an empty path. Others commonly visited paths are login pages, profile pages etc, so the number of urls you need to hash in order to determine if someone has visited a specific domain is most likely very small. If an attacker does that, he'll miss out on users who used a deep link into a website, but he'll catch most of them.

And it gets even worse: If an attacker manages to find one full url from a hash that a user provided, he might very easily get all the urls for a large part of the browsing session of that user. How? Well, since he has an url, he can dereference it with his own custom spider, look at all the links in the document, hash them and look for them in your database. Then he does the same with those links, and so on.

So you can do a few things to make it harder, but I don't think there's a way around the user having to basically trust you with his browsing history. The only ways around that which I can see would involve building a distributed system not completely under your control and using that to collect urls, for example a kind of mixer network. Another venue might be to have the clients download large parts of your database contents, thus hiding which urls they were actually interested in, and provide new content for your database only in large packets, which would at least hide the time component of the user's browsing.

Out of Band
  • 9,150
  • 1
  • 21
  • 30
8

Short answer.

While you state you are concerned about your end-user’s privacy, it’s not clear who you intend to be “protecting” them from and for what reason?

  • If the core functionality of your application is to—essentially—farm user data from a client, send it to a server and deliver a result, then you as the recipient of that data will always know what that data is.
  • If your goal is to protect data in transmission from the client to the server from prying third parties, then an encryption scheme can be devised to protect transmission. But that is the absolute best you can do to protect user data.

Long answer.

First you say this:

I’m working on a small side project which involves a browser add-on and a backend service. The service is pretty simple: Give it a URL and it will check if the URL exists in a database. If the URL is found some additional information is also returned.

Then you say this:

The browser add-on forwards any URLs that the user opens to the service and checks the response. Now, sharing every URL you’re browsing is of course a big no-no.

The problem with the scheme you describe and your concerns for privacy is that your applications core, inherent behavior is to share information that is traditionally considered private. So at the end of the day, what level of “privacy” do you intend to protect for who, from what and for what reason?

If someone agrees to use your application—having some basic, rudimentary knowledge of what the application does and what information it shares—chances are good they know that your backend server will know exactly what they browse. Oh sure, you can setup any elaborate, contrived hashing scheme you can come up with to “mask” the URL but at the end of the day your backend server will know the end user’s data. And even if you are convinced this data is somehow unknown to you, it still does not stop the perception that you would know what the data is; and honestly I cannot conceive of a scheme where you can provide this service and you do not know what URLs are being browsed.

If you are concerned about user data leaking out in transmission to potential 3rd parties of some kind then perhaps you can come up with some encryption scheme that can protect the data being transmitted. To me, that is doable.

But if your overall desire is to collect private data of some kind to analyze it and then deliver an end result, the overall concept of you—and your system—somehow not knowing specifics about that data is flawed. You control the backend of a process like this and you completely have access to the data whether you like it or not.

Giacomo1968
  • 1,185
  • 5
  • 16
  • 1
    Agreed. One minor nitpick: Just because someone agrees to use the browser extension, they don't necessarily understand the consequences for their privacy. Many internet users *don't* know "fully well" what sending a partial hash of an URL (or even the full URL) means for their privacy, because, sad as it is, many internet users don't understand anything about how the internet, or HTTP in particular, works. To them, computers are magic, the internet is magic and the browser has these magical browser extensions that make more magic things happen. But... they might not care even if they knew. – Out of Band Nov 08 '16 at 13:40
  • 1
    Thanks. What you've said makes complete sense and is pretty much the conclusion I've come to as well. As I see it, hashing does protect the transmitted data from 3rd parties, but then using HTTPS achieves the same goal. From my perspective hashing provides one benefit over sending raw URLs. With hashed URLs the only way for my backend service to track *complete* user history is to have the hash of EVERY URL on the internet. This way, I am at worst able to track only those URLs which I have in my DB, which is better than having a clear browsing history with every URL visited. – Jibran Nov 08 '16 at 16:07
  • 2
    @Jibran: Consider that it doesn't just matter what you can do, it also matters what someone can do who steals your database. You happen to know that you don't have a big rainbow table for your chosen hash, containing all the URLs on the internet. But likely it's trivial for someone else (including you if you turn evil) to compute that table for all the URLs they know about, which is certainly enough URLs to threaten user privacy to some degree. But as Jake says, this app is inherently *somewhat* privacy-busting, so find a level your users can live with. – Steve Jessop Nov 08 '16 at 16:47
  • @Pascal Fair enough. Adjusted the wording to acknowledge the level of knowledge some folks out there have. – Giacomo1968 Nov 08 '16 at 16:55
4

Your proposal to store (partial) hashes of the URLs is an established way to mitigate the impact on privacy. While that makes it harder to answer "On which pages have you been?" it's obviously still trivial if you know which exact pages you're looking for since the hashes are practically unique for every URL.

What you describe is exactly the problem that the Google Safe Browsing service had to solve. This service is used by Chrome and other applications to check suspicious URLs against Google's list of dangerous websites while browsing - with the requirement of still ensuring some degree of privacy.

Google outlines their method in the Google Chrome Privacy Whitepaper:

When Safe Browsing is enabled in Chrome, Chrome contacts Google's servers periodically to download the most recent Safe Browsing list of unsafe sites, including phishing, social engineering, and malware sites, as well as sites that lead to unwanted software. The most recent copy of this list is stored locally on your system. Chrome checks the URL of each site you visit or file you download against this local list. If you navigate to a URL that appears on the list, Chrome sends a partial URL fingerprint (the first 32 bits of a SHA-256 hash of the URL) to Google for verification that the URL is indeed dangerous. Chrome also sends a partial URL fingerprint when a site requests a potentially dangerous permission, so that Google can protect you if the site is malicious. Google cannot determine the actual URL from this information.

(Emphasis my own)

Note that if a few false positives are acceptable for your service, you could store only a small part of the hash with the benefit of a faster lookup and plausible deniability.

Arminius
  • 43,922
  • 13
  • 140
  • 136
  • 1
    I'm thinking of going with the hash approach as well. But my understanding is that using a partial hash doesn't offer any privacy advantage over using the full hash, since mostly the hash prefixes will be unique as well. – Jibran Nov 08 '16 at 10:54
  • 1
    @Jibran Depends on how short they are. At some point you will get collisions. – Arminius Nov 08 '16 at 11:04
  • Note that an "interesting" byproduct of this protocol is the following: if Google want to get notified whenever an user visits `www.somesite.com`, it is sufficient that they add it to the unsafe list. Since this local list is also composed of SHA-1 hash prefixes, this is difficult to detect from outside. – Federico Poloni Nov 08 '16 at 13:29
4

While all the other answers are focusing on how to transfer the url to your backend service "properly", the general conclusion seems: that is not possible.

I'd like to suggest a different approach, which may very well be not be possible in your use case, but I think it is a valuable method to at least discuss.

Instead of sending the url to the backend, why not send the database over to the addon and do the lookup there?

Of course this introduces all kinds of new problems. The database is probably very big, may contain information you do not want on your users machine, etc. But for simple/small applications, this might be a valid solution.

  • This only solves the problem if the database won't be populated by the browser addons themselves, but will be populated from another source. But if that was the case, you wouldn't need to download the whole database; you could also just query for all the urls whose hash contains one of your hashed url's bytes. If that still yields too many results, query for urls which contained two of the bytes of the url hash you were interested in. This way clients can hide what they are looking for, and even trade privacy for download speed. You could provide a "privacy" slider for that. – Out of Band Nov 08 '16 at 17:15
1

It is not much better for user privacy. For example https://www.google.com/ would have always the same hash so it would be known who browsed it.

Depending on your project needs you might need to consider other options which would suit you, one of them would be not transmitting every URL every time for example. You could also check only FQDN and not the whole URL which would be a lot better for privacy.

Giacomo1968
  • 1,185
  • 5
  • 16
Aria
  • 2,706
  • 11
  • 19