Private contact discovery - calculating probability of false positive

Question

I'm working on contact discovery for a service and want to do it in a way that maintains privacy as much as possible, without using Signal's Private Contact Discovery Service (https://signal.org/blog/private-contact-discovery/), due to hardware/infrastructure limitations.

A user will grant access to their contacts and the mobile apps will generate a SHA-256 hash for each phone number and email address found. The first N characters of each hash will be sent to the server, where they will be compared with the hashes of the contact details of users who use the service.

For each partial match, the server will reply with the next M characters, returning N+M characters, where N+M is less than the length of a SHA-256 hash.

The mobile will then compare these more complete hashes with the ones it generated and mark any users with a match as being a user of the service.

How can I calculate the required values for N and M, so that the chance of a false positive (marking a user as a user of the service when they are not) is 1/1,000,000,000?

How many unique people do you anticipate will use the service? — Cowthulhu, Jun 27 '19 at 14:08

Nic · Answer 1 · 2019-07-01T19:58:47.570

The mathy bit

Let's play with the numbers, but correctly. SHA-256 outputs a 256-bit hash. An n-bit string can represent 2ⁿ possible values, so the odds of randomly guessing a given n-bit string are one in 2**n. Therefore, to get under any probability P of colliding with any given hash, we need to get ceil(log2(P)) and send at least that many bits. How many characters that is depends on how you represent the bits.

But notice my wording there: "any given other hash". That means you need, at minimum, 30 bits to ensure less than one-in-a-billion odds of collision with a single other hash. To ensure no collisions within your database at all, we need to account for the fact that we're not just comparing one hash against one other hash, but one hash against every other hash. So... do the same math I did, but instead of ceil(log2(P)), it's ceil(log2(P * C)), where C is the count of other hashes. It doesn't affect the value all that much, frankly; another few zeroes will only add a few more bits on to the end. Exponents scale really fast.

As a fun side note, even if you had 1 000 000 000 000 (10¹², "one trillion" to Americans) contacts in your system, using the full 256-bit hash, the attacker would have to guess on average 100 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 (10⁶⁵) hashes before getting a match.

The frame-challenge bit

But I think you're going about this wrong. You say you want to maintain privacy by avoiding sending data. But... there's no way¹ to extract data from a hash. Why not just have the client send all of their hashes to the server and the server sends back the ones that match? It'll be simpler to implement and leak insignificantly more information. Plus, that way you don't trust the client to be honest about which contacts they have; the server, which you control, is doing all the validation. In total, I think that'd actually leak less data to a malicious user than your current scheme.

Otherwise, consider this scheme:

Attacker produces a random prefix and sends it to the server, claiming it's a contact of his.
Server sends back all of the matching hashes.
Attacker sends back a confirmation for all of them.
Server now allows Attacker to claim everyone with any partial match as a contact.
Attacker can message, get their actual contact info, etc., depending on what your service provides.

In contrast, with my idea, the attacker would have to brute-force guess every hash. That's certainly possible, but even if they managed a trillion guesses a second -- an unlikely situation, given that they'd have to hit your server for every single one -- it'd take on average 2.3 times the life of the universe to guess a single collision. ...Er, wait, sorry, I misread my notes. It'd take about 2.3 hundred million billion billion billion universe lifetimes. Turns out 2²⁵⁶ is very big.

^{1: Okay, there sort of is a way to get data out of a hash, depending on the hash you used. Something considered "strong" like SHA-256, though, can't be reversed -- just make sure you build in a way to easily upgrade the hash algorithm if/when SHA-256 needs to be replaced. Prefixing the hash with the algorithm is a useful, secure way of indicating hashes used without much fuss.}

Vilius Povilaika · Answer 2 · 2019-01-28T13:09:22.370

0

Alright. Let’s play with the numbers a little bit. SHA-256 consists of 256 bytes. As you said, we want it split into N and M. A byte is alphanumerical, case insensitive in this case. That makes an each byte one of the 26 letters in the English alphabet, or one of the 10 numbers (0-9). So there’s a 1:36 chance of guessing a byte. SHA-256, as I mentioned, has 256 bytes. After a bit of calculations we can see that 36^6 gives you a 1/2,176,782,336 chance of guessing. So in summary: 6 is the minimal number of digits needed to meet your requirements, in other words - the length of N or M should be 6. The other variable should fill the remaining 250 bytes, or, if N+M < 256b, then both should be equal to 6 to be as short as possible but to meet your requirements.

edited Jan 28 '19 at 13:09

answered Jan 28 '19 at 13:00

Vilius Povilaika

972
8
20

2

256 bits, not bytes – domen Feb 27 '19 at 14:56
1

A byte has 256 possible values (0-255), not 36. And typically hash function outputs are measured in bits, not bytes. – Cowthulhu Jun 27 '19 at 14:10
I feel like the birthday paradox applies here somehow, but I'm not quite sure how... – Nic Jun 27 '19 at 15:38

Private contact discovery - calculating probability of false positive

2 Answers2

The mathy bit

The frame-challenge bit