The mathy bit
Let's play with the numbers, but correctly. SHA-256 outputs a 256-bit hash. An n-bit string can represent 2n possible values, so the odds of randomly guessing a given n-bit string are one in 2**n
. Therefore, to get under any probability P of colliding with any given hash, we need to get ceil(log2(P))
and send at least that many bits. How many characters that is depends on how you represent the bits.
But notice my wording there: "any given other hash". That means you need, at minimum, 30 bits to ensure less than one-in-a-billion odds of collision with a single other hash. To ensure no collisions within your database at all, we need to account for the fact that we're not just comparing one hash against one other hash, but one hash against every other hash. So... do the same math I did, but instead of ceil(log2(P))
, it's ceil(log2(P * C))
, where C is the count of other hashes. It doesn't affect the value all that much, frankly; another few zeroes will only add a few more bits on to the end. Exponents scale really fast.
As a fun side note, even if you had 1 000 000 000 000 (1012, "one trillion" to Americans) contacts in your system, using the full 256-bit hash, the attacker would have to guess on average 100 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 (1065) hashes before getting a match.
The frame-challenge bit
But I think you're going about this wrong. You say you want to maintain privacy by avoiding sending data. But... there's no way1 to extract data from a hash. Why not just have the client send all of their hashes to the server and the server sends back the ones that match? It'll be simpler to implement and leak insignificantly more information. Plus, that way you don't trust the client to be honest about which contacts they have; the server, which you control, is doing all the validation. In total, I think that'd actually leak less data to a malicious user than your current scheme.
Otherwise, consider this scheme:
- Attacker produces a random prefix and sends it to the server, claiming it's a contact of his.
- Server sends back all of the matching hashes.
- Attacker sends back a confirmation for all of them.
- Server now allows Attacker to claim everyone with any partial match as a contact.
- Attacker can message, get their actual contact info, etc., depending on what your service provides.
In contrast, with my idea, the attacker would have to brute-force guess every hash. That's certainly possible, but even if they managed a trillion guesses a second -- an unlikely situation, given that they'd have to hit your server for every single one -- it'd take on average 2.3 times the life of the universe to guess a single collision. ...Er, wait, sorry, I misread my notes. It'd take about 2.3 hundred million billion billion billion universe lifetimes. Turns out 2256 is very big.
1: Okay, there sort of is a way to get data out of a hash, depending on the hash you used. Something considered "strong" like SHA-256, though, can't be reversed -- just make sure you build in a way to easily upgrade the hash algorithm if/when SHA-256 needs to be replaced. Prefixing the hash with the algorithm is a useful, secure way of indicating hashes used without much fuss.