What's a "safe" URL shortening algorithm?

Question

The situation: Currently, we are sending out emails and SMS to our users, which include a link to a form that each user has to fill out on a regular basis (an "eDiary"). As we need to be able to authenticate and authorize the user accordingly, we currently attach a jwt token as a query parameter to the link as we want to make it as easy as possible for the users, as not many of them are very tech-savvy.

The problem with this is, that the links tend to be very long because of this, which makes it very expensive, as we're sending out about five SMSes, out of which four are just for the link.

Therefore we are currently looking for a way to shorten the links in a safe way. As the data that is being entered is very sensitive, the link should still be secure enough, so it can not be guessed in a "short" amount of time. The links are usually valid for one or two days, as we can not "force" the user to enter the data in a shorter amount of time.

Therefore I am very likely looking for a hashing algorithm of sorts, that could be used in this scenario, where the structure of the links is well known. We'll of course also limit the amount of retries a user could do, before being temporarily blocked by the system, but in case we'll have to serve millions of links, the chances will grow of guessing one at random.

The question: What would be a good compromise between shortening the URL as much as possible, but still keeping the link safe enough so "guessing" a specific link is unlikely for specific users as well as for all users (birthday attack)?

Just to comment. Couldn't you assume that users save their login credentials (cookie remember-me) and that they are already logged in when they land into the linked page? — usr-local-ΕΨΗΕΛΩΝ, Oct 01 '20 at 21:38
Other comment: are you running your own URL shortener or are you using a public shortener? The second would make my body shake... — usr-local-ΕΨΗΕΛΩΝ, Oct 01 '20 at 21:38
How secure it would be if urls were relatively short, but phone number worked as a password? If somebody obtained URL from your phone / SMS, then he has access to it, so it's already dangerous. If he does not have your phone number, then even if somebody guesses the link, then he still cannot enter. Additionally after e.g 3 failed attempts the page should be blocked and as you said - 1 day of expiration or something — Joelty, Oct 02 '20 at 11:04
Don't provide access to "very sensitive" information through SMS. SIM-swapping attacks have already been [successfully used in the wild](https://en.wikipedia.org/wiki/SIM_swap_scam#Incidents). — Fax, Oct 02 '20 at 13:08
What if: those shortended URLs worked **only for the forms** and to view their data the user must log in the regular way? — jaskij, Oct 02 '20 at 13:18
The problem is not the shortener, it is that you are sending authentication tokens through an insecure channel. Email and SMS are not secure. — OrangeDog, Oct 03 '20 at 10:18
Your users' privacy can be further enhanced with a notice upon login via your SMS link mentioning the last time they logged in. Also, it's smart to invalidate the link as soon as it's used. These things are sometimes called *nonces* meaning number used once. — O. Jones, Oct 03 '20 at 11:10
1. @usr-local-ΕΨΗΕΛΩΝ: We do not use cookies, but jwts to also handle the authentication as well as the automatic expiry from the token side. Also, we're of course planning to build our own URL shortener, hence the question. — BenSower, Oct 03 '20 at 12:58
2. @Joelty I really like the idea of using the phone number as a password, but we're working with people that are VERY digitally challenged and need to keep this service as simple to use as humanly possible. Still, this could in fact be a valid solution that would avoid the problem itself! — BenSower, Oct 03 '20 at 13:02
@Fax : Yes, we are aware of SIM-swapping attacks, but as mentioned, we had to make a compromise between usability (we are dealing with a very low-tech crowd) and security. In an ideal world, we'd not have to send out these "magic" links in the first place. — BenSower, Oct 03 '20 at 13:05
@JanDorniak: This logic is already implemented to some degree, as a user can only access the data of the currently active form, which is only temporarily accessible through this channel. — BenSower, Oct 03 '20 at 13:06
@O.Jones: I really like the idea of mentioning the last time they logged in! Concerning invalidating the link, we can only do that after the designated "active" time period has passed, as users might have to edit the data they already entered in that time frame. — BenSower, Oct 03 '20 at 13:08

Conor Mancone · Accepted Answer · 2020-10-02T12:53:11.480

30

Entropy is your friend. Using only alphanumeric characters (special characters are best avoided in this case because they often need URL encoding, which complicates things) you have a "language" of 62 possible characters to choose from. For a string of length X made from this "language", the total number of possible strings is simply:

62**X

If you start blocking an IP address after Y failed attempts then the odds that an attacker with a single IP address will guess a code are:

Y/(62**X)

But imagine an attacker can easily switch IP addresses, so let's imagine they have a million IP addresses at their disposal (note: the number will be much larger if you support IPV6). Therefore their odds of success are simply:

(1e6*Y)/(62**X)

Finally note (h/t @Falco) that the above assumes the attacker is looking for a particular code. If you are worried about someone finding any code then you need to further multiply by the number of active codes you have at a given time, which depends on how often they are created and how quickly they expire.

Given all of this though, you just have to decide how low you want the probability to be, plug in your Y, and solve for X. As a simple starting point I usually suggest a 32 character alphanumeric string (make sure and use a proper CSPRNG). If you block an IP after 1000 failed attempts then an attackers odds of finding a specific code are:

(1e6*1000)/(62**32)

Which is 4.400134339715791e-49. Given those odds, it's more likely that the attacker will win the lottery 4 or 5 times in a row before they guess a code. You could have billions of active codes at a time and the odds of guessing any one would still effectively be zero.

edited Oct 02 '20 at 12:53

answered Oct 01 '20 at 12:22

Conor Mancone

29,899
13
91
96

8

This, but make sure the source of entropy is a secure random number generator! – Saustin Oct 01 '20 at 12:50
Nice, that makes a lot of sense! Thank you for your comment! :-) – BenSower Oct 01 '20 at 17:19
9

With IPv6 you wouldn't limit per IP, you'd limit per /64 or similar. Everyone effectively has unlimited amounts of IPv6 available. Personally, I have /46 available as of right now. In ten minutes I could probably assemble something like /40 if I wanted to... Blocking anything less than a /64 doesn't make sene with IPv6. – vidarlo Oct 01 '20 at 19:43
8

@vidarlo indeed but it is tricky, because it can vary so much from ISP to ISP. Somewhere out there is an ISP that hands out IPV6 a dozen at a time, because... Well people are always doing non-sensical things – Conor Mancone Oct 01 '20 at 20:34
4

Indeed, but less than /64 seems rare. And if you *don't* block a /64, you make it ***trivial*** for an attacker... – vidarlo Oct 01 '20 at 21:02
2

A better idea than blocking ips, is to delay responses when the number of failed attempts reaches your threshold. If the attacker has to wait 10 to 30 seconds before getting a response they can't effectively brute force the solution space. But an end user with the correct details is only marginally impacted. – Bae Oct 02 '20 at 00:07
2

@Bae if the attacker can parallelize the attacks, then by delaying the response you might just be DOS-ing yourself, depending on how that's done (at least you need some memory to hold state between request and response). – Paŭlo Ebermann Oct 02 '20 at 00:29
4

"it's more likely that every person in the world will win the lottery at the same time than that an attacker will guess a code." – Um, if people's chances of winning the lottery are independent and less than 99% for each person, then the chances of everyone in the world winning the lottery at the same time are ***WAY*** less than 10^-49. – Tanner Swett Oct 02 '20 at 04:19
1

@TannerSwett good call - In my head I was combining the exponents in the wrong way. – Conor Mancone Oct 02 '20 at 09:12
With a 32-character string (chosen randomly from a 62-character alphabet as you suggest), you'll have about 190.5 bits of entropy. With that much entropy there's absolutely no need for any kind of rate limiting. See my answer below for more details. – Ilmari Karonen Oct 02 '20 at 11:35
1

The calculation completely misses the fact that the service has more than a single aktive Link at one time. The service should be considered broken if an attacker guesses **any** valid link. If there are 50.000 Users on the Service and each link is valid for 2 days, and there is a new link sent out every day, you have 100.000 active links at any given time. (In this case the result is still e-44) – Falco Oct 02 '20 at 12:18
1

@Falco Good call, thanks! I was approaching from the perspective of an attacker trying to find a particular code, but of course that may not be the threat model here. – Conor Mancone Oct 02 '20 at 12:53
"special characters are best avoided in this case because they often need URL encoding, which complicates things" usually you can find a couple of non-problematic special characters and get the number up to 64, which simplifies the generation process.For example "-" and "_" can be used in urls without encoding. – Peter Green Oct 02 '20 at 19:28

score 12 · Answer 2 · edited Oct 07 '21 at 07:34

TL;DR: Don't bother with rate limiting. Just generate a secure random 128-bit (or 192-bit) token for each URL using your preferred crypto API / library and base64url encode it. Include the encoded token in the URL and also store it in a secure database with the associated user, form and expiration data.

Like Conor Mancone, I would also suggest just including a single random token with sufficient entropy in the URL. You should obviously use a cryptographically secure random number source to generate these tokens.

When generating the URL, you should store each token in a database along with any associated information needed to authenticate the user and display the correct form. You may also want to store a creation and/or expiration timestamp, both to limit the validity period of the URLs (and thus reduce the risk of old e-mails being compromised) and also simply to allow you to purge old records from the database.

As for what counts as "sufficient entropy", the precise lower limit obviously varies depending on your use case and threat model. In particular, assuming that you expect to have at most 2^p valid URLs in your database at any given time, that your adversary can make at most 2^q queries to your service and that they should have at most a one-in-2^r chance of successfully guessing a valid URL, your tokens should be at least p + q + r bits long.

In practice, a pretty safe "industry standard" token length would be 128 bits. Assuming that you'll have at most 2³² valid URLs at a time, a 128-bit token would require an attacker to make at least 2⁶⁴ queries to your service to have a 1/2³² chance of guessing even a single valid URL. For most purposes, this should be more than enough even without any kind of rate limiting.

(Tangentially, a 128-bit token length also allows you to generate up to about 2⁶⁴ random tokens before you'll suffer your first token collision on average. But that's kind of irrelevant, since a database anyway allows you to detect collisions and handle them just by generating a new token.)

If your really wanted to be sure, you could go up to 192 or even 256 bits. A 192-bit token, for example, would allow you to have up to 2⁶⁴ URLs while requiring at least 2⁶⁴ queries for an attack success probability of 1/2⁶⁴. And a 256-bit token would increase the difficulty of the attack by an extra factor of 2⁶⁴ on top of that — not that I see how that could possibly be necessary for any realistic threat.

As for generating and encoding the tokens, I would suggest simply generating a random 128-bit (or 192-bit or 256-bit) bitstring using any cryptographic RNG of your choice and encoding it using URL-safe Base64. (Most programming language runtimes should have a suitable RNG built in, or at least easy to install as a library. And if not, your OS most likely provides one, e.g. as /dev/urandom on Unixish systems.) This will produce a 22-character string for a 128-bit token, a 32-character string for a 192-bit token or a 43-character string for a 256-bit token. And it's quite a bit simpler than generating the token one character at a time as Conor Mancone's answer suggests.

BTW, if you don't happen to have access to a convenient database and/or a secure RNG, another option would be to include all the necessary information (at least user ID, form ID and timestamp) in the URL itself together with a 128-bit cryptographic message authentication code of those values (computed and verified using a secret key stored on the server). Indeed, that's basically what JWT does to authenticate the tokens, just with a bit more overhead.

Note that, in this particular case, each token is only valid for a single user/form/timestamp combination, which the attacker must choose before attempting to guess the token, so effectively p = 0 (since 2⁰ = 1). Thus, a somewhat shorter token can provide the same effective level of security than if using the random token method described earlier. Of course, this length saving is usually more than balanced by the extra parameters that need to be included in the URL.

Very thorough. As an aside, it can be easy to lose track of just how big these numbers are. If you _did_ have 2^32 tokens that were 128 bits each, the tokens alone would take up nearly 70GB of data. Small for a computer, but large for a database. If you managed to generate 2^64 tokens with 192bits of data each, you would need half a zetta byte of data storage - I don't think there are any data centers that large yet! — Conor Mancone, Oct 02 '20 at 13:01
Thank you guys for the detailed insights! This already helped me a lot! — BenSower, Oct 03 '20 at 13:21

score 3 · Answer 3 · answered Oct 02 '20 at 23:25

If you want safety I will recommend UUIDv4 encoded in base58. In essence you get 22 alphanumeric characters which are URL-safe and they store full UUIDv4 which is (reasonably) guaranteed to be random and unguessable.

A nice write up on the subject: https://www.skitoy.com/p/base58-unique-ids/638/

What's a "safe" URL shortening algorithm?

3 Answers3