52

Let's say I operate a website where you can create cat pictures. I give every cat picture a unique identifier so that it can be shared on social media with http://catpictures.com/base62Identifier.

I could give the cat pictures sequential identifiers such as 1,2,3, etc, but then it would be possible to easily discover how many new cat pictures the users create per day (by the largest identifier that returns HTTP 200 each day). This exposes me to the common strategy of ordering a product from your competitors once a month and noting the invoice number. Website traffic figures are well-correlated to business revenue so I obviously want to keep this information secret.

What I'm considering trying:

This sounds like a job for a hashing algorithm, right? The trouble is by observing a hash it's pretty easy to tell which algorithm created it (md5, crc32, etc). Someone with a rainbow table would make short work of that idea. I could salt the identifier [hash("salt"+1), hash("salt"+2), ...], but would then have to worry about the security associated with the salt. And collision checking.

Another idea I had was to generate a random string of characters and use that as the cat picture's primary key in the database (or just I could hash the first n bits of the cat picture data). This way I would only have to check for collisions.

Is there a standard, best-practice way avoiding exposing your traffic volumes through your unique identifier URLs?

Edit: I'm specifically looking for a solution that is a good combination of security and suitability as a database primary key or indexable column.

Escher
  • 603
  • 5
  • 8
  • 41
    Is there any reason you can't use a random number for each resource? There's no need for any hash. – Navin Dec 13 '15 at 20:11
  • 2
    Would using [multiplicative inverses](http://ericlippert.com/2013/11/14/a-practical-use-of-multiplicative-inverses/) be "safe" enough? – Corak Dec 14 '15 at 07:59

13 Answers13

78

The standard approach to this kind of issue is to create a UUID (Universally Unique Identifier) for each picture. This is generally a random 128-bit identifier which you can assign to each picture without any particular concern that it would be possible to enumerate the pictures via a brute-force attack on the namespace.

For example in .NET you can use the GUID structure for this kind of purpose. Since Windows 2000 (source), Guid.NewGuid generates random (version 4) UUID. (Ancient versions generated a version 1 UUID which reveals the date when it was generated, doing nothing to protect you from the "invoice number" problem.)

Rory McCune
  • 60,923
  • 14
  • 136
  • 217
  • 1
    Aren't v4 UUIDs generated from a cryptographically strong PRNG through, making them appropriately random? – Escher Dec 14 '15 at 13:59
  • @AndréBorie: The main performance drawback I have observed from using UUIDs as primary keys is that random UUIDs are ill-suited for use as a clustered index, which is the default SQL Server setting for a PK. So one has to make sure to configure the table appropriately, and set the primary key as NON CLUSTERED. – Jean Hominal Dec 14 '15 at 14:20
  • 5
    @Escher [The specification](https://www.itu.int/rec/T-REC-X.667-200409-S/en) "strongly recommends" a cryptographically secure RNG, but does not forbid to use a non-secure one. – Philipp Dec 14 '15 at 19:40
  • 1
    Note on UUID vs. CSPRNG: Best of both words would be generating the number with a CSPRNG and then making it an UUID v4 via bit mask. – Medinoc Dec 15 '15 at 10:29
  • 3
    An extended discussion on these points should go to DBA.SE but from a DBA's PoV UUIDs can be problematical. In their natural form they are bad candidates for clustered keys, leading to much excess fragmentation. The "Sequental UUIDs" compromise in SQL Server fixes this but recreates the key guessing problem. Their size can also be more of an issue then some assume: as well as 12 extra bytes per row in data (when compared to a 32-bit integer key) there is 12 extra bytes per row on relevant indexes (and 12 per row on *all* indexes if your clustering key in a UUID). – David Spillett Dec 15 '15 at 11:53
  • 1
    @David Spillett, the answer to that is "don't cluster on a UUID." Key indexes do not have to be clustered indexes. – Greenstone Walker Dec 16 '15 at 01:29
  • 1
    @GreenstoneWalker: Yes, but sometimes there isn't another really useful clustering key either, and in any case a lot of people default to clustering on their PK without even thinking about it. What I'm trying to say is "think about it up front and you and/or your DBA will thank you later" (and I think you are agreeing). – David Spillett Dec 16 '15 at 09:45
30

I would simply use the picture hash. What's the problem with someone figuring out the hash you used? If I think "this part of the url looks like a sha1", download the file and it has that sha1, I was right. But that doesn't make me able to break your «cat security». Even if it was treatable to attempt breaking the hash to figure out the picture, there's no point in attempting that instead of simply downloading it.

Ángel
  • 17,578
  • 3
  • 25
  • 60
  • 13
    Except that if two users upload the same cat picture (both from elsewhere), which user then 'owns' it (e.g. can delete it)? This is assuming prohibiting duplicates is not in itself a benefit. – abligh Dec 14 '15 at 10:11
  • 5
    @abligh, well, in this case I'd just keep a list of users who uploaded it and didn't delete it. But of course it depends on the use-case. There are cases where you need to hide the fact the two documents from different users are in fact the same, in which case you can just throw the user id into the hashed content. – Jan Hudec Dec 14 '15 at 13:04
  • 1
    If it wasn't cat pictures but large ZIP files (for instance), this method would also prevent duplicate files being uploaded. – Prinsig Dec 14 '15 at 14:00
  • @abligh You don't have to use the primary key as the method of finding the picture in a typical URL route. For example, you could have `mysite.com/users/user_id/pictures/image_hash` or `mysite.com/pictures/image_has/users/user_id`. I guess what I'm trying to say is that as long as you enforce the `image <--> user` association, the image hash could be used to look up the image in the database (like any other column) instead of the integer primary key, and then only the user associated with the image's id could delete it. If you don't have forced logins to upload though, that's another problem. – Chris Cirefice Dec 14 '15 at 16:13
  • @Prinsig And different files with the same hash – rpax Dec 15 '15 at 22:32
14

Just generate a cryptographically secure hash of the image data and use it as an identifier.

This has two side-effects:

  • People can tell if an image already exists on your service by asking for an image with that hash.
  • People can not upload duplicate images.

Both of these effects are not inherently bad. They might even come in handy. But if you would like to avoid them, you could salt each image hash with a pseudorandom number from a secure random number generator.

Collisions are nothing to worry about, by the way. With a hash function like SHA256, the chances for a random collision are so astronomically small, it would be a sensation when you would find one.

Philipp
  • 48,867
  • 8
  • 127
  • 157
  • 1
    I think another side effect would be the requirement of long cat picture URLs :P – Jonathan Gray Dec 13 '15 at 17:14
  • *People can not upload **exact** duplicate images*. Open a .jpg, "save as", and hash both files. It's *possible* that your software spots there's no change and writes the original data, but tweak one pixel in any image format and the hash will be different. So you don't even reliably prevent people uploading the *same* image accidentally given that the "same" image may not be the same file (e.g. automagic resizing on mobile). [Image hashing](http://stackoverflow.com/q/998662/2583476) would fix that, but isn't a secure hash. It may not matter of course. – Chris H Dec 14 '15 at 09:15
  • 4
    @Chris with real-world image hosting services (like Imgur, for instance), a common use case is taking an image file obtained elsewhere and uploading it as-is, with no editing at all, not changing a single pixel, not even opening it in an image editor and hitting "Save As". It's probably quite common that people upload bit-for-bit identical images. – David Conrad Dec 14 '15 at 19:33
  • @DavidConrad, I'm sure it is common. Re-uploading a non-bit-identical image probably also is: on http my mobile provider *sometimes* recompresses images (making e.g. maps useless in the process). Re-upload that and it won't hash the same, with no user action to alter it. – Chris H Dec 14 '15 at 21:58
  • The point is that the statement `People can not upload exact image duplicates` is wrong. – Stijn de Witt Dec 17 '15 at 15:54
9

The standard way is simply to randomly generate your URLs, using a cryptographically secure pseudo-random number generator (CSPRNG).

No need for any hashing or the like - just use plain old random numbers. They don't need to be GUIDs either (unless your database handles GUIDs better than simple numbers for some reason). Presumably your site already remembers which image is accessible at each URL, so just modify that to deal with random URLs instead of sequential ones.

A 128-bit random number should be long enough.

Remember to check for duplicate URLs when processing new images.

D.W.
  • 98,420
  • 30
  • 267
  • 572
user253751
  • 3,885
  • 3
  • 19
  • 15
  • Using a UUID (or GUID) means you don't have to check for duplicates - the 'U' stands for Unique, and is part of the guarantee of a UUID or GUID. But a UUID doesn't necessarily offer a guarantee of unpredictability. A cryptographically secure pseudo-random number generator (CSPRNG) used to generate a sufficient number of bits provides both. – John Deters Dec 14 '15 at 19:31
  • @JohnDeters Most UUIDs are type 5 UUIDs, which are just random numbers with a "this is a random number" indicator tacked on. – user253751 Dec 14 '15 at 20:09
  • I agree. RFC section 6. Security Considerations, begins with: "Do not assume that UUIDs are hard to guess; they should not be used as security capabilities (identifiers whose mere possession grants access), for example." But if you use a CSPRNG to generate a large enough number, it will be hard to guess. A hash of a photo is not a suitable seed for a CSPRNG. – John Deters Dec 14 '15 at 21:20
  • 1
    @immibis, you are free to refrain from editing other answers if you prefer -- but it's not exactly Stack Exchange policy. See http://security.stackexchange.com/help/editing and http://security.stackexchange.com/help/privileges/edit. If you feel that my edits did not improve your answer or made it worse, you have the power to roll the edits back. (Personally, I felt that my edit is an improvement, and after adding the additional caveats I added in my edit, this becomes the best answer to the question... but feel free to form your own opinion.) – D.W. Dec 14 '15 at 23:42
8

From what I read in the question, comments, and other answers, everything is turning around finding a unique identifier for each picture, which is not guessable, nor would provide information about the number of pictures, and easy to handle in a database.

Then, why don't you just use the timestamp of insertion (number of millisecond since 1970)? If there is a probability for two people inserting a cat picture in the very same millisecond, you can concatenate it with a sequential number corresponding to the number of insertion in that millisecond.

That way the only thing somebody aggressively searching your last photo would find out about is the last time someone added a photo provided you let such a jerk make what would look like a daily dos attack.

Meanwhile you would have no concerns with collisions or database support.

Aldian
  • 181
  • 3
6

Alternative solution:

Add metadata to your image identifiers. Example:

philipp_20151213_00002.jpg - Second image posted by user Philipp on December 13th 2015.

I leaks that metadata, but it's only data a user can see anyway when clicking on that link (I assume).

This doesn't tell an observer how many images are posted in total on your service, just about the activity of that particular user on that particular day. If you would like to hide this too, you could use pseudorandom numbers instead of sequential numbers. Collisions might still be possible when a single user uploads a very large amount of images in one day, but they will be rare enough that you can handle them by simply generating new random numbers until you have one which isn't taken.

Philipp
  • 48,867
  • 8
  • 127
  • 157
1

Here is one method. Keep an 8 byte server-wide CSPRNG. Then for each new image, generate another 8 byte CSPRNG. Hash this CSPRNG with your server-wide CSPRNG (md5 is fine). Then XOR the last 8 bytes of the hash with the image ID (which will auto-increment from 0 in a database). The client will receive a Base64 encoding of the image's unique 8 byte CSPRNG along with the 8 byte XOR result. This will be the public image ID.

When the server receives the public image ID, it will hash the first 8 bytes of the public ID along with the 8 byte server-wide CSPRNG. Then it will take the last 8 bytes of the hash and XOR it with the last 8 bytes of the public ID. The result would be the private internal ID which can be indexed from the database.

Update (explanation):

First, pre-define a random global CSPRNG that will be used for all ID calculations (8 bytes or 64-bits with 18,446,744,073,709,551,616 possible combinations).

serverCSPRNG = CSPRNG(8)

For creating a new public ID (16 bytes) from a privateID (8 bytes), do the following:

newCSPRNG = CSPRNG(8)
hashEnding = last8Bytes(md5(newCSPRNG + serverCSPRNG))
publicID = newCSPRNG + XOR(hashEnding, privateID)

For deriving the privateID from the publicID:

hashEnding = last8Bytes(md5(first8Bytes(publicID) + serverCSPRNG))
privateID = XOR(hashEnding, last8Bytes(publicID))

For additional security, a secondary global (static server-only) CSPRNG may be XOR'd on the last 8 bytes of the publicID in order to protect it completely from brute-force attacks (as it implements the security model inherent of a one-time-pad).

Jonathan Gray
  • 1,036
  • 7
  • 11
  • This is more complex than just using a random number. Are there some specific advantages it provides? Also, if you really want to propose this, you might want to use mathematics to describe your scheme; that will probably be easier to follow. Finally, is 8 bytes enough? If there are $2^{32}$ images, then by the birthday paradox, an attacker can estimate roughly how many images you have by doing something like $2^{32}$ probes -- so I suspect you'd be better off using 128 bits or so. – D.W. Dec 14 '15 at 22:58
  • @D.W. I will try to clear it up using a more mathematical-like approach, as you suggest. There are actually going to be 2^64 (18,446,744,073,709,551,616) possible combinations with 8 bytes (this is 64-bit protection). However this approach is not simply breakable using regular brute force methods. It is more of a one-time-pad approach. This is because the brute force is done on the server CSPRNG, which would require prior knowledge of a public ID along with the associated internal ID. This could even be overcome with a secondary server-side CSPRNG (for the true 128-bit protection). – Jonathan Gray Dec 15 '15 at 12:03
  • @D.W. As for specific advantages, other methods require the use of values that are likely to use implementations that require an additional indexed column in a database. This method derives the public ID *after row insertion* which could be beneficial as-well. Additionally, this method will scale better in a clustered setup where one master database is used. There is also no need to compensate for errors due to the possibility of inserting a non-unique value into a column requiring unique values (if properly implemented). – Jonathan Gray Dec 15 '15 at 12:39
  • Thanks for the update: the new description makes things a lot clearer. However, I don't think 8 bytes is enough. A brute-force attacker only needs to guess the `serverCSPRNG` value, which is only 64 bits, so a brute-force attack requires only 2^64 md5 computations. That is below what is recommended today and might be feasible at a not-unreasonable expense. I do see the advantages; nice. However, I would suggest a slightly different scheme: e.g., I'd suggest `publicID = E(K, privateID)` where `K` is a 128-bit server key and `E` is a strong encryption method (AES?). – D.W. Dec 15 '15 at 17:02
  • @D.W. I chose this method over strong encryption as it is more efficient. But also like I was trying to explain (maybe I wasn't clear) is that when brute forcing there is no way of determining whether the value you have it correct or not. There are deterministic ways of guessing at it but even that possibility is removed using a second 8-byte server-wide random value. The addition of the second random value would complete the requirements for a completely non-deterministic, "perfect" encryption. It's a little messier than a one-time-pad but the concept is virtually the same. – Jonathan Gray Dec 15 '15 at 17:55
  • *"when brute forcing there is no way of determining whether the value you have it correct or not"* - I don't think that's correct. I suspect that the private IDs will often have some structure that allows verifying a correct guess at `serverCSPRNG` (e.g., if the private IDs are sequential): guess `serverCSPRNG`, do a trial decryption, and see if the resulting privateID has the right structure/format. – D.W. Dec 16 '15 at 00:18
  • @D.W. The private IDs are seqential, yes, but that value is ultimately XOR'd with a hash that's derived with the help of a random value. Because of the way it's mathematically derived, any possible public ID will result in a valid private ID. There really is no way to verify any part of the crypto to make use of a brute force attack against this, especially if using the secondary CSPRNG (this would prevent an attack against a known private ID). In fact knowing one server CSPRNG would only be beneficial in brute forcing the other (assuming the private ID is also known). – Jonathan Gray Dec 16 '15 at 07:26
  • I would like to point out that I know that it is possible if knowing multiple sequential public IDs to gain enough information to brute force against the original server CSPRNG. However in this case, hashing collisions actually work in our favour making false positives possible. It's not pretty. The additional use of the secondary server-wide CSPRNG is added protection that would still prevent internal ID leakage stemming from the compromise (brute force) of the first. Without that, all they can really track is sequence. – Jonathan Gray Dec 16 '15 at 08:54
1

As noted here: Hashes, UUID's etc. have the 'disadvantage' that insertions of records in the DB where these hashes/uuid's are the PK and the PK is clustered are possibly very expensive (define expensive...) since they're usually not sequential (unless a specific function like NEWSEQUENTIALID is used, however: note the "important" block on that page: "If privacy is a concern, do not use this function. It is possible to guess the value of the next generated GUID...").

Apart from the suggestions here I would consider something like Twitter's (discontinued) snowflake. I wrote a similar .Net library (IdGen); it's readme has some information on how it works exactly. The advantage is that generated ID's are still (mostly) sequential, not too space-intensive (64bit v.s. 128bit UUID's / hashes) and can be used in a (uncoordinated) distributed environment where you have several hosts/processes generating ID's without causing collisions. And altough they're sequential, they don't give away much information on the number of cat pictures (or, more generally, number of "used ID's") over some period of time.

RobIII
  • 442
  • 2
  • 9
1

This sounds like a job for a hashing algorithm, right?

No, because as you observe you have to worry about collisions. To me it sounds like a job for a permutation, i.e. a block cipher. This does require management of a key, which is the downside, but it allows you to use your database's auto-increment function and not to worry about collisions.

The tricky part is deciding what to do about the IV, and here you have options. You could generate a new one each time you create a URL, so there will potentially be e.g. 2^128 different URLs per cat picture. You could make the IV be per-user or per-session and stored server-side as part of the user profile / session state. You could even make it be per-user but included in the URL, so you can track who successfully makes the pictures go viral.

Peter Taylor
  • 161
  • 5
0

One approach is to use hashids.

Hashids is a small open-source library that generates short, unique, non-sequential ids from numbers.

It converts numbers like 347 into strings like “yr8”, or array of numbers like [27, 986] into “3kTMd”.

You can also decode those ids back. This is useful in bundling several parameters into one or simply using them as short UIDs.

Your DB performance is unimpaired as you can continue to use numeric sequential IDs internally. Meanwhile the external keys are opaque.

  • 2
    Please be aware, that hashid's isn't crypto secure: http://carnage.github.io/2015/08/cryptanalysis-of-hashids/ ids can be returned to numbers by a sufficiently motivated adversary – user999305 Dec 14 '15 at 21:01
  • I think hashids are reasonable for this use case. While hashids are reversible, they start short, they are guaranteed to be unique, and it prevents offensive English words (like fuck) from appearing as hashes. The reversibility can be mitigated by using a large secret. It won't be *crypto secure* like a UUID, but it does have many properties that are desirable for client facing Invoice Numbers. – whitehat101 Dec 15 '15 at 02:24
  • Not all cases need crypto security. When it is needed, I'd go for arbitrary GUIDs instead. – Alfred Armstrong Dec 15 '15 at 09:43
0

I have a low-tech solution to this problem. Simply use a URL shortening service, or write your own.

It provides the following:

  1. Your public URL is not exposed on social media sites.
  2. Your URLs are guaranteed to be random and arbitrary.
  3. You are free to change your underlying implementation of resource naming, and the external links will continue to work.
  4. Easier sharing http://catpic.to/i34dhY vs. http://catpictures.com/some-guid-string.
  5. The unique ID is easily indexed/searched.

If you don't want to rely on a third party service, you can easily roll your own by implementing a bijective function in the language of your choice.

Burhan Khalid
  • 926
  • 1
  • 6
  • 4
0

Problem:

  • We wish to have a number that is sequential; otherwise it gets expensive adding records to the database as the middle of indexes have to be updated in a mostly random order.
  • We don’t want the number to relate to how many cats have been uploaded.
  • We need the number to be unique but only within your website.

Therefore:

  • nextCat is set to 0 when the website first starts up, it will likely need to be 64 bit.
  • nextCat is incremented each time a cat is added, and newCat is set to true.
  • nextCat is incremented by a random timer that fires at a rate that is faster than you expect cats to be added. However if newCat is true, then the incremented is not done for this timer fire, and newCat is set to false.
  • each cat is ALSO given a GUID, but never needs to be found based on its GUID
  • the web address for a cat is something.com/cats/catNumber-catGuid
  • if when a cat is requested the catGuid is wrong, the same response is given then for a catNumber that does not relate to a cat.

(The timer is done for a random time, so that it is hard to tell if two cats are added between a fireing of the timer.)

Ian Ringrose
  • 641
  • 1
  • 4
  • 9
  • So the indices of cats are for example 0, 1, 4, 7, 8, 10, 12, 15, etc? If you knew the distribution used to generate the random interval you could take the expected value and divide the difference between cat indices 24hrs apart to get a pretty good estimation of the cats generated during that time. – Escher Dec 16 '15 at 16:17
  • Thanks @Escher, I have updated my answer so it is very hard to find the indices of the cats. – Ian Ringrose Dec 16 '15 at 16:29
-2

General Best Practice: Never expose the PKEY in any web links.

In your database - your PKEY needs to be a BIGINT for speed. Also in your database, consider adding this field... (public_filename ..if not exists) to your table. public_filename field needs to be a guid string. Use a guid function to rename the file with a unique file name upon upload to your server, and populate public_filename with that.

The public_filename should be used for weblinks, not the PKEY.

Also I recommend keeping a user_filename field to support any forensic searches from the uploader- if you allow that. user_filename would be the original filename uploaded by the user.

Never expose the PKEY in any web links - always use some form of public_filename. Use your database queries to de-reference the public_filename to a PKEY, and from there you can figure out which file to serve from the server.

Another Best Practice - never overwrite a users file upload automatically. Rename the user_filename field with a serialization (-001, -002).

Odds are you will get many files named "mycat" from the same user.

fredogone
  • 1
  • 1
  • There's nothing wrong with exposing the primary key. It's common to use hash(data) as the key. – Navin Dec 13 '15 at 20:13
  • Navin - hash strings as primary keys are really slow. Why would you want that? Plus. forensically - you have no speedy sequence. – fredogone Dec 13 '15 at 20:23
  • fredogone - Store the hash in a 64bit int, not a string. You're only going to hash the image when it is added to your system. When the user wants to retrieve it, they need to remember the hash. – Navin Dec 13 '15 at 20:28
  • Navin - you've just eliminated forensic sequence, unless you keep another field to capture timestamp. – fredogone Dec 13 '15 at 20:30
  • Well, if you ever update/edit rows, an auto-incrementing ID wouldn't give you sequence either. – Navin Dec 13 '15 at 20:33
  • No what forensic evidence is looking for - is it? But if you were looking for rows created between a range of rows - it would..an auto-incrementing ID would give you that. – fredogone Dec 13 '15 at 20:56