3

I want to make it impractical to link the users to their sensitive data without their passwords – even with a full access to the database.

Furthermore, if a user has multiple pieces of sensitive data, I also want to avoid linking the different pieces together

Based on the comments and some searching, I have updated the question.

I found a few similar questions but none seem to give the details I'm looking for, or have slightly different prerequisites (f.ex. data sharing).

So, in addition to registering website users and managing them with standard CRUD operations, I'm storing some possibly sensitive pieces of data from the users into the database. In that respect, this question is similar, and the answers offer guidelines for me as well. Specifically, I should not "store anything in [my] database that could be used to obtain the encryption key without knowing the password."

The client side is composed of simple html/css/js pages and, at the moment, a desktop app is not an option. I analyze the data only in groups (based on variables in the data) so that individual data are of no interest. However, I want to keep it possible for the users to see their own data and, for example, to delete the data if wanted.

What I'm thinking is generating a key for each piece of data, encrypting the key–data_id pairs into the database and decrypting them each time an unencrypted key is needed, either for storing data or when a user wants to see their data:

import json
from cryptography.fernet import Fernet


def get_key(password, data_id):

    # Check that the given password is valid
    if not self.check_password(password):
        raise KeyError('The given password is not valid')
    
    # Always use string representation of the data_id since json allows only string keys
    data_id_str = str(data_id)

    # Create a Fernet object with the password
    f = Fernet(password)
    
    # Set the encoding for the bytes <-> string conversion    
    encoding = 'utf-8'
    
    # Decrypt and load into a dict the existing keys
    if self.encrypted_keys:
    
        # Ensure that the encrypted keys are in bytes for the Fernet
        bytes_encrypted_keys = bytes(self.encrypted_keys)

        # Decrypt the encrypted keys and transform the bytes object into a string
        keys_string = f.decrypt(bytes_encrypted_key).decode(encoding)
        
        # Load the string into a dict
        keys_dict = json.loads(keys_string)
    
    # Create an empty dict if no keys defined
    else:
        keys_dict = {}
    
    # Try to get a key for the data_id
    try:
        key = keys_dict[data_id_str]
    
    # The key not found
    except KeyError:
        
        # Generate a new a URL-safe 32-byte key and decode as a string into the keys_dict 
        key = keys_dict.setdefault(
            data_id_str,
            Fernet.generate_key().decode(encoding),
        )
        
        # Turn the updated keys_dict into a string
        updated_keys_string = json.dumps(keys_dict)

        # Encode the string into bytes for the Fernet
        bytes_keys = updated_keys_string.encode(encoding)

        # Encrypt the updated keys
        self.encrypted_keys = f.encrypt(bytes_keys)

        # Save the encrypted keys into the database
        self.encrypted_keys.save()
    
    # Return the decrypted key for the data_id
    return key

Does this seem like a reasonable process? Are there some obvious flaws I'm missing? Is this overkill? Are there some other things I should consider?

I'm aware that a weak spot in this is the strength of the password. I'll try to manage that with the standard strength checks.

I also understand that an access to the server gives the ability to intercept the process and compromise the keys. Of course, if there was a way to prevent this, without a desktop app, I'd be interested. Currently, I'm hoping, at least, to secure the database.

Thank you for the advice!

teppo
  • 131
  • 2
  • "I want to make it difficult" -- I'd start by rephrasing that to actually define a spec for what you actually want to happen. You either want to prevent it or you want to allow it under certain conditions. Once you define that, your design will start to emerge. – schroeder Nov 22 '20 at 13:21
  • 1
    The concept you are going for is "Pseudonymisation". You want the data to be anonymous, but attributable at a later date. There are well-defined methods and design patterns for this. Look at "tokens". The bonus is that if you can do it correctly, you can keep the data (if permitted in the regulatory environment you are in) if the user wants to "delete" it by deleting the token, instead (and you have stripped PII from the tokenised data, of course);. – schroeder Nov 22 '20 at 13:23
  • Thanks for the tip @schroeder. I'll do some searching on "tokens". I would like to **prevent** unauthorized linking of the users to their sensitive data but I'm not sure if that is ever possible. If an adversary has enough resources and time, I believe there's no way you can prevent them from finding out the information they are after. Hopefully, however, I could make it difficult enough to be impractical given the possible rewards. I'll rephrase. – teppo Nov 22 '20 at 18:55
  • I really think you will find what you are looking for in Pseudonymisation models and tokenised data. I haven't architected a tokenised system in forever, else I'd add more in an answer – schroeder Nov 22 '20 at 20:08

1 Answers1

0

If I'm understanding correctly:

Forget encryption & decryption and key protection. None of that is necessary.

Use a Hash the same way password identities are kept.

The hash becomes a unique identifier of the data without revealing the user.

You derive a suitably complex hash from the user supplied password. The user can later provide that password for you to re-hash to derive the matching identifier for their data.

No fuss, no muss, no stored passwords.

--EDIT for your newly added constraint--

Furthermore, if a user has multiple pieces of sensitive data, I also want to avoid linking the different pieces together

Use a random Salt to produce a different hash ID for every data blob. With no other other constraints, you'd have to compute the hash for every salt in the system to find a match. That may be trivial for hundreds, but for much larger values you may want an additional constraint.

user10216038
  • 7,552
  • 2
  • 16
  • 19
  • I see that the specs I gave are incomplete. I also need a normal registration for the users as there are views allowed only for registered users. In this respect, [this is similar question](https://security.stackexchange.com/questions/23409/how-to-login-and-encrypt-data-with-the-same-password-key?rq=1). So, if I understand correctly, if use only hashes, I would need to use two passwords: one for login and one for the link between the user and the sensitive data. I'll update the question accordingly. – teppo Nov 22 '20 at 12:52
  • 1
    @teppo - you could use two *different* passwords if you like. You could also use the same password with two different hash algorithms. One would not relate to or correlate to the other. – user10216038 Nov 23 '20 at 05:08
  • Yes, two different hash algorithms might be an option. Although, I noticed one weakness with my original idea, which would apply to two hash algorithms as well: If a user has multiple data blobs, using the same key for all might make it easier to identify the user. I'll do some thinking and update my question again.. – teppo Nov 23 '20 at 19:14
  • @teppo - Yes, data anonymization is a whole other problem. Organizations have been failing at that for years and continue to do so. – user10216038 Nov 23 '20 at 20:42
  • About the random salt, I would then need to store the salts secretly somewhere, wouldn't I? Or how would the process of getting to see your own data work? – teppo Nov 24 '20 at 18:44
  • @teppo - No, salts are not secret. You can simple append them to the hash. For convenience, you may want to keep a separate list of all salts used so that when a request for *My Content with Password* comes in you can produce all possible hashes from the salt list in order to look for a match. Nothing secret or protected here. It relies on the fact that you need the password to find the hash, with or without salt, and the process is not reversible. – user10216038 Nov 24 '20 at 18:52
  • Thanks, now I understand. So, what do think, what are the benefits in using just hashes instead of encrypting the keys like in my suggestion? – teppo Nov 26 '20 at 19:00