6

I'm looking to encrypt files using secure hashing and encryption algorithms in Python. Having used bcrypt in the past, I decided to use it for my passphrase calculator, then pass the output through SHA256 in order to get my 32 bytes of data, then use that with AES to encrypt/decrypt a file:

#!/usr/bin/env python

from argparse import ArgumentParser
from bcrypt import gensalt, hashpw
from Crypto.Cipher import AES
from hashlib import sha256

import os, struct, sys

def main():
    parser = ArgumentParser(description = "Encrypts or decrypts a file using " +
            "bcrypt for the password and triple AES for file encryption.")
    parser.add_argument('-p', '--passphrase', required = True, 
            help = "The passphrase to use for encryption.")
    parser.add_argument('-i', '--input', required = True,
            help = "The input file for encryption / decryption.")
    parser.add_argument('-o', '--output', required = True,
            help = "The output file for encryption / decryption.")
    parser.add_argument('-r', '--rounds', default = 10, 
            help = "The number of bcrypt rounds to use.")
    parser.add_argument('-s', '--salt', default = None,
            help = "The salt to use with bcrypt in decryption.")
    parser.add_argument('operation', choices = ('encrypt', 'decrypt'),
            help = "The operation to apply, whether to encrypt or decrypt data.")

    parameters = parser.parse_args()

    if parameters.operation == 'encrypt':
        encrypt(parameters.input, parameters.output, parameters.passphrase,
                parameters.rounds)
    elif parameters.operation == 'decrypt':
        decrypt(parameters.input, parameters.output, parameters.passphrase,
                parameters.salt)

def encrypt(input_file, output_file, passphrase, rounds):
    bcrypt_salt = gensalt(rounds)
    bcrypt_passphrase = hashpw(passphrase, bcrypt_salt)
    passphrase_hash = sha256(bcrypt_passphrase).digest()

    print "Salt: %s" % (bcrypt_salt, )

    iv = os.urandom(16)

    cipher = AES.new(passphrase_hash, AES.MODE_CBC, iv)

    with open(input_file, 'rb') as infile:
        infile.seek(0, 2)
        input_size = infile.tell()

        infile.seek(0)  

        with open(output_file, 'wb') as outfile:
            outfile.write(struct.pack('<Q', input_size))
            outfile.write(iv)

            while True:
                chunk = infile.read(64 * 1024)
                if len(chunk) == 0:
                    break
                elif len(chunk) % 16 != 0:
                    chunk += ' ' * (16 - len(chunk) % 16)

                outfile.write(cipher.encrypt(chunk))

    return bcrypt_salt

def decrypt(input_file, output_file, passphrase, salt):
    print "Salt: %s" % (salt,)

    bcrypt_passphrase = hashpw(passphrase, salt)
    passphrase_hash = sha256(bcrypt_passphrase).digest()

    with open(input_file, 'rb') as infile:
        input_size = struct.unpack('<Q', infile.read(struct.calcsize('Q')))[0]
        iv = infile.read(16)

        cipher = AES.new(passphrase_hash, AES.MODE_CBC, iv)

        with open(output_file, 'wb') as outfile:
            while True:
                chunk = infile.read(64 * 1024)
                if len(chunk) == 0:
                    break

                outfile.write(cipher.decrypt(chunk))

            outfile.truncate(input_size)

if __name__ == "__main__":
    main()

What are the possible weak points of an implementation like this?

What I have determined is that it would be easy for an attacker to determine the original file size, however that doesn't reveal much about the file. SHA-256 isn't the best hashing algorithm in the world, but wrapping a bcrypt password would lead me to believe that all threats would be mitigated there. Due to bcrypt's ability to increase in security over time by adding more rounds to the algorithm, it seems like a pretty safe bet to use bcrypt for now.

Are there any gaping holes in this implementation? I'm not a cryptographer, but I do know the basics and purposes of each of the three algorithms being used here.

Thomas Pornin
  • 320,799
  • 57
  • 780
  • 949
Naftuli Kay
  • 6,715
  • 9
  • 47
  • 75

1 Answers1

9

From looking at it for two minutes:

  • Using bcrypt is sound. SHA-256 is one of the best known hash functions (because it has been around for more than 10 years, widely deployed, and yet unscathed).

  • Since you generate a new salt for each encrypted file (and that's good !), it would make sense to store it in a file header. That way, you would not have to provide it as parameter to the decryption routine. A salt needs not be secret (that's a salt, not a key) but a new salt must be generated for each encryption, hence external handling could be cumbersome. You already have a custom header for the IV, you could do the same for the salt.

    Alternative: store the salt as header, and generate both the encryption key and the IV from bcrypt output (split the SHA-256 output into two 128-bit chunks; first one is the key, second one is the IV). This would keep header size to 16 bytes. But a random IV (from os.urandom(), like you do) is not bad either.

  • You use encryption but there is no MAC. There are not many attack models where the attacker can observe data (i.e. you need confidentiality) but not alter data (i.e. you do not need integrity). In practice, you would be better off adding a MAC. Ideally, replace CBC with a mode which handles both encryption and integrity (GCM, EAX).

    If your support library does not provide such a mode, you will have to do it "old style" by combining encryption with HMAC. This is not totally trivial to do. Best practice would be to generate an extra MAC key from bcrypt output (if you produce the encryption key, the IV and the MAC key, you will need a larger hash function, e.g. SHA-512) and compute the MAC over the concatenation of the IV and the encrypted data (you could skip the IV only if you generate it from the passphrase too, but it is safer to include it in the MAC input). Upon decryption, verify the MAC first, then proceed to decryption only if the MAC matches.

  • You might want to include some provision for algorithm agility, i.e. adding in the header a byte value which identifies the combination of algorithms you use (bcrypt, SHA-256, AES+CBC, HMAC/SHA-256). This would allow you to switch later on to another set of algorithms without breaking code compatibility with existing files. In that case, don't forget to add this "algorithm identifier byte" to the input for the MAC.

  • Your padding is ambiguous: upon decryption, you will find some space characters at the end, but you will not know whether the extra spaces were part of the input file, or added by the encrypt function. And, indeed, upon decryption you leave the spaces. This might not be an issue for your specific usage scenario, but, as a general encryption/decryption tool, this is usually considered an issue if files do not survive an encryption/decryption trip without modification. PKCS#7 padding (also known as PKCS#5 padding) is commonly used: you add k bytes (at least 1, at most 16, so that the total length is a multiple of 16) and all the bytes have value k. Upon decryption, look at the last byte, which will tell you how many padding bytes were added.

Thomas Pornin
  • 320,799
  • 57
  • 780
  • 949
  • Thanks so much for the reply! This answers a lot of the questions I had. Point 2: I may not actually end up storing these things in the file header, as it can be stored in a database, this script was only to get quickly up and running. I'll have to look into generating the encryption key and IV from the SHA-256 output, but wouldn't that change my encryption level in AES? Point 3: I'll look into that, I'm not so familiar with the idea of a MAC. Point 5: these values may be stored in a database. – Naftuli Kay Jan 22 '13 at 17:28
  • Point 6: I believe I'm already doing what you recommended. When I encrypted and decrypted a file using the script above, I was able to keep SHA-256 sum integrity of the file. (see the `outfile.truncate(input_size)` call towards the end of the decryption method) – Naftuli Kay Jan 22 '13 at 17:31
  • @TKKocheran: ah yes, hadn't seen that. You encode the total length in the file header. Note that an attacker who can modify the file may alter that value, inducing you into doing an uncontrolled `truncate()` upon decryption; depending on the OS and filesystem, this might be a way to make you fill your disk real fast. – Thomas Pornin Jan 22 '13 at 17:43
  • @TKKocheran: if by "encryption level" you mean "AES key size", then be assured that 128 bits are enough for security (even if your enemy is the US government or Google). The larger key lengths are there in order to comply with old military regulations, but cracking a 128-bit AES key is already quite beyond what can be done with existing technology; see [this answer](http://security.stackexchange.com/questions/6141/amount-of-simple-operations-that-is-safely-out-of-reach-for-all-humanity/6149#6149) for a discussion on that subject. – Thomas Pornin Jan 22 '13 at 17:45
  • Ah, so should I store a hash of the original data? How can I mitigate an uncontrolled truncate attack? – Naftuli Kay Jan 22 '13 at 18:05
  • @TKKocheran: first you could verify that the length in the header is coherent with the length of the encrypted file (it must be lower, but no more than 15 bytes lower in your case). And, also, you need a MAC; a stored hash "elsewhere" is only a crude attempt at integrity. Ideally, you want the MAC to be computed and verified over the _encrypted_ data, because it will mean that when you actually decrypt (and unpad and truncate) you work only over genuine data, not something crafted by the attacker. – Thomas Pornin Jan 22 '13 at 18:12
  • Interesting, I didn't know that AES could encrypt without any filesize overhead (apart from the `%16` bit) – Naftuli Kay Jan 22 '13 at 20:08