2

I have implemented encryption for certain columns in a database. In some cases it may be necessary to search by these columns, either for exact matches or for substrings. At the moment, there is less than 10000 rows to search through (at most). However this will probably change in the future so I am anticipating efficency problems.

The data will remain encrypted in the database, but when retrieved by the application, it can then be decrypted. This means that the retrieval does not have to be 100% accurate, it can retrieve more records than actually match the query, and the application itself can discard the non-matching records.

With this in mind I came up with the following solutions:

  • For exact text match, hash the encrypted data and modulo this with a relatively low number (2^16 perhaps), and store that value in an additional column. When querying the database, it is only necessary to perform the same operation on the input string, and retrieve all records with matching hash value.

  • For finding substrings, hash each letter and mod 32 the result. Then set the corresponding bit in an integer. Perform the same operation on the query string. Thus it is only necessary to retrieve records where the result of a binary AND on the query with the stored value is greater than 0. This could also be done using each pair of letters, perhaps with a 64-bit or longer integer to allow for more records to be discarded.

My question is, would either of these techniques have a real impact on the security of the data in the encrypted columns? The encryption scheme is AES-256.

Slicedpan
  • 313
  • 2
  • 12
  • Hmmmm, just discovered that what I am asking about here appears to be similar (or the same as) encrypted bloom filters. Would anyone believe me if I told them I came up with this independently? :) – Slicedpan Nov 06 '14 at 15:17
  • This doesn't make a lot of sense to me. For 1. You would have to have the encrypted data already to search. So why are you searching for it? If you're using a database why not create a unique id for each row that you can query for. When you insert the encrypted data you would get the unique id in return, and use that to later query the database. – RoraΖ Nov 06 '14 at 15:17
  • For example, a User table with real names encrypted, and you want to search for a person with a specific name – Slicedpan Nov 06 '14 at 15:24

1 Answers1

3

Ability to narrow searches tends to be in direct opposition with the confidentiality that you seek through encryption. For instance, if you store your "16-bit hash" in an extra column then that hash reveals 16 bits of the data -- 16 indirect bits, but 16 bits nonetheless. An attacker who sees the database may try to guess (brute force) the record contents, and the 16 bits will allow him to detect 65535/65536th of bad guesses: this is a substantial advantage.

Ability to do substring searches is even worse, since it necessarily reveals information that allows the brute force attack to proceed in gradual steps (this is in fact the same problem as partial password authentication).

At best, what you could do is to implement deterministic encryption, such that encryption of a given record value always yields the same encrypted result. This leaks a modicus of information (if two records have the same contents then this will show, despite the encryption layer); on the other hand, it allows for exact searches: you encrypt the value to search, and use the index on the encrypted values. Substring searches, however, should be avoided at all costs.


I think a better method would be to revisit your assumptions:

However this will probably change in the future so I am anticipating efficency problems.

Usually, performance issues don't exist until having been actually encountered (at least in a test platform, if not in production) and duly measured. As Donald Knuth once wrote: premature optimization is the root of all evil.

Even if the envisioned performance issue is real and you know how much it will cost, some alternate methods might be applicable. For instance, you could read all the records in the RAM of the application, decrypt them all, and keep them in RAM. This would allow very fast searches without even going to the SQL level. Modern servers have a lot of RAM. As an example, the servers that maintain the StackExchange sites (all of them) are reputed to have been sufficiently boosted in RAM (a few hundred gigabytes) so that all the data can be cached, and the servers can perform all read accesses at RAM speed.

If your records are, say, no longer than 100 bytes (e.g. they are the names of some people), then you can store 10 millions of value in a mere gigabyte of RAM. What is a gigabyte ? Even your phone has more RAM than that.

Tom Leek
  • 168,808
  • 28
  • 337
  • 475
  • Thanks for the reply, much food for thought here. Do you think that padding the plaintext with a secret value before hashing would do anything to prevent the possible use of bruteforce attacks? (Assuming of course that an attacker would not have access to the secret) – Slicedpan Nov 06 '14 at 16:05
  • "Padding with a secret value before hashing" really means computing a [MAC](http://en.wikipedia.org/wiki/Message_authentication_code), so you may as well do it correctly. Use HMAC, which is a deterministic MAC. Using HMAC to generate a search key is equivalent to the "deterministic encryption" solution, except that it uses up an extra column (in particular, if you can search the MAC values then you still leak whether two source values are identical or not). – Tom Leek Nov 06 '14 at 16:17
  • Cool. Appreciate the help, cheers. – Slicedpan Nov 06 '14 at 16:20