Searching Encrypted Data from Database

Question

Suppose I have a table with email is stored in one table. I have encrypted it using AES-256 and stored the result in DB Column.

For let's say I have one input email and I want to search whether this email exists in DB or not: Now I see two approaches to it

Encrypt the input and search for the encrypted value.
Use the concept of Blind Index(using HMAC SHA-256) and store it and then for input also create this HMAC and search it.

Which approach is better and why? For Encrytion we are using Inhouse library which does AES-256 using secret Key/ Salt. Secret key is fetched from AWS KMS(Basically decrypt from AWS KMS)

I guess your only input is not-email. – kelalaka Oct 14 '20 at 19:44 — kelalaka, Oct 14 '20 at 19:44
Lets say it is. Then how do we solve this. – Ankit Bansal Oct 14 '20 at 19:46 — Ankit Bansal, Oct 14 '20 at 19:46

ManRow · Answer 1 · 2021-09-27T07:26:06.177

5

Perhaps the common two-column approach used for login scenarios may be applicable here.

This is often where users need log in with their email (either because they forgot their username or the site doesn't use them), but their email address is considered confidential data that should not be stored in plaintext.

In this scenario, your "user table" would include two columns:

One column containing a peppered hash of the email addresses (such as SHA-256(pepper || email), where "pepper" is just some secret bytes stored on the server or HSM. The term is taken from like how a "pepper" is used to protect passwords)
One column containing the actual email addresses, encrypted with a random-IV per-entry (e.g like IV || E(email))

The first column is used to quickly look up user info for login purposes while the second column can be used to retrieve the actual email address whenever needed (e.g., for correspondence purposes).

edited Sep 27 '21 at 07:26

answered Oct 15 '20 at 05:51

ManRow

401
1
4
10

I'm pretty sure that AES is faster than HMAC. While you have the key, why two columns? Just use AES for login purposes, too. Is it really a standard written in some rfc, etc.., or a common approach? – kelalaka Oct 15 '20 at 07:13
Oh just simply so you don't have to use the less secure ECB or CBC modes with a *fixed-IV-for-all-entries* to encrypt the email addresses. The two column approach gives you that fast log-in lookup while at the same time also letting you encrypt email addresses in (any of) the more secure *random-IV-per-entry* modes. – ManRow Oct 15 '20 at 08:33
No, random IV will prevent you from the first access to the row. It should be deterministic. For the other columns, at least, authenticated encryption is recommended like AES-GCM or better chacha20-poly1305. Depending on the column structure, the authentication can be performed on the row level, too. This is really application specific subject, that is why in the first hand I have asked the OP... – kelalaka Oct 15 '20 at 08:35
No -- the random IV is just prepended to the encrypted email address in the second column. The first column just stores `hash(email + pepper)` for *quick lookups* to identify *which user* is trying to log in. You use *only* the first column for login "identification" -- and that is also *its only* sole purpose! (Basically, so you can use some more secure random-IV-per-entry mode to actually encrypt the individual email addresses) – ManRow Oct 15 '20 at 08:55

score 1 · Answer 2 · answered Oct 14 '20 at 20:03

1

E-mail case:

For Encryption, you need to use AES in ECB mode of operation in order to correctly execute the SELECT query. The IV has no use in ECB mode of operation and ECB mode can leak information about the data even the e-mails are unique. If we consider that the data is more than one block, then some part of the data can be matched on the ciphertext - consider that the e-mail that separated exactly at the @ character. The other block-cipher mode of operations uses IV for probabilistic encryption that randomizes the same value into different ciphertext; like CBC and CTR don't allow the SELECT queries.

Using HMAC can be a better solution if the data more than 128-bit size. There is a negligible chance of collision but it will be not important if you use HMAC-SHA256. If you find a collision then you can easily resolve it with other fields, but more importantly, you will be in the Cryptographic news!

The general case:

For both cases, there is a common frequency attack ^* that can be devastating. With the frequency knowledge of the data, an observer can reveal the ciphertext/HMACed data without needing any key!. Therefore; both solutions can only be secure if the data are unique on the column that they are stored.

For each column is advised to use different keys and those can be constructed from the master key by using HKDF(k,"table_name"+"column_name")

Conclusion: If the data is unique then HMAC-SHA256 is better, if not none.

_{^* The linked answer in the Cryptography.SE gives details about the attack and post some realted articles}

answered Oct 14 '20 at 20:03

kelalaka

5,409
4
24
47

You do not need to use ECB in order to deterministically encrypt. This can be done with other modes as well. – Xander Oct 23 '20 at 02:55
@Xander could you properly name a block-cipher mode of operation that can be deterministic that will enable the SELECT query without crippling it like fixed IV/nonce? – kelalaka Oct 23 '20 at 07:23
CBC or GCM-SIV with a fixed IV. – Xander Oct 23 '20 at 13:14
1

@Xander well those are crippled! CBC will still leak information if the first blocks are equal and can cause the frequency attack. GCM-SIV fixed IV? SIV mode uses all of the messages to remove the nonce-misuse. So, you actually rely only on that. One needs to see the real use case to talk about the lost security case. At first, It might seem good, however, what is the slowdown? – kelalaka Oct 23 '20 at 15:57
@Xander I did not say that my solution is the solution, I rather compared them and show the problems, indeed, I've commented on another answer to become better and upvoted. Could you write about your experiences so that we can see them? – kelalaka Oct 23 '20 at 15:59
Well yes, of course they leak information, that's why we use random IVs. The only point I'm making is that the part of your answer suggesting that ECB is the only mode that can be used is incorrect, and in fact for multiblock messages will leak significantly more information than other mode options, so it's a particularly dangerous piece of misinformation. – Xander Oct 23 '20 at 16:24
@Xander the author of the SIV mode is against the fixed IV. If you get your idea from here [https://cyber.biu.ac.il/aes-gcm-siv/] they are wrong about that. The most correct article is here https://tosc.iacr.org/index.php/ToSC/article/view/810/763 – kelalaka Nov 01 '20 at 16:03
Of course they're against it. It's a bad idea. I never suggested it was a good idea. I was just pointing out that your answer is wrong, and wrong in a dangerous way, because what you claim is the only option for the encryption path is even worse. – Xander Nov 01 '20 at 20:01
@Xander I've already said the ECB is dangerous in many times, that was the whole point of the answer... – kelalaka Nov 01 '20 at 20:05
"For Encryption, you need to use AES in ECB mode of operation in order to correctly execute the SELECT query."..."The other block-cipher mode of operations uses IV for probabilistic encryption that randomizes the same value into different ciphertext; like CBC and CTR don't allow the SELECT queries." This is not correct. My suggestion is that you fix it. – Xander Nov 01 '20 at 20:22
@Xander Nope! I'm not going to say the crippled versions other ways to do. Those are the crippled version of other encryption modes. `could you properly name a block-cipher mode of operation that can be deterministic that will enable the SELECT query without crippling it like fixed IV/nonce?` and you turned in the crippled version :) For the login the a PRF like HMAC is fine. One doesn't need anything else. – kelalaka Nov 01 '20 at 20:29
Ok, it's your answer. You can do what you want with it. – Xander Nov 01 '20 at 20:39
@Xander yes, thanks for the discussion. – kelalaka Nov 01 '20 at 20:39

score 1 · Answer 3 · answered Sep 24 '21 at 11:12

Databases currently have native encryption. MySQL calls it Transparent Data Encryption and Oracle calls it the same, Postgres have various options for encryption, IBM DB2 have Native Data Encryption, Redis have data encryption, MongoDB have data at rest and client side field encryption, CouchDB have both encrypted data at rest and field level encryption.

This all means you don't have to create your own way to encrypt data if your database engine already supports that. You will not be more secure writing your own encryption routines.

Reversible encryption is difficult to properly secure because you need to decrypt it, and key management is the most difficult one. If an attacker gets access to it, it's over, no matter how many PBKDF2 rounds you used to generate another key to encrypt every char from the field with a different key, and encrypted the resulting cyphertext.

The attacker can do anything your application can do. If somewhere the application gets a key from Amazon KMS, the attacker can do the same. If the application uses the key to decrypt an index table, so does the attacker. Nothing your application does cannot be done by an attacker.

So either use the database native encryption, or better define your threat model.

Searching Encrypted Data from Database

3 Answers3

Linked