How secure is the Ciphersweet library for searchable encryption, and why is a duplicate entry leak not a problem?

Question

I'm currently managing a code base in which we've got a mysql database with all records encrypted using the php-encryption library. This works well for our current setup. We now got a new business requirement that should make it possible to do a SELECT based on one of the encrypted fields.

Since it is impossible to select based on the encrypted values, I searched around and found ciphersweet. It's a new (6 months old) repo with currently only 136 github stars. I've read through a blogpost about it written by the company behind the lib.

The idea is based on blind indexing in which the general idea is to store a keyed hash (e.g. HMAC) of the plaintext in a separate column. The blind index key should be distinct from the encryption key and unknown to the database server.

As far as I understand (but I could be wrong here) the value for which I need to search is hashed using the index key which is static per column. The resulting value is searched for in the HMAC column. When a record is found, the encrypted value is then decrypted.

They describe that it does have a duplicate entry leak, meaning that if all records are obtained it can be known which records have the same value, but not what that value is.

I understand the concepts and it sounds ok, but since I'm not a cryptographic expert I can't really judge it's secureness. Isn't it somehow possible to use this duplicate entry leak to do some other attack? I've always learned (=read on the internet) that encryption should always include a IV/Nonce/Salt to make rainbow tables impossible. I guess the usage of the static index key per column prevents these rainbow tables though.

Basically I've got the feeling I'm missing something here. Why is the duplicate entry leak not a problem all of a sudden? Is there anybody else out there who can comment on this library/technology?

Are you subject to any regulatory, or contractual security controls? What's the reason for the encryption? Is it just 'good practice'? — Daisetsu, Nov 01 '18 at 15:32
@Daisetsu - We're storing medical data within the EU and we're subject to the GDPR. We are required to encrypt the records, but no encryption standard is specified. — kramer65, Nov 01 '18 at 15:51
Side note: a number of databases provide database/column(/row) encryption, including transparent encryption (mostly this is an at-rest). Would one of those be a better fit than encrypting fields in the application layer? — Clockwork-Muse, Nov 09 '18 at 19:55
@Clockwork-Muse At-rest encryption of the database doesn't protect against someone who has read access to the database (either by acquiring a backup dump, through SQL injection, using a time-based side-channel, ...) — nicoo, Nov 16 '18 at 01:19
CipherSweet now has a more detailed description of its security properties, etc. https://github.com/paragonie/ciphersweet/blob/master/docs/SECURITY.md — Scott Arciszewski, Nov 18 '18 at 01:58

Scott Arciszewski · Answer 1 · 2019-05-28T19:47:39.663

Important Disclaimer: I wrote CipherSweet for my employer. Everything that follows should be taken with a grain of salt unless otherwise verified by third party security experts. Even if this answer gets a lot of votes, it must never be accepted. I'm merely attempting to answer some of the basic questions about CipherSweet's design, not answer whether or not it's secure. Trust others instead.

The security model of CipherSweet is essentially a time-memory trade off, which introduces the risk of attacks that are more similar to a crossword puzzle than traditional cryptanalysis techniques (hence, we simply refer to them as "crossword puzzle attacks", although the more formal term "partially-known plaintext attacks" is more likely to find relevant hits in the academic literature).

In this respect, CipherSweet's overall design can be secure, but only if care is taken in designing the blind indexes in your application. There are a few considerations:

How many bits of each blind index will you store?
- Fewer bits increases the chances of collisions on a given SELECT query, which means more false positives to filter out after decryption.
- However, fewer bits also decreases the usefulness of a blind index.
How many blind indexes will you create for a given encrypted column?
- The more indexes you create, the more metadata you risk leaking to attackers.

Of course, the security of this design is only worth talking about if a few other assumptions hold true:

Each blind index has a distinct key.
The hash function (or KDF) used to turn the plaintext into a blind index is adequately secure (e.g. in the random oracle model).
The encryption itself is secure (e.g. AEAD with a security level in excess of 127 bits against all known practical attacks).

To understand how these assumptions are answered, take a read through the CipherSweet internals documentation. But in short, the answers are:

CipherSweet uses distinct per-field and per-index keys, derived from a master key.
As of November 2018, HMAC-SHA384 and BLAKE2b remain unbroken.
As of November 2018, AES-CTR + HMAC-SHA2 and XChaCha20-Poly1305 remain unbroken.

So with that said, the unanswered question is: How do we determine an unsafe level of information leakage?

Obviously, creating a distinct blind index $plaintext[0], $plaintext[1], ... would leak entire plaintexts from the indices.
- You wouldn't need to break the encryption in this case. Just study patterns and commit dummy entries and treat it as a very inefficient substitution cipher.
Conversely, a single literal blind index of the entire plaintext, truncated to 16 bits, would almost certainly not leak anything useful for attackers.
- Given a sufficiently large dataset, collisions would be very common.
- Of course, this almost defeats the purpose of even creating a blind index.

Furthermore, CipherSweet allows you to create compound blind indexes. For example, you could hash together:

The first initial of the person's first name (if applicable)
The first initial of the person's last name (if applicable)
The last 4 digits of the person's social security number
A single letter identifying their sex/gender (if applicable)

This greatly increases the keyspace of possible plaintexts for a given blind index, even if the keyspace of each individual input is limited. This is especially useful for encrypting/searching with boolean fields.

Given all of the above, I don't know if you'll be able to obtain a sufficient answer about whether or not to trust CipherSweet on this website. The unanswered questions aren't exactly straightforward to answer, and will require an in-depth analysis.

That being said: It is certainly possible to misuse CipherSweet in a way that undermines its security goals. It's a tool.

I believe that, unless some deep flaw is discovered in the protocol design, it SHOULD be possible to use CipherSweet securely. And even if it is secure, you will almost certainly want a security expert to double-check how you use it.

score 6 · Answer 2 · answered Nov 09 '18 at 14:53

Disclaimer: while I looked into the design and implementation of ciphersweet back when it was published (or at least when I became aware of it), I didn't perform a full-blown audit & proof of security of the design, and I didn't look into the PHP implementation in depth (in large parts because I do not do much PHP work). Do not mistake this for an audit report (I only make those when I'm paid for it :P)

Ciphersweet uses a pretty sound, boring design; Scott explains it, along with Cyphersweet's security claims, in his answer so I won't go over it again. The main point of Ciphersweet, IMO, is that it is safer and harder to mess-up than the alternatives.

The “duplicate entry leak” you mention does not apply to full records (those are encrypted with standard, non-deterministic encryption) but to the blind indexes: if you index on, say, HIV status, then someone with read access to the database can figure out which records have the same HIV status, and from there likely recover HIV status for all records.

This is an information leak that is fundamental to blind indexes: if you have enough information to SELECT all rows with a given (function of the) HIV status, you have enough information to check whether any 2 rows have the same HIV status, so fancier cryptography will not help there (incl. using deterministic encryption, order-preserving/revealing encryption, ...).

The good news is that, unlike other designs (like order-revealing encryption), keyed hashes (under an unknown key) do not reveal any more information than whether the values are equal.

Obviously, it is not sufficient (as shown with the HIV status example), so there are 3 main mitigations you can use (and Ciphersweet support them all):

the most obvious is not to add blind indexes on very-sensitive data: if you do not want to expose HIV status data, why are you creating an index for efficiently querying on them?
Use compound indexes: if the data you need to index on is too low-entropy to be safely put in a blind index (say, HIV status), you can hash it together with some other data (the documentation provides uses the SSN in that example), and support SELECTing records with a given HIV status and a given SSN.

This is, IMO, the less-useful option, since you can directly SELECT by SSN (assuming you have a blind index over SSNs) and then check the HIV status in the decrypted record. Reserve it for cases where either you cannot have an index over one of the fields (because they are all too low-entropy and/or high-sensitivity).
truncate the HMAC value, so as to reduce the information leak: let's say you have patient records, all with a unique name, and support selecting by it [0]. I could check whether a certain patient exists by adding a record (through the application) with that name, then checking the database for a second record with the same name hash, even if the application itself wouldn't grant me the permission to search for patients by name.

With a truncated hash, you can make it so that each lookup in the blind index returns (on average) a small number of records; say, if you want avg. 3 records per query, out of 1 000 000 records, you would want a hash with size log2(10⁶/3) ~ 18 bits. This makes the scenario I described impossible.

I do not think Ciphersweet provides particular support for evolving the size of the blind indexes as the size of your database increases, though it should be doable. Thankfully, the only issue with not resizing a blind index as the database grows, is a slight performance overhead: if your DB gets 10× bigger and now contains 10 000 000 records, keeping the same 18 bit blind index results in avg. 30 records getting selected, which the application then decrypts and filters; decrypting 30 records to find the one you are interested in should still be quite fast.

[0] In a real-world use-case, you would probably support selecting by a normalized (lowercased, stripped of punctuation) version of the name; Ciphersweet supports functional indexes.

TL;DR: Ciphersweet is safe, probably much more so than most alternatives; there are some caveats you need to be aware of, which exists in all encrypted databases, and some operational concerns, but they are all very manageable.

How secure is the Ciphersweet library for searchable encryption, and why is a duplicate entry leak not a problem?

2 Answers2