I'm answering my own question because I think I now understand BREACH and how to prevent it. I'd love feedback.
How BREACH works (as I understand it)
(Expanding on an explanation here that helped me.)
Suppose you're an attacker. You are signed into a service as yourself. You notice that there's a search
endpoint, and if you send the search term rabbits
, you get back a response like this:
<SearchResponse>
<AuthToken>d2a372efa35aab29028c49d71f56789</AuthToken>
<SearchTerm>rabbits</SearchTerm>
<Results>
<Result>rabbits rock</Result>
<Result>yay rabbits</Result>
</Results>
</SearchResponse>
You also notice that the response is gzipped and encrypted (HTTPS).
You try searching for a string that's formatted like the <AuthToken
value, likeaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
. The response is:
<SearchResponse>
<AuthToken>d2a372efa35aab29028c49d71f56789</AuthToken>
<SearchTerm>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</SearchTerm>
<Results>
</Results>
</SearchResponse>
There are no results for this. You then modify your search term slightly:
<SearchResponse>
<AuthToken>d2a372efa35aab29028c49d71f56789</AuthToken>
<SearchTerm>d2a37aaaaaaaaaaaaaaaaaaaaaaaaaa</SearchTerm>
<Results>
</Results>
</SearchResponse>
As you hoped, something interesting is happening. Because the search term is nonsense, the <Results>
are always the same: empty. The only thing chaging is the <SearchTerm>
. And because of compression, the more the <SearchTerm>
value resembles the <AuthToken>
value, the smaller the response is.
This is because of how gzip compression works: it removes repetition when compressing, and restores it when decompressing. The more repetitive the input, the smaller it compresses.
You search again, using the exact value of the <AuthToken>
.
<SearchResponse>
<AuthToken>d2a372efa35aab29028c49d71f56789</AuthToken>
<SearchTerm>d2a372efa35aab29028c49d71f56789</SearchTerm>
<Results>
</Results>
</SearchResponse>
This time you make a note of how small the response is. Now you know that any time the response is this size, it means the search term matched the auth token exactly.
Now, because these are your requests, you've been able to read them directly. If you could do a MITM attack on another user of the site (eg, by running a rogue router), you'd be able to see the size of the encrypted response, but not the actual contents.
You think to yourself: if I can trick someone else into sending the search terms I want them to, and if I can see how big the encrypted response is, I can tweak the search term over and over. The closer I get to guessing the auth token, the smaller the response will be, and when it's the size of the response I just saw, I've guessed correctly. Once I know their auth token, I can sign in as them.
If you can somehow execute an XSS attack on your victim, you can get them to make the necessary requests.
Mitigation
This attack would not work if:
- The server did not use HTTP compression (like gzip, in our example)
- The request could not be made successfully without a CSRF token, which the attacker could not know
- The server never put both sensitive data (like an API token) and user-supplied data (like the search term) in the same response
- The server never returned the same API token twice (eg, if raw token values were timestamped and signed before sending, the timestamp would ensure the token in the response changed constantly)
- The response always contained random-length padding, as @AndrolGenhald pointed out in a comment (although with enough requests, an attacker might separate the signal from this noise)
- The request could not be made successfully without a session cookie, and site's session cookie had a
SameSite
attribute, and the would-be victim was using a browser that understands this attribute so that it understood not to include the cookie with requests originating from another site.