This is a follow-up to my previous question: Prevent DOS against RSA authentication. This question is also discusses a similar problem: Prevent denial of service attacks against slow hashing functions?.
My setup is client-server with plain sockets and a custom protocol. The client programs are shipped a public RSA key, and the servers hold the private key counterpart.
In general, connections are persistent so logins are actually rare, except during server restarts, network glitches and other issues that might cause a large amount of clients to disconnect and reconnect. In other words a "DDoS" attack might simply be clients trying to reconnect...
The current (DoS-friendly) handshake works as follows:
- Client sends
[protocol-version] [public-key encrypted nonce]
.- Server unpacks nonce.
- Server generates its own nonce and uses PBKDF2 to derive a key using the client nonce and its own nonce.
- The server responds with
[reply-code] [server-nonce] [AES encrypted packet + HMAC]
- Client checks the reply-code. If ok, takes the server-nonce and derives the key and checks that everything is ok by decrypting the encrypted payload using the new key.
- Encrypted communication commences using this newly created key.
The problem is that a large amount of connects will consume a lot of CPU resources. This will both affect already logged in users, but it will also slow down logins, causing logins to timeout, which in turn cause even more disconnects and logins to occur.
It's easy to see that even without malicious intent, the servers are not stable under reasonably high login load if the decryption is costly.
In order to mitigate this, I posed the question mentioned above: Prevent DOS against RSA authentication, which suggested ECDH to lower the cost of logins.
That is a good start, but it might not properly address the problem, where a DoS will not only prevent logins, but also degrade the experience of users already logged in.
I've tried to come up with a few strategies that could help regardless of the handshake algorithm used, and I'd like to hear which ones would be recommended, and if I'm overlooking some useful strategy.
- Restrict login CPU usage by queuing decrypts on a single (or few) low priority thread. If the queue is large, clients can be rejected or kept on hold with periodic updates.
- Add a login server, which serves clients with keys, similar to the handshake above, but the client is also given an identifier with this key. The client can then log into the normal servers by presenting the identifier, as the server will be able to retrieve it. (Presenting the identifier and retrieving the key from the login server would replace 1-3 in the handshake protocol). Any DoS would only affect the login server.
- Again a login server, but using HTTPS instead of the custom RSA scheme for distributing the key + identifier to the client.
- Login server as (2) but require the client to present a Hashcash with the request (and the login server does not process the encrypted data unless it is valid)
- Login server as (2) but use a server-issued client puzzle instead.
Merits and disadvantages (as I understand them):
(1) Can be used both with the normal servers and for a login server.
Using (2) will complicate login somewhat, but makes it trivial to ensure that players aren't affected by a DoS attack on the authentication algorithm.
I suspect that (3) would make it easier to use with a DDoS system like Cloudflare, however it is my understanding that (4) and (5) is impossible to use with HTTPS, which is a downside.
Regardless, any scheme needs to be coupled with the standard mechanisms preventing single machine DoS attacks, such as banning quickly reconnecting IPs. Selecting a cheaper authentication algorithm will also help a lot.
EDIT
To Summarize
- My current handshake cannot handle a sufficiently large amount of simultaneous connects because the RSA decryption will consume excessive amounts CPU.
- I would like to know the usual methods to reduce this vulnerability, both at handshake level (cheaper algorithms, client puzzles, limited CPU for decryption etc) and on server level (separate out login services etc). Links to papers / books would be great.
- Also, I would be grateful if I could get a good/bad assessment of the strategies (1-5) mentioned above.
EDIT 2
That these are persistent connections running a custom protocol. This means thousands of legitimate clients may be connected at the same time.
If an attacker succeeds in temporarily choke the bandwidth and cause connection timeouts, this can be used to leverage legitimate client reconnects to bring down the server, regardless of client reconnect delays.