How to mitigate backend stress generated from malicious traffic

Question

I want to reduce, or mitigate the effects of, malicious layer 7 traffic (targeted attacks, generic evil automated crawling) which reaches my backend making it very slow and even unavailable. This regards load-based attacks as described in https://serverfault.com/a/531942/1816

Assume that:

I use a not very fast backend/CMS (e.g ~1500ms TTFB for every dynamically generated page). Optimizing this is not possible, or simply very expensive in terms of effort.
I've fully scaled up, i.e I'm on the fastest H/W possible.
I cannot scale out, i.e the CMS does not support master-to-master replication, so it's only served by a single node.
I use a CDN in front of the backend, powerful enough to handle any traffic, which caches responses for a long time (e.g 10 days). Cached responses (hits) are fast and do not touch my backend again. Misses will obviously reach my backend.
The IP of my backend is unknown to attackers/bots.
Some use cases, e.g POST requests or logged in users (small fraction of total site usage), are set to bypass the CDN's cache so they always end up hitting the backend.
Changing anything on the URL in a way that makes it new/unique to the CDN (e.g addition of a &_foo=1247895239) will always end up hitting the backend.
An attacker who has studied the system first will very easily find very slow use cases (e.g paginated pages to the 10.000th result) which they'll be able to abuse together with random parameters of #7 to bring the backend to its knees.
I cannot predict all known and valid URLs and legit parameters of my backend at a given time in order to somehow whitelist requests or sanitize the URL on the CDN in order to reduce unnecessary requests from reaching the backend. e.g /search?q=whatever and /search?foo=bar&q=whatever will 100% produce the same result because foo=bar is not something that my backend uses, but I cannot sanitize that on the CDN level.
Some attacks are from a single IP, others are from many IPs (e.g 2000 or more) which cannot be guessed or easily filtered out via IP ranges.
The CDN provider and the backend host provider both offer some sort of DDoS attack feature but the attacks which can bring my backend down are very small (e.g only 10 requests per second) and are never considered as DDoS attacks from these providers.
I do have monitoring in place and instantly get notified when the backend is stressed, but I don't want to be manually banning IPs because this is not viable (I may be sleeping, working on something else, on vacation or the attack may be from many different IPs).
I am hesitant to introduce a per-IP limit of connections per second on the backend since I will, at some point, end up denying access to legit users. e.g imagine a presentation/workshop about my service taking place in a university or large company where from tens or hundreds of browsers will almost simultaneously be using the service from a single IP address. If these are logged in, then they'll always reach my backend and not be served by the CDN. Another case is public sector users all accessing the service from very limited amount of IP addresses (provided by the government). So this would deny access to legit users and would not help at all to attacks from many IPs each of which only does a couple of requests.
I do not want to permanently blacklist certain large IP ranges of countries which sometimes are the origins of attacks (e.g China, eastern Europe) because this is unfair, wrong, will deny access to legit users from those areas and attacks from other places will not be affected.

So, what can I do to handle this situation? Is there a solution that I've not taken into consideration in my assumptions that could help?

Can you describe your stack a bit? E.g. what do you use for frontend caching (Varnish?) and for your database server? — fevangelou, May 31 '19 at 07:51
Although the problem is generic, assume that my stack is: Ubuntu 18, MySQL 5.7, Apache 2.4 serving the Drupal 7 backend which runs on PHP 7.0. For the caching CDN I use fastly which is a modified varnish version. — cherouvim, May 31 '19 at 07:58
As a first line of defence, I highly recommend tuning MySQL (this is a good starting point https://gist.github.com/fevangelou/fb72f36bbe333e059b66 built over the years with initial guidance from Percona). Since the problem is bot crawling (malicious or not) killing your cache strategy, I highly recommend utilizing the "max_execution_time" directive which was added in 5.7 exactly for this reason. Set this value to something like 10 or 20 seconds (expressed in milliseconds) and you will basically target queries from crawling requests. — fevangelou, May 31 '19 at 08:04
All this because MySQL is usually what gets killed in such scenarios. If you're also dealing with high load from Apache and/or PHP-FPM (if you use it), it would be great to switch to Nginx and PHP-FPM with proper tuning for both to handle high traffic (and bot crawling that is difficult to pinpoint). Unfortunately, most tutorials around the web on PHP-FPM tuning are wrong because they account primarily for RAM consumption when in high traffic scenarios they should account for processes/threads flooding or killing your CPUs (which is really easy to fix). — fevangelou, May 31 '19 at 08:10

Don Zoomik · Answer 1 · 2020-02-14T23:20:22.530

I live in a similar environment (I don't manage it directly but I work with the team that does) and we've found two solutions that work well together. In our case, we host application ourselves so we have full control over traffic flow but the idea remains the same.

Some of your constraints are quite hard and I'd argue contradictory but I think they can be worked around. I'm not sure what your CDN is but I presume that it's a black box that you don't really control.

I would suggest setting up another (caching) layer in front of your application to control and modify traffic, we use Varnish for it - mostly caching but also for mitigating malicious traffic. I can be quite small and doesn't have to cache for as long as CDN as it should only see very little traffic.

Sanitize URIs before preseting to backend. You usually can't do it in CDN but you can do it with your own man in the middle server. Allow only known URIs and parameters, the rest will be either cut off, fed into throttling system or immediatly rejected (404, 500...).
Cache misses lead to throttling or no service. Depending on your application, even one cache miss per second (likely even less than than) per client IP can point to malicious traffic, especially if you're only seeing cache misses from CDN. This requires that you have some insight into real end-user IPs and you can probably scope it to exclude logged in users or known good IP ranges (that could have more lax throttling rules or no throttling). CloudFront adds X-Forwarded-For headers, maybe your CDN has something similar. For example there are 5 cache misses over 15 seconds per client IP, reject with an error that is not cached in CDN (429 probably) for 60 seconds. More misses over longer perios lead to longer bans. See here https://github.com/varnish/varnish-modules/blob/master/docs/vmod_vsthrottle.rst

score 1 · Answer 2 · answered Feb 13 '20 at 10:08

I feel your pain - but you are facing an impossible task. You can't have your cake and eat it.

Normally I would suggest fail2ban as the tool to address this (if webserver rate limiting is not an option) however not only do you explicitly say that you can't even support a temporary ban, since your traffic comes via a CDN , you'll need to build a lot of functionality to report the address and to apply blocking.

You only have 2 courses of action left which I can see:

1) flatten the site into html files and serve them as static content

2) get a job somewhere else

How to mitigate backend stress generated from malicious traffic

2 Answers2