Common back pressure strategies for services under sustained high load

Question

I'm wondering what are the common backpressure strategies people use for their web services?

Imagine your service operates under heavy load, and at some point the load reaches 120% of your capacity. How do you deal with this?

The most sound strategy I can think of is to start rejecting connections. So if one host reaches its peak capacity (e.g. all Apache workers are busy) I would start rejecting TCP connection until one of the workers frees up. This way all connection that are accepted are handled immediately without queuing (so the latency is minimal), and the excessive 20% are rejected, allowing load balancer to redispatch them to the other host or to perform any other load shedding strategy (e.g. redirecting to static/cached content).

I think this fail-fast approach is much superior to any kind of queueing. Small queues are good for absorbing short bursts in traffic, but with excessive queueing your system can fail spectacularly under heavy load. For example with FIFO queue processing without any AQM it can get into the state when all processed requests have already timed out on the client side, thus system makes no forward progress.

I was surprised that this strategy is not that easy to implement as it sounds. My approach was to set a small listen backlog on web server, expecting all connections that do not fit to be rejected. But due to changes in Linux kernel 2.2 this strategy falls apart (see http://veithen.github.io/2014/01/01/how-tcp-backlog-works-in-linux.html).

Newer Linux kernel accepts connections for you unconditionally. SYN-ACK response is sent to the client without considering listen backlog size at all. Enabling tcp_abort_on_overflow option does not help much either. This option makes kernel send RST when connection does not fit into accept queue, but at this point client already considers the connection ESTABLISHED and may have started sending the data.

This is especially problematic with HAProxy. If connection was successfully established it will not redispatch the request to the other server, since this request may have had some side-effects on the server.

So I guess my questions are:

am I the weird one for trying to implement something like this?
are there any other strategies for dealing with sustained high load you can recommend?
is Linux kernel's tcp_abort_on_overflow broken and should've been applied to half-open queue instead?

Thanks in advance!

score 3 · Answer 1 · answered May 12 '15 at 05:49

To answer your first question: yes. Well, in my personal opinion anyway, don't take it personally. The thing is that you are trying to set the limits on your TCP stack, while you have yourself a loadbalancer in front of it with many counters and options. If you limit yourself in the TCP stack, you run into way more problems when you hit those limits. I would check and keep limits in the loadbalancer itself. Set up session counters, or make some health script which will verify the server's health. Upon reaching the limits, you can either reject the new incoming requests, or redirect them to another server when you set the backend as full. You are bound by the limits of your Apache, not your OS or haproxy, so try to stay away from the system limits by controlling the load to your apache before it reaches it.

I guess that answered your second question already as well.

The third question answer is more complex, and I don't think you would want to get into the depths of that. Reaching the TCP overflow state is in my opinion a step too far already in tweaking the server to sustain high load and traffic.

These are my two cents, hope it helps.

Common back pressure strategies for services under sustained high load

1 Answers1