How can I balance incoming web traffic amongst N apache servers?

Question

I am looking to use something like Heartbeat/Squid/Varnish/etc to balance the amount of incoming traffic amongst the internal apache instances. This would have to be software and not hardware as all my stuff is run on VPS. I don't have a lot of experience in this area so sorry if I am misusing terminology and picking the wrong packages.

I have drawn up something to illustrate what I am after. The green side is what the initial setup would look like and the blue side is what it might look like after adding more apache instances due to traffic increase. This might not be how these things work, but Ideally I would add the IP of the balancer/s to the DNS of the domain. Then the balancer/s would see how many connections are on each apache instance (via some config list of internal IP's or eternal IPs) and distributes the connections equally. In the blue there is a second balancer as I am sure at some point the balancer would need help too.

Maybe I am going about this wrong, but I am looking for help on what the "balancer/s" should be and best practices on how to set them up.

Any help would be great. alt text

@Prix - Looks like visio (http://office.microsoft.com/en-us/visio/) — malonso, Sep 16 '10 at 00:17

score 4 · Answer 1 · answered Aug 24 '10 at 21:18

Just about any "reverse proxy" will do what you ask.

For example Varnish, Pound and HAProxy are all good at what they do, but they also have their differences - however, for what you are asking, any of them will do. Personally, I would think you'd be best off with HAProxy, but that's just a guess.

You might be best off reading an article about load balancers to help you decide what kind you need: http://1wt.eu/articles/2006_lb/

Also, you might consider using a pre-built service for this - like running your software on Amazon's Elastic Compute Cloud and using their Elastic Load Balancing.

score 2 · Answer 2 · answered Aug 25 '10 at 01:17

At first, there is an important question that must be answered:
do you need the user sessions to be handled by the load-balancer(s) and always driven to the same web server (if alive)?

sessions not required: in this case, you should use the efficient nginx program as a load balancer. The configuration is easy to set, where you basically only have to indicate the list of web servers in an upstream upstream_name { server1, ..., serverN } statement, then, for a given domain, you need a simple proxy_pass upstream_name directive.
See Nginx wiki.
session required there is a similar setting for pound where you indicate the name of the cookie that will host the session ID (ID MYCOOKIENAME), then a list of BACKEND for all your servers.
See for instance Pound setup exemple.

When the need for several load balancers arises, you may want to go for a heartbeat configuration, that either will ensure only one balancer mounts the virtual IP for a given domain, (if sessions required, or mount both and feed DNS with two IP addresses for instance). Maybe this should be detailed in another question at the time it becomes necessary (as the tools evolve quickly).
See also this link for instance.

score 1 · Answer 3 · answered Aug 26 '10 at 15:40

You should need a very good reason for introducing additional complexity and a single point of failure into your architecture.

Round-Robin load balancing

costs nothing
is simple to implement and manage
implements failover at the client - the only place that failure can be reliably detected
implicitly supports server-affinity but still allows failover without the problems of session management associated with sticky sessions
requires no additional software / hardware / configuration on cluster nodes

It amazse me the amount of mis-information that is put about regarding round-robin. If I were a cynical person I might wonder whether there is any connection with the vendors whoi produce big expensive load-balancing hardware.

The only points I will concede is that

IPV4 addresses are becoming scarce and therefore expensive - but still much. much cheaper than say a Cisco CSS.
Increasingly the internet runs on web-services - and not all developers implement DNS support according to the specs. But every browser I've ever used works as it should

"requires no additional software" -- well, requires that the webapp has shared session state (login, what's in a shopping basket, etc). And DNS RR can have uneven load balancing for long'ish periods of time. Yes, DNS RR is a viable method, but it is hardly clearly superior to the alternatives... — , Sep 15 '10 at 23:33

score 0 · Answer 4 · answered Aug 24 '10 at 21:18

0

start your quest here: http://httpd.apache.org/docs/2.1/mod/mod_proxy_balancer.html and http://www.barneyb.com/barneyblog/2009/02/26/apache-httpds-mod_proxy_balancer/

answered Aug 24 '10 at 21:18

bugtussle

205
2
8

score 0 · Answer 5 · answered Aug 24 '10 at 23:56

0

For the balancers you could look into LVS at http://www.linuxvirtualserver.org/, perhaps running ldirectord and heartbeat to direct traffic and perform failover.

answered Aug 24 '10 at 23:56

jaq

109
3

score 0 · Answer 6 · answered Aug 25 '10 at 00:12

0

Nginx is awesome as an upstream proxy, I have used it with great success in a configuration doing 1M+ uniques daily

answered Aug 25 '10 at 00:12

Robert Swisher

1,147
7
14

score 0 · Answer 7 · edited Apr 13 '17 at 12:14

OK, this was asked a while back, and I'm late to the party. Still, there is something to add here.

Jackie, you've pretty much nailed it. Your illustration shows how load balancing is handled on most smaller and midsized installations.

You should read the load balancing introduction by Willy Tarreau that Nakedible linked to. It is still valid, and it's a good introduction.

You need to consider how these fit your needs:

TCP/IP level load balancers (Linux Virtual Server et al). Lowest per connection overhead, highest speed, cannot "see" HTTP.
HTTP level load balancers (HAProxy, nginx, Apache 2.2, Pound, Microsoft ARR, and more). Higher overhead, can see HTTP, can gzip HTTP, can do SSL, can do sticky session load balancing.
HTTP reverse proxies (Apache Traffic Server, Varnish, Squid). Can store cache-able objects (some webpages, css, js, images) in RAM and forward them to subsequent clients without involving the backend webserver. Can often do some of the same things that L7 HTTP load balancers do.

there is a second balancer as I am sure at some point the balancer would need help too.

Well, sure. But load balancing is simple, and often a single load balancer can go fast. I link to this article, which struck a nerve in the web, as just an example of what performance ballpark a single modern server can provide. Don't use multiple LB's before you need to. When you need to a common approach is IP level load balancers at the very front (or DNS Round Robin), going to HTTP level load balancers, going to proxies & webapp servers.

help on what the "balancer/s" should be and best practices on how to set them up.

The trouble spot is the session state handling, and to some extent failure state behavor. Setting up the load balancers themselves is comparatively straightforward.

If you're just using 2-4 backend webapp servers, the static hashing based on the origin IP address can be viable. This avoids the need for shared session state amongst webapp servers. Each webapp node sees 1/N of the overall traffic, and the customer-to-server mapping is static in normal operation. It's not a good fit for larger installation though.

The two best load balancing algorithms, in the sense that they have benign behavior under high load and even load distribution, are round robin and true random load balancing. Both of these require that your web application has global session state available on webapp nodes. How this is done depends on the webapp tech stack; but there are generally standard solutions available for this.

If neither static hashing, nor shared session state are a good fit for you, then the choice is generally 'sticky session' load balancing and per-server session state. In most cases this works fine, and it is a fully viable choice.

the balancer/s would see how many connections are on each apache instance (via some config list of internal IP's or eternal IPs) and distributes the connections equally

Yeah, some sites use this. There are many names for the many different load balancing algorithms that exist. If you can pick round robin or random (or weighted round robin, weighted random) then I would recommend you do so, for the reasons given above.

Last thing: Don't forget that many vendors (F5, Cisco and others on high-end, fx Coyote Point and Kemp Technologies at more reasonable prices) offer mature load balancing appliances.

How can I balance incoming web traffic amongst N apache servers?

7 Answers7