Foreword
Update in 2016. Things are evolving, all servers are getting better, they all support SSL and the web is more amazing than ever.
Unless stated, the following is targeted toward professionals in business and start-ups, supporting thousands to millions of users.
These tools and architectures require a lot of users/hardware/money. You can try this at a home lab or to run a blog but that doesn't make much sense.
As a general rule, remember that you want to keep it simple. Every middleware appended is another critical piece of middleware to maintain. Perfection is not achieved when there is nothing to add but when there is nothing left to remove.
Some Common and Interesting Deployments
HAProxy (balancing) + nginx (php application + caching)
The webserver is nginx running php. When nginx is already there it might as well handle the caching and redirections.
HAProxy ---> nginx-php
A ---> nginx-php
P ---> nginx-php
r ---> nginx-php
o ---> nginx-php
x ---> nginx-php
y ---> nginx-php
HAProxy (balancing) + Varnish (caching) + Tomcat (Java application)
HAProxy can redirect to Varnish based on the request URI (*.jpg *.css *.js).
HAProxy ---> tomcat
A ---> tomcat
---> tomcat
P ---> tomcat <----+
r ---> tomcat <---+|
o ||
x ---> varnish <--+|
y ---> varnish <---+
HAProxy (balancing) + nginx (SSL to the host and caching) + Webserver (application)
The webservers don't speak SSL even though EVERYONE MUST SPEAK SSL (especially this HAProxy-WebServer link with private user information going through EC2). Adding a local nginx allows to bring SSL up to the host. Once nginx is there it might as well do some caching and URL rewriting.
Note: Port redirection 443:8080 is happening but is not part of the features. There is no point in doing port redirection. The load balancer could speak directly to webserver:8080.
(nginx + webserver on same host)
HAProxy ---> nginx:443 -> webserver:8080
A ---> nginx:443 -> webserver:8080
P ---> nginx:443 -> webserver:8080
r ---> nginx:443 -> webserver:8080
o ---> nginx:443 -> webserver:8080
x ---> nginx:443 -> webserver:8080
y ---> nginx:443 -> webserver:8080
Middleware
HAProxy: THE load balancer
Main Features:
- Load balancing (TCP, HTTP, HTTPS)
- Multiple algorithms (round robin, source ip, headers)
- Session persistence
- SSL termination
Similar Alternatives: nginx (multi-purpose web-server configurable as a load balancer)
Different Alternatives: Cloud (Amazon ELB, Google load balancer), Hardware (F5, fortinet, citrix netscaler), Other&Worldwide (DNS, anycast, CloudFlare)
What does HAProxy do and when do you HAVE TO use it?
Whenever you need load balancing. HAProxy is the go to solution.
Except when you want very cheap OR quick & dirty OR you don't have the skills available, then you may use an ELB :D
Except when you're in banking/government/similar requiring to use your own datacenter with hard requirements (dedicated infrastructure, dependable failover, 2 layers of firewall, auditing stuff, SLA to pay x% per minute of downtime, all in one) then you may put 2 F5 on top of the rack containing your 30 application servers.
Except when you want to go past 100k HTTP(S) [and multi-sites], then you MUST have multiples HAProxy with a layer of [global] load balancing in front of them (cloudflare, DNS, anycast). Theoretically, the global balancer could talk straight to the webservers allowing to ditch HAProxy. Usually however, you SHOULD keep HAProxy(s) as the public entry point(s) to your datacenter and tune advanced options to balance fairly across hosts and minimize variance.
Personal Opinion: A small, contained, open source project, entirely dedicated to ONE TRUE PURPOSE. Among the easiest configuration (ONE file), most useful and most reliable open source software I have came across in my life.
Nginx: Apache that doesn't suck
Main Features:
- WebServer HTTP or HTTPS
- Run applications in CGI/PHP/some other
- URL redirection/rewriting
- Access control
- HTTP Headers manipulation
- Caching
- Reverse Proxy
Similar Alternatives: Apache, Lighttpd, Tomcat, Gunicorn...
Apache was the de-facto web server, also known as a giant clusterfuck of dozens modules and thousands lines httpd.conf
on top of a broken request processing architecture. nginx redo all of that, with less modules, (slightly) simpler configuration and a better core architecture.
What does nginx do and when do you HAVE TO use it?
A webserver is intended to run applications. When your application is developped to run on nginx, you already have nginx and you may as well use all its features.
Except when your application is not intended to run on nginx and nginx is nowhere to be found in your stack (Java shop anyone?) then there is little point in nginx. The webservers features are likely to exist in your current webserver and the other tasks are better handled by the appropriate dedicated tool (HAProxy/Varnish/CDN).
Except when your webserver/application is lacking features, hard to configure and/or you'd rather die job than look at it (Gunicorn anyone?), then you may put an nginx in front (i.e. locally on each node) to perform URL rewriting, send 301 redirections, enforce access control, provide SSL encryption, and edit HTTP headers on-the-fly. [These are the features expected from a webserver]
Varnish: THE caching server
Main Features:
- Caching
- Advanced Caching
- Fine Grained Caching
- Caching
Similar Alternatives: nginx (multi-purpose web-server configurable as a caching server)
Different Alternatives: CDN (Akamai, Amazon CloudFront, CloudFlare), Hardware (F5, Fortinet, Citrix Netscaler)
What does Varnish do and when do you HAVE TO use it?
It does caching, only caching. It's usually not worth the effort and it's a waste of time. Try CDN instead. Be aware that caching is the last thing you should care about when running a website.
Except when you're running a website exclusively about pictures or videos then you should look into CDN thoroughly and think about caching seriously.
Except when you're forced to use your own hardware in your own datacenter (CDN ain't an option) and your webservers are terrible at delivering static files (adding more webservers ain't helping) then Varnish is the last resort.
Except when you have a site with mostly-static-yet-complex-dynamically-generated-content (see the following paragraphs) then Varnish can save a lot of processing power on your webservers.
Static caching is overrated in 2016
Caching is almost configuration free, money free, and time free. Just subscribe to CloudFlare, or CloudFront or Akamai or MaxCDN. The time it takes me to write this line is longer that the time it takes to setup caching AND the beer I am holding in my hand is more expensive than the median CloudFlare subscription.
All these services work out of the box for static *.css *.js *.png and more. In fact, they mostly honour the Cache-Control
directive in the HTTP header. The first step of caching is to configure your webservers to send proper cache directives. Doesn't matter what CDN, what Varnish, what browser is in the middle.
Performance Considerations
Varnish was created at a time when the average web servers was choking to serve a cat picture on a blog. Nowadays a single instance of the average modern multi-threaded asynchronous buzzword-driven webserver can reliably deliver kittens to an entire country. Courtesy of sendfile()
.
I did some quick performance testing for the last project I worked on. A single tomcat instances could serve 21 000 to 33 000 static files per second over HTTP (testing files from 20B to 12kB with varying HTTP/client connections count). The sustained outbound traffic is beyond 2.4 Gb/s. Production will only have 1 Gb/s interfaces. Can't do better than the hardware, no point in even trying Varnish.
Caching Complex Changing Dynamic Content
CDN and caching servers usually ignore URL with parameters like ?article=1843
, they ignore any request with sessions cookies or authenticated users, and they ignore most MIME types including the application/json
from /api/article/1843/info
. There are configuration options available but usually not fine grained, rather "all or nothing".
Varnish can have custom complex rules (see VCL) to define what is cachable and what is not. These rules can cache specific content by URI, headers and current user session cookie and MIME type and content ALL TOGETHER. That can save a lot of processing power on webservers for some very specific load pattern. That's when Varnish is handy and AWESOME.
Conclusion
It took me a while to understand all these pieces, when to use them and how they fit together. Hope this can help you.
That turns out to be quite long (6 hours to write. OMG! :O). Maybe I should start a blog or a book about that. Fun fact: There doesn't seem to be a limit on answer's length.