The Background
We're running a Varnish cache server in front of a CF/Apache2 backend server. The varnish box runs a healthcheck probe every two seconds as follows:
probe healthcheck {
.url = "/probe.cfm";
.timeout = 5s;
.interval = 2s;
.window = 10;
.threshold = 5;
.initial = 5;
.expected_response = 200;
}
backend web1 {
.host = "<backend ip>";
.port = "80";
.probe = healthcheck;
}
The probe.cfm does this:
<cfoutput>
<!doctype html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en"> <![endif]-->
<!--[if IE 7]> <html class="no-js ie7 oldie" lang="en"> <![endif]-->
<!--[if IE 8]> <html class="no-js ie8 oldie" lang="en"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en"> <!--<![endif]-->
<head>
<meta charset="utf-8">
<title>CF Probe</title>
</head>
<body>
</cfoutput>
<cfquery name="qryProbe" datasource="#Request.DSN#">
SELECT TOP 1 [PageID] FROM [Page] WHERE [PageID] > 6
</cfquery>
<cfoutput>
#Variables.qryProbe.RecordCount#
</body>
</html>
</cfoutput>
This snippet selects a single record from the underlying DB (mapped in data sources) and returns a 200 if it's successful.
Later in the Varnish config there's a section that tests if the backend is up and if not, then it sets the grace period on the cache to 24 hours and for any pages not in cache, it should generate a synthetic maintenance page.
sub vcl_recv
if (req.backend.healthy) { set req.grace = 30s; } else { set req.grace = 24h; }
sub vcl_error
if (!req.backend.healthy && obj.status != 200 && obj.status != 403 && obj.status != 404 && obj.status != 301 && obj.status != 302) {
synthetic{"<some HTML here>"}
}
The Problem
We recently had a condition wherein the CF instance wasn't strictly-speaking unresponsive but it wasn't serving pages. According to the varnish logs, however, the backend was still healthy, so Varnish quickly stopped serving content, too.
Additionally, I saw at least one instance where the backend was generating error 500s while we restarted the CF instance and Varnish was still reporting it as healthy, despite the health test line in the log saying it received a 500 from the backend.
The Question
How can I more accurately test the health status of the CF backend so Varnish responds correctly to outages/reboots/etc.
I suppose, additionally, can anyone see any glaring flaws in the Varnish tests for backend health that I've set up and the tests that determine whether the synthetic HTML gets rendered or not?