Varnish failing intermittently for no obvious reason

Question

For the last few years we have been running Varnish as a cache and load balancer in front of several apache servers serving several thousand websites.

We also use monit to ensure that if varnish ever dies it gets restarted. The varnish section in monitrc looks like this:

  # Check varnish on port 80
  check process varnish with pidfile /var/run/varnishd.pid
  start program = "/etc/init.d/varnish start"
  stop program = "/etc/init.d/varnish stop"
  if failed host 127.0.0.1 port 80 protocol http
    and request "/monit-check-url"
    then restart

This has worked fine at least 3 years. We get occasional failures of the port 80 check, but monit restarts varnish accordingly and it's generally unnoticeable to users.

However, over the last few weeks we are seeing flurries of these failures, usually over a period of a couple of hours, and users are noticing connection failures. Today has been particularly bad.

There are no clues in syslog (it's a debian box btw) as suggested by the "Varnish crashing" section at: https://www.varnish-cache.org/docs/3.0/tutorial/troubleshooting.html and all we see in there is monit failing it's check on port 80 then stopping and starting varnish.

Additionally we are not seeing any spike in bandwidth or number of hits to the backend webservers that would suggest it's failing under higher than normal load.

We were running Varnish 3.0.3 which I upgraded to 3.0.7 but the problem has continued. No other changes have been made to this box that coincide with the problems starting, and the varnish configuration hasn't been changed in quite a long time.

Has anyone had any similar experiences with varnish or have any suggestions on troubleshooting this further? Could it be some sort of attack?

Any help or advice greatly appreciated!

Do you mean the varnish that is bound to port 6082 (aka varnishadm) is dying as well? Or just the varnish _child_? — karatedog, Apr 16 '16 at 23:46

score 0 · Answer 1 · answered Mar 16 '17 at 00:22

Your approach here seems a little heavy-handed as there many reasons why a request could fail, not all of which are varnish problems (eg connectivity issues, failures on the backends etc) Restarting varnish will cause an outage whilst it starts up again, so should only be used as a last resort.

Before restarting anything, I'd recommend running varnishadm debug.health on the varnish box to see what state varnish considers your backend to be in. Depending on the result, you can decide where to look further:

If the backend is considered unhealthy, then the problem lies between varnish and the backend (or in the backend itself). Check the networking to the backend, plus any monitoring on the backend.
If the backend is considered healthy, then the problem lies between monit and varnish. Check the networking to the varnish server, plus debug the monitoring itself.
If the varnishadm process can't establish a connection, then the problem is with varnish itself. Check which varnish processes are running and look for any error messages from varnish in your logs.

Varnish failing intermittently for no obvious reason

1 Answers1