2

We are using Ganglia to monitoring our cloud infrastructure on Amazon AWS. Everything is working correctly (metrics are flowing etc), except that occasionally the gmetad process will segfault out of the blue. The gmetad process is running on an m3.medium EC2, and is monitoring about 50 servers. The servers are arranged into groups, each one having a bastion EC2 where the metrics are gathered. gmetad is configured to grab the metrics from those bastions - about 10 of them.

Some useful facts:

  • We are running Debian Wheezy on all the EC2s
  • The crash creates no logs in normal operation other than a segfault log something like "gmetad[11291]: segfault at 71 ip 000000000040547c sp 00007ff2d6572260 error 4 in gmetad[400000+e000]". If we run gmetad manually with debug logging, it appears that the crash is related to gmetad doing a cleanup.
  • When we realised that the cleanup process might be to blame we did more research around that. We realised that our disk IO was way too high and added rrdcached in order to reduce it. The disk IO is now much lower, and the crash is occurring less often, but still an average of once a day or so.
  • We have two systems (dev and production). Both exhibit this crash, but the dev system, which is monitoring a much smaller group of servers crashes significantly less often.
  • The production system is running ganglia 3.3.8-1+nmu1/rrdtool 1.4.7-2. We've upgraded ganglia in the dev systems to ganglia 3.6.0-2~bpo70+1/rrdtool 1.4.7-2. That doesn't seem to have helped with the crash.
  • We have monit running on both systems configured to restart gmetad if it dies. It restarts immediately with no issues.

Has anyone experienced this kind of crash, especially on Amazon hardware? We're at our wits end trying to find a solution!

SamBarham
  • 123
  • 1
  • 4
  • Ok it's an old thread.. but we do have the same issue with our cluster (no cloud) running rrdcached 1.7.0 and Ganglia 3.7.2. Did you find a solution? – Yann Sagon Aug 21 '17 at 15:28

0 Answers0