Graphite stops collecting data randomly

Question

We have a Graphite server to collect data through collectd, statsd, JMXTrans ... Since a few days, we frequently have holes in our data. Digging through the data we still have, we can see an increase in the carbon cache size (from 50K to 4M). We don't see an increase in the number of metrics collected (metricsReceived is stable at around 300K). We have an increase in the number of queries from 1000 to 1500 on average.

Strangely, the cpuUsage decreases slightly from 100% (we have 4 CPU) to 50% when the cache size increase.

Strangely again, we see an increase in the number if octets read from disk, and a decrease in the number of octets written.

We have carbon configure mostly with default values:

MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 5000
MAX_CREATES_PER_MINUTE = 2000

Obviously, something has changed in our system, but we dont understand what, nor how we can find this cause ...

Any help ?

I usually start from the the ground up approach to graphite issues; is there space on the disk to write to? Have the data directory permissions changed at all? Has there been a change in the daemon user collecting stats? Should there be no clear cause, it's entirely possible you have RRD corruption, and may need to find a way to export what you have, and start metric collection from scratch. — Stephan, Aug 28 '13 at 22:21
We checked disk space and permission, nothing strange there. No change in the daemon collecting data, maybe an increase in the number of metrics, but not that big. We're looking into WSP corruption. — Guillaume, Sep 01 '13 at 19:09

score 2 · Answer 1 · answered Oct 28 '13 at 05:45

This is not a graphite stack's bug, but rather a IO bottleneck, most probably because your storage does not have the high enough IOPS. Because of this, the queue keeps building up, and overflows at 4M. At that point, You lose that much queued data, which is reflected later, as random 'gaps' in your graph. Your system cannot keep-up with the scale at which it is receiving metrics. It keeps filling up and overflowing.

Strangely, the cpuUsage decreases slightly from 100% (we have 4 CPU) to 50% when the cache size increase.

This is because your system begins swapping and the CPUs get a lot of 'idle time', because of the IO wait.

To add context, i have 500 provisioned IOPS at aws on a system on which i receive some 40K metrics. The queue is stable at 50K.

I'm seeing the exact same issue described in the question. However, disk usage is minimal (reported as 0%-3% by atop) and I'm only pushing ~80 metrics/s through StatsD. Therefore it seems unlikely that I have an IO bottleneck. Any idea of what might be causing the issue? — heyman, May 04 '15 at 08:40

Michael Martinez · Answer 2 · 2017-06-12T20:14:29.403

Other answerer mentioned disk i/o bottleneck. I'll talk about network bottleneck as another cause of this.

In my environment, we run a cluster of front end UI servers (httpd, memcached); another cluster of middle layer relays (carbon-c-relay performing forwarding and aggregation); and a backend layer (httpd, memcached, carbon-c-relay, and carbon-cache.) Each of these clusters consists of multiple instances in EC2 and in total process 15 million metrics per minute.

We had a problem where we were seeing gaps for the metrics generated by the aggregate "sum" function, and also the aggregated values were incorrect (too low). The problem would alleviate by restarting carbon-c-relay in the middle layer, but gaps would start appearing again after several hours.

We had aggregation taking place in both the middle layer and the backend layer (the backend layer aggregated the aggregated metrics passed to it from the middle layer).

The middle layer hosts were not cpu bound, not disk bound, and no constraints on memory. This combined with the fact that the problem would only appear a few hours after restarting the relay proceses, meant that there was a network bottleneck. Our solution was simply to add more hosts to the middle layer. Doing this instantly resulted in the aggregated metrics performing correctly and not experiencing gaps.

The exact place in the network stack where was the bottleneck? I couldn't tell you. It could have been on the linux hosts; it could have been on the Amazon side.

Graphite stops collecting data randomly

2 Answers2