We have a Graphite server to collect data through collectd, statsd, JMXTrans ... Since a few days, we frequently have holes in our data. Digging through the data we still have, we can see an increase in the carbon cache size (from 50K to 4M). We don't see an increase in the number of metrics collected (metricsReceived is stable at around 300K). We have an increase in the number of queries from 1000 to 1500 on average.
Strangely, the cpuUsage decreases slightly from 100% (we have 4 CPU) to 50% when the cache size increase.
Strangely again, we see an increase in the number if octets read from disk, and a decrease in the number of octets written.
We have carbon configure mostly with default values:
- MAX_CACHE_SIZE = inf
- MAX_UPDATES_PER_SECOND = 5000
- MAX_CREATES_PER_MINUTE = 2000
Obviously, something has changed in our system, but we dont understand what, nor how we can find this cause ...
Any help ?