Collectd and graphite imports data every 5 minutes, rather than 1 minute

Question

I'm a bit new to graphite, so bear with me on this. I'm looking into alternatives for a large and fairly unwieldy cacti installation, so I've been playing with graphite. We pull a lot of data via SNMP, so I've also downloaded, compiled and installed collectd to pipe SNMP data into graphite.

I've set up a simple query within collectd to just grab the current eth0 in/out counters. I'm looking to capture at a minute's resolution for a week, followed by 5 minutes thereafter, so my storage-schemas.conf looks like this:

[carbon]
 pattern = ^carbon\.
 retentions = 60:90d

[default]
 pattern = .*
 retentions = 60s:1w, 5m:1y

Similarly, in collectd.conf I have set the following:

<Plugin snmp>
   <Data "std_traffic">
       Type "if_octets"
       Table true
       Instance "IF-MIB::ifDescr"
       Values "IF-MIB::ifInOctets" "IF-MIB::ifOutOctets"
   </Data>

   <Host "lonsbrndlb01">
       Address "lonsbrndlb01"
       Version 2
       Community "public"
       Collect "std_traffic"
       Interval 60
   </Host>
</Plugin>

This almost works perfectly. The keys appear in graphite, and data comes in.

The only problem is that the data is a counter, and not a per-minute rate. I can get around this in graphite by using the derivative function, which supposedly turns counters into per-minute rates. However, doing this, I see this graph:

This is fairly evident that the data's only arriving every 5 minutes, and not every 60 seconds as I specified. Why is this? I thought I'd set the right values in both collectd and graphite, so I think I'm missing something somewhere.

Edit

Some more data on this, as it might be useful.

The formulas I'm using are just derivative(lonsbrndlb01.snmp.if_octets-eth0.tx) and derivative(lonsbrndlb01.snmp.if_octets-eth0.rx), although I've now switched to using nonNegativeDerivative because of counter rollovers. I've also updated the image below to give a sense of scale.

Running whisper-dump.py on the rx.wsp file gives a header of:

Meta data:
  aggregation method: average
  max retention: 31536000
  xFilesFactor: 0.5

Archive 0 info:
  offset: 40
  seconds per point: 60
  points: 10080
  retention: 604800
  size: 120960

Archive 1 info:
  offset: 121000
  seconds per point: 300
  points: 105120
  retention: 31536000
  size: 1261440

followed by about 2.4M of data.

Data from the graph by appending &format=json is:

[{"target": "nonNegativeDerivative(lonsbrndlb01.snmp.if_octets-eth0.rx)", "datapoints": [[null, 1342597800], [26346975.0, 1342597860], [35197821.0, 1342597920], [138121.0, 1342597980], [108605.0, 1342598040], [690712.0, 1342598100], [27213713.0, 1342598160], [876898.0, 1342598220], [463897.0, 1342598280], [137499.0, 1342598340], [96980.0, 1342598400], [26237641.0, 1342598460], [35094898.0, 1342598520], [112569.0, 1342598580], [274897.0, 1342598640], [139174.0, 1342598700], [806881.0, 1342598760], [26206311.0, 1342598820], [112298.0, 1342598880], [781205.0, 1342598940], [606872.0, 1342599000], [5184462.0, 1342599060], [61946135.0, 1342599120], [4126005.0, 1342599180], [115908.0, 1342599240], [714159.0, 1342599300], [195738.0, 1342599360], [26261781.0, 1342599420], [100503.0, 1342599480], [751322.0, 1342599540], [930865.0, 1342599600], [230666.0, 1342599660], [59636.0, 1342599720], [62575579.0, 1342599780], [104950.0, 1342599840], [1208886.0, 1342599900], [379369.0, 1342599960], [785827.0, 1342600020], [26215475.0, 1342600080], [221604.0, 1342600140], [351866.0, 1342600200], [231163.0, 1342600260], [211398.0, 1342600320], [70770807.0, 1342600380], [429324.0, 1342600440], [1937893.0, 1342600500], [1476961.0, 1342600560], [72383.0, 1342600620], [371513.0, 1342600680], [29186024.0, 1342600740], [1924055.0, 1342600800], [280068.0, 1342600860], [341216.0, 1342600920], [36643885.0, 1342600980], [26708952.0, 1342601040], [259828.0, 1342601100], [488406.0, 1342601160], [230698.0, 1342601220], [766407.0, 1342601280], [28252848.0, 1342601340]]}, {"target": "nonNegativeDerivative(lonsbrndlb01.snmp.if_octets-eth0.tx)", "datapoints": [[null, 1342597800], [26007032.0, 1342597860], [34808859.0, 1342597920], [100498.0, 1342597980], [91818.0, 1342598040], [649666.0, 1342598100], [26566941.0, 1342598160], [895897.0, 1342598220], [478867.0, 1342598280], [100242.0, 1342598340], [81130.0, 1342598400], [25908859.0, 1342598460], [34659481.0, 1342598520], [75295.0, 1342598580], [285061.0, 1342598640], [103644.0, 1342598700], [824177.0, 1342598760], [25884962.0, 1342598820], [93420.0, 1342598880], [799160.0, 1342598940], [582373.0, 1342599000], [5024696.0, 1342599060], [61269813.0, 1342599120], [3336907.0, 1342599180], [436657.0, 1342599240], [696692.0, 1342599300], [182144.0, 1342599360], [25947578.0, 1342599420], [79011.0, 1342599480], [733857.0, 1342599540], [1015395.0, 1342599600], [184960.0, 1342599660], [48026.0, 1342599720], [61462810.0, 1342599780], [89187.0, 1342599840], [1195360.0, 1342599900], [386772.0, 1342599960], [744445.0, 1342600020], [25913548.0, 1342600080], [201978.0, 1342600140], [344650.0, 1342600200], [199421.0, 1342600260], [208959.0, 1342600320], [69924581.0, 1342600380], [381593.0, 1342600440], [1610764.0, 1342600500], [1484192.0, 1342600560], [41585.0, 1342600620], [373375.0, 1342600680], [28478208.0, 1342600740], [1893711.0, 1342600800], [253921.0, 1342600860], [354558.0, 1342600920], [36199040.0, 1342600980], [26395675.0, 1342601040], [239238.0, 1342601100], [477775.0, 1342601160], [212554.0, 1342601220], [752374.0, 1342601280], [27890202.0, 1342601340]]}]

It may be peaky data, but there's no way this box is peaking at 60MBit traffic every few minutes.

I'm not sure about the collectd, it's possible it's doing some summing or averaging before sending. One thing to remember though, is it's not peaking at 60Megs/sec those counters are the number of bytes / minute. So it's 60meg/minute, or 1Mb/sec which is still a lot. Have you tried pulling the data via snmp with something other than collectd? — GardenMWM, Jul 18 '12 at 21:33

score 2 · Accepted Answer · answered Jul 17 '12 at 22:08

2

If you use the whisper-dump.py command on the appropriate whisper file, what does it show? It looks like it's not exactly every 5 minutes from the graph. Is it at all possible that you're just getting spikey network traffic? Also, for counters, it's always a good idea to use nonNegativeDerivative instead of Derivative since the nonNegative version accounts for rollover.

answered Jul 17 '12 at 22:08

GardenMWM

155
8

I just switched from derivative to nonNegativeDerivative, as I saw a counter roll over this morning. I'll add details about the data and the formula to the original post. I'll also add dump information both from the web interface and from `whisper-dump.py`. – growse Jul 18 '12 at 08:42
The penny dropped for me at 3am this morning. I was being thrown by a combination of things: 1) It *is* peaky traffic, that's just what's going on. and 2) nonNegativeDerivative is *per minute*, as you say. Applying a scaling factor has given me a much more sensible graph. – growse Jul 19 '12 at 08:43

Collectd and graphite imports data every 5 minutes, rather than 1 minute

1 Answers1