2

I'm a bit new to graphite, so bear with me on this. I'm looking into alternatives for a large and fairly unwieldy cacti installation, so I've been playing with graphite. We pull a lot of data via SNMP, so I've also downloaded, compiled and installed collectd to pipe SNMP data into graphite.

I've set up a simple query within collectd to just grab the current eth0 in/out counters. I'm looking to capture at a minute's resolution for a week, followed by 5 minutes thereafter, so my storage-schemas.conf looks like this:

[carbon]
 pattern = ^carbon\.
 retentions = 60:90d

[default]
 pattern = .*
 retentions = 60s:1w, 5m:1y

Similarly, in collectd.conf I have set the following:

<Plugin snmp>
   <Data "std_traffic">
       Type "if_octets"
       Table true
       Instance "IF-MIB::ifDescr"
       Values "IF-MIB::ifInOctets" "IF-MIB::ifOutOctets"
   </Data>

   <Host "lonsbrndlb01">
       Address "lonsbrndlb01"
       Version 2
       Community "public"
       Collect "std_traffic"
       Interval 60
   </Host>
</Plugin>

This almost works perfectly. The keys appear in graphite, and data comes in.

The only problem is that the data is a counter, and not a per-minute rate. I can get around this in graphite by using the derivative function, which supposedly turns counters into per-minute rates. However, doing this, I see this graph:

This is fairly evident that the data's only arriving every 5 minutes, and not every 60 seconds as I specified. Why is this? I thought I'd set the right values in both collectd and graphite, so I think I'm missing something somewhere.

Edit

Some more data on this, as it might be useful.

The formulas I'm using are just derivative(lonsbrndlb01.snmp.if_octets-eth0.tx) and derivative(lonsbrndlb01.snmp.if_octets-eth0.rx), although I've now switched to using nonNegativeDerivative because of counter rollovers. I've also updated the image below to give a sense of scale.

Running whisper-dump.py on the rx.wsp file gives a header of:

Meta data:
  aggregation method: average
  max retention: 31536000
  xFilesFactor: 0.5

Archive 0 info:
  offset: 40
  seconds per point: 60
  points: 10080
  retention: 604800
  size: 120960

Archive 1 info:
  offset: 121000
  seconds per point: 300
  points: 105120
  retention: 31536000
  size: 1261440

followed by about 2.4M of data.

Data from the graph by appending &format=json is:

[{"target": "nonNegativeDerivative(lonsbrndlb01.snmp.if_octets-eth0.rx)", "datapoints": [[null, 1342597800], [26346975.0, 1342597860], [35197821.0, 1342597920], [138121.0, 1342597980], [108605.0, 1342598040], [690712.0, 1342598100], [27213713.0, 1342598160], [876898.0, 1342598220], [463897.0, 1342598280], [137499.0, 1342598340], [96980.0, 1342598400], [26237641.0, 1342598460], [35094898.0, 1342598520], [112569.0, 1342598580], [274897.0, 1342598640], [139174.0, 1342598700], [806881.0, 1342598760], [26206311.0, 1342598820], [112298.0, 1342598880], [781205.0, 1342598940], [606872.0, 1342599000], [5184462.0, 1342599060], [61946135.0, 1342599120], [4126005.0, 1342599180], [115908.0, 1342599240], [714159.0, 1342599300], [195738.0, 1342599360], [26261781.0, 1342599420], [100503.0, 1342599480], [751322.0, 1342599540], [930865.0, 1342599600], [230666.0, 1342599660], [59636.0, 1342599720], [62575579.0, 1342599780], [104950.0, 1342599840], [1208886.0, 1342599900], [379369.0, 1342599960], [785827.0, 1342600020], [26215475.0, 1342600080], [221604.0, 1342600140], [351866.0, 1342600200], [231163.0, 1342600260], [211398.0, 1342600320], [70770807.0, 1342600380], [429324.0, 1342600440], [1937893.0, 1342600500], [1476961.0, 1342600560], [72383.0, 1342600620], [371513.0, 1342600680], [29186024.0, 1342600740], [1924055.0, 1342600800], [280068.0, 1342600860], [341216.0, 1342600920], [36643885.0, 1342600980], [26708952.0, 1342601040], [259828.0, 1342601100], [488406.0, 1342601160], [230698.0, 1342601220], [766407.0, 1342601280], [28252848.0, 1342601340]]}, {"target": "nonNegativeDerivative(lonsbrndlb01.snmp.if_octets-eth0.tx)", "datapoints": [[null, 1342597800], [26007032.0, 1342597860], [34808859.0, 1342597920], [100498.0, 1342597980], [91818.0, 1342598040], [649666.0, 1342598100], [26566941.0, 1342598160], [895897.0, 1342598220], [478867.0, 1342598280], [100242.0, 1342598340], [81130.0, 1342598400], [25908859.0, 1342598460], [34659481.0, 1342598520], [75295.0, 1342598580], [285061.0, 1342598640], [103644.0, 1342598700], [824177.0, 1342598760], [25884962.0, 1342598820], [93420.0, 1342598880], [799160.0, 1342598940], [582373.0, 1342599000], [5024696.0, 1342599060], [61269813.0, 1342599120], [3336907.0, 1342599180], [436657.0, 1342599240], [696692.0, 1342599300], [182144.0, 1342599360], [25947578.0, 1342599420], [79011.0, 1342599480], [733857.0, 1342599540], [1015395.0, 1342599600], [184960.0, 1342599660], [48026.0, 1342599720], [61462810.0, 1342599780], [89187.0, 1342599840], [1195360.0, 1342599900], [386772.0, 1342599960], [744445.0, 1342600020], [25913548.0, 1342600080], [201978.0, 1342600140], [344650.0, 1342600200], [199421.0, 1342600260], [208959.0, 1342600320], [69924581.0, 1342600380], [381593.0, 1342600440], [1610764.0, 1342600500], [1484192.0, 1342600560], [41585.0, 1342600620], [373375.0, 1342600680], [28478208.0, 1342600740], [1893711.0, 1342600800], [253921.0, 1342600860], [354558.0, 1342600920], [36199040.0, 1342600980], [26395675.0, 1342601040], [239238.0, 1342601100], [477775.0, 1342601160], [212554.0, 1342601220], [752374.0, 1342601280], [27890202.0, 1342601340]]}]

It may be peaky data, but there's no way this box is peaking at 60MBit traffic every few minutes.

growse
  • 7,830
  • 11
  • 72
  • 114
  • I'm not sure about the collectd, it's possible it's doing some summing or averaging before sending. One thing to remember though, is it's not peaking at 60Megs/sec those counters are the number of bytes / minute. So it's 60meg/minute, or 1Mb/sec which is still a lot. Have you tried pulling the data via snmp with something other than collectd? – GardenMWM Jul 18 '12 at 21:33

1 Answers1

2

If you use the whisper-dump.py command on the appropriate whisper file, what does it show? It looks like it's not exactly every 5 minutes from the graph. Is it at all possible that you're just getting spikey network traffic? Also, for counters, it's always a good idea to use nonNegativeDerivative instead of Derivative since the nonNegative version accounts for rollover.

GardenMWM
  • 155
  • 8
  • I just switched from derivative to nonNegativeDerivative, as I saw a counter roll over this morning. I'll add details about the data and the formula to the original post. I'll also add dump information both from the web interface and from `whisper-dump.py`. – growse Jul 18 '12 at 08:42
  • The penny dropped for me at 3am this morning. I was being thrown by a combination of things: 1) It *is* peaky traffic, that's just what's going on. and 2) nonNegativeDerivative is *per minute*, as you say. Applying a scaling factor has given me a much more sensible graph. – growse Jul 19 '12 at 08:43