8

We have a 4 core CPU production system which does a lot of cronjobs , having constant proc queue and an usual load of ~1.5.

During night time we do some IO intensive stuff with postgres. We generate a graph showing the load/memory usage (rrd-updates.sh) This "fails" sometimes on high IO load situations. It is happening nearly every night, but not on every high IO situation.

My "normal" solution would be to nice and ionice the postgres stuff and increase the prio of the graph generation. However this still fails. The graph generation is semi-thread-proof with flock. I do log the execution times and for the graph generation it is up to 5 min during high IO load, seemingly resulting in a missing graph for up to 4 min.
The timeframe is exactly matching with the postgres activity (this sometimes happens during the day aswell, though not that often) Ionicing up to realtime prio (C1 N6 graph_cron vs C2 N3 postgres), nicing way above the postgres (-5 graph_cron vs 10 postgres) did not solve the issue.

Assuming the data is not collected, the additional issue is the ionice/nice somehow still not working.
Even with 90% IOwait and a load of 100 i was still able to use the data generation command free without more than maybe 5 sec delay (on testing at least).

Sadly i have not been able to reproduce this exactly in testing (having only a virtualized dev system)

Versions:

Kernel 2.6.32-5-686-bigmem
Debian Squeeze rrdtool 1.4.3 Hardware: SAS 15K RPM HDD with LVM in hardware RAID1
mount options: ext3 with rw,errors=remount-ro
Scheduler: CFQ
crontab:

* * * * *               root    flock -n /var/lock/rrd-updates.sh nice -n-1 ionice -c1 -n7 /opt/bin/rrd-updates.sh

There seems to be a somhow possibly related BUG from Mr Oetiker on github for rrdcache:
https://github.com/oetiker/rrdtool-1.x/issues/326

This actually could be my issue (concurrent writes) but it does not explain the cronjob to not fail. In the asumption i actually have 2 concurrent writes flock -n would return exit code 1 (per man page ,confirmed in testing) As i do not get an email with the output either and the observation that the cronjob do actually run fine all the other time i am somehow lost.

Example output: cpu load graph with missing lines

Based on the comment i added the important source of the update script.

rrdtool update /var/rrd/cpu.rrd $(vmstat 5 2 | tail -n 1 | awk '{print "N:"$14":"$13}')
rrdtool update /var/rrd/mem.rrd $(free | grep Mem: | awk '{print "N:"$2":"$3":"$4}')
rrdtool update /var/rrd/mem_bfcach.rrd $(free | grep buffers/cache: | awk '{print "N:"$3+$4":"$3":"$4}')

What do i miss or where can i check further?

Remember: Productive system so no dev, no stacktrace or similiar available or installable.

Dennis Nolte
  • 2,848
  • 4
  • 26
  • 36
  • 1
    Way back when MRTG was replaced by RRDgraph. One of the marvelous changes from the old to the new was that RRDgraph actually generates the images only when there is a view request. The old MRTG generated whole new graphs for every data point every five minutes. Your problem is with data-collection, not the graph render. – ericx Aug 12 '14 at 13:36
  • @ericx thank you for your comment. I added the source for the data generation. Do you still think the issue is the vmstat command instead of IOnice/nice somehow not working correctly? If so, why do you think that way? – Dennis Nolte Aug 12 '14 at 14:10
  • Does your `cron` capture STDERR anywhere? On FreeBSD I usually run these under `periodic every5` and I have a `/var/log/periodic.every5` that generally captures any errors. I would also stagger the three scripts and possibly rotate the order to see whether one in particular hangs. Most of my RRDTool experience was with `cricket` which had it's own logging. The `cricket` logs were excellent for finding trouble. Are you really collecting every minute? (* * * * * instead of */5 * * * *) What's the granularity of the graph? RRD defaults to 5 minute intervals. – ericx Aug 12 '14 at 14:31
  • this is the command which was used to initially create them : create cpu.rrd --step 300 DS:sys:GAUGE:70:U:U DS:user:GAUGE:70:U:U RRA:AVERAGE:0.01:1:6351 so this means you just found another bug, thank you. i did rewrite STDOUT and STDERR for that script for testing, nothing did get logged which helped me back when i tried the first time. i will add the output tomorrow – Dennis Nolte Aug 12 '14 at 15:05
  • Yah, ` --step 300 ` Shoving data into this thing more frequently than every 300 seconds is pointless. – ericx Aug 12 '14 at 15:34
  • logfile is still empty: the dump however showed that some info is missing, so as you said it is actually the data generation which fails. – Dennis Nolte Aug 13 '14 at 08:16
  • why don't you just simply use sysstat? any reason? – sio4 Sep 12 '14 at 15:26
  • Basically for two reason: as it is a prod system i need a lengthy release cycle to install stuff which is not "critical" for operation. 2nd: previosly we had it installed and running and at the time we have the high load it did produce OOMs, so we stopped using it for now. – Dennis Nolte Sep 15 '14 at 07:13
  • 1
    In terms of undestanding the "failure", rrdtool's display is based on a 5 minute polling cycle. If you don't finishing processing one cycle before the next is due to start, and if your data collection and graph production is part of the same processing operation, then you will get a missing data point. – mc0e Oct 18 '14 at 07:38

1 Answers1

2

I guess it's not the rrdtool that cannot update the graph, but rather data cannot be measured at this point. By the way, your method of measuring CPU and memory stats is just wrong, because it gives you instant result. CPU and Memory load can change drastically along the 60 seconds interval, but you will take only one value. You should really consider taking SNMP data, which gives average data on an interval. Plus, the whole pipe seems to be more expensive and slow that a snmpget call. Could be the gaps main reason.

drookie
  • 8,051
  • 1
  • 17
  • 27
  • just as a follow up, it was this. Once we were able to move some resource hungry processes to another server the graphs get generated fine. – Dennis Nolte Sep 21 '15 at 10:50