43

I am looking for a way to diagnose issues, such as swap death, where a balooning memory process fills up swap and kills the whole machine (such as apache).

I'm already using cacti and I can set up nagios (though would rather not) or munin but as far as I can tell they can't record individual program usage - just overall status.

I know I can roll a script that >> to some file every 30s but I'd like to see if an existing mature solution already exists.

Again, ideally it would:

  • record processes' memory usage every N seconds
  • record processes' CPU usage every N seconds
  • support charts and history
  • support averages - like mysqld has used 43% CPU in the last day and averaged 400MB memory
  • be free and open source

Process names are not and should not be known in advance - the idea is to just let it monitor and then have a look at the top offenders.

My system is Linux (OpenSUSE).

peterh
  • 4,914
  • 13
  • 29
  • 44
Artem Russakovskii
  • 973
  • 3
  • 11
  • 25
  • Do you want to monitor *any* process which may have a memory leak (The top N memory hogs) or are you looking to monitor a defined set of processes (e.g. Apache webserver and a Tomcat process)? The latter is doable with some simple Nagios or Cacti plugins. The former is more difficult. You should clarify this. – Stefan Lasiewski Jul 29 '10 at 04:01
  • I already clarified it in the post but to clarify again: I want to know the state of the system when it goes down due to swap death. I want to know who the worst offenders are. And btw, it doesn't have to be a memory leak - just an influx of traffic, or whatever causes high memory usage. So, again, no advance knowledge of binary names should be configured. – Artem Russakovskii Jul 30 '10 at 03:25
  • possible duplicate - http://serverfault.com/questions/67234/storing-calculating-historical-load-averages – warren Aug 03 '10 at 17:20
  • Warren, that's an entirely different question. – Artem Russakovskii Aug 03 '10 at 23:28
  • 14
    Closing a such good quality post was a bad thing, especially after 4 years retroactively. – peterh Apr 26 '15 at 23:02

11 Answers11

21

It you want just the top offenders, consider running top with a relatively long interval (60 seconds plus) in batch mode. You may need more than one top running to capture the top offenders on multiple resources. I have configured systems to run top for a few cycles when a resource was being over used.

Consider running sar in batch mode to capture resource utilization. I realize this is server based, but it useful to determine times when problems are occurring.

Run munin and enable notifications. This may give you a chance to get in and watch the server going down. You may be able to correct the problem before it goes down.

For memory leaks, a steady increase in swap usage indicates a problem. I once watched a server slowly die over a period of days. The problem service was a program monitoring other processes for memory leaks. The system admin kept insisting the increasing swap usage was not a problem, right up until the server stopped responding.

You may find that cfengine's anomaly detection can be used to trigger a script to capture the system state when things go wrong. You may want a lot of information besides just the processes using the most resources. For a sudden influx of usage you may want a list of network connections (by address not name). Memory usage is also useful.

BillThor
  • 27,354
  • 3
  • 35
  • 69
15

sysstat is made pretty much exactly for your kind of purpose.

Peter Eisentraut
  • 3,575
  • 1
  • 23
  • 21
  • This is where you should start. You can't know where to start an examination until you know where you might have the best chances. Sysstat is what you are looking for (also has pretty graphs). Once you know more use systemtap. – Allen Aug 03 '10 at 16:20
9

I've used atop before:

http://freshmeat.net/projects/atop/

"Atop is an ASCII full-screen performance monitor that is capable of reporting the activity of all processes (even if processes have finished during the interval), daily logging of system and process activity for long-term analysis, highlighting overloaded system resources by using colors, etc. At regular intervals, it shows system-level activity related to the CPU, memory, swap, disks, and network layers, and for every active process it shows the CPU utilization, the memory growth, priority, username, state, and exit code."

NinjaCat
  • 576
  • 1
  • 9
  • 20
  • atop doesn't seem to have a report that would provide me with what I wanted. Please correct me if I'm wrong. – Artem Russakovskii Jul 27 '10 at 09:45
  • It takes care of your first two bullet points (memory/cpu by process). You can use the library to gather these stats and then do your history / graphing based on the data. – NinjaCat Jul 28 '10 at 14:25
  • 4
    @artem-russakovskii - By default atop logs data to a file every ten minutes. If your server crashed at 3:45 you could start atop with `atop -r log_filename`, press `m` to switch to the per-process memory usage view, and then press `t` to move forward in 10 minute increments until 3:40. You can read more about the basics of using atop at https://lwn.net/Articles/387202/ and see an example of identifying a memory leak at http://www.atoptool.nl/download/case_leakage.pdf – sciurus Mar 01 '11 at 19:45
7

Have you tried collectd?
It's very powerful and customizable.
Has a lot of plugins and could be integrated with nagios.

http://collectd.org/features.shtml

PiL
  • 1,591
  • 8
  • 6
  • Collectd is very lightweight, not too difficult to set up, and will let you see memory/swap growth over time. It will not pinpoint the offending processes, though -- but maybe you'll be able to notice and catch the memory growth in time and inspect the situation manually with `top`. – Marius Gedminas Jul 30 '10 at 11:34
  • 1
    I have to say that i didn't try that plugin, but reading from the manual of process plugin of collectd: "If processes are selected the following information is gathered. All this information is aggregated by the process name. Its Resident Segment Size, Used user- and system-time, The number of processes by that name, The number of threads (summed up over all the processes), The number of major and minor page faults. Rough I/O-numbers (bytes written and read due to syscalls by the process). – PiL Jul 30 '10 at 12:02
  • You can select the processes or by name or by regex. – PiL Jul 30 '10 at 12:03
3

Server Density does exactly what you describe.

I use it on one of our production servers and am very happy about it. It's top feature is the ability to view charts, click on a peak and see the server CPU/Memory consumption at that current time, including all running processes. They call it snapshots.

It's constantly improving. One of the latest features is anomaly detection, which allows you to easily detect anomalies. You can also setup various tresholds

Aron Rotteveel
  • 8,239
  • 17
  • 51
  • 64
  • 4
    Ah, I forgot to mention the little part where I'd prefer it to be free, and open source, if possible. Over $100 per server is not really what I'm looking to spend (and I only have 1 server, not 5). http://www.serverdensity.com/pricing/ – Artem Russakovskii Jul 30 '10 at 03:28
3

nmon is a great tool that does what you're looking for. Developed for AIX and Linux. Produces a ton of detailed output and easy to put into reports. If you google it, there is an IBM wiki that has a bunch of documentation and additional utilities for parsing the data.

mattcaffeine
  • 151
  • 3
2

Centreon on top of Nagios, Nagios coupled with NRPE. You can then write custom scripts to report data in ANY format you wish to NRPE. Nagios then polls the data from remote servers with NRPE and Centreon makes a pretty graph and adds a ton of user flexibility. We use it over at http://beyondhosting.net I have a VZ Container template with centreon+nagios setup already if you want it.

Graphs centreon builds hostthenpost.org/tyler/2010-07-23_1719.png

VisBits
  • 101
  • 1
  • 4
  • I'd like a ready solution for reporting the things I mentioned, most importantly processes consuming the most memory. I'm also not sure what VZ is. – Artem Russakovskii Jul 27 '10 at 09:47
2

Maybe the good old OProfile does what you need? It's a kernel-based system-level profiler with only a small (couple of percents) overhead.

Then there's an excellent Perl script, PSMon, which allows you to set up all kind of CPU/memory limits. If those are exceeded, psmon will log an error and/or kill the offending process.

The latter would not produce any profiling reports to you, but if it decides to kill the same process over and over again, you have probably found the nasty bastard you were looking for. :-)

Janne Pikkarainen
  • 31,454
  • 4
  • 56
  • 78
2

http://studyhat.blogspot.com/2010/08/user-activity-view-processes-display.html

have look above link small code which gives you out put of memory cpu etc.

Rajat
  • 3,329
  • 21
  • 29
2

The answers suggested when I asked a similar question:

Icapan said:

Munin is the easiest way to get uptime graphs with minimum effort in installing and configuring. I also use atop for aggregate cpu usage by some process, but that is not what You asked for.

David Spillet said:

I use collectd to record system load amongst a number of other parameters. It stores the data in RRD stores that can be graphed and otherwise analysed using the many available tools and scripts. I use a modified version of this script for my graphing (sample output).

Collectd has plugins for monitoring lots of stuff (everything commonly asked for and a few things on top), and creating your own shouldn't be difficult if you need something specialised, so makes for a very flexible tool. Configuring the graphs in rrd.cgi is a very manual process, though not difficult, though you might well find a more convenient tool for working with the RRD files maintained by collectd.

You might also check Nagios or OpenNMS, too.

warren
  • 17,829
  • 23
  • 82
  • 134
1

Munin will do all of what you need out of the box without requiring Nagios or any other tool. There are RPMs available for OpenSUSE.

gareth_bowles
  • 8,867
  • 9
  • 33
  • 42
  • Does it do it with a plugin. If so, which one? I haven't been able to find one that doesn't require a pre-configured list of processes to monitor. – Artem Russakovskii Jul 26 '10 at 16:31
  • It wasn't clear from your original question that you don't want to monitor a preconfigured list of processes - could you provide more detail on your requirements ? – gareth_bowles Jul 27 '10 at 02:08
  • Clarification: Process names are not and should not be known in advance - the idea is to just let it monitor and then have a look at the top offenders. – Artem Russakovskii Jul 27 '10 at 09:49