0

I'm getting seemingly random server hangs (specifically OOM-ing) and running out of skill trying to track this down.

I'm using a Debian 5 VS with Apache/mySQL/PHP. I've got Postfix running using mySQL as well.

I was ssh-ed in when it happened last used top to see: 1. load average shot up over 25 and higher 2. CPU as 49.8%wait, 48.6%idle, so some kind of IO blocking? 3. 13 apache2 processes, totalling 41.4% of memory 4. mySQL showing only 2.6% of memory

Memory showed: Mem: 524512k total, 518144k used, 6368k free, 800k buffers Swap: 262136k total, 261024k used, 1112k free, 22824k cached

I've got Munin installed and it doesn't show (to my inexperienced eyes) anything really pathological happening at the time this happened - even postfix isn't doing much in terms of queue size

df tells me I'm only using 58% of my disk so I'm not close to topping out there.

php.ini is set to 128M max memory, 30second max execution time

I've been dredging through apache and mySQL logs, but can't see anything.

Can anyone suggest a next step in terms of what extra monitoring I could put on the server, or further logging?

Best wishes

Peter

Peter
  • 113
  • 4

2 Answers2

0

Not strange to see the OOM-killer triggering, since you have run out of memory! You have only 512 MB memory, with is not that much nowadays, and ALL your 256 MB swap space is in use.

My suggestion is that you buy more memory modules and add those to your server.

Which processes are being killed by the OOM-killer, apache2? You should check this in your /var/log/messages .

rems
  • 2,240
  • 13
  • 11
  • Hi, thanks for the reply. Chucking more memory at it might solve it, but it could be masking something else. It's happening rarely, and at times when all the visible processes 'appear' to be using few resources.... I think the OOM-killer picks processes to kill according to criteria other than whether they are the resource hogs. Thanks again for the reply.. P – Peter Feb 10 '11 at 13:31
  • Do you have graphs or some statistic about memory usage? – coredump Feb 10 '11 at 18:24
  • @coredump. I've put a couple of the munin graphs up. The last incident was 10am Thursday. mem graph is here http://bit.ly/ic0UWC and the load average is here http://bit.ly/hmQ0Wn Thanks v much for any suggestions. Peter – Peter Feb 11 '11 at 10:39
  • Strange, OOM killer should trigger only if you run out of memory and swap, but that doesn't seem to happen. When the load graphic gets back working, tho, there's a descending load scale, so maybe something is hitting hard and increasing the load and memory too fast for munin to get it. You should check what can be causing those spikes. – coredump Feb 11 '11 at 11:36
  • @coredump - thanks for the reply. It's a classic 'drive you nuts' error because the server is fine almost all the time and rarely runs anywhere near full capacity on any measure... I'm wondering whether there is more php logging I could do to rule out the 'rogue script' option. I _thought_ I had set hard limits on memory usage for PHP but possibly not. Wonder if there's any simple code around to stress-test a server and make sure these mem/cpu limits have been properly set? – Peter Feb 14 '11 at 11:07
  • You can try ab (apache benchmark), but if the 'rogue code' is somewhere deep in your code you will need better debugging. Maybe log the slow queries on your DB. You can also use sar to log your system stats. – coredump Feb 14 '11 at 12:10
  • I ran into similar problems to find out that the culprit where cron jobs running ... Did you try disabling all cron jobs? Or checking if cron jobs are running at the moment the system starts getting filled? Just another idea ... – rems Feb 14 '11 at 12:35
  • @rems - thanks for the comment - I only just saw it. No cron jobs running around that time (only daily ones for backups which don't correlate with the crash times). I've put a cronjob in to monitor server load each minute, which may either kill or cure! I have a feeling that mySQL may be to blame, but it's only a hunch. – Peter Apr 19 '11 at 11:18
0

You can try ab (apache benchmark) to stress the server, but if the 'rogue code' is somewhere deep in your code you will need better debugging. You should log the slow queries on your DB.

You can also use sar to log your system stats. Unfortunately, the better way to see what's happening is being logged on the machine when it does and see what each process is doing. You can write a script for that too but it may end not running because of the load.

coredump
  • 12,573
  • 2
  • 34
  • 53
  • thanks for the reply (for some reason I didn't see it until now). I'll check this out. Cheers, Peter – Peter Apr 19 '11 at 11:13