0

We use munin and monit to keep track of general stats about our vps, in last a couple of weeks, we have been running into issues where random IO wait spike is killing our server performance.

Since then we have been checking cron for possible suspect, but haven't found one that match spike patterns. Arriving on time to check ps aux for staled process isn't always possible, and result can vary even during the event.

So I am wondering if there are a better way to setup passive monitoring, preferably via munin/monit, that keep track of processes that are experiencing/causing IO wait the most?

(PS: I have used some of the suggestions in this Q&A, but haven't been able to pinpoint the cause yet.)

bitinn
  • 331
  • 1
  • 3
  • 12

3 Answers3

0

You might want to use Linux process accounting. If it's included in your kernel (I'm not sure about the current status), you can enable it with accton. The command sa will report the following data per process:

cpu - sum of system and user time in cpu seconds
re - "real time" in cpu seconds
k - cpu-time averaged core usage, in 1k units
avio - average number of I/O operations per execution
tio - total number of I/O operations
k*sec - cpu storage integral (kilo-core seconds)
u - user cpu time in cpu seconds
s - system time in cpu seconds

You might be interested in the avio and tio values of the processes.

The GNU accounting utilities manual as well as the Enabling Process Accounting on Linux HOWTO gives more details.

Alexander Janssen
  • 2,557
  • 15
  • 21
0

There may not be a process that's causing this. There's more likely an I/O subsystem or resource contention issue. I wouldn't be looking for a single process or group of processes that cause this. I'd be noting the conditions under which it happens.

Do you have the full specifications of your server, including the details of the disk setup? Please post them along with OS version/kernel - uname -a

edit -

I guess this is a VPS. Call or email your provider's support. Show them a graph and explain your situation.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
-2

You may be exceeding vps limits:

Even if you caught the processes with high io values, it will possibly be a generic apache-php or other web site related process. But that is not enough, you would need the actual url behind it. And when you find the url, it will NOT be the reason of the io problem, it will be a victim of it. Most cms's need ~1000 files to run, and any problem in hdd or cache will cause them to have huge io spikes. The actual reason for io could be on some other vps account, or you hitting the vps limits. Exceeding vps limits does cause io spikes with the limitations they have in place.

Other accounts or machine processes hogging the vps:

You could consider changing your provider!

I had a similar issue, spent weeks trying to diagnose it, then exchanging communication with deaf support bureaucracy, and finally got fed up and changed to another host. Now everything is suddenly fine and has been fine for a year! What happened? Same vps size, same code, even a slower machine, even a busier site and everything is smooth.

My guess at that time was, some other account hogging the hdd, with some huge daily forum backup, found this through Google. My other guess was some inefficient server backup software. Don't know. The similarity to your situation was, it started at a certain date, and it was quite periodic, and there were practically little change on my vps code or load itself. You can neither catch nor prove this with monitoring software restricted to vps walls.

That was a year ago. Nowadays there are some new headaches around.

Core dumps: This is the new fashion now, core dumping instead of error logging. Server histeria. When something goes wrong, the server starts core dumping a 200M file, and this keeps io high, so other things go wrong in a chain, and one gets 3-5 200M files saved at the busiest time of the day, stalling hdd.

imagick image processor: For some reason, imagick started working with 1G-50G temporary files. If that happens, it stalls hdd. I suspect Imagick is used by munin, what an irony.

Another practical suggestion: Keep munin open in a window, while you are working on something else. It auto refreshes every 5 minutes. That way you will catch the spike. I used to catch io spikes like that. Your spike lasts for 30 minutes or so. You can easily catch it. Then you can do ps aux etc.

Johan
  • 122
  • 3