The "load average" on a *nix machine is the "average length of the run queue", or in other words, the average number of processes that are doing something (or waiting to do something). While the concept is simple enough to understand, troubleshooting the problem can be less straight-forward.
Here's the statistics on a server I worked on today that made me wonder the best way to fix this sort of thing. Here's the statistics:
- 1GB RAM free, 0 swap space usage
- CPU times around 20% user, 30% wait, 50% idle (according to top)
- About 2 to 3 processes in either "R" or "D" state at a time (tested using ps | grep)
- Server logs free of any error messages indicating hardware problems
- Load average around 25.0 (for all 3 averages)
- Server visibly unresponsive for users
I eventually "fixed" the problem by restarting MySQLd... which doesn't make a lot of sense, because according to mysql's "show processlist" command, the server was theoretically idle.
What other tools/metrics should I have used to help diagnose this issue and possibly determine what was causing the server load to run so high?