To solve this problem you (or someone on your behalf) will need to gather some data about your system and analyze it using Scientific Method (or a process you prefer) .
You can gather the data using system tools like sar, free, iostat, vmstat etc.
Install monitoring to gather and track data 1,2
Reading your logs is also frequently helpful.
Now that you have a view of how your system is performing you can start t ask questions, perform trials and analyse the results.
- What is the actual problem you're trying to solve ?
My load average is unusually high.1
So, now we know what the actual problem is we're solving we have some direction. Let's gather some information to help us figure out a solution.
- Is the problem time related? Does it happen regularly or randomly.
- Check your logs, all of them, not just the particular services's logs as something else may be causing the problem. Log entries generally have timestamps, this is to help you correlate events across multiple applications and services - use them. If necessary increase the log verbosity too.
- Watch what your system is doing. Use tools like top, vmstat, iostat, sar, ps,tcpdump or even full blown monitoring.
Analyse the information you have gathered. What is actually happening on the system when the service stops responding? What is the state of the system's resources ?
Take appropriate action to remediate. Hopefully it's pretty obvious what's going on, you're running out of memory and OOM killer comes out to play, your swap activity is too high, your run queue is too long, you're iobound etc. If it's not obvious then you're probably not gathering the correct data - you know what to do, go back to 2.
Monitor what the changes introduced at 4. do.
Did the changes fix the problem ? Is it better? Is it worse ? Is there no difference ? Where you go from here depends on what you find. You may need to go back to 2. and gather more pertinent data or 3. to reanalyse what data you have or 4. because you identified a number of potential solutions.
Document your findings and the changes you made.