0

I recently inherited a web server setup from another developer. Its basically the following:

2 web servers running apache 2 2 load balancers running nginx 2 database servers running MySQL

Every week or so the apache web servers become unresponsive to requests and the load balancer ends up returning 504 gateway timeout. I logged in to the web server and checked uptime it returned: 18:40:49 up 5 days, 20:15, 1 user, load average: 122.37, 119.80, 107.57 which is extremely high compared to the number of processes available for the instance which is 8.

In order to get things back online as fast as possible I ended up restarting the web servers and everything went back to normal: 18:54:19 up 5 min, 1 user, load average: 0.11, 0.22, 0.10

I am not asking for definite answers as I should be looking further into the source of the problem but I would like some hints and suggestions regarding this issue:

  • Why do you think this might be happening ?
  • What are ways in which I can look further into this issue to be able to identify the source of the problem ? I need some pointers on where and what to look for.

Thanks for the help.

Ayman Farhat
  • 103
  • 3

1 Answers1

2

A high load that gets fixed by a restart could be symptom of some sort of leak. If memory usage increases either due to a memory leak, or simply because the application platform maintains some data structures, that grow over time, then the server could end up swapping a lot.

This obviously depends on a lot of factors, but I have seen webservers too tight on memory exhibit symptoms just like what you described.

Another possibility is that the application spawns background threads that for some reason keep running and spending CPU time or other resources.

I strongly recommend you ask the previous owner for clues as to what may be consuming resources (both memory and CPU). The symptoms you describe could also happen if the server has been compromised, but without knowing how the server is supposed to behave, it can be very hard do tell the difference. And even if a compromise isn't part of the explanation, you still need to understand the application in order to debug the problem.

kasperd
  • 29,894
  • 16
  • 72
  • 122