I've been hosting a website on an Amazon EC2 instance for years. Recently, users have complained of slowness and connection failures. I've checked memory and CPU usage on both the EC2 LAMP server and on the RDS database server and both seem well within nominal range.
Web Server
- CPU usage averages about 15%, with rare spikes to around 50-60% about twice a day
- Memory usage 3.5G total, 3.2G used, 2.7G cached, swap usage zero
DB Server
- CPU usage typically 2-5%, with daily spikes. These spikes have been gradually getting higher for about a week, but never exceed 10%
- DB connections under 1 except for infrequent spikes to 2
- 5GB free RAM
Using netstat, I see at any given time that there are around 1000 connections to the web server:
$ netstat -ant | wc -l
1089
I've seen this number as high as 1480 earlier in the day when the problems occur.
All of this makes me think that the machine is network-found. I.e., there's not enough available network bandwidth to serve all the requested data. I think this lack of bandwidth may be the machine's bottle neck.
Can anyone suggest how to determine if this machine is, in fact, limited by network bandwidth? It would be extremely helpful if I could construct a network usage graph that indicates the problem. I'm not sure what this might look like, but am imagining a graph showing a hard plateau during the times of poor performance.
I've attempted to attach a screenshot here of the AWS monitoring graphs:
EDIT: I was monitoring the server this morning when the slowness started happening and I have been unable to locate any resource bottleneck. The web server's memory and CPU usage seem fine. The db server's memory and CPU usage seem fine. I don't see any outrageous amount of network bandwidth being used and yet the server responds very slowly to page requests. Then the problem suddenly evaporates.
While the problem persists, it looks from a user perspective (using Firefox) as though there's something slow about the TLS handshake that looks very much like this problem but my apache server has HostnameLookup set to OFF.
The bottleneck, whatever it is, appears to prevent network connections from being made. During the slowness, total network connections were steady around 800:
netstat -n | wc -l
While the connections to the database from the web server were very steady around 200:
netstat -an | grep <db-server-ip-here> | wc -l
As soon as the problem passes (which seems quite erratic) then these numbers jump to about double these values and the server runs lightning fast.