How to debug EC2 server that stop responding silently randomly

Question

I'm running a t2.micro instance on Amazon Linux AMI 2018.03 (4.14.59-64.43.amzn1.x86_64). It hosts a php website using Apache/2.4.33, and connect to an RDS MySQL database.

From time to time, the server completely "disappears". Trying to display the website, connect to the FTP or even connect in SSH with putty all result in a timeout. And it doesn't come back on its own, I have to manually shut down the server via the AWS console and start it up again, then everything is back to normal. (Interestingly, the "reboot" command does nothing and seem to be ignored by the server. Only shutting it down and starting it again works)

Problem is, I checked every log files I could find and there doesn't seem to be anything at all around the time the server stop responding, so I have no idea how to troubleshoot. Checking Cloudwatch metrics, the CPU and Network usage also seem to be normal while the server is not responding.

This seem to happens when I'm running a particular memory-heavy PHP script a bunch of times (but randomly, I can also run this script without issue) so I suspect it might be related to the RAM filling up. But if the system was closing something to free up memory, wouldn't it show up in the logs?

How would one go about debugging in a situation like this?

Thanks

Here is the only thing in the messages log around the last occurrence :

Sep  6 15:11:34 compta dhclient[2266]: PRC: Renewing lease on eth0.
Sep  6 15:11:34 compta dhclient[2266]: XMT: Renew on eth0, interval 10970ms.
Sep  6 15:11:34 compta dhclient[2266]: RCV: Reply message on eth0 from ****::***:****:****:****.
Sep  6 15:11:34 compta ec2net: [get_meta] Trying to get http://***.***.***.***/latest/meta-data/network/interfaces/macs/**:**:**:**:**:**/local-ipv4s
Sep  6 15:11:34 compta ec2net: [rewrite_aliases] Rewriting aliases of eth0
Sep  6 15:11:34 compta ec2net: [get_meta] Trying to get http://***.***.***.***/latest/meta-data/network/interfaces/macs/**:**:**:**:**:**/subnet-ipv4-cidr-block
Sep  6 15:22:13 compta kernel: imklog 5.8.10, log source = /proc/kmsg started.
Sep  6 15:22:13 compta rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="2356" x-info="http://www.rsyslog.com"] start
Sep  6 15:22:13 compta kernel: [    0.000000] Linux version 4.14.59-64.43.amzn1.x86_64 (mockbuild@gobi-build-64010) (gcc version 7.2.1 20170915 (Red Hat 7.2.1-2) (GCC)) #1 SMP Thu Aug 2 21:29:33 UTC 2018
Sep  6 15:22:13 compta kernel: [    0.000000] Command line: root=LABEL=/ console=tty1 console=ttyS0 selinux=0 LANG=en_US.UTF-8 KEYTABLE=us
Sep  6 15:22:13 compta kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'

15:22 is when I restart the server.

Just realised something : The eth0 lease usually renew ~ every minute, but stop once the server stop responding.

CPU heavy on a bursting (t2) instance probably means you should check your CPU credits balance in CloudWatch. Hitting the cap will mean everything gets EXTREMELY slow. — ceejayoz, Sep 06 '18 at 15:13
@ceejayoz Checked it but it seems this wasn't the issue. See my comment on gator2003 answer for more details. — Dino, Sep 07 '18 at 01:53
Your theory that it could run out of RAM is worth considering. Two options here, run it on a t2.medium / t2.large for a few days and see if it fails. A cheaper option is to [set up some swap space](https://www.photographerstechsupport.com/tutorials/adding-swap-space-ec2-amazon-linux-instance/), either in a file on your existing EBS volume or on a new dedicated volume. I have a t2.nano running Nginx / PHP / MySQL and a few other things, 512MB RAM 512MB swap (uses 100MB), works great. I have tuned MySQL / PHP quite carefully. — Tim, Sep 07 '18 at 08:28
@Tim Thanks, I set up some swap space and it hasn't crashed since then! The script is very memory heavy though so I'll have to add some queue system I think once I have more concurrent users or even the swap will fill up, but now I know where it's coming from :) — Dino, Sep 11 '18 at 17:04

score 0 · Answer 1 · answered Sep 06 '18 at 17:06

0

Agreed on checking CPU credits on a t2 instance. The throttling can have that behavior.

Check out this link: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-credits-baseline-concepts.html

answered Sep 06 '18 at 17:06

gator2003

26
3

I don't think this is the issue here, if I understand the metrics correctly. I checked Cloudwatch for CPUCreditBalance and it seems to be always capped at 144, except for a few occurences where it instantly drops to 30. I think those drops are when I reboot the server, since the doc on your link says "For T2, the CPU credit balance does not persist between instance stops and starts. If you stop a T2 instance, the instance loses all its accrued credits." If I'm correct this mean that the last few times I restarted the server because it wasn't responding anymore, the balance was at 144. – Dino Sep 07 '18 at 01:48
Stop / start resets the credit balance - you have a new instance on a new host. An OS level restart doesn't change instances, you're still on the same hypervisor with the same instance, but may use up some credits. – Tim Sep 07 '18 at 08:25
I don't think it's this btw – Tim Sep 07 '18 at 08:27

score 0 · Accepted Answer · answered Sep 11 '18 at 20:25

As per previous comment, I'll turn to an answer so you can mark correct. This means people won't come in to try to help.

I suggest you set up some swap space, to test if it's a RAM problem. I have a tutorial on how to do that here, but it's a very common thing to do so there are hundreds of resources telling you how to do that.

How to debug EC2 server that stop responding silently randomly

2 Answers2