2

I have a variety of Ubuntu machines on EC2 running in production, with about 30 that were upgraded from 15.04 to 15.10. With most of the machines, the upgrade went flawlessly and experienced no issues at all.

However, 10 of my webservers have started crashing immediately following the 15.10 upgrade. As far as what exactly defines a "crash", Instance Status Checks fail, and I can no longer SSH to the machine. Background daemons running on the system stop responding, and nothing is written to the logs. The most recent log entries I see on one machine show:

/var/log/syslog:Dec 18 00:28:58 xxx-web-4a dhclient: DHCPREQUEST of 10.xxx.xxx.104 on eth0 to 10.xxx.xxx.1 port 67 (xid=0x616a091d)
/var/log/syslog:Dec 18 00:28:58 xxx-web-4a dhclient: DHCPACK of 10.xxx.xxx.104 from 10.xxx.xxx.1
/var/log/syslog:Dec 18 00:28:58 xxx-web-4a dhclient: bound to 10.xxx.xxx.104 -- renewal in 1640 seconds.

But my Instance Status Checks didn't begin failing until 00:32:00 (when the first of several checks failed to respond). There is absolutely nothing in the logs during the period following the entries above.

Now, like I said, my ~20 other 15.10 instances have never crashed in the over 6 weeks since their upgrade, only this set of webservers, and they're all crashing. So, what's different about these machines? Only two things, really.

  1. They're my highest-traffic 15.10 instances, sending and receiving about 5-10Mb/sec on average, peaking to a bit over 30-40 on occasion.
  2. They're my only instances of type c4.xlarge or m4.xlarge. Originally, they were all c4.xlarge, but I replaced them with m4.xlarge to try to isolate the problem. It seems to be less frequent with the m4.xlarge, but I've still seen 3 or 4 or so crashes a day between the 10 webservers. Generally, I'm seeing each instance crash at least once a day, at seemingly random times.

These instances are running Apache 2.4.x, mod_php 5.6.11, and memcached 1.4.24, but I have other machines receiving less traffic on a smaller instance type that are perfectly stable.

Not sure if related, but all of these machines are seeing periodic ifquery segfaults, for example:

/var/log/syslog:Dec 17 14:02:05 xxx-web-3a kernel: [   22.592488] ifquery[476]: segfault at 1 ip 0000000000403187 sp 00007ffde8596050 error 4 in ifup[400000+d000]
/var/log/syslog:Dec 17 14:02:05 xxx-web-3a kernel: [   23.593774] ifquery[510]: segfault at 1 ip 0000000000403187 sp 00007ffde6087b90 error 4 in ifup[400000+d000]
/var/log/syslog:Dec 17 14:02:05 xxx-web-3a kernel: [   24.594994] ifquery[531]: segfault at 1 ip 0000000000403187 sp 00007ffe70747a50 error 4 in ifup[400000+d000]
/var/log/syslog:Dec 17 14:04:12 xxx-web-3a kernel: [    2.623024] ifquery[367]: segfault at 1 ip 0000000000403187 sp 00007ffefc980f60 error 4 in ifup[400000+d000]

One system, prior to the c4.xlarge --> m4.xlarge migration, saw a General Protection Fault logged a single time in the system console log, but I have not seen this again.

I'm not seeing these segfaults on my other 15.10 machines which are not crashing.

These are all Enhanced Networking instances with Intel 82599 10G Ethernet, which I slightly suspect may contribute to the issue, but, I have other (much-lower-traffic) machines with the same adapter running 15.10 without ever crashing.

Is anyone seeing similar problems, or have any ideas for debugging or fixing this? Thanks!

Edit

Looking at the Console Log, one of my frequently-crashing systems reported a General Protection Fault right before rebooting:

[171009.844097] general protection fault: 0000 [#1] [ 0.000000] Initializing cgroup subsys cpuset
Will
  • 1,127
  • 10
  • 25
  • 1
    Which Ubuntu base AMI version are you using? Have a look at https://forums.aws.amazon.com/thread.jspa?messageID=682508 – Jukka Jan 05 '16 at 08:15
  • Originally, I experienced this upgrading from 15.04 to 15.10 via `do-release-upgrade`; unsure which 15.04 AMI. Then, I launched a fresh new instance from Ubuntu's official Cloud Images AMI for December 9th: `ami-47723b2d` and they failed in a similar manner to this thread you linked, which is different from the kernel panic errors (General Protection Fault) that I'm experiencing. These are literally crashing due to some kernel or driver memory error. But yes, there seem to be numerous problems with 15.10 on ec2 c4.xlarge and m4.xlarge instances, at least. – Will Jan 05 '16 at 11:55
  • I'm trying to turn on kernel crash reporting so I can get a backtrace for a bug report, but I'll have to see in the morning which instances crashed and if the crash reports saved. I'll update when that happens. Thanks! – Will Jan 05 '16 at 11:56
  • You can either use `netconsole` or `crashdump` (RHEL: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/pdf/Kernel_Crash_Dump_Guide/Red_Hat_Enterprise_Linux-7-Kernel_Crash_Dump_Guide-en-US.pdf, Ubuntu: https://help.ubuntu.com/lts/serverguide/kernel-crash-dump.html). On physical machines you could also use `pstore` driver for that. Anyway. Even after you capture `dmesg` you'll end up upgrading/downgrading kernel as a simplest solution so be prepared for that. Unless you have a person on your team who can fix/workaround GPE in your current kernel. – SaveTheRbtz Jan 06 '16 at 04:26
  • Thanks, yeah, I'm fine with having to update the kernel or downgrade it or downgrade the whole machine. But I also want to help the ubuntu/kernel/driver team fix this issue if I can. I've been trying to get kdump working for hours! See my question [here](https://askubuntu.com/questions/717458/cannot-get-kdump-to-dump-a-vmcore-using-crashkernel) :( i also got netconsole working, but halfway through the boot it stops sending output, even if netconsole is the only console. thats the most frustrating part of this is that none of the debugging tools id know to use are working. – Will Jan 06 '16 at 05:33
  • Also I can get readonly console output by AWS's `ec2-get-console-output -r`, and when i crash the kernel manually via `echo e >/proc/sysrq-trigger` i get a nice traceback on the console, but this GPF/GPE never outputs anything more than the one line in my post. That's why I'm hoping the vmcore will help, but i cant get it to dump for the life of me, even over ssh. – Will Jan 06 '16 at 05:34

0 Answers0