EC2 instance Amzn2 was down for a month. Kernel panic when starting now

0

I had this fully functional Amazon Linux 2 instance which I had configured into a mail server. During the summer I shut it down. But nothing more than that. When I try starting it now, it only passes 1/2 checks, it fails on the Instance Status Check. The instance was not reachable. I tried re-attaching IP, if it maybe were a network problem, but no change. In the System log, trying to look for incongruities, I find a Kernel panic. Specifically:

[    2.330642] Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
[    2.341751] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.14.128-112.105.amzn2.x86_64 #1
[    2.348702] Hardware name: Amazon EC2 t3.micro/, BIOS 1.0 10/16/2017
[    2.353256] Call Trace:
[    2.356063]  dump_stack+0x5c/0x82
[    2.359252]  ? rest_init+0x10/0xb0
[    2.362449]  panic+0xe4/0x252
[    2.365444]  ? do_execveat_common.isra.31+0x87/0x820
[    2.369391]  ? rest_init+0xb0/0xb0
[    2.372587]  kernel_init+0xeb/0xfc
[    2.375852]  ret_from_fork+0x35/0x40
[    2.379959] Kernel Offset: disabled
[    2.383237] ---[ end Kernel panic..]

I'm at loss. No change to the volume, or kernel, has been made since the system was functional. What could have caused this problem?

I also get two other types of warnings before the end of the log (after kernel panic):

[    2.398371] WARNING: CPU: 1 PID: 1 at kernel/sched/core.c:1198 set_task_cpu+0x177/0x180
[    2.405144] Modules linked in:
[    2.408252] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.14.128-112.105.amzn2.x86_64 #1
[    2.414987] Hardware name: Amazon EC2 t3.micro/, BIOS 1.0 10/16/2017
[    2.419494] task: ffff88803deb4000 task.stack: ffffc90000194000
[    2.423834] RIP: 0010:set_task_cpu+0x177/0x180
[    2.427527] RSP: 0000:ffff88803e103e30 EFLAGS: 00010006
[    2.431555] RAX: 0000000000000200 RBX: ffff88803df18000 RCX: 0000000000000090
[    2.436436] RDX: ffffffffffffffd8 RSI: 0000000000000000 RDI: ffff88803df18000
[    2.441357] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000003
[    2.446228] R10: 0000000000000010 R11: 0000000000000000 R12: ffff88803df18b34
[    2.451314] R13: 0000000000000000 R14: 0000000000000246 R15: 0000000000021380
[    2.456536] FS:  0000000000000000(0000) GS:ffff88803e100000(0000) knlGS:0000000000000000
[    2.463460] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.467777] CR2: 0000000000000000 CR3: 000000000200a001 CR4: 00000000007606e0
[    2.472646] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    2.477707] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    2.482526] PKRU: 00000000
[    2.485418] Call Trace:
[    2.488215]  <IRQ>
[    2.490820]  try_to_wake_up+0x154/0x490
[    2.494284]  ? __queue_work+0x11c/0x410
[    2.497707]  ? call_timer_fn+0x130/0x130
[    2.501135]  call_timer_fn+0x30/0x130
[    2.504433]  run_timer_softirq+0x3f0/0x450
[    2.507943]  ? timerqueue_add+0x52/0x80
[    2.622453]  ? enqueue_hrtimer+0x37/0x80
[    2.625931]  __do_softirq+0xe3/0x2c7
[    2.629236]  irq_exit+0xbd/0xd0
[    2.632357]  smp_apic_timer_interrupt+0x78/0x130
[    2.636131]  apic_timer_interrupt+0x7d/0x90
[    2.639758]  </IRQ>
[    2.642441] RIP: 0010:panic+0x201/0x252
[    2.645950] RSP: 0000:ffffc90000197ec8 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff10
[    2.652663] RAX: 000000000000009b RBX: ffffffff81607900 RCX: ffffffff82065548
[    2.657640] RDX: 0000000000000000 RSI: 0000000000000092 RDI: 0000000000000046
[    2.662559] RBP: ffffc90000197f40 R08: 0000000000000168 R09: 000000000000000f
[    2.667455] R10: ffffffff821ddbe0 R11: ffffffff826e872d R12: 0000000000000000
[    2.672391] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    2.677708]  ? rest_init+0x10/0xb0
[    2.681017]  ? panic+0x1fa/0x252
[    2.684341]  ? do_execveat_common.isra.31+0x87/0x820
[    2.688306]  ? rest_init+0xb0/0xb0
[    2.691632]  kernel_init+0xeb/0xfc
[    2.694988]  ret_from_fork+0x35/0x40
[    2.698394] Code: 0e ff ff ff 80 8b 6c 08 00 00 04 e9 2a ff ff ff 0f 0b e9 ce fe ff ff f7 43 5c fd ff ff ff 0f 84 d8 fe ff ff 0f 0b e9 d1 fe ff ff <0f> 0b e9 db fe ff ff 66 90 0f 1f 44 00 00 41 55 41 54 49 89 f5 
[    2.711861] ---[ end trace 7b2141584b80ec24 ]---
[    2.715622] sched: Unexpected reschedule of offline CPU#0!


[    2.723573] WARNING: CPU: 1 PID: 1 at arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40
[    2.730938] Modules linked in:
[    2.734042] CPU: 1 PID: 1 Comm: swapper/0 Tainted: G        W       4.14.128-112.105.amzn2.x86_64 #1
[    2.741424] Hardware name: Amazon EC2 t3.micro/, BIOS 1.0 10/16/2017
[    2.745941] task: ffff88803deb4000 task.stack: ffffc90000194000
[    2.750354] RIP: 0010:native_smp_send_reschedule+0x37/0x40
[    2.754563] RSP: 0000:ffff88803e103e18 EFLAGS: 00010086
[    2.758654] RAX: 000000000000002e RBX: ffff88803e021380 RCX: ffffffff82065548
[    2.763553] RDX: 0000000000000000 RSI: 0000000000000096 RDI: 0000000000000046
[    2.768488] RBP: ffff88803e021380 R08: 0000000000000199 R09: 000000000000000f
[    2.773431] R10: ffff88803e103d40 R11: ffffffff826e872d R12: ffff88803df18000
[    2.778717] R13: ffff88803e103e68 R14: 0000000000000246 R15: 0000000000021380
[    2.783660] FS:  0000000000000000(0000) GS:ffff88803e100000(0000) knlGS:0000000000000000
[    2.790587] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.794889] CR2: 0000000000000000 CR3: 000000000200a001 CR4: 00000000007606e0
[    2.799878] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    2.804779] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    2.809674] PKRU: 00000000
[    2.812641] Call Trace:
[    2.815456]  <IRQ>
[    2.818130]  check_preempt_curr+0x75/0x80
[    2.821640]  ttwu_do_wakeup+0x19/0x140
[    2.825031]  try_to_wake_up+0x1d1/0x490
[    2.828570]  ? __queue_work+0x11c/0x410
[    2.832008]  ? call_timer_fn+0x130/0x130
[    2.835543]  call_timer_fn+0x30/0x130
[    2.838958]  run_timer_softirq+0x3f0/0x450
[    2.842498]  ? timerqueue_add+0x52/0x80
[    2.845953]  ? enqueue_hrtimer+0x37/0x80
[    2.849532]  __do_softirq+0xe3/0x2c7
[    2.852846]  irq_exit+0xbd/0xd0
[    2.856006]  smp_apic_timer_interrupt+0x78/0x130
[    2.859839]  apic_timer_interrupt+0x7d/0x90
[    2.863436]  </IRQ>
[    2.866089] RIP: 0010:panic+0x201/0x252
[    2.869540] RSP: 0000:ffffc90000197ec8 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff10
[    2.876169] RAX: 000000000000009b RBX: ffffffff81607900 RCX: ffffffff82065548
[    2.881295] RDX: 0000000000000000 RSI: 0000000000000092 RDI: 0000000000000046
[    2.886293] RBP: ffffc90000197f40 R08: 0000000000000168 R09: 000000000000000f
[    2.891396] R10: ffffffff821ddbe0 R11: ffffffff826e872d R12: 0000000000000000
[    2.896413] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    2.901459]  ? rest_init+0x10/0xb0
[    2.904800]  ? panic+0x1fa/0x252
[    2.908095]  ? do_execveat_common.isra.31+0x87/0x820
[    2.912120]  ? rest_init+0xb0/0xb0
[    2.915520]  kernel_init+0xeb/0xfc
[    2.918961]  ret_from_fork+0x35/0x40
[    2.922434] Code: f5 18 01 73 18 48 8b 05 f8 14 e1 00 be fd 00 00 00 48 8b 80 a0 00 00 00 e9 e7 49 9b 00 89 fe 48 c7 c7 c8 ad dc 81 e8 74 73 09 00 <0f> 0b c3 66 0f 1f 44 00 00 0f 1f 44 00 00 48 83 ec 20 65 48 8b 
[    2.936170] ---[ end trace 7b2141584b80ec25 ]---

psisis

Posted 2019-09-19T10:45:57.023

Reputation: 9

Are you sure you didn't run an apt upgrade (or similar) command before shutting down? the message says "missing init" so the initrd could be corrupt or missing. If you updated the kernel, perhaps try to boot into an earlier one at grub (if its accessible) or perhaps try and chroot from another instance? I havent used cloud computing before but the "missing init" message makes me thing its something to do with an upgraded kernel and initrd – QuickishFM – 2019-09-19T11:28:20.043

No nothing at all, which is the reason I'm worried it might happen again. Sure, I didn't shut it down on the best of conditions. I should have manually closed all services first, but don't know if it would've made a difference.

I was so frustrated I didn't want to chroot. But doing it now, I yum updated the kernel package, which was a pending update. Another thing which is so annoying about Amazon Linux 2 (and Red Hat in general), which I didn't know before, is the difficulty finding packages. Glibc is one big clusterfudge – psisis – 2019-09-19T11:37:09.230

It probably didn't make a difference, since all the times I have force closed a Linux machine, it booted back up and ran fsck to determine if there is data corruption, and thats it. Perhaps the initrd and kernel are being stored on a shared Amazon drive (so they service other VMs at the same time) and, during the month, they removed an old initrd? I dont think that's the case but its all I can think of to do with the 1 month gap. I think you should contact AWS customer services with the error and stacktrace, they'll be able to manually run fsck on your volume and restore initrd if need be – QuickishFM – 2019-09-19T11:44:14.520

Alright. Thanks for all your help. I did chroot, updated kernel and every other package. But sadly it made no difference. I even created a new instance and new volume from snapshot, but exactly the same problem; no initrd. If it's something to do with booting, grub and whatnot, I'm afraid it will be hard to diagnose and fix it because of the virtualization environment and how I'm unused to it. Might have to pay 30 bucks to get technical support... well well – psisis – 2019-09-19T12:52:19.447

If you were able to chroot into the volume, maybe you can fix initrd from there. https://linoxide.com/linux-how-to/fixing-broken-initrd-image-linux/ is what I see online, I've done it before but I can't remember the exact commands. Essentially, by chrooting you act as if the system itself was booted, and you run the update-initramfs or mkinitrd. Perhaps try that before shelling out the cash :) it's worked for me, on a similar problem

– QuickishFM – 2019-09-19T13:11:07.417

Hm. I tried following your link, every step went through fine. I also used dracut to make a new initramfs https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/sec-verifying_the_initial_ram_disk_image. I have both initrd and initramfs, which is confusing. More confusing is that my initramfs is named after kernel, while initrd is named: initrd-plymouth.img which is weird. Plymouth seems to be some RHEL kernel/wm program, related to theme managing. Anyway, nothing got fixed. Seems like a complicated problem. I'll have to ask support and post answer

– psisis – 2019-09-19T13:40:35.113

No answers