2

Being unable to ssh into a machine I connected it to a monitor and found the following:

enter image description here

The machine is running Ubuntu Server 18.04 LTS and is a first generation 8 core Ryzen 1700. I've restarted the machine since and it works fine but am not sure what caused this in the first place and want to avoid it happening again.

enter image description here

enter image description here

Greg
  • 1,557
  • 5
  • 24
  • 35
  • A good place to start is typing the error messages into your favourite search engine. – user9517 Feb 10 '19 at 22:27
  • You should also examine your logs to see if there is any more relevant information. – user9517 Feb 10 '19 at 22:28
  • @lain I've googled the problem and didn't find anything reasonable. One solution online said that it was related to nvidia GPU, which I've since removed and the issue still happens. It might be something obvious that I am missing, which logs should I check and what should I be looking for? – Greg Feb 10 '19 at 23:22
  • Do read this: https://www.kernel.org/doc/Documentation/RCU/stallwarn.txt By value of the other errors you can see that the issue was somewhere in snapd, so one of the apps installed via snapd might be the culprit of lockup, which cascaded to the rest of the system. – Gothrek Feb 13 '19 at 17:32

1 Answers1

1

This is a random issue with 1st and 2nd gen Ryzens (at least). You'll find several reports and no real solution. I have a Ryzen 2700U and in forums people always suggest to try a newer kernel. I've tried kernel 5.0, 5.4, 5.6 and also gave 5.8 a run. Had the issue with all of them.

I've recently increased the kernel.watchdog_thresh from 10s (default) to 60s (max).

sudo sysctl -w kernel.watchdog_thresh=60

to make it permanent add the following to /etc/sysctl.conf:

kernel.watchdog_thresh=60

It's still too early to say if it worked but I have a good feeling.

user618360
  • 11
  • 3
  • 1
    Those are the steps to increase the timeout after which the problem is reported, are you saying that changed anything about the underlying problem? – anx Feb 21 '21 at 10:41
  • It took a bit longer to happen but probably it was just a coincidence. – user618360 Feb 23 '21 at 21:47