5

I rent a dedicated server (with Intel Haswell CPU and custom hardware) at a lowcost hosting service and use it with CentOS 6.4 / 64 bit Linux (with stock kernel: 2.6.32-358.14.1.el6.x86_64).

Every few weeks it hangs and the other customers seem to have similar problems.

In the dmesg output I see (here is the full dmesg output):

CPU0: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz stepping 03
....
NMI watchdog enabled, takes one hw-pmu counter.
....
iTCO_wdt: Intel TCO WatchDog Timer Driver v1.07rh
iTCO_wdt: Found a Lynx Point TCO device (Version=2, TCOBASE=0x1860)
iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)

and in the process list I see:

#  ps uawwwx|grep [w]atchdog
root         6  0.0  0.0      0     0 ?        S    Aug22   0:00 [watchdog/0]
root        10  0.0  0.0      0     0 ?        S    Aug22   0:00 [watchdog/1]
root        14  0.0  0.0      0     0 ?        S    Aug22   0:00 [watchdog/2]
root        18  0.0  0.0      0     0 ?        S    Aug22   0:00 [watchdog/3]
root        22  0.0  0.0      0     0 ?        S    Aug22   0:00 [watchdog/4]
root        26  0.0  0.0      0     0 ?        S    Aug22   0:00 [watchdog/5]
root        30  0.0  0.0      0     0 ?        S    Aug22   0:00 [watchdog/6]
root        34  0.0  0.0      0     0 ?        S    Aug22   0:00 [watchdog/7]

Does this mean, a hardware watchdog is already active at my server and will reboot my machine in under 30 seconds of being frozen?

(In the /etc/sysctl.conf I have put kernel.panic=10, so that it doesn't stuck in kdb console anymore).

Or do I have to install and start the CentOS package watchdog?

ewwhite
  • 194,921
  • 91
  • 434
  • 799
Alexander Farber
  • 714
  • 4
  • 16
  • 38
  • 4
    Why are you OK with this server hanging so frequently? Is it non-critical? – sofly Sep 07 '13 at 13:13
  • Because the price is good (50 EUR for Haswell + 32 GB RAM) and I also have all my domains there... – Alexander Farber Sep 07 '13 at 16:16
  • 1
    Not the answer you want to hear, but the correct answer for your situation is to ditch the crappy hosting company. Of course, if this is a Development environment, it's off-topic for Serverfault. Which I'm certainly willing to ignore as this looks generally interesting and useful anyway. – Magellan Sep 07 '13 at 17:23
  • Instead of trying to mitigate this error by making it reboot on hang, you should probably just leave the host and find another one that doesn't have hanging machines, since it sounds like this could be an issue with the host. Or contact them? – sofly Sep 08 '13 at 00:51
  • 1
    @SoFLy The OP says this has been [discussed with the host](http://serverfault.com/questions/533793/is-a-hardware-watchdog-already-active-at-my-centos-server/537127#comment616829_537127) via public forum. It still doesn't mean that the host is doing a good job. This is likely a bad hardware/driver/OS interaction. – ewwhite Sep 08 '13 at 03:07
  • So, similar conclusion... switch to a more reliable host without botched hardware. – sofly Sep 09 '13 at 18:33
  • 2
    My server is hosting a little Facebook game. The total amount of income from players is EUR 150. The server costs EUR 50 + I have some more expenses. Could you guys please stop chanting "switch the hoster", because I'm actually happy with it and am willing to take a reboot every few weeks? I just need to configure the watchdog properly, so that the server restarts by itself. – Alexander Farber Sep 10 '13 at 09:00
  • OK, it's cheap, and you only have to reboot every few weeks. When you start making 1500€ a month on it, you are going to be losing a lot more money every time you have to reboot, and then you will have to think about moving somewhere more reliable. – Michael Hampton Sep 14 '13 at 01:02
  • 2
    Ok, thanks for this deep insight, eventhough I'm sure I'll never make EUR 1500 with it – Alexander Farber Sep 14 '13 at 08:22

3 Answers3

9

Well, there are a few issues to tackle here...

  • What happens when the server hangs? What's on the screen? What's in the logs? Do you have to engage with the hosting provider to reboot? Can you perform the reset on your own?

  • Your server should not be hanging, stalling or crashing!! Having worked in environments where low-end, DIY or custom hardware is used, I understand that the service provider's aim is to cut costs. However, if there's a stability issue, the onus is on the provider to remediate those issues. It's not difficult to build a stable Linux server platform. Yet, it happens more often than it should. If the combination of hardware/software/OS/firmware is toxic, that's a bad sign. The provider should be operating at a scale where they should be able to understand problems before they impact multiple clients.

  • Does your hardware have an IPMI device? Do YOU have IPMI access? Typically, watchdogs are part of your out-of-band management device. For instance, HP ProLiant servers have their Automatic Server Recovery (ASR) feature set to handle this.

  • The device your system detects is part of the Intel chipset in use. So there is technically a watchdog device and there is generic kernel support for it (it looks like it's in the CentOSPlus kernel, not the one you have). However, the watchdog package can help as a software-level watchdog, outside of the hardware hooks you may have.

But again, you're treating the symptom here. It's important to get to the root cause. If other customers are encountering these issues, you all need to resolve this with the service provider.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • On the screen is a useless kernel stack trace. This all been discussed en masse at the hoster's forum... I don't have an IPMI access, because it costs extra money... I am sorry, but your answer is useless for me. – Alexander Farber Sep 07 '13 at 16:23
  • Well, think of this as a hosting problem. Likely driver or hardware related. You have a watchdog on your server hardware, but the installed Linux kernel can't take advantage of it. The support was built into the CentOS Plus kernel, though. This was backported from newer kernel revisions. You can install the watchdog app as well. But the point of this answer was to see what options you had available to you. E.g. IPMI is a very good start. – ewwhite Sep 07 '13 at 16:28
7

Linux has a generic watchdog interface. You can use it by either enabling the NMI watchdog your iTCO_wdt hardware supports or by installing and configuring a software watchdog which does not depend on the hardware.

sciurus
  • 12,493
  • 2
  • 30
  • 49
  • Thanks. Will the "wathdog" CentOS package use the "iTCO_wdt" thingie seen in my dmesg or are they unrelated? – Alexander Farber Sep 08 '13 at 07:03
  • I edited my answer to make it more clear that they're unrelated. – sciurus Sep 08 '13 at 21:29
  • 1
    But the hardware support for this particular watchdog is not available in the kernel the OP is using. If it's a hardware/driver interaction, the software watchdog may not be the answer. – ewwhite Sep 09 '13 at 14:41
  • @ewwhite how do you know their kernel doesn't support the hardware watchdog? I think their dmesg output shows it's supported. – sciurus Sep 09 '13 at 20:48
  • 1
    The code for that Intel chipset was not backported into their kernel. It is in the CentOSPlus variant of the kernel. [See the kernel changelog](http://rpmfind.net/linux/RPM/centos/centosplus/6.4/x86_64/Packages/kernel-firmware-2.6.32-358.6.2.el6.centos.plus.noarch.html) or: `[watchdog] iTCO_wdt: add Intel Lynx Point DeviceIDs (John Villalovos) [738470]` – ewwhite Sep 09 '13 at 21:17
  • 1
    @ewwhite you still didn't say *why* you think the code wasn't backported into their kernel. I think you are mistaken for two reasons. One is that you see the device being recognized in their dmesg output. The other is that on a stock RHEL 6 system running 2.6.32-358, I grepped the kernel-firmware 2.6.32 changelog and it shows the same message that you pasted. Since CentOS rebuilds the RHEL kernel, the support should be there too. – sciurus Sep 09 '13 at 22:54
1

CentOS

yum install watchdog

On Ubuntu

apt-get install watchdog
#optional
#apt-get install das-watchdog

Then...

sudo vi /etc/watchdog.conf

Of course you should know that in VIM the colon (:) button opens the menu (or rather, command line) and w tells it to write your changes, or w! forces it to, and q quits. (Also that you can use the old ZX Spectrum cursor keys - hjkl to move around, the letter d to delete and i to insert, escape to stop inserting.)

Uncomment:

 watchdog-device = /dev/watchdog

See

 man watchdog.conf

For more... when you're done...

service watchdog restart

Yes, those processes are related to the watchdog, but unless they're configured properly, they're just sitting there doing nothing.

This should help you cope with unreliable power supplies turning random lock-ups into random reboots.

You can test it with

echo *todo* placeholder while I test how to test it, in case I reboot...

If it still doesn't work, you might have to sweat a little more and find out what driver your platform supports.

Personally, would try loading and testing each watchdog timer module individually, with something like this, run as root in the shell:

echo "Testing default... " | tee -a /var/log/watchdog-test.log; sync
service watchdog stop
echo Didn't work, we're still here... | tee -a /var/log/watchdog-test.log; sync
# If the default watchdog does work, I bet stopping the service disabled the default watchdog then... *todo* test and update this
echo Modules still loaded...
DOGS=`lsmod|grep -e wdt -e dog|cut -d\  -f1`
echo $DOGS
for dog in $DOGS; do
  echo Unloading $dog
  rmmod $dog || { echo "Oops.. didn't work, $dog won't unload"; sleep 70; };
done;
echo Did they all unload...? If not, I think the rest of this is a waste of time... reboot and skip that one next time
sleep 63
DOGS=`find /lib/modules|grep watchdog|awk -F'\watchdog/' '{print $2}'|sed s@.ko@@g|sort|uniq`
for dog in $DOGS; do 
   echo "Testing $dog... " | tee -a /var/log/watchdog-test.log; sync
   modprobe -v $dog && if [ -e /dev/watchdog ]; then
      dmesg|tail -5
      echo $dog Loaded. Ready for a reboot? | tee -a /var/log/watchdog-test.log; sync
      echo *todo* force a quicker timeout? *todo* read kernel source
      cat /dev/watchdog & test=$!
      sleep 0.5
      [ -e /proc/$test ] && { sleep 63; kill $test; };
  fi
  rmmod $dog
  echo $dog Didn't work, we're still here... | tee -a /var/log/watchdog-test.log; sync
done

If it just runs through, no delays... then none of the modules seemed to work. If your PC reboots, when it boots up:

tail -1 /var/log/watchdog-test.log

Will show a likely candidate... Now make sure your server loads it...

Ubuntu seems to use the module you note here:

sudo vi /etc/default/watchdog

I haven't tested this. If you do, come and update this answer. todo Here's a hint for SuSe: https://www.suse.com/support/kb/doc?id=7016880 and for Ubuntu: https://github.com/miniwark/miniwark-howtos/wiki/Hardware-Watchdog-Timer-setup-on-Ubuntu-12.04 http://odroid.com/dokuwiki/doku.php?id=en:odroid_linux_watchdog

Dagelf
  • 589
  • 4
  • 14