Here's my situation:
I'm having a very occasional problem where a (very) remote embedded PC/104 system running Debian seems to lose the ability to use any communications interface. I can't get to it via ethernet or serial ports (the console). After cycling the power, the system logs show nothing amiss. They just end abruptly and resume minutes or hours later when I cycle the power.
I suspect the system isn't locked up, because I have a python script which tries to ping google.com and if it fails, it uses an IO pin to toggle the wireless modem's power supply via a relay.
So, I have a completely unresponsive system, and a modem which is being power cycled every ten minutes by that same system. Fortunately, between reboots, I can use the modem to power-cycle the processor. And get back up and collecting data.
The system has a hardware watchdog and I've had watchdogd setup and running for a while. Last time this happened, I tried adding the line:
file=/var/log/messages
to watchdog.conf, but it didn't help. I then read that
When using file mode watchdog will try to stat(2) the given files. Errors returned by stat will not cause a reboot. For a reboot the stat call has to last at least one minute.
I don't know enough about stat to know how it might respond to losing the ability to write to disk, but I suspect it doesn't just hang.
I also just noticed that watchdogd has a --sync option, but the man pages aren't very verbose as to what happens if sync fails. My interval is 2 seconds, are there reasons not to sync a SSD every two seconds?
-Thanks