2

I'm trying to setup a cron job to reboot devices daily. With a safe callback to a SysRq reset if for some reason the reboot does hang (issue being that SSH gets killed and the device never reboots so it is lost and requires costly human intervention to restart).

The script that used to work for a while:

5 5 * * * root /sbin/reboot -f; sleep 30; /bin/echo `date -u +'\%Y-\%m-\%dT\%H:\%M:\%SZ'` >> /var/log/player-reboot.error.log; echo 1 > /proc/sys/kernel/sysrq; sync; echo b > /proc/sysrq-trigger

However it's pretty brutal (hard reboot -f) and some of our devices did not recover recently (a couple over thousands every day).

Not sure what hangs (looks like the file is never written so I'd say either the reboot itself or the echo hangs?

Was looking to use ampersands & to never "lock" and be sure that a proper reset will happen eventually, however it does not seem to work at all (no more reboots):

5 5 * * * root /sbin/shutdown -r +2 &; sleep 240; /bin/echo `date -u +'\%Y-\%m-\%dT\%H:\%M:\%SZ'` >> /var/log/player-reboot.error.log &; echo 1 > /proc/sys/kernel/sysrq; sleep 1; echo b > /proc/sysrq-trigger

Can I use the ampersand in a cron script? Do you know another smarter way to achieve the desired results? Thanks!

Olivier
  • 415
  • 3
  • 5
  • 14
  • Do you see any errors in the cron log file (/var/log/cron)? I expect that you'll see shell syntax errors there because of the ; after the & - you want just &. I find it easier to put long command lines like this into a file and call the file from cron rather than having everything in the crontab file - generally means that things are more readable. – Paul Haldane Nov 27 '17 at 14:22

1 Answers1

2

The simpler approach is to schedule another process to check for greater then 24 hours (ie: 25h) uptime. If the check returns true, it is obvious that something went wrong with the reboot, and so the machine must be restarted via SysRq.

For maximum reliability, your periodic check should not depends on crond (which can be killed by the hanging shutdown process). Rather, use a polling scheme; something like that:

#!/bin/bash
max_uptime=$((25*3600)) #max 25h
sleep_time=3600 #1h sleeps
while true; do
    current_uptime=`grep -o "^[[:digit:]]\+" /proc/uptime`
    echo "current uptime: $current_uptime seconds"
    if [ $current_uptime -gt $max_uptime ]; then
        echo "reboot!"
        echo 1 > /proc/sys/kernel/sysrq; sync; echo b > /proc/sysrq-trigger
    else
        echo "not now!"
    fi
    echo "sleeping..."
    sleep $sleep_time
done

You can first-start the above script with a @reboot crond entry, or with rc.local and friends.

shodanshok
  • 44,038
  • 6
  • 98
  • 162
  • Interesting idea, but how can you be sure that shutdown/reboot won't exit this program before hanging? For instance sshd is down when such issues do occur. – Olivier Nov 28 '17 at 20:41
  • `sshd` is stopped/SIGTERM by the stopping ssh service file/unit triggered by runlevel 6 (or equivalent). A script as the one above will be killed by `init`, just before the machine reboot. So, yes: it *is* possible for the machine to hang after the script is killed but just before the machine reboot; however, it is very unlikely. – shodanshok Nov 28 '17 at 20:52