6

We had a system go offline this morning. The only thing in syslog is:

Mar 20 15:27:15 fooserver systemd[1]: Received SIGINT.
Mar 20 15:27:15 fooserver systemd[1]: Starting Synchronise Hardware Clock to System Clock...
Mar 20 15:27:15 fooserver systemd[1]: Stopping system-ifup.slice.
Mar 20 15:27:15 fooserver systemd[1]: Removed slice system-ifup.slice.
Mar 20 15:27:15 fooserver rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="579" x-info="http://www.rsyslog.com"] exiting on signal 15.

Then a five hour gap until it was manually restarted.

When it came back up, everything operated as it should.

No other log files (I grepped for this time period in everything that was in /var/log) show anything unusual.

The best I've got so far is someone was in the equipment room and pressed the button (accidentally). But that's thin. Only a few people have access, and I don't think any were on site at that time.

Is there anywhere else to look for this? Or, perhaps, anything else I could set to monitor for this for next time?

I currently have this command running in screen trying to catch it for next time: sysdig -p '%proc.pname[%proc.ppid]: %proc.name -> %evt.type(%evt.args)' evt.type=kill

DrDamnit
  • 348
  • 4
  • 16
  • Have you checked `fooserver` logs? That could indicate a probem with the software, that occurs with some specific input. – Tero Kilkanen Mar 20 '17 at 19:13
  • More to the point, you need to check the systemd journal. There should certainly be more than that logged. – Michael Hampton Mar 20 '17 at 20:28
  • I'd be also looking at the /crash subdirectory, along with anyout-of-band management logs. SIGINT is basically a hard interrupt, and could be anything from someone tagging the power button, to a power supply fault or system component glitch that caused the systems management processor to initiate a shutdown, rather than let the "magical computron smoke" escape. – George Erhard Mar 20 '17 at 20:42
  • @Tero - yes. I check the loggs. See "grep" comment in OP. – DrDamnit Mar 21 '17 at 20:41
  • @MichaelHampton Goo idea. I will peruse the systemd journal. – DrDamnit Mar 21 '17 at 20:42
  • @GeorgeErhard I did not notice a /crash directory. Should it be off the root? – DrDamnit Mar 21 '17 at 20:42
  • @MichaelHampton systemd journal (journalctl) only goes back to the reboot _after_ the crash. :-( – DrDamnit Mar 21 '17 at 20:49
  • /var/crash should hold a crash dump, which... won't be a readable log. You'll likely need to parse it with a debugger to see what exactly triggered the panic. – George Erhard Mar 22 '17 at 15:59
  • 1
    ```root@fooserver:/# find . -type d -name "*crash*" ./usr/src/linux-headers-3.16.0-4-amd64/include/config/crash root@fooserver:/#``` Seems there is no crash directory anywhere? This is starting to look like someone pressed the button. – DrDamnit Mar 22 '17 at 18:08
  • No ILOM, RSA, or other out-of-band system manager you can access? If not, this is a pretty big oversight - not only should you be able to remotely "press the button" if needed, but also for hardware-level logging and error trapping. – George Erhard Mar 24 '17 at 17:28
  • From `man systemd`: `SIGINT - Upon receiving this signal the systemd system manager will start the ctrl-alt-del.target unit.` Maybe someone pressed CTRL+ALT+DEL? – Martin von Wittich Oct 11 '17 at 09:20
  • Which exact distribution and version you are using can be relevant in figuring out which log files to look in for more clues. – kasperd Jul 18 '18 at 10:04

0 Answers0