3

First instance, had a Centos 5.4 (64-bit), plenty of resources, installed Hudson (http://wiki.hudson-ci.org/display/HUDSON/Meet+Hudson) and everything was honkey-dorey. Several days or weeks later (can't remember which), the entire server would randomly freeze, requiring a hard reboot. There was nothing running on it other than the resources required for Hudson.

New gig: freshly installed Centos 5.5 (64-bit). Within a month or so, freezing has started again. No apparent reason.

We have identical servers running all over the place, serving everything from Tomcat to Jboss to basic Apache stuff, all without ever freezing or crashing.

It seems Hudson is the problem - we just can't figure out what it does differently from typical configs.

So 2 questions:

  1. Any Hudson experts out there want to chime in?
  2. Troubleshooting: What are the right logs to be looking at? Where might we find an entry that says "X caused the system to crash" etc.?
Joshua
  • 593
  • 2
  • 19
  • When you freshly installed CentOS 5.5 64bit, was it still on the same physical hardware? Have you done any testing on the hardware before reinstalling the OS? Been able to check the physical console on a crash? It would also help if you listed the system specs, physical virtual, etc. – Andy Shinn Apr 01 '11 at 04:13
  • Two separate hardware platforms, both known good. – Joshua Apr 01 '11 at 05:09
  • 1
    I'm on CentOS 5.6 64bit as well and I am seeing the same issue. The server was rock solid until we put Hudson on it, and now it seems to go down every couple of days. – rik.the.vik Feb 02 '12 at 21:50

1 Answers1

2

The best way I've found is to keep some kind of live log over a network or serial connection. Sometimes, the kernel can print a critical message out to a logged in shell even though it can't save it to a file so just having a remote shell open can help. You can also tail -f certain log files, or better yet, cat /proc/kmsg and see live kernel messages sent over ssh. Another more reliable option is to set up a physical serial port as the console. I have all my servers support a serial console and can log the whole boot with a serial terminal emulator like HyperTerminal, or better, PuTTY on a serial port. Adding the boot option console=ttyS0 will send all kernel messages to COM1 which requires a lot less to work as opposed to maintaining a network connection. Most motherboards still usually have a header on the board for COM1 even if they don't have the connector.

penguin359
  • 452
  • 3
  • 8
  • Interesting and worth looking into. However, I'm curious if there is anything I can look at right now that might show me what is causing the crashing? (i.e. logging that's already taking place). Even if it is vague, it will give me a good place to start. – Joshua Apr 01 '11 at 07:46
  • 1
    Analyzing a previous crash? Mostly, logs like /var/log/messages, /var/log/syslog, and /var/log/debug. 99% of logging goes there. If your kernel is panicing, it can be configured to do a dump you your primary swap partition and then saved to /var/log/dump on reboot. You can also enable automatic reboot on panic rather than halting with the sysctl kernel.panic. Add kernel.panic = 30 to /etc/sysctl.conf and run sysctl -p. – penguin359 Apr 01 '11 at 07:59