0

Every few weeks one of our Solaris 10 servers becomes unresponsive. I can telnet to port 22 and get the SSH banner but I am unable to actually establish an SSH connection to it. It's a Dell R610 so I login via the DRAC Console and while I can press enter and get a new line but whenever I try to run a command such as 'prstat' the console hangs and I am unable to Control-C or anything else. I am also unable to send it a CTRL-ALT-DEL to reboot gracefully and have to end up doing a remote hard power-cycle.

Nothing strange appears in the logs and we have tried setting up crons to capture and append the output of prstat, iostat, vmstat, sar, etc to a file every minute to try and see what's causing this but all we see is that the machine is fine and then everything seems to stop.

We are also graphing metrics in Cacti and don't see anything. Like I said everything is normal and then data just stops.

The problem happened again last night and we have discovered in the 'last' output that the machine seems to start shutting down a couple of hours before it becomes unresponsive (no-one is shutting it down), here is the output:

reboot system boot Tue Nov 23 17:24 <-- here is where I rebooted it. reboot system down Tue Nov 23 15:01

There are no environmental or chassis alarms in the DRAC.

I've checked for any crons, etc that could be shutting down the server somehow, don't really see anything. I want to enable auditd but that requires a reboot and this is a major production system.

Can anyone offer any advice?

Dell R610 Solaris 10 5/09 s10x_u7wos_08 X86

Thanks,

Shane

Zoredache
  • 128,755
  • 40
  • 271
  • 413
rxvt
  • 21
  • 4

3 Answers3

1

Discovered that if I go into the BIOS->CPU Settings and Disable C-Settings the servers no longer crash. They have been up for over a month now while the other servers which didn't have the flag set still crashed.

rxvt
  • 21
  • 4
0

I have that exact behavior on the Dell R410 running Solaris 10 9/10 s10x_u9wos_14a.

I found this thread which lead me to think I should use the broadcom driver instead of Solaris for my install. http://opensolaris.org/jive/thread.jspa?messageID=491917 http://forums.oracle.com/forums/thread.jspa?threadID=1924459&tstart=15

I'm going to try installing it this weekend but as you know only time will tell because there is absolutely no trace of the problem until it occurs.

Output from fmdump -e fmdump: /var/fm/fmd/errlog is empty.

  • After doing a few Google searches based on the information you provided I discovered that if you go into the BIOS-> CPU Settings and Disable C-Settings this stops the servers crashing. We've had our servers online for over a month now. Thanks! – rxvt Feb 07 '11 at 18:13
0

First things to check - are you running the latest patch levels and updated firmware for your hardware? What software are you running on the host and has this had the latest patches applied? Does the host have adequate clean power and cooling?

Checking the HCL, it looks like the Dell R610 is certified on OpenSolaris and Solaris 11 Express but no mention of Solaris 10.

hth.

cachonfinga
  • 215
  • 1
  • 6