3

I moved my ESXi 5.0 server into a colo about 3 weeks back and ever since then, I've had issues with the server spontaneously powering down and coming back up. I previously had the server sitting at my house for close to a month while I was on vacation. During that time, the server did not go down once.The only differences (that I am aware of) are:

  1. New physical location
  2. Installed a Dell PERC5i RAID card
  3. There are actually some websites on there now, but nothing really taxing from a traffic or processor standpoint

What makes the situation a little urgent is that in one instance, when ESXi and the guests came back up, one of the VMs experienced a filesystem panic and went into RO mode. I rebooted that guest, ran fsck and everything was back to normal. I'm trying to figure out what caused that particular restart and would really appreciate the eye of an experienced ESXi user to spot anything peculiar in my logs. I don't see anything that looks like a kernel crash or memory dump. Below are excerpts from what seem to me to be relevant logs just before and after the restart event ... please let me know if I should include any others.

vmksummary.log

2012-08-07T17:00:01Z heartbeat: up 2d18h42m11s, 3 VMs; [[3406 vmx 2092436kB] [3453 vmx 2095768kB] [3373 vmx 2300420kB]] [[3531 sfcb-hhrc 2%max] [3432 sfcb-vmware_bas 5%max] [3420 sfcb-pycim 16%max]]
2012-08-07T18:00:01Z heartbeat: up 2d19h42m11s, 3 VMs; [[3406 vmx 2092488kB] [3453 vmx 2095640kB] [3373 vmx 2301544kB]] [[3531 sfcb-hhrc 2%max] [3432 sfcb-vmware_bas 5%max] [3420 sfcb-pycim 16%max]]
2012-08-07T18:58:42Z bootstop: Host has booted
2012-08-07T19:00:01Z heartbeat: up 0d0h2m10s, 3 VMs; [[3405 vmx 464780kB] [3451 vmx 815008kB] [3373 vmx 1086716kB]] [[3501 sfcb-CIMXML-Pro 1%max] [3432 sfcb-vmware_bas 2%max] [3420 sfcb-pycim 5%max]]

syslog.log

2012-08-04T20:00:01Z crond[2702]: USER root pid 97212 cmd /usr/lib/vmware/vmksummary/log-heartbeat.py
2012-08-04T20:01:01Z crond[2702]: USER root pid 97329 cmd /sbin/auto-backup.sh
2012-08-04T21:00:01Z crond[2702]: USER root pid 99638 cmd /usr/lib/vmware/vmksummary/log-heartbeat.py
2012-08-04T21:01:01Z crond[2702]: USER root pid 99745 cmd /sbin/auto-backup.sh
2012-08-04T22:00:01Z crond[2702]: USER root pid 102014 cmd /usr/lib/vmware/vmksummary/log-heartbeat.py
2012-08-04T22:01:01Z crond[2702]: USER root pid 102081 cmd /sbin/auto-backup.sh
2012-08-04T22:17:54Z jumpstart: dependencies for plugin 'restore-host-cache' not met (missing: vcfs)
2012-08-04T22:17:54Z vmkmicrocode: Warning: Line size is greater than expected size 242
2012-08-04T22:17:54Z vmkmicrocode: File microcode_amd_0x100fa0.bin does not contain a valid microcode update for any of the processors
2012-08-04T22:17:54Z vmkmicrocode: File m4010676860C0001.dat does not contain a valid microcode update for any of the processors
2012-08-04T22:17:54Z vmkmicrocode: File m03106a5.dat does not contain a valid microcode update for any of the processors
2012-08-04T22:17:54Z vmkmicrocode: cpu0 with revision (a07) can use the update in file microcode-1027.dat
2012-08-04T22:17:54Z vmkmicrocode: update number 25 version(1), revision(2571), date(0x9282010), size(2048)
2012-08-04T22:17:54Z vmkmicrocode: cpu1 with revision (a07) can use the update in file microcode-1027.dat
2012-08-04T22:17:54Z vmkmicrocode: update number 25 version(1), revision(2571), date(0x9282010), size(2048)
2012-08-04T22:17:54Z vmkmicrocode: cpu2 with revision (a07) can use the update in file microcode-1027.dat
2012-08-04T22:17:54Z vmkmicrocode: update number 25 version(1), revision(2571), date(0x9282010), size(2048)

vmkernel.log

2012-08-04T02:59:59.509Z cpu4:2655)<6>megasas_hotplug_work[6]: aen event code 0x0027
2012-08-04T15:57:19.630Z cpu5:2655)<6>megasas_hotplug_work[6]: aen event code 0x005e
2012-08-04T16:03:35.776Z cpu4:2649)<6>megasas_hotplug_work[6]: aen event code 0x005e
TSC: 0 cpu0:0)Boot: 167: Parsing boot option module /useropts.gz
TSC: 14715 cpu0:0)Boot: 173: Parsing command line boot options
TSC: 86415 cpu0:0)BootConfig: 38: coresPerPkg = 0
TSC: 90368 cpu0:0)BootConfig: 41: useNUMAInfo = TRUE
TSC: 93878 cpu0:0)BootConfig: 44: numaLatencyLoops = 20
...
PRESUMABLY MORE BOOT STUFF
...
0:00:00:03.667 cpu0:2048)IDT: 991: 0x30 <keyboard> exclusive, flags 0x3
0:00:00:03.667 cpu0:2048)IDT: 991: 0x58 <mouse> exclusive, flags 0x3
0:00:00:03.667 cpu0:2048)IOAPIC: 1335: 0x58 retriggerred
0:00:00:03.667 cpu0:2048)IOAPIC: 1335: 0x30 retriggerred
0:00:00:03.667 cpu0:2048)GlobalTimer: 78: GlobalTimer service not available
0:00:00:03.667 cpu0:2048)Initializing Power Management ...
0:00:00:03.670 cpu0:2048)Power: 2568: No supported CPU power management technology detected
0:00:00:03.671 cpu0:2048)MCE: 616: Fixed 10 MCE bank/CPU-package ownership settings
0:00:00:03.672 cpu0:2048)CpuSched: 11824: Reset scheduler statistics
0:00:00:03.672 cpu0:2048)Init: 892: Vmkernel initialization done. Returning to console.
0:00:00:03.672 cpu0:2048)VMKernel loaded successfully.
2012-08-04T22:17:52.152Z cpu6:2059)ScsiCore: 129: Starting taskMgmt watchdog world 2059
2012-08-04T22:17:52.152Z cpu4:2060)ScsiCore: 129: Starting taskMgmt watchdog world 2060
2012-08-04T22:17:52.152Z cpu5:2141)VSCSI: 2520: Starting reset handler world 2141/1
2012-08-04T22:17:52.152Z cpu3:2177)ScsiCore: 63: Starting taskmgmt handler world 2177/1
2012-08-04T22:17:52.152Z cpu2:2178)ScsiCore: 63: Starting taskmgmt handler world 2178/1
2012-08-04T22:17:52.152Z cpu5:2142)VSCSI: 2709: Starting reset watchdog world 2142

hostd.log

2012-08-04T22:13:54.996Z [FFEA7AC0 info 'Vmomi'] Activation [N5Vmomi10ActivationE:0x33f7abc0] : Invoke done [waitForUpdates] on [vmodl.query.PropertyCollector:ha-property-collector]
2012-08-04T22:13:54.996Z [FFEA7AC0 verbose 'Vmomi'] Arg version:
--> "46"
2012-08-04T22:13:54.996Z [FFEA7AC0 info 'Vmomi'] Throw vmodl.fault.RequestCanceled
2012-08-04T22:13:54.996Z [FFEA7AC0 info 'Vmomi'] Result:
--> (vmodl.fault.RequestCanceled) {
--> dynamicType = <unset>,
--> faultCause = (vmodl.MethodFault) null,
--> msg = "",
--> }
2012-08-04T22:13:54.997Z [34759B90 error 'SoapAdapter.HTTPService'] HTTP Transaction failed on stream TCP(local=127.0.0.1:0, peer=127.0.0.1:58492) with error N7Vmacore15SystemExceptionE(Connection reset by p
2012-08-04T22:14:13.998Z [340C2B90 verbose 'Proxysvc Req01482'] New proxy client TCP(local=66.196.32.10:80, peer=223.4.119.245:43890)
2012-08-04T22:14:44.561Z [348FBB90 verbose 'vm:/vmfs/volumes/4ffd026d-a15e589f-c6e3-003048d37c09/REDACTED/REDACTED.vmx'] Actual VM overhead: 119980032 bytes
2012-08-04T22:14:44.562Z [348FBB90 verbose 'Vmsvc'] RefreshVms updated overhead for 1 VM
2012-08-04T22:15:07.104Z [34718B90 verbose 'Cimsvc'] Ticket issued for CIMOM version 1.0, user root
Section for VMware ESX, pid=2790, version=5.0.0, build=build-623860, option=Release
------ In-memory logs start --------
2012-08-04T22:18:21.746Z [FFC7CAC0 info 'Default'] Supported VMs 87
2012-08-04T22:18:21.746Z [FFC7CAC0 info 'Handle checker'] Setting system limit of 2222
2012-08-04T22:18:21.746Z [FFC7CAC0 info 'Handle checker'] Set system limit to 2222
2012-08-04T22:18:21.746Z [FFC7CAC0 info 'Default'] Setting malloc mmap threshold to 32 k
2012-08-04T22:18:21.746Z [FFC7CAC0 info 'Default'] getrlimit(RLIMIT_NPROC): curr=64 max=128, return code = Success
2012-08-04T22:18:21.746Z [FFC7CAC0 info 'Default'] setrlimit(RLIMIT_NPROC): curr=128 max=128, return code = Success
------ In-memory logs end --------
2012-08-04T22:18:21.747Z [FFC7CAC0 info 'Default'] Initialized channel manager

I've ruled out:

  • problem with the VM whose filesystem went R/O - my understanding is that a single VM crash can't bring down ESXi
  • problem with a web traffic spike - the only site on that VM isn't very heavily trafficked at close to 10:30pm and a look through the guest's Apache logs and whatnot supports this

I suspect:

  • problem with the Dell RAID card I installed - it was fine for 3+ weeks before I installed this, I will be installing diagnostics in the next couple of days so that I can monitor
  • possibly an issue with throughput on the RAID card causing slow response to requests made by the VM and leading it to think there's something wrong with the filesystem, although this should not explain a restart, Linux should be fine with it, just marks the FS R/O and continues along until you can fix the problem, and as explained above, the system should not have been under load
  • does VMWare perform automatic updates requiring reboots? I don't have VMWare Tools installed on any guests so that could lead to a dirty reboot of guest VMs.
  • bad power at the colo - the morning after moving the server there, I had to get them to restart my machine ... I suspect somebody switched off a powerbar or something since I got a very generic "we were having power issues" response from them. Also, we had a big electrical storm a few hours ago and the server rebooted at least 3 times during a 20 minute span, no filesystems corrupted but this should not be the case for a supposedly UPS + generator backed data center
  • anything else you can think of?
Nick
  • 31
  • 1
  • 1
  • 3
  • 6
    You should be worried if your colo is giving you generic messages about "power issues". That's your prime suspect. – Michael Hampton Aug 08 '12 at 04:22
  • Yes this was my exact reaction as well, but I wanted to rule out any possibility of configuration or hardware problems with my rig before I go to the trouble of relocating. The only doubt in my mind really is the fact that on my development server which sits right next to me, I've kicked the cord out or had the power go out so many times over the past few years but not once did the filesystem get corrupted on me. I suppose that this particular time, I could have been unlucky enough that the server was right in the middle of doing something. – Nick Aug 08 '12 at 06:36

1 Answers1

3

The electrical storm could have resulted in any number of issues. Depending on the class/quality of your data center's facility, there could have been an impact.

  • Your most useful logs would be visible in the vSphere Client "Events" tab.
  • Do you have any out-of-band management available? DRAC, perhaps? That would give you information about the physical hardware status.
  • Is this actually a Dell server? Which model/generation? If so, you should install Dell's CIM agents for ESXi 5.
  • Does your PERC/5i controller have cache memory and a battery-backed cache unit (BBWC)? Running without those can impact write performance.
  • An individual VMWare ESXi system does not have any automatic update capabilities.
  • You should install VMWare tools on your guest systems.
  • Do you have dual power supplies in the server and access to an A/B power feed? If this is a single-PSU system, that's a possible culprit.
ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • The events tab only seems to show events such as individual VM powerups, SSH logins, etc. It does not even indicate that the machine was booted up. This article http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1019238 suggests going into the physical logs themselves, which I have done. It isn't a Dell server, it's a SuperMicro Server running a dual Quad-core Xeon processors. All components are on VMWare's HCL and the PERC 5i is just a rebranded LSI RAID card. I don't believe the RAID card I have has BBWC. – Nick Aug 08 '12 at 06:11
  • It is a single PSU unit but as I said, it was running fine close to a month prior to this with no problems. It would be quite a coincidence that the PSU should become wonky right when I moved it to the DC. VMWare tools is definitely a good recommendation. Am I correct in understanding that it will gracefully power-down guests with the package installed? Even if I initiate a reboot at the host level? – Nick Aug 08 '12 at 06:17
  • 1
    The BBU and cache RAM are separate units on the PERC 5/i. It's possible to have either without the other. I would not have sent the box to the data center without having both, though I'm not going to blame the filesystem corruption on this without more info. – Michael Hampton Aug 08 '12 at 06:52
  • @nick Yes, the VMWare tools can be used to shutdown your guest systems gracefully, if configured to do so. However, if you are simply losing power on your single-PSU system, nothing can help that. – ewwhite Aug 08 '12 at 15:42