Every month of so, one of my server running VMware 4.1 became unresponsive. The only way to get it back up was to do a hard reboot. When this happened I could connect to VMware but was not able to do anything except navigate and view information.
The server is Dell PowerEdge R210 with two 1TB SATA disks and Dell SAS 6/iR Adapter Raid controller (mirroring the disks, no battery). I have another identical server running without problems.
I have now replaced the server to be able to do some tests to figure this out. So far I've: updated BIOS and Raid controller firmware, reinstalled VMware, replaced all RAM modules but that doesn't fix the issue.
I tried to install Ubuntu on the server and the issue does not exist there, just when running VMware.
This has now happened about 10 times and looks like it's more likely to happen under much disk load.
The error messages are like this:
Lost connectivity to storage device naa.600508e000000000a528c060b1275b09. Path vmhba1:C1:T0:L0 is down. Affected datastores: "", "datastore1", "Hypervisor1", "Hypervisor2", "Hypervisor3".
Lost access to volume 50520233-c467e816-a5a1-0026b97a4010 (datastore1) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
Here are the log entries: