I'm having some random database corruptions on our Exchange Server 2013 with event 476 on ESE. This is the fifth time that this happens and the situation in already unacceptable. Here's an screenshot of Event Viewer with the incident.
The recovery procedure must be done from backups or done by eseutil /p
which is a lossy procedure since the logs got corrupted too.
At this point I really want to isolate the problem and find which device I should blame. This Exchange Server is running inside a VM in vSphere 6.0. The VMDK is exported through iSCSI from a Dell Powervault MD3820i.
Due to the nature of the error, it appears to be a problem with the storage subsystem, but how can we investigate this? On the previous issues the folks on DELL said that everything was fine in the storage, but I don't know if the diagnostics run by them are trusty enough.
Thanks in advance,
EDIT: There are no AntiVirus software installed on the server. The host hardware running VMware vSphere 6.0 is a DELL PowerEdge R730 homologated from DELL to run vSphere. There are no errors on VMware or anything like this on the logs, or at least I wasn't able to find any issue on the logs.
Storage communication is done by iSCSI using two Cat6 cables in multipath mode with dual controllers on the PowerVault MD3820i, so it's a pretty default configuration and know to work, and again, it was homologated by DELL.
I know that things homologated by DELL doesn't mean that's good. But they sold the hardware and they recommended their best practices, and we followed all of them.
EDIT II: The PowerVault storage appliance is running the latest firmware from DELL, the version 08_20_09_60 which is one older than the latest has addressed one particular issue that leads to data corruption: Addressed a rare condition which has the potential of causing a processor fault that could result in a data integrity issue
About the network cards, we're using a dual Broadcom NetXtreme II BCM57810 10GbE. The card does not support TCP engine offloading and/or iSCSI offloading so this should not be the issue.
VMware is running with the recommended drivers for the local SAS controllers too: the megaraid_sas
driver instead of the deafault tg3
bundled with VMware. I don't think this is could be the issue since the VM's are on iSCSI Storage and not on the local storage.