Exchange 2013 database corruption with event 476 in ESE

Question

I'm having some random database corruptions on our Exchange Server 2013 with event 476 on ESE. This is the fifth time that this happens and the situation in already unacceptable. Here's an screenshot of Event Viewer with the incident.

The recovery procedure must be done from backups or done by eseutil /p which is a lossy procedure since the logs got corrupted too.

At this point I really want to isolate the problem and find which device I should blame. This Exchange Server is running inside a VM in vSphere 6.0. The VMDK is exported through iSCSI from a Dell Powervault MD3820i.

Due to the nature of the error, it appears to be a problem with the storage subsystem, but how can we investigate this? On the previous issues the folks on DELL said that everything was fine in the storage, but I don't know if the diagnostics run by them are trusty enough.

Thanks in advance,

EDIT: There are no AntiVirus software installed on the server. The host hardware running VMware vSphere 6.0 is a DELL PowerEdge R730 homologated from DELL to run vSphere. There are no errors on VMware or anything like this on the logs, or at least I wasn't able to find any issue on the logs.

Storage communication is done by iSCSI using two Cat6 cables in multipath mode with dual controllers on the PowerVault MD3820i, so it's a pretty default configuration and know to work, and again, it was homologated by DELL.

I know that things homologated by DELL doesn't mean that's good. But they sold the hardware and they recommended their best practices, and we followed all of them.

EDIT II: The PowerVault storage appliance is running the latest firmware from DELL, the version 08_20_09_60 which is one older than the latest has addressed one particular issue that leads to data corruption: Addressed a rare condition which has the potential of causing a processor fault that could result in a data integrity issue

About the network cards, we're using a dual Broadcom NetXtreme II BCM57810 10GbE. The card does not support TCP engine offloading and/or iSCSI offloading so this should not be the issue.

VMware is running with the recommended drivers for the local SAS controllers too: the megaraid_sas driver instead of the deafault tg3 bundled with VMware. I don't think this is could be the issue since the VM's are on iSCSI Storage and not on the local storage.

My environment is completely different but we are having the same issues with Exchange 2013 (not the same errors though). For the SAN I'm running a NetApp FAS2040, servers are 3 HP DL 360p's Gen8 with a Cisco Nexus 5000 connecting our 10GB iSCSI links (Cat6 cabling as well). Strangely enough, when we enabled jumbo frames over a month ago we started to see Exchange DB errors (mailboxes going into quarantine). We have an open ticket with Microsoft but "Neel" keeps pointing to disk errors on C:\ as a result of corruption issues on E: and G: (DB drives). The plan is to disable jumbo frames. — , Aug 14 '15 at 17:50
This is really strange. I do have some other Exchange environments with other storage backends (mainly ZFS with FreeNAS) and even with jumbo frames everything runs smoothly. The other thing to consider is that the jumbo frames setup are deployed and recommended by DELL once again, so we're using their own recommendation... — Vinícius Ferrão, Aug 14 '15 at 17:58

Rob Moir · Answer 1 · 2015-08-13T10:17:32.347

As it says in the event log error description, this will almost certainly be a fault with the system hardware, which can be a rather nebulous concept when talking about virtual guests.

I would be looking very hard at the storage subsystem - Given my recent experiences with virtual clusters built on Dell servers I would suspect either an issue with network card firmware or storage system firmware in that order.

Having had a cup of tea and a think, I've looked again at your error, you're getting a 1019 error. This is specifically saying that the exchange server went to read some data in the database that it 'knew' had been written but was unable to find it (have you read https://support.microsoft.com/en-gb/kb/314917 - the errors are discussed there in some detail).

This can only be disk corruption of some kind and the root cause for that is very likely to be an issue with the storage system, especially considering that you mention this has happened before.

My other worry at this point is that 1019 errors can be rather insidious; it could be the end result of a write going wrong some time ago not being detected because the data wasn't needed for some time. Restoring yesterday's backup won't help if the corruption occurred last week, for example.

At this point, I'd be certainly contacting Dell and also, maybe, Microsoft.

I'm really dissatisfied with those storage appliances from DELL. I've already deployed the latest firmware which fixes "random data corruption when a CPU fault happens on the controller", but this haven't solved the issue. — Vinícius Ferrão, Aug 13 '15 at 06:51

score 0 · Answer 2 · edited Aug 13 '15 at 06:20

With the limited information about the environment it is running on I would start by checking the following.

Make sure AV has the appropriate exclusions set for exchange.

Make sure the drivers for storage and network are the correct stable versions for the devices at the other end.

look for other events that precede the failure.

Try to include more info about the hardware, Server type, mem, cpu, network card types and config ( port-channel etc )

look close at your vsphere logs for any storage related errors.

score 0 · Answer 3 · answered Nov 20 '15 at 20:10

There are problems in VMware 6 that can corrupt exchange stores (or anything active like a database). There are (related?) issues with the Changed Block Tracking (CBT) feature used by virtual backup software like Veeam. Search against those topics and you will find others with corrupt Exchange stores. It's a particularly nasty problem since after your store is corrupted the CBT errors may have made ALL of your backup restore points (including off-site) unusable. From what I can understand VMware has a patch to prevent the corruption of the running server but at the time of this posting there is not a fix for the CBT issues and CBT-based backups of ESXi 6.0 are not reliable. FWIW - I've had good experience with Dell's MD SANs. They're not fancy, but I've got several clients running them and never had a problem. Likewise I've got quite a few shelves of Equallogic that have been reliable. Of course, I use only basic LUN features, nothing fancy like snapshots or replication; relying on Veeam for that.

We don't use Veeam... This problem is exclusively to Veeam? Or due to methods of backups done by Veeam? — Vinícius Ferrão, Nov 20 '15 at 20:11

Exchange 2013 database corruption with event 476 in ESE

3 Answers3