Prevent corruption on ext3 Linux Vms running on Xenserver after Equallogic hung for 45 minutes

Question

I face a problem today with my Storage Equallogic PS4000E, they get stucked for 45 minutes then goes up and running normally, no logs, no nothing to help us discover what happens.

Well, I run a Xenserver with 2 server Pool ver. 5.6 SP2, after this problem with the storage, the most recent Linux vms Ubuntu 12 and Windows Vms back to work normally, but most old Debian Vms, become read-only FS and we need to fsck all, some vms was permanently corrupted and other work normally after a reboot and fsck.

I like to know if are there any way to prevent VM filesystem corruption on iScsi lost connnection / timeout connection, maybe increasing iScsi timeout on Xen or something similar in each guest VM.

Anybody?

Contact Dell support to diagnose the problem. This must not arrive. You may have a network problem, a controller problem, a disk failure. Do you have backups if the corruption is more important you think ? — Dom, Aug 20 '16 at 19:14
I try but the logs do not show nothing, and my box is 6years old , Dell decline my warranty renew last April and do not provide any support. This ia impressive , the system hungs twice last 30 days and never happens before. I take backups, but its a stressing situation since I dont know when this will happen again , i hope never more. Not avalable money to buy another box at this time. — Luiz Gustavo, Aug 21 '16 at 20:51
Do you have logs enable on the switches ? Do you see flapping ports ? Do you see logs in your Xen servers ? — Dom, Aug 22 '16 at 08:47

score 0 · Accepted Answer · answered Aug 22 '16 at 15:36

Corruption isn't going to be entirely preventable when you're dealing with ~1-hour loss of storage connectivity - certainly not by tuning some SCSI timeout variable in the hypervisor or OS.

Your inability to renew warranty is unfortunate, but normal for 7.2k disk Equallogic systems which are limited to 5 years max warranty (10K/15K/SSD units can go out to 7 years). I would link to the EQL "Release and Support Guidelines" PDF, but access to the support page where it's hosted requires an active warranty.

You stated that only your "old" Debian VMs experience serious problems afterward - perhaps this is related to which file system they're using, and/or how your mounts are configured? (e.g. data=journal/ordered/writeback)

no logs, no nothing to help us discover what happens

This is highly unlikely, though many sets of log data can be difficult to obtain without previous experience/familiarity in gathering and analyzing them.

How do you know that this was a storage problem? What events/errors or behavior did you observe that lead to this conclusion?

@Dom asked a great question in a comment regarding switch logs. Equallogic diagnostics are not built around end-user readability, but switch logs should be fully accessible and readable if logging is actually in place.

If you don't have the budget to replace a SAN after it's end of service life / supportability, you can't afford to have one in the first place. I know that's completely hindsight and doesn't help you, but you should seriously consider moving off the EQL storage and onto something less expensive (like multiple servers, local storage only, and replicate VMs with something like DRBD). A SAN can be great, but it's a serious financial commitment too.

I view switch logs, but do not find nothing wrong. The storage Ips was pingable, I can connect by telnet and ssh and log with grpadmin but no promt was shown, also WEB admin is accessible, but do not show the array with message "connecting". I already buy another servers for VMS with Local Storage, was the solution, but I cant migrate and deactivate the Xen + EQL yet. I will take a look for DRDB, thanks for recomendation. — Luiz Gustavo, Aug 22 '16 at 22:06
Are there any errors in the EQL event viewer? (In the "monitor" section of group manager) - if there's no evidence there, you're essentially stuck needing help from support. — JimNim, Aug 23 '16 at 01:53
No log error in EQL, this is the problem. I'll try to restart the primary controller to second assume and will monitor :( After warranty expire, I can't access EQL Portal, can't download firmwares, this is bizarre, Dell wants we put the EQL in Trash bin after warranty expires. — Luiz Gustavo, Aug 23 '16 at 02:56
Version 5.0.2 and I never update because works perfectly during last 5 yrs — Luiz Gustavo, Aug 23 '16 at 22:27

Prevent corruption on ext3 Linux Vms running on Xenserver after Equallogic hung for 45 minutes

1 Answers1