3

42TB LUN, formatted in XFS and shared via NFS was reported 'unavailable' by customers. In the end I was forced to restart the file server. The XFS LUN won't mount until it is repaired, and to repair I need to mount it so the log will replay and commit the uncommitted changes. In the past, I've learned that dumping the log and running the repair results in loss of filenames for a portion of the files and folders in the LUN. 42 TB and potentially hundreds of thousands of files. Loss of filenames equates to data loss.

I have a backup. Restoring will require gathering resources. I think there's roughly 30TB of data in that LUN that I need to restore and copy back into place. So I need 30 TB of free space, which is not readily available.

Is there another way of forcing XFS to mount in order to replay those logs and commit the changes?

This is the third time I've had a LUN 'freeze' on me and be reported as xfs corrupted in the logs and been forced to reboot the server to bring it back online. XFS seems to have a solid reputation. It has been around for a significant amount of time. And it is the default for the file server's OS (RHEL7). Have I got some terrible error in my configuration that is killing these LUNs?

SAN presents LUN, mounted nodev,nosuid,nofail on file server. File server shares to workstations which mount the share as synchronous. Is there something in this combination that would hang the file server?

Jeter-work
  • 825
  • 4
  • 15
  • What happened right before the corruption? More details. – ewwhite Mar 31 '17 at 03:41
  • Customers were adding to the LUN. They started a 10TB transfer in. – Jeter-work Mar 31 '17 at 03:56
  • What *exact* error message(s) are you getting? And what SAN? XFS is reliable - if the underlying hardware behaves. In particular, if the hardware correctly honors write barriers. Are write barriers disabled - is the filesystem mounted with `-o nobarrier`? If you do that and don't have reliable battery-backed non-volatile cache on your hardware, you're likely to get corruption. – Andrew Henle Mar 31 '17 at 15:14
  • I spent about 2 hours running commands and transcribing messages and the page ate my update. Too frustrated now to recreate the work. I will update on Monday. – Jeter-work Mar 31 '17 at 23:50
  • The SAN is an EMC VNX5400. The root of the problem is that I thought the file server was frozen when I guess it was committing writes. Or it was frozen in the process of committing writes. Resolved the issue by disconnecting all of the file server's clients via shutdown on client end, and by stopping the NFS export service. This has allowed (relatively) quick restart of the file server. I have also addressed the lack of restore space. – Jeter-work Mar 13 '18 at 16:16

1 Answers1

1

Came across this question when checking for updates to bugs #1681410 and #1686687 on launchpad which I also have been affected by with similar symptoms as you are describing (also with XFS but a larger LUN and when running ubuntu 16.04 server).

We've been checking our storage system (which provides extensive logs) in quite great depth (requesting support from the manufacturer) but ended up not finding any errors or misconfigurations there.

Having run into this several times we managed to nail the occurrance of this behaviour down to a certain time where nobody may have actively worked on the system which let us look at other factors as well. We finally found evidence that the cron-scheduled runs of fstrim (which is a default on ubtuntu 16.04 server!) started once a week seem to trigger the corruptions on our filesystem especially as it takes some time to fstrim a LUN of over 100TB in size.

I believe the bugs posted on launchpad quite likely describe this issue but as it appears to me, this issue hash been upstreamed but never really fixed so far. So for now we simply make sure that no fstrim is run by removing the respective entry form cron.weekly. We also check if a cron-job has been re-added after running updates which is something I'd like to be solved differently.

antiplex
  • 121
  • 1
  • 5
  • For me, it occurs if I switch the server off before it has completed committing changes. So what I've done is implemented forced shutdown for all the clients. In the past I just warned them that the file shares would be unavailable. SO the shares are not in use, and thus the LUNs are not in use, and there's no changes to commit. Haven't had any trouble since then. – Jeter-work Mar 13 '18 at 15:56