42TB LUN, formatted in XFS and shared via NFS was reported 'unavailable' by customers. In the end I was forced to restart the file server. The XFS LUN won't mount until it is repaired, and to repair I need to mount it so the log will replay and commit the uncommitted changes. In the past, I've learned that dumping the log and running the repair results in loss of filenames for a portion of the files and folders in the LUN. 42 TB and potentially hundreds of thousands of files. Loss of filenames equates to data loss.
I have a backup. Restoring will require gathering resources. I think there's roughly 30TB of data in that LUN that I need to restore and copy back into place. So I need 30 TB of free space, which is not readily available.
Is there another way of forcing XFS to mount in order to replay those logs and commit the changes?
This is the third time I've had a LUN 'freeze' on me and be reported as xfs corrupted in the logs and been forced to reboot the server to bring it back online. XFS seems to have a solid reputation. It has been around for a significant amount of time. And it is the default for the file server's OS (RHEL7). Have I got some terrible error in my configuration that is killing these LUNs?
SAN presents LUN, mounted nodev,nosuid,nofail on file server. File server shares to workstations which mount the share as synchronous. Is there something in this combination that would hang the file server?