So we've got this 4-node Storage Space Direct (S2D) cluster, working for more than 1.5 year without any major issue. The OS is Windows Server 2016.
- Firewall down for all profiles
- No antivirus installed, Windows Defender OFF
- Active Directory delegations untouched
- No change in the network infrastructure has been reported
- RDMA was disabled 1 year ago, as we found out the NIC didn't fully support it
Two days ago, we noticed a lot or error messages in the cluster event log, and the backup jobs of all Hyper-V VM hosted on the cluster failed (made via VEEAM).
Investigation quickly showed there is are many issue with the SMB connections.
Any of the 4 hosts :
- can ping other resources in the network
- can't connect any shared folders
- NTP sync fails (
net time \\server
fails, so isw32tm /monitor
)
Obviously, the File Share Witness fails as well, and some issue with Domain services to be reported...
We tried to reboot the nodes separately, and after a reboot the SMB connections are just fine... for a few minutes/hours, and then the issue arise again.
The impact on the cluster, along with the File Share Witness beeing offline, is we can't easily perform a Live Migration of the VMs between the nodes (succeeds randomly). A Quick Migration happens like a charm, though. As SMB connections are not possible, we can't move the VM to another cluster or standalone host.
We fear the cluster will go haywire if a node fails uncontrollably. Even though the VM are stable, we still can't perform a backup (we could perform an export).
Have any of you heard about that issue with S2D or the Microsoft Failover cluster role ? It might also be unrelated to the cluster itself...
What can be done to find the root cause of this issue ?
Here are samples of the logs found in the cluster role, and in the event logs for SMBCLient :
From the Cluster console:
Cluster network name resource 'Cluster Name' encountered an error enabling the network name on this node. The reason for the failure was: 'Unable to obtain a logon token'.
The error code was '1311'.
You may take the network name resource offline and online again to retry.
Event with ID 30803 :
Failed to establish a network connection.
Error: {Device Timeout} The specified I/O operation on %hs was not completed before the time-out period expired.
Server name: server.domain.com
Server address: x.x.x.x:445 Connection type: Wsk
Guidance: This indicates a problem with the underlying network or transport, such as with TCP/IP, and not with SMB. A firewall that blocks TCP port 445, or TCP port 5445 when using an iWARP RDMA adapter, can also cause this issue.
Another one, ID 30804 :
A network connection was disconnected.
Server name: \server.domain.com Server address: x.x.x.x:445 Connection type: Wsk
Guidance: This indicates that the client's connection to the server was disconnected.
Frequent, unexpected disconnects when using an RDMA over Converged Ethernet (RoCE) adapter may indicate a network misconfiguration. RoCE requires Priority Flow Control (PFC) to be configured for every host, switch and router on the RoCE network. Failure to properly configure PFC will cause packet loss, frequent disconnects and poor performance.