1

The problem is the following: I have 2 servers that have a FS mounted as ocfs2:

root [server01]% mount | grep ocfs2
ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
/dev/sddlmam1 on /share type ocfs2 (rw,_netdev,datavolume,heartbeat=local)

and

root [server02]% mount | grep ocfs2
ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
/dev/sddlmam1 on /share type ocfs2 (rw,_netdev,datavolume,heartbeat=local)

And both are sharing this FS via NFS protocol:

root [server01]% cat /etc/exports 
/share *(rw,sync,no_root_squash)

and

root [server02]% cat /etc/exports 
/share *(rw,sync,no_root_squash)

Configuration of OCFS2 looks ok:

root [server01]% cat  /etc/ocfs2/cluster.conf
cluster:
    node_count = 2
    name = ocfs2

node:
    number = 0
    cluster = ocfs2
    ip_port = 7778
    ip_address = 10.1.1.2
    name = server01

node:
    number = 1
    cluster = ocfs2
    ip_port = 7778
    ip_address = 10.1.1.3
    name = server02


root [server02]% cat  /etc/ocfs2/cluster.conf 

cluster:
    node_count = 2
    name = ocfs2

node:
    number = 0
    cluster = ocfs2
    ip_port = 7778
    ip_address = 10.1.1.2
    name = server01

node:
    number = 1
    cluster = ocfs2
    ip_port = 7778
    ip_address = 10.1.1.3
    name = server02

These servers are clustered and balanced:

root [client01]% nslookup servercluster
Server:     10.108.1.1
Address:    10.108.1.1#53

servercluster.domain.com    canonical name = servercluster.domain.com.
Name:   servercluster.domain.com
Address: 10.1.1.2
Name:   servercluster.domain.com
Address: 10.1.1.3
Name:   servercluster.domain.com
Address: 10.1.1.4

root [client01]% mount | grep servercluster
servercluster:/share on /share type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.1.1.5,local_lock=none,addr=10.1.1.2)

And

root [client02]% nslookup servercluster
Server:     10.108.1.1
Address:    10.108.1.1#53

servercluster.domain.com    canonical name = servercluster.domain.com.
Name:   servercluster.domain.com
Address: 10.1.1.2
Name:   servercluster.domain.com
Address: 10.1.1.3
Name:   servercluster.domain.com
Address: 10.1.1.4

root [client02]% mount | grep servercluster
servercluster:/share on /share type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.1.1.6,local_lock=none,addr=10.1.1.3)

server01 is suffering some outage, and has reported NFS errors:

dracut.log

Info: Not stripping '/tmp/initramfs.6SdU0W/lib/modules/3.8.13-118.16.4.el6uek.x86_64/kernel/fs/nfsd/nfsd.ko', because module signature appended.

messages:

Dec  8 02:18:19 server01 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Dec  8 02:18:19 server01 kernel: NFSD: starting 90-second grace period (net ffffffff81983940)

My question is: Is it possible that the errors come from having the same FS shared by 2 different servers via NFS?

Humberto Castellon
  • 849
  • 1
  • 7
  • 17
X3MBoy
  • 121
  • 8

1 Answers1

2

There shouldn't be any corruption on the underlying OCFS2 filesystem as that handles locking itself.

Obviously there's no coordination between the two NFS servers so you could have a situation where one client sends in an update to an existing file and at the same time another client does the same - typically the second client's data will immediately overwrite the first client's update, with a possible delay as the file will be temporarily open while the first client does it's write - but you shouldn't see any actual corruption at least.

Chopper3
  • 100,240
  • 9
  • 106
  • 238
  • Actually the server gets several errors, but after an update of the ofcs2 package it started to behave correctly. Maybe it was a bug, but because of a lot of pressure, I had to update without time to take more logs and outputs to file a bug. – X3MBoy Jun 21 '19 at 18:30