1

Someting has broken and I lost a connection with storage on first server. Second server had access to that FS. I tried to restart GFS by service lock_gulmd, gfs, pool, ccsd stop/start (in various orders) but no luck. On master server (the third one) "gulm_tool nodelist localhost"

" says

Name: srv1
  state = Expired
  mode = Slave
  missed beats = 0
  last beat = 0
  delay avg = 0
  max delay = 0

I found that it needs to be fenced? Automatically or manually? Anyone can help? At the moment, none of the hosts is writing anything to the FS, so no harm could be done, I presume. Second host is also expired at the moment and can't start lock_gulmd.

Icapan
  • 484
  • 1
  • 3
  • 9

2 Answers2

1

If it hasn't already been automatically fenced, I would assume your fencing mechanism is not exactly working perfectly.

I suppose what one could do, is reboot the expired hosts (either one by one, or both at the same time) and inform the cluster fencing has been successful with the fence_ack_manual tool. Doesn't this show in your logs?

Running this tool (on the node that requested it to be run, which is not the node that needed to be rebooted) will allow the GFS filesystem and the faulty node to be recovered. The recovering mainly consists of the node being a proper cluster member again and the GFS filesystem journal being replayed if necessary, iirc.

wzzrd
  • 10,269
  • 2
  • 32
  • 47
0

Honestly, the best way to clear GFS problems like this, especially when you're locked out of the filesystem anyway, is just to shut all the machines down and then start the cluster back up again. It was the most reliable and usually the quickest way of fixing these problems when I was wranging lots of GFS filesystems.

womble
  • 95,029
  • 29
  • 173
  • 228