2

We have some daemons executing on a number of hosts.

The daemon executable images are these very large binaries that are hosted on NFS.

When the binaries are updated on the NFS server, the previously running daemons sometimes drop dead with a Bus error. I'm assuming what's happening is the NFS server is replacing the binaries in a way that's invisible to the VFS layer on the NFS clients so they end up loading pages from the updated binary, which of course leads to madness.

We tried moving the new binaries into place instead of cp, but that doesn't seem to fix it.

I'm considering simply mlock()'ing the binary in the daemon startup script, but surely there's magic NFS options or semantics that we should be abusing. Is there a better way to fix this?

mbac32768
  • 848
  • 1
  • 7
  • 13

2 Answers2

2

The best solution we've found is to always install the binary with a version string at the end of its name, and maintain a symbolic link always pointing to the latest version.

/mnt/foo/bar -> bar-20111201000000
/mnt/foo/bar-20111201000000
/mnt/foo/bar-20111115000000

When you install the new version you atomically move a new symbolic link over the old one.

When you run the binary off of NFS, your process maps the versioned binary name, which new installations won't disturb. It also has this neat bonus feature where you can run ps and immediately see which version of the binary is running.

mbac32768
  • 848
  • 1
  • 7
  • 13
0

This is a common issue with NFS. When you remove the file, the existing NFS connection believes that the stat table it has is correct, goes to reload and gets a bus error.

What you want to do is move the existing binary, put the new binary in place, after each of the machines have started using the new binary, remove the old one. Apache does this when it tries to mmap served files from NFS that change as well.

karmawhore
  • 3,865
  • 17
  • 9
  • While that solves my immediate problem, it opens up a timing condition problem. There's a span of time after the existing binary has been moved out of the way but before the new binary has been moved into the way that it won't exist at all. – mbac32768 Jul 20 '10 at 13:17