Script runs on 50 servers. How can I ensure only one executes a particular step?

Question

I have some work that needs to be done on 50+ servers. The first step is to checkout an updated version of some source code onto a shared directory (assume all have the shared drive mounted). The second is to perform some work on each of the servers.

I'd prefer to have these two scripts run on each of the servers. All 50+ servers are cloned from a single disk image and it's not practical for me to customize any of them.

When the 50 servers run the first script, I want only the first one that tries to run it to actually run it. The others I want to simply exit. The server that actually runs the script should then update a shared directory, then exit. Then, later, the second script will run and perform the work on all servers based on the updated code that the first server fetched.

What's the best way to do this? Can I reliable have the first script run on one server and create a file or something that acts as a 'semaphore' or 'lock' of some sort that keeps the other servers away?

Making this more complicated is that I'm thinking of having the scripts run from identical cron files on each of the servers -- meaning all scripts could try to run it at the same time assuming all their clocks are set identically.

I'm hoping these will be run from bash scripts. Does this make sense as an approach?

EDIT: Updated based on questions:

We don't want every server to try to checkout it's own copy of these files -- they are in a multi-GB source code repository and having 50+ simultaneous checkouts of that code would be difficult for our source control server (and not scalable to 100+ servers).

Adding a cronjob to the 50+ servers is not that big of an issue, but adding another customized server with it's own configuration is harder. We're already cloning the 50 servers -- maintaining a separate server just to checkout an the latest source code for the 50+ servers to access seems wasteful and will add more overhead than just adding a script to our current servers.

I could be wrong, but it seems to me you should just have two separate scripts, 1 for updating the code on the shared directory, and another for running the "task" on all 50+ servers, based on the updated code. Why can't script 1 run at a different time (even just a couple minutes off) from script 2? If you do go this route, you should look into Puppet or Chef for maintaining exact/replica copies of files, cron jobs, and tasks on multiple servers. — David W, May 06 '13 at 18:13
@DavidW I agree and that was my intention, I'm sorry I was not clear. The script that does the checkout would run before the other script. — Kevin Bedell, May 06 '13 at 18:20
Does this help at all http://serverfault.com/questions/376717/how-to-schedule-server-jobs-more-intelligently-than-with-cron/376724#376724 in particular flock — user9517, May 06 '13 at 19:52

score 2 · Accepted Answer · answered May 06 '13 at 18:18

2

Three solutions.

Run the "checkout" step manually, or in a separate script on just one of the servers. This seems like the best approach--otherwise you may run into a race condition.
If you are willing to accept a chance of running into a race condition, you could certainly try creating a specific date-stamped file when the first script runs. Or, if the dates would be reliable enough, you could try checking for the last-modified date of the checked-out files.
If customizing is really verboten, then have each VM make its own copy of the files to work on instead of trying to use a shared volume.

Each of these have tradeoffs but you haven't really made it clear why you want to design the solution this way.

answered May 06 '13 at 18:18

Quinten

1,076
1
11
25

Re: 1) We have 50+ servers that need this update and it needs to run automatically during off-hours -- specifically over a weekend. Putting into place a weekend process that requires someone to log in and run one step would be error prone. Re 3) Having 50+ (or 100 eventually) servers all simultaneously doing full checkouts of the source code is a big load for our source control server, and 2) This is the essence of my question -- what's the best way to do that reliably? – Kevin Bedell May 06 '13 at 18:24
Truly reliably -- you cant - at least not without some serious re-engineering of your infrastructure. Providing true atomicity over the network is pretty hard. – Matthew Ife May 06 '13 at 18:28
If the checkout needs to happen once over the weekend, close in time to the second script running and can't be done in advance, then 1) seems to be the best solution. Just set up cron on one of the servers to run the checkout an hour or so before the second script runs. I know you want each to be identical, but making one different will be the most reliable solution. – Quinten May 06 '13 at 18:41
@Quinten Thanks. I understand what you're saying. There are some circumstances where writing a lockfile over NFS can work. I've read that 'mkdir' is atomic and will return different status based on where or not the directory already exists. I may have to figure out how to force just one server to check the file out -- like maybe embedded a particular hostname in the script, then the server won't do the checkout unless its hostname matches. – Kevin Bedell May 06 '13 at 18:49
@KevinBedell - there you go...make the first script do something like "IF I am then else exit." – TheCleaner May 06 '13 at 18:54
1

@TheCleaner's idea is a good one. Assuming that you know that the server in question will be available to do the checkout at the time in question. – Quinten May 06 '13 at 19:58

score 1 · Answer 2 · answered May 06 '13 at 19:06

There is no true atomicity over the network without lots of engineering to provide it, and the more engineering required the more complicated it will be.

There are serious tradeoffs to consider. This answer offers you no insight on what to do when the work is half done.

NFSv3 supports a atomic locking mechanism in newer kernels (well, pretty old to be frank) http://nfs.sourceforge.net/#faq_d10 . So some mechanism for a semaphore in theory can be acheived in the following way.

A 'done' file exists on the host already. (this is a signal for script 2 only)
Open an 'acquire' file on the host using O_EXCL.
Rename 'done' to 'done.old'.
Do your special work here.
Open an 'done file on the host using O_EXCL.
Unlink 'done.old'.
Unlink 'acquire'

Heres some template shell scripting stuff that attempts this.

#!/bin/bash
# WARNING: This is a cricital line! NEVER EDIT THIS
set -e -o noclobber

BASEPATH=/tmp
cd "${BASEPATH}"

# 1. A done file exists on the host already (this is a signal for script 2 only)
# 2. Open an 'acquire' file on the host using `O_EXCL`.
echo > 'acquire'

# 3. Rename 'done' to 'done.old'.
mv 'done' 'done.old' 2>/dev/null || :

# 4. Do your special work here.
echo "How much wood could a woodchuck chuck if a woodchuck could chuck wood?"

# 5. Open a 'done' file using O_EXCL
echo > 'done'

# 6. Unlink 'done.old'.
unlink 'done.old' || :

# 7. Unlink 'acquire'.
unlink 'acquire'

The most important line is the set -e -o noclobber which serves two purposes.

It ensures if any command fails the script exits.
The script will not overwrite files (makes opens occur in O_EXCL).

Given the set criteria the most important functional part is echo > acquire which will atomically open the acquire file. If this fails (because someone else has it, even if TWO opens at once are occurring, only one will win) the -e option of set ensures we quit the script.

There should never be two of these scripts running in parallel. This script however does not offer a solution where two scripts run one after another (which would be permitted in its current form). I guess the best means to do this would be to alter the 'done' file to be some timestamped named file which you look for the existence of before the process begins. Thus this assumes that its 'safe' to rely on the time as a medium to determine the safety of code criticality.

I do mention that this isn't concrete. At the moment this offers you the guarantee that two processes cannot claim the file at the same time. As mentioned more modification to permit it not to start on the presence of a 'done' file is needed.

Other things not covered are:

What if the process starts but does not finish?
If the shared directory is unavailable before or halfway through how to handle this.
If the host is taking too long to do the 'safe' stuff at step 4, how does this affect the next time it wants to run? Should we use the old instance once its finished or new instance?

To cover these problems one needs a 'fencing' mechanism (lots of changing of infrastructure) to truly guarantee re-acquiring the lock in another host is a safe operation.

Thanks for your detailed and thoughtful response. This was very helpful. — Kevin Bedell, May 06 '13 at 20:06

score 1 · Answer 3 · answered May 07 '13 at 07:26

Might I suggest the following,

Nominate one server as a replicate code repository. You can then cron the updates to that repository at any interval. The rest of the servers can test if there is a local repository and then rsync the files from the nominated server. This information can be stored in the shared file server space. This will be pretty easy to automate and should be fairly robust.

Another radical solution -> would be to use bittorrent sync. The repository server would be read/write while the others will have a readonly share. Might be quicker as the network load will be shared amongst the servers. btsync can be setup via a configuration file and the linux client works pretty well.

EDIT: you can skip the repository server for the radical solution and stick with btsync.

Cheers! :)

Danie

score 0 · Answer 4 · answered May 06 '13 at 18:27

0

You will have to use some sort of lock file (before doing anything) that shows the owner of the first script and the time run. When someone else tries to execute the script it should look for the lock file then exit. At the end of the script (if it ran) delete said lock file.

answered May 06 '13 at 18:27

Jim B

23,938
4
35
58

'some sort of lock' -- this is the question. How can I do that when 50+ servers are trying to be the first one and have it be reliable? – Kevin Bedell May 06 '13 at 18:31
create a file called norun.lock (on the network filesystem) ... I'm fairly certain that file creation is atomic – Jim B May 06 '13 at 18:41
1

mkdir is atomic. – user9517 May 06 '13 at 19:51
Yeah I couldn't remember if it was mkdir or creating a lock file that was guaranteed atomic by the system calls. So instead of a file make a dir to look for then RM it when done. – Jim B May 07 '13 at 04:19

Script runs on 50 servers. How can I ensure only one executes a particular step?

4 Answers4