0

We have a cluster being used to run MPI jobs for a customer. Previously this cluster used Torque as the scheduler, but we are transitioning to Grid Engine 6.2u5 (for some other features). Unfortunately, we are having trouble duplicating some of our maintenance scripts in the Grid Engine environment.

In Torque, we have a prologue.parallel script which is used to carry out an automated health-check on the node. If this script returns a fail condition, Torque will helpfully offline the node and re-queue the job to use a different group of nodes.

In Grid Engine, however, the queue "prolog" only runs on the head node of the job. We can manually run our prologue script from the startmpi.sh initialization script, for the mpi parallel environment; but I can't figure out how to detect a fail condition and carry out the same "mark offline and requeue" procedure.

Any suggestions?

ajdecon
  • 1,291
  • 4
  • 14
  • 21

3 Answers3

1

I can't say I've tried it, but at least with the prolog script returning a value other than 0, 99, or 100 should place the queue in an error state. You may be able to use a similar tactic in the start_proc_args script.

If that doesn't work, I'm not sure if what you are asking is possible to achieve via prolog scripts. Perhaps you could use a health-check cron job (or use your monitoring system of choice) to perform the checks and disable the host's queues if they fail?

Kamil Kisiel
  • 11,946
  • 7
  • 46
  • 68
  • I'm slowly coming to the conclusion that the cron job will be necessary, but it doesn't quite do the job. Some of our health checks would interfere with a job if it ran at the same time, and our customer's schedule isn't such that there will be regular downtimes. Some of our checks also involve memory or network conditions which should really be checked right before the job runs. :-( – ajdecon Mar 03 '11 at 01:43
  • I feel your pain. It does seem like a deficiency in the GridEngine system if this isn't possible. You could try contacting Univa or some of the open source GridEngine authors and requesting the feature. They seem to be quite receptive to such requests at this point. – Kamil Kisiel Mar 03 '11 at 02:13
0

In case it's helpful to others, here's what we ended up doing:

  • Health checks on a long time-scale, and which wouldn't interfere with potentially overlapping jobs, (i.e. checking for hardware problems in the storage system) were offloaded to periodic cron jobs. (Frequencies depend.)

  • Health checks on a long time-scale, but which might interfere with jobs (memory performance checks) were offloaded to an SGE job submitted to each node in "exclusive" mode, submitted nightly by cron. If failed, the node is offlined before any other jobs could arrive.

  • Checks on the environment conditions right before running a job (looking for stray processes, full memory, etc) were put in a script which was run from the pe startup script, startmpi.sh. Commands are submitted to the nodes using pdsh, and output codes are returned via STDOUT. (Not ideal, but...) If one or more nodes fail, the script offlines them and runs qmod -r $JOB_ID to re-run the job. (Note that the job has to be specified as "re-runnable" either in its script or by default.) This forces the list of nodes to be rebuilt before the jobscript is actually run.

We're currently working on building fault-tolerance into this, but the basics have been confirmed to work. Thanks to @kamil-kisiel and the #gridengine channel on synirc.net for suggestions.

ajdecon
  • 1,291
  • 4
  • 14
  • 21
0

Why not create a load sensor that runs on every node and depending on what you test for sets a complex?

With this approach you can have jobs running that isn't depending on for example interconnect if your interconnect network is down.

Jimmy Hedman
  • 155
  • 5