We've been using cron for about as long as I can remember to handle all of our job scheduling needs. Everything from storage clones/snapshots to reports against databases to daily system reports to monitoring checks are scheduled across a few hundred servers via cron.

The drawbacks are pretty obvious: difficult to manage jobs, no easy way to create dependencies (especially across different servers), and, of course, it is inevitable that someone "temporarily" skips a job but later forgets to remove the comment.

We tried a commercial offering, but in the end it was deemed too expensive as a step up from cron.

I see other options out there, such as SLURM, Oracle Grid Engine, Torque/Maui, Quartz, DIET, Condor which appear to be geared toward larger, more homogeneous cluster environments with jobs which would run on any number of similar nodes: grid computing and the like. Our environment is fairly mixed (various Linuxes, AIX, and FreeBSD), and we need to create dependencies across different types of systems (e.g. a job on a Linux box may need to determine whether a job on an AIX box should run.)

Does anyone have any experience moving from cron to a more centrally-managed offering? Any tips for choosing the software or whether it is better to go open source or commercial?

  • 24,141
  • 6
  • 41
  • 67

6 Answers6


Condor, OGE, and Torque can all get you there but only Condor has built-in dependency management with it's DAGMan tool. DAGMan lets you set up a directed, acyclic graph that describes your work flow and the manager takes care of moving through jobs in your workflow and evaluating pass/fail results at each step in the flow. Condor is relatively platform agnostic, which means DAGMan is too, and you can certainly have one child step run on AIX when the parent ran on Linux or Windows. DAGMan isn't concerned with where jobs run, just that exit codes are pass or fail.

Any tips for choosing the software or whether it is better to go open source or commercial?

With some caveats I think the free communities in this space are well worth looking at.

OGE is in a weird space now. It's no longer free to run the Oracle-produced GE variant and Oracle is no longer contributing code it writes back to the GE SCC, but there are several forks of the code that exist that are trying to soldier on as free, open source projects. Univa in particular has lead the charge, hiring ex-Sun GE devs to continue to work on an open source, freely available GE variant. Grid Engine has two things going for it: it's easy to setup, it can handle short running (<2 minute) jobs without imparting a lot of scheduling overhead on the jobs that slows down throughput. It's big downside is there is not very good support for Windows. Some of us put some efforts in to porting it to run on Cygwin many years ago, but it's not as good as native that's for sure.

Now Condor is my favourite of the three technologies you mentioned. There's a strong community around Condor and the software is very mature (>20 years old now). Native Windows and POSIX OS support means it runs all over the place very well. The aforementioned DAGMan is just one of the many great pieces that come with Condor. It can be a touch complicated to set up, but once it's up and running it's rock solid. It has an incredibly flexible language for doing job <-> machine matching and building your use rules for your resources. It also supports dynamic provisioning on machines, letting jobs select how much of machines resources they need and then re-advertising the difference as being still available. It supports global resource counters so you can constrain against things like software licenses. And of course, it has DAGMan, which is an incredibly powerful tool for workflow management. The downside to Condor is the scheduling overhead for short-running jobs can be burdensome. You want jobs that run longer than 2 minutes ideally, otherwise scheduling starts to become a big part of the job's time in the system.

Torque is a little more niche. I know less about it I'm afraid. It compares more to Grid Engine than Condor. There are paid add-ons that @warren mentioned that can expand what the basic, free Torque can do.

If you want to try out the three technologies and see how they work with your specific workloads, CycleCloud can spin up secure, virtualized, pools that are pre-configured with Condor, GridEngine or Torque -- so no time spent in figuring that stuff out on your part. It'd be a few dollars to spin up small pools of each technology and try them with representative workloads. (Disclaimer: I work for Cycle Computing, we make CycleCloud)

  • 493
  • 1
  • 5
  • 17
Ian C.
  • 1,193
  • 8
  • 11
  • Thanks for the information. Condor seems really geared toward larger collections of machines all capable of running a particular job. The problem I have is more one of having a bunch of jobs which run in very specific locations, but I need to chain jobs together to run in a specific order. Is this something Condor can do as well, or is it going to be painful to make it work this way? – Cakemox Jul 21 '11 at 11:53
  • 1
    Condor can handle your situation. You can constrain jobs from DAGs in all kinds of ways so they target very specific machines or hardware in your pools. – Ian C. Jul 21 '11 at 14:45

Chronos looks very promising.

Chronos is Airbnb's replacement for cron. It is a distributed and fault-tolerant scheduler that runs on top of Apache Mesos. You can use it to orchestrate jobs. It supports custom Mesos executors as well as the default command executor. Thus by default, Chronos executes sh (on most systems bash) scripts. Chronos can be used to interact with systems such as Hadoop (incl. EMR), even if the Mesos slaves on which execution happens do not have Hadoop installed. Included wrapper scripts allow transfering files and executing them on a remote machine in the background and using asynchronous callbacks to notify Chronos of job completion or failures.

I've also head great personal success using Jenkins as a cron replacement. It handles executing jobs on remote servers quite nicely. Here's a writeup on it: http://www.22ideastreet.com/blog/2014/05/02/replace-local-cron-with-jenkins/

Greg Sheremeta
  • 160
  • 1
  • 5

For the past 4.5 years, I have worked with HP's (nee Opsware) Server Automation platform, and the rest of the Business Technology Optimization suite (Network Automation, Operations Orchestration, etc).

For a large enough environment, job management via SA is a highly-viable (and desirable) tool. In conjunction with OO, jobs can be controlled via change control management, ticketing, etc.

Here's the not-so-fun part: it's pricey (very pricey). You might check some of the suggestions in a similar question I asked a while back: FLOSS Server management and audit tools.

I'd also say that Torque/Maui/Moab (from Adaptive Computing) are very cool: not sure on pricing, but they are highly flexible tools as well.

Disclaimer - I work for a partner of HP BTO and Adaptive
  • 17,829
  • 23
  • 82
  • 134

NOTE A completely different take on the problem!

cron is old and clunky in certain terms.

If you are indeed looking for new ways to do scheduling I'd try something event based with a messaging middleware. Think RabbitMQ with clients on each server.

Inter Host dependencies can be solved by "notification queues".

"Real" Time based events are a little trickier, that's actually what cron is for (and is quite good at, at least regarding small environments). Where it get's tricky to get hold of the idea is to prevent hickups. Like in: every night at 0100h do a snapshot. You might see some load spikes or a lot of failing logins at that very moment thruought your whole infrastructure. If you have a queue based a approach you'll get at least some deviation for free (although it's not guaranteed -- unless some logic implements that).

The thing to get around is that without real time based jobs you can't rely on things like: yeah my backups will start at 0200h and if they still run on 0400h something's wrong. What's easier to do is making sure that no 2 jobs that interfere are run at the same time. Just make a blocking agent that will only consume one job at a time.

The managing part would be some nice web interface where jobs could be submitted either on-demand, or -- now it get's back to "cron" or your favorite implementation of it the java quartz scheduler has a granularity on seconds AFAIK -- for the time based part just use good old cron :)

Please don't downvote me for being OT -- it's a rather rough concept but since the question doesn't rule out money one might as well spend the money to get the solution for the exact in-house requirements by creating something rather than spending the money by buying something where a vendor thinks that it fullfills some requirements :)

Martin M.
  • 6,428
  • 2
  • 24
  • 42
  • This is interesting for distributing large jobs, but my jobs are much more temporal. I do have some jobs which are could be queued like this, though, so I'll keep this in mind for those. – Cakemox Jun 08 '11 at 15:36

I've used Espresso (Cybermation) from CA. Not sure what they're calling it now. I've also used UC4. They both work, cost a lot of money (to my understanding), and can be a bear to maintain, but they do what it says on the tin. /Edit - missed that you say that commercial apps are too expensive. I can definitely agree, but for some companies, it's worth it, especially when it's for business applications that make money.

  • 35,711
  • 3
  • 50
  • 86

I've worked with the Open Source Job Scheduler as an option to replace a 2000+ line central crontab in a production environment. Things got so complicated with cron, that we could not determine what downtime windows were or how to deal with inter-server dependencies. This product helped, but was a bit complex to setup.

  • 194,921
  • 91
  • 434
  • 799