I'm looking for an open source solution for the following:
I have jobs that need to run continuously. The jobs are applications or scripts. If they fail, they need to be restarted. If they fail, say 10 times consecutively or within a certain time period, say 1 hour, they need to be cancelled and notification issued to a central repository. If the jobs start heating up (using CPU or memory etc.) warnings should be issued and then killed if they get too hot. The jobs could be optionally scheduled to run only during certain hours.
I know there must be open source, platform independent, high-level-language (i.e. implemented in python, etc.) full-service, sophisticated solutions for this, but I'm not even sure what to look for or what such a system is called. I've done a lot of googling but have yet to find something that does all this.