0

We have a bunch of cron jobs that occasionally fail, for example due to network hiccup. Only rarely do they fail consistently (for example due to a bug or misconfiguration).

I'd like to receive only error mails in the latter case, and suppress the cron mails when the job only fails occasionally, to fight "pager fatigue", aka not caring about the mails anymore because most of them don't require action anyway.

Are there any tools (for example wrappers around the cron job) that do this? How do other organizations handle a large number of Linux servers with cron jobs?

moritz
  • 151
  • 1
  • 5
  • You monitor the status of the jobs. You have the monitoring system page people, not cron's output. That's how I'd do it anyway. Cron's built for running periodic tasks and emailing the output. Email is not a suitable monitoring tool in my opinion. –  Sep 01 '16 at 07:49
  • The problem with monitoring is that it needs to be kept in sync with the actual cron jobs, which adds extra burden on the maintainer. The cron mails go into a ticket system, so they aren't lost. – moritz Sep 01 '16 at 09:04
  • Keeping monitoring 'kept in sync' is a trivial issue to solve. Touch a file if the job runs successfully. File hasn't been touched in X minutes/hours? Monitoring system sends out an alert. –  Sep 01 '16 at 17:35
  • Then I still have to keep the information which jobs to monitor in the monitoring system, so I have to store the information in two places. Not very good for information hygiene. – moritz Sep 07 '16 at 08:12

1 Answers1

0

The jobs you are running under cron should handle expected errors. It is unusual to have cron jobs that periodically fail. Fix the programs so that they don't fail. That may mean you need to wrap them in retry logic that waits a short period of time, then retries once or twice. However, I don't really like the retry solution.

If you have jobs failing routinely because of a "network hiccup", address the network issues. If it is for other reasons, address that issue.

If you want to alert only if the cron job is no longer working (definition required), don't alert on the cron job failure. Build a monitoring process that can detect the problem. This can be difficult. If you are monitoring an update process, there can be a period where there are no updates that triggers a false positive on the monitor that assures updates are being done.

Make sure you have scheduled your cron jobs so that you don't have conflicting jobs running at the same time. A timeline chart may help.

You may be able to cobble together a monitor for your critical jobs that counts the failures and successes and alerts if there have been too many successive failures. This will require an extra step in the job to report its status.

BillThor
  • 27,354
  • 3
  • 35
  • 69