ansible: How to properly handle errors that break handler notification?

Question

A problem I keep running into in ansible is where one deployment step should run when any of a number of preparation step is changed, but the changed status is lost due to fatal errors.

When after one successfull preparation step, ansible cannot continue, I still want the machine to eventually reach the state the playbook was meant to achieve. But ansible forgets, e.g.:

- name: "(a) some task is changed"
  git:
    update: yes
    ...
  notify:
   # (b) ansible knows about having to call handler later!
   - apply

- name: "(c) connection lost here"
  command: ...
  notify:
   - apply

- name: apply
  # (d) handler never runs: on the next invocation git-fetch is a no-op
  command: /bin/never

Since the preparation step (a) is now a no-op, running again does not recover this information. For some tasks, just running ALL handlers is good enough. For others one can rewrite the handlers into tasks that know when: to run. But some tasks & checks are expensive and/or unreliable, so this is not always good enough.

Partial solutions:

Write out a file and check for its existence later instead of relying on the ansible handler. This feels like an antipattern. After all, ansible knows whats left to do - I just do not know how to get it to remember it across multiple attempts.
Stay in a loop until it works or manual fix is applied, however long that may be: This seems like a bad trade, because now I might not be able to use ansible against the same group of targets .. or I have to safeguard against undesirable side-effects of multiple concurrent runs
Just require a higher reliability of targets so its rare enough to justify always manually resolving these situations, using --start-at-task= and checking which handlers are still needed: Experience says, things do occasionally break, and right now I am adding more things that can.

Is there a pattern, feature or trick to properly handle such errors?

score 1 · Answer 1 · answered Oct 15 '20 at 01:59

The Ansible docs you linked to suggest a way to deal with this:

Ansible runs handlers at the end of each play. If a task notifies a handler but another task fails later in the play, by default the handler does not run on that host, which may leave the host in an unexpected state. For example, a task could update a configuration file and notify a handler to restart some service. If a task later in the same play fails, the configuration file might be changed but the service will not be restarted.

You can change this behavior with the --force-handlers command-line option, by including force_handlers: True in a play, or by adding force_handlers = True to ansible.cfg. When handlers are forced, Ansible will run all notified handlers on all hosts, even hosts with failed tasks. (Note that certain errors could still prevent the handler from running, such as a host becoming unreachable.)

Placing it in ansible.cfg will ensure that it is the default behavior for every playbook and role you run.

Very little can save you if the host dies during a playbook run.

Well, *fatal errors* such as connection issues stopping ansible from continuing was sort of the premise.. (This is still helpful, because I did *not* use this option in a place I should have!) — anx, Oct 15 '20 at 14:22

score 0 · Answer 2 · answered Apr 13 '21 at 14:17

It seems that currently the only way to tackle this problem is like Michael Hampton pointed out.

IMHO this is not a viable solution since the handlers itself can error caused by the origin error letting the playbook run crash. A better solution should persist handler notification state between playbook executions, ideally at the remote hosts. There already is the concept of facts and custom facts which holds some kind of state at the remotes hosts disk.

Currently I have no working concept how to implement that.

ansible: How to properly handle errors that break handler notification?

2 Answers2