Why is my instance failing ELB health checks when adding it to the load balancer via Ansible?

Question

I am trying to add an EC2 instance to an Elasitic Load Balancer using an Ansible playbook, with the ec2_elb module. This is the task that should do this:

- name: "Add host to load balancer {{ load_balancer_name }}"
  sudo: false
  local_action:
    module: ec2_elb
    state: present
    wait: true
    region: "{{ region }}"
    ec2_elbs: ['{{ load_balancer_name }}']
    instance_id: "{{ ec2_id }}"

However, it routinely fails, with this output (verbosity turned up):

TASK: [Add host to load balancer ApiELB-staging] ****************************** 
<127.0.0.1> REMOTE_MODULE ec2_elb region=us-east-1 state=present instance_id=i-eb7e0cc7
<127.0.0.1> EXEC ['/bin/sh', '-c', 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1409156786.81-113716163813868 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1409156786.81-113716163813868 && echo $HOME/.ansible/tmp/ansible-tmp-1409156786.81-113716163813868']
<127.0.0.1> PUT /var/folders/d4/17fw96k107d5kbck6fb2__vc0000gn/T/tmpki4HPF TO /Users/pkaeding/.ansible/tmp/ansible-tmp-1409156786.81-113716163813868/ec2_elb
<127.0.0.1> EXEC ['/bin/sh', '-c', u'LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python /Users/pkaeding/.ansible/tmp/ansible-tmp-1409156786.81-113716163813868/ec2_elb; rm -rf /Users/pkaeding/.ansible/tmp/ansible-tmp-1409156786.81-113716163813868/ >/dev/null 2>&1']
failed: [10.0.115.149 -> 127.0.0.1] => {"failed": true}
msg: The instance i-eb7e0cc7 could not be put in service on LoadBalancer:ApiELB-staging. Reason: Instance has not passed the configured HealthyThreshold number of health checks consecutively.

FATAL: all hosts have already failed -- aborting

I have my ELB configuration defined like this (also via Ansible):

- name: "Ensure load balancer exists: {{ load_balancer_name }}"
  sudo: false
  local_action:
    module: ec2_elb_lb
    name: "{{ load_balancer_name }}"
    state: present
    region: "{{ region }}"
    subnets: "{{ vpc_public_subnet_ids }}"
    listeners:
      - protocol: https
        load_balancer_port: 443
        instance_protocol: http
        instance_port: 8888
        ssl_certificate_id: "{{ ssl_cert }}"
    health_check:
        ping_protocol: http # options are http, https, ssl, tcp
        ping_port: 8888
        ping_path: "/internal/v1/status"
        response_timeout: 5 # seconds
        interval: 30 # seconds
        unhealthy_threshold: 10
        healthy_threshold: 10
  register: apilb

When I access the status resource from either my laptop or from the server itself (as localhost) I get a 200 response as expected. I also added a command task to the Ansible playbook, right before adding the instance to the ELB, to confirm that the application is booted up and serving requests properly (and it is):

- command: /usr/bin/curl -v --fail http://localhost:8888/internal/v1/status

I don't see any non-200 responses for the status check resource in the logs for my application (but of course, if the requests never made it as far as my application, they would not be logged).

The other weird thing is that the instance does get added to the ELB, and it seems to work properly. So I know that at some point, at least, the load balancer can access the application properly (for both the status check resource, and other resources). The AWS console shows the instance is healthy, and the Cloudwatch charts don't show any failed health checks.

Any ideas?

possible duplicate of [ELB Instance Out of service](http://serverfault.com/questions/478330/elb-instance-out-of-service) — Ladadadada, Aug 27 '14 at 16:57
@Ladadadada if my issue is the same, then wouldn't there need to be failed (ie non-200) status check requests? I don't see any in my logs, and I did verify that the application is up and returning 200 for the status check before trying to add it to the ELB. — pkaeding, Aug 27 '14 at 17:08
No. All status check requests succeed, but the default state of a new instance when added to an ELB is **unhealthy**. In your case it must send back 10 responses where the status is `200`, each separated by 30 seconds before the instance will be considered **healthy**. — Ladadadada, Aug 27 '14 at 17:14
Hmm, interesting. How can the Ansible `ec2_elb` play ever *not* fail, then, if it doesn't wait at least the minimum time for the instance to be considered healthy? — pkaeding, Aug 27 '14 at 17:16
Maybe I am running into https://github.com/ansible/ansible/issues/5305 — pkaeding, Aug 27 '14 at 17:21
Judging from the docs, there's a [`wait_timeout`](http://docs.ansible.com/ec2_elb_module.html) parameter which you will have to set to something higher than 300 for this to work. (330 would be safe). Or lower your `interval` or `healthy_threshold` so that you have to wait less than 300 seconds. Your `unhealthy_threshold` is the same, so once a web server starts throwing 500 responses, it will stay in the pool for 5 minutes before the ELB drops it. — Ladadadada, Aug 27 '14 at 17:23
Ahh, you are right! If you would like to put the suggestion of fixing the `wait_timeout` into an answer, I'd be happy to accept it. It turns out my problem was in the ansible side of things, not really a duplicate of the other question. — pkaeding, Aug 27 '14 at 17:37
"Hmm, interesting. How can the Ansible ec2_elb play ever not fail, then, if it doesn't wait at least the minimum time for the instance to be considered healthy?" You give it a non-existent instance ID, one that's not in the right VPC/AZ, etc. — ceejayoz, Aug 27 '14 at 18:07

score 4 · Accepted Answer · answered Aug 27 '14 at 19:11

Adapted from my earlier comment:

Judging from the Ansible docs, there's a wait_timeout parameter which you will have to set to something higher than 300 for this to work. (330 would be safe).

Or you could lower your interval or healthy_threshold or both so that you have to wait less than 300 seconds.

Your unhealthy_threshold is the same as the healthy_threshold, so once a web server starts throwing 500 responses, it will stay in the pool for 5 minutes before the ELB drops it.

score 3 · Answer 2 · answered Sep 23 '15 at 18:26

3

You can use ec2_elb option wait: no.

answered Sep 23 '15 at 18:26

Alexey Vazhnov

497
5
13

2

This is the correct answer for me. ```wait:no``` doesn't wait for the ELB health check to become healthy before continuing, whereas it will only become healthy later on in my case. – Morgan Christiansson May 19 '16 at 13:45

Why is my instance failing ELB health checks when adding it to the load balancer via Ansible?

2 Answers2