19

We're trying to run a fairly straightforward setup on Amazon EC2 - several HTTP servers sitting behind an Amazon Elastic Load Balancer (ELB).

Our domain is managed in Route53, and we have a CNAME record set up to point to the ELB.

We've experienced some issues where some - but not all - locations are intermittently unable to connect to the load balancer; it seems that this may be the resolution of the ELB's domain name.

Amazon support advised us that the underlying Elastic IP of the load balancer has been changing, and that the problem is that some ISPs' DNS servers do not honour the TTL. We're not satisfied with this explanation, because we replicated the problem using Amazon's own DNS servers from an EC2 instance, as well as on local ISPs in Australia and via Google's DNS server (8.8.8.8).

Amazon also confirmed that during the period where we noticed down time from some locations, traffic passing through the ELB was down significantly - so the problem is not with our endpoints.

Interestingly, the domain seems to resolve to the correct IP on the servers that cannot connect - but the attempt to establish a TCP connection fails.

All the instances attached to the ELB have been healthy at all times. They're all

Does anyone know how we might go about diagnosing this problem more deeply? Has anyone else experienced this problem with the Elastic Load Balancer?

Thanks,

Cera
  • 533
  • 3
  • 6
  • 12
  • I should add as another note - despite this seemingly being potentially related to DNS or routing, as far as we can tell our domain *always* resolves to the correct EIP - running the `host` utility resolves to the same address on systems where we can connect and systems where we can't. – Cera Jan 15 '13 at 22:17

3 Answers3

22

I found this question while Googling for how to diagnose Amazon Elastic Load Balancers (ELBs) and I want to answer it for anyone else like me who has had this trouble without much guidance.

ELB Properties

ELBs have some interesting properties. For instance:

  • ELBs are made up of 1 or more nodes
  • These nodes are published as A records for the ELB name
  • These nodes can fail, or be shut down, and connections will not be closed gracefully
  • It often requires a good relationship with Amazon support ($$$) to get someone to dig into ELB problems

NOTE: Another interesting property but slightly less pertinent is that ELBs were not designed to handle sudden spikes of traffic. They typically require 15 minutes of heavy traffic before they will scale up or they can be pre-warmed on request via a support ticket

Troubleshooting ELBs (manually)

Update: AWS has since migrated all ELBs to use Route 53 for DNS. In addition, all ELBs now have a all.$elb_name record that will return the full list of nodes for the ELB. For example, if your ELB name is elb-123456789.us-east-1.elb.amazonaws.com, then you would get the full list of nodes by doing something like dig all.elb-123456789.us-east-1.elb.amazonaws.com. For IPv6 nodes, all.ipv6.$elb_name also works. In addition, Route 53 is able to return up to 4KB of data still using UDP, so using the +tcp flag may not be necessary.

Knowing this, you can do a little bit of troubleshooting on your own. First, resolve the ELB name to a list of nodes (as A records):

$ dig @ns-942.amazon.com +tcp elb-123456789.us-east-1.elb.amazonaws.com ANY

The tcp flag is suggested as your ELB could have too many records to fit inside of a single UDP packet. I'm also told, but haven't personally confirmed, that Amazon will only display up to 6 nodes unless you perform an ANY query. Running this command will give you output that looks something like this (trimmed for brevity):

;; ANSWER SECTION:
elb-123456789.us-east-1.elb.amazonaws.com. 60 IN SOA ns-942.amazon.com. root.amazon.com. 1376719867 3600 900 7776000 60
elb-123456789.us-east-1.elb.amazonaws.com. 600 IN NS ns-942.amazon.com.
elb-123456789.us-east-1.elb.amazonaws.com. 60 IN A 54.243.63.96
elb-123456789.us-east-1.elb.amazonaws.com. 60 IN A 23.21.73.53

Now, for each of the A records use e.g. curl to test a connection to the ELB. Of course, you also want to isolate your test to just the ELB without connecting to your backends. One final property and little known fact about ELBs:

  • The maximum size of the request method (verb) that can be sent through an ELB is 127 characters. Any larger and the ELB will reply with an HTTP 405 - Method not allowed.

This means that we can take advantage of this behavior to test only that the ELB is responding:

$ curl -X $(python -c 'print "A" * 128') -i http://ip.of.individual.node
HTTP/1.1 405 METHOD_NOT_ALLOWED
Content-Length: 0
Connection: Close

If you see HTTP/1.1 405 METHOD_NOT_ALLOWED then the ELB is responding successfully. You might also want to adjust curl's timeouts to values that are acceptable to you.

Troubleshooting ELBs using elbping

Of course, doing this can get pretty tedious so I've built a tool to automate this called elbping. It's available as a ruby gem, so if you have rubygems then you can install it by simply doing:

$ gem install elbping

Now you can run:

$ elbping -c 4 http://elb-123456789.us-east-1.elb.amazonaws.com
Response from 54.243.63.96: code=405 time=210 ms
Response from 23.21.73.53: code=405 time=189 ms
Response from 54.243.63.96: code=405 time=191 ms
Response from 23.21.73.53: code=405 time=188 ms
Response from 54.243.63.96: code=405 time=190 ms
Response from 23.21.73.53: code=405 time=192 ms
Response from 54.243.63.96: code=405 time=187 ms
Response from 23.21.73.53: code=405 time=189 ms
--- 54.243.63.96 statistics ---
4 requests, 4 responses, 0% loss
min/avg/max = 187/163/210 ms
--- 23.21.73.53 statistics ---
4 requests, 4 responses, 0% loss
min/avg/max = 188/189/192 ms
--- total statistics ---
8 requests, 8 responses, 0% loss
min/avg/max = 188/189/192 ms

Remember, if you see code=405 then that means that the ELB is responding.

Next Steps

Whichever method you choose, you will at least know if your ELB's nodes are responding or not. Armed with this knowledge, you can either turn your focus to troubleshooting other parts of your stack or be able to make a pretty reasonable case to AWS that something is wrong.

Hope this helps!

Charles Hooper
  • 1,500
  • 9
  • 8
  • 1
    Thanks for the great answer. We originally figured out most of this through trial and error, but this will be a handy reference. – Cera Aug 18 '13 at 23:15
7

The fix is actually simple: Use an A record rather than a CNAME in Route53.

In the AWS Management Console, choose "A record" and then move the radio button labeled "Alias" to "Yes." Then select your ELB from the dropdown menu.

jamieb
  • 3,387
  • 4
  • 24
  • 36
  • 1
    I don't understand the rationale behind this fix. Amazon's documentation for the ELB specifically says that a `CNAME` record should be used. What would be the benefit of an `A` record / what is changing here? – Cera Jan 15 '13 at 22:15
  • 3
    You'd have to use a CNAME if your DNS was hosted someplace other than Route53. But A record aliasing is a feature that is specific to Route53 and is intended to solve the exact problem you're encountering. The [Route53 docs](http://docs.aws.amazon.com/Route53/latest/DeveloperGuide/CreatingAliasRRSets.html) explain it in greater depth. – jamieb Jan 15 '13 at 22:39
  • @jamieb Can you provide a link to that piece of documentation? – Till Sep 12 '13 at 14:19
  • 1
    It's called "Alias Target" as opposed to an A record. http://docs.aws.amazon.com/Route53/latest/DeveloperGuide/HowToAliasRRS.html – Jonny07 Jul 29 '14 at 22:03
0

There are some potential solutions you could try in this AWS developers forum. https://forums.aws.amazon.com/message.jspa?messageID=387552.

For example:

potential fix #1

We had a similar problem when we moved to ELB, we resolved this by reducing the name of our ELB to a single character. Even a 2 char name for ELB caused random problems with network solutions DNS resolutions.

Your ELB's DNS name should be something like -> X.<9chars>.us-east-1.elb.amazonaws.com

potential fix #2

I'm the original poster. Thanks for all the responses. We were able to reduce the frequency with which we experienced DNS issues by setting the TTL very high (so they would cached by non-Network Solutions servers). However, we were still getting enough problems where we just couldnt stay with Network Solutions any longer. We thought of moving to UltraDNS based on good reports on the service, but it looked like Route 53 (which uses UltraDNS under the covers, it would appear) would be cheaper for us. Since switching to Route 53, we have no more DNS issues, and our ELB names can be nice and long too.

There were other things to try in that post but those seem to be the best leads.

slm
  • 7,355
  • 16
  • 54
  • 72
  • Thanks for the suggestions. Unfortunately it seems that the problem lies purely in the DNS resolution of the hostname for the ELB, not for our record that aliases to it. Our record always resolves to the ELB's hostname properly. – Cera Jan 15 '13 at 22:18
  • Did @jaimieb's fix solve the problem? – slm Jan 15 '13 at 22:38
  • If I understand you correctly then the problem is that you have CNAME/ANAME records that resolve to a CNAME/ANAME record ELB, and your part is resolving just fine, no performance issues, but once you get to the ELB's DNS records the performance problems show up? – slm Jan 15 '13 at 22:40
  • @slm - potential fix #1 does not help. I would recommend removing it from the post. – Ursus Mar 30 '17 at 20:34