3

This problem from what i can tell is isolated to PowerDNS. The servers are running two packages pdns-static-3.0.1-1.i386.rpm and pdns-recursor-3.3-1.i386.rpm on the most recent version of Amazon Linux.

The amazon ec2 loadbalancers are assigned a CNAME with multiple hosts. Below is an example of the actual behavior. Notice how the hosts are always in the same order.

[root@localhost ~]# host cache.domain.com
cache.domain.com is an alias for xxxxx.us-east-1.elb.amazonaws.com.
xxxxx.us-east-1.elb.amazonaws.com has address aaa.aaa.aaa.aaa
xxxxx.us-east-1.elb.amazonaws.com has address bbb.bbb.bbb.bbb
[root@localhost ~]# host cache.domain.com
cache.domain.com is an alias for xxxxx.us-east-1.elb.amazonaws.com.
xxxxx.us-east-1.elb.amazonaws.com has address aaa.aaa.aaa.aaa
xxxxx.us-east-1.elb.amazonaws.com has address bbb.bbb.bbb.bbb
[root@localhost ~]# host cache.domain.com
cache.domain.com is an alias for xxxxx.us-east-1.elb.amazonaws.com.
xxxxx.us-east-1.elb.amazonaws.com has address aaa.aaa.aaa.aaa
xxxxx.us-east-1.elb.amazonaws.com has address bbb.bbb.bbb.bbb

Expected behavior is round robin for the hosts

[root@localhost ~]# host cache.domain.com
cache.domain.com is an alias for xxxxx.us-east-1.elb.amazonaws.com.
xxxxx.us-east-1.elb.amazonaws.com has address aaa.aaa.aaa.aaa
xxxxx.us-east-1.elb.amazonaws.com has address bbb.bbb.bbb.bbb
[root@localhost ~]# host cache.domain.com
cache.domain.com is an alias for xxxxx.us-east-1.elb.amazonaws.com.
xxxxx.us-east-1.elb.amazonaws.com has address bbb.bbb.bbb.bbb
xxxxx.us-east-1.elb.amazonaws.com has address aaa.aaa.aaa.aaa
[root@localhost ~]# host cache.domain.com
cache.domain.com is an alias for xxxxx.us-east-1.elb.amazonaws.com.
xxxxx.us-east-1.elb.amazonaws.com has address aaa.aaa.aaa.aaa
xxxxx.us-east-1.elb.amazonaws.com has address bbb.bbb.bbb.bbb

The addresses eventually do swap but it seems to be on a 30 minute cache timer changing the TTL of the record doesn't appear to affect anything. It appears as though the resolver has a cache of the response. This adversely affects my application because all of the load is only being sent to one of the loadbalancers (Availability Zones) so if I have servers in two zones then only one zone is under load at a time.

Do you know how I can fix this so that each time the host is resolved the order of the addresses is alternating.


DIG OUTPUT

;  DiG 9.7.6-P1-RedHat-9.7.6-1.P1.18.amzn1  cache.domain.com
;; global options: +cmd
;; Got answer:
;; HEADER opcode: QUERY, status: NOERROR, id: 54610
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
cache.domain.com.           IN      A

;; ANSWER SECTION:
cache.domain.com.    100     IN      CNAME   xxxxx.us-east-1.elb.amazonaws.com.                                                                                                                 
xxxxx.us-east-1.elb.amazonaws.com. 3 IN A aaa.aaa.aaa.aaa
xxxxx.us-east-1.elb.amazonaws.com. 3 IN A bbb.bbb.bbb.bbb

;; Query time: 0 msec
;; SERVER: ccc.ccc.ccc.ccc#53(ccc.ccc.ccc.ccc)
;; WHEN: Mon Jul  2 15:09:27 2012
;; MSG SIZE  rcvd: 130

Recursor config

allow-from=0.0.0.0/0
dont-query=
local-address=127.0.0.1
local-port=530                                                                  # Port should be changed to 530 because its not good to run on the same port as dns server
quiet=yes
setgid=pdns
setuid=pdns
disable-packetcache=
packetcache-ttl=0
forward-zones=domain.local=LOCALIP,domain.cloud=LOCALIP                         # Forward the two zones we care about back to the local dns server
forward-zones-recurse=amazonaws.com=172.16.0.23,compute-1.internal=172.16.0.23  # Forward queries for amazons domains to the resolver for amazon

SOLUTION

add the following lines to recursor.conf

disable-packetcache=
packetcache-ttl=0

add the following line to pdns.conf

recursive-cache-ttl=0
bwight
  • 793
  • 1
  • 6
  • 14
  • regarding the bounty: you can't have both a detailed canonical answer, and a specific problem solved. Furthermore, configuring the PowerDNS Recursor *is* the solution for this specific problem. – Habbie Jul 03 '12 at 07:27

2 Answers2

5

The PowerDNS Recursor caches at two levels.

It caches responses from authoritative servers for up to the TTL specified in the response it got (limited by max-cache-ttl but never exceeding the TTL it got from an auth).

Additionally, when a response packet from the recursor to a client (your clients that are generating load) is generated and sent, this packet is cached as a whole, so that the same question can be answered extremely quickly (without any parsing) if it comes in again. This is called the packetcache.

Shuffling happens in between these two levels. This means that your results are in fact shuffled, but their shuffle order is kept stable by the packetcache (for up to an hour, by default). If you want per-response shuffle, set 'disable-packetcache' or 'packetcache-ttl=0'.

Habbie
  • 745
  • 3
  • 9
  • disable-packetcache actually disables the caching code; packetcache-ttl=0 pushes the same check slightly deeper down. Thus, disable-packetcache should be slightly faster. Specifying both does not hurt :) – Habbie Jul 01 '12 at 20:11
  • I added both those to the configuration for the recursor and still get no luck. I updated the original post with a dig and the configuration i have. – bwight Jul 02 '12 at 15:20
  • 1
    I tested with recursor 3.3 and your exact config. The A records for an 'A www.google.com' query shuffle for me. Can you share a name that fails for you? Obviously cache.domain.com is fake :) – Habbie Jul 02 '12 at 17:40
  • I tried www.google.com and there is no shuffling going on. Do you think it has anything to do with the OS? I tried `host www.google.com` and `dig www.google.com` neither work for me. Could it be something in the pdns config? – bwight Jul 02 '12 at 18:05
  • 1
    can you try this and post the output? for d in {1..100} ; do echo $(dig +short a www.google.com @127.0.0.1 -p 530 | grep '^[0-9]') ; done | sort -u | wc -l – Habbie Jul 02 '12 at 19:21
  • When I run that the result is 94 – bwight Jul 02 '12 at 19:31
  • let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/3967/discussion-between-bwight-and-habbie) – bwight Jul 02 '12 at 19:37
  • if the result is anything other than 1, shuffling is working! – Habbie Jul 02 '12 at 20:19
  • In chat it turned out recursor was in fact working correctly, but the front end proxy (also pdns) was misconfigured. Looks solved to me now. – Habbie Jul 03 '12 at 05:22
  • 1
    Its solved, i'll award you the bounty points as soon as i can. Thanks for your help. – bwight Jul 03 '12 at 14:11
0

Not necessarily a "fix" - but do you need to use the CNAME from your application rather than directly querying the underlying A record? Presumably the CNAME => A record mapping doesn't change that often.

Sometimes the simplest dumb fixes are actually enough, and avoid having to solve all the worlds' problems just to get the results you need!

Bron Gondwana
  • 1,738
  • 3
  • 12
  • 15
  • I do actually need to use the CNAME instead of the A record. The CNAME points to a list of load balancers that are used for redundancy within the amazon cloud. The A records are changed as more load is put onto the system because amazon will provision more powerfull load balancers as my needs inrease. – bwight Jul 02 '12 at 17:24
  • Oh well... so much for that plan then. – Bron Gondwana Jul 02 '12 at 18:42